Li, Qigang, Keyan Zhao, Carlos D. Bustamante, Xin Ma, and Wing H. Wong. “Xrare: a machine learning method jointly modeling phenotypes and genetic evidence for rare disease diagnosis.” Genetics in Medicine 21, no. 9 (2019): 2126-2134.
Xrare prioritizes causative gene variants in rare disease diagnosis based on the presenting phenotypes and genetic variants of a patient. The main innovations in our approach include i) a phenotype scoring method highly tolerant of noise and imprecision in the presenting phenotypes, and ii) a machine learning framework capable of incorporating domain expert knowledge in feature extraction and learning. Simulations and real clinical data demonstrated that Xrare outperforms existing alternative methods by 10-40% at various genetic diagnosis scenarios.
As there are many required data sets and software packages, we provided a docker image to simplify the installation process. You can download it here.
Then load and run it:
docker load -i xrare-pub.2021.tar.gz
docker run --rm -it -p 18787:8787 xrare37-pub:2021 bash
In the docker container, you can start R and perform genetic analysis with our xrare R package. We also install Rstudio server, so you can use username xrare
and password xrare2019
to login via the TCP port 18787.
We use a genetic data of a patient with skeletal dysplasia to show the usage of this software. The patient has the following phenotypes:
library(xrare)
## Loading required package: data.table
## Loading required package: data.vcf
vcffile <- "/platform/raredisease/test/test.skeletal.vcf.gz"
dt <- xrare(vcffile=vcffile, hpoid="HP:0002652,HP:0001762,HP:0008905,HP:0001193,HP:0001385")
colnames(dt)
## [1] "CHROM" "POS"
## [3] "ID" "REF"
## [5] "ALT" "QUAL"
## [7] "FILTER" "INFO"
## [9] "CADD_phred" "DANN_score"
## [11] "FATHMM_pred" "INFO.GERP2_NR"
## [13] "INFO.GERP2_RS" "LRT_pred"
## [15] "LRT_score" "MutationAssessor_pred"
## [17] "MutationTaster_pred" "MutationTaster_score"
## [19] "Polyphen2_HDIV_pred" "Polyphen2_HDIV_score"
## [21] "Polyphen2_HVAR_score" "RS"
## [23] "SIFT_pred" "SIFT_score"
## [25] "dbscSNV_ADA_SCORE" "dbscSNV_RF_SCORE"
## [27] "INFO.exome_controls_AF" "INFO.exome_controls_AF_afr"
## [29] "INFO.exome_controls_AF_amr" "INFO.exome_controls_AF_eas"
## [31] "INFO.exome_controls_AF_nfe" "INFO.exome_controls_AF_sas"
## [33] "exome_controls_nhomalt" "phyloP"
## [35] "clinvar_id" "clinvar_sig"
## [37] "INFO.disease_HGMD" "exome_controls_AF_popmax"
## [39] "INFO.cosmic_id" "INFO.genome_controls_AF"
## [41] "INFO.genome_controls_AF_afr" "INFO.genome_controls_AF_amr"
## [43] "INFO.genome_controls_AF_eas" "INFO.genome_controls_AF_nfe"
## [45] "genome_controls_AF_popmax" "genome_controls_nhomalt"
## [47] "INFO.ora_mirna_binding_site" "INFO.ora_reg_region"
## [49] "INFO.ora_tf_binding_site" "dpsi_max_tissue"
## [51] "dpsi_zscore" "rmsk"
## [53] "INFO.exome_filter" "INFO.genome_filter"
## [55] "clinvar_pmids" "INFO.omim_id"
## [57] "XRARE_TMPVARID" "SVTYPE"
## [59] "END" "Allele"
## [61] "Consequence" "IMPACT"
## [63] "Gene" "Feature_type"
## [65] "Feature" "BIOTYPE"
## [67] "EXON" "INTRON"
## [69] "HGVSc" "HGVSp"
## [71] "cDNA_position" "CDS_position"
## [73] "Protein_position" "Amino_acids"
## [75] "Codons" "Existing_variation"
## [77] "DISTANCE" "STRAND"
## [79] "FLAGS" "ENSP"
## [81] "REFSEQ_MATCH" "SOURCE"
## [83] "REFSEQ_OFFSET" "GIVEN_REF"
## [85] "USED_REF" "BAM_EDIT"
## [87] "DOMAINS" "HGVS_OFFSET"
## [89] "LoF" "LoF_filter"
## [91] "LoF_flags" "LoF_info"
## [93] "dbvar_variant_id" "dbvar_clinical_assertion"
## [95] "dbvar_overlap" "gnomAD_SV_NAME"
## [97] "gnomAD_SV_AF_POPMAX" "gnomAD_SV_HOM"
## [99] "gnomAD_SV_overlap" "AF_POPMAX"
## [101] "nhomalt" "locus_group"
## [103] "hgnc_id" "symbol"
## [105] "omim_id" "tx_order"
## [107] "Feature_nover" "ACMG_PM2"
## [109] "ACMG_BA1" "ACMG_BS1"
## [111] "ACMG_BS2" "ACMG_PVS1"
## [113] "ACMG_PM5" "ACMG_PS1"
## [115] "ACMG_PM1" "ACMG_PM4"
## [117] "ACMG_PP2" "ACMG_PP3"
## [119] "Pred_Patho" "ACMG_BP3"
## [121] "ACMG_BP4" "Pred_Benign"
## [123] "ACMG_BP7" "ACMG_score"
## [125] "FORMAT.GT.LP6005038-DNA_H09" "FORMAT.AD.LP6005038-DNA_H09"
## [127] "FORMAT.DP.LP6005038-DNA_H09" "FORMAT.GQ.LP6005038-DNA_H09"
## [129] "FORMAT.ZZ.LP6005038-DNA_H09" "pathoACMG"
## [131] "tagsACMG" "xrare_score"
You can sort variants by the predicted score xrare_score
in descending order.
dt = setorder(dt, -xrare_score)
head(dt[, .(CHROM,POS,REF,ALT,xrare_score,symbol,pathoACMG,tagsACMG)])
## CHROM POS REF ALT xrare_score symbol pathoACMG
## 1: 5 149360517 A C 0.91211432 SLC26A2 Uncertain significance
## 2: 1 94473266 C T 0.81328905 ABCA4 Likely Pathogenic
## 3: 16 89619511 C T 0.61699539 SPG7 Uncertain significance
## 4: 12 89885848 C T 0.29401755 POC1B Uncertain significance
## 5: 16 75665388 C T 0.02056419 KARS1 Uncertain significance
## 6: X 128696703 C T 0.01976258 OCRL Uncertain significance
## tagsACMG
## 1: PP3,PM2
## 2: PP3,PM5,PM2,PM1
## 3: PP3,PM2,PM1
## 4: PP3,PM5,PM2,BS2
## 5: PP3
## 6: PP3,PP2,PM2