Citation

Li, Qigang, Keyan Zhao, Carlos D. Bustamante, Xin Ma, and Wing H. Wong. “Xrare: a machine learning method jointly modeling phenotypes and genetic evidence for rare disease diagnosis.” Genetics in Medicine 21, no. 9 (2019): 2126-2134.

Introduction

Xrare prioritizes causative gene variants in rare disease diagnosis based on the presenting phenotypes and genetic variants of a patient. The main innovations in our approach include i) a phenotype scoring method highly tolerant of noise and imprecision in the presenting phenotypes, and ii) a machine learning framework capable of incorporating domain expert knowledge in feature extraction and learning. Simulations and real clinical data demonstrated that Xrare outperforms existing alternative methods by 10-40% at various genetic diagnosis scenarios.

Installation

As there are many required data sets and software packages, we provided a docker image to simplify the installation process. You can download it here.

Then load and run it:

docker load -i xrare-pub.2021.tar.gz
docker run --rm -it -p 18787:8787 xrare37-pub:2021 bash

In the docker container, you can start R and perform genetic analysis with our xrare R package. We also install Rstudio server, so you can use username xrare and password xrare2019 to login via the TCP port 18787.

Example

We use a genetic data of a patient with skeletal dysplasia to show the usage of this software. The patient has the following phenotypes:

library(xrare)
## Loading required package: data.table
## Loading required package: data.vcf
vcffile <- "/platform/raredisease/test/test.skeletal.vcf.gz"
dt <- xrare(vcffile=vcffile, hpoid="HP:0002652,HP:0001762,HP:0008905,HP:0001193,HP:0001385")
colnames(dt)
##   [1] "CHROM"                       "POS"                        
##   [3] "ID"                          "REF"                        
##   [5] "ALT"                         "QUAL"                       
##   [7] "FILTER"                      "INFO"                       
##   [9] "CADD_phred"                  "DANN_score"                 
##  [11] "FATHMM_pred"                 "INFO.GERP2_NR"              
##  [13] "INFO.GERP2_RS"               "LRT_pred"                   
##  [15] "LRT_score"                   "MutationAssessor_pred"      
##  [17] "MutationTaster_pred"         "MutationTaster_score"       
##  [19] "Polyphen2_HDIV_pred"         "Polyphen2_HDIV_score"       
##  [21] "Polyphen2_HVAR_score"        "RS"                         
##  [23] "SIFT_pred"                   "SIFT_score"                 
##  [25] "dbscSNV_ADA_SCORE"           "dbscSNV_RF_SCORE"           
##  [27] "INFO.exome_controls_AF"      "INFO.exome_controls_AF_afr" 
##  [29] "INFO.exome_controls_AF_amr"  "INFO.exome_controls_AF_eas" 
##  [31] "INFO.exome_controls_AF_nfe"  "INFO.exome_controls_AF_sas" 
##  [33] "exome_controls_nhomalt"      "phyloP"                     
##  [35] "clinvar_id"                  "clinvar_sig"                
##  [37] "INFO.disease_HGMD"           "exome_controls_AF_popmax"   
##  [39] "INFO.cosmic_id"              "INFO.genome_controls_AF"    
##  [41] "INFO.genome_controls_AF_afr" "INFO.genome_controls_AF_amr"
##  [43] "INFO.genome_controls_AF_eas" "INFO.genome_controls_AF_nfe"
##  [45] "genome_controls_AF_popmax"   "genome_controls_nhomalt"    
##  [47] "INFO.ora_mirna_binding_site" "INFO.ora_reg_region"        
##  [49] "INFO.ora_tf_binding_site"    "dpsi_max_tissue"            
##  [51] "dpsi_zscore"                 "rmsk"                       
##  [53] "INFO.exome_filter"           "INFO.genome_filter"         
##  [55] "clinvar_pmids"               "INFO.omim_id"               
##  [57] "XRARE_TMPVARID"              "SVTYPE"                     
##  [59] "END"                         "Allele"                     
##  [61] "Consequence"                 "IMPACT"                     
##  [63] "Gene"                        "Feature_type"               
##  [65] "Feature"                     "BIOTYPE"                    
##  [67] "EXON"                        "INTRON"                     
##  [69] "HGVSc"                       "HGVSp"                      
##  [71] "cDNA_position"               "CDS_position"               
##  [73] "Protein_position"            "Amino_acids"                
##  [75] "Codons"                      "Existing_variation"         
##  [77] "DISTANCE"                    "STRAND"                     
##  [79] "FLAGS"                       "ENSP"                       
##  [81] "REFSEQ_MATCH"                "SOURCE"                     
##  [83] "REFSEQ_OFFSET"               "GIVEN_REF"                  
##  [85] "USED_REF"                    "BAM_EDIT"                   
##  [87] "DOMAINS"                     "HGVS_OFFSET"                
##  [89] "LoF"                         "LoF_filter"                 
##  [91] "LoF_flags"                   "LoF_info"                   
##  [93] "dbvar_variant_id"            "dbvar_clinical_assertion"   
##  [95] "dbvar_overlap"               "gnomAD_SV_NAME"             
##  [97] "gnomAD_SV_AF_POPMAX"         "gnomAD_SV_HOM"              
##  [99] "gnomAD_SV_overlap"           "AF_POPMAX"                  
## [101] "nhomalt"                     "locus_group"                
## [103] "hgnc_id"                     "symbol"                     
## [105] "omim_id"                     "tx_order"                   
## [107] "Feature_nover"               "ACMG_PM2"                   
## [109] "ACMG_BA1"                    "ACMG_BS1"                   
## [111] "ACMG_BS2"                    "ACMG_PVS1"                  
## [113] "ACMG_PM5"                    "ACMG_PS1"                   
## [115] "ACMG_PM1"                    "ACMG_PM4"                   
## [117] "ACMG_PP2"                    "ACMG_PP3"                   
## [119] "Pred_Patho"                  "ACMG_BP3"                   
## [121] "ACMG_BP4"                    "Pred_Benign"                
## [123] "ACMG_BP7"                    "ACMG_score"                 
## [125] "FORMAT.GT.LP6005038-DNA_H09" "FORMAT.AD.LP6005038-DNA_H09"
## [127] "FORMAT.DP.LP6005038-DNA_H09" "FORMAT.GQ.LP6005038-DNA_H09"
## [129] "FORMAT.ZZ.LP6005038-DNA_H09" "pathoACMG"                  
## [131] "tagsACMG"                    "xrare_score"

You can sort variants by the predicted score xrare_score in descending order.

dt = setorder(dt, -xrare_score)
head(dt[, .(CHROM,POS,REF,ALT,xrare_score,symbol,pathoACMG,tagsACMG)])
##    CHROM       POS REF ALT xrare_score  symbol              pathoACMG
## 1:     5 149360517   A   C  0.91211432 SLC26A2 Uncertain significance
## 2:     1  94473266   C   T  0.81328905   ABCA4      Likely Pathogenic
## 3:    16  89619511   C   T  0.61699539    SPG7 Uncertain significance
## 4:    12  89885848   C   T  0.29401755   POC1B Uncertain significance
## 5:    16  75665388   C   T  0.02056419   KARS1 Uncertain significance
## 6:     X 128696703   C   T  0.01976258    OCRL Uncertain significance
##           tagsACMG
## 1:         PP3,PM2
## 2: PP3,PM5,PM2,PM1
## 3:     PP3,PM2,PM1
## 4: PP3,PM5,PM2,BS2
## 5:             PP3
## 6:     PP3,PP2,PM2