Xrare: a machine learning method jointly modeling phenotypes and genetic evidence for rare disease diagnosis

Citation

Li, Qigang, Keyan Zhao, Carlos D. Bustamante, Xin Ma, and Wing H. Wong. “Xrare: a machine learning method jointly modeling phenotypes and genetic evidence for rare disease diagnosis.” Genetics in Medicine 21, no. 9 (2019): 2126-2134.

Introduction

Xrare prioritizes causative gene variants in rare disease diagnosis based on the presenting phenotypes and genetic variants of a patient. The main innovations in our approach include i) a phenotype scoring method highly tolerant of noise and imprecision in the presenting phenotypes, and ii) a machine learning framework capable of incorporating domain expert knowledge in feature extraction and learning. Simulations and real clinical data demonstrated that Xrare outperforms existing alternative methods by 10-40% at various genetic diagnosis scenarios.

Installation

As there are many required data sets and software packages, we provided a docker image to simplify the installation process. You can download it here.

Then load and run it:

docker load -i xrare-pub.2021.tar.gz
docker run --rm -it -p 18787:8787 xrare37-pub:2021 bash

In the docker container, you can start R and perform genetic analysis with our xrare R package. We also install Rstudio server, so you can use username xrare and password xrare2019 to login via the TCP port 18787.

Example

We use a genetic data of a patient with skeletal dysplasia to show the usage of this software. The patient has the following phenotypes:

HP:0002652(Skeletal dysplasia),
HP:0001762(Talipes equinovarus),
HP:0008905(Rhizomelia),
HP:0001193(Ulnar deviation of the hand or of fingers of the hand),
HP:0001385(Hip dysplasia) The function xrare will run the whole analysis pipeline, To use it, at least two arguments, vcffile and hpoid, are needed, vcffile is the Variant Call Format (VCF) file of the patient, hpoid is the a string representing HPO-encoded phenotypes. The xrare function will return a data.table with >100 columns.

library(xrare)

## Loading required package: data.table

## Loading required package: data.vcf

vcffile <- "/platform/raredisease/test/test.skeletal.vcf.gz"
dt <- xrare(vcffile=vcffile, hpoid="HP:0002652,HP:0001762,HP:0008905,HP:0001193,HP:0001385")
colnames(dt)

##   [1] "CHROM"                       "POS"                        
##   [3] "ID"                          "REF"                        
##   [5] "ALT"                         "QUAL"                       
##   [7] "FILTER"                      "INFO"                       
##   [9] "CADD_phred"                  "DANN_score"                 
##  [11] "FATHMM_pred"                 "INFO.GERP2_NR"              
##  [13] "INFO.GERP2_RS"               "LRT_pred"                   
##  [15] "LRT_score"                   "MutationAssessor_pred"      
##  [17] "MutationTaster_pred"         "MutationTaster_score"       
##  [19] "Polyphen2_HDIV_pred"         "Polyphen2_HDIV_score"       
##  [21] "Polyphen2_HVAR_score"        "RS"                         
##  [23] "SIFT_pred"                   "SIFT_score"                 
##  [25] "dbscSNV_ADA_SCORE"           "dbscSNV_RF_SCORE"           
##  [27] "INFO.exome_controls_AF"      "INFO.exome_controls_AF_afr" 
##  [29] "INFO.exome_controls_AF_amr"  "INFO.exome_controls_AF_eas" 
##  [31] "INFO.exome_controls_AF_nfe"  "INFO.exome_controls_AF_sas" 
##  [33] "exome_controls_nhomalt"      "phyloP"                     
##  [35] "clinvar_id"                  "clinvar_sig"                
##  [37] "INFO.disease_HGMD"           "exome_controls_AF_popmax"   
##  [39] "INFO.cosmic_id"              "INFO.genome_controls_AF"    
##  [41] "INFO.genome_controls_AF_afr" "INFO.genome_controls_AF_amr"
##  [43] "INFO.genome_controls_AF_eas" "INFO.genome_controls_AF_nfe"
##  [45] "genome_controls_AF_popmax"   "genome_controls_nhomalt"    
##  [47] "INFO.ora_mirna_binding_site" "INFO.ora_reg_region"        
##  [49] "INFO.ora_tf_binding_site"    "dpsi_max_tissue"            
##  [51] "dpsi_zscore"                 "rmsk"                       
##  [53] "INFO.exome_filter"           "INFO.genome_filter"         
##  [55] "clinvar_pmids"               "INFO.omim_id"               
##  [57] "XRARE_TMPVARID"              "SVTYPE"                     
##  [59] "END"                         "Allele"                     
##  [61] "Consequence"                 "IMPACT"                     
##  [63] "Gene"                        "Feature_type"               
##  [65] "Feature"                     "BIOTYPE"                    
##  [67] "EXON"                        "INTRON"                     
##  [69] "HGVSc"                       "HGVSp"                      
##  [71] "cDNA_position"               "CDS_position"               
##  [73] "Protein_position"            "Amino_acids"                
##  [75] "Codons"                      "Existing_variation"         
##  [77] "DISTANCE"                    "STRAND"                     
##  [79] "FLAGS"                       "ENSP"                       
##  [81] "REFSEQ_MATCH"                "SOURCE"                     
##  [83] "REFSEQ_OFFSET"               "GIVEN_REF"                  
##  [85] "USED_REF"                    "BAM_EDIT"                   
##  [87] "DOMAINS"                     "HGVS_OFFSET"                
##  [89] "LoF"                         "LoF_filter"                 
##  [91] "LoF_flags"                   "LoF_info"                   
##  [93] "dbvar_variant_id"            "dbvar_clinical_assertion"   
##  [95] "dbvar_overlap"               "gnomAD_SV_NAME"             
##  [97] "gnomAD_SV_AF_POPMAX"         "gnomAD_SV_HOM"              
##  [99] "gnomAD_SV_overlap"           "AF_POPMAX"                  
## [101] "nhomalt"                     "locus_group"                
## [103] "hgnc_id"                     "symbol"                     
## [105] "omim_id"                     "tx_order"                   
## [107] "Feature_nover"               "ACMG_PM2"                   
## [109] "ACMG_BA1"                    "ACMG_BS1"                   
## [111] "ACMG_BS2"                    "ACMG_PVS1"                  
## [113] "ACMG_PM5"                    "ACMG_PS1"                   
## [115] "ACMG_PM1"                    "ACMG_PM4"                   
## [117] "ACMG_PP2"                    "ACMG_PP3"                   
## [119] "Pred_Patho"                  "ACMG_BP3"                   
## [121] "ACMG_BP4"                    "Pred_Benign"                
## [123] "ACMG_BP7"                    "ACMG_score"                 
## [125] "FORMAT.GT.LP6005038-DNA_H09" "FORMAT.AD.LP6005038-DNA_H09"
## [127] "FORMAT.DP.LP6005038-DNA_H09" "FORMAT.GQ.LP6005038-DNA_H09"
## [129] "FORMAT.ZZ.LP6005038-DNA_H09" "pathoACMG"                  
## [131] "tagsACMG"                    "xrare_score"

You can sort variants by the predicted score xrare_score in descending order.

dt = setorder(dt, -xrare_score)
head(dt[, .(CHROM,POS,REF,ALT,xrare_score,symbol,pathoACMG,tagsACMG)])

##    CHROM       POS REF ALT xrare_score  symbol              pathoACMG
## 1:     5 149360517   A   C  0.91211432 SLC26A2 Uncertain significance
## 2:     1  94473266   C   T  0.81328905   ABCA4      Likely Pathogenic
## 3:    16  89619511   C   T  0.61699539    SPG7 Uncertain significance
## 4:    12  89885848   C   T  0.29401755   POC1B Uncertain significance
## 5:    16  75665388   C   T  0.02056419   KARS1 Uncertain significance
## 6:     X 128696703   C   T  0.01976258    OCRL Uncertain significance
##           tagsACMG
## 1:         PP3,PM2
## 2: PP3,PM5,PM2,PM1
## 3:     PP3,PM2,PM1
## 4: PP3,PM5,PM2,BS2
## 5:             PP3
## 6:     PP3,PP2,PM2