PECA is a statistical model for gene regulation from paired expression and chromatin accessibility data.
Zhana Duren, Xi Chen, Rui Jiang, Yong Wang, and Wing Hung Wong (2017), Modeling gene regulation from paired expression and chromatin accessibility data. Proc Natl Acad Sci U S A. 2017 May 1;111(44):15675-80
The rapid increase of genome-wide data sets on gene expression, chromatin states and transcription factor (TF) binding locations offers an exciting opportunity to interpret the information encoded in genomes and epigenomes. This task can be challenging as it requires joint modeling of context specific activation of cis-regulatory elements (RE) and the effects on transcription of associated regulatory factors. To meet this challenge, we propose a statistical approach based on paired expression and chromatin accessibility (PECA) data across diverse cellular contexts. In our approach, we model 1) the localization to REs of chromatin regulators (CR) based on their interaction with sequence-specific TF, 2) the activation of REs due to CRs that are localized to them, 3) the effect of TFs bound to activated REs on the transcription of target genes (TG). The transcriptional regulatory network inferred by PECA provides a detailed view of how trans- and cis-regulatory elements work together to affect gene expression in a context specific manner.
Our analytical approach for learning from this data is to model the distribution of the expression of target genes (TG) conditional on the accessibility of regulatory elements and the expression of transcription factors (TF) and chromatin regulators (CR).
Expression of target gene: We assume that the rate of transcription of a TG in a cellular context is affected by TFs bound to regulatory elements that are active in that cellular context. For each RE we construct a variable (parenthesized term in eq. (3) ) that represents the combined effect of TFs that are expressed in that context and have significant motif matches on that RE. Target gene expression is modeled by a regression with these variables as potential predictors. However, only active REs associated with a TG will be included in the regression model for that TG (eq. 3). The association of RE to TG was done before model building, based on the distance between them and the degree of correlation between the accessibility of the RE with promoter accessibility and expression of the TG.
Activity status of regulatory element: The activity status of a RE (say the ith RE) is represented by a context dependent binary variable Zi, with Zi=1 indicating that the ith RE is in an active state. Testing whether a RE is active in a cellular context, say by editing the RE in a cell line, is time consuming experimentally. As an alternative, genome-wide inference of active REs are usually done based on ChIP-seq signals for selected chromatin regulators (e.g. P300), histone modification marks (e.g. H3K4me3, H3K27ac) and local methylation signal. Thus the knowledge of which CRs have been recruited to a RE is informative on the activity status of that RE. To incorporate this into our model, we denote the recruitment status of a CR to a RE by a binary variable C, i.e., Cij=1 indicates that the jth CR has been recruited to the ith RE. These variables are used together with the expression of CRs and the accessibility of the RE, to define predictive variables in our model for the activity status of the RE (eq. 2).
Recruitment of CR to RE: Generally CRs do not have sequence specificity. We assume a CR is likely to be recruited to a RE if the RE is open and is bound by TFs that have protein interaction propensity with the CR. For each pair of CR and RE, we consider any TF that (i) is a protein interaction partner with the CR and (ii) has significant motif match on the RE, and use it to construct a predictor variable for the modeling of the recruitment status of the CR on the RE. This predictor variable is defined as the geometric mean of the openness of the RE, the binding potential of the TF to the RE, the expression of the TF, and the expression specificity score of the TF. The specificity score, defined as geometric mean of maximum TF expression and max/(min+0.5) where max and min are respectively maximum and minimum expression over a panel of cellular contexts, measures the tissue specificity of the expression of the TF. Including it in the definition of the predictor variable has the desirable effect of down-weighting any TF whose expression is non-varying across cellular contexts. The resulting model for CR recruitment is given in eq. 1 of Fig. 2.
To infer the unknown parameters α,β,γ,η and latent variables (C, Z) based on the observed expression data (TG, TF, CR) and accessibility data (O), we consider the conditional density of TG given TF, CR and O:
The term P(C_ij│TF,O_i ) represents the conditional density of the recruitment status of jth CRs on the ith RE, as specified eq. 1 of Fig. 2. Similarly the terms P(Z_i│CR,C_i,O_i ) and P(TG_l│TF,Z) are specified by eq. 2 and 3 of Fig. 2.Note that these terms involve different components of the parameter vector: η appears in the first term, αt appears in the second term, and (β_i , γ_k) appears in the third term. This conditional experiment (TG|TF, CR, O) provides a valid basis for the inference of the unknown parameters α,β,γ,η and latent variables (C, Z). To induce sparsity, we use Laplacian priors for the parameters α and β. We employ an iterated conditional modes algorithm for this inference. The resulting model and inference methodology is named PECA, for Paired Expression and Chromatin Accessibility modeling.
Release: Data Download
Release: regulatoryElement-targetGene association: Download
Release: PECA driven network: Download
Release: PECA driven tissue specific network: Download
Release: Tool for context specific network: Download
Release: Tool for genomic region annotation by network: Download
Any correpondences regarding the PECA model should be directed to Zhana Duren(zduren@stanford.edu), Prof. Rui Jiang(ruijiang@tsinghua.edu.cn), Prof. Yong Wang(ywang@amss.ac.cn) and Prof. Wing Hung Wong (whwong@stanford.edu).
The PECA model was developed by researchers at Stanford University in the Wong Lab.
The Wong Lab and its research contribute to Stanford's Bio-X Initiative, which is aimed to bring clinicians, biomedical, and life science researchers
together with engineers, physicists, and computational scientists to tackle the complexity of the body in health and disease.