Genomic data clustering using Principal Component Analysis

(The repo is available on GitHub. This project was done in collaboration with Ray Iyer. This adapted project was motivated by the original Genes Mirror Geography article published in Nature.)

Here in this project I share with you a simple and elegant methodology from linear algebra – the PCA – which can help uncover rather important information for data where each data entry is a vector of same size. PCA is used in many ways – as a dimensionality reduction method, for visualizing data, for understanding the drivers of variability in the data, and for interpreting correlations between the corrdinates (something which we will use here). See the conclusions section for how PCA differentiates from Singular Value Decomposition (SVD), a sister technique used for image compression. Below is the figure that shows that PCA can be used to cluster individuls of different populations together. I will show you how got to it and more using real open dataset!

Figure: Population based clustering from projections of data onto the first & second principal components

Data

The data for this problem is a real dataset which comes from the 1000 genomes project and the data file is available here. Each of the 995 lines in the file represents an individual. The first three columns represent respectively the individual’s unique identifier, his/her sex (1=male, 2=female) and the population he or she belongs to¹. The subsequent 10101 columns of each line are a subsample of nucleobases from the individual’s genome.

Primer on PCA

The idea behind PCA is in essence very simple. For a given set of \(m\) vectors of the same size (say \(n\)), or in other words \(m\) \(n\)-dimensional vectors, we are interested in finding \(k\) \(n\)-dimensional vectors where \(k \leq m\) that “best” represent the original \(m\) \(n\)-dimensional vectors. In other words, when we approximately express each of the \(m\) vectors as a linear combination of the \(k\) vetcors, then this is the “best” possible representation of the vectors. This “best” depends on how we define it. For PCA, the “best” \(k\) vectors are those that capture the maximum variance of data (the \(m\) vectors). This is also mathematically equivalent to saying (duality) that PCA defines the “best-fit line” (one along each of the \(k\) vectors)² as the one that minimizes the average squared Euclidean distance between the line and the data points. These top \(k\) “best” vectors are what are called the principal components (PC) of the data. The illustration here should help identify what the principal component looks like (adopted from Prof. Gregory Valiant’s notes (this project was done as a course project under Greg).

In the illustration, the “good” line (the first principal component) is the one for which projection of the data points onto it keeps the two clusters separated, while the projection onto the bad line merges the two clusters.

The theory for computing PCs involves eigenvector calculation. We will leave that for sklearn but if you are keen, I would direct you to the notes by Greg.

Read in a Pandas Dataframe

from sklearn.decomposition import PCA
import numpy as np 
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt

def readAndProcessData():
    """
        Function to read the raw text file into a dataframe and keeping the 
        population, gender separate from the genetic data
        
        We also calculate the population mode for each attribute or trait (columns)
        Note that mode is just the most frequently occuring trait
        
        return: dataframe (df), modal traits (modes), population and gender for each individual row
    """
    
    df = pd.read_csv('p4dataset2020.txt', header=None, delim_whitespace=True)
    gender = df[1]
    population = df[2]
    print(np.unique(population))
    
    df.drop(df.columns[[0, 1, 2]],axis=1,inplace=True)
    modes = np.array(df.mode().values[0,:])
    return df, modes, population, gender

df, modes, population, gender = readAndProcessData()

['ACB' 'ASW' 'ESN' 'GWD' 'LWK' 'MSL' 'YRI']

So there are individuals from 7 populations¹ in our data

Preprocessing

To understand what makes some population members special, I would like to focus on the mutations in the genomes rather than the genomes themselves. So I first convert the data from the text file of nucleobases to a real-valued matrix (PCA needs a real-valued matrix). Specifically, convert the genetic data into a binary matrix \(X\) such that \(X_{i,j}= 0\) if the \(i^{th}\) individual has column \(j\)’s mode nucleobase for his or her \(j^{th}\) nucleobase, and \(X_{i,j}= 1\) otherwise. Note that in this way, all mutations appear as a 1, even if they are different mutations, so if the mode for column \(j\) is “G”, then if individual \(i\) has an “A”,“T”, or “C”, then \(X_{i,j}= 1\).

def convertDfToMatrix(df, mode):
    """
        Create a binary matrix (binarized) representing mutations away from mode
        Each row is for an individual, and each column is a trait
        
        binarized_{i,j}= 0 if the $$i^{th}$$ individual has column 
        $$j$$’s mode nucleobase for his or her $$j^{th}$$ nucleobase, 
        and binarized_{i,j}= 1 otherwise
    """
    
    raw_np = df.to_numpy()
    binarized = np.where(raw_np!=modes, 1, 0)
    return binarized

X = convertDfToMatrix(df, modes)

Perform PCA

pca = PCA(n_components=3)
pca.fit(X) 

#Data points projected along the first, second and the third PC
projectedPrincipal = np.matmul(X, pca.components_[0])
projectedSecondary = np.matmul(X, pca.components_[1])
projectedTertiary = np.matmul(X, pca.components_[2])

# The dimensions of the returned PCA vectors should be the same as that of
# the original data vectors (the rows corresponding to different individuals). 

assert(X.shape[0] == projectedPrincipal.shape[0])

Population Clusters from Mutations Data

Let’s examine the first 2 principal components of \(X\). These components contain lots of information about our data set. We’ll create a scatter plot with each of the 995 rows of \(X\) projected onto the first two principal components. In other words, the horizontal axis is PC1, the vertical axis is PC2, and each individual should be projected onto the subspace spanned by PC1 and PC2.

def groupByPop(x, y, classLabels):   
    """
        Function that creates a scatter plot of the projections of the 
        data points onto the first (principal) PC versus their projection
        on the second PC and labels each datapoint according to which of the 
        classes the datapoint belongs to in "classLabels"
        
        args:
            x: Projection on First Principal component for all datapoints
            y: Projection on Second Principal component for all datapoints
            classLabels: population; can be in ['ACB' 'ASW' 'ESN' 'GWD' 'LWK' 'MSL' 'YRI']
                see http://www.1000genomes.org/faq/which-populations-are-part-your-study/
                for details on population info
    """

    classDict = {}
    classes = np.unique(classLabels)
    for label in classes:
        idx = np.where(classLabels == label)
        classDict[label] = (x[idx], y[idx])
    for lab in classes:
        x, y = classDict[lab]
        plt.scatter(x, y, label=lab)
    plt.legend(fontsize=12)
    plt.xlabel('Projection on First Principal component', fontsize=12)
    plt.ylabel('Projection on Second Principal Comp.', fontsize=12)
    plt.savefig('FirstSecond.png', format='png')
    plt.show()

# call the function
groupByPop(projectedPrincipal, projectedSecondary, population)

Population clustering! Notice how different population members have similar values of their projections onto the principal components. This means two things – (1) individuals in a particular population have similar gene mutations such that their data vectors start to look similar! And (2) individuals in different populations have mutations in different nucleobases.

So the correlation between the first and second principal components for members of a population helps cluster them into groups! So, it is the first and second PCs that capture varibility in genes due to population differences. Let’s keep exploring!

Gender Clusters from Mutations Data

This time, let’s examine the first and the third principal components of \(X\). We’ll create a scatter plot with each of the 995 rows of \(X\) projected onto the first and third principal components this time.

def groupByGender(x, y, classLabels):
    """
        Function that creates a scatter plot of the projections of the 
        data points onto the first (principal) PC versus their projection
        on the third PC and labels each datapoint according to which of the 
        classes the datapoint belongs to in "classLabels"
        
        args:
            x: Projection on First Principal component for all datapoints
            y: Projection on Third Principal component for all datapoints
            classLabels: her gender specs of population -- `Male` or `Female`
    """
    
    classDict = {}
    classes = np.unique(classLabels)
    for label in classes:
        idx = np.where(classLabels == label)
        classDict[label] = (x[idx], y[idx])
    for label in classes:
        x, y = classDict[label]
        plt.scatter(x, y, label='Male' if label == 1 else 'Female')
    plt.legend(fontsize=12)
    plt.xlabel('Projection on First Principal component', fontsize=12)
    plt.ylabel('Projection on Third Principal Comp.', fontsize=12)
    plt.savefig('FirstThird.png', format='png')
    plt.show()
    
# call the function    
groupByGender(projectedPrincipal, projectedTertiary, gender)

Interestingly enough, gender differences pop out! Note that there is a strong correlation between data members within the male and within the female sections. Also note once again, the first and third PC projections tend to be different from each other for males and females allowing for clustering!

Which chromosomes lead to gender differences?

We just remarked above that the third PC is responsible for popping out the gender differences. But can we say more? Can we say which nucleobases are different between males and females? For answering this, we look at the absolute values of the third principal component (not the projections) and create a scatter plot between the Nucleobase index and the corresponding third principal component’s absolute value.

plt.figure()
plt.scatter(np.arange(1,pca.components_[2].shape[0]+1), np.absolute(pca.components_[2]))
plt.xlabel('Nucleobase Index', fontsize=15)
plt.ylabel('|Third PC|', fontsize=15)
plt.savefig('part_f.png', format ='png')
plt.ylim([0, max(np.absolute(pca.components_[2]))+0.02])
plt.show()

We can now say with confidence that the female members and male members must differ in their nucleobases for indices \(\sim\)9500 and larger. These mutations for index larger than 9500 are responsible then for gender differences.

Difference between PCA and SVD

Another sister approach of PCA is the popular Singular Value Decomposition (SVD). SVD is foundational for problems where one attempts to find a good low-rank approximation for a matrix of interest. Often what remains unclear is the difference between PCA and SVD – well, which one to use depends on what the data is. As a rule of thumb, if you’re given a matrix where a row and column position has specific meaning (such as an image), then we want to go for SVD. But if this matrix of interest is an unordered set of vectors, then we go for PCA. For this reason, SVD is often used for image compression, matrix-denoising and matrix-completion applications.

Conclusions

We have seen that we can cluster the individuals by population and genders.

Takehome message: Simple techniques can be very useful in discerning important insights! Tools like PCA and SVD are good to have in your repertoire especially when exploring a totally new data you have no clue how to begin with.

Actually the dataset we used is only approximately 0.6% of the original dataset which can be found here, so you can keep exploring more beyond this population if you like!

Explore all projects

http://www.1000genomes.org/faq/which-populations-are-part-your-study/ ↩ ↩²
Note that PCA “best fit line is not same as the “best” fit linear regression through the data. In case of linear regression, we are minimizing the vertical distance of points from the line whereas in PCA we are minimizing the perpendicular distance of points from the line. ↩