Govinda M. Kamath

Govinda Kamath 

Office # 260,
350 Serra Mall,
Dept of Electrical Engineering,
Stanford University,
California 94305,
United States of America.
Email gkamath AT
Research interests: Machine learning, Algorithms, Computational Genomics, Information Theory
[Google Scholar] [Github]

About Me

I am a fifth year doctoral student at the Information Systems Lab at Stanford, working with Prof. David Tse. Most of my work during my doctoral studies has been on developing fast (that is almost linear time) machine learning algorithms for tackling various problems in computational genomics. Prior to this I did get a ME in Electrical Communication Engineering from Indian Institute of science, where I worked with Prof. P Vijay Kumar on designing and proving optimality results of codes related to Distributed Storage Systems.

I've also been involved in TA-ing and helping develop a new course Data science for high throughput sequencing taught by my adviser David Tse. We discuss various problems in computational genomics, the different ways of modelling them and solving them. We cover a variety of tools from statistics, computer science and information theory that have been instrumental in tackling these problems.

I'm passionate about making empirical components of my research easily reproducible. Jupyter notebooks have been a godsend in that context, and has enabled me improve the ease of reproducing of some of my more recent work quite drastically.


During my doctoral studies, the following are a sample of the problems I have worked on.

  1. Genome assembly: The genome of an organism is a long strings of A,C,T,G. Current sequencing technology allows us to obtain noisy substrings of the genome called reads from the genome. These noisy reads are of length around 10,000 while the genome is of length in millions or billions. Assembling the genome from reads is a fundamental problem of computational genomics. I worked on this from both a theoretic and a practical perspective.
    Genome assembly is constrained by repeats in the genome. We realised that the output of an assembler should be a graph where every sequence corresponding to a genome is an eulerian cycle (or an eulerian tour if the genome is linear). We wrote a practical assembler called HINGE which was published in Genome Research, and is still under active development with quite a few users primarily doing bacterial genomics. The current focus here is reducing the memory footprint. The theoretic results, which showed goodness of simplified version of HINGE from a rate distortion perspective were presented at International Symposium of Information Theory (ISIT).

  2. Single Cell RNA-seq: RNA is an intermediate compound often created as an intermediate between DNA and protein. RNA transcripts are much smaller than DNA (of length tens of thousand instead of billions), and the number of molecule of each RNA transcript is dynamically varying with time (and the type of cell in the organism), while all the cells in any organism has the same DNA. RNA-seq is a sequencing assay that sequences RNA rather than DNA. With advances in the bio-chemistry, researchers have been able to sequence RNA from single cells. This presents interesting computational problems, as there is very little RNA in each cell. The fact that millions of cells are sequenced and each cell has an expression (feature vector) of tens of thousands of dimensions makes the problem more challenging.
    One basic application of this technology is to be able to cluster cells from their RNA-seq reads. We developed an equivalence class based method that does this with significant computational savings over conventional methods. This was published in Genome Biology.

  3. Adaptive Monte-Carlo Optimization: One particular clustering algorithm that seems to be ideal for single-cell-RNA-seq data is k-medoids as opposed to k-means. This is because the data is sparse and the centroid is not an ideal representative of the cluster. Further there has been a lot of interest in clustering this data using distances other than squared euclidean distances. gene expressions are sparse. The main bottle neck here was that computing the medoid took superlinear time. We developed an algorithm: med-dit to compute medoids in almost linear time. This has been accepted for publication in AISTATS-2018. The algorithm is based on a connection between the combinatorial optimisation problem of finding medoids and the multi-armed bandits framework that we discovered and exploited. We were able to extend these ideas to speed up a wide range of machine learning algorithms, such as k-Nearest Neighbours, k-means, heirarchical clustering in a framework that we call Adaptive Monte-Carlo Optimization. This is under review at NIPS-2018.

  4. Haplotype Assembly: Humans and other higher organisms are diploid, that is they have two copies of their genome. These two copies are almost identical with some polymorphic sites and regions (less than 0.3% of the genome). The problem here is to estimate which of the polymorphisms are on the same copy of a chromosome. We attacked this in two theoretic frameworks. The first was noting that this was a convolutional code, and the second was posing this as a community detection problem. These were published at ISIT-2015 and ICML-2016.

Some other problems I've dabbled with include trying to understand deep neural nets in the context of information theoretic and computational genomics problems (like the problem of finding DNA motifs binding with proteins).

Preprints and Publications

  • Jesse M. Zhang, Govinda M. Kamath, David N. Tse, ‘‘Towards a post-clustering test for differential expression’’, Under review at RECOMB 2019. [Preprint] [Analysis]

  • Vivek Bagaria*, Govinda M. Kamath*, David N. Tse, ‘‘Adaptive Monte-Carlo Optimization’’, Under review at AISTATS 2018. [Preprint] [Analysis] * co-first authors

  • Govinda M.Kamath *, Ilan Shomorony*, Fei Xia*, Thomas A. Courtade, and David N. Tse. “HINGE: long-read assembly achieves optimal repeat resolution.” Genome research 27, no. 5 (2017): 747-756. [Preprint] [Paper] [Code] [Analysis] [Press Coverage]
    * co-first authors

  • Yuxin Chen, Govinda M. Kamath, Changho Suh, and David N. Tse, ‘‘Community Recovery in Graphs with Locality’’, International Conference on Machine Learning - 2016 (ICML- 2016), New York City, USA. [Preprint] [Paper]

  • Lee Organick, Siena Dumas Ang, Yuan-Jyue Chen, Randolph Lopez, Sergey Yekhanin, Konstantin Makarychev, Miklos Z Racz, Govinda M. Kamath, Parikshit Gopalan, Bichlien Nguyen, Christopher Takahashi, Sharon Newman, Hsing-Yeh Parker, Cyrus Rashtchian, Kendall Stewart, Gagan Gupta, Robert Carlson, John Mulligan, Douglas Carmean, Georg Seelig, Luis Ceze, Karin Strauss, ‘‘Scaling up DNA data storage and random access retrieval’’, to be published in Nature Biotechnology. [Preprint]
    This was a large project that I was a part of as an intern at MSR, Redmond. My work was mainly in analysing the data, modelling and characterising the storage channel.

  • Ilan Shomorony, Govinda M. Kamath, Fei Xia, Thomas A. Courtade, and David N. Tse, ‘‘Partial Assembly: A Rate-Distortion Perspective’’, IEEE International Symposium of Information Theory - 2016 (ISIT- 2016), Barcelona, Spain. [Preprint] [Paper]

  • Jesse Zhang and Govinda M. Kamath, ’'Learning the Language of the Genome using RNNs’’, CS224d Project. [Technical Report]

  • Vasilis Ntranos*, Govinda M. Kamath*, Jesse Zhang*, Lior Pachter and David N. Tse, ‘‘Fast and accurate single-cell RNA-Seq analysis by clustering of transcript-compatibility counts’’, in press in Genome Biology special issue on single cell omics. [Preprint] [Paper] [Analysis]
    * co-first authors

  • Govinda M Kamath, Eren Sasoglu, and David Tse. ‘‘Optimal Haplotype Assembly from High-Throughput Mate Pair Reads’’, published in IEEE International Symposium of Information Theory - 2015 (ISIT-2015), Hong Kong. [Preprint] [Paper]

  • Govinda M Kamath, Narayanamoorthy Prakash, Lalitha Vadlamani, and P Vijay Kumar.‘‘Codes With Local Regeneration and Erasure Correction’’. IEEE Transactions of Information Theory June 2014. Earlier version published in the proceedings of IEEE International Symposium of Information Theory - 2013 (ISIT 2013), Istanbul, Turkey. [Preprint] [Conference Version] [Journal Version]

  • Govinda M Kamath, Narayanamoorthy Prakash, Lalitha Vadlamani, P. Vijay Kumar, Natalia Silberstein, Ankit S. Rawat, O. Ozan Koyluoglu, and Sriram Vishwanath, ‘‘Explicit MBR All-Symbol Locality Codes’’. Proceedings of IEEE International Symposium of Information Theory - 2013 (ISIT 2013), Istanbul, Turkey. [Preprint] [Paper]

  • Narayanamoorthy Prakash, Govinda M Kamath, Lalitha Vadlamani, and P Vijay Kumar.‘‘Optimal Linear Codes with a Local-Error-Correction Property’’. in Proceedings of IEEE International Symposium of Information Theory - 2012 (ISIT 2012), Cambridge, Massachusetts, USA. [Preprint] [Paper]

  • Govinda M Kamath and P Vijay Kumar. ‘‘Regenerating codes: a reformulated storage-bandwidth trade-off and a new construction.’'in Proceedings. of the National Conference on Communication 2012 (NCC 2012), IIT Kharagpur, India. [Paper]

  • Lalitha Vadlamani, Narayanamoorthy Prakash, Govinda M Kamath, and P Vijay Kumar. ‘‘On t-designs and bounds relating query complexity to error resilience in locally correctable codes.’’ in Proceedings. of the National Conference on Communication 2012 (NCC 2012), IIT Kharagpur, India. [Paper]