Govinda M. Kamath

Office # 260,
350 Serra Mall,
Dept of Electrical Engineering,
Stanford University,
California 94305,
United States of America.
Email gkamath AT stanford.edu
Research interests: Machine learning, Algorithms, Computational Genomics, Information Theory
[Google Scholar] [Github]

About Me
I am a fifth year doctoral student at the Information Systems Lab at Stanford,
working with Prof. David Tse. Most of my work during my doctoral studies has
been on developing fast (that is almost linear time) machine learning algorithms for tackling various problems in computational genomics.
Prior to this I did get a ME in Electrical Communication Engineering from Indian Institute of science, where I worked with Prof. P Vijay Kumar on designing and proving optimality results of codes related to Distributed Storage Systems.
I've also been involved in TAing and helping develop a new course
Data science for high throughput sequencing taught by my adviser David Tse.
We discuss various problems in computational genomics, the different ways of modelling them and solving them. We
cover a variety of tools from statistics, computer science and information theory that have been instrumental in
tackling these problems.
I'm passionate about making empirical components of my research easily reproducible. Jupyter notebooks
have been a godsend in that context, and has enabled me improve the ease of reproducing of some of my more
recent work quite drastically.
Research
During my doctoral studies, the following are a sample of the problems I have worked on.
Genome assembly: The genome of an organism is a long strings of A,C,T,G.
Current sequencing technology allows us to obtain noisy substrings of the genome called reads from the genome.
These noisy reads are of length around 10,000 while the genome is of length in millions or billions.
Assembling the genome from reads is a fundamental problem of computational genomics.
I worked on this from both a theoretic and a practical perspective.
Genome assembly is constrained by repeats in the genome. We realised that the output of an assembler should
be a graph where every sequence corresponding to a genome is an eulerian cycle
(or an eulerian tour if the genome is linear). We wrote a practical assembler called HINGE
which was published in Genome Research,
and is still under active development with quite a few users primarily doing bacterial genomics. The current focus here is reducing the memory footprint. The theoretic results, which showed goodness of simplified version of HINGE from a rate distortion perspective
were presented at International Symposium of Information Theory (ISIT).
Single Cell RNAseq: RNA is an intermediate compound often created as an intermediate between DNA and protein.
RNA transcripts are much smaller than DNA (of length tens of thousand instead of billions), and the number of
molecule of each RNA transcript is dynamically varying with time (and the type of cell in the organism), while
all the cells in any organism has the same DNA.
RNAseq is a sequencing assay that sequences RNA rather than DNA.
With advances in the biochemistry, researchers have been able to sequence RNA from single cells.
This presents interesting computational problems, as there is very little RNA in each cell.
The fact that millions of cells are sequenced and each cell has an expression (feature vector) of tens
of thousands of dimensions makes the problem more challenging.
One basic application of this technology is to be able to cluster cells from their RNAseq reads.
We developed an equivalence class based method that does this with significant computational
savings over conventional methods. This was published in
Genome Biology.
Adaptive MonteCarlo Optimization:
One particular clustering algorithm that seems to be ideal for singlecellRNAseq data is kmedoids
as opposed to kmeans. This is because the data is sparse and the centroid is not an ideal representative of the cluster. Further
there has been a lot of interest in clustering this data using distances other than squared euclidean distances.
gene expressions are sparse. The main bottle neck here was that computing the medoid took superlinear time. We
developed an algorithm: meddit to compute medoids in almost linear time. This has been accepted for publication
in AISTATS2018. The algorithm is based on a connection between the
combinatorial optimisation problem of finding medoids and the multiarmed bandits framework that we discovered
and exploited.
We were able to extend these ideas to speed up a wide range of machine learning algorithms, such as
kNearest Neighbours, kmeans, heirarchical clustering in a framework that we call
Adaptive MonteCarlo Optimization. This is under review at NIPS2018.
Haplotype Assembly: Humans and other higher organisms are diploid,
that is they have two copies of their genome. These two copies are almost identical
with some polymorphic sites and regions (less than 0.3% of the genome).
The problem here is to estimate which of the polymorphisms are on the same copy of a chromosome.
We attacked this in two theoretic frameworks.
The first was noting that this was a convolutional code, and the second was posing this as a community detection problem.
These were published at ISIT2015 and ICML2016.
Some other problems I've dabbled with include trying to understand deep neural nets in the context of
information theoretic and computational genomics problems (like the problem of finding DNA motifs binding
with proteins).
Preprints and Publications
Vivek Bagaria*, Govinda M. Kamath*, David N. Tse, ‘‘Adaptive
MonteCarlo Optimization’’, Under review at NIPS 2018. [Preprint]
[Analysis]
* cofirst authors
Govinda M.Kamath *, Ilan Shomorony*, Fei Xia*, Thomas A. Courtade, and David N. Tse. “HINGE: longread assembly achieves optimal repeat resolution.” Genome research 27, no. 5 (2017): 747756. [Preprint] [Paper] [Code] [Analysis]
[Press Coverage]
* cofirst authors
Yuxin Chen, Govinda M. Kamath, Changho Suh, and David N. Tse, ‘‘Community Recovery in Graphs with Locality’’, International Conference on Machine Learning  2016 (ICML 2016), New York City, USA. [Preprint] [Paper]
Lee Organick, Siena Dumas Ang, YuanJyue Chen, Randolph Lopez, Sergey Yekhanin, Konstantin Makarychev, Miklos Z Racz, Govinda M. Kamath, Parikshit Gopalan, Bichlien Nguyen, Christopher Takahashi, Sharon Newman, HsingYeh Parker, Cyrus Rashtchian, Kendall Stewart, Gagan Gupta, Robert Carlson, John Mulligan, Douglas Carmean, Georg Seelig, Luis Ceze, Karin Strauss,
‘‘Scaling up DNA data storage and random access retrieval’’, to be published in Nature Biotechnology. [Preprint]
This was a large project that I was a part of as an intern at MSR, Redmond. My work was mainly in analysing the data, modelling and characterising the storage channel.
Ilan Shomorony, Govinda M. Kamath, Fei Xia, Thomas A. Courtade, and David N. Tse, ‘‘Partial Assembly: A RateDistortion Perspective’’, IEEE International Symposium of Information Theory  2016 (ISIT 2016), Barcelona, Spain. [Preprint] [Paper]
Vasilis Ntranos*, Govinda M. Kamath*, Jesse Zhang*, Lior Pachter and David N. Tse, ‘‘Fast and accurate singlecell RNASeq analysis by clustering of transcriptcompatibility counts’’, in press in Genome Biology special issue on single cell omics. [Preprint] [Paper] [Analysis]
* cofirst authors
Govinda M Kamath, Eren Sasoglu, and David Tse. ‘‘Optimal Haplotype Assembly from HighThroughput Mate Pair Reads’’, published in IEEE International Symposium of Information Theory  2015 (ISIT2015), Hong Kong. [Preprint] [Paper]
Govinda M Kamath, Narayanamoorthy Prakash, Lalitha Vadlamani, and P Vijay Kumar.‘‘Codes With Local Regeneration and Erasure Correction’’. IEEE Transactions of Information Theory June 2014. Earlier version published in the proceedings of IEEE International Symposium of Information Theory  2013 (ISIT 2013), Istanbul, Turkey. [Preprint] [Conference Version] [Journal Version]
Govinda M Kamath, Narayanamoorthy Prakash, Lalitha Vadlamani, P. Vijay Kumar, Natalia Silberstein, Ankit S. Rawat, O. Ozan Koyluoglu, and Sriram Vishwanath, ‘‘Explicit MBR AllSymbol Locality Codes’’. Proceedings of IEEE International Symposium of Information Theory  2013 (ISIT 2013), Istanbul, Turkey. [Preprint] [Paper]
Narayanamoorthy Prakash, Govinda M Kamath, Lalitha Vadlamani, and P Vijay Kumar.‘‘Optimal Linear Codes with a LocalErrorCorrection Property’’. in Proceedings of IEEE International Symposium of Information Theory  2012 (ISIT 2012), Cambridge, Massachusetts, USA. [Preprint] [Paper]
Govinda M Kamath and P Vijay Kumar. ‘‘Regenerating codes: a reformulated storagebandwidth tradeoff and a new construction.’'in Proceedings. of the National Conference on Communication 2012 (NCC 2012), IIT Kharagpur, India. [Paper]
Lalitha Vadlamani, Narayanamoorthy Prakash, Govinda M Kamath, and P Vijay Kumar. ‘‘On tdesigns and bounds relating query complexity to error resilience in locally correctable codes.’’ in Proceedings. of the National Conference on Communication 2012 (NCC 2012), IIT Kharagpur, India. [Paper]
