Office: Packard 251, Stanford
Email: iochoa at stanford dot edu
I am currently a Ph.D. student in the Electrical Engineering department at Stanford University, working under the supervision of Prof. Tsachy Weissman. I also received my MSc from the same department in 2012. Previous to Stanford, I got a BS and MSc from the Telecommunications Engineering (Electrical Engineering) department at TECNUN in Donostia, Spain, which included a stay of 6 months at Lulea Tekniska Universitet, Lulea, Sweden, as part of the Erasmus program. Then I worked as a researcher at CEIT, also in Donostia, within the group INTECOM (Communication Systems and Mathematical Principles of Information).
My main interests are in the field of information theory, genetics, compression, coding, communications and signal processing.
My research focuses mainly on helping the bio community to handle the massive amounts of genomic data that are being generated, for example by designing new and more effective compression programs for genomic data (see EE information theory is guiding improved ways to model and compress data).
My studies are/were funded by:
You can find more information in my Resume.
Compression schemes for genomic data: Recent advancements in sequencing technology have led to a drastic reduction in the cost of sequencing a genome. This has generated an unprecedented amount of genomic data that must be stored, processed, and transmitted. To facilitate this effort, we are working on new compression schemes for the genomic data that can drastically reduce the storage requirements (see EE information theory is guiding improved ways to model and compress data). The genomic data is mainly composed of nucleotide sequences, called ``reads'', and per base quality scores that indicate the level of confidence on the read-out of each of the base pairs. We have developed compression algorithms for the reads, the quality scores, and for genomes that are already assembled, outperforming in all cases the previously proposed algorithms.
Denoising of quality values via lossy compression: Quality values have been proven to be more difficult to compress than the reads, mainly due to their higher entropy and bigger alphabet. Moreover, there is evidence that the quality values are inherently noisy. Although lossy compression of the quality values has recently been proposed to alleviate the storage costs, reducing the noise in the quality values has remained largely unexplored. Thus we are working on a scheme for reducing both file size and noise in the storage of quality values. We have preliminary results that show that denoising via lossy compression is possible.
Cancer module discovery using gene expression data: We aim at identifying disease driver genes associated with molecular processes and linking them with their targets. We have developed CaMoDi, a novel computational method that allows to link potential drug targets with molecular pathways in a fast, accurate and robust manner. Joint work with Prof. Olivier Gevaert.
Compression schemes for efficient similarity queries: The generation of new databases and the amount of data on existing ones is growing exponentially (e.g., databases that store genomic data). As a result, executing queries on large databases is becoming a timely and challenging task. With this in mind, we have studied the problem of compressing sequences in a database so that similarity queries can still be performed efficiently on the compressed database. We have proposed an approach to this task, based in part on existing lossy compression algorithms, whose performance is close to the fundamental limits in some cases.
Internship at Google, CA, USA (Summer 2013): Design, implementation and verification of a flexible decimation filter for a digital microphone (Architecture design in Matlab, Implementation of a C-model including fixed point conversion, C-to-RTL conversion using the tool Catapult and FPGA implementation).
Research Assistant at CEIT, Basque Country, Spain (2008-2010): Conducted research in the relay channel, source-channel coding for non-uniform and cyclo-stationary sources, LDPC and Turbo codes, etc. within the group INTECOM (Communication Systems and Mathematical Principles of Information).
Current trends on genomic data compression, VA research seminar, Palo Alto, USA, November 2015.
Genomic Data Compression and Processing, EECS Rising Stars, Massachusetts Institute of Technology (MIT), USA, November 2015.