Statistical Language Modeling in the Era of Abundant Data

Ciprian Chelba
Research Scientist, Google
Given on: Jan. 9th, 2015


The talk presents an overview of statistical language modeling as applied to real-word problems: speech recognition, machine translation, spelling correction, soft keyboards to name a few prominent ones. We summarize the most successful estimation techniques, and examine how they fare for applications with abundant data, e.g. voice search. We conclude by highlighting a few open problems: getting an accurate estimate for the entropy of text produced by a very specific source, e.g. query stream; optimally leveraging data that is of different degrees of relevance to a given "domain"; does a bound on the size of a "good" model for a given source exist?


Ciprian Chelba is a Research Scientist with Google. Previously he worked as a Researcher in the Speech Technology Group at Microsoft Research. His research interests are in statistical modeling of natural language and speech. Recent projects include: Google Audio Indexing; indexing, ranking and snippeting of speech content; Language Modeling for Google Search by Voice, and Android IME predictive keyboard.