Search engine companies prefer to show users pages that they can read. To do this, it is useful to identify the language(s) used on each web page. With billions of web pages in hundreds of languages, an automated statistical approach is needed. In contrast to previous work using short words or groups of three letters (trigrams) to identify perhaps a dozen different languages in single-language well-written text corpi, we look at the more general problem of detecting ~180 languages in ~57 Unicode scripts, in the sometimes mixed-language wild-west text of Web pages. We discuss statistical detection using quadgrams, building statistics offline from the Web itself as corpus, and dealing with unusual pages -- e.g., what goes wrong.
There is no downloadable version of the slides for this talk available at this time.
About the speaker:
|Dick Sites is a Senior Staff Engineer at Google, where he has worked for 4-1/2 years. He previously worked at Adobe Systems, Digital Equipment Corporation, Hewlett-Packard, Burroughs, and IBM. His accomplishments include co-architecting the DEC Alpha computers, advancing the art of binary translation for computer executables, adding electronic book encryption to Adobe Acrobat, decoding image metadata for Photoshop, and building various computer performance monitoring and tracing tools at the above companies. He also taught Computer Science for four years at UC/San Diego. Most recently he has been working on Unicode text processing. Dr. Sites holds a PhD degree in Computer Science from Stanford and a BS degree in Mathematics from MIT. He also attended the Master's program in Computer Science at UNC 1969-70. He holds 33 patents and was recently elected to the National Academy of Engineering.|