Downtranslation of XML Dictionaries to lexc
Kenneth R. Beesley
Xerox Research Centre Europe
6, chemin de Maupertuis
38240 Meylan, France
October 12, 2004
This handout provides an introduction to downtranslating dictionaries from an XML
format to the Xerox lexc format.
While basically simple, XML has an enormous range of uses, being suitable for en-
coding almost any data that has an hierarchical, tree-like structure. In natural-language
processing, XML is an obvious and attractive way to encode dictionaries.
The wide range of XML uses, and users, is largely responsible for the confusing
state of XML processing today. Any good programmer can produce new XML appli-
cations, and many do, providing solutions that reflect their own specific experience,
needs and tastes. All I can do is to present my own opinions and recommendations,
emphasizing that what works for me may not appeal to you at all.
lexc Dictionaries vs. XML Dictionaries
lexc Dictionaries are a Dead-End
People learning how to develop morphological analyzers, using the Xerox finite-state
tools, are first taught to define their "dictionaries" and morphotactic descriptions di-
rectly in the lexc language. The lexc file is then compiled into a finite-state transducer.
As any finite-state developer must understand the lexc syntax and semantics, this is
all right and proper, but it is my opinion that long-term development of dictionaries
directly in the lexc format is a dead-end and a mistake. In my opinion, dictionaries in
any serious project should be edited and maintained in an XML language, with an au-
tomatic "downtranslation" to lexc format whenever a morphological analyzer is being
built or rebuilt. In a Unix-like system, the downtranslation can be performed as part
of the 'make' process, and there are doubtless solutions like 'make' in the Windows
world as well.
Why is it a mistake to maintain dictionaries directly in lexc format? Traditionally,
at Xerox and elsewhere, developers would continue to develop their dictionaries by