Xerox Research Centre Europe,
6, chemin de Maupertuis, F-38240 Meylan, France
This paper is a review of some of the major applications of finite-state transducers in natural-language processing ranging from morphological analysis to finite-state parsing. The analysis and generation of inflected word forms can be performed efficiently by means of lexical transducers. Such transducers can be compiled using an extended regular-expression calculus with restriction and replacement operators. These operators facilitate the description of complex linguistic phenomena involving morphological alternations and syntactic patterns. Because regular languages and relations can be encoded as finite-automata, new languages and relations can be derived from them directly by the finite-state calculus. This is a fundamental advantage over higher-level linguistic formalisms.
The last decade has seen a substantial surge in the use of finite-state methods in many areas of natural-language processing. This is a remarkable comeback considering that in the dawn of modern linguistics, finite-state grammars were dismissed as fundamentally inadequate. Noam Chomsky's seminal 1957 work, Syntactic Structures , includes a short chapter devoted to ``finite state Markov processes'', devices that we now would call weighted finite-state automata. In this section Chomsky demonstrates in a few paragraphs that
English is not a finite state language. (p. 21)In any natural language, a sentence may contain discontinuous constituents embedded in the middle of another discontinuous pair as in `` If ... either ... or ... then ...'' It is impossible to construct a finite automaton that keeps track of an unlimited number of such nested dependencies. Any finite-state machine for English will accept strings that are not well-formed.
The persuasiveness of Syntactic Structures had the effect that, for many decades to come, computational linguists directed their efforts towards more powerful formalisms. Finite-state automata as well as statistical approaches disappeared from the scene for a long time. Today the situation has changed in a fundamental way: statistical language models are back and so are finite-state automata, in particular, finite-state transducers. One reason is that there is a certain disillusionment with high-level grammar formalisms. Writing large-scale grammars even for well-studied languages such as English turned out to be a very hard task. With easy access to text in electronic form, the lack of robustness and poor coverage became frustrating. But there are other, more positive reasons for the renewed interest in finite-state techniques. In phonology, it was discovered rather early  that the kind of formal descriptions of phonological alternations used by linguists were, against all appearances, finite-state models. In syntax, it became evident that although English as a whole is not a finite-state language, there are nevertheless subsets of English for which a finite-state description is not only adequate but also easier to construct than an equivalent phrase-structure grammar. Finally, considerable progress has been made in developing special finite-state formalisms that are suited for the description of linguistic phenomena and, along with them, compilers that efficiently produce automata from such a description. The automata in current linguistic applications are typically much too large and complex to be produced by hand.
The following sections will cover these positive developments in more detail.
Morphology is a domain of linguistics that studies the formation of words. It is traditional to distinguish between surface forms and their analyses, called lemmas. The lemma for a surface form such as the English word bigger typically consists of the traditional dictionary citation form of the word together with terms that convey the morphological properties of the particular form. For example, the lemma for bigger might be represented as big+Adj+Comp to indicate that bigger is the comparative form of the adjective big.
There are two challenges in modeling natural-language morphology:
In the resulting transducer, each path (= sequence of states and arcs) from the initial state to a final state represents a mapping between a surface form and its lemma, also known as the lexical form. For example, the information that the comparative of the adjective big is bigger might be represented in the English lexical transducer by the path in Figure 1 where the zeros represent epsilon symbols.
An important characteristic of the finite-state transducers built at Xerox is that they are inherently bidirectional: there is no privileged input side. The path in Figure 1 can be traversed matching either the form bigger to produce big+Adj+Comp, or vice versa. The same transducer can be used for analysis (surface input, ``upward'' application) or for generation (lexical input, ``downward'' application).
A single surface string can be related to multiple lexical strings. For example, a morphological transducer for French applied upward to the surface string suis may produce the four lexical strings shown in Figure 2. Ambiguity in the downward direction is also possible, as in the relation of the lexical string payer+IndP+SG+P1+Verb (``I pay'') to the surface strings paie and paye, which are in fact alternate spellings in standard French orthography.
At Xerox, such lexical transducers have been created for a great number of languages including most of the European languages, Turkish, Arabic, Korean, and Japanese. The source descriptions are written using notations [12,9,1] that are helpful shorthands for ordinary regular expressions. The construction is commonly done by creating two separate modules: a lexicon description that defines the morphotactics of the language and a set of rules that define regular alternations such as the gemination of g and the epenthetical e in the surface form bigger. Irregular alternations such as être:suis are defined explicitly in the source lexicon. The separately compiled lexicon and rule networks are subsequently composed together as in Figure 3.
Lexical transducers may contain hundreds of thousands, even millions, of states and arcs and an infinite number of paths in the case of languages such as German that in principle allow noun compounds of any length. The regular expressions from which such complex networks are compiled include high-level operators that have been developed in order to make it possible to describe constraints and alternations that are commonly found in natural languages in a convenient and perspicuous way. We will describe them in the following sections.
The notation used in this section comes from the Xerox finite-state calculus. It is described in detail in Chapter 2 of the forthcoming book by Beesley and Karttunen . We use uppercase letters, A, B, etc., as variables over regular expressions. Lowercase letters, a, b, etc., stand for symbols. There are two special symbols: 0, the EPSILON symbol, that stands for the empty string and ?, the ANY symbol that represents the infinite set of symbols in some yet unknown alphabet. The special meaning of 0, ?, and any other symbol can be canceled by enclosing the symbol in double quotes.
An atomic expression consisting of a symbol pair such as a:b denotes a relation containing the corresponding strings. An expression consisting of a single symbol such as a denotes the language consisting of ``a'' or, alternatively, the corresponding identity relation. The Xerox implementation intentionally does not distinguish between a and a:a.
Complex regular expressions can be built up from simpler ones by means of regular expression operators. Square brackets, , are used for grouping expressions. Because both regular languages and regular relations are closed under concatenation and union, the following basic operators can be combined with any kind of regular expression:
Although regular languages are closed under complementation, subtraction, and intersection, regular relations are not ; thus the following operators can be combined only with expressions that denote a regular language.
A | B Union. A B Concatenation. (A) Optionality; equivalent to [A | 0]. A+ Iteration; one or more concatenations of A. A* Kleene star; equivalent to (A+).
Regular relations can be constructed by means of two basic operators:
~A Complement. \A Term complement; all single symbols strings not in A. A & B Intersection. A - B Subtraction (minus).
A .x. B Crossproduct. A .o. B Composition.
The crossproduct operator, .x., is used only with expressions that denote a regular language; it constructs a relation between them. [A .x. B] designates the relation that maps every string of A to every string of B. The notaion a:b is a convenient shorthand for [a .x. b].
The syntax (though not the descriptive power) of regular expressions can be extended by defining new operators that allow commonly used constructions to be expressed more concisely. A simple example of a trivial but convenient extension is the CONTAINMENT operator $.
$A = [?* A ?*]
For example, $[a | b] denotes all strings that contain at least one ``a'' or ``b'' somewhere.
The addition of new operators can be more than just a notational convenience. A case in point is Koskenniemi's  RESTRICTION operator =>.
Here A, L and R may denote any regular language. This expression designates the language of strings that have the property that any string of A that occurs in them is immediately preceded by some string from L and immediately followed by some string from R. For example,
A => L _ R Restriction; A only in the context of L _ R.
a => b _ cincludes all strings that contain no occurrence of ``a'', strings like ``bac-bac'' that completely satisfy the condition, but no strings like ``ab''. A special boundary symbol,
.#., is used to indicate the beginning or the end of the string in contexts. For example,
a => _ .#.allows ``a'' only at the end of a string.
The advantage of the restriction operator is that it encodes in a
compact way a useful condition that is difficult to express in
terms of the more primitive operators. The definition of
[A => L _ R] is shown below.
A => L _ R[[ [[?* L] A ?*] | [?* A [R ?*]] ]]
Another example of a useful high-level abstraction is the REPLACE
->. As we will see
shortly, there are many constructions involving this operator.
The simplest variant is unconstrained, obligatory replacement:
A -> BReplacement of A by B.
[A -> B]relation maps any upper-language string to itself if the string contains no instance of A. Upper-language strings that contain instances of A are paired with lower-language strings that are otherwise identical except that each A segment is replaced by some B string. The definition  of simple replacement is shown below.
A -> B[ [$[A - 0] [A .x. B]]* $[A - 0]]
High-level abstractions like
A => L _ R and
A -> B are
conceptually easier to operate with than the logically equivalent but
very complex primitive formulas, just as it is easier to write complex
computer programs in a high-level language rather than in a logically
equivalent assembly language.
Instead of replacing the strings of a language by other strings, it is sometimes useful just to mark them in some special way. In the Xerox calculus, an expression of the form
A -> B ... CMarking A by B and C.
i c e c r e a m
Replacement and marking can be constrained in many different ways: by context, by direction of application, by length of the pattern to be replaced or marked. The basic technique for compiling constrained replacement and marking transducers was invented in the early 1980's by Kaplan and Kay  for Chomsky-Halle-type rewrite rules . It was also used very early for Koskenniemi's two-level rules [16,14,12]. The idea was finally explained in detail in Kaplan and Kay's 1994 paper . There is now a rich literature on this topic [10,17,5,11,15,18]. The details vary but the basic method involves composing together a cascade of networks that introduce various auxiliary symbols into the input string, constrain their distribution, and finally eliminate the auxiliary alphabet. As there is no space to explore the compilation issue in a technical way, we will only explain the syntax of constrained replacement and marking expressions and give a few examples of the corresponding transducers without explaining how the expressions are compiled.
The transducers compiled from the simple replacement and marking
expressions are in general ambiguous in the sense that a string in the
upper language of the relation is paired with more than one lower-language
string. For example, a | a a -> "[" ... "]" yields a marking
transducer than maps the upper-language string ``aaa'' into three
different lower-language strings:
a a a a a a a a a - - - - --- --- - [a][a][a] [a][a a] [a a][a]The -> operator does not constrain the selection of the alternate substrings for replacement or marking. In this case, the upper-language string can be factored or parsed in three different ways.
For many applications, it is useful to define another version of
replacement and marking that in all such cases yields a unique
outcome. The longest-match, left-to-right replace operator,
@->, defined in , imposes a unique
factorization on every input. The upper-language substrings to be
marked or replaced are selected from left to right, not allowing any
overlaps. If there are alternate candidate strings starting at the
same location, only the longest one is chosen.
Thus a | a a @-> "[" ... "]" denotes a relation that
unambiguously maps ``aaa'' to ``[aa][a]''. The transducers corresponding
to the -> and @-> variant of this expressions
are shown in Figure 4.
\/to distinguish between the four possible cases:
|| L _ RL and R both on the upper side
// L _ RL on the lower, R on the upper side
\\ L _ RL on the upper, R on the lower side
\/ L _ RL and R both on the lower side
a a -> a || a a C+ _(Slovak)
a a -> a // a a C+ _(Gidabal)
b a a c a a d a a f a a b a a c a a d a a f a a
b a a c a d a f a b a a c a d a a f a
The two replacement transducers compiled from Rule 1 and Rule 2 are shown in Figure 5.
C* V+ C* @-> ... "-" || _ C V
s t r u k t u r a l i s m i
s t r u k - t u - r a - l i s - m i .
\/could not be used here.
The syllabification transducer is a simple finite-state parser: it recognizes and marks instances of a regular language in a text. In the next section we will show a more sophisticated example of this kind.
Although the syntax of a natural language cannot in general be described by a finite-state, or even a context-free grammar, there are many subsets of natural language that can be correctly described by very simple means, for example, names and titles, addresses, prices, dates, etc. In this section, we examine one such case in detail: a grammar for dates.
For the sake of illustration, let us consider here only one of several common date formats, expressions such as
Tuesday July 25 Tuesday, July 25 July 25, 2000 Tuesday, July 25, 2000
In the following we assume that a date expression consists of a day of the week, a month and a date with or without a year, or a combination of the two. Note that this description of the syntax of date expressions presents the same problem we encountered in the a | aa @-> a example in the previous section. Long date expressions, such as ``Tuesday, July 25, 2000'', contain smaller well-formed date expressions, e.g. ``July 25'', that should be ignored in the context of a larger date. In order to simplify the presentation, we stipulate that date expressions are contiguous strings, including the internal spaces and commas.
To facilitate the specification of the date language we first define some auxiliary terms and then use them to define a language of dates and a parser for the language. The complete set of definitions is shown below:
1To9 = 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 0To9 = "0" | 1To9 Day = Monday | Tuesday | ...... | Saturday | Sunday Month = January | February | ...... | November | December Date = 1To9 | [1 | 2] 0To9 | 3 ["0" | 1] Year = 1To9 (0To9 (0To9 (0To9))) AllDates = Day | (Day ", ") Month " " Date (", " Year)
From these definitions we can compile a small finite-state automaton, AllDates, with 13 states and 96 arcs that describes a language of about 30 million date expressions for the period from January 1, 1 to December 31, 9999.
A parser for the language can be compiled from the following simple regular expression.
AllDates @-> "[" ... "]"
It yields a transducer of 23 states and 321 arcs that marks maximal date expressions in the manner illustrated by the following text:
Today is [Tuesday, July 25, 2000] because yesterday was [Monday] and it was [July 24] so tomorrow must be [Wednesday, July 26].Because of the left-to-right, longest-match constraints associated with the
@->operator, the transducer brackets only the maximal date expressions.
However, this regular-expression grammar is not optimal. The AllDates language includes a large number of syntactically correct but semantically invalid date expressions. For example, there is no ``April 31, 2000'', ``February 29, 1900'', or ``Sunday, September 29, 1941''. April has only 30 days in any year; unlike year 2000, year 1900 was not a leap year; and September 29, 1941 fell on a Monday.
All these three types of imperfections can be corrected within the finite-state calculus. For each of these three types of invalid dates we can define a regular language that excludes such expressions. By intersecting these constraint languages with the AllDates language, we can define a language that contains only semantically valid dates. Figure 6 illustrates the idea.
MaxDaysInMonth Restriction on the distribution of 30 and 31.
* LeapDays Restriction on February 29.
* WeakDayDate Restrictions on weekdays and dates
Even = "0" | 2 | 4 | 6 | 8 Odd = 1 | 3 | 5 | 7 | 9 N = 1To9 0To9* Div4 = [((N) Even) ["0" | 4 | 8]] | [(N) Odd [2 | 6]] LeapYears = Div4 - [[N - Div4] "0" "0"]
[N - Div4] "0" "0"denotes numbers with two final zeros that are preceded by a number that is not divisible by four. For example, it includes ``1900'' but not ``2000''. Because LeapYears is defined as Div4 minus this set, it follows that the string ``2000'' is in the language but ``1900'' is not.
Once the language of leap years is defined, the distribution of ``February
29'' in date expressions can be constrained with the following simple
LeapDays = February " " 2 9 ", " => _ LeapYears .#.
.#., is necessary here in order to rule out expressions like ``February 29, 1969'', which would qualify if we were allowed to take into account only the first three digits, 196 being a leap year in the Gregorian calendar.
The construction of the WeakDayDates constraint is not
trivial but not as difficult as it might initially seem. See
 for details. Having constructed the auxiliary
constraint languages we can define the language of valid
ValidDates = AllDates & MaxDaysInMonth & LeapDays & WeekDayDates
We could now construct a parser that recognizes only valid dates. But we actually can do something more interesting, namely, define a parser that recognizes all date expressions and marks them as valid, ``[VD '', or invalid, ``[ID '':
ValidDates @-> "[VD " ... "]" ,
[AllDates - ValidDates] @-> "[ID " ... "]"
The correct date for today is [VD Tuesday, July 25, 2000]. Today is not [ID Tuesday, July 26, 2000].
Although regular expressions and the algorithms for converting them into finite-state automata have been part of elementary computer science for decades, the restriction, replacement, and marking expressions we have focused on are relatively recent. They have turned out to be very useful for linguistic applications, in particular for morphology, tokenization, and shallow parsing. Descriptions consisting of regular expressions can be efficiently compiled into finite-state networks, which in turn can be determinized, minimized, sequentialized, compressed, and optimized in other ways to reduce the size of the network or to increase the application speed. Many years of engineering effort have produced efficient runtime algorithms for applying networks to strings.
Regular expressions have a clean, declarative semantics. At the same time they constitute a kind of high-level programming language for manipulating strings, languages, and relations. Although regular grammars can cover only limited subsets of a natural language, there can be an important practical advantage in describing such sublanguages by means of regular expressions rather than by some more powerful formalism. Because regular languages and relations can be encoded as finite automata, they can be more easily manipulated than context-free and more complex languages. With regular-expression operators, new regular languages and relations can be derived directly without rewriting the grammars for the sets that are being modified. This is a fundamental advantage over higher-level formalisms.
This document was generated using the LaTeX2HTML translator Version 99.1 release (March 30, 1999)
Copyright © 1993, 1994, 1995, 1996,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 1 -no_navigation fst-in-nlp.tex
The translation was initiated by Lauri Karttunen on 2000-10-19