[an error occurred while processing this directive]

FAQ: Dotted Brackets in Replace Rules

Q: Why is there a "dotted brackets" operator [..] -> operator and how is it different from the normal square bracket [] -> operator in Replace Rules?

A: OK. To start with, the rule operator is really just the right arrow ->. The [..] is a special "dotted-bracket" notation in Replace Rules that can be used to surround a regular expression to be replaced. It is typically needed in cases where the language to be replaced contains the empty string (epsilon).

Here's the problem: your basic right-arrow rewrite rule is always of the following form

upper -> lower || left _ right

where the left and right contexts are optional. I'll stick to the simpler

upper -> lower

type of Replace Rule for now.

Looking at generation, if the upper expression can match the input string at any point, then the rule must apply and fire. In the following case, the rule states that every a in the upper string must be mapped to a b in the lower string. E.g. applying

a -> b

downward to the input string "cacacac" will yield "cbcbcbc". The rule matches the input string in three places in this case and applies three times.

Now consider the following rules, built on the exact same pattern. [] is the same as 0 and matches the empty string.

[] -> b

is equivalent to

0 -> b

Here the [] (or the 0) is the upper expression, b is the lower expression, and -> is (as before) the rule operator. For generation (downward application), the rule [] -> b states that wherever the input string contains an empty string, then the empty string must be mapped to a b in the output string.

Syntactically and semantically, the rule [] -> b is completely parallel to a -> b. In practice, the problem with [] -> b is that any input string contains an infinite number of empty strings. E.g. the input string "ab" contains an infinite number of empty strings before the a, an infinite number of empty strings between a and b, and an infinite number of empty strings after the final b. So if you apply

[] -> b

or the equivalent

0 -> b

to the input string "ab", then the rule will match an infinite number of times and will try to generate an infinite number of output strings. This lamentable situation comes from the basic semantics of replace rules and the fact that any input string contains an infinite number of empty strings.

When xfst beginners write a rule like 0 -> b or [] -> b, what they usually intend is for the rule to fire a single time (rather than an infinite number of times) wherever the rule can match an empty string. And that's what the special "dotted-bracked" notation allows you to specify. When you write [..] -> b and apply that downwards to the input string "ac", the single result is "babcb". You can test this easily in xfst:

xfst[0]: regex [] -> b ;
124 bytes. 1 state, 3 arcs, Circular.
xfst[1]: down ac
Network has an epsilon loop on the input side.
ac

[N.B. above that xfst sees that the rule can match an infinite number of times and prints a warning about the epsilon (empty string) loop.]

xfst[1]: regex [..] -> b ;
200 bytes. 3 states, 4 arcs, Circular.
xfst[2]: down ac
babcb

With this second test, using an upper expression [..], a single b is inserted wherever an empty string can be matched in the input expression, i.e. once at the beginning, once at the end, and once between symbols.

Next step: It's important to understand that the dotted brackets [. and .] can appear alone as in [..] -> b, but this in in fact a special case. They are in fact separate operators that surround an expression designating the language to be replaced. And you typically need to use them whenever the language to be replaced can match an empty string.

Consider the following (bad) rule:

a*b* -> c

This states (for generation) that whenever the input string matches "zero or more as followed by zero or more bs, then the rule must fire and map them to a c". The problem is that a*b* can (and therefore will) match the empty string. If you apply

a*b* -> c

to any input string, e.g. "d", then the rule will once again match an infinite number of times and try to output an infinite number of solutions. (There are an infinite number of empty strings at the beginning and at the end of "d".) This can be shown easily in xfst:

xfst[0]: regex  a*b* -> c ;
344 bytes. 3 states, 18 arcs, Circular.
xfst[1]: down d
Network has an epsilon loop on the input side.
ccd
ccdc
cd
cdc
d
dc
xfst[1]: regex [.a*b*.] -> c ;
324 bytes. 5 states, 11 arcs, Circular.
xfst[2]: down d
cdc

As the example shows, the way to prevent this "infinite epsilon loop" problem is to surround the input expression with the [. and .] dotted brackets. This causes the rule to be compiled differently. That way, wherever the input string contains an infinite number of empty strings, the rule treats it like a single empty string and does a single mapping. Apply [. a*b* .] -> c to the input string "d" and you get just "cdc".

For left-arrow rules, as in

a <- b

the right-hand (lower) side is the input side, so

a <- 0

or

a <- []

or something like

a <- a*b*c*

will cause the same infinite-match problem. Applying such rules upwards to an input string will result in an infinite number of matches, technically requiring an infinite number of outputs. You can get the desired behavior by surrounding the lower-side (input) expression with dotted brackets:

a <- [..]
a <- [. a*b*c* .]

as in the following xfst example:

xfst[2]: regex a <- [. a*b*c* .] ;
388 bytes. 6 states, 15 arcs, Circular.
xfst[3]: up m
ama

So, in conclusion, you typically need to surround the expression that matches the input string with [. and .] whenever the input expression can match the empty string.

[an error occurred while processing this directive][an error occurred while processing this directive]

Last Modified:Sunday, 07-Mar-2004 22:14:41 PST