FAQ: UTF-8 and Xerox/Parc Finite-State Software

ISO-8859-1 or Unicode in UTF-8 Encoding

The new versions of the Xerox/Parc Finite-State utilities xfst, lexc, tokenize and lookup can handle either

    1.       ISO-8859-1 (Official ISO 8-bit Latin-1), or
    2.       Unicode UTF-8

UTF-8 is now the default encoding for all applications. The character encoding can be declared explicitly on the first line of any xfst script or lexc source file:

# -*- coding: utf-8 -*-

or

# -*- coding: iso-8859-1 -*-

We encourage users to move to Unicode UTF-8 if they need any encodings beyond the 7-bit ASCII set. Unicode is the Future. Regional 8-bit encodings such as ISO-8859-2 and mutants such as CP1252 on Windows are the Past.

The treatment of the Euro symbol is a good example of why it is best to avoid 8-bit encodings other than standard ISO-8859-1. There is no Euro symbol in the part of Unicode that corresponds to ISO-8859-1. The proper Unicode code point for € [this may or may not display correctly as the Euro sign in your browser] is decimal 8364 (0x20AC). In Windows CP1252 € has the code 128 (0x80); in ISO-8859-15 (also known as Latin-9) the € code is 164 (0xA4); in Macintosh Roman it is 219 (0xDB). These incompatible 8-bit encoding standards breed confusion. The best way out is to adopt the Unicode standard in the common UTF-8 encoding that is universally supported on all modern operating systems.

xfst

The current version of xfst prefers Unicode in UTF-8 encoding. By default, xfst assumes that scripts and the terminal itself are in UTF-8. To change into ISO-8859-1 mode, invoke the command

xfst[]: set char-encoding latin-1

To set it back to UTF-8 mode, invoke

xfst[]: set char-encoding utf-8

You can launch xfst in ISO-8859-1 mode with an optional -latin1 flag on the Unix command line (here the dollar sign represents the Unix prompt):

$ xfst -latin1

This is equivalent to

$ xfst
xfst[]: set char-encoding latin-1

lexc

The current version of lexc assumes UTF-8 by default. The command utf8-mode toggles to the opposite latin-1 mode:

lexc> utf8-mode

To toggle back to UTF-8 mode, simply invoke the command utf8-mode again.

You can launch lexc in ISO-8859-1 with an optional -latin1 flag on the Unix command line (the dollar sign here represents the Unix prompt):

$ lexc -latin1

This is equivalent to the command sequence

$ lexc
lexc> utf8-mode

tokenize

By default the current version of tokenize assumes that its input is in UTF-8. If the input file is in ISO-8859-1, then the -latin1 flag must be added. For example, if the input file myfile.txt is in ISO-8859-1, and your tokenizer FST is in mytokenizer.fst then you could type the following at the command line:

cat myfile.txt | tokenize mytokenizer.fst -latin1 | ...

lookup

By default the current version of lookup assumes that its input is in UTF-8 format. If the input is in ISO-8859-1, then the -latin1 flag must be added. For example, if the input is in ISO-8859-1, and your analyzer FST is in myanalyzer.fst then the flag is added as shown below:

... | lookup myanalyzer.fst -latin1 | ...

or

lookup myanalyzer.fst -latin1 < tokenizedinputfile.txt > myout.txt

or

... | lookup -flags L"=>"LTT my.fst -latin1 > myout.txt

etc.

Beware Windows "Latin-1"

When using Latin-1, Windows (and Mac users) should stick to Official ISO Latin-1 and not use the Windows CP 1252 codepage, which is (lamentably) sometimes called "Latin-1". In real ISO Latin-1, character codes in the range 127-159 are undefined. The Microsoft CP 1252 ("Windows Latin-1") has assigned these undefined codes to glyphs listed on their codepage CP 1252. For example, in Windows Latin-1, the Euro symbol has the code 128. As long as the user creates and applies networks on his own machine or some other Windows machine, everything seems to work fine, but the networks cannot be shared with users on other platforms and cannot be used in Xerox/Parc utf8-mode. In Latin-1 mode, xfst does not map the Microsoft euro symbol to its proper Unicode representation \u20AC. This is the same problem that happens with users whose environment is ISO-8859-15 (also known as Latin-9).

Bottom Line: Users of the Xerox/Parc finite-state software need to understand that ISO-8859-1 in xfst and the other applications means the REAL TRUE ISO-8859-1 STANDARD and not some altered variant such as Latin-9 or CP 1252 ("Windows Latin-1"). For any user who needs symbols that are not in the 7-bit ASCII set, our recommendation is to move to Unicode UTF-8. That is the only encoding that is the same across all platforms. All modern operating systems and text editors support it.

Warning: Some UTF-8 editors insert an optional BOM (Byte Order Mark) into the beginning of the file. UTF-8 files that start with a BOM can be processed without removing this mark (a sequence of three bytes: 0xEF 0xBB 0xBF). It is not required but harmless unless it conflicts with the file encoding declaration on the first line of a text input file that xfst and other c-fsm applications are looking for:
# -*- coding: utf-8 -*-
or
# -*- coding: iso-8859-1 -*-
Please read
UTF-8 problem: BOM for information about how to deal with this problem if it affects you. release.



Last Modified:Monday, 09-Aug-2010 13:19:04 PDT