[luatex] BOM

Taco Hoekwater taco at elvenkind.com
Fri May 15 08:48:18 CEST 2009


Arthur Reutenauer wrote:
>> Does Unicode say what's supposed to happen when BOM is found in the
>> middle of a document? 
> 
>   Yes, it behaves as a zero width no-break space, i.e., a formatting
> character preventing line break on either side of it.  It draws its
> Unicode name from that property (U+FEFF ZERO WIDTH NO-BREAK SPACE,
> a.k.a. ZWNBSP, a.k.a. BOM).  See The Unicode Standard, version 5.0,
> chapter 16, p. 551 (http://www.unicode.org/versions/Unicode5.0.0/ch16.pdf)
> 
>   Since Unicode 3.2, U+2060 WORD JOINER is preferred for specifying
> zero width no-break space, but BOM has nevertheless that semantics too,
> when not at the beginning of a file.

Exactly.

   ".. because [U+FEFF] is more commonly used as byte order mark, the
   use of U+2060 word joiner to indicate word joining is strongly
   preferred for any [post-3.2] text."

As we have no legacy texts, and I cannot believe there is any editor
out there that inserts ZWNBF on its own --without user intervention--,
I see no problem with using catcode 9. Besides, \catcode reassignments
can be made by the user as well so if someone really wants to redefine
U+FEFF as a \penalty, they still can do so.

>> Somehow I thought LuaTeX only read UTF-8, not UTF-16.  Wrong?
> 
>   You're right, but as Yannis points out, many text editors nevertheless
> put a UTF-8 BOM at the beginning of file.  It corresponds to three
> bytes, EF BB BF.
> 
>> A .ini file (ie, all luatex .ini files) seems the wrong place to put
>> this.  I like the idea of Taco initializing it that way in the engine.
> 
>   Then I didn't understand what Taco meant.

luatex --ini (like any tex-like engine) starts up with a handful of
preassigned \catcodes, like the ones for REVERSE SOLIDUS (0), SPACE
(10), and DELETE (9).

My proposal is to add ZERO WIDTH NO-BREAK SPACE (9) to that shortlist.

Best wishes,
Taco





More information about the luatex mailing list