[luatex] BOM

Arthur Reutenauer arthur.reutenauer at normalesup.org
Fri May 15 00:22:58 CEST 2009


> Does Unicode say what's supposed to happen when BOM is found in the
> middle of a document? 

  Yes, it behaves as a zero width no-break space, i.e., a formatting
character preventing line break on either side of it.  It draws its
Unicode name from that property (U+FEFF ZERO WIDTH NO-BREAK SPACE,
a.k.a. ZWNBSP, a.k.a. BOM).  See The Unicode Standard, version 5.0,
chapter 16, p. 551 (http://www.unicode.org/versions/Unicode5.0.0/ch16.pdf)

  Since Unicode 3.2, U+2060 WORD JOINER is preferred for specifying
zero width no-break space, but BOM has nevertheless that semantics too,
when not at the beginning of a file.

> Somehow I thought LuaTeX only read UTF-8, not UTF-16.  Wrong?

  You're right, but as Yannis points out, many text editors nevertheless
put a UTF-8 BOM at the beginning of file.  It corresponds to three
bytes, EF BB BF.

> A .ini file (ie, all luatex .ini files) seems the wrong place to put
> this.  I like the idea of Taco initializing it that way in the engine.

  Then I didn't understand what Taco meant.

	Arthur


More information about the luatex mailing list