arthur.reutenauer at normalesup.org
Fri May 15 00:22:58 CEST 2009
> Does Unicode say what's supposed to happen when BOM is found in the
> middle of a document?
Yes, it behaves as a zero width no-break space, i.e., a formatting
character preventing line break on either side of it. It draws its
Unicode name from that property (U+FEFF ZERO WIDTH NO-BREAK SPACE,
a.k.a. ZWNBSP, a.k.a. BOM). See The Unicode Standard, version 5.0,
chapter 16, p. 551 (http://www.unicode.org/versions/Unicode5.0.0/ch16.pdf)
Since Unicode 3.2, U+2060 WORD JOINER is preferred for specifying
zero width no-break space, but BOM has nevertheless that semantics too,
when not at the beginning of a file.
> Somehow I thought LuaTeX only read UTF-8, not UTF-16. Wrong?
You're right, but as Yannis points out, many text editors nevertheless
put a UTF-8 BOM at the beginning of file. It corresponds to three
bytes, EF BB BF.
> A .ini file (ie, all luatex .ini files) seems the wrong place to put
> this. I like the idea of Taco initializing it that way in the engine.
Then I didn't understand what Taco meant.
More information about the luatex