taco at elvenkind.com
Fri May 15 08:48:18 CEST 2009
Arthur Reutenauer wrote:
>> Does Unicode say what's supposed to happen when BOM is found in the
>> middle of a document?
> Yes, it behaves as a zero width no-break space, i.e., a formatting
> character preventing line break on either side of it. It draws its
> Unicode name from that property (U+FEFF ZERO WIDTH NO-BREAK SPACE,
> a.k.a. ZWNBSP, a.k.a. BOM). See The Unicode Standard, version 5.0,
> chapter 16, p. 551 (http://www.unicode.org/versions/Unicode5.0.0/ch16.pdf)
> Since Unicode 3.2, U+2060 WORD JOINER is preferred for specifying
> zero width no-break space, but BOM has nevertheless that semantics too,
> when not at the beginning of a file.
".. because [U+FEFF] is more commonly used as byte order mark, the
use of U+2060 word joiner to indicate word joining is strongly
preferred for any [post-3.2] text."
As we have no legacy texts, and I cannot believe there is any editor
out there that inserts ZWNBF on its own --without user intervention--,
I see no problem with using catcode 9. Besides, \catcode reassignments
can be made by the user as well so if someone really wants to redefine
U+FEFF as a \penalty, they still can do so.
>> Somehow I thought LuaTeX only read UTF-8, not UTF-16. Wrong?
> You're right, but as Yannis points out, many text editors nevertheless
> put a UTF-8 BOM at the beginning of file. It corresponds to three
> bytes, EF BB BF.
>> A .ini file (ie, all luatex .ini files) seems the wrong place to put
>> this. I like the idea of Taco initializing it that way in the engine.
> Then I didn't understand what Taco meant.
luatex --ini (like any tex-like engine) starts up with a handful of
preassigned \catcodes, like the ones for REVERSE SOLIDUS (0), SPACE
(10), and DELETE (9).
My proposal is to add ZERO WIDTH NO-BREAK SPACE (9) to that shortlist.
More information about the luatex