wl at gnu.org
Fri May 15 07:54:10 CEST 2009
>> then it should not be ignored anywhere except when it is the first
>> character of the file? i.e. setting the \catcode "FEFF = 9 would be
> Formally, yes, it's wrong. But the use of U+FEFF as zero width
> no-break space is deprecated since almost ten years, and the
> overwhelmingly vast majority of use cases of that character will be
> as "BOM" -- more correctly, as Unicode encoding scheme marker, since
> byte order is not an issue for UTF-8, as I'm sure you know.
The difference between protocol (this is, BOM at beginning of data)
and data (BOM somewhere else) is not easy to follow. The main reason
is that input for a single luatex run consists of many files; each of
them can start with a BOM -- shall it be interpreted as being part of
the protocol or part of the data?
I thus suggest, in accordance with Arthur, to always interpret U+FEFF
as the BOM, ignoring it everywhere. This reduces problems a a lot
IMHO. Of course, this should be documented properly.
Perhaps it makes sense to emit a warning if U+FEFF is found in the middle
of a file.
More information about the luatex