[luatex] BOM

Fri May 15 09:00:39 CEST 2009

Werner LEMBERG <wl at gnu.org> wrote:

> >> then it should not be ignored anywhere except when it is the first
> >> character of the file? i.e. setting the \catcode "FEFF = 9 would be
> >> wrong?
> > 
> >   Formally, yes, it's wrong.  But the use of U+FEFF as zero width
> > no-break space is deprecated since almost ten years, and the
> > overwhelmingly vast majority of use cases of that character will be
> > as "BOM" -- more correctly, as Unicode encoding scheme marker, since
> > byte order is not an issue for UTF-8, as I'm sure you know.
> 
> The difference between protocol (this is, BOM at beginning of data)
> and data (BOM somewhere else) is not easy to follow.  The main reason
> is that input for a single luatex run consists of many files; each of
> them can start with a BOM -- shall it be interpreted as being part of
> the protocol or part of the data?
> 
> I thus suggest, in accordance with Arthur, to always interpret U+FEFF
> as the BOM, ignoring it everywhere.  This reduces problems a a lot
> IMHO.  Of course, this should be documented properly.

i agree.  if even jtc1/sc2 has agreed that u+feff in text should be
deprecated, then it *must* be regarded as an idea that was seriously
broken from the start.

> Perhaps it makes sense to emit a warning if U+FEFF is found in the middle
> of a file.

how?  you've just explained that there's little chance of spotting it in
that situation.

r