Robin.Fairbairns at cl.cam.ac.uk
Fri May 15 09:00:39 CEST 2009
Werner LEMBERG <wl at gnu.org> wrote:
> >> then it should not be ignored anywhere except when it is the first
> >> character of the file? i.e. setting the \catcode "FEFF = 9 would be
> >> wrong?
> > Formally, yes, it's wrong. But the use of U+FEFF as zero width
> > no-break space is deprecated since almost ten years, and the
> > overwhelmingly vast majority of use cases of that character will be
> > as "BOM" -- more correctly, as Unicode encoding scheme marker, since
> > byte order is not an issue for UTF-8, as I'm sure you know.
> The difference between protocol (this is, BOM at beginning of data)
> and data (BOM somewhere else) is not easy to follow. The main reason
> is that input for a single luatex run consists of many files; each of
> them can start with a BOM -- shall it be interpreted as being part of
> the protocol or part of the data?
> I thus suggest, in accordance with Arthur, to always interpret U+FEFF
> as the BOM, ignoring it everywhere. This reduces problems a a lot
> IMHO. Of course, this should be documented properly.
i agree. if even jtc1/sc2 has agreed that u+feff in text should be
deprecated, then it *must* be regarded as an idea that was seriously
broken from the start.
> Perhaps it makes sense to emit a warning if U+FEFF is found in the middle
> of a file.
how? you've just explained that there's little chance of spotting it in
More information about the luatex