[luatex] BOM

luigi scarso luigi.scarso at gmail.com
Fri May 15 00:51:36 CEST 2009

On Fri, May 15, 2009 at 12:22 AM, Arthur Reutenauer <
arthur.reutenauer at normalesup.org> wrote:

> > Does Unicode say what's supposed to happen when BOM is found in the
> > middle of a document?
>   Yes, it behaves as a zero width no-break space, i.e., a formatting
> character preventing line break on either side of it.  It draws its
> Unicode name from that property (U+FEFF ZERO WIDTH NO-BREAK SPACE,
> a.k.a. ZWNBSP, a.k.a. BOM).  See The Unicode Standard, version 5.0,
> chapter 16, p. 551 (http://www.unicode.org/versions/Unicode5.0.0/ch16.pdf)
>  Since Unicode 3.2, U+2060 WORD JOINER is preferred for specifying
> zero width no-break space, but BOM has nevertheless that semantics too,
> when not at the beginning of a file.
"Where the byte order is explicitly specified, such as in UTF-16BE or
UTF-16LE,then all U+FEFF characters --even at the very beginning of the
text-- are to interpreted as zero width-break space. Similarity , where
Unicode text has known byte order, initial
U+FEFF character are not required, but for backward compability are to be
interpreted as zero width-break space.
Systems that use the byte order mark must recognize when an initial U+FEFF
signals the byte order. In those cases, it is not part of the textual
content and should be removed before processing, because otherwise it may be
mistaken for a legitimate zero width no-break space. To represent an initial
U+FEFF ZERO WIDTH NO-BREAK SPACE in a UTF-16 file,use U+FEFF twice in a row.
The first one is a byte order mark; the second one is the initial zero width
no-break space."


"U+FEFF affects the interpretation of text and cannot be freely deleted, The
overloading of semantics for this code point has caused problems for program
and protocol"

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://tug.org/pipermail/luatex/attachments/20090515/74936e43/attachment.html 

More information about the luatex mailing list