luigi.scarso at gmail.com
Fri May 15 00:51:36 CEST 2009
On Fri, May 15, 2009 at 12:22 AM, Arthur Reutenauer <
arthur.reutenauer at normalesup.org> wrote:
> > Does Unicode say what's supposed to happen when BOM is found in the
> > middle of a document?
> Yes, it behaves as a zero width no-break space, i.e., a formatting
> character preventing line break on either side of it. It draws its
> Unicode name from that property (U+FEFF ZERO WIDTH NO-BREAK SPACE,
> a.k.a. ZWNBSP, a.k.a. BOM). See The Unicode Standard, version 5.0,
> chapter 16, p. 551 (http://www.unicode.org/versions/Unicode5.0.0/ch16.pdf)
> Since Unicode 3.2, U+2060 WORD JOINER is preferred for specifying
> zero width no-break space, but BOM has nevertheless that semantics too,
> when not at the beginning of a file.
"Where the byte order is explicitly specified, such as in UTF-16BE or
UTF-16LE,then all U+FEFF characters --even at the very beginning of the
text-- are to interpreted as zero width-break space. Similarity , where
Unicode text has known byte order, initial
U+FEFF character are not required, but for backward compability are to be
interpreted as zero width-break space.
Systems that use the byte order mark must recognize when an initial U+FEFF
signals the byte order. In those cases, it is not part of the textual
content and should be removed before processing, because otherwise it may be
mistaken for a legitimate zero width no-break space. To represent an initial
U+FEFF ZERO WIDTH NO-BREAK SPACE in a UTF-16 file,use U+FEFF twice in a row.
The first one is a byte order mark; the second one is the initial zero width
"U+FEFF affects the interpretation of text and cannot be freely deleted, The
overloading of semantics for this code point has caused problems for program
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the luatex