<br><br><div class="gmail_quote">On Fri, May 15, 2009 at 12:22 AM, Arthur Reutenauer <span dir="ltr"><<a href="mailto:arthur.reutenauer@normalesup.org">arthur.reutenauer@normalesup.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
<div class="im">> Does Unicode say what's supposed to happen when BOM is found in the<br>
> middle of a document?<br>
<br>
</div> Yes, it behaves as a zero width no-break space, i.e., a formatting<br>
character preventing line break on either side of it. It draws its<br>
Unicode name from that property (U+FEFF ZERO WIDTH NO-BREAK SPACE,<br>
a.k.a. ZWNBSP, a.k.a. BOM). See The Unicode Standard, version 5.0,<br>
chapter 16, p. 551 (<a href="http://www.unicode.org/versions/Unicode5.0.0/ch16.pdf" target="_blank">http://www.unicode.org/versions/Unicode5.0.0/ch16.pdf</a>)<br>
<br>
Since Unicode 3.2, U+2060 WORD JOINER is preferred for specifying<br>
zero width no-break space, but BOM has nevertheless that semantics too,<br>
when not at the beginning of a file.<br>
<div class="im"></div></blockquote><div>"Where the byte order is explicitly specified, such as in UTF-16BE or UTF-16LE,then all U+FEFF characters --even at the very beginning of the text-- are to interpreted as zero width-break space. Similarity , where Unicode text has known byte order, initial <br>
U+FEFF character are not required, but for backward compability are to be interpreted as zero width-break space.<br>....<br>Systems that use the byte order mark must recognize when an initial U+FEFF signals the byte order. In those cases, it is not part of the textual content and should be removed before processing, because otherwise it may be mistaken for a legitimate zero width no-break space. To represent an initial U+FEFF ZERO WIDTH NO-BREAK SPACE in a UTF-16 file,use U+FEFF twice in a row.<br>
The first one is a byte order mark; the second one is the initial zero width no-break space."<br><br>farther<br><br>"U+FEFF affects the interpretation of text and cannot be freely deleted, The overloading of semantics for this code point has caused problems for program and protocol"<br>
</div></div><br>-- <br>luigi<br><br>