<br><br><div class="gmail_quote">On Fri, May 15, 2009 at 12:22 AM, Arthur Reutenauer <span dir="ltr">&lt;<a href="mailto:arthur.reutenauer@normalesup.org">arthur.reutenauer@normalesup.org</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

<div class="im">&gt; Does Unicode say what&#39;s supposed to happen when BOM is found in the<br>

&gt; middle of a document?<br>

<br>

</div>  Yes, it behaves as a zero width no-break space, i.e., a formatting<br>

character preventing line break on either side of it.  It draws its<br>

Unicode name from that property (U+FEFF ZERO WIDTH NO-BREAK SPACE,<br>

a.k.a. ZWNBSP, a.k.a. BOM).  See The Unicode Standard, version 5.0,<br>

chapter 16, p. 551 (<a href="http://www.unicode.org/versions/Unicode5.0.0/ch16.pdf" target="_blank">http://www.unicode.org/versions/Unicode5.0.0/ch16.pdf</a>)<br>

<br>

  Since Unicode 3.2, U+2060 WORD JOINER is preferred for specifying<br>

zero width no-break space, but BOM has nevertheless that semantics too,<br>

when not at the beginning of a file.<br>

<div class="im"></div></blockquote><div>&quot;Where the byte order is explicitly specified, such as in UTF-16BE or UTF-16LE,then all U+FEFF characters --even at the very beginning of the text-- are to interpreted as zero width-break space. Similarity , where Unicode text has known byte order, initial <br>

U+FEFF character are not required, but for backward compability are to be interpreted as zero width-break space.<br>....<br>Systems that use the byte order mark must recognize when an initial U+FEFF signals the byte order. In those cases, it is not part of the textual content and should be removed before processing, because otherwise it may be mistaken for a legitimate zero width no-break space. To represent an initial U+FEFF ZERO WIDTH NO-BREAK SPACE in a UTF-16 file,use U+FEFF twice in a row.<br>

The first one is a byte order mark; the second one is the initial zero width no-break space.&quot;<br><br>farther<br><br>&quot;U+FEFF affects the interpretation of text and cannot be freely deleted, The overloading of semantics for this code point has caused problems for program and protocol&quot;<br>

</div></div><br>-- <br>luigi<br><br>