<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On 7 May 2015 at 02:07, Ross Moore <span dir="ltr"><<a href="mailto:ross.moore@mq.edu.au" target="_blank">ross.moore@mq.edu.au</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="word-wrap:break-word">Hi David,<div><br>......<br></div></div></blockquote><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="word-wrap:break-word"><div><div><span class=""></span><div>No disagreement to this.</div><span class=""><br></span></div></div></div></blockquote><div><br></div><div>OK:-) <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="word-wrap:break-word"><div><div><span class=""><blockquote type="cite"><div><br>In the current versions ^^^^d835^^^^dc00 is two characters in luatex<br>and one character in xetex<br>as the implementation detail that xetex's underlying storage is mostly<br>UTF-16 is exposed. </div></blockquote><div><br></div></span><div>This seems to be premature of XeTeX then.</div><div>It seems to be making an assumption on how those bytes </div><div>will ultimately be used.</div></div></div></div></blockquote><div><br><br></div><div>I don't think it's so much assuming that as just choosing to use UTF16<br></div><div>as an internal string format tends to lead that way. Unlike UTF-8, UTF-16<br></div><div>can not represent all code points in the 0-10FFFF range.<br></div><div>If  I switch to java(script) notation which does define numeric references<br></div><div>as utf-16 units rather than unicode code points, if you do not make it an<br></div><div>error you can encode an isolated surrogate such as "\ud835" but there<br></div><div>is no way to store the two character sequence U+D835 U+DC00<br>"\ud835\udc00" is the single character U+1D400, so you can only store<br></div><div>such character sequence if you store each text block as a sequence of<br>separate strings keeping unpaired surrogates apart "\ud835","\udc00"<br></div><div>which is a lot of effort for supporting input that should never appear.<br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="word-wrap:break-word"><div><div><span class=""><br><blockquote type="cite"><div>If it is<br>not possible to prevent ^^^ or utf8 encoded surrogate pairs combining<br>then it is better to<br>prevent them being formed.</div></blockquote><div><br></div></span><div>Hmm. </div><div>What if you have an entirely different purpose in mind for those bytes?</div><div>You still need to be able to create them and do further processing with them.</div></div></div></div></blockquote><div><br></div><div>luatex has a different mechanism for this, it allows utf8 encoding and ^^^ numeric<br></div><div>references to access the first 256 slots _above_ "10FFFF:<br><br></div><div>quoting the luatex manual:<br></div><div><br><blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex" class="gmail_quote">Output in byte-sized chunks can be achieved by using characters just outside of the valid Unicode range,<br>starting at the value 1 114 112 (0x110000). When the time comes to print a character c >= 1 114 112,<br>LuaTeX will actually print the single byte corresponding to c minus 1,114,112.<br></blockquote><br></div><div>This allows explicit byte-level access to file writing (so you can write binary data such as images)<br></div><div> without having to second guess and invert the character encoding the system uses to write characters to a file.<br></div><div><br><br></div></div></div><div class="gmail_extra">David<br></div></div>