# [texhax] unicode

pierre.mackay pierre.mackay at comcast.net
Wed Aug 3 19:01:32 CEST 2005

Alexander Grahn wrote:

>
>I will give the ucs-package a try. My computer has not been generally unicode
>enabled. Therefore, original strings are ordinary (8 bit?)-ASCII.
>
>I found an example PDF-file which suits my needs. At one position therein
>the string appears as
>
>  /T (somestring)
>
>and at another as
>
>  [(^@s^@o^@m^@e^@s^@t^@r^@i^@n^@g) 10 0 R ...]
>
>
>
Actually, both of them can be considered Unicode. The difference is that
one is stored in UTF8, which happens to be the same as 8-bit ASCII, and
the second is stored as 16-bit wide characters, which are not recognized
by your editor. As we slowly creep toward Unicode compatibility, it is
going to be important to keep the code, expressed as code-points in the
Unicode Standard, distinct from the various expressions of it in
software. Here I might make a plug for the latest Open Office Writer,
which does a splendid job of converting the Macintosh 16-bit + 8bit
convention to true UTF8. (I don't know what Macintosh does about 24-bit
pages in the standard. It's not a problem I ever expect to have.

Anyway, until you get to characters with an octal value of 077, or
decimal 127, you really can't say that ASCII is not Unicode. The early
idea was that everything should be converted to wide-char,
at the price of doubling the size of all text files, but a glance over
source code that one would think would be affected by this idea
indicates that very few developers are willing to take the hit. UTF is
really quite remarkable as a solution.)

>if I open the file in the Vim editor.
>
>The first occurence is obviously plain ASCII and the second one Unicode
>(the PDF-specification 1.6 is saying this).
>
>If, for example, I define a string
>as
>
>  \def\mystring{^@C^@a^@r^@l}
>
>
>
\def\anotherstring{^^@A^^@B^^@C}
works, but when you set it with ABC\anotherstring DEF
the result seems to strip out the nulls.

A hex dump of the dvi file shows the sequence ABCABCDEF, with the null
bytes omitted.

Curiouser and curiouser

Pierre