[XeTeX] Bug in XeTeX 0.997?

Wed Jan 16 11:02:12 CET 2008

On 15 Jan 2008, at 8:24 pm, Ross Moore wrote:

> Hi Youssef and Jonathan,
>
> On 15/01/2008, at 11:35 PM, Youssef Jabri wrote:
>
>> Hi Jonathan, Hi everybody,
>>
>> I am preparing a new version of Arabi that works with both eTeX and
>> XeTeX, and things work quite well so far.
>> But I noticed that the following code works with eTeX,  XeTeX 0.996
>> but fails with version 0.997
>>
>> \documentclass{article}
>> %utf8 is a part of the ArabTeX package which handles unicode by its
>> own, it's earlier than the utf8 code used by the inputenc package.
>> \usepackage{arabtex,utf8}
>> \begin{document}
>> bla bla
>> \end{document}
>>
>> I am using the mac intel binary from
>>   http://minimals.contextgarden.net/current/bin/xetex/
>
> I can confirm this.
> It fails with version 0.997  whereas it works with  0.996 .

I'm a bit surprised it worked in 0.996, actually... I guess the xetex  
is being a bit stricter about reading UTF-8 now.

> The error message is:
>
> Runaway definition?
> ->\global \let \a at scan \utfc at scan \global \def \sc at beg {\utf at beg }
> \global \ETC.
> ! File ended while scanning definition of \set at utfc.
>
>
> The actual point of failure is at line 31 in   .../arabtex/utf8.sty
>
>     \catcode `· 11
>
> This \catcode setting does not work properly and causes the '}'
> at the end of the following line to be not recognised as being
> the end of the replacement tokens for  \gdef\set at utfc{...

The problem arises because .../arabtex/utf8.sty is an 8-bit, non- 
Unicode file, which xetex tries to interpret as UTF-8. When it sees  
the (single byte) code for what's appearing here as a bullet, this is  
taken as the first byte of a multi-byte UTF-8 sequence.

>
> If an extra '}' is appended, the definition is completed,
> but not as the author intended; viz.
>
>> \set at utfc=macro:
> ->\global \let \a at scan \utfc at scan \global \def \sc at beg {\utf at beg }
> \global \def
> \sc at word {\utf at word }\global \a at digits = {0123456789}\global \a at first
> = {Ύϕ^^
> 92^^8d}\catcode `\BAD.1 \a at message {input encoding set to UTF-8  
> conventi
> ons}}.
> l.35 \show\set at utfc
>
>
> Note the  "\catcode `\BAD.1 " and the extra "}" at the end of these
> expansion tokens; whereas with  XeTeX v0.996  the correct expansion  
> is:
>
>> \set at utfc=macro:
> ->\global \let \a at scan \utfc at scan \global \def \sc at beg {\utf at beg }
> \global \def
> \sc at word {\utf at word }\global \a at digits = {0123456789}\global \a at first
> = {Ύϕ^^
> 92^^8d}\catcode `1 \a at message {input encoding set to UTF-8  
> convention
> s}.
> l.35 \show\set at utfc

While that runs without error messages, it is not the  
"correct" (intended) meaning of the code. Note the \catcode command,  
which is going to set the catcode of some arbitrary character  
(showing as a ".notdef" box in my email) to 1, not to 11 as  
originally intended. This is because the bytes following the "bullet"  
byte were consumed by xetex's UTF-8 interpretation.

>
>
> This problem seems to be by-passed by changing line 31 to read:
>       \catcode `\· 11
>
> but then a similar problem occurs at line 1300 in  .../arabtex/
> apatch.sty
> which is fixed the same way.
>
>
> These small edits do not adversely affect XeTeX v0.996 either,
> so far as I can tell without actually setting anything in arabic.
> Certainly the packages now load without errors.

While they may load without errors, they are probably not performing  
their intended function (which is probably not needed anyway in XeTeX).

To read a file like this "correctly" with xetex, you'd need to set  
the input encoding form to "bytes". Then for the UTF-8 macro support  
to work as intended, you'd need to do the same with the actual text  
files, too. But far better to forget all this and simply allow xetex  
to process the UTF-8 text natively.

JK