[XeTeX] How to prevent Chinese chars to be treated as part of TeX command?

Sat Oct 17 13:31:02 CEST 2009

On 17 Oct 2009, at 07:02, Joseph Wright wrote:

> mhbezine209 mhbezine2009 wrote:
>> I find a problem of XeTeX: I often encounter errors like
>> "! Undefined control sequence l.6 \TeX浣犲ソ"
>> when I typeset Chinese documents with XeTeX.
>> See example below to have an idea on the source of errors.
>> ------cut from here----------
>> \documentclass{article}
>> \usepackage{xeCJK}
>> \begin{document}
>> \TeX你好Hello
>> \end{document}
>> -------end----------------------
>>
>> Such errors occur when chinese characters (or any other non-ASCII  
>> unicode
>> chars) follow a valid command immediately.
>> In other words, if there is no space between Chinese characters and a
>> command name,
>> XeTeX will treat the Chinese characters as part of the command  
>> name, so it
>> issues an error message.  I do not know whether it is a bug of  
>> XeTeX or it
>> is intended. Anyway, I find this design is very annoying because I  
>> must
>> manually add a white space or {} after each command name so as to  
>> avoid such
>> errors. Does anybody to have good solution to resolve this problem?  
>> It would
>> be disirable if this feature of XeTeX can be disabled with one  
>> command or a
>> macro. I think it would be better to restrict command names in  
>> ASCII chars.
>> Thanks for any discussion on this issue:-)
>>
>
> TeX treats any "letters" as part of a control sequence, so if I write:
>
> \TeXHello
>
> TeX will complain and I need to write
>
> \TeX Hello. All XeTeX is doing is extending this concept to UTF-8 by
> setting a lot
> more characters up as "letters". So everything seems pretty consistent
> to me.
> Most users want to use non-ASCII characters in csnames with XeTeX, in
> any case.

Right. Basically, the character (category) codes in the xetex/xelatex  
formats are initialized based on Unicode character properties;  
anything that is classified as a "letter" in Unicode is given \catcode  
11, so that xetex also treats it as a letter. This includes the  
Chinese characters, as well as letters in the various alphabetic  
scripts. These assignments are made in the file unicode-letters.tex,  
which is loaded during format file creation.

If you want to change this in your documents, you could write a macro  
\MakeCJKother that changes the \catcode of all those characters from  
11 to 12 ("other"). (Use a loop macro!) Then they will terminate  
control-sequence names, just like punctuation characters, etc.

(On the other hand, this would prevent you using multi-chinese- 
character macro names such as \你好 or \谢谢. Currently, these work  
just like alphabetic equivalents such as \नमस्ते or  
\спасибо.)

JK