[XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

Wed May 6 16:09:55 CEST 2015

On 6/5/15 14:14, Joseph Wright wrote:

> Based on the current files, we have a block to set \XeTeXcharclass,
> which only applies to XeTeX. The logic followed in that code is that
> characters in the file LineBreak.txt which have class "ID" (ideographs)
> not only set the \XeTeXcharclass class to 1 but also set the \catcode of
> the code point to 11. That leads to a difference between the two Unicode
> engines. My current feeling is that the data file should split this
> process such that the category code change applies to both XeTeX and
> LuaTeX, with the XeTeX-specific code separate. Does this make sense and
> indeed does the current assignment make sense?
>

ISTM that the most appropriate (default) \catcode for characters with 
class ID is clearly letter (11), and would suggest that LuaTeX should 
follow XeTeX in this.

So yes, splitting out the XeTeX-specific code and having LuaTeX share 
the catcode assignments makes sense.

After all, if users can write control sequences such as

   \hello
   \halló
   \Здравствуйте
   \ሰላም
   \सलाम

they should equally well be able to write

   \你好
   \こんにちわ

and have each of these treated as single control sequences, too. This 
will not work if category ID characters are given catcode 12.

If you're making improvements to unicode-letters.def, I would suggest 
also adding a section that assigns catcode 15 (invalid) to the code 
values "D800 - "DFFF (i.e. the UTF-16 surrogates, which should never be 
used in isolation as characters).

JK