[XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

Joseph Wright joseph.wright at morningstar2.co.uk
Wed May 6 22:15:07 CEST 2015

On 06/05/2015 15:09, Jonathan Kew wrote:
> On 6/5/15 14:14, Joseph Wright wrote:
>> Based on the current files, we have a block to set \XeTeXcharclass,
>> which only applies to XeTeX. The logic followed in that code is that
>> characters in the file LineBreak.txt which have class "ID" (ideographs)
>> not only set the \XeTeXcharclass class to 1 but also set the \catcode of
>> the code point to 11. That leads to a difference between the two Unicode
>> engines. My current feeling is that the data file should split this
>> process such that the category code change applies to both XeTeX and
>> LuaTeX, with the XeTeX-specific code separate. Does this make sense and
>> indeed does the current assignment make sense?
> ISTM that the most appropriate (default) \catcode for characters with
> class ID is clearly letter (11), and would suggest that LuaTeX should
> follow XeTeX in this.

Well for LaTeX at least the team get to make the call here and I think
we will pull everything into line.

> So yes, splitting out the XeTeX-specific code and having LuaTeX share
> the catcode assignments makes sense.

OK, if there are no objections I have a plan on this (I'll actually keep
all of the data, I think, and alter the assignment code).

> After all, if users can write control sequences such as
>   \hello
>   \halló
>   \Здравствуйте
>   \ሰላም
>   \सलाम
> they should equally well be able to write
>   \你好
>   \こんにちわ
> and have each of these treated as single control sequences, too. This
> will not work if category ID characters are given catcode 12.

Entirely reasonable.

> If you're making improvements to unicode-letters.def, I would suggest
> also adding a section that assigns catcode 15 (invalid) to the code
> values "D800 - "DFFF (i.e. the UTF-16 surrogates, which should never be
> used in isolation as characters).

Noted: easy enough to add.
Joseph Wright

More information about the XeTeX mailing list