[XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

Wed May 6 15:14:43 CEST 2015

Hello all,

As some people will have seen, the LaTeX team have recently integrated
setting of codes (\catcode, \lccode, etc.) for the entire Unicode range
 into the kernel when XeTeX/LuaTeX are in use. This is not a functional
change for end users but does mean that the team now have some control
over these important settings. Notably, the new data file we have
created (unicode-letters.def) is compatible with plain TeX and works
with both XeTeX and LuaTeX. We are therefore hopeful that it will
provide useful not only to LaTeX users but also to those using
plain-basef formats.

For the initial pass we have adopted the settings applied by
unicode-letters.tex (XeTeX)/luatex-unicode-letters.tex (LuaTeX) as-is.
We have constructed a new (TeX) script to generate this data from the
raw Unicode data files.

Most of the settings are straight-forward and shared between XeTeX and
LuaTeX. For example, characters marked as Unicode as letters have
\catcode 11, \lccode and \uccode are set up based on case relationships,
etc. However, we would like to raise one area that may need revision.

Based on the current files, we have a block to set \XeTeXcharclass,
which only applies to XeTeX. The logic followed in that code is that
characters in the file LineBreak.txt which have class "ID" (ideographs)
not only set the \XeTeXcharclass class to 1 but also set the \catcode of
the code point to 11. That leads to a difference between the two Unicode
engines. My current feeling is that the data file should split this
process such that the category code change applies to both XeTeX and
LuaTeX, with the XeTeX-specific code separate. Does this make sense and
indeed does the current assignment make sense?

We are very keen to hear about any other logic changes that may be
required in the data file. This is a complex area and we have at present
done little other than copy the current logic.
--
Joseph Wright