[XeTeX] Hyphenation around „ß“

Jonathan Kew jfkthame at googlemail.com
Mon Jan 27 11:23:02 CET 2014


On 27/1/14 09:47, Ulrike Fischer wrote:
> Am Mon, 13 Jan 2014 08:24:30 +0000 schrieb Jonathan Kew:
>
>
>>> So is it relly true, that XeTeX is not able to apply the TeX hyphenation
>>> mechnanism correctly to some unicode characters like „ß“?
>>> I can't believe it.
>
>> That seems unlikely. It's almost certainly being affected by something
>> that latex or babel or whatever is setting up.
>
> The problem seems to be the \lccode and \uccode of ß:
>
> During format generation of xelatex (just before the pattern are
> read) they are are set to 255 and 223 by "\reserved at a{"C0}{"DF}".

Aha. That looks like it relates to a legacy 8-bit codepage (Cork?), and 
is incorrect for a Unicode world.

\lccode of ß should certainly be 223 (0xDF), corresponding to its 
Unicode value U+00DF LATIN SMALL LETTER SHARP S.

Its \uccode is debatable; it should probably also be 223, as ß is 
normally treated as non-uppercaseable (or as uppercasing to "SS", which 
can't be done via \uccode), but another option would be 0x1E9E, for the 
(relatively recently-encoded) Unicode letter U+1E9E LATIN CAPITAL LETTER 
SHARP S.

> But latter on xelatex.ini resets them both to 223 and this disturbs
> the hyphenation:
>
>
> \documentclass{article}
>
> \textwidth=1in
> \usepackage{fontspec}
> \usepackage[german]{babel}
>
> \begin{document}
>
> \showthe\lccode`\ß
> \showthe\uccode`\ß
>
> \noindent wußte geißeln wußte geißeln wußte geißeln
>    wußte geißeln wußte geißeln wußte geißeln
>    wußte geißeln wußte geißeln wußte geißeln
>    wußte geißeln wußte geißeln wußte geißeln
> \par
>
> %Setting values active at format generation works:
> \lccode`\ß=255
> \uccode`\ß=223
>
> \noindent wußte geißeln wußte geißeln wußte geißeln
>    wußte geißeln wußte geißeln wußte geißeln
>    wußte geißeln wußte geißeln wußte geißeln
>    wußte geißeln wußte geißeln wußte geißeln
> \par
>
> \end{document}
>
> I don't know if it is an expected behaviour or a bug of xetex that
> the lccode/uccode matters.

It's expected that it matters, because text is mapped via \lccode for 
matching against hyphenation patterns.

(AFAIR, \uccode should be irrelevant here.)

> But you get the same behaviour with
> lualatex only the "other way round": As the pattern are read only at
> the begin of document the first paragraph in my example works fine,
> but the second with the changed lccode/uccode fails.
>
>
> So this probably means that the code at the end of xelatex.ini which
> resets catcodes and lccode/uccodes etc should move to the begin of
> hyphen.cfg so that the correct codes are active when the patterns
> are read.

That sounds right, I think. Or maybe this can be fixed within the 
hyph-utf8 code somewhere.

Reading the patterns with incorrect lccodes (in particular, \lccode `ß = 
255) may appear to work if you set that same lccode in the document, but 
it's still wrong - and seems likely to lead to confusion with ÿ, which 
is U+00FF.

Thanks for the diagnosis!

JK



More information about the XeTeX mailing list