[tex-hyphen] Loading patterns twice, OT1 and apostrophe
Arthur Reutenauer
arthur.reutenauer at normalesup.org
Sat Jun 28 00:59:29 CEST 2008
> I've been thinking: Perhaps the final solution is to
> do away with \lccode and \uccode completely and instead base the
> system on unicode properties?
You don't say :-)
I see at least three things here:
The case at hand in this thread: some legacy 8-bit patterns would
result, ideally, in two different Unicode patterns. This is not
possible for the moment because those pattern pairs would be converted
back to the same pattern in 8-bit engines, and iniTeX doesn't want
duplicate patterns. It's one of the biggest problems we had to deal
with since the beginning, but what is the rationale behind this
no-duplicate policy? I couldn't find any justification for it; the
error is thrown in “TeX: The Program” part 43, section 963, with no
comment at all; and in spite of what the help message says, Appendix H
of the TeXbook doesn't mention anything about duplicate patterns (at
least not the definitive millenium edition). A simple explanation could
be that the patterns are suspected to be buggy if they contain
duplicates; but it seems a rather weak check (and not iniTeX's job, in
my opinion), and I don't really see the harm in duplicates (just
discarding them doesn't sound that horrible). Then again, I might be
missing something, of course.
The irony here is that LuaTeX doesn't complain about duplicate
patterns anymore since the hyphenation-handling code moved over to
libHnj last October, and part 43 of the original TeX code disappeared
entirely; Taco, can you comment about that?
Second, I'm also tempted to say that we don't need \lccode's and
\uccode's for patterns, and that we should rely on Unicode properties only.
Finally, I had another thought that was raised by the Sanskrit
patterns: in Indic scripts, single glyphs can correspond to a great deal
of characters -- up to 5 or 6, apparently, in the current patterns
contributed by Yves Codet. This blends very badly with \lefthyphenmin
and \righthyphenmin, because if we want to, say, prevent such a
5-character glyph from being hyphenated at the end of a word, we would
have to set \righthyphenmin to 5; but this would of course prevent all
the other 5-character clusters from being hyphenated, some of them
possibly corresponding to 5 actual glyphs in the font. That is,
counting characters doesn't seem as relevant as it is in the Latin or
Cyrillic scripts. I believe we should consider what Unicode calls a
grapheme cluster (“what the user thinks of as a character”) instead of
characters. Needless to say, for the existing patterns the two concepts
overlap to a great extent; the vast majority of grapheme clusters can be
represented with a single Unicode character, if not all of them. This
does not at all hold, however, for Indic scripts (neither for Arabic,
but that's hardly relevant for hyphenation).
Arthur
More information about the tex-hyphen
mailing list