[tex-hyphen] ptex-specific patterns
Arthur Reutenauer
arthur.reutenauer at normalesup.org
Mon May 31 12:35:14 CEST 2010
Hello Mojca,
Obviously if you want to use the same pattern files in pTeX as in
hyph-utf8, you need to be able to convert them from UTF-8 to the
encoding(s) pTeX use. But:
> - reading the same utf2ec/utf2t2a etc. as for 8bit engines
> - for every UTF-8 character that might appear in some 8bit encoding,
> try if the UTF-8 code is interpreted as a single token or as two
> tokens
> - if it's two tokens, don't do anything (handled by utf2ec already)
> - if it's a single token, change its catcode to active and make it
> output the corresponding 8bit character as ^^xx
> - read the patterns
I don't get that at all. Apparently you want to start converting the
character codes to the appropriate 8-bit font encoding, depending on the
language. Then what? Why do you still consider UTF-8 byte sequences?
(And what's a "UTF-8 character?" -- I'm guessing "UTF-8 code" means a
sequence of byte(s) encoding a Unicode character in UTF-8, but I don't
know what a "UTF-8 character" is supposed to be).
> [If anyone (like Taco for example) is willing to help with the second
> step, I would welcome any idea or some working code. Arthur?]
Why not, if I understand what you want.
> Arthur - is there any counterexample that you can think of? If
> Japanese characters were composed out of three bytes, that could be a
> problem, but I think they are not.
Indeed, no encoding form used by pTeX uses more than two bytes to
encode any character. ISO 2022-JP uses escape sequences, though, but
that's a different issue.
> Can ^^xx ever represent an active
> character and thus lead into an infinite recursion? (I just keep my
> fingers crossed that this is never going to happen even if possible.)
It's definitely possible that a byte value occur both as a
starting-byte of some two-byte sequence in the input text, and in the
font encoding used for output (thereby needing to be both an active
character and a letter). It's already a small miracle that this doesn't
happen for the languages we already support; I mentioned that a while
back.
In this case, you don't have infinite recursion, but the system blows
up in your face in some other way I can't remember; I came into it two
years ago when we were converting the patterns, at some early stage.
There may be a workaround, though.
> - For German I suggest using the new patterns.
Obviously.
> - For Ukrainian and Russian I suggest using patterns from hyph-utf8
> and remove the complex code (ukrhyph, ruhyph); whenever the first
> Russian user pops up and requests 7 different encodings and 7
> different versions of patterns, I'll change back to ruhyph. There's a
> chance that things will change in the meantime anyway.
I don't see why we shouldn't take the opportunity to adapt ruhyph's
mechanism while we're at it; it's been pending for two years...
> - If we finish by TL 2010 deadline, that's fine, if not, that's fine
> as well (no pressure; we'll try to do it, but not force it for every
> price).
It sounds crazy to imagine that we're going to have anything robust
ready for TeX Live 2010; I'd much rather take our time. Incorporating
pTeX into TeX Live seems to pose other -- in my opinion more urgent --
problems; using a copy of the old patterns in the mean time doesn't seem
to harm.
Arthur
More information about the tex-hyphen
mailing list