[tex-hyphen] ptex-specific patterns

Arthur Reutenauer arthur.reutenauer at normalesup.org
Mon May 31 12:35:14 CEST 2010


	Hello Mojca,

  Obviously if you want to use the same pattern files in pTeX as in
hyph-utf8, you need to be able to convert them from UTF-8 to the
encoding(s) pTeX use.  But:

> - reading the same utf2ec/utf2t2a etc. as for 8bit engines
> - for every UTF-8 character that might appear in some 8bit encoding,
> try if the UTF-8 code is interpreted as a single token or as two
> tokens
>   - if it's two tokens, don't do anything (handled by utf2ec already)
>   - if it's a single token, change its catcode to active and make it
>     output the corresponding 8bit character as ^^xx
> - read the patterns

  I don't get that at all.  Apparently you want to start converting the
character codes to the appropriate 8-bit font encoding, depending on the
language.  Then what?  Why do you still consider UTF-8 byte sequences?
(And what's a "UTF-8 character?" -- I'm guessing "UTF-8 code" means a
sequence of byte(s) encoding a Unicode character in UTF-8, but I don't
know what a "UTF-8 character" is supposed to be).

> [If anyone (like Taco for example) is willing to help with the second
> step, I would welcome any idea or some working code. Arthur?]

  Why not, if I understand what you want.

> Arthur - is there any counterexample that you can think of? If
> Japanese characters were composed out of three bytes, that could be a
> problem, but I think they are not.

  Indeed, no encoding form used by pTeX uses more than two bytes to
encode any character.  ISO 2022-JP uses escape sequences, though, but
that's a different issue.

>                                    Can ^^xx ever represent an active
> character and thus lead into an infinite recursion? (I just keep my
> fingers crossed that this is never going to happen even if possible.)

  It's definitely possible that a byte value occur both as a
starting-byte of some two-byte sequence in the input text, and in the
font encoding used for output (thereby needing to be both an active
character and a letter).  It's already a small miracle that this doesn't
happen for the languages we already support; I mentioned that a while
back.

  In this case, you don't have infinite recursion, but the system blows
up in your face in some other way I can't remember; I came into it two
years ago when we were converting the patterns, at some early stage.
There may be a workaround, though.

> - For German I suggest using the new patterns.

  Obviously.

> - For Ukrainian and Russian I suggest using patterns from hyph-utf8
> and remove the complex code (ukrhyph, ruhyph); whenever the first
> Russian user pops up and requests 7 different encodings and 7
> different versions of patterns, I'll change back to ruhyph. There's a
> chance that things will change in the meantime anyway.

  I don't see why we shouldn't take the opportunity to adapt ruhyph's
mechanism while we're at it; it's been pending for two years...


> - If we finish by TL 2010 deadline, that's fine, if not, that's fine
> as well (no pressure; we'll try to do it, but not force it for every
> price).

  It sounds crazy to imagine that we're going to have anything robust
ready for TeX Live 2010; I'd much rather take our time.  Incorporating
pTeX into TeX Live seems to pose other -- in my opinion more urgent --
problems; using a copy of the old patterns in the mean time doesn't seem
to harm.

	Arthur


More information about the tex-hyphen mailing list