[tex-hyphen] ptex-specific patterns

Arthur Reutenauer arthur.reutenauer at normalesup.org
Mon May 31 17:20:02 CEST 2010

	Mojca,

I'm still not getting the algorithm you suggest.  In particular:

> I need a definition for command
>    \def\mycommand#1#2{...}
> that I could call as
>     % A3 is code of ccaron in EC
>     % č is two tokens: ^^c4^^8d
>     \mycommand{č}{^^a3}
>     % no idea what is Tau in greek encoding (don't care)
>     % but it's only a single token
>     \mycommand{Τ}{^^ff}
>
> The pseudocode:
>   - test if #1 is one or two tokens (use the same trick as Taco suggested)
>   - if it's interpreted as two tokens, ignore
>   - if it's interpreted as one token (like Tau),
>     make that letter \active and define it to generate #2

That won't be enough.  Because, if I undertand Z. R.'s explanations
correctly, you could have the following situation:

(Assuming pTeX is in EUC-JP mode)

1. The input is “ši” (U+0161, U+0069).  It's reencoded as 0xB2, 0x69
in the EC font encoding, which is not a valid EUC-JP code, hence the
first byte is interpreted as a character, and so is the second byte.

2. The input is “šč” (U+0161, U+010D).  It's reencoded as 0xB2, 0xA3
in EC, which *is* a valid EUC-JP code (corresponding to Unicode
character U+6A2A, as it is), hence that two-character sequences is
interpreted as a single Japanese character, and the original input is
simply lost.

I don't see how we could solve the situation by considering each
character individually (like we currently do in UTF-8), given pTeX's
behaviour.

>>  In this case, you don't have infinite recursion, but the system blows
>> up in your face in some other way I can't remember; I came into it two
>> years ago when we were converting the patterns, at some early stage.
>> There may be a workaround, though.
>
> Let's just assume that this won't happen. If it does, we'll care about it later.

You can't just assume that.  On the contrary, we need to know it now
if it's going to happen in order to prevent it.

> Maybe because I have no idea how the Russians use it. It would be fine
> with me to change it, but we need to do that exclusively in
> cooperation with the author.

And I think we should do that.  Again, it's been our intention since
two years anyway.

> I would worry much more about the crazyness of last-minute addition of