[tex-hyphen] ptex-specific patterns

Mon May 31 17:31:24 CEST 2010

Arthur Reutenauer wrote:
> 	Mojca,
> 
>   I'm still not getting the algorithm you suggest.  In particular:
> 
>> I need a definition for command
>>    \def\mycommand#1#2{...}
>> that I could call as
>>     % A3 is code of ccaron in EC
>>     % č is two tokens: ^^c4^^8d
>>     \mycommand{č}{^^a3}
>>     % no idea what is Tau in greek encoding (don't care)
>>     % but it's only a single token
>>     \mycommand{Τ}{^^ff}
>>
>> The pseudocode:
>>   - test if #1 is one or two tokens (use the same trick as Taco suggested)
>>   - if it's interpreted as two tokens, ignore
>>   - if it's interpreted as one token (like Tau),
>>     make that letter \active and define it to generate #2
> 
>   That won't be enough.  Because, if I undertand Z. R.'s explanations
> correctly, you could have the following situation:
> 
>   (Assuming pTeX is in EUC-JP mode)
> 
>   1. The input is “ši” (U+0161, U+0069).  It's reencoded as 0xB2, 0x69
> in the EC font encoding, which is not a valid EUC-JP code, hence the
> first byte is interpreted as a character, and so is the second byte.
> 
>   2. The input is “šč” (U+0161, U+010D).  It's reencoded as 0xB2, 0xA3
> in EC, which *is* a valid EUC-JP code (corresponding to Unicode
> character U+6A2A, as it is), hence that two-character sequences is
> interpreted as a single Japanese character, and the original input is
> simply lost.
> 
>   I don't see how we could solve the situation by considering each
> character individually (like we currently do in UTF-8), given pTeX's
> behaviour.

I also read that explanation (but not very thoroughly). My impression
was: ptex understands any kind of input as long as it is a valid
Japanese character, and produces random 8-bit stuff otherwise (which
could coincide with an 8-bit font encoding for western europe, but only
if you are both careful and lucky).

I cannot imagine how it would be possible to work around those input
restrictions dynamically.

Best wishes,
Taco