[tex-hyphen] ptex-specific patterns
Taco Hoekwater
taco at elvenkind.com
Mon May 31 17:31:24 CEST 2010
Arthur Reutenauer wrote:
> Mojca,
>
> I'm still not getting the algorithm you suggest. In particular:
>
>> I need a definition for command
>> \def\mycommand#1#2{...}
>> that I could call as
>> % A3 is code of ccaron in EC
>> % č is two tokens: ^^c4^^8d
>> \mycommand{č}{^^a3}
>> % no idea what is Tau in greek encoding (don't care)
>> % but it's only a single token
>> \mycommand{Τ}{^^ff}
>>
>> The pseudocode:
>> - test if #1 is one or two tokens (use the same trick as Taco suggested)
>> - if it's interpreted as two tokens, ignore
>> - if it's interpreted as one token (like Tau),
>> make that letter \active and define it to generate #2
>
> That won't be enough. Because, if I undertand Z. R.'s explanations
> correctly, you could have the following situation:
>
> (Assuming pTeX is in EUC-JP mode)
>
> 1. The input is “ši” (U+0161, U+0069). It's reencoded as 0xB2, 0x69
> in the EC font encoding, which is not a valid EUC-JP code, hence the
> first byte is interpreted as a character, and so is the second byte.
>
> 2. The input is “šč” (U+0161, U+010D). It's reencoded as 0xB2, 0xA3
> in EC, which *is* a valid EUC-JP code (corresponding to Unicode
> character U+6A2A, as it is), hence that two-character sequences is
> interpreted as a single Japanese character, and the original input is
> simply lost.
>
> I don't see how we could solve the situation by considering each
> character individually (like we currently do in UTF-8), given pTeX's
> behaviour.
I also read that explanation (but not very thoroughly). My impression
was: ptex understands any kind of input as long as it is a valid
Japanese character, and produces random 8-bit stuff otherwise (which
could coincide with an 8-bit font encoding for western europe, but only
if you are both careful and lucky).
I cannot imagine how it would be possible to work around those input
restrictions dynamically.
Best wishes,
Taco
More information about the tex-hyphen
mailing list