[tex-hyphen] ptex-specific patterns

Mojca Miklavec mojca.miklavec.lists at gmail.com
Mon May 31 13:33:35 CEST 2010

On Mon, May 31, 2010 at 12:35, Arthur Reutenauer wrote:
>        Hello Mojca,
>
>  Obviously if you want to use the same pattern files in pTeX as in
> hyph-utf8, you need to be able to convert them from UTF-8 to the
> encoding(s) pTeX use.  But:

Well, yes, but pTeX seems to use just normal EC (QX, T2A or whatever)
encoding for European languages. So the challenge is not converting
"into something that pTeX uses", but "from something that pTeX
understands when reading UTF-8 patterns".

>> - reading the same utf2ec/utf2t2a etc. as for 8bit engines
>> - for every UTF-8 character that might appear in some 8bit encoding,
>> try if the UTF-8 code is interpreted as a single token or as two
>> tokens
>>   - if it's two tokens, don't do anything (handled by utf2ec already)
>>   - if it's a single token, change its catcode to active and make it
>>     output the corresponding 8bit character as ^^xx
>> - read the patterns
>
>  I don't get that at all.  Apparently you want to start converting the
> character codes to the appropriate 8-bit font encoding, depending on the
> language.

True.

> Then what?

Read patterns and write them into format.

> Why do you still consider UTF-8 byte sequences?

Because we need to read the patterns somehow.

> (And what's a "UTF-8 character?" -- I'm guessing "UTF-8 code" means a
> sequence of byte(s) encoding a Unicode character in UTF-8,

Right.

>> [If anyone (like Taco for example) is willing to help with the second
>> step, I would welcome any idea or some working code. Arthur?]
>
>  Why not, if I understand what you want.

I need a definition for command
\def\mycommand#1#2{...}
that I could call as
% A3 is code of ccaron in EC
% č is two tokens: ^^c4^^8d
\mycommand{č}{^^a3}
% no idea what is Tau in greek encoding (don't care)
% but it's only a single token
\mycommand{Τ}{^^ff}

The pseudocode:
- test if #1 is one or two tokens (use the same trick as Taco suggested)
- if it's interpreted as two tokens, ignore
- if it's interpreted as one token (like Tau),
make that letter \active and define it to generate #2

>>                                    Can ^^xx ever represent an active
>> character and thus lead into an infinite recursion? (I just keep my
>> fingers crossed that this is never going to happen even if possible.)
>
>  It's definitely possible that a byte value occur both as a
> starting-byte of some two-byte sequence in the input text, and in the
> font encoding used for output (thereby needing to be both an active
> character and a letter).  It's already a small miracle that this doesn't
> happen for the languages we already support; I mentioned that a while
> back.
>
>  In this case, you don't have infinite recursion, but the system blows
> up in your face in some other way I can't remember; I came into it two
> years ago when we were converting the patterns, at some early stage.
> There may be a workaround, though.

Let's just assume that this won't happen. If it does, we'll care about it later.

>> - For Ukrainian and Russian I suggest using patterns from hyph-utf8
>> and remove the complex code (ukrhyph, ruhyph); whenever the first
>> Russian user pops up and requests 7 different encodings and 7
>> different versions of patterns, I'll change back to ruhyph. There's a
>> chance that things will change in the meantime anyway.
>
>  I don't see why we shouldn't take the opportunity to adapt ruhyph's
> mechanism while we're at it; it's been pending for two years...

Maybe because I have no idea how the Russians use it. It would be fine
with me to change it, but we need to do that exclusively in
cooperation with the author. (It could be implemented as an extension
to Babel etc. but we need author's opinion.) This IS one of the things
that I don't fully understand. (Well, I understand the mechanism, but
I have no idea what exactly users do with it. Norbert would say: do
not change that file since it will be overwritten with the next update
:)

>> - If we finish by TL 2010 deadline, that's fine, if not, that's fine
>> as well (no pressure; we'll try to do it, but not force it for every
>> price).
>
>  It sounds crazy to imagine that we're going to have anything robust
> ready for TeX Live 2010; I'd much rather take our time.  Incorporating
> pTeX into TeX Live seems to pose other -- in my opinion more urgent --
> problems; using a copy of the old patterns in the mean time doesn't seem
> to harm.

I only said: *IF* we will have it ready, working and tested, there's
no reason for not including it. If not, the current state is working
already, so we should not worry. I'm not aware of other issues of pTeX
(I have not been involved in its inclusion), but at least we can help
where we know how to help.

I do not expect many people expecting that Slovenian hyphenation
patterns will work flawlessly in pTeX.

I would worry much more about the crazyness of last-minute addition of