[tex-hyphen] More with the patterns

Tue Jun 10 19:39:08 CEST 2008

On Tue, Jun 10, 2008 at 3:56 PM, Javier Múgica wrote:
>
>>
>> Yes, but the problem is that every file uses a different convention.
>
>
> The problem for XeTeX you mean, don't you?

Yes and no. The problem is not in XeTeX per se, but in the fact that
all the files assume T1/T2A font encoding. And most files only make
sure that the patterns work in pdfTeX/whatever-8bit-tex.

> I have never used it and I may be
> missing the problem.

In pdfTeX, á will map to 0xE1, but that's an á in EC encoding only,
not in Unicode, where it's an invalid character. Most patterns would
make sure that á resolves to the proper code in EC, which is fine for
pdfTeX, but forgets that in some other encoding that letter might be
on some other position. ^^e1 is an invalid character in UTF-8 and
generates an error.

> If I understand correctly, even if every file used its own convention, that
> isn't a problem for standard TeX,

It's not a problem because:
- you take it for granted that people will use T1 font encoding, and
the patterns really only work with a single font encoding
- people use inputenc that does its own transformations to make sure
that your characters resolve to the right character, but that's the
cleaner part
- everyone has adapted patterns to work with T1 encoding and nothing
else; there is no way to combine Slovenian and Polish for example; If
I switch to Polish language in the middle of Slovenian text
(T1-encoded), and if I write gbreve, TeX will interpret it as iogonek
and hyphenate the word as if it was iogonek.

If someone would adapt the patterns to work with XeTeX, one could
argue the same way: they work for XeTeX, I do not care about pdfTeX.

A good example are Hungarian patterns. No single standard editor can
read the contents of patterns (it can read, but not interpret them
properly), and no tool is able to convert the encoding. That's a
typical example of patterns that were really only used for T1
encoding.

> if you switch a languageyou cannot use
> but it is for XeTeX and utf-8 editors
> because they need to prepare themselves to read a file that is not in utf-8,
> and since every patterns file has a different convention, or even a
> different target encoding, it is difficult, tricky and ugly.

Yes, ugly for two reasons. The first reason is the one that you
mention - XeTeX has to know how it interpret the convention. But the
second reason is worse. It has to disable everything that the file is
doing, only load the patterns and try to ignore everything else
without any side effect.

> When I wrote the patterns I had in mind a 8-bit-character reading engine
> (TeX), so I didn't see any problem. I enclosed my deffinitions of active
> characters in a group (well I would have done had I needed, but the latin1
> encoding I used for wrting the file coincides with the T1 encoding in all
> the characters needed for Galician), which is possible since \patterns{ is
> always a global asignment. And no matter the criterion used, it will always
> not be a problem as long as you do everyting locally.
> But for engines that expect utf-8 files that's different...

Sure. That was the only reasonable way to do it in past. But since
that has changed recently, we would like to convince authors to start
using and possibly updating new files instead of the old ones. Purely
theoretically, we could still watch CTAN for changes and update when
needed, but it makes much more sense if authors adopt the new
convention.

There is one drawback - one needs to have a bit more files instead of one now:
- patterns themselves
- pattern loader for a specific language
- generic conversion file from UTF-8 to T1 encoding (that could be
part of pattern loader, but lots of patterns need the same conversion)
- some generic macros (could also be part of pattern loader - so you
really need only two files, but the other two are there to keep some
common macros at a common place)
but one gains a lot with that approach.

> p.d.: From the computer I am now I get an error whenever I try to open any
> of the svn:// links, so I couldn't see any of the examples.

I have asked Karl to provide a web interface. The only way to see the
files is currently to do
   svn co svn://tug.org/texhyphen/trunk
in console or use TortoiseSVN on Windows. You cannot see the files in browser.

If you manage to install subversion on your computer and run the
command above, you can check trunk/tests and run the testing format
generator. If you have problems, I can send you the files.

Mojca