[tex-hyphen] ptex-specific patterns

Mon May 31 10:46:23 CEST 2010

Dear all,

After reading the answer on
    http://oku.edu.mie-u.ac.jp/tex/mod/forum/discuss.php?d=460
(see the bottom of this mail) I came to the following idea which I
think should work properly.

The problem with pTeX and our package is not the output step: it
outputs characters in proper format. The problem is just a reading
step. In most cases it works fine, but if some UTF-8 character happens
to coincide with some Japanese character (in whatever encoding) then
it's interpreted as a single token instead of two. I think that the
following procedure should work:

- reading the same utf2ec/utf2t2a etc. as for 8bit engines
- for every UTF-8 character that might appear in some 8bit encoding,
try if the UTF-8 code is interpreted as a single token or as two
tokens
  - if it's two tokens, don't do anything (handled by utf2ec already)
  - if it's a single token, change its catcode to active and make it
    output the corresponding 8bit character as ^^xx
- read the patterns

[If anyone (like Taco for example) is willing to help with the second
step, I would welcome any idea or some working code. Arthur?]

The only remaining question is: I'm not sure how to make reliable
tests (we could try to typeset the patterns and see if they are
printed out properly :).

Arthur - is there any counterexample that you can think of? If
Japanese characters were composed out of three bytes, that could be a
problem, but I think they are not. Can ^^xx ever represent an active
character and thus lead into an infinite recursion? (I just keep my
fingers crossed that this is never going to happen even if possible.)

In addition:
- I would use the recognition code for pTeX suggested by Akira
- I don't really care whether pTeX wants to use a static language.ptx
or language.dat: that's up to Japanese to decide, but the idea is the
ability to "throw away" pattern copies and the ability to use
loadhyph-xx.tex; if they want, they may keep language.ptx with all the
languages included.
- For German I suggest using the new patterns.
- For Ukrainian and Russian I suggest using patterns from hyph-utf8
and remove the complex code (ukrhyph, ruhyph); whenever the first
Russian user pops up and requests 7 different encodings and 7
different versions of patterns, I'll change back to ruhyph. There's a
chance that things will change in the meantime anyway.
- If we finish by TL 2010 deadline, that's fine, if not, that's fine
as well (no pressure; we'll try to do it, but not force it for every
price).

Mojca

----------------------------------------------------------------------
For archiving purposes, here's the answer from the forum, written by Z.R.

Hello Mojca.

I think it is difficult for pTeX to share a single set of hyphenation
files with Unicode-aware engines (XeTeX and LuaTeX). The loadhyph-*
files could be shared, by elaborating engine detection in some way as
Akira Kakuto suggests. The other files, however, must be filtered so
that all occurrence of bytes with value 0x80 or greater must be
translated to the escaped forms ^^??, and doing so makes the files
unusable in Unicode-aware engines that read directly UTF-8 encoded
characters.

Here I explain how pTeX processes input from source files (not so) briefly.

- The pTeX engine treats ‘Japanese’ and ‘European’ characters in a
distinctive way; e.g. pTeX has multiple ‘current’ fonts, each for
European, horizontal Japanese and vertical Japanese.
- In view of anything but source input, the processing of European
characters is done in the same way as 8-bit TeX; e.g. in pTeX you can
use 8-bit font encodings like T1, and once 8-bit hyphenation patterns
are safely loaded (in some way) then they will perfectly work.
- On the processing of Japanese characters, pTeX can handle characters
in JIS X 0208 (refer to
http://unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0208.TXT
for the repertoire of JIS X 0208.) Note that in pTeX a Japanese
character is always treated as one character (and thus one token).
- Input from source files is always seen in 7-bit ASCII plus JIS X
0208 rendered in either of ISO-2022-JP(jis), Shift_JIS(sjis),
EUC-JP(euc) or UTF-8(utf8). The input encoding is specified as a
command-line option, so cannot be specified (or changed) in TeX
documents.
- When pTeX sees a byte with its high bit on, it tries to decode the
sequence starting with that byte in the specified encoding; if it
succeeeds pTeX thinks there is a Japanese character, and if it fails
pTeX thinks there is a 8-bit European character in fallback, as the
result of lexical analysis.
   - In utf8 mode, only byte sequences that will give a character in
JIS X 0208 are deemed to be valid; for everything else decoding will
fail.
- And finally, pTeX always regards an escaped form ^^?? as a single
8-bit European character.
For example, when pTeX scans a byte sequence <CE A4 21>:

- In utf8 mode, pTeX reads it as a Japanese character ‘Τ’<CE A4> and
an European character ‘!’<21>, since JIS X 0208 contains basic Greek
characters. Putting Japanese characters in hyphenation patterns would
cause an error because it does not make sense.
- In euc mode, pTeX reads it as a Japanese character ‘里’<CE A4> and an
European character ‘!’<21>.
- in sjis mode, pTeX reads it as three European characters ^^ce, ^^a4
and ‘!’<21>, since neither CE nor A4 is a lead-byte of sjis
double-byte characters.
And when pTeX scans a byte sequence <C3 A4>:

- In utf8 mode, pTeX reads it as two European characters ^^c3 and
^^a4, since <C3 A4> means in UTF-8 ‘ä’, which is not a JIS X 0208
character.