[tex-hyphen] UTF-8 Hyphen

Mon Jun 9 17:56:16 CEST 2008

On Mon, Jun 9, 2008 at 3:55 PM, Javier Múgica wrote:
> Hello:
>
> I got the message from Mojca today. I was planing to write utf-8 patterns
> for LuaTeX, but I was waiting for the July release of LuaTeX to start with
> LuaTeX and then see if I could write a single file, that will be read
> properly both by current TeX and LuaTeX. Something like
>
> \ifx\undefined\SomethinginLuanotinTeX
> \else
>   Set up a LuaTeX callback to read latin1 characters and transform them to
> UTF-8
> \fi
>
> And at the end of the file
>
> \ifx\undefined\SomethinginLuanotinTeX
> \else
>   Restore previous behavour of LuaTeX
> \fi
>
> But for the existing engines i didn't want a utf-8 file. Indeed, since
> currently hyphenation patterns are for a specific encoding, it makes sense
> to have them written in that encoding. initex and friends (pdftex --ini)
> will always read them right.

Yes, but the problem is that every file uses a different convention.
If you want to know what that means, here are some examples:

some languages using EC encoding:

^^a3 -> č	ccaron
^^a2 -> ć	cacute
etc.

polish:

/a -> ą	aogonek
/c -> ć	cacute
/e -> ę	eogonek
/l -> ł	lslash
/n -> ń	nacute
/o -> ó	oacute
/s -> ś	sacute
/x -> ź	zacute
/z -> ż	zdotaccent

slovenian:

"c -> č	ccaron
"s -> š	scaron
"z -> ž	zcaron

romanian:

"a -> ă
"A -> â
"i -> î
"s -> ș
"t -> ț

galician, spanish, ...

use latin1 encoding

hungarian:

use ec encoding (not ^^xy, but literally ec, that no single editor I
know is able to interpret)

czech, slovak:

\v e -> ě	ecaron
\v c -> č	ccaron
\v d -> ď	dcaron
\v l -> ľ	lcaron
\v n -> ň	ncaron
\v r -> ř	rcaron
\v s -> š	scaron
\v t -> ť	tcaron
\v z -> ž	zcaron
\r u -> ů	uring
\'a -> á	aacute
\'e -> é	eacute
\'i -> í	iacute
\'o -> ó	oacute
\'u -> ú	uacute
\'r -> ŕ	racute
\'y -> ý	yacute
\"a -> ä	adieresis
\^o -> ô	ocircumflex

german:

"a -> ä
"o -> ö
"u -> ü
/3 -> ß

danish:

X -> æ
Y -> ø
Z -> å

esperanto (it's not even the right control sequence, but an approximation):

^c -> ĉ
^g -> ĝ
^h -> ĥ
^j -> ĵ
^s -> ŝ
^u -> ŭ

turkish:

@      -- arabic hamza
#      -- ayn
c:  ç  E7 U+00E7   LATIN SMALL LETTER C WITH CEDILLA
d!  ḍ  -- U+1E0D   LATIN SMALL LETTER D WITH DOT BELOW
d=  ḏ  -- U+1E0F   LATIN SMALL LETTER D WITH LINE BELOW
g=  ğ  A7 U+011F   LATIN SMALL LETTER G WITH BREVE
g:  ġ  -- U+0121   LATIN SMALL LETTER G WITH DOT ABOVE
h
h!  ḥ  -- U+1E25   LATIN SMALL LETTER H WITH DOT BELOW
h=  ẖ  -- U+1E96   LATIN SMALL LETTER H WITH LINE BELOW
k!  ḳ  -- U+1E33   LATIN SMALL LETTER K WITH DOT BELOW
n=  ñ  F1 U+00F1   LATIN SMALL LETTER N WITH TILDE
s!  ṣ  -- U+1E63   LATIN SMALL LETTER S WITH DOT BELOW
s=     --    s with line below - not even in unicode
s:  ş  B3 U+015F   LATIN SMALL LETTER S WITH CEDILLA
t!  ṭ  -- U+1E6D   LATIN SMALL LETTER T WITH DOT BELOW
t=  ṯ  -- U+1E6F   LATIN SMALL LETTER T WITH LINE BELOW
z!  ẓ  -- U+1E93   LATIN SMALL LETTER Z WITH DOT BELOW
z=  ẕ  -- U+1E95   LATIN SMALL LETTER Z WITH LINE BELOW
z:  ż  BB U+017C   LATIN SMALL LETTER Z WITH DOT ABOVE

> I never use XeTeX, I don't need it at all, nor do I read the xu- files (I do
> not even have them in my computer, no need all that stuff), and I'm likely
> not the only one, so for anything prior to LuaTeX, pattern files without
> multibyte characters need to be present.

The nice trick about TeX is that pdfTeX and other engines would see
the patterns as if they were 8-bit. Just as one doesn't type in text
in ec encoding (inputenc adds another layer on top of encoding), but
it utf-8 or latin1 or whatever ... There would be a macro called
before the patterns are loaded to read the patterns properly for eight
bit TeX engines.

The examples are in svn://tug.org/texhyphen, but they need minor
instructions about how to use them (I need to make a torture test
now.)

> That said, I don't mind if you make a copy of my patterns and transform them
> to UTF-8. Galician is a good language to make experiments with, very few
> people will be affected if it crashes, so you may sacrifice it along with
> Slovenian :-)

Thanks :)

>> Since your patterns are auto-generated, it might mean that tools that
>> auto-generate them might need to be adjusted a bit
>
>
> Fortunately not. Indeed, I may use old initex to generate the patterns in
> any single-byte encoding I wish, just changin the line
>
> \def\Ti{\encodingreplacements{á é í ó ú ñ ü ï}{^^e1 ^^e9 ^^ed ^^f3 ^^fa ^^f1
> ^^fc ^^ef}}
>
> in the generating file. It does not work for UTF-8, but not because of the
> tool used but because of processing them with initex.

I'll try to come back to the issue later then ...

> I will start playing with LuaTeX at the end of July or even in September,
> not before.

OK.

Thanks,
   Mojca