[tex-hyphen] UTF-8 Hyphen
Mojca Miklavec
mojca.miklavec.lists at gmail.com
Mon Jun 9 17:56:16 CEST 2008
On Mon, Jun 9, 2008 at 3:55 PM, Javier Múgica wrote:
> Hello:
>
> I got the message from Mojca today. I was planing to write utf-8 patterns
> for LuaTeX, but I was waiting for the July release of LuaTeX to start with
> LuaTeX and then see if I could write a single file, that will be read
> properly both by current TeX and LuaTeX. Something like
>
> \ifx\undefined\SomethinginLuanotinTeX
> \else
> Set up a LuaTeX callback to read latin1 characters and transform them to
> UTF-8
> \fi
>
> And at the end of the file
>
> \ifx\undefined\SomethinginLuanotinTeX
> \else
> Restore previous behavour of LuaTeX
> \fi
>
> But for the existing engines i didn't want a utf-8 file. Indeed, since
> currently hyphenation patterns are for a specific encoding, it makes sense
> to have them written in that encoding. initex and friends (pdftex --ini)
> will always read them right.
Yes, but the problem is that every file uses a different convention.
If you want to know what that means, here are some examples:
some languages using EC encoding:
^^a3 -> č ccaron
^^a2 -> ć cacute
etc.
polish:
/a -> ą aogonek
/c -> ć cacute
/e -> ę eogonek
/l -> ł lslash
/n -> ń nacute
/o -> ó oacute
/s -> ś sacute
/x -> ź zacute
/z -> ż zdotaccent
slovenian:
"c -> č ccaron
"s -> š scaron
"z -> ž zcaron
romanian:
"a -> ă
"A -> â
"i -> î
"s -> ș
"t -> ț
galician, spanish, ...
use latin1 encoding
hungarian:
use ec encoding (not ^^xy, but literally ec, that no single editor I
know is able to interpret)
czech, slovak:
\v e -> ě ecaron
\v c -> č ccaron
\v d -> ď dcaron
\v l -> ľ lcaron
\v n -> ň ncaron
\v r -> ř rcaron
\v s -> š scaron
\v t -> ť tcaron
\v z -> ž zcaron
\r u -> ů uring
\'a -> á aacute
\'e -> é eacute
\'i -> í iacute
\'o -> ó oacute
\'u -> ú uacute
\'r -> ŕ racute
\'y -> ý yacute
\"a -> ä adieresis
\^o -> ô ocircumflex
german:
"a -> ä
"o -> ö
"u -> ü
/3 -> ß
danish:
X -> æ
Y -> ø
Z -> å
esperanto (it's not even the right control sequence, but an approximation):
^c -> ĉ
^g -> ĝ
^h -> ĥ
^j -> ĵ
^s -> ŝ
^u -> ŭ
turkish:
@ -- arabic hamza
# -- ayn
c: ç E7 U+00E7 LATIN SMALL LETTER C WITH CEDILLA
d! ḍ -- U+1E0D LATIN SMALL LETTER D WITH DOT BELOW
d= ḏ -- U+1E0F LATIN SMALL LETTER D WITH LINE BELOW
g= ğ A7 U+011F LATIN SMALL LETTER G WITH BREVE
g: ġ -- U+0121 LATIN SMALL LETTER G WITH DOT ABOVE
h
h! ḥ -- U+1E25 LATIN SMALL LETTER H WITH DOT BELOW
h= ẖ -- U+1E96 LATIN SMALL LETTER H WITH LINE BELOW
k! ḳ -- U+1E33 LATIN SMALL LETTER K WITH DOT BELOW
n= ñ F1 U+00F1 LATIN SMALL LETTER N WITH TILDE
s! ṣ -- U+1E63 LATIN SMALL LETTER S WITH DOT BELOW
s= -- s with line below - not even in unicode
s: ş B3 U+015F LATIN SMALL LETTER S WITH CEDILLA
t! ṭ -- U+1E6D LATIN SMALL LETTER T WITH DOT BELOW
t= ṯ -- U+1E6F LATIN SMALL LETTER T WITH LINE BELOW
z! ẓ -- U+1E93 LATIN SMALL LETTER Z WITH DOT BELOW
z= ẕ -- U+1E95 LATIN SMALL LETTER Z WITH LINE BELOW
z: ż BB U+017C LATIN SMALL LETTER Z WITH DOT ABOVE
> I never use XeTeX, I don't need it at all, nor do I read the xu- files (I do
> not even have them in my computer, no need all that stuff), and I'm likely
> not the only one, so for anything prior to LuaTeX, pattern files without
> multibyte characters need to be present.
The nice trick about TeX is that pdfTeX and other engines would see
the patterns as if they were 8-bit. Just as one doesn't type in text
in ec encoding (inputenc adds another layer on top of encoding), but
it utf-8 or latin1 or whatever ... There would be a macro called
before the patterns are loaded to read the patterns properly for eight
bit TeX engines.
The examples are in svn://tug.org/texhyphen, but they need minor
instructions about how to use them (I need to make a torture test
now.)
> That said, I don't mind if you make a copy of my patterns and transform them
> to UTF-8. Galician is a good language to make experiments with, very few
> people will be affected if it crashes, so you may sacrifice it along with
> Slovenian :-)
Thanks :)
>> Since your patterns are auto-generated, it might mean that tools that
>> auto-generate them might need to be adjusted a bit
>
>
> Fortunately not. Indeed, I may use old initex to generate the patterns in
> any single-byte encoding I wish, just changin the line
>
> \def\Ti{\encodingreplacements{á é í ó ú ñ ü ï}{^^e1 ^^e9 ^^ed ^^f3 ^^fa ^^f1
> ^^fc ^^ef}}
>
> in the generating file. It does not work for UTF-8, but not because of the
> tool used but because of processing them with initex.
I'll try to come back to the issue later then ...
> I will start playing with LuaTeX at the end of July or even in September,
> not before.
OK.
Thanks,
Mojca
More information about the tex-hyphen
mailing list