[tex-live] Re: UTF-8 support

Vladimir Volovich vvv@vsu.ru
Fri, 24 Jan 2003 19:18:57 +0300


"PO" == Petr Olsak writes:

 >> encTeX cannot define all possible UTF-8 characters (due to very
 >> big number of characters), so some valid UTF-8 character in input
 >> file will not be translated in any way, thus \macro ä MAY still be
 >> processed incorrectly by encTeX if a multibyte UTF-8 sequence for
 >> ä was not defined in encTeX, and a bad behavior of \macro will
 >> occur.  i.e. \macro ä will get the first byte of UTF-8
 >> representation of ä instead of the whole character (just the same
 >> effect as mentioned in the ucs package).

 PO> Yes, but all characters _supported_ in encTeX table behaves in
 PO> macro level as a single token.

but enctex cannot support all characters (due to very big number of
characters in UTF-8: 2^31), so still some uses of \macro ä MAY still
be processed incorrectly by encTeX, and will fail in exactly the same
way as in ucs.sty

 PO> On the other hand, all characters (supported or no) in ucs.sty
 PO> behaves as multi token.

that is, in my opinion, not a major limitation (one can always use
braces to delimit macro arguments). also, in encTeX, the "dirty
tricks" with e.g. \catcode will not always work: when some UCS
character is defined as a \macro, \catcode will produce an error.
so compatibility with old documents is not preserved in all cases.

e.g., if i use
\chardef\ruYa="code
\mubyte \ruYa ....\endmubyte
and then the input file contains:
\catcode`\<ruYa>=11
then encTeX will fail badly, as it will translate it to
\catcode`\\ruYa=11
am i wrong?

so, with standard TeX some old package which was using
\catcode`\<ruYa>=11 (where <ruYa> is 8-bit character) works, and with
encTeX (where the text is recoded to UTF-8 and <ruYa> is a multibyte
character) it will fail, so no compatibility is preserved.

 >> while with UCS package (purely TeX solution) you can at least
 >> generate sensible warnings for undefined UTF-8 characters which
 >> may occur in imput files without defining all 2^31 characters, it
 >> is not achievable with encTeX - characters which were left
 >> undefined and which will appear in input files will horribly fail
 >> in encTeX without any warning.

 PO> Warnings are possible, see the section 3.5 in encTeX
 PO> documentation, the \warntwobytes example. You can define the
 PO> \warnthreebytes similar and you can declare all unsupported UTF-8
 PO> codes by inserting these control sequences in encTeX conversion
 PO> table by simple \loop in \loop.

you are saying that encTeX is capable of storing a table of
\warn...byte definitions for 2^31 characters in memory? :)

with UCS package, it is possible to produce a warning without defining
all possible characters; and also UCS package can load definitions for
Unicode ranges on demand, when they are used in the document, and thus
does not need to store huge set of macros in memory (but encTeX has to
load all definitions in a format file).

 >> also, as far as i understand, encTeX is a very limited solution:
 >> it mostly assumes that one uses a single text font encoding
 >> (e.g. T1) throughout the document (just like TCX), and thus it
 >> does not provide solution for really multilingual UTF-8 documents.
 >> 
 >> if i'm wrong, please correct me - give an example of how one can
 >> use encTeX to support e.g. T1 and T2A font encodings (for
 >> e.g. french and cyrillic) in the same document. i think that there
 >> will be problems because the same slot numbers in T1 and T2A
 >> encodings contain completely different glyphs, and reverse
 >> mappings will not work correctly.

 PO> Yes, this is limitation of 8bit TeX itself. You cannot switch to
 PO> the more font encodings in one paragraph because of \lccodes are
 PO> used in hyphen algorithm only at the end of the paragraph. This
 PO> is a reason why I did not implement the multi-tables in encTeX.

that's not what i'm arguing about: T1 and T2A (and other standard
LaTeX T* encodings) all have compatible settings for uc/lccode and do
not require their changes, thus texts in T1 and T2A can coexist in one
paragraph without breaking hyphenation.

 PO> You can declare unilimited UTF-8 codes as a control-sequences in
 PO> encTeX.  The russians can declare UTF-8 code of Ya as a \ruYa
 PO> (for example) and they can define \chardef\ruYa="code (or as a
 PO> macro). You can switch the font and do russian typpesetting from
 PO> UTF-8 ecoding. The reverse mapping to \write files are kept for
 PO> all characters and control sequences stored in encTeX.

as far as i understand (see above), encTeX does not provide full
compatibility with old macros; more important, this functionality is
already available in Omega: you can use

\ocp\someOCP=sometranslation
\InputTranslation currentfile \someOCP

and put all necessary translation rules into sometranslation.otp,
and this will provide the same functionality as encTeX does
(there are also \OutputTranslation for writing to files).

e.g. i've just made a small proof of concept OTP to process the file
math-example.tex with Omega and got all characters right (using just
ordinary TeX fonts).

so why use yet another non-standard TeX extension instead of existing
(and more powerful) Omega?

 >> sorry, i don't understand your reasoning... are you saying that it
 >> is impossible to achieve some effect with writing to files and
 >> verbatim reading of files from TeX, using purely TeX machinery?

 PO> It is possible but you cannot \write the \'A sequences into such
 PO> files.  If you are working with more than 256 codes, then it is
 PO> not possible in purely 8bit TeX but it is possible in encTeX.

you are wrong... LaTeX + UCS packags perfectly provides such
functionality: you can have UTF-8 encoded input file with cyrillic
french czech etc characters and writing to files will correctly
convert all these characters to in variant internal representation
(\'A, \CYRYA, etc). would you like me to make a sample file?

 >> if so, could you describe what is it? you are not forced to
 >> redefine the backslash's catcode to 12 when reading or writing
 >> files, - nothing prevents you to preserve the original catcode
 >> when reading.

 PO> I am forced to redefine the backslash's catcode when (for
 PO> example) I am writting a TeX manual and some examples are shown
 PO> in two versions: typeset version and the input file version. This
 PO> examples are written only once in the source of this manual. I
 PO> need to all Czech characters are shown in this manual in the
 PO> native form.

i see no problem either. there's no need for TeX extension to
accomplish this task.

Best,
v.