[tex-live] Re: UTF-8 support

Petr Olsak olsak@math.feld.cvut.cz
Wed, 22 Jan 2003 15:51:10 +0100 (CET)


On Wed, 22 Jan 2003, Vladimir Volovich wrote:

>  >> with encTeX, expansion of a multibyte UTF-8 character can also be
>  >> not a single letter, but a sequence of several tokens (e.g. a call
>  >> to macro), - so encTeX suffers from exactly the same "problem":
>  >> you can't be sure that one UTF-8 character in the input file will
>  >> be one token,
>
>  PO> NO! Please, read the encTeX manual before this discussion.
>
> OK, sorry, - i didn't read it carefully enough...
> nevertheless, i still see a problem:
>
>   encTeX cannot define all possible UTF-8 characters (due to very big
>   number of characters), so some valid UTF-8 character in input file
>   will not be translated in any way, thus \macro ä MAY still be
>   processed incorrectly by encTeX if a multibyte UTF-8 sequence for ä
>   was not defined in encTeX, and a bad behavior of \macro will occur.
>   i.e. \macro ä will get the first byte of UTF-8 representation of ä
>   instead of the whole character (just the same effect as mentioned in
>   the ucs package).

Yes, but all characters _supported_ in encTeX table behaves in macro level
as a single token. On the other hand, all characters (supported or no) in
ucs.sty behaves as multi token.

>   while with UCS package (purely TeX solution) you can at least
>   generate sensible warnings for undefined UTF-8 characters which may
>   occur in imput files without defining all 2^31 characters, it is not
>   achievable with encTeX - characters which were left undefined and
>   which will appear in input files will horribly fail in encTeX
>   without any warning.

Warnings are possible, see the section 3.5 in encTeX documentation, the
\warntwobytes example. You can define the \warnthreebytes similar and you
can declare all unsupported UTF-8 codes by inserting these control
sequences in encTeX conversion table by simple \loop in \loop.

> also, as far as i understand, encTeX is a very limited solution: it
> mostly assumes that one uses a single text font encoding (e.g. T1)
> throughout the document (just like TCX), and thus it does not provide
> solution for really multilingual UTF-8 documents.
>
> if i'm wrong, please correct me - give an example of how one can use
> encTeX to support e.g. T1 and T2A font encodings (for e.g. french and
> cyrillic) in the same document. i think that there will be problems
> because the same slot numbers in T1 and T2A encodings contain
> completely different glyphs, and reverse mappings will not work
> correctly.

Yes, this is limitation of 8bit TeX itself. You cannot switch to the more
font encodings in one paragraph because of \lccodes are used in hyphen
algorithm only at the end of the paragraph. This is a reason why I did not
implement the multi-tables in encTeX.

You can declare unilimited UTF-8 codes as a control-sequences in encTeX.
The russians can declare UTF-8 code of Ya as a \ruYa (for example) and
they can define \chardef\ruYa="code (or as a macro). You can switch the
font and do russian typpesetting from UTF-8 ecoding. The reverse mapping
to \write files are kept for all characters and control sequences
stored in encTeX.

> sorry, i don't understand your reasoning... are you saying that it is
> impossible to achieve some effect with writing to files and verbatim
> reading of files from TeX, using purely TeX machinery?

It is possible but you cannot \write the \'A sequences into such files.
If you are working with more than 256 codes, then it is not possible in
purely 8bit TeX but it is possible in encTeX.

> if so, could you describe what is it? you are not forced to redefine
> the backslash's catcode to 12 when reading or writing files, - nothing
> prevents you to preserve the original catcode when reading.

I am forced to redefine the backslash's catcode when (for example) I am
writting a TeX manual and some examples are shown in two versions:
typeset version and the input file version. This examples are written only
once in the source of this manual. I need to all Czech characters are
shown in this manual in the native form.

Best regards

Petr Olsak