[tex-live] Re: UTF-8 support

Vladimir Volovich vvv@vsu.ru
Wed, 22 Jan 2003 16:35:02 +0300


"PO" == Petr Olsak writes:

 PO> UTF-8 characters are interpreted by TeX as a sequence of
 PO> commands, so don't use calls like \macro ä instead of \macro{ä}.

 >>  it is always a good style to delimit macro arguments with braces

 PO> It means that I don't completelly switch all my old documents to
 PO> UTF-8 because problems can occur! On the other hand, the encTeX
 PO> is really robust solution.

 >> with encTeX, expansion of a multibyte UTF-8 character can also be
 >> not a single letter, but a sequence of several tokens (e.g. a call
 >> to macro), - so encTeX suffers from exactly the same "problem":
 >> you can't be sure that one UTF-8 character in the input file will
 >> be one token,

 PO> NO! Please, read the encTeX manual before this discussion.

OK, sorry, - i didn't read it carefully enough...
nevertheless, i still see a problem:

  encTeX cannot define all possible UTF-8 characters (due to very big
  number of characters), so some valid UTF-8 character in input file
  will not be translated in any way, thus \macro ä MAY still be
  processed incorrectly by encTeX if a multibyte UTF-8 sequence for ä
  was not defined in encTeX, and a bad behavior of \macro will occur.
  i.e. \macro ä will get the first byte of UTF-8 representation of ä
  instead of the whole character (just the same effect as mentioned in
  the ucs package).

  while with UCS package (purely TeX solution) you can at least
  generate sensible warnings for undefined UTF-8 characters which may
  occur in imput files without defining all 2^31 characters, it is not
  achievable with encTeX - characters which were left undefined and
  which will appear in input files will horribly fail in encTeX
  without any warning.

also, as far as i understand, encTeX is a very limited solution: it
mostly assumes that one uses a single text font encoding (e.g. T1)
throughout the document (just like TCX), and thus it does not provide
solution for really multilingual UTF-8 documents.

if i'm wrong, please correct me - give an example of how one can use
encTeX to support e.g. T1 and T2A font encodings (for e.g. french and
cyrillic) in the same document. i think that there will be problems
because the same slot numbers in T1 and T2A encodings contain
completely different glyphs, and reverse mappings will not work
correctly.

purely (La)TeX solution works just fine, - e.g. ucs package or my
small utf-8 input encoding support at
CTAN:macros/latex/contrib/supported/t2/etc/utf-8
(see e.g. multilingual example file in that directory)

 PO> The second example: You have written that \write files includes
 PO> only \'A notation of characters in LaTeX. Do you know a documents
 PO> where you have to re-read the \write files in verbatim mode? I
 PO> know these documents. What happens in LaTeX in such situation?
 >> nothing bad - it is very well possible to write to files in LaTeX
 >> using the ASCII LICR representation, and then read the files back:
 >> you'll need to translate \ into, say, \textbackslash, and
 >> characters like Á to \'A (which is a native representation in
 >> LaTeX); then, when you read the file back, all will be correct: *
 >> Á will be written as \'A, and read back as Á * \'A will be written
 >> as \textbackslash 'A, and read back as \'A so verbatim
 >> representation will be preserved.  (fancyvrb package contains a
 >> lot of such framework)

 PO> The "\textbacklash dance" will help you if the native verbatim
 PO> environment is used. But if you first set all \catcodes to 12
 PO> (including backslash) and second you \input the external file, no
 PO> \textbacklash will help you.

 PO> Sorry, I am not a TeX novice, I _know_ what I am saying. The
 PO> LaTeX solution of UTF-8 encoding is not robust.

sorry, i don't understand your reasoning... are you saying that it is
impossible to achieve some effect with writing to files and verbatim
reading of files from TeX, using purely TeX machinery?

if so, could you describe what is it? you are not forced to redefine
the backslash's catcode to 12 when reading or writing files, - nothing
prevents you to preserve the original catcode when reading.

i.e., what encTeX buys us WRT verbatim which could not be achieved
without any extensions to TeX? could you give a small example?

Best,
v.