[tex-live] Bug#571679: moving ucs back to -recommended

Wed Mar 10 19:12:56 CET 2010

----- Ursprüngliche Mail ----

> Von: Zdenek Wagner <zdenek.wagner at gmail.com>
> An: Norbert Preining <preining at logic.at>
> CC: Manuel at tug.org; ?= <mpg at elzevir.fr>; 571679 at bugs.debian.org; texlive at tug.org
> Gesendet: Mittwoch, den 10. März 2010, 18:38:50 Uhr
> Betreff: Re: [tex-live] Bug#571679: moving ucs back to -recommended
> 
> 2010/3/10 Norbert Preining :
> > On Mi, 10 Mär 2010, Robin Fairbairns wrote:
> >> unicode-based engines good, 8-bit engines (tex, pdftex,...) bad.
> >
> > Well, so many things have been done with 8-bit engines, so I
> > would not say bad, but space for improvement ;-)
> >
> Time for change. The inputenc package makes use of active characters
> and multibyte characters come in pieces.

I wouldn't argue that way: the same is true for all systems that use variable-width encodings like UTF-8 or UTF-16, and there are lots of these systems (Java, Windows, Cocoa...). You simply have to know that a "char" in Java is de facto a UTF-16 code unit, and that all string functions work on code units, not characters. The situation is not that different from the TeX one, it's just a bit more obvious when using UTF-8 (non-ASCII characters are ubiquitous, in contrast to non-BMP characters).

> If your macro contains
> \futurelet\somechar\dosomething then \somechar may contain a part of a
> character. another example, suppose that you redefine \chapter so that
> it changes the first character into an initial by:
> 
> \def\PutInitial#1{\global\everypar{}%
>   % code for putting #1 as an initial
> }
> \everypar{\PutInitial}
> 
> Now #1 may contain just a part of a character and using this part will
> cause an error. That's why I cannot use inputenc even with Czech, I
> need above mentioned constructs in my LaTeX macros and the same macro
> must work both with accented and non-accented characters. Moreover,
> non-accented characters can be compared by \if, multibyte characters
> and active characters cannot.

These are not necessarily limitations of inputenc: If languages like C or Java had only a "char" (meaning: UTF-8 or UTF-16 code unit) and not a "string" datatype, similar issues would arise; that's why "char" datatypes have limited usage nowadays, and some languages like Python don't use them at all. The problem is rather that TeX has no real string datatype.

I think the problem with pdfTeX and inputenc is not so much the variable-with encoding, but rather the technical delicacies (LICRs, font encodings...). Usually the first step when reading a text file is converting it to some internal (fixed) Unicode-enabled encoding, something pdfTeX just cannot do; and the macro language inadequate for this.

__________________________________________________
Do You Yahoo!?
Sie sind Spam leid? Yahoo! Mail verfügt über einen herausragenden Schutz gegen Massenmails. 
http://mail.yahoo.com