[XeTeX] Table of contents

Michiel Kamermans pomax at nihongoresources.com
Fri Apr 30 12:33:48 CEST 2010


On 4/30/2010 1:56 AM, José Carlos Santos wrote:
> Since no sophisticated solution appeared (or occurred to me), I shall 
> do that. But I think it is a flaw.

You don't need a sophisticated solution when the simple one is the only 
correct one. In the old days we had to worry about which character set 
we were using because different sets gave you different characters at 
identical codepoints. In the old, dark days of not-a-lot-of-memory and 
"there are other languages?" land, resolving chr(128) could give you the 
euro symbol, or it might be the first byte in a multi byte japanese 
character, and  - and this is the important part - the file you were 
reading wouldn't in any way indicate wtf you were supposed to do.

With more modern computing technology available, and a realisation that 
there are such things as other countries and writing systems, this 
became ridiculous, and we came up with a new way of doing things: 
unicode, mirrored by the ISO/IEC 10646 standard. This convention 
specifies a huge character map with every distinct character-entity in 
its own spot, and specifies various ways in which you can encode the 
parts of that map that you actually use efficiently in a file, so that 
on average storiung unicode data takes up an irrelevantly small amount 
more diskspace than storing it using the crooked concept of codepages. 
More importantly, data stored in unicode can TELL you that it's stored 
in unicode. With that, the world finally realised how much better things 
were if the data actually indicated how to decode it, instead of having 
people go "okay but... what the heck codepage is this text file actually 
in?".

The whole reason XeTeX exists is because there was no real unicode 
awareness in the various flavours of TeX until Jonathan Kew started 
making one: the lack of indicating an encoding in XeTeX is not a flaw, 
but represents a victory for people who actually want to write things 
properly: this is what we should have had in the first place, if there 
hadn't been all those early computing technological restrictions. It 
finally let us all write in every language imaginable without having to 
worry about whether or not that letter we wanted was in the code page we 
were using, and if it wasn't, how to construct what should be an 
arbitrarily simple character using lots of TeX code to combine letters 
and symbols in ways that only worked for that one font we were actually 
using. By going with unicode, XeTeX made, and still makes, things 
intuitively easy. You write your text, XeTeX compiles what you wrote, 
and you are not bothered by trying to figure out whether or not the 
character you want is in the codepage you're using.

Of course, note that this is is very different from needing to verify 
that the character you want is in the font you are using. codepage tell 
you which characters even exist as far as the computer is concerned. 
Need a lambda symbol when you're writing something n cp1252? Tough, it 
doesn't exist. Not just "in the font you are using", it simply doesn't 
exist until you change the codepage for your entire data context to 
something else.

Codepages are a thing from a dark past, when typesetting was severely 
impaired by fonts simply not being big enough to actually contain all 
the letters people might need, and there not being a well defined 
codepoint mapping for glyphs (what you see with your eyes) and 
characters (what the thing you're seeing actually represents).

At this point in time (finally, one might add) only old operating 
systems still really care about codepages - the rest have moved on to 
embrace a world where it doesn't matter what language you write in, 
because letters from one language are no longer mutually exclusive with 
another. In the TeX world, too, there's great efforts being made to 
ditch the antique concept of codepages, with XeTeX and LuaTex constantly 
improving.

If you want to typeset things nicely, and you actually care about the 
language you're using - you're using French, so you really should care-  
don't use cp1252; ANSI is the *AMERICAN* standard for an 8-bit character 
set. Codepages were invented to overcome the problem of only having 256 
spots for letters. Unicode solved that problem. Why make XeTeX use a 
solution for a problem that doesn't exist anymore?

- Mike


More information about the XeTeX mailing list