[XeTeX] A question regarding 7bits versus 8 bits encoding and fonts

Sun Feb 19 19:39:52 CET 2006

On 18 Feb 2006, at 11:26 am, Damlamian, Alain wrote:

> Hello everyone.
>
> I am still new to XeTeX, and am a user of TeX but no wizard at it  
> at all.
> I have been doing some testing to see if I can import my previous  
> files created with Textures under OS9 (I was informed of a new  
> upcoming version for OSX, but will not be able to afford it, I guess!)
>
> My files were created using the \input tex8bits, so that accented  
> letters are recognized in the source and appear as the proper  
> accented letters in the output. But my files also contain some OS9  
> fonts.
> This why I turned to XeTeX under TeXShop.
>

I can see this is going to be a confusing situation.... I guess I  
should try and explain what's going on with encodings, fonts, etc.  
here. Let's take it slowly and hope it all makes sense in the end....

Your Textures files will use the MacRoman encoding, Apple's standard  
8-bit character encoding that extended (7-bit) ASCII to provide  
various European accented letters, symbols, etc. The purpose of the  
"tex8bits" file was to map the character codes in MacRoman to the  
appropriate TeX constructions, so that they print as expected using  
standard TeX fonts. So the MacRoman character "á" (a-acute, byte code  
135, or 0x87 in hexadecimal) is defined as an "active character" that  
expands to \'{a}. (I'm guessing, here, as I don't have the Textures  
file, but that'll be the general idea.)

Textures also (as I understand it) gave access to the OS 9 fonts; I'm  
not sure what the exact mechanism was, but am confident this access  
was also based on 8-bit encodings (normally MacRoman). It also  
provided a mapping of some kind so that TeX-style \accent commands,  
etc., would generate the appropriate characters in these fonts (e.g.,  
0x87 for \'{a}).

If you want to use Mac fonts under OS X with XeTeX, the situation is  
a little different, because XeTeX is Unicode-based; in particular, it  
accesses the fonts via Unicode character codes. So to produce an a- 
acute in Zapfino via XeTeX, you need to generate the character code  
0x00E1, not 0x87.

OK, let's take a break from fonts for a moment and consider the input  
text. XeTeX is Unicode-based. By default, it's going to assume the  
input text is Unicode (in the UTF-8 encoding form). If it reads one  
of your MacRoman files, it's going to encounter byte values that,  
when interpreted as UTF-8, either mean something quite different than  
you expect, or (more likely) simply aren't valid UTF-8 at all. So  
those characters are going to go missing.

You can address this in several ways. One is to convert your MacRoman  
files to Unicode (UTF-8), which is what happens, I assume, if you  
open them in TeXShop, add the
	"%!TEX encoding = UTF-8 Unicode"
line, and save. Then the a-acute in your file is saved as the UTF-8  
byte sequence <0xC3, 0xA1>, which is the UTF-8 representation of the  
Unicode character code 0x00E1. When XeTeX reads this file, it  
interprets the UTF-8 and processes the Unicode a-acute character. And  
if you're using a Unicode font, all is well.

Another option is to leave the encoding of your MacRoman files  
unchanged, but add the line
	\XeTeXinputencoding "macintosh"
at the beginning (*before* any of the "high" characters!). This tells  
XeTeX to interpret the following lines as MacRoman, and apply a  
MacRoman->Unicode mapping. So the byte code 0x87 will be mapped to  
0x00E1 for XeTeX to process.

This allows XeTeX to read the MacRoman text without "losing"  
characters by misinterpreting bytes as UTF-8; and again, if you're  
using a Unicode font, all is well. However, if you try to use macros  
that were designed to work with the 8-bit characters of MacRoman, you  
may not get the results you expect: for example, if you have macros  
that make code 0x87 an \active character, expecting this to do  
something with a-acute, it won't happen because the a-acute is now  
0x00E1. (On the other hand, if the macro file said \catcode`\á= 
\active rather than \catcode"87=\active, and if you also convert the  
macro file to Unicode or read it with \XeTeXinputencoding  
"macintosh", then it'll work.)

There's another option, too, though I don't really recommend it; you  
can say:
	\XeTeXinputencoding "bytes"
and then the input text will be read as individual bytes representing  
the character codes 0..255. So your a-acute in a MacRoman file will  
become 0x0087 internally in XeTeX. This is the closest thing to a  
"byte encoding compatibility mode", but note that if you then print  
that text with Unicode fonts, you're not going to get the right  
characters (0x87 will NOT print as a-acute!).

(Still with me? Take a deep breath!)

Now, whether you keep the input text as MacRoman and use  
\XeTeXinputencoding, or convert the text to Unicode, either way the  
accented letters should be Unicode characters once they're in XeTeX.  
And so they'll work with any Unicode font (provided it includes the  
characters you're using, of course), with no further special handling.

What about the CM fonts? Those are NOT Unicode fonts; they're 7-bit  
fonts with somewhat unusual encodings, and don't include accented  
letters as such (but separate accents, for use with the \accent  
command). So to make the accented letters work in CM, you need  
something that makes the letters "active" and defines them to expand  
into the TeX accent sequences such as \'{a}. The old Textures version  
of tex8bits won't work as-is, because the character codes are  
different, but the principle is similar, and perhaps that's what  
Bruno's 8bitdefs is doing.

But if you load a file like this, then the accented characters are no  
longer being printed "directly" as themselves, and so you won't get  
the expected results from Unicode fonts!

This is a fundamental issue. The CM fonts and the Mac OS X fonts are  
based on very different encodings, and so need very different  
character codes to be presented to them. (Textures may have gotten  
around this in the 8-bit world using "virtual fonts" to effectively  
rearrange the character set of the Mac fonts to match TeX's  
expectations.)

LaTeX provides a mechanism to handle this, with its concept of font  
encodings; the font selection scheme associates encodings with fonts,  
and knows to redefine things like the accent commands appropriately  
when you switch between fonts with different encodings. But that's a  
whole extra level of complexity on top of the basic font-selection of  
TeX. Perhaps you could do a similar thing by arranging to load  
8bitdefs when you're using a CM font, but deactivate the definitions  
when you switch to a Unicode OS X font.

In summary:

I don't see an easy way to mix CM and Mac fonts in plain (Xe)TeX,  
because the character encodings are fundamentally different. Plain  
TeX's \'{a} is defined in terms of where the acute accent is encoded  
in CM, and so it won't work with Mac fonts. A simple á character will  
print fine in Mac fonts, but isn't available in CM. And as soon as  
you use some kind of macro package that redefines á to work in CM,  
you've effectively replaced it (on-the-fly) with \'{a} and you're  
back to that problem. You can, of course, redefine \' to do the right  
thing for Unicode fonts, but then it won't be right for CM any longer.

So *either* you use 8bitdefs or something similar, and legacy TeX  
fonts; *or* you use real Unicode characters and real Unicode fonts.

In the LaTeX world, the font selection scheme, along with packages  
like fontspec and xunicode, manage to hide a great deal of this and  
let you switch fonts freely, but behind the scenes there's a lot of  
stuff being redefined when you change between the two "worlds". You  
could do all the same stuff yourself in Plain, of course, but I  
wouldn't want to try!

Hope this helps some people get a better grasp of the whole picture.

JK