[XeTeX] A question regarding 7bits versus 8 bits encoding and
fonts
Jonathan Kew
jonathan_kew at sil.org
Sun Feb 19 19:39:52 CET 2006
On 18 Feb 2006, at 11:26 am, Damlamian, Alain wrote:
> Hello everyone.
>
> I am still new to XeTeX, and am a user of TeX but no wizard at it
> at all.
> I have been doing some testing to see if I can import my previous
> files created with Textures under OS9 (I was informed of a new
> upcoming version for OSX, but will not be able to afford it, I guess!)
>
> My files were created using the \input tex8bits, so that accented
> letters are recognized in the source and appear as the proper
> accented letters in the output. But my files also contain some OS9
> fonts.
> This why I turned to XeTeX under TeXShop.
>
I can see this is going to be a confusing situation.... I guess I
should try and explain what's going on with encodings, fonts, etc.
here. Let's take it slowly and hope it all makes sense in the end....
Your Textures files will use the MacRoman encoding, Apple's standard
8-bit character encoding that extended (7-bit) ASCII to provide
various European accented letters, symbols, etc. The purpose of the
"tex8bits" file was to map the character codes in MacRoman to the
appropriate TeX constructions, so that they print as expected using
standard TeX fonts. So the MacRoman character "á" (a-acute, byte code
135, or 0x87 in hexadecimal) is defined as an "active character" that
expands to \'{a}. (I'm guessing, here, as I don't have the Textures
file, but that'll be the general idea.)
Textures also (as I understand it) gave access to the OS 9 fonts; I'm
not sure what the exact mechanism was, but am confident this access
was also based on 8-bit encodings (normally MacRoman). It also
provided a mapping of some kind so that TeX-style \accent commands,
etc., would generate the appropriate characters in these fonts (e.g.,
0x87 for \'{a}).
If you want to use Mac fonts under OS X with XeTeX, the situation is
a little different, because XeTeX is Unicode-based; in particular, it
accesses the fonts via Unicode character codes. So to produce an a-
acute in Zapfino via XeTeX, you need to generate the character code
0x00E1, not 0x87.
OK, let's take a break from fonts for a moment and consider the input
text. XeTeX is Unicode-based. By default, it's going to assume the
input text is Unicode (in the UTF-8 encoding form). If it reads one
of your MacRoman files, it's going to encounter byte values that,
when interpreted as UTF-8, either mean something quite different than
you expect, or (more likely) simply aren't valid UTF-8 at all. So
those characters are going to go missing.
You can address this in several ways. One is to convert your MacRoman
files to Unicode (UTF-8), which is what happens, I assume, if you
open them in TeXShop, add the
"%!TEX encoding = UTF-8 Unicode"
line, and save. Then the a-acute in your file is saved as the UTF-8
byte sequence <0xC3, 0xA1>, which is the UTF-8 representation of the
Unicode character code 0x00E1. When XeTeX reads this file, it
interprets the UTF-8 and processes the Unicode a-acute character. And
if you're using a Unicode font, all is well.
Another option is to leave the encoding of your MacRoman files
unchanged, but add the line
\XeTeXinputencoding "macintosh"
at the beginning (*before* any of the "high" characters!). This tells
XeTeX to interpret the following lines as MacRoman, and apply a
MacRoman->Unicode mapping. So the byte code 0x87 will be mapped to
0x00E1 for XeTeX to process.
This allows XeTeX to read the MacRoman text without "losing"
characters by misinterpreting bytes as UTF-8; and again, if you're
using a Unicode font, all is well. However, if you try to use macros
that were designed to work with the 8-bit characters of MacRoman, you
may not get the results you expect: for example, if you have macros
that make code 0x87 an \active character, expecting this to do
something with a-acute, it won't happen because the a-acute is now
0x00E1. (On the other hand, if the macro file said \catcode`\á=
\active rather than \catcode"87=\active, and if you also convert the
macro file to Unicode or read it with \XeTeXinputencoding
"macintosh", then it'll work.)
There's another option, too, though I don't really recommend it; you
can say:
\XeTeXinputencoding "bytes"
and then the input text will be read as individual bytes representing
the character codes 0..255. So your a-acute in a MacRoman file will
become 0x0087 internally in XeTeX. This is the closest thing to a
"byte encoding compatibility mode", but note that if you then print
that text with Unicode fonts, you're not going to get the right
characters (0x87 will NOT print as a-acute!).
(Still with me? Take a deep breath!)
Now, whether you keep the input text as MacRoman and use
\XeTeXinputencoding, or convert the text to Unicode, either way the
accented letters should be Unicode characters once they're in XeTeX.
And so they'll work with any Unicode font (provided it includes the
characters you're using, of course), with no further special handling.
What about the CM fonts? Those are NOT Unicode fonts; they're 7-bit
fonts with somewhat unusual encodings, and don't include accented
letters as such (but separate accents, for use with the \accent
command). So to make the accented letters work in CM, you need
something that makes the letters "active" and defines them to expand
into the TeX accent sequences such as \'{a}. The old Textures version
of tex8bits won't work as-is, because the character codes are
different, but the principle is similar, and perhaps that's what
Bruno's 8bitdefs is doing.
But if you load a file like this, then the accented characters are no
longer being printed "directly" as themselves, and so you won't get
the expected results from Unicode fonts!
This is a fundamental issue. The CM fonts and the Mac OS X fonts are
based on very different encodings, and so need very different
character codes to be presented to them. (Textures may have gotten
around this in the 8-bit world using "virtual fonts" to effectively
rearrange the character set of the Mac fonts to match TeX's
expectations.)
LaTeX provides a mechanism to handle this, with its concept of font
encodings; the font selection scheme associates encodings with fonts,
and knows to redefine things like the accent commands appropriately
when you switch between fonts with different encodings. But that's a
whole extra level of complexity on top of the basic font-selection of
TeX. Perhaps you could do a similar thing by arranging to load
8bitdefs when you're using a CM font, but deactivate the definitions
when you switch to a Unicode OS X font.
In summary:
I don't see an easy way to mix CM and Mac fonts in plain (Xe)TeX,
because the character encodings are fundamentally different. Plain
TeX's \'{a} is defined in terms of where the acute accent is encoded
in CM, and so it won't work with Mac fonts. A simple á character will
print fine in Mac fonts, but isn't available in CM. And as soon as
you use some kind of macro package that redefines á to work in CM,
you've effectively replaced it (on-the-fly) with \'{a} and you're
back to that problem. You can, of course, redefine \' to do the right
thing for Unicode fonts, but then it won't be right for CM any longer.
So *either* you use 8bitdefs or something similar, and legacy TeX
fonts; *or* you use real Unicode characters and real Unicode fonts.
In the LaTeX world, the font selection scheme, along with packages
like fontspec and xunicode, manage to hide a great deal of this and
let you switch fonts freely, but behind the scenes there's a lot of
stuff being redefined when you change between the two "worlds". You
could do all the same stuff yourself in Plain, of course, but I
wouldn't want to try!
Hope this helps some people get a better grasp of the whole picture.
JK
More information about the XeTeX
mailing list