[OS X TeX] Using Latin-1 with TexShop preferences set at UTF-8

Richard Koch koch at math.uoregon.edu
Fri Aug 24 18:41:25 CEST 2012


Andre,

TeXShop is doing the correct thing.

Let me explain how these encodings work. Originally, TeX assumed the source
codes would be in pure ASCII. This character set uses only the
first 128 spots for characters, leaving the remaining 128 spots empty.

Users from other parts of the world then invented various encoding schemes
which filled in the 128 missing spots with desirable characters. For instance,
Latin 1 is designed to contain most characters required for Western European
langues. A very large number of additional encodings were invented, using
the spots for Cyrillic and many other purposes.

Unfortunately, the encoding is not part of the file. There is no way to look at
the file and say "this is Latin 1" or "this is Latin 5" or whatever. Some software
attempts to guess the encoding, but this is prone to error and I refuse to do it.

If you select Latin 1, or any other of the standard encodings, then any file (i.e.,
stream of hexadecimal characters) is legal. So if you write in Latin 5, and then
open the file in Latin 1, it will open fine, but some of the characters will be 
replaced by other characters.

After various encodings were invented (not just for TeX but for general
computing applications), unicode was developed to handle all of the
symbol systems commonly used in the world. Internally, TeXShop (and any
Mac Cocoa based application) works with unicode.

There is no fixed method to write unicode to disk. Instead there are a number
of different standards.

Consequently, when TeXShop (or any other Cocoa application) reads a file
or writes a file, it has to be given an "encoding". If you select, say, Latin 1,
then if you type unicode characters not in Latin 1, write to disk, and read the
file back, these unicode characters will be replaced.

One of the standards commonly used to read and write unicode is UTF-8.
The advantage of this standard is that standard ASCII characters get their
ordinary values, so a straight ASCII file is legal UTF-8. Other characters,
however, are encoded in a manner you can read about via Google.
So if you set UTF-8 as your default encoding, then you can type any
unicode symbol in TeXShop, save, and read back, and the character will
be preserved.

However, there is one difficulty with UTF-8. Because the remaining characters
are encoded, a random collection of characters will usually not be a legal
UTF-8 file. In particular, if you write a file with Latin 1, and then attempt to
read it as UTF-8, it's not just that the system may replace characters by others.
Instead the Mac is going to "barf" and complain that the file isn't legal
UTF-8.

In TeXShop, when that happens, the program tries to read with a default
encoding, which is Mac Roman or something like that. Since ANY file is
legal Mac Roman, you at least get something although it is not what you want.

In summary, if you set your default encoding to Latin 1, then you'll be able
to read any file you have, or any file someone sends you, but if that file
wasn't writtein in Latin 1, then some characters may be wrong.

But if you set your default encoding to UTF-8, then anything you type
in the editor will be preserved. But if you try to open a file which was
actually written in Latin 1, the editor won't use UTF-8 (because the
file contains illegal stuff) and won't use Latin 1. It will use Mac Roman.

Dick Koch




More information about the macostex-archives mailing list