[OS X TeX] Input encoding question
jonathan at jfkew.plus.com
Fri Feb 20 10:44:13 CET 2009
On 20 Feb 2009, at 08:21, Axel E. Retif wrote:
> On 20 Feb, 2009, at 01:51, Richard Koch wrote:
>> I'll stop writing after this short response, so don't worry about
>> an unending thread!
> On the contrary ---I'm sorry for causing apparently useless noise!
No need to apologize; it's an valid issue to discuss. Before the
thread dies completely, I hope people will excuse me also adding a
couple of notes.
On 20 Feb 2009, at 07:49, Peter Dyballa wrote:
> Just because *some* software can handle it, it's not reason enough.
> Files grow big because some (statistically: quite all) characters
> are represented by more than one byte,
No, the vast majority of characters in real-world TeX files would be
represented by 1 byte in UTF-8, because they are ASCII characters --
either English content or markup. It's true that accented Latin
characters, most of which would be 1 byte each in legacy 8-bit
encodings, become 2 bytes in UTF-8, but this is a tiny factor overall,
and the marginal size increase is well worth it. Far better to have
files that are a few percent larger than to have compact files that
may be misinterpreted depending on the codepage used.
East Asian characters (the really big collection in terms of number of
different characters) are mostly represented by 2 bytes each in
UTF-8 .... but that's also true in pre-Unicode encodings.
> software needs to extra-process these byte sequences.
True, but the mechanisms are readily available. If you use 8-bit
codepages, software needs to know the appropriate interpretation of
those bytes, too; yes, it may be slightly simpler as each byte can be
interpreted in isolation, but you still have to configure the software
to get it right.
> And LaTeX and ConTeXt are mostly 8 bit applications with a 7 bit core.
The "core" is 8-bit since TeX 3.0; I don't think we need be concerned
about the old 7-bit version.
On 20 Feb 2009, at 07:51, Richard Koch wrote:
> That said, when my mathematical colleagues send me TeX, it is
> essentially never in unicode.
I'm sure that's true. But then, the non-ASCII math characters they
want to represent aren't in some 8-bit encoding, either, are they?
Rather, they're encoded through pure ASCII markup and escape
sequences, so the "input encoding" issue is irrelevant.
On 20 Feb 2009, at 06:06, Richard Koch wrote:
> With the current default encoding or Latin 1 or most other
> encodings, files always open and ascii always works great, and the
> only trouble you'll run into is that a few characters may not be
> what you expect.
With a UTF-8 default, "ascii always works great" too. And if a file
can't be interpreted as valid UTF-8, you can fall back to a default 8-
bit encoding *and warn the user to check the non-ASCII characters*,
which is better than blindly opening a file as MacRoman when it might
equally well be Latin-1 (or vice versa).
Of course, if the file comes with (internal or external) metadata that
tells you its encoding, that's a different matter altogether.
> But with any unicode encoding,
I don't think anyone would suggest UTF-16 as a default, but with
> your TeX friends will be confused when they get files from you, and
> you'll be confused when you try to open their files.
No more confused than if you share files between 8-bit editors that
use different encodings. In fact, that's worse because there's a
significant risk of errors (situations where there is incorrect
interpretation of the bytes) going unnoticed, if there are just
occasional non-ASCII characters in the data.
On 20 Feb 2009, at 06:36, Richard Koch wrote:
> The TeX world is moving toward unicode: see XeTeX and luaTeX and
> other developments. On the other hand, compatibility with older
> sources is unusually important in the TeX world, so I expect that a
> wide range of encodings will work as long as TeX itself survives.
Yes; though the vast majority of "older sources" are in ASCII and
therefore also valid and correct when interpreted as UTF-8. (Otherwise
XeTeX would never have gotten off the ground -- every time xelatex
runs, it relies on this fact to read all those classes, packages, etc!)
For "older sources" that include non-ASCII 8-bit characters, they
*must* include appropriate inputenc (or equivalent) declarations,
otherwise no TeX system, old or new, can be expected to handle them
reliably. Of course we need to continue supporting these. But I
believe we would be doing users, and the TeX community as a whole, a
great favor by steering people towards UTF-8 wherever possible,
including as the default encoding for saving files.
More information about the macostex-archives