[OS X TeX] Input encoding question

Fri Feb 20 10:44:13 CET 2009

On 20 Feb 2009, at 08:21, Axel E. Retif wrote:

> On  20 Feb, 2009, at 01:51, Richard Koch wrote:
>
>> Axel,
>>
>> I'll stop writing after this short response, so don't worry about  
>> an unending thread!
>
> On the contrary ---I'm sorry for causing apparently useless noise!

No need to apologize; it's an valid issue to discuss. Before the  
thread dies completely, I hope people will excuse me also adding a  
couple of notes.

On 20 Feb 2009, at 07:49, Peter Dyballa wrote:

> Just because *some* software can handle it, it's not reason enough.  
> Files grow big because some (statistically: quite all) characters  
> are represented by more than one byte,

No, the vast majority of characters in real-world TeX files would be  
represented by 1 byte in UTF-8, because they are ASCII characters --  
either English content or markup. It's true that accented Latin  
characters, most of which would be 1 byte each in legacy 8-bit  
encodings, become 2 bytes in UTF-8, but this is a tiny factor overall,  
and the marginal size increase is well worth it. Far better to have  
files that are a few percent larger than to have compact files that  
may be misinterpreted depending on the codepage used.

East Asian characters (the really big collection in terms of number of  
different characters) are mostly represented by 2 bytes each in  
UTF-8 .... but that's also true in pre-Unicode encodings.

> software needs to extra-process these byte sequences.

True, but the mechanisms are readily available. If you use 8-bit  
codepages, software needs to know the appropriate interpretation of  
those bytes, too; yes, it may be slightly simpler as each byte can be  
interpreted in isolation, but you still have to configure the software  
to get it right.

> And LaTeX and ConTeXt are mostly 8 bit applications with a 7 bit core.

The "core" is 8-bit since TeX 3.0; I don't think we need be concerned  
about the old 7-bit version.

On 20 Feb 2009, at 07:51, Richard Koch wrote:

> That said, when my mathematical colleagues send me TeX, it is  
> essentially never in unicode.

I'm sure that's true. But then, the non-ASCII math characters they  
want to represent aren't in some 8-bit encoding, either, are they?  
Rather, they're encoded through pure ASCII markup and escape  
sequences, so the "input encoding" issue is irrelevant.

On 20 Feb 2009, at 06:06, Richard Koch wrote:

> With the current default encoding or Latin 1 or most other  
> encodings, files always open and ascii always works great, and the  
> only trouble you'll run into is that a few characters may not be  
> what you expect.

With a UTF-8 default, "ascii always works great" too. And if a file  
can't be interpreted as valid UTF-8, you can fall back to a default 8- 
bit encoding *and warn the user to check the non-ASCII characters*,  
which is better than blindly opening a file as MacRoman when it might  
equally well be Latin-1 (or vice versa).

Of course, if the file comes with (internal or external) metadata that  
tells you its encoding, that's a different matter altogether.

> But with any unicode encoding,

I don't think anyone would suggest UTF-16 as a default, but with  
UTF-8...

> your TeX friends will be confused when they get files from you, and  
> you'll be confused when you try to open their files.

No more confused than if you share files between 8-bit editors that  
use different encodings. In fact, that's worse because there's a  
significant risk of errors (situations where there is incorrect  
interpretation of the bytes) going unnoticed, if there are just  
occasional non-ASCII characters in the data.

On 20 Feb 2009, at 06:36, Richard Koch wrote:

> The TeX world is moving toward unicode: see XeTeX and luaTeX and  
> other developments. On the other hand, compatibility with older  
> sources is unusually important in the TeX world, so I expect that a  
> wide range of encodings will work as long as TeX itself survives.

Yes; though the vast majority of "older sources" are in ASCII and  
therefore also valid and correct when interpreted as UTF-8. (Otherwise  
XeTeX would never have gotten off the ground -- every time xelatex  
runs, it relies on this fact to read all those classes, packages, etc!)

For "older sources" that include non-ASCII 8-bit characters, they  
*must* include appropriate inputenc (or equivalent) declarations,  
otherwise no TeX system, old or new, can be expected to handle them  
reliably. Of course we need to continue supporting these. But I  
believe we would be doing users, and the TeX community as a whole, a  
great favor by steering people towards UTF-8 wherever possible,  
including as the default encoding for saving files.

JK