[XeTeX] xetex file organization

Wed Nov 3 16:48:35 CET 2004

On 3 Nov 2004, at 3:04 pm, Bruno Voisin wrote:

> All the rest looks fine to me, but again I'm not a specialist. That 
> said:
>
>> 	tex/
>> 		generic/
>> 			hyphen/
>> 				(Unicode-compatible versions of hyphenation files;
>> 				these are designed to still work with standard TeX as well)
>
> Hopefully at some point in the future, when/if the capability of 
> reading files in a specific encoding is added to XeTeX, this directory 
> would become unnecessary (as well as the modified version of url.sty 
> in texmf.gwtex).

Actually, I've now implemented support for reading files in non-Unicode 
encodings (but haven't released a version including this yet). So you 
know what's coming, you can say:

	\XeTeXinputencoding "encoding-name"

(where "encoding-name" is scanned like a filename by XeTeX, with 
optional quotes). The "encoding-name" can be one of a set of built-in 
names:
	auto		(the default setting, auto-detects utf8 or utf16 files)
	utf8
	utf16		(platform-native utf16, i.e., big-endian on Mac OS X)
	utf16be
	utf16le
	bytes		(reads individual bytes directly as character codes 0..255)
or it can be an "internet encoding name" recognized by the Mac OS Text 
Encoding Converter; so you can say things like:
	\XeTeXinputencoding "x-mac-roman"
	\XeTeXinputencoding "windows-1252"
	\XeTeXinputencoding "iso-8859-4"
	\XeTeXinputencoding "big5"
etc., and the text will be converted from that encoding to Unicode as 
the file is read.

The encoding used to read a file (either \input or \openin) is 
determined at the time the file is opened; it can't be changed on the 
fly.

Note that it may still be necessary to adapt hyphenation files, though, 
as many of them are written in terms of specific legacy encodings using 
TeX-level mechanisms (active characters, ^^xx sequences, etc.). These 
mechanisms won't be affected by the \XeTeXinputencoding setting. 
Although such files can safely be read by XeTeX, they may not provide 
the appropriate hyphenation rules for text that actually uses the 
Unicode character codes for the given language.

In a case like url.sty, yes, I'd guess that simply reading it in Latin1 
ought to solve the problems people have had. How best to ensure that 
this happens is another question.... I'm still thinking about that.

Jonathan