[XeTeX] default char classes

Wed Mar 12 14:31:02 CET 2008

Jonathan, you have convinced me that language markup is needed.
Actually, with our mostly-WYSIWYG front end, you have to specify RTL 
when appropriate in order to keep the cursor from jumping every time you 
type a space -- it gets the direction from the font but then thinks it 
has changed when it sees the space.
What I am getting out of this discussion is that the user should not 
think that he is specifying a font with a tag -- with many Unicode fonts 
this is unnecessary -- but he is specifying a language. And the language 
determines much more than the font ...

I am curious about Will's question. Are there efficiency concerns in 
defining lots of large token classes?

--Barry
> Message: 1
> Date: Sun, 9 Mar 2008 16:07:59 +0000
> From: Jonathan Kew <jonathan_kew at sil.org>
> Subject: Re: [XeTeX] default char classes
> To: barry.mackichan at mackichan.com,	Unicode-based TeX for Mac OS X and
> 	other platforms <xetex at tug.org>
> Message-ID: <320757AA-4287-4530-BDE5-AD6E330BD57E at sil.org>
> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
>
> On 9 Mar 2008, at 3:18 pm, Barry MacKichan wrote:
>
>   
>> Yes, that is how we do it now.
>>
>> I don't actually write multilingual documents myself, but we sell  
>> software (Scientific WorkPlace, etc.) that does, and so we are  
>> looking for ways to make things simpler for our customers.
>>
>> The main thing I'm after is to reinforce the concept in LaTeX of  
>> separating content and form. The choice of a font for a particular  
>> range of unicode characters is strictly a matter of form, yet the  
>> author has to do different things in his document, depending on his  
>> choice of fonts.
>>
>> 1. If he uses a font like Minion Pro, which contains Hebrew  
>> characters, he needs to do nothing.
>>     
>
> He still needs to get \beginR....\endR (or something higher-level  
> that resolves to this) around the Hebrew text somehow, doesn't he?  
> That doesn't happen automatically.
>
> Now someone will no doubt tell me that it should! Perhaps; but again,  
> there's a limit to what can be done automatically. Given source text  
> that contains
>
>      latin latin HEBREW HEBREW latin latin HEBREW HEBREW latin latin.
>
> do we have a Latin-script sentence containing two separate Hebrew  
> phrases, or is that a single Hebrew phrase that itself contains an  
> embedded Latin quote? There's no way to know without some kind of  
> markup or higher-level information, and it matters for layout. In  
> other words, there's a crucial difference between these two:
>
>      latin latin \beginR HEBREW HEBREW \endR latin latin \beginR  
> HEBREW HEBREW \endR latin latin.
>
>      latin latin \beginR HEBREW HEBREW \beginL latin latin \endL  
> HEBREW HEBREW \endR latin latin.
>
> and only the author can tell us -- via markup -- which is intended.
>
> Or to take a "simpler" example, if our source text is
>
>      latin latin HEBREW HEBREW? latin latin.
>
> are we looking at a single Latin-script sentence that contains a  
> Hebrew quote that ends with a question mark, or are we looking at a  
> Latin question (containing a couple of Hebrew words), and then a  
> second Latin sentence? The answer to this will determine where the  
> question mark appears in the reordered text -- is it part of the  
> Hebrew inclusion (in which case it appears to the left), or part of  
> the surrounding Latin script (and appears to the right)?
>
> JK
>
>   
>