[XeTeX] xunicode.sty bug

Jonathan Kew jonathan_kew at sil.org
Wed Jul 19 00:22:16 CEST 2006


On 18 Jul 2006, at 11:03 pm, Ross Moore wrote:

> Hi Jonathan, Ralf, Will, Toralf, and others.
>
>
> On 18/07/2006, at 8:36 PM, Jonathan Kew wrote:
>
>>> Jonathan Kew <jonathan_kew at sil.org> writes:
>>>
>>>>> Ux00AD  soft hyphen
>>>>
>>>> This is the Unicode character that means essentially the same as
>>>> TeX's "\-". A non-printing layout control that indicates a  
>>>> potential
>>>> break point, not a visible character in its own right. If the line
>>>> actually breaks there, the appropriate visible manifestation is
>>>> script/language-dependent; a common default would be to insert U
>>>> +2010
>>>> before the break, but this is not universally correct.
>
> If I understand correctly, \- is primitive in TeX which is basically
> a shorthand for  \discretionary{-}{}{} resulting in use or otherwise
> of the \hyphenchar for the current font.
>
> Without changing this mechanism, it seems that having
>      \hyphenchar=^^ad
> would do the right thing, provided the font has a glyph there.
> This is a matter for  fontspec  to determine, yes ?

I'd be nervous of having fontspec set \hyphenchar\font to "AD, even  
after checking that there's a glyph, as some fonts do in fact have a  
glyph that is not a "normal" hyphen (e.g., a hyphen in a dotted box,  
to indicate that it's a control function and not a normal printing  
symbol).

What fontspec could reasonably do would be to set \hyphenchar to  
"2010 if that character is supported, and leave it as "2D otherwise.  
But of course this has implications for the following point....

> I suppose the issue is really what happens to any hyphenations
> when you select and copy a paragraph from a PDF prepared by XeTeX.
> Is the U+00AD actually present within the Unicode string ?
> Do you see the glyph or not when the result is pasted into a
> text-editor ?

Because we can't trust fonts to have a suitable hyphen glyph at "AD,  
I think this is not achievable in general. (And let's not get too  
upset about that. The idea that one can reliably get from a glyph  
stream in a PDF back to the original Unicode character stream, based  
solely on the actual glyphs found, simply doesn't work in a bunch of  
the edge cases.)

>
>
>> In the case of xetex, I think a sensible default (to handle the
>> situation where U+00AD occurs in the input text) would be to say:
>>
>>      \catcode"AD=\active
>>      \let^^ad=\-
>
> This seems to be the right implementation, when an author
> has included ^^ad (by whatever means) within the source.
>
> It seems to me that these assignments are completely standard,
> so belong in  xetex.xfmt , rather than being added by a package.
> I could put them into xunicode.sty , but really don't think that
> it is appropriate. Agreed ?

Yes, I think it would make sense to have ^^ad mapped to \- as a  
standard definition in xetex (though of course it can still be  
overridden, like just about everything else, if you really need to).  
I could add this to the unicode-letters.tex file that is read during  
the creation of the formats.

Similarly, we should probably make ^^a0 into an active character that  
behaves like plain TeX's ~ (tie) -- a non-breaking space. Not all  
fonts support U+00A0 properly (the glyph may be missing, or may be  
the wrong width), so implementing it at the TeX level instead is  
probably going to be more reliable.

JK



More information about the XeTeX mailing list