[XeTeX] xunicode.sty bug

Jonathan Kew jonathan_kew at sil.org
Tue Jul 18 12:36:17 CEST 2006


On 18 Jul 2006, at 11:03 am, Ralf Stubner wrote:

> Jonathan Kew <jonathan_kew at sil.org> writes:
>
>>> Ux00AD  soft hyphen
>>
>> This is the Unicode character that means essentially the same as
>> TeX's "\-". A non-printing layout control that indicates a potential
>> break point, not a visible character in its own right. If the line
>> actually breaks there, the appropriate visible manifestation is
>> script/language-dependent; a common default would be to insert U+2010
>> before the break, but this is not universally correct.
>
> I vaguely remember that there are some discussions concerning soft
> hyphen being nonprinting or not. Might have been on
> <URL:http://www.cs.tut.fi/~jkorpela/shy.html>. I don't have a clear
> opinion here at the moment.

Yes, this is an interesting and informative discussion (and it's a  
messy situation!).

It seems to me that the ISO-8859-1 code xAD was closer to being a  
presentational glyph than a character, in terms of the Unicode/WG2  
character/glyph model (but the model was not clearly articulated at  
that time), while Unicode itself defines U+00AD more clearly as a  
layout control character.

> It is a printing character in fonts like
> MinionPro or Charis SIL.

Right; many (most) fonts map this character to a visible hyphen  
glyph. However, the Standard (p.388) says:

<quote src="http://www.unicode.org/versions/Unicode4.0.0/ch15.pdf">
Hyphenation. U+00AD SOFT HYPHEN (SHY) indicates an intraword break  
point, where a
line break is preferred if a word must be hyphenated or otherwise  
broken across lines. Such
break points are generally determined by an automatic hyphenator. The  
use of SHY is generally
limited to situations where users need to override the behavior of  
such a hyphenator.
The visible rendering of a line break at an intraword break point,  
whether automatically
determined or indicated by a SHY, depends on the surrounding  
characters, the language,
and, at times, the meaning of the word. The precise rules are outside  
the scope of this standard,
but see Unicode Standard Annex #14, “Line Breaking Properties,” for  
additional
information. A common default rendering is to insert a hyphen before  
the line break, but
this is incorrect in many situations.
</quote>

As such, U+00AD should not normally be rendered directly by a text  
display system, and so it is irrelevant what glyph is in the font. If  
the potential break position indicated by U+00AD is not used, it  
should have no visible result at all; and if the position is used, it  
should be rendered as appropriate depending on the surrounding  
characters, language, etc.

Having a visible glyph for U+00AD in a font may be useful if text is  
displayed by a "dumb" system that does not handle its Unicode  
semantics. But in this case, it may be a bad idea for the glyph to  
look like a "normal" hyphen, as this could mislead people into using  
it thinking that it will always be a visible character. Using a  
specially-marked glyph (e.g., with dashed box around) might be a  
better choice. (This can also be used by editors that want to support  
a "show invisibles" mode.)

In the case of xetex, I think a sensible default (to handle the  
situation where U+00AD occurs in the input text) would be to say:

     \catcode"AD=\active
     \let^^ad=\-

JK



More information about the XeTeX mailing list