[XeTeX] Unicode hyphens etc. and Xe(La)TeX

Roland Kuhn rk at rkuhn.info
Mon Nov 1 13:53:41 CET 2010


Generally, I recommend using the correct unicode characters in the TeX source and then define the behavior you want for them. In this case, this is fairly straight-forward:

1) TeX inserts empty discretionaries after each occurrence of the \hyphenchar (a per-font property which is usually equal to `-), which takes care of your first point quite nicely.

2) The soft hyphen can be made active and defined to yield “\-” (the only drawback to this character is that it is not very nicely displayed inside Terminal on MacOS):
\catcode` =\active
\def {\-}

3) The unicode hyphen "2010 can be made active and defined to yield “-” (ASCII hyphen), which is the right choice within TeX by construction:
\catcode`‐=\active
\def‐{-}

4) The non-breaking hyphen can also be made active and defined to yield “\hbox{-}” (the box prevents the discretionary after the ASCII hyphen from escaping, \nobreak does not help here):
\catcode`‑=\active
\def‑{\hbox{-}}

Where those characters are encountered does not matter much in my experience, but you can always include macros for disabling these activations, akin to
\catcode` =12
\catcode`‐=12
\catcode`‑=12

Given these, you should be able to adapt the procedure to solve the case with the middle dots.

Regards,

Roland

On Oct 31, 2010, at 23:09 , BPJ wrote:

> I'm trying to find out if and how Xe(La)TeX does
> or can be made to treat the following characters
> different frem each other and/or in a 'smart' way:
> 
> 	1) U+002D HYPHEN-MINUS
> 	2) U+00AD SOFT HYPHEN
> 	3) U+2010 HYPHEN
> 	4) U+2011 NON-BREAKING HYPHEN
> 
> Specifically I'd like to get the correct behavior for
> Swedish so that a linebreak may occur after an ASCII hyphen
> but not after a Unicode non-breaking hyphen. While globally
> replacing every Unicode soft hyphen with \- is easy you
> cannot, unfortunately, globally replace every ASCII hyphen
> with some command which would do the right thing (whatever
> that command may be) as the ASCII hyphen may occur in
> command arguments which I've already inserted, and which are
> not to be interpreted as text. (Though I think that such would typically be followed by a digit rather than a letter...)
> 
> I also have sort of the same thoughts about
> 
> 	5) U+00B7 MIDDLE DOT
> 	6) U+2027 HYPHENATION POINT
> 
> or rather I would want some way to distinguish between a
> middle dot after which a linebreak may occur and one after
> which it may not.
> 
> I guess I'm basically looking for a \maylinebreak command!
> 
> /bpj
> 
> 
> --------------------------------------------------
> Subscriptions, Archive, and List information, etc.:
> http://tug.org/mailman/listinfo/xetex

--
I'm a physicist: I have a basic working knowledge of the universe and everything it contains!
    - Sheldon Cooper (The Big Bang Theory)




More information about the XeTeX mailing list