[tex-hyphen] Apostrophe

Mon Jun 16 17:51:35 CEST 2008

On 16 Jun 2008, at 4:16 pm, Mojca Miklavec wrote:

>> IMO, where some patterns have traditionally included the  
>> apostrophe (x27),
>> we should probably provide duplicate patterns with U+2019 as well.
>
> Any little/tiny chance to use some other way to achieve the same? It's
> seem like yet-another-hack to me, that will prevent us from direct
> conversion to 8-bit patterns.
>
> 1.) create a list of equivalent characters
>
> 2.)
> a) parse contents of \patterns and if some character from the list
> belongs to that list, duplicate the pattern before it's passed to TeX

It ought to be possible to do this, I guess, but it's fairly painful  
as TeX macro programming. (For LuaTeX it could no doubt be done much  
more easily in Lua, but that doesn't help XeTeX.)

> b) extend the engine (only XeTeX/LuaTeX in that case) in some way to
> accept hints that some characters are equivalent during hyphenation. I
> guess that \lccode does exactly that, but I'm not sure what will
> happen if I set lccode of "adiaeresis" to lccode of "a" for example,
> when I want to use some macro to do uppercasing/lowercasing of words
> for me.

Or to take the specific example of the apostrophe, we could set  
\lccode"2019="27 (or vice versa, depending which way we want to write  
the patterns). But then if someone applies \lowercase to a run of  
text that includes the ’ character, they'll be surprised to see it  
changed to '.

The trouble is that \lccode is overloaded, being used for multiple  
purposes that may not always want the same set of mappings. I suppose  
if we had a separate \hyphequiv table, that would help -- but you're  
not getting a new feature like that in time for the TL2008 release!

> I would really prefer not to introduce new hacks in patterns.
> Apostrophe represents a single character, so it should be left as a
> single character in patterns (assuming that we leave it there), only
> TeX might see it in a different way.

The correct Unicode character to use would be U+2019, I think, so we  
could simply use that in the patterns and ignore U+0027. The trouble  
is that there are sure to be users who have U+0027 in their text, and  
expect this to behave the same way; in order to support both the  
"best practice" and the "ASCII-like" encoding of the data, we need  
two versions of the patterns. That's not really a "hack in patterns",  
IMO, it's a concession to the fact that real-life data will not  
always be encoded in the purest and best Unicode Way, and it may be  
helpful to try and support these "variant spellings" where possible.

JK