[tex-hyphen] Apostrophe
Jonathan Kew
jonathan_kew at sil.org
Mon Jun 16 17:51:35 CEST 2008
On 16 Jun 2008, at 4:16 pm, Mojca Miklavec wrote:
>> IMO, where some patterns have traditionally included the
>> apostrophe (x27),
>> we should probably provide duplicate patterns with U+2019 as well.
>
> Any little/tiny chance to use some other way to achieve the same? It's
> seem like yet-another-hack to me, that will prevent us from direct
> conversion to 8-bit patterns.
>
> 1.) create a list of equivalent characters
>
> 2.)
> a) parse contents of \patterns and if some character from the list
> belongs to that list, duplicate the pattern before it's passed to TeX
It ought to be possible to do this, I guess, but it's fairly painful
as TeX macro programming. (For LuaTeX it could no doubt be done much
more easily in Lua, but that doesn't help XeTeX.)
> b) extend the engine (only XeTeX/LuaTeX in that case) in some way to
> accept hints that some characters are equivalent during hyphenation. I
> guess that \lccode does exactly that, but I'm not sure what will
> happen if I set lccode of "adiaeresis" to lccode of "a" for example,
> when I want to use some macro to do uppercasing/lowercasing of words
> for me.
Or to take the specific example of the apostrophe, we could set
\lccode"2019="27 (or vice versa, depending which way we want to write
the patterns). But then if someone applies \lowercase to a run of
text that includes the ’ character, they'll be surprised to see it
changed to '.
The trouble is that \lccode is overloaded, being used for multiple
purposes that may not always want the same set of mappings. I suppose
if we had a separate \hyphequiv table, that would help -- but you're
not getting a new feature like that in time for the TL2008 release!
> I would really prefer not to introduce new hacks in patterns.
> Apostrophe represents a single character, so it should be left as a
> single character in patterns (assuming that we leave it there), only
> TeX might see it in a different way.
The correct Unicode character to use would be U+2019, I think, so we
could simply use that in the patterns and ignore U+0027. The trouble
is that there are sure to be users who have U+0027 in their text, and
expect this to behave the same way; in order to support both the
"best practice" and the "ASCII-like" encoding of the data, we need
two versions of the patterns. That's not really a "hack in patterns",
IMO, it's a concession to the fact that real-life data will not
always be encoded in the purest and best Unicode Way, and it may be
helpful to try and support these "variant spellings" where possible.
JK
More information about the tex-hyphen
mailing list