[tex-hyphen] Newest GitHub additions into CTAN?

Wed Dec 30 23:52:23 CET 2020

On Wed, 30 Dec 2020 at 22:48, Arthur Reutenauer wrote:
> It does sound like T2A is the best choice for Macedonian,

While I agree that it might be the least problematic 8-bit encoding, I
don't agree that it's the best choice in 2021.

By supporting T2A you are actively educating the users to stick with
T2A for NEW documents without them noticing tons of issues:
- accents on those two letters will likely be misplaced (and as a
consequence more ugly)
- you don't get any kerning around those characters
- words containing any of those two letters won't be hyphenated at all
... along with an extremely limited choice of available fonts, extreme
difficulties to mix the document with other languages, with all other
limitations of the 8-bit engines, ...

I'm saying this because I've seen tons of hardcover books on
bookshelves in my native language that you can easily recognise as
being typeset using the wrong (OT1) encoding, with the caron on ccaron
"heavily" misplaced, esp. when bold is used. (I was among those who
kept using the wrong encoding for many years as well, and after I've
learned about it, I probably haven't seen anyone using the correct
encoding in the sources either.)

The biggest problem is when things work just enough to make an
impression of being ok, and then users don't feel the need to go one
extra mile and learn how to make them perfect, even though they would
prefer the second option if they were ever aware about it. Are users
actually requesting 8-bit support, or is the addition some kind of a
personal wish to satisfy everyone?

That said, if there is real urgency to support 8-bit encodings ...
we'll do it, of course.

> Mojca will try to make an upload to CTAN by tomorrow
> (Thursday) evening, otherwise we’ll work on it some time next week.

This was the initial plan, but I can no longer promise to stick with
it after noticing that what's in the repository at the moment is
actually wrong.

The initial patterns from Vasil that we found online were claiming to
be using the T2A encoding, but the actual encoding was different.
Arthur "reversed-engineered" the contents (with the help of the
comments) by creating his own custom mapping that he provisionally
called "Macedonian" and using it to convert the original into the
final UTF-8 file. We were assuming that the patterns were actually
used in the original form, but unless there was a custom font
available somewhere, they were likely not working correctly with T2A.
I looked at those original patterns again and it seems that the
original patterns were using the cp-1251 encoding, so that they looked
OK inside a text editor on Windows. Or at least those 7 letters that I
checked seem to match cp-1251 exactly.

If we really do need to support 8-bit encodings, I would keep the
original UTF-8 patterns intact and just remove the incompatible
patterns from the 8-bit version. But that 8-bit version needs to be
done from scratch (using the existing scripts). The file currently in
the master branch is apparently using cp-1251 and would not work as
desired.

Mojca