[tex-hyphen] Hyphenation patterns for Belarusian

Arthur Reutenauer arthur.reutenauer at normalesup.org
Sun Aug 28 16:12:48 CEST 2016

Hi Maksim,

First of all thank you for your efforts, although I would say you’re
trying to do a little too much at this stage, I’ll explain why at the
end.

> ! Conflicting pattern ignored.
> l.6024 }
>
> ?
> ! Emergency stop.
> l.6024 }
>
> !  ==> Fatal error occurred, no output PDF file produced!
> Transcript written on luatex.log.
>
> Is there any way to make it more verbose? Or debug the issue somehow?

You can’t really make it more verbose with LuaTeX, but debugging the
issue is easy: conflicting patterns (called “duplicate patterns” by
XeTeX and other engines) are patterns where the underlying character
strings are the same, for example a1b and a2b.  If you generate formats
for XeTeX instead of LuaTeX, it gives you the exact line number where
the offending pattern is found -- i. e., the second occurrence, which

Using that technique I found a number of conflicts such as б1ь and
б8ь, в1ь and в8ь, as well as а1й and а8й, а1ў and а8ў, and the more
intriguing pairs 1’2а and ’3а, 1’2е and ’3е, etc.  This makes me suspect
that the patterns haven’t been developed with great care.

> Also, please, clarify for me usage of quotes. There are 3 symbols used in hyph-be.tex: ' ` ’
> I suspect this can confuse the engine, since generate-plain-patterns.rb checks only the first one and convert it to the third one to populate hyph-quote-<lang>.tex
> What is the official position on quotes? Should one use only ' and *TeX will do the rest, or other symbols are allowed too?

Any symbol is allowed in a hyphenation pattern for TeX as long as you
set its \lccode correctly, which is done in a file called
unicode-letters.def, or later within hyph-utf8.  If the characters don’t
have a correct \lccode, you get an error from TeX saying “Non-letter”,
and since you’re not reporting anything like that, your system seems to
be set up correctly from that point of view.

However, TeX won’t treat the different types of apostrophes in any
special way, there are no equivalence tables or anything like that.  To
the engine, the different Unicode characters for the apostrophe are
simply that, different characters.  We enforce equivalences such as the
one between ' and ’ by duplicating every pattern containing an
apostrophe and putting it in the hyph-quote-* files as you’ve seen, so
in your case we could do that by putting all patterns with ` and ’ in
hyph-quote-be.tex, and the patterns with ' in the main file.  We can
update the Ruby scripts to do that.

The reason for having only one type of apostrophe in the main file
(hyph-be.tex) is so that other programs that have a notion of
equivalence won’t get confused; this is not about TeX (at least not

> And the third moment with these patterns is T2A encoding. The U+2019 symbol (the third quote from the list above) make conversion impossible, since the symbol is not mapped in converter. I tried to enable it in t2a.dat and regenerate converter, but it fails with message: The encoding t2a uses more than two bytes to encode characters.

Yes, of course, in T2A there is only one character slot for the
apostrophe, so you shouldn’t try and map all the different characters
one-to-one.  This is precisely where the strategy explained in the
paragraph above helps: if you extract all the different types of
apostrophes to an auxiliary file and keep only one in the main file, you
can work around that problem.  That said, do you really need to use the
patterns in an 8-bit encoding?

In conclusion, I think you should try and test the patterns first; you
don’t need any of the machinery that hyph-utf8 provides, but for example
just

---- BEGIN test-hyph-be.tex
\catcode`\{=1
\catcode`\}=2
\input unicode-letters.def
\lccode`\'=`\'
\lccode`\`=`\`
\lccode`\’=`\’
\input hyph-be