[tex-live] [OT] Re: patterns generation & editing

Taco Hoekwater taco at elvenkind.com
Sat Jul 10 12:07:52 CEST 2010


Almost totally off-topic, but I though this was a fun subject.

On 07/10/2010 01:17 AM, Manuel Pégourié-Gonnard wrote:
>> I'm quite surprised to learn that. I know near to nothing about the
>> process of building pattern files, but, being French, I already had
>> a look at hyph-fr.tex, and noticed it is full of comments, and
>> moreover the patterns are divided as "phonetic" and "etymological".

On 07/10/2010 03:25 AM, Werner LEMBERG wrote:
> Patterns for Romanic languages, AFAIK, can be manually maintained.
> You should rather have a look at languages like German: Adding, say,
> hundred compound words changes virtually the complete pattern set.

Thanks, learned something new: I did not know that romanic languages
were *that* easy ;)

I assume the purity of the language (the extend to which it has been
influenced by other languages, especially languages from different
roots) and whether digraphs and in-word inflections exist also make a
big difference to the complexity of the generated patterns.

Here is an example of the complexities in Dutch: following are some of
the words  starting with 'arm' that were used to create the patterns in
the 'max' pattern list, with a bit of explanation (I deleted most
of the plurals and diminutives):

arm             % 1) body part 2) poverty/poor
ar-ma-da        % 1) fleet of warships
ar-ma-ged-don   % 1) end of the world
ar-mag-nac      % 1) a drink from the Armagnac region
ar-ma-tuur      % 1) armoring 2) (light) fixture/fitting
arm-band        % 1) bracelet (this is a compound)
arm-ban-den     % plural of armband
arm-band-hor-lo-ge  % 1) wristwatch (a double compound)
arm-band-je     % diminutive of armband
arm-band-jes    % plural armbandje
ar-me           % 1) a poor person
ar-mee          % 1) army
ar-men          % multiple of arm
Ar-me-ni-ër     % 1) inhabitant of Armenia
Ar-me-ni-ërs    % multiple of Armeniër
ar-mer          % 1) poorer than
ar-me-re        % infliction of armer
ar-me-tie-rig   % 1) languishing
ar-me-zon-daars-ge-zicht % poor-sinners-face
arm-las-tig     % 1) the status of being poor
arm-leng-te     % 1) measure of distance (compound)
arm-leu-ning    % 1) arm rest of a chair
arm-pje         % diminutive of arm (as body part)
arms-gat        % 1) sleeve hole
arm-slag        % 1) freedom of movement

'arm' has Germanic roots in both meanings, but they behave quite
differently: the body part nearly always stays whole (except in
the simple plural), wile the poverty case often breaks between
r and m.

'armada' comes from Spanish,
'armageddon' from Hebrew,
'armagnac' from French,
'armatuur' from Latin via French (but with Germanized ending),
'armee' from French (this word is not in normal use anymore),
'band' is Germanic,
'horloge' from French,
'zondaar' and 'gezicht' are Germanic,
'leuning' appears to be a local Dutch invention.

The other word constituents are too hard to explain for me quickly.

'armsgat' is a compound word in principle, but the extra 's'
is a Germanic possessive. The 's' in 'armezondaarsgezicht' is
the same, the 'e' in that case comes from the possessive.

'armetierig' and 'armslag' would not be considered a compound any
more, even though they are based on combinations of roots.

Looking at 'arm' in the middle of a word, there are many words
that have this combination of letters stemming from totally
different roots than 'arm' e.g.:

alarm          % alarm (from French)
baar-moe-der   % womb, compound from 'to bear' + 'mother'
bar-man        % bartender, compound
char-mant      % charming (from French)
daar-mee       % with that
darm           % intestine (unclear)
haar-mid-del   % a product for hair, compound
hand-warm      % luke warm, compound
har-mo-nie     % harmony
jaar-markt     % fair, compound of 'year' and 'market'
kar-mo-zijn    % a colorant (can be traced back to Arabic via Latin)
klaar-ma-ken   % to prepare, compound of 'ready' and 'make'
mar-me-la-de   % marmelade (originally from Portuguese)
mar-mer        % marble (from Latin)
mar-mot        % small rodent (from French)
waar-merk      % verification seal, compound of 'true' and 'mark'
war-moes       % vegetable pulp
warm-te        % heat (Germanic, infliction of 'warm': somewhat hot)
zwaar-moe-dig  % depressed ('zwaar'=heavy,'moede'='tired')

The result: in hyph-nl.tex there are 18 containing 'arm', and no
less than 61 patterns containing 'rm' . I hope it is clear that
you cannot easily alter a line or so in the patterns if you want to
fix the bad hyphenation of word containing the sequence 'arm'.

Best wishes,

Side note: of course you can 'fix' the patterns by adding a badly
hyphenated word in this way: s8m8o8g9a8l8a8r8m, but I definitely
not call the the 'preferred way'.

