texlive[58841] Master: patgen2-tutorial

commits+karl at tug.org commits+karl at tug.org
Sun Apr 11 23:52:07 CEST 2021


Revision: 58841
          http://tug.org/svn/texlive?view=revision&revision=58841
Author:   karl
Date:     2021-04-11 23:52:07 +0200 (Sun, 11 Apr 2021)
Log Message:
-----------
patgen2-tutorial

Modified Paths:
--------------
    trunk/Master/tlpkg/bin/tlpkg-ctan-check

Added Paths:
-----------
    trunk/Master/texmf-dist/doc/support/patgen2-tutorial/README.md
    trunk/Master/texmf-dist/doc/support/patgen2-tutorial/patgen2-tutorial.bib
    trunk/Master/texmf-dist/doc/support/patgen2-tutorial/patgen2-tutorial.pdf
    trunk/Master/texmf-dist/doc/support/patgen2-tutorial/patgen2-tutorial.tex

Removed Paths:
-------------
    trunk/Master/texmf-dist/doc/support/patgen2-tutorial/patgen2.pdf
    trunk/Master/texmf-dist/doc/support/patgen2-tutorial/patgen2.tex

Added: trunk/Master/texmf-dist/doc/support/patgen2-tutorial/README.md
===================================================================
--- trunk/Master/texmf-dist/doc/support/patgen2-tutorial/README.md	                        (rev 0)
+++ trunk/Master/texmf-dist/doc/support/patgen2-tutorial/README.md	2021-04-11 21:52:07 UTC (rev 58841)
@@ -0,0 +1,19 @@
+# A Patgen tutorial from 1993, revisited 28 years later
+## Yannis Haralambous
+### Licensed with LPPL 1.3c
+### 1.1 2021/03/29
+
+This the revisited version of a document written in 1993, long before Unicode, the PDF format, 
+the Web, the $n$-th Doctors (for n≥9), \eTeX, smartphones, emojis, fake news, COVID, 
+and whatever makes the world we currently live in. As a tutorial, it was utterly _bad_: 
+it was unclear, missed important information and focalized on (semi-private) jokes. 
+I realized this fact in March 2021, when I tried to use Patgen again after decades of oblivion. 
+Having re-discovered the use of Patgen, which is still as scarcely documented as it was in 1993, 
+I decided to rewrite this tutorial hoping to succeed where I once failed. I ask in advance for 
+your indulgence. Do not hesitate to contact me if parts of it are (still) obscure!
+
+|Change log        | Info                                                                           |
+|------------------|--------------------------------------------------------------------------------|
+|**1.1 2021/03/29**| Revisited the document.                                                        |
+|**1.0 1993/??/??**| The original document.                                                         |
+


Property changes on: trunk/Master/texmf-dist/doc/support/patgen2-tutorial/README.md
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: trunk/Master/texmf-dist/doc/support/patgen2-tutorial/patgen2-tutorial.bib
===================================================================
--- trunk/Master/texmf-dist/doc/support/patgen2-tutorial/patgen2-tutorial.bib	                        (rev 0)
+++ trunk/Master/texmf-dist/doc/support/patgen2-tutorial/patgen2-tutorial.bib	2021-04-11 21:52:07 UTC (rev 58841)
@@ -0,0 +1,32 @@
+ at phdthesis{liang,
+author={Liang, Franklin Mark},
+title={Word Hy-phen-a-tion by Comp-put-er},
+school={Stanford University},
+year={1983},
+note={\url{https://www.tug.org/docs/liang/liang-thesis.pdf}}
+}
+
+ at unpublished{krantz,
+      title={Language-Agnostic Syllabification with Neural Sequence Labeling}, 
+      author={Jacob Krantz and Maxwell Dulin and Paul De Palma},
+      year={2019},
+      note={arXiv:1909.13362},
+}
+
+ at book{DEK-A,
+author={Knuth, Donald E.},
+title={The \TeX book},
+publisher ={Addison-Wesley},
+year={1984},
+}
+
+ at inproceedings{bartlett,
+author = {Bartlett, Susan  and
+      Kondrak, Grzegorz  and
+      Cherry, Colin},
+title={Automatic Syllabification with Structured SVMs for Letter-to-Phoneme Conversion},
+year={2008},
+booktitle={{ACL 2008, Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, June 15-20, 2008, Columbus, Ohio, USA}},
+pages={568--576},
+publisher={Association for Computational Linguistics},
+}


Property changes on: trunk/Master/texmf-dist/doc/support/patgen2-tutorial/patgen2-tutorial.bib
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: trunk/Master/texmf-dist/doc/support/patgen2-tutorial/patgen2-tutorial.pdf
===================================================================
(Binary files differ)

Index: trunk/Master/texmf-dist/doc/support/patgen2-tutorial/patgen2-tutorial.pdf
===================================================================
--- trunk/Master/texmf-dist/doc/support/patgen2-tutorial/patgen2-tutorial.pdf	2021-04-11 21:40:05 UTC (rev 58840)
+++ trunk/Master/texmf-dist/doc/support/patgen2-tutorial/patgen2-tutorial.pdf	2021-04-11 21:52:07 UTC (rev 58841)

Property changes on: trunk/Master/texmf-dist/doc/support/patgen2-tutorial/patgen2-tutorial.pdf
___________________________________________________________________
Added: svn:mime-type
## -0,0 +1 ##
+application/pdf
\ No newline at end of property
Added: trunk/Master/texmf-dist/doc/support/patgen2-tutorial/patgen2-tutorial.tex
===================================================================
--- trunk/Master/texmf-dist/doc/support/patgen2-tutorial/patgen2-tutorial.tex	                        (rev 0)
+++ trunk/Master/texmf-dist/doc/support/patgen2-tutorial/patgen2-tutorial.tex	2021-04-11 21:52:07 UTC (rev 58841)
@@ -0,0 +1,430 @@
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%                                                                             %
+%          A SMALL TUTORIAL ON THE MULTILINGUAL FEATURES OF PATGEN2           %
+%                                                                             %
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\documentclass{article}
+\usepackage{a4wide,amstext,url}
+\title{A Revisited Small Tutorial on Patgen, 28~Years After}
+\author{Yannis Haralambous\\\small IMT Atlantique, CS 83818, 29238 Brest Cedex 3, France\\[-2pt]\small\texttt{yannis.haralambous at imt-atlantique.fr}}
+\date{\today}
+\begin{document}
+\maketitle
+
+\def\XeTeX{X\kern-.1667em\lower.5ex\hbox{E}\kern-.125emT\kern-.1667em\lower.5ex\hbox{E}\kern-.125emX}
+
+\emph{This the revisited version of a document written in 1993, long before Unicode, the PDF format, the Web, the $n$-th Doctors (for $n\geq9$), \XeTeX, smartphones, emojis, fake news, COVID, and whatever makes the world we currently live in. As a tutorial, it was utterly \emph{bad}: it was unclear, missed important information and focalized on (semi-private) jokes. I realized this fact in March 2021, when I tried to use Patgen again after decades of oblivion. Having re-discovered the use of Patgen, which is still as scarcely documented as it was in 1993, I decided to rewrite this tutorial hoping to succeed where I once failed. I ask in advance for your indulgence. Do not hesitate to contact me if parts of it are (still) obscure!}
+
+This tutorial is based on PATGEN v2.4 from the \TeX live distribution 2020.
+
+\section{Introduction}
+
+Patgen is a program allowing to build hyphenation patterns (cf.\ \cite[Appendix~H]{DEK-A}  for more details on these) out of (relatively large) corpora of already-hyphenated words. It has been developed in the early eighties by Frank Liang, a PhD student of Donald Knuth. Patgen's algorithm is thoroughly described in Liang's thesis \cite{liang}. In the meantime the problem of ``syllabification'' has been tackled by many machine learning approaches, such as SVMs \cite{bartlett} bidirectional LSTM networks \cite{krantz}, etc. In some languages  hyphenation is strictly syllabic and hence syllabification is equivalent to hyphenation, in other languages this is not the case, but the same tools can be applied since the two problems are quite similar. Also there is abundant bibliography on hyphenation in the \emph{TUGboat}.
+
+%Automatic Syllabification with Structured SVMs for Letter-to-Phoneme Conversion
+%January 2008
+%SourceDBLP
+%Conference: ACL 2008, Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, June 15-20, 2008, Columbus, Ohio, USA
+
+
+
+\section{The Various Files Involved}
+
+Patgen reads information from three files (the dictionary file, the pre-existing patterns file and the translate file) and writes information into two files: the output file, where patterns are written and an additional output file called \texttt{pattmp.*} containing the result of the application of patterns to the dictionary file.
+
+\subsection{The Translate File}
+
+This file comes last on the command line syntax, but because of its importance we will cover it first. The original Patgen (1983) was hardwired to use only the 26 letters of the Latin alphabet. Its November 1991 revision (version~2) by Peter Breitenlohner extended it so that up to 245 different ``abstract characters'' can be used. By ``abstract characters'' I mean members of the alphabet of a formal language that can be represented by multiple, possibly multi-character, representations.
+
+Each member of this formal alphabet must have a primary representation. Optionally it may have alternative representations which will be detected in the dictionary file and mapped to the primary one. In the test file, only primary representations are used.
+
+The fact of using multi-character (and hence multi-byte) representations was a brillant idea by Peter Breitenlohner since it makes Patgen encoding-agnostic: you can use ASCII or ISO Latin-16 or Unicode it will be the same for Patgen.
+
+\medskip
+
+Let us move on to the syntax of the translate file, which is quite peculiar.
+
+The first line of the file has fixed position-dependent semantics: 
+\begin{itemize}
+\item positions 1 and 2 are used to indicate the \verb=\lefthyphenmin= value,
+\item positions 3 and 4 are used for the \verb=\righthyphenmin= value. (As there seems to be no language using left or right hyphen min values greater than 9, you will have to systematically leave a space in front of the one-digit values (spaces at positions 1 and 3).)
+\item positions 5, 6 and 7 of the first line are used to give alternative values for characters \verb=.=, \verb=-= and \verb=*=: the first one is used for word limits, the second is used to indicate breaking points in the patterns file and in the dictionary, the third is used to indicate correct hyphenations in the test file;
+
+\item starting with line 2, you add one line per symbol of the formal alphabet you will use in your dictionary and patterns. The first character position of each line is used for the \emph{delimiter} of that line. This delimiter will surround the letter of your formal alphabet as well as all alternative representations of it. Writing the delimiter twice ends the parsing process for that line, so that one can add comments after the double delimiter.
+\end{itemize}
+
+For example, in the line
+\begin{verbatim}
+XeXEXX
+\end{verbatim}
+we use \texttt{X} as delimiter, our primary representative of this formal letter is \texttt{e} (the primary representative is used in the patterns that will be created), and a secondary representative of it is \texttt{E}. Secondary representatives can be multi-character sequences.
+
+Naturally we will use the space as delimiter, since it makes the translate file more readable:
+\begin{verbatim}
+ e E
+\end{verbatim}
+But do not omit the space at position~1, otherwise your file will not be correctly parsed.
+
+One last detail: since lines $\geq2$ use a delimiter mechanisme, you can use multi-character and multibyte sequences. Not so for line~1 which is byte-oriented: better choose ASCII characters for the potential replacements of \verb=.=, \verb=-= and \verb=*=.
+
+\subsection{The Pre-Existing Patterns File}
+
+If you are not creating patterns ab ovo but rather wish to update a pre-existing set of patterns, then you can place these patterns (in standard pattern notation) in the pre-existing patterns file (second in the list of patgen command line arguments).
+
+If you don't have pre-existing patterns, you should create (``\texttt{touch}'' on Unix systems) an empty pre-existing patterns file nevertheless.
+
+\subsection{The Dictionary File}
+
+The dictionary file contains words using the primary or secondary representatives of formal letters, in which breaking points are indicated by hyphens \texttt{-}.
+
+The dictionary file is also used to test the patterns at the end of the pattern creation process (see \S\,\ref{test}). 
+
+\subsection{The Output File}
+
+This one is written by Patgen, it contains patterns in the standard \TeX\ notation.
+
+\subsection{The Test Result Files}\label{test}
+
+Test Result files carry names \texttt{pattmp.*}, where \texttt{*} is a number. They contain the test results. Patgen will continue creating such files, giving them increasing numbers so that you can use to them to evaluate your progress.
+
+Their syntax is as follows: the period \texttt{.} (or its alternative given in position 5 of the first line of the translate file) represents wrong breaking points introduced by the patterns, the hyphen \texttt{-} (or its alternative given in position 6 of the first line of the translate file) represents missed breaking points, and the asterisk \texttt{*} (or its alternative given in position 7 of the first line of the translate file) represents correct breaking points detected by the patterns. A perfect result is one with only asterisks.
+
+\section{Command Line Syntax}
+
+Patgen expects four command line arguments:
+\begin{verbatim}
+patgen greek.dic greek.pat greek.out greek.tra
+\end{verbatim}
+namely the names (or paths) of the dictionary file, the pre-existing patterns file, the output file and the translate file. Failure of respecting this order will result in Patgen's nervous breakdown, hardly something you would like to inflict to such an old and faithful friend.
+
+\section{Strategy} As Frank Liang explains in his thesis \cite[p.~37]{liang}, at each level patterns are tested over all words, they are kept only if they satisfy the following formula:
+$$
+\alpha*\text{\#good matches}-\beta*\text{\#bad matches}\geq\eta,
+$$
+where $\alpha$ is the ``good weight,'' $\beta$ the ``bad weight'' and $\eta$ the threshold.
+
+Once Patgen launched and the parsing of files completed, it will ask you
+\begin{verbatim}
+This is PATGEN, Version 2.4 (TeX Live 2020)
+left_hyphen_min = 1, right_hyphen_min = 1, 33 letters
+0 patterns read in
+pattern trie has 256 nodes, trie_max = 256, 0 outputs
+hyph_start, hyph_finish: 
+\end{verbatim}
+You have to give two numbers between 1 and 9 (separated by a space), representing the pattern levels you want to create (odd pattern levels are hyphenating levels, even pattern levels are inhibiting levels, higher numbers prevail over lower numbers).
+
+It will then ask:
+\begin{verbatim}
+pat_start, pat_finish:
+\end{verbatim}
+These are the minimum and maximum lengths of patterns you are interesting in, between 1 and 15. With modern computers it is by no means extravagant to ask for the whole range, both your CPU and your RAM can handle it easily. Again you have to supply two numbers (separated by a space), the min and the max.
+
+Then Patgen will ask you:
+\begin{verbatim}
+good weight, bad weight, threshold: 
+\end{verbatim}
+These are the three values in our formula above ($\alpha$, $\beta$, $\eta$). In his thesis, Liang describes the strategy used for the \TeX82 version of US English patterns:
+
+\begin{center}
+\begin{tabular}{ccccccc}
+level&$\alpha$&$\beta$&$\eta$&max length&nb of patterns&accuracy\\\hline
+1&1&2&20&4&458&76.6\%\\
+2&2&1&8&4&509&68.2\%\\
+3&1&4&7&5&985&83.2\%\\
+4&3&2&1&6&1647&82.0\%\\
+5&1&$\infty$&4&8&1320&89.3\%
+\end{tabular}
+\end{center}
+
+I have used Patgen by consistently giving the same values \texttt{1 1 1} for $\alpha,\beta,\eta$, and also got good results. Testing the parameters may need several attempts, but today these attempts are measured in seconds and not in hours anymore.
+
+Once you have given the parameter values, Patgen will starting calculating all pattern lengths for level~1, and will report:
+\begin{verbatim}
+593480 good, 210386 bad, 0 missed
+100.00 %, 35.45 %, 0.00 %
+97 patterns, 712 nodes in count trie, triec_max = 734
+0 good and 97 bad patterns added 
+finding 0 good and 0 bad hyphens 
+pattern trie has 47469 nodes, trie_max = 82932, 74 outputs
+42121 nodes and 53 outputs deleted
+total of 2510 patterns at hyph_level 1
+
+pat_start, pat_finish:
+\end{verbatim}
+It's up to you to give values of pattern length for the next run. Once these given, you will be asked again
+\begin{verbatim}
+good weight, bad weight, threshold: 
+\end{verbatim}
+and so on, for each level.
+
+When there are no more levels left, you get the following message:
+\begin{verbatim}
+593480 good, 0 bad, 0 missed
+100.00 %, 0.00 %, 0.00 %
+0 patterns, 256 nodes in count trie, triec_max = 256
+0 good and 34 bad patterns added 
+finding 0 good and 0 bad hyphens 
+pattern trie has 11603 nodes, trie_max = 146189, 496 outputs
+0 nodes and 10 outputs deleted
+total of 0 patterns at hyph_level 9
+hyphenate word list? 
+\end{verbatim}
+where you reply `y' to get the following answer:
+\begin{verbatim}
+writing pattmp.9
+ 
+593480 good, 0 bad, 0 missed
+100.00 %, 0.00 %, 0.00 %
+\end{verbatim}
+and the program stops. Now the output file contains the patterns and file \verb=pattmp.9= (the name is given in the message above, and may vary) contains the result of applying the patterns to the dictionary file.
+
+\section{An example}
+\emph{This toy example originates from the 1993 version of the tutorial, I found it so pittoresque that I couldn't help but leave it. Nevertheless I ran Patgen anew to get actualized console output (in the previous document I used Wilfried Ricken's memorable \TeX\ implementation for MacOS~9, from 1993).}
+
+To illustrate this I have taken some
+Greek words (people who know $X\acute\alpha%
+\rho\rho\upsilon$ $K\lambda\acute\upsilon\nu\nu$
+and his album $\Pi\alpha\tau%
+\acute\alpha\tau\epsilon\varsigma$ will recognize some of these
+words\ldots) and included an $\alpha$ with accent, and a variant
+representation of $\pi$, in form of a macro \verb*=\varpi =. Also I
+followed the Greek alphabetical order in the translate file.
+
+Here is my list of words:
+\begin{verbatim}
+A-NU-PO-TA-QTOS
+A-KA-TA-M'A-QH-TOS
+A-EI-MNH-STOS
+MA-NA
+TOUR-KO-GU-FTIS-SA
+NU-QO-KO-PTHS
+LI-ME-NO-FU-LA-KAS
+MPRE-LOK
+A-GA-\varpi A-EI
+PEI-RAI-'AS
+\end{verbatim}
+(file {\tt greek.dic}); and here is my translate file ({\tt greek.tra}):
+\begin{verbatim}
+ 1 1
+ a A
+ 'a 'A
+ b B
+ g G
+ d D
+ e E
+ z Z
+ h H
+ j J
+ i I
+ k K
+ l L
+ m M
+ n N
+ o O
+#p#P#\varpi ##
+ r R
+ s S
+ t T
+ u U
+ f F
+ q Q
+ y Y
+ w W
+\end{verbatim}
+(in case you are wondering, Greek indeed uses values~1
+and~1 for minimal right
+and left hyphenations!). As you see, I have chosen \verb=#= as delimiter
+in the case of $\pi$, because \verb=\varpi= is supposed to be followed
+by a blank space. And then I wrote two delimiters (\verb=##=) to let Patgen know
+that I finished giving secondary representations of that formal letter.
+
+Here is what I got on my console:
+
+\begin{verbatim}
+patgen greek.dic greek.pat greek.out greek.tra
+
+This is PATGEN, Version 2.4 (TeX Live 2020)
+left_hyphen_min = 1, right_hyphen_min = 1, 24 letters
+0 patterns read in
+pattern trie has 266 nodes, trie_max = 290, 0 outputs
+hyph_start, hyph_finish: 1 2
+pat_start, pat_finish: 2 4
+good weight, bad weight, threshold: 1 1 1
+processing dictionary with pat_len = 2, pat_dot = 1
+ 
+0 good, 0 bad, 31 missed
+0.00 %, 0.00 %, 100.00 %
+69 patterns, 325 nodes in count trie, triec_max = 440
+25 good and 42 bad patterns added (more to come)
+finding 29 good and 0 bad hyphens, efficiency = 1.16
+pattern trie has 333 nodes, trie_max = 349, 2 outputs
+processing dictionary with pat_len = 2, pat_dot = 0
+ 
+29 good, 0 bad, 2 missed
+93.55 %, 0.00 %, 6.45 %
+45 patterns, 301 nodes in count trie, triec_max = 378
+2 good and 43 bad patterns added 
+finding 2 good and 0 bad hyphens, efficiency = 1.00
+pattern trie has 339 nodes, trie_max = 386, 6 outputs
+processing dictionary with pat_len = 2, pat_dot = 2
+ 
+31 good, 0 bad, 0 missed
+100.00 %, 0.00 %, 0.00 %
+47 patterns, 303 nodes in count trie, triec_max = 369
+0 good and 47 bad patterns added 
+finding 0 good and 0 bad hyphens 
+pattern trie has 344 nodes, trie_max = 386, 13 outputs
+51 nodes and 11 outputs deleted
+total of 27 patterns at hyph_level 1
+ 
+pat_start, pat_finish: 2 4
+good weight, bad weight, threshold: 1 1 1
+processing dictionary with pat_len = 2, pat_dot = 1
+ 
+31 good, 0 bad, 0 missed
+100.00 %, 0.00 %, 0.00 %
+27 patterns, 283 nodes in count trie, triec_max = 315
+0 good and 27 bad patterns added 
+finding 0 good and 0 bad hyphens 
+pattern trie has 295 nodes, trie_max = 386, 4 outputs
+processing dictionary with pat_len = 2, pat_dot = 0
+ 
+31 good, 0 bad, 0 missed
+100.00 %, 0.00 %, 0.00 %
+27 patterns, 283 nodes in count trie, triec_max = 303
+0 good and 27 bad patterns added 
+finding 0 good and 0 bad hyphens 
+pattern trie has 320 nodes, trie_max = 386, 6 outputs
+processing dictionary with pat_len = 2, pat_dot = 2
+ 
+31 good, 0 bad, 0 missed
+100.00 %, 0.00 %, 0.00 %
+24 patterns, 280 nodes in count trie, triec_max = 286
+0 good and 24 bad patterns added 
+finding 0 good and 0 bad hyphens 
+pattern trie has 328 nodes, trie_max = 386, 9 outputs
+35 nodes and 7 outputs deleted
+total of 0 patterns at hyph_level 2
+hyphenate word list? y
+writing pattmp.2
+ 
+31 good, 0 bad, 0 missed
+100.00 %, 0.00 %, 0.00 %
+\end{verbatim}
+and here are the results: first of all the output
+file {\tt greek.out}:
+
+\begin{verbatim}
+a1g
+a1e
+a1k
+a1m
+a1n
+a1p
+a1t
+a1q
+'a1q
+e1l
+e1n
+h1t
+i1'a
+i1m
+i1r
+1ko
+o1g
+o1p
+o1t
+o1f
+r1k
+s1s
+1st
+u1l
+u1p
+u1f
+u1q
+\end{verbatim}
+
+As you see, {\tt o1t} is sorted before {\tt o1f}, because of the order used for the primary representatives in the translate file. Also you see that Patgen
+has read the word \verb*=A-GA-\varpi A-EI= exactly as if it
+were written \verb*=a-ga-pa-ei= and produced the pattern
+{\tt a1p} (if you look in the dictionary word list there is no other
+reason for this pattern to exist).
+
+And here is our result (file {\tt pattmp.2}):
+\begin{verbatim}
+a*nu*po*ta*qtos
+a*ka*ta*m'a*qh*tos
+a*ei*mnh*stos
+ma*na
+tour*ko*gu*ftis*sa
+nu*qo*ko*pths
+li*me*no*fu*la*kas
+mpre*lok
+a*ga*pa*ei
+pei*rai*'as
+\end{verbatim}
+
+As you see there is no \verb*=\varpi = anymore:
+Patgen has replaced it by \verb=p=, and so
+\verb*=A-GA-\varpi A-EI= has become \verb=a*ga*pa*ei=.
+
+%\section{Where do I find more information?}
+%I voluntarily didn't discussed the various parameters used for
+%pattern generation. For these there is very good litterature:
+%\begin{itemize}
+%\item the \TeX book, by the Grand Wizard of \TeX\ arcana,
+%appendix H;
+%\item \LaTeX\ Erweiterungsm\"oglichkeiten, by Helmut Kopka,
+%Addison-Wesley, Pages 482--489;
+%\item Swedish Hyphenation for \TeX, by Jan Michael Rynning,
+%[sorry, I don't know where this paper is published];
+%\item Word Hy-phen-a-tion by Com-put-er, Stanford University
+%Report {\tt STAN-CS-83-977};
+%\item forthcoming paper by Dominik Wujastyk, on British
+%hyphenation;
+%\item Hyphenation Patterns for Ancient Greek and Latin,
+%TUGboat~13 (4), pages 457--469.
+%\item and many others\ldots
+%\end{itemize}
+%
+%Where to get PatGen2? probably everywhere, but certainly in
+%Stuttgart ({\tt IP 129.69.1.12}),
+%\verb=soft/tex/systems/pc/utilities/patgen.zip= (take a look also at
+%\verb=soft/tex/systems/knuth/texware/patgen.version2.1/patgen.README=).
+
+\section{Go Forth, etc., etc.}
+Imitating the Grand Wizard, I invite you to {\sc Go Forth} and make masterpieces of hyphenation patterns\ldots
+%{\bf but} please get in touch with the TWGMLC (Technical Working
+%Group on Multiple Language Coordination) first, since people there
+%are working on many languages: maybe they have already done what
+%you need and are still testing it; or maybe they haven't and in
+%that case you could help us a lot.
+
+And once you created them, you can contact the maintainers of packages such as Polyglossia or Babel to have them included in them---or you can directly send your patterns to CTAN so that everybody in the \TeX\ community can use them.
+
+\bibliographystyle{plain}
+\bibliography{patgen2-tutorial}
+
+\end{document}
+
+ at phdthesis{liang,
+author={Liang, Franklin Mark},
+title={Word Hy-phen-a-tion by Comp-put-er},
+school={Stanford University},
+year={1983},
+note={\url{https://www.tug.org/docs/liang/liang-thesis.pdf}}
+}
+
+ at unpublished{krantz,
+      title={Language-Agnostic Syllabification with Neural Sequence Labeling}, 
+      author={Jacob Krantz and Maxwell Dulin and Paul De Palma},
+      year={2019},
+      note={arXiv:1909.13362},
+}
+
+ at book{DEK-A,
+author={Knuth, Donald E.},
+title={The \TeX book},
+publisher ={Addison-Wesley},
+year={1984},
+}
\ No newline at end of file


Property changes on: trunk/Master/texmf-dist/doc/support/patgen2-tutorial/patgen2-tutorial.tex
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Deleted: trunk/Master/texmf-dist/doc/support/patgen2-tutorial/patgen2.pdf
===================================================================
(Binary files differ)

Deleted: trunk/Master/texmf-dist/doc/support/patgen2-tutorial/patgen2.tex
===================================================================
--- trunk/Master/texmf-dist/doc/support/patgen2-tutorial/patgen2.tex	2021-04-11 21:40:05 UTC (rev 58840)
+++ trunk/Master/texmf-dist/doc/support/patgen2-tutorial/patgen2.tex	2021-04-11 21:52:07 UTC (rev 58841)
@@ -1,334 +0,0 @@
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-%                                                                             %
-%          A SMALL TUTORIAL ON THE MULTILINGUAL FEATURES OF PATGEN2           %
-%                                                                             %
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\documentstyle{article}
-\title{A small tutorial on the multilingual features of PatGen2}
-\author{Yannis Haralambous}
-\date{}
-\begin{document}
-\maketitle
-
-\def\XeT{X\kern-.1667em\lower.5ex\hbox{E}\kern-.125emT}
-
-(This document is released under the GNU General Public License,
-version~2 or any later version.)
-
-I will very briefly discuss and illustrate by an example the features
-of PatGen2\footnote{the extension to PatGen by Peter Breitenlohner,
-author of many beautiful and extremely useful \TeX ware, such as
-\TeX--\XeT, {\tt DVIcopy} etc.}, related to the {\em translation}
-file.
-
-\section{Syntax of the translation file}
-\subsection{What is the problem}
-The problem is that in our (non-English, non-Latin,
-non-Indonesian\footnote{As far as I know, English, Latin and
-Indonesian are the only languages without diacritics or special
-characters...}) languages, we have more characters than just the~26
-letters of the Latin alphabet. Not to mention that alphabetical orders
-can be quite different (remember: in Spanish, cucaracha comes before
-chacal), and that there can be many ways to express these additional
-characters: for example the French \oe{} (like in << Histoire de l'\oe
-il >> by Georges Bataille) can be given to \TeX\ as an 8-bit
-character, or as \verb*=\oe =, or as \verb=\oe{}=, or as \verb=^^f7=
-(since we are {\em all} now working with DC fonts) and so forth.
-
-Now let's make (or enhance the) hyphenation patterns for our language(s).
-Fred Liang has not provided PatGen with the extensibility feature: before
-PatGen2, if your language had less than~26 letters you could hack patterns by
-substituting letters, if it had more\ldots then the odds were against you.
-
-So the problem is to express additional characters in the patterns, in some
-understandable way (and if possible to get the results in the appropriate
-alphabetical order).
-
-\subsection{The solution}
-PatGen2 allows you to include more characters than the~26 letters of
-the alphabet.  You can specify arbitrary many {\em input forms} for
-each of them (these forms will be identified internally), and you can
-specify the alphabetical order of your language. The latter feature is
-only interesting when you want to study the patterns file to correct
-bugs, it doesn't affect the result.
-
-These informations are transmitted to PatGen2 by means of a {\em
-translation} file (usually with the extension \verb=.tra=). The syntax
-of this file is not very user-friendly, but if you deal with PatGen
-you are supposed to be a hacker anyway, and hackers just {\bf love}
-weird-but-efficient-syntaxes [the proof: we all like
-\TeX!\footnote{\ldots just joking}].
-
-On the first line you specify the \verb=\lefthyphenmin= and
-\verb=\righthyphenmin= values. The seven first positions of this line
-are used as follows: two positions for \verb=\lefthyphenmin=, two
-positions for \verb=\righthyphenmin=, and one position for substitute
-symbols of each of \verb=.=, \verb=-=, \verb=*= in the output file.
-This can be useful if you are going to use the dot the asterisk or the
-hyphen to describe your additional characters (for example one may use
-the dot to specify the dot accent on a letter).
-
-The next lines concern letters of the alphabet of our language. The
-first position of each line is the delimiter of our fields. You know
-from languages like C or {\tt awk} that it is very useful to define
-our own delimiter (instead of the blank space) if for any reason we
-want to {\em include} a blank space into our fields. This will be the
-case in the example: one usually leaves a blank space after a \TeX\
-macro without argument.
-
-Otherwise just leave the first position blank.  Next comes the
-standard representation of our character (preferably lowercase). This
-is how the character will appear in the patterns and in the hyphenated
-output file.
-
-Follows again a delimiter, and all the {\em equivalent}
-representations of your character: its uppercase form, and as many
-other input forms we wish, always followed by a delimiter.
-
-The order of lines specifies the alphabetical order of your characters.
-
-When PatGen sees two consecutive delimiters,
-it stops reading; so we can include
-comments after that.
-
-\section{An example}
-To illustrate this I have taken some
-Greek words (people who know $X\acute\alpha%
-\rho\rho\upsilon$ $K\lambda\acute\upsilon\nu\nu$
-and his album $\Pi\alpha\tau%
-\acute\alpha\tau\epsilon\varsigma$ will recognize some of these
-words\ldots) and included an $\alpha$ with accent, and a variant
-representation of $\pi$, in form of a macro \verb*=\varpi =. Also I
-followed the Greek alphabetical order in the translation file.
-
-Here is my list of words:
-
-\begin{verbatim}
-A-NU-PO-TA-QTOS
-A-KA-TA-M'A-QH-TOS
-A-EI-MNH-STOS
-MA-NA
-TOUR-KO-GU-FTIS-SA
-NU-QO-KO-PTHS
-LI-ME-NO-FU-LA-KAS
-MPRE-LOK
-A-GA-\varpi A-EI
-PEI-RAI-'AS
-\end{verbatim}
-
-(file {\tt greek.dic}); and here is my translation file ({\tt greek.tra}):
-
-\begin{verbatim}
- 1 1
- a A
- 'a 'A
- b B
- g G
- d D
- e E
- z Z
- h H
- j J
- i I
- k K
- l L
- m M
- n N
- o O
-#p#P#\varpi ##
- r R
- s S
- t T
- u U
- f F
- q Q
- y Y
- w W
-\end{verbatim}
-
-(yes, yes, that's not a joke; Greek really uses values~1
-and~1 for minimal right
-and left hyphenations!). As you see, I have chosen \verb=#= as delimiter
-in the case of $\pi$, because \verb=\varpi= is supposed to be followed
-by a blank space. And then I wrote two delimiters (\verb=##=) to show PatGen2
-that I finished talking about $\pi$.
-
-Here is what I got on my console:
-
-\begin{verbatim}
-PatGen -t greek.tra -o greek.out greek.dic
-This is PatGen, C Version 2.1 / Macintosh Version 2.0
-Copyright (c) 1991-93 by Wilfried Ricken. All rights reserved.
-left_hyphen_min = 1, right_hyphen_min = 1, 24 letters
-0 patterns read in
-pattern trie has 266 nodes, trie_max = 290, 0 outputs
-hyph_start:  1
-hyph_finish: 2
-pat_start:   2
-pat_finish:  4
-good weight: 1
-bad weight:  1
-threshold:   1
-processing dictionary with pat_len = 2, pat_dot = 1
-
-0 good, 0 bad, 31 missed
-0.00 %, 0.00 %, 100.00 %
-69 patterns, 325 nodes in count trie, triec_max = 440
-25 good and 42 bad patterns added (more to come)
-finding 29 good and 0 bad hyphens, efficiency = 1.16
-pattern trie has 333 nodes, trie_max = 349, 2 outputs
-processing dictionary with pat_len = 2, pat_dot = 0
-
-29 good, 0 bad, 2 missed
-93.55 %, 0.00 %, 6.45 %
-45 patterns, 301 nodes in count trie, triec_max = 378
-2 good and 43 bad patterns added
-finding 2 good and 0 bad hyphens, efficiency = 1.00
-pattern trie has 339 nodes, trie_max = 386, 6 outputs
-processing dictionary with pat_len = 2, pat_dot = 2
-
-31 good, 0 bad, 0 missed
-100.00 %, 0.00 %, 0.00 %
-47 patterns, 303 nodes in count trie, triec_max = 369
-0 good and 47 bad patterns added
-finding 0 good and 0 bad hyphens
-pattern trie has 344 nodes, trie_max = 386, 13 outputs
-51 nodes and 11 outputs deleted
-total of 27 patterns at hyph_level 1
-
-pat_start:   2
-pat_finish:  4
-good weight: 1
-bad weight:  1
-threshold:   1
-processing dictionary with pat_len = 2, pat_dot = 1
-
-31 good, 0 bad, 0 missed
-100.00 %, 0.00 %, 0.00 %
-27 patterns, 283 nodes in count trie, triec_max = 315
-0 good and 27 bad patterns added
-finding 0 good and 0 bad hyphens
-pattern trie has 295 nodes, trie_max = 386, 4 outputs
-processing dictionary with pat_len = 2, pat_dot = 0
-
-31 good, 0 bad, 0 missed
-100.00 %, 0.00 %, 0.00 %
-27 patterns, 283 nodes in count trie, triec_max = 303
-0 good and 27 bad patterns added
-finding 0 good and 0 bad hyphens
-pattern trie has 320 nodes, trie_max = 386, 6 outputs
-processing dictionary with pat_len = 2, pat_dot = 2
-
-31 good, 0 bad, 0 missed
-100.00 %, 0.00 %, 0.00 %
-24 patterns, 280 nodes in count trie, triec_max = 286
-0 good and 24 bad patterns added
-finding 0 good and 0 bad hyphens
-pattern trie has 328 nodes, trie_max = 386, 9 outputs
-35 nodes and 7 outputs deleted
-total of 0 patterns at hyph_level 2
-hyphenate word list? y
-writing PatTmp.2
-
-31 good, 0 bad, 0 missed
-100.00 %, 0.00 %, 0.00 %
-Time elapsed: 0:37:70 minutes.
-\end{verbatim}
-
-and here are the results: first of all the patterns
-(file {\tt greek.out})
-
-\begin{verbatim}
-a1g
-a1e
-a1k
-a1m
-a1n
-a1p
-a1t
-a1q
-'a1q
-e1l
-e1n
-h1t
-i1'a
-i1m
-i1r
-1ko
-o1g
-o1p
-o1t
-o1f
-r1k
-s1s
-1st
-u1l
-u1p
-u1f
-u1q
-\end{verbatim}
-
-As you see, {\tt o1t} comes before {\tt o1f}: (smile) simply
-because we are talking about $o1\tau$ and $o1\phi$. So the
-alphabetical order is well respected. Also you see that PatGen2
-has read the word \verb*=A-GA-\varpi A-EI= exactly as if it
-were \verb*=A-GA-PA-EI= and has made out of it the pattern
-{\tt a1p} (if you look in the input words there is no other
-reason for this pattern to exist).
-
-And here is our result (file {\tt PatTmp.2}):
-
-\begin{verbatim}
-a*nu*po*ta*qtos
-a*ka*ta*m'a*qh*tos
-a*ei*mnh*stos
-ma*na
-tour*ko*gu*ftis*sa
-nu*qo*ko*pths
-li*me*no*fu*la*kas
-mpre*lok
-a*ga*pa*ei
-pei*rai*'as
-\end{verbatim}
-
-(Of course, it is correct; you think I would have shown it if it
-weren't correct?) As you see there is no \verb*=\varpi = anymore:
-PatGen2 has really replaced it by \verb=p=, and so
-\verb*=A-GA-\varpi A-EI= has become \verb=a*ga*pa*ei=.
-
-\section{Where do I find more information?}
-I voluntarily didn't discussed the various parameters used for
-pattern generation. For these there is very good litterature:
-\begin{itemize}
-\item the \TeX book, by the Grand Wizard of \TeX\ arcana,
-appendix H;
-\item \LaTeX\ Erweiterungsm\"oglichkeiten, by Helmut Kopka,
-Addison-Wesley, Pages 482--489;
-\item Swedish Hyphenation for \TeX, by Jan Michael Rynning,
-[sorry, I don't know where this paper is published];
-\item Word Hy-phen-a-tion by Com-put-er, Stanford University
-Report {\tt STAN-CS-83-977};
-\item forthcoming paper by Dominik Wujastyk, on British
-hyphenation;
-\item Hyphenation Patterns for Ancient Greek and Latin,
-TUGboat~13 (4), pages 457--469.
-\item and many others\ldots
-\end{itemize}
-
-Where to get PatGen2? probably everywhere, but certainly in
-Stuttgart ({\tt IP 129.69.1.12}),
-\verb=soft/tex/systems/pc/utilities/patgen.zip= (take a look also at
-\verb=soft/tex/systems/knuth/texware/patgen.version2.1/patgen.README=).
-
-\section{Go forth, etc etc}
-OK, once again {\sc Go Forth} and make masterpieces of hyphenation
-patterns\ldots
-{\bf but} please get in touch with the TWGMLC (Technical Working
-Group on Multiple Language Coordination) first, since people there
-are working on many languages: maybe they have already done what
-you need and are still testing it; or maybe they haven't and in
-that case you could help us a lot.
-
-
-\end{document}
-

Modified: trunk/Master/tlpkg/bin/tlpkg-ctan-check
===================================================================
--- trunk/Master/tlpkg/bin/tlpkg-ctan-check	2021-04-11 21:40:05 UTC (rev 58840)
+++ trunk/Master/tlpkg/bin/tlpkg-ctan-check	2021-04-11 21:52:07 UTC (rev 58841)
@@ -95,7 +95,7 @@
     beamertheme-light beamertheme-metropolis beamertheme-npbt
     beamertheme-phnompenh beamertheme-pure-minimalistic
     beamertheme-saintpetersburg beamertheme-upenn-bc
-    beamerthemejltree
+    beamerthemejltree beamerthemelalic
     beamerthemenirma beamerthemenord
     bearwear beaulivre
     beebe begingreek begriff beilstein belleek bengali
@@ -244,7 +244,7 @@
     dejavu dejavu-otf
     delim delimseasy delimset delimtxt denisbdoc derivative dhua
     diabetes-logbook diadia diagbox diagmac2 dialogl diagnose dice dichokey
-    dickimaw dictsym diffcoeff digiconfigs dijkstra din1505
+    dickimaw dictsym diffcoeff digiconfigs dijkstra dimnum din1505
     dinat dinbrief dingbat directory dirtree dirtytalk disser ditaa dithesis
     dk-bib dlfltxb
     dnaseq dnp doc-pictex docbytex doclicense
@@ -308,7 +308,8 @@
     fetamont fetchcls
     feupphdteses fewerfloatpages feyn feynmf feynmp-auto
     ffslides fge fgruler
-    fibeamer fifinddo-info fifo-stack fig4latex figbas figbib figflow figsize
+    fibeamer fifinddo-info fifo-stack
+    fig4latex figbas figbib figchild figflow figsize
     filecontents filecontentsdef filedate filehook fileinfo filemod
     findhyph fink finstrut fira firamath firamath-otf
     first-latex-doc firstaid fitbox fithesis
@@ -340,7 +341,7 @@
     functan fundus-calligra fundus-cyr fundus-sueterlin fvextra fwlw
   g-brief gaceta galois gamebook gammas garamond-libre garamond-math garuda-c90
     garrigues gastex gatech-thesis gates gatherenum gauss
-    gb4e gbt7714 gcard gchords gcite
+    gb4e gbt7714 gcard gchords gcite gckanbun
     gender geschichtsfrkl genealogy genealogytree gene-logic
     genmpage gentium-tug gentle gentombow geometry
     german germbib germkorr



More information about the tex-live-commits mailing list.