texlive[48052] Master: kanaparser (19jun18)

commits+karl at tug.org commits+karl at tug.org
Tue Jun 19 22:49:48 CEST 2018


Revision: 48052
          http://tug.org/svn/texlive?view=revision&revision=48052
Author:   karl
Date:     2018-06-19 22:49:48 +0200 (Tue, 19 Jun 2018)
Log Message:
-----------
kanaparser (19jun18)

Modified Paths:
--------------
    trunk/Master/tlpkg/bin/tlpkg-ctan-check
    trunk/Master/tlpkg/libexec/ctan2tds
    trunk/Master/tlpkg/tlpsrc/collection-luatex.tlpsrc

Added Paths:
-----------
    trunk/Master/texmf-dist/doc/luatex/kanaparser/
    trunk/Master/texmf-dist/doc/luatex/kanaparser/README.md
    trunk/Master/texmf-dist/doc/luatex/kanaparser/description.pdf
    trunk/Master/texmf-dist/doc/luatex/kanaparser/description.tex
    trunk/Master/texmf-dist/doc/luatex/kanaparser/examples.pdf
    trunk/Master/texmf-dist/doc/luatex/kanaparser/examples.tex
    trunk/Master/texmf-dist/tex/luatex/kanaparser/
    trunk/Master/texmf-dist/tex/luatex/kanaparser/kanaparser.lua
    trunk/Master/texmf-dist/tex/luatex/kanaparser/kanaparser.tex
    trunk/Master/tlpkg/tlpsrc/kanaparser.tlpsrc

Added: trunk/Master/texmf-dist/doc/luatex/kanaparser/README.md
===================================================================
--- trunk/Master/texmf-dist/doc/luatex/kanaparser/README.md	                        (rev 0)
+++ trunk/Master/texmf-dist/doc/luatex/kanaparser/README.md	2018-06-19 20:49:48 UTC (rev 48052)
@@ -0,0 +1,27 @@
+# Kana Parser for LuaTeX
+
+## Author: Adam Zahumenský
+-----------------------------
+
+This is a LuaTeX package that allows for transliteration of Japanese syllabic alphabets, hiragana and katakana, to latin and vice versa.
+The intention of this package is to assist in learning of the kana alphabets and allow users to write kana directly using latin, convert between the two kanas or romanize kana using simple macros.
+I used the most common Hepburn romanization system while keeping to ASCII character set, hence not supporting long latin characters and instead using direct vowel transliteration (ou instead of ō).
+
+The package features three functional macros, one for each target alphabet (latin, hiragana, katakana), which transliterate as much of the provided text as possible to the target alphabet.
+The macros accept a multi-paragraph argument containing the text to transliterate.
+Before using of any of these macros, use the \parserInit macro once to initialize the parser.
+
+Some syllables such as "ji" support multiple kana representations. Refer to kanaparser.tex for the list of these syllables and use the \toggleChars macro to toggle between their representations.
+Default choices are based on usage frequency.
+
+To remove ambiguity of syllables beginning with a vowel and following the 'n' character, this package features an isolator character, ' (apostrophe). Refer to examples.tex for its usage.
+
+To use geminated consonants in syllables such as tta using the little tsu (sokuon) character, double the desired consonant instead of typing 't'. Hence type ecchi instead of etchi.
+
+To output Japanese characters you need to use a font with support for these characters. An example of this is ipafont.
+LuaTeX cannot load otf/ttf fonts natively, use the luaotfload.sty helper bundled in TeXLive to do that.
+Refer to examples.txt for font usage, ipafont is required to use the ipagp.otf font referenced in the file.
+
+License: BSD
+Supported Lua version: 5.2
+Last package revision: 19 June 2018
\ No newline at end of file


Property changes on: trunk/Master/texmf-dist/doc/luatex/kanaparser/README.md
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: trunk/Master/texmf-dist/doc/luatex/kanaparser/description.pdf
===================================================================
(Binary files differ)

Index: trunk/Master/texmf-dist/doc/luatex/kanaparser/description.pdf
===================================================================
--- trunk/Master/texmf-dist/doc/luatex/kanaparser/description.pdf	2018-06-19 20:47:54 UTC (rev 48051)
+++ trunk/Master/texmf-dist/doc/luatex/kanaparser/description.pdf	2018-06-19 20:49:48 UTC (rev 48052)

Property changes on: trunk/Master/texmf-dist/doc/luatex/kanaparser/description.pdf
___________________________________________________________________
Added: svn:mime-type
## -0,0 +1 ##
+application/pdf
\ No newline at end of property
Added: trunk/Master/texmf-dist/doc/luatex/kanaparser/description.tex
===================================================================
--- trunk/Master/texmf-dist/doc/luatex/kanaparser/description.tex	                        (rev 0)
+++ trunk/Master/texmf-dist/doc/luatex/kanaparser/description.tex	2018-06-19 20:49:48 UTC (rev 48052)
@@ -0,0 +1,109 @@
+% This file produces a description document for the Kana Parser project
+
+\input luaotfload.sty % otf font loader
+\input kanaparser % load the parser package
+
+\font\jp = ipagp % ipagp.otf font is included in the ipafont font package: https://www.internationalphoneticassociation.org/content/ipa-fonts
+\font\hdf = cmbx12 at 18pt
+\font\nmf = cmbx12 at 14pt
+
+% wrapper macros that change font automatically
+\def\jchar#1{{\jp #1}}
+\def\kpth#1{\jchar{\toHiragana{#1}}}
+\def\kptk#1{\jchar{\toKatakana{#1}}}
+\def\kptl#1{\jchar{\toLatin{#1}}}
+
+% styling macros
+\def\hd#1{{\hdf#1}\vskip 10pt}
+\def\nm#1{\vskip 10pt{\nmf#1}\vskip 8pt}
+
+\parserInit % initialize kana parser
+
+\hd{Kana Parser for Lua\TeX}
+
+Greetings, reader. This document will describe this Lua\TeX \kern3pt package in detail, providing all the information you need to start using this package.
+We will analyze the Japanese writing system and see how it relates to the Latin script. Then we will see how this package handles the conversion.
+
+\nm{1. The Japanese writing system}
+
+The modern Japanese uses four distinct character sets: Latin (also known as `romaji' in Japanese), a pair of syllabic sets known as kana and ultimately kanji, the complex ideographic set borrowed from the Chinese. They combine these four sets regularly, a practice usually very confusing for newcomers to the language.
+
+Kanji is based on a subset of Chinese ideograms known as `hanzi' in China and cannot be transliterated to Latin by a simple state automaton, requiring solid context awareness.
+
+Kana, however, is a phonetic system based on syllables which can be directly transliterated to Latin. The two kana sets are known as hiragana and katakana.
+Kana represents a set of roughly 46 syllables (48 including two obsolete ones), each syllable has a hiragana and a katakana character assigned to it.
+There are five vowel characters (a, i, u, e, o), an `n' character and the rest are syllabic compounds of vowels and consonants, such as `wo'.
+
+Hiragana is used for native syntactic and grammatic constructs as well as common words and phrases. It's also used in material intended to be read by juveniles and children who do not yet understand complex kanji.
+
+Katakana is used for loanwords, foreign words and usually onomatopoeia among other uses.
+The two kanas cover the same set of syllables and as such can be freely converted between each other.
+
+\nm{2. Differences between the kanas}
+
+Despite covering the same syllable set, there are certain differences between the systems.
+The most striking difference is in how the sets prolong their syllables.
+Prolongation here means extending a vowel-terminated syllable by a pure vowel, getting a syllable of double length. An example of this is \jchar{ma => maa}.
+
+Hiragana prolongs syllables by explicitly putting a vowel character after a syllable: \kpth{ma => maa}.\break
+Here you can see how an \kpth{a} gets appended to \kpth{ma} to prolong it.
+Syllables ending in o and e are instead prolonged by u and i, respectively: \jchar{mo => mou} (\kpth{mo => mou})
+
+Katakana uses a single prolonging character, \jchar ー, to prolong any vowel-terminated syllable.
+This package ensures this character is always correctly transliterated to its respective hiragana vowel or Latin vowel.\break
+\kptk{mo => mou} in katakana translates correctly to \kpth{\toKatakana{mo => mou}} in hiragana and \kptl{\toKatakana{mo => mou}} in Latin.
+
+Another difference is in katakana's added support for various foreign syllables. These syllables don't exist in native Japanese, such as vu (\kptk{vu}).
+These syllables help in better representing foreign words and as such don't commonly have hiragana counterparts.
+However, thanks to the inter-compatibility of kana character set, even these syllables can be written in hiragana, although such use is very unusual: vu (\kpth{vu}).
+This package supports such conversions to promote learning of the character sets.
+
+\nm{3. Consonant gemination}
+
+Japanese language supports doubling (or gemination) of certain unvoiced consonants (s, t, k, p, ch) when they appear at the beginning of a syllable. An example of this is the syllable `ka' (\kpth{ka}) which turns into `kka' (\kpth{kka}) when geminated. As seen in the example, the kana sets have a special character, \jchar っ, called sokuon (little tsu), a small version of the `tsu' (\kpth{tsu}) character, which is placed in front of the syllable which is to be geminated.
+This package detects correct usage of sokuon and represents it in Latin by doubling the respective consonant.
+In several romanization systems, gemination is represented by using `t' instead in all cases but I find the doubling of the affected consonant a better way to show the true nature of sokuon.
+
+\break\nm{4. Ambiguity of `n'}
+
+N is the only consonant in Japanese with its own kana character, \kpth{n}.
+As such, there is some ambiguity in following it by other characters.
+There are several syllables beginning in `n', such as nya (\kpth{nya}) or nyo (\kpth{nyo}), which could be ambiguously split into `n-ya' (\kpth{n'ya}) and `n-yo' (\kpth{n'yo}) respectively.
+To make sure there is no ambiguity in romanization of these characters, an isoLating delimiter is used: '. To demonstrate its usage, `nyaa' becomes \kptk{nyaa} in katakana but `n'yaa' becomes \kptk{n'yaa} --- ambiguity resolved.
+This works backwards too, where \kpth{ren'youkei} which contains the `nyo' syllable split to `n-yo' transliterates to \kptl{\toHiragana{ren'youkei}}.
+
+\nm{5. Transliteration alternatives}
+
+As expected with completely different writing systems, the conversion between them is not really isomorphic. Several syllables have multiple kana representations and several kana characters have multiple romanization options.
+To tackle this problem, this package tries to be as permissive as possible by letting the user configure alternatives on the go.
+The most frequent alternatives are selected by default and can be viewed in the kanaparser.tex file. There is a switch macro in the package that lets the user choose which kana character(s) will be used in place of the selected syllable if that syllable supports alternatives. There is always at most one alternative to a syllable representation.
+For example, if you wish that `we' is not written as \kptk{we} in katakana but instead as the obsolete \toggleChars{we}\kptk{we}, the package lets you do it.
+On the other hand, `sisi' and `shishi' will both transliterate to \kpth{sisi} although backwards transliteration will always be the closer-sounding \kptl{\toHiragana{sisi}}.
+Romanization of all the alternative kana characters is enabled by default.
+
+\nm{6. Transliterating mixed character sets and special characters}
+
+This package has limited support for this feature. Its three macros always attempt to transliterate as much as they can into the target character set. There is no option to only transliterate hiragana, for example. When targetting Latin, both kana sets will be converted. Same goes for transliterating to the kanas, both Latin and the other kana set will be converted.
+Characters not understood by the used macro (including ") will be left unchanged except for apostrophes ('), which will be consumed (and treated as isolation delimiters) when transLating to kana.
+
+\nm{7. Introducing romanization systems}
+
+There are several systems for romanization of Japanese and this package loosely follows the Hepburn system (\jchar{ヘボン式ローマ字}).
+The first difference is that the package ignores the characters with macron in long syllables (such as \jchar ō).
+This is to stay within the ASCII character set (which simplifies typing on a common keyboard) and lets newcomers to the language get used to the prolongation rules.
+As such, \kpth{koukou} transliterates to \kptl{\toHiragana{koukou}} instead of \jchar{kōkō}.
+Contextual variations are also ignored in this package, such as writing \kpth{ha} as \jchar{wa} when used as a topic particle.
+Another notable deviation from Hepburn is not using `t' for consonant gemination except for syllables beginning in `t'. As such, \jchar{まっちゃ} becomes \kptl{まっちゃ} and not \jchar{matcha}.
+
+\nm{8. Fonts, unicode and implementation}
+
+Kana are multibyte unicode characters, a compatible font is needed to display any of them, hence the bundled macros won't print anything readable without a font with japanese support.
+An example of such font is the ipafont family.
+
+Both Lua and Lua\TeX \kern3pt support unicode characters although Lua only considers them multibyte strings. As such an UTF-8 tokenizer is needed to properly recognize individual characters.
+Once tokenized, conversion both to and from kana sets is possible using a state automaton with a processing buffer.
+
+When converting Latin to kana, a three-character buffer is needed to process characters such as `nya' (\kpth{nya}); the other way around only two-character is required to process multi-character compounds.
+Based on the contents of this buffer the automaton decides what to transliterate, prolong, geminate or print as-is. Conversion between kana sets is implemented as a simple translation table.
+
+\bye


Property changes on: trunk/Master/texmf-dist/doc/luatex/kanaparser/description.tex
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: trunk/Master/texmf-dist/doc/luatex/kanaparser/examples.pdf
===================================================================
(Binary files differ)

Index: trunk/Master/texmf-dist/doc/luatex/kanaparser/examples.pdf
===================================================================
--- trunk/Master/texmf-dist/doc/luatex/kanaparser/examples.pdf	2018-06-19 20:47:54 UTC (rev 48051)
+++ trunk/Master/texmf-dist/doc/luatex/kanaparser/examples.pdf	2018-06-19 20:49:48 UTC (rev 48052)

Property changes on: trunk/Master/texmf-dist/doc/luatex/kanaparser/examples.pdf
___________________________________________________________________
Added: svn:mime-type
## -0,0 +1 ##
+application/pdf
\ No newline at end of property
Added: trunk/Master/texmf-dist/doc/luatex/kanaparser/examples.tex
===================================================================
--- trunk/Master/texmf-dist/doc/luatex/kanaparser/examples.tex	                        (rev 0)
+++ trunk/Master/texmf-dist/doc/luatex/kanaparser/examples.tex	2018-06-19 20:49:48 UTC (rev 48052)
@@ -0,0 +1,43 @@
+% This file shows various usage of this parser
+
+\input luaotfload.sty % otf font loader
+\input kanaparser % load the parser package
+
+\font\jp = ipagp % ipagp.otf font is included in the ipafont font package: https://www.internationalphoneticassociation.org/content/ipa-fonts
+
+\parserInit % initialize kana parser
+
+% wrapper macros that change font automatically
+\def\jchar#1{{\jp #1}}
+\def\kpth#1{\jchar{\toHiragana{#1}}}
+\def\kptk#1{\jchar{\toKatakana{#1}}}
+\def\kptl#1{\jchar{\toLatin{#1}}}
+
+Example of transliteration to Latin: \kptl{しゅんかしゅうとう しし}
+
+Example of transliteration to katakana featuring prolongation dashes: \kptk{しゅんかしゅうとう しし}
+
+Example of transliteration to hiragana converting prolongation dashes: \kpth{シュンカシュートー}
+
+Example of transliteration of multiple-form syllables to hiragana using default settings: \kpth{jiji wewe}
+
+\toggleChars{ji we} % toggles the kana representation of 'ji' and 'we' syllables
+Example of transliteration of multiple-form syllables to hiragana using alternate settings: \kpth{jiji wewe}
+
+Mixed example of transliteration to katakana: \kptk{shunkashuutouuuxxxxxchou}
+
+Example of default transliteration to hiragana using ambiguous syllables after n: \kpth{renyoukei}
+
+Example of isolated n-character to resolve ambiguity: \kpth{ren'youkkei}
+
+Example of hiragana to Latin transliteration from previous example: \kptl{れんようけい}
+
+Example of consonant gemination from hiragana to Latin: \kptl{にっぽん}
+
+Example of consonant gemination from Latin to hiragana: \kpth{nippon}
+
+Example of character preservation: \kptl{when transLating to Latin, ' and " are preserved}
+
+Example of character preservation 2: \kptl{\kpth{when transLating to kana, ' is consumed, " is preserved}}
+
+\bye


Property changes on: trunk/Master/texmf-dist/doc/luatex/kanaparser/examples.tex
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: trunk/Master/texmf-dist/tex/luatex/kanaparser/kanaparser.lua
===================================================================
--- trunk/Master/texmf-dist/tex/luatex/kanaparser/kanaparser.lua	                        (rev 0)
+++ trunk/Master/texmf-dist/tex/luatex/kanaparser/kanaparser.lua	2018-06-19 20:49:48 UTC (rev 48052)
@@ -0,0 +1,427 @@
+-- Kana Parser lua engine
+
+local vowels = {'a', 'e', 'i', 'o', 'u'} -- latin vowels
+local vowelsK = {'ア', 'エ', 'イ', 'オ', 'ウ'} -- katakana vowels
+local ambigousToN = {'あ', 'え', 'い', 'お', 'う', 'や', 'よ', 'ゆ'} -- characters ambiguous to preceding "n"
+local littleTsuWL = {'s', 't', 'k', 'p', 'c'} -- whitelist for little tsu gemination
+local transRaw = { -- latin -> hiragana
+	n = 'ん', a = 'あ', e = 'え', i = 'い', o = 'お', u = 'う',
+	ba = 'ば', be = 'べ', bi = 'び', bo = 'ぼ', bu = 'ぶ',
+	bya = 'びゃ', byo = 'びょ', byu = 'びゅ',
+	cha = 'ちゃ', che = 'ちぇ', chi = 'ち', cho = 'ちょ', chu = 'ちゅ',
+	da = 'だ', de = 'で', di = 'でぃ', ['do'] = 'ど', du = { 'づ', 'どぅ' },
+	dya = 'でゃ', dyo = 'でょ', dyu = 'でゅ',
+	fa = 'ふぁ', fe = 'ふぇ', fi = 'ふぃ', fo = 'ふぉ',
+	fya = 'ふゃ', fyo = 'ふょ', fyu = 'ふゅ',
+	ga = 'が', ge = 'げ', gi = 'ぎ', go = 'ご', gu = 'ぐ',
+	gwa = 'ぐぁ', gwe = 'ぐぇ', gwi = 'ぐぃ', gwo = 'ぐぉ', gya = 'ぎゃ', gyo = 'ぎょ', gyu = 'ぎゅ',
+	ha = 'は', he = 'へ', hi = 'ひ', ho = 'ほ', hu = 'ふ',
+	hya = 'ひゃ', hyo = 'ひょ', hyu = 'ひゅ',
+	ja = { 'じゃ', 'ぢゃ' }, je = 'じぇ', ji = { 'じ', 'ぢ' }, jo = { 'じょ', 'ぢょ' }, ju = { 'じゅ', 'ぢゅ' },
+	ka = 'か', ke = 'け', ki = 'き', ko = 'こ', ku = 'く',
+	kwa = 'くぁ', kwe = 'くぇ', kwi = 'くぃ', kwo = 'くぉ', kya = 'きゃ', kyo = 'きょ',	kyu = 'きゅ',
+	ma = 'ま', me = 'め', mi = 'み', mo = 'も', mu = 'む',
+	mya = 'みゃ', myo = 'みょ', myu = 'みゅ',
+	na = 'な', ne = 'ね', ni = 'に', no = 'の', nu = 'ぬ',
+	nya = 'にゃ', nyo = 'にょ', nyu = 'にゅ',
+	pa = 'ぱ', pe = 'ぺ', pi = 'ぴ', po = 'ぽ', pu = 'ぷ',
+	pya = 'ぴゃ', pyo = 'ぴょ', pyu = 'ぴゅ',
+	ra = 'ら', re = 'れ', ri = 'り', ro = 'ろ', ru = 'る',
+	rya = 'りゃ', ryo = 'りょ', ryu = 'りゅ',
+	sa = 'さ', se = 'せ', si = 'し',	so = 'そ', su = 'す',
+	sha = 'しゃ', she = 'しぇ', shi = 'し', sho = 'しょ', shu = 'しゅ',
+	ta = 'た', te = 'て', ti = 'てぃ', to = 'と',
+	tha = 'てゃ', tho = 'てょ', thu = 'てゅ',
+	tsa = 'つぁ', tse = 'つぇ', tsu = 'つ', tsi = 'つぃ', tso = 'つぉ',
+	tu = 'つ',
+	va = 'ゔぁ', ve = 'ゔぇ', vi = 'ゔぃ', vo = 'ゔぉ', vu = 'ゔぅ',
+	vya = 'ゔゃ', vyo = 'ゔょ', vyu = 'ゔゅ',
+	wa = 'わ', we = { 'うぇ', 'ゑ' }, wi = 'ゐ', wo = { 'を', 'うぉ' },
+	ya = 'や', ye = 'いぇ', yo = 'よ', yu = 'ゆ',
+	za = 'ざ', ze = 'ぜ', zo = 'ぞ', zu = 'ず'
+}
+local transK = { -- hiragana -> katakana
+	['ん'] = 'ン', ['あ'] = 'ア', ['え'] = 'エ', ['い'] = 'イ', ['お'] = 'オ', ['う'] = 'ウ',
+	['ぁ'] = 'ァ', ['ぃ'] = 'ィ', ['ぅ'] = 'ゥ', ['ぇ'] = 'ェ', ['ぉ'] = 'ォ',
+	['ゃ'] = 'ャ', ['ゅ'] = 'ュ', ['ょ'] = 'ョ',
+	['は'] = 'ハ', ['へ'] = 'ヘ', ['ひ'] = 'ヒ', ['ほ'] = 'ホ', ['ふ'] = 'フ',
+	['ば'] = 'バ', ['べ'] = 'ベ', ['び'] = 'ビ', ['ぼ'] = 'ボ', ['ぶ'] = 'ブ',
+	['ぱ'] = 'パ', ['ぺ'] = 'ペ', ['ぴ'] = 'ピ', ['ぽ'] = 'ポ', ['ぷ'] = 'プ',
+	['た'] = 'タ', ['て'] = 'テ', ['ち'] = 'チ', ['と'] = 'ト', ['つ'] = 'ツ',
+	['だ'] = 'ダ', ['で'] = 'デ', ['ぢ'] = 'ヂ', ['ど'] = 'ド', ['づ'] = 'ヅ',
+	['か'] = 'カ', ['け'] = 'ケ', ['き'] = 'キ', ['こ'] = 'コ', ['く'] = 'ク',
+	['が'] = 'ガ', ['げ'] = 'ゲ', ['ぎ'] = 'ギ', ['ご'] = 'ゴ', ['ぐ'] = 'グ',
+	['ま'] = 'マ', ['め'] = 'マ', ['み'] = 'マ', ['も'] = 'モ', ['む'] = 'マ',
+	['な'] = 'ナ', ['ね'] = 'ネ', ['に'] = 'ニ', ['の'] = 'ノ', ['ぬ'] = 'ヌ',
+	['ら'] = 'ラ', ['れ'] = 'レ', ['り'] = 'リ', ['ろ'] = 'ロ', ['る'] = 'ル',
+	['さ'] = 'サ', ['せ'] = 'セ', ['し'] = 'シ', ['そ'] = 'ソ', ['す'] = 'ス',
+	['ざ'] = 'ザ', ['ぜ'] = 'ゼ', ['じ'] = 'ジ', ['ぞ'] = 'ゾ', ['ず'] = 'ズ',
+	['わ'] = 'ワ', ['ゑ'] = 'ヱ', ['ゐ'] = 'ヰ', ['を'] = 'ヲ',
+	['や'] = 'ヤ', ['よ'] = 'ヨ', ['ゆ'] = 'ユ',
+	['ゔ'] = 'ヴ', ['っ'] = 'ッ'
+}
+local correctionsFromKana = { -- manual transliteration choices
+	['し'] = 'shi'
+}
+local longK = 'ー'
+local isolator = '\''
+local prolongRules = { -- special rules for prolonging syllables
+	o = 'u',
+	e = 'i'
+}
+
+-- builds a reverse table
+local function rev(t)
+	local res = {}
+	for k, v in pairs(t) do
+		if (type(v) == 'table') then
+			res[v[1]] = k
+			res[v[2]] = k
+		else
+			res[v] = k
+		end
+	end
+	return res
+end
+
+-- builds the default translation tables latin <-> kana from transRaw
+local function buildDefaultTransTables()
+	local tr, rtr = {}, {}
+	
+	for k, v in pairs(transRaw) do
+		tr[k] = type(v) == 'table' and v[1] or v
+	end
+
+	rtr = rev(tr)
+
+	-- apply corrections
+	for i, v in pairs(correctionsFromKana) do
+		rtr[i] = v
+	end
+
+	return tr, rtr, rev(transK)
+end
+
+-- decides which wovel should prolong the given vowel
+local function prolong(c)
+	for i, v in ipairs(vowels) do
+		if c == v then
+			if prolongRules[c] then return prolongRules[c] else return c end
+		end
+	end
+	return nil
+end
+
+-- checks if a katakana token is a vowel and returns its latin representation
+local function getWovelK(c)
+	for i, v in ipairs(vowelsK) do
+		if c == v then return vowels[i] end
+	end
+	return nil
+end
+
+-- checks if a given symbol is ambiguous to preceding n
+local function isAmbiguous(c)
+	for i, v in ipairs(ambigousToN) do
+		if c == v then return true end
+	end
+	return false
+end
+
+-- init translation tables
+local trans, revTrans, revTransK = buildDefaultTransTables()
+
+-- init default transliteration choices (everything default to first alternative)
+local transChoices = {}
+
+-- checks if two characters are valid candidates for little tsu
+local function isValidTsuCandidate(a, b)
+	if a ~= b then return false end
+	for i, v in ipairs(littleTsuWL) do
+		if a == v then return true end
+	end
+	return false
+end
+
+-- checks if two characters are a little tsu used correctly and returns the gemination consonant if true
+local function getGeminationConsonant(a, b)
+	if a ~= 'っ' then return nil end -- disregard katakana, only hiragana is processed in romanization
+	local tr = revTrans[b]
+	if not tr then return nil end -- invalid hiragana character
+	local fst = string.sub(tr, 1, 1) -- get first character of the transliteration
+	for i, v in ipairs(littleTsuWL) do
+		if fst == v then return fst end
+	end
+	return nil -- invalid gemination
+end
+
+-- parses an utf8 string into utf8 chars (tokens)
+local function tokenize(utf8str)
+	assert(type(utf8str) == 'string')
+	local res, seq, val = {}, 0, ''
+	for i = 1, #utf8str do
+		local c = string.byte(utf8str, i)
+		if seq == 0 then
+			if i ~= 1 then table.insert(res, val) end
+			seq = c < 0x80 and 1 or c < 0xE0 and 2 or c < 0xF0 and 3 or
+			      c < 0xF8 and 4 or error('invalid UTF-8 character sequence')
+			val = string.char(c)
+		else
+			val = val .. string.char(c)
+		end
+		seq = seq - 1
+	end
+	table.insert(res, val)
+	return res
+end
+
+-- PUBLIC API SECTION
+
+-- toggles used characters for supplied syllables (whitespace-separated)
+function toggleChars(input)
+	local cur, choices = '', {}
+	for s in string.gmatch(input, '%S+') do -- split by whitespaces
+		cur = trans[s]
+		if cur then -- don't process unknown syllables
+			choices = transRaw[s]
+			if type(choices) == 'table' then -- only process syllables with alternatives
+				trans[s] = cur == choices[1] and choices[2] or choices[1] -- toggle between alternatives
+			end
+		end
+	end
+end
+
+-- any kana to latin
+function toLatin(input)
+	if input == '' then return end
+	local tbl = tokenize(input)
+	local buffer, res = {}, ''
+
+	-- read tokenized input
+	local tjoin, tfst, last, gc = '', '', 0, '' -- last is the last valid transliterated vowel, gc is the last gemination consonant
+	for i, v in ipairs(tbl) do
+		if revTransK[v] ~= nil then v = revTransK[v] end -- convert all katakana to hiragana
+		table.insert(buffer, v)
+
+		if #buffer == 2 then -- kana can be formed with up to two characters, always keep two in buffer
+			tjoin, tfst, gc = revTrans[ buffer[1] .. buffer[2] ], revTrans[ buffer[1] ], getGeminationConsonant(buffer[1], buffer[2])
+			if tjoin ~= nil then -- double character
+				res = res .. tjoin
+				buffer, last = {}, string.sub(tjoin, -1)
+			elseif gc then -- check for little tsu
+				res = res .. gc
+				buffer, last = {buffer[2]}, 0
+			elseif tfst ~= nil then -- single character
+				res = res .. tfst
+				if tfst == 'n' and isAmbiguous(buffer[2]) then -- ambiguous character succeeding an "n"
+					res = res .. isolator
+				end
+				buffer, last = {buffer[2]}, string.sub(tfst, -1)
+			elseif buffer[1] == longK and prolong(last) ~= nil then -- prolonging dash
+				res = res .. prolong(last)
+				buffer, last = {buffer[2]}, 0
+			else -- cannot transliterate, output as-is
+				res = res .. buffer[1]
+				buffer, last = {buffer[2]}, 0
+			end
+		end
+	end
+
+	if #buffer == 1 then -- trailing character
+		if revTrans[ buffer[1] ] ~= nil then -- single character
+			res = res .. revTrans [ buffer[1] ]
+		elseif buffer[1] == longK and prolong(last) ~= nil then -- prolonging dash
+			res = res .. prolong(last)
+		else -- cannot transliterate, output as-is
+			res = res .. buffer[1]
+		end
+	end
+
+	tex.print(res)
+end
+
+-- latin or katakana to hiragana, 'raw' parameter is for internal use, leave it blank to get output to TeX
+function toHiragana(input, raw)
+	if input == '' then return end
+	local tbl = tokenize(input)
+	local buffer, res = {}, ''
+	local t3, t2, t1, last, lastsym, lastcnd = '', '', '', 0, nil, nil
+
+	for i, v in ipairs(tbl) do
+		if revTransK[v] then v = revTransK[v] end -- translate katakana to hiragana on the go
+		table.insert(buffer, v)
+
+		if #buffer == 3 then
+			t3, t2, t1 = trans[ buffer[1] .. buffer[2] .. buffer[3] ], trans[ buffer[1] .. buffer[2] ], trans[ buffer[1] ]
+			if t3 ~= nil then -- all three letters yield translation
+				if lastcnd then -- add little tsu
+					res = res .. 'っ'
+					lastcnd = nil
+				end
+				res = res .. t3
+				last = buffer[3]
+				buffer = {}
+			elseif t2 ~= nil then -- first two letters yield translation
+				if lastcnd then -- add little tsu
+					res = res .. 'っ'
+					lastcnd = nil
+				end
+				res = res .. t2
+				last = buffer[2]
+				buffer = {buffer[3]}
+			elseif isValidTsuCandidate(buffer[1], buffer[2]) then -- test little tsu candidates
+				if lastcnd then res = res .. lastcnd end -- add last consonant in raw form
+				lastcnd = buffer[1] -- set last candidate consonant
+				last = 0 -- is not vowel
+				buffer = {buffer[2], buffer[3]}
+			elseif t1 ~= nil then -- first letter yields translation : a, e, i, o, u, n
+				res = res .. t1
+				last = buffer[1]
+				buffer = {buffer[2], buffer[3]}
+			elseif buffer[1] == longK and prolong(last) ~= nil then -- valid prolonger sign
+				res = res .. trans[prolong(last)]
+				buffer, last = {buffer[2], buffer[3]}, 0
+			elseif buffer[1] == isolator then -- isolating apostrophe, consume it
+				buffer = {buffer[2], buffer[3]}
+			else
+				if lastcnd then -- add last consonant in raw form
+					res = res .. lastcnd
+					lastcnd = nil
+				end
+
+				-- this code allows for proper conversion of katakana's prolongation dash to hiragana
+				t1 = revTrans[ buffer[1] ]
+				if t1 then -- symbol is standalone hiragana
+					last = string.sub(t1, -1)
+					lastsym = buffer[1]
+				elseif lastsym then -- attempt to merge symbol with previous symbol
+					t1 = revTrans[ lastsym .. buffer[1] ]
+					if t1 then -- symbol is a valid non-standalone hiragana compound
+						last = string.sub(t1, -1)
+					else -- symbol is an invalid non-standalone hiragana compound
+						last = nil
+					end
+					lastsym = nil
+				else
+					last, lastsym = 0, nil
+				end
+				
+				res = res .. buffer[1]
+				buffer = {buffer[2], buffer[3]}
+			end
+		end
+	end
+
+	if #buffer == 2 then
+		if trans[ buffer[1] .. buffer[2] ] ~= nil then -- first two symbols yield translation
+			if lastcnd then res = res .. 'っ' end -- add little tsu
+			res = res .. trans[ buffer[1] .. buffer[2] ]
+			last = buffer[2]
+			buffer = {}
+		elseif trans[ buffer[1] ] ~= nil then -- first symbol yields translation
+			res = res .. trans[ buffer[1] ]
+			last = buffer[1]
+			buffer = {buffer[2]}
+		elseif buffer[1] == longK and prolong(last) ~= nil then -- valid prolonger
+			res = res .. trans[prolong(last)]
+			buffer, last = {buffer[2]}, 0
+		elseif buffer[1] == isolator then -- consume isolator
+			buffer = {buffer[2]}
+		else
+			if lastcnd then res = res .. lastcnd end -- add last consonant in raw form
+
+			-- this code allows for proper conversion of katakana's prolongation dash to hiragana
+			t1 = revTrans[ buffer[1] ]
+			if t1 then -- symbol is standalone hiragana
+				last = string.sub(t1, -1)
+				lastsym = buffer[1]
+			elseif lastsym then -- attempt to merge symbol with previous symbol
+				t1 = revTrans[ lastsym .. buffer[1] ]
+				if t1 then -- symbol is a valid non-standalone hiragana compound
+					last = string.sub(t1, -1)
+				else -- symbol is an invalid non-standalone hiragana compound
+					last = nil
+				end
+				lastsym = nil -- erase last valid symbol
+			else
+				last, lastsym = 0, nil
+			end
+
+			res = res .. buffer[1]
+			buffer = {buffer[2]}
+		end
+	end
+
+	if #buffer == 1 then -- remaining symbol
+		if trans[ buffer[1] ] ~= nil then
+			res = res .. trans[ buffer[1] ]
+		elseif buffer[1] == longK and prolong(last) ~= nil then
+			res = res .. trans[prolong(last)]
+		elseif buffer[1] ~= isolator then
+			res = res .. buffer[1]
+		end
+	end
+
+	if not raw then
+		tex.print(res)
+	else
+		return res -- for internal use
+	end
+end
+
+-- latin or hiragana to katakana
+function toKatakana(input)
+	if input == '' then return end
+	local hiraganized = tokenize(toHiragana(input, true)) -- convert everything to hiragana
+
+	-- replace hiragana with katakana
+	for i, v in ipairs(hiraganized) do
+		if transK[v] ~= nil then
+			hiraganized[i] = transK[v]
+		end
+	end
+
+	-- insert prolonging symbols and prepare output
+	local prev, nxt, vowel, tprev, tnext, res = hiraganized[1], '', '', '', '', hiraganized[1]
+	local merge, toprolong = '', nil
+	for i = 2, #hiraganized do
+		nxt = hiraganized[i]
+
+		vowel = getWovelK(nxt)
+
+		if not toprolong then -- check prev for ending vowel
+			tprev = revTransK[prev]
+			if tprev then
+				tprev = revTrans[tprev]
+				if tprev then
+					toprolong = prolong(string.sub(tprev, -1))
+				end
+			end
+		end
+
+		if toprolong then -- check nxt for matching prolonger
+			if toprolong == vowel then
+				nxt = longK
+				toprolong = nil
+			elseif vowel then
+				toprolong = prolong(vowel)
+			else
+				toprolong = nil
+			end
+		end
+
+		-- try merging prev and nxt for a single token
+		tprev, tnext = revTransK[prev], revTransK[nxt]
+		if tprev and tnext then
+			merge = revTrans[tprev .. tnext]
+			if merge then
+				toprolong = prolong(string.sub(merge, -1))
+			end
+		end
+
+		res = res .. nxt
+		prev = nxt
+	end
+
+	tex.print(res)
+end


Property changes on: trunk/Master/texmf-dist/tex/luatex/kanaparser/kanaparser.lua
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: trunk/Master/texmf-dist/tex/luatex/kanaparser/kanaparser.tex
===================================================================
--- trunk/Master/texmf-dist/tex/luatex/kanaparser/kanaparser.tex	                        (rev 0)
+++ trunk/Master/texmf-dist/tex/luatex/kanaparser/kanaparser.tex	2018-06-19 20:49:48 UTC (rev 48052)
@@ -0,0 +1,25 @@
+% Kana Parser for LuaTeX
+% Author: Adam Zahumensky, FIT CVUT
+
+% initializer macro, use it before using any other macros in this package
+\def\parserInit{ \directlua{ dofile('kanaparser.lua') } }
+
+% supply a whitespace-separated list of syllables whose kana characters you'd like to toggle
+% list of supported alternatives ([] denotes default choice):
+% du : [づ], どぅ
+% ja : [じゃ], ぢゃ
+% ji : [じ], ぢ
+% jo : [じょ], ぢょ
+% ju : [じゅ], ぢゅ
+% we : [うぇ], ゑ
+% wo : [を], うぉ
+\def\toggleChars#1{\directlua{ toggleChars("\luatexluaescapestring{#1}") }}
+
+% convert all kana to latin
+\long\def\toLatin#1{\directlua{ toLatin("\luatexluaescapestring{#1}") }}
+
+% convert latin and katakana to hiragana
+\long\def\toHiragana#1{\directlua{ toHiragana("\luatexluaescapestring{#1}") }}
+
+% convert latin and hiragana to katakana
+\long\def\toKatakana#1{\directlua{ toKatakana("\luatexluaescapestring{#1}") }}
\ No newline at end of file


Property changes on: trunk/Master/texmf-dist/tex/luatex/kanaparser/kanaparser.tex
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Modified: trunk/Master/tlpkg/bin/tlpkg-ctan-check
===================================================================
--- trunk/Master/tlpkg/bin/tlpkg-ctan-check	2018-06-19 20:47:54 UTC (rev 48051)
+++ trunk/Master/tlpkg/bin/tlpkg-ctan-check	2018-06-19 20:49:48 UTC (rev 48052)
@@ -352,7 +352,8 @@
     jknapltx jkmath jlabels jlreq jmlr jneurosci jpsj jsclasses
     jslectureplanner jumplines junicode
     jura juraabbrev jurabib juramisc jurarsp js-misc jvlisting
-  kantlipsum karnaugh karnaugh-map karnaughmap kastrup kdgdocs kerkis kerntest
+  kanaparser kantlipsum karnaugh karnaugh-map karnaughmap kastrup
+    kdgdocs kerkis kerntest
     keycommand keyfloat keyreader keystroke keyval2e keyvaltable kix kixfont
     knitting knittingpattern knowledge knuth knuth-lib knuth-local
     koma-moderncvclassic koma-script koma-script-examples koma-script-sfs

Modified: trunk/Master/tlpkg/libexec/ctan2tds
===================================================================
--- trunk/Master/tlpkg/libexec/ctan2tds	2018-06-19 20:47:54 UTC (rev 48051)
+++ trunk/Master/tlpkg/libexec/ctan2tds	2018-06-19 20:49:48 UTC (rev 48052)
@@ -1703,6 +1703,7 @@
  'jadetex',     '\.ltx|\.def|\.tex|\.ini|\.sty|\.fd',
  'js-misc',     '(cassette|idverb|js-misc|schild|sperr|xfig)\.tex',
  'jslectureplanner', '\.lps|' . $standardtex,
+ 'kanaparser',	'kanaparser.(tex|lua)$',
  'karnaugh',    'kvmacros.tex',
  'kastrup',     'binhex.tex|' . $standardtex,
  'keystroke',   'keystroke_.*|\.sty',

Modified: trunk/Master/tlpkg/tlpsrc/collection-luatex.tlpsrc
===================================================================
--- trunk/Master/tlpkg/tlpsrc/collection-luatex.tlpsrc	2018-06-19 20:47:54 UTC (rev 48051)
+++ trunk/Master/tlpkg/tlpsrc/collection-luatex.tlpsrc	2018-06-19 20:49:48 UTC (rev 48052)
@@ -15,6 +15,7 @@
 depend enigma
 depend fontloader-luaotfload
 depend interpreter
+depend kanaparser
 depend lua-visual-debug
 depend lua2dox
 depend luacode

Added: trunk/Master/tlpkg/tlpsrc/kanaparser.tlpsrc
===================================================================


More information about the tex-live-commits mailing list