texlive[57257] Master: uninormalize (29dec20)

commits+karl at tug.org commits+karl at tug.org
Tue Dec 29 23:01:59 CET 2020


Revision: 57257
          http://tug.org/svn/texlive?view=revision&revision=57257
Author:   karl
Date:     2020-12-29 23:01:58 +0100 (Tue, 29 Dec 2020)
Log Message:
-----------
uninormalize (29dec20)

Modified Paths:
--------------
    trunk/Master/tlpkg/bin/tlpkg-ctan-check
    trunk/Master/tlpkg/libexec/ctan2tds
    trunk/Master/tlpkg/tlpsrc/collection-luatex.tlpsrc

Added Paths:
-----------
    trunk/Master/texmf-dist/doc/lualatex/uninormalize/
    trunk/Master/texmf-dist/doc/lualatex/uninormalize/README.md
    trunk/Master/texmf-dist/doc/lualatex/uninormalize/uninormalize-doc.pdf
    trunk/Master/texmf-dist/doc/lualatex/uninormalize/uninormalize-doc.tex
    trunk/Master/texmf-dist/tex/lualatex/uninormalize/
    trunk/Master/texmf-dist/tex/lualatex/uninormalize/unicode-normalization.lua
    trunk/Master/texmf-dist/tex/lualatex/uninormalize/unicode-normalize-names.lua
    trunk/Master/texmf-dist/tex/lualatex/uninormalize/unicode-normalize.lua
    trunk/Master/texmf-dist/tex/lualatex/uninormalize/uninormalize.sty
    trunk/Master/tlpkg/tlpsrc/uninormalize.tlpsrc

Added: trunk/Master/texmf-dist/doc/lualatex/uninormalize/README.md
===================================================================
--- trunk/Master/texmf-dist/doc/lualatex/uninormalize/README.md	                        (rev 0)
+++ trunk/Master/texmf-dist/doc/lualatex/uninormalize/README.md	2020-12-29 22:01:58 UTC (rev 57257)
@@ -0,0 +1,64 @@
+# The `uninormalize` package
+
+The purpose of this package is to provide Unicode normalization for LuaLaTeX. It is based on  Arthur Reutenauer's 
+[code for GSOC 2008](https://code.google.com/p/google-summer-of-code-2008-tex/downloads/list), which was adapted a little bit to work with
+current `Luaotfload`. For more information, see [this question on TeX.sx](http://tex.stackexchange.com/q/229044/7712).
+
+## What does that mean?
+
+Citing [Wikipedia](https://en.wikipedia.org/wiki/Unicode_equivalence):
+
+> Unicode equivalence is the specification by the Unicode character encoding
+> standard that some sequences of code points represent essentially the same
+> character. This feature was introduced in the standard to allow compatibility
+> with preexisting standard character sets, which often included similar or
+> identical characters.
+>
+> Unicode provides two such notions, canonical equivalence and compatibility.
+> Code point sequences that are defined as canonically equivalent are assumed to
+> have the same appearance and meaning when printed or displayed. For example,
+> the code point `U+006E` (the Latin lowercase "n") followed by `U+0303` (the
+> combining tilde) is defined by Unicode to be canonically equivalent to the
+> single code point `U+00F1` (the lowercase letter "ñ" of the Spanish alphabet). 
+
+## Basic usage
+
+
+    \documentclass{article}
+    \usepackage{fontspec}
+    \usepackage[czech]{babel}
+    \setmainfont{Linux Libertine O}
+    \usepackage{uninormalize}
+    \begin{document}
+    
+    Some tests:
+    \begin{itemize}
+      \item combined letter ᾳ %GREEK SMALL LETTER ALPHA (U+03B1) 
+                              % + COMBINING GREEK YPOGEGRAMMENI 
+                              % (U+0345)
+      \item normal letter ᾳ   % GREEK SMALL LETTER ALPHA WITH 
+                              %YPOGEGRAMMENI (U+1FB3)
+    \end{itemize}
+    
+    Some more combined and normal letters: 
+    óóōōöö
+    
+    Linux Libertine does support some combined chars: \parbox{4em}{příliš}
+
+    Using the \verb|^^^^| syntax: ^^^^0061^^^^0301 ^^^^0041^^^^0301
+    \end{document}
+
+## Package options
+
+This package has three options:
+
+
+- **buffer**  -- normalize processed document at the moment when it's
+  source file is read, before processing by \TeX\ starts. This is the default
+  option, it seems to work better than the next one.
+- **nodes** -- normalize LuaTeX nodes. Normalization happens after the full processiny by \TeX. 
+- **debug** -- print debug messages to the terminal output
+
+Both **buffer** and **nodes** options are enabled by default, you can disable any of them by using:
+
+    \usepackage[nodes=false,buffer=false]{uninormalize}


Property changes on: trunk/Master/texmf-dist/doc/lualatex/uninormalize/README.md
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: trunk/Master/texmf-dist/doc/lualatex/uninormalize/uninormalize-doc.pdf
===================================================================
(Binary files differ)

Index: trunk/Master/texmf-dist/doc/lualatex/uninormalize/uninormalize-doc.pdf
===================================================================
--- trunk/Master/texmf-dist/doc/lualatex/uninormalize/uninormalize-doc.pdf	2020-12-29 22:00:30 UTC (rev 57256)
+++ trunk/Master/texmf-dist/doc/lualatex/uninormalize/uninormalize-doc.pdf	2020-12-29 22:01:58 UTC (rev 57257)

Property changes on: trunk/Master/texmf-dist/doc/lualatex/uninormalize/uninormalize-doc.pdf
___________________________________________________________________
Added: svn:mime-type
## -0,0 +1 ##
+application/pdf
\ No newline at end of property
Added: trunk/Master/texmf-dist/doc/lualatex/uninormalize/uninormalize-doc.tex
===================================================================
--- trunk/Master/texmf-dist/doc/lualatex/uninormalize/uninormalize-doc.tex	                        (rev 0)
+++ trunk/Master/texmf-dist/doc/lualatex/uninormalize/uninormalize-doc.tex	2020-12-29 22:01:58 UTC (rev 57257)
@@ -0,0 +1,49 @@
+\documentclass{article}
+\usepackage{url}
+\ifx\HCode\undefined
+\usepackage{fontspec}
+\setmainfont{Linux Libertine O}[Renderer = Harfbuzz]
+\setmonofont{DejaVu Sans Mono}[Scale=MatchLowercase]
+\fi
+\usepackage{microtype,hyperref}
+\usepackage[nodes]{uninormalize}
+\usepackage{markdown}
+\def\tightlist{}
+\begin{document}
+\title{The \texttt{uninormalize} package}
+\author{Michal Hoftich\footnote{\url{michal.h21 at gmail.com}} \and Arthur Reutenauer\footnote{\url{arthur.reutenauer at normalesup.org }}}
+\date{Version 0.1\\28/12/2020}
+\maketitle
+
+\markdownInput[hybrid]{README.md}
+
+\subsection{Example results}
+
+\begin{itemize}
+  \item combined letter ᾳ %GREEK SMALL LETTER ALPHA (U+03B1) + COMBINING GREEK YPOGEGRAMMENI (U+0345)
+  \item normal letter ᾳ% GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI (U+1FB3)
+\end{itemize}
+
+Some more combined and normal letters: 
+óóōōöö
+
+Linux Libertine does support some combined chars: \parbox{4em}{příliš}
+
+Using the \verb|^^^^| syntax: ^^^^0061^^^^0301 ^^^^0041^^^^0301
+
+\subsection{License}
+
+Copyright: 2020 Michal Hoftich
+
+This work may be distributed and/or modified under the conditions of the 
+\textit{\LaTeX\ Project Public License}, either version 1.3 of this license or (at your option)
+any later version. The latest version of this license is in
+\url{http://www.latex-project.org/lppl.txt} and version 1.3 or later is part of all
+distributions of \LaTeX\ version 2005/12/01 or later.
+
+This work has the LPPL maintenance status \textit{maintained}.
+
+The Current Maintainer of this work is Michal Hoftich.
+
+\end{document}
+


Property changes on: trunk/Master/texmf-dist/doc/lualatex/uninormalize/uninormalize-doc.tex
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: trunk/Master/texmf-dist/tex/lualatex/uninormalize/unicode-normalization.lua
===================================================================
--- trunk/Master/texmf-dist/tex/lualatex/uninormalize/unicode-normalization.lua	                        (rev 0)
+++ trunk/Master/texmf-dist/tex/lualatex/uninormalize/unicode-normalization.lua	2020-12-29 22:01:58 UTC (rev 57257)
@@ -0,0 +1,351 @@
+-- char-def now contains all necessary fields, no need for a custom file
+if not characters then
+  require "char-def"
+end
+
+if not unicode then require('unicode') end
+unicode.conformance = unicode.conformance or { }
+
+unicharacters = unicharacters or {}
+uni = unicode.utf8
+unidata = characters.data
+
+function printf(s, ...) print(string.format(s, ...)) end
+-- function debug(s, ...) io.write("DEBUG: ", string.format(s, ...), "\n") end
+function warn(s, ...) io.write("Warning: ", string.format(s, ...), "\n") end
+
+function md5sum(any) return md5.hex(md5.sum(any)) end
+
+-- Rehash the character data
+unicharacters.combinee = { }
+unicharacters.context = unicharacters.context or { }
+local charu = unicode.utf8.char
+
+function unicharacters.context.rehash2()
+  for ucode, udata in pairs(unidata) -- *not* ipairs :-)
+  do
+    local sp = udata.specials
+    if sp then
+      if sp[1] == 'char' then
+        -- local ucode = udata.unicodeslot
+        local entry = { combinee = sp[2], combining = sp[3], combined = ucode }
+        if not unicharacters.combinee[sp[2]]
+        then unicharacters.combinee[sp[2]] = { } end
+        local n = #unicharacters.combinee[sp[2]]
+        unicharacters.combinee[sp[2]][n+1] = entry
+      end
+    end
+    -- copy context's combining field to combclass field
+    -- this field was in the custom copy of the char-def.lua that we no longer use
+    udata.combclass = udata.combining
+  end
+end
+
+
+unicharacters.context.rehash2()
+combdata = unicharacters.combinee
+
+--[[ function unicode.conformance.is_hangul(ucode)
+  return ucode >= 0xAC00 and ucode <= 0xD7A3
+end ]] -- Make it local for the moment
+local function is_hangul(char)
+  return char >= 0xAC00 and char <= 0xD7A3
+end
+
+local function is_jamo(char)
+  if char < 0x1100 then return false
+  elseif char < 0x1160 then return 'choseong'
+  elseif char < 0x11A8 then return 'jungseong'
+  elseif char < 0x11FA then return 'jongseong'
+  else return false
+  end
+end
+
+local function decompose(ucode, compat) -- if compat then compatibility
+  local invbuf = { }
+  local sp = unidata[ucode].specials
+  if not sp
+  then return { ucode } else
+    if compat then compat = (sp[1] == 'compat') else compat = false end
+    while sp[1] == 'char' or compat do
+      head, tail = sp[2], sp[3]
+      if not tail then invbuf[#invbuf + 1] = head break end -- singleton
+      invbuf[#invbuf + 1] = tail
+        sp = unidata[head].specials
+        if not sp then invbuf[#invbuf + 1] = head sp = { } end
+      -- end -- not unidata[head]
+    end -- while sp[1] == 'char' or compat
+  end -- not sp
+
+  local seq = { }
+  for i = #invbuf, 1, -1
+  do seq[#seq + 1] = invbuf[i]
+  end
+  return seq
+end
+
+local function canon(seq) -- Canonical reordering
+  if #seq < 3 then return seq end
+  local c1, c2, buf
+  -- I'd never thought I'd implement an actual bubble sort some day ;-)
+  for k = #seq - 1, 1, -1 do
+    for i = 2, k do -- was k - 1!  Argh!
+      c1 = unidata[seq[i]].combclass
+      c2 = unidata[seq[i+1]].combclass
+      if c1 and c2 then
+        if c1 > c2 then
+          buf = seq[i]
+          seq[i] = seq[i+1]
+          seq[i+1] = buf
+        end
+      end
+    end
+  end
+  return seq
+end
+
+if not math.div then -- from l-math.lua
+  function math.div(n, m)
+    return math.floor(n/m)
+  end
+end
+
+local SBase, LBase, VBase, TBase = 0xAC00, 0x1100, 0x1161, 0x11A7
+local LCount, VCount, TCount = 19, 21, 28
+local NCount = VCount * TCount
+local SCount = TCount * NCount
+
+local function decompose_hangul(ucode) -- assumes input is really a Hangul
+  local SIndex = ucode - SBase
+  local L = LBase + math.div(SIndex, NCount)
+  local V = VBase + math.div((SIndex % NCount), TCount)
+  local T = TBase + SIndex % TCount
+  if T == TBase then T = nil end
+  return { L, V, T }
+end
+
+-- To NFK?D.
+function toNF_D_or_KD(unistring, compat)
+  local nfd, seq = { }, { }
+  for uchar in uni.gmatch(unistring, '.') do
+    local ucode = uni.byte(uchar)
+    if is_hangul(ucode) then
+      seq = decompose_hangul(ucode)
+      for  _, c in ipairs(seq)
+      do nfd[#nfd + 1] = c
+      end
+      seq = { }
+    elseif not unidata[ucode]
+    then nfd[#nfd + 1] = ucode else
+      local ccc = unidata[ucode].combclass
+      if not ccc or ccc == 0 then
+        seq = canon(seq)
+        for _, c in ipairs(seq) do nfd[#nfd + 1] = c end
+        seq = decompose(ucode, compat)
+      else seq[#seq + 1] = ucode
+      end -- not ccc or ccc == 0
+    end -- if is_hangul(ucode) / elseif not unidata[ucode]
+  end -- for uchar in uni.gmatch(unistring, ".")
+
+  if #seq > 0 then
+    seq = canon(seq)
+    for _, c in ipairs(seq) do nfd[#nfd + 1] = c end
+  end
+
+  local nfdstr = ""
+  for _, chr in ipairs(nfd)
+  do nfdstr = string.format("%s%s", nfdstr, uni.char(chr)) end
+  return nfdstr, nfd
+end
+
+function unicode.conformance.toNFD(unistring)
+  return toNF_D_or_KD(unistring, false)
+end
+
+function unicode.conformance.toNFKD(unistring)
+  return toNF_D_or_KD(unistring, true)
+end
+
+local function compose(seq)
+  local base = seq[1]
+  if not combdata[base] then return seq else
+    local i = 2
+    while i <= #seq do -- can I play with 'i' in a for loop?
+      local cbng = seq[i]
+      local cccprev
+      if unidata[seq[i-1]] then cccprev = unidata[seq[i-1]].combclass end
+      if not cccprev then cccprev = -1 end
+      if unidata[cbng].combclass > cccprev then
+        if not combdata[base] then return seq else
+          for _, cbdata in ipairs(combdata[base]) do
+            if cbdata.combining == cbng then
+              seq[1] = cbdata.combined
+              base = seq[1]
+              for k = i, #seq - 1
+              do seq[k] = seq[k+1]
+              end -- for k = i, #seq - 1
+              seq[#seq] = nil
+              i = i - 1
+            end -- if cbdata.combining == cbng
+          end -- for _, cbdata in ipairs(combdata[base])
+        end -- if unidata[cbng.combclass > cccprev
+      end -- if not combdata[base]
+    i = i + 1
+    end -- while i <= #seq
+  end -- if not combdata[base]
+  return seq
+end
+
+-- To NFC from NFD.
+-- Does not yet take all the composition exclusions in account
+-- (missing types 1 and 2 as defined by UAX #15 X6)
+function unicode.conformance.toNFC_fromNFD(nfd)
+  local nfc = { }
+  local seq = { }
+  for uchar in uni.gmatch(nfd, '.') do
+    local ucode = uni.byte(uchar)
+    if not unidata[ucode]
+    then nfc[#nfc + 1] = ucode else
+      local cb = unidata[ucode].combclass
+      if not cb or (cb == 0) then
+        -- if seq ~= { } then -- Dubious ...
+        if #seq > 0 then
+          seq = compose(seq) -- There was a check for #seq == 1 here
+          for i = 1, #seq do nfc[#nfc + 1] = seq[i] end
+        end -- #seq > 0
+        seq = { ucode } 
+      else seq[#seq + 1] = ucode --[[ Maybe check if seq is not empty ... ]]
+      end -- not cb or cb == 0
+    end
+  end
+
+  seq = compose(seq)
+  for i = 1, #seq do nfc[#nfc + 1] = seq[i] end
+
+  local nfcstr = ""
+  for _, chr in ipairs(nfc)
+  do nfcstr = string.format("%s%s", nfcstr, uni.char(chr)) end
+  return nfcstr, nfc
+end
+
+local function cancompose(seq, compat)
+  local dec = { } -- new table to hold the decomposed sequence
+
+  local shift
+  if #seq >= 2 then -- let's do it the brutal way :-)
+    if is_jamo(seq[1]) == 'choseong' and
+       is_jamo(seq[2]) == 'jungseong' then
+      LIndex = seq[1] - LBase
+      VIndex = seq[2] - VBase
+      if #seq == 2 or is_jamo(seq[3]) ~= 'jongseong' then
+        TIndex = 0
+        shift = 1
+      else
+        TIndex = seq[3] - TBase
+        shift = 2
+      end
+      seq[1] = (LIndex * VCount + VIndex) * TCount + TIndex + SBase
+      for i = 2, #seq -- this shifts and shrinks the table at the same time
+      do seq[i] = seq[i + shift]
+      end
+    end
+  end
+
+  dec[1] = seq[1]
+  for i = 2, #seq do
+    local u = seq[i]
+    local sp = unidata[u].specials
+    if sp then
+      if compat then compat = (sp[1] == 'compat') else compat = false end
+      if (sp[1] == 'char') or compat then
+        for i = 2, #sp
+        do dec[#dec + 1] = sp[i]
+        end
+      end
+    else dec[#dec + 1] = u
+    end
+  end -- we have the fully decomposed sequence; now sort it
+
+  for i = #dec - 1, 2, -1 do -- bubble sort!
+    for j = 2, #dec - 1 do
+      local u = dec[j]
+      local ccc1 = unidata[u].combclass
+      local v = dec[j+1]
+      local ccc2 = unidata[v].combclass
+      if ccc1 > ccc2 then -- swap
+        dec[j+1] = u
+        dec[j] = v
+      end
+    end
+  end -- dec sorted; now recursively compose
+
+  local base, i, n = dec[1], 2, #dec
+  local cbd = combdata[base]
+  local incr_i = true
+  while i <= n do
+    local cbg = dec[i]
+    if cbd then
+      for _, cb in ipairs(cbd) do
+        if cb.combining == cbg then
+      -- NO :-) -- if cbd[cbg] then -- base and cbg combine; compose
+          dec[1] = cb.combined
+          base = dec[1]
+          cbd = combdata[base]
+          for j = i, n-1 -- shift table elements right of i
+          do dec[j] = dec[j+1] end
+          dec[n] = nil
+          n = n-1 -- table has shrunk by 1, and i doesn't grow
+          incr_i = false
+        end
+      end
+    end
+    if incr_i then i = i + 1
+    else incr_i = true end
+  end -- we're finally through! return
+  return dec
+end
+
+function toNF_C_or_KC(unistring, compat)
+  if unistring == "" then return "" end
+  local nfc, seq = "", { }
+  local start, space = true, ""
+  for uchar in uni.gmatch(unistring, '.') do
+    local ucode = uni.byte(uchar)
+    if start then space = ", " start = true end
+    if not unidata[ucode] then -- unknown to the UCD, will not compose
+      nfc = string.format("%s%s", nfc, uchar)
+    else
+      local ccc = unidata[ucode].combclass 
+      if not (ccc or is_jamo(ucode) == 'jongseong'
+              or is_jamo(ucode) == 'jungseong')
+         or ccc == 0 or (is_jamo(ucode) == 'choseong') then
+      -- and is actually good :-) -- Well, yes and no ;-)
+        if #seq == 0 then -- add ucode and go to next item of the loop
+          seq = { ucode }
+        else -- seq contains unicharacters, try and compose them
+          if #seq == 1 then nfc = string.format("%s%s", nfc, uni.char(seq[1]))
+          else dec = cancompose(seq, compat)
+            for _, c in ipairs(dec) -- add the whole sequence to nfc
+            do nfc = string.format("%s%s", nfc, uni.char(c)) end
+          end
+          seq = { ucode } -- don't forget to reinitialize seq with current char
+        end
+      else -- not ccc or ccc == 0 and is_choseong:
+           -- character is combining, add it to seq
+        seq[#seq + 1] = ucode
+      end
+    end
+  end
+  if #seq > 0 then dec = cancompose(seq, compat) end
+  for _, c in ipairs(dec)
+  do nfc = string.format("%s%s", nfc, uni.char(c)) end
+  return nfc
+end
+
+function unicode.conformance.toNFC(unistring)
+  return toNF_C_or_KC(unistring, false)
+end
+
+function unicode.conformance.toNFKC(unistring)
+  return toNF_C_or_KC(unistring, true)
+end


Property changes on: trunk/Master/texmf-dist/tex/lualatex/uninormalize/unicode-normalization.lua
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: trunk/Master/texmf-dist/tex/lualatex/uninormalize/unicode-normalize-names.lua
===================================================================
--- trunk/Master/texmf-dist/tex/lualatex/uninormalize/unicode-normalize-names.lua	                        (rev 0)
+++ trunk/Master/texmf-dist/tex/lualatex/uninormalize/unicode-normalize-names.lua	2020-12-29 22:01:58 UTC (rev 57257)
@@ -0,0 +1,57 @@
+-- Unicode names
+
+if not characters then
+  require "char-def"
+end
+
+unicode = unicode or { }
+unicode.conformance = unicode.conformance or { }
+
+unidata = characters.data
+
+if not math.div then -- from l-math.lua
+  function math.div(n, m)
+    return math.floor(n/m)
+  end
+end
+
+local function is_hangul(char)
+  return char >= 0xAC00 and char <= 0xD7A3
+end
+
+local function is_han_character(char) -- from font-otf.lua (check)
+  return
+     (char>=0x04E00 and char<=0x09FFF) or
+     (char>=0x03400 and char<=0x04DFF) or
+     (char>=0x20000 and char<=0x2A6DF) or
+     (char>=0x0F900 and char<=0x0FAFF) or
+     (char>=0x2F800 and char<=0x2FA1F)
+end
+
+local SBase, LBase, VBase, TBase = 0xAC00, 0x1100, 0x1161, 0x11A7
+local LCount, VCount, TCount = 19, 21, 28
+local NCount = VCount * TCount
+local SCount = LCount * NCount
+
+local JAMO_L_TABLE = { [0] = "G", "GG", "N", "D", "DD", "R", "M", "B", "BB",
+  "S", "SS", "", "J", "JJ", "C", "K", "T", "P", "H" }
+local JAMO_V_TABLE = { [0] = "A", "AE", "YA", "YAE", "EO", "E", "YEO", "YE",
+  "O", "WA", "WAE", "OE", "YO", "U", "WEO", "WE", "WI", "YU", "EU", "YI", "I" }
+local JAMO_T_TABLE = { [0] = "", "G", "GG", "GS", "N", "NJ", "NH", "D", "L",
+  "LG", "LM", "LB", "LS", "LT", "LP", "LH", "M", "B", "BS", "S", "SS", "NG",
+  "J", "C", "K", "T", "P", "H" }
+
+function unicode.conformance.name(char)
+  if is_hangul(char) then
+    local SIndex = char - SBase
+    local LIndex = math.div(SIndex, NCount)
+    local VIndex = math.div(SIndex % NCount, TCount)
+    local TIndex = SIndex % TCount
+    return string.format("HANGUL SYLLABLE %s%s%s", JAMO_L_TABLE[LIndex],
+      JAMO_V_TABLE[VIndex], JAMO_T_TABLE[TIndex])
+  elseif is_han_character(char)
+  then return string.format("CJK UNIFIED IDEOGRAPH-%04X", char)
+  elseif unidata[char] -- if unidata[char] exists, the name exists
+  then return unidata[char].description
+  end
+end


Property changes on: trunk/Master/texmf-dist/tex/lualatex/uninormalize/unicode-normalize-names.lua
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: trunk/Master/texmf-dist/tex/lualatex/uninormalize/unicode-normalize.lua
===================================================================
--- trunk/Master/texmf-dist/tex/lualatex/uninormalize/unicode-normalize.lua	                        (rev 0)
+++ trunk/Master/texmf-dist/tex/lualatex/uninormalize/unicode-normalize.lua	2020-12-29 22:01:58 UTC (rev 57257)
@@ -0,0 +1,226 @@
+local M = {}
+require("unicode-normalize-names")
+require('unicode-normalization')
+local NFC = unicode.conformance.toNFC
+local char = unicode.utf8.char
+local gmatch = unicode.utf8.gmatch
+local name = unicode.conformance.name
+local byte = unicode.utf8.byte
+-- local unidata = unicharacters.data
+local length = unicode.utf8.len
+
+local glyph_id = node.id "glyph"
+
+M.debug = false
+
+-- for some reason variable number of arguments doesn't work
+local function debug_msg(a,b,c,d,e,f,g,h,i)
+  if M.debug then
+    local t = {a,b,c,d,e,f,g,h,i}
+    print("[uninormalize]", table.unpack(t))
+  end
+end
+
+local function make_hash (t) 
+  local y = {}
+  for _,v in ipairs(t) do 
+    y[v] = true
+  end
+  return y
+end
+
+local letter_categories = make_hash {"lu","ll","lt","lo","lm"}
+
+local mark_categories = make_hash {"mn","mc","me"}
+
+local function printchars(s)
+	local t = {}
+	for x in gmatch(s,".") do
+		t[#t+1] = name(byte(x))
+	end
+	debug_msg("characters",table.concat(t,":"))
+end
+
+local categories = {}
+
+
+local function get_category(charcode)
+  local charcode = charcode or ""
+  if categories[charcode] then
+    return categories[charcode] 
+  else
+    local unidatacode = unidata[charcode] or {}
+    local category = unidatacode.category
+    categories[charcode] = category
+    return category
+  end
+end
+
+-- get glyph char and category
+local function glyph_info(n)
+  local char = n.char
+  return char, get_category(char)
+end
+
+local function get_mark(n)
+  if n.id == glyph_id then
+    local character, cat = glyph_info(n)
+    if mark_categories[cat] then
+      return char(character)
+    end
+  end
+  return false
+end
+
+local function make_glyphs(head, nextn,s, lang, font, subtype) 
+  local g = function(a) 
+    local new_n = node.new(glyph_id, subtype)
+    new_n.lang = lang
+    new_n.font = font
+    new_n.char = byte(a)
+    return new_n
+  end
+  if length(s) == 1 then
+    return node.insert_before(head, nextn,g(s))
+  else
+    local t = {}
+    local first = true
+    for x in gmatch(s,".") do
+      debug_msg("multi letter",x)
+        head, newn = node.insert_before(head, nextn, g(x))
+    end
+    return head
+  end
+end
+
+local function normalize_marks(head, n)
+  local lang, font, subtype = n.lang, n.font, n.subtype
+  local text = {}
+  text[#text+1] = char(n.char)
+  local head, nextn = node.remove(head, n)
+  --local nextn = n.next
+  local info = get_mark(nextn)
+  while(info) do
+    text[#text+1] = info
+    head, nextn = node.remove(head,nextn)
+    info = get_mark(nextn)
+  end
+  local s = NFC(table.concat(text))
+  debug_msg("We've got mark: " .. s)
+  local new_n = node.new(glyph_id, subtype)
+  new_n.lang = lang
+  new_n.font = font
+  new_n.char = byte(s)
+  --head, new_n = node.insert_before(head, nextn, new_n)
+  -- head, new_n = node.insert_before(head, nextn, make_glyphs(s, lang, font, subtype))
+  head, new_n = make_glyphs(head, nextn, s, lang, font, subtype)
+  local t = {}
+  for x in node.traverse_id(glyph_id,head) do
+    t[#t+1] = char(x.char)
+  end
+  debug_msg("Variables ", table.concat(t,":"), table.concat(text,";"), char(byte(s)),length(s))
+  return head, nextn
+end
+
+local function normalize_glyphs(head, n)
+  --local charcode = n.char
+  --local category = get_category(charcode)
+  local charcode, category = glyph_info(n)
+  if letter_categories[category] then 
+    local nextn = n.next
+    if nextn and nextn.id == glyph_id then
+      --local nextchar = nextn.char
+      --local nextcat = get_category(nextchar)
+      local nextchar, nextcat = glyph_info(nextn)
+      if mark_categories[nextcat] then
+        return normalize_marks(head,n)
+      end
+    end
+  end
+  return head, n.next 
+end
+
+
+function M.nodes(head)
+	local t = {}
+	local text = false
+  local n = head
+	-- for n in node.traverse(head) do
+  while n do
+		if n.id == glyph_id then
+      local charcode = n.char
+			debug_msg("unicode name",name(charcode))
+			debug_msg("character category",get_category(charcode))
+			t[#t+1]= char(charcode)
+			text = true
+      head, n = normalize_glyphs(head, n)
+		else
+			if text then
+				local s = table.concat(t)
+				debug_msg("text chunk",s)
+				--printchars(NFC(s))
+				debug_msg("----------")
+			end
+			text = false
+			t = {}
+      n = n.next
+		end
+	end
+	return head
+end
+
+local unibytes = {}
+
+local function get_charcategory(s)
+  local s = s or ""
+  local b = unibytes[s] or byte(s) or 0
+  unibytes[s] = b
+  return get_category(b)
+end
+
+local function normalize_charmarks(t,i)
+  local c = {t[i]}
+  local i = i + 1
+  local s = get_charcategory(t[i])
+  while mark_categories[s] do
+    c[#c+1] = t[i]
+    i = i + 1
+    s = get_charcategory(t[i])
+  end
+  return NFC(table.concat(c)), i
+end
+
+local function normalize_char(t,i)
+  local ch = t[i]
+  local c = get_charcategory(ch)
+  if letter_categories[c] then
+    local nextc = get_charcategory(t[i+1])
+    if mark_categories[nextc] then
+      return normalize_charmarks(t,i)
+    end
+  end
+  return ch, i+1
+end
+
+function M.buffer(line)
+  local t = {}
+  local new_t = {}
+  -- we need to make table witl all uni chars on the line
+  for x in gmatch(line,".") do
+    t[#t+1] = x
+  end
+  local i = 1
+  -- normalize next char
+  local c, i = normalize_char(t, i)
+  new_t[#new_t+1] = c
+  while t[i] do
+    c, i = normalize_char(t,i)
+    -- local  c = t[i]
+    -- i =  i + 1
+    new_t[#new_t+1] = c
+  end
+  return table.concat(new_t)
+end
+  
+
+return M


Property changes on: trunk/Master/texmf-dist/tex/lualatex/uninormalize/unicode-normalize.lua
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: trunk/Master/texmf-dist/tex/lualatex/uninormalize/uninormalize.sty
===================================================================
--- trunk/Master/texmf-dist/tex/lualatex/uninormalize/uninormalize.sty	                        (rev 0)
+++ trunk/Master/texmf-dist/tex/lualatex/uninormalize/uninormalize.sty	2020-12-29 22:01:58 UTC (rev 57257)
@@ -0,0 +1,35 @@
+\ProvidesPackage{uninormalize}
+\RequirePackage{luatexbase}
+\RequirePackage{luacode}
+\RequirePackage{kvoptions}
+\DeclareBoolOption[true]{nodes}
+\DeclareBoolOption[true]{buffer}
+\DeclareBoolOption{debug}
+\ProcessKeyvalOptions*
+\ifuninormalize at nodes
+  \luaexec{processnodes=true}
+\fi
+\ifuninormalize at buffer
+  \luaexec{processbuffer=true}
+\fi
+\ifuninormalize at debug
+  \luaexec{uninormalize_debug = true}
+\fi
+\begin{luacode*}
+local normalize = require "unicode-normalize"
+if processnodes==true then
+  print "[uninormalize] process nodes on"
+  luatexbase.add_to_callback("pre_linebreak_filter",normalize.nodes, "normalize unicode")
+  luatexbase.add_to_callback("hpack_filter",normalize.nodes, "normalize unicode")
+end
+if processbuffer== true then
+  print "[uninormalize] process buffer on"
+  luatexbase.add_to_callback("process_input_buffer", normalize.buffer," normalize unicode")
+end
+if uninormalize_debug then
+  normalize.debug = true
+end
+\end{luacode*}
+
+
+\endinput


Property changes on: trunk/Master/texmf-dist/tex/lualatex/uninormalize/uninormalize.sty
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Modified: trunk/Master/tlpkg/bin/tlpkg-ctan-check
===================================================================
--- trunk/Master/tlpkg/bin/tlpkg-ctan-check	2020-12-29 22:00:30 UTC (rev 57256)
+++ trunk/Master/tlpkg/bin/tlpkg-ctan-check	2020-12-29 22:01:58 UTC (rev 57257)
@@ -791,7 +791,7 @@
     unfonts-core unfonts-extra
     uni-wtal-ger uni-wtal-lin
     unicode-alphabets unicode-data unicode-bidi unicode-math
-    unifith uniquecounter unisugar
+    unifith uninormalize uniquecounter unisugar
     unitconv unitipa unitn-bimrep units unitsdef
     universa universalis univie-ling unizgklasa
     unravel unswcover

Modified: trunk/Master/tlpkg/libexec/ctan2tds
===================================================================
--- trunk/Master/tlpkg/libexec/ctan2tds	2020-12-29 22:00:30 UTC (rev 57256)
+++ trunk/Master/tlpkg/libexec/ctan2tds	2020-12-29 22:01:58 UTC (rev 57257)
@@ -2213,6 +2213,7 @@
  'underscore',  '^..[^s].*\.sty',       # not miscdoc.sty
  'undolabl',    '\.sty|[^c]\.cfg',      # omit ltxdoc.cfg, would be system-wide
  'unicode-alphabets',	'\..sv|' . $standardtex,
+ 'uninormalize',	'\.lua|' . $standardtex,
  'unitn-bimrep','\.jpg|' . $standardtex,
  'univie-ling',	'univie.*logo.*.pdf|' . $standardtex,
  'universa',    '\.fd|uni\.sty',        # not unidoc.sty

Modified: trunk/Master/tlpkg/tlpsrc/collection-luatex.tlpsrc
===================================================================
--- trunk/Master/tlpkg/tlpsrc/collection-luatex.tlpsrc	2020-12-29 22:00:30 UTC (rev 57256)
+++ trunk/Master/tlpkg/tlpsrc/collection-luatex.tlpsrc	2020-12-29 22:01:58 UTC (rev 57257)
@@ -58,3 +58,4 @@
 depend spelling
 depend stricttex
 depend typewriter
+depend uninormalize

Added: trunk/Master/tlpkg/tlpsrc/uninormalize.tlpsrc
===================================================================


More information about the tex-live-commits mailing list.