texlive[55158] Master: lua-uni-algos (15may20)

commits+karl at tug.org commits+karl at tug.org
Fri May 15 23:13:44 CEST 2020


Revision: 55158
          http://tug.org/svn/texlive?view=revision&revision=55158
Author:   karl
Date:     2020-05-15 23:13:44 +0200 (Fri, 15 May 2020)
Log Message:
-----------
lua-uni-algos (15may20)

Modified Paths:
--------------
    trunk/Master/tlpkg/bin/tlpkg-ctan-check
    trunk/Master/tlpkg/libexec/ctan2tds
    trunk/Master/tlpkg/tlpsrc/collection-luatex.tlpsrc

Added Paths:
-----------
    trunk/Master/texmf-dist/doc/luatex/lua-uni-algos/
    trunk/Master/texmf-dist/doc/luatex/lua-uni-algos/README.md
    trunk/Master/texmf-dist/doc/luatex/lua-uni-algos/lua-uni-algos.pdf
    trunk/Master/texmf-dist/doc/luatex/lua-uni-algos/lua-uni-algos.tex
    trunk/Master/texmf-dist/tex/luatex/lua-uni-algos/
    trunk/Master/texmf-dist/tex/luatex/lua-uni-algos/lua-uni-algos.lua
    trunk/Master/texmf-dist/tex/luatex/lua-uni-algos/lua-uni-case.lua
    trunk/Master/texmf-dist/tex/luatex/lua-uni-algos/lua-uni-graphemes.lua
    trunk/Master/texmf-dist/tex/luatex/lua-uni-algos/lua-uni-normalize.lua
    trunk/Master/texmf-dist/tex/luatex/lua-uni-algos/lua-uni-parse.lua
    trunk/Master/tlpkg/tlpsrc/lua-uni-algos.tlpsrc

Added: trunk/Master/texmf-dist/doc/luatex/lua-uni-algos/README.md
===================================================================
--- trunk/Master/texmf-dist/doc/luatex/lua-uni-algos/README.md	                        (rev 0)
+++ trunk/Master/texmf-dist/doc/luatex/lua-uni-algos/README.md	2020-05-15 21:13:44 UTC (rev 55158)
@@ -0,0 +1,30 @@
+# The lua-uni-algos Package
+
+Version: v0.1
+
+Date: 2020-05-14
+
+Author: Marcel Krüger
+
+License: LPPL v1.3c
+
+A collection of small Lua modules implementing some if the most generic Unicode algorithms for use with LuaTeX.
+This package tries to reduce duplicated work by collecting a set of small utilities which can be used be useful for many LuaTeX packages dealing with Unicode strings.
+There is no user-level functionality provided.
+
+Additional Unicode algorithms will be added in the future, if you need a specific algorithm feel free to open an issue on GitHub or send me an e-mail.
+
+
+## Requirements
+
+Given that this package provides Lua modules, it is only useful in Lua(HB)TeX.
+Additionally, it expects an up-to-date version of the `unicode-data` package to be present.
+
+
+## Support
+If you found a bug, please open an [issue on GitHub](https://github.com/zauguin/lua-uni-algos/issues) or contact me by mail at <tex at 2krueger.de>.
+
+## Installation
+
+In most cases it is best to use an official release provided by your TeX distribution.
+If you want to install an experimental build directly from the repository, use `l3build install`.


Property changes on: trunk/Master/texmf-dist/doc/luatex/lua-uni-algos/README.md
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: trunk/Master/texmf-dist/doc/luatex/lua-uni-algos/lua-uni-algos.pdf
===================================================================
(Binary files differ)

Index: trunk/Master/texmf-dist/doc/luatex/lua-uni-algos/lua-uni-algos.pdf
===================================================================
--- trunk/Master/texmf-dist/doc/luatex/lua-uni-algos/lua-uni-algos.pdf	2020-05-15 21:09:50 UTC (rev 55157)
+++ trunk/Master/texmf-dist/doc/luatex/lua-uni-algos/lua-uni-algos.pdf	2020-05-15 21:13:44 UTC (rev 55158)

Property changes on: trunk/Master/texmf-dist/doc/luatex/lua-uni-algos/lua-uni-algos.pdf
___________________________________________________________________
Added: svn:mime-type
## -0,0 +1 ##
+application/pdf
\ No newline at end of property
Added: trunk/Master/texmf-dist/doc/luatex/lua-uni-algos/lua-uni-algos.tex
===================================================================
--- trunk/Master/texmf-dist/doc/luatex/lua-uni-algos/lua-uni-algos.tex	                        (rev 0)
+++ trunk/Master/texmf-dist/doc/luatex/lua-uni-algos/lua-uni-algos.tex	2020-05-15 21:13:44 UTC (rev 55158)
@@ -0,0 +1,171 @@
+\documentclass{article}
+\usepackage{doc, shortvrb, metalogo, hyperref, fontspec}
+% \setmainfont{Noto Serif}
+% \setmonofont{FreeMono}
+\title{Unicode algorithms for Lua\TeX}
+\author{Marcel Krüger\thanks{E-Mail: \href{mailto:tex at 2krueger.de}{\nolinkurl{tex at 2krueger.de}}}}
+\MakeShortVerb\|
+\newcommand\pkg{\texttt}
+\begin{document}
+\maketitle
+Dealing with general Unicode encoded data comes with many challenges because it has to respect individual concerns of many different scripts and languages. The Unicode consortium maintains multiple useful algorithms which can sometimes make this task much easier.
+
+\pkg{lua-uni-algos} tries to make the most fundamental algorithms available for authors of Lua-based packages to aid in handling Unicode data.
+
+Currently this package implements:
+\begin{description}
+  \item[Unicode normalization] Normalize a given Lua string into any of the normalization forms NFC, NFD, NFKC, or NFKD as specified in the Unicode standard, section 2.12.
+  \item[Case folding] Fold Unicode codepoints into a form which eliminates all case distinctions. This can be used for case-independent matching of strings. Not to be confused with case mapping which maps all characters to lower/upper/titlecase: In contrast to case mapping, case folding is mostly locale independent but does not give results which should be shown to users.
+  \item[Grapheme cluster segmentation] Identify a grapheme cluster, a unit of text which is perceived as a single character by typical users, according to the rules in UAX \#29, section 3.
+\end{description}
+\section{Normalization}
+Unicode normalization is handled by the Lua module |lua-uni-normalize|.
+You can either load it directly with
+\begin{verbatim}
+local normalize = require'lua-uni-normalize'
+\end{verbatim}
+or if you need access to all implemented algorithms you can use
+\begin{verbatim}
+local uni_algos = require'lua-uni-algos'
+local normalize = uni_algos.normalize
+\end{verbatim}
+
+Then, four functions are available: |normalize.NFC|, |normalize.NFD|, |normalize.NFKC|, and |normalize.NFKD|.
+If you do not know which of these you need, then you should probably |normalize.NFC|. All functions are used in the same way:
+\begin{verbatim}
+local str = "Äpfel…"
+print("Original:", str)
+print("NFC:", normalize.NFC(str))
+print("NFD:", normalize.NFD(str))
+print("NFKC:", normalize.NFKC(str))
+print("NFKD:", normalize.NFKD(str))
+\end{verbatim}
+This results in
+\begin{verbatim}
+Original:	Äpfel…
+NFC:	Äpfel…
+NFD:	Äpfel…
+NFKC:	Äpfel...
+NFKD:	Äpfel...
+\end{verbatim}
+(This example is shown in Latin Modern Mono which has the (for this purpose) very useful property of not handling combining character very well.
+In a well-behaving font, the `...C` and `...D` lines should look the same.)
+
+\section{Case folding}
+For case folding load the Lua module |lua-uni-case|.
+You can either load it directly with
+\begin{verbatim}
+local uni_case = require'lua-uni-case'
+\end{verbatim}
+or if you need access to all implemented algorithms you can use
+\begin{verbatim}
+local uni_algos = require'lua-uni-algos'
+local uni_case = uni_algos.case
+\end{verbatim}
+
+The main function is |uni_case.casefold(str, full, special)|. It accepts three parameters: A Lua string |str| to be case folded, a boolean |full| to specify if the number of codepoints is allowed to change in the progress (This should normally be set to |true|.) and a boolean |special| which enables special handling for Turkish languages (In most cases, this should be set to |false|.)
+The function returns the case-folded string:
+\begin{verbatim}
+local str = "Straße…"
+print("Original:", str)
+print("Case folded (full=false):", uni_case.casefold(str, false, false))
+print("Case folded (full=true):", uni_case.casefold(str, true, false))
+\end{verbatim}
+This results in
+
+\noindent\begingroup
+  \ttfamily
+  \directlua{
+    local uni_case = require'lua-uni-case'
+    local str = "Straße…"
+    tex.sprint("Original:", str, '\\\\')
+    tex.sprint("Case folded (full=false):", uni_case.casefold(str, false, false), '\\\\')
+    tex.sprint("Case folded (full=true):", uni_case.casefold(str, true, false), '\\\\')
+  }\par
+\endgroup
+
+In most cases, you will want to normalize the string after casefolding.
+
+For cases where you want to casefold something which is not given as a Lua string, you can use the function |uni_case.casefold_lookup(cp, full, special)|. Instead of a string, it accepts a codepoint as first parameter and returns a table of codepoints. A string can be casefolded by replacing every codepoints with the sequence of codepoints returned by |uni_case.casefold_lookup|. If |casefold_lookup| returns |false| or |nil|, the codepoint should not be changed.
+
+\section{Grapheme clusters}
+Grapheme cluster handling is handled by the Lua module |lua-uni-graphemes|.
+You can either load it directly with
+\begin{verbatim}
+local graphemes = require'lua-uni-graphemes'
+\end{verbatim}
+or if you need access to all implemented algorithms you can use
+\begin{verbatim}
+local uni_algos = require'lua-uni-algos'
+local graphemes = uni_algos.graphemes
+\end{verbatim}
+
+Sometimes we want to look at a single character of a string, but identifying what a character is is not that easy in Unicode. A simple example is the character from the previous section: ``Ä''
+The NFD form is certainly a single character, but is encoded using two codepoints: U+0041 (LATIN CAPITAL LETTER A) and U+0308 (COMBINING DIAERESIS). Or the Tamil letter Ni which is encoded as U+0BA8 (TAMIL LETTER NA) followed by U+0BBF (TAMIL VOWEL SIGN I). But sometimes it can be useful to identify characters, e.g.\ for letterspacing or letterines.
+
+There are two main interfaces for this: One iterator for iterating over grapheme clusters and one direct interface to the underlying state machine:
+
+\begin{verbatim}
+for final, first, grapheme in graphemes.graphemes'Äpfel' do
+  print(grapheme)
+end
+\end{verbatim}
+% \begin{verbatim}
+% for final, first, grapheme in graphemes.graphemes'Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞' do
+%   print(grapheme)
+% end
+% \end{verbatim}
+
+\noindent\begingroup
+  \ttfamily
+  \directlua{
+    local graphemes = require'./lua-uni-graphemes'
+    for final, first, grapheme in graphemes.graphemes'Äpfel' do
+      tex.sprint(grapheme, '\string\\\string\\')
+    end
+  }\par
+\endgroup
+
+The more powerful state machine interface |graphemes.read_codepoint| takes two parameters: A new codepoint and a state.
+At the beginning, the state can be omitted.
+For every codepoint in your input, call the function with the new codepoint and the last state. Then there are two return values: The first one is a boolean telling you if the current codepoint is the beginning of a new cluster, the second is a new state you have to pass with the next codepoint.
+
+So e.g.\ to find cluster boundaries in the Unicode codepoint sequence U+0041 U+0308 U+0BA8 U+0BBF you could use
+
+\begin{verbatim}
+local graphemes = require'lua-uni-graphemes'
+local new_cluster, state
+new_cluster, state = graphemes.read_codepoint(0x0041, state)
+print(new_cluster)
+new_cluster, state = graphemes.read_codepoint(0x0308, state)
+print(new_cluster)
+new_cluster, state = graphemes.read_codepoint(0x0BA8, state)
+print(new_cluster)
+new_cluster, state = graphemes.read_codepoint(0x0BBF, state)
+print(new_cluster)
+\end{verbatim}
+  
+\noindent resulting in
+
+\noindent\begingroup
+  \ttfamily
+  \directlua{
+    local graphemes = require'lua-uni-graphemes'
+    local new_cluster, state
+    new_cluster, state = graphemes.read_codepoint(0x0041, state)
+    tex.sprint(tostring(new_cluster), '\string\\\string\\')
+    new_cluster, state = graphemes.read_codepoint(0x0308, state)
+    tex.sprint(tostring(new_cluster), '\string\\\string\\')
+    new_cluster, state = graphemes.read_codepoint(0x0BA8, state)
+    tex.sprint(tostring(new_cluster), '\string\\\string\\')
+    new_cluster, state = graphemes.read_codepoint(0x0BBF, state)
+    tex.sprint(tostring(new_cluster), '\string\\\string\\')
+  }\par
+\endgroup
+
+\vskip-\baselineskip
+\noindent meaning the first and third codepoint start a new cluster.
+
+Do not try to interpret the |state|, it has no defined values and might change at any point.
+
+\end{document}


Property changes on: trunk/Master/texmf-dist/doc/luatex/lua-uni-algos/lua-uni-algos.tex
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: trunk/Master/texmf-dist/tex/luatex/lua-uni-algos/lua-uni-algos.lua
===================================================================
--- trunk/Master/texmf-dist/tex/luatex/lua-uni-algos/lua-uni-algos.lua	                        (rev 0)
+++ trunk/Master/texmf-dist/tex/luatex/lua-uni-algos/lua-uni-algos.lua	2020-05-15 21:13:44 UTC (rev 55158)
@@ -0,0 +1,20 @@
+-- lua-uni-algos.lua
+-- Copyright 2020 Marcel Krüger
+--
+-- This work may be distributed and/or modified under the
+-- conditions of the LaTeX Project Public License, either version 1.3
+-- of this license or (at your option) any later version.
+-- The latest version of this license is in
+--   http://www.latex-project.org/lppl.txt
+-- and version 1.3 or later is part of all distributions of LaTeX
+-- version 2005/12/01 or later.
+--
+-- This work has the LPPL maintenance status `maintained'.
+-- 
+-- The Current Maintainer of this work is Marcel Krüger
+
+return {
+  case = require'lua-uni-case',
+  graphemes = require'lua-uni-graphemes',
+  normalize = require'lua-uni-normalize',
+}


Property changes on: trunk/Master/texmf-dist/tex/luatex/lua-uni-algos/lua-uni-algos.lua
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: trunk/Master/texmf-dist/tex/luatex/lua-uni-algos/lua-uni-case.lua
===================================================================
--- trunk/Master/texmf-dist/tex/luatex/lua-uni-algos/lua-uni-case.lua	                        (rev 0)
+++ trunk/Master/texmf-dist/tex/luatex/lua-uni-algos/lua-uni-case.lua	2020-05-15 21:13:44 UTC (rev 55158)
@@ -0,0 +1,67 @@
+-- lua-uni-graphemes.lua
+-- Copyright 2020 Marcel Krüger
+--
+-- This work may be distributed and/or modified under the
+-- conditions of the LaTeX Project Public License, either version 1.3
+-- of this license or (at your option) any later version.
+-- The latest version of this license is in
+--   http://www.latex-project.org/lppl.txt
+-- and version 1.3 or later is part of all distributions of LaTeX
+-- version 2005/12/01 or later.
+--
+-- This work has the LPPL maintenance status `maintained'.
+-- 
+-- The Current Maintainer of this work is Marcel Krüger
+
+local unpack = table.unpack
+local move = table.move
+local codes = utf8.codes
+local utf8char = utf8.char
+
+local empty = {}
+local result = {}
+
+local casefold, casefold_lookup do
+  local p = require'lua-uni-parse'
+  local l = lpeg or require'lpeg'
+
+  local data = p.parse_file('CaseFolding', l.Cf(
+      l.Ct(l.Cg(l.Ct'', 'C') * l.Cg(l.Ct'', 'F') * l.Cg(l.Ct'', 'S') * l.Cg(l.Ct'', 'T'))
+    * (l.Cg(p.fields(p.codepoint, l.C(1), l.Ct(p.codepoint * (' ' * p.codepoint)^0), true)) + p.eol)^0
+    * -1
+  , function(t, base, class, mapping)
+    t[class][base] = mapping
+    return t
+  end))
+  local C, F, S, T = data.C, data.F, data.S, data.T
+  data = nil
+
+  function casefold_lookup(c, full, special)
+    return (special and T[c]) or C[c] or (full and F or S)[c]
+  end
+  function casefold(s, full, special)
+    local first = special and T or empty
+    local second = C
+    local third = full and F or S
+    local result = result
+    for i = #result, 1, -1 do result[i] = nil end
+    local i = 1
+    for _, c in codes(s) do
+      local datum = first[c] or second[c] or third[c]
+      if datum then
+        local l = #datum
+        move(datum, 1, l, i, result)
+        i = i + l
+      else
+        result[i] = c
+        i = i + 1
+      end
+    end
+    return utf8char(unpack(result))
+  end
+end
+
+return {
+  casefold = casefold,
+  casefold_lookup = casefold_lookup,
+}


Property changes on: trunk/Master/texmf-dist/tex/luatex/lua-uni-algos/lua-uni-case.lua
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: trunk/Master/texmf-dist/tex/luatex/lua-uni-algos/lua-uni-graphemes.lua
===================================================================
--- trunk/Master/texmf-dist/tex/luatex/lua-uni-algos/lua-uni-graphemes.lua	                        (rev 0)
+++ trunk/Master/texmf-dist/tex/luatex/lua-uni-algos/lua-uni-graphemes.lua	2020-05-15 21:13:44 UTC (rev 55158)
@@ -0,0 +1,168 @@
+-- lua-uni-graphemes.lua
+-- Copyright 2020 Marcel Krüger
+--
+-- This work may be distributed and/or modified under the
+-- conditions of the LaTeX Project Public License, either version 1.3
+-- of this license or (at your option) any later version.
+-- The latest version of this license is in
+--   http://www.latex-project.org/lppl.txt
+-- and version 1.3 or later is part of all distributions of LaTeX
+-- version 2005/12/01 or later.
+--
+-- This work has the LPPL maintenance status `maintained'.
+-- 
+-- The Current Maintainer of this work is Marcel Krüger
+
+local property do
+  local p = require'lua-uni-parse'
+  local l = lpeg or require'lpeg'
+
+  property = p.parse_file('emoji-data',
+    l.Cg(p.fields(p.codepoint_range, l.C'Extended_Pictographic')) + p.ignore_line,
+    p.multiset)
+
+  property = p.parse_file('GraphemeBreakProperty', l.Cf(
+      l.Carg(1)
+    * (l.Cg(p.fields(p.codepoint_range, l.C(l.R('az', 'AZ', '__')^1))) + p.ignore_line)^0
+    * -1, p.multiset),
+    nil,
+    property)
+  if not property then
+    error[[Break Property matching failed]]
+  end
+end
+
+local controls = { CR = true, LF = true, Control = true, }
+local precore_lookup = {
+  Prepend = "PRECORE",
+  L = "L",
+  V = "V",
+  LV = "V",
+  LVT = "T",
+  T = "T",
+  Regional_Indicator = "RI",
+  Extended_Pictographic = "POST_PICTO",
+}
+local l_lookup = {
+  L = "L",
+  V = "V",
+  LV = "V",
+  LVT = "T",
+}
+local postcore_map = { Extend = true, ZWJ = true, SpacingMark = true, }
+local state_map state_map = {
+  START = function(prop)
+    if prop == 'CR' then
+      return 'CR', true
+    end
+    if prop == 'LF' or prop == 'Control' then
+      return 'START', true
+    end
+    return state_map.PRECORE(prop), true
+  end,
+  PRECORE = function(prop)
+    if controls[prop] then
+      return state_map.START(prop)
+    end
+    return precore_lookup[prop] or 'POSTCORE'
+  end,
+  POSTCORE = function(prop)
+    if postcore_map[prop] then
+      return 'POSTCORE'
+    end
+    return state_map.START(prop)
+  end,
+  RI = function(prop)
+    if prop == 'Regional_Indicator' then
+      return 'POSTCORE'
+    end
+    return state_map.POSTCORE(prop)
+  end,
+  PRE_PICTO = function(prop)
+    if prop == "Extended_Pictographic" then
+      return "POST_PICTO"
+    end
+    return state_map.POSTCORE(prop)
+  end,
+  POST_PICTO = function(prop)
+    if prop == "Extend" then
+      return "POST_PICTO"
+    end
+    if prop == "ZWJ" then
+      return "PRE_PICTO"
+    end
+    return state_map.POSTCORE(prop)
+  end,
+  L = function(prop)
+    local nextstate = l_lookup[prop]
+    if nextstate then
+      return nextstate
+    end
+    return state_map.POSTCORE(prop)
+  end,
+  V = function(prop)
+    if prop == 'V' then
+      return 'V'
+    end
+    return state_map.T(prop)
+  end,
+  T = function(prop)
+    if prop == 'T' then
+      return 'T'
+    end
+    return state_map.POSTCORE(prop)
+  end,
+  CR = function(prop)
+    if prop == 'LF' then
+      return 'START'
+    else
+      return state_map.START(prop)
+    end
+  end,
+}
+
+-- The value of "state" is considered internal and should not be relied upon.
+-- Just pass it to the function as is or pass nil. `nil` should only be passed when the passed codepoint starts a new cluster
+function read_codepoint(cp, state)
+  local new_cluster
+  state, new_cluster = state_map[state or 'START'](property[cp])
+  return new_cluster, state
+end
+
+-- A Lua iterator for strings -- Only reporting the beginning of every grapheme cluster
+local function graphemes_start(str)
+  local nextcode, str, i = utf8.codes(str)
+  local state = "START"
+  return function()
+    local new_cluster, code
+    repeat
+      i, code = nextcode(str, i)
+      if not i then return end
+      new_cluster, state = read_codepoint(code, state)
+    until new_cluster
+    return i, code
+  end
+end
+-- A more useful iterator: returns the byterange of the graphemecluster in reverse order followed by a string with te cluster
+local function graphemes(str)
+  local iter = graphemes_start(str)
+  return function(_, cur)
+    if cur == #str then return end
+    local new = iter()
+    if not new then return #str, cur + 1, str:sub(cur + 1) end
+    return new - 1, cur + 1, str:sub(cur + 1, new - 1)
+  end, nil, iter() - 1
+end
+return {
+  read_codepoint = read_codepoint,
+  graphemes_start = graphemes_start,
+  graphemes = graphemes,
+}
+--[[
+for i, c in graphemes_start'äbcdef' do
+  print(i, utf8.char(c))
+end
+for i, j, s in graphemes'Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞' do
+  print(j, i, s)
+end
+]]


Property changes on: trunk/Master/texmf-dist/tex/luatex/lua-uni-algos/lua-uni-graphemes.lua
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: trunk/Master/texmf-dist/tex/luatex/lua-uni-algos/lua-uni-normalize.lua
===================================================================
--- trunk/Master/texmf-dist/tex/luatex/lua-uni-algos/lua-uni-normalize.lua	                        (rev 0)
+++ trunk/Master/texmf-dist/tex/luatex/lua-uni-algos/lua-uni-normalize.lua	2020-05-15 21:13:44 UTC (rev 55158)
@@ -0,0 +1,269 @@
+-- lua-uni-normalize.lua
+-- Copyright 2020 Marcel Krüger
+--
+-- This work may be distributed and/or modified under the
+-- conditions of the LaTeX Project Public License, either version 1.3
+-- of this license or (at your option) any later version.
+-- The latest version of this license is in
+--   http://www.latex-project.org/lppl.txt
+-- and version 1.3 or later is part of all distributions of LaTeX
+-- version 2005/12/01 or later.
+--
+-- This work has the LPPL maintenance status `maintained'.
+-- 
+-- The Current Maintainer of this work is Marcel Krüger
+
+-- Provide all four kinds of Unicode normalization
+
+local newtable = lua.newtable
+local move = table.move
+local char = utf8.char
+local codes = utf8.codes
+local unpack = table.unpack
+
+kpse.set_program_name'kpsewhich'
+local ccc, composition_mapping, decomposition_mapping, compatibility_mapping do
+  local function doubleset(ts, key, v1, kind, v2)
+    ts[1][key] = v1
+    ts[3][key] = v2
+    if not kind then
+      ts[2][key] = v2
+    end
+    return ts
+  end
+  local p = require'lua-uni-parse'
+  local l = lpeg
+  local Cnil = l.Cc(nil)
+  local letter = lpeg.R('AZ', 'az')
+  ccc, decomposition_mapping, compatibility_mapping
+                    = unpack(p.parse_file('UnicodeData', l.Cf(
+    l.Ct(l.Ct'' * l.Ct'' * l.Ct'') * (
+      l.Cg(p.fields(p.codepoint,
+                    p.ignore_field,
+                    p.ignore_field,
+                    '0' * Cnil + p.number,
+                    p.ignore_field,
+                    ('<' * l.C(letter^1) * '> ' + Cnil)
+                  * l.Ct(p.codepoint * (' ' * p.codepoint)^0)^-1,
+                    p.ignore_line)) + p.eol
+    )^0 * -1, doubleset)))
+
+  composition_mapping = {}
+  local composition_exclusions = {      [0x00958] = true, [0x00959] = true,
+    [0x0095A] = true, [0x0095B] = true, [0x0095C] = true, [0x0095D] = true,
+    [0x0095E] = true, [0x0095F] = true, [0x009DC] = true, [0x009DD] = true,
+    [0x009DF] = true, [0x00A33] = true, [0x00A36] = true, [0x00A59] = true,
+    [0x00A5A] = true, [0x00A5B] = true, [0x00A5E] = true, [0x00B5C] = true,
+    [0x00B5D] = true, [0x00F43] = true, [0x00F4D] = true, [0x00F52] = true,
+    [0x00F57] = true, [0x00F5C] = true, [0x00F69] = true, [0x00F76] = true,
+    [0x00F78] = true, [0x00F93] = true, [0x00F9D] = true, [0x00FA2] = true,
+    [0x00FA7] = true, [0x00FAC] = true, [0x00FB9] = true, [0x0FB1D] = true,
+    [0x0FB1F] = true, [0x0FB2A] = true, [0x0FB2B] = true, [0x0FB2C] = true,
+    [0x0FB2D] = true, [0x0FB2E] = true, [0x0FB2F] = true, [0x0FB30] = true,
+    [0x0FB31] = true, [0x0FB32] = true, [0x0FB33] = true, [0x0FB34] = true,
+    [0x0FB35] = true, [0x0FB36] = true, [0x0FB38] = true, [0x0FB39] = true,
+    [0x0FB3A] = true, [0x0FB3B] = true, [0x0FB3C] = true, [0x0FB3E] = true,
+    [0x0FB40] = true, [0x0FB41] = true, [0x0FB43] = true, [0x0FB44] = true,
+    [0x0FB46] = true, [0x0FB47] = true, [0x0FB48] = true, [0x0FB49] = true,
+    [0x0FB4A] = true, [0x0FB4B] = true, [0x0FB4C] = true, [0x0FB4D] = true,
+    [0x0FB4E] = true,
+    [0x02ADC] = true, [0x1D15E] = true, [0x1D15F] = true, [0x1D160] = true,
+    [0x1D161] = true, [0x1D162] = true, [0x1D163] = true, [0x1D164] = true,
+    [0x1D1BB] = true, [0x1D1BC] = true, [0x1D1BD] = true, [0x1D1BE] = true,
+    [0x1D1BF] = true, [0x1D1C0] = true,
+  }
+
+  for cp, decomp in next, decomposition_mapping do
+    if #decomp > 1 and not (composition_exclusions[cp] or ccc[decomp[1]]) then
+      local mapping = composition_mapping[decomp[1]]
+      if not mapping then
+        mapping = {}
+        composition_mapping[decomp[1]] = mapping
+      end
+      mapping[decomp[2]] = cp
+    end
+  end
+
+  local function fixup_decomp(decomp)
+    local first = decomp[1]
+    local first_decomp = decomposition_mapping[first]
+    if not first_decomp then return false end
+    if fixup_decomp(first_decomp) then
+      print('nested', first)
+    end
+    move(decomp, 2, #decomp, #first_decomp + 1)
+    move(first_decomp, 1, #first_decomp, 1, decomp)
+    return true
+  end
+  -- Fixup stage
+  for cp, decomp in next, decomposition_mapping do
+    if fixup_decomp(decomp) then
+      -- print(':(', cp)
+    end
+  end
+
+  -- NFKD edition
+  local DEBUG = false
+  local function fixup_decomp(orig, decomp)
+    local work
+    local shared = decomposition_mapping[orig] == decomp
+    local j = 0
+    for i = 1, #decomp do
+      local cp = decomp[i]
+      local cp_decomp = compatibility_mapping[cp]
+      if cp_decomp then
+        if shared then
+          local old = decomp
+          decomp = {}
+          compatibility_mapping[orig] = decomp
+          move(old, 1, #old, 1, decomp)
+        end
+        decomp[i] = cp_decomp
+        j = j + #cp_decomp
+        work = true
+      else
+        j = j + 1
+      end
+    end
+    if not work then return decomp end
+    for i = #decomp, 1, -1 do
+      local v = decomp[i]
+      if type(v) == 'number' then
+        decomp[j] = v
+        j = j - 1
+      else
+        local count = #v
+        move(v, 1, count, j - count + 1, decomp)
+        j = j - count
+      end
+    end
+    assert(j == 0)
+    return decomp
+  end
+  -- Fixup stage
+  for cp, decomp in next, compatibility_mapping do
+    fixup_decomp(cp, decomp)
+  end
+end
+
+local function ccc_reorder(codepoints, i, j, k)
+  if k >= j then return end
+  local first = codepoints[k]
+  local first_ccc = ccc[first]
+  if not first_ccc then
+    return ccc_reorder(codepoints, k+1, j, k+1)
+  end
+  local new_pos = k
+  local cur_ccc
+  repeat
+    new_pos = new_pos + 1
+    if new_pos > j then break end
+    local cur = codepoints[new_pos]
+    cur_ccc = ccc[cur]
+  until (not cur_ccc) or (cur_ccc >= first_ccc)
+  new_pos = new_pos - 1
+  if new_pos == k then
+    return ccc_reorder(codepoints, i, j, k+1)
+  end
+  move(codepoints, k+1, new_pos, k)
+  codepoints[new_pos] = first
+  return ccc_reorder(codepoints, i, j, k == i and i or k-1)
+end
+function to_nfd_table(s, decomposition_mapping)
+  local new_codepoints = newtable(#s, 0)
+  local j = 1
+  for _, c in codes(s) do
+    local decomposed = decomposition_mapping[c]
+    if decomposed then
+      move(decomposed, 1, #decomposed, j, new_codepoints)
+      j = j + #decomposed
+    elseif c >= 0xAC00 and c <= 0xD7A3 then
+      c = c - 0xAC00
+      local tIndex = c % 28
+      c = c // 28
+      local vIndex = c % 21
+      local lIndex = c // 21
+      new_codepoints[j] = 0x1100 + lIndex
+      new_codepoints[j+1] = 0x1161 + vIndex
+      if tIndex == 0 then
+        j = j + 2
+      else
+        new_codepoints[j+2] = 0x11A7 + tIndex
+        j = j + 3
+      end
+    else
+      new_codepoints[j] = c
+      j = j + 1
+    end
+  end
+  ccc_reorder(new_codepoints, 1, #new_codepoints, 1)
+  return new_codepoints
+end
+local function to_nfd(s)
+  return char(unpack(to_nfd_table(s, decomposition_mapping)))
+end
+local function to_nfkd(s)
+  return char(unpack(to_nfd_table(s, compatibility_mapping)))
+end
+local function to_nfc_generic(s, decomposition_mapping)
+  local codepoints = to_nfd_table(s, decomposition_mapping)
+  local starter, lookup, last_ccc, lvt
+  local j = 1
+  for i, c in ipairs(codepoints) do
+    local cur_ccc = ccc[c]
+    if lookup then
+      if (cur_ccc == nil) == (cur_ccc == last_ccc) then -- unblocked
+        local composed = lookup[c]
+        if composed then
+          codepoints[starter] = composed
+          lookup = composition_mapping[composed]
+          goto CONTINUE
+        end
+      end
+    elseif lvt then
+      if lvt == 1 then
+        if c >= 0x1161 and c <= 0x11A7 then
+          lvt = 2
+          codepoints[starter] = ((codepoints[starter] - 0x1100) * 21 + c - 0x1161) * 28 + 0xAC00
+          goto CONTINUE
+        end
+      else -- if lvt == 2 then
+        if c >= 0x11A8 and c <= 0x11C2 then
+          lvt = nil
+          codepoints[starter] = codepoints[starter] + c - 0x11A7
+          goto CONTINUE
+        end
+      end
+    end
+    codepoints[j] = c
+    lvt = nil
+    if not cur_ccc then
+      starter = j
+      lookup = composition_mapping[c]
+      if not lookup and c >= 0x1100 and c <= 0x1112 then
+        lvt = 1
+      end
+    end
+    j = j + 1
+    last_ccc = cur_ccc
+    ::CONTINUE::
+  end
+  for i = j,#codepoints do codepoints[i] = nil end
+  return char(unpack(codepoints))
+end
+local function to_nfc(s)
+  return to_nfc_generic(s, decomposition_mapping)
+end
+local function to_nfkc(s)
+  return to_nfc_generic(s, compatibility_mapping)
+end
+
+return {
+  NFD = to_nfd,
+  NFC = to_nfc,
+  NFKD = to_nfkd,
+  NFKC = to_nfkc,
+}
+-- print(require'inspect'{to_nfd{0x1E0A}, to_nfc{0x1E0A}})
+
+-- print(require'inspect'{to_nfd{0x1100, 0x1100, 0x1161, 0x11A8}, to_nfc{0x1100, 0x1100, 0x1161, 0x11A8}})


Property changes on: trunk/Master/texmf-dist/tex/luatex/lua-uni-algos/lua-uni-normalize.lua
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: trunk/Master/texmf-dist/tex/luatex/lua-uni-algos/lua-uni-parse.lua
===================================================================
--- trunk/Master/texmf-dist/tex/luatex/lua-uni-algos/lua-uni-parse.lua	                        (rev 0)
+++ trunk/Master/texmf-dist/tex/luatex/lua-uni-algos/lua-uni-parse.lua	2020-05-15 21:13:44 UTC (rev 55158)
@@ -0,0 +1,71 @@
+-- lua-uni-parse.lua
+-- Copyright 2020 Marcel Krüger
+--
+-- This work may be distributed and/or modified under the
+-- conditions of the LaTeX Project Public License, either version 1.3
+-- of this license or (at your option) any later version.
+-- The latest version of this license is in
+--   http://www.latex-project.org/lppl.txt
+-- and version 1.3 or later is part of all distributions of LaTeX
+-- version 2005/12/01 or later.
+--
+-- This work has the LPPL maintenance status `maintained'.
+-- 
+-- The Current Maintainer of this work is Marcel Krüger
+
+-- Just a simple helper module to make UCD parsing more readable
+
+local lpeg = lpeg or require'lpeg'
+local R = lpeg.R
+local tonumber = tonumber
+
+local codepoint = lpeg.R('09', 'AF')^4 / function(c) return tonumber(c, 16) end
+local sep = lpeg.P' '^0 * ';' * lpeg.P' '^0
+local codepoint_range = codepoint * ('..' * codepoint + lpeg.Cc(false))
+local ignore_line = (1-lpeg.P'\n')^0 * '\n'
+local eol = lpeg.S' \t'^0 * ('#' * ignore_line + '\n')
+local ignored = (1-lpeg.S';#\n')^0
+local number = lpeg.R'09'^1 / tonumber
+
+local function fields(first, ...)
+  if first == ignore_line then
+    assert(select('#', ...) == 0)
+    return ignore_line
+  end
+  local tail = select('#', ...) == 0 and eol or sep * fields(...)
+  return first * tail
+end
+
+local function multiset(table, key1, key2, value)
+  for key = key1,(key2 or key1) do
+    table[key] = value
+  end
+  return table
+end
+
+local function parse_uni_file(filename, patt, func, ...)
+  if func then
+    return parse_uni_file(filename, lpeg.Cf(lpeg.Ct'' * patt^0 * -1, func), nil, ...)
+  end
+  local resolved = kpse.find_file(filename .. '.txt')
+  if not resolved then
+    error(string.format("Unable to find Unicode datafile %q", filename))
+  end
+  local f = assert(io.open(resolved))
+  local data = f:read'*a'
+  f:close()
+  return lpeg.match(patt, data, 1, ...)
+end
+
+return {
+  codepoint = codepoint,
+  codepoint_range = codepoint_range,
+  ignore_line = ignore_line,
+  ignore_field = ignored,
+  eol = eol,
+  sep = sep,
+  number = number,
+  fields = fields,
+  multiset = multiset,
+  parse_file = parse_uni_file,
+}


Property changes on: trunk/Master/texmf-dist/tex/luatex/lua-uni-algos/lua-uni-parse.lua
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Modified: trunk/Master/tlpkg/bin/tlpkg-ctan-check
===================================================================
--- trunk/Master/tlpkg/bin/tlpkg-ctan-check	2020-05-15 21:09:50 UTC (rev 55157)
+++ trunk/Master/tlpkg/bin/tlpkg-ctan-check	2020-05-15 21:13:44 UTC (rev 55158)
@@ -455,7 +455,8 @@
     lstaddons lstbayes lstfiracode lt3graph ltablex ltabptch ltb2bib
     ltxcmds ltxdockit ltxfileinfo ltxguidex ltximg
     ltxkeys ltxmisc ltxnew ltxtools
-    lua-alt-getopt lua-check-hyphen lua-uca lua-ul lua-visual-debug
+    lua-alt-getopt lua-check-hyphen lua-uca lua-ul
+    lua-uni-algos lua-visual-debug
     luabibentry luabidi luacode luacolor luahyphenrules
     luaimageembed luaindex luainputenc luaintro lualatex-doc lualatex-doc-de
     lualatex-math lualatex-truncate lualibs

Modified: trunk/Master/tlpkg/libexec/ctan2tds
===================================================================
--- trunk/Master/tlpkg/libexec/ctan2tds	2020-05-15 21:09:50 UTC (rev 55157)
+++ trunk/Master/tlpkg/libexec/ctan2tds	2020-05-15 21:13:44 UTC (rev 55158)
@@ -1939,6 +1939,7 @@
  'logic',       'milstd\.tex|' . $standardtex,
  'lollipop',	'\.ini|lollipop\.tex|lollipop-.*tex|lollipop.tex',
  'ltxkeys',     '\.sty|\.clo|\.ldf|\.cls|\.def|\.fd$',  # not cfg
+ 'lua-uni-algos',    '\.lua|' . $standardtex,
  'lua-check-hyphen', '\.lua|' . $standardtex,
  'lua-ul',	'\.lua|' . $standardtex,
  'lua-visual-debug', '\.lua|' . $standardtex,

Modified: trunk/Master/tlpkg/tlpsrc/collection-luatex.tlpsrc
===================================================================
--- trunk/Master/tlpkg/tlpsrc/collection-luatex.tlpsrc	2020-05-15 21:09:50 UTC (rev 55157)
+++ trunk/Master/tlpkg/tlpsrc/collection-luatex.tlpsrc	2020-05-15 21:13:44 UTC (rev 55158)
@@ -20,6 +20,7 @@
 depend interpreter
 depend kanaparser
 depend lua-uca
+depend lua-uni-algos
 depend lua-ul
 depend lua-visual-debug
 depend luacode

Added: trunk/Master/tlpkg/tlpsrc/lua-uni-algos.tlpsrc
===================================================================


More information about the tex-live-commits mailing list.