[luatex] glyph to Unicode code point

Thu Mar 29 10:01:25 CEST 2018

On 3/29/2018 1:29 AM, maxwell wrote:
> I'm using a version of the code in the answer here
>      https://tex.stackexchange.com/questions/228312/
> to convert a LuaTeX node structure into a list of characters in that 
> structure.
> 
> My code at present traverses a Node, recursing on nodes that are hlists 
> (or vlists); when it comes to a node which is a glyph (node.id() = 29 in 
> the table on p99 of the LuaTeX reference manual, version 1.0.4 of April 
> 2017), it converts the node.char to what is (hopefully) a Unicode code 
> point:
>      unicode.utf8.char(node.char)
> 
> I say hopefully, because this conversion relies on the glyph being 
> assigned a slot in in a particular font that has the same number as the 
> Unicode code point for that character.  This works ok for many simple 
> fonts, particularly some Latin fonts.  It also works for some simple 
> Arabic script fonts which encode the various glyph variants (like 
> initial, final, medial and isolated) as being the Arabic Presentation 
> Form code points (which can easily be converted to code points in the 
> normal Arabic block).  Unfortunately, it does not work for glyphs that 
> are assigned to other slots in a font, corresponding to a Private Use 
> Area code point in Unicode, or sometimes not corresponding to a valid 
> Unicode code point at all.
> 
> The cmap table in an Open Type font (or a True Type font) provides a 
> mapping between Unicode code points and glyph slots.  Somewhere under 
> the hood, LuaTex is presumably using this table to choose an appropriate 
> glyph.  It seems like it should be possible to do the reverse mapping, 
> i.e. to map from a particular glyph to the corresponding Unicode code 
> point.  (In the case of ligatures, this will be a one-to-many mapping.) 
> The LuaTeX reference appears to discuss this on p68-69; if I have a 
> character's hash (from a font table), I can apparently extract the 
> 'tounicode' value, which IIUC is the Unicode code point I'm looking for.
> 
> My problem is that I don't know how to go from the glyph's slot number 
> (which is apparently what node.char is giving me, for nodes that 
> represent a glyph) to the character hash in the font table, or even how 
> to find the font table from the font number.  The node.char elements are 
> numbers like 1583 (which appears to be 0x62F, and makes sense as the 
> code point for Arabic Dal) and 983159 (0xF0077, which would not be a 
> valid Unicode character, but might be a glyph in some font).
> 
> How do I go from these node.char numbers and a node.font number (a 
> number like 29, which apparently points to a font) to a character hash 
> in the font's table?  I'm guessing I need a function that maps from the 
> node.font to a font table, and then a function that maps the number from 
> node.char plus a font table to a character hash.  Something like
>     if node.id == 29 then
>         UnicodeChar = unicode.utf8.char(Node2CharHash(node.char, 
> Node2FontTable(node.font)).tounicode)
> where Node2CharHash and Node2FontTable are the functions I'm looking 
> for, if my guess is right.  (My syntax is probably wrong, I'm used to 
> Python...)
if you use context or latex ... this is a starting point

\starttext

\setbox0\hbox{something effe}

\directlua {
     for n in node.traverse(tex.box[0].list) do
         if n.id == node.id("glyph") then
             print(
                 fonts.hashes.identifiers[n.font].properties.filename,
 
fonts.hashes.identifiers[n.font].characters[n.char].tounicode
             )
         end
     end
}

\stoptext


-----------------------------------------------------------------
                                           Hans Hagen | PRAGMA ADE
               Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
        tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl
-----------------------------------------------------------------