[luatex] glyph to Unicode code point

maxwell maxwell at umiacs.umd.edu
Thu Mar 29 01:29:24 CEST 2018


I'm using a version of the code in the answer here
     https://tex.stackexchange.com/questions/228312/
to convert a LuaTeX node structure into a list of characters in that 
structure.

My code at present traverses a Node, recursing on nodes that are hlists 
(or vlists); when it comes to a node which is a glyph (node.id() = 29 in 
the table on p99 of the LuaTeX reference manual, version 1.0.4 of April 
2017), it converts the node.char to what is (hopefully) a Unicode code 
point:
     unicode.utf8.char(node.char)

I say hopefully, because this conversion relies on the glyph being 
assigned a slot in in a particular font that has the same number as the 
Unicode code point for that character.  This works ok for many simple 
fonts, particularly some Latin fonts.  It also works for some simple 
Arabic script fonts which encode the various glyph variants (like 
initial, final, medial and isolated) as being the Arabic Presentation 
Form code points (which can easily be converted to code points in the 
normal Arabic block).  Unfortunately, it does not work for glyphs that 
are assigned to other slots in a font, corresponding to a Private Use 
Area code point in Unicode, or sometimes not corresponding to a valid 
Unicode code point at all.

The cmap table in an Open Type font (or a True Type font) provides a 
mapping between Unicode code points and glyph slots.  Somewhere under 
the hood, LuaTex is presumably using this table to choose an appropriate 
glyph.  It seems like it should be possible to do the reverse mapping, 
i.e. to map from a particular glyph to the corresponding Unicode code 
point.  (In the case of ligatures, this will be a one-to-many mapping.)  
The LuaTeX reference appears to discuss this on p68-69; if I have a 
character's hash (from a font table), I can apparently extract the 
'tounicode' value, which IIUC is the Unicode code point I'm looking for.

My problem is that I don't know how to go from the glyph's slot number 
(which is apparently what node.char is giving me, for nodes that 
represent a glyph) to the character hash in the font table, or even how 
to find the font table from the font number.  The node.char elements are 
numbers like 1583 (which appears to be 0x62F, and makes sense as the 
code point for Arabic Dal) and 983159 (0xF0077, which would not be a 
valid Unicode character, but might be a glyph in some font).

How do I go from these node.char numbers and a node.font number (a 
number like 29, which apparently points to a font) to a character hash 
in the font's table?  I'm guessing I need a function that maps from the 
node.font to a font table, and then a function that maps the number from 
node.char plus a font table to a character hash.  Something like
    if node.id == 29 then
        UnicodeChar = unicode.utf8.char(Node2CharHash(node.char, 
Node2FontTable(node.font)).tounicode)
where Node2CharHash and Node2FontTable are the functions I'm looking 
for, if my guess is right.  (My syntax is probably wrong, I'm used to 
Python...)

    Mike Maxwell


More information about the luatex mailing list