[luatex] glyph to Unicode code point

Hans Hagen j.hagen at xs4all.nl
Thu Mar 29 10:01:25 CEST 2018

On 3/29/2018 1:29 AM, maxwell wrote:
> I'm using a version of the code in the answer here
>      https://tex.stackexchange.com/questions/228312/
> to convert a LuaTeX node structure into a list of characters in that
> structure.
>
> My code at present traverses a Node, recursing on nodes that are hlists
> (or vlists); when it comes to a node which is a glyph (node.id() = 29 in
> the table on p99 of the LuaTeX reference manual, version 1.0.4 of April
> 2017), it converts the node.char to what is (hopefully) a Unicode code
> point:
>      unicode.utf8.char(node.char)
>
> I say hopefully, because this conversion relies on the glyph being
> assigned a slot in in a particular font that has the same number as the
> Unicode code point for that character.  This works ok for many simple
> fonts, particularly some Latin fonts.  It also works for some simple
> Arabic script fonts which encode the various glyph variants (like
> initial, final, medial and isolated) as being the Arabic Presentation
> Form code points (which can easily be converted to code points in the
> normal Arabic block).  Unfortunately, it does not work for glyphs that
> are assigned to other slots in a font, corresponding to a Private Use
> Area code point in Unicode, or sometimes not corresponding to a valid
> Unicode code point at all.
>
> The cmap table in an Open Type font (or a True Type font) provides a
> mapping between Unicode code points and glyph slots.  Somewhere under
> the hood, LuaTex is presumably using this table to choose an appropriate
> glyph.  It seems like it should be possible to do the reverse mapping,
> i.e. to map from a particular glyph to the corresponding Unicode code
> point.  (In the case of ligatures, this will be a one-to-many mapping.)
> The LuaTeX reference appears to discuss this on p68-69; if I have a
> character's hash (from a font table), I can apparently extract the
> 'tounicode' value, which IIUC is the Unicode code point I'm looking for.
>
> My problem is that I don't know how to go from the glyph's slot number
> (which is apparently what node.char is giving me, for nodes that
> represent a glyph) to the character hash in the font table, or even how
> to find the font table from the font number.  The node.char elements are
> numbers like 1583 (which appears to be 0x62F, and makes sense as the
> code point for Arabic Dal) and 983159 (0xF0077, which would not be a
> valid Unicode character, but might be a glyph in some font).
>
> How do I go from these node.char numbers and a node.font number (a
> number like 29, which apparently points to a font) to a character hash
> in the font's table?  I'm guessing I need a function that maps from the
> node.font to a font table, and then a function that maps the number from
> node.char plus a font table to a character hash.  Something like
>     if node.id == 29 then
>         UnicodeChar = unicode.utf8.char(Node2CharHash(node.char,
> Node2FontTable(node.font)).tounicode)
> where Node2CharHash and Node2FontTable are the functions I'm looking
> for, if my guess is right.  (My syntax is probably wrong, I'm used to
> Python...)
if you use context or latex ... this is a starting point

\starttext

\setbox0\hbox{something effe}

\directlua {
for n in node.traverse(tex.box[0].list) do
if n.id == node.id("glyph") then
print(
fonts.hashes.identifiers[n.font].properties.filename,

fonts.hashes.identifiers[n.font].characters[n.char].tounicode
)
end
end
}

\stoptext

-----------------------------------------------------------------
Hans Hagen | PRAGMA ADE
Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl
-----------------------------------------------------------------


More information about the luatex mailing list