[luatex] glyph to Unicode code point
maxwell
maxwell at umiacs.umd.edu
Thu Mar 29 01:29:24 CEST 2018
I'm using a version of the code in the answer here
https://tex.stackexchange.com/questions/228312/
to convert a LuaTeX node structure into a list of characters in that
structure.
My code at present traverses a Node, recursing on nodes that are hlists
(or vlists); when it comes to a node which is a glyph (node.id() = 29 in
the table on p99 of the LuaTeX reference manual, version 1.0.4 of April
2017), it converts the node.char to what is (hopefully) a Unicode code
point:
unicode.utf8.char(node.char)
I say hopefully, because this conversion relies on the glyph being
assigned a slot in in a particular font that has the same number as the
Unicode code point for that character. This works ok for many simple
fonts, particularly some Latin fonts. It also works for some simple
Arabic script fonts which encode the various glyph variants (like
initial, final, medial and isolated) as being the Arabic Presentation
Form code points (which can easily be converted to code points in the
normal Arabic block). Unfortunately, it does not work for glyphs that
are assigned to other slots in a font, corresponding to a Private Use
Area code point in Unicode, or sometimes not corresponding to a valid
Unicode code point at all.
The cmap table in an Open Type font (or a True Type font) provides a
mapping between Unicode code points and glyph slots. Somewhere under
the hood, LuaTex is presumably using this table to choose an appropriate
glyph. It seems like it should be possible to do the reverse mapping,
i.e. to map from a particular glyph to the corresponding Unicode code
point. (In the case of ligatures, this will be a one-to-many mapping.)
The LuaTeX reference appears to discuss this on p68-69; if I have a
character's hash (from a font table), I can apparently extract the
'tounicode' value, which IIUC is the Unicode code point I'm looking for.
My problem is that I don't know how to go from the glyph's slot number
(which is apparently what node.char is giving me, for nodes that
represent a glyph) to the character hash in the font table, or even how
to find the font table from the font number. The node.char elements are
numbers like 1583 (which appears to be 0x62F, and makes sense as the
code point for Arabic Dal) and 983159 (0xF0077, which would not be a
valid Unicode character, but might be a glyph in some font).
How do I go from these node.char numbers and a node.font number (a
number like 29, which apparently points to a font) to a character hash
in the font's table? I'm guessing I need a function that maps from the
node.font to a font table, and then a function that maps the number from
node.char plus a font table to a character hash. Something like
if node.id == 29 then
UnicodeChar = unicode.utf8.char(Node2CharHash(node.char,
Node2FontTable(node.font)).tounicode)
where Node2CharHash and Node2FontTable are the functions I'm looking
for, if my guess is right. (My syntax is probably wrong, I'm used to
Python...)
Mike Maxwell
More information about the luatex
mailing list