[XeTeX] Latin Modern, from TFM to Unicode

Wed Jun 12 21:32:10 CEST 2013

Thanks for all the responses.

I understand the distinction between Unicode characters (code points) and 
glyphs, and that an OpenType font can have glyphs in it that do not 
correspond to any Unicode code points.  I don't quite get whether or how 
those non-Unicode glyphs are subject to being found via the 'cmap' table, 
or whether they have glyph IDs that are known or can be determined by 
some documented convention outside the OpenType font file.  Or whether 
they are part of some internal ligature-like structure that only the 
OpenType font has information about (which might mean that the glyph IDs 
can change internally from one release to the next of the OT font).

Arthur Reutenauer responded:

> These glyphs or parts of glyphs can probably be mapped one-to-one to font 
slots in the
> original lmex10, but that does not make them characters.

Understood about not being characters.  But it's that one-to-one mapping 
from each slot in TFM to an equivalent slot in OpenType (for Latin 
Modern) I'm interested in pinning down (hopefully not "probably").  It 
certainly appears that every glyph represented by "lmex10.tfm" can be 
found in the "Latin Modern Math" font file, though I haven't gone through 
all 128 trying to find where they appear in the OT font.

Khaled Hosny wrote:

> [snip numerous good explanations]

Thanks.  I understand better what's going on inside the OpenType font, 
and can now imagine how FontBook is figuring out which glyphs are not the 
targets of the 'cmap' table's Unicode code point inputs.  And I 
understand that the math extension font contains glyphs for different 
sizes of the same symbol, but kept in different slots with different 
glyph indices (if that's the right term) in the TFM file.

> I"m not sure what do you want to achieve, and you might be asking the wrong 
question,
> so it might be better to elaborate more on your actual goal.

I have my own homebrew math layout system that determines where to place 
math glyphs based on information in the lmex10.tfm and other TFM files.  
For reasons peculiar to my needs, I'm not interested in creating PDF or 
DVI output.  I just want to draw a math glyph on my screen using "Latin 
Modern Math" at a computed position, based on where TeX would place it 
using the metrics in "lmex10.tfm" or other TFM file (the extent to which 
I'm accurately simulating TeX is a side-issue, but I'm trying hard).  My 
assumption was that the glyphs in the OT file are the visually the same, 
and have the same metrics/bounding boxes, etc. as the original TFM 
metrics.  Or if they don't have quite the same metrics, the differences 
are not going to change over time with new versions of the OT font.

I assumed that every one of the 128 glyphs represented by slots in 
lmex10.tfm would be found in the OpenType font "Latin Modern Math", along 
with lots of other glyphs.  I had thought that all the glyphs in the OT 
font had Unicode character designations, but have now understood that 
that is not a good assumption.

Consider the radical sign.  In the TFM file, there is information that 
TeX uses to determine which final glyph(s) to use, based on the height of 
the box of whatever's underneath the radical.  So TeX chooses the glyph 
in slot "70 for small height, or the glyph in slot "71 for medium height, 
or the one in slot "72 for large height, or slot "73 for even larger 
height.  If none of those fixed-height glyphs are high enough, presumably 
TeX goes into a tall symbol construction algorithm based on data within 
the TFM file, using glyphs representing pieces of radical signs, kept in 
slots "74, "75, and "76.

Using FontBook, in the "Latin Modern" OpenType file, the glyph for the 
official Unicode code point U+221A SQUARE ROOT is glyph ID #2839.  So 
that's a "character" I suppose.  The 'cmap' table maps that Unicode value 
to that glyph ID and it can be drawn as a character would.  But there are 
also non-Unicode glyphs for partial radical signs, all of which look 
identical to the glyphs shown by /fonttable for "lmex10.tfm" (which are 
taken from some PFB file).  In particular, I've figured out by inspection 
the following partial answer to what I'm interested in:

small radical    TFM slot "70 ==> OTF glyph #2843 (no Unicode designation)
medium radical   TFM slot "71 ==> OTF glyph #2844 (no Unicode designation)
large radical    TFM slot "72 ==> OTF glyph #2845 (no Unicode designation)
larger radical   TFM slot "73 ==> OTF glyph #2846 (no Unicode designation)

radical bottom   TFM slot "74 ==> OTF glyph #2840 (U+23B7 RADICAL SYMBOL 
BOTTOM)
vertical bar     TFM slot "75 ==> OTF glyph #2841 (no Unicode 
deisignation)
top corner       TFM slot "76 ==> OTF glyph #2842 (no Unicode 
deisignation)

So given that there are partial glyphs useful for building very large 
radical signs in "Latin Modern Math", and given that most, though not all 
of them, have no official Unicode code point assigned to them, how does 
an outside process that wants to use the OT font to draw a very large 
radical sign tell the font what to draw.  Since there's no mapping from 
Unicode, then the outside process either needs to know the absolute glyph 
IDs inside the font, or it needs to cause the font to go into some 
internal construction mode, like building a ligature, where the font 
itself knows the sequence and position of the glyphs to use to construct 
the tall symbol.  The latter seems impossible, because the font can't 
know the threshold height at which to stop construction.  The former 
means hard coding internal glyph IDs somewhere outside the font, which 
I'm hoping is not fragile, but worrying might be.

Sorry for the reams of details, but I'm trying to be explain my confusion 
exactly.

Doug McKenna