[XeTeX] Type0 fonts somehow not built correctly for Unicode text-extraction and Accessibility

Ross Moore ross.moore at mq.edu.au
Mon Aug 6 00:11:02 CEST 2018

There seems to be a subtle problem with the way subsetted Type0 fonts are built
by xdvipdfmx with XeLaTeX jobs, for the purposes of finding the /ToUnicode  resource.

The main view is fine, but when checking other aspects, for standards compliance, some basic tests fail.
See e.g. with included image.

Firstly, the CIDSet is not built correctly, by not including all glyphs that are used.
 pdfTeX hs a similar problem with regard to Charset.
The issue seems to be that if an accented character is built internally from multiple glyphs,
then each of those glyphs should be included in the CIDSet, as well as the combined character.

Acrobat’s Preflight has a filter to remove such incomplete CIDSets, so this isn’t a crucial deficiency.

Secondly, although clearly present, the /ToUnicode  CMap resource is not being found.
The font seems to be named correctly here, according to:

page 279  of  ISO 32000_1:2008

§ 9.7.6  Type 0 Font Dictionaries
§  General
A Type 0 font dictionary contains the entries listed in Table 121.

                            Table 121 – Entries in a Type 0 font dictionary

BaseFont  name    (Required) The name of the font.
  If the descendant is a Type 0 CIDFont, this name should be the concatenation of the CIDFont’s BaseFont name, a hyphen,
  and the CMap name given in the Encoding entry (or the CMapName entry in the CMap).
  If the descendant is a Type 2 CIDFont, this name should be the same as the CIDFont’s BaseFont name.

Since this is a Type 2 CIDFont, the 2nd sentence is applicable.

And since it is a subset of the full font, the last sentence below is also applicable.

page 285  of  ISO 32000_1:2008

§9.8.3 Font Descriptors for CIDFonts
§  General
In addition to the entries in Table 122, the FontDescriptor dictionaries of CIDFonts may contain the entries listed in Table 124.

           Table 124 – Additional font descriptor entries for CIDFonts

CIDSet   stream    (Optional) A stream identifying which CIDs are present in the CIDFont file.
 If this entry is present, the CIDFont shall contain only a subset of the glyphs in the character collection defined by the CIDSystemInfo dictionary.
 If it is absent, the only indication of a CIDFont subset shall be the subset tag in the FontName entry (see 9.6.4, "Font Subsets").

So I cannot see why the /ToUnicode resource is not being found.

Would someone with more experience building fonts and subsetting, please have a look at this issue.



Dr Ross Moore
Mathematics Dept | 12 Wally’s Walk, 734
Macquarie University, NSW 2109, Australia
T: +61 2 9850 8955  |  F: +61 2 9850 8114
M:+61 407 288 255  |  E: ross.moore at mq.edu.au


