[pdftex] Generating CJK in PDF

Fri May 4 15:57:15 CEST 2001

I'm sure I've already exceeded my quota for posting on this list for
this week, but let me ask a few questions about a quite different
topic. I promise not to post as much next week :-)

Currently it is not really possible to produce CJK PDF files with
pdftex.  Certainly one can use Werner's CJK package, or hlatex etc.,
using several Type1 8bit fonts to cover a single 16bit font.  But the
resulting PDF files are essentially encrypted - it is possible to view
and print them, but one cannot copy and paste, or search for text in
the file, because the viewer has no idea what the character codes mean.

It is also currently not possible to use embedded TTF fonts for CJK
text, as there is no easily available tool for splitting a CJK TTF
font into the 8bit pieces used by CJK. (One can split-convert the TTF
fonts into Type1 fonts, though.)

"Real" CJK PDF files would use CIDkeyed composite fonts in a
well-known standard encoding from Adobe's CMap repertoire, and 16bit
encoded strings in content streams.

Is any work under way to enable pdftex to create such output?  If not,
is it desirable, and what would be the right way to do it?  

Here is a possible design: Extend the syntax of font map files with
the same subsetting implemented by ttf2tfm:

ntukai@<subsetting spec file>@ ntukai.ttf

This would define TeX fonts ntukai01, ntukai02, ... etc, as specified
by the <subsetting spec file>, and one would have to provide .tfm
files for all these subfonts.

One would, however, not have to actually split ntukai.ttf into pieces.
pdftex would know that all the TeX font subsets are parts of a single
TTF font, and how the 8-bit character codes map into the full font.
It would use 16bit character codes in content streams referring to the
font, and would embed (possibly a subset of) ntukai.ttf as a single
CIDkeyed font.  One would not need to change the existing macro
packages at all, but would generate real 16bit output.

This extension is not trivial and would take quite some work, but
there don't seem to be fundamental obstacles.  But is it worth it?
Note that, as an alternative, one could also write a postprocessor.
It would be quite non-trivial, though, as it would actually have to
parse page content streams to change the character encoding.

There is a much simpler route to enabling copy-and-paste and text
searching, at least in theory: one can add a "ToUnicode" character map
to each of the subfonts.  Viewers that correctly implement the PDF
specification should then be able to provide search and copy.  But
does this really work in practice?  Xpdf certainly does not support
it.  Does it work using Acrobat Reader on Chinese Windows, or Hangul
Windows?  Does Acrobat Reader for Linux actually have any support for
it? 

If it works, this would be quite easy to implement.  The resulting PDF
files would still only use 8bit fonts internally, but the user
wouldn't notice that, and so it probably wouldn't matter too much.  If
it works :-)

Otfried