[XeTeX] Unicode bookmarks in xdvipdfmx - prototype solution (long)

Timothy Eyre tinytim1234 at hotmail.com
Wed Jun 14 09:00:50 CEST 2006


I've produced a modified version of xdvipdfmx that includes prototype 
Unicode bookmark support.

It uses the usual dvipdfm special

\special{pdf: out 1 << /Title (<UTF8 string>) /Dest [ @thispage /FitH @ypos 
] >>}

to receive the bookmark specification.

To make the change kick in you need to specify the following at the top of 
your document.

\special{pdf:tounicode EUC-UCS2}

The EUC-UCS2 identifier is just a dummy; any CID mapping file on your system 
will do. This is just required to trick the rest of the xdvipdfmx code into 
calling reencodestring().

The change consists of a modification to the function reencodestring() in 
the xdvipdfmx source file spc_pdfm.c. It also uses the modules

http://www.unicode.org/Public/PROGRAMS/CVTUTF/ConvertUTF.h

and

http://www.unicode.org/Public/PROGRAMS/CVTUTF/ConvertUTF.c

The change works by converting the UTF-8 string to big-endian UTF-16 and 
prepending 0xFEFF. This is the Unicode format for text strings given in the 
PDF spec.

I've only done a little testing in which I got mixed Cyrillic and Kanji to 
display just fine. A limitation of this change is that it assumes the UTF-16 
result is precisely two bytes long. I'm also concerned that it might convert 
strings other than bookmark strings; although theorectically this shouldn't 
matter.

I'm not planning to doing any further work on this. If someone else wants to 
take on the task of bringing this up to product quality then I'd be very 
grateful. I think that:

- The change should only kick in if the user has specified something like 
\special{pdf:tounicode XETEX-UTF8-UCS2} and behave as before otherwise; and

- The code should cope with UTF16 output that is longer than a single pair 
of bytes; at least by returning an error.

Tim

static int
reencodestring (CMap *cmap, pdf_obj *instring)
{
#define WBUF_SIZE 4096
  unsigned char  wbuf[WBUF_SIZE];
  unsigned char *obufcur;
  unsigned char *inbufcur;
  long inbufleft, obufleft;
  UTF16 *utf16TargetStart;
  UTF8 *utf8SourceStart;
  UTF16 utf16_result[3];
  ConversionResult result;

  if (!cmap || !instring)
    return 0;

  inbufleft = pdf_string_length(instring);
  inbufcur  = pdf_string_value (instring);

  wbuf[0]  = 0xfe;
  wbuf[1]  = 0xff;
  obufcur  = wbuf + 2;
  obufleft = WBUF_SIZE - 2;

  while (inbufleft > 0)
  {
    if (obufleft < 2)
    {
      return  -1;
    }

    memset(utf16_result, 0, sizeof(utf16_result));
    utf16TargetStart = utf16_result;
    utf8SourceStart = inbufcur;
    result = ConvertUTF8toUTF16((const UTF8 **)&utf8SourceStart,
                                
&(inbufcur[trailingBytesForUTF8[inbufcur[0]]+1]),
                                &utf16TargetStart,
                                &(utf16_result[2]),
                                strictConversion);
    if (result != conversionOK)
    {
      return -1;
    }

    *(obufcur) = *(((char *)utf16_result)+1);
    obufcur++;
    obufleft--;
    *(obufcur) = *((char *)utf16_result);
    obufcur++;
    obufleft--;
    inbufleft -= trailingBytesForUTF8[inbufcur[0]] + 1;
    if (inbufleft < 0)
    {
      return  -1;
    }
    inbufcur += trailingBytesForUTF8[inbufcur[0]] + 1;
  }

  pdf_set_string(instring, wbuf, WBUF_SIZE - obufleft);

  return  0;
}




More information about the XeTeX mailing list