[XeTeX] Anchor names

Mon Nov 7 02:30:59 CET 2011

Hi Heiko, and Akira,

On 06/11/2011, at 3:55 AM, Heiko Oberdiek wrote:

>       \special{%
>         pdf:ann width 4bp height 2bp depth 2bp<<%
>           /Type/Annot%
>           /foo/ab#abc
>           /Subtype/Link%
>           /Border[0 0 1]%
>           /C[0 0 1]% blue border
>           /A<<%
>             /S/GoToR%%
>             /F(t.tex)%
>             /D<66f6f8>% 
>             % Result: <66f6f8>, but ** WARNING ** Failed to convert input string toUTF16...
>             % /D<c3a46e6368c3b872>%
>             % Result: <feff00e4006e0063006800f80072>
> 	 >>%
> 	>>%
>       }%

I've verified that this is indeed what happens, with 

  This is XeTeX, Version 3.1415926-2.2-0.9997.4 (TeX Live 2010)

Now looking at the source coding, at:

   http://ftp.tug.org/svn/texlive/trunk/Build/source/texk/xdvipdfmx/src/spc_pdfm.c?diff_format=u&view=log&pathrev=13771

it is hard to see how those results can occur.

The warning message is only produced when the function

   maybe_reencode_utf8(pdf_obj *instring)

returns a value less than 1 (e.g. -1)
viz. lines 571--578:   function:  modstrings

>>>       }
>>>       else {
>>>         r = maybe_reencode_utf8(vp);
>>>       }
>>>       if (r < 0) /* error occured... */
>>>         WARN("Failed to convert input string to UTF16...");
>>>     }
>>> 	break;

or  lines 1145--1150  (for  pdf:dest  but not actually used here)

>>> #ifdef  ENABLE_TOUNICODE
>>>   error = maybe_reencode_utf8(name);
>>>   if (error < 0)
>>>     WARN("Failed to convert input string to UTF16...");
>>> #endif
>>>     array = parse_pdf_object(&args->curptr, args->endptr, NULL);

Now that function should find only ASCII bytes in  '<66f6f8>'
and  '<c3a46e6368c3b872>' .
In both cases the string should have remained silently unmodified.

viz.    lines 474--481    function:  maybe_reencode_utf8

>>>   /* check if the input string is strictly ASCII */
>>>   for (cp = inbuf; cp < inbuf + inlen; ++cp) {
>>>     if (*cp > 127) {
>>>       non_ascii = 1;
>>>     }
>>>   }
>>>   if (non_ascii == 0)
>>>     return 0; /* no need to reencode ASCII strings */

What am I reading wrong? If anything.

Has there been an earlier de-coding of  <....>  hex-strings
into byte values, done either by XeTeX or xdvipdfmx ?
If so, then surely it is this which is unneccessary.
(Not XeTeX, since the string is correct in the .xdv file.)

e.g.  function  pst_string_parse_hex   in  pst_obj.c  seems
to be doing this.  But that is only supposed to be used with  
coding from   cmap_read.c  and  t1-load.c .
And these are only meant for interpreting the font data that goes 
into content streams. So I'm at a loss in understanding this.

But  'modstrings'  is applied recursively, and part of it
seems to be checking for a CMap (when appropriate?).
So maybe there is an unintended un-encoding that precedes 
an encoding?

> 
> It seems that *all* literal strings are affected by the
> unhappy reconversions. But the PDF specification lets no choice,
> there are various places for byte strings.
> In the example, if a file name has byte string XY and the destination Z,
> then the file name is XY and the file name Z and nothing else. Otherwise
> neither the file or the destination will be found.
> 
> Thus either (XeTeX/)xdvipdfmx finds a way for specifying arbitrary
> byte strings (at least for PDF strings(/streams)) -- it is a
> requirement of the PDF specification. Or we have to conclude 
> that 8-bit is not supported and that means US-ASCII.
> 
> Yours sincerely
>  Heiko Oberdiek

Hope this helps --- or you can help me  :-)

Cheers,

	Ross

------------------------------------------------------------------------
Ross Moore                                       ross.moore at mq.edu.au 
Mathematics Department                           office: E7A-419      
Macquarie University                             tel: +61 (0)2 9850 8955
Sydney, Australia  2109                          fax: +61 (0)2 9850 8114
------------------------------------------------------------------------