[XeTeX] Whitespace in input

Ross Moore ross.moore at mq.edu.au
Sat Nov 19 02:01:17 CET 2011


Hi Zdenek,

On 19/11/2011, at 10:30 AM, Zdenek Wagner wrote:

>> /ActualText is your friend here.
>> You tag the content and provide the string that you want to appear
>> with Copy/Paste as the value associated to a dictionary key.
>> 
> I do not know whether the PDF specification has evolved since I read
> it the last time. /ActualText allows only single-byte characters, ie
> those with codes between 0 and 255, not arbitrary Unicode characters.

That is most certainly not true.
You code up UTF-16BE as Hex strings.

Here is a snippet of the (tagged-pdfLaTeX) source coding from 
the main example that I showed in my  TUG2011 talk. 
The URL for the video of the talk is given in several of my previous emails:

>>>    \SMC attr{/ActualText<FEFFD835DC4F>\TPDFaloud{1D44F}} noendtext 254 {mi}%
>>>  b%
>>>    _{\noEMC%
>>>   \TPDFsub 
>>>    \SMC attr{/ActualText<FEFFD835DC58>\TPDFaloud{1D458}} noendtext 255 {mi}%
>>>  k%
>>>    \EMC 
>>>  }^{\EMC 
>>>    \SMC attr{/ActualText( )} noendtext 256 {Span}%
>>>  \pdffakespace
>>>    \EMC 
>>>  }%
>>>    \TPDFpopbrack 
>>>    \SMC attr{/ActualText<FEFF0029>\TPDFaloud{0029}} noendtext 257 {mo}%
>>>  \Bigr)%


Inside the resulting PDF, this content looks like:

>>> 1 0 0 1 4.902 2.463 cm
>>> /mi <</MCID 10 /ActualText<FEFFD835DC4F>/Alt(  , b ,  )
>>> >>BDC
>>> BT
>>> /F11 9.9626 Tf
>>>  [(b)]TJ
>>> ET
>>> EMC
>>> 1 0 0 1 4.276 4.114 cm
>>> /Span <</MCID 11 /ActualText( )
>>> >>BDC
>>> BT
>>> /F103 1 Tf
>>>  [( )]TJ
>>> ET
>>> EMC
>>> 1 0 0 1 0 -6.577 cm
>>> /mi <</MCID 12 /ActualText<FEFFD835DC58>/Alt(  sub k ,  )
>>> >>BDC
>>> BT
>>> /F10 6.9738 Tf
>>>  [(k)]TJ
>>> ET
>>> EMC
>>> 1 0 0 1 4.901 2.463 cm
>>> /mo <</MCID 13 /Alt(  close bracket:,   , )
>>> >>BDC


The full PDF passes all of Adobe's validation tests for
correct PDF syntax, Accessible Content, PDF/A-1b compliance.

More particularly:
 
  /mi <</MCID 10 /ActualText<FEFFD835DC4F>/Alt(  , b ,  )
  >>BDC
  BT
  /F11 9.9626 Tf
   [(b)]TJ
  ET
  EMC

expresses a math-italic 'b' as :

 1.  the glyph in the position of letter 'b' (in CMMI10  font);

 2.  to be spoken aloud as  " , b , "  where commas indicate a slight pause

 3.  to Copy/Paste as the surrogate pair  Ux0D835 Ux0DC4F
      equivalent to a Plane-1 math-italic character 'b' .

The /MCID key is necessary for tagged PDF, but the /Alt and /ActualText
should work independently to full tagging.
The '/mi' is immaterial; it could equally well be  '/Span'. 


> /ActualText is demonstrated on German hyphenated words such as Zucker
> which is hyphenated as Zuk- ker. I have tried to put /ActualText
> manually via a special, I could see it in the PDF file but it did not
> work.

Yes, because it is quite important to position the tagging pieces
correctly within the PDF content stream. It has to balance correctly
with BT ... ET  and the BDC ... EMC  operator pairs, and there may
be other subtle requirements.

Certainly it cannot be done with just a single \special .
There needs to be stuff both before and after the content
that causes actual glyphs to be displayed.


Just using \pdfliteral  is not sufficient with pdfTeX; we needed
a special modification that allowed the  /mi <<...>>BDC 
and  EMC to fit snuggly around the  BT ... ET .

There could be a similar problem with XeTeX's 
     \special{pdf:literal ... }  
(or whatever is the syntax).
This is the issue that I was trying to discuss with JK in 2009 or 2010.


> 
> When converting a white space to a space character some [complex]
> heuristics is needed while proper conversion of glyphs to characters
> of Indic scripts require just a few strict rules. The ligatures as TRA
> have to appear in the toUnicode map, otherwise its meaning will be
> unclear. If you see the I-matra, go to the last consonant in the
> sequence and put the I-matra character there. If you see the RA glyph
> at the right edge of a syllable, go back to the leftmost consonant in
> the group and prepend RA+VIRAMA there. This is all what has to be done
> with Devanagari. Other Indic scripts contain two-part vowels but the
> rules will be similarly simple. We should not be forced to double the
> size of the PDF file. AR and other PDF rendering programs should learn
> these simple rules and use them when extracting text.

If you can provide the  UTF-16BE Hex representation of these,
I can create a PDF using it as the /ActualText  replacement for 
some arbitrary string of letters.

This will test whether this is a viable approach for Devanagari.
If so, then it is a matter of working out how to expand this
for a full solution.


> 
>> There is a macro package that can do this with pdfTeX, and it is
>> a vital part of my Tagged PDF work for mathematics.
>> Also, I have an example where the CJK.sty package is extended
>> to tag Chinese characters built from multiple glyphs so that
>> Copy/Paste works correctly (modulo PDF reader quirks).
>> 
>> Not sure about XeTeX.
>> 
>> I once tried to talk with Jonathan Kew about what would be needed
>> to implement this properly, but he got totally the wrong idea
>> concerning glyphs and characters, and what was needed to be done
>> internally and what by macros. The conversation went nowhere.

> -- 
> Zdeněk Wagner


Cheers,

	Ross

------------------------------------------------------------------------
Ross Moore                                       ross.moore at mq.edu.au 
Mathematics Department                           office: E7A-419      
Macquarie University                             tel: +61 (0)2 9850 8955
Sydney, Australia  2109                          fax: +61 (0)2 9850 8114
------------------------------------------------------------------------






More information about the XeTeX mailing list