[XeTeX] Whitespace in input

Zdenek Wagner zdenek.wagner at gmail.com
Sat Nov 19 13:23:53 CET 2011


2011/11/19 Ross Moore <ross.moore at mq.edu.au>:
> Hi Zdenek,
>
> On 19/11/2011, at 10:30 AM, Zdenek Wagner wrote:
>
>>> /ActualText is your friend here.
>>> You tag the content and provide the string that you want to appear
>>> with Copy/Paste as the value associated to a dictionary key.
>>>
>> I do not know whether the PDF specification has evolved since I read
>> it the last time. /ActualText allows only single-byte characters, ie
>> those with codes between 0 and 255, not arbitrary Unicode characters.
>
> That is most certainly not true.
> You code up UTF-16BE as Hex strings.
>
> Here is a snippet of the (tagged-pdfLaTeX) source coding from
> the main example that I showed in my  TUG2011 talk.
> The URL for the video of the talk is given in several of my previous emails:
>
Thank you for the sample. I will try again when I have more time.
Maybe there is a stupid bug in my old code. As a matter of fact, when
playing with /ActualText I knew much less than now.

>>>>    \SMC attr{/ActualText<FEFFD835DC4F>\TPDFaloud{1D44F}} noendtext 254 {mi}%
>>>>  b%
>>>>    _{\noEMC%
>>>>   \TPDFsub
>>>>    \SMC attr{/ActualText<FEFFD835DC58>\TPDFaloud{1D458}} noendtext 255 {mi}%
>>>>  k%
>>>>    \EMC
>>>>  }^{\EMC
>>>>    \SMC attr{/ActualText( )} noendtext 256 {Span}%
>>>>  \pdffakespace
>>>>    \EMC
>>>>  }%
>>>>    \TPDFpopbrack
>>>>    \SMC attr{/ActualText<FEFF0029>\TPDFaloud{0029}} noendtext 257 {mo}%
>>>>  \Bigr)%
>
>
> Inside the resulting PDF, this content looks like:
>
>>>> 1 0 0 1 4.902 2.463 cm
>>>> /mi <</MCID 10 /ActualText<FEFFD835DC4F>/Alt(  , b ,  )
>>>> >>BDC
>>>> BT
>>>> /F11 9.9626 Tf
>>>>  [(b)]TJ
>>>> ET
>>>> EMC
>>>> 1 0 0 1 4.276 4.114 cm
>>>> /Span <</MCID 11 /ActualText( )
>>>> >>BDC
>>>> BT
>>>> /F103 1 Tf
>>>>  [( )]TJ
>>>> ET
>>>> EMC
>>>> 1 0 0 1 0 -6.577 cm
>>>> /mi <</MCID 12 /ActualText<FEFFD835DC58>/Alt(  sub k ,  )
>>>> >>BDC
>>>> BT
>>>> /F10 6.9738 Tf
>>>>  [(k)]TJ
>>>> ET
>>>> EMC
>>>> 1 0 0 1 4.901 2.463 cm
>>>> /mo <</MCID 13 /Alt(  close bracket:,   , )
>>>> >>BDC
>
>
> The full PDF passes all of Adobe's validation tests for
> correct PDF syntax, Accessible Content, PDF/A-1b compliance.
>
> More particularly:
>
>  /mi <</MCID 10 /ActualText<FEFFD835DC4F>/Alt(  , b ,  )
>  >>BDC
>  BT
>  /F11 9.9626 Tf
>   [(b)]TJ
>  ET
>  EMC
>
> expresses a math-italic 'b' as :
>
>  1.  the glyph in the position of letter 'b' (in CMMI10  font);
>
>  2.  to be spoken aloud as  " , b , "  where commas indicate a slight pause
>
>  3.  to Copy/Paste as the surrogate pair  Ux0D835 Ux0DC4F
>      equivalent to a Plane-1 math-italic character 'b' .
>
> The /MCID key is necessary for tagged PDF, but the /Alt and /ActualText
> should work independently to full tagging.
> The '/mi' is immaterial; it could equally well be  '/Span'.
>
>
>> /ActualText is demonstrated on German hyphenated words such as Zucker
>> which is hyphenated as Zuk- ker. I have tried to put /ActualText
>> manually via a special, I could see it in the PDF file but it did not
>> work.
>
> Yes, because it is quite important to position the tagging pieces
> correctly within the PDF content stream. It has to balance correctly
> with BT ... ET  and the BDC ... EMC  operator pairs, and there may
> be other subtle requirements.
>
> Certainly it cannot be done with just a single \special .
> There needs to be stuff both before and after the content
> that causes actual glyphs to be displayed.
>
>
> Just using \pdfliteral  is not sufficient with pdfTeX; we needed
> a special modification that allowed the  /mi <<...>>BDC
> and  EMC to fit snuggly around the  BT ... ET .
>
> There could be a similar problem with XeTeX's
>     \special{pdf:literal ... }
> (or whatever is the syntax).
> This is the issue that I was trying to discuss with JK in 2009 or 2010.
>
>
>>
>> When converting a white space to a space character some [complex]
>> heuristics is needed while proper conversion of glyphs to characters
>> of Indic scripts require just a few strict rules. The ligatures as TRA
>> have to appear in the toUnicode map, otherwise its meaning will be
>> unclear. If you see the I-matra, go to the last consonant in the
>> sequence and put the I-matra character there. If you see the RA glyph
>> at the right edge of a syllable, go back to the leftmost consonant in
>> the group and prepend RA+VIRAMA there. This is all what has to be done
>> with Devanagari. Other Indic scripts contain two-part vowels but the
>> rules will be similarly simple. We should not be forced to double the
>> size of the PDF file. AR and other PDF rendering programs should learn
>> these simple rules and use them when extracting text.
>
> If you can provide the  UTF-16BE Hex representation of these,
> I can create a PDF using it as the /ActualText  replacement for
> some arbitrary string of letters.
>
> This will test whether this is a viable approach for Devanagari.
> If so, then it is a matter of working out how to expand this
> for a full solution.
>
>
>>
>>> There is a macro package that can do this with pdfTeX, and it is
>>> a vital part of my Tagged PDF work for mathematics.
>>> Also, I have an example where the CJK.sty package is extended
>>> to tag Chinese characters built from multiple glyphs so that
>>> Copy/Paste works correctly (modulo PDF reader quirks).
>>>
>>> Not sure about XeTeX.
>>>
>>> I once tried to talk with Jonathan Kew about what would be needed
>>> to implement this properly, but he got totally the wrong idea
>>> concerning glyphs and characters, and what was needed to be done
>>> internally and what by macros. The conversation went nowhere.
>
>> --
>> Zdeněk Wagner
>
>
> Cheers,
>
>        Ross
>
> ------------------------------------------------------------------------
> Ross Moore                                       ross.moore at mq.edu.au
> Mathematics Department                           office: E7A-419
> Macquarie University                             tel: +61 (0)2 9850 8955
> Sydney, Australia  2109                          fax: +61 (0)2 9850 8114
> ------------------------------------------------------------------------
>
>
>
>
>
>
> --------------------------------------------------
> Subscriptions, Archive, and List information, etc.:
>  http://tug.org/mailman/listinfo/xetex
>



-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz



More information about the XeTeX mailing list