[XeTeX] Whitespace in input
Zdenek Wagner
zdenek.wagner at gmail.com
Sat Nov 19 13:23:53 CET 2011
2011/11/19 Ross Moore <ross.moore at mq.edu.au>:
> Hi Zdenek,
>
> On 19/11/2011, at 10:30 AM, Zdenek Wagner wrote:
>
>>> /ActualText is your friend here.
>>> You tag the content and provide the string that you want to appear
>>> with Copy/Paste as the value associated to a dictionary key.
>>>
>> I do not know whether the PDF specification has evolved since I read
>> it the last time. /ActualText allows only single-byte characters, ie
>> those with codes between 0 and 255, not arbitrary Unicode characters.
>
> That is most certainly not true.
> You code up UTF-16BE as Hex strings.
>
> Here is a snippet of the (tagged-pdfLaTeX) source coding from
> the main example that I showed in my TUG2011 talk.
> The URL for the video of the talk is given in several of my previous emails:
>
Thank you for the sample. I will try again when I have more time.
Maybe there is a stupid bug in my old code. As a matter of fact, when
playing with /ActualText I knew much less than now.
>>>> \SMC attr{/ActualText<FEFFD835DC4F>\TPDFaloud{1D44F}} noendtext 254 {mi}%
>>>> b%
>>>> _{\noEMC%
>>>> \TPDFsub
>>>> \SMC attr{/ActualText<FEFFD835DC58>\TPDFaloud{1D458}} noendtext 255 {mi}%
>>>> k%
>>>> \EMC
>>>> }^{\EMC
>>>> \SMC attr{/ActualText( )} noendtext 256 {Span}%
>>>> \pdffakespace
>>>> \EMC
>>>> }%
>>>> \TPDFpopbrack
>>>> \SMC attr{/ActualText<FEFF0029>\TPDFaloud{0029}} noendtext 257 {mo}%
>>>> \Bigr)%
>
>
> Inside the resulting PDF, this content looks like:
>
>>>> 1 0 0 1 4.902 2.463 cm
>>>> /mi <</MCID 10 /ActualText<FEFFD835DC4F>/Alt( , b , )
>>>> >>BDC
>>>> BT
>>>> /F11 9.9626 Tf
>>>> [(b)]TJ
>>>> ET
>>>> EMC
>>>> 1 0 0 1 4.276 4.114 cm
>>>> /Span <</MCID 11 /ActualText( )
>>>> >>BDC
>>>> BT
>>>> /F103 1 Tf
>>>> [( )]TJ
>>>> ET
>>>> EMC
>>>> 1 0 0 1 0 -6.577 cm
>>>> /mi <</MCID 12 /ActualText<FEFFD835DC58>/Alt( sub k , )
>>>> >>BDC
>>>> BT
>>>> /F10 6.9738 Tf
>>>> [(k)]TJ
>>>> ET
>>>> EMC
>>>> 1 0 0 1 4.901 2.463 cm
>>>> /mo <</MCID 13 /Alt( close bracket:, , )
>>>> >>BDC
>
>
> The full PDF passes all of Adobe's validation tests for
> correct PDF syntax, Accessible Content, PDF/A-1b compliance.
>
> More particularly:
>
> /mi <</MCID 10 /ActualText<FEFFD835DC4F>/Alt( , b , )
> >>BDC
> BT
> /F11 9.9626 Tf
> [(b)]TJ
> ET
> EMC
>
> expresses a math-italic 'b' as :
>
> 1. the glyph in the position of letter 'b' (in CMMI10 font);
>
> 2. to be spoken aloud as " , b , " where commas indicate a slight pause
>
> 3. to Copy/Paste as the surrogate pair Ux0D835 Ux0DC4F
> equivalent to a Plane-1 math-italic character 'b' .
>
> The /MCID key is necessary for tagged PDF, but the /Alt and /ActualText
> should work independently to full tagging.
> The '/mi' is immaterial; it could equally well be '/Span'.
>
>
>> /ActualText is demonstrated on German hyphenated words such as Zucker
>> which is hyphenated as Zuk- ker. I have tried to put /ActualText
>> manually via a special, I could see it in the PDF file but it did not
>> work.
>
> Yes, because it is quite important to position the tagging pieces
> correctly within the PDF content stream. It has to balance correctly
> with BT ... ET and the BDC ... EMC operator pairs, and there may
> be other subtle requirements.
>
> Certainly it cannot be done with just a single \special .
> There needs to be stuff both before and after the content
> that causes actual glyphs to be displayed.
>
>
> Just using \pdfliteral is not sufficient with pdfTeX; we needed
> a special modification that allowed the /mi <<...>>BDC
> and EMC to fit snuggly around the BT ... ET .
>
> There could be a similar problem with XeTeX's
> \special{pdf:literal ... }
> (or whatever is the syntax).
> This is the issue that I was trying to discuss with JK in 2009 or 2010.
>
>
>>
>> When converting a white space to a space character some [complex]
>> heuristics is needed while proper conversion of glyphs to characters
>> of Indic scripts require just a few strict rules. The ligatures as TRA
>> have to appear in the toUnicode map, otherwise its meaning will be
>> unclear. If you see the I-matra, go to the last consonant in the
>> sequence and put the I-matra character there. If you see the RA glyph
>> at the right edge of a syllable, go back to the leftmost consonant in
>> the group and prepend RA+VIRAMA there. This is all what has to be done
>> with Devanagari. Other Indic scripts contain two-part vowels but the
>> rules will be similarly simple. We should not be forced to double the
>> size of the PDF file. AR and other PDF rendering programs should learn
>> these simple rules and use them when extracting text.
>
> If you can provide the UTF-16BE Hex representation of these,
> I can create a PDF using it as the /ActualText replacement for
> some arbitrary string of letters.
>
> This will test whether this is a viable approach for Devanagari.
> If so, then it is a matter of working out how to expand this
> for a full solution.
>
>
>>
>>> There is a macro package that can do this with pdfTeX, and it is
>>> a vital part of my Tagged PDF work for mathematics.
>>> Also, I have an example where the CJK.sty package is extended
>>> to tag Chinese characters built from multiple glyphs so that
>>> Copy/Paste works correctly (modulo PDF reader quirks).
>>>
>>> Not sure about XeTeX.
>>>
>>> I once tried to talk with Jonathan Kew about what would be needed
>>> to implement this properly, but he got totally the wrong idea
>>> concerning glyphs and characters, and what was needed to be done
>>> internally and what by macros. The conversation went nowhere.
>
>> --
>> Zdeněk Wagner
>
>
> Cheers,
>
> Ross
>
> ------------------------------------------------------------------------
> Ross Moore ross.moore at mq.edu.au
> Mathematics Department office: E7A-419
> Macquarie University tel: +61 (0)2 9850 8955
> Sydney, Australia 2109 fax: +61 (0)2 9850 8114
> ------------------------------------------------------------------------
>
>
>
>
>
>
> --------------------------------------------------
> Subscriptions, Archive, and List information, etc.:
> http://tug.org/mailman/listinfo/xetex
>
--
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz
More information about the XeTeX
mailing list