[XeTeX] New feature REQUEST for xetex

Andrew Cunningham lang.support at gmail.com
Tue Feb 23 06:05:49 CET 2016


PDF text is essentially a sequence of glyphs, and uses the ToUnicode
mappings to resolve to

For OpenType fonts, it will apply to any glyphs that are not default glyphs
assigned specific codepoints, true ligatures or variation selectors, so in
theory for complex scripts it could include many if most most glyphs in a
font, depending on how sophisticated the typography and font design is.
Reality is you get better performance in PDFs using "dumb", simple fonts.

PDF accessibility is two staged:

first stage (best supported) is the ToUnicode mapping .. essentially text
in a PDF is just a sequence of glyphs, it is the ToUnicode mapping that
resolves them to real Unicode codepoints.

But the ToUnicode mapping can only map one glyph to one codepoint or one
glyph to a sequence of codepoints (for ligatures and variation selectors).
The documentation on PDFs seem to spend a lot of time discussing the ins
and outs of this in reference to CID fonts.

For OpenType fonts, I assume that, the cmap table is the basis of the
ToUnicode mapping. In OpenType fonts not all glyphs will have mappings to
Unicode codepoints.

Likewise PDFs are the end result of the rendering process, PDF tools can
not handle reordering and certain types of substitution that result in the
final rendered string.

2) second step in accesisble PDFs is the use of ActualText ... but
customised dedicated tools are needed

As indicated cutting and pasting operations will not work, since this is
occurring on the text layer, not the ActualText, and I suspect that will be
unlikely to change. Whereas Adobe's APIs for screen readers will use the
ActualText layer.

If cutting and pasting is an important use case for you, then PDFs are the
wrong file format for you. PDFs are a pre-print format not an archival
format, despite all the rhetoric about PDF/A , PDF/UA, etc. Or more
precisely it is only ever going to be an archival format for a certain set
of languages in certain scripts with non-opentype fonts or documents that
avoid using certain opentype features.

Andrew



On 23 February 2016 at 14:58, ShreeDevi Kumar <shreeshrii at gmail.com> wrote:

> >> the problem is caused just by a few characters, especially the short
> i-matra. It might be more difficult in other Indic scripts containing
> two-part vowels.
>
> It is more extensive and applies to all/most glyphs used for conjuncts in
> addition to the short i-matra. It also applies to other Indic scripts as
> well as other complex scripts.
>
> Example below shows how the conjuncts get copied and displayed as square
> boxes. It is also font dependent.
>
> नमऽे ुगेदूसाजु ंगाराः क ु ुराः वाः । अनािरराः ससाः िशवाा
> भजािधपाीक ु ृताा भवि ॥ १॥
>
> >> It might be useful to use ActualText only for selected words.
>
> That might work for a predominantly English text with some devanagari, but
> not for full devanagari texts.
>
> >> It is not only the problem of copy&paste, you will not be able to use
> the search dialog in Acrobat. For instance, you will not be able to find
> किताब.
>
> Yes, you are right. Search does not work for unicode fonts for complex
> scripts in the current pdfs.
>
> Hence the request ...
>
> ShreeDevi
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
> Hi all,
>
> the problem is caused just by a few characters, especially the short
> i-matra. It might be more difficult in other Indic scripts containing
> two-part vowels. The reason is that visually they appear in a different
> order than they should appear in Unicode representation. It can be solved
> by using ActualText. If all words were entered this way, the size of the
> PDF will double. It might be useful to use ActualText only for selected
> words.
>
> It is not only the problem of copy&paste, you will not be able to use the
> search dialog in Acrobat. For instance, you will not be able to find किताब.
>
>
>
> Zdeněk Wagner
> http://ttsm.icpf.cas.cz/team/wagner.shtml
> http://icebearsoft.euweb.cz
>
> 2016-02-22 14:38 GMT+01:00 ShreeDevi Kumar <shreeshrii at gmail.com>:
>
>> Hi Jonathan,
>>
>> I am using xetex/xelatex for typesetting of devanagari texts.
>> eg. http://sanskritdocuments.org/doc_devii/gangAShTakamkAlidAsa.pdf
>> http://sanskritdocuments.org/doc_devii/gangAShTakamkAlidAsa.html?lang=sa
>> (HTML TEXT version of the same)
>>
>> However, when the devanagri text is copied from the pdf, it does not
>> display correctly - which is the case with complex scripts with most pdf
>> creators (as far as I know).
>>
>> eg.
>> ॥ गङ्गाष्टकं कालिदासकृतम् ॥
>> is displayed as
>> ॥ गाकं कािलदासकृतम ॥
>>
>> Is it possible to add a feature to xetex to support search and copy of
>> complex script text in scripts such as devanagari?
>>
>> It would really be great to have this ​​​​"coming soon to a XeTeX near
>> you"....... :-)
>>
>> Thanks.
>>
>> ShreeDevi
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>>
>> On Thu, Feb 18, 2016 at 4:28 PM,
>> ​​
>> Jonathan Kew <jfkthame at gmail.com> wrote:
>>
>>> This is a pretty specialized feature, likely to be interest only to a
>>> small minority of users. But for those it concerns, here's something that
>>> is
>>> ​​
>>> "coming soon to a XeTeX near you".......
>>>
>>>
>>> I've recently implemented a new feature, controlled by the integer
>>> parameter \XeTeXinterwordspaceshaping. This will be available in the TL'16
>>> release, if all goes well.
>>>
>>> This feature is relevant only when using OpenType/Graphite/AAT fonts,
>>> not legacy .tfm-based fonts.
>>>
>>> When \XeTeXinterwordspaceshaping is greater than 0, XeTeX will attempt
>>> to support fonts where the width of inter-word spaces may vary
>>> contextually, depending on the preceding and following text. This is needed
>>> by fonts such as SIL's Awami Nastaliq (in development) where words are
>>> expected to kern together across spaces.
>>>
>>> The default behavior of xetex is to measure each word in isolation, and
>>> simply string together a sequence of such word and space (glue) nodes to
>>> form the horizontal list that is then line-broken to form a paragraph.
>>> Normally, when inter-word spaces do not depend on the adjacent words, this
>>> works fine; but in Awami the width of inter-word spaces may vary
>>> drastically, even becoming negative in some cases.
>>>
>>> Setting \XeTeXinterwordspaceshaping=1 tells xetex to measure such spaces
>>> "in context" and take account of the contextually-modified widths during
>>> line breaking. This greatly improves the typeset result with such a font.
>>> Each word is still shaped and rendered individually, but line-breaking and
>>> word spacing respects the inter-word kerning.
>>>
>>> A further complication occurs when not only the width of the space but
>>> also the glyphs of the adjacent words themselves may be subject to
>>> contextual changes. An example of this would be a font that has OpenType
>>> ligature rules that apply to multiple-word sequences; e.g. a symbol font
>>> that ligates the text "credit card" to render a credit-card icon. Another
>>> example is the word-final swash forms in Hoefler Italic, which are intended
>>> to be used at end-of-line but NOT before word spaces within the line.
>>>
>>> These cases are addressed with \XeTeXinterwordspaceshaping=2. With this
>>> value, not only are inter-word spaces measured in context, but also each
>>> run of text (words and intervening spaces) in a single font will be
>>> re-shaped as a unit at \shipout time. This allows full shaping (contextual
>>> swashes, ligatures, etc) to take effect across inter-word spaces.
>>>
>>> Currently, this feature is implemented only in the "contextual-space"
>>> branch of the code at sourceforge; anyone interested in testing it will
>>> need to check out and build the code from there. After some time, if no
>>> major problems show up, I expect to merge it to the master branch, and then
>>> to the TeXLive source tree.
>>>
>>> Feedback welcome..........
>>>
>>> JK
>>>
>>>
>>>
>>> --------------------------------------------------
>>> Subscriptions, Archive, and List information, etc.:
>>>  http://tug.org/mailman/listinfo/xetex
>>>
>>
>>
>>
>>
>> --------------------------------------------------
>> Subscriptions, Archive, and List information, etc.:
>>   http://tug.org/mailman/listinfo/xetex
>>
>>
>
>
>
> --------------------------------------------------
> Subscriptions, Archive, and List information, etc.:
>   http://tug.org/mailman/listinfo/xetex
>
>
>
>
> --------------------------------------------------
> Subscriptions, Archive, and List information, etc.:
>   http://tug.org/mailman/listinfo/xetex
>
>


-- 
Andrew Cunningham
lang.support at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://tug.org/pipermail/xetex/attachments/20160223/c3b208f7/attachment-0001.html>


More information about the XeTeX mailing list