[XeTeX] New feature REQUEST for xetex

Zdenek Wagner zdenek.wagner at gmail.com
Tue Feb 23 10:40:48 CET 2016


Hi all,

several years ago I did some texts with pdflatex and the devnag package
(XeTeX did not exist at that time), it is still here:
http://icebearsoft.euweb.cz/dvngpdf/

The situation in the Indic scripts are much more complex and cannot be
solved by a ToUnicode map. Half-consonants can be mapped to a consonant
followed by a virama. Conjuncts as ksha can be mapped to ka + virama + sha.
The problem is with reordering. I will make examples in Hindi only because
I do not know other Indic languages.

Take a word kitaab (= किताब, meaning a book). The correct character order
is ka + i-matra + ta + aa-matra + ba but in the vizual representattion the
glyphs are ordered as i-matra + ka + ta + aa-matra + ba. You cannot blindly
move the i-matra behond the following consonant. Word shakti (= सहक्ति,
force) is sha + ka + virama + ta + i-matra in the character order but sha +
i-matra + {kta-conjunct | half-ka + ta} where the second form is usually
preferred in nowadays Hindi. Even more weird reorderings exist, marzii is
ma + ra + virama + za + ii-matra in character order but vizually ma + za +
ii-matra + hook-repha.

The case of two-part vowels in some scripts is difficult two. You have
generally the following scheme:

vowel-part-1 + consonant-group or conjunct + vowel-part-2

Both parts exist as a separate glyphs mapped to other characters so you
must know whether the glyph represents a character or whether two glyphs
compose a two-part vowel.

These are not things that could be solved by simple ToUnicode maps. On the
contrary, it is not necessary to put ActualText to each word but certainly
to a great many words.


Zdeněk Wagner
http://ttsm.icpf.cas.cz/team/wagner.shtml
http://icebearsoft.euweb.cz

2016-02-23 6:21 GMT+01:00 Andrew Cunningham <lang.support at gmail.com>:

> Simon,
>
> On 23 February 2016 at 14:12, Simon Cozens <simon at simon-cozens.org> wrote:
>
>> On 23/02/2016 13:54, Andrew Cunningham wrote:
>> > PDF/UA for instance leaves the question deliberately ambigious.
>> > ActualText is the way to make the content accessible, but developers
>> > creating tools for PDF do not actually have to process the ActualText.
>>
>> Yeah. (Sorry to keep banging the drum but) I've just done some tests
>> with SILE, which includes some support for tagged/accessible PDFs. Even
>> when the ActualText includes the correct Devanagari, I am still seeing
>> the same problems with cut-and-paste. I'm not sure what needs to be done
>> to get it right.
>>
>>
> In terms of SILE ... supporting generation of other formats like XPS as an
> alternative to PDF is probably the only way forward for complex script
> languages.
>
> If SILE is tagging the PDFs and adding ActualText attributes , then it is
> doing everything it should be doing. The problems are with the PDF
> specification itself, what it was originally designed to be (a pre-print
> format based on the Postscript language) and the limitations placed on it
> by the developers of the spec.
>
> Andrew
>
>
>
> --------------------------------------------------
> Subscriptions, Archive, and List information, etc.:
>   http://tug.org/mailman/listinfo/xetex
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://tug.org/pipermail/xetex/attachments/20160223/38a21bb4/attachment-0001.html>


More information about the XeTeX mailing list