[texhax] extracting a plain text file of the final document

Paul Isambert zappathustra at free.fr
Sun Jan 15 10:28:00 CET 2012

Philip TAYLOR <P.Taylor at rhul.ac.uk> a écrit:
> Seems a very useful utility, Reinhard, and one of
> which I was previously unaware, but why does it
> eat all the "Th" (but not "th") groups ?!

Probably because you've used a font with the "Th" ligature and it isn't
recognized. Indeed, with a document in CM, "Th" renders to "Th", while
the same documents in Chaparral renders "Th" as some impossible glyph.

It also renders "ff" as a ligature, unless you include (in the TeX
document with pdfTeX or LuaTeX):

  \pdfglyphtounicode{ff}{0066 0066}

in which case it properly analyzes the ligature.

So something along those lines should be tried with "Th", provided you
find the glyph's name (not "Th" in Chaparal):

  \pdfglyphtounicode{<name>}{0054 0068}

Hopefully it works. But still that must be done before compilation, or
perhaps pdftotext as some option signalling such glyph must be mapped to
such character(s)?


More information about the texhax mailing list