[pdftex] hacking on tex parsing

Wed May 6 00:07:00 CEST 2020

Hi Peter, Karl

On 5 May 2020, at 11:21 am, Karl Berry <karl at freefriends.org<mailto:karl at freefriends.org>> wrote:

Hi Peter,

to make some TTS software to help me proofread my papers,

If TTS stands for Text-to-Speech, have you tried the kind of screen-readers that
are used by visually-disabled people?

Alternatively, Adobe’s Acrobat Reader (and Pro) have a “Read Out Loud” function,
designed essentially for reading eBooks.
It does a quite reasonable job, especially when the LaTeX-generated PDF has
been processed for published standards like PDF/A, or even better PDF/UA.

The issue is that TeX-fonts can involve custom encodings, especially with mathematical expressions.
These can upset the correct extraction of the characters, by interpreting the glyphs as being
something other than what was intended in your (La)TeX source.
However, if it is mostly English prose, then it should do a fine job;
when it comes to proof-reading mathematics, then you have to look at it anyway to
appreciate the 2-dimensional aspect of mathematics having superscript, subscripts, fractions,
matrices, integrals with limits, etc.

Such font aspects are minimised when you use the  pdfx  package to process
for PDF/A say.

If you want headings to be recognised as such, then you need to process for
PDF/UA, which means using the "Tagged PDF” format.
See my web-pages for real-world examples of this:

   http://web.science.mq.edu.au/~ross/TaggedPDF/
   http://web.science.mq.edu.au/~ross/TaggedPDF/TUG2019-movies

As far as I know, general text-to-speech for LaTeX remains an open
problem.

Surely this is entirely subsumed into the efforts to handle text-extraction reliably,
as is needed with PDF/A and PDF/UA output.

Shortly before his untimely death,Eitan Gurari was working on
this as another output mode for tex4ht, but I don't believe it's seen
any development since. What he did was targeted at Emacspeak
(https://tug.org/TUGboat/tb28-3/tb90gurari.pdf<https://protect-au.mimecast.com/s/Puf-Cq71jxf98jnpiXEwiy?domain=tug.org>, last section). For
convenience, I'll attach the eslatex script he mentions in case you want
to try it. Although it's still in the sources, I took it out of the
binary directories in TeX Live years ago.

A search for latex document to speech turns up
https://github.com/martysweet/latex-to-speech
which I've never looked at. FWIW.

but I'm hoping that I can reuse some existing tex parsing code.

This is totally unnecessary if you get the text from the output PDF,
not the input (La)TeX source.

There are other standalone programs, such as KaTeX, LaTeX2HTML, and
mathjax which can parse TeX (or just TeX math) to greater or lesser
extents. Perhaps something in there would be useful.

instance, if I can grab the document after newcommand or
DeclareMathOperator has been processed that would be very helpful. Or if

You have to redefine the macros to do so. This is what tex4ht does -- it
runs TeX, but redefines virtually everything, often at a low level,
in order to be able to intervene and generate the various output formats.

there's some tree-like data structure that gets created

(pdf)tex itself (and tex4ht) don't build trees. They operate token by token.

You might get more and better answers from texhax at tug.org<mailto:texhax at tug.org> (general
public mailing list), tex.stackexchange.com<http://tex.stackexchange.com/>, etc. The above is just what
comes to my mind, certainly not definitive.

If you get anywhere, we'd like to publish something about it in TUGboat :).

All the best,
Karl

Hope this helps.
Stay safe.

Ross

Dr Ross Moore
Department of Mathematics and Statistics
12 Wally’s Walk, Level 7, Room 734
Macquarie University, NSW 2109, Australia
T: +61 2 9850 8955  |  F: +61 2 9850 8114
M:+61 407 288 255  |  E: ross.moore at mq.edu.au<mailto:ross.moore at mq.edu.au>
http://www.maths.mq.edu.au
[cid:image001.png at 01D030BE.D37A46F0]
CRICOS Provider Number 00002J. Think before you print.
Please consider the environment before printing this email.

This message is intended for the addressee named and may
contain confidential information. If you are not the intended
recipient, please delete it and notify the sender. Views expressed
in this message are those of the individual sender, and are not
necessarily the views of Macquarie University. <http://mq.edu.au/>
<http://mq.edu.au/>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://tug.org/pipermail/pdftex/attachments/20200505/26586542/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 4605 bytes
Desc: image001.png
URL: <https://tug.org/pipermail/pdftex/attachments/20200505/26586542/attachment-0001.png>