Mike Marchywka marchywka at
Sat May 7 11:58:32 CEST 2022

[ top posting due to thread hijack :) [ 
Years ago I ran into authors who were unaware that their
PDF hid things like data and text or had a cheap encryption feature
of no real value just hassle  :)
Can anyone recommend a decent open source OCR code that could
worked with scanned pdf? Converting the pages to tiff or bmp is easy
but IIRC many years ago I plaed with some OCR pakage but
it did not do well on the PDF I had. They may not be
that common any more but I could add it to TooBib :)

btw, does Zotero actually extract citation info from ANY PDF
or just "manage" them? I have never gotten their web form
to work with any pdf although it might work if it can extract
a doi from the link.

I'm adding more special case handlers and have some suuport
for extraction from local  files although it may not extract
the original web address or canonical etc even if it
is in the html although likely the  DOI is good enough. 

 Mike Marchywka 
306 Charles Cox Drive 
Canton, GA 30115

From: texhax < at> on behalf of Reinhard Kotucha <reinhard.kotucha at>
Sent: Friday, May 6, 2022 5:58 PM
To: Philip Taylor (Hellenic Institute)
Cc: texhax at; Herbert Voss; David Jonah
Subject: Re: Conversion

On 2022-05-06 at 20:19:43 +0100, Philip Taylor (Hellenic Institute) wrote:

 > On 06/05/2022 20:12, Herbert Voss wrote:
 > >
 > > Am 06.05.22 um 16:23 schrieb David Jonah via texhax:
 > >> I want to convert a .pdf document to a LaTeX document. The paper has
 > >> superscripts, an index, and a source document.
 > >
 > > Convert the document first to doc and then to latex. There are several
 > > programs for the first one and some for the second.
 > With Adobe Acrobat DC, one can export a PDF to an MS Word document; the
 > conversion is usually excellent, and if an MS Word to LaTeX converter
 > exists that is of the same quality, then the overall results should be
 > most acceptable.

I suppose that the graphical representation of the converted document
*looks* good but the logical structure of the document gets lost
because it can't be derived from an ordinary PDF file.  A PDF file
only describes the visual representation of a document.

But if you want to edit the converted document the visual
representation is worthless.  You have to use LaTeX macros like
\chapter, \section, \subsection, etc. just to be able to re-generate
the table of contents, for example.

Because you have to edit the converted document anyway I don't see any
benefit from using Adobe Acrobat DC, MS-Word, or any other proprietary
software.  pdftotext does what you need.

On Windows pdftotext is part of TeX Live and on other operating
systems it can be installed by the package manager.


Reinhard Kotucha                            Phone: +49-511-3373112
Marschnerstr. 25
D-30167 Hannover                    mailto:reinhard.kotucha at

More information about the texhax mailing list.