[pdftex] searchable pdfs from scanned material

Mon Aug 30 11:01:20 CEST 2004

Le mardi 24 août 2004 à 22:08:00, Giampiero écrivit :

GS> Hi,
GS> I am about to create some pdf files from scanned material, and
GS> I would like search engines like google to find words in them.
GS> For each scanned page I have a text file with the OCR (letter
GS> recognition) generated text.

Well, if you have a good OCR like Fine reader, it could generate the
hidden text behind your scanned images as PDF, and then you import
them all with pdfpages. (pdftex including the page object as is, you
get the hidden text this way... all the articles in numdam.org are
done this way)

GS> This works fine in the sense that grep says that the pdf file matches
GS> whenever I look for a word contained in the \pdfannot command.
GS> I wonder if there is a better way of achieving my goal: one problem
GS> with my method, for example, is that \pdfannot creates a little yellow
GS> note at the beginning of the first page containing the text. This might
GS> be confusing, expecially given the fact that the OCR is way from
GS> perfect, and the text contains many errors. Is there a way to hide
GS> the note?

the most simple way would be to print first the text in white (some
kind of verbatim input inside a zero surface box), then
include the image over it.

Cordialement,
 Thierry