[pdftex] searchable pdfs from scanned material

Giampiero Salvi giampi at speech.kth.se
Tue Aug 24 22:08:00 CEST 2004

I am about to create some pdf files from scanned material, and
I would like search engines like google to find words in them.
For each scanned page I have a text file with the OCR (letter
recognition) generated text.

The way I do it right now is creating a tex file (that I process
through pdflatex) with a number of \includegraphics (as many as
the number of pages). I also include the text for each page into
a \pdfannot command that looks like this:

\pdfannot width 10cm height 0cm depth 4cm { /Subtype /Text /Contents
     (text text text)

This works fine in the sense that grep says that the pdf file matches
whenever I look for a word contained in the \pdfannot command.
I wonder if there is a better way of achieving my goal: one problem
with my method, for example, is that \pdfannot creates a little yellow
note at the beginning of the first page containing the text. This might
be confusing, expecially given the fact that the OCR is way from
perfect, and the text contains many errors. Is there a way to hide
the note? Or is there another command that just inserts text in the
file without any graphical reference to it?

Thank you!

More information about the pdftex mailing list