search accents in pdf generated by TeX

William F Hammond hmwlfsr at
Sat Jan 29 07:20:03 CET 2022

Ulrike Fischer <news3 at> writes:

> Am Thu, 27 Jan 2022 21:55:32 -0800 schrieb William F Hammond via
> texhax:
>> First, I don't know what the statement "accented letters are
>> not recognized by the pdf" means.  If we're talking about
>> typesetting with pdftex, then I think that the PDF output is
>> UTF-8 encoded. 
> No.

I do understand that a PDF file is not a text file if that
is why you are saying "no".  But ...

>> If one runs the program "pdftotext", which
>> is part of an Ubuntu package called poppler-utils on my
>> Ubuntu platform, the output text is UTF-8 encoded.  I think
>> that text TeX's algorithmic accents are implemented using
>> Unicode combining characters. 
> No, not with pdftex. If you compile
> \documentclass{article}
> \begin{document}
> ä ö ü é è

When I said that I was using TeX's *algorithmic* accents,
I meant  \"a \"o \"u \'e \`e

But, anyway, the original question was about
plain TeX, not LaTeX, and I was trying to address

> \end{document}
> and then copy and paste you will get
>     ¨a ¨o ¨u ´e `e
> that is 
>     U+00A8a U+00A8o U+00A8u U+00B4e U+0060e
> (U+00A8 is for example diaresis).

I have not been able to duplicate what you say.

Perhaps you have a newer version of pdftex.  Mine
is from TeXLive 2017.  But I doubt if that is the
explanation.  Perhaps there are differences in
"locale" arrangements.

> if you add \usepackage[T1]{fontenc} and so use a font
> which has the needed glyphs then you get the correct
> unicode code points and a searchable pdf
>      ä ö ü é è

With what input encoding for your LaTeX source?  Are you
saying that you have an arrangement for pdflatex to read
unicode up U+00FF when text-encoded as UTF-8 (as in your
last email) ??

                              -- Bill

More information about the texhax mailing list.