[tex4ht] curiosity about unicode.4hf

Matteo Gamboz gamboz at medialab.sissa.it
Tue Mar 14 10:30:12 CET 2017

On Mon, 13 Mar 2017 22:53:47 +0100,
Karl Berry wrote:
> Hi Matteo,
>     I get "a.html" that contains:
>     ...’...
> I guess you're expecting the literal UTF-8 right single quote instead of
> the entity syntax?


>     AFAIK, ' and " are illegal in attributes, 
> I have used those characters in attribute values. Anyway, how are
> attributes related to the example?  I'm baffled here, sorry.

my fault, sorry, more correct is
' is illegal in attributes delimited by ' (e.g. xxx='aa'aa')
" is illegal in attributes delimited by "

I wrote that only because unicode.4hf forces " to be an entity and I
thought I can see the reason in the "delicacy" of using " in some

>     (and #x2018 is not in the file - texlive2016).
>     Does anyone know why &x2019; ended up in unicode.4hf?
> I don't know why Eitan decided to translate ASCII ' to the Unicode
> entity value and leave ASCII ` output as literal UTF-8 (with your options).
> I don't know what the implications would be of changing it, either; not
> something I would want to do lightly.
> Briefly looking at the source file (tex4ht-fonts-4hf.tex), I don't see
> any explanation. Could have missed it.
> Does outputting the entity cause some problem?

not directly (see below)

>     htlatex a "xhtml" " -cunihtf -utf8"
> Why do you want to use those options in the first place?
> (Just wondering.)

I use tex4ht to transform some TeX fragments to XML

For instance the authors names of some physics articles:
\author{Francesco D'Eramo}
<contrib><string-name>Francesco D&#x2019;Eramo</string-name>...

which is correctly shown by any decent xml viewer as:
<contrib><string-name>Francesco D’Eramo</string-name>...

(for instance https://repo.scoap3.org/record/19196/files/main.xml)

Sometimes, I need to compare the author's name from these XML to what
we have in our DB, and what we have in our DB is always in the form 
Francesco D'Eramo
(with simple ' instead of ’ or &#x2019;)

This is not a big problem (I just replace &#x2019; with ' and do my

But I was wandering why forcing the use of &#x2019; and, as you noted,
I did not want to change it without knowing the rationale behind it

Thanks anyway

More information about the tex4ht mailing list