[l2h] Any way of accurately identifying/converting em- and en-dashes?

Ross Moore ross at ics.mq.edu.au
Thu Dec 10 05:50:33 CET 2009


Hello Stuart,

On 10/12/2009, at 2:50 AM, Stuart Rossiter wrote:

> Hi,
>
>   This revisits issues raised (but not resolved) in a 2003 post:
> http://tug.org/mailman/htdig/latex2html/2003-August/002400.html
>
> It appears that latex2html is (still) converting em- and en-dashes to
> -- and - respectively. Since hyphens are also left as -, there is then
> no way to distinguish (in the HTML) between things that were en-dashes
> and normal hyphens (so you can't do the conversions to &endash; etc.
> manually, even if you want to).
>
> Also, the main script has do_cmd_texteemdash and do_cmd_textendash
> routines (to convert to --- and -- respectively), but these don't seem
> to get used when you explicitly use \textemdash and \textendash
> commands, which I thought would be a way round this problem (it still
> does the conversions to -- and -).

No, that is not entirely correct.
The coding has:

# these can be overridded in charset (.pl) extension files:
sub do_cmd_textemdash { join('','---', $_[0]);}
sub do_cmd_textendash { join('','--', $_[0]);}

So if you set the charset then you can get other results.

Alternatively, you can override these in a configuration file,
as that gets read after the main script has been loaded.


>
> So it appears that:
>
> -- latex2html can't distinguish these dashes properly (I assume that,
> as for quotes, this is an issue with being able to definitively
> identify them), although it's distinguishing *something* in doing the
> conversions to -- and - ! (so maybe this *can* be fixed?)

It is also a matter of output encodings.

By default, LaTeX2HTML was written to produce Latin 1 output,
that is, ISO-8859-1 encoding.
This does not include single characters for endash and emdash.

If you want single characters, and HTML coding that validates,
then you must either use entities, or expand the charset, or both.
There are switches  -unicode  and  -entities  for this.

With the  -unicode  switch you should get  –  and  —
respectively, for  --  and  ---  within normal paragraphs.

With switches  -unicode -entities  then the parameter entities
are supposed to be translated into named entites:
     –  and   &emdash;

Or with switches   -unicode -utf8   then you should get
the correct single characters in UTF8 encoding.


>
> -- there is also no way to "preserve" the dashes from the original in
> a way which would allow for accurate manual adjustments afterwards.

This statement is true when you do not specify  -unicode .
It is not true when you do include this switch.

LaTeX2HTML was written at a time when browser support for Unicode
was very flaky indeed. That is why the defaults are what they are.
Since then web technologies have advanced considerably, and other
tools do quite a good job of translating LaTeX coding into HTML,
or XHTML or XML.

On the other hand, customising LaTeX2HTML is not that hard,
  **provided** you can use Perl, and have a good understanding
of just what it is that you really want to do.


>
> Am I missing something, or is there any advice people can offer?


Hopefully the above helps.

>
> Thanks in advance,
> Stuart


Cheers,

	Ross

------------------------------------------------------------------------
Ross Moore                                       ross at maths.mq.edu.au
Mathematics Department                           office: E7A-419
Macquarie University                             tel: +61 (0)2 9850 8955
Sydney, Australia  2109                          fax: +61 (0)2 9850 8114
------------------------------------------------------------------------





More information about the latex2html mailing list