[l2h] Any way of accurately identifying/converting em- and en-dashes?

Stuart Rossiter monsieurrigsby at googlemail.com
Thu Dec 10 13:59:35 CET 2009


As per my other post, retried with Version 2008 (1.71) as from latest
Ubuntu/Debian package. Still same issue: no combination of options
seems to sort em- and en- dashes :-(

Stuart

2009/12/10 Ross Moore <ross at ics.mq.edu.au>:
> Hello Stuart,
>
> On 10/12/2009, at 2:50 AM, Stuart Rossiter wrote:
>
>> Hi,
>>
>>  This revisits issues raised (but not resolved) in a 2003 post:
>> http://tug.org/mailman/htdig/latex2html/2003-August/002400.html
>>
>> It appears that latex2html is (still) converting em- and en-dashes to
>> -- and - respectively. Since hyphens are also left as -, there is then
>> no way to distinguish (in the HTML) between things that were en-dashes
>> and normal hyphens (so you can't do the conversions to &endash; etc.
>> manually, even if you want to).
>>
>> Also, the main script has do_cmd_texteemdash and do_cmd_textendash
>> routines (to convert to --- and -- respectively), but these don't seem
>> to get used when you explicitly use \textemdash and \textendash
>> commands, which I thought would be a way round this problem (it still
>> does the conversions to -- and -).
>
> No, that is not entirely correct.
> The coding has:
>
> # these can be overridded in charset (.pl) extension files:
> sub do_cmd_textemdash { join('','---', $_[0]);}
> sub do_cmd_textendash { join('','--', $_[0]);}
>
> So if you set the charset then you can get other results.
>
> Alternatively, you can override these in a configuration file,
> as that gets read after the main script has been loaded.
>
>
>>
>> So it appears that:
>>
>> -- latex2html can't distinguish these dashes properly (I assume that,
>> as for quotes, this is an issue with being able to definitively
>> identify them), although it's distinguishing *something* in doing the
>> conversions to -- and - ! (so maybe this *can* be fixed?)
>
> It is also a matter of output encodings.
>
> By default, LaTeX2HTML was written to produce Latin 1 output,
> that is, ISO-8859-1 encoding.
> This does not include single characters for endash and emdash.
>
> If you want single characters, and HTML coding that validates,
> then you must either use entities, or expand the charset, or both.
> There are switches  -unicode  and  -entities  for this.
>
> With the  -unicode  switch you should get  &#8211;  and  &#8212;
> respectively, for  --  and  ---  within normal paragraphs.
>
> With switches  -unicode -entities  then the parameter entities
> are supposed to be translated into named entites:
>    &ndash;  and   &emdash;
>
> Or with switches   -unicode -utf8   then you should get
> the correct single characters in UTF8 encoding.
>
>
>>
>> -- there is also no way to "preserve" the dashes from the original in
>> a way which would allow for accurate manual adjustments afterwards.
>
> This statement is true when you do not specify  -unicode .
> It is not true when you do include this switch.
>
> LaTeX2HTML was written at a time when browser support for Unicode
> was very flaky indeed. That is why the defaults are what they are.
> Since then web technologies have advanced considerably, and other
> tools do quite a good job of translating LaTeX coding into HTML,
> or XHTML or XML.
>
> On the other hand, customising LaTeX2HTML is not that hard,
>  **provided** you can use Perl, and have a good understanding
> of just what it is that you really want to do.
>
>
>>
>> Am I missing something, or is there any advice people can offer?
>
>
> Hopefully the above helps.
>
>>
>> Thanks in advance,
>> Stuart
>
>
> Cheers,
>
>        Ross
>
> ------------------------------------------------------------------------
> Ross Moore                                       ross at maths.mq.edu.au
> Mathematics Department                           office: E7A-419
> Macquarie University                             tel: +61 (0)2 9850 8955
> Sydney, Australia  2109                          fax: +61 (0)2 9850 8114
> ------------------------------------------------------------------------
>
>
>
>


More information about the latex2html mailing list