[l2h] An Apparent Byte Size Limit for a Portable Network Graphics (.png) Image File Containing Simplified Chinese Characters Produced by LaTeX2HTML From a .tex File Containing LaTeX and Chinese/Japanese/Korean (CJK) for LaTeX Commands

Pat Somerville l_pat_s at hotmail.com
Sat Sep 4 05:13:52 CEST 2010

In order to view some special accent marks over some characters, please use the rich-text (HTML=>HyperText Markup Language) format to view the contents of this e-mail letter body. Thanks.

A one-paragraph introduction: This report may be considered somewhat lengthy. But in it I request help from LaTeX2HTML or Chinese/Japanese/Korean (CJK) code experts concerning how to modify some code to work with pinyin commands in the Guo Biao (GB), GB2312 encoding. Pinyin romanization is a pronunciation system for simplified Chinese characters which includes placing tone marks or diacritics over some vowels in the pronunciation system, for example Wo xihuan chi fàn. In that pronunciation system the shapes of the tone marks roughly pattern the steadiness, rise, and/or fall of the pitch in the human voice used in pronouncing the pinyin syllables. In pinyin the pronunciations for some letters or combinations of them are not always the pronunciations one would expect from them in English.--For example the pronunciation of the pinyin xi is the pronunciation of "she" in English.  In a .tex file the use of the GB2312 encoding in a lengthy CJK environment could enable smaller and shorter .png (Portable Network Graphics) files to be produced from that environment containing simplified Chinese characters, pinyin, and/or mathematics than using the 8-bit, Uniformation Transformation Format (UTF-8) encoding and multiple CJK environments.--Using the UTF-8 encoding via the initial command \usepackage{CJKutf8}, the size in bytes of each of several .png images produced by LaTeX2HTML 1.70 was probably at least roughly proportional to the length of each segment of pinyin, mathematics, LaTeX commands, and/or text between the LaTeX commands \begin{CJK}{UTF8}{gbsn} and \end{CJK} in the .tex file. Meanwhile, before a solution is worked out in the computer code for the GB2312 and perhaps other non-UTF-8 encodings for producing pinyin in a .html file by LaTeX2HTML, I include here a "workaround" solution for the GB2312 encoding to the problem which ought to be considered secondary to a potential modification of the computer code.

Recently I have been experimenting with a modern set of software packages in the following replacements:

OpenSuSE-11.1, Linux operating system replaced by the openSUSE-11.3, Linux operating system

K Desktop Environment (KDE) 3.5.10 replaced by the Lightweight X11 (X Windows System, version 11) Desktop Environment (LXDE), which I guess may be a lightweight version of KDE 4.4.4, "release 2," which is also installed in the same operating system

Chinese/Japanese/Korean (CJK) 4.7.0 for LaTeX packages replaced by CJK 4.8.2

LaTeX2HTML 1.70 (year-2002 version) replaced by LaTeX2HTML 1.71 (year-2008-version)

LaTeX 2e, year-2008 version replaced by LaTeX 2e, September 24, 2009 version

And now I have Perl 5, version 12, subversion 1 built for "i586, Linux, thread, multi" installed.

Yet even with the above, modern set of software packages, a problem remains.--For a Throwaway.tex file with contents, for example like the following:






\Wo3 \xi3\huan1 \chi1 \fan4.



, I could not obtain good-looking pinyin romanizations with diacritics or tone marks above some of the vowels like these:

Wo xihuan chi fàn. 

in the resulting, Throwaway.html, output file produced by a command of the form

"latex2html -nonavigation -no_math -html_version 3.2, math -split 0 Throwaway.tex"

. On the other hand a command of the form "latex Throwaway.tex" resulted in good-looking pinyin in the output file Throwaway.dvi when it was opened by the program Okular.

When the file Throwaway.tex also contained simplified Chinese characters between the commands "\begin{CJK}{GB}{gbsn}" and "\end{CJK}", they were gratefully nicely produced in each of the files Throwaway.dvi, output by LaTeX, and Throwaway.html, output by LaTeX2HTML.--I found that it was important in the text editor Kate to save the file with contents like the one above containing "\begin{CJK}{GB}{gbsn}" using the GB2312 (Guo Biao 2312) encoding in order to avoid the error message "! Package CJK Error: Invalid character code" after entering a command of the form "latex Throwaway.tex" [If I instead saved the file in the 8-bit, Uniform Transformation Format (UTF-8) encoding, I obtained such a LaTeX and/or CJK error message. The LaTeX and/or CJK error message provided excellent help by mentioning that I could type "H" for immediate help, which resulted in the following, very helpful message being displayed: "The second byte of the CJK code is out of range. Do you use the right encoding scheme?" ]. Simply I should use the same encoding when saving the file as the .tex file directs is to be used, for example to save the file in the GB2312 encoding when the file contains "\usepackage{CJK}" and "\begin{CJK}{GB}{gbsn}" or to save it in the UTF-8 encoding when the file instead contains "\usepackage{CJKutf8}" and "\begin{CJK}{UTF8}{gbsn}". 

Now I discuss the relevant error messages obtained while trying to generate Throwaway.html using the above latex2html command: 

"No implementation found for style 'pinyin'

Unknown commands: fan huan Wo"

Taken literally the first of these two error messages indicates that the command "\usepackage{pinyin}" was properly "recognized" by LaTeX2HTML as a command directing that the style file pinyin.sty be used. But the error message "Unknown commands: fan huan Wo" indicated that LaTeX2HTML did not " recognize" the commands "\Wo3 \xi3\huan1 \chi1 \fan4." as being commands for pinyin. In fact Greek letters for \xi and \chi appeared in the .html file produced by LaTeX2HTML. So LaTeX2HTML apparently interpreted those commands as commands for Greek letters.--Since there probably are no Greek letters corresponding to \Wo3, \huan1, and \fan4, the error message "Unknown commands: fan huan Wo" could at least be partly understood for that reason. Adding a "\PYactivate" command, which I guess might mean to activate pinyin "recognition," before "\Wo3 \xi3\huan1 \chi1 \fan4." in Throwaway.tex unfortunately did not help LaTeX2HTML "recognize" that the intention of "\Wo3 \xi3\huan1 \chi1 \fan4." was for pinyin instead of Greek letters; and the addition of that command made no change in the good, pinyin output in Throwaway.dvi of LaTeX 2e. In fact the command "\PYactivate," which is handled inside pinyin.sty, was not "recognized" by LaTeX2HTML. 

On the other hand, 

1) for the above contents in Throwaway.tex, again Throwaway.dvi had good-looking pinyin with tone marks (diacritics) in it produced by LaTeX in conjunction with CJK. And 

2) when the commands "\usepackage{CJKutf8}", "\usepackage{pinyin}", and "\begin{CJK}{UTF8}{gbsn}" were used instead of "\usepackage{CJK}", "\usepackage{pinyin}", and "\begin{CJK}{GB}{gbsn}" in a different .tex file, the pinyin looked good in the .html output file produced by a latex2html command like the one above.

Point 1 agrees nicely with the fact that in the early paper entitled "The CJK Package for LaTeX2e--Multilingual Support Beyond Babel" and written by Werner Lemberg, he designed CJK for use with LaTeX; the word LaTeX2HTML does not appear in that early paper at http://tug.org/TUGboat/Articles/tb18-3/cjkintro600.pdf on the Internet.

The above results lead me to think that the pinyin problem could be in how LaTeX2HTML works with the pinyin, LaTeX commands like "\Wo3 \xi3\huan1 \chi1 \fan4." within the CJK environment instead of in either the CJK software packages or LaTeX, especially since CJK and LaTeX worked well together; however, I would not discount the possibility that CJK could be adjusted so that LaTeX2HTML could work better with it for pinyin in non-UTF-8 encodings. The hope of point 2 is that since LaTeX2HTML can handle the pinyin commands properly when UTF-8-encoding-related commands are used in the .tex file, perhaps an adjustment could be made in the LaTeX2HTML code so that the pinyin commands could be properly handled when non-UTF-8-encoding-related commands are used as well.

This problem became interesting to me. But unfortunately I lack lots of knowledge about LaTeX2HTML's internal workings and the language Perl in which it is written. And the language of the CJK code for LaTeX looks "foreign" to me as well. In what computer language are the LaTeX and CJK codes written? So far from LaTeX 2e's documentation within my openSUSE-11.3, Linux operating system, located in /usr/share/texmf/doc/latex/latex2e-help-texinfo/latex2e.pdf, I learned that LaTeX is a macroprocessor for TeX and uses a markup language.

But I hope I can provide a few possible clues and conjectures to the LaTeX2HTML code experts, far more knowledgable than myself about the code, that I hope will stimulate their interest and thinking toward the solution to this problem. There could, of course, be errors of some of my following conjectures.

An obvious, but not necessarily correct conjecture is that LaTeX2HTML 1.70 and 1.71 are presently for some reason "comfortable" working with pinyin.sty in the UTF-8, but not in the GB encoding. But in comparing the contents of the file CJK.sty and CJKutf8.sty in the directory /usr/share/texmf/tex/latex/cjk/texinput I found an interesting difference in one pair of commands:In 

CJK.sty                                                                              In CJKutf8.sty                                                  

\NeedsTeXFormat{LaTeX2e}[2001/06/01]                       \NeedsTeXFormat{LaTeX2e}[2003/12/01]

Frankly I don't know for certain what action this statement is supposed to initiate, for example whether it is supposed to convert one format of LaTeX commands to another one or not. But I wonder regarding pinyin if, for example, LaTeX 2e can work with both the apparently June 1, 2001 and December 1, 2003 formats of LaTeX commands while LaTeX2HTML might only be able to work with the December 1, 2003 format of LaTeX commands. The directory /usr/share/texmf/tex/latex/cjk/texinput also contains the subdirectories Bg5, UTF8, GB, etc. The Bg5 and UTF8 subdirectories contain .enc files, while the GB subdirectory does not.

Among many other differences I note the following interesting ones:

In CJK.sty                                                                                                   In CJKutf8.sty                                    

\RequirePackage{MULenc}                                                                       \RequirePackage{ifpdf}




I looked inside the file /usr/share/texmf/tex/latex/cjk/texinput/pinyin.sty and noticed that commands like \chi are idenified as pinyin (or perhaps components of pinyin expressions like \chi1) and that macros are called for production of various accents (really diacritics) over "vocals," which should be some of the vowels, according to how I know pinyin works. So since

A) "\Wo3 \xi3\huan1 \chi1 \fan4." was not handled properly in the above contents of Throwaway.tex; 

B) yet "\usepackage{pinyin}" is in it; and 

C) since those pinyin commands were handled properly when instead using the commands "\usepackage{CJKutf8}" and "\begin{CJK}{UTF8}{gbsn}",

I wonder if the error message "Unknown commands: fan huan Wo" could indicate that the command "\usepackage{pinyin}" was found and properly interpreted; but perhaps LaTeX2HTML did not "look" for the file pinyin.sty; or else pinyin.sty was not found by LaTeX2HTML, for example because it may have "searched" for pinyin.sty in the wrong directory. Or to write this differently, suppose LaTeX2HTML needed pinyin.sty in a certain directory; but it wasn't there. If the problem would be this simple, then copying pinyin.sty into the needed directory might solve it. A first guess of mine of a directory in which to try that would be the directory in which LaTeX2HTML reads the LaTeX commands in the .tex file.--What directory is that? 

Okay, here I included a lot of speculation, most of which, if not all of which is wrong. Now it's time for the LaTeX2HTML code experts to think over this matter and to inform me where the problem could be or is. I ought to be able to follow possible directions from a Perl code expert to change the code to make it work as desired.

Meanwhile I worked out a "workaround" solution to produce output which looks like pinyin in a .html file when utilizing the )Guo Biao (GB for probably GB2312) encoding in a CJK environment in a .tex file saved in the GB2312 encoding. From http://www.maths.tcd.ie/~dwilkins/LaTeXPrimer/MathAccents.html on the Internet the following commands in mathematics mode can be used as approximate-looking substitutes for pinyin diacritics or usually tone marks over some vowels (The \mbox{..} commands are used for standard instead of the italic-looking type one normally obtains in mathematics mode. A comprehensive list of LaTeX commands is given in http://www.cis.rit.edu/~rvrpci/teaching/LaTeX/symbols-letter.pdf on the Internet.):

$\bar{\mbox{\i}}$ for i, using the command \i for the dotless "i" appropriate for pinyin when it has a diacritic over it;

$\acute{\mbox{a}}$ for á;

$\breve{o}$ for o; and

$\grave{\mbox{a}}$ for à.

The use of mathematics mode allows two kinds of marks to be placed over one vowel, as in

$\breve{\ddot{\mbox{u}}$ for u and

$\acute{\hat{\mbox{e}}}$ for  , or "e" with a "^" above it and and approximately a "/" above all of that.

The above are examples. Some other vowels may be used for all but two of the above marks.--The umlaut-mark-looking pair of dots can only be placed over the "u" in pinyin; and the "^" can only be placed over the letter "e" in pinyin.

Early attempts of mine to use commands like some of the above ones led to two initial problems when using LaTeX2HTML 1.71: 1) Undesired black line segments appeared under some of the vowels with diacritics over them. This was solved with help from Shigeharu Takeno by commenting out a line containing $DVIPSOPT = ' -Ppdf -E' in the file /usr/lib/latex2html/l2hconf.pm. Actually there were two such lines in that file, one of which was already commented out. Commenting out the second one by placing a # at the beginning of it and then saving that file gratefully eliminated the black segments under some .png (Portable Networks Graphics) images in the .html, output file produced by LaTeX2HTML. I did this sort of thing in both versions 1.70 and 1.71 of LaTeX2HTML. 2) In the .html file mathematical expressions were followed by "mathend000#", something which did not occur when I used LaTeX2HTML 1.70. Shigeharu Takeno kindly provided the solution for this problem as well at http://tug.org/mailman/htdig/latex2html/2008-December/003489.html on the Internet. That was in the Perl script file /usr/bin/latex2html to insert the following question marks into each of the following lines:

$math_verbatim_rx = "verbatim_mark#?math(\\d+)#"; 

$mathend_verbatim_rx = "verbatim_mark#?mathend([^#]*)#";

. The contents of my test file Throwaway7.tex looked like this:





$\mbox{W}\breve{\mbox{o}}\ \mbox{x}\breve{\mbox{\i}}\mbox{hu}\bar{\mbox{a}}

{\mbox{n}\ \mbox{ch}\bar{\mbox{\i}}\ \mbox{f}\grave{\mbox{a}}\mbox{n.}$


\noindent$\mbox{l}\breve{\ddot{\mbox{u}}}$ $\acute{\hat{\mbox{e}}}$



This file Throwaway7.tex was then saved in the GB2312 encoding to match "GB" in this file using the text editor Kate. The desire was to produce output similar to this in the .html file after executing a command of the form "latex2html......Throwaway7.tex": 

    Wo xihuan chi fàn.


However, the pinyin output did not look that good in the Konqueror Web browser for two reasons: 1) There was too much horizontal space between a vowel with a diacritic over it and the letter before it. Experimenting by adding \, and \! within the mathematics mode did not seem to significantly reduce that space in the .html file, but with \! tended to crowd some letters against each other in the .dvi file produced by the command "latex Throwaway7.tex." 2) The vowels with diacritics over them were relatively shorter in height than other lower-cased letters surrounding them which one would otherwise expect to also fit between two horizontal lines of different heights. There is the hope that the pinyin output in a .html file involving someone suggesting modifications of the computer code for non-UTF-8 encodings could look better than the pinyin I gratefully was able to produce using the above, "workaround" solution. 


From: "Pat Somerville" <l_pat_s at hotmail.com>
Sent: Wednesday, August 11, 2010 4:48 PM
To: <latex2html at tug.org>; <cjk at ffii.org>
Subject: Re: [l2h] An Apparent Byte Size Limit for a PortableNetworkGraphics
(.png) Image File Containing SimplifiedChineseCharacters Produced by
LaTeX2HTML From a .tex FileContainingLaTeX and Chinese/Japanese/Korean (CJK)
for LaTeX Comma

> Correction: SCIM=Smart Common Input Method, not Small Common Input Method;
> sorry for my earlier error.
> Pat
> --------------------------------------------------
> From: "Pat Somerville" <l_pat_s at hotmail.com>
> Sent: Friday, August 06, 2010 1:25 PM
> To: <latex2html at tug.org>; <cjk at ffii.org>
> Subject: Re: [l2h] An Apparent Byte Size Limit for a Portable
> NetworkGraphics (.png) Image File Containing Simplified ChineseCharacters
> Produced by LaTeX2HTML From a .tex File ContainingLaTeX and
> Chinese/Japanese/Korean (CJK) for LaTeX Comma
>> Thank you, Professors Ross Moore and Shigeharu Takeno, for each of you
>> kindly taking the time to respond to me.  Switching from
>> \usepackage{CJKutf8} to \usepackage{CJK} in a .tex file of the form
>> MyFile.tex did solve two problems:
>> 1) In the case of a large segment of LaTeX commands beginning with
>> \begin{CJK}{UTF8}{gbsn} and ending with \end{CJK} in a .tex file, that
>> change eliminated the "Bad file descriptor error"s while the program
>> LaTeX2HTML attempted to generate some .png (Portable Network Graphics)
>> images.  With the above change the number of .png images produced from a
>> tex file greatly increased due to the mathematical content, more like the
>> operation with which I was accustomed using LaTeX2HTML.
>> 2) It was no longer necessary to have either an \end{CJK}command before a
>> command of the form \htmladdnormallink{http://../}{http://../} or another
>> \begin{CJK}{UTF8}{gbsn} command following the htmladdnormallink command.
>> But there was a negative side effect.  From what I have read the Chinese
>> pinyin package, which is really the file pinyin.sty, is supposed to be a
>> part of the CJK (Chinese/Japanese/Korean) software package.  With the
>> following set of commands among others in a test, .tex file of the form
>> MyFile.tex:
>> ....
>> .....
>> \usepackage{CJK}
>> \usepackage{pinyin}
>> \begin{CJK}{UTF8}{gbsn}
>> \Wo \xi3\huan1 \chi1 \fan4.
>> \PYdeactivate
>> $\chi $ $\mu $
>> \PYactivate
>> \end{CJK}
>> \end{document}
>> , neither the pinyin expression corresponding to \Wo \xi3\huan1 \chi1
>> \fan4 nor the Greek letters chi and mu were displayed in the .html file
>> produced as a result of executing a command of the form
>> "latex2html....... MyFile.tex".  But changing only the command
>> \usepackage{CJK} to \usepackage{CJKutf8}, the pinyin and Greek letters
>> were displayed correctly in such a .html file.  Changing that command to
>> \begin{CJK}{GB}{gbsn} also resulted in the set of disappointing results.
>> So for the moment in the .tex file
>> a) using the LaTeX commands \usepackage{CJKutf8} and \usepackage{pinyin},
>> b) a number of short, CJK segments each beginning with
>> \begin{CJK}{UTF8}{gbsn} and ending with \end{CJK} to avoid the "Bad file
>> descriptor error"s in generating some .png images of the text and
>> mathematics between such delimiting commands,
>> c) surrounding each \htmladdnormallink{http://../}{http://..} command
>> with a \begin{CJK}{UTF8}{gbsn} and \end{CJK} pair of commands,
>> d) and surrounding a group of LaTeX commands and text containing commands
>> for Greek letters like $\chi $ and $\mu $ with the command \PYdeactivate
>> before them and sometime or sometimes the command \PYactivate after them,
>> a command which is probably necessary if some pinyin romanizations were
>> to follow the latter command,
>> is a strategy which enabled simplified Chinese characters, Greek letters,
>> hyperlinks, and pinyin romanizations to all be displayed correctly in a
>> html file produced by executing a command of the form
>> "latex2html..........MyFile.tex".
>> But concerning the use of the pinyin software package, apparently there
>> is something basic which is a problem somewhere.  The following set of
>> LaTeX commands
>> \documentclass{article}
>> \usepackage{CJK}
>> \usepackage{pinyin}
>> \begin{document}
>> \begin{CJK}{Bg5}{fs}
>> \Wo \xi3\huan1 \chi1 \fan4.
>> \end{CJK}
>> \end{document}
>> in my test file Throwaway.tex differs from the set in
>> http://tug.org/TUGboat/Articles/tb18-3/cjkintro600.pdf only slightly in
>> the line of pinyin which begins with \Wo3 ..... and in not containing any
>> Chinese characters. Yet the output file Throwaway.html produced by
>> executing a command of the form "latex2html ........ Throwaway.tex"
>> contained the output 3 the lower-case Greek letter xi#xi; the lower-case
>> Greek letter chi or an X#chi;1 4. instead of good-looking pinyin.
>> Changing the \usepackage{CJK} and \begin{CJK}{Bg5}{fs} commands to
>> \usepackage{CJKutf8} and either the \begin{CJK}{UTF8}{fs} or the
>> \begin{CJK}{Bg5}{fs} commands, the output was good-looking pinyin
>> containing the proper diacritical marks. Again I am using LaTeX2HTML
>> 1.70, a year-2002 version.  And I could be using CJK 4.7.0 for LaTeX,
>> based on what I read inside the file CJKutf8.sty.--I used the CJK
>> software packages provided via the Internet using Yet another Software
>> Tool 2's (YaST2's) "Online Updates" in July of the year 2010 for
>> OpenSuSE-11.1, Linux.  What is the cause of the problem here?  And how
>> can it be fixed?  Looking at "History of the CJK Package" at
>> http://cjk.ffii.org/history.txt on the Internet, for version 4.7.0 of CJK
>> one, pinyin-related error was mentioned:
>> "pinyin.sty:
>>                      The package didn't preserve `\ding' which is defined
>>                      in pifont.sty, causing problems with older versions
>> of
>>                      the hyperref package and its `hpdftex' driver
>> option."
>> In the above problematic sets of commands I used in Throwaway.tex
>> hyperref does not appear among them.  So perhaps the problem I have found
>> is not directly mentioned among the errors for CJK 4.7.0.  For version
>> 4.8.1 of CJK, which at least based on the contents of CJKutf8.sty I might
>> not be using, the following pinyin-related error was mentioned at
>> http://cjk.ffii.org/history.txt on the Internet:
>>                    "Pinyin syllable macros (defined in pinyin.sty) were
>> not
>>                    robust, causing problems with indices, for example."
>> Pat
>> --------------------------------------------------
>> From: "Shigeharu TAKENO" <shige at iee.niit.ac.jp>
>> Sent: Monday, August 02, 2010 11:24 PM
>> To: "Pat Somerville" <l_pat_s at hotmail.com>
>> Cc: <latex2html at tug.org>
>> Subject: Re: [l2h] An Apparent Byte Size Limit for a Portable Network
>> Graphics        (.png) Image File Containing Simplified Chinese
>> Characters Produced by LaTeX2HTML From a .tex File Containing LaTeX and
>> Chinese/Japanese/Korean (CJK) for LaTeX Commands
>>> shige 08/03 2010
>>> ----------------
>>> Pat Somerville wrote:
>>>> \documentclass{article}
>>>> \usepackage{CJKutf8}
>>> Latex2html does not support "CJKutf8" style file, but supports
>>> "CJK" style file. If you use "CJK.sty" instead "CJKutf8", the
>>> large image may not be made.
>>> cf.
>>>  http://takeno.iee.niit.ac.jp/~shige/misc/data/testcjk-u.tex
>>>  http://takeno.iee.niit.ac.jp/~shige/misc/data/testcjk-u.pdf
>>>  http://takeno.iee.niit.ac.jp/~shige/misc/data/testcjk-u/index.html
>>> +========================================================+
>>> Shigeharu TAKENO     NIigata Institute of Technology
>>>                       kashiwazaki,Niigata 945-1195 JAPAN
>>> shige at iee.niit.ac.jp   TEL(&FAX): +81-257-22-8161
>>> +========================================================+
>> _______________________________________________
>> latex2html mailing list
>> latex2html at tug.org
>> http://tug.org/mailman/listinfo/latex2html
> _______________________________________________
> latex2html mailing list
> latex2html at tug.org
> http://tug.org/mailman/listinfo/latex2html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://tug.org/pipermail/latex2html/attachments/20100903/24725fa8/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 217 bytes
Desc: not available
URL: <http://tug.org/pipermail/latex2html/attachments/20100903/24725fa8/attachment-0001.png>

More information about the latex2html mailing list