[XeTeX] Table of contents

José Carlos Santos jcsantos at fc.up.pt
Sat May 1 10:50:07 CEST 2010


On 30-04-2010 19:06, Jonathan Kew wrote:

>> Please consider again this file:
>>
>> \documentclass[10pt,a4paper]{book}
>> \usepackage[frenchb]{babel}
>> \usepackage{fontspec}
>> \usepackage{xunicode}
>> \usepackage{xltxtra}
>> \begin{document}
>> \frontmatter
>> \tableofcontents
>> \XeTeXinputencoding "cp1252"
>> \XeTeXdefaultencoding "cp1252"
>> \mainmatter\setcounter{secnumdepth}{2}
>> \chapter{Général de Gaulle}
>> Il était français.
>> \end{document}
>>
>> When the \tableofcontents command is found, the line
>>
>> \XeTeXinputencoding "cp1252"
>>
>> was not read yet. Therefore, it seems to me (since XeTeX is unicode-based) that the file .toc is in unicode and therefore that XeLaTeX should have no problem with it.
>
> There are a couple of issues that make this tricker than it might initially appear.
>
> First, it's important to be aware that xetex *always* writes auxiliary files using utf-8, regardless of the \XeTeXdefaultencoding setting. There is no facility to change the \write to use a different encoding. (Perhaps the command should have been called \XeTeXdefaultinputencoding, but it's already pretty long!)
>
> So when xetex writes the .aux and .toc files, it will take the internal character codes of your text and encode them in utf-8 form. If you look at the .aux file that is generated from your example, you'll see that the cp1252 characters in the input have been converted to Unicode and then represented as utf-8 byte sequences in that file.
>
> You are correct in expecting that because you've put \tableofcontents before the \XeTeXdefaultencoding command, the .toc file should be read as utf-8. However, if you examine the .toc file, you'll find that it does NOT contain the expected utf-8 version of "Général". Why is this?
>
> The answer lies in how LaTeX creates the new .toc file during a run. It does NOT write the TOC entries directly into the .toc file during the run; if it did, this would have worked OK -- they'd be written in utf-8, and read as utf-8 by your \tableofcontents. But if you observe the terminal output during a (xe)latex run, you'll see that at the end of the document, it reads the .aux file as an input. What's happening is that during the run, the chapters, sections, etc are written to the .aux file (in utf-8). Then, at the end, the .aux file is closed, then read as an input, and the relevant information written to the .toc (and perhaps other files such as .lof and .lot, if you're using those features).
>
> The problem is that at this point, the .aux file is read *with* your \XeTeXdefaultencoding declaration in force, so the individual utf-8 bytes that were written to it now get interpreted as cp1252 characters and mapped to their Unicode values, instead of the byte sequences being interpreted as utf-8. That's the source of the "junk" you're getting. Those utf-8-bytes-interpreted-as-cp1252 then get re-encoded to utf-8 sequences as the .toc is written, so in effect the original characters have been "doubly encoded".
>
> In this particular case, at least, you can work around the problem by resetting the default encoding immediately before the end of the document, so that when LaTeX reads in the .aux file at the end of the run, it reads it correctly as utf-8. In other words, if you modify this example to become:
>
>    \documentclass[10pt,a4paper]{book}
>    \usepackage[frenchb]{babel}
>    \usepackage{fontspec}
>    \usepackage{xunicode}
>    \usepackage{xltxtra}
>    \begin{document}
>    \frontmatter
>    \tableofcontents
>    \XeTeXinputencoding "cp1252"
>    \XeTeXdefaultencoding "cp1252"
>    \mainmatter\setcounter{secnumdepth}{2}
>    \chapter{Général de Gaulle}
>    Il était français.
>    \XeTeXdefaultencoding "utf-8"
>    \end{document}
>
> then your table of contents should correctly show "Général".
>
> However, there may be other situations where auxiliary files are written and read at unpredictable times during the processing of the document, making it more difficult to control the encodings at the right moments. In general, moving to an entirely utf-8 environment is a better and more robust way forward.

Many thanks for your detailed and clear explanation.

Best regards,

Jose Carlos Santos


More information about the XeTeX mailing list