[tex-live] BibTeXU

Taco Hoekwater taco at elvenkind.com
Tue Jun 1 12:56:02 CEST 2010


Karl Berry wrote:
>     I gather this is what used to be bibtex8 in former years?
> 
> bibtexu is/was a project by Yannis (and a student or two) to use the ICU
> library with BibTeX.  Peter also put in the massive efforts needed to
> make this work in the TL build system and have bibtexu and xetex use the
> same ICU library

Bibtexu does not actually seem to work all that well, or at least it
has some quirks on my linux 64 box. I experimented a bit because it
sounds promising. Long email follows.

I created a small test.aux file with just this in it:

\citation{*}
\bibstyle{plain}
\bibdata{xampl}

At first it complained that it could not find '88591lat.csf'. This
is probably just a packaging error: as it stands, the bibtexu package
should depend on bibtex8 (or the files have to be moved to the bibtexu
package, I do not know whether bibtex8 needs them).  I installed
bibtex8, and that took care of that.

But then, I got this:

[taco at ntg tmp]$ bibtexu test
The 8-bit codepage and sorting file: 88591lat.csf
The top-level auxiliary file: test.aux
The style file: plain.bst
Database file #1: xampl.bib
Terminated

I killed it after about five minutes, and by then it had used
2minutes CPU time, Resident size was 1G, and Virtual size 2.3G
(and growing).

valgrind gives about a gazillion

   'Conditional jump or move depends on uninitialised value(s)'

messages.

It seems \citation{*} is causing this trouble, because a test
without it runs fine (changed to \citation{article-full}).

Having found a working solution, now I wanted to see about
that 'u' at the end of the program name. Big disappointment
there: from the documentation in 'source', the 'u' apparently
stands for 'Unified' or so and at first glance it has nothing
to do with Unicode  at all. (I could have stopped there
because to me there would be little point to a drop-in
replacement of bibtex8).

Nevertheless, the line:

    The 8-bit codepage and sorting file: 88591lat.csf

gave the impression that that csf file is configurable.
00readme.txt from the source says there should be a command
line option:

        -c  --csfile FILE

but this option does not work nor is it listed in the -h
output: I get the help text echoed back at me (there are more
options listed in 00readme.txt that do exist, but I am not
in the mood to list them all).

The 00readme.txt from the source says you can set an
environment variable (BIBTEX_CSFILE), so I tried that:


[taco at ntg tmp]$ env BIBTEX_CSFILE=cp47lat.csf bibtexu xampl-latex
The 8-bit codepage and sorting file: 88591lat.csf
The top-level auxiliary file: xampl-latex.aux
The style file: plain.bst
Database file #1: xampl.bib

Didn't work. Continuing on, it turns out that kpsewhich cannot
find cp47lat.csf either, so I tried an absolute path:

[taco at ntg tmp]$ env 
BIBTEX_CSFILE=/home/taco/texlive/2010/texmf-dist/bibtex/csf/base/cp437lat.csf 
bibtexu xampl-latex
The 8-bit codepage and sorting file: 88591lat.csf
The top-level auxiliary file: xampl-latex.aux
The style file: plain.bst
Database file #1: xampl.bib

Doesn't work either. Then I remembered having seen a debug
switch: --debug=search:

[taco at ntg tmp]$ env BIBTEX_CSFILE=cp437lat.csf bibtexu --debug=search 
xampl-latex
The 8-bit codepage and sorting file: 88591lat.csf
The top-level auxiliary file: xampl-latex.aux
The style file: plain.bst
Database file #1: xampl.bib

Also doesn't seem to do anything. Un-phased, try with --debug=all:

[taco at ntg tmp]$ env BIBTEX_CSFILE=cp437lat.csf bibtexu --debug=all 
xampl-latex

Lots of output this time, but _nothing_ related to file searching.


Now I could have given up, but then I realized that perhaps the
u in bibtexu is about *input*, not output or whatever is implied
by '8-bit codepage and sorting file'. So I created a copy of
xampl.bib and changed Aamport to "Aaämport", saved as UTF-8,
and ran:

[taco at ntg tmp]$ bibtexu xampl-latex

Much to my surprise, the output is UTF-8! That is exactly what
I wanted, but what is all this talk about 8-bit csf files
about then? I don't understand that at all.

Never mind, now for the real experiment (this is where old bibtex
fails):

   \citation{article-full}
   \bibdata{xampl-utf}
   \bibstyle{alpha}

The "Aaämport" above makes bibtex and bibtex8 generate invalid
UTF-8 output in this case, because it takes the first 3 bytes
of the surname instead of the first 3 sequences (an important
difference in UTF-8). Here is what happens:

[taco at ntg tmp]$ bibtexu xampl-latex
The 8-bit codepage and sorting file: 88591lat.csf
The top-level auxiliary file: xampl-latex.aux
The style file: alpha.bst
Database file #1: xampl-utf.bib
6there is a error: U_ZERO_ERROR[taco at ntg tmp]$

It reports an error, but it *did* generate a bbl file, and the
content of that is correct UTF-8:

   \bibitem[Aaä86]{article-full}

Then I tried "The ḠṈÄȚŜ and Gnus Document Preparation System".
Output UTF-8: "The ḡṉäțŝ and gnus document preparation system"

It does work after all!

This now makes me believe that all this talk about csf files is
just a bit leftover noise that does not actually mean anything.

So what about that U_ZERO_ERROR report then? No idea. It happens
once for each \citation in the 'alpha' style (as well as in the
cont-xx.bst styles) but it seems harmless.

In the end, what is left is the \citation{*} bug, and a lot of
obsolete documentation, I think. (and it took me three hours
figuring this out).

Best wishes,
Taco













More information about the tex-live mailing list