[tex-live] Unicode filename problem

Zdenek Wagner zdenek.wagner at gmail.com
Sat Nov 1 23:51:12 CET 2008


2008/11/1 Torsten Ekedahl <teke at math.su.se>:
> I've just switched from iso-latin-1 to UTF-8 on my computer. This has revealed
> a problem with tex.
>
Remember that TX is not a UTF program, it works with 8 bit characters
internally. If you use non-ASCII character, it will be represented in
UTF-8 as 2, 3, 4, or 5 bytes and TeX will see that many characters.
These chararcters mayu have different \catcode's making some bytes
ignored or illegal. In order to use such a file name you would have to
change \catcode of a lot of characters to 12 but then the included
file might not work. Using \usepackage[utf8]{inputenc} makes things
even worse because \catcode of some bytes used in multibyte UTF-8
characters are set to 13. The characters will be expanded to control
sequences and thence file with such a name will be unreadable. Yo can
try the following definition:

\def\UTFinput{\begingroup
  \count@\z@
  \loop
    \catcode\coount@ 12
    \ifnum\count@<255
    \advance\count@\@ne
  \repeat
  \catcode`\{1 \catcode\`}2 `\catcode`\\0
  \doUTFinput}
\def\doUTFinput#1{\endgroup \input #1 }

I have not tried it, I have just typed it from scratch but
\UTFinput{filename} may then work. Do not forget to put the definition
to  a style file or between \makeatletter and \makeatother.

I would also like to mention that encTeX will not help either. The
solution will be similar but the \UTFinput macro will have to set
\mubytein\z@ before calling \doUTFinput. If you want to work so
intensively with UTF-8, it is better to switch to XeTeX.

> In case it matters I use the Ubuntu 2007-13 version of texlive and get the
> following version information
>
> homealone[1]latex
> This is pdfTeXk, Version 3.141592-1.40.3 (Web2C 7.5.6)
>
> Anyway the problem is with non-ASCII letters in file names. There is no
> problem if I use straight latex:
>
> homealone[1]latex inlämning.tex
> This is pdfTeXk, Version 3.141592-1.40.3 (Web2C 7.5.6)
>  %&-line parsing enabled.
> entering extended mode
> (./inlämning.tex
> LaTeX2e <2005/12/01>
> Babel <v3.8h> and hyphenation patterns for english, usenglishmax, swedish,
> dumy
> lang, nohyphenation, loaded.
> )
> *
>
> (The actual filename may come out funny in this mail but it is
> inl aedieresis mning.tex )
>
> However in case I try to input the file from another one it doesn't work
>
> homealone[1]cat test.tex
> \input inlämning.tex
> homalone[1]latex test.tex
> This is pdfTeXk, Version 3.141592-1.40.3 (Web2C 7.5.6)
>  %&-line parsing enabled.
> entering extended mode
> (./retex.tex
> LaTeX2e <2005/12/01>
> Babel <v3.8h> and hyphenation patterns for english, usenglishmax, swedish,
> dumy
> lang, nohyphenation, loaded.
> (./inlämning.tex) (/usr/share/texmf-texlive/tex/latex/base/article.cls
> Document Class: article 2005/09/16 v1.4f Standard LaTeX document class
> (/usr/share/texmf-texlive/tex/latex/base/size10.clo))
> (/usr/share/texmf-texlive/tex/latex/base/inputenc.sty
> (/usr/share/texmf-texlive/tex/latex/base/utf8.def
> (/usr/share/texmf-texlive/tex/latex/base/t1enc.dfu)
> (/usr/share/texmf-texlive/tex/latex/base/ot1enc.dfu)
> (/usr/share/texmf-texlive/tex/latex/base/omsenc.dfu)))
> ! I can't find file `inl'.
> <to be read again>
>                   \unhbox
> l.8 \input inlä
>                mning.tex
>
> I get the same result if I first try to switch to utf8 encoding:
> homealone[1]cat test.tex
> \documentclass[a4paper,twoside]{article}
> \usepackage[utf8]{inputenc}
> \input inlämning.tex
>
> This is probably not too surprising (and I wasn't) but it is not clear to me
> that this is the way God intended it to be. However, it becomes more
> surprising if one tries to dump a format:
>
> homealone[1]cat retex.tex
> \let\DDDD\dump
> \let\dump\relax
> \input latex.ltx
> \documentclass[a4paper,twoside]{article}
> \usepackage[utf8]{inputenc}
> \DDDD
> homealone[1]pdftex --ini --output-format dvi retex.tex
> homealone[1]pdftex '&retex' inlämning.tex
> This is pdfTeXk, Version 3.141592-1.40.3 (Web2C 7.5.6)
>  %&-line parsing enabled.
> ! I can't find file `inl'.
> <to be read again>
>                   \unhbox
> <*> &retex inl^^c3^^a4
>                      mning.tex
> Please type another input file name:
> ! Emergency stop.
> <to be read again>
>                   \unhbox
> <*> &retex inl^^c3^^a4
>                      mning.tex
>
> Apart from the fact that I get different characters (ä vs ^^c3^^a4) reported I
> more or less understand that the conversion from UTF-8 to TeX's internal
> character format plays havoc with the file name. However, I wanted to bring
> it to everyone's attention but wouldn't be too upset with a "don't do that
> then" kind of answer.
>
>                Torsten
>
>
>



-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz


More information about the tex-live mailing list