[luatex] lua-inputenc

Javier Múgica javier at digi21.eu
Fri Feb 20 14:26:30 CET 2009


>if you write something like \protect ü ...

The \protect doesn't make any difference. The relevant fact is that
the ü is, in Windows for instance, FC, ie, 11111100, and in a  utf-8
file that byte is not a syngle unit, so it is not a "Byte" in the
sense of minimun significant unit.

>>`But I can make it active and map it to sometihng valid.'
>>`No, that character does not exist.'
>>`What do you mean it does not exist? Of course my file includes 11111100 bytes!'
>>`So does it include six-byte sequences, but certainly you cannot make the six-bit sequence 110100 (or any other) an active character.'

THE ONLY WAY TO HANDLE 8-BIT INPUT IS BY INTERCEPTING THE INPUT WITH A
CALLBACK AND TRANSFORMING IT BEFORE THE LUA ENGINE CAN READ IT.

>No, you wouldn't use utf8, but for example ansinew.def, this maps character 0x80 (or decimal 128) to \texteuro.

Yes, I meant that.

>This method is the most compatible way, but it strips off the new possibilities of using UTF-8 encoding with LuaTeX.
>Therefore I don't think, it's a good base for new development.

If you are using an 8-bit input encoding you won't have the utf-8
facilities anyway; you are limited to 256 input characters.


I see three possible scenarios:

1. Old code which we want to compile with luatex

    In this case I think the solution I proposed is the best one:

      \ifdefiined\directlua
      \input byte-to-utf8.ltex    %This is the file including the code
from my previous message.
      \fi

2. New code in all repectes, i.e., not only the author document but as
well the packages it loads.

    Then the files would be stored in the utf-8 encoding and there is
no need for an input encoding
    scheme at all. There persists the need for an output encoding
while we still use tfm files,
    but this is a different problem.

3. New documents, with utf-8 encoding, using old packages.

    Then it is likely that nothing need be done here too, for packages
cannot include input
    dependent code it they are to come out right in a computer other
than the author's one.
    The following example is taken from the LaTeX Companion itself and belongs
    to the package varioref (the definition for \reftextafter):

       \cyrn\cyra\ \cyrs\cyrl\cyre\cyrd\cyru\cyryu\cyrshch\cyre\cyrshrt ...

       `Clearly, no one wants to type text like this on a regular
basis. Nevertheless, it has
       the advantage of being universally portable.'

    There are a few exceptions to this:

        a) Characters within comments written in the author's native
language. Since they appear
      at comments and are thus ignored this is not a problem. Note
that even invalid characters
      may appear within comments. Neither TeX nor LuaTeX will complay
about the following code.

                \catcode`X=15  %Invalid
                \relax %Xé
                \end

        b) Code dealing with encodings, either input or output.
      As for the .def files of the latex inputenc package, as I
pointed they do not include
      non-ascii characters but rather lines like
\DeclareInputText{140}{\OE}. The same is true
      for the output encodings---the fontenc package:
\DeclareTextSymbol{\OE}{T1}{215}. In latex,
      they made the decision not to include non-ascii bytes in any
file. I don't know how
      context does.
      So before trying to solve unexisting problems, whereby we may
add more troubles
      than we solve, let's investigate whether the problem actually exists.
      I did in LaTeX's base folder the following: >copy *.* all.tex,
and then used a text editor
      to find bytes >=128, and found none.

        c) Single-character control sequences or private use of the
characters in the package.
      I made a more radical text than the previous one. I joined all
the files from my LaTeX folder
      and searched for >=128 bytes. I found some very few within
comments (author names)
      and appart from that only a pair of single-character control
sequences in the package
      usr.sty: \ß, \µ and \¿.
      I then performed the same test for the babel files and found none.


    For the very few packages using some non-acii characters, if they
are still maintained
    by their authors, and they want them to be usable both with old
and new documents,
    they may simply write

       \ifdefined\directlua
       \directlua0{mypackage_previous_pibuffer=callback.find('process_input_buffer')}
       \input byte-to-utf8.ltex
       \fi

       <the file contents>

       \ifdefined\directlua
       \directlua0{callback.register('process_input_buffer',mypackage_previous_pibuffer)}
       \fi

    The same can be said for authors that wrote packages for
themselves or a restricted set of
    nearby users where an input encoding can be assumed and thus they
may include many
    non-ascii characters.

    Finally, just in case there remains some frozen packages with
non-ascii code, the
    documentation for the user could simply include something like
this (I exemplify with LaTeX):

    >>In order to use a package you write
    >>
    >>   \usepackage{<packagename>}
    >>
    >>for exampmle, usepackage{fancyhdr}. There may be some old
packages from where you
    >>get the strange message: `Text line contains invalid utf-8
sequence.' If that happens then
    >>write instead
    >>
    >>   \useoldpackage{<packagename>}

    And a similar one with \oldinput{<pkgn>} for \input <pckgn>. The
definition of both these
    would be something like:

     \def\oldinput#1{
       \ifdefined\directlua
       \directlua0{previous_pibuffer=callback.find('process_input_buffer')}
       \input byte-to-utf8.ltex
       \def\next{\input #1%
         \directlua0{callback.register('process_input_buffer',previous_pibuffer)
       }%
       \else
       \def\next{\input #1}
       \fi
       \next
     }

     \def\useoldpackage{\let\InputIfFileExists\OldInputIfFileExists}
     \def\OldInputIfFileExists#1#2{\let\input\oldinput\@@InputIfFileExists{#1}{#2}%
       \let\input\@@input\let\InputIfFileExists\@@InputIfFileExists}

     (Actually, it has to be a bit more elborate since a bold use of
previous_pibuffer like
     this will not work for nested \oldinput's)


%%%%%%%%%%%%%%%%%%%%%%%

I tested the second scenario above. I saved the refered (in my
previous message) book with utf-8 encoding, did nothing more and it
compiled perfectly. My code may write non-ascii to files, but since it
use utf-8 and luatex, then that code is itself utf-8. Note that
inputenc is still necessary, but that is just because the LICR objects
are used by packages, in particular by fontenc. However, for a future
lualatex the ICR could just be the utf-8 itself, except for \SS and
some others. But this is a problem more of latex itself that of the
input encoding.
As for the output encoding (i.e., the font), while we still use
256-char limited .tfm files,
characters still need to be active.

--Javier A.


More information about the luatex mailing list