[latex3-commits] [git/LaTeX3-latex3-latex2e] master: some more and some changed words about UTF-8 [ci skip] (a003709)

Frank Mittelbach frank.mittelbach at latex-project.org
Fri Mar 30 21:37:26 CEST 2018


Repository : https://github.com/latex3/latex2e
On branch  : master
Link       : https://github.com/latex3/latex2e/commit/a003709a3b09861559086fdf113cdfdcc2caa7c9

>---------------------------------------------------------------

commit a003709a3b09861559086fdf113cdfdcc2caa7c9
Author: Frank Mittelbach <frank.mittelbach at latex-project.org>
Date:   Fri Mar 30 21:37:26 2018 +0200

    some more and some changed words about UTF-8 [ci skip]


>---------------------------------------------------------------

a003709a3b09861559086fdf113cdfdcc2caa7c9
 doc/ltnews28.tex |  213 ++++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 176 insertions(+), 37 deletions(-)

diff --git a/doc/ltnews28.tex b/doc/ltnews28.tex
index 07df086..59d5eeb 100644
--- a/doc/ltnews28.tex
+++ b/doc/ltnews28.tex
@@ -35,6 +35,10 @@
 
 \usepackage{lmodern,url,hologo}
 
+\providecommand\acro[1]{\textsc{#1}}
+\providecommand\meta[1]{$\langle$\textit{#1}$\rangle$}
+
+
 \publicationmonth{April}
 \publicationyear{2018}
 
@@ -47,7 +51,7 @@
 
 \setlength\rightskip{0pt plus 3em}
 
-\section{New home for \LaTeXe{} sources}
+\section{A new home for \LaTeXe{} sources}
 
 In the past the development version of the \LaTeXe{} source files has
 been managed in a Subversion source control system with read access
@@ -81,43 +85,146 @@ The requirements and the workflow for reporting a bug in the core
 \end{quote}
 and with further details also discussed in~\cite{Mittelbach:TB39-1}.
 
-\section{Default input encoding}
-Since the release of \LaTeXe, \LaTeX\ has supported multiple file encodings
-via the \package{inputenc} package. It used to be necessary to support several
-different input encodings to support different languages. These days Unicode
-and in particular the UTF-8 file encoding can support multiple languages
-in a single encoding. UTF-8 is the default  encoding in most current operating
-systems and editors, and is the only encoding natively supported by
-\hologo{LuaTeX} and \hologo{XeTeX}.
-
-With this release, the default encoding for \LaTeX\ files has been
-changed to UTF-8 if used with classic \TeX\ or PDF\TeX. The
-implementation is essentially the same as the existing UTF-8 support
-from \verb|\usepackage[utf8]{inputenc}|.
 
-Documents using non ASCII characters should already be specifying the
-encoding used via an option to the \package{inputenc} package. Such
-documents should not be affected by this change in default.
+\section{UTF-8: the new default input encoding}
+
+The first \TeX{} implementations only supported reading 7-bit
+\acro{ascii} files---any accented or otherwise ``special'' character
+had to be entered using commands, if it could be represented at
+all. For example to obtain an ``a'' one would enter \verb=\"a=, and to
+typeset a ``\ss'' the command \verb=\ss=. Furthermore fonts at that
+time had 128 glyphs inside, holding the \acro{ascii} characters, some
+accents to build composite glyphs from a letter and an accent, and a
+few special symbols such as parantheses, etc.
+
+with 8-bit \TeX{} engines such as \hologo{pdfTeX} this situation changed
+somewhat: it was now possible to process 8-bit files, i.e., files that
+could encode 256 different characters. However, 256 is still a fairly
+small number and with this limitation it is only possible to encode a
+few languages and for other languages one would need to change the
+encoding (i.e., interpret the character positions 0--255 in a
+different way). The first code points 0--127 where essentially normed
+(corresponding to \acro{ascii}) while the second half 128--255 would
+vary by holding different accented characters to support a certain set
+of languages.
+
+Each computer used one of these encodings when storing or interpreting
+files and as long as two computers used the same encoding it was
+(easily) possible to exchange files between them and have them
+interpreted and processed correctly.
+
+But different computers may have used different encodings and given
+that a computer file is simply a sequence of bytes with no indication for
+which encoding is was destined chaos could easily happen and
+happened. For example, the German word ``Gr\"o\ss e'' (height) entered on a
+German keyboard could show up as ``Gr\v T\`ae'' on a diferent
+computer using a different encoding by default.
+
+So in summmary the situation wasn't at all well and it was clear in
+the early nienties that \LaTeXe{} (that was being developed to provide
+a \LaTeX{} version usable across the world) had to provide a solution
+to this issue.
+
+The \LaTeXe{} answer was the introduction of the \package{inputenc}
+package~\cite{Mittelbach:Brno95} through which it is possible to
+provide support for multiple encodings. It also allows to correctly
+process a file written in one encoding on a computer using a different
+encoding and even supports documents where the encoding changes
+midway.
+
+Since the first release of \LaTeXe{} in 1994, \LaTeX{} documents that
+used any characters outside \acro{ascii} in the source (i.e. any
+characters in the range of 128--255) were supposed to load
+\package{inputenc} and specify in which file encoding they were
+written and stored.
+%
+If the \package{inputenc} package was not loaded then \LaTeX{} used a
+``raw'' encoding which essentially took each byte from the input file
+and typeset the glyph that happened to be in that position in the
+current font---something that sometimes produces the right result but
+often enough will not.
+
+In 1992 Ken Thompson and Rob Pike developed the UTF-8 encoding scheme
+which allows to encode all Unicode characters within 8-bit sequences
+and over time this encoding has gradually taken over the world,
+replacing the legacy 8-bit encodings used before. These days all major
+computer operating systems use UTF-8 to store their files and it
+requires some effort to explicitly store files in one of the legay
+encodings.
+
+As a result, whenever \LaTeX{} users want to use any accented
+characters from their keyboard (instead of resorting to \verb=\"a= and
+the like) they always have to use
+\begin{verbatim}
+  \usepackage[utf8]{inputenc}
+\end{verbatim}
+in the preamble of their documents as otherwise \LaTeX{} will produce
+glibberish.
+
+\subsection*{The new default}
 
-Some documents would have been using accemted letters \emph{without}
-loading \package{inputenc}, relying on the similarities between the
-input used and the T1 font encoding.  These documents will generate an
-error that they are not valid UTF-8, however the documents may be
-easily processed by specifying the encoding used by adding a line such
-as \verb|\usepackage[utf8]{inputenc}|, or adding the new command
-\verb|\UseRawInputEncoding| as the first line of the file. This will
-re-instate the previous default.
+With this release, the default encoding for \LaTeX\ files has been
+changed from the ``fall through raw'' encoding to UTF-8 if used with
+classic \TeX\ or \hologo{pdfTeX}. The implementation is essentially
+the same as the existing UTF-8 support from
+\verb|\usepackage[utf8]{inputenc}|.  
+
+The \hologo{LuaTeX} and \hologo{XeTeX} engines always supported the
+UTF-8 encoding as their native (and only) input encoding, so with
+these engines \package{inputenc} was always a no-op.
+
+This means that with new documents one can assume UTF-8 input and it
+is no longer required to always specify
+\verb|\usepackage[utf8]{inputenc}|. But if this line is present it
+will not hurt either.
+
+
+\subsection*{Compatibility}
+
+For most existing documents this change will be transparent:
+\begin{itemize}
+\item documents using only \acro{ascii} in the input file and
+  accessing accented characters via commands;
+\item documents that specified the encoding of their file via an
+  option to the \package{inputenc} package and then used 8-bit
+  characters in that encoding;
+\item documents that already had been stored in UTF-8 (whether or not
+  specifying this via \package{inputenc}).
+\end{itemize}
+Only documents that have been stored in a legay encoding and used
+accented letters from the keyboard \emph{without} loading
+\package{inputenc} (relying on the similarities between the input used
+and the T1 font encoding) are affected.
+
+These documents will now generate an error that they contain invalid
+UTF-8 sequences.  However, such documents may be easily processed by
+adding the new command \verb|\UseRawInputEncoding| as the first line
+of the file. This will re-instate the previous ``raw'' encoding
+default.
 
 \verb|\UseRawInputEncoding| may also be used on the commandline to
-process existing files without requiring the file to be edited\\
-  \verb|pdflatex '\UseRawInputEncoding \input'  file|\\
+process existing files without requiring the file to be edited
+\begin{verbatim}
+  pdflatex '\UseRawInputEncoding \input'  file
+\end{verbatim}
 will process the file using the previous default encoding.
 
+Possible alternatives are reencoding the file to UTF-8 using a tool
+(such as recode or iconv or an editor) or adding the line
+\begin{flushleft}
+\verb=  \usepackage[=\meta{encoding}\verb=]{inputenc}=
+\end{flushleft}
+to the preamble specifying the \meta{encoding} that fits the file
+encoding.  In many cases this will be \texttt{latin1} or
+\texttt{cp1562}. For other encoding names and their meaning see the
+\package{inputenc} documentation.
+
 As usual, this change may also be reverted via the more general
 \package{latexrelease} package mechanism, by speciying a release date
 earlier than this release.
 
-\section{General rollback concept for packages and classes}
+\section[A general rollback concept]
+        {A general rollback concept for packages and classes}
 
   In 2015 a rollback concept for the \LaTeX{} kernel was introduced.
   Providing this feature allowed us to make corrections to the
@@ -156,10 +263,6 @@ for this. For the programming level we also added
 
 
 
-\section{Further TU encoding improvements}
-
-Anything here?
-
 \section{Changes to packages in the tools category}
 
 \subsection{\LaTeX{} table columns with fixed widths}
@@ -189,21 +292,57 @@ needs adjustment.
 
 \begin{thebibliography}{9}
   
-\bibitem{Mittelbach:TB38-2-213} Frank Mittelbach:
-  \emph{\LaTeX{} table columns with fixed widths}.  
-  In: TUGBoat, 38\#2, 2017.
-  \url{https://www.latex-project.org/publications/}
-
 \bibitem{Mittelbach:TB39-1} Frank Mittelbach:
   \emph{New rules for reporting bugs in the \LaTeX{} core software}.  
   Submitted to TUGBoat.
   \url{https://www.latex-project.org/publications/}
 
+\bibitem{Mittelbach:Brno95} Frank Mittelbach:
+  \emph{\LaTeXe{} Encoding Interface --- Purpose, concepts, and 
+   Open Problems}.  
+  Talk given in Brno June 1995.
+  \url{https://www.latex-project.org/publications/}
+
 \bibitem{Mittelbach:TB39-2} Frank Mittelbach:
   \emph{A rollback concept for packages and classes}.  
   Submitted to TUGBoat.
   \url{https://www.latex-project.org/publications/}
 
+\bibitem{Mittelbach:TB38-2-213} Frank Mittelbach:
+  \emph{\LaTeX{} table columns with fixed widths}.  
+  In: TUGBoat, 38\#2, 2017.
+  \url{https://www.latex-project.org/publications/}
+
 \end{thebibliography}
 
 \end{document}
+
+
+
+Since the release of \LaTeXe, \LaTeX\ has supported multiple file encodings
+via the \package{inputenc} package. It used to be necessary to support several
+different input encodings to support different languages. These days Unicode
+and in particular the UTF-8 file encoding can support multiple languages
+in a single encoding. UTF-8 is the default  encoding in most current operating
+systems and editors, and is the only encoding natively supported by
+\hologo{LuaTeX} and \hologo{XeTeX}.
+
+Documents using non ASCII characters should already be specifying the
+encoding used via an option to the \package{inputenc} package. Such
+documents should not be affected by this change in default.
+
+
+Some documents would have been using accemted letters \emph{without}
+loading \package{inputenc}, relying on the similarities between the
+input used and the T1 font encoding.  These documents will generate an
+error that they are not valid UTF-8, however the documents may be
+easily processed by specifying the encoding used by adding a line such
+as \verb|\usepackage[utf8]{inputenc}|, or adding the new command
+\verb|\UseRawInputEncoding| as the first line of the file. This will
+re-instate the previous default.
+
+\verb|\UseRawInputEncoding| may also be used on the commandline to
+process existing files without requiring the file to be edited\\
+  \verb|pdflatex '\UseRawInputEncoding \input'  file|\\
+will process the file using the previous default encoding.
+





More information about the latex3-commits mailing list