[tex-hyphen] [lltx] spell checking TeX files

Wed Oct 10 19:17:11 CEST 2012

[Full quote, because I mistakenly moved the discussion to the lualatex
list.  Moving now to luatex at tug.org, please reply there.  Sorry for the
multiple mails.]

Am 03.10.2012 20:06, schrieb Stephan Hennig:
> [CC'ing lualatex-dev at tug.org and tex-hyphen at tug.org,
> since spell checking is of international concern.
> Please reply to lualatex-dev at tug.org.]
> 
> Am 02.10.2012 16:01, schrieb Pander:
> 
>> You can mention that the Dutch patterns are being processed by OpenTaal.
>> They are put on hold since we are working very hard on the next version
>> of spell checking at the moment.
> 
> You're speaking about spell checking, not hyphenation, right?  Could you
> please elaborate a bit?
> 
> I've recently thought about spell checking of TeX documents and came up
> with the following idea that requires LuaTeX's node list manipulations:
> 
> 1. In the first LuaTeX run, write all typeset text into a UTF-8 encoded
> text file.
> 
> 2. Feed that text file to your favourite spell checker, generating a
> list of bad words.
> 
> 3. In the second run, LuaTeX reads-in the list of bad words and puts a
> red wavy line under all bad words in the document.  A possible approach
> is to mark nodes corresponding to a bad word in pre_linebreak_filter
> with an attribute so that they can be identified later.
> 
> 
> Pro:
> 
> + The approach is spell checking application agnostic.  It only
>   requires that the spell checker can output a list of bad words
>   (aspell and hunspell can do so).
> 
> + The spell checker doesn't need to know TeX syntax.  Even though,
>   aspell as well as hunspell can cope with TeX source files, they
>   cannot spell check TeX generated text that is not explicitly in
>   the source file.  Additionally, commercial spell checkers likely
>   do not know about TeX (such as Duden Korrektor, a spell checker
>   for the German language).
> 
> + You can optionally use multiple spell checkers at once.
> 
> + Point'n'click people have their red wavy lines in the PDF, while
>   others can still just look at the list of bad words.
> 
> + The approach might work with Grammar checkers as well.  Don't know.
> 
> 
> Cons:
> 
> - Red wavy lines are only marketing ...
> 
> 
> I have attached a small package totext (license is LPPL) trying to
> implement step 1 outlined above.  To test it, add the line
> 
>   \usepackage{totext}
> 
> to a LaTeX file and process that with LuaLaTeX.  The package should work
> with other formats as well, but then users need to adapt file
> totext.sty, which consists of only 2 lines.  During the TeX run, a file
> <jobname>.txt is created that should contain most of the text of the TeX
> output.  The output is broken to a fixed line length, that is currently
> hard-coded to 72 characters per line (can be adjusted on ln. 164 in
> totext.lua).  Attached is file sample2e.txt, which contains the output
> of a compile run of sample2e.tex.
> 
> The package currently hooks into the pre_linebreak_filter and
> hpack_filter callbacks.  I'm not sure what the best callbacks are, but
> to avoid irritating the spell checker words should preferably not be
> hyphenated in the text file.  The red wavy lines, on the other hand,
> need to be inserted after all text is laid-out on the page (perhaps in
> buildpage_filter?).

The code is now available on GitHub,
<URL:https://github.com/hennigs/spelling>.

> What doesn't work:
> 
> * The package currently doesn't deal with mathematics.

See issue #8.

> * Ligatures are not resolved into their constituent letters.

I've added a code point substitution feature.  The most important latin
ligatures, like 'ﬀ', 'ﬁ' etc., are now translated into 'ff', 'fi' etc.
to help the incapable spell-checker.  The translation table is currently
hard-coded.  A TeX interface for fine-grained substitution control would
be nice, e.g., for switching of substitution of long s (ſ) by s.
Contributions are warmly welcome, especially those for the TeX parts.
I'd love to see issues #1 and #2 resolved soon to make a first upload to
CTAN.

> * Footnote marks are missing in the text.

That works now.

> * It fails miserably on the \LaTeX logo.  The package adopted the
>   definition of a word from the chickenize package (start with a
>   glyph node, end with a node whose id is neither of 37 glyph,
>   7 disc, 11 kern, 22 ???).  It seems like more nodes have to be
>   considered as being possible parts of a word.

The definition of the LaTeX logo contains a \vbox.  That is best
repaired by providing a definition of the Logo without a \vbox within a
word (the TeX logo does without), see issue #12.

I'm on the road for the rest of the week and perhaps a bit less
responsive.  Oh, and did I mention that I'd be happy to hand-over
maintenance of this package to someone else?  Check it out!

Happy TeXing,
Stephan Hennig