[tex-hyphen] [lltx] spell checking TeX files
Stephan Hennig
mailing_list at arcor.de
Wed Oct 10 19:17:11 CEST 2012
[Full quote, because I mistakenly moved the discussion to the lualatex
list. Moving now to luatex at tug.org, please reply there. Sorry for the
multiple mails.]
Am 03.10.2012 20:06, schrieb Stephan Hennig:
> [CC'ing lualatex-dev at tug.org and tex-hyphen at tug.org,
> since spell checking is of international concern.
> Please reply to lualatex-dev at tug.org.]
>
> Am 02.10.2012 16:01, schrieb Pander:
>
>> You can mention that the Dutch patterns are being processed by OpenTaal.
>> They are put on hold since we are working very hard on the next version
>> of spell checking at the moment.
>
> You're speaking about spell checking, not hyphenation, right? Could you
> please elaborate a bit?
>
> I've recently thought about spell checking of TeX documents and came up
> with the following idea that requires LuaTeX's node list manipulations:
>
> 1. In the first LuaTeX run, write all typeset text into a UTF-8 encoded
> text file.
>
> 2. Feed that text file to your favourite spell checker, generating a
> list of bad words.
>
> 3. In the second run, LuaTeX reads-in the list of bad words and puts a
> red wavy line under all bad words in the document. A possible approach
> is to mark nodes corresponding to a bad word in pre_linebreak_filter
> with an attribute so that they can be identified later.
>
>
> Pro:
>
> + The approach is spell checking application agnostic. It only
> requires that the spell checker can output a list of bad words
> (aspell and hunspell can do so).
>
> + The spell checker doesn't need to know TeX syntax. Even though,
> aspell as well as hunspell can cope with TeX source files, they
> cannot spell check TeX generated text that is not explicitly in
> the source file. Additionally, commercial spell checkers likely
> do not know about TeX (such as Duden Korrektor, a spell checker
> for the German language).
>
> + You can optionally use multiple spell checkers at once.
>
> + Point'n'click people have their red wavy lines in the PDF, while
> others can still just look at the list of bad words.
>
> + The approach might work with Grammar checkers as well. Don't know.
>
>
> Cons:
>
> - Red wavy lines are only marketing ...
>
>
> I have attached a small package totext (license is LPPL) trying to
> implement step 1 outlined above. To test it, add the line
>
> \usepackage{totext}
>
> to a LaTeX file and process that with LuaLaTeX. The package should work
> with other formats as well, but then users need to adapt file
> totext.sty, which consists of only 2 lines. During the TeX run, a file
> <jobname>.txt is created that should contain most of the text of the TeX
> output. The output is broken to a fixed line length, that is currently
> hard-coded to 72 characters per line (can be adjusted on ln. 164 in
> totext.lua). Attached is file sample2e.txt, which contains the output
> of a compile run of sample2e.tex.
>
> The package currently hooks into the pre_linebreak_filter and
> hpack_filter callbacks. I'm not sure what the best callbacks are, but
> to avoid irritating the spell checker words should preferably not be
> hyphenated in the text file. The red wavy lines, on the other hand,
> need to be inserted after all text is laid-out on the page (perhaps in
> buildpage_filter?).
The code is now available on GitHub,
<URL:https://github.com/hennigs/spelling>.
> What doesn't work:
>
> * The package currently doesn't deal with mathematics.
See issue #8.
> * Ligatures are not resolved into their constituent letters.
I've added a code point substitution feature. The most important latin
ligatures, like 'ff', 'fi' etc., are now translated into 'ff', 'fi' etc.
to help the incapable spell-checker. The translation table is currently
hard-coded. A TeX interface for fine-grained substitution control would
be nice, e.g., for switching of substitution of long s (ſ) by s.
Contributions are warmly welcome, especially those for the TeX parts.
I'd love to see issues #1 and #2 resolved soon to make a first upload to
CTAN.
> * Footnote marks are missing in the text.
That works now.
> * It fails miserably on the \LaTeX logo. The package adopted the
> definition of a word from the chickenize package (start with a
> glyph node, end with a node whose id is neither of 37 glyph,
> 7 disc, 11 kern, 22 ???). It seems like more nodes have to be
> considered as being possible parts of a word.
The definition of the LaTeX logo contains a \vbox. That is best
repaired by providing a definition of the Logo without a \vbox within a
word (the TeX logo does without), see issue #12.
I'm on the road for the rest of the week and perhaps a bit less
responsive. Oh, and did I mention that I'd be happy to hand-over
maintenance of this package to someone else? Check it out!
Happy TeXing,
Stephan Hennig
More information about the tex-hyphen
mailing list