[tex-hyphen] Google Books corpus

Tue Jul 12 18:41:45 CEST 2011

schrieb Aleks Kleyn:

> Few words about Google errors mentioned bellow. As far as understand
> they restored text from scanned image. This is artificial
> intelligence, the field which evolves slowly.

While OCR in general is a hard problem, those 'typical errors' I
referred to can very well be tackled by a dictionary approach.  In the
German language a word cannot start with 'ß'.  So a words starting with
that letter has a high probability of being an erroneous match and can
automatically be fed into a dictionary assisted recognition stage.  The
same is true for words starting with exactly two capital letters 'AV'.

Note, I'm only speaking of the simple cases where the rest of the word
is already spelled correctly.  The presence of such typical errors
indicates Google (so far) doesn't use a dictionary to decrease the error
rate.

Best regards,
Stephan Hennig

> -----Original Message-----
> From: tex-hyphen-bounces at tug.org [mailto:tex-hyphen-bounces at tug.org] On
> Behalf Of Stephan Hennig
> Sent: Thursday, June 30, 2011 7:25 PM
> To: About TeX hyphenation patterns, old and new.
> Subject: [tex-hyphen] Google Books corpus
> 
> Additionally, the German corpus contains lots of
> typical OCR errors like
> 
>     incorrect                correct
> 
>   ßrot                     Brot
>   AVahrscheinlichkeit      Wahrscheinlichkeit
> 
> that I would have expected to be handled better by Google.  (Well, there
> are many of such typical errors, but with low frequencies each so that
> in total they shouldn't generate significant skew to the data.)