[tex-hyphen] Google Books corpus

Aleks Kleyn Aleks_Kleyn at MailAPS.org
Tue Jul 12 19:10:39 CEST 2011


I agree with you. However we do not have code behind. We do not know how
much rules uses Google team in code. Also, German has one rules, French has
others. In the book can be few languages at the same time. But we as human
being when watch the text may fix it. 

The problem is more deep than you think. Of course you can trust Watson.
However when you speak with other person you want to be sure that he or she
tells you truth.
I tell you story from my past. In 1970 there was in Ukraine PC which was
able to make analytic calculations. I used it to calculate gravitational
fields. Everything was fine. But once I got in answer curvature 0. It was
special field, namely Friedman space, and it was natural to believe in
answer. Then I felt something wrong and debugged the code. I discovered that
at certain time a large expression appeared in memory and PC lost half of
expression. The rest was equal 0.

The following thought arose in head. If software develops which such speed,
then once may happens next. Somebody knows about existence of differential
equations but does not know how to solve them. Due his business he relies to
PC application, however due of lock of his knowledge he does not see when PC
gives wrong answer.

I think you do search in German books because you know German. This helps
you to catch errors. I can advice also to contact Google team to help them
to fix these problems. But I do not promise that this task is easy.

Aleks Kleyn
http://alekskleyn.dyndns-home.com:4080/  
http://sites.google.com/site/AleksKleyn/ 
http://arxiv.org/a/kleyn_a_1  
http://AleksKleyn.blogspot.com/
http://KleynAleks.blogspot.com/


-----Original Message-----
From: tex-hyphen-bounces at tug.org [mailto:tex-hyphen-bounces at tug.org] On
Behalf Of Stephan Hennig
Sent: Tuesday, July 12, 2011 12:42 PM
To: tex-hyphen at tug.org
Subject: Re: [tex-hyphen] Google Books corpus

schrieb Aleks Kleyn:

> Few words about Google errors mentioned bellow. As far as understand
> they restored text from scanned image. This is artificial
> intelligence, the field which evolves slowly.

While OCR in general is a hard problem, those 'typical errors' I
referred to can very well be tackled by a dictionary approach.  In the
German language a word cannot start with 'ß'.  So a words starting with
that letter has a high probability of being an erroneous match and can
automatically be fed into a dictionary assisted recognition stage.  The
same is true for words starting with exactly two capital letters 'AV'.

Note, I'm only speaking of the simple cases where the rest of the word
is already spelled correctly.  The presence of such typical errors
indicates Google (so far) doesn't use a dictionary to decrease the error
rate.

Best regards,
Stephan Hennig


> -----Original Message-----
> From: tex-hyphen-bounces at tug.org [mailto:tex-hyphen-bounces at tug.org] On
> Behalf Of Stephan Hennig
> Sent: Thursday, June 30, 2011 7:25 PM
> To: About TeX hyphenation patterns, old and new.
> Subject: [tex-hyphen] Google Books corpus
> 
> Additionally, the German corpus contains lots of
> typical OCR errors like
> 
>     incorrect                correct
> 
>   ßrot                     Brot
>   AVahrscheinlichkeit      Wahrscheinlichkeit
> 
> that I would have expected to be handled better by Google.  (Well, there
> are many of such typical errors, but with low frequencies each so that
> in total they shouldn't generate significant skew to the data.)




More information about the tex-hyphen mailing list