[tex-hyphen] Google Books corpus

Aleks Kleyn Aleks_Kleyn at MailAPS.org
Fri Jul 1 06:04:55 CEST 2011


Few words about Google errors mentioned bellow. As far as understand they
restored text from scanned image. This is artificial intelligence, the field
which evolves slowly. Not everything is perfect on Google, but I think this
problem will be fixed later.

Aleks Kleyn
http://alekskleyn.dyndns-home.com:4080/  
http://sites.google.com/site/AleksKleyn/ 
http://arxiv.org/a/kleyn_a_1  
http://AleksKleyn.blogspot.com/
http://KleynAleks.blogspot.com/

-----Original Message-----
From: tex-hyphen-bounces at tug.org [mailto:tex-hyphen-bounces at tug.org] On
Behalf Of Stephan Hennig
Sent: Thursday, June 30, 2011 7:25 PM
To: About TeX hyphenation patterns, old and new.
Subject: [tex-hyphen] Google Books corpus

Hi,

I haven't seen mentioned it here before, so ...

In
<URL:http://permalink.gmane.org/gmane.science.linguistics.corpora/13159>,
Google
has announce public availability of Google Books corpora for several
languages (English, Chinese, French, German, Hebrew, Russian, Spanish).
 The corpora are two years old (2009-07-15).  License is Creative
Commons Attribution 3.0 Unported.

Corpora contain only words that have been observed in at least 40
different books.  For each word, frequencies are given per observed
year.  But years have to be taken with a grain of salt: Searching for
'computer' in the German corpus with the Books Ngram Viewer
<URL:http://ngrams.googlelabs.com/> and clicking at range '1800-1966' at
the bottom reveals a computer lexicon from 1902, which was obviously
printed in 1992.  Additionally, the German corpus contains lots of
typical OCR errors like

    incorrect                correct

  ßrot                     Brot
  AVahrscheinlichkeit      Wahrscheinlichkeit

that I would have expected to be handled better by Google.  (Well, there
are many of such typical errors, but with low frequencies each so that
in total they shouldn't generate significant skew to the data.)

A few numbers for the German corpus (the only one I have looked at so far):

  * The size of the list of 1-grams is 1 GB compressed, 5 GB
    uncompressed.

  * Most frequent word is 'der' with a frequency of 1,167,791,242.

  * The list contains 24 frequency classes, class 24 being
    incomplete (the 40 books limit).

  * After consolidating the list (cumulating frequencies of the
    same words over all years), there are 3.6 million words.
    The final list has a size of ca. 60 MB.

  * The oldest books in the German corpus are from 1564.

Best regards,
Stephan Hennig




More information about the tex-hyphen mailing list