[tex-hyphen] Google Books corpus
Aleks Kleyn
Aleks_Kleyn at MailAPS.org
Fri Jul 1 06:04:55 CEST 2011
Few words about Google errors mentioned bellow. As far as understand they
restored text from scanned image. This is artificial intelligence, the field
which evolves slowly. Not everything is perfect on Google, but I think this
problem will be fixed later.
Aleks Kleyn
http://alekskleyn.dyndns-home.com:4080/
http://sites.google.com/site/AleksKleyn/
http://arxiv.org/a/kleyn_a_1
http://AleksKleyn.blogspot.com/
http://KleynAleks.blogspot.com/
-----Original Message-----
From: tex-hyphen-bounces at tug.org [mailto:tex-hyphen-bounces at tug.org] On
Behalf Of Stephan Hennig
Sent: Thursday, June 30, 2011 7:25 PM
To: About TeX hyphenation patterns, old and new.
Subject: [tex-hyphen] Google Books corpus
Hi,
I haven't seen mentioned it here before, so ...
In
<URL:http://permalink.gmane.org/gmane.science.linguistics.corpora/13159>,
Google
has announce public availability of Google Books corpora for several
languages (English, Chinese, French, German, Hebrew, Russian, Spanish).
The corpora are two years old (2009-07-15). License is Creative
Commons Attribution 3.0 Unported.
Corpora contain only words that have been observed in at least 40
different books. For each word, frequencies are given per observed
year. But years have to be taken with a grain of salt: Searching for
'computer' in the German corpus with the Books Ngram Viewer
<URL:http://ngrams.googlelabs.com/> and clicking at range '1800-1966' at
the bottom reveals a computer lexicon from 1902, which was obviously
printed in 1992. Additionally, the German corpus contains lots of
typical OCR errors like
incorrect correct
ßrot Brot
AVahrscheinlichkeit Wahrscheinlichkeit
that I would have expected to be handled better by Google. (Well, there
are many of such typical errors, but with low frequencies each so that
in total they shouldn't generate significant skew to the data.)
A few numbers for the German corpus (the only one I have looked at so far):
* The size of the list of 1-grams is 1 GB compressed, 5 GB
uncompressed.
* Most frequent word is 'der' with a frequency of 1,167,791,242.
* The list contains 24 frequency classes, class 24 being
incomplete (the 40 books limit).
* After consolidating the list (cumulating frequencies of the
same words over all years), there are 3.6 million words.
The final list has a size of ca. 60 MB.
* The oldest books in the German corpus are from 1564.
Best regards,
Stephan Hennig
More information about the tex-hyphen
mailing list