TransWikia.com

Biggest freely available English corpus?

Linguistics Asked by Tt22 on December 6, 2021

Any help on finding the biggest freely available English corpus that can be used on research?

So far I have found OANC with 15 M words.

7 Answers

CommonCrawl

crawls the web and freely provides its archives and datasets to the public. Common Crawl's web archive consists of petabytes of data collected since 2011. It completes crawls generally every month.

It's a few billion pages (petabytes of data).

You can find versions of it that are already cleaned, de-duplicated and split by language.

https://commoncrawl.org/

Answered by Adam Bittlingmayer on December 6, 2021

What about 1 Billion Word Language Model Benchmark? It is freely available for download.

Also you might find this Reddit thread useful for other corpus' links.

Answered by hafiz031 on December 6, 2021

The COBUILD corpus (18M tokens) is available through WebCelex, if the arcane user interface isn't a deal-breaker. It's valuable more for its extensive manual annotations than its size, with quite a lot of morphological and phonological information available.

(It's smaller than most of the others listed here, but seems worth mentioning, since it's larger than the OANC mentioned in the question and is well-annotated.)

Answered by Draconis on December 6, 2021

I found the Exquisite Corpus and it's freely avalaible. A detail of the sources can be seen here. I really don't know the exact size, but it's on the billions' scale.

Answered by ofou on December 6, 2021

Sketch Engine, a corpus manager and text analysis software, provides a few corpora with open access for research on https://app.sketchengine.eu/#open The largest English corpus (freely available) is ACL Anthology Reference Corpus with 62 million words.

On the other hand, you can try 30-day free trial of Sketch Engine and search one of the biggest English corpus which currently exists with over 35 billion words, see more at https://www.sketchengine.eu/timestamped-english-corpus/

Answered by Rodrigo on December 6, 2021

Westbury labs provides a ~1 billion word Wikipedia dump of all articles with greater than 1000 words from 2010: http://www.psych.ualberta.ca/~westburylab/downloads/westburylab.wikicorp.download.html

BYU has a larger dump (1.9 bn) from 2014, but it's not available for download.

Answered by Jeremy Salwen on December 6, 2021

Can't beat the Global Web-Based English Corpus proposed by robert---but here are is another big one:

A Wikipedia dump is also huge ...

Answered by jk - Reinstate Monica on December 6, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP