[ Describe Training Pipelines and overview of what it means ]
Web crawl data
The source used most is Common Crawl, an open-source, constantly updated repository, featuring 9.5+ petabytes of data (according to Mozilla Foundation). It features:
- Over 250 billion pages spanning 18 years.
- Free and open corpus since 2007.
- Cited in over 10,000 research papers.
- 3–5 billion new pages added each month.
The amount of data available across various domains is mind-boggling. These include all blogs, news articles, forum discussions, as well as business and personal websites. Any site that requires logging in or is covered by a specific restrictive license is excluded.
A typical website consists of HTML content that can reference media, fonts, stylesheets, scripts, and other pages. A typical web crawler starts with the home page, then parses the page and looks for links to other files or websites (i.e. links). It then recursively downloads each of those, until all content has been downloaded and processed.
If a website links to external sources, the system will navigate to that site and start the process all over. As you might imagine, this can get exponentially large.
Many web-crawlers maintain a depth limit so they only go that many hops downstream before stopping.
Technically, these are not very complicated. Open source tools like Crawlee, Scrapy, or Mechanical Soup make it very easy to create and run your own web crawlers (as long as you have space to store all the material). What each crawler needs is a starting point URL.
In the early days of the internet, websites wanting to be discovered would submit themselves to search engine to be crawled. These would then be periodically visited by the search engine to update the search index. This was a mutually beneficial service that helped customers discover content and navigate to those pages.
In the era of AI, however, web crawling has been replaced with web scraping. The difference is subtle, but when crawling, the content is downloaded to be indexed so the search engine can send traffic back to the original site. With scraping, the content is ingested and used for training models. A ChatBot can answer a user’s question without sending them back to the original source.
As you might imagine, this has caused a lot of consternation.
Web data offers a great deal of variety, allowing models to learn a range of formats and modes of expression. However, there is no indpendent verification of content veracity, allowing unsavory material to make its way into the training dataset.
Web data offers great variety — everything from news articles and blogs to discussion forums — which helps models learn a wide range of topics and styles. However, the veracity and fidelity of web data can be uneven. It includes both high-quality content and a long tail of noisy, duplicative, or false information. Crawlers do implement filters (e.g. removing spam, porn, obvious SEO co
But that’s just a fraction of what is being ingested.
References
These include encyclopediae like WikiPedia, dictionaries, and various open references across different languages. There may be some overlap between these and what’s on Common Crawl, in which case the material has to be de-duplicated before getting ingested.
Sites like
####
- Books and literature: open-source copyright-expired content like Project Gutenberg,
- , academic papers, social media and forums, code, and human-curated or synthetic data.
, this includes a vast corpus of written matter (analog and digital), structured and unstructured data, as well as multimedia. These are analyzed, tokenized, and turned into digital representations that can then be used later during inference.
There are a large number of public datasets with which to train a model. Once you’ve used all those, companies will use classic web-spidering techniques pioneered by early search engines to try to acquire more content.
Most AI vendors do not publicize exactly what data sources they have use for their training, considering those as a strategic advantage over competitors. Owners of content have realized they are sitting on a potential goldmine of raw material and are quickly mobilizing to monetize those archives.
The open-source material is considered low-hanging fruit – forcing sites like WikiPedia to create custom datasets to prevent getting hammered by AI scrapers.
, The Gutenberg Project, The Library of Congress.