Back to blog
Vytenis Kaubrė
The biggest challenge in training large language models (LLMs) isn’t their architecture—it’s finding high-quality, diverse, and unbiased data in a vast and noisy digital landscape. This is true whether you’re building an LLM from scratch or fine-tuning a pre-trained one, as you’ll need to use high-quality data compiled from multiple sources.
This article overviews LLM training, the need for public web data, and the major public data sources for highly-performant LLMs.
Training data for LLMs consists of vast collections of text, often terabytes in size, gathered from public sources like websites, online books, research papers, and code repositories. Raw text is commonly used as the main format, which is cleaned and then tokenized into smaller units, such as subwords, using various techniques like byte pair encoding. The processed data is then fed to the model's architecture to teach it to analyze and generate human-like text by identifying patterns, context, grammar, and meaning.
Training LLM on custom data generally includes the following steps:
Gather large-scale textual data from diverse sources to ensure a varied data distribution.
Clean the data to remove irrelevant and incomplete information, duplicates, and inappropriate content. Then, you can normalize the text.
Tokenize the text into words, subwords, or characters to create a manageable and efficient token set for large language model training.
You may want to use a pre-trained model like GPT, BERT, or similar, which you can fine-tune with custom data. This will save you considerable time and costs.
If you have specific objectives where pre-trained models aren’t sufficient, you can create an LLM architecture from scratch. Some popular tools include PyTorch and TensorFlow. This step requires high-performance hardware, time, and a serious budget.
Pre-training: If you’re starting from scratch, in this phase, the model learns general language patterns. The training objective often involves predicting words or tokens in a sequence, enabling the model to understand the context and structure of the language.
Fine-tuning: Adjust the model to improve its performance on a specific task or domain. This may involve different fine-tuning approaches with human involvement, such as supervised learning and reinforcement learning.
Throughout the training process, evaluate the model’s performance using metrics like perplexity, BLEU score, F1 score, accuracy, task-specific benchmarks, and other.
Optimize various hyperparameters like learning rates, batch sizes, gradient clipping, warmup steps, and others to adjust the model’s performance.
While synthetic data created by generative AI models can help train and fine-tune large language models, the so-called LLM hallucinations can also bring a significant downside to the end result. This phenomenon refers to scenarios where large language models generate clear and grammatically correct but false and misleading results. There are countless reasons for LLM hallucinations, but the limits that synthetic data introduces are one of the contributing factors.
When using synthetically produced data, large language models may inherit the same mistakes and biases found in the original data, potentially amplifying inaccuracies. Therefore, utilizing high-quality public web data for LLM training and fine-tuning is crucial. This allows models to navigate “the jungle” by improving the overall accuracy and knowledge base, offering benefits like:
Extensive knowledge: when trained on a variety of topics, gathered throughout different web sources, LLM models can in turn produce more accurate and relevant responses across various domains.
Diverse writing styles and perspectives: the broad range of sources, data types (i.e. conversational data), and formats (text, images, etc.), enables LLM models to capture diverse linguistic patterns, nuances, and styles, and enhance contextual understanding.
Up-to-date information: public websites are interested in constant content updates in order to rank better on search engines and attract new audiences. Consequently, training large language models on such data ensures that models are aligned with the current and most relevant information.
Building a custom training dataset will require an effective web scraping tool that can handle various page types and stringent anti-scraping systems. One way is to develop a custom web scraping infrastructure, or alternatively, you can choose a web scraping API that’s specifically designed to extract data from difficult websites. Before going with one or the other option, you should compile a list of public data sources you want to scrape. See the list below to get a general overview:
This entails any domain-specific content, including websites focused on science, retail, business, etc. Depending on your use case, websites for LLM training include:
Specific public content on any website, like blog posts, articles, and reviews.
Search engine results from Google, Bing, and other engines.
E-commerce data from Amazon, Google Shopping, and similar large retail sites.
Public domain sources like Project Gutenberg and similar provide a wealth of quality data, covering a diverse range of topics and writing styles found in books.
Public access community networks, forums, and social media platforms are perfect for conversational and humanistic texts. Platforms like Stack Exchange also offer deep knowledge of various topics like mathematics, physics, linguistics, programming, and others.
If you want to train your LLM on scientific data, consider sources like Google Scholar, PLOS ONE, DOAJ, PubMed Central® (PMC), and similar platforms that provide multiple documents that are peer-reviewed.
To train an LLM proficient in current international and national events, politics, and other fields, you may want to feed it with public news data gathered from Google News and similar platforms.
This free online encyclopedia hosts around 6.8 million content pages with around 4.7 billion words, covering almost any topic. While it’s not the most reliable source since anyone can edit the content, it’s still a great source for LLM training due to its well-written and multilingual text with wide topic coverage. For a well-rounded LLM, Wikipedia should be supplemented with more data from other datasets, similar to how OpenAI’s GPT-3 and Google’s BERT were trained.
If your goal is to train an LLM that can navigate different programming nuances and generate code that works using different programming languages, then consider using public sources like GitHub, Stackhare, DockerHub, and Kaggle.
Public video platforms are a great source of conversational text for LLM training. In essence, you would need to use available transcribed video texts.
Alternatively, open-source LLM training data, found on Common Crawl, Kaggle, and similar, can significantly ease the entire training process. You may also want to use cleaned web datasets gathered from public websites that are difficult to scrape. Oxylabs datasets are set up by a team of web scraping experts and can be scheduled to deliver fresh data at the frequency you need. Using a ready-to-use dataset ensures you can fully focus your resources on an LLM rather than dealing with the complexities of web scraping.
If you want to learn more about the topic of LLMs and web scraping, check out these articles:
LLM training data size depends on various factors, such as model architecture and purpose. While small-scale LLMs may be trained on tens of gigabytes of data, larger models like GPT, BERT, and Llama often consume hundreds of gigabytes to several terabytes of data.
To get data for AI & LLMs, you can use open-access datasets like Common Crawl, Kaggle, and others or acquire specific datasets that contain public data scraped from difficult websites. Alternatively, you may want to scrape public data yourself using a custom web scraper or a dedicated scraping tool that’s specifically developed to ease the entire process.
Some projections, like this research paper, predict that large language models will exhaust high-quality public human-generated data by 2026-2032. There are numerous factors that contribute to the issue, like web scraping restrictions and the slow growth of new content. However, synthetic data generation, multimodal fine-tuned LLMs, and advancements in training techniques offer potential paths forward.
To prepare data for LLM training, start with raw text and clean it by removing noise and inappropriate information. Then, you can normalize the text. Organize the data in formats like JSON or CSV for metadata or structure, and then feed the textual data to the model's tokenizer to convert into token sequences. Ensure the data is diverse, balanced, and relevant to the task. Once the data is processed, split it into diverse datasets for training, validation, and testing. This will allow you to use the sets for different training phases to prevent overfitting and ensure generalization.
For LLM pre-training, a dataset usually consists of large amounts of diverse, unlabeled text data from various sources. The model may learn by predicting the next word in a sequence or by predicting masked words. For example:
“A dog is an [MASK].”
The model predicts “animal”.
For LLM fine-tuning, the training data contains labeled text specific to the target task, such as input-output pairs, question-answer pairs, or any other structured interaction. The LLM would automatically adjust model parameters, known as model weights, to minimize the difference between its response and the provided answer. For example:
Question: “Are dogs animals?”
Labeled answer: “Yes, dogs are animals classified under the biological kingdom Animalia.”
LLM Answer: “Yes, dogs are animals that are classified within the biological kingdom Animalia.”
About the author
Vytenis Kaubrė
Technical Copywriter
Vytenis Kaubrė is a Technical Copywriter at Oxylabs. His love for creative writing and a growing interest in technology fuels his daily work, where he crafts technical content and web scrapers with Oxylabs’ solutions. Off duty, you might catch him working on personal projects, coding with Python, or jamming on his electric guitar.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Get the latest news from data gathering world
Scale up your business with Oxylabs®