How Is AI Trained? A Guide for AI Training

Agnė Matusevičiūtė

Last updated on

2025-10-28

8 min read

Training an AI model isn’t just about feeding data into an algorithm. It’s about building a pipeline that turns raw data into intelligent behavior. Scraping relevant, high-quality data, properly preprocessing it, and using the appropriate open-source tools to maintain, train, and assess your model are all necessary for this to be accomplished.

In this article, you’ll learn how to extract training data from real-world sources, prepare it for ingestion by a machine learning model, and outline the essential procedures for building an artificial intelligence model from the ground up.

Key takeaways

AI training is a structured process that includes training data gathering, preprocessing, model evaluation, and more. It’s not just about feeding data into a model.
Relevant and high-quality data is essential for training AI models successfully.
The choice of AI model and training technique should be in line with your goal or the issue you're attempting to resolve.
Preprocessing is important! It can easily impact model performance and reliability.
Transfer learning with pre-trained models is faster and more practical than training from scratch for most use cases.
Different problems need different fixes: underfitting requires more training; overfitting needs regularization or more diverse data.

Step 1: Define the goal and model type

Before gathering any training data or writing any code, you have to decide on a suitable model, training strategy, precisely define the issue you are attempting to solve, and choose the foundational frameworks.

Here, we're focusing on a common natural language processing task – text classification, or, in other words, identifying the category to which a particular piece of information belongs. We’ll use a transformer-based model – in this case, BERT (short for Bidirectional Encoder Representations from Transformers), one of the most popular deep learning models for text classification. Lastly, we'll use supervised learning, a training strategy in which the model gains knowledge from a labeled dataset that includes input text examples and the categories that go with them.

Potential core development libraries you can use: Python, PyTorch, TensorFlow, Scikit-learn, Hugging Face Transformers, JAX.

Step 2: Collect data

When collecting web data for AI & LLMs, it's essential to use reliable sources. Check out our guide on LLM training data, which covers the 8 main public data sources where you can find high-quality and diverse public data for training models.

One of the easiest ways to gather data from the web is by using a scraping provider like Oxylabs. Since we’re collecting data to train an AI model or fine-tune it, we’ll be using Oxylabs Web Scraper API to pull real-time results from Amazon search.

Setting up the environment

We’ll be using Python to interact with the Web Scraper API. If it’s not already installed on your system, you can download the latest version from the official Python website.

Once Python is set up, create a new project folder and initialize a virtual environment. Then, create a new file called scraper.py. Next, open a terminal window and run the following command:

pip install requests

This command installs the required libraries for working with APIs in Python. We'll use the requests library to make the API call and the built-in json module to save the retrieved data in JSON format. To speed up data gathering, consider using AIOHTTP for asynchronous processing.

Scraping the data

Start by importing the necessary libraries at the beginning of your Python file.

import json
import requests

Then, define your request parameters, which include:

Search query
Amazon domain
Number of pages to fetch
Other configurable parameters

Customize these settings based on your needs.

Don’t forget to set the parse parameter to True – that way, you’ll get a nicely structured JSON response to work with. For the full documentation for scraping Amazon search results, check Oxylabs Documentation.

Your payload should look something like this:

payload = {
    "source": "amazon_search",
    "domain": "com",
    "query": "adidas",
    "start_page": 1,
    "pages": 1,
    "parse": True,
}

Now, send the request to the AI Web Scraper API using the following code:

response = requests.post(
    "https://realtime.oxylabs.io/v1/queries",
    auth=("USERNAME", "PASSWORD"),
    json=payload,
)
response.raise_for_status()

Make sure to replace the USERNAME and PASSWORD fields with your Oxylabs Web Scraper API credentials from Oxylabs dashboard. You can claim a free trial to test the API.

After completing the request, extract the organic search results from Amazon using this line of code:

data = response.json()["results"][0]["content"]["results"]["organic"]

Storing the data

You can save the data by using Python’s json library to write it to a file. Here's how:

with open("amazon_data.json", "w") as f:
    json.dump(data, f, indent=2)

Run the following command to scrape publicly available Amazon search data using the Web Scraper API:

python scraper.py

If everything works correctly and no errors appear, you should find an amazon_data.json file in the same folder as your Python script.

Completing the code

import json
import requests


payload = {
    "source": "amazon_search",
    "domain": "com",
    "query": "adidas",
    "start_page": 1,
    "pages": 1,
    "parse": True,
}

response = requests.post(
    "https://realtime.oxylabs.io/v1/queries",
    auth=("USERNAME", "PASSWORD"),
    json=payload,
)
response.raise_for_status()

data = response.json()["results"][0]["content"]["results"]["organic"]

with open("amazon_data.json", "w") as f:
    json.dump(data, f, indent=2)

Step 3: Preprocess the data

Think your raw data is ready to go? Not quite. Raw scraped data is rarely ready for immediate use in AI models. Preprocessing is where you turn that messy input into something your model can actually learn from.

Preprocessing really matters, and data scientists know that no matter how much data you've got, a messy input can drag down your model's performance.

Text cleaning and deduplication

Raw data often contains leftover HTML artifacts, special characters, encoding mismatches, inconsistencies, and noise that can reduce your model’s performance. Text cleaning removes or corrects these elements to prevent confusion for your model and ensures that your dataset is accurate, relevant, and consistent.

Deduplication identifies and removes duplicate entries in your dataset. For example, the same product might appear multiple times in your scraped Amazon data if it showed up on different search pages or in various categories. Common deduplication strategies include matching by product ASIN, comparing text similarity, or using hash functions to identify exact duplicates.

Normalization

Normalization organizes your data to ensure consistency and accuracy. It can do so by converting all text to lowercase, standardizing prices, dates, or measurements, unifying abbreviations, and removing whitespaces.

Potential data preprocessing tools you can use: Pandas, Hugging Face, Apache Spark, Great Expectations, NLP: NLTK, spaCy.

Step 4: Label data

The next step is labeling the data you’ve collected for your AI model. Labeling is the process of annotating your data with meaningful tags or categories that your model will learn from. There are several approaches you can take, so you have to find the right data labeling approach for your project.

Labeling type	Description	Best for	Drawbacks
Manual	Involves having human annotators review and label each data point individually.	Small datasets, complex classification tasks, or when high accuracy is critical.	Time-consuming, expensive, and doesn't scale well for large datasets.
Crowdsourcing	Distributes the labeling task across a large group of workers, typically through different platforms.	Large datasets with straightforward labeling tasks that don't require specialized domain knowledge.	Quality can vary between workers, requires quality control mechanisms, and may be costly for very large datasets.
Heuristic	Uses rules, patterns, or weak supervision to generate labels automatically.	Initial dataset creation, augmenting manually labeled data, or when clear patterns exist in your data.	Less accurate than manual labeling, may introduce bias, and works only when reliable heuristics can be defined.

Many successful AI projects employ a combination of these methods, utilizing heuristics for initial labeling, crowdsourcing for bulk work, and manual expert review for quality control or handling complex cases.

Potential data labeling tools you can use: Label Studio, Prodigy, Doccano, CVAT (Computer Vision Annotation Tool), Amazon SageMaker Ground Truth.

Step 5: Final processing

Tokenization

Tokenization breaks words, subwords, and sentences into smaller chunks (tokens) so that the model can understand and process them effectively. For example, a product review like "Best shoes ever! Highly recommend." might be tokenized into:

["Best", "shoes", "ever", "!", "Highly", "recommend", "."]

Dataset splitting

Once the data is cleaned and structured, it’s divided into the following sets:

Training set: The data your model actually learns from. It’s the most significant portion of your dataset, and it’s used to adjust the model’s parameters during training.
Validation set: The data that’s used to fine-tune the model. It prevents overfitting and guides decisions, such as when to stop training or which model version to retain.
Test set: The data used to evaluate final performance on unseen data. It should only be used once at the very end to assess your final model.

How you split the data matters. Depending on your dataset, you may choose random splitting, stratified splitting (to keep category balance), or time-based splitting (to simulate real-world deployment).

Note: Ensure no information from validation or test sets influences your training process.

Potential final data processing tools: Pandas, Scikit-learn, Hugging Face, Apache Spark, NLP: NLTK, spaCy.

Step 6: Train AI models

To start, you must decide if you want to train a model from scratch or use an already pre-trained model as the base. Both use cases have their pros and cons:

Using a pre-trained model as a base is much faster, as it lets you utilize transfer learning and can offer competitive results for most use cases.
Training from scratch gives you complete control over the whole process and much more flexibility to adapt the model to specific use cases.

For now, let's focus on using a pre-trained model as a base and utilize transfer learning as much as possible.

Choosing a pre-trained model

Choosing a pre-trained model is an important initial step that sets the foundation for the best results. You have to evaluate things like task type, dataset size, compute resources that are available, and performance vs. accuracy focus. Here are some considerations to help you get on the right track:

If you’re doing text classification, most likely anything from the BERT model family will be your go-to.
Is your dataset smaller than 10k examples? You should start with smaller models.
Does your domain have specific jargon? Try to find an already specialized model like LegalBert or BioBert, and others.

In this conceptual overview of the process, let's focus on the most commonly used example – BERT encoder with a small classifier head that maps encodings to our labels.

Training data loop

Next up, we need to think about our training loop. Main things to consider are:

Loss function: A standard choice for supervised text classification is cross-entropy. You can handle class imbalance with weights, or do binary cross-entropy for multi-label classifications if needed.
Optimizer and learning rate: AdamW as the optimizer is the typical choice for BERT, with a learning rate between 2e-5 and 5e-5.
Metrics: to understand how the training is going, we can use the standard accuracy, precision/recall, and F1 combination. These will let us notice things like overfitting and where exactly the AI model makes mistakes.

Finally, we need to set up the training loop itself. For that, we can use Python and its libraries like transformers, as they will have most of everything you would need for a standard training. Here's how the pseudo script could look to give you a rough idea:

# Load or prepare labeled dataset
texts, labels = load_dataset(path_to_data)
label_map = create_label_map(labels)

# Initialize tokenizer and preprocess
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
encoded_data = tokenizer(
    texts,
    truncation=True,
    padding="max_length",
    max_length=MAX_SEQ_LENGTH
)

# Split into train / validation / test
train_data, val_data, test_data = split_dataset(encoded_data, labels)

# Initialize model
#    (BERT encoder + lightweight classification head)
model = BertModelWithClassificationHead(
    base_model="bert-base-uncased",
    num_labels=len(label_map)
)

# Define loss function and optimizer
loss_fn = CrossEntropyLoss()    
optimizer = AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)


# Training loop
for epoch in range(NUM_EPOCHS):                  # e.g. 5 epochs
    model.train()
    for batch in train_dataloader:
        inputs = batch["input_ids"]
        masks = batch["attention_mask"]
        labels = batch["labels"]

        # Forward pass
        logits = model(inputs, attention_mask=masks)

        # Compute loss
        loss = loss_fn(logits, labels)

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        scheduler.step()

    # Validation step
    model.eval()
    with no_grad():
        all_preds, all_labels = [], []
        for batch in val_dataloader:
            logits = model(batch["input_ids"], attention_mask=batch["attention_mask"])
            predictions = argmax(logits)
            all_preds.extend(predictions)
            all_labels.extend(batch["labels"])
    metrics = compute_metrics(all_preds, all_labels)      # accuracy, F1, etc.
    log(epoch, loss, metrics)

# Evaluate on test set
test_metrics = evaluate(model, test_dataloader)

#S ave final model, tokenizer, and label map
save_model(model, output_dir)
save_tokenizer(tokenizer, output_dir)
save_label_map(label_map, output_dir)

The script should leave us with a model checkpoint that we can deploy.

Step 7: Evaluate and fine-tune

Now that we have something already trained, how do we know if it's performing well? Well, the first step is always using the validation set to test the model's performance. This will let you see things like confusion matrix, F1, and accuracy by class. You can also then inspect the classification results manually to get a grasp of where the model is failing and adjust accordingly.

Taking the insights from the validation and result inspection, you can iterate on adjusting the model hyperparameters, process, dataset, and training again.

If you see things like high training/validation loss and poor validation performance, the model might be underfitting, so increasing the number of epochs or learning rate might be the change you need.
If training loss is dropping, while validation loss is rising, it might mean that the model is overfitting. In that case, you might want to increase the dropout rate, lower the learning rate, or increase the dataset size and quality, and improve class representation.

One important thing to note would be to try changing only one thing at a time, to have a reasonable experiment that would leave no doubt on which change was an improvement or a detriment to the model.

You should continue this process until you arrive at a model that has acceptable performance.

Step 8: Deploy or integrate

The final step would be taking the trained model and exposing it for use in your systems. An example flow accomplishing that could work like this: loading both the tokenizer and classifier as you saved them after training, and creating a function that would take raw text and send it through a flow like tokenizer → model → softmax → label and return the label.

Then, you can use something like FastAPI to wrap around the latter function to get an accessible endpoint that would utilize the trained model.

Such services can then be used in your scraping pipeline to extract custom data like sentiment of user reviews for a product listing, categorizing product descriptions, etc.

Final thoughts

Training AI starts – and ends – with your data. Whether you're scraping, cleaning, or labeling, how you handle it makes all the difference. Pick the right tools, stay focused on your goal, and you'll be in a good place to build something smart, reliable, and scalable. For image, video, and audio generation capabilities, make sure to train AI on multimodal data.

To see how AI training concepts apply in real-world use cases, check out our other step-by-step guides and general overviews:

Frequently asked questions

What is AI model training?

Training an AI model means teaching it to recognize patterns by using data. This applies whether you're building large language models or simpler classifiers. Think of it like showing examples to help it learn how things work. The more relevant and high-quality the data, the better the model will perform. There are different types of models and training approaches depending on what you're trying to achieve – like whether you want the AI to generate content, make predictions, or categorize information. Want a deeper dive? Check out our blog post on how AI training works.

How does AI actually learn?

AI learns by processing large amounts of data and spotting patterns in it using neural networks – mathematical algorithms that adjust and improve as they go, kind of like learning from experience. The more useful and well-organized the data, the smarter the AI becomes over time.

Is AI programmed or trained?

It’s mostly trained, not manually programmed. But how is AI trained? Instead of giving it a list of instructions, we feed it data so it can figure out the patterns on its own and improve over time. That’s what makes AI flexible and adaptable.

What are the methods of training AI?

There are three primary AI model training methods:

Supervised learning: The model learns from labeled examples (where the answers are already known).
Unsupervised learning: The model tries to find patterns in data that hasn't been labeled.
Reinforcement learning: The model learns by trial and error, getting feedback (like rewards or penalties) based on what it does.

For a full breakdown of how AI learns and the different methods involved, check what Is AI training and how it works on our blog.

Forget about complex web scraping processes

Choose Oxylabs' advanced web intelligence collection solutions to gather real-time public data hassle-free.

About the author

Agnė Matusevičiūtė

Technical Copywriter

With a background in philology and pedagogy, Agnė focuses on making complicated tech simple.

Learn more about Agnė Matusevičiūtė Learn more about Agnė Matusevičiūtė

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.