Web Scraping With AutoGen Tutorial

You can only prompt GenAI tools like ChatGPT through a chat interface with a direct message and get a response in a browser window. It certainly covers most users’ needs, but it’s far from the most effective way to use the full potential of GenAI. In a chat-like interface, it’s difficult to complete tasks like data analysis, where you need to fit a large amount of data into a small text field.

Agent orchestration frameworks like AutoGen address the scaling issue. You can use AutoGen with Oxylabs Web Scraper API to create a fully automated tool for using models like GPT to summarize scraped data from various sources, such as shopping items from Amazon, Google SERPs, or just about any website.

This guide showcases creating a price analysis tool for Amazon products and finding the best deals for a given search query.

Introduction to AutoGen

AutoGen is one of the most popular frameworks for orchestrating multiple large language models, commonly called agents. An agent is an AI specialized for a single task. In this case, one agent analyzes Amazon product prices while the other finds the best deals.

AutoGen allows the agents to work together in teams to communicate, share results, and collaborate on common tasks. It opens new opportunities for developers to create sophisticated workflows that would be far too difficult for a single AI to manage.

In addition, agents can use tools, such as Python functions, that provide data, do calculations, or do anything else you can code. Tools supply AI agents with data to complete specific tasks.

For this use case, you can integrate scraping web data with Web Scraper API and interpret it with AI in a single tool, which generates complex analysis by providing a single search query.

Preparing the project

To start, let’s take care of the prerequisites.

You need to have Python installed on your system. You can download the latest version here.

Next, acquire access to an LLM provider like OpenAI or Azure OpenAI. For this tutorial, let’s use OpenAI’s API and the gpt-4o-mini model. Of course, pick what’s best for you.

Create a new folder for your project. Let’s call it autogen-oxylabs.

Inside the folder, create three Python files:

main.py – to store the main function for initializing the tool.
scraper.py – to store code related to getting data from Web Scraper API.
summary.py – for code AutoGen AI agents use to summarize the retrieved data.

Use this project structure to split the logic of scraping data and using it in AutoGen.

Create a virtual environment by running this command:

Copy

python -m venv venv
source venv/bin/activate

Install the required dependencies:

autogen-agentchat provides predefined agents to make launching a quick application with AutoGen simpler.
aiohttp for performing asynchronous HTTP requests.
autogen-ext[openai] for using OpenAI models.

Copy

pip install aiohttp autogen-agentchat "autogen-ext[openai]"

The environment is now ready. You can start implementing the scraper.py module.

Using Web Scraper API

Begin by writing the code to scrape the data you wish to summarize. If you were to do this manually, it would require complex parsing logic, bypassing various blocking mechanisms, and managing different proxy types. However, you can simply use Web Scraper API to do most of the work for you.

Let’s use the Amazon source to search for items on Amazon based on a provided query.

You’ll have to provide your API credentials. You can find them on your Oxylabs dashboard.

Import a few things from the previously installed aiohttp and define an AmazonScraper class with a constructor, providing the Web Scraper API endpoint URL and your credentials.

It should look like this:

Copy

from aiohttp import BasicAuth, ClientSession


class AmazonScraper:

    def __init__(self) -> None:
        self._base_url = "https://realtime.oxylabs.io/v1/queries"
        self._auth = BasicAuth("USERNAME", "PASSWORD")

NOTE: Don’t forget to replace placeholders with your own Web Scraper API credentials.

Now, you can define an asynchronous method called get_amazon_search_data with a single parameter called query. Specify that the query should be a str and the method should return a list of dictionaries, list[dict]. It will greatly improve the readability of the code.

Copy

async def get_amazon_search_data(self, query: str) -> list[dict]:
"""Gets search data for provided query from Amazon."""

You need this method to be asynchronous due to the way AutoGen is implemented. It’s generally a good practice to use asynchronous code when performing API calls for inherently better performance.

Next, define your API payload together with the API call. With aiohttp, it can look like this:

Copy

print(f"Fetching data for query: {query}")
payload = {
    "source": "amazon_search",
    "domain": "com",
    "query": query,
    "start_page": 1,
    "pages": 1,
    "parse": True,
}
session = ClientSession()

try:
    response = await session.post(
        self._base_url,
        auth=self._auth,
        json=payload,
    )
    response.raise_for_status()
    data = await response.json()
finally:
    await session.close()

NOTE: The amazon_search query now supports filtering and sorting of search results prior to scraping – for full details on usage and options, please refer to the documentation.

Notice how we passed the self._auth parameter we defined before to the post method. This provides authentication with Web Scraper API. We also used the finally clause to ensure the session is closed, regardless of the outcome of the API call.

After the API call is complete, you can parse the response just a bit so it will be easier for AI to parse it fully on its own.

The result is the return of a list of dictionaries that represent data retrieved from Amazon:

Copy

results = data["results"][0]["content"]["results"]
return [*results.values()]

You now have a class for scraping Amazon search results. Note that this code can be expanded further by implementing other methods for scraping different sources or parsing multiple Amazon pages.

For what you have now, the full class should look like this:

Copy

from aiohttp import BasicAuth, ClientSession


class AmazonScraper:

    def __init__(self) -> None:
        self._base_url = "https://realtime.oxylabs.io/v1/queries"
        self._auth = BasicAuth("USERNAME", "PASSWORD")

    async def get_amazon_search_data(self, query: str) -> list[dict]:
        """Gets search data for provided query from Amazon."""
        print(f"Fetching data for query: {query}")
        payload = {
            "source": "amazon_search",
            "domain": "com",
            "query": query,
            "start_page": 1,
            "pages": 1,
            "parse": True,
        }
        session = ClientSession()

        try:
            response = await session.post(
                self._base_url,
                auth=self._auth,
                json=payload,
            )
            response.raise_for_status()
            data = await response.json()
        finally:
            await session.close()

        results = data["results"][0]["content"]["results"]
        return [*results.values()]

You can test it in the main.py file by initializing and calling the get_amazon_search_data method and printing what you get.

You can use this code for implementing the main function:

Copy

import asyncio
from pprint import pprint
from scraper import AmazonScraper


async def main():
    scraper = AmazonScraper()
    pprint(scraper.get_amazon_search_data("laptop"))


if __name__ == "__main__":
    asyncio.run(main())

If you run python main.py, you should see a list of Amazon product results for laptops.

Since you now have a way to collect data, let’s start writing the AmazonDataSummarizer class to implement your first AutoGen agents.

Defining agents with AutoGen

Continue building your tool by implementing a class responsible for summarizing retrieved data from Web Scraper API you defined before.

Start by defining the class itself. Let’s name it AmazonDataSummarizer. Let’s also define a few more variables in the constructor of the class to access them in the code later on.

Ideally, the class should expect the scraper you defined before as an argument to the constructor. Such a pattern is called dependency injection, allowing you to link together different parts of the code in a structured way.

Along with the scraper, let’s import and define the OpenAI client AutoGen should use to communicate with OpenAI models. This is the part where you can use your OpenAI API key mentioned earlier in the tutorial. The code should look similar to this:

Copy

from autogen_ext.models.openai import OpenAIChatCompletionClient
from scraper import AmazonScraper

class AmazonDataSummarizer:

    def __init__(self, scraper: AmazonScraper) -> None:
        self._client = OpenAIChatCompletionClient(
            model="gpt-4o-mini",
            api_key="YOUR_API_KEY",
        )

        self._scraper = scraper

Now, let’s begin defining the AI agents. You can start giving them names by defining an Enum class. Let’s name the class AgentName.

Copy

from enum import Enum

class AgentName(str, Enum):
    """Enum for AI agent names."""

    PRICE_SUMMARIZER = "Price_Summarizer"
    DEAL_FINDER = "Deal_Finder"

This makes it easier to track your agents later on.

After that, you can start defining the actual agents. Import the AssistantAgent class from autogen_agentchat.agents and define the _initialize_agents method that would include the configuration of your AI agents. It should look like this:

Copy

def _initialize_agents(self) -> list[AssistantAgent]:
        """Initializes the agents."""
        price_summarizer_agent = AssistantAgent(
            name=AgentName.PRICE_SUMMARIZER,
            model_client=self._client,
            reflect_on_tool_use=True,
            tools=[self._scraper.get_amazon_search_data],
            system_message="You are an expert in analyzing prices from online shopping data. Summarize the key price statistics, including average, min, max, and any interesting price patterns. Share your summary with the group",
        )

        deal_finder_agent = AssistantAgent(
            name=AgentName.DEAL_FINDER,
            model_client=self._client,
            tools=[self._scraper.get_amazon_search_data],
            reflect_on_tool_use=True,
            system_message="You are a skilled deal finder in online shopping data. Find the best possible deals based on price, availability, and general value. Share your findings with the group. Respond with 'SUMMARY_COMPLETE' when you've shared your findings.",
        )

        return [price_summarizer_agent, deal_finder_agent]

Let’s break down the initialization part of the AssistantAgent class. You should define:

The agent with a name
The OpenAI client you defined in the constructor
The method for scraping Amazon data you defined before
A system message

The system message is the prompt the agent will use to instruct what it should do. You can define prompts just like you would when chatting with ChatGPT by providing clear instructions and the expected result.

Also, ask the agent to respond with SUMMARY_COMPLETE to let you know when to stop running AI agents and process the provided summary.

The reflect_on_tool_use=True flag indicates that the AI agent should use the data it receives from the function as context for its response.

This is the main part of integrating data sources like Web Scraper API and AI agents in AutoGen. It’s enough to define an async function that would return the results in the tools parameter, and the AI could pick it up. The rest is simply defined in the prompt.

Now that your agents are defined, let’s make them work together to provide a unified summary.

Using AutoGen agents with scraped data

For this part of the tutorial, let’s use a concept in AutoGen called teams. A team is a form of collaboration between AI agents that allows them to exchange information and work together on common tasks.

There are different types of teams in AutoGen, but for this tutorial, let’s use the RoundRobinGroupChat team type, which means that each agent performs the provided task sequentially, one after the other. This way, you can simply have the agents take turns performing their own summaries on the retrieved Amazon data and return their findings.

You also need to let AutoGen know when the agents finish their summaries so you can stop the team from running. This is where the SUMMARY_COMPLETE string you asked the agent to return comes into play. You can define a text termination condition in the team, which simply means that the team should stop running once an agent says a specific thing.

Here’s how it should look:

Copy

async def generate_summary(self, query: str) -> None:
        """Generates a summary using AI agents based on the given query"""
        agents = self._initialize_agents()

        text_termination = TextMentionTermination("SUMMARY_COMPLETE")
        team = RoundRobinGroupChat(
            participants=agents,
            termination_condition=text_termination,
        )

        task = f"Search for products for the query {query} and provide a summary in formatted Markdown of your findings."
        messages = []

        async for message in team.run_stream(task=task):
            if isinstance(message, BaseChatMessage) and message.source in {
                AgentName.PRICE_SUMMARIZER,
                AgentName.DEAL_FINDER,
            }:
                messages.append(message.to_text())

Define the team with the previously defined agents as participants and pass the SUMMARY_COMPLETE termination condition.

Define the main task for each agent and pass it to the run_stream method of the team variable.

Loop through the messages received from the agents and append them into a predefined list.

Once the task is done, you should get a full summary of the prices and deals for the product you queried in Markdown format.

You can now save these results in a Markdown file by implementing a method like this:

Copy

def _write_to_md(self, messages: list[str]) -> None:
        """Writes the messages to a Markdown file."""
        with open("summary.md", "w") as f:
            for message in messages:
                f.write(f"{message}\n\n")

Let’s call it at the end of the generate_summary method to save the results. The full method code should look like this:

Copy

async def generate_summary(self, query: str) -> None:
        """Generates a summary using AI agents based on the given query""""
        agents = self._initialize_agents()

        text_termination = TextMentionTermination("SUMMARY_COMPLETE")
        team = RoundRobinGroupChat(
            participants=agents,
            termination_condition=text_termination,
        )

        task = f"Search for products for the query {query} and provide a summary in formatted Markdown of your findings."
        messages = []

        async for message in team.run_stream(task=task):
            if isinstance(message, BaseChatMessage) and message.source in {
                AgentName.PRICE_SUMMARIZER,
                AgentName.DEAL_FINDER,
            }:
                messages.append(message.to_text())

        self._write_to_md(messages)

Running the tool

You now have the complete tool for generating summaries using AI agents with AutoGen and data from Web Scraper API.

Put it all together in the main file like this:

Copy

import asyncio
from scraper import AmazonScraper
from summary import AmazonDataSummarizer


async def main():
    scraper = AmazonScraper()
    summarizer = AmazonDataSummarizer(scraper=scraper)
    await summarizer.generate_summary(query="laptop")


if __name__ == "__main__":
    asyncio.run(main())

If you run python main.py in a terminal, after a bit of time, you should see a file called summary.md appear in your directory.

Try viewing the results in a text editor capable of previewing Markdown. It should look like this:

Previewing summary.md

The complete code

Here’s the complete code for the whole tool.

main.py

Copy

import asyncio
from scraper import AmazonScraper
from summary import AmazonDataSummarizer


async def main():
    scraper = AmazonScraper()
    summarizer = AmazonDataSummarizer(scraper=scraper)
    await summarizer.generate_summary(query="laptop")

if __name__ == "__main__":
    asyncio.run(main())

scraper.py

Copy

from aiohttp import BasicAuth, ClientSession


class AmazonScraper:

    def __init__(self) -> None:
        self._base_url = "https://realtime.oxylabs.io/v1/queries"
        self._auth = BasicAuth("USERNAME", "PASSWORD")

    async def get_amazon_search_data(self, query: str) -> list[dict]:
        """Gets search data for provided query from Amazon."""
        print(f"Fetching data for query: {query}")
        payload = {
            "source": "amazon_search",
            "domain": "com",
            "query": query,
            "start_page": 1,
            "pages": 1,
            "parse": True,
        }
        session = ClientSession()

        try:
            response = await session.post(
                self._base_url,
                auth=self._auth,
                json=payload,
            )
            response.raise_for_status()
            data = await response.json()
        finally:
            await session.close()

        results = data["results"][0]["content"]["results"]
        return [*results.values()]

summary.py

Copy

from enum import Enum
from autogen_agentchat.agents import AssistantAgent
from autogen_agentchat.conditions import TextMentionTermination
from autogen_agentchat.messages import BaseChatMessage
from autogen_agentchat.teams import RoundRobinGroupChat
from autogen_ext.models.openai import OpenAIChatCompletionClient

from scraper import AmazonScraper


class AgentName(str, Enum):
    """Enum for agent names."""

    PRICE_SUMMARIZER = "Price_Summarizer"
    DEAL_FINDER = "Deal_Finder"


class AmazonDataSummarizer:

    def __init__(self, scraper: AmazonScraper) -> None:
        self._client = OpenAIChatCompletionClient(
            model="gpt-4o-mini",
            api_key="YOUR_API_KEY",
        )

        self._scraper = scraper

    def _initialize_agents(self) -> list[AssistantAgent]:
        """Initializes the agents."""
        price_summarizer_agent = AssistantAgent(
            name=AgentName.PRICE_SUMMARIZER,
            model_client=self._client,
            reflect_on_tool_use=True,
            tools=[self._scraper.get_amazon_search_data],
            system_message="You are an expert in analyzing prices from online shopping data. Summarize the key price statistics, including average, min, max, and any interesting price patterns. Share your summary with the group",
        )

        deal_finder_agent = AssistantAgent(
            name=AgentName.DEAL_FINDER,
            model_client=self._client,
            tools=[self._scraper.get_amazon_search_data],
            reflect_on_tool_use=True,
            system_message="You are a skilled deal finder in online shopping data. Find the best possible deals based on price, availability, and general value. Share your findings with the group. Respond with 'SUMMARY_COMPLETE' when you've shared your findings.",
        )

        return [price_summarizer_agent, deal_finder_agent]

    def _write_to_md(self, messages: list[str]) -> None:
        """Writes the messages to a Markdown file."""
        with open("summary.md", "w") as f:
            for message in messages:
                f.write(f"{message}\n\n")

    async def generate_summary(self, query: str) -> None:
        """Creates a team of agents."""
        agents = self._initialize_agents()

        text_termination = TextMentionTermination("SUMMARY_COMPLETE")
        team = RoundRobinGroupChat(
            participants=agents,
            termination_condition=text_termination,
        )

        task = f"Search for products for the query {query} and provide a summary in formatted Markdown of your findings."
        messages = []

        async for message in team.run_stream(task=task):
            if isinstance(message, BaseChatMessage) and message.source in {
                AgentName.PRICE_SUMMARIZER,
                AgentName.DEAL_FINDER,
            }:
                messages.append(message.to_text())
        self._write_to_md(messages)

Conclusion

This tutorial covers integrating a data source, Oxylabs Web Scraper API, and an AI agent framework, AutoGen. The created tool demonstrates the capabilities of combining both technologies by generating summaries based on requirements provided in human language, which can be easily configurable and expandable.

With the resources of Web Scraper API and the power of LLMs, analyzing large amounts of data is simpler than ever before.

For more similar topics, check:

If you run into any problems or need more help, our support team is available through:

Email: Contact us at support@oxylabs.io for detailed questions.
Live Chat: Use the chat feature on the right side of your screen for fast assistance

For more setup guides with the most popular frameworks, apps, and operating systems, explore our integrations.

Please be aware that this is a third-party tool not owned or controlled by Oxylabs. Each third-party provider is responsible for its own software and services. Consequently, Oxylabs will have no liability or responsibility to you regarding those services. Please carefully review the third party's policies and practices and/or conduct due diligence before accessing or using third-party services.

Frequently asked questions

AutoGen is an open-source framework developed by Microsoft for building agent-based AI applications. It enables multiple AI agents to collaborate, communicate, and solve complex tasks through automated interactions. AutoGen agents can be customized with different capabilities to handle tasks from coding and problem-solving to data analysis.

To install AutoGen version 0.2, create a virtual environment to isolate its dependencies. You can do this using tools like venv, Conda, or Poetry. Once your virtual environment is set up and activated, you can install AutoGen.

If you plan to execute code within AutoGen, it's advisable to install Docker, as it provides a consistent and isolated environment for code execution. After installing Docker, you can utilize the DockerCommandLineCodeExecutor from the autogen.coding module to manage code execution within Docker containers. Such a setup ensures that your code runs in a controlled environment.

CrewAI and AutoGen are multi-agent frameworks but differ in design philosophies and flexibility.

CrewAI emphasizes simplicity and a role-based structure where agents, called "crew members," operate under clear roles with tasks orchestrated by a central Crew. It's built for quick prototyping and clear workflows, often favoring ease of use over deep customization.

AutoGen, developed by Microsoft, offers a more modular and extensible framework. It allows detailed control over agent behaviors, communication protocols, and execution environments, making it ideal for advanced users needing fine-grained interactions or complex logic. While AutoGen requires more setup, it offers higher flexibility for building multi-agent systems.

Get the latest news from data gathering world

I'm interested

ISO/IEC 27001:2022 certified products:

Proxy Solutions

Scraper APIs

Get Web Scraper API for $49/month

Company

About us Our values Affiliate program Service partners Press area Residential Proxies sourcing Careers OxyCon®Project 4beta Sustainability Community

Proxies

Datacenter Proxies Dedicated Datacenter Proxies Residential Proxies SOCKS5 Proxies Mobile Proxies ISP Proxies Private Proxies Free Proxies

Advanced proxy solutions

Web Unblocker