Scraping Baidu Search Results with Python: A Step-by-Step Guide

Iveta Vistorskyte

Last updated by Akvilė Lūžaitė

2025-05-26

6 min read

Baidu is a leading search engine in China, allowing users to search for information online. The search results are displayed similarly to other search engines, with a list of websites and web pages matching the user's search query.

This blog post covers the process of scraping publicly available Baidu organic search results using Python and Oxylabs' Web Scraper API.

Baidu search engine: overview

The Baidu Search Engine Results Page (SERP) consists of various elements that help users find the required information quickly. Paid search results, organic search results, and related searches might appear when entering a search query in Baidu.

Organic search results

Similarly to other search engines, Baidu's organic search results are listed to provide users with the most relevant and helpful information related to their original search query.

baidu organic results api – example of organic baidu search results

Paid results

When you enter a search query on Baidu, you'll see some results marked as "advertise (广告)" Companies pay for these results to appear at the top of the search results.

Baidu's related search feature helps users find additional information related to their search queries. Usually, this feature can be found at the end of the search results page.

baidu search engine results api – related search example

Challenges of scraping Baidu

If you've ever encountered gathering public information from Baidu, you should know that it's not an easy task. Baidu uses various anti-scraping techniques, such as CAPTCHAs, blocking suspicious user agents and IP addresses, and employing dynamic elements that make accessing content for automated bots difficult.

Baidu's search result page is dynamic, meaning the HTML code can often change. It makes it hard for web scraping tools to locate and gather certain Baidu SERP data. You need to constantly maintain and update your web scraper to get hassle-free public information. This is where a ready-made web intelligence tool, such as our own Web Scraper API, comes in to save time, effort, and resources.

Is it legal to scrape Baidu?

Even if the legality of web scraping is a widely discussed topic, gathering publicly available data from the web, including Baidu search results, may be considered legal. Of course, there are a few rules you must follow when web scraping, such as:

A web scraping tool shouldn't log in to websites and then download data.
Even if there may be fewer restrictions for collecting public data than private information, you still must ensure that you're not breaching laws that may apply to such data, e.g., collecting copyrighted data.

If you're considering starting web scraping, especially for the first time, it's best to get professional legal advice on whether your public data gathering activities won't breach any laws or regulations. For additional information, you can also check our extensive article about the legality of web scraping.

Try for free

Get a free trial to test our Web Scraper API.

Up to 2K results

No credit card required

Scraping Baidu search results

When you purchase our Web Scraper API or start a free trial, you get the unique credentials needed to gather public data from Baidu. When you have all the necessary information, you can start to implement the web scraping process with Python.

It can be done with these steps:

1. Import necessary Python libraries

Install the requests and bs4 libraries in your Python environment using the pip install requests bs4 command, and import them together with the pprint module in your Python file.

We’ll be using requests for making HTTP requests to the Oxylabs Web Scraper API and the beautifulsoup library for parsing the retrieved HTML content.

It should look like this:

import requests
from bs4 import BeautifulSoup
from pprint import pprint

2. Define API endpoint URL

The API endpoint URL is the URL we’ll be sending our HTTP requests to. You can define the URL as follows:

url = 'https://realtime.oxylabs.io/v1/queries'

3. Fill in your authentication credentials

You also need to obtain an API key or authorization credentials from us. Once you've received the key, you can use it to make API requests. Define your authentication as follows:

auth = ('your_api_username', 'your_api_password')

4. Create a payload with query parameters

Next up, create a dictionary containing the necessary API parameters and the full Baidu URL you want to scrape.

This dictionary must include these parameters:

source must be set to universal.
url – your Baidu URL.
geo_location – the location of your scraping requests.

It should look like this:

payload = {
   'source': 'universal',
   'url': 'https://www.baidu.com/s?ie=utf-8&wd=nike&rn=50',
   'geo_location': 'United States',
}

Check our documentation for a full list of available parameters.

5. Send a POST request

Once you've declared everything, you can pass it as a JSON object in your request body.

response = requests.post(url, json=payload, auth=auth, timeout=180)
response.raise_for_status()

The requests.post() method sends a POST request with the search parameters and authentication credentials to our Web Scraper API. We should also add the response.raise_for_status() line, which would raise an exception in case there was an issue with the API call.

6. Load and validate the data

The json_data variable contains the JSON-formatted response from the API. We should also validate if we received any results from the scraper for our provided Baidu URL.

json_data = response.json()
if not json_data["results"]:
    print("No results found for the given query.")
    return

Parsing and storing Baidu search results

After retrieving the HTML content for Baidu search results, we can proceed to parse it into a known data format. If you're curious, check out our blog to find information on what is a data parser.

But for now, let’s implement a function called parse_baidu_html_results . The function should accept the HTML content as a string and return a list of dictionaries.

def parse_baidu_html_results(html_content: str) -> list[dict]:
    """Parses Baidu HTML search results content into a list of dictionaries."""
    ...

Inside the function, we can create a BeautifulSoup object from the HTML content string.

We can then use the object to extract the necessary structured data we need. For this example, we’ll be parsing out the page title of the search result and the link.

Here’s what the code can look like:

def parse_baidu_html_results(html_content: str) -> list[dict]:
    """Parses Baidu HTML search results content into a list of dictionaries."""
    parsed_results = []

    soup = BeautifulSoup(html_content, "html.parser")

    result_blocks = soup.select("div.c-container[id]")

    for block in result_blocks:
        title_tag = (
            block.select_one("h3.t a")
            or block.select_one("h3.c-title-en a")
            or block.select_one("div.c-title a")
        )

        if not title_tag:
            continue

        title_text = title_tag.get_text(strip=True)
        href = title_tag.get("href")

        if title_text and href:
            parsed_results.append({"title": title_text, "url": href})

    return parsed_results

Within the function, we use BeautifulSoup to select each certain Baidu search result using the div.c-container[id] CSS selector.

After we have that, we can loop through the search results and select the title and URL of each result by selecting the title tag element. We use a few selectors to make the code more robust, in case the site structure changes in the future.

Here’s an example of how the parsed_results variable can look like:

[{'title': '-NIKE中文官方网站',
  'url': 'http://www.baidu.com/link?url=rfgnOGuIn6H54TFBQhGFXF-52oUjNUrJc8CeHdVVfYIBUjbdyQivAZgPy7WAQmXZSkpxQlFrCUY1m7fx5fPw7K'},
 {'title': '耐克Nike新品上市专区-运动鞋-连帽衫-夹克外套-NIKE中文官方网站',
  'url': 'http://www.baidu.com/link?url=TO9UR6DTuCqQE7jteADPE1VWRO3f4PI7BlF-eInYy8VJDdIgRioa37FIdkzNkTnB'},
 {'title': '男子-NIKE中文官方网站',
  'url': 'http://www.baidu.com/link?url=58Aj0GphmvO3CR4yc4c0EMcJz0SsTG3RtrAGAzMAVOc5ow8dU1A7N8y3Mq0ulW4J2tGKYGAWIyp13fVvzJ6w4CsYKzEkj3de3UjJPw6GrPLqsvveXsJ2Nl6Ru-Q4bddl'}]

After we have that, we can implement another function for storing the result in a CSV file.

For that, we should first install the pandas library, by running pip install pandas. Once that’s done, we can import the pandas library and implement a simple function called store_to_csv like this:

import pandas as pd

def store_to_csv(data: list[dict]):
    """Stores the parsed data into a CSV file."""
    df = pd.DataFrame(data)
    df.to_csv("baidu_results.csv")

Full code sample

Let’s put everything together to have a neat application for scraping HTML content from a Baidu search site and storing the results into a CSV file.

First of all, let’s create a main function, where we’ll store the main logic for our code.

import requests
import pandas as pd
from bs4 import BeautifulSoup
from pprint import pprint

def main():
    ...

if __name__ == "__main__":
   main()

Next, let’s move our previously written code to the created main function.

import requests
import pandas as pd
from bs4 import BeautifulSoup
from pprint import pprint

def main():
    payload = {
        "source": "universal",
        "url": "https://www.baidu.com/s?ie=utf-8&wd=nike&rn=50",
        "geo_location": "United States",
    }

    url = "https://realtime.oxylabs.io/v1/queries"
    auth = ("your_api_username", "your_api_password")

    response = requests.post(url, json=payload, auth=auth, timeout=180)
    response.raise_for_status()

    json_data = response.json()

    if not json_data["results"]:
        print("No results found for the given query.")
        return

if __name__ == "__main__":
   main()

We can now combine our previously written functions to a single application.

Here’s the full code:

import requests
import pandas as pd
from bs4 import BeautifulSoup
from pprint import pprint


def store_to_csv(data: list[dict]):
    """Stores the parsed data into a CSV file."""
    df = pd.DataFrame(data)
    df.to_csv("baidu_results.csv")


def parse_baidu_html_results(html_content: str) -> list[dict]:
    """Parses Baidu HTML search results content into a list of dictionaries."""
    parsed_results = []

    soup = BeautifulSoup(html_content, "html.parser")

    result_blocks = soup.select("div.c-container[id]")

    for block in result_blocks:
        title_tag = (
            block.select_one("h3.t a")
            or block.select_one("h3.c-title-en a")
            or block.select_one("div.c-title a")
        )

        if not title_tag:
            continue

        title_text = title_tag.get_text(strip=True)
        href = title_tag.get("href")

        if title_text and href:
            parsed_results.append({"title": title_text, "url": href})

    return parsed_results


def main():

    payload = {
        "source": "universal",
        "url": "https://www.baidu.com/s?ie=utf-8&wd=nike&rn=50",
        "geo_location": "United States",
    }

    url = "https://realtime.oxylabs.io/v1/queries"
    auth = ("your_api_username", "your_api_password")

    response = requests.post(url, json=payload, auth=auth, timeout=180)
    response.raise_for_status()

    json_data = response.json()

    if not json_data["results"]:
        print("No results found for the given query.")
        return

    html_content = json_data["results"][0]["content"]
    parsed_data_list = parse_baidu_html_results(html_content)

    store_to_csv(parsed_data_list)


if __name__ == "__main__":
    main()

The result of the code above should be a CSV file in your project directory, called baidu_results.csv.

If you open it, the output should look something like this:

Scraping Baidu search with Residential Proxies

In the previous section of this tutorial, we covered how to scrape Baidu search results with our Baidu Search API, which makes it a really simple task to scrape websites.

However, if you prefer to perform requests to the website yourself, our Residential Proxies is the perfect solution for that.

Residential Proxies provide the option to send requests through IP addresses that belong to physical devices, with IP addresses provided by ISPs around the world. These proxies rotate automatically, so there’s no need to worry about additional IP management.

Let’s use the code we wrote before as a starting point for using Residential Proxies to scrape Baidu search results.

Updating our code to use Residential Proxies

First off, let’s declare our proxies variable. To get the hostname for the proxy server, you can go to your Oxylabs Dashboard Residential Proxies section and click on the Endpoint generator tab.

You can select various options here, like the region the proxy is located, the endpoint and session types, and much more.

However, for this example we’ll be using a global HTTPS proxy with authentication and sticky sessions.

The URL should look like this:

proxy_entry = "http://customer-<your_username>:<your_password>@pr.oxylabs.io:10000"

Make sure to replace the placeholders with your Residential Proxy credentials.

Now that that’s done, let’s alter the main function a bit and call the Baidu search URL directly using a GET request with the proxies attached to it.

It can look like this:

def main():

    url = "https://www.baidu.com/s?ie=utf-8&wd=nike&rn=50"

    proxy_entry = "http://customer-<your_username>:<your_password>@pr.oxylabs.io:10000"
    proxies = {
        "http": proxy_entry,
        "https": proxy_entry,
    }

    response = requests.get(url, proxies=proxies, timeout=180)
    html_content = response.text

    parsed_data_list = parse_baidu_html_results(html_content)
    store_to_csv(parsed_data_list)

As you can see, we reused the parsing and storing functions from before to get the exact same result, but through a different source. If you run the script, you should see the same baidu_results.csv file, with the same results as with the Web Scraper API.

Full code sample with Residential Proxies

Here’s how the full code for scraping Baidu SERP data with Residential Proxies looks like:

import requests
import pandas as pd
from bs4 import BeautifulSoup


def store_to_csv(data: list[dict]):
    """Stores the parsed data into a CSV file."""
    df = pd.DataFrame(data)
    df.to_csv("baidu_results.csv")


def parse_baidu_html_results(html_content: str) -> list[dict]:
    """Parses Baidu HTML search results content into a list of dictionaries."""
    parsed_results = []

    soup = BeautifulSoup(html_content, "html.parser")

    result_blocks = soup.select("div.c-container[id]")

    for block in result_blocks:
        title_tag = (
            block.select_one("h3.t a")
            or block.select_one("h3.c-title-en a")
            or block.select_one("div.c-title a")
        )

        if not title_tag:
            continue

        title_text = title_tag.get_text(strip=True)
        href = title_tag.get("href")

        if title_text and href:
            parsed_results.append({"title": title_text, "url": href})

    return parsed_results


def main():

    url = "https://www.baidu.com/s?ie=utf-8&wd=nike&rn=50"

    proxy_entry = "http://customer-<your_username>:<your_password>@pr.oxylabs.io:10000"
    proxies = {
        "http": proxy_entry,
        "https": proxy_entry,
    }

    response = requests.get(url, proxies=proxies, timeout=180)
    html_content = response.text

    parsed_data_list = parse_baidu_html_results(html_content)
    store_to_csv(parsed_data_list)


if __name__ == "__main__":
    main()

Baidu scraping methods comparison

There are several ways to scrape Baidu: you can do it manually without proxies, use proxies to avoid IP bans, or rely on a scraper API for convenience and scalability.

Each method differs in complexity, speed, and resistance to blocks – manual scraping is simplest but limited, proxies offer more freedom with added setup, and APIs handle most of the heavy lifting for you. Here are all of the ways to scrape Baidu data compared:

Criteria	Manual scraping (without proxies)	Manual scraping using proxies	Scraper APIs
Key features	• Single, static IP address • Direct network requests • Local execution environment	• IP rotation • Geo-targeting • Request distribution • Anti-detection measures	• Maintenance-free infrastructure • CAPTCHA handling • JavaScript rendering • Automatic proxy management
Pros	• Maximum flexibility • No additional service costs • Complete data pipeline control • Minimal latency	• Improved success rate • Reduced IP blocking • Coordinate, city, state-level targeting • Anonymity	• Minimal maintenance overhead • Built-in error handling • Regular updates for site layout changes • Technical support
Cons	• High likelihood of IP blocks • Regular maintenance • Limited scaling • No geo-targeting	• Additional proxy service costs • Manual proxy management • Additional setup • Increased request latency	• Higher costs • Fixed customization • API-specific limitations • Dependency on provider
Best for	• Small-scale web scraping • Unrestricted websites • Custom data extraction logic	• Medium to large-scale web scraping • Restricted websites • Global targets	• Enterprise-level web scraping • Complex websites with anti-bot measures • Resource-constrained teams • Quick implementation

Conclusion

Gathering search results from Baidu can be challenging, but we hope this step-by-step guide will help you scrape public data from Baidu easier. With the assistance of Baidu Scraper API, you can bypass various anti-bot measures and extract Baidu organic search results at scale.

A proxy server is essential for block-free web scraping. To resemble organic traffic, you can buy proxy solutions, most notably Residential Proxies and Datacenter IPs, or choose free proxies from a reliable provider.

If you have any questions or want to know more about gathering public data from Baidu, contact us via email or live chat. We also offer a free trial for our Web Scraper API, so feel free to try if this advanced web scraping solution works for you.

About the author

Iveta Vistorskyte

Head of Content & Research

Iveta Vistorskyte is a Head of Content & Research at Oxylabs. Growing up as a writer and a challenge seeker, she decided to welcome herself to the tech-side, and instantly became interested in this field. When she is not at work, you'll probably find her just chillin' while listening to her favorite music or playing board games with friends.

Learn more about Iveta Vistorskyte Learn more about Iveta Vistorskyte

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.