Baidu is a leading search engine in China, allowing users to search for information online. The search results are displayed similarly to other search engines, with a list of websites and web pages matching the user's search query.
This blog post covers the process of scraping publicly available Baidu organic search results using Python and Oxylabs' Web Scraper API.
The Baidu Search Engine Results Page (SERP) consists of various elements that help users find the required information quickly. Paid search results, organic search results, and related searches might appear when entering a search query in Baidu.
Similarly to other search engines, Baidu's organic search results are listed to provide users with the most relevant and helpful information related to their original search query.
When you enter a search query on Baidu, you'll see some results marked as "advertise (广告)" Companies pay for these results to appear at the top of the search results.
Baidu's related search feature helps users find additional information related to their search queries. Usually, this feature can be found at the end of the search results page.
If you've ever encountered gathering public information from Baidu, you should know that it's not an easy task. Baidu uses various anti-scraping techniques, such as CAPTCHAs, blocking suspicious user agents and IP addresses, and employing dynamic elements that make accessing content for automated bots difficult.
Baidu's search result page is dynamic, meaning the HTML code can often change. It makes it hard for web scraping tools to locate and gather certain Baidu SERP data. You need to constantly maintain and update your web scraper to get hassle-free public information. This is where a ready-made web intelligence tool, such as our own Web Scraper API, comes in to save time, effort, and resources.
Even if the legality of web scraping is a widely discussed topic, gathering publicly available data from the web, including Baidu search results, may be considered legal. Of course, there are a few rules you must follow when web scraping, such as:
A web scraping tool shouldn't log in to websites and then download data.
Even if there may be fewer restrictions for collecting public data than private information, you still must ensure that you're not breaching laws that may apply to such data, e.g., collecting copyrighted data.
If you're considering starting web scraping, especially for the first time, it's best to get professional legal advice on whether your public data gathering activities won't breach any laws or regulations. For additional information, you can also check our extensive article about the legality of web scraping.
Request a free trial to test our Web Scraper API.
When you purchase our Web Scraper API or start a free trial, you get the unique credentials needed to gather public data from Baidu. When you have all the necessary information, you can start to implement the web scraping process with Python.
It can be done with these steps:
Install the requests and bs4 libraries in your Python environment using the pip install requests bs4 command, and import them together with the pprint module in your Python file.
We’ll be using requests for making HTTP requests to the Oxylabs Web Scraper API and the beautifulsoup library for parsing the retrieved HTML content.
It should look like this:
import requests
from bs4 import BeautifulSoup
from pprint import pprint
The API endpoint URL is the URL we’ll be sending our HTTP requests to. You can define the URL as follows:
url = 'https://realtime.oxylabs.io/v1/queries'
You also need to obtain an API key or authorization credentials from us. Once you've received the key, you can use it to make API requests. Define your authentication as follows:
auth = ('your_api_username', 'your_api_password')
Next up, create a dictionary containing the necessary API parameters and the full Baidu URL you want to scrape.
This dictionary must include these parameters:
source must be set to universal.
url – your Baidu URL.
geo_location – the location of your scraping requests.
It should look like this:
payload = {
'source': 'universal',
'url': 'https://www.baidu.com/s?ie=utf-8&wd=nike&rn=50',
'geo_location': 'United States',
}
Check our documentation for a full list of available parameters.
Once you've declared everything, you can pass it as a JSON object in your request body.
response = requests.post(url, json=payload, auth=auth, timeout=180)
response.raise_for_status()
The requests.post() method sends a POST request with the search parameters and authentication credentials to our Web Scraper API. We should also add the response.raise_for_status() line, which would raise an exception in case there was an issue with the API call.
The json_data variable contains the JSON-formatted response from the API. We should also validate if we received any results from the scraper for our provided Baidu URL.
json_data = response.json()
if not json_data["results"]:
print("No results found for the given query.")
return
After retrieving the HTML content for Baidu search results, we can proceed to parse it into a known data format. If you're curious, check out our blog to find information on what is a data parser.
But for now, let’s implement a function called parse_baidu_html_results . The function should accept the HTML content as a string and return a list of dictionaries.
def parse_baidu_html_results(html_content: str) -> list[dict]:
"""Parses Baidu HTML search results content into a list of dictionaries."""
...
Inside the function, we can create a BeautifulSoup object from the HTML content string.
We can then use the object to extract the necessary structured data we need. For this example, we’ll be parsing out the page title of the search result and the link.
Here’s what the code can look like:
def parse_baidu_html_results(html_content: str) -> list[dict]:
"""Parses Baidu HTML search results content into a list of dictionaries."""
parsed_results = []
soup = BeautifulSoup(html_content, "html.parser")
result_blocks = soup.select("div.c-container[id]")
for block in result_blocks:
title_tag = (
block.select_one("h3.t a")
or block.select_one("h3.c-title-en a")
or block.select_one("div.c-title a")
)
if not title_tag:
continue
title_text = title_tag.get_text(strip=True)
href = title_tag.get("href")
if title_text and href:
parsed_results.append({"title": title_text, "url": href})
return parsed_results
Within the function, we use BeautifulSoup to select each certain Baidu search result using the div.c-container[id] CSS selector.
After we have that, we can loop through the search results and select the title and URL of each result by selecting the title tag element. We use a few selectors to make the code more robust, in case the site structure changes in the future.
Here’s an example of how the parsed_results variable can look like:
[{'title': '-NIKE中文官方网站',
'url': 'http://www.baidu.com/link?url=rfgnOGuIn6H54TFBQhGFXF-52oUjNUrJc8CeHdVVfYIBUjbdyQivAZgPy7WAQmXZSkpxQlFrCUY1m7fx5fPw7K'},
{'title': '耐克Nike新品上市专区-运动鞋-连帽衫-夹克外套-NIKE中文官方网站',
'url': 'http://www.baidu.com/link?url=TO9UR6DTuCqQE7jteADPE1VWRO3f4PI7BlF-eInYy8VJDdIgRioa37FIdkzNkTnB'},
{'title': '男子-NIKE中文官方网站',
'url': 'http://www.baidu.com/link?url=58Aj0GphmvO3CR4yc4c0EMcJz0SsTG3RtrAGAzMAVOc5ow8dU1A7N8y3Mq0ulW4J2tGKYGAWIyp13fVvzJ6w4CsYKzEkj3de3UjJPw6GrPLqsvveXsJ2Nl6Ru-Q4bddl'}]
After we have that, we can implement another function for storing the result in a CSV file.
For that, we should first install the pandas library, by running pip install pandas. Once that’s done, we can import the pandas library and implement a simple function called store_to_csv like this:
import pandas as pd
def store_to_csv(data: list[dict]):
"""Stores the parsed data into a CSV file."""
df = pd.DataFrame(data)
df.to_csv("baidu_results.csv")
Let’s put everything together to have a neat application for scraping HTML content from a Baidu search site and storing the results into a CSV file.
First of all, let’s create a main function, where we’ll store the main logic for our code.
import requests
import pandas as pd
from bs4 import BeautifulSoup
from pprint import pprint
def main():
...
if __name__ == "__main__":
main()
Next, let’s move our previously written code to the created main function.
import requests
import pandas as pd
from bs4 import BeautifulSoup
from pprint import pprint
def main():
payload = {
"source": "universal",
"url": "https://www.baidu.com/s?ie=utf-8&wd=nike&rn=50",
"geo_location": "United States",
}
url = "https://realtime.oxylabs.io/v1/queries"
auth = ("your_api_username", "your_api_password")
response = requests.post(url, json=payload, auth=auth, timeout=180)
response.raise_for_status()
json_data = response.json()
if not json_data["results"]:
print("No results found for the given query.")
return
if __name__ == "__main__":
main()
We can now combine our previously written functions to a single application.
Here’s the full code:
import requests
import pandas as pd
from bs4 import BeautifulSoup
from pprint import pprint
def store_to_csv(data: list[dict]):
"""Stores the parsed data into a CSV file."""
df = pd.DataFrame(data)
df.to_csv("baidu_results.csv")
def parse_baidu_html_results(html_content: str) -> list[dict]:
"""Parses Baidu HTML search results content into a list of dictionaries."""
parsed_results = []
soup = BeautifulSoup(html_content, "html.parser")
result_blocks = soup.select("div.c-container[id]")
for block in result_blocks:
title_tag = (
block.select_one("h3.t a")
or block.select_one("h3.c-title-en a")
or block.select_one("div.c-title a")
)
if not title_tag:
continue
title_text = title_tag.get_text(strip=True)
href = title_tag.get("href")
if title_text and href:
parsed_results.append({"title": title_text, "url": href})
return parsed_results
def main():
payload = {
"source": "universal",
"url": "https://www.baidu.com/s?ie=utf-8&wd=nike&rn=50",
"geo_location": "United States",
}
url = "https://realtime.oxylabs.io/v1/queries"
auth = ("your_api_username", "your_api_password")
response = requests.post(url, json=payload, auth=auth, timeout=180)
response.raise_for_status()
json_data = response.json()
if not json_data["results"]:
print("No results found for the given query.")
return
html_content = json_data["results"][0]["content"]
parsed_data_list = parse_baidu_html_results(html_content)
store_to_csv(parsed_data_list)
if __name__ == "__main__":
main()
The result of the code above should be a CSV file in your project directory, called baidu_results.csv.
If you open it, the output should look something like this:
In the previous section of this tutorial, we covered how to scrape Baidu search results with our Baidu Search API, which makes it a really simple task to scrape websites.
However, if you prefer to perform requests to the website yourself, our Residential Proxies is the perfect solution for that.
Residential Proxies provide the option to send requests through IP addresses that belong to physical devices, with IP addresses provided by ISPs around the world. These proxies rotate automatically, so there’s no need to worry about additional IP management.
Let’s use the code we wrote before as a starting point for using Residential Proxies to scrape Baidu search results.
First off, let’s declare our proxies variable. To get the hostname for the proxy server, you can go to your Oxylabs Dashboard Residential Proxies section and click on the Endpoint generator tab.
You can select various options here, like the region the proxy is located, the endpoint and session types, and much more.
However, for this example we’ll be using a global HTTPS proxy with authentication and sticky sessions.
The URL should look like this:
proxy_entry = "http://customer-<your_username>:<your_password>@pr.oxylabs.io:10000"
Make sure to replace the placeholders with your Residential Proxy credentials.
Now that that’s done, let’s alter the main function a bit and call the Baidu search URL directly using a GET request with the proxies attached to it.
It can look like this:
def main():
url = "https://www.baidu.com/s?ie=utf-8&wd=nike&rn=50"
proxy_entry = "http://customer-<your_username>:<your_password>@pr.oxylabs.io:10000"
proxies = {
"http": proxy_entry,
"https": proxy_entry,
}
response = requests.get(url, proxies=proxies, timeout=180)
html_content = response.text
parsed_data_list = parse_baidu_html_results(html_content)
store_to_csv(parsed_data_list)
As you can see, we reused the parsing and storing functions from before to get the exact same result, but through a different source. If you run the script, you should see the same baidu_results.csv file, with the same results as with the Web Scraper API.
Here’s how the full code for scraping Baidu SERP data with Residential Proxies looks like:
import requests
import pandas as pd
from bs4 import BeautifulSoup
def store_to_csv(data: list[dict]):
"""Stores the parsed data into a CSV file."""
df = pd.DataFrame(data)
df.to_csv("baidu_results.csv")
def parse_baidu_html_results(html_content: str) -> list[dict]:
"""Parses Baidu HTML search results content into a list of dictionaries."""
parsed_results = []
soup = BeautifulSoup(html_content, "html.parser")
result_blocks = soup.select("div.c-container[id]")
for block in result_blocks:
title_tag = (
block.select_one("h3.t a")
or block.select_one("h3.c-title-en a")
or block.select_one("div.c-title a")
)
if not title_tag:
continue
title_text = title_tag.get_text(strip=True)
href = title_tag.get("href")
if title_text and href:
parsed_results.append({"title": title_text, "url": href})
return parsed_results
def main():
url = "https://www.baidu.com/s?ie=utf-8&wd=nike&rn=50"
proxy_entry = "http://customer-<your_username>:<your_password>@pr.oxylabs.io:10000"
proxies = {
"http": proxy_entry,
"https": proxy_entry,
}
response = requests.get(url, proxies=proxies, timeout=180)
html_content = response.text
parsed_data_list = parse_baidu_html_results(html_content)
store_to_csv(parsed_data_list)
if __name__ == "__main__":
main()
There are several ways to scrape Baidu: you can do it manually without proxies, use proxies to avoid IP bans, or rely on a scraper API for convenience and scalability.
Each method differs in complexity, speed, and resistance to blocks – manual scraping is simplest but limited, proxies offer more freedom with added setup, and APIs handle most of the heavy lifting for you. Here are all of the ways to scrape Baidu data compared:
Criteria | Manual scraping (without proxies) | Manual scraping using proxies | Scraper APIs |
---|---|---|---|
Key features | • Single, static IP address • Direct network requests • Local execution environment |
• IP rotation • Geo-targeting • Request distribution • Anti-detection measures |
• Maintenance-free infrastructure • CAPTCHA handling • JavaScript rendering • Automatic proxy management |
Pros | • Maximum flexibility • No additional service costs • Complete data pipeline control • Minimal latency |
• Improved success rate • Reduced IP blocking • Coordinate, city, state-level targeting • Anonymity |
• Minimal maintenance overhead • Built-in error handling • Regular updates for site layout changes • Technical support |
Cons | • High likelihood of IP blocks • Regular maintenance • Limited scaling • No geo-targeting |
• Additional proxy service costs • Manual proxy management • Additional setup • Increased request latency |
• Higher costs • Fixed customization • API-specific limitations • Dependency on provider |
Best for | • Small-scale web scraping • Unrestricted websites • Custom data extraction logic |
• Medium to large-scale web scraping • Restricted websites • Global targets |
• Enterprise-level web scraping • Complex websites with anti-bot measures • Resource-constrained teams • Quick implementation |
Gathering search results from Baidu can be challenging, but we hope this step-by-step guide will help you scrape public data from Baidu easier. With the assistance of Baidu Scraper API, you can bypass various anti-bot measures and extract Baidu organic search results at scale.
A proxy server is essential for block-free web scraping. To resemble organic traffic, you can buy proxy solutions, most notably Residential Proxies and Datacenter IPs, or choose free proxies from a reliable provider.
If you have any questions or want to know more about gathering public data from Baidu, contact us via email or live chat. We also offer a free trial for our Web Scraper API, so feel free to try if this advanced web scraping solution works for you.
About the author
Iveta Vistorskyte
Head of Content & Research
Iveta Vistorskyte is a Head of Content & Research at Oxylabs. Growing up as a writer and a challenge seeker, she decided to welcome herself to the tech-side, and instantly became interested in this field. When she is not at work, you'll probably find her just chillin' while listening to her favorite music or playing board games with friends.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Yelyzaveta Hayrapetyan
2025-05-28
Roberta Aukstikalnyte
2025-05-21
Vytenis Kaubrė
2025-04-29
Try out Paid Proxy Servers
Explore Oxylabs' Paid Proxy Servers, featuring one of the largest proxy pools available – 177M+ IPs across 195 countries.
Get the latest news from data gathering world
Scale up your business with Oxylabs®
Proxies
Advanced proxy solutions
Data Collection
Datasets
Resources
Innovation hub
Try out Paid Proxy Servers
Explore Oxylabs' Paid Proxy Servers, featuring one of the largest proxy pools available – 177M+ IPs across 195 countries.