Back to blog
Scraping Baidu Search Results with Python: A Step-by-Step Guide
Iveta Vistorskyte
Back to blog
Iveta Vistorskyte
Baidu is a leading search engine in China, allowing users to search for information online. The search results are displayed similarly to other search engines, with a list of websites and web pages matching the user's search query.
This blog post covers the process of scraping publicly available Baidu organic search results using Python and Oxylabs' Baidu Scraper API.
The Baidu Search Engine Results Page (SERP) consists of various elements that help users find the required information quickly. Paid search results, organic search results, and related searches might appear when entering a search query in Baidu.
Similarly to other search engines, Baidu's organic search results are listed to provide users with the most relevant and helpful information related to their search query.
When you enter a search query on Baidu, you'll see some results marked as "advertise (广告)" Companies pay for these results to appear at the top of the search results.
Baidu's paid results example
Baidu's related search feature helps users find additional information related to their search queries. Usually, this feature can be found at the end of the search results page.
If you've ever encountered gathering public information from Baidu, you should know that it's not an easy task. Baidu uses various anti-scraping techniques, such as CAPTCHAs, blocking suspicious user agents and IP addresses, and employing dynamic elements that make accessing content for automated bots difficult.
Baidu's search result page is dynamic, meaning the HTML code can often change. It makes it hard for web scraping tools to locate and gather certain Baidu search results. You need to constantly maintain and update your web scraper to get hassle-free public information. This is where a ready-made web intelligence tool, such as our own Web Scraper API, comes in to save time, effort, and resources.
Even if the legality of web scraping is a widely discussed topic, gathering publicly available data from the web, including Baidu search results, may be considered legal. Of course, there are a few rules you must follow when web scraping, such as:
A web scraping tool shouldn't log in to websites and then download data.
Even if there may be fewer restrictions for collecting public data than private information, you still must ensure that you're not breaching laws that may apply to such data, e.g., collecting copyrighted data.
If you're considering starting web scraping, especially for the first time, it's best to get professional legal advice on whether your public data gathering activities won't breach any laws or regulations. For additional information, you can also check our extensive article about the legality of web scraping.
Request a free trial to test our Web Scraper API.
Let's start from the basics. First, you need to create a project environment. For this, you need to install Python on your computer.
Open your terminal or command prompt, and follow the instructions:
1. Navigate to the directory where you want to create your virtual environment.
2. Run the following command to create a new virtual environment:
python -m venv env
This will create a new directory called env that contains the virtual environment. You can replace env with any name you like.
3. Activate the virtual environment by running the appropriate command for your operating system:
source env/bin/activate
4. You'll now see the env at the beginning of your command prompt, indicating that you're working in the virtual environment.
5. To install packages inside the virtual environment, use the pip command as you normally would.
pip install requests
This will install the requests package inside the virtual environment without affecting your global Python installation.
6. If you need to exit the virtual environment, run the command:
deactivate
That's it! You've now created and activated a virtual environment in Python using the venv module. You can use this environment to work on your Python projects without interfering with your global Python installation.
When scraping Baidu search results using Baidu Scraper API, you can form the URLs with specific parameters to customize web requests and retrieve certain Baidu search results. You can use the URL parameters to set limits and offsets, specify search queries, and more.
Since Baidu uses different URLs for desktop and mobile devices, you must also provide correctly formed Baidu URLs for scraping. See the structure below:
Desktop devices:
https://www.baidu.<domain>/s?ie=utf-8&wd=<query>&rn=<limit>&pn=<calculated_start_page>
Mobile devices:
https://m.baidu.<domain>/s?ie=utf-8&word=<query>&rn=<limit>&pn=<calculated_start_page>
The parameter values are as follows:
Value | Description |
---|---|
domain |
Use .com to access English content and .cn for Chinese content. |
query |
Represents a search keyword. Instead of space characters between words you must use %20 . Please note that for a desktop URL subdomain, you must use the wd parameter to specify queries, while for mobile device URLs it must be word . |
limit |
Specifies how many search results to show per page. |
calculated _start_page |
Represents how many search results have to be skipped. The value can be calculated using this formula: Calculated_start_page = Limit × Start_page — Limit . So for instance, if you want to access the 3rd page of search results and see a total of 5 results per page, then the calculated start page value must be 10. |
With all this in mind, here are some examples of how you can form a Baidu URL for the ‘nike shoes’ search keyword, access the 5th page, and see 10 results per page:
Desktop: https://www.baidu.com/s?ie=utf-8&wd=nike%20shoes&rn=10&pn=40
Mobile: https://m.baidu.com/s?ie=utf-8&word=nike%20shoes&rn=10&pn=40
You can find more information in our documentation. Now that you have set up your environment and have a basic idea of URL parameters, it's time to overview a step-by-step process of scraping Baidu search results using Oxylabs’ Baidu Scraper API.
When you purchase our Baidu Scraper API or start a free trial, you get the unique credentials needed to gather public data from Baidu. When you have all the information, you can start the web scraping process with Python.
Install the requests library in your Python environment using the pip install requests command, and import it together with the pprint module in your Python file:
import requests
from pprint import pprint
import requests imports the Python requests library which allows you to send HTTP requests and get responses.
from pprint import pprint imports the pprint function from the Python pprint module. This function is used to pretty-print Python data structures such as dictionaries and lists.
import json imports the json library, which provides methods to encode and decode JSON data.
The API endpoint URL is a target URL. Define your API endpoint URL as follows:
url = 'https://realtime.oxylabs.io/v1/queries'
You also need to obtain an API key or authorization credentials from us. Once you've received the key, you can use it to make API requests. Define your authentication as follows using your API key:
auth = ('your_api_username', 'your_api_password')
Create a dictionary containing the API parameters and the full Baidu URL you want to scrape. These can include parameters such as the url, user_agent_type, geo_location, and etc.
Here's how you can create your Python dictionary called payload, which contains the main parameters you want to pass to the Baidu search engine scraper:
payload = {
'source': 'universal',
'url': 'https://www.baidu.com/s?ie=utf-8&wd=nike&rn=50',
'geo_location': 'United States',
'user_agent_type': 'desktop_firefox'
}
Check our documentation for a full list of available parameters. You can also extract parsed Baidu search results by using a free Custom Parser feature.
Once you've declared everything, you can pass it as a JSON object in your request body.
response = requests.post(url, json=payload, auth=auth, timeout=180)
The requests.post() method sends a POST request with the search parameters and authentication credentials to our Web Scraper API.
The json_data variable contains the JSON-formatted response from the API and finally, the json_data variable is printed to the console using the print() function.
json_data = response.json()
pprint(json_data)
Additionally, the scraped Baidu HTML file can be saved using Python’s open() function:
with open('baidu.html', 'w') as f:
f.write(response.json()['results'][0]['content'])
Here's the full code example of how to scrape Baidu search results with Python and our Baidu Scraper API:
import requests
from pprint import pprint
payload = {
'source': 'universal',
'url': 'https://www.baidu.com/s?ie=utf-8&wd=nike&rn=50',
'geo_location': 'United States',
'user_agent_type': 'desktop_firefox'
}
url = 'https://realtime.oxylabs.io/v1/queries'
auth = ('your_api_username', 'your_api_password')
response = requests.post(url, json=payload, auth=auth, timeout=180)
json_data = response.json()
pprint(json_data)
with open('baidu.html', 'w') as f:
f.write(response.json()['results'][0]['content'])
The code above includes necessary libraries, filters the search results by defined parameters on the keyword ‘nike’, passes the target URL and credentials as a JSON request, and waits for the response. Once the response is loaded, the code prints the data in JSON format and saves the scraped HTML document.
The output of the above code looks like this:
The 'status_code': 200 specifies that the query was executed successfully.
Gathering search results from Baidu can be challenging, but we hope this step-by-step guide will help you scrape public data from Baidu easier. With the assistance of Baidu Scraper API, you can bypass various anti-bot measures and extract Baidu organic search results at scale.
Proxies are essential for block-free web scraping. To resemble organic traffic, you can buy proxy solutions, most notably residential and datacenter IPs.
If you have any questions or want to know more about gathering public data from Baidu, contact us via email or live chat. We also offer a free trial for our Web Scraper API, so feel free to try if this advanced web scraping solution works for you.
About the author
Iveta Vistorskyte
Lead Content Manager
Iveta Vistorskyte is a Lead Content Manager at Oxylabs. Growing up as a writer and a challenge seeker, she decided to welcome herself to the tech-side, and instantly became interested in this field. When she is not at work, you'll probably find her just chillin' while listening to her favorite music or playing board games with friends.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Yelyzaveta Nechytailo
2024-12-09
Augustas Pelakauskas
2024-12-09
Get the latest news from data gathering world
Scale up your business with Oxylabs®