Proxy locations

Europe

North America

South America

Asia

Africa

Oceania

See all locations

Network statusCareers

Back to blog

Scraping Baidu Search Results with Python: A Step-by-Step Guide

Scraping Baidu Search Results with Python: A Step-by-Step Guide
Iveta Vistorskyte

Iveta Vistorskyte

2023-03-105 min read
Share

Baidu is a leading search engine in China, allowing users to search for information online. The search results are displayed similarly to other search engines, with a list of websites and web pages matching the user's search query.

This blog post covers the process of scraping publicly available Baidu organic search results using Python and Oxylabs' Baidu Scraper API.

Baidu search engine: overview

The Baidu Search Engine Results Page (SERP) consists of various elements that help users find the required information quickly. Paid search results, organic search results, and related searches might appear when entering a search query in Baidu. 

Organic search results

Similarly to other search engines, Baidu's organic search results are listed to provide users with the most relevant and helpful information related to their search query.

Baidu's organic search results example

Paid results

When you enter a search query on Baidu, you'll see some results marked as "advertise (广告)" Companies pay for these results to appear at the top of the search results.

Baidu's paid results example

Related searches

Baidu's related search feature helps users find additional information related to their search queries. Usually, this feature can be found at the end of the search results page.

Baidu's related searches example

Challenges of scraping Baidu

If you've ever encountered gathering public information from Baidu, you should know that it's not an easy task. Baidu uses various anti-scraping techniques, such as CAPTCHAs, blocking suspicious user agents and IP addresses, and employing dynamic elements that make accessing content for automated bots difficult. 

Baidu's search result page is dynamic, meaning the HTML code can often change. It makes it hard for web scraping tools to locate and gather certain Baidu search results. You need to constantly maintain and update your web scraper to get hassle-free public information. This is where a ready-made web intelligence tool, such as our own SERP Scraper API, comes in to save time, effort, and resources.

Even if the legality of web scraping is a widely discussed topic, gathering publicly available data from the web, including Baidu search results, may be considered legal. Of course, there are a few rules you must follow when web scraping, such as: 

  1. A web scraping tool shouldn't log in to websites and then download data. 

  2. Even if there may be fewer restrictions for collecting public data than private information, you still must ensure that you're not breaching laws that may apply to such data, e.g., collecting copyrighted data. 

If you're considering starting web scraping, especially for the first time, it's best to get professional legal advice on whether your public data gathering activities won't breach any laws or regulations. For additional information, you can also check our extensive article about the legality of web scraping.

Try free for 1 week

Request a free trial to test our SERP Scraper API.

  • 5K results
  • No credit card required
  • Scraping Baidu search results

    Let's start from the basics. First, you need to create a project environment. For this, you need to install Python on your computer. 

    Steps for creating a virtual environment

    Open your terminal or command prompt, and follow the instructions: 

    1. Navigate to the directory where you want to create your virtual environment.

    2. Run the following command to create a new virtual environment:

    python -m venv env

    This will create a new directory called env that contains the virtual environment. You can replace env with any name you like.
    3. Activate the virtual environment by running the appropriate command for your operating system: 

    source env/bin/activate

    4. You'll now see the env at the beginning of your command prompt, indicating that you're working in the virtual environment.

    5. To install packages inside the virtual environment, use the pip command as you normally would.

    pip install requests

    This will install the requests package inside the virtual environment without affecting your global Python installation.

    6. If you need to exit the virtual environment, run the command:

    deactivate

    That's it! You've now created and activated a virtual environment in Python using the venv module. You can use this environment to work on your Python projects without interfering with your global Python installation.

    How to form Baidu URLs

    When scraping Baidu search results using Baidu Scraper API, you can form the URLs with specific parameters to customize web requests and retrieve certain Baidu search results. You can use the URL parameters to set limits and offsets, specify search queries, and more. 

    Since Baidu uses different URLs for desktop and mobile devices, you must also provide correctly formed Baidu URLs for scraping. See the structure below:

    1. Desktop devices:

    https://www.baidu.<domain>/s?ie=utf-8&wd=<query>&rn=<limit>&pn=<calculated_start_page>
    1. Mobile devices:

    https://m.baidu.<domain>/s?ie=utf-8&word=<query>&rn=<limit>&pn=<calculated_start_page>

    The parameter values are as follows:

    Value Description
    domain Use .com to access English content and .cn for Chinese content.
    query Represents a search keyword. Instead of space characters between words you must use %20. Please note that for a desktop URL subdomain, you must use the wd parameter to specify queries, while for mobile device URLs it must be word.
    limit Specifies how many search results to show per page.
    calculated _start_page Represents how many search results have to be skipped. The value can be calculated using this formula:
    Calculated_start_page = Limit × Start_page — Limit.
    So for instance, if you want to access the 3rd page of search results and see a total of 5 results per page, then the calculated start page value must be 10.

    With all this in mind, here are some examples of how you can form a Baidu URL for the ‘nike shoes’ search keyword, access the 5th page, and see 10 results per page:

    Desktop: https://www.baidu.com/s?ie=utf-8&wd=nike%20shoes&rn=10&pn=40

    Mobile: https://m.baidu.com/s?ie=utf-8&word=nike%20shoes&rn=10&pn=40

    You can find more information in our documentation. Now that you have set up your environment and have a basic idea of URL parameters, it's time to overview a step-by-step process of scraping Baidu search results using Oxylabs’ Baidu Scraper API.

    Tutorial on scraping Baidu search results with Python & Baidu Scraper API

    When you purchase our Baidu Scraper API or start a free trial, you get the unique credentials needed to gather public data from Baidu. When you have all the information, you can start the web scraping process with Python.

    1. Import necessary Python libraries

    Install the requests library in your Python environment using the pip install requests command, and import it together with the pprint module in your Python file:

    import requests
    from pprint import pprint
    • import requests imports the Python requests library which allows you to send HTTP requests and get responses.

    • from pprint import pprint imports the pprint function from the Python pprint module. This function is used to pretty-print Python data structures such as dictionaries and lists.

    • import json imports the json library, which provides methods to encode and decode JSON data.

    2. Define API endpoint URL

    The API endpoint URL is a target URL. Define your API endpoint URL as follows:

    url = 'https://realtime.oxylabs.io/v1/queries'

    3. Fill in your authentication credentials

    You also need to obtain an API key or authorization credentials from us. Once you've received the key, you can use it to make API requests. Define your authentication as follows using your API key:

    auth = ('your_api_username', 'your_api_password')

    4. Create a payload with query parameters

    Create a dictionary containing the API parameters and the full Baidu URL you want to scrape. These can include parameters such as the url, user_agent_type, geo_location, and etc. 

    Here's how you can create your Python dictionary called payload, which contains the main parameters you want to pass to the Baidu search engine scraper:

    payload = {
       'source': 'universal',
       'url': 'https://www.baidu.com/s?ie=utf-8&wd=nike&rn=50',
       'geo_location': 'United States',
       'user_agent_type': 'desktop_firefox'
    }

    Check our documentation for a full list of available parameters. You can also extract parsed Baidu search results by using a free Custom Parser feature.

    5. Send a POST request

    Once you've declared everything, you can pass it as a JSON object in your request body.

    response = requests.post(url, json=payload, auth=auth, timeout=180)

    The requests.post() method sends a POST request with the search parameters and authentication credentials to our SERP Scraper API.

    6. Load and print data

    The json_data variable contains the JSON-formatted response from the API and finally, the json_data variable is printed to the console using the print() function.

    json_data = response.json()
    pprint(json_data)

    Additionally, the scraped Baidu HTML file can be saved using Python’s open() function:

    with open('baidu.html', 'w') as f:
        f.write(response.json()['results'][0]['content'])

    Full code sample

    Here's the full code example of how to scrape Baidu search results with Python and our Baidu Scraper API:

    import requests
    from pprint import pprint
    
    payload = {
       'source': 'universal',
       'url': 'https://www.baidu.com/s?ie=utf-8&wd=nike&rn=50',
       'geo_location': 'United States',
       'user_agent_type': 'desktop_firefox'
    }
    
    url = 'https://realtime.oxylabs.io/v1/queries'
    auth = ('your_api_username', 'your_api_password')
    
    response = requests.post(url, json=payload, auth=auth, timeout=180)
    
    json_data = response.json()
    pprint(json_data)
    
    with open('baidu.html', 'w') as f:
        f.write(response.json()['results'][0]['content'])

    The code above includes necessary libraries, filters the search results by defined parameters on the keyword ‘nike’, passes the target URL and credentials as a JSON request, and waits for the response. Once the response is loaded, the code prints the data in JSON format and saves the scraped HTML document.

    Output

    The output of the above code looks like this:

    API's response in JSON format

    The 'status_code': 200 specifies that the query was executed successfully.

    Conclusion

    Gathering search results from Baidu can be challenging, but we hope this step-by-step guide will help you scrape public data from Baidu easier. With the assistance of Baidu Scraper API, you can bypass various anti-bot measures and extract Baidu organic search results at scale.

    If you have any questions or want to know more about gathering public data from Baidu, contact us via email or live chat. We also offer a free trial for our SERP Scraper API, so feel free to try if this advanced web scraping solution works for you.

    About the author

    Iveta Vistorskyte

    Iveta Vistorskyte

    Lead Content Manager

    Iveta Vistorskyte is a Lead Content Manager at Oxylabs. Growing up as a writer and a challenge seeker, she decided to welcome herself to the tech-side, and instantly became interested in this field. When she is not at work, you'll probably find her just chillin' while listening to her favorite music or playing board games with friends.

    All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

    Related articles

    Get the latest news from data gathering world

    I’m interested