Back to blog
Vytenis Kaubrė
In this tutorial, you’ll learn how to use SERP Scraper API (a part of Web Scraper API) to scrape Yandex search results. Before we begin, let’s briefly discuss what Yandex Search Engine Results Pages (SERPs) look like and why it's difficult to scrape them, and how proxy servers can help overcome these challenges.
Like Google, Bing, or any other search engine, Yandex provides a way to search the web. Yandex SERP displays search results based on various factors, including the relevance of the content to the search query, the website's quality and authority, the user's language and location, and other personalized factors. Users can refine their search results by using filters and advanced search options.
Let's say we searched for the term “iPhone.” You should see something similar to the below:
Notice the results page has two different sections: Advertisements on top and organic search results below. The organic search results section includes web pages that are not paid for and are displayed based on their relevance to the search query, as determined by Yandex's search algorithm.
On the other hand, you can identify ads by a label such as "Sponsored" or "Advertisement." They are displayed based on the keywords used in the search query and the advertiser's bid for those keywords. The ads usually include basic details, such as the title, the price, and the link to the product on the Yandex market.
One of the key challenges of scraping Yandex is its CAPTCHA protection. See the screenshot below:
Yandex has a strict anti-bot system to prevent scrapers from extracting data programmatically from the Yandex search engine. They can block your IP address if the CAPTCHA is triggered frequently. Moreover, they constantly update the anti-bot system, which is tough to keep up with. This makes scraping SERPs at scale complicated, and raw scripts require frequent maintenance to adapt to the changes.
Fortunately, our Web Scraper API is an excellent solution to bypass Yandex’s anti-bot system. Web Scraper API can scale on demand by using sophisticated crawling methods and rotating proxies. In the next section, we’ll explore how you can take advantage of it to scrape Yandex using Python.
Request a free trial to test our Web Scraper API.
Begin by downloading and installing Python from the official website. If you already have Python installed, make sure you have the latest version.
To scrape Yandex, we’ll use three Python libraries: requests, Beautiful Soup, and pandas. You can install them using Python’s package manager pip with the following command:
pip install requests pandas beautifulsoup4
The requests module will enable you to interact with the API by making network requests, the Beautiful Soup library will help you extract specific data, and then store the results using pandas.
Let’s get to know some query parameters for a smooth start.
The main parameters you should use are:
source – use universal as a value to access our scraper;
url – provide a link to the Yandex search results page. See our documentation on how to form Yandex URLs. Additionally, check out this Yandex documentation for specific geo-location values you can use in the URL;
geo_location – uses our proxy server pool to change the IP address of requests;
render – enables JavaScript rendering.
Visit our documentation to find out more about parameters and their values.
Now that everything’s ready, let’s write a Python script to interact with the Yandex SERP and retrieve results for any keyword.
Start by importing the libraries that you’ve installed in the previous step:
import requests
from bs4 import BeautifulSoup
import pandas as pd
Next, prepare a payload as shown below:
payload = {
'source': 'universal',
'url': f'https://yandex.com/search/?text=what%20is%20web%20scraping',
'geo_location': 'Germany',
'render': 'html'
}
Using the above payload, we’re searching Yandex for the term “what is web scraping.” The code tells the scraper to retrieve search results that only include websites with the domain .com, while also routing the request through a German proxy server. The 'render': 'html' parameter enables JavaScript rendering, which helps to overcome Yandex anti-scraping measures.
Next, we need to make a POST request to the Web Scraper API. To do that, use the requests library you’ve imported previously:
credentials = ('USERNAME', 'PASSWORD')
response = requests.post(
'https://realtime.oxylabs.io/v1/queries',
auth=credentials,
json=payload,
timeout=180
)
Note that we have declared a tuple named credentials. For the code to work, you’ll have to replace the USERNAME and PASSWORD with the API authentication credentials. If you don’t have them, you can sign up and get a 1-week free trial.
We use the POST method of the requests library to send the payload to the URL https://realtime.oxylabs.io/v1/queries. We also pass the authentication credentials and the payload as JSON.
Next, let’s print the result with the following line:
print(response.json())
It’ll print the entire API response. A successful Yandex scraping request will return a 200 status code, but if you encounter a different response, we recommend visiting our documentation, where we’ve detailed common response codes.
You can easily parse the scraped Yandex search results by using a library like Beautiful Soup. Check out our in-depth Beautiful Soup tutorial to learn how to use it. You can also utilize our free Custom Parser feature for hassle-free HTML parsing.
First, define an empty data list and create a BeautifulSoup instance:
data = []
content = response.json()['results'][0]['content']
soup = BeautifulSoup(content, 'html.parser')
Then, create a for loop that extracts the h2 title and and its href URL for each li element in the result:
for element in soup.find_all('li'):
h2_element = element.select_one('h2 > span')
h2 = h2_element.text if h2_element else None
link_element = element.select_one('.organic__url')
link = link_element.get('href') if link_element else None
data.append({'Title': h2, 'Link': link})
print(data)
To export the data into a CSV or JSON format, you must first create a data frame:
df = pd.DataFrame(data)
With this code, you’re using the pandas library to pass the data list of parsed results. Now, you can simply export the data frame into CSV as below:
df.index += 1
df.to_csv('yandex_results.csv', index=True)
Similarly, you can export the results into JSON using the following code:
df.to_json('yandex_results.json', orient='records', indent=4)
Once you execute the code, the script will create two new files in the current directory with the scraped results.
import requests
from bs4 import BeautifulSoup
import pandas as pd
payload = {
'source': 'universal',
'url': f'https://yandex.com/search/?text=what%20is%20web%20scraping',
'geo_location': 'Germany',
'render': 'html'
}
credentials = ('USERNAME', 'PASSWORD')
response = requests.post(
'https://realtime.oxylabs.io/v1/queries',
auth=credentials,
json=payload,
timeout=180
)
print(response.json())
data = []
content = response.json()['results'][0]['content']
soup = BeautifulSoup(content, 'html.parser')
for element in soup.find_all('li'):
h2_element = element.select_one('h2 > span')
h2 = h2_element.text if h2_element else None
link_element = element.select_one('.organic__url')
link = link_element.get('href') if link_element else None
data.append({'Title': h2, 'Link': link})
df = pd.DataFrame(data)
df.index += 1
df.to_csv('yandex_results.csv', index=True)
df.to_json('yandex_results.json', orient='records', indent=4)
Running the above code will output a CSV file that should look similar to this:
While scraping Yandex SERPs is extremely challenging, by following the steps outlined in this article and using the provided Python code, you can easily scrape Yandex search results for any chosen keyword and export the data into a CSV or JSON file. With the help of Web Scraper API and proxies, you can bypass Yandex's anti-bot measures and scrape SERPs at scale. If you need even more robust solutions, you can buy proxy services to further enhance your scraping efficiency.
If you require assistance or want to know more, feel free to contact us via email or live chat.
About the author
Vytenis Kaubrė
Technical Copywriter
Vytenis Kaubrė is a Technical Copywriter at Oxylabs. His love for creative writing and a growing interest in technology fuels his daily work, where he crafts technical content and web scrapers with Oxylabs’ solutions. Off duty, you might catch him working on personal projects, coding with Python, or jamming on his electric guitar.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Get the latest news from data gathering world
Scale up your business with Oxylabs®