In this tutorial, you’ll learn how to build a custom Yandex scraper with proxies and use Web Scraper API to scrape Yandex search results. Before we begin, let’s briefly discuss what Yandex Search Engine Results Pages (SERPs) look like and why it's difficult to scrape them, and how proxy servers can help overcome these challenges.
Like Google, Bing, or any other search engine, Yandex provides a way to search the web. Yandex SERP displays search results based on various factors, including the relevance of the content to the search query, the website's quality and authority, the user's language and location, and other personalized factors. Users can refine their search results by using filters and advanced search options.
Let's say we searched for the term “iPhone.” You should see something similar to the below:
Notice the results page has two different sections: Advertisements on top and organic search results below. The organic search results section includes web pages that are not paid for and are displayed based on their relevance to the search query, as determined by Yandex's search algorithm.
On the other hand, you can identify ads by a label, such as "Sponsored" or "Advertisement." They are displayed based on the keywords used in the search query and the advertiser's bid for those keywords. The ads usually include basic details, such as the title, the price, and the link to the product on the Yandex market.
One of the key challenges of scraping Yandex is its CAPTCHA protection. See the screenshot below:
Yandex has a strict anti-bot system to prevent scrapers from extracting data programmatically from the Yandex search pages. They can block your IP address if the CAPTCHA is triggered frequently. Moreover, they constantly update the anti-bot system, which is tough to keep up with. This makes scraping SERPs at scale complicated, and raw scripts require frequent maintenance to adapt to the changes.
Fortunately, our Web Scraper API is an excellent solution to bypass Yandex’s anti-bot system. Web Scraper API can scale on demand by using sophisticated crawling methods and rotating proxy solutions. In the next sections, we’ll explore how you can take advantage of it to scrape Yandex search engine results using Python.
Get a free trial to test our Web Scraper API.
Begin by downloading and installing Python from the official website. If you already have Python installed, make sure you have the latest version.
To scrape Yandex, we’ll use three Python libraries: requests, Beautiful Soup, and pandas. You can install them using Python’s package manager pip with the following command:
pip install requests pandas beautifulsoup4
The requests module will enable you to make network requests, the Beautiful Soup library will help you extract specific data, and pandas will let you store the results in a CSV file.
In this section, you’ll learn how to scrape Yandex search data by building a simple scraper that utilizes Residential Proxies to overcome CAPTCHAs and IP blocks.
In a new Python file, import the necessary libraries:
import requests
from bs4 import BeautifulSoup
import pandas as pd
Next, create a proxies dictionary that we’ll use to route requests through:
USERNAME = 'PROXY_USERNAME'
PASSWORD = 'PROXY_PASSWORD'
proxies = {
'http': f'https://{USERNAME}:{PASSWORD}@pr.oxylabs.io:7777',
'https': f'https://{USERNAME}:{PASSWORD}@pr.oxylabs.io:7777'
}
It’s essential to make HTTP requests look like coming from a real web browser. So, let’s create a basic HTTP headers dictionary:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:137.0) '
'Gecko/20100101 Firefox/137.0',
'Accept': 'text/html,application/xhtml+xml,'
'application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9,ru;q=0.8',
'Connection': 'keep-alive'
}
Send a GET request to your desired Yandex SERP URL and make sure to use the proxies and headers dictionaries:
response = requests.get(
'https://yandex.com/search/?text=what%20is%20web%20scraping',
proxies=proxies,
headers=headers
)
response.raise_for_status()
After getting a response back, use the BeautifulSoup class to read the raw HTML document:
soup = BeautifulSoup(response.text, 'html.parser')
After that, you can start iterating through each search result card and extract the required data for your needs. For instance, a great starting point is to retrieve the titles and links:
data = []
for listing in soup.select('li.serp-item_card'):
title_el = listing.select_one('h2 > span')
title = title_el.text if title_el else None
link_el = listing.select_one('.organic__url')
link = link_el.get('href') if link_el else None
data.append({'Title': title, 'Link': link})
After extracting an individual listing, the code stores every result to a data list.
It’s time to use the pandas library to store the scraped data in a file. You may save the data to any format that’s useful to you, but for this tutorial, let’s stick to CSV:
df = pd.DataFrame(data)
df.to_csv('yandex_results.csv')
import requests
from bs4 import BeautifulSoup
import pandas as pd
USERNAME = 'PROXY_USERNAME'
PASSWORD = 'PROXY_PASSWORD'
proxies = {
'http': f'https://{USERNAME}:{PASSWORD}@pr.oxylabs.io:7777',
'https': f'https://{USERNAME}:{PASSWORD}@pr.oxylabs.io:7777'
}
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:137.0) '
'Gecko/20100101 Firefox/137.0',
'Accept': 'text/html,application/xhtml+xml,'
'application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9,ru;q=0.8',
'Connection': 'keep-alive'
}
response = requests.get(
'https://yandex.com/search/?text=what%20is%20web%20scraping',
proxies=proxies,
headers=headers
)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
data = []
for listing in soup.select('li.serp-item_card'):
title_el = listing.select_one('h2 > span')
title = title_el.text if title_el else None
link_el = listing.select_one('.organic__url')
link = link_el.get('href') if link_el else None
data.append({'Title': title, 'Link': link})
df = pd.DataFrame(data)
df.to_csv('yandex_results.csv', index=False)
Running the code will produce a CSV file with scraped data that should look similar to this screenshot:
Building your own web scraping tool can become burdensome, especially when you want to scale your data scraping processes. That’s where Oxylabs’ robust web scraping infrastructure comes in handy, allowing you to scrape thousands and even millions of Yandex pages without worrying about scaling, IP blocks, CAPTCHAs, and other hurdles.
Web Scraper API boasts plenty of features, including built-in proxy servers as well as dedicated scrapers and parsers for popular targets such as Google, Bing, Amazon, and more. Take a look at our documentation for a smooth start.
Begin by importing the requests and pandas libraries:
import requests
import pandas as pd
Next, create a payload dictionary that will provide all the search parameters to the API required to scrape Yandex data:
payload = {
'source': 'universal',
'url': 'https://yandex.com/search/?text=what%20is%20web%20scraping',
}
You can also add more parameters to set a specific geo-location, enable JavaScript rendering, and more. Check out the supported API parameters for additional details.
Web Scraper API allows you to define your own parsing logic through the Custom Parser feature. So, let’s modify the payload dictionary with parsing_instructions:
payload = {
'source': 'universal',
'url': 'https://yandex.com/search/?text=what%20is%20web%20scraping',
'parse': True,
'parsing_instructions': {
'listings': {
'_fns': [{'_fn': 'css', '_args': ['li.serp-item_card']}],
'_items': {
'title': {
'_fns': [
{'_fn': 'css_one', '_args': ['h2 > span']},
{'_fn': 'element_text'}
]
},
'link': {
'_fns': [
{
'_fn': 'xpath_one',
'_args': [
'.//a[contains(@class, "organic__url")]/@href'
]
}
]
}
}
}
}
}
Custom Parser supports both CSS and XPath selectors. Hence, you can easily extract the result link from the href attribute using XPath. You can also ease the process of writing your own parsing logic by generating a Yandex parser with our AI-powered OxyCopilot.
Next, make a POST request to Web Scraper API and send the configured payload for processing:
response = requests.post(
'https://realtime.oxylabs.io/v1/queries',
auth=('API_USERNAME', 'API_PASSWORD'),
json=payload
)
response.raise_for_status()
Make sure to replace the API_USERNAME and API_PASSWORD with the API user credentials you’ve created in the Oxylabs dashboard.
To save the data to a CSV format, you must first access the results from the API’s response:
data = response.json()['results'][0]['content']['listings']
Finally, create a data frame and export the search results to CSV:
df = pd.DataFrame(data)
df.to_csv('yandex_results_API.csv', index=False)
import requests
import pandas as pd
payload = {
'source': 'universal',
'url': 'https://yandex.com/search/?text=what%20is%20web%20scraping',
'parse': True,
'parsing_instructions': {
'listings': {
'_fns': [{'_fn': 'css', '_args': ['li.serp-item_card']}],
'_items': {
'title': {
'_fns': [
{'_fn': 'css_one', '_args': ['h2 > span']},
{'_fn': 'element_text'}
]
},
'link': {
'_fns': [
{
'_fn': 'xpath_one',
'_args': [
'.//a[contains(@class, "organic__url")]/@href'
]
}
]
}
}
}
}
}
response = requests.post(
'https://realtime.oxylabs.io/v1/queries',
auth=('API_USERNAME', 'API_PASSWORD'),
json=payload
)
response.raise_for_status()
data = response.json()['results'][0]['content']['listings']
df = pd.DataFrame(data)
df.to_csv('yandex_results_API.csv', index=False)
Running the above code will output a CSV file that will look similar to this:
Approach | Advantages | Disadvantages |
---|---|---|
No proxies | Straightforward implementation, zero proxy-related expenses | Frequent IP blocking and CAPTCHA challenges, unable to access geo-restricted content, poor performance at larger scales |
With proxies | Significantly reduces blocking risks, enables access to location-specific content, enhanced performance for high-volume scraping | Additional proxy service costs, requires managing proxy infrastructure (unless handled by provider) |
Using a scraping API | Automatic IP rotation and CAPTCHA bypass, enterprise-grade scalability, browser emulation capabilities, rapid development and deployment | Recurring subscription fees, vendor lock-in concerns, may have constraints on certain data extraction scenarios |
Custom solutions (Selenium, etc.) | Complete customization possibilities, particularly effective for JavaScript-heavy websites, no ongoing costs with self-hosted infrastructure | Requires substantial technical expertise, developer must implement anti-blocking strategies, often slower performance than specialized solutions |
While scraping Yandex SERPs is extremely challenging, by following the steps outlined in this article and using the provided Python code, you can easily scrape Yandex organic results for any chosen keyword and export the data into a CSV file. With the help of Web Scraper API, residential proxies, or a reliable free proxy list, you can bypass Yandex's anti-bot measures and scrape real-time search data at scale. If you need even more robust solutions, you can buy proxy services to further enhance your scraping efficiency.
If you require assistance or want to know more, feel free to contact us via email or live chat.
About the author
Vytenis Kaubrė
Technical Copywriter
Vytenis Kaubrė is a Technical Copywriter at Oxylabs. His love for creative writing and a growing interest in technology fuels his daily work, where he crafts technical content and web scrapers with Oxylabs’ solutions. Off duty, you might catch him working on personal projects, coding with Python, or jamming on his electric guitar.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Try Web Scraper API
Choose Oxylabs' Web Scraper API to gather real-time search data hassle-free.
Free proxies for web scraping
Get 5 Datacenter IPs for FREE by registering on the Oxylabs dashboard.
Get the latest news from data gathering world
Scale up your business with Oxylabs®
Proxies
Advanced proxy solutions
Data Collection
Datasets
Resources
Innovation hub
Try Web Scraper API
Choose Oxylabs' Web Scraper API to gather real-time search data hassle-free.
Free proxies for web scraping
Get 5 Datacenter IPs for FREE by registering on the Oxylabs dashboard.