Back to blog

How to Scrape Dynamic Websites With Python

Enrika Pavlovskytė

2023-10-125 min read
Share

For static websites built with HTML and CSS, simple tools like Python’s request library and Beautiful Soup can often do the job when web scraping. However, things get tricky when dealing with advanced websites built on dynamic JavaScript frameworks like React, Angular, and Vue. These frameworks streamline web development by providing pre-built components and architecture, but they also make dynamic websites difficult to scrape.

In this blog post, we’ll overview the differences between dynamic and static websites and give a step-by-step guide on scraping data from dynamic targets, particularly those featuring infinite scroll.

Dynamic vs. static pages

Static pages offer limited modes of interaction, mainly allowing content viewing and basic actions like following links or logging in. Dynamic pages provide a much broader range of interaction, often adapting to user behavior. 

This is often made possible by employing different approaches to website rendering. With static pages, you prepare an HTML file, put it on a server, and the user will then access that fixed file. Because the file is always the same, each user will see the exact same version of the web page.

Dynamic web pages, on the other hand, are not directly renderable by the web browser. Instead, when a server receives a request, it engages in intermediate actions to generate the final result on the fly. Consequently, it loads content dynamically and different users might see different versions of the same web page.

How to scrape dynamic pages?

We've previously discussed dynamic scraping techniques like employing Selenium, utilizing a headless browser, and making AJAX request. So, we invite you to check these resources out as well. Additionally, check out the advanced web scraping with Python guide to learn more tactics, such as emulating AJAX calls and handling infinite scroll.

However, it's important to note that each target presents unique challenges. Indeed, there can be different elements that require special attention, such as pop-up windows, drop-down menus, live search suggestions, real-time updates, sortable tables, and more. 

One of the most encountered challenges is continuous or infinite scroll —  a technique that allows content to be dynamically loaded as the user scrolls down a webpage. To scrape websites with infinite scroll, you need to customize your scraper, which is exactly what we’ll discuss below using Google Search as an example of a dynamic target.

How to scrape a dynamic target using Selenium

This section will go through the numbered steps to scrape dynamic sites using Selenium in Python. For our target, we will use the Google Search page with some keywords. 

Step 1: Setting up the environment

First, make sure you have installed the latest version of Python on your system. You’ll also need to install Selenium library using the following command:

pip install selenium

Then, download the Chrome driver and ensure that the version matches your Google Chrome version. 

Step 2: Code in action

Start by creating a new Python file and import the required libraries:

import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from bs4 import BeautifulSoup

Then set up Chrome Webdriver with Selenium by copying the path to your driver executable file and pasting in the following code:

# Set up the Chrome WebDriver
driver = webdriver.Chrome()

Following that, navigate to the Google Search Page and provide your search keyword:

# Navigate to Google Search
search_keyword = "adidas"
driver.get("https://www.google.com/search?q=" + search_keyword)

Now you can simulate a continuous scroll on Selenium. Using the script below, you’ll scroll multiple times to get more results and then scrape the results:

Define the number of times to scroll
scroll_count = 5


# Simulate continuous scrolling using JavaScript
for _ in range(scroll_count):
   driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
   time.sleep(2)  # Wait for the new results to load 

Finally, use BeautifulSoap to parse and extract the results:

page_source = driver.page_source


# Parse the page source with BeautifulSoup
soup = BeautifulSoup(page_source, 'html.parser')


search_results = soup.find_all('div', class_='tF2Cxc')
for result in search_results:
   title = result.h3.text
   link = result.a['href']
   print(f"Title: {title}")
   print(f"Link: {link}")
   print("\n")

Run the final code as specified below to see the results:

import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from bs4 import BeautifulSoup

# Set up the Chrome WebDriver
driver = webdriver.Chrome()

# Navigate to Google Search
search_keyword = "adidas"
driver.get("https://www.google.com/search?q=" + search_keyword)

# Define the number of times to scroll
scroll_count = 5

# Simulate continuous scrolling using JavaScript
for _ in range(scroll_count):
   driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
   time.sleep(2)  # Wait for the new results to load (adjust as needed)

# Get the page source after scrolling
page_source = driver.page_source


# Parse the page source with BeautifulSoup
soup = BeautifulSoup(page_source, "html.parser")

# Extract and print search results
search_results = soup.find_all("div", class_="tF2Cxc")
for result in search_results:
   title = result.h3.text
   link = result.a["href"]
   print(f"Title: {title}")
   print(f"Link: {link}")
   print("\n")

# Close the WebDriver
driver.quit()

The output should look something like this:

One problem with this web scraping approach is that Google might mistake you for a malicious bot, which would frequently trigger CAPTCHA. This means you'll have to integrate reliable proxies into your script and rotate them continuously. 

So, if you’re looking for a solution to extract data on a large scale, you might also want to consider a commercial solution that will deal with anti-bot systems for you, and we’ll cover this in the subsequent section.

Scraping dynamic web pages without Selenium

It’s important to note that you’re not restricted to Selenium when web scraping dynamic targets. In fact, there are alternative ways to do that: 

1. Oxylabs Scraper API

One of the best ways to scrape dynamic content is using a specialized scraper service. For example, Oxylabs Scraper API is designed for web scraping tasks and is adapted to the most popular web scraping targets. This solution leverages Oxylabs data gathering infrastructure, meaning that you don’t need to worry about IP blocks or JavaScript-rendered content, making it a valuable tool for web scraping dynamic targets.

2. Other library combinations

There are numerous additional libraries and tools for web scraping that can handle dynamic material in addition to Selenium. Advanced Python libraries like Scrapy paired with Splash, Requests-HTML with Pyppeteer or Playwright, Node.js with Cheerio may be among these. These libraries give users the ability to interact with dynamic elements, render JavaScript, and extract data from dynamically loaded pages.

3. Specialized no-code scrapers

There are also some no-code scrapers designed to handle dynamic content. These tools typically offer user-friendly interfaces that allow users to select and extract specific elements from dynamic pages without the need for writing code. However, these solutions can sometimes lack flexibility and are more suitable for more basic scraping projects.

Scraping targets with infinite scroll using Scraper API

As mentioned above, you can use a commercial Scraper API to scrape dynamic content. The benefit of this approach is that you won’t need to worry about having your own scraping infrastructure. Specifically, you don’t need to pass any additional parameters to deal with CAPTCHAs and IP blocks, as the tool does these things for you.

What’s more, Oxylabs Scraper APIs are designed to deal with dynamically loaded content. For example, our Web Scraper API automatically detects infinite scroll and efficiently loads the requested organic results without extra parameters required.

Let’s see how it works in action!

Step 1: Setting up the environment

Before we start scraping, make sure you have the following libraries installed:

pip install requests beautifulsoup4 pandas

Also, to use the Web Scraper API, you need to get access to the Oxylabs API by creating an account. After doing that, you will be directed to the Oxylabs API dashboard. Head to the Users tab and create a new API user. These API user credentials will be used later in the code.

Step 2: Code in action

After getting the credentials, you can start writing the code in Python. Begin by importing the required library files in your Python file:

import requests
import pandas as pd

Then, create the Payload for SERP Scraper API following this structure:

 # Structure payload.
payload = {
   'source': 'google_search',
   'domain': 'com',
   'query': 'adidas',
   'parse': True,
   'limit': 100,
}

You can then initialize the request and get the response from the API:

USERNAME = "<your_username>"
PASSWORD = "<your_password>"

response = requests.request(
   'POST',
   'https://realtime.oxylabs.io/v1/queries',
   auth=(USERNAME, PASSWORD),
   json=payload,
)

After receiving the response, you can get the content from the JSON results using pandas:

df = pd.DataFrame(columns=["Product Title", "Seller", "Price","URL"])

results = response.json()['results']
items = results[0]['content']['results']
items_list = items['pla']['items']
for it in items_list:
   df = pd.concat(
       [pd.DataFrame([[it['title'], it['seller'], it['price'], it['url']]], columns=df.columns), df],
       ignore_index=True,)

Here, we have created a dataFrame and added all the data to it. You can then convert the dataFrame and CSV and JSON files like this:

df.to_csv("google_search_results.csv", index=False)
df.to_json("google_search_results.json", orient="split", index=False)

That’s it! Let’s combine all the code and execute it to see the output:

import requests
import pandas as pd

USERNAME = "<your_username>"
PASSWORD = "<your_password>"

# Structure payload.
payload = {
   "source": "google_search",
   "domain": "com",
   "query": "adidas",
   "start_page": 11,
   "pages": 2,
   "parse": "true",
}




# Get response.
response = requests.request(
   "POST",
   "https://realtime.oxylabs.io/v1/queries",
   auth=(USERNAME, PASSWORD),
   json=payload,
)
df = pd.DataFrame(columns=["Product Title", "Seller", "Price", "URL"])


results = response.json()["results"]
items = results[0]["content"]["results"]
items_list = items["pla"]["items"]
for it in items_list:
   df = pd.concat(
       [
           pd.DataFrame(
               [[it["title"], it["seller"], it["price"], it["url"]]],
               columns=df.columns,
           ),
           df,
       ],
       ignore_index=True,
   )


print(df)
df.to_csv("google_search_results.csv", index=False)
df.to_json("google_search_results.json", orient="split", index=False)

As a result, you’ll get a CSV file which you can open in Excel:

Conclusion

Scraping dynamic web pages is possible with Python, Selenium, and the Oxylabs SERP Scraper API. Your individual use case, data requirements, and preferences will determine the approach you use. While the SERP Scraper API streamlines the process of scraping search engine results, Selenium offers more control over browser automation.

We hope you found this guide on how to scrape dynamic web pages useful and if you’re looking for more materials, be sure to check out our blog posts on dynamic web scraping with Playwright, PHP, or R.  You might also be interested in web scraping with Python or asynchronous data gathering.

Frequently asked questions

Why BeautifulSoup and Requests are not enough to scrape dynamic websites?

Beautiful Soup and Requests are not enough to scrape dynamic websites primarily because they do not have the capability to execute JavaScript. Instead, you need to employ additional tools like headless browsers or leverage other libraries like Requests-HTML with Pyppeteer or Playwright.

About the author

Enrika Pavlovskytė

Former Copywriter

Enrika Pavlovskytė was a Copywriter at Oxylabs. With a background in digital heritage research, she became increasingly fascinated with innovative technologies and started transitioning into the tech world. On her days off, you might find her camping in the wilderness and, perhaps, trying to befriend a fox! Even so, she would never pass up a chance to binge-watch old horror movies on the couch.

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

Related articles

Get the latest news from data gathering world

I’m interested