In today’s data-driven world, being able to extract information from platforms like Google Hotels opens up powerful opportunities. Imagine having access to up-to-date aggregated hotel data—including prices, availability, reviews, and locations—at your fingertips, ready to fuel your market research or compare prices for better decision-making. This tutorial will show you how to do just that using Python, teaching you step-by-step how to scrape data from Google Hotels.
Using a Google Hotels web scraper, you can effectively gather data to monitor pricing trends and analyze hotel prices across regions. These insights can be critical for conducting market research in the hospitality industry, providing a competitive edge by understanding customer preferences. Alternatively, exploring a Google Hotels API could streamline your data collection process for similar purposes. With Google Hotels API, you can retrieve geo-specific data in HTML or JSON block-free.
Note: This tutorial is intended for educational purposes only. Before implementing these techniques, participants are urged to thoroughly review Google’s Terms of Service to ensure adherence to their stipulated web scraping policies, thereby guaranteeing compliance and ethical integrity in data collection practices.
We will use Selenium and BeautifulSoup Library in Python to scrape the data from Google Hotels. Selenium helps interact with the dynamic nature of the website and get information, while BeautifulSoup helps extract data from the webpage using techniques like CSS selectors. This combination makes it easier to automate web scraping tasks and efficiently handle dynamic content.
This tutorial expects you to have basic Python and web scraping knowledge. However, if you are new, refer to the foundational Python web scraping blog post. For more advanced users, utilizing a Google Hotels web scraper or even integrating with Google Maps data can greatly enhance your data collection efforts, similar to scraping Google News or other dynamic platforms.
Ensure that the latest version of Python is installed on your system. For this tutorial, we used Python 3.11. However, any version greater than 3.8 will work best for you. Additionally, you need the following list of libraries:
BeautifulSoup
Requests
Selenium
Pandas
lxml
These libraries will allow you to extract structured information like hotel prices, hotel ratings, and locations. Such extracted data can then be used for further market research, price comparison, or monitoring pricing trends. To install these libraries, you can use the following command:
pip install requests bs4 selenium pandas lxml
After you have completely installed all the libraries, create a new Python file and start by importing the required libraries. Selenium for browser automation, BeautifulSoup for extracting the data, pandas for saving the data, and time for putting delays.
import pandas as pd
import os
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
from selenium.webdriver import ChromeOptions
After importing the required libraries, initialize the selenium Web driver object and provide the Google Hotels link for navigation. You can adjust the search criteria such as location or date to customize the dataset you want to extract.
# Step 1: Set up Selenium WebDriver
driver = webdriver.Chrome()
options = ChromeOptions()
options.add_argument("--headless=new")
driver = webdriver.Chrome(options=options)
location_name = "New York" # Set the location for hotel search
base_url = f"https://www.google.com/travel/hotels/{location_name}/?q={location_name}¤cy=USD"
driver.get(base_url) # Load the Google Hotels page for the specified location
location_name contains the name of the location for which you need to get the hotel data. You can set the location variable as per your choice. This location variable is concatenated in the URL.
Sometimes, Google can present a cookies page before showing the hotel results. You can accept the cookies if the page appears using this code:
# Accept cookies if the button exists
try:
driver.find_element(By.XPATH, '//button[contains(@aria-label, "Accept")]').click()
except NoSuchElementException:
pass
Let's look at the Google Hotels page:
This is the main page that shows the hotel name, price, rating, and other amenities. We can extract the name, price, and rating from this page. But for location, we need to navigate to the details page of the hotel and get the hotel address details. Google’s SERP elements like carousel, events, or ads offer similar dynamic features for different content types. The detail page is as follows:
Using Selenium, we will navigate to the main page, get the data from that page, then navigate to the details page and get address details from that page.
We can look at the corresponding HTML of all these elements on the developer's tools window in the browser (right-click on the element and click inspect element). The following images show the HTML of all the elements:
The next step is to get the data as a response and parse it using Beautiful Soup. Google may present the same hotel listings on different pages; hence, you should only store the data that’s not already present. You can achieve this by using the set() function.
# Step 2: Initialize a list to store hotel data
hotel_data = []
# Initialize a set to track unique entries
unique_hotels = set()
# Parse the page with BeautifulSoup using lxml
soup = BeautifulSoup(driver.page_source, "lxml")
# Find all hotel cards on the page
hotel_cards = soup.find_all("div", class_="BcKagd") # Adjust class if necessary
for hotel in hotel_cards:
# Extract hotel details
name = (
hotel.find("h2", class_="BgYkof").text
if hotel.find("h2", class_="BgYkof")
else "N/A"
) # Hotel name
price = (
hotel.find("span", class_="qQOQpe prxS3d").text
if hotel.find("span", class_="qQOQpe prxS3d")
else "N/A"
) # Price
rating = (
hotel.find("span", class_="KFi5wf lA0BZ").text
if hotel.find("span", class_="KFi5wf lA0BZ")
else "N/A"
) # Rating
From the main page, we extracted the name, rating, and price. Now we will navigate to the details page and get the location.
hotel_link = hotel.find("a", class_="PVOOXe").get("href") # Hotel link
# Visit the hotel page to get more details
driver.get("https://www.google.com" + hotel_link)
WebDriverWait(driver, 20).until(
EC.presence_of_element_located((By.CLASS_NAME, "gJGKuf"))
)
# Try to get the location of the hotel
try:
location = driver.find_element(By.XPATH, '//div[@class="K4nuhf"]/span[1]').text
contact = driver.find_element(By.XPATH, '//div[@class="K4nuhf"]/span[3]').text
except NoSuchElementException:
location = "N/A" # Default if location not found
contact = "N/A"
# Go back to the previous page
driver.back()
# Define a unique identifier for the hotel
hotel_id = (name, price, rating, location, contact)
At the end of each page's data collection, navigate back to the previous page and proceed to scrape the data for the next hotel. This process continues until all the desired pages have been scraped. Since the hotel data is spread across multiple pages, we need to account for pagination. We'll implement a loop that iterates through 5 pages (you can adjust the number of pages as needed). The loop will keep scraping until either 5 pages have been processed or the "Next" button is no longer available. To achieve this, we will enclose the scraping code in a loop that checks for the "Next" button and iterates accordingly.
hotel_data = []
# Initialize a set to track unique entries
unique_hotels = set()
no_of_pages = 0 # Counter for the number of pages scraped
next_button_clickable = True # Flag to track if the next button is clickable
# Step 3: Scrape hotel data page by page
while next_button_clickable and no_of_pages < 5:
# Parse the page with BeautifulSoup using lxml
soup = BeautifulSoup(driver.page_source, "lxml")
# Find all hotel cards on the page
hotel_cards = soup.find_all("div", class_="BcKagd") # Adjust class if necessary
for hotel in hotel_cards:
# Extract hotel details
name = (
hotel.find("h2", class_="BgYkof").text
if hotel.find("h2", class_="BgYkof")
else "N/A"
) # Hotel name
price = (
hotel.find("span", class_="qQOQpe prxS3d").text
if hotel.find("span", class_="qQOQpe prxS3d")
else "N/A"
) # Price
rating = (
hotel.find("span", class_="KFi5wf lA0BZ").text
if hotel.find("span", class_="KFi5wf lA0BZ")
else "N/A"
) # Rating
hotel_link = hotel.find("a", class_="PVOOXe").get("href") # Hotel link
# Visit the hotel page to get more details
driver.get("https://www.google.com" + hotel_link)
WebDriverWait(driver, 20).until(
EC.presence_of_element_located((By.CLASS_NAME, "gJGKuf"))
)
# Try to get the location of the hotel
try:
location = driver.find_element(By.XPATH, '//div[@class="K4nuhf"]/span[1]').text
contact = driver.find_element(By.XPATH, '//div[@class="K4nuhf"]/span[3]').text
except NoSuchElementException:
location = "N/A" # Default if location not found
contact = "N/A"
# Go back to the previous page
driver.back()
# Define a unique identifier for the hotel
hotel_id = (name, price, rating, location, contact)
# Only add the hotel if it hasn't been seen before
if hotel_id not in unique_hotels:
unique_hotels.add(hotel_id)
hotel_data.append({
"Name": name,
"Price": price,
"Rating": rating,
"Location": location,
"Contact": contact,
"Link": "https://www.google.com" + hotel_link
})
# Clear the hotel_data list to store the next page's data
hotel_data.clear()
# Step 5: Check for the "Next" button and navigate to the next page
try:
WebDriverWait(driver, 20).until(
EC.presence_of_element_located((By.CLASS_NAME, "K1smNd"))
)
# Locate the "Next" button based on the current page number
if no_of_pages == 0:
next_button = driver.find_element(
By.XPATH, '//div[@class="eGUU7b"]/button'
)
else:
next_button = driver.find_element(
By.XPATH, '//div[@class="eGUU7b"]/button[2]'
)
# Click the "Next" button
next_button.click()
no_of_pages += 1 # Increment the page counter
except NoSuchElementException:
print("No more pages to navigate.")
next_button_clickable = False # Stop if no "Next" button is found
# Step 6: Close the Selenium driver
driver.quit()
The loop scrapes data from up to 5 pages of hotels. It checks if there is a Next button after each page is scraped. If the button exists, it clicks to go to the next page and continues scraping. The loop stops if either 5 pages have been scraped or there is no Next button (indicating there are no more pages to scrape). For this, we need to get the XPATH of the button. The XPATH is different for the first page than the other pages since the next pages contain the Previous button as well.
After scraping the hotels’ data from a page we will write the data in the csv file and move to the next loop iteration. The following code will be inserted in the loop mentioned above before clicking the Next button.
# Step 4: Save the current page's data to CSV after scraping
df = pd.DataFrame(hotel_data)
# Write header only if the file does not exist
df.to_csv("hotels_data.csv", mode="a", header=not os.path.exists("hotels_data.csv"), index=False)
# Clear the hotel_data list to store the next page's data
hotel_data.clear()
The header=not os.path.exists("hotels_data.csv") ensures that the header is written only when the CSV file is first created. After saving the current page's data, we clear the hotel_data list to start fresh for the next page:
Let’s combine all the code and have a look at the complete code:
# Import necessary libraries
import pandas as pd
import os
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
from selenium.webdriver import ChromeOptions
# Step 1: Set up Selenium WebDriver
driver = webdriver.Chrome()
options = ChromeOptions()
options.add_argument("--headless=new")
driver = webdriver.Chrome(options=options)
location_name = "New York" # Set the location for hotel search
base_url = f"https://www.google.com/travel/hotels/{location_name}/?q={location_name}¤cy=USD"
# Delete the CSV file if it exists
if os.path.exists("hotels_data.csv"):
os.remove("hotels_data.csv")
driver.get(base_url) # Load the Google Hotels page for the specified location
# Accept cookies if the button exists
try:
driver.find_element(By.XPATH, '//button[contains(@aria-label, "Accept")]').click()
except NoSuchElementException:
pass
WebDriverWait(driver, 20).until(
EC.presence_of_element_located((By.CLASS_NAME, "K1smNd"))
)
# Step 2: Initialize a list to store hotel data
hotel_data = []
# Initialize a set to track unique entries
unique_hotels = set()
no_of_pages = 0 # Counter for the number of pages scraped
next_button_clickable = True # Flag to track if the next button is clickable
# Step 3: Scrape hotel data page by page
while next_button_clickable and no_of_pages < 5:
# Parse the page with BeautifulSoup using lxml
soup = BeautifulSoup(driver.page_source, "lxml")
# Find all hotel cards on the page
hotel_cards = soup.find_all("div", class_="BcKagd") # Adjust class if necessary
for hotel in hotel_cards:
# Extract hotel details
name = (
hotel.find("h2", class_="BgYkof").text
if hotel.find("h2", class_="BgYkof")
else "N/A"
) # Hotel name
price = (
hotel.find("span", class_="qQOQpe prxS3d").text
if hotel.find("span", class_="qQOQpe prxS3d")
else "N/A"
) # Price
rating = (
hotel.find("span", class_="KFi5wf lA0BZ").text
if hotel.find("span", class_="KFi5wf lA0BZ")
else "N/A"
) # Rating
hotel_link = hotel.find("a", class_="PVOOXe").get("href") # Hotel link
# Visit the hotel page to get more details
driver.get("https://www.google.com" + hotel_link)
WebDriverWait(driver, 20).until(
EC.presence_of_element_located((By.CLASS_NAME, "gJGKuf"))
)
# Try to get the location of the hotel
try:
location = driver.find_element(By.XPATH, '//div[@class="K4nuhf"]/span[1]').text
contact = driver.find_element(By.XPATH, '//div[@class="K4nuhf"]/span[3]').text
except NoSuchElementException:
location = "N/A" # Default if location not found
contact = "N/A"
# Go back to the previous page
driver.back()
# Define a unique identifier for the hotel
hotel_id = (name, price, rating, location, contact)
# Only add the hotel if it hasn't been seen before
if hotel_id not in unique_hotels:
unique_hotels.add(hotel_id)
hotel_data.append({
"Name": name,
"Price": price,
"Rating": rating,
"Location": location,
"Contact": contact,
"Link": "https://www.google.com" + hotel_link
})
# Step 4: Save the current page's data to CSV after scraping
df = pd.DataFrame(hotel_data)
# Write header only if the file does not exist
df.to_csv("hotels_data.csv", mode="a", header=not os.path.exists("hotels_data.csv"), index=False)
# Clear the hotel_data list to store the next page's data
hotel_data.clear()
# Step 5: Check for the "Next" button and navigate to the next page
try:
WebDriverWait(driver, 20).until(
EC.presence_of_element_located((By.CLASS_NAME, "K1smNd"))
)
# Locate the "Next" button based on the current page number
if no_of_pages == 0:
next_button = driver.find_element(
By.XPATH, '//div[@class="eGUU7b"]/button'
)
else:
next_button = driver.find_element(
By.XPATH, '//div[@class="eGUU7b"]/button[2]'
)
# Click the "Next" button
next_button.click()
no_of_pages += 1 # Increment the page counter
except NoSuchElementException:
print("No more pages to navigate.")
next_button_clickable = False # Stop if no "Next" button is found
# Step 6: Close the Selenium driver
driver.quit()
Let’s look at the output CSV of the above code:
Now, we have successfully scraped all the hotel data from the Google Hotels Page. But before you proceed with your own project, remember to consider some of the pitfalls and challenges that you can come across while scraping the Google Hotels page.
CAPTCHA: Google frequently uses CAPTCHA to stop bots. A CAPTCHA can appear if you're making requests frequently. To reduce this, try introducing delays (time.sleep) between queries to simulate natural browsing behavior.
Dynamic content loading: Google Hotels uses JavaScript to load a large portion of its content dynamically. This implies that the full data will not be captured by static requests. Using Selenium can be beneficial because it enables a complete page load, including dynamically loaded elements.
IP blocks: Your IP may be temporarily blocked if you make too many requests in a short amount of time. Set appropriate delays, vary your access pattern, and, if needed, consider using proxies to prevent this.
This tutorial covered the essential steps for scraping hotel data from Google Hotels using Python. We walked through setting up your Python environment, navigating Google Hotels, handling dynamic content with Selenium, extracting relevant details with BeautifulSoup, and saving the data into a CSV file. Along the way, we discussed key challenges such as CAPTCHAs, dynamic content loading, and potential IP blocking.
For more specialized cases like Google Autocomplete or Google Books , explore these additional scraping solutions.
About the author
Maryia Stsiopkina
Senior Content Manager
Maryia Stsiopkina is a Senior Content Manager at Oxylabs. As her passion for writing was developing, she was writing either creepy detective stories or fairy tales at different points in time. Eventually, she found herself in the tech wonderland with numerous hidden corners to explore. At leisure, she does birdwatching with binoculars (some people mistake it for stalking), makes flower jewelry, and eats pickles.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Roberta Aukstikalnyte
2024-04-03
Yelyzaveta Nechytailo
2024-03-18
Get the latest news from data gathering world
Scale up your business with Oxylabs®