Proxy locations

Europe

North America

South America

Asia

Africa

Oceania

See all locations

Network statusCareers

Back to blog

How to Scrape Google Flights with Python

How to Scrape Google Flights
Yelyzaveta Nechytailo

Yelyzaveta Nechytailo

2024-03-184 min read
Share

Google Flights is a flight booking service that makes buying tickets from third-party easier. It holds large amounts of flight data, including airfare, destinations, flight times, layovers, departing flights, departure date, and more. All this information is not only valuable for individual travelers but for businesses, too – it helps them conduct competitor analysis, analyze customer preferences, and more. 

Without the right tools and methods, scraping Google Flights listings can get quite challenging. In today’s article, we’re going to demonstrate how to scrape public data from flight pages and generate search results with Python and Google Flights API. We’ll gather all sorts of data, including price, flight time, and airline name. 

Try free for 1 week

Request a free trial to test SERP Scraper API.

  • 5K results
  • No credit card required
  • 1. Installing prerequisite libraries

    Let’s start by installing the prerequisites:

    pip install bs4

    We’ll need BeautifulSoup4 to parse and extract information from the HTML that we’re going to scrape.

    2. Creating core structure

    Now, let’s think about the general logic of our scraper. We’ll create a functionality for defining multiple Google Flights URLs that we’d like to scrape. Then, we’ll take these URLs one-by-one, extract the flight information we need and save it as a JSON object.

    To start off, let’s create a function that will take a URL as a parameter, scrape that URL with Google Flights API (you can get a free 7-day trial for it) and return the scraped HTML: 

    def get_flights_html(url):
        payload = {
            'source': 'google',
            'render': 'html',
            'url': url,
        }
    
        response = requests.request(
            'POST',
            'https://realtime.oxylabs.io/v1/queries',
            auth=('username', 'password'),
            json=payload,
        )
    
        response_json = response.json()
    
        html = response_json['results'][0]['content']
    
        return html

    Make sure to change up USERNAME and PASSWORD with your actual Oxylabs credentials.

    If we inspect the HTML of the Google Flights page, we can see that all the flight listings are inside a <li> HTML object with the class pIav2d.

    Next up, we’ll create a function that accepts a BeautifulSoup object created from the HTML of the whole page. This function will create and return an array of objects containing information from individual flight listings. Let’s try to form the function in such a way that makes it easily extendible if required: 

    def extract_flight_information_from_soup(soup_of_the_whole_page):
        flight_listings = soup_of_the_whole_page.find_all('li','pIav2d')
    
        flights = []
    
        for listing in flight_listings:
            if listing is not None:
                # Add some specific data extraction here
    
                flight = {}
    
                flights.append(flight)
    
        return flights

    Now that we can get the HTML and have a function to hold our information extraction, we can organize both of those into one:

    def extract_flights_data_from_urls(urls):
        constructed_flight_results = []
    
        for url in urls:
            html = get_flights_html(url)
    
            soup = BeautifulSoup(html,'html.parser')
        
            flights = extract_flight_information_from_soup(soup)
    
            constructed_flight_results.append({
                'url': url,
                'flight_data': flights
            })
    
        return constructed_flight_results

    This function will take an array of URLs as a parameter and return an object of extracted flight data.

    One thing left for our core is a function that takes this data and saves it as a file:

    def save_results(results, filepath):
        with open(filepath, 'w', encoding='utf-8') as file:
            json.dump(results, file, ensure_ascii=False, indent=4)
    
        return

    We can finish by creating a simple main function to invoke all that we’ve created so far:

    def main():
        results_file = 'data.json'
    
        urls = [
            'https://www.google.com/travel/flights?tfs=CBsQAhooEgoyMDI0LTA3LTI4agwIAxIIL20vMDE1NnFyDAgCEggvbS8wNGpwbBooEgoyMDI0LTA4LTAxagwIAhIIL20vMDRqcGxyDAgDEggvbS8wMTU2cUABSAFSA0VVUnABemxDalJJTkRCNVRGbDBOMU5UVEdOQlJ6aG5lRUZDUnkwdExTMHRMUzB0TFMxM1pXc3lOMEZCUVVGQlIxZ3dhRWxSUVRoaWFtTkJFZ1pWTWpnMk1qSWFDZ2lRYnhBQ0dnTkZWVkk0SEhEN2VBPT2YAQGyARIYASABKgwIAxIIL20vMDRqcGw&hl=en-US&curr=EUR&sa=X&ved=0CAoQtY0DahgKEwiAz9bF5PaEAxUAAAAAHQAAAAAQngM',
            'https://www.google.com/travel/flights/search?tfs=CBwQAhooEgoyMDI0LTA3LTI4agwIAxIIL20vMDE1NnFyDAgDEggvbS8wN19rcRooEgoyMDI0LTA4LTAxagwIAxIIL20vMDdfa3FyDAgDEggvbS8wMTU2cUABSAFwAYIBCwj___________8BmAEB&hl=en-US&curr=EUR'
        ]
    
        constructed_flight_results = extract_flights_data_from_urls(urls)
    
        save_results(constructed_flight_results, results_file)

    With the core of the application complete, we can start creating specific data extraction functions that will gather the needed flight information.

    3. Getting the price

    First up is the price. Navigating the HTML of Google Flights can get tricky, as it seems to be quite dynamic. Let’s see how we could pinpoint the price.

    The first div to find is the one with the class BVAVmf.

    Inside it, we can specify another one – YMlIz – which will contain only the price we need and nothing else.

    Now that we got that, we just need to write the function itself:

    def get_price(soup_element):
        price = soup_element.find('div','BVAVmf').find('div','YMlIz').get_text()
    
        return price

    4. Getting the flight time

    Another important piece of information is the flight time. We’ll begin with the div containing both the flight time and the airline name – Ir0Voe.

    Then, we can specify an inner div that contains only the time – zxVSec.

    If we inspect this one closer, we’ll find that it has a lot of inner span containers with our time information split and repeated numerous times. But all of the needed time information can be found in the first child span container of the latter div.

    We can also make good use of the fact that Oxylabs SERP Scraper API renders JavaScript for us and extracts information in a very convenient format, as the exact flight dates can be seen when hovered over with a mouse.

    We should keep in mind that there will be two span containers with the class eoY5cb that we will need: one for the departure and one for arrival. We should get them both.

    Now that we’ve got all of the needed HTML information, let’s extract it.

    def get_time(soup_element):
        spans = soup_element.find('div','Ir0Voe').find('div','zxVSec', recursive=False).find_all('span', 'eoY5cb')
    
        time = ""
    
        for span in spans:
            time = time + span.get_text() + "; "
    
        return time

    5. Getting the airline name

    For the last piece of information, we’ll need the name of the airline, which again, won’t be trivial.

    We begin with the same div containing both the flight time and the airline name – Ir0Voe.

    Then, we can specify an inner div that contains only the airline name – sSHqwe.

    One more thing here, we can see that there might be some additional information about the flight and the airline in this field, such as the operator or the fact that tickets for this flight are from multiple airlines.

    As all of this information is important for us, let’s try to gather it all. We can see that all of the texts are inside classless span containers. 

    The final step is to put everything we found so far into a function for extraction.

    def get_airline(soup_element):
        airline = soup_element.find('div','Ir0Voe').find('div','sSHqwe')
    
        spans = airline.find_all('span', attrs={"class": None}, recursive=False)
    
        result = ""
    
        for span in spans:
            result = result + span.get_text() + "; "
        
        return result

    Having all of these functions for data extraction, we just need to add them to the place we designated earlier to finish up our code.

    def extract_flight_information_from_soup(soup_of_the_whole_page):
        flight_listings = soup_of_the_whole_page.find_all('li','pIav2d')
    
        flights = []
    
        for listing in flight_listings:
            if listing is not None:
                price = get_price(listing)
                time = get_time(listing)
                airline = get_airline(listing)
    
                flight = {
                    "airline": airline,
                    "time": time,
                    "price": price 
                }
    
                flights.append(flight)
    
        return flights

    6. Final result

    If we add all of it together, the final product should look something like this.

    from bs4 import BeautifulSoup
    import requests
    import json
    
    def get_price(soup_element):
        price = soup_element.find('div','BVAVmf').find('div','YMlIz').get_text()
    
        return price
    
    
    def get_time(soup_element):
        spans = soup_element.find('div','Ir0Voe').find('div','zxVSec', recursive=False).find_all('span', 'eoY5cb')
    
        time = ""
    
        for span in spans:
            time = time + span.get_text() + "; "
    
        return time
    
    
    def get_airline(soup_element):
        airline = soup_element.find('div','Ir0Voe').find('div','sSHqwe')
    
        spans = airline.find_all('span', attrs={"class": None}, recursive=False)
    
        result = ""
    
        for span in spans:
            result = result + span.get_text() + "; "
        
        return result
    
    
    def save_results(results, filepath):
        with open(filepath, 'w', encoding='utf-8') as file:
            json.dump(results, file, ensure_ascii=False, indent=4)
    
        return
    
    
    def get_flights_html(url):
        payload = {
            'source': 'google',
            'render': 'html',
            'url': url,
        }
    
        # Get response.
        response = requests.request(
            'POST',
            'https://realtime.oxylabs.io/v1/queries',
            auth=('username', 'password'),
            json=payload,
        )
    
        response_json = response.json()
    
        html = response_json['results'][0]['content']
    
        return html
    
    
    def extract_flight_information_from_soup(soup_of_the_whole_page):
        flight_listings = soup_of_the_whole_page.find_all('li','pIav2d')
    
        flights = []
    
        for listing in flight_listings:
            if listing is not None:
                price = get_price(listing)
                time = get_time(listing)
                airline = get_airline(listing)
    
                flight = {
                    "airline": airline,
                    "time": time,
                    "price": price 
                }
    
                flights.append(flight)
    
        return flights
    
    
    def extract_flights_data_from_urls(urls):
        constructed_flight_results = []
    
        for url in urls:
            html = get_flights_html(url)
    
            soup = BeautifulSoup(html,'html.parser')
        
            flights = extract_flight_information_from_soup(soup)
    
            constructed_flight_results.append({
                'url': url,
                'flight_data': flights
            })
    
        return constructed_flight_results
    
    
    def main():
        results_file = 'data.json'
    
        urls = [
            'https://www.google.com/travel/flights?tfs=CBsQAhooEgoyMDI0LTA3LTI4agwIAxIIL20vMDE1NnFyDAgCEggvbS8wNGpwbBooEgoyMDI0LTA4LTAxagwIAhIIL20vMDRqcGxyDAgDEggvbS8wMTU2cUABSAFSA0VVUnABemxDalJJTkRCNVRGbDBOMU5UVEdOQlJ6aG5lRUZDUnkwdExTMHRMUzB0TFMxM1pXc3lOMEZCUVVGQlIxZ3dhRWxSUVRoaWFtTkJFZ1pWTWpnMk1qSWFDZ2lRYnhBQ0dnTkZWVkk0SEhEN2VBPT2YAQGyARIYASABKgwIAxIIL20vMDRqcGw&hl=en-US&curr=EUR&sa=X&ved=0CAoQtY0DahgKEwiAz9bF5PaEAxUAAAAAHQAAAAAQngM',
            'https://www.google.com/travel/flights/search?tfs=CBwQAhooEgoyMDI0LTA3LTI4agwIAxIIL20vMDE1NnFyDAgDEggvbS8wN19rcRooEgoyMDI0LTA4LTAxagwIAxIIL20vMDdfa3FyDAgDEggvbS8wMTU2cUABSAFwAYIBCwj___________8BmAEB&hl=en-US&curr=EUR'
        ]
    
        constructed_flight_results = extract_flights_data_from_urls(urls)
    
        save_results(constructed_flight_results, results_file)
    
    
    if __name__ == "__main__":
        main()

    Conclusion

    In today’s article, we’ve learned how to extract public pricing information, flight times, and airline names from Google Flights. By employing Python and Oxylabs SERP Scraper API, we were able to deal with the dynamic nature of Google Flights and gather the data successfully. 

    Frequently asked questions

    Is scraping Google Flights legal?

    Scraping itself is not an illegal activity as long as it’s performed without violating any laws, regulations, or target website’s requirements. Therefore, before extracting public data from Google Flights, you need to make sure your scraping activities are ethical and that you’re using a high-quality web scraping tool.

    How do I get Google flight data?

    Google doesn’t provide an official API for accessing Google Flights data. However, there are several quality third-party solutions, such as Oxylabs’ Google Flights API, that can help you access Google Flights’ public data without a problem. Alternatively, you can build your own web scraper, but this might require more time and resources from your side.

    How accurate is Google Flights?

    Google Flights is considered to be a reliable service for providing up-to-date information on flights, prices, and airlines. While users can generally be sure that the provided information is trustworthy, various factors, such as sudden schedule changes or discrepancies in data provided by airlines may occasionally affect its accuracy.

    About the author

    Yelyzaveta Nechytailo

    Yelyzaveta Nechytailo

    Senior Content Manager

    Yelyzaveta Nechytailo is a Senior Content Manager at Oxylabs. After working as a writer in fashion, e-commerce, and media, she decided to switch her career path and immerse in the fascinating world of tech. And believe it or not, she absolutely loves it! On weekends, you’ll probably find Yelyzaveta enjoying a cup of matcha at a cozy coffee shop, scrolling through social media, or binge-watching investigative TV series.

    All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

    Related articles

    Get the latest news from data gathering world

    I’m interested

    IN THIS ARTICLE:


    • 1. Installing prerequisite libraries


    • 2. Creating core structure


    • 3. Getting the price


    • 4. Getting the flight time


    • 5. Getting the airline name


    • 6. Final result


    • Conclusion

    Try Google Flights Scraper API

    Collect large-volume public flight data with Oxylabs' Google Flights API.

    Scale up your business with Oxylabs®