Back to blog
How to Scrape Google People Also Ask: Python Tutorial
Maryia Stsiopkina
Back to blog
Maryia Stsiopkina
Scraping Google’s “People Also Ask” (PAA) section is a great method to gather insights into user search intent and identify questions related to what people commonly ask online. These questions can offer content ideas, improve SEO, and help build an effective content strategy.
In this tutorial, you will learn how to scrape the PAA box using Python, extract the data with the help of BeautifulSoup, and save it as a file.
Google’s People Also Ask is a feature that appears in the search engine results pages in response to user queries. It provides a box of questions related to the original question that users frequently search for, which can be expanded to reveal brief answers. Using this data is incredibly useful for SEO purposes, and if you'd prefer an automated solution, you can try Oxylabs' dedicated Google Ads Scraper API for gathering Google Ads data or their Google Search Autocomplete API for scraping search suggestions or even integrate it with Google Sheets for easy data analysis.
For this tutorial, we will require requests and BeautifulSoup4 Python libraries. Let’s install them using pip:
pip install requests json bs4
Now that we have all the required libraries, we will create a file main.py that will hold all of our code and start by making a request to Google:
import requests
# Encode the query for use in a URL
query = "search query"
query = query.replace(' ', '+')
# URL for the Google search
url = f"https://www.google.com/search?q={query}"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
# Send the GET request to Google
response = requests.get(url, headers=headers)
# If the request is successful (status code 200)
if response.status_code == 200:
print(response.status_code)
else:
print(f"Error: Unable to fetch the search results. Status code: {response.status_code}")
In the code above, we replace spaces with plus (+) characters in our search query, so that we can use it as a URL parameter. We then construct the URL, add some headers so that Google lets us through and send the request.
Next, let’s focus on locating and getting those questions. We will use the HTML of the page that we got from our request to Google to create a BeautifulSoup object and then search it for the question.
Before we begin coding, we need a CSS identifier for the question objects so we can select them later. We can find it by inspecting the Google page source through a browser:
All that is left now is to fetch the text from these objects using BeautifulSoup:
import requests
from bs4 import BeautifulSoup
# Define a function to search Google
def get_soup_from_google_search(query):
# Encode the query for use in a URL
query = query.replace(' ', '+')
# URL for the Google search
url = f"https://www.google.com/search?q={query}"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
# Send the GET request to Google
response = requests.get(url, headers=headers)
# If the request is successful (status code 200)
if response.status_code == 200:
# Parse the HTML content with BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")
return soup
else:
print(f"Error: Unable to fetch the search results. Status code: {response.status_code}")
return None
query = "how do you get diabetes"
soup = get_soup_from_google_search(query)
if soup:
for question in soup.select('span.CSkcDe'):
print(question.text)
We have created a method called get_soup_from_google_search that does what the name suggests and takes in the desired query as a parameter. Then we check if the object is not empty, find all of our previously identified PAA question objects and print out their text. For automating this process, consider leveraging Oxylabs' Google Books API or Google Events API to gather more specialized data.
If we run the code, we can see all of the questions printed out:
Finally, we have to store the data we collect somewhere. Let’s create a function that would take in the data we extracted and save it to a file together with a timestamp of the current time.
def save_results(query, questions):
results = {
"date": datetime.now().strftime("%Y-%m-%d"),
"query": query,
"questions": questions
}
if os.path.exists("results.json"):
with open("results.json", "r", encoding="utf-8") as file:
data = json.load(file)
else:
data = []
data.append(results)
with open("results.json", "w", encoding="utf-8") as file:
json.dump(data, file, indent=4)
print("Results saved to results.json")
If we combine it with the code we had before, we will have our final product:
import requests
import json
import os
from bs4 import BeautifulSoup
from datetime import datetime
def save_results(query, questions):
results = {
"date": datetime.now().strftime("%Y-%m-%d"),
"query": query,
"questions": questions
}
if os.path.exists("results.json"):
with open("results.json", "r", encoding="utf-8") as file:
data = json.load(file)
else:
data = []
data.append(results)
with open("results.json", "w", encoding="utf-8") as file:
json.dump(data, file, indent=4)
print("Results saved to results.json")
def google_search(query):
query = query.replace(' ', '+')
url = f"https://www.google.com/search?q={query}"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
response = requests.get(url, headers=headers)
if response.status_code == 200:
soup = BeautifulSoup(response.text, "html.parser")
return soup
else:
print(f"Error: Unable to fetch the search results. Status code: {response.status_code}")
return None
def extract_questions(soup):
titles = []
if soup:
for question in soup.select('span.CSkcDe'):
titles.append(question.get_text())
return titles
query = "how do you get diabetes"
soup = google_search(query)
questions = extract_questions(soup)
save_results(query, questions)
And if we run the code, we can inspect our results file:
We can now inspect our results file or download the data to your local software for further analysis.
The important thing to consider here is that Google will quickly clamp down on you with limitations if you try to scrape above a certain threshold. So proxies and rotating user agents should be used to circumvent issues like IP blocks and CAPTCHA. For more robust scraping, you might want to explore Google Carousel API to handle carousel elements in search results.
As People Also Ask Data is something that changes over time, it is important to schedule a script like this to run periodically. After some time, you could start tracking shifts in user searching intents and update your content strategies accordingly.
Scraping Google’s People Also Ask (PAA) section can provide valuable insights into user behavior and search patterns, which can help you optimize your content strategy. With the Python script we developed, you can easily extract PAA questions and store them for further analysis. By staying informed on trending keywords, you’ll be better equipped to refine your content and boost SEO efforts over time.
For a more advanced approach, Oxylabs offers tools for scraping search results, including Google Featured Snippet API.
People Also Ask boxes are sections that appear on Google’s search engine results pages (SERPs) containing related questions to the user's query. These boxes expand to show brief answers pulled from various websites. They help users explore additional information by providing quick answers to common follow-up questions.
If you plan to scrape Google, it’s important to respect their terms of service and use ethical practices, such as incorporating proxies and respecting rate limits. For scalable and ethical scraping, using an API like Oxylabs' Google API is highly recommended, as it can retrieve the data you need without violating Google's policies. Always notice the rate limits and guidelines specified by such services.
Google generates People Also Ask questions based on the search intent of users, analyzing vast amounts of keywords to identify related queries and pull up relevant follow-up questions. For example, when you search for a topic, Google pulls up questions that have been frequently searched in relation to that query, offering helpful insights. This process is powered by complex algorithms and machine-learning techniques designed to provide the most relevant output.
Yes, it is possible to scrape Google search results, but it requires careful handling to avoid getting blocked by Google's anti-scraping measures. Using proxies and rotating user agents helps to mitigate issues like IP blocking. For a more efficient and scalable solution, many developers turn to API services. These APIs provide structured data directly from Google’s search results without violating terms of service. Additionally, scraping tools often make use of libraries like BeautifulSoup or software like Scrapy to extract and structure the output.
About the author
Maryia Stsiopkina
Senior Content Manager
Maryia Stsiopkina is a Senior Content Manager at Oxylabs. As her passion for writing was developing, she was writing either creepy detective stories or fairy tales at different points in time. Eventually, she found herself in the tech wonderland with numerous hidden corners to explore. At leisure, she does birdwatching with binoculars (some people mistake it for stalking), makes flower jewelry, and eats pickles.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Get the latest news from data gathering world
Scale up your business with Oxylabs®