Back to blog
How to Scrape Google News: Step-by-Step Guide


Danielius Radavicius
Back to blog
Danielius Radavicius
Google News is a personalized news aggregation platform that curates and highlights relevant stories worldwide based on user interests. It compiles news and headlines from various sources, ensuring easy access from any device. An essential feature is "Full Coverage," which delves deeper into stories by presenting diverse perspectives from different outlets and mediums.
In this tutorial, you'll learn how to scrape Google News data in two ways: by writing a custom scraper and by utilizing a ready-made Google News Scraper. By following the steps outlined, you'll also learn how to mitigate the anti-bot scraping challenges of Google News. Although, before continuing, check out this article to learn more about news scraping.
Make sure you have Python installed from the official website. The code samples shown in this blog post are written using Python 3.12.0. You'll also need to install the following libraries:
beautifulsoup4==4.13.3
pandas==2.2.3
requests==2.32.3
You can use pip to install these modules via your terminal:
pip install requests bs4 pandas
The requests library simplifies making HTTP calls to Google, while beautifulsoup4 parses the raw HTML to extract your needed data, and pandas provides an easy way to save your results to CSV files.
There are a few distinct ways Google offers news results:
Through the dedicated Google News website:
https://news.google.com/home
By accessing the News tab on Google search results:
https://www.google.com/search?q=stock+market&tbm=nws
And the RSS feed URL:
https://news.google.com/rss/headlines/section/topic/WORLD
For this tutorial, let’s scrape news articles from the search result pages.
Create a new Python file and import the libraries:
import requests
from bs4 import BeautifulSoup
import pandas as pd
Next, make a GET request to the Google News URL. Once the request returns a response, use the BeautifulSoup instance to prepare the HTML for parsing.
This approach doesn't use a headless browser, so any dynamic data loaded via JavaScript won't appear in the HTML file. To work around this limitation, either save the scraped HTML document and open it in your browser or use Dev Tools to disable JavaScript when viewing the Google News page. This step ensures your CSS selectors match the actual HTML you're scraping, not the JavaScript-enhanced version.
response = requests.get(
url='https://www.google.com/search?q=stock+market&tbm=nws'
)
with open('page.html', 'w') as f:
f.write(response.text)
soup = BeautifulSoup(response.text, 'html.parser')
You can open the saved HTML file in your browser to view the data you're working with. If you don't see the expected search results and instead encounter a policy page, CAPTCHA, or other unexpected content, refer to the next step below.
If the content you want to scrape is blocked, your IP address might be the issue. To solve this problem, you can use proxy servers to replace your actual IP address with a proxy’s IP.
Residential Proxies offer superior performance as they utilize IP addresses provided by established Internet Service Providers (ISPs) with excellent online reputation. For this reason, let’s use this proxy type to make requests by modifying the previous code:
USER = 'proxy_username'
PASS = 'proxy_password'
response = requests.get(
url='https://www.google.com/search?q=stock+market&tbm=nws',
proxies={
'http': f'https://customer-{USER}:{PASS}@us-pr.oxylabs.io:10000',
'https': f'https://customer-{USER}:{PASS}@us-pr.oxylabs.io:10000'
}
)
Visit our documentation to learn how to use Residential Proxies and see more code examples. Additionally, you may want to use a headless browser such as Selenium or Playwright to make your requests even more resilient to blocks.
Note: contact our Customer Support Team to enable Google domains for your acquired proxies.
Once you’re able to access news headlines, the next step is to create the parsing logic with CSS selectors. The idea is to select all news article cards and then iterate through each card to extract specific data. Let’s start by finding a way to select all news articles on the page.
Open your browser and either load the saved page.html file or navigate to https://www.google.com/search?q=stock+market&tbm=nws and disable JavaScript rendering (see this tutorial for the Chrome browser).
Next, open Developer Tools by right-clicking anywhere on the web page and selecting Inspect. Make sure you’re inside the Elements tab (or Inspector) and have enabled element selection by clicking the pointer icon in the top-left corner of the developer tools panel. With this option enabled, you can click on any element on the page, and its corresponding HTML section will be highlighted in the panel.
Since there are two different types of news cards on the page, the CSS selector we’ll use is div.X7NTVe > a, div.pkphOe > a. Open the search function in the Dev Tools panel by pressing CTRL + F (or Cmd + F on Mac), then paste your CSS selector. This lets you quickly verify the selector works correctly.
In your Python file, add these lines of code:
articles = []
for article in soup.select('div.X7NTVe > a, div.pkphOe > a'):
You should find the news article title inside each card’s <h3> element.
With this in mind, you can update the code like so:
articles = []
for article in soup.select('div.X7NTVe > a, div.pkphOe > a'):
title = article.select_one('h3').text
Next, the article’s link can be parsed from the href attribute.
Add this additional line inside the for loop:
href = article.get('href').replace('/url?q=', '')
You can find the name of the publisher by selecting two different element classes: .aJyiOc, .lRVwie.
Include the following line in your code:
source = article.select_one('.aJyiOc, .lRVwie').text
For all articles, you can find the time of publication by selecting span.r0bn4c.rQMQod.
Your for loop should look like this now:
articles = []
for article in soup.select('div.X7NTVe > a, div.pkphOe > a'):
title = article.select_one('h3').text
href = article.get('href').replace('/url?q=', '')
source = article.select_one('.aJyiOc, .lRVwie').text
time = article.select_one('span.r0bn4c.rQMQod').text
Next, inside the loop, append all the parsed data for each article to the articles list:
articles.append({
'title': title,
'link': href,
'source': source,
'published': time
})
Finally, you can save all the extracted news articles to a CSV file using pandas:
df = pd.DataFrame(articles)
df.to_csv('news_1.csv', index=False)
The final version of your code should be:
import requests
from bs4 import BeautifulSoup
import pandas as pd
USER = 'proxy_username'
PASS = 'proxy_password'
response = requests.get(
url='https://www.google.com/search?q=stock+market&tbm=nws',
proxies={
'http': f'https://customer-{USER}:{PASS}@us-pr.oxylabs.io:10000',
'https': f'https://customer-{USER}:{PASS}@us-pr.oxylabs.io:10000'
}
)
soup = BeautifulSoup(response.text, 'html.parser')
with open('page.html', 'w') as f:
f.write(response.text)
articles = []
for article in soup.select('div.X7NTVe > a, div.pkphOe > a'):
title = article.select_one('h3').text
href = article.get('href').replace('/url?q=', '')
source = article.select_one('.aJyiOc, .lRVwie').text
time = article.select_one('span.r0bn4c.rQMQod').text
articles.append({
'title': title,
'link': href,
'source': source,
'published': time
})
df = pd.DataFrame(articles)
df.to_csv('news_1.csv', index=False)
After executing this custom news scraper, you’ll have all the articles neatly scraped into a CSV file, which can be opened in Excel, Google Sheets, or any other program that supports CSV:
Our Google News scraper aims to ensure that your current and future scraping projects will be significantly streamlined while all the possible hassles are dealt with efficiently. The Oxylabs Web API will allow you to access real-time data and scrape Google News results localized for almost any location. On top of that, with a single purchase, you get access to multiple ready-made scrapers, including Google SERP, Amazon, and others. Consequently, you don't have to worry about any anti-scraping solution issues.
Oxylabs also provides a 1-week free trial to thoroughly test and develop your scraper and explore all the functionalities of the Google News API. Visit our documentation to learn more.
Start a free trial to test our Web Scraper API.
Sign up and log in to the dashboard. From there, you can create and grab your user credentials for the Web API. They will be needed in later steps.
Create a new Python file and import the modules:
import requests
import pandas as pd
Let's prepare the payload dictionary and credentials to send API requests and start scraping data. First, replace the USERNAME and PASSWORD with your sub-account credentials.
credentials = ('USERNAME', 'PASSWORD')
What’s neat about Web Scraper API is that it automatically parses the data by setting the parse parameter to True, so you don’t have to inspect the HTML yourself. Additionally, you can easily scrape multiple pages by utilizing the pages parameter. To scrape Google News, set the following parameters:
payload = {
'source': 'google_search', # Define Google Search as source.
'query': 'stock market', # Your search query.
'pages': '5', # Number of pages to scrape.
'parse': True, # Enable automatic data parsing.
'context': [
{'key': 'tbm', 'value': 'nws'}, # Enable News results.
]
}
You can also enable JavaScript rendering by setting the render parameter to html if needed. Next, using the post() method of the requests module, POST the payload and credentials to the API.
response = requests.post(
'https://realtime.oxylabs.io/v1/queries',
auth=credentials,
json=payload,
)
print(response.json())
If everything works, you should see the status code 200 inside the JSON response. If you get any other response codes please, refer to the documentation.
Finally, you can save the parsed news results to a file. It’s a good idea to first clean up the JSON response by extracting only the relevant information:
data = [item['content'] for item in response.json()['results']]
all_news = []
for page_data in data:
page_num = page_data['page']
for news in page_data['results']['main']:
news['page'] = page_num
all_news.append(news)
This will ensure that the results aren’t nested and include only the page number and news data from each page.
Next, let’s store the all_news list into a data frame object. Then, you can export it to a CSV file using the to_csv() method. You can also set the index to False so that the CSV won’t include an extra index column.
df = pd.DataFrame(all_news)
df.to_csv('news_2.csv', index=False)
Here’s the complete API code sample:
import requests
import pandas as pd
credentials = ('USERNAME', 'PASSWORD')
payload = {
'source': 'google_search',
'query': 'stock market',
'pages': '5',
'parse': True,
'context': [
{'key': 'tbm', 'value': 'nws'},
]
}
response = requests.post(
'https://realtime.oxylabs.io/v1/queries',
auth=credentials,
json=payload,
)
print(response.json())
data = [item['content'] for item in response.json()['results']]
all_news = []
for page_data in data:
page_num = page_data['page']
for news in page_data['results']['main']:
news['page'] = page_num
all_news.append(news)
df = pd.DataFrame(all_news)
df.to_csv('news_2.csv', index=False)
Running the code will produce a CSV file that will present data like so:
When scraping Google News or other challenging targets, selecting the appropriate technique is essential. The following table provides a comparison of different approaches:
Scraping method | Success rate | Handling blocks | Speed | Ease of use | Maintenance effort |
---|---|---|---|---|---|
No proxies | Low | Frequent IP bans | Fast | Simple | High – needs manual fixes due to blocks |
With proxies | Medium | Better, requires IP rotation | Moderate | Moderate | Low – may need proxy management if not provided by the proxy provider |
Headless browser | Medium | Can handle some blocks but may be detected | Slow | Complex | High – requires CAPTCHA handling and anti-scraping evasion |
Web Scraper API | High | Bypasses anti-scraping systems | Fast | Easy | Low – no need for manual adjustments |
Using Oxylabs web scraping solutions, you can keep up to date with the latest News from Google News. Take advantage of Oxylabs' powerful Scraper API to enhance your overall scraping experiences. Also, by using the techniques described in the article, you can harness the power of Google News data without worrying about proxy rotation or anti-bot challenges.
Proxies are essential for block-free web scraping. To resemble organic traffic, you can buy proxy solutions, most notably residential proxies and datacenter IPs, or get a reliable free proxy server.
Want to broaden your Google data scraping skills? Take a look at our guides for scraping Jobs, Search, Images, Trends, Scholar, Flights, Shopping, and Maps.
The answer isn’t a simple “yes or no”. Before scraping public data available on Google News, you should consult with legal professionals to make sure your use case and the data you want to scrape doesn’t violate any laws and regulations.
Scraping Google News data involves building a custom scraper or utilizing a dedicated web scraping API. The latter is the best option if you want to avoid the hassle of dealing with complex coding, proxy management, headless browsers, and other common web scraping difficulties.
It depends on the data you want to scrape and how you use it. Scraping public web data is generally considered legal when it’s performed without violating any local and international laws and regulations. However, you should always seek legal advice and review any terms before engaging in scraping activities. To learn more on this topic, check out this in-depth article about is web scraping legal.
To scrape Google News articles effectively, you may want to equip yourself with a web scraping tool that handles blocks, CAPTCHAs, and infrastructure management so you can focus on results. Alternatively, you can create your own web scraper using a preferred programming language, a headless browser (Selenium, Playwright, Puppeteer, or similar), a parser, and rotating HTTP proxies to overcome IP blocks.
About the author
Danielius Radavicius
Former Copywriter
Danielius Radavičius was a Copywriter at Oxylabs. Having grown up in films, music, and books and having a keen interest in the defense industry, he decided to move his career toward tech-related subjects and quickly became interested in all things technology. In his free time, you'll probably find Danielius watching films, listening to music, and planning world domination.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Augustas Pelakauskas
2025-03-20
Maryia Stsiopkina
2025-03-18
Get the latest news from data gathering world
Scale up your business with Oxylabs®