Google Scholar is a search engine for accessing academic data. With Scholar, you can retrieve scientific articles, research papers, and theses. However, academic research could require gathering and handpicking a large amount of data.
Going through endless results manually is a tough task. With automation, you can extract data like titles, authors, and citations from each result on a Google Scholar page in seconds.
In this tutorial, you’ll learn how to collect data from Google Scholar with Oxylabs Google Scholar API and Python.
Claim a 7-day free trial to test our Web Scraper API for any use case.
You can find the following code on our GitHub.
You can download the latest version of Python from the official website.
To store the Python code and create a new Python file in your current directory, run the following command.
touch main.py
For this tutorial, let’s use Requests and Beautiful Soup. Run the following command to install the necessary dependencies.
pip3 install requests bs4
Open the previously created Python file with an editor of your choice and import the installed libraries.
import requests
from bs4 import BeautifulSoup
After importing the installed libraries, prepare the API request for SERP Scraper API (a part of Web Scraper API).
Start by declaring the USERNAME and PASSWORD variables – your API credentials. You can retrieve these values from the Oxylabs dashboard.
For the url parameter, paste a Google Scholar page you wish to scrape. Let’s use Google Scholar results for scholarly articles related to global warming.
Make sure the source parameter is set to google.
You can find all of the API parameters in our documentation.
USERNAME = "USERNAME"
PASSWORD = "PASSWORD"
payload = {
"url": "https://scholar.google.com/scholar?q=global+warming+&hl=en&as_sdt=0,5",
"source": "google",
}
Now, use the declared payload and credentials to send an API request and retrieve HTML content from the page. Pass the payload and credentials to the json and auth parameters, respectively.
response = requests.post(
"https://realtime.oxylabs.io/v1/queries",
auth=(USERNAME,PASSWORD),
json=payload,
)
response.raise_for_status()
As a good practice, add the response.raise_for_status() line after the POST request. This line raises an exception in case something goes wrong with the request, guaranteeing the received response is what you expect instead of an error code.
Next, you can extract HTML content from the response
html = response.json()["results"][0]["content"]
For cleaner code, move this part to a function called get_html_from_page with a url parameter to be reused later. The url parameter should be a link to the Google Scholar page, and the function should return HTML content.
Here’s the full code for sending an API request and retrieving HTML content.
import requests
from bs4 import BeautifulSoup
USERNAME = "USERNAME"
PASSWORD = "PASSWORD"
def get_html_for_page(url):
payload = {
"url": url,
"source": "google",
}
response = requests.post(
"https://realtime.oxylabs.io/v1/queries",
auth=(USERNAME,PASSWORD),
json=payload,
)
response.raise_for_status()
return response.json()["results"][0]["content"]
url = "https://scholar.google.com/scholar?q=global+warming+&hl=en&as_sdt=0,5"
html = get_html_for_page(url)
Use the previously installed BeautifulSoup library to parse the retrieved HTML.
First, go to the Google Scholar page you wish to scrape, right-click on the first result and click Inspect.
You should see that each result is wrapped in a div element with a class called gs_ri. Let’s use that to find each article on the page.
soup = BeautifulSoup(html, "html.parser")
articles = soup.find_all("div", {"class": "gs_ri"})
Next, implement a separate function called parse_data_from_article to extract data for each article. The function should accept a BeautifulSoup object named article and return a dictionary.
def parse_data_from_article(article):
...
Now, let’s extract the title, authors, and link of a Google Scholar article. Find the title element for the article, which includes both the link and the title text.
title_elem = article.find("h3", {"class": "gs_rt"})
title = title_elem.get_text()
title_anchor_elem = article.select("a")[0]
url = title_anchor_elem["href"]
Let’s also retrieve the article ID found in the same anchor element you got the URL from. Use it later on for retrieving citations.
article_id = title_anchor_elem["id"]
Select the authors wrapped in a div with a class named gs_a.
authors = article.find("div", {"class": "gs_a"}).get_text()
Finally, let’s put it all together and return a dictionary with the collected data. Here’s the full code for the parse_data_from_article function.
def parse_data_from_article(article):
title_elem = article.find("h3", {"class": "gs_rt"})
title = title_elem.get_text()
title_anchor_elem = article.select("a")[0]
url = title_anchor_elem["href"]
article_id = title_anchor_elem["id"]
authors = article.find("div", {"class": "gs_a"}).get_text()
return {
"title": title,
"authors": authors,
"url": url,
}
You can now use this function together with the previously written code to get data for each scraped article from your Google Scholar page. Here’s the code so far.
import requests
from bs4 import BeautifulSoup
USERNAME = "USERNAME"
PASSWORD = "PASSWORD"
def get_html_for_page(url):
payload = {
"url": url,
"source": "google",
}
response = requests.post(
"https://realtime.oxylabs.io/v1/queries",
auth=(USERNAME,PASSWORD),
json=payload,
)
response.raise_for_status()
return response.json()["results"][0]["content"]
def parse_data_from_article(article):
title_elem = article.find("h3", {"class": "gs_rt"})
title = title_elem.get_text()
title_anchor_elem = article.select("a")[0]
url = title_anchor_elem["href"]
article_id = title_anchor_elem["id"]
authors = article.find("div", {"class": "gs_a"}).get_text()
return {
"title": title,
"authors": authors,
"url": url,
}
url = "https://scholar.google.com/scholar?q=global+warming+&hl=en&as_sdt=0,5"
html = get_html_for_page(url)
soup = BeautifulSoup(html, "html.parser")
articles = soup.find_all("div", {"class": "gs_ri"})
data = [parse_data_from_article(article) for article in articles]
Now that you have the initial data let’s see how to get citations from each article. Since the citations aren’t directly present in the received HTML, you’ll need to do an additional API call.
To get citations, construct a new URL with the article ID retrieved earlier. The URL should look like this:
https://scholar.google.com/scholar?q=info:{article_id}:scholar.google.com&output=cite
Let’s start by implementing a new function called get_citations. It should receive an article_id as a parameter and return a list of dictionaries with citations.
def get_citations(article_id):
...
You can reuse the previously defined function get_html_for_page() to retrieve HTML for citations. Let’s retrieve HTML content and create another soup object.
url = f"https://scholar.google.com/scholar?q=info:{article_id}:scholar.google.com&output=cite"
html = get_html_for_page(url)
soup = BeautifulSoup(html, "html.parser")
Next, use the soup object to retrieve the title and content of each citation. Since the citations are structured in an HTML table, find them using the tr selector.
data = []
for citation in soup.find_all("tr"):
...
Now, extract the title and content of each citation and form them into a dictionary.
title = citation.find("th", {"class": "gs_cith"}).get_text(strip=True)
content = citation.find("div", {"class": "gs_citr"}).get_text(strip=True)
entry = {
"title": title,
"content": content,
}
data.append(entry)
Here’s the full picture of the get_citations function.
def get_citations(article_id):
url = f"https://scholar.google.com/scholar?q=info:{article_id}:scholar.google.com&output=cite"
html = get_html_for_page(url)
soup = BeautifulSoup(html, "html.parser")
data = []
for citation in soup.find_all("tr"):
title = citation.find("th", {"class": "gs_cith"}).get_text(strip=True)
content = citation.find("div", {"class": "gs_citr"}).get_text(strip=True)
entry = {
"title": title,
"content": content,
}
data.append(entry)
return data
As you now have a function for retrieving citations of each article, include it in the parse_data_from_article() function. Add it to the returned dictionary using the previously extracted article_id variable.
def parse_data_from_article(article):
title_elem = article.find("h3", {"class": "gs_rt"})
title = title_elem.get_text()
title_anchor_elem = article.select("a")[0]
url = title_anchor_elem["href"]
article_id = title_anchor_elem["id"]
authors = article.find("div", {"class": "gs_a"}).get_text()
return {
"title": title,
"authors": authors,
"url": url,
"citations": get_citations(article_id),
}
Here’s the full code so far.
import requests
from bs4 import BeautifulSoup
USERNAME = "USERNAME"
PASSWORD = "PASSWORD"
def get_html_for_page(url):
payload = {
"url": url,
"source": "google",
}
response = requests.post(
"https://realtime.oxylabs.io/v1/queries",
auth=(USERNAME,PASSWORD),
json=payload,
)
response.raise_for_status()
return response.json()["results"][0]["content"]
def get_citations(article_id):
url = f"https://scholar.google.com/scholar?q=info:{article_id}:scholar.google.com&output=cite"
html = get_html_for_page(url)
soup = BeautifulSoup(html, "html.parser")
data = []
for citation in soup.find_all("tr"):
title = citation.find("th", {"class": "gs_cith"}).get_text(strip=True)
content = citation.find("div", {"class": "gs_citr"}).get_text(strip=True)
entry = {
"title": title,
"content": content,
}
data.append(entry)
return data
def parse_data_from_article(article):
title_elem = article.find("h3", {"class": "gs_rt"})
title = title_elem.get_text()
title_anchor_elem = article.select("a")[0]
url = title_anchor_elem["href"]
article_id = title_anchor_elem["id"]
authors = article.find("div", {"class": "gs_a"}).get_text()
return {
"title": title,
"authors": authors,
"url": url,
"citations": get_citations(article_id),
}
url = "https://scholar.google.com/scholar?q=global+warming+&hl=en&as_sdt=0,5"
html = get_html_for_page(url)
soup = BeautifulSoup(html, "html.parser")
articles = soup.find_all("div", {"class": "gs_ri"})
data = [parse_data_from_article(article) for article in articles]
Since a single Google Scholar page contains only 10 articles, it’s useful to have an option to configure how many pages you want to scrape. To do so, let’s implement additional logic for scraping multiple Google Scholar pages.
To start, navigate to the second result page in Google Scholar and inspect the URL. You can see that the URL looks like this:
https://scholar.google.com/scholar?start=10&q=global+warming+&hl=en&as_sdt=0,5
From here, use the start parameter to indicate a page you want to scrape. Let’s implement a simple helper function called get_url_for_page to construct URLs for each page number. The function should accept arguments named url and page_index.
def get_url_for_page(url, page_index):
return url + f"&start={page_index}"
Now, declare a few more variables to make it work. NUM_OF_PAGES to indicate how many pages you want to scrape, and page_index to mark a starting point for the script.
NUM_OF_PAGES = 2
page_index = 0 # Starting at page nr. 1
Next, move your main code to a for loop that loops for the number of declared pages and increment the page_index value by 10 on each iteration. Here’s how it should look.
def get_url_for_page(url, page_index):
return url + f"&start={page_index}"
data = []
url = "https://scholar.google.com/scholar?q=global+warming+&hl=en&as_sdt=0,5"
NUM_OF_PAGES = 2
page_index = 0
for _ in range(NUM_OF_PAGES):
page_url = get_url_for_page(url, page_index)
html = get_html_for_page(page_url)
soup = BeautifulSoup(html, "html.parser")
articles = soup.find_all("div", {"class": "gs_ri"})
entries = [parse_data_from_article(article) for article in articles]
data.extend(entries)
page_index += 10
The resulting code gives a list of dictionaries with article data from the specified number of Google Scholar pages. To make the script cleaner, move your data-retrieving code to a function named get_data_from_page.
def get_url_for_page(url, page_index):
return url + f"&start={page_index}"
def get_data_from_page(url):
html = get_html_for_page(url)
soup = BeautifulSoup(html, "html.parser")
articles = soup.find_all("div", {"class": "gs_ri"})
return [parse_data_from_article(article) for article in articles]
data = []
url = "https://scholar.google.com/scholar?q=global+warming+&hl=en&as_sdt=0,5"
NUM_OF_PAGES = 2
page_index = 0
for _ in range(NUM_OF_PAGES):
page_url = get_url_for_page(url, page_index)
entries = get_data_from_page(page_url)
data.extend(entries)
page_index += 10
Here’s the complete code for scraping Google Scholar with Oxylabs Web Scraper API and Python.
import requests
import pandas as pd
from bs4 import BeautifulSoup
USERNAME = "USERNAME"
PASSWORD = "PASSWORD"
def get_html_for_page(url):
payload = {
"url": url,
"source": "google",
}
response = requests.post(
"https://realtime.oxylabs.io/v1/queries",
auth=(USERNAME,PASSWORD),
json=payload,
)
response.raise_for_status()
return response.json()["results"][0]["content"]
def get_citations(article_id):
url = f"https://scholar.google.com/scholar?q=info:{article_id}:scholar.google.com&output=cite"
html = get_html_for_page(url)
soup = BeautifulSoup(html, "html.parser")
data = []
for citation in soup.find_all("tr"):
title = citation.find("th", {"class": "gs_cith"}).get_text(strip=True)
content = citation.find("div", {"class": "gs_citr"}).get_text(strip=True)
entry = {
"title": title,
"content": content,
}
data.append(entry)
return data
def parse_data_from_article(article):
title_elem = article.find("h3", {"class": "gs_rt"})
title = title_elem.get_text()
title_anchor_elem = article.select("a")[0]
url = title_anchor_elem["href"]
article_id = title_anchor_elem["id"]
authors = article.find("div", {"class": "gs_a"}).get_text()
return {
"title": title,
"authors": authors,
"url": url,
"citations": get_citations(article_id),
}
def get_url_for_page(url, page_index):
return url + f"&start={page_index}"
def get_data_from_page(url):
html = get_html_for_page(url)
soup = BeautifulSoup(html, "html.parser")
articles = soup.find_all("div", {"class": "gs_ri"})
return [parse_data_from_article(article) for article in articles]
data = []
url = "https://scholar.google.com/scholar?q=global+warming+&hl=en&as_sdt=0,5"
NUM_OF_PAGES = 2
page_index = 0
for _ in range(NUM_OF_PAGES):
page_url = get_url_for_page(url, page_index)
entries = get_data_from_page(page_url)
data.extend(entries)
page_index += 10
To see the collected data, simply add this line at the end of your script:
print(data)
You should see something like this.
As the results prove, using Oxylabs Web Scraper API with Python is a seamless way to collect structured data from Google Scholar for research.
For more Google tutorials, see Jobs, Search, Images, Trends, News, Flights, Shopping, and Maps web scraping guides.
Integrating proxies is key to block-free web scraping. To appear organic on a target website, you can buy proxy server solutions of various types, each suitable for different data extraction scenarios.
If you have any questions about scraping Google Scholar or want to learn more about our solutions, drop us a line at support@oxylabs.io or via the live chat on our homepage.
Before scraping Google Scholar, please consult with legal professionals to be sure that you aren't breaching third-party rights, including but not limited to intellectual property rights.
To learn more, read Is Web Scraping Legal?
To extract Google Scholar data at scale, you can build your own web scraper using a preferred language. Some languages, such as Python, R, Ruby, PHP, Node.js, and Java, are well suited for web data extraction, while others, like C++, are to be avoided for web interactions.
Another option is to use third-party tools, such as APIs, to automate the bulk of the process.
About the author
Augustas Pelakauskas
Senior Copywriter
Augustas Pelakauskas is a Senior Copywriter at Oxylabs. Coming from an artistic background, he is deeply invested in various creative ventures - the most recent one being writing. After testing his abilities in the field of freelance journalism, he transitioned to tech content creation. When at ease, he enjoys sunny outdoors and active recreation. As it turns out, his bicycle is his fourth best friend.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Yelyzaveta Nechytailo
2024-12-09
Augustas Pelakauskas
2024-12-09
Get the latest news from data gathering world
Scale up your business with Oxylabs®