Back to blog
How to Scrape Wikipedia Data: Ultimate Tutorial
Roberta Aukstikalnyte
Back to blog
Roberta Aukstikalnyte
Wikipedia is known as one of the biggest online sources of information, covering a wide variety of topics. Naturally, it possesses a ton of valuable information for research or analysis. However, to obtain this information at scale, you’ll need specific tools, proxies, and knowledge.
In this article, we’ll give answers to questions like “Is scraping Wikipedia allowed?” or “What exact information can be extracted?”. In the second portion of the article, we’ll give the exact steps for extracting publicly available information from Wikipedia using Python and Oxylabs’ Wikipedia Scraper API (part of Web Scraper API). We’ll go through steps for extracting different types of Wikipedia article data, such as paragraphs, links, tables, and images.
Let’s get started.
Request a free trial to test our Web Scraper API.
Let's start by creating a Python file for our scraper:
touch main.py
Within the created file, we’ll begin assembling a request for the Web Scraper API:
USERNAME = 'yourUsername'
PASSWORD = 'yourPassword'
# Structure payload.
payload = {
'source': 'universal',
'url': 'https://en.wikipedia.org/wiki/Michael_Phelps',
}
Variables USERNAME and PASSWORD will contain our credentials for authentication to the API. Payload will have all the various query parameters that the API supports. In our case, we have to specify the source and the url. Source denotes the type of scraper that should be used for the processing of this request and the universal one will work just fine for us. Url will tell the API what link to scrape. For more information about the various possible parameters, check out the Web Scraper API documentation.
After specifying the information required for the API, we can form and send the request:
response = requests.request(
'POST',
'https://realtime.oxylabs.io/v1/queries',
auth=(USERNAME, PASSWORD),
json=payload,
)
print(response.status_code)
If we set everything up correctly, the code should print out 200 as the status.
Now that we can send requests to the API, we can start scraping specific data that we require. Without specific instructions, our scraper will return us a raw lump of HTML. But we can use the functionality of Custom Parser to specify and get exactly what we want. Custom Parser is a free Scraper APIs feature that lets you define your own parsing and data processing logic and parse HTML data.
To start off, we can get the most obvious one - paragraphs of text. For that, we will need to find the CSS selector for the paragraphs of text. Inspecting the page, we can see that they are inside the paragraph element.
We can edit our payload to the API like this:
payload = {
'source': 'universal',
'url': 'https://en.wikipedia.org/wiki/Michael_Phelps',
'parse': 'true',
"parsing_instructions": {
"paragraph_text": {
"_fns": [
{"_fn": "css", "_args": ["p"]},
{"_fn": "element_text"}
]
}
}
}
response = requests.request(
'POST',
'https://realtime.oxylabs.io/v1/queries',
auth=(USERNAME, PASSWORD),
json=payload,
)
print(response.json())
Here, we pass two additional parameters: parse indicates that we want to use a Custom Parser with our request and parsing_instructions allow us to specify what needs to be parsed. In this case, we add a function that fetches an element by CSS and another one that extracts the text of the element.
After running the code, we can see the response:
To get the links scraped, first, we will need the xPath selector for them. As we can see, they are inside an HTML element named a.
To fetch these links now, we have to edit our Custom Parser as follows:
payload = {
'source': 'universal',
'url': 'https://en.wikipedia.org/wiki/Michael_Phelps',
'parse': 'true',
"parsing_instructions": {
"links": {
"_fns": [
{"_fn": "xpath", "_args": ["//a/@href"]},
]
},
}
}
response = requests.request(
'POST',
'https://realtime.oxylabs.io/v1/queries',
auth=(USERNAME, PASSWORD),
json=payload,
)
print(response.json())
If we look at the response, we can see that the links sometimes are not full links, but rather relative ones.
...
'/wiki/Wikipedia:Contents'
'https://foundation.wikimedia.org/wiki/Special:MyLanguage/Policy:Cookie_statement',
'//en.m.wikipedia.org/w/index.php?title=Michael_Phelps&mobileaction=toggle_view_mobile',
...
Let’s write some code that would retrieve the links and put them inside one collection:
def extract_links():
links = []
for link in result_set['content']['links']:
if link and not link.startswith(('https://')):
if link.startswith('//'):
link = 'https:' + link
else:
link = 'https://en.wikipedia.org' + link
links.append(link)
return links
raw_scraped_data = response.json()
processed_links = {}
for result_set in raw_scraped_data['results']:
processed_links['links'] = extract_links()
print(processed_links)
First, we define a function extract_links. It checks if the link is relative and modifies it if it is.
Now, we take the scraped and parsed links from the API response that we got and iterate through the results.
The next bit of information we could gather is from tables. Let's find a css selector for them.
A table as an HTML element is used in multiple ways across the page, but we will limit ourselves to extracting the information from the tables that are within the body of text. As we can see, such tables are of a Wikitable CSS class.
We can begin writing our code:
def extract_tables(html):
list_of_df = pandas.read_html(html)
tables = []
for df in list_of_df:
tables.append(df.to_json(orient='table'))
return tables
raw_scraped_data = response.json()
processed_tables = {}
for result_set in raw_scraped_data['results']:
tables = list(map(extract_tables, result_set['content']['table']))
processed_tables = tables
print(processed_tables)
The extract_tables function accepts the HTML of a table element and parses the table into a JSON structure with the help of the Pandas Python library.
Then, as previously done with the links, we iterate over the results, map each table to our extraction function, and get our array of processed information.
Note: a table doesn’t always directly translate into a JSON structure in an organized way, so you might want to customize the information you gather from the table or use another format.
The final elements we’ll be retrieving are image sources. Time to find their xPath selector.
We can see that images have their special HTML element called img. The only thing left is to write the code.
def extract_images():
images = []
for img in result_set['content']['images']:
if 'https://' not in img:
if '/static/images/' in img:
img = 'https://en.wikipedia.org' + img
else:
img = 'https:' + img
images.append(img)
else:
continue
return images
raw_scraped_data = response.json()
processed_images = {}
for result_set in raw_scraped_data['results']:
processed_images['images'] = extract_images()
print(processed_images)
We begin by defining the extract_images function, which takes the scraped and parsed image URLs and checks whether they are relative or full links.
Now that we went over extracting a few different pieces of Wikipedia page data, let’s join it all together and add some saving to a file for a finalized version of our scraper:
import requests
import json
import pandas
def extract_links():
links = []
for link in result_set['content']['links']:
if link and not link.startswith(('https://')):
if link.startswith('//'):
link = 'https:' + link
else:
link = 'https://en.wikipedia.org' + link
links.append(link)
return links
def extract_images():
images = []
for img in result_set['content']['images']:
if 'https://' not in img:
if '/static/images/' in img:
img = 'https://en.wikipedia.org' + img
else:
img = 'https:' + img
images.append(img)
else:
continue
return images
def extract_tables(html):
list_of_df = pandas.read_html(html)
tables = []
for df in list_of_df:
tables.append(df.to_json(orient='table'))
return tables
# Structure payload.
payload = {
'source': 'universal',
'url': 'https://en.wikipedia.org/wiki/Michael_Phelps',
"parse": 'true',
"parsing_instructions": {
"paragraph_text": {
"_fns": [
{"_fn": "css", "_args": ["p"]},
{"_fn": "element_text"}
]
},
"table": {
"_fns": [
{"_fn": "css", "_args": ["table.wikitable"]},
]
},
"links": {
"_fns": [
{"_fn": "xpath", "_args": ["//a/@href"]},
]
},
"images": {
"_fns": [
{"_fn": "xpath", "_args": ["//img/@src"]},
]
}
}
}
# Create and send the request
USERNAME = 'USERNAME'
PASSWORD = 'PASSWORD'
response = requests.request(
'POST',
'https://realtime.oxylabs.io/v1/queries',
auth=(USERNAME, PASSWORD),
json=payload,
)
raw_scraped_data = response.json()
processed_data = {}
for result_set in raw_scraped_data['results']:
processed_data['links'] = extract_links()
processed_data['tables'] = list(map(extract_tables, result_set['content']['table'])) if result_set['content']['table'] is not None else []
processed_data['images'] = extract_images()
processed_data['paragraphs'] = result_set['content']['paragraph_text']
json_file_path = 'data.json'
with open(json_file_path, 'w') as json_file:
json.dump(processed_data, json_file, indent=4)
print(f'Data saved as {json_file_path}')
We hope that you found this web scraping tutorial helpful. Scraping public information from Wikipedia pages with Oxylabs’ Web Scraper API is a rather straightforward process. However, if you need to scale your scraping efforts, you can buy proxies to enhance performance and avoid restrictions.
If you run into any additional questions about web scraping, be sure to contact us at support@oxylabs.io and our professional customer support team will happily assist you.
You can gather publicly available information from Wikipedia with automated solutions, such as proxies or web scraping infrastructure, such as Oxylabs’ Web Scraper API or a custom-built scraper.
When it comes to the legality of web scraping, it mostly depends on the type of data and whether it’s considered public. The data on Wikipedia articles is usually publicly available, so you should be able to scrape it. However, we always advise you to seek out professional legal assistance regarding your specific use case.
To scrape public Wikipedia page data, you’ll need an automated solution like Oxylabs’ Web Scraper API or a custom-built scraper. Web Scraper API is a web scraping infrastructure that, after receiving your request, gathers publicly available Wikipedia page data according to your request.
You can extract text from HTML elements by using the Oxylabs’ Custom Parser feature and the {"_fn": "element_text"} function – for more information, check out the documentation. Alternatively, you can build your own parser, but it’ll take more time and resources.
About the author
Roberta Aukstikalnyte
Senior Content Manager
Roberta Aukstikalnyte is a Senior Content Manager at Oxylabs. Having worked various jobs in the tech industry, she especially enjoys finding ways to express complex ideas in simple ways through content. In her free time, Roberta unwinds by reading Ottessa Moshfegh's novels, going to boxing classes, and playing around with makeup.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Get the latest news from data gathering world
Scale up your business with Oxylabs®