Data acquisition Tutorials Scrapers Python

How to Scrape Wikipedia Data: Ultimate Tutorial

Roberta Aukstikalnyte

Last updated by Augustas Pelakauskas

2025-04-17

5 min read

Wikipedia is known as one of the biggest online sources of information, covering various topics. Naturally, it possesses a ton of valuable information for research or analysis. However, to obtain this information at scale, you’ll need specific tools, proxies, and knowledge.

In this article, we’ll answer questions like “Is web scraping Wikipedia allowed?” or “What exact information can be extracted?”.

We’ll give the exact steps for extracting publicly available information from Wikipedia using Python and Oxylabs’ Wikipedia Scraper API (part of Web Scraper API). Also, you'll find out how to set up a scraper without an API, using proxies only.

We’ll go through steps for extracting different types of Wikipedia article data, such as paragraphs, links, tables, and images.

Let’s get started.

Try for free

Get a free trial to test our Web Scraper API.

Up to 2K results
No credit card required

1. Connecting to Web Scraper API

Let's start by creating a Python file for our scraper:

Copy

touch main.py

Within the created file, we’ll begin assembling a request for the Web Scraper API:

Copy

USERNAME = 'yourUsername'
PASSWORD = 'yourPassword'

# Structure payload.
payload = {
   'source': 'universal',
   'url': 'https://en.wikipedia.org/wiki/Michael_Phelps',
}

Variables USERNAME and PASSWORD will contain our credentials for authentication to the API. Payload will have all the various query parameters that the API supports. In our case, we have to specify the source and the url. Source denotes the type of scraper that should be used for processing this request, and the universal one will work just fine for us. Url will tell the API what link to scrape. For more information about the various possible parameters, check out the Web Scraper API documentation.

After specifying the information required for the API, we can form and send the request:

Copy

response = requests.request(
   'POST',
   'https://realtime.oxylabs.io/v1/queries',
   auth=(USERNAME, PASSWORD),
   json=payload,
)

print(response.status_code)

If we set everything up correctly, the code should print out 200 as the status.

2. Extracting specific data

Now that we can send requests to the API, we can start web scraping specific data that we require. Without specific instructions, our scraper will return a raw lump of HTML. However, we can use the Custom Parser functionality to specify and get exactly what we want. Custom Parser is a free Scraper API feature that lets you define your own parsing and data processing logic and parse HTML data.

1) Paragraphs

To start off, we can get the most obvious one - paragraphs of text. For that, we will need to find the CSS selector for the paragraphs of text. Inspecting the Wikipedia page, we can see that they are inside the paragraph element.

We can edit our payload to the API like this:

Copy

payload = {
   'source': 'universal',
   'url': 'https://en.wikipedia.org/wiki/Michael_Phelps',
   'parse': 'true',
   "parsing_instructions": {
       "paragraph_text": {
           "_fns": [
               {"_fn": "css", "_args": ["p"]},
               {"_fn": "element_text"}
           ]
       }
   }
}

response = requests.request(
  'POST',
  'https://realtime.oxylabs.io/v1/queries',
  auth=(USERNAME, PASSWORD),
  json=payload,
)

print(response.json())

Here, we pass two additional parameters: parse indicates that we want to use a Custom Parser with our request, and parsing_instructions allow us to specify what needs to be parsed. In this case, we add a function that fetches an element by CSS and another one that extracts the text of the element.

After running the code, we can see the response:

2) Links

To get the links scraped, first, we will need the xPath selector for them. As we can see, they are inside an HTML element named a.

To fetch these links now, we have to edit our Custom Parser as follows:

Copy

payload = {
   'source': 'universal',
   'url': 'https://en.wikipedia.org/wiki/Michael_Phelps',
   'parse': 'true',
   "parsing_instructions": {
       "links": {
           "_fns": [
               {"_fn": "xpath", "_args": ["//a/@href"]},
           ]
       },
   }
}

response = requests.request(
  'POST',
  'https://realtime.oxylabs.io/v1/queries',
  auth=(USERNAME, PASSWORD),
  json=payload,
)

print(response.json())

If we look at the response, we can see that the links sometimes are not full links, but rather relative ones.

Copy

...
'/wiki/Wikipedia:Contents'
'https://foundation.wikimedia.org/wiki/Special:MyLanguage/Policy:Cookie_statement',
'//en.m.wikipedia.org/w/index.php?title=Michael_Phelps&mobileaction=toggle_view_mobile',
...

Let’s write some code that would retrieve the links and put them inside one collection:

Copy

def extract_links():
    links = []
    for link in result_set['content']['links']:
        if link and not link.startswith(('https://')):
            if link.startswith('//'):
                link = 'https:' + link
            else:
                link = 'https://en.wikipedia.org' + link
            links.append(link)
    return links

raw_scraped_data = response.json()
processed_links = {}

for result_set in raw_scraped_data['results']:
    processed_links['links'] = extract_links()

print(processed_links)

First, we define a function extract_links. It checks if the link is relative and modifies it if it is.

Now, we take the scraped and parsed links from the API response that we got and iterate through the results.

3) Tables

Next, let’s see how to scrape HTML tables. Let's find a css selector for them.

A table as an HTML element is used in multiple ways across the Wikipedia page, but we will limit ourselves to extracting the information from the tables that are within the body of text. As we can see, such tables are of a Wikitable CSS class.

We can begin writing our code:

Copy

def extract_tables(html):
   list_of_df = pandas.read_html(html)
   tables = []
   for df in list_of_df:
       tables.append(df.to_json(orient='table'))
   return tables

raw_scraped_data = response.json()
processed_tables = {}

for result_set in raw_scraped_data['results']:
   tables = list(map(extract_tables, result_set['content']['table']))
   processed_tables = tables

print(processed_tables)

The extract_tables function accepts the HTML of a table element and parses the table into a JSON structure with the help of the Pandas Python library.

Then, as previously done with the links, we iterate over the results, map each table to our extraction function, and get our array of processed information.

Note: a table doesn’t always directly translate into a JSON structure in an organized way, so you might want to customize the information you gather from the table or use another format.

4) Images

The final elements we’ll be retrieving are image sources. Time to find their xPath selector.

We can see that images have a special HTML element called img. The only thing left is to write the code.

Copy

def extract_images():
    images = []
    for img in result_set['content']['images']:
        if 'https://' not in img:
            if '/static/images/' in img:
                img = 'https://en.wikipedia.org' + img
            else:
                img = 'https:' + img
            images.append(img)
        else:
            continue
    return images

raw_scraped_data = response.json()
processed_images = {}

for result_set in raw_scraped_data['results']:
    processed_images['images'] = extract_images()

print(processed_images)

We begin by defining the extract_images function, which takes the scraped and parsed image URLs and checks whether they are relative or full links.

3. Joining everything together

Now that we have gone over extracting a few different pieces of Wikipedia page data, let’s join it all together and add some saving to a file for a finalized version of our scraper:

Copy

import requests
import json
import pandas

def extract_links():
    links = []
    for link in result_set['content']['links']:
        if link and not link.startswith(('https://')):
            if link.startswith('//'):
                link = 'https:' + link
            else:
                link = 'https://en.wikipedia.org' + link
            links.append(link)
    return links


def extract_images():
    images = []
    for img in result_set['content']['images']:
        if 'https://' not in img:
            if '/static/images/' in img:
                img = 'https://en.wikipedia.org' + img
            else:
                img = 'https:' + img
            images.append(img)
        else:
            continue
    return images


def extract_tables(html):
   list_of_df = pandas.read_html(html)
   tables = []
   for df in list_of_df:
       tables.append(df.to_json(orient='table'))
   return tables

# Structure payload.
payload = {
   'source': 'universal',
   'url': 'https://en.wikipedia.org/wiki/Michael_Phelps',
   "parse": 'true',
   "parsing_instructions": {
       "paragraph_text": {
           "_fns": [
               {"_fn": "css", "_args": ["p"]},
               {"_fn": "element_text"}
           ]
       },
       "table": {
           "_fns": [
               {"_fn": "css", "_args": ["table.wikitable"]},
           ]
       },
        "links": {
           "_fns": [
               {"_fn": "xpath", "_args": ["//a/@href"]},
           ]
       },
        "images": {
            "_fns": [
                {"_fn": "xpath", "_args": ["//img/@src"]},
            ]
        }
   }
}

# Create and send the request
USERNAME = 'USERNAME'
PASSWORD = 'PASSWORD'

response = requests.request(
   'POST',
   'https://realtime.oxylabs.io/v1/queries',
   auth=(USERNAME, PASSWORD),
   json=payload,
)

raw_scraped_data = response.json()
processed_data = {}

for result_set in raw_scraped_data['results']:

    processed_data['links'] = extract_links()

    processed_data['tables'] = list(map(extract_tables, result_set['content']['table'])) if result_set['content']['table'] is not None else []

    processed_data['images'] = extract_images()

    processed_data['paragraphs'] = result_set['content']['paragraph_text']

json_file_path = 'data.json'

with open(json_file_path, 'w') as json_file:
   json.dump(processed_data, json_file, indent=4) 

print(f'Data saved as {json_file_path}')

Web scraping Wikipedia using Python and proxies

You can extract data from Wikipedia using Python web scraping with your favorite IDE, no API is required. However, to extract data from multiple Wikipedia web pages at scale, using proxies is advisable due to various anti-bot measures.

Proxies help avoid IP blocks and limits. They route requests through different IP addresses, making it appear as if multiple users are accessing the site rather than a single web scraper.

Overall, Wikipedia is one of the best websites to scrape due to its simple structure, not featuring some of the more complex features like JavaScript-based dynamic content.

The foundation of a web scraping script is selecting the right Python libraries. Beautiful Soup and Requests are a must for sending HTTP requests and parsing data.

Here's a typical workflow on how to scrape data from Wikipedia:

Identify target data - Determine exactly which Wikipedia data points to extract.
Set up your environment - download Python, choose an IDE, and install the necessary libraries.
Get proxies - Obtain reliable proxy IPs from a proxy service.
Design request headers - Set up realistic user agents and other headers to mimic browser behavior.
Code the Wiki scraper - Create functions to send requests through proxies, select HTML elements, parse HTML, and extract data.
Store the data - Save extracted data to a database or a structured file format (CSV, JSON).

You can provide your proxy details within the get() function. Create a dictionary variable with proxy details, including address (host), port, and authentication credentials.

Let’s add Oxylabs Residential Proxies:

Copy

proxies = {
    "http": "https://USERNAME:PASSWORD@pr.oxylabs.io:7777",
    "https": "https://USERNAME:PASSWORD@pr.oxylabs.io:7777"
}

Here’s the script that extracts the initial text paragraph and hyperlinks from the Wikipedia page:

Copy

import csv
import requests
from bs4 import BeautifulSoup


# Residential Proxies config
proxies = {
    'http': 'https://USERNAME:PASSWORD@pr.oxylabs.io:7777',
    'https': 'https://USERNAME:PASSWORD@pr.oxylabs.io:7777'
}

# Add more headers if needed
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                  'AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/135.0.0.0 Safari/537.36'
}

def scrape_wikipedia(url):
    '''
    Scrape a Wikipedia page, extract paragraphs and their links separately.
    '''
    # Send a request and parse raw HTML
    response = requests.get(
        url,
        headers=headers,
        proxies=proxies
    )
    response.raise_for_status()
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Find main content div and paragraphs
    content = soup.find('div', {'class': 'mw-content-ltr'})
    paragraphs = content.find_all('p', recursive=True) if content else []
    
    results = []
    for paragraph in paragraphs:
        # Skip empty paragraphs
        if not paragraph.text.strip():
            continue
        
        # Remove citation references from the paragraph
        for citation in paragraph.find_all('sup', {'class': 'reference'}):
            citation.decompose()
        text = paragraph.get_text().strip()
        
        # Extract wiki hyperlinks
        links = []
        for link in paragraph.find_all('a'):
            href = link.get('href', '')
            if (href and href.startswith('/wiki/') and 
                not href.startswith('/wiki/File:') and
                not (link.get('class') and 'reference' in link.get('class'))):
                full_url = f'https://en.wikipedia.org{href}'
                links.append(full_url)
        
        if text:
            results.append({
                'text': text,
                'links': links
            })
    
    return results


def save_to_csv(data, filename):
    '''Save the scraped data to a CSV file.'''
    with open(filename, 'w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(['Paragraph', 'Links'])
        
        for item in data:
            links_text = ',\n'.join(item['links']) if item['links'] else ''
            writer.writerow([item['text'], links_text])


if __name__ == '__main__':
    data = scrape_wikipedia('https://en.wikipedia.org/wiki/Michael_Phelps')
    
    if data:
        save_to_csv(data, 'wikipedia_data.csv')
        print(f'Successfully scraped {len(data)} paragraphs')
    else:
        print('No data was scraped.')

NOTE: Don't forget to insert your own proxy credentials.

Free proxies for Wikipedia scraping

Get free proxies and unlock Wikipedia data with an average 99.82% success rate.

5 IPs for FREE
No credit card is required

Wikipedia scraping methods comparison

Criteria	Manual scraping (without proxies)	Manual scraping using proxies	Scraper APIs
Key features	• Single, static IP address • Direct network requests • Local execution environment	• IP rotation • Geo-targeting • Request distribution • Anti-detection measures	• Maintenance-free infrastructure • CAPTCHA handling • JavaScript rendering • Automatic proxy management
Pros	• Maximum flexibility • No additional service costs • Complete data pipeline control • Minimal latency	• Improved success rate • Reduced IP blocking • Coordinate, city, state-level targeting • Anonymity	• Minimal maintenance overhead • Built-in error handling • Regular updates for site layout changes • Technical support
Cons	• High likelihood of IP blocks • Regular maintenance • Limited scaling • No geo-targeting	• Additional proxy service costs • Manual proxy management • Additional setup • Increased request latency	• Higher costs • Fixed customization • API-specific limitations • Dependency on provider
Best for	• Small-scale web scraping • Unrestricted websites • Custom data extraction logic	• Medium to large-scale web scraping • Restricted websites • Global targets	• Enterprise-level web scraping • Complex websites with anti-bot measures • Resource-constrained teams • Quick implementation

Conclusion

We hope that you found this web scraping tutorial helpful. Web scraping public information from Wikipedia pages with Oxylabs’ Web Scraper API is a rather straightforward process.

However, if you need to scale your web scraping efforts for advanced web scraping with Python, you can buy proxies suitable for your project to enhance performance and avoid restrictions, such as datacenter or residential proxies.

If you run into any additional questions about web scraping, be sure to contact us at support@oxylabs.io, and our professional customer support team will happily assist you.

Frequently asked questions

How to scrape Wikipedia? You can gather publicly available information from Wikipedia with the help of Oxylabs free proxies or paid solutions, such as Oxylabs' Web Scraper API or a custom-built Wiki scraper.

When it comes to the legality of web scraping, it mostly depends on the type of data and whether it’s considered public. The data on Wikipedia articles is usually publicly available, so you should be able to scrape it. However, we always advise you to seek out professional legal assistance regarding your specific use case.

To scrape Wikipedia page data, you’ll need an automated solution like Oxylabs’ Web Scraper API or a custom-built scraper. Web Scraper API is a web scraping infrastructure that, after receiving your request, gathers publicly available Wikipedia page data according to your request.

You can extract text from HTML elements by using the Oxylabs' Custom Parser feature and the {"_fn": "element_text"} function – for more information, check our documentation. Alternatively, you can build your own parser, but it'll take more time and resources.

Web scraping the target Wikipedia page with BeautifulSoup requires using Python to send HTTP requests and then parse the HTML content. First, you'll need to install the necessary libraries, then create a script that sends a request to a Wikipedia page, retrieves the HTML, and uses BeautifulSoup to parse and extract the desired data based on HTML tags, classes, or IDs. The final result should be structured data in a convenient format like JSON or CSV.

About the author

Roberta Aukstikalnyte

Former Senior Content Manager

Roberta Aukstikalnyte was a Senior Content Manager at Oxylabs. Having worked various jobs in the tech industry, she especially enjoys finding ways to express complex ideas in simple ways through content. In her free time, Roberta unwinds by reading Ottessa Moshfegh's novels, going to boxing classes, and playing around with makeup.

Learn more about Roberta Aukstikalnyte Learn more about Roberta Aukstikalnyte

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.