Proxy locations

Europe

North America

South America

Asia

Africa

Oceania

See all locations

Network statusCareers

Back to blog

How to Parse HTML with PyQuery: Python Tutorial

Vytenis Kaubrė

2023-01-254 min read
Share

PyQuery is a Python library that allows you to manipulate and extract data from HTML and XML documents. It provides a jQuery-like syntax and API, making it easy to work with web content in Python.

Like jQuery, PyQuery allows you to select elements from an XML or HTML document using CSS selectors and then manipulate or extract data from those elements. Therefore PyQuery is largely used for XML and HTML manipulation, parsing, and data extraction from web APIs.

In this article, we’ll show you how to write a web scraper in Python using the PyQuery Library. We’ll first explore the basics, and after that, we’ll compare PyQuery with Beautiful Soup. So, let’s get to it.

How to install PyQuery

To install PyQuery, you’ll need to have Python installed on your device. If you don't have Python, you can download it from the official Python website and install it. In this tutorial, we’re using Python 3.10.7 and PyQuery 2.0.0.

First, we’ll set up the PyQuery library using pip. To do this, open a terminal or command prompt and type the following line:

python -m pip install pyquery

Alternatively,  you can install a specific version of the PyQuery library with pip. To install version 2.0.0, use the example below:

python -m pip install pyquery==2.0.0

This will set up PyQuery with all the necessary dependencies. If you run into any errors, check out the official PyQuery documentation.

Parsing DOM

Let’s write our first scraper using PyQuery to parse the Document Object Model (DOM). We’ll use the requests module to fetch an HTML page and parse it using the PyQuery module. First, let’s import the necessary libraries:

import requests
from pyquery import PyQuery as pq

With the code below, we’ll fetch the website (https://example.com) and grab the title using PyQuery:

r = requests.get("https://example.com")
doc = pq(r.content)
print(doc("title").text())

When we run this code, it’ll print the title of the website. Note that we use the get() method to grab the website content. Then, using the PyQuery class, we parse the content and store it in the doc object. Then we use the CSS selector to parse and display the title text using the title tag as a CSS selector.

Extracting multiple elements using CSS selector

Let’s extract multiple HTML elements with the CSS Selector using the https://books.toscrape.com website as our target. PyQuery has built-in support to extract an HTML document from a URL, so we’ll implement that with this example:

from pyquery import PyQuery as pq
doc = pq(url="https://books.toscrape.com")
for link in doc("h3>a"):
   print(link.text, link.attrib["href"])

We use the CSS Selector to grab all the links inside the H3 tags. Then, we print the text and URL of those links using a for loop. Depending on the number of elements, the CSS selector will return one or more elements.

To access the element properties, we use the attrib object. The syntax is the same as the Python dictionary. So, we simply pass the "href" as a key, and it returns the URL of the element.

Removing elements

Sometimes, we might need to remove unwanted elements from the DOM. PyQuery has a method called remove(), which we’ll use for this purpose. 

Let’s say we want to get rid of all the icons from the above example. We can do it by adding a few lines of code like shown below:

from pyquery import PyQuery as pq
doc = pq(url="https://books.toscrape.com")
doc("i").remove()
print(doc)

Once we run this code, it’ll remove all the icons from the doc.

PyQuery vs. BeautifulSoup

PyQuery and Beautiful Soup are both great Python libraries for working with HTML and XML documents. They enable the parsing, traversing, and manipulating of HTML and XML, as well as extracting data from web pages and APIs.

One key difference between PyQuery and Beautiful Soup is their syntax and API. PyQuery is designed to have a syntax and API similar to the jQuery JavaScript library designed for working with HTML and DOM elements. If you know how to make jQuery queries, you should be able to quickly pick up PyQuery as well.

On the other hand, Beautiful Soup has a different syntax and API that’s more similar to the ElementTree library in Python's standard library. If you’re familiar with ElementTree, you may find Beautiful Soup easier to use. Also, Beautiful Soup supports HTML sanitization, which is handy when scraping websites with broken HTML. Beautiful Soup is more feature-rich regarding built-in functions. Thus, Beautiful Soup is extensively used in Python web scraping cases.

However, being lightweight, PyQuery can do things much faster than Beautiful Soup. Here’s a great GitHub gist code that you can use to test out the response times of Beautiful Soup and PyQuery, as well as other similar libraries.

Ultimately, the choice between PyQuery and Beautiful Soup depends on your specific needs and preferences. Either can be a good choice for working with HTML and XML files in Python.

Let's summarize the key differences between the PyQuery and Beautiful Soup libraries:

PyQuery Beautiful Soup
Syntax and API Similar to jQuery Similar to ElementTree
Performance Fast Good
Multiple Parsers Support Yes Yes
Unicode Support Yes Yes
HTML Sanitization No Yes
Multiple Language Support No Yes

Wrapping up

To conclude, PyQuery is an easy-to-use Python library for working with HTML and XML files. Its jQuery-like syntax and API make it easy to parse, traverse, and manipulate HTML and XML, as well as extract data.

While PyQuery is a powerful tool, it’s not the only option available for working with HTML and XML in Python. Beautiful Soup is another popular library that offers a different syntax and API and is suitable for different use cases. Ultimately, the choice between PyQuery and Beautiful Soup depends on your specific needs and preferences.

We hope this article was useful and now you’re better equipped to use PyQuery in your own projects. Also, check out our Residential Proxies, which will help you avoid various issues when gathering public data. If you need assistance with Oxylabs products, feel free to contact our support via live chat or email.

About the author

Vytenis Kaubrė

Junior Technical Copywriter

Vytenis Kaubrė is a Junior Technical Copywriter at Oxylabs. His love for creative writing and a growing interest in technology fuels his daily work, where he crafts technical content and web scrapers with Oxylabs’ solutions. Off duty, you might catch him working on personal projects, coding with Python, or jamming on his electric guitar.

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

Related articles

scrapingdigest

Get the latest news from data gathering world

I'm interested

IN THIS ARTICLE:


  • How to install PyQuery


  • Parsing DOM


  • PyQuery vs. BeautifulSoup


  • Wrapping up

Forget about complex web scraping processes

Choose Oxylabs' advanced web intelligence collection solutions to gather real-time public data hassle-free.

Scale up your business with Oxylabs®