PyQuery is a Python library that allows you to manipulate and extract data from HTML and XML documents. It provides a jQuery-like syntax and API, making it easy to work with web content in Python.
Like jQuery, PyQuery allows you to select elements from an XML or HTML document using CSS selectors and then manipulate or extract data from those elements. Therefore PyQuery is largely used for XML and HTML manipulation, parsing, and data extraction from web APIs.
In this article, we’ll show you how to write a web scraper in Python using the PyQuery Library. We’ll first explore the basics, and after that, we’ll compare PyQuery with Beautiful Soup. So, let’s get to it.
To install PyQuery, you’ll need to have Python installed on your device. If you don't have Python, you can download it from the official Python website and install it. In this tutorial, we’re using Python 3.10.7 and PyQuery 2.0.0.
First, we’ll set up the PyQuery library using pip. To do this, open a terminal or command prompt and type the following line:
python -m pip install pyquery
Alternatively, you can install a specific version of the PyQuery library with pip. To install version 2.0.0, use the example below:
python -m pip install pyquery==2.0.0
This will set up PyQuery with all the necessary dependencies. If you run into any errors, check out the official PyQuery documentation.
Let’s write our first scraper using PyQuery to parse the Document Object Model (DOM). We’ll use the requests module to fetch an HTML page and parse it using the PyQuery module. First, let’s import the necessary libraries:
import requests from pyquery import PyQuery as pq
With the code below, we’ll fetch the website (https://example.com) and grab the title using PyQuery:
r = requests.get("https://example.com") doc = pq(r.content) print(doc("title").text())
When we run this code, it’ll print the title of the website. Note that we use the get() method to grab the website content. Then, using the PyQuery class, we parse the content and store it in the doc object. Then we use the CSS selector to parse and display the title text using the title tag as a CSS selector.
Let’s extract multiple HTML elements with the CSS Selector using the https://books.toscrape.com website as our target. PyQuery has built-in support to extract an HTML document from a URL, so we’ll implement that with this example:
from pyquery import PyQuery as pq doc = pq(url="https://books.toscrape.com") for link in doc("h3>a"): print(link.text, link.attrib["href"])
We use the CSS Selector to grab all the links inside the H3 tags. Then, we print the text and URL of those links using a for loop. Depending on the number of elements, the CSS selector will return one or more elements.
To access the element properties, we use the attrib object. The syntax is the same as the Python dictionary. So, we simply pass the "href" as a key, and it returns the URL of the element.
Sometimes, we might need to remove unwanted elements from the DOM. PyQuery has a method called remove(), which we’ll use for this purpose.
Let’s say we want to get rid of all the icons from the above example. We can do it by adding a few lines of code like shown below:
from pyquery import PyQuery as pq doc = pq(url="https://books.toscrape.com") doc("i").remove() print(doc)
Once we run this code, it’ll remove all the icons from the doc.
PyQuery and Beautiful Soup are both great Python libraries for working with HTML and XML documents. They enable the parsing, traversing, and manipulating of HTML and XML, as well as extracting data from web pages and APIs.
On the other hand, Beautiful Soup has a different syntax and API that’s more similar to the ElementTree library in Python's standard library. If you’re familiar with ElementTree, you may find Beautiful Soup easier to use. Also, Beautiful Soup supports HTML sanitization, which is handy when scraping websites with broken HTML. Beautiful Soup is more feature-rich regarding built-in functions. Thus, Beautiful Soup is extensively used in Python web scraping cases.
However, being lightweight, PyQuery can do things much faster than Beautiful Soup. Here’s a great GitHub gist code that you can use to test out the response times of Beautiful Soup and PyQuery, as well as other similar libraries.
Ultimately, the choice between PyQuery and Beautiful Soup depends on your specific needs and preferences. Either can be a good choice for working with HTML and XML files in Python.
Let's summarize the key differences between the PyQuery and Beautiful Soup libraries:
|Syntax and API||Similar to jQuery||Similar to ElementTree|
|Multiple Parsers Support||Yes||Yes|
|Multiple Language Support||No||Yes|
To conclude, PyQuery is an easy-to-use Python library for working with HTML and XML files. Its jQuery-like syntax and API make it easy to parse, traverse, and manipulate HTML and XML, as well as extract data.
While PyQuery is a powerful tool, it’s not the only option available for working with HTML and XML in Python. Beautiful Soup is another popular library that offers a different syntax and API and is suitable for different use cases. Ultimately, the choice between PyQuery and Beautiful Soup depends on your specific needs and preferences.
We hope this article was useful and now you’re better equipped to use PyQuery in your own projects. Also, check out our Residential Proxies, which will help you avoid various issues when gathering public data. If you need assistance with Oxylabs products, feel free to contact our support via live chat or email.
About the author
Vytenis Kaubre is a Copywriter at Oxylabs. As his passion lay in creative writing and curiosity in anything tech kept growing, he joined the army of copywriters. After work, you might find Vytenis watching TV shows, brainstorming ideas for tabletop games, or taking Raymond Chandler’s advice, “When in doubt, have a man come through a door with a gun in his hand” too seriously (when writing short stories).
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Get the latest news from data gathering world
Forget about complex web scraping processes
Choose Oxylabs' advanced web intelligence collection solutions to gather real-time public data hassle-free.
Scale up your business with Oxylabs®
GET IN TOUCH
Certified data centers and upstream providers
Connect with us
Advanced proxy solutions