As web scraping and automation tools are becoming increasingly useful and accessible, more and more companies are investing in this field. It doesn’t matter which technology or framework you use; eventually, you will have one decision to make – XPath or CSS Selector. In this article, we’ll examine each option in detail and help you make a decision that’ll allow you to get the best out of XPath and CSS.
XPath means XML Path Language. Basically, it uses a non-XML syntax to easily tackle various parts of a document, and can be used to identify whether pointed out parts match the form.
When the software boom started, companies began adopting software solutions to manage their operations one at a time. It resulted in multi-system chaos. Businesses ended up having numerous systems across using different programming languages and platforms, yet these systems still had to communicate between themselves.
For example, the inventory management system could use Java, and the eCommerce stores used .NET. Things would soon become messy as the systems became more complex. Companies needed a standard that everyone would agree upon.
XML, or eXtensible Markup Language, was the perfect solution to this problem. It’s a way of representing information. A simple example of an XML is as follows:
<?xml version="1.0" encoding="UTF-8"?> <inventory> <product type="bottle">Water Bottle</product> <brand>ABC</brand> <code type="upc-13">54268453659458</code> <stock>54</stock> <description>Red colored water bottle</description> </inventory>
Two other related things came along with XML. The first was to standardize each XML document for the specific purpose known as XML Schema Definition. The second, called XML Path or XPath, was to find or query certain information from this document.
HTML is very much like XML – it does not enforce the rules of XML into HTML very rigidly, but still, you can use XPath to query HTML. If there is some malformed code, making the HTML challenging to read, the various parsers such as lxml modify the document markups so that you can still query it.
The XML and, along with it, XPath remained a standard for many years. It resulted in a lot of other technologies ending up using XPath. One such example is the famous automation tool Selenium. Even though it provides different methods to locate elements, automation testers mostly use XPath.
This simple HTML document will be helpful in understanding the basics of XPath. Copy this HTML markup and save it as document1.html.
<html> <head> <title>XPath vs CSS</title> </head> <body> <h2 id="header">Welcome to XPath</h2> <div id="navbar"> <a href="https://oxylabs.io/blog" id="blog" class="nav">Visit our Blog</a> <a href="https://oxylabs.io/resources/case-studies" class="nav">Case Studies</a> </div> </body> </html>
Let’s clarify the basic terminology first.
Any XML or HTML will have three types of nodes:
Element Node – These are the tags. For example, <title>XPath vs CSS</title>.
Attribute Node – These are the attributes that are part of an element node. For instance, id="blog" is an attribute node.
Atomic Value – These are the final values that can be either the text contained in an element or a value of an attribute. Example: “Case Studies”.
These nodes can have various kinds of relationships with each other. Knowing the names of these relations is essential to understand how to create an XPath selector. The following fundamentals will be applicable for both XPath and CSS.
Children. These are the elements that are exactly one level down. For example, in the above HTML document, the elements <h2> and <div> are children of the <body>. Note that the elements <a> are not children of the <body>.
Parent. These elements are exactly one level up. That means that each element will have exactly one parent. In the above example, the parent of <a> element is <div>.
Siblings. These elements have the same parent. In this example, tags <h2> and <div> are siblings that share the same parent <body>.
Descendants. These are elements that are any level down, including children. For example, the descendants of the <body> are <h2>, <div>, <a>, and more.
Ancestor. These are elements any level up, including the parent. In this example, the ancestors of <a> tag are <div>, <body>, and <html>.
The easiest way of creating an XPath is the “bottom-up” approach. In this approach, start with writing the name of the element that you want to select. Then, add a forward slash before it. After that, write the name of its parent before the expression. Keep repeating until you have no more parents. Once it’s done, write the name of its parent before it, separated with a forward slash.
For example, the steps to write XPath for the element <h2> are going to be as follows:
Step 1 – h2
Step 2 – /h2
Step 3 – body/h2
Step 4 – /body/h2
Step 5 – html/body/h2
Step 6 –/html/body/h2
This might be a tedious way of building an XPath but it helps understanding how XPath is structured.
This XPath can also be created using Developer Tools from your browser. Here is how to do it:
Save the above HTML to a file and open it in Chrome or Firefox.
Right-click the text “Welcome to XPath” and select “Inspect”.
This will open “Developer Tools” with the <h2> element highlighted. Right-click the element, point to “Copy” and then choose “Copy Full XPath”. This will create the same path that we just created.
Copy full XPath in Chrome
In the previous screenshot, you can see two options for XPath – XPath and Full XPath. The previous section demonstrated how to create a full XPath.
If XPath is selected instead of the full XPath in the example above, then the XPath that we get is as follows:
So, what is wrong with Full XPath? This was a simple example that created a simple full XPath. This will not always be the case, though. For example, if the same method is followed to generate the full XPath for the table of contents of any Wikipedia page, the full XPath will be something like this:
There are three main problems with this XPath:
It’s not readable and not maintainable.
It’s too rigid. Even if there is only a slight change, like an additional information banner at the top of the page, this XPath will break.
Full XPath will traverse each node and make it slow.
That’s why we need to learn a shorter, faster, and more readable XPath. Let’s begin with the first building block – the slash character. While a single slash selects the child node, a double-slash selects all the matching nodes, no matter where they are in the document.
For example, to find all the <a> tags in a document, the XPath expression will be as follows:
Similarly, to find all the <h2> tags, the XPath will be as follows:
In the documet1.html file that is being discussed, there is only one <h2> tag. This means the XPath that we created earlier, /html/body/h2 can be simply //h2. The next building block is the asterisk. This character is used as a wild card that will match every element. For example, if we want to get the list of all the <a> elements which are children of a <div> element, without the asterisk the XPath will be as follows:
Likewise, to select all the <p> elements that are children of <div>, we can use this XPath:
However, if we want to select all elements which are children of a <div> element, we can use an asterisk. The XPath will be as follows:
You should use this as it can match many elements if other conditions are not added. These conditions are added using square brackets and are called predicates.
The predicates are written inside square brackets. In the simplest form, these are just a number.
For example, the following XPath will match two anchor tags.
If we add a number in the square brackets, we can specify which element to select. For example, if you want to select the first anchor tag, the XPath will be as follows:
Note that the first node is 1, not 0, like it would be in most programming languages.
We can also use functions in the square brackets. For example, to get the last <a> tag, we can write this XPath:
There are many more functions which will be covered shortly. Before that, we will see how we can get the text from the elements.
The XPath //h2 returns the element. Let’s see how we can extract text, or in other words, an atomic value from the elements.
The text is usually contained between the element opening and closing part. For example:
<h2 id="header">Welcome to XPath</h2>
To extract this text, the XPath function text() can be used. For example, the following XPath will return the text inside the h2 element:
To extract the value of any attribute, use @ symbol with the attribute name. For example, consider this anchor tag:
<a href="https://oxylabs.io/blog">Visit our Blog</a>
To extract the value of the href attribute, use @href in the XPath:
This will get the text https://oxylabs.io/blog
We can also use the attribute selectors in XPath predicates. For example, this XPath will select all the <div> elements where the id attribute’s value is header:\
If you want to look for an element that contains specific text, you can search that by using text() function in square brackets:
//a[text()="Visit our Blog"]
Note that this will be looking for a complete match. If you want to look for a partial match, use the contains function. This function will be an important criterion to help you decide between XPath and CSS.
This was all about the basics of XPath. In the next section, you will learn the key area where XPath is truly powerful.
While working with any HTML, or DOM, there can be three main axes. These are ancestors, descendants, and siblings. While it is possible to select siblings and descendants using CSS as well, only XPath is capable of going up to ancestors.
Take this example:
<a href=" https://oxylabs.io/"> <span class="link">Oxylabs</span> </a>
In this example, if we have to extract the value of href attribute of the <a> tag, we can locate the span tag and go up one level to find the <a> tag:
Note the use of .. to go one level up. This is not possible using CSS Selectors.
Further, for a partial match, CSS had :contains operator but now has been deprecated. It may work with few specific tools but the support is not guaranteed.
On the other hand, the XPath function contains() is universally supported. To sum it up, if you want to traverse up the DOM, XPath is your only choice. If you want to perform a partial match, again, XPath may be your only option.
CSS selector is the first piece of a CSS rule. CSS is a set of components and rules that inform the browser which HTML elements have to be located and applied with CSS properties.
HTML is rarely served without CSS, or Cascading Style Sheet. If there is a need to change the appearance of any HTML element, the most common way to do that is by applying a style to it – it can be as simple as changing text color, or it can be more complex if you are using animation.
In both cases, the first step would be to locate the element. Once an element is located, styles can be applied using CSS. To locate the elements, CSS selectors are used. CSS selectors were developed for entirely different reasons, but they serve the same purpose – selecting elements.
Now that you have an idea of what XPath and CSS selectors are, we’ll take a look at how XPath and CSS are created and when you should use one over the other.
Getting started with a CSS selector is very easy. The CSS selector can use tag name, id, (pseudo-)class or attribute. Consider this element:
<h2 id="header" name="ctrl" class="fancy">XPath vs CSS</h2>
Here are few examples of the CSS selectors for this tag:
h2 – Select by element name. Wild card is *
#header – Use # to specify id
.fancy – Use a period for specifying class
[name="ctrl"] – Use any attributes in square brackets
These can be combined too. For example:
h2#header selects <h2> elements with id as header.
For selecting children, the > operator is used. For selecting the first sibling + operator is used. For selecting all siblings, ~ is used. Here are few examples:
div > a – selects <a> elements which are children of <div>
div a – selects <a> elements which are descendent of <div>
div + a – selects first <a> element after a <div>
div ~ a – Selects all <a> elements after a <div>
If you know these, you should be able to cover most of the scenarios.
The best thing about CSS selectors is that they are easy to learn and maintain.
Note that Beautiful Soup does not support XPath. CSS Selector is the only choice in this case.
The older browsers such as Internet Explorer did not support CSS fully. However, nowadays, browsers support CSS Selectors fully; thus, compatibility will not be an issue.
CSS Selectors are also faster than XPath. However, note that the difference is so tiny that it should not matter at all in the real world.
Deciding between XPath and CSS may feel like a difficult task. Let’s see the below table for comparison.
Comparison of XPath and CSS selectors
XPath enables bidirectional flow, meaning the traversal may go both ways – from child to parent and parent to child. On the contrary, CSS enables one-directional flow, so the traversal works only from parent to child. Concerning speed and performance, XPath is slower, and CSS is a better choice in terms of speed and performance.
However, you should take this with a pinch of salt. Browsers are evolving at a fast rate. Except for Firefox and Safari, almost every browser is now based on Chromium. Similarly, computer resources are becoming less of a bottleneck.
Lastly, most of the tests are done for Selenium Framework alone. However, while web scraping, there are a lot of other factors in the picture. If rendering is not needed, there will not be a need to use a full browser at all.
First and foremost, you shouldn’t choose one over the other because of speed. XPath will be faster in some cases, and CSS Selector will be faster in other cases. As discussed, the difference is so tiny with modern browsers and programming languages that it should be discarded altogether.
While web scraping, other factors, such as network, play a much more significant role.
There are two scenarios when CSS is not even an option – traversing up the tree, and the contains function for partial match.
Using web scraping tools such as Beautiful Soup, you can consider using find and find_all methods. These are optimized for Beautiful Soup, and you will not have to worry about choosing between XPath and CSS.
Summing up, the decision on which selectors to use does not rely on the performance of XPath and CSS. Instead, you should focus on other factors, such as finding the required feature-set or considering the ease of use and compatibility. You can read a detailed guide on web scraping with Python in our blog.
About the author
Former Content Manager
Monika Maslauskaite is a former Content Manager at Oxylabs. A combination of tech-world and content creation is the thing she is super passionate about in her professional path. While free of work, you’ll find her watching mystery, psychological (basically, all kinds of mind-blowing) movies, dancing, or just making up choreographies in her head.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
This tutorial explores a real-life scenario where web scraping and machine learning work in tandem and how the step-by-step process should look like if you decide to do it on your own.
This article will walk you through the step-by-step process of installing and downloading files using Wget with or without proxies, covering multiple scenarios and showcasing practical examples.
Get the latest news from data gathering world
Scale up your business with Oxylabs®
GET IN TOUCH
Certified data centers and upstream providers
Connect with us