lxml Tutorial: XML Processing and Web Scraping With lxml

Gabija Fatenaite

Last updated on

2021-08-30

6 min read

In this lxml Python tutorial, we will explore the lxml library. We will go through the basics of creating XML documents and then jump onto processing XML and HTML documents. Finally, we will put together all the pieces and see how to extract data using lxml. Each step of this tutorial is complete with practical Python lxml examples.

For your convenience, we also prepared this tutorial in a video format:

Prerequisite

This tutorial is aimed at developers who have at least a basic understanding of Python. A basic understanding of XML and HTML is also required. Simply put, if you know what an attribute is in XML, that is enough to understand this article.

This tutorial uses Python 3 code snippets but everything works on Python 2 with minimal changes as well.

What is lxml in Python?

lxml is one of the fastest and feature-rich libraries for processing XML and HTML in Python. This library is essentially a wrapper over C libraries libxml2 and libxslt. This combines the speed of the native C library and the simplicity of Python.

Using Python lxml library, XML and HTML documents can be created, parsed, and queried. It is a dependency on many of the other complex packages like Scrapy.

Installation

The best way to download and install the lxml library is from Python Package Index (PyPI). If you are on Linux (debian-based), simply run:

Copy

sudo apt-get install python3-lxml

Another way is to use the pip package manager. This works on Windows, Mac, and Linux:

Copy

pip3 install lxml

Link to GitHub

On windows, just use pip install lxml, assuming you are running Python 3.

Creating a simple XML document

Any XML or any XML compliant HTML can be visualized as a tree. A tree has a root and branches. Each branch optionally may have further branches. All these branches and the root are represented as an Element.

A very simple XML document would look like this:

Copy

<root>
    <branch>
        <branch_one>
        </branch_one>
        <branch_one>
        </branch_one >
    </branch>
</root>

Link to GitHub

If an HTML is XML compliant, it will follow the same concept.

Note that HTML may or may not be XML compliant. For example, if an HTML has <br> without a corresponding closing tag, it is still valid HTML, but it will not be a valid XML. In the later part of this tutorial, we will see how these cases can be handled. For now, let’s focus on XML compliant HTML.

The Element class

To create an XML document using python lxml, the first step is to import the etree module of lxml:

Copy

>>> from lxml import etree

Link to GitHub

Every XML document begins with the root element. This can be created using the Element type. The Element type is a flexible container object which can store hierarchical data. This can be described as a cross between a dictionary and a list.

In this python lxml example, the objective is to create an HTML, which is XML compliant. It means that the root element will have its name as html:

Copy

>>> root = etree.Element("html")

Link to GitHub

Similarly, every html will have a head and a body:

Copy

>>> head = etree.Element("head")
>>> body = etree.Element("body")

Link to GitHub

To create parent-child relationships, we can simply use the append() method.

Copy

>>> root.append(head)
>>> root.append(body)

Link to GitHub

This document can be serialized and printed to the terminal with the help of tostring() function. This function expects one mandatory argument, which is the root of the document. We can optionally set pretty_print to True to make the output more readable. Note that tostring() serializer actually returns bytes. This can be converted to string by calling decode():

Copy

>>> print(etree.tostring(root, pretty_print=True).decode())

Link to GitHub

The SubElement class

Creating an Element object and calling the append() function can make the code messy and unreadable. The easiest way is to use the SubElement type. Its constructor takes two arguments – the parent node and the element name. Using SubElement, the following two lines of code can be replaced by just one.

Copy

body = etree.Element("body")
root.append(body)
# is same as 
body = etree.SubElement(root,"body")

Setting text and attributes

Setting text is very easy with the lxml library. Every instance of the Element and SubElement exposes two methods – text and set, the former is used to specify the text and later is used to set the attributes. Here are the examples:

Copy

para = etree.SubElement(body, "p")
para.text="Hello World!"

Link to GitHub

Similarly, attributes can be set using key-value convention:

Copy

para.set("style", "font-size:20pt")

Link to GitHub

One thing to note here is that the attribute can be passed in the constructor of SubElement:

Copy

para = etree.SubElement(body, "p", style="font-size:20pt", id="firstPara")
para.text = "Hello World!"

Link to GitHub

The benefit of this approach is saving lines of code and clarity. Here is the complete code. Save it in a python file and run it. It will print an HTML which is also a well-formed XML.

Copy

from lxml import etree
 
root = etree.Element("html")
head = etree.SubElement(root, "head")
title = etree.SubElement(head, "title")
title.text = "This is Page Title"
body = etree.SubElement(root, "body")
heading = etree.SubElement(body, "h1", style="font-size:20pt", id="head")
heading.text = "Hello World!"
para = etree.SubElement(body, "p",  id="firstPara")
para.text = "This HTML is XML Compliant!"
para = etree.SubElement(body, "p",  id="secondPara")
para.text = "This is the second paragraph."
 
etree.dump(root)  # prints everything to console. Use for debug only

Link to GitHub

Note that here we used etree.dump() instead of calling etree.tostring(). The difference is that dump() simply writes everything to the console and doesn’t return anything, tostring() is used for serialization and returns a string which you can store in a variable or write to a file. dump() is good for debug only and should not be used for any other purpose.

Add the following lines at the bottom of the snippet and run it again:

Copy

with open('input.html', 'wb') as f:
    f.write(etree.tostring(root, pretty_print=True))

Link to GitHub

This will save the contents to input.html in the same folder you were running the script. Again, this is a well-formed XML, which can be interpreted as XML or HTML.

How do you parse an XML file using LXML in Python?

The previous section was a Python lxml tutorial on creating XML files. In this section, we will look at traversing and manipulating an existing XML document using the lxml library.

Before we move on, save the following snippet as input.html.

Copy

<html>
  <head>
    <title>This is Page Title</title>
  </head>
  <body>
    <h1 style="font-size:20pt" id="head">Hello World!</h1>
    <p id="firstPara">This HTML is XML Compliant!</p>
    <p id="secondPara">This is the second paragraph.</p>
  </body>
</html>

Link to GitHub

When an XML document is parsed, the result is an in-memory ElementTree object.

The raw XML contents can be in a file system or a string. If it is in a file system, it can be loaded using the parse method. Note that the parse method will return an object of type ElementTree. To get the root element, simply call the getroot() method.

Copy

from lxml import etree
 
tree = etree.parse('input.html')
elem = tree.getroot()
etree.dump(elem) #prints file contents to console

Link to GitHub

The lxml.etree module exposes another method that can be used to parse contents from a valid xml string — fromstring()

Copy

xml = '<html><body>Hello</body></html>'
root = etree.fromstring(xml)
etree.dump(root)

Link to GitHub

One important difference to note here is that fromstring() method returns an object of element. There is no need to call getroot().

If you want to dig deeper into parsing, we have already written a tutorial on BeautifulSoup, a Python package used for parsing HTML and XML documents. But to quickly answer what is lxml in BeautifulSoup, lxml can use BeautifulSoup as a parser backend. Similarly, BeautifulSoup can employ lxml as a parser.

Finding elements in XML

Broadly, there are two ways of finding elements using the Python lxml library. The first is by using the Python lxml querying languages: XPath and ElementPath. For example, the following code will return the first paragraph element.

Note that the selector is very similar to XPath. Also note that the root element name was not used because elem contains the root of the XML tree.

Copy

tree = etree.parse('input.html')
elem = tree.getroot()
para = elem.find('body/p')
etree.dump(para)
 
# Output 
# <p id="firstPara">This HTML is XML Compliant!</p>

Link to GitHub

Similarly, findall() will return a list of all the elements matching the selector.

Copy

elem = tree.getroot()
para = elem.findall('body/p')
for e in para:
    etree.dump(e)
 
# Outputs
# <p id="firstPara">This HTML is XML Compliant!</p>
# <p id="secondPara">This is the second paragraph.</p>

Link to GitHub

The second way of selecting the elements is by using XPath in Python directly. This approach is easier to follow by developers who are familiar with XPath. Furthermore, XPath can be used to return the instance of the element, the text, or the value of any attribute using standard XPath syntax.

Copy

para = elem.xpath('//p/text()')
for e in para:
    print(e)
 
# Output
# This HTML is XML Compliant!
# This is the second paragraph.

Link to GitHub

Handling HTML with lxml.html

Throughout this article, we have been working with a well-formed HTML which is XML compliant. This will not be the case a lot of the time. For these scenarios, you can simply use lxml.html instead of lxml.etree.

Note that reading directly from a file is not supported. The file contents should be read in a string first. Here is the code to print all paragraphs from the same HTML file.

Copy

from lxml import html
with open('input.html') as f:
    html_string = f.read()
tree = html.fromstring(html_string)
para = tree.xpath('//p/text()')
for e in para:
    print(e)
 
# Output
# This HTML is XML Compliant!
# This is the second paragraph

Link to GitHub

lxml web scraping tutorial

Now that we know how to parse and find elements in XML and HTML, the only missing piece is getting the HTML of a web page.

For this, the ‘requests’ library is a great choice. It can be installed using the pip package manager:

Copy

pip install requests

Link to GitHub

Once the requests library is installed, HTML of any web page can be retrieved using a simple get() method. Here is an example.

Copy

import requests
 
response = requests.get('http://books.toscrape.com/')
print(response.text)
# prints source HTML

Link to GitHub

This can be combined with lxml to retrieve any data that is required.

Here is a quick example that prints a list of countries from Wikipedia:

Copy

import requests
from lxml import html
 
response = requests.get('https://en.wikipedia.org/wiki/List_of_countries_by_population_in_2010')
 
tree = html.fromstring(response.text)
countries = tree.xpath('//span[@class="flagicon"]')
for country in countries:
    print(country.xpath('./following-sibling::a/text()')[0])

Link to GitHub

In this code, the HTML returned by response.text is parsed into the variable tree. This can be queried using standard XPath syntax. The XPaths can be concatenated. Note that the xpath() method returns a list and thus only the first item is taken in this code snippet.

This can easily be extended to read any attribute from the HTML. For example, the following modified code prints the country name and image URL of the flag.

Copy

for country in countries:
    flag = country.xpath('.//img/@src')[0]
    country = country.xpath('./following-sibling::a/text()')[0]
    print(country, flag)

Link to GitHub

You can click here to find the complete code used in this article for your convenience or learn more how to successfully select the sibling elements in XPath.

Conclusion

In this Python lxml tutorial, various aspects of XML and HTML handling using the lxml library have been introduced. Python lxml library is a light-weight, fast, and feature-rich library. This can be used to create XML documents, read existing documents, and find specific elements. This makes this library equally powerful for both XML and HTML documents. Combined with requests library, it can also be easily used for web scraping.

In our blog, you can learn more about web scraping using Selenium or other useful libraries like Beautiful Soup. Also, If you feel that building a data gathering tool is a bit challenging, check out our Web Scraper API – an advanced data collection solution for large-scale web scraping.

Forget about complex web scraping processes

Choose Oxylabs' advanced web intelligence collection solutions to gather real-time public data hassle-free.

About the author

Gabija Fatenaite

Former Director of Product & Event Marketing

Gabija Fatenaite was a Director of Product & Event Marketing at Oxylabs. Having grown up on video games and the internet, she grew to find the tech side of things more and more interesting over the years. So if you ever find yourself wanting to learn more about proxies (or video games), feel free to contact her - she’ll be more than happy to answer you.

Learn more about Gabija Fatenaite Learn more about Gabija Fatenaite

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.