Proxy locations

Europe

North America

South America

Asia

Africa

Oceania

See all locations

Network statusCareers

Back to blog

lxml Tutorial: XML Processing and Web Scraping With lxml

Gabija Fatenaite

2021-08-306 min read
Share

In this lxml Python tutorial, we will explore the lxml library. We will go through the basics of creating XML documents and then jump onto processing XML and HTML documents. Finally, we will put together all the pieces and see how to extract data using lxml. Each step of this tutorial is complete with practical Python lxml examples.

Prerequisite

This tutorial is aimed at developers who have at least a basic understanding of Python. A basic understanding of XML and HTML is also required. Simply put, if you know what an attribute is in XML, that is enough to understand this article. 

This tutorial uses Python 3 code snippets but everything works on Python 2  with minimal changes as well.

What is lxml in Python?

lxml is one of the fastest and feature-rich libraries for processing XML and HTML in Python. This library is essentially a wrapper over C libraries libxml2 and libxslt. This combines the speed of the native C library and the simplicity of Python.

Using Python lxml library, XML and HTML documents can be created, parsed, and queried. It is a dependency on many of the other complex packages like Scrapy.

Installation

The best way to download and install the lxml library is from Python Package Index (PyPI). If you are on Linux (debian-based), simply run:

sudo apt-get install python3-lxml

Another way is to use the pip package manager. This works on Windows, Mac, and Linux:

pip3 install lxml
Link to GitHub

On windows, just use pip install lxml, assuming you are running Python 3.

Creating a simple XML document

Any XML or any XML compliant HTML can be visualized as a tree. A tree has a root and branches. Each branch optionally may have further branches. All these branches and the root are represented as an Element.

A very simple XML document would look like this:

<root>
    <branch>
        <branch_one>
        </branch_one>
        <branch_one>
        </branch_one >
    </branch>
</root>
Link to GitHub

If an HTML is XML compliant, it will follow the same concept. 

Note that HTML may or may not be XML compliant. For example, if an HTML has <br> without a corresponding closing tag, it is still valid HTML, but it will not be a valid XML. In the later part of this tutorial, we will see how these cases can be handled.  For now, let’s focus on XML compliant HTML.

The Element class

To create an XML document using python lxml, the first step is to import the etree module of lxml:

>>> from lxml import etree
Link to GitHub

Every XML document begins with the root element. This can be created using the Element type. The Element type is a flexible container object which can store hierarchical data. This can be described as a cross between a dictionary and a list.

In this python lxml example, the objective is to create an HTML, which is XML compliant. It means that the root element will have its name as html:

>>> root = etree.Element("html")
Link to GitHub

Similarly, every html will have a head and a body:

>>> head = etree.Element("head")
>>> body = etree.Element("body")
Link to GitHub

To create parent-child relationships, we can simply use the append() method.

>>> root.append(head)
>>> root.append(body)
Link to GitHub

This document can be serialized and printed to the terminal with the help of tostring() function. This function expects one mandatory argument, which is the root of the document. We can optionally set pretty_print to True to make the output more readable. Note that tostring() serializer actually returns bytes. This can be converted to string by calling decode():

>>> print(etree.tostring(root, pretty_print=True).decode())
Link to GitHub

The SubElement class

Creating an Element object and calling the append() function can make the code messy and unreadable. The easiest way is to use the SubElement type. Its constructor takes two arguments – the parent node and the element name. Using SubElement, the following two lines of code can be replaced by just one.

body = etree.Element("body")
root.append(body)
# is same as 
body = etree.SubElement(root,"body")

Setting text and attributes

Setting text is very easy with the lxml library. Every instance of the Element and SubElement exposes two methods – text and set, the former is used to specify the text and later is used to set the attributes. Here are the examples:

para = etree.SubElement(body, "p")
para.text="Hello World!"
Link to GitHub

Similarly, attributes can be set using key-value convention:

para.set("style", "font-size:20pt")
Link to GitHub

One thing to note here is that the attribute can be passed in the constructor of SubElement:

para = etree.SubElement(body, "p", style="font-size:20pt", id="firstPara")
para.text = "Hello World!"
Link to GitHub

The benefit of this approach is saving lines of code and clarity. Here is the complete code. Save it in a python file and run it. It will print an HTML which is also a well-formed XML.

from lxml import etree
 
root = etree.Element("html")
head = etree.SubElement(root, "head")
title = etree.SubElement(head, "title")
title.text = "This is Page Title"
body = etree.SubElement(root, "body")
heading = etree.SubElement(body, "h1", style="font-size:20pt", id="head")
heading.text = "Hello World!"
para = etree.SubElement(body, "p",  id="firstPara")
para.text = "This HTML is XML Compliant!"
para = etree.SubElement(body, "p",  id="secondPara")
para.text = "This is the second paragraph."
 
etree.dump(root)  # prints everything to console. Use for debug only
Link to GitHub

Note that here we used etree.dump() instead of calling etree.tostring(). The difference is that dump() simply writes everything to the console and doesn’t return anything, tostring() is used for serialization and returns a string which you can store in a variable or write to a file. dump() is good for debug only and should not be used for any other purpose. 

Add the following lines at the bottom of the snippet and run it again:

with open('input.html', 'wb') as f:
    f.write(etree.tostring(root, pretty_print=True))
Link to GitHub

This will save the contents to input.html in the same folder you were running the script. Again, this is a well-formed XML, which can be interpreted as XML or HTML.

How do you parse an XML file using LXML in Python?

The previous section was a Python lxml tutorial on creating XML files. In this section, we will look at traversing and manipulating an existing XML document using the lxml library.

Before we move on, save the following snippet as input.html.

<html>
  <head>
    <title>This is Page Title</title>
  </head>
  <body>
    <h1 style="font-size:20pt" id="head">Hello World!</h1>
    <p id="firstPara">This HTML is XML Compliant!</p>
    <p id="secondPara">This is the second paragraph.</p>
  </body>
</html>
Link to GitHub

When an XML document is parsed, the result is an in-memory ElementTree object.

The raw XML contents can be in a file system or a string. If it is in a file system, it can be loaded using the parse method. Note that the parse method will return an object of type ElementTree. To get the root element, simply call the getroot() method.

from lxml import etree
 
tree = etree.parse('input.html')
elem = tree.getroot()
etree.dump(elem) #prints file contents to console
Link to GitHub

The lxml.etree module exposes another method that can be used to parse contents from a valid xml string — fromstring()

xml = '<html><body>Hello</body></html>'
root = etree.fromstring(xml)
etree.dump(root)
Link to GitHub

One important difference to note here is that fromstring() method returns an object of element. There is no need to call getroot().

If you want to dig deeper into parsing, we have already written a tutorial on BeautifulSoup, a Python package used for parsing HTML and XML documents. But to quickly answer what is lxml in BeautifulSoup, lxml can use BeautifulSoup as a parser backend. Similarly, BeautifulSoup can employ lxml as a parser. 

Finding elements in XML

Broadly, there are two ways of finding elements using the Python lxml library. The first is by using the Python lxml querying languages: XPath and ElementPath. For example, the following code will return the first paragraph element.

Note that the selector is very similar to XPath. Also note that the root element name was not used because elem contains the root of the XML tree.

tree = etree.parse('input.html')
elem = tree.getroot()
para = elem.find('body/p')
etree.dump(para)
 
# Output 
# <p id="firstPara">This HTML is XML Compliant!</p>
Link to GitHub

Similarly, findall() will return a list of all the elements matching the selector.

elem = tree.getroot()
para = elem.findall('body/p')
for e in para:
    etree.dump(e)
 
# Outputs
# <p id="firstPara">This HTML is XML Compliant!</p>
# <p id="secondPara">This is the second paragraph.</p>
Link to GitHub

The second way of selecting the elements is by using XPath directly. This approach is easier to follow by developers who are familiar with XPath. Furthermore, XPath can be used to return the instance of the element, the text, or the value of any attribute using standard XPath syntax.

para = elem.xpath('//p/text()')
for e in para:
    print(e)
 
# Output
# This HTML is XML Compliant!
# This is the second paragraph.
Link to GitHub

Handling HTML with lxml.html

Throughout this article, we have been working with a well-formed HTML which is XML compliant. This will not be the case a lot of the time. For these scenarios, you can simply use lxml.html instead of lxml.etree

Note that reading directly from a file is not supported. The file contents should be read in a string first. Here is the code to print all paragraphs from the same HTML file.

from lxml import html
with open('input.html') as f:
    html_string = f.read()
tree = html.fromstring(html_string)
para = tree.xpath('//p/text()')
for e in para:
    print(e)
 
# Output
# This HTML is XML Compliant!
# This is the second paragraph
Link to GitHub

lxml web scraping tutorial

Now that we know how to parse and find elements in XML and HTML, the only missing piece is getting the HTML of a web page.

For this, the ‘requests’ library is a great choice. It can be installed using the pip package  manager:

pip install requests
Link to GitHub

Once the requests library is installed, HTML of any web page can be retrieved using a simple get() method. Here is an example.

import requests
 
response = requests.get('http://books.toscrape.com/')
print(response.text)
# prints source HTML
Link to GitHub

This can be combined with lxml to retrieve any data that is required.

Here is a quick example that prints a list of countries from Wikipedia:

import requests
from lxml import html
 
response = requests.get('https://en.wikipedia.org/wiki/List_of_countries_by_population_in_2010')
 
tree = html.fromstring(response.text)
countries = tree.xpath('//span[@class="flagicon"]')
for country in countries:
    print(country.xpath('./following-sibling::a/text()')[0])
Link to GitHub

In this code, the HTML returned by response.text is parsed into the variable tree. This can be queried using standard XPath syntax. The XPaths can be concatenated. Note that the xpath() method returns a list and thus only the first item is taken in this code snippet.

This can easily be extended to read any attribute from the HTML. For example, the following modified code prints the country name and image URL of the flag.

for country in countries:
    flag = country.xpath('.//img/@src')[0]
    country = country.xpath('./following-sibling::a/text()')[0]
    print(country, flag)
Link to GitHub

You can click here to find the complete code used in this article for your convenience.

Conclusion

In this Python lxml tutorial, various aspects of XML and HTML handling using the lxml library have been introduced. Python lxml library is a light-weight, fast, and feature-rich library. This can be used to create XML documents, read existing documents, and find specific elements. This makes this library equally powerful for both XML and HTML documents. Combined with requests library, it can also be easily used for web scraping.

In our blog, you can learn more about web scraping using Selenium or other useful libraries like Beautiful Soup. Also, If you feel that building a data gathering tool is a bit challenging, check out our Web Scraper API – an advanced data collection solution for large-scale web scraping.

About the author

Gabija Fatenaite

Lead Product Marketing Manager

Gabija Fatenaite is a Lead Product Marketing Manager at Oxylabs. Having grown up on video games and the internet, she grew to find the tech side of things more and more interesting over the years. So if you ever find yourself wanting to learn more about proxies (or video games), feel free to contact her - she’ll be more than happy to answer you.

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

Related articles

Get the latest news from data gathering world

I’m interested

IN THIS ARTICLE:


  • What is lxml in Python?


  • lxml web scraping tutorial


  • Conclusion

Forget about complex web scraping processes

Choose Oxylabs' advanced web intelligence collection solutions to gather real-time public data hassle-free.

Scale up your business with Oxylabs®