In this lxml Python tutorial, we will explore the lxml library. We will go through the basics of creating XML documents and then jump onto processing XML and HTML documents. Finally, we will put together all the pieces and see how to extract data using lxml. Each step of this tutorial is complete with practical Python lxml examples.
This tutorial is aimed at developers who have at least a basic understanding of Python. A basic understanding of XML and HTML is also required. Simply put, if you know what an attribute is in XML, that is enough to understand this article.
This tutorial uses Python 3 code snippets but everything works on Python 2 with minimal changes as well.
- What is lxml in Python?
- Creating a simple XML document
- The Element class
- The SubElement class
- Setting text and attributes
- How do you parse an XML file using LXML in Python?
- Finding elements in XML
- Handling HTML with lxml.html
- lxml web scraping tutorial
What is lxml in Python?
lxml is one of the fastest and feature-rich libraries for processing XML and HTML in Python. This library is essentially a wrapper over C libraries libxml2 and libxslt. This combines the speed of the native C library and the simplicity of Python.
Using Python lxml library, XML and HTML documents can be created, parsed, and queried. It is a dependency on many of the other complex packages like Scrapy.
The best way to download and install the lxml library is from Python Package Index (PyPI). If you are on Linux (debian-based), simply run:
sudo apt-get install python3-lxml
Another way is to use the pip package manager. This works on Windows, Mac, and Linux:
pip3 install lxml
On windows, just use pip install lxml, assuming you are running Python 3.
Creating a simple XML document
Any XML or any XML compliant HTML can be visualized as a tree. A tree has a root and branches. Each branch optionally may have further branches. All these branches and the root are represented as an Element.
A very simple XML document would look like this:
<root> <branch> <branch_one> </branch_one> <branch_one> </branch_one > </branch> </root>
If an HTML is XML compliant, it will follow the same concept.
Note that HTML may or may not be XML compliant. For example, if an HTML has <br> without a corresponding closing tag, it is still valid HTML, but it will not be a valid XML. In the later part of this tutorial, we will see how these cases can be handled. For now, let’s focus on XML compliant HTML.
The Element class
To create an XML document using python lxml, the first step is to import the etree module of lxml:
>>> from lxml import etree
Every XML document begins with the root element. This can be created using the Element type. The Element type is a flexible container object which can store hierarchical data. This can be described as a cross between a dictionary and a list.
In this python lxml example, the objective is to create an HTML, which is XML compliant. It means that the root element will have its name as html:
>>> root = etree.Element("html")
Similarly, every html will have a head and a body:
>>> head = etree.Element("head") >>> body = etree.Element("body")
To create parent-child relationships, we can simply use the append() method.
>>> root.append(head) >>> root.append(body)
This document can be serialized and printed to the terminal with the help of tostring() function. This function expects one mandatory argument, which is the root of the document. We can optionally set pretty_print to True to make the output more readable. Note that tostring() serializer actually returns bytes. This can be converted to string by calling decode():
>>> print(etree.tostring(root, pretty_print=True).decode())
The SubElement class
Creating an Element object and calling the append() function can make the code messy and unreadable. The easiest way is to use the SubElement type. Its constructor takes two arguments – the parent node and the element name. Using SubElement, the following two lines of code can be replaced by just one.
body = etree.Element("body") root.append(body) # is same as body = etree.SubElement(root,"body")
Setting text and attributes
Setting text is very easy with the lxml library. Every instance of the Element and SubElement exposes two methods – text and set, the former is used to specify the text and later is used to set the attributes. Here are the examples:
para = etree.SubElement(body, "p") para.text="Hello World!"
Similarly, attributes can be set using key-value convention:
One thing to note here is that the attribute can be passed in the constructor of SubElement:
para = etree.SubElement(body, "p", style="font-size:20pt", id="firstPara") para.text = "Hello World!"
The benefit of this approach is saving lines of code and clarity. Here is the complete code. Save it in a python file and run it. It will print an HTML which is also a well-formed XML.
from lxml import etree root = etree.Element("html") head = etree.SubElement(root, "head") title = etree.SubElement(head, "title") title.text = "This is Page Title" body = etree.SubElement(root, "body") heading = etree.SubElement(body, "h1", style="font-size:20pt", id="head") heading.text = "Hello World!" para = etree.SubElement(body, "p", id="firstPara") para.text = "This HTML is XML Compliant!" para = etree.SubElement(body, "p", id="secondPara") para.text = "This is the second paragraph." etree.dump(root) # prints everything to console. Use for debug only
Note that here we used etree.dump() instead of calling etree.tostring(). The difference is that dump() simply writes everything to the console and doesn’t return anything, tostring() is used for serialization and returns a string which you can store in a variable or write to a file. dump() is good for debug only and should not be used for any other purpose.
Add the following lines at the bottom of the snippet and run it again:
with open(‘input.html’, ‘wb’) as f: f.write(etree.tostring(root, pretty_print=True)
This will save the contents to input.html in the same folder you were running the script. Again, this is a well-formed XML, which can be interpreted as XML or HTML.
How do you parse an XML file using LXML in Python?
The previous section was a Python lxml tutorial on creating XML files. In this section, we will look at traversing and manipulating an existing XML document using the lxml library.
Before we move on, save the following snippet as input.html.
<html> <head> <title>This is Page Title</title> </head> <body> <h1 style="font-size:20pt" id="head">Hello World!</h1> <p id="firstPara">This HTML is XML Compliant!</p> <p id="secondPara">This is the second paragraph.</p> </body> </html>
When an XML document is parsed, the result is an in-memory ElementTree object.
The raw XML contents can be in a file system or a string. If it is in a file system, it can be loaded using the parse method. Note that the parse method will return an object of type ElementTree. To get the root element, simply call the getroot() method.
from lxml import etree tree = etree.parse('input.html') elem = tree.getroot() etree.dump(elem) #prints file contents to console
The lxml.etree module exposes another method that can be used to parse contents from a valid xml string — fromstring()
xml = '<html><body>Hello</body></html>' root = etree.fromstring(xml) etree.dump(root)
One important difference to note here is that fromstring() method returns an object of element. There is no need to call getroot().
If you want to dig deeper into parsing, we have already written a tutorial on BeautifulSoup, a Python package used for parsing HTML and XML documents. But to quickly answer what is lxml in BeautifulSoup, lxml can use BeautifulSoup as a parser backend. Similarly, BeautifulSoup can employ lxml as a parser.
Finding elements in XML
Broadly, there are two ways of finding elements using the Python lxml library. The first is by using the Python lxml querying languages: XPath and ElementPath. For example, the following code will return the first paragraph element.
Note that the selector is very similar to XPath. Also note that the root element name was not used because elem contains the root of the XML tree.
tree = etree.parse('input.html') elem = tree.getroot() para = elem.find('body/p') etree.dump(para) # Output # <p id="firstPara">This HTML is XML Compliant!</p>
Similarly, findall() will return a list of all the elements matching the selector.
elem = tree.getroot() para = elem.findall('body/p') for e in para: etree.dump(e) # Outputs # <p id="firstPara">This HTML is XML Compliant!</p> # <p id="secondPara">This is the second paragraph.</p>
The second way of selecting the elements is by using XPath directly. This approach is easier to follow by developers who are familiar with XPath. Furthermore, XPath can be used to return the instance of the element, the text, or the value of any attribute using standard XPath syntax.
para = elem.xpath('//p/text()') for e in para: print(e) # Output # This HTML is XML Compliant! # This is the second paragraph.
Handling HTML with lxml.html
Throughout this article, we have been working with a well-formed HTML which is XML compliant. This will not be the case a lot of the time. For these scenarios, you can simply use lxml.html instead of lxml.etree.
Note that reading directly from a file is not supported. The file contents should be read in a string first. Here is the code to print all paragraphs from the same HTML file.
from lxml import html with open('input.html') as f: html_string = f.read() tree = html.fromstring(html_string) para = tree.xpath('//p/text()') for e in para: print(e) # Output # This HTML is XML Compliant! # This is the second paragraph
lxml web scraping tutorial
Now that we know how to parse and find elements in XML and HTML, the only missing piece is getting the HTML of a web page.
For this, the ‘requests’ library is a great choice. It can be installed using the pip package manager:
pip install requests
Once the requests library is installed, HTML of any web page can be retrieved using a simple get() method. Here is an example.
import requests response = requests.get('http://books.toscrape.com/') print(response.text) # prints source HTML
This can be combined with lxml to retrieve any data that is required.
Here is a quick example that prints a list of countries from Wikipedia:
import requests from lxml import html response = requests.get('https://en.wikipedia.org/wiki/List_of_countries_by_population_in_2010') tree = html.fromstring(response.text) countries = tree.xpath('//span[@class="flagicon"]') for country in countries: print(country.xpath('./following-sibling::a/text()'))
In this code, the HTML returned by response.text is parsed into the variable tree. This can be queried using standard XPath syntax. The XPaths can be concatenated. Note that the xpath() method returns a list and thus only the first item is taken in this code snippet.
This can easily be extended to read any attribute from the HTML. For example, the following modified code prints the country name and image URL of the flag.
for country in countries: flag = country.xpath('./img/@src') country = country.xpath('./following-sibling::a/text()') print(country, flag)
In this Python lxml tutorial, various aspects of XML and HTML handling using the lxml library have been introduced. Python lxml library is a light-weight, fast, and feature-rich library. This can be used to create XML documents, read existing documents, and find specific elements. This makes this library equally powerful for both XML and HTML documents. Combined with requests library, it can also be easily used for web scraping.