The quality of a good code is measured by, among other elements, the ability to understand the context around its design, thus permitting maintenance later on. It should also be easy to read, and have comments that explain the role the code plays, as well as its functions. This underscores two of the most fundamental elements of developing and documenting code.
For the entire computation process to be complete, execution should also be included, and, depending on the programming tools being used, communication of the results. Notably, an interactive computing interface already exists that combines all these mentioned elements. It’s known as the Jupyter Notebook.
This article discusses exhaustively what is Jupyter Notebook, its uses, how to set it up on your computer, and whether it’s ideal for beginners. Importantly, this article is essentially a Jupyter Notebook tutorial on various aspects of the application, including using it for web scraping.
What is a Jupyter Notebook?
From basic computing principles, a notebook interface is a coding environment that allows for the development of code, as well as documentation of the logic behind the code. Simply, it’s an environment that facilitates literate programming. At launch in 1987, the notebook interface was a system of documenting mathematical metaphors to support mathematical applications that undertook the computation.
However, as the popularity of this interface grew, so did the need for new “computational engines,” known as kernels, emerged. The kernels would now execute the code contained in the notebook, meaning the notebook environment and the associated kernel were merged – opening the notebook would also open the kernel. Some of the notable “engines” include MATLAB, SQL, Scala, Julia, Python, R, and more.
Jupyter Notebook is an open-source web-based notebook environment that supports three main computational engines or Jupyter languages, namely Julia, Python, and R, making the word “Jupyter” somewhat of an acronym. Although, by default, it deploys the IPython kernel that uses Python to execute code, it still supports more than 100 other computational engines.
With Jupyter Notebook, a spin-off project from the original IPython, you can create and send documents containing code, text, equations, graphs, and more. This arrangement allows you to write code, view it, and execute it within a single user interface, thus making it easy to make changes on the go and immediately see the impact of the changes.
What is Jupyter Notebook used for?
Jupyter Notebook is used for a number of operations, including:
- Data visualization: Jupyter Notebook was initially designed for use by data scientists, meaning the analysis and interpretation of data are at the center of its operations. For this reason, you can use this application to generate charts and graphs from code, using available modules such as Bokeh, Plotly, and Matplotlib. Notably, Jupyter Notebook is structured such that it generates charts immediately below the code that created them.
- Multimedia support: As a web application, Jupyter Notebook supports multimedia. There are two approaches to including multimedia on your notebooks, namely generating them using a module called IPython.display (this entails writing code) or including the multimedia as an HTML document.
- Web scraping: Web scraping refers to the automated extraction of data from websites. As we’ll detail later, Jupyter Notebook comes in handy and even creates a complete independent ecosystem.
- Code sharing: Jupyter Notebook is somewhat comparable to cloud services like Pastebin and GitHub, which facilitate code sharing. However, Jupyter Notebook also promotes interactivity in that you can create code, execute it, view the results, and include text-based comments, all using your web browser. You can then save the file and send it to anyone with whom you are collaborating.
- Documenting code: As we detailed earlier, writing a quality code entails making it easy to read and understand the context within which you created the code. Being an application meant to promote collaboration, it goes without saying that Jupyter Notebook should have tools that facilitate this very function; and it does. The notebook environment on this open-source application enables you to explain your code line-by-line, thus making it easy for anyone to understand your reasoning.
Setting up a Jupyter Notebook
Given that you can set up a Jupyter Notebook in various ways, this article explores one of them. Notably, we’re assuming that you’re using Python 3 and have already installed it on your computer. If not, you can download the latest version from the Python website.
Using Command Line
- Run Command Prompt (Windows user) or Terminal window (macOS user); remember that the folder you had selected before running the command line will automatically be your home directory once you eventually run the application. So, it’s advisable that you create a new folder to prevent the Jupyter Notebook files from cluttering the unintended folder.
- Once your command line is up and running, type the following command:
pip install jupyterand hit enter/return. Hitting enter will install Jupyter Notebook. This process should take about a minute to complete.
- To run Jupyter Notebook, simply type
jupyter notebookon the command line, which should redirect you to your preferred browser and display the homepage shown below.
To create a new notebook, click New on the top right side of the page and then select Python 3 from the dropdown menu, as below. Then, you should name your notebook.
It’s noteworthy that this approach to launching Jupyter Notebooks makes the folder you were initially on before running Command Prompt or Terminal the home directory for your Jupyter Notebook files. In our case, we used Python’s Scripts folder, hence the various files you can see with a .exe extension. In your case, you could choose to create a new folder before running cmd.
How to use Jupyter Notebook?
After creating a new folder and subsequently a new notebook, it’s now time to start using Jupyter Notebook. But how do you use it, and how is a notebook structured? Much like Microsoft Excel, Jupyter Notebook is composed of cells, i.e., boxes that allow you to input lines of code or text (shown in the image below).
By default, Jupyter Notebook’s toolbar automatically reads ‘Code,’ meaning that the cells will only accept code-based entries. However, the dropdown menu (below) enables you to choose other options, namely ‘Markdown,’ ‘Raw NBConvert,’ and ‘Heading.’ With these options, you can make use of the wholesome notebook environment.
As stated, Jupyter Notebook executes the written code. To do this, the toolbar contains a ‘play’ symbol that enables you to run the code. Alternatively, you can use the combination keys Ctrl+Enter (Windows) or Ctrl+Return (Mac).
With Jupyter Notebooks, you can do the following:
- Import a .csv file
- Include comments alongside your code
- Create graphics and charts
- Share code
- Extract data from websites
Jupyter Notebook and web scraping
Successful web scraping is a multistep process that entails retrieving HTML data from the public portion of the target website, parsing it to obtain only the desired information, and, lastly, storing this information in a file. Web scraping with Jupyter Notebook follows a similar process, but successful completion relies on the various Python libraries.
For this reason, you have to use modules such as urllib.request to fetch websites’ Uniform Resource Locators (URLs) and a combination of the Beautiful Soup package and lxml scraping library to parse/process the HTML data.
The latter combination ensures that you can harness the benefits of both Beautiful Soup and lxml. For instance, Beautiful Soup fixes bad HTML, presenting the data as Python data, while lxml is a fast and powerful HTML processing library.
Next, the parsed data needs to be converted into a structured format, such as a table with rows and columns, through a multistep process that begins with removing HTML tags from the text using Beautiful Soup. Thereafter, you need to create an empty list and then attach each row of text to the newly created empty list. Next, convert the list into a data frame using two Python libraries, namely Pandas and NumPy, and subsequently, save the data as a .csv file.
Notably, in Jupyter Notebooks, the result of each line, block, or paragraph is visible after running it and doesn’t disappear with the run of the following line, block, or paragraph.
Is Jupyter Notebook good for web scraping beginners?
As a tool that supports explanatory text, coding, code execution, and visualization of the output, Jupyter Notebook is a handy tool for beginners. The fact that it enables the user to see everything provides an environment that promotes experimentation and the ability to try out different codes, as well as running sections of the code independently. These functional elements make the Python Jupyter Notebook extremely useful for beginners as it builds on their knowledge and promotes digital literacy.
What is a Jupyter Notebook not suitable for?
While the Jupyter Notebook is perfect for novices and beginners, it’s usually a struggle for seasoned developers and data scientists. Its deficiencies also trickle down to its various use cases.
While we have mentioned that it promotes collaboration, it only does so in a limited capacity. By that, we mean that you have to save the Jupyter Notebook files and send them to another developer. Further, two people cannot work on the same notebook simultaneously. If two individuals are inputting code at the same time, Jupyter Notebook cannot successfully merge the codes.
Secondly, you cannot conduct unit testing on Jupyter Notebook. This means that it’s nearly impossible to find bugs within a few lines of code. Testing code on this platform requires you to wrap up the entire code and then run the test. Even so, successful execution is not guaranteed as it may fail halfway due to the errors within the code.
If we delve deeper, specifically to web scraping, the Jupyter Notebook should only be used to extract small volumes of data, practice, or code testing purposes.
Jupyter Notebook displays the results of each line, block, or paragraph. When scraping on a large scale, there are many results, meaning that it can slow down the whole web scraping process. The larger the scale of web scraping, the more important efficiency is. Running optimized code would make much more sense once the initial discovery phase has been completed in Jupyter. Though it’s quite likely that such code won’t work with it, and you would have to stop using Jupyter at this point.
A larger web scraping project also requires more developers to work with it. As mentioned above, two or more developers cannot work on the same notebook simultaneously. Instead, developers have to save the Jupyter Notebook files and send them to another developer. This process seems inefficient, so using a Jupyter Notebook for large-scale web scraping projects isn’t worth the effort.
Jupyter Notebook is a proven and useful tool for creating a notebook environment wherein you can include comments, code, execute the code, and also see the outcome of the written code within the same platform. These features make Jupyter Notebook interactive, an attribute that makes it a valuable tool for beginners. It also facilitates collaboration but to a limited extent. While Jupyter Notebook can be used in web scraping, it’s only recommended for small-scale projects.