avatar

Augustas Pelakauskas

Dec 23, 2021 9 min read

Web scraping is an automated process of data extraction from a website. As a tool, a web scraper collects and exports data to a more usable format (JSON, CSV) for further analysis. Building a scraper could be complicated, requiring guidance and practical examples. A vast majority of web scraping tutorials concentrate on the most popular scraping languages, such as JavaScript, PHP, and, more often than not – Python. This time let’s take a look at Golang.

Golang, or Go, is designed to leverage the static typing and run-time efficiency of C and usability of Python and JavaScript, with added features of high-performance networking and multiprocessing. It’s also compiled and excels in concurrency, making it quick.

This article will guide you through the step-by-step process of writing a fast and efficient Golang web scraper that can extract public data from a target website.

Installing Go

To start, head over to the Go downloads page. Here you can download all of the common installers, such as Windows MSI installer, macOS Package, and Linux tarball. Go is open-source, meaning that if you wish to compile Go on your own, you can download the source code as well.

A package manager facilitates working with first-party and third-party libraries by helping you to define and download project dependencies. The manager pins down version changes, allowing you to upgrade your dependencies without fear of breaking the established infrastructure.

If you prefer package managers, you can use Homebrew on macOS. Open the terminal and enter the following:

brew install go

On windows, you can use the Chocolatey package manager. Open the command prompt and enter the following:

choco install golang

Once Go is installed, you can use any code editor or an IDE that supports Go. One of the most common code editors is Visual Studio Code. If you are using Visual Studio Code, install the Go extension.

Go extension for Visual Studio Code

If you prefer a complete IDE, you can use GoLand.

Both Visual Studio Code and GoLand are available for Windows, macOS, and Linux.

Web scraping frameworks

Go offers a wide selection of frameworks. Some are simple packages with core functionality, while others, such as Ferret, Gocrawl, Soup, and Hakrawler, offer a complete web scraping infrastructure to simplify data extraction.

The most popular framework for writing web scrapers in Go is Colly.

Colly is a fast scraping framework that can be used to write any kind of crawler, scraper, or spider. If you want to know more about differentiating the scraper from a crawler, check this article.

Colly has a clean API, handles cookies and sessions automatically, supports caching and robots.txt, and most importantly, it’s fast. Colly offers distributed scraping, HTTP request delays, and concurrency per domain.

In this article, we’ll be using Colly to scrape books.toscrape.com. The website is a dummy book store for practicing web scraping.

Parsing HTML with Colly

To parse the URLs and HTML, the first step is to create a project and install Colly.

Create a new directory and navigate there using the terminal. From this directory, run the following command:

go mod init oxylabs.io/web-scraping-with-go

This will create a go.mod file that contains the following lines with the name of the module and the version of Go. In this case, the version of Go is 1.17:

module oxylabs.io/web-scraping-with-go
go 1.17

Next, run the following command to install Colly and its dependencies:

go get github.com/gocolly/colly

This command will also update the go.mod file with all the required dependencies as well as create a go.sum file.

We are now ready to write the web scraper code file. Create a new file, save it as books.go and enter the following code:

package main

import (
   "encoding/csv"
   "fmt"
   "log"
   "os"

   "github.com/gocolly/colly"
)
func main() {
   // Scraping code here
   fmt.Println("Done")
}

The first line is the name of the package. Next, there are some built-in packages being imported as well as Colly itself.

The main() function is going to be the entry point of the program. This is where we’ll write the code for the web scraper.

Sending HTTP requests with Colly

The fundamental component of a Colly web scraper is the Collector. The Collector makes HTTP requests and traverses HTML pages.

The Collector exposes multiple events. We can hook custom functions that execute when these events are raised. These functions are anonymous and pass as a parameter.

First, to create a new Collector using default settings, enter this line in your code:

c := colly.NewCollector()

There are many other parameters that can be used to control the behavior of the Collector. In this example, we are going to limit the allowed domains. Change the line as follows:

c := colly.NewCollector(
   colly.AllowedDomains("books.toscrape.com"),
)

Once the instance is available, the Visit() function can be called to start the scraper. However, before doing so, it’s important to hook up to a few events.

The OnRequest event is raised when an HTTP request is sent to a URL. This event is used to track which URL is being visited. Simple use of an anonymous function that prints the URL being requested is as follows: 

c.OnRequest(func(r *colly.Request) {
   fmt.Println("Visiting", r.URL)
})

Note that the anonymous function being sent as a parameter here is a callback function. It means that this function will be called when the event is raised.

Similarly, OnResponse can be used to examine the response. The following is one such example:

c.OnResponse(func(r *colly.Response) {
   fmt.Println(r.StatusCode)
})

The OnHTML event can be used to take action when a specific HTML element is found.

Locating HTML elements via CSS selector

The OnHTML event can be hooked using the CSS selector and a function that executes when the HTML elements matching the selector are found.

For example, the following function executes when a title tag is encountered:

c.OnHTML("title", func(e *colly.HTMLElement) {
   fmt.Println(e.Text)
})

This function extracts the text inside the title tag and prints it. Putting together all we have gone through so far, the main() function is as follows:

func main() {
   c := colly.NewCollector(
colly.AllowedDomains("books.toscrape.com"),
   )

   c.OnHTML("title", func(e *colly.HTMLElement) {
      fmt.Println(e.Text)
   })

   c.OnResponse(func(r *colly.Response) {
      fmt.Println(r.StatusCode)
   })

   c.OnRequest(func(r *colly.Request) {
      fmt.Println("Visiting", r.URL)
   })

   c.Visit("https://books.toscrape.com/")
}

This file can be run from the terminal as follows:

go run books.go

The output will be as follows:

Visiting https://books.toscrape.com/
200
All products | Books to Scrape - Sandbox 

Extracting the HTML elements

Now that we know how Colly works let’s modify OnHTML to extract the book titles and prices.

The first step is to understand the HTML structure of the page.

The books are in the <article> tags

Each book is contained in an article tag that has a product_pod class. The CSS selector would be .product_pod.

Next, the complete book title is found in the thumbnail image as an alt attribute value. The CSS selector for the book title would be .image_container img.

Finally, the CSS selector for the book price would be .price_color.

The OnHTML can be modified as follows:

c.OnHTML(".product_pod", func(e *colly.HTMLElement) {
   title := e.ChildAttr(".image_container img", "alt")
   price := e.ChildText(".price_color")
})

This function will execute every time a book is found on the page.

Note the use of the ChildAttr function that takes two parameters: the CSS selector and the name of the attribute – it isn’t subtle. A better idea would be to create a data structure to hold this information. In this case, we can use struct as follows:

type Book struct {
	Title string
	Price string
}

The OnHTML will be modified as follows:

c.OnHTML(".product_pod", func(e *colly.HTMLElement) {
	book := Book{}
	book.Title = e.ChildAttr(".image_container img", "alt")
	book.Price = e.ChildText(".price_color")
	fmt.Println(book.Title, book.Price)
})

For now, this web scraper is simply printing the information to the console, which isn’t particularly useful. We’ll revisit this function when it’s time to save the data to a CSV file.

Handling pagination

First, we need to locate the “next” button and create a CSS selector. For this particular site, the CSS selector is .next > a. Using the selector, a new function can be added to the OnHTML event. In this function, we’ll convert a relative URL to an absolute URL. Then, we’ll call the Visit() function to crawl the converted URL:

c.OnHTML(".next > a", func(e *colly.HTMLElement) {
	nextPage := e.Request.AbsoluteURL(e.Attr("href"))
	c.Visit(nextPage)
})

The existing function that scrapes the book information will be called on all of the resulting pages as well. No additional code is needed.

Now that we have the data from all of the pages, it’s time to save it to a CSV file.

Writing data to a CSV file

The built-in CSV library can be used to save the structure to CSV files. If you want to save the data in JSON format, you can use the JSON library as well.

To create a new CSV file, enter the following code before creating the Colly collector:

file, err := os.Create("export.csv")
if err != nil {
   log.Fatal(err)
}
defer file.Close()

This will create export.csv and delay closing the file until the program completes its cycle.

Next, add these two lines to create a CSV writer:

writer := csv.NewWriter(file)
defer writer.Flush() 

Now it’s time to write the headers:

headers := []string{"Title", "Price"}
writer.Write(headers)

Finally, modify the OnHTML function to write each book as a single row:

c.OnHTML(".product_pod", func(e *colly.HTMLElement) {
	book := Book{}
	book.Title = e.ChildAttr(".image_container img", "alt")
	book.Price = e.ChildText(".price_color")
	row := []string{book.Title, book.Price}
	writer.Write(row)
})

That’s all! The code for the Golang web scraper is now complete.

Run the file by entering the following in the terminal:

go run books.go

This will create an export.csv file with 1,000 rows of data.

Conclusion

A code written in Go is cross-platform and runs remarkably fast. The code used in this article ran in less than 12 seconds. Executing the same task in Scrapy, which is one of the most optimized modern frameworks for Python, took 22 seconds. If speed is what you prioritize for your web scraping tasks, it’s a good idea to consider Golang in tandem with a modern framework such as Colly.

If you want to know more on how to scrape the web using other programming languages, we have published multiple articles, such as web scraping with JavaScript, Java, R, Ruby, and many more.

avatar

About Augustas Pelakauskas

Augustas Pelakauskas is a Copywriter at Oxylabs. Coming from an artistic background, he is deeply invested in various creative ventures - the most recent one being writing. After testing his abilities in the field of freelance journalism, he transitioned to tech content creation. When at ease, he enjoys sunny outdoors and active recreation. As it turns out, his bicycle is his third best friend.

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

Related articles

Meeting the Most Ambitious SEO Needs With Fresh Data
Meeting the Most Ambitious SEO Needs With Fresh Data

Jan 26, 2022

3 min read

OxyCast: A New Podcast on Everything Web Scraping Related
OxyCast: A New Podcast on Everything Web Scraping Related

Jan 20, 2022

2 min read

Python Web Scraping Tutorial: Step-By-Step
Python Web Scraping Tutorial: Step-By-Step

Jan 06, 2022

21 min read