Web Scraping With C#

Monika Maslauskaite

Last updated on

2023-12-06

9 min read

The first decision you would want to make when writing a web scraping code is deciding on the programming language you’ll use. C# might be just the right match. Since C# is a popular general-purpose language, it can offer a wide range of tools to tackle web scraping issues of any kind. While belonging to the C family of languages, C# favorably differs in terms of efficiency and scalability while remaining simple and easy-to-use.

In this article, we’ll explore C# and show you how to create a real-life C# public web scraper. Keep in mind that even if we’re using C#, you’ll be able to adapt this information to all languages supported by the .NET platform, including VB.NET and F#.

For your convenience, we also prepared this tutorial in a video format:

C# Web Scraping Tools

Before writing any code, the first step is choosing the suitable C# library or package. These C# libraries or packages will have the functionality to download HTML pages, parse them, and make it possible to extract the required data from these pages. Some of the most popular C# packages are as follows:

ScrapySharp
Puppeteer Sharp
Html Agility Pack

Html Agility Pack is the most popular C# package, with almost 50 million downloads from Nuget alone. There are multiple reasons behind its popularity, the most significant one being the ability of this HTML parser to download web pages directly or using a browser. This package is tolerant of malformed HTML and supports XPath. Also, it can even parse local HTML files; thus, we’ll use this package further in this article.

ScrapySharp adds even more functionality to C# programming. This package supports CSS Selectors and can simulate a web browser. While ScrapySharp is considered a powerful C# package, it’s not very actively maintained among programmers.

Puppeteer Sharp is a .NET port of the famous Puppeteer project for Node.js. It uses the same Chromium browser to load the pages. Also, this package employs the async-await style of code, enabling asynchronous, promise-based behavior. Puppeteer Sharp might be a good option if you are already familiar with this C# package and need a browser to render pages.

Building a web scraper with C#

As mentioned, now we’ll demonstrate how to write a C# public web scraping code that will use Html Agility Pack. We will be employing the .NET 5 SDK with Visual Studio Code. This code has been tested with .NET Core 3 and .NET 5, and it should work with other versions of .NET.

We’ll be following the hypothetical scenario: scraping a bookstore and collecting book names and prices. Let’s set up the development environment before writing a C# web crawler.

Setup Development environment

For C# development environment, install Visual Studio Code. Note that Visual Studio and Visual Studio Code are two completely different applications if you use them for writing a C# code.

Once Visual Studio Code is installed, install .NET 5.0 or newer. You can also use .NET Core 3.1. After installation is complete, open the terminal and run the following command to verify that .NET CLI or Command Line Interface is working properly:

dotnet --version

This should output the version number of the .NET installed.

Project Structure and Dependencies

The code will be a part of a .NET project. To keep it simple, create a console application. Then, make a folder where you’ll want to write the C# code. Open the terminal and navigate to that folder. Now, type in this command:

dotnet new console

The output of this command should be the confirmation that the console application has been successfully created.

Now, it’s time to install the required packages. To use C# to scrape public web pages, Html Agility Pack will be a good choice. You can install it for this project using this command:

dotnet add package HtmlAgilityPack

Install one more package so that we can easily export the scraped data to a CSV file:

dotnet add package CsvHelper

If you are using Visual Studio instead of Visual Studio Code, click File, select New Solution, and press on Console Application. To install the dependencies, follow these steps:

Choose Project;
Click on Manage Project Dependencies. This will open the NuGet Packages window;
Search for HtmlAgilityPack and select it;
Finally, search for CsvHelper, choose it, and click on Add Packages.

Nuget Package Manager in Visual Studio

Now that the packages have been installed, we can move on to writing a code for web scraping the bookstore.

Download and Parse Web Pages

The first step of any web scraping program is to download the HTML of a web page. This HTML will be a string that you’ll need to convert into an object that can be processed further. The latter part is called parsing. Html Agility Pack can read and parse files from local files, HTML strings, any URL, or even a browser.

In our case, all we need to do is get HTML from a URL. Instead of using .NET native functions, Html Agility Pack provides a convenient class – HtmlWeb. This class offers a Load function that can take a URL and return an instance of the HtmlDocument class, which is also part of the package we use. With this information, we can write a function that takes a URL and returns an instance of HtmlDocument.

The first step is to import the required library files. Open the Program.cs file and import the library files using the following code:

using HtmlAgilityPack;

Then, open Program.cs file and enter this function in the class Program:

// Parses the URL and returns HtmlDocument object
static HtmlDocument GetDocument(string url)
{
    HtmlWeb web = new HtmlWeb();
    HtmlDocument doc = web.Load(url);
    return doc;
}

With this, the first step of the code is complete. The next step is to parse the document.

Parsing the HTML: Getting Book Links

In this part of the code, we’ll be extracting the required information from the web page. At this stage, a document is now an object of type HtmlDocument. This class exposes two functions to select the elements. Both functions accept XPath as input and return HtmlNode or HtmlNodeCollection. Here is the signature of these two functions:

public HtmlNodeCollection SelectNodes(string xpath);

public HtmlNode SelectSingleNode(string xpath);

Let’s discuss SelectNodes first.

For this example – C# web scraper – we are going to scrape all the book details from this page. First, it needs to be parsed so that all the links to the books can be extracted. To do that, open this page in the browser, right-click any of the book links and click Inspect. This will open the Developer Tools.

After understanding some time with the markup, your XPath to select should be something like this:

//h3/a

This XPath can now be passed to the SelectNodes function.

HtmlDocument doc = GetDocument(url);
HtmlNodeCollection linkNodes = doc.DocumentNode.SelectNodes("//h3/a");

Note that the SelectNodes function is being called by the DocumentNode attribute of the HtmlDocument.

The variable linkNodes is a collection. We can write a foreach loop over it and get the href from each link one by one. There is one tiny problem that we need to take care of – the links on the page are relative. Hence, they need to be converted into an absolute URL before we can scrape these extracted links.

For converting the relative URLs, we can make use of the Uri class. We can use this constructor to get a Uri object with an absolute URL.

Uri(Uri baseUri, string? relativeUri);

Once we have the Uri object, we can simply check the AbsoluteUri property to get the complete URL.

We can write all this in a function to keep the code organized.

static List<string> GetBookLinks(string url)
    {
        var bookLinks = new List<string>();
        HtmlDocument doc = GetDocument(url);
        HtmlNodeCollection linkNodes = doc.DocumentNode.SelectNodes("//h3/a");
        var baseUri = new Uri(url);
        foreach (var link in linkNodes)
        {
            string href = link.Attributes["href"].Value;
            bookLinks.Add(new Uri(baseUri, href).AbsoluteUri);
        }
        return bookLinks;
    }

In this function, we are starting with an empty List<string> object. In the foreach loop, we are adding all the links to this object and returning it.

Now, it’s time to modify the Main() function so that we can test the C# code that we have written so far. Modify the function so that it looks like this:

static void Main(string[] args)
{
    var bookLinks = GetBookLinks("http://books.toscrape.com/catalogue/category/books/mystery_3/index.html");
    Console.WriteLine("Found {0} links", bookLinks.Count);
}

To run this code, open the terminal and navigate to the directory which contains this file, and type in the following:

dotnet run

The output should be as follows:

Found 20 links

Let’s move to the next part where we will be processing all the links to get the book data.

Parsing the HTML: Getting Book Details

At this point, we have a list of strings that contain the URLs of the books. We can simply write a loop that will first get the document using the GetDocument function that we’ve already written. After that, we’ll use the SelectSingleNode function to extract the title and the price of the book.

To keep the data organized, let’s start with a class. This class will represent a book. This class will have two properties – Title and Price. It will look like this:

public class Book
{
    public string Title { get; set; }
    public string Price { get; set; }
}

Now, open a book page in the browser and create the XPath for the Title – //h1. Creating an XPath for the price is a little trickier because the additional books at the bottom have the same class applied.

XPath for Price

The XPath of the price will be this:

//div[contains(@class,"product_main")]/p[@class="price_color"]

Note that XPath contains double quotes. We will have to escape these characters by prefixing them with a backslash.

Now we can use the SelectSingleNode function to get the Node, and then employ the InnerText property to get the text contained in the element. We can organize everything in a function as follows:

static List<Book> GetBookDetails(List<string> urls)
{
    var books = new List<Book>();
    foreach (var url in urls)
    {
        HtmlDocument document = GetDocument(url);
        var titleXPath = "//h1";
        var priceXPath = "//div[contains(@class,\"product_main\")]/p[@class=\"price_color\"]";
        var book = new Book();
        book.Title = document.DocumentNode.SelectSingleNode(titleXPath).InnerText;
        book.Price = document.DocumentNode.SelectSingleNode(priceXPath).InnerText;
        books.Add(book);
    }
    return books;
}

This function will return a list of Book objects. It’s time to update the Main() function as well:

static void Main(string[] args)
{
    var bookLinks = GetBookLinks("http://books.toscrape.com/catalogue/category/books/mystery_3/index.html");
    Console.WriteLine("Found {0} links", bookLinks.Count);
    var books = GetBookDetails(bookLinks);
}

The final part of this web scraping project is to export the data in a CSV.

Exporting Data

If you haven’t yet installed the CsvHelper, you can do this by running the command dotnet add package CsvHelper from within the terminal. After installation, import the CsvHelper class in your Program.cs file like this:

using System.Globalization;
using CsvHelper;

The export function is pretty straightforward. First, we need to create a StreamWriter and send the CSV file name as the parameter. Next, we will use this object to create a CsvWriter. Finally, we can use the WriteRecords function to write all the books in just one line of code.

To ensure that all the resources are closed properly, we can use the using block. We can also wrap everything in a function as follows:

static void exportToCSV(List<Book> books)
{
    using (var writer = new StreamWriter("./books.csv"))
    using (var csv = new CsvWriter(writer, CultureInfo.InvariantCulture))
    {
        csv.WriteRecords(books);
    }
}

Finally, we can call this function from the Main() function:

static void Main(string[] args)
{
    var bookLinks = GetBookLinks("http://books.toscrape.com/catalogue/category/books/mystery_3/index.html");
    var books = GetBookDetails(bookLinks);
    exportToCSV(books);
}

Let’s bring together all the snippets and have a look at the complete code:

using System.Globalization;
using CsvHelper;
using HtmlAgilityPack;

namespace webscraping
{



    class Program
    {
        static HtmlDocument GetDocument(string url)
        {
            HtmlWeb web = new HtmlWeb();
            HtmlDocument doc = web.Load(url);
            return doc;
        }
        static List<string> GetBookLinks(string url)
        {
            var bookLinks = new List<string>();
            HtmlDocument doc = GetDocument(url);
            HtmlNodeCollection linkNodes = doc.DocumentNode.SelectNodes("//h3/a");
            var baseUri = new Uri(url);
            foreach (var link in linkNodes)
            {
                string href = link.Attributes["href"].Value;
                bookLinks.Add(new Uri(baseUri, href).AbsoluteUri);
            }
            return bookLinks;
        }
        static List<Book> GetBookDetails(List<string> urls)
        {
            var books = new List<Book>();
            foreach (var url in urls)
            {
                HtmlDocument document = GetDocument(url);
                var titleXPath = "//h1";
                var priceXPath = "//div[contains(@class,\"product_main\")]/p[@class=\"price_color\"]";
                var book = new Book();
                book.Title = document.DocumentNode.SelectSingleNode(titleXPath).InnerText;
                book.Price = document.DocumentNode.SelectSingleNode(priceXPath).InnerText;
                books.Add(book);
            }
            return books;
        }
        static void exportToCSV(List<Book> books)
        {
            using (var writer = new StreamWriter("books.csv"))
            using (var csv = new CsvWriter(writer, CultureInfo.InvariantCulture))
            {
                csv.WriteRecords(books);
            }
        }

        static void Main(string[] args)
        {
            var bookLinks = GetBookLinks("http://books.toscrape.com/catalogue/category/books/mystery_3/index.html");
            Console.WriteLine("Found {0} links", bookLinks.Count);
            var books = GetBookDetails(bookLinks);
            exportToCSV(books);
        }


    }
}

That’s it! To run this code, open the terminal and run the following command:

dotnet run

Within seconds, you will have a books.csv file created.

Scraping dynamic web pages with Selenium

Scraping dynamic web pages is not possible using packages such as Html Agility Pack. The reason is that the pages are dynamically loaded using JavaScript, and the data is unavailable in a simple HTTP GET request.

For example, take a look at the dynamic version of the quotes.toscrape site — http://quotes.toscrape.com/js/.

Packages such as HTMLAgilityPack fail at scraping these pages.

The solution is to use Selenium.

Selenium can help you automate web browsers. Using Selenium, you can simulate user interactions with a website by clicking buttons or filling out forms and extracting data from the resulting pages, just like a real browser. This capability makes it useful for scraping. Selenium supports multiple programming languages, including C#.

Setting up Selenium

Apart from the web browser, you would need two components — web driver and .NET bindings for the web driver API.

The driver is a small binary file that provides a programmatic interface for controlling a web browser. The language bindings allow us to write these instructions in our intended language.

There are many ways to download the web driver. Depending on your intended browser, you can download Chrome, Firefox, or Edge driver and add the binary path to your PATH variable. Pay attention to the browser version, as using the correct driver version is crucial.

Alternatively, you can use the WebDriverManager Nuget package and create a driver without worrying about the browser version, path variable, or updating the driver.

Open Visual Studio and create a new Console application. Right-click the project, and select Nuget packages.

Search for Selenium, and choose Selenium.Webdriver, and install it.

Next, search for DriverManager, select it, and install it.

If you are using VSCode, use the following commands to install the above-mentioned packages:

dotnet add package WebDriverManager --version 2.17.1
dotnet add package Selenium.WebDriver --version 4.15.0

You are now ready to write code that uses Selenium.

Loading a webpage

Visual Studio will create a Program.cs for you when you create a project. The first step is to import the required library files into your Program.cs file:

using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
using WebDriverManager;
using WebDriverManager.DriverConfigs.Impl;

Modify its Main method and add the following code:

static void Main(string[] args)
{
    // setting up the driver
    new DriverManager().SetUpDriver(new ChromeConfig());
    var driver = new ChromeDriver();
    // load the webpage
    driver.Navigate().GoToUrl("http://quotes.toscrape.com/js/");
    // scraping code here
    // closing the browser 
    driver.Quit();
}

This code downloads and installs the Chrome driver, launches the browser, loads this page, and then closes.

Class to hold the data

Following the best practices, you can create a class that represents a quote as follows:

public class Quote
{
    public string? Text { get; set; }
    public string? Author { get; set; }
    public override string ToString()
    {
        return Author + " says, " + Text;
    }
}

The ToString() method is overridden only for readability.

Locating the elements

You can use CSS selectors or XPath selectors to locate the elements.

For example, if the web page heading is in an h1 tag, the selector would be "h1".

You can use the driver.FindElement() method to locate an element. Note that there is a closely resembling method, driver.FindElements(), which returns an array of elements.

element = driver.FindElement(By.CssSelector("h1"));

Note the use of By.CssSelector. If you want to use XPath, use By.XPath instead.

You can get the text from this element by reading its Text property.

Console.WriteLine(element.Text);

Our target web page contains all the quotes in a div with a class quote.

With this understanding, we can write the following line to get all the quote containers:

var quoteContainers = driver.FindElements(By.CssSelector("div.quote"));

Next, we can loop over the containers and scrape individual quotes. It is possible because FindElement works with elements too.

We can use the CSS ".author" to extract the author.

foreach (var item in quoteContainers)
{
    Console.WriteLine(item.FindElement(By.CssSelector(".author")).Text);
}

Similarly, to get the text of the quote, the CSS would be span.text.

The following code uses the data class that we created:

Quote quote = new()
{
    Text = item.FindElement(By.CssSelector("span.text")).Text,
    Author = item.FindElement(By.CssSelector(".author")).Text
};
Console.WriteLine(quote.ToString());

Exporting data to CSV

First, install the CsvHelper Nuget package if you still need to, as explained earlier.

Next, create a list that can hold all the quotes we are scraping:

var quotes = new List<Quote>();

Append each quote as you scrape to this list.

Finally, outside the for loop, add these lines that will create quotes.cv file with the quotes you have scraped.

using (var writer = new StreamWriter("~/quotes.csv"))
using (var csv = new CsvWriter(writer, CultureInfo.InvariantCulture))
{
    csv.WriteRecords(quotes);
}

Complete Code

Putting together everything, here is the complete code for scraping dynamic pages with Selenium:

using System.Globalization;
using CsvHelper;
using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
using WebDriverManager;
using WebDriverManager.DriverConfigs.Impl;


namespace webscraping

{
public class Quote
{
    public string? Text { get; set; }
    public string? Author { get; set; }
    public override string ToString()
    {
        return Author + " says, " + Text;
    }
}


    class Program
    {
        static void Main(string[] args)
        {
            new DriverManager().SetUpDriver(new ChromeConfig());
            var driver = new ChromeDriver();
            driver.Navigate().GoToUrl("http://quotes.toscrape.com/js/");
            var quotes = new List<Quote>();
            var quoteContainers = driver.FindElements(By.CssSelector("div.quote"));
            foreach (var item in quoteContainers)
            {
                Quote quote = new()
                {
                    Text = item.FindElement(By.CssSelector("span.text")).Text,
                    Author = item.FindElement(By.CssSelector(".author")).Text
                };
                quotes.Add(quote);
                Console.WriteLine(quote.ToString());
            }


            using (var writer = new StreamWriter("./quotes.csv"))
            using (var csv = new CsvWriter(writer, CultureInfo.InvariantCulture))
            {
                csv.WriteRecords(quotes);
            }
		// Close the driver
		 driver.Quit();

        }
    }
}

This is what the result looks like:

Conclusion

You can use multiple packages if you want to write a web scraper with C#. In this article, we’ve shown how to employ Html Agility Pack, a powerful and easy-to-use package. This was a simple example that can be enhanced further; for instance, you can try adding the above logic to handle multiple pages to this code.

If you want to know more about how web scraping works using other programming languages, you can have a look at the guide on web scraping with Python or a step-by-step tutorial on how to write a web scraper using JavaScript. Also, check out our own web scraper tool suitable for most websites.