Back to blog
Roberta Aukstikalnyte
PowerShell is a configuration and automation engine for solving tasks and issues designed by Microsoft. It consists of a scripting language with object-oriented support and a command line shell. Users, especially System Administrators, can automate, configure, and manage their network-related tasks using this engine.
PowerShell core is an advanced version of Windows PowerShell with open-source and cross-platform properties. Windows PowerShell is only compatible with Windows OS, whereas the Core version works well with UNIX-compliant operating systems including macOS and Linux.
PowerShell is often used in the data acquisition field. Today’s tutorial answers why PowerShell is a reliable engine for web scraping and goes through each step for using it for our data acquisition needs – let’s get started.
Web scraping refers to extracting and saving useful information from online sources, including web pages. It is the art of parsing the HTML contents to retrieve specific information.
Designing a good web scraping tool requires sufficient knowledge of HTML and the target website structure. The web scraping tool is reliable if it is robust to minor changes in the target web pages.
Python and Java support several libraries for performing complex web scraping tasks. Libraries like AutoScraper are trivial to use, allowing an absolute beginner to do highly robust web scraping tasks without any in-depth understanding of the HTML and web page structure.
PowerShell provides two cmdlets to scrape HTML data from the target web page: Invoke-WebRequest and Invoke-RestMethod – they will be explained later in the article. However, one must have a sufficient background in HTML and regular expressions to design a robust and reliable web scraping tool.
If you’re a process automation engineer or a DevOps professional, chances are, you’d like everything automated with PowerShell scripts that can work in cross-platform contexts – that’s why the PowerShell engine is a great choice for web scraping.
This section provides a practical hands-on approach to web scraping with PowerShell scripts. We’ll learn how to scrape public URLs, relevant information, and images from web pages using Invoke-WebRequest and Invoke-RestMethod cmdlets. Moreover, we’ll also discover content parsing with simple regular expressions and PowerHTML.
If you prefer, here's a link to the same tutorial on GitHub.
This tutorial takes Books to Scrape as a target for our web scraping tool. The target website features hundreds of books under 52 categories. The link to each category is available on the index page, as shown here:
Invoke-WebRequest cmdlet tells PowerShell to get a web page. It sends a request to any web page or service and receives a response including the contents, HTTP request status code, and metadata, just like any web browser would receive them.
For instance, let’s look at a very basic use case where we invoke a web request to www.google.com.
Invoke-WebRequest 'www.google.com'
The output of Invoke-WebRequest returns an object that has Status Code, StatusDescription, RawContent, Links, and all other metadata as its properties, as shown in the following output snippet:
To scrape links of all the categories from the target’s index page using Invoke-WebRequest, we can do it using the following script:
$scraped_links = (Invoke-WebRequest -Uri 'https://books.toscrape.com/').Links.Href | Get-Unique
$reg_expression = 'catalogue/category/books/.*'
$all_matches = ($scraped_links | Select-String $reg_expression -AllMatches).Matches
$urls = foreach ($url in $all_matches){
$url.Value
}
$urls
Here, the Invoke-WebRequest returns an object having all the content at the target URL. The Link.Href property filters for all the hyper-reference links in the contents. Then, we pipe it to the Get-Unique cmdlet to have only the unique links. Therefore, the $scraped_links object has all the unique links present on the target URL.
To get links for categories only, we further parse the $scraped_links with a regular expression $reg_expression. Therefore, the $all_matches object will have link objects for categories only.
Finally, we extract values for link objects from $all_matches and store them in the $urls list. Let’s see how it looks on the output console:
Let’s look at another example where Invoke-WebRequest is used to scrape all the URLs for images on the target web page.
(Invoke-WebRequest -Uri 'https://books.toscrape.com/').Images | Select-Object src
Similar to the earlier example, we first invoke an Invoke-WebRequest and get the Images section. The resultant is then piped to the Select-Object cmdlet, which then fetches the source links.
The output of the above command is:
It’s not like scraping the links is the only use case with the Invoke-WebRequest method; we can surely scrape contents and related data. However, for the sake of demonstration, the next subsection discusses Invoke-RestMethod for web scraping.
Invoke-RestMethod is also used to send requests on web pages or web services, including web APIs. Like the Invoke-WebRequest cmdlet, it also retrieves the HTML or content of the target URI. However, in contrast to the Invoke-WebRequest, the Invoke-RestMethod does not receive the metadata section.
Invoke-RestMethod is particularly useful for requesting APIs where the response data is usually in JSON format. The Invoke-RestMethod method automatically parses the JSON responses into objects.
Let’s request google.com with Invoke-RestMethod and see what we get:
Invoke-RestMethod 'www.google.com'
Response:
As expected, the output shows that the response for the Invoke-RestRequest contains only the HTML content of the target URL.
As discussed at the start of this article, the Invoke-RestRequest cmdlet can also be used to scrape web pages. Now, let’s see it in action.
Assume that we want to scrape some specific information from a book page at the target bookstore’s website; the Invoke-RestRequest can do it in the following way:
$book_html = Invoke-RestMethod 'https://books.toscrape.com/catalogue/libertarianism-for-beginners_982/index.html'
$reg_exp = <li class="active".*>(?<name>.*)</li>(.|\n)*<th>UPC</th><td.*>(?<upc_id>.*)</td>(.|\n)*<th>Product Type</th><td.*>(?<product_type>.*)</td>(.|\n)*<th>Price.*</th><td.*>(?<price>.*)</td>(.|\n)* <th>Availability</th>(.|\n)*<td.*>(?<availability>.*)</td>'
$all_matches = ($book_html | Select-String $reg_exp -AllMatches).Matches
$BookDetails =[PSCustomObject]@{
'Name' = ($all_matches.Groups.Where{$_.Name -like 'name'}).Value
'UPC_id' = ($all_matches.Groups.Where{$_.Name -like 'upc_id'}).Value
'Product Type' = ($all_matches.Groups.Where{$_.Name -like 'product_type'}).Value
'Price' = ($all_matches.Groups.Where{$_.Name -like 'price'}).Value
'Availability' = ($all_matches.Groups.Where{$_.Name -like 'availability'}).Value
}
$BookDetails
The example above requests the target book page and stores the received HTML content in the $book_html object. Next, it creates a regular expression to parse the name, UPC id, product type, price, and availability information from the HTML content stored in $book_html. Let’s have a look at the image of the target page along with its source to understand the formulation of this regular expression:
Note that the book’s title is inside a <li> tag with an active class. Moreover, the product information is inside a <table> tag where each type of information (e.g., UPC, price, etc.) is in a <td> tag which is preceded by a relevant <th> tag.
Keeping the above observations in mind, we designed the regular expression to match the <li> tag with an active class and capture everything enclosed in this tag to a name group. After that, the regular expression skips everything, including newlines, until it finds a <th> tag with UPC as its inner text. The inner text of the adjacent <td> tag is then captured in the upc_id group. Similarly, we follow the same pattern for the remaining product information.
Sidenote: We should use utmost care while designing a regular expression. A minor mistake can cause the web scraper to extract undesired information or even nothing at all. For example, in the case of our previous scraper script, missing any single symbol can cause the regular expression to match nothing, hence failing the scraper to scrape anything. Therefore, it’s recommended to use an online regular expression tester like the regixtester to check the validity of the expression over the example page source.
Once the regular expression is applied to the received HTML content, the select-string cmdlet, along with the flag -AllMatches, returns a MatchInfo object with detailed information about all the matching strings.
Finally, the resultant $all_matches MatchInfo object is converted into a PowerShell custom object, containing only the desired information. The output of the above script is as follows:
Let’s apply the above script on another book page URL and see the results:
code
Now that we know how to extract information from a specific web page, let’s see how scraping data from a specific category would work.
Say that we want titles and prices of all the books in a specific category – it also can be achieved with the PowerShell engine. However, before looking at the script, we need to look at one of the category pages at the target bookstore (i.e., Books to Scrape) along with its source code.
The above snippet shows the web page for Sports and Games and the corresponding HTML source code. The web page has a total of five books; the price and title of each book are available on the page.
If we closely look into the web page's source, the full book title is provided as a title attribute to the book's page href tag. Moreover, the price is in a paragraph tag with a price_color class.
Now, having a sufficient understanding of the underlying page structure, the script can be introduced to our web scraper.
$category_page_html=Invoke-RestMethod 'https://books.toscrape.com/catalogue/category/books/sports-and-games_17/index.html'
$reg_exp = '<h3><a href=.* title=\"(?<title>.*)\">.*<\/a><\/h3>(\n.*){13}<p class="price_color">(?<price>.*)<\/p>'
$all_matches = ($category_page_html | Select-String $reg_exp -AllMatches).Matches
$BookList = foreach ($book in $all_matches)
{
[PSCustomObject]@{
'title' = ($book.Groups.Where{$_.Name -like 'title'}).Value
'price' = ($book.Groups.Where{$_.Name -like 'price'}).Value
}
}
$BookList
The above script first retrieves the HTML of the Sports and Games category’s web page. Then, it applies the regular expression (stored in $reg_exp) on the retrieved HTML to select all the matching strings. As the target page has only five books, the $all_match (a selectInfo object) will be of length 5 and will have detailed information on all the matches along with the matched strings.
We don’t need details associated with the matches; rather, we are concerned with the titles and prices of the books. So, the script creates a list of PowerShell custom objects, where each PSCustomObject has just the name and title of a particular match.
The output of the above script looks like this:
code
Let's try using our scraper on another target by replacing the link in the Invoke-RestMethod with a link to the travel category. As expected, the script will output titles and prices of the books on the travel category page.
Until now, we’ve been using regular expressions to extract the required information. Designing a robust regular expression to extract relevant strings is very tricky. Moreover, modifying a pre-written regular expression is always problematic due to its poor readability.
Thanks to PowerHTML, we have a robust, more readable, and highly maintainable way to make PowerShell parse HTML data. PowerHTML is a powerful wrapper over the HtmlAgilityPack that supports XPath syntax to parse the HTML. It’s particularly useful in scenarios where the HTML Document Object Model (DOM) is unavailable, as in the case of content received in response to an Invoke-WebRequest.
We can install the PowerHTML module using the following command:
Install-Module -Name PowerHTML
Assume we want Product Information from a book web page, A Light in the Attic. The required information is inside the striped table, as depicted in the following snippet:
We can use the following script using PowerHTML to retrieve the Product Information from this web page:
$web_page = Invoke-WebRequest 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'
$html = ConvertFrom-Html $web_page
$BookDetails=[System.Collections.ArrayList]::new()
$name_of_book =$html.SelectNodes('//li') | Where-Object { $_.HasClass('active') }
$name=$name_of_book.ChildNodes[0].innerText
$n = New-Object -TypeName psobject
$n | Add-Member -MemberType NoteProperty -Name Name -Value $name
$BookDetails+=$n
$table = $html.SelectNodes('//table') | Where-Object { $_.HasClass('table-striped') }
foreach ($row in $table.SelectNodes('tr'))
{
$cnt += 1
$name=$row.SelectSingleNode('th').innerText.Trim()
$value=$row.SelectSingleNode('td').innerText.Trim() -replace "\?", " "
$new_obj = New-Object -TypeName psobject
$new_obj | Add-Member -MemberType NoteProperty -Name $name -Value $value
$BookDetails+=$new_obj
}
Write-Output 'Extracted Table Information'
$table
Write-Output 'Extracted Book Details Parsed from HTML table'
$BookDetails
The above code first retrieves the HTML contents of the target web page using the Invoke-WebRequest method. Then, the HTML content is converted to an HTMLAgilityPack htmlNode object using the ConvertFrom-Html command and stored in $html. This conversion allows us to use XPath syntax for further content parsing.
Afterward, the book title is parsed by selecting the <li> tag with class=active. This <li> tag has the book name as its inner text. We can fetch this inner text, add it to a new psObject against a Name property, and append it to the $BookDetails array list.
The product information is displayed in a <table> tag on the webpage. The below figure shows the product information table rendered by the browser on the left and the corresponding HTML source on the right.
To get the product information, we further parse the $html object and select the <table> tag using the $html.SelectNodes('//table') | Where-Object { $_.HasClass('table-striped') } command.
Then, a foreach loop selects all the rows or <tr> tags of the table and builds an object for each row having the <th> tag’s innerText value as the object’s Name property and the <td> tag’s innerText as a value for the Name property of the object.
Further, the loop also appends all the product information objects to the $BookDetails array list. The last line of the script displays this list.
The output of the above scripts is as follows:
Requesting a web page without using a proxy address has several risks associated with it. It can reveal our IP, exposing our location information. Moreover, we may want to scrape some region-specific data that can only be accessed by IP addresses of a particular region. Luckily, using a proxy server can help with both cases. If you're dealing with especially difficult targets, we'd recommend choosing a Residential Proxy.
Both, the Invoke-WebRequest and the Invoke-RestMethod, support using a proxy. These cmdlets support the -Proxy flag to provide the URI of the proxy. We can also pass the proxy credentials with the proxy address using the -ProxyCredential flag.
For example, the following snippets showcase the use of proxy endpoints while requesting www.google.com.
Invoke-RestMethod 'http://www.google.com ' -Proxy 'PROXY_ENDPOINT'
Invoke-WebRequest 'http://www.google.com ' -Proxy 'PROXY_ENDPOINT'
The PROXY_ENDPOINT refers to a URI of a proxy, that is comprised of a protocol, an optional authentication information, an IP or a hostname, as well as an optional port number (e.g., http://user:pass@127.0.0.1:8081 or https://127.0.0.1:8081 ).
PowerShell is a powerful cross-platform task automation tool that can also be used for public web data acquisition. We can scrape data using either Invoke-WebRequest or Invoke-RestMethod cmdlets in conjunction with classic regular expressions or PowerHTML-like robust parsing tools that parse the data retrieved by these request cmdlets.
If you want to make web scraping simple and block-free, take a look at our advanced web intelligence solutions, such as Web Scraper API.
The Windows PowerShell was built on .Net Framework, which makes the end products Windows-specific. However, PowerShell Core is based on a cross-platform applications development framework known as .Net Core. Therefore, the latter supports cross-platform compatibility.
The cmdlets are the lightweight commands provided by PowerShell (e.g., Invoke-WebRequest, Invoke-RestMethod, Write-Output, etc.). On executing these cmdlets, the PowerShell runtime invokes relevant APIs to process the commands.
About the author
Roberta Aukstikalnyte
Senior Content Manager
Roberta Aukstikalnyte is a Senior Content Manager at Oxylabs. Having worked various jobs in the tech industry, she especially enjoys finding ways to express complex ideas in simple ways through content. In her free time, Roberta unwinds by reading Ottessa Moshfegh's novels, going to boxing classes, and playing around with makeup.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Get the latest news from data gathering world
Scale up your business with Oxylabs®
GET IN TOUCH
General:
hello@oxylabs.ioSupport:
support@oxylabs.ioCareer:
career@oxylabs.ioCertified data centers and upstream providers
Connect with us
Advanced proxy solutions
Resources
Innovation hub
oxylabs.io© 2024 All Rights Reserved