Web Scraping With PHP

Augustas Pelakauskas

Last updated on

2021-12-30

12 min read

PHP is a general-purpose scripting language and one of the most popular options for web development. For example, WordPress, the most common content management system to create websites, is built using PHP.

PHP offers various building blocks required to build a web scraper, although it can quickly become an increasingly complicated task. Conveniently, there are many open-source libraries that can make web scraping with PHP more accessible.

This article will guide you through the step-by-step process of writing various web scraping PHP routines that can extract public data from static and dynamic web pages.

For your convenience, we also prepared this tutorial in a video format:

Can PHP be used for web scraping?

In short, yes, it certainly can, and the rest of the article will detail precisely how the web page scraping processes should look. However, asking whether it's an adequate choice as a language for web scraping is an entirely different matter as numerous programming language alternatives exist.

Note that PHP is old; it has existed since the 90s and reached significant version 8. Yet, this is advantageous as it makes PHP a rather easy language to use and has decades of solved problems/errors under its belt. However, simplicity comes at a cost as well. When it comes to complex, dynamic websites, PHP is outperformed by Python and Javascript, although if your requirements are data scraped from simple pages, then PHP is a good choice.

Installing prerequisites

To begin, make sure that you have both PHP and Composer installed.

If you are using Windows, visit this link to download PHP. Alternatively, you can use the Chocolatey package manager. Using Chocolatey, run the following command from the command line or PowerShell:

Copy

choco install php

Link to GitHub

If you are using macOS, the chances are that you already have PHP bundled with the operating system. Alternatively, you can use a package manager such as Homebrew to install PHP. Open the terminal and enter the following:

Copy

brew install php

Link to GitHub

Once PHP is installed, verify that the version is 7.1 or newer. Open the terminal and enter the following to verify the version:

Copy

php --version

Next, install Composer. Composer is a dependency manager for PHP. It’ll help to install and manage the required packages.

To install Composer, visit this link. Here you’ll find the downloads and instructions.

If you are using a package manager, the installation is easier. On macOS, run the following command to install Composer:

Copy

brew install composer

Link to GitHub

On Windows, you can use Chocolatey:

Copy

choco install composer

Link to GitHub

To verify the installation, run the following command:

Copy

composer --version

The next step is to install the required libraries.

Making an HTTP GET request

The first step of PHP web scraping is to load the page.

In this article, we’ll be using books.toscrape.com. The website is a dummy book store for practicing web scraping.

When viewing a website in a browser, the browser sends an HTTP GET request to the web server as the first step. To send the HTTP GET request using PHP, the built-in function file_get_contents can be used.

This function can take a file path or a URL and return the contents as a string.

Create a new file and save it as native.php. Open this file in a code editor such as Visual Studio Code. Enter the following lines of code to load the HTML page and print the HTML in the terminal:

Copy

<?php
$html = file_get_contents('https://books.toscrape.com/');
echo $html;

Link to GitHub

Execute this code from the terminal as follows:

Copy

php native.php

Upon executing this command, the entire HTML of the page will be printed.

As of now, it is difficult to locate and extract specific information within the HTML.

This is where various open-source, third-party libraries come into play.

Web scraping in PHP with Goutte

A wide selection of libraries is available for web scraping with PHP. In this article, we’ll use Goutte as it is accessible, well documented, and continuously updated. It is always a good idea to try the most popular solutions as supporting content and preexisting advice is plentiful.

Goutte can handle most static websites. For dynamic sites, we’ll use Symfony Panther.

Goutte, pronounced as goot, is a wrapper around Symfony’s components, such as BrowserKit, CssSelector, DomCrawler, and HTTPClient.

Symfony is a set of reusable PHP components. The components used by Goutte can be used directly. However, Goutte makes it easier to write the code.

To install Goutte, create a directory where you intend to keep the source code. Navigate to the directory and enter these commands:

Copy

composer init --no-interaction --require="php >=7.1"
composer require fabpot/goutte
composer update

Link to GitHub

The first command will create the composer.json file. The second command will add the entry for Goutte as well as download and install the required files. It’ll also create the composer.lock file.

The composer update command will ensure that all the files of the dependencies are up to date.

Sending HTTP requests with Goutte

The most important class for PHP web scraping using Goutte is the Client that acts like a browser. The first step is to create an object of this class:

Copy

$client = new Client();

This object can then be used to send a request. The method to send the request is conveniently called request. It takes two parameters — the HTTP method and the target URL, and returns an instance of the DOM crawler object:

Copy

$crawler = $client->request('GET', 'https://books.toscrape.com');

Link to GitHub

This will send the GET request to the HTML page. To print the entire HTML of the page, we can call the html() method.

Putting together everything we’ve built so far, this is how the code file looks like:

Copy

<?php
require 'vendor/autoload.php';
use Goutte\Client;
$client = new Client();
$crawler = $client->request('GET', 'https://books.toscrape.com');
echo $crawler->html();

Link to GitHub

Save this new PHP file as books.php and run it from the terminal. This will print the entire HTML:

Copy

php books.php

What we need next is a way to locate specific elements from the page.

Locating HTML elements via CSS Selectors

Goutte uses the Symfony component CssSelector. It facilitates the use of CSS Selectors in locating HTML elements.

The CSS Selector can be supplied to the filter method. For example, to print the title of the page, enter the following line to the books.php file that we are working with:

Copy

echo $crawler->filter('title')->text();

Link to GitHub

Note that title is the CSS Selector that selects the <title> node from the HTML.

Keep in mind that in this particular case, text() returns a text contained in the HTML element. In the earlier example, we’ve used html() to return the entire HTML of the selected element.

If you prefer to work with XPath, use the filterXPath() method instead. The following line of code produces the same output:

Copy

echo $crawler->filterXPath('//title')->text();

Link to GitHub

Now, let’s move on to extracting the book titles and prices.

Extracting the elements

Open https://books.toscrape.com in Chrome, right-click on a book and select Inspect. Before we write the web scraping code, we need to analyze the HTML of our page first.

The books are located in the <article> tags

Upon examining the HTML of the target web page, we can see that each book is contained in an article tag, which has a product_pod class. Here, the CSS Selector would be .product_pod.

In each article tag, the complete book title is located in the thumbnail image as an alt attribute value. The CSS Selector for the book title would be .image_container img.

Finally, the CSS Selector for the book price would be .price_color.

To get all of the titles and prices from this page, first, we need to locate the container and then run the each loop.

In this loop, an anonymous function will extract and print the title along with the price as follows:

Copy

function scrapePage($url, $client){
    $crawler = $client->request('GET', $url);
    $crawler->filter('.product_pod')->each(function ($node) {
            $title = $node->filter('.image_container img')->attr('alt');
            $price = $node->filter('.price_color')->text();
            echo $title . "-" . $price . PHP_EOL;
        });
    }

Link to GitHub

The functionality of web data extraction was isolated in a function. The same function can be used for extracting data from different websites.

Handling pagination

At this point, our PHP web scraper is performing data extraction from only a single URL. In real-life web scraping scenarios, multiple pages would be involved.

In this particular site, the pagination is controlled by a Next link (button). The CSS Selector for the Next link is .next > a.

In the function scrapePage that we’ve created earlier, add the following lines:

Copy

try {
    $next_page = $crawler->filter('.next > a')->attr('href');
} catch (InvalidArgumentException) { //Next page not found
    return null;
}
return "https://books.toscrape.com/catalogue/" . $next_page;

Link to GitHub

This code uses the CSS Selector to locate the Next button and to extract the value of the href attribute, returning the relative URL of the subsequent page. On the last page, this line of code will raise the InvalidArgumentException.

If the next page is found, this function will return its URL. Otherwise, it will return null.

From now on, we’ll be initiating each scraping cycle with a different URL. This will make the conversion from a relative URL to an absolute one easier.

Lastly, we can use a while loop to call this function:

Copy

$client = new Client();
$nextUrl = "https://books.toscrape.com/catalogue/page-1.html";
while ($nextUrl) {
    $nextUrl = scrapePage($nextUrl, $client);
}
scrapePage($url, $client);

Link to GitHub

The web scraping code is almost complete.

Writing data to a CSV file

The final step of the web scraping PHP process is to export the data to a storage. PHP’s built-in fputcsv function can be used to export the data to a CSV file.

First, open the CSV file in write or append mode and store the file handle in a variable.

Next, send the variable to the scrapePage function. Then, call the fputcsv function for each book to write the title and price in one row.

Lastly, after the while loop, close the file by calling fclose.

The final code file will be as follows:

Copy

function scrapePage($url, $client, $file)
{
    $crawler = $client->request('GET', $url);
    $crawler->filter('.product_pod')->each(function ($node) use ($file) {
        $title = $node->filter('.image_container img')->attr('alt');
        $price = $node->filter('.price_color')->text();
        fputcsv($file, [$title, $price]);
    });
    try {
        $next_page = $crawler->filter('.next > a')->attr('href');
    } catch (InvalidArgumentException) { //Next page not found
        return null;
    }
    return "https://books.toscrape.com/catalogue/" . $next_page;
}

$client = new Client();
$file = fopen("books.csv", "a");
$nextUrl = "https://books.toscrape.com/catalogue/page-1.html";

while ($nextUrl) {
    echo "<h2>" . $nextUrl . "</h2>" . PHP_EOL;
    $nextUrl = scrapePage($nextUrl, $client, $file);
}
fclose($file);

Link to GitHub

Run this file from the terminal:

Copy

php books.php

This will create a books.csv file with 1,000 rows of data.

Web scraping with Guzzle, XML, and XPath

Guzzle is a PHP library that sends HTTP requests to web pages in order to get a response. In other words, Guzzle is a PHP HTTP client that you can use to scrape data. Do note that before we can work with a web page, we need to understand two more concepts--XML and XPath.

XML stands for eXtensible Markup Language. It will be used to create files for storing structured data. These files can then be transmitted, and the data constructed.

There is the issue of reading XML files and this is where XPath comes into the picture.

XPath stands for XML Path and is used for navigation and selecting XML nodes.

HTML files are very similar to XML files. In some cases we might need a parser to make adjustments to the minor differences and make HTML at least somewhat compliant with XML file standards. Crucially there are some parsers that can read even poorly formatted XML.

In any case, the parsers will then make necessary HTML modifications so that we can work with XPath to query and navigate the HTML.

Setting up a Guzzle Project

To install Guzzle, create a directory where you intend to keep the source code. Navigate to the directory and enter these commands:

Copy

composer init --no-interaction --require="php >=7.1"
composer require guzzlehttp/guzzle

Link to GitHub

In addition to Guzzle, we’ll also use a library for parsing HTML code. There are many PHP libraries available such as simple HTML dom parser and Symphony DOMCrawler. In this tutorial, Symphony DOMCrawler is chosen as the syntax is very similar to Goutte and you will be able to apply what you already know in this section. Another point in favor of DomCrawler over the simple HTML dom parser is that it supports working with invalid HTML code very well.

Install DOMCrawler using the following command:

Copy

composer require symfony/dom-crawler

These commands will download all the necessary files. The next step is to create a new file and save it as scraper.php.

Sending HTTP requests with Guzzle

Similar to Goutte, the most important class of Guzzle is Client. Begin by creating a new file scraper.php and enter the following lines of PHP code:

Copy

<?php
require 'vendor/autoload.php';
use GuzzleHttp\Client;
use Symfony\Component\DomCrawler\Crawler;

Link to GitHub

Now we are ready to create an object of the Client class:

Copy

$client = new Client();

The client object can then be used to send a request. The method to send the request is conveniently called request. It takes two parameters — the HTTP method and the target URL, and returns a response:

Copy

$response = $client->request('GET', 'https://books.toscrape.com');

From this response, we can extract the web page's HTML as follows:

Copy

$html = $response->getBody()->getContents();
echo $html

Note that in this example, the response contains HTML code. If you are working with a web page that returns JSON, you can save the JSON to a file and stop the script. The next section will be applicable only if the response contains HTML or XML data.

Continuing, we will use the DomCrawler to extract specific elements from this web page.

Locating HTML elements via XPath

Import the Crawler class and create an instance of the Crawler class as shown in the following PHP code snippet:

Copy

use Symfony\Component\DomCrawler\Crawler;

We can create an instance of the crawler class as follows:

Copy

$crawler = new Crawler($html);

Now we can use the filterXPath method to extract any XML node. For example, the following line prints only the title of the page:

Copy

echo $crawler->filterXPath('//title')->text();

A quick note about XML Nodes: In XML, everything is a node-- an element is a node, an attribute is a node, and text is also a node. The filterXPath method returns a node. Therefore, to extract the text from an element, even if you use the text() function in XPath, you still have to call the text() method to extract text as a string.

In other words, both the following lines of code will return the same value:

Copy

echo $crawler->filterXPath('//title')->text();
echo $crawler->filterXPath('//title/text()')->text();

Link to GitHub

Now, let's move on to extracting the book titles and prices.

Extracting the elements

Before we write web scraping code, let’s start with analyzing the HTML of our page.

Open the web page https://books.toscrape.com in Chrome, right-click on a book and select Inspect.

The books are located in <article> elements with the class attribute set to product_pod. The XPath to select these nodes will be as follows:

Copy

//*[@class="product_pod"]

In each article tag, the complete book title is located in the thumbnail image as an alt attribute value. The XPath for book title and book price would be as follows:

Copy

//*[@class="image_container"]/a/img/@alt
//*[@class="price_color"]/text()

To get all of the titles and prices from this page, we first need to locate the container and then use a loop to get to each of the elements containing the data we need.

In this loop, an anonymous function will extract and print the title along with the price, as shown in the following PHP code snippet:

Copy

$crawler->filterXpath('//*[@class="product_pod"]')->each(function ($node) {
$title = $node->filterXpath('.//*[@class="image_container"]/a/img/@alt')->text();
$price = $node->filterXPath('.//*[@class="price_color"]/text()')->text();
echo $title . "-" . $price . PHP_EOL;
});

Link to GitHub

This was a simple demonstration of how you can scrape data from any page using Guzzle or DOMCrawler parsers. Note that this method will not work with a dynamic website. These websites use JavaScript code that cannot be handled by DOMCrawler. In cases like this, you will need to use Symphony Panther.

The next step after extracting data is to save it.

Saving extracted data to a file

To store the extracted data, you can change the script to use the built-in PHP and create a CSV file.

Write the following PHP code snippet as this:

Copy

$file = fopen("books.csv", "a");
$crawler->filterXpath('//*[@class="product_pod"]')->each(function ($node) use ($file) {
$title = $node->filterXpath('.//*[@class="image_container"]/a/img/@alt')->text();
$price = $node->filterXPath('.//*[@class="price_color"]/text()')->text();
fputcsv($file, [$title, $price]);
});
fclose($file);

Link to GitHub

This code snippet, when run, will save all the data to the books.csv file.

Web scraping with Symfony Panther

Dynamic websites use JavaScript to render the contents. For such websites, Goutte wouldn’t be a suitable option.

For these websites, the solution is to employ a browser to render the page. It can be done using another component from Symfony – Panther. Panther is a standalone PHP library for web scraping using real browsers.

In this section, we’ll scrape quotes and authors from quotes.toscrape.com. It is a dummy website for learning the basics of scraping dynamic web pages.

Installing Panther and its dependencies

To install Panther, open the terminal, navigate to the directory where you will be storing your source code, and run the following commands:

Copy

composer init --no-interaction --require="php >=7.1" 
composer require symfony/panther
composer update

Link to GitHub

These commands will create a new composer.json file and install Symfony/Panther.

The other two dependencies are a browser and a driver. The common browser choices are Chrome and Firefox. The chances are that you already have one of these browsers installed.

The driver for your browser can be downloaded using any of the package managers.

On Windows, run:

Copy

choco install chromedriver

On macOS, run:

Copy

brew install chromedriver

Link to GitHub

Sending HTTP requests with Panther

Panther uses the Client class to expose the get() method. This method can be used to load URLs, or in other words, to send HTTP requests.

The first step is to create the Chrome Client. Create a new PHP file and enter the following lines of code:

Copy

<?php
require 'vendor/autoload.php';
use \Symfony\Component\Panther\Client;
$client = Client::createChromeClient();

Link to GitHub

The $client object can then be used to load the web page:

Copy

$client->get('https://quotes.toscrape.com/js/');

Link to GitHub

This line will load the page in a headless Chrome browser.

Locating HTML elements via CSS Selectors

To locate the elements, first, we need to get a reference for the crawler object. The best way to get an object is to wait for a specific element on a page using the waitFor() method. It takes the CSS Selector as a parameter:

Copy

$crawler = $client->waitFor('.quote');

Link to GitHub

The code line waits for the element with this selector to become available and then returns an instance of the crawler.

The rest of the code is similar to Goutte’s as both use the same CssSelector component of Symfony.

The container HTML element of a quote

First, the filter method is supplied by the CSS Selector to get all of the quote elements. Then, the anonymous function is supplied to each quote to extract the author and the text:

Copy

    $crawler->filter('.quote')->each(function ($node) {
        $author = $node->filter('.author')->text();
        $quote = $node->filter('.text')->text();
       echo $autor." - ".$quote
    });

Link to GitHub

Handling pagination

To scrape data from all of the subsequent pages of this website, we can simply click the Next button. For clicking the links, the clickLink() method can be used. This method works directly with the link text.

On the last page, the link won’t be present, and calling this method will throw an exception. This can be handled by using a try-catch block:

Copy

while (true) {
    $crawler = $client->waitFor('.quote');
…
    try {
        $client->clickLink('Next');
    } catch (Exception) {
        break;
    }
}

Link to GitHub

Writing data to a CSV file

Writing the data to CSV is straightforward when using PHP’s fputcsv() function. Open the CSV file before the while loop, write every row using the fputcsv() function, and close the file after the loop.

Putting everything together, here is the final code:

Copy

$file = fopen("quotes.csv", "a");
while (true) {
    $crawler = $client->waitFor('.quote');
    $crawler->filter('.quote')->each(function ($node) use ($file) {
        $author = $node->filter('.author')->text();
        $quote = $node->filter('.text')->text();
        fputcsv($file, [$author, $quote]);
    });
    try {
        $client->clickLink('Next');
    } catch (Exception) {
        break;
    }
}
fclose($file);

Link to GitHub

Once you execute the web scraper contained in this PHP script, you will have a quotes.csv file with all the quotes and authors ready for further analysis.

Click here and check out a repository on GitHub to find the complete code used in this article.

Conclusion

You shouldn’t run into major hiccups when using Goutte for most static web pages, as this popular library offers sufficient functionality and extensive documentation. However, if the typical HTML extraction methods aren’t up to the task when dynamic elements come into play, then Symfony Panther is the right solution to deal with more complicated loads.

If you are working with a site developed using Laravel, Code Igniter, or just plain PHP, writing a web scraping part directly in PHP can be very useful, for example, when creating your own WordPress plugin. As PHP is also a scripting language, you can write web scraping code even when it is not meant to be deployed to a website.

If you want to know more on how to scrape the web using other programming languages, check similar articles, such as web scraping with JavaScript, Java, R, Ruby, Golang, cURL in PHP, and Python on our blog. Additionally, use an opportunity to try the functionality of our own general-purpose web scraper for free.

About the author

Augustas Pelakauskas

Former Senior Technical Copywriter

Augustas Pelakauskas was a Senior Technical Copywriter at Oxylabs. Coming from an artistic background, he is deeply invested in various creative ventures - the most recent being writing. After testing his abilities in freelance journalism, he transitioned to tech content creation. When at ease, he enjoys the sunny outdoors and active recreation. As it turns out, his bicycle is his fourth-best friend.

Learn more about Augustas Pelakauskas Learn more about Augustas Pelakauskas

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.