avatar

Augustas Pelakauskas

Dec 30, 2021 12 min read

PHP is a general-purpose scripting language and one of the most popular options for web development. For example, WordPress, the most common content management system to create websites, is built using PHP.

PHP offers various building blocks required to build a web scraper, although it can quickly become an increasingly complicated task. Conveniently, there are many open-source libraries that can make web scraping with PHP more accessible.

This article will guide you through the step-by-step process of writing various PHP web scraping routines that can extract public data from static and dynamic web pages.

Installing prerequisites

To begin, make sure that you have both PHP and Composer installed.

If you are using Windows, visit this link to download PHP. Alternatively, you can use the Chocolatey package manager. Using Chocolatey, run the following command from the command line or PowerShell:

choco install php

If you are using macOS, the chances are that you already have PHP bundled with the operating system. Alternatively, you can use a package manager such as Homebrew to install PHP. Open the terminal and enter the following:

brew install php

Once PHP is installed, verify that the version is 7.1 or newer. Open the terminal and enter the following to verify the version:

php --version

Next, install Composer. Composer is a dependency manager for PHP. It’ll help to install and manage the required packages.

To install Composer, visit this link. Here you’ll find the downloads and instructions.

If you are using a package manager, the installation is easier. On macOS, run the following command to install Composer:

brew install composer

On Windows, you can use Chocolatey:

choco install composer

To verify the installation, run the following command:

composer --version

The next step is to install the required libraries.

Making an HTTP GET request

The first step of PHP web scraping is to load the page.

In this article, we’ll be using books.toscrape.com. The website is a dummy book store for practicing web scraping.

When viewing a website in a browser, the browser sends an HTTP GET request to the web server as the first step. To send the HTTP GET request using PHP, the built-in function file_get_contents can be used.

This function can take a file path or a URL and return the contents as a string.

Create a new file and save it as native.php. Open this file in a code editor such as Visual Studio Code. Enter the following lines of code to load the HTML page and print the HTML in the terminal:

<?php
$html = file_get_contents('https://books.toscrape.com/');
echo $html;

Execute this code from the terminal as follows:

php native.php

Upon executing this command, the entire HTML of the page will be printed.

As of now, it is difficult to locate and extract specific information within the HTML.

This is where various open-source, third-party libraries come into play.

Web scraping in PHP with Goutte

A wide selection of libraries is available for web scraping with PHP. In this article, we’ll use Goutte as it is accessible, well documented, and continuously updated. It is always a good idea to try the most popular solutions as supporting content and preexisting advice is plentiful.

Goutte can handle most static websites. For dynamic sites, we’ll use Symfony Panther.

Goutte, pronounced as goot, is a wrapper around Symfony’s components, such as BrowserKit, CssSelector, DomCrawler, and HTTPClient.

Symfony is a set of reusable PHP components. The components used by Goutte can be used directly. However, Goutte makes it easier to write the code.

To install Goutte, create a directory where you intend to keep the source code. Navigate to the directory and enter these commands:

composer init --no-interaction --require="php >=7.1"
composer require fabpot/goutte
composer update

The first command will create the composer.json file. The second command will add the entry for Goutte as well as download and install the required files. It’ll also create the composer.lock file.

The composer update command will ensure that all the files of the dependencies are up to date.

Sending HTTP requests with Goutte

The most important class for PHP web scraping using Goutte is the Client that acts like a browser. The first step is to create an object of this class:

$client = new Client();

This object can then be used to send a request. The method to send the request is conveniently called request. It takes two parameters — the HTTP method and the target URL, and returns an instance of the DOM crawler object:

$crawler = $client->request('GET', 'https://books.toscrape.com');

This will send the GET request to the HTML page. To print the entire HTML of the page, we can call the html() method.

Putting together everything we’ve built so far, this is how the code file looks like:

<?php
require 'vendor/autoload.php';
use Goutte\Client;
$client = new Client();
$crawler = $client->request('GET', 'https://books.toscrape.com');
echo $crawler->html();

Save this new PHP file as books.php and run it from the terminal. This will print the entire HTML:

php books.php

What we need next is a way to locate specific elements from the page.

Locating HTML elements via CSS Selectors

Goutte uses the Symfony component CssSelector. It facilitates the use of CSS Selectors in locating HTML elements.

The CSS Selector can be supplied to the filter method. For example, to print the title of the page, enter the following line to the books.php file that we are working with:

echo $crawler->filter('title')->text();

Note that title is the CSS Selector that selects the <title> node from the HTML.

Keep in mind that in this particular case, text() returns a text contained in the HTML element. In the earlier example, we’ve used html() to return the entire HTML of the selected element.

If you prefer to work with XPath, use the filterXPath() method instead. The following line of code produces the same output:

echo $crawler->filterXPath('//title')->text();

Now, let’s move on to extracting the book titles and prices.

Extracting the elements

Open https://books.toscrape.com in Chrome, right-click on a book and select Inspect. Before we write the web scraping code, we need to analyze the HTML of our page first.

The books are located in the <article> tags

Upon examining the HTML of the target web page, we can see that each book is contained in an article tag, which has a product_pod class. Here, the CSS Selector would be .product_pod.

In each article tag, the complete book title is located in the thumbnail image as an alt attribute value. The CSS Selector for the book title would be .image_container img.

Finally, the CSS Selector for the book price would be .price_color.

To get all of the titles and prices from this page, first, we need to locate the container and then run the each loop.

In this loop, an anonymous function will extract and print the title along with the price as follows:

function scrapePage($url, $client){
    $crawler = $client->request('GET', $url);
    $crawler->filter('.product_pod')->each(function ($node) {
            $title = $node->filter('.image_container img')->attr('alt');
            $price = $node->filter('.price_color')->text();
            echo $title . "-" . $price . PHP_EOL;
        });
    }

The functionality of web data extraction was isolated in a function. The same function can be used for extracting data from different websites.

Handling pagination

At this point, our web scraper is performing data extraction from only a single URL. In real-life web scraping scenarios, multiple pages would be involved.

In this particular site, the pagination is controlled by a Next link (button). The CSS Selector for the Next link is .next > a.

In the function scrapePage that we’ve created earlier, add the following lines:

try {
    $next_page = $crawler->filter('.next > a')->attr('href');
} catch (InvalidArgumentException) { //Next page not found
    return null;
}
return "https://books.toscrape.com/catalogue/" . $next_page;

This code uses the CSS Selector to locate the Next button and to extract the value of the href attribute, returning the relative URL of the subsequent page. On the last page, this line of code will raise the InvalidArgumentException.

If the next page is found, this function will return its URL. Otherwise, it will return null.

From now on, we’ll be initiating each scraping cycle with a different URL. This will make the conversion from a relative URL to an absolute one easier.

Lastly, we can use a while loop to call this function:

$client = new Client();
$nextUrl = "https://books.toscrape.com/catalogue/page-1.html";
while ($nextUrl) {
    $nextUrl = scrapePage($nextUrl, $client);
}
scrapePage($url, $client);

The web scraping code is almost complete.

Writing data to a CSV file

The final step of the PHP web scraping process is to export the data to a storage. PHP’s built-in fputcsv function can be used to export the data to a CSV file.

First, open the CSV file in write or append mode and store the file handle in a $file variable.

Next, send the $file variable to the scrapePage function. Then, call the fputcsv function for each book to write the title and price in one row.

Lastly, after the while loop, close the file by calling fclose.

The final code file will be as follows:

function scrapePage($url, $client, $file)
{
    $crawler = $client->request('GET', $url);
    $crawler->filter('.product_pod')->each(function ($node) use ($file) {
        $title = $node->filter('.image_container img')->attr('alt');
        $price = $node->filter('.price_color')->text();
        fputcsv($file, [$title, $price]);
    });
    try {
        $next_page = $crawler->filter('.next > a')->attr('href');
    } catch (InvalidArgumentException) { //Next page not found
        return null;
    }
    return "https://books.toscrape.com/catalogue/" . $next_page;
}

$client = new Client();
$file = fopen("books.csv", "a");
$nextUrl = "https://books.toscrape.com/catalogue/page-1.html";

while ($nextUrl) {
    echo "<h2>" . $nextUrl . "</h2>" . PHP_EOL;
    $nextUrl = scrapePage($nextUrl, $client, $file);
}
fclose($file);

Run this file from the terminal:

php books.php

This will create a books.csv file with 1,000 rows of data.

Web scraping with Symfony Panther

Dynamic websites use JavaScript to render the contents. For such websites, Goutte wouldn’t be a suitable option.

For these websites, the solution is to employ a browser to render the page. It can be done using another component from Symfony – Panther. Panther is a standalone PHP library for web scraping using real browsers.

In this section, we’ll scrape quotes and authors from quotes.toscrape.com. It is a dummy website for learning the basics of scraping dynamic web pages.

Installing Panther and its dependencies

To install Panther, open the terminal, navigate to the directory where you will be storing your source code, and run the following commands:

composer init --no-interaction --require="php >=7.1" 
composer require symfony/panther
composer update

These commands will create a new composer.json file and install Symfony/Panther.

The other two dependencies are a browser and a driver. The common browser choices are Chrome and Firefox. The chances are that you already have one of these browsers installed.

The driver for your browser can be downloaded using any of the package managers.

On Windows, run:

choco install chromedriver

On macOS, run:

brew install chromedriver

Sending HTTP requests with Panther

Panther uses the Client class to expose the get() method. This method can be used to load URLs, or in other words, to send HTTP requests.

The first step is to create the Chrome Client. Create a new PHP file and enter the following lines of code:

<?php
require 'vendor/autoload.php';
use \Symfony\Component\Panther\Client;
$client = Client::createChromeClient();

The $client object can then be used to load the web page:

$client->get('https://quotes.toscrape.com/js/');

This line will load the page in a headless Chrome browser.

Locating HTML elements via CSS Selectors

To locate the elements, first, we need to get a reference for the crawler object. The best way to get an object is to wait for a specific element on a page using the waitFor() method. It takes the CSS Selector as a parameter:

$crawler = $client->waitFor('.quote');

The code line waits for the element with this selector to become available and then returns an instance of the crawler.

The rest of the code is similar to Goutte’s as both use the same CssSelector component of Symfony.

The container HTML element of a quote

First, the filter method is supplied by the CSS Selector to get all of the quote elements. Then, the anonymous function is supplied to each quote to extract the author and the text:

    $crawler->filter('.quote')->each(function ($node) {
        $author = $node->filter('.author')->text();
        $quote = $node->filter('.text')->text();
       echo $autor." - ".$quote
    });

Handling pagination

To scrape data from all of the subsequent pages of this website, we can simply click the Next button. For clicking the links, the clickLink() method can be used. This method works directly with the link text.

On the last page, the link won’t be present, and calling this method will throw an exception. This can be handled by using a try-catch block:

while (true) {
    $crawler = $client->waitFor('.quote');
…
    try {
        $client->clickLink('Next');
    } catch (Exception) {
        break;
    }
}

Writing data to a CSV file

Writing the data to CSV is straightforward when using PHP’s fputcsv() function. Open the CSV file before the while loop, write every row using the fputcsv() function, and close the file after the loop.

Putting everything together, here is the final code:

$file = fopen("quotes.csv", "a");
while (true) {
    $crawler = $client->waitFor('.quote');
    $crawler->filter('.quote')->each(function ($node) use ($file) {
        $author = $node->filter('.author')->text();
        $quote = $node->filter('.text')->text();
        fputcsv($file, [$author, $quote]);
    });
    try {
        $client->clickLink('Next');
    } catch (Exception) {
        break;
    }
}
fclose($file);

Once you execute the web scraper contained in this PHP script, you will have a quotes.csv file with all the quotes and authors ready for further analysis.

Conclusion

You shouldn’t run into major hiccups when using Goutte for most static web pages, as this popular library offers sufficient functionality and extensive documentation. However, if the typical HTML extraction methods aren’t up to the task when dynamic elements come into play, then Symfony Panther is the right solution to deal with more complicated loads.

If you are working with a site developed using Laravel, Code Igniter, or just plain PHP, writing a web scraping part directly in PHP can be very useful, for example, when creating your own WordPress plugin. As PHP is also a scripting language, you can write web scraping code even when it is not meant to be deployed to a website.

If you want to know more on how to scrape the web using other programming languages, check similar articles, such as web scraping with JavaScript, Java, R, Ruby, and Python on our blog.

avatar

About Augustas Pelakauskas

Augustas Pelakauskas is a Copywriter at Oxylabs. Coming from an artistic background, he is deeply invested in various creative ventures - the most recent one being writing. After testing his abilities in the field of freelance journalism, he transitioned to tech content creation. When at ease, he enjoys sunny outdoors and active recreation. As it turns out, his bicycle is his third best friend.

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

Related articles

Meeting the Most Ambitious SEO Needs With Fresh Data
Meeting the Most Ambitious SEO Needs With Fresh Data

Jan 26, 2022

3 min read

OxyCast: A New Podcast on Everything Web Scraping Related
OxyCast: A New Podcast on Everything Web Scraping Related

Jan 20, 2022

2 min read

Python Web Scraping Tutorial: Step-By-Step
Python Web Scraping Tutorial: Step-By-Step

Jan 06, 2022

21 min read