Back to blog

How to Automate Competitors' & Benchmark Analysis With Python

Daniel Heredia Mejias

2021-12-224 min read
Share

Doing competitors’ or benchmark analysis for SEO can be a burdensome task as it requires taking into account many factors which usually are extracted from different data sources. 

The purpose of this article is to help you automate the data extraction processes as much as possible. After learning how to do this, you can dedicate your time to what matters: the analysis itself and coming up with actionable insights to strategize.

Daniel Heredia Mejias, SEO Marketer at Casumo

The logic that we’ll follow to automate this process is as follows:

  1. We’ll use Oxylabs’ SERP Scraper API (part of Web Scraper API) to gather the public data from SERPs and get the top results for a specific keyword. 

  2. We’ll scrape the URLs that are ranking in the first positions of the SERPs and obtain the required on-page content that we’ll use for our analysis.

  3. We’ll connect with MOZ API and obtain the necessary off-page metrics.

  4. We’ll use PageSpeed Insights API to put together some metrics related to Core Web Vitals.

  5. Finally, we’ll convert the Python list into a dataframe and export it as an Excel file.

Explaining the basics

In short, we’ll be able to dump into our Excel sheet up to 18 metrics from the best-performing pages for a keyword for further analysis and get an idea of what it takes to be in the best SERPs spots. The metrics that we’re going to obtain in our Excel sheet as columns are:

  1. Meta Title SERPs: the meta title appearing on the SERPs;

  2. Meta Title On Page: the meta title that is written on-page;

  3. Meta Title Equal: this column can be False or True depending if the meta title on the SERPs matches with the on-page text;

  4. Meta Description: on-page meta description; 

  5. H1: on-page H1;

  6. Paragraphs: content contained in <p> tags;

  7. Text length: number of characters from the paragraphs;

  8. Keyword Occurrences Paragraphs: how many times the keyword is used in the paragraphs;

  9. Meta Title Occurrence: if the keyword is used in the meta title;

  10. Meta Description Occurrence: if the keyword is used in the meta description;

  11. Equity Backlinks MOZ: the backlinks that are giving value according to MOZ;

  12. Total Backlinks MOZ: total number of backlinks from MOZ;

  13. Domain Authority: metric used by MOZ to show how authoritative a domain is;

  14. FCP (First Contentful Paint): measures the time from when a page starts loading to when any part of that page’s content is rendered on the screen;

  15. FIP (First Input Delay): the time it takes for the browser to respond to the user’s first interaction;

  16. LCP (Largest Contentful Paint): the amount of time to render the largest content element visible in the viewport, from when the user requests the URL;

  17. CLS (Cumulative Layout Shift): proportion of the viewport that was impacted by layout shifts and the movement distance of the elements that were moved;

  18. Overall PSI Score: page speed overall score that ranges from 0 to 100.

So, we’ve already explained the logic of the code and the variables that we’re going to obtain. Let’s get started with the process of getting the required public data for the automated analysis.

Using Oxylabs’ solution to retrieve the SERPs results

First of all, we’ll use Oxylabs’ Web Scraper API to extract the top results from the SERPs for an inputted keyword. Remember that you’ll need to get your API username and password to use this piece of code. You’ll also need to input a keyword.

import requests

keyword = "<your_keyword>"

payload = {
    "source": "SEARCH_ENGINE_search",
    "domain": "com",
    "query": keyword,
    "parse": "true",
}

response = requests.request(
    "POST",
    "https://realtime.oxylabs.io/v1/queries",
    auth=("<your_username>", "<your_password>"),
    json=payload,
)

list_comparison = [
    [x["url"], x["title"]]
    for x in response.json()["results"][0]["content"]["results"]["organic"]
]
Link to GitHub

* You’ll need to specify the exact source, for example, the largest search engine.

Example content of list_comparison:

>>> print(list_comparison)
[
	["https://example.com/result/example-link", "Example Link - Example"],
	["https://more-examples.net", "Homepage - More Examples"],
	["https://you-searched-for.com/query=your_keyword", "You Searched for 'your_keyword'. Analyze your search now!"],
]
Link to GitHub

From my point of view, Oxylabs offers a very competitive and robust service to scrape the SERPs. Previously I’ve written on my blog about Oxylabs and how you could get the most out of it for SERPs scraping.

Scraping URLs of the top results

After scraping the SERPs with Oxylabs’ scraping solution and getting the best-performing pages that rank for a particular keyword, we’ll scrape their URLs and extract their on-page contents with the Python library Requests. We’ll also parse the content with BeautifulSoup.

import requests
from bs4 import BeautifulSoup

for y in list_comparison:
    try:
        print("Scraping: " + y[0])
        html = requests.request("get", y[0])
        soup = BeautifulSoup(html.text, "lxml")
        
        try:
            metatitle = (soup.find("title")).get_text()
        except Exception:
            metatitle = ""

        try:
            metadescription = soup.find("meta", attrs={"name": "description"})["content"]
        except Exception:
             metadescription = ""

        try:
            h1 = soup.find("h1").get_text()
        except Exception:
            h1 = ""

        paragraph = [a.get_text() for a in soup.find_all('p')]
        text_length = sum(len(a) for a in paragraph)
        text_counter = sum(a.lower().count(keyword) for a in paragraph)
        metatitle_occurrence = keyword in metatitle.lower()
        h1_occurrence = keyword in h1.lower()
        metatitle_equal = metatitle == y[1]        
        y.extend([metatitle, metatitle_equal, metadescription, h1, paragraph, text_length, text_counter, metatitle_occurrence, h1_occurrence])
    
    except Exception as e:
        print(e)
        y.extend(["No data"]*9)
Link to GitHub

Obtaining the off-page metrics

Now, we’ll use MOZ’s API to obtain the off-page metrics. Of course, you’ll need to get your MOZ username and password and input them in the code below. It’s also worth mentioning that MOZ enables customers to make up to 2.000 API requests for free.

Install the Moz API library as follows:

pip install 
git+https://github.com/seomoz/SEOmozAPISamples.git#egg=mozscape&subdirectory=python
import time
from mozscape import Mozscape
 
client = Mozscape("<MOZ username>", "<MOZ password>")
 
for y in list_comparison:
    try:
        print("Getting MOZ results for: " + y[0])
        domainAuthority = client.urlMetrics(y[0])
        y.extend([domainAuthority["ueid"], domainAuthority["uid"], domainAuthority["pda"]])
    except Exception as e:
        print(e)
        time.sleep(10)  # Retry once after 10 seconds.
        domainAuthority = client.urlMetrics(y[0])
        y.extend([domainAuthority["ueid"], domainAuthority["uid"], domainAuthority["pda"]])
Link to GitHub

Obtaining the Page Speed metrics

Finally, we’ll obtain the Page Speed metrics using the Page Speed API. To use Page Speed API, you’ll need to set up a project and get the API key from the Google Cloud Platform.

import json

pagespeed_key = "<your page speed key>"


for y in list_comparison:
    try:
        
        print("Getting results for: " + y[0])
        url = "https://www.googleapis.com/pagespeedonline/v5/runPagespeed?url=" + y[0] + "&strategy=mobile&locale=en&key=" + pagespeed_key
        response = requests.request("GET", url)
        data = response.json() 
        
        overall_score = data["lighthouseResult"]["categories"]["performance"]["score"] * 100
        fcp = data["loadingExperience"]["metrics"]["FIRST_CONTENTFUL_PAINT_MS"]["percentile"]/1000
        fid = data["loadingExperience"]["metrics"]["FIRST_INPUT_DELAY_MS"]["percentile"]/1000
        lcp = data["loadingExperience"]["metrics"]["LARGEST_CONTENTFUL_PAINT_MS"]["percentile"]
        cls = data["loadingExperience"]["metrics"]["CUMULATIVE_LAYOUT_SHIFT_SCORE"]["percentile"]/100
        

        
        y.extend([fcp, fid, lcp, cls, overall_score])
    
    except Exception as e:
        print(e)
        y.extend(["No data", "No data", "No data", "No data", overall_score])
Link to GitHub

Converting Python list into a dataframe and exporting it as an Excel file

As a final step, you can download all the data from your notebook as an Excel file with Pandas:

import pandas as pd
 
df = pd.DataFrame(list_comparison)
df.columns = ["URL","Metatitle SERPs", "Metatitle Onpage","Metatitle Equal", "Metadescription", "H1", "Paragraphs", "Text Length", "Keyword Occurrences Paragraph", "Metatitle Occurrence", "Metadescription Occurrence", "Equity Backlinks MOZ", "Total Backlinks MOZ", "Domain Authority", "FCP", "FID","LCP","CLS","Overall Score"]
df.to_excel('<filename>.xlsx', header=True, index=False)
Link to GitHub

Click here and check out a repository on GitHub to find the complete code used in this article.

Conclusion

That’s it! You don’t even need to know how to code. You can use the input forms and insert your credentials to run the code and extract all the publicly available data for the best-performing sites ranking for particular keywords.

About the author

Daniel Heredia Mejias

SEO Marketer at Casumo

Daniel Heredia is a Spanish SEO manager who lives in Barcelona and works for Casumo. In love with SEO and especially automation, the Python snake bit him for the very first time around three years ago. He writes about SEO and Python on his site and in his spare time he runs some SEO side projects.

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

Related articles

Get the latest news from data gathering world

I’m interested