HTTP headers enable both for the client and server to transfer further information within the request or response header.
As you might be aware, web scraping or web data collection tools are a thriving method to gather a vast amount of publicly available intelligence in an automated way. Simply put, the more you know, the more you grow, right? But how much do you know about web scraping process itself?
When it comes to the technical side of web scraping which evolved into the art itself, perhaps the most interesting part is that there is no correct way to set up a web scraper.
However, there are proven resources and techniques, such as the use of a proxy, practicing IP rotation (also known as rotating proxies) that will substantially increase your chances of being successful at web scraping i.e. avoid getting blocked by target servers.
Another sometimes overlooked technique is to use and optimize HTTP headers. This practice will allow you to significantly decrease your web scraper’s chances of getting blocked by various data sources, and also ensure that the retrieved data is of high quality.
Hence, in this article, we discuss what are HTTP headers, and elaborate on their purpose. What’s more, we will discuss why using and optimizing HTTP headers are essential when web scraping, and how to secure your web app with various HTTP headers. Let’s begin.
- What are HTTP headers?
- List of HTTP headers
- Examples of HTTP headers
- Why use and optimize HTTP headers?
- How to secure your web app with HTTP headers?
- It’s a wrap
What are HTTP headers?
HTTP headers’ purpose is to enable both for the client and server to transfer further details within the request or response.
However, let’s take a step back and dig a little bit deeper to understand what are HTTP headers and their primary purpose.
HTTP stands for HyperText Transfer Protocol, which on the internet manages how communication is structured and transferred, and how web servers (think websites) and browsers (e.g. Chrome, Internet Explorer) should respond to different requests.
Generally, the user sends a request that includes a header. HTTP headers contain additional information for the web server. Web servers then respond by sending over specified data back to the client. Data is structured, if possible, according to the software specifications contained in the request header.
List of HTTP headers
HTTP headers can be grouped based on their context:
HTTP request header
Request header is sent by the client i.e. internet browser in an HTTP transaction. These headers send many details about the source of the request, e.g. the type of browser (or application in general) being used and its version.
HTTP request headers are an important part of any HTTP communication. Websites tailor their layouts and design in accordance to the type of machine, operating system and application making the request. A collection of information on the software and hardware of the source is sometimes called “user agent”. Otherwise, content might be displayed incorrectly.
If the website in question does not recognize the user agent, it will often default to one of these two actions. Some websites will display a default HTML version they have prepared for cases like these while others will block the request entirely.
HTTP response header
Response headers are sent by a web server in HTTP transaction responses. Response headers often contain information on whether the initial request went through, type of connection, encoding etc. If the request did not go through, then HTTP response headers will contain an error code. HTTP header error codes are divided into specific categories:
- 1xx – Informational
- 2xx – Success
- 3xx – Redirection
- 4xx – Client Error
- 5xx – Server Error
Each category has many situation specific responses. A full list of HTTP header error codes is available on many different websites.
General HTTP header
General headers apply to both requests and responses, but don’t apply to the content itself. These headers can be present in any HTTP message.
The most common general headers are Connection, Cache-Control, or Date.
HTTP entity header
Entity headers contain information about the body of the resource. Each entity tag is represented as a pair, for example: Content-Language, Content-Length, etc.
Examples of HTTP headers
GET /URL/destination/to/get/ HTTP/1.1 Host: targetwebsite.com User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:188.8.131.52) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729) Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 Accept-Language: en-us,en;q=0.5 Accept-Encoding: gzip,deflate Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 Keep-Alive: 300 Connection: keep-alive Cookie: PHPSESSID=r2t5uvjq435r4q7ib3vtdjq120 Pragma: no-cache Cache-Control: no-cache
User-Agent header is one of the most important headers that can define whether your request will be successful or not. Most common user agents should be used to make sure you avoid being blocked while web scraping.
Some of HTTP headers can be grouped according to how proxies deal with them. We have already written about HTTP proxies and their configuration. Here are some headers that interact with proxies:
- Connection – a general header that controls if network connection stays open after the current transaction finishes.
- Keep-Alive – allows the client to indicate how the connection may be used to set a maximum amount of requests and a timeout. For this header to be valid, the connection header must be set to: Keep-Alive.
- Proxy-Authenticate – a response header that defines the authentication method for gaining access to a resource behind a proxy server. This header authenticates the request to the proxy server and allows it to further transmit the request.
- Proxy-Authorization – a request header that includes the credentials to authenticate a user agent to a proxy server.
- Trailer – a response header that allows the sender to include additional fields at the end of chunked messages. For example, a message integrity check, post-processing status, or digital signature.
- Transfer-Encoding – specifies the form of encoding used to safely transfer the payload body to the sender. This header is applied to a message between two nodes, not to a resource itself.
These are just a few examples of HTTP headers. Listing all of the possible variations of HTTP headers is nearly impossible. A HTTP headers can send many types of requests, or ask for specific languages and encoding, etc.
Why use and optimize HTTP headers?
- Decrease web scraper’s chances of getting blocked by the target server
- Increase the quality of data retrieved from the target server
Simply put, the use of HTTP headers will have a direct impact on what type of data will be retrieved back from web servers, and define its quality.
What’s more, if you will use the HTTP headers accordingly, it will allow you to substantially reduce the chances of getting blocked by web servers.
Nowadays, most web service owners understand that they will most likely get their data scraped by someone. Some scrapers slow down websites, therefore website owners use every tool at their disposal to protect the website. One of these tools might be to automatically block any fake user agents they detect. Some web server owners might even display incorrect information if they detect a fake user agent! If you are looking for information on how to crawl a website without getting blocked, check out our blog.
As mentioned before, HTTP headers carry additional information to web servers, and by optimizing the content of this message, it is possible to make the internet requests seem as it is coming from an organic user. Such traffic to web servers is highly unlikely to be blocked.
How to secure your web app with HTTP headers?
While HTTP headers can be used by web scrapers to avoid IP blocks, they can also be employed by web servers for web security. Simply speaking, HTTP security headers are a contract between the browser and the developer. This contract is defined by HTTP response headers that set the level of the website’s security.
Here are some of the most common HTTP headers that allow you to secure your web applications:
- Content-Security-Policy header – provides an additional layer of security. It helps prevent various attacks, including Cross Site Scripting (XSS) and other code injection attacks. The policy defines content sources which are approved and allows the browser to load them.
- Feature-Policy header – either allows or denies the use of the browser in its own frame as well as in content within <iframe> elements.
- X-Frame-Options header – protects website visitors from clickjacking attacks.
- X-XSS-Protection header – used to configure the built in reflective XSS protection. It is found in Chrome, Internet Explorer, and Safari (Webkit).
- Referrer-Policy header – controls how much referrer information, sent via the Referrer header, ought to be included with requests.
- X-Content-Type-Options response header – is a marker used by the server. This marker indicates that the MIME types advertised in the Content-Type headers should be followed, not changed.
You can easily check your HTTP header security online. Various tools let you check which HTTP security headers currently run on your website, you simply need to have an URL that you want to check.
It’s a wrap
Hopefully, by now you have a decent idea of what is a HTTP header, their purpose, and how they come into play in the web scraping world. We also briefly touched HTTP security headers and how they work.
Of course, it’s only the tip of an iceberg and there are quite a few HTTP headers that need to be taken into account when web scraping. We have covered 5 essential HTTP headers that every web scraper must use and optimize. Check it out, and happy scraping!