Web Scraping: Another Block In The Wall

[0:00:00.0] Martynas: It is a very weird concept when you think…It's like a war between us, trying to get information and them not giving the information.

[0:00:11.6] Augustinas: Alex mentioned the phrase – "cat and mouse game."

[0:00:16.0] Martynas: Yes.

[0:00:17.8] Augustinas: Always that…with, you know, these, like, servers or, you know, the developers from other websites that don't want to be scraped, they keep coming.

[0:00:26.2] Martynas: For us, developers, we're the soldiers, and for us, it's a war.

[0:00:35.0] Augustinas: Hello and welcome to the OxyCast – the show where we talk about everything web scraping related. My name is Oggy, Augustinas Kalvis – call me whatever you want. And right on the next side of this little table, we have my beloved colleague and friend – Martynas Saulius. Why don't you introduce yourself?

[0:00:53.2] Martynas: Yeah, so I'm Martynas Saulius, as Oggy said, kind of, introduced me. I am a Python Developer here at Oxylabs, and I've been doing scrapers for the past three years here.

[0:01:08.2] Augustinas: Martynai, how experienced would you call yourself? Are you at a mid-senior level already?

[0:01:14.5] Martynas: I am at mid-level.

[0:01:43.8] Augustinas: Before we move forward, let's clarify some terminology. We noticed that the internet sometimes uses the term web scraping a little bit differently than what we're used to here at Oxylabs. Web scraping, as a term, tends to cover multiple processes – scraping, parsing, and even crawling. As developers, we should try to be as concrete as possible when we're talking about technical things, which is why, in general, we avoid the term web scraping and, instead, use those individual terms. Let's start with scraping.

Scraping means to request the web server for data from a web page. For example, we have a website called - potatomarketplace.com. We use cURL, or any other HTTP client, in order to get the raw data which can be HTML, JSON, or even an image.

Parsing means to filter out particular information from the raw data. This usually means that the scraped data is analyzed in order to filter out the necessary information, which can later be structured into JSON, CSV, or other data formats.

And last but not least, we have crawling – scraping a web page, parsing for links, and repeating for every parsed link. So that means that we scrape a web page, we parse the HTML, and during this process, we're looking for links pointing to other web pages. And then we scrape and parse over and over again until we run out of links to follow. This extended process is what we call crawling. Now with clear terminology, we can move forward.

[0:03:18.0] Martynas: Oh, so a scraper is very basic when you think about it. It's just a mechanism that goes to a certain website, and extracts data from it, fetches it.

[0:03:30.1] Augustinas: Alright.

[0:03:31.2] Martynas: Mostly, it's just the HTML of the web page.

[0:03:35.1 Augustinas: Okay, and have you ever built a web scraper yourself?

[0:03:39.4] Martynas: Yes, very long ago, I don't really remember the details, but mostly I've made my own scraper to bypass the RSS for updates in a certain forum of mine. I wanted to see the updates in almost real-time and see exact updates, not just for forum posts.

[0:04:01.9] Augustinas: Okay, and that was a long time ago. Right?

[0:04:04.8] Martynas: Yes.

[0:04:06.6] Augustinas: You remember how long ago that was?

[0:04:07.6] Martynas: That would be somewhere at the start of my university time, so about seven to eight years.

[0:04:12.9] Augustinas: And you say you worked in Oxy?

[0:04:15.5] Martynas: For three years here.

[0:04:17.1] Augustinas: Okay, so it's a three-year window from the point where you had to build your own web scraper.

[0:04:22.8] Martynas: Until I started working, yes.

[0:04:25.6] Augustinas: Okay.

[0:04:26.5] Martynas: It's a weird coincidence like that.

[0:04:28.6] Augustinas: And, well, I know it was a long time ago, but perhaps you remember any sorts of issues that you encountered while building that particular web scraper?

[0:04:37.4] Martynas: Well, basically, most of my issues back then were inexperience, let's say. I didn't really know what I was doing. So I just went in and started writing something and made a lot of mistakes. Most of them were not understanding how the internet works, in general, and what you do with the data from the internet.

[0:04:58.3] Augustinas: And now, would you say you're comfortable with your understanding of the internet?

[0:05:01.3] Martynas: Yes, maybe.

[0:05:06.6] Augustinas: Maybe?

[0:05:06.8] Martynas: Well, I mean, it changes every day, and usually, there are some dark corners of the internet that I wouldn't want to know, but that's a topic for another day.

[0:05:14.9] Augustinas: Okay, and so three years in Oxy, well, you probably had quite the journey so far, probably encountered a lot of web scrapers and maybe even built your own.

[0:05:25.8] Martynas: Build my own, no. But I usually maintained the scraper that we had here.

[0:05:31.9] Augustinas: Alright, and what would you say are the differences between, well, like, the web scraper that you built yourself all those years ago and the ones that we maintain here at Oxy?

[0:05:45.5] Martynas: Generally, well, the first word that I put between my own little project and the scraper that I'm working on here would be scale. The sheer scale, how much bigger it is, how many more requests it accommodates, and what kind of targets they are scraping. Because that little forum of mine had nothing, like, I could just shoot one or two requests into it every day and be happy with it. But now, here, we have to deal with almost thousands in a second.

[0:06:20.1] Augustinas: So if you had to, you know, give me some sort of number, how many requests do you think your little scraper from back then used to do per day, in comparison to the one..?

[0:06:32.2] Martynas: As I said, about two, three a day.

[0:06:35.3] Augustinas: Two, three a day of your little web scraper.

[0:06:37.5] Martynas: Yes, and that was sufficient for me.

[0:06:39.6] Augustinas: And here we probably do by the millions, right?

[0:06:42.4] Martynas: In a day, yes. If we had to rough the numbers, it would, if we account for how much in a month, the number would go into billions.

[0:06:54.4] Augustinas: Okay, so the scale is an issue, right? What about any other issues that you..?

[0:07:06.8] Martynas: I'm not sure if that would be an issue because, as I said, that then it was more…more of an inexperience thing. But now, seeing what you scrape and how the scraper acts, the observability part, I feel, it's very important as well. If you don't know why your scraping failed or some issues happened, you wouldn't even know what to fix.

[0:07:31.2] Augustinas: Right, because when you're building your own little web scraper, and you're playing with it locally, you can use the debugger, and then you can actually see what's happening under the hood. But then you can't do that in production, right? So observability is clearly important so that we can see what's really happening in our own scrapers here at Oxylabs.

[0:07:48.6] Martynas: Yes, absolutely.

[0:07:50.2] Augustinas: How do we make observability happen in general? Do you happen to know some of the toolings that we use or..?

[0:08:00.8] Martynas: Generally, we use the log. We log all events that happen in our system and store it and use the Elasticsearch stack just to see…to aggregate all the data and see in the past how it's going. We have certain metrics that we used to observe with Prometheus, and now we're entering the deep waters of the OpenTelemetry object. An object…I mean project.

[0:08:32.9] Augustinas: Yeah, I recently picked up a Rust book by Luca Palmieri where he talked about the concept of tracing. So, if I'm not mistaken, OpenTelemetry is a tracing library, right?

[0:08:44.6] Martynas: It's an open project that tries to standardize all tracing methods and so on.

[0:08:52.3] Augustinas: So logs, metrics, traces, but we're still kind of young in the trace sphere of things?

[0:08:59.9] Martynas: Yes, trifecta effect of observability, yes.

[0:09:03.3] Augustinas: Anything else that comes to mind? Any other big issues or..?

[0:09:10.5] Martynas: Well, the reason why we need observability is not just because the scrapers would crash and burn sometimes. But sometimes, the targets that we scrape don't want to give us information. If we don't know what happened, then…

[0:09:25.7] Augustinas: So what's happening here? The targets, they don't want to give us information? Oh, right, I think we talked about this last episode. Alex mentioned that businesses usually don't want to be scraped because, well, it's not in their best interest.

[0:09:42.9] Martynas: This would be the crux here for us at Oxylabs. What we're mostly dealing with.

[0:09:51.1] Augustinas: What do we call that?

[0:09:52.6] Martynas: We basically call it being blocked.

[0:09:56.1] Augustinas: Okay, so we even have a name for this particular issue already.

[0:09:59.1] Martynas: Yes.

[0:10:03.0] Augustinas: So, we know why it happens. How are we blocked, really? What happens when we are blocked?

[0:10:11.0] Martynas: We don't know if you, well, how should I say, why it happens. We do know from what Alex said. But how we're blocked in general – the target decides to either give you the information or not. The web page. Right? And the blocking usually appears in just outright being forbidden from it. They can give you false information if they determine that you're, well, an antagonist for them. And the third way is, well, what's most common for all people is CAPTCHA pages. You need to prove that you're not a bot.

[0:10:57.1] Augustinas: Okay, how do you even solve a CAPTCHA? I mean, there are so many things I want to ask about this, but this is…

[0:11:09.8] Martynas: Take your time - I'm here all day.

[0:11:15.5] Augustinas: Okay, no, but I'm really interested in what you do, image recognition or something like that?

[0:11:23.1] Martynas: If there's a CAPTCHA, usually, credit where credit's due, they do a good job. It's very, very difficult to bypass a CAPTCHA using the technology that we have right now. So, instead of brute forcing a CAPTCHA, even if there were tries from our side to do so, we tend to avoid it whatsoever. We try to avoid blocking.

[0:11:54.9] Augustinas: Avoid it.

[0:11:55.2] Martynas: Yes, the idea of blocking is that you're blocked when the target believes you're a bot. To avoid that, you need to show yourself as a human, act as a human.

[0:12:11.0] Augustinas: And, well, how do they know that we're a bot?

[0:12:14.5] Martynas: That's where our job begins. We need to understand the ways they understand…they find that we're bots and avoid those.

[0:12:23.9] Augustinas: So, what are those ways then? Or is that like a secret that you're not allowed to give out?

[0:12:30.0] Martynas: No, no, no, you can practically Google it, and then you get a lot of answers. But there are a few very important points.

[0:12:38.8] Augustinas: Okay.

[0:12:40.5] Martynas: The first one being that we all communicate through the HTTP protocol, right? And the HTTP protocol can have parameters. And those parameters are usually just to help the target understand what kind of browser is asking for information – do you need to format the page, change some fields or make it more accessible, and yada yada. There are a lot of reasons why one would address the parameters.

[0:13:09.9] Augustinas: So, if I remember correctly, that's through headers and things such as that.

[0:13:18.7] Martynas: Well, a lot of things but, mostly, headers, cookies, and so on. Cookies are part of the headers, I think.

[0:13:25.6] Augustinas: Okay.

[0:13:30.5] Martynas: They might see some suspicious activity from those parameters, and they would believe you're not human. Like, let's say...the easiest one would be cURL, right? It's a CLI. Well, how should I say it? I don't remember exactly the terminology, what cURL does, but it does a lot, but mostly it just does HTTP requests.

[0:13:59.4] Augustinas: Right, and yeah, it's a CLI tool that you just use to, usually, ping a website. I use it to debug my internet connection all the time. You know, cURL, google.com to see if anything comes back.

[0:14:11.2] Martynas: Yes, if you can't reach Google, that's a problem.

[0:14:14.1] Augustinas: Yeah, then the obvious question would be - how would a website know that you're using cURL? I know that there are headers, right, that probably give away that kind of information.

[0:14:27.2] Martynas: Generally, if you are not specifying anything, then the browser or the tool that you're using usually fills in the gaps. The user agent is one of the more important headers. So it usually fills in, just to say what kind of browser you are. If you're using a Google Chrome browser, for instance, it will set the user agent that would define what operation system, what version of it, and what kind of browser or engine was asking for this type of request.

[0:15:02.5] Augustinas: So what you're saying is, basically, if I change my user agent from time to time, I have a better chance of not being blocked by a website?

[0:15:12.7] Martynas: Not quite, but it's one of the parts.

[0:15:17.2] Augustinas: Okay. So there's more to the story. It's not just the user agent, right?

[0:15:19.1] Martynas: Yeah, there's more than starting. So get back to the suspicious parameters part. If, for instance, if you're using cURL and Kernel automatically puts on its user agent. Like, "I'm cURL. Please give me your information." They would start to believe, "Wait, this is weird, that's not a human. He would not browse that web page like a human would do." And they’d say – "You're gone."

[0:15:46.2] Augustinas: "Stop it" – just because of the user agent? But you're saying that there's more to it.

[0:15:51.2] Martynas: There's more to it. There are a lot of headers that they can address to you. Like, Referer header that goes into…let's say you came from Google. If you came from somewhere else, would they believe you?

[0:16:08.3] Augustinas: I, personally, if I was, you know, doing a security system, I would think that it doesn't usually matter that much. Because, well, maybe if my website is loved by my users, you know, it's possible that they entered my URL directly instead of, you know, being redirected there from Google or some other search engine.

[0:16:28.2] Martynas: But imagine that your connection is a new one. You're a new person, not an oldcomer.

[0:16:33.6] Augustinas: Absolutely new computer trying to connect to a website?

[0:16:35.6] Martynas: Yes, imagine someone is directly connecting out of a known referrer without knowing.

[0:16:45.3] Augustinas: Yeah, I think I can see why that would be suspicious.

[0:16:45.8] Martynas: That would be suspicious, and you would scrutinize the request even more.

[0:16:50.3] Augustinas: You know it might not necessarily be a deal breaker, but it would surely add to my suspicions. I would probably still need even more data to, you know, be absolutely sure that it’s a bot.

[0:17:04.2] Martynas: Yes, and that's where the statistical preview point happens. Usually, when the target has so many requests coming to them every day, right? You cannot check them all, right? You can't scrutinize every parameter and see if they're a bot or not. So you tend to use a statistical viewpoint. And if you see someone…somebody is using parameter recombination that does not really work with their statistics, they would, actually, mark you as a suspicious request.

[0:17:39.7] Augustinas: But you can do that, you know, on the other side of the spectrum, like, they use statistics to defend themselves, we can use statistics to be more credible.

[0:17:50.4] Martynas: Yes, but their data range is way larger than ours.

[0:17:55.3] Augustinas: Okay.

[0:17:55.5] Martynas: They get millions of requests. We're doing, like, let's say, 2,000 per second requests. So there's a gap in knowledge here.

[0:18:04.4] Augustinas: So, are these statistics made from only the headers or..?

[0:18:11.2] Martynas: Not just headers. They might have IP locations as well.

[0:18:13.1] Augustinas: Okay.

[0:18:17.0] Martynas: If you're shooting everything from the same subnet, maybe it's kind of suspicious. If you're shooting 1,000 requests from one computer, it might be suspicious as well. If we're putting, like, very arbitrary numbers here.

[0:18:30.8] Augustinas: What if you're a citizen from China, where, you know, the internet is probably very, how should I say it, tight, you know, there are many IP addresses in the same location?

[0:18:40.9] Martynas: I am not specialized in this part. I wouldn't say what exact tools they use here, but, in general, I think they have their own workarounds with their own ISPs, just to get the information they need, or they just get blocked outright. It's not the first time you hear about some websites being blocked.

[0:19:02.1] Augustinas: Yeah, I think Alex mentioned a little bit that, you know, we have proxy products, and one of the ways that I can think of how I would solve that is, you know, either use an IP address from a location where there's usually an incredibly huge amount of IP addresses in, you know, one small zone, or, you know, I would just use a proxy or a VPN, or something among those lines. So I'm guessing that's what we do.

[0:19:30.1] Martynas: Yes, to fake ourselves not doing the same requests from the same point, we do use proxies that will fake our IP, and that would make it a lot worse for the target to understand if you're a bot or a human.

[0:19:45.5] Augustinas: Okay.

[0:19:46.1] Martynas: And, usually, by default, they pick human. There are some targets that are very, very adamant about their security, and they would set by default that any suspicious connection or a thing that they cannot check is a bot. So they need to be checked that way. When you're checked, you get a cookie. So you came to their website, and now you're identified as someone that came to the website before, and you use that cookie until they think - "wait, this guy is doing something suspicious," we gotta check.

[0:20:27.9] Augustinas: Alright, so what I'm hearing from this is that you know, you have a risk of getting your IP block, really.

[0:20:38.7] Martynas: That is usually the easiest way for them to restrict.

[0:20:43.4] Augustinas: Right.

[0:20:44.8] Martynas: And restrict any further requests.

[0:20:49.2] Augustinas: Okay. Still, I'm thinking that, you know, even if we were a proxy provider or something like that, we still have a limited amount of IP addresses, right?

[0:21:00.7] Martynas: Yes. So you can practically think of them as a resource at this point.

[0:21:06.3] Augustinas: Okay.

[0:21:06.9] Martynas: So, if you're always doing the same request without changing anything with the same proxy, it might be flagged as a bot. Some targets even see if the bot is from the same subnet, and they will kill some part of that subnet and say we don't want any request from this subnet.

[0:21:27.7] Augustinas: Right.

[0:21:28.5] Martynas: And it might become a problem because if you run out of IPs, how can you scrape anything?

[0:21:35.6] Augustinas: So it's not just about, you know, getting you, it's not just about scraping a single website. It's also about, you know, scraping it responsibly.

[0:21:48.8] Martynas: Yes.

[0:21:50.3] Augustinas: So that's what Alex meant last time. Oh, okay, now I think I'm connecting the dots here. So if you're not actually being careful about the resources that you're using; if you're not careful about, you know, the way that you're making these requests with, you know, some specific IP addresses, there's a risk that your IP addresses are going to get banned, right? If it happens enough times, well, then suddenly you're either going to have to spend more money on getting more IP addresses or, well, you're not going to be able to scrape the website.

[0:22:28.1] Martynas: I mean, there are some ways of how to go around that. There are, like, DC proxies, and there's, like, residential proxies. You can ask to use a very expensive but a real machine to do requests.

[0:22:41.9] Augustinas: Right.

[0:22:43.3] Martynas: And those are more likely to pass because they are more likely to be believed as real human devices.

[0:22:48.0] Augustinas: Okay. Okay, I want to talk about, you know, maybe there's some other ways of getting blocked that you know about.

[0:23:04.1] Martynas: There are some, but they become more of an exception, like, in the territory of exceptions rather than rules.

[0:23:11.6] Augustinas: Okay, yeah, because I'm thinking, you know, you can validate, probably, a request through, you know, all of these HTTP protocol things like the headers, right, or the cookies.

[0:23:20.7] Martynas: There's the TLS handshake. What kind of…how…what is the secure connection you're doing? Some targets really do take that kind of consideration.

[0:23:32.6] Augustinas: But then, you know, if you're using something like cURL to scrape every single website, what if I'm validating that you're a real user through something like your mouse movements because you can do that with JavaScript.

[0:23:45.6] Martynas: Yes, that's one of the exceptions. Well, now it's becoming a rule these days because most of the websites are starting to use JavaScript more and more. So, basically, the main idea is that we've just talked about blocking when there's a single request. Right? Now, what if there is a chain of multiple requests that you need to go through to get validated? The most common way of seeing that is when you enter the web page, it does not load everything instantly. Where it loads the templates, the stock elements, and so on, right?

[0:24:25.8] Augustinas: Like, you could probably validate the, you know, you loaded a CDN image before you loaded a local image.

[0:24:33.0] Martynas: Yes, and they do, a browser does extra JavaScript actions, right, and that results in more requests towards them, and usually they help with validation, and that's how they work it out.

[0:24:51.1] Augustinas: So what kind of tools do you then use to get around that?

[0:24:56.9] Martynas: See, work with certain headless technologies, like Puppeteer or Playwright.

[0:25:05.3] Augustinas: I've never heard of those.

[0:25:05.9] Martynas: Well, basically, imagine their goal is not to work as a mechanism that does an HTTP request but a mechanism that emulates a browser.

[0:25:15.6] Augustinas: Okay.

[0:25:16.2] Martynas: In the virtual environment.

[0:25:16.4] Augustinas: Okay.

[0:25:18.5] Martynas: For instance, they will open up a whole Chromium.

[0:25:21.0] Augustinas: So, like Selenium?

[0:25:22.7] Martynas: Selenium is a headless technology as well.

[0:25:24.8] Augustinas: Okay. Alright, and so you're using these headless browsers in order to, basically, emulate a browser, and then?

[0:25:36.0] Martynas: And let it do everything that the browser would do, like doing JavaScript actions.

[0:25:40.4] Augustinas: Okay.

[0:25:41.3] Martynas: But that's expensive.

[0:25:45.8] Augustinas: I wanted to summarize a little bit, you know, of what we talked about just now. So there are the HTTP protocol things. We need to be careful about those, right? Then there's, I guess, the kind of requests that you do. Those also matter. Like, you know, maybe you're making a post request somewhere.

[0:26:06.0] Martynas: Yes.

[0:26:09.1] Augustinas: You emulate the browser in order to fake yourself as a real user, basically. Is there anything else we missed?

[0:26:25.0] Martynas: If we missed something, those would appear, and that's basically our job to work with such unique problems, let's say.

[0:26:32.6] Augustinas: Okay.

[0:26:34.1] Martynas: And…but these are mainly what you've encountered in this kind of sphere.

[0:26:42.3] Augustinas: Well, okay. Martynai, is there anything else you want to say about, you know, blocking in general or web scraping? Any other thoughts that you have about this topic?

[0:26:53.4] Martynas: It is a very weird concept when you think. It's like a war between us, trying to get information and them not giving the information.

[0:27:05.7] Augustinas: Alex mentioned the phrase – "cat and mouse game."

[0:27:11.1] Martynas: Yes.

[0:27:13.2] Augustinas: Always that…with, you know, these, like, servers or, you know, the developers from other websites that don't want to be scraped, they keep coming.

[0:27:20.8] Martynas: For us, developers, we're the soldiers, and for us, it's a war.

[0:27:25.2] Augustinas: Yes, that's exactly right, and they keep developing new techniques to recognize us, but, you know, in general, it's probably, you know, it's either headers or it's either, you know, some form of checking you via JavaScript.

[0:27:38.4] Martynas: After so long that the internet has been developed, for, like, several decades, right?

[0:27:43.7] Augustinas: Right.

[0:27:44.2] Martynas: The technologies of scraping and bot detection have evolved. So these kinds of things about parameters that we've talked before, they were exceptions as well, but now they're rules. The same as I said, JavaScript is becoming more and more part of the whole browsing experience.

[0:28:04.6] Augustinas: It's a trend basically that just kept following you with the times, and then that trend became common knowledge, and everybody's doing it. Okay, anything else that we missed?

[0:28:16.6] Martynas: From me - no. Unless you have some questions.

[0:28:20.5] Augustinas: Extra questions? No, I don't think I have any more questions. I will keep the other questions about the next topic that we will have.

[0:28:32.4] Martynas: And that topic would be?

[0:28:35.1] Augustinas: It's going to be about the next step after you scrape a website.

[0:28:40.0] Martynas: Oh, yes, let me guess, it's called parsing, isn't it?

[0:28:43.4] Augustinas: And there you have it, folks, this was the episode with Martynas Saulius, our expert on scraping. Well, then, Martynai, do you happen to know about our little ending phrase?

[0:28:54.3] Martynas: No, not really.

[0:28:55.8] Augustinas: Oh, well, before that, thank you so much, guys, for joining us. I'd like you to visit us on Spotify, on Apple Podcasts, on SoundCloud, and, of course, on YouTube. Guys, thank you so much for joining us and remember – scrape responsibly and parse safely.

Get the latest news from data gathering world

I'm interested