Web Scraping In A Nutshell: The How, When, And Why Answered

[0:00:00.0] Augustinas: Hi, my name is Augstinas. I'm a Software Developer at Oxylabs. So everybody calls me Oggy.

[0:00:13.7] Aleksandras: Yeah, hi, my name is Alex. I'm a Product Owner here at Oxylabs. Welcome to the first episode of OxyCast.

[0:00:23.0] Augustinas: That's right, welcome to the OxyCast. Alex, today we're going to talk about all things web scraping-related, but I'm pretty sure that you have a concrete topic for today.

[0:00:32.9] Aleksandras: Yeah, sort of, I mean, let's see where this goes, but I think we should be talking about, you know, what is web scraping in general? Why is it useful? What makes it difficult to do web scraping? Should you do it at all? I mean, should you scrape the web yourself? Or just buy the data, or, you know, that sort of stuff? So I think that's the, you know, the broad topic of today's discussion, right?

[0:01:03.0] Augustinas: Sounds about right. So I think we should probably start from the bread and butter – what exactly is a web scraper?

[0:01:12.5] Aleksandras: Okay, well, web scraping. You know, I'm sure there are like a hundred definitions of what is web scraping. But here at Oxylabs, at least, we consider scraping to be the act of, you know, making a request to a web server somewhere out there to get a piece of data and for that to be happening automatically, and that data has to be, mostly, not meant to be consumed by bots. It's meant to be consumed by, sort of, live actors. So scraping is automated data gathering from the web, where you pretend to be, you know, mostly, a human being, and you try to sort of get the data this way.

[0:02:06.1] Augustinas: So, in developer terms, it would be something like getting structured data from raw HTML rather than having something like JSON or XTM.

[0:02:18.8] Aleksandras: Well, you know, it depends. Sometimes you can get the structured data right from the source, but, mostly, your data gathering from the web will consist of at least two parts. So the first part is actually getting some content from out there, right? And the second part is structuring the, you know, that content that you get to make it useful to you, and that process is called parsing. So for the purpose of this discussion, let's say that scraping is the, you know, the part where you get the data from out there, and parsing is where you transform it, you know, from a bit of HTML code to something that is structured whether it's a JSON, CSV, or, you know, any other format you choose.

[0:03:12.5] Augustinas: Okay, and so the obvious question that comes to my mind is – why would anybody want to do that?

[0:03:19.2] Aleksandras: Yeah, I mean, there are a few reasons you will want to do that. But, you know, you wouldn't want to do that, obviously, if you can't use the data. But there are a lot of uses for the data once you get it from the web. So, just to mention a few popular use cases, you know, if you want to do SEO, you'll want to, sort of, keep tracking where your domain is among the, you know, search results on any web search engine. Or if you, let's say if you want to do a bit of market analysis and find out, you know, the prices that your competitors are selling their, you know, for instance, pencils for on a hundred different marketplaces. Well, you'll soon find that it's very hard, if not impossible, to do it by hand. So you must have an automated system that does it all for you, and, you know, at the end of the day, you'll either be looking at a report that's got all the, you know, all the scraped data in it; or you may be looking at, you know, a chart or something like really visually well prepared; or that data could be fueling, instead of, you know, just being available for you to read, you could fuel some other processes in your company and that could be used as input for other stages. For instance, you could use it to, you know, you could use the scraped data to just power your dynamic pricing model. So you find out, you know, your competitor's sort of pricing level, and you adjust your price based on that.

[0:05:07.4] Augustinas: You mentioned an example where you talked about finding where exactly we are in the SEO results. Just to confirm, that's not exactly what we do. As in, we don't form the reports ourselves. All we do is just give other people the data so that they can construct the reports themselves, right?

[0:05:28.8] Aleksandras: Yeah, that's right. So we mostly operate on a very granular level, if I may put it this way. So, for us, a single scraping job is when you get some data that can be found at a particular URL. So you get the HTML, parse it, and deliver it back to, well, to your customer. So that's a single operation. But if you, let's say, you want to build a good report; you'll need to do at least, well, a few hundred of these operations to find out how you’re…how you’re doing, basically. You know, where you rank for different keywords and all the various locations, if that makes sense. Because a lot of search engines will return you a different result set based on where you are as a user, you may also get different results depending on which, what kind of, or what type of a device you're running the search on. Whether it's a mobile phone or, you know, a desktop computer, you may get very different results.

[0:06:40.5] Augustinas: So, just to confirm, we not only scrape and parse these websites, but we also put them in a standard data structure so that people could use that data for what they want. Is that about right?

[0:06:56.6] Aleksandras: Well, we do scraping and parsing, mostly, and I should perhaps mention that when I see "we" – I'm talking about Oxylabs and the product that's called Scraper APIs in particular. So we do scraping, and we do parsing, and then we sort of, you know, just give the restructured result, which is a JSON, mostly. We give that to our customers, and then they try to make sense of it. So the result actually goes through multiple transformations once it gets to the customer, and we, sort of, just, you know, just assume that they know what they're doing and they can actually derive the insights based on the data that we give them.

[0:07:38.8] Augustinas: Alright, and Alex, just out of curiosity, have you ever had to use a scraper for, for example, one of your own purposes?

[0:07:51.1] Aleksandras: I'll be honest, I didn't have to do it. Well, I mean, I haven't done it. I did run, you know, just try, to play around and see, run some tests with our system and see how it works. Whether or not I'm able to, you know, use it as a regular customer. I've not built my own project, no.

[0:08:11.7] Augustinas: I'm asking because before even joining Oxylabs, I actually have built a web scraper by myself.

[0:08:18.2] Aleksandras: Yeah?

[0:08:19.1] Augustinas: It was for this little e-commerce website where I wanted to, well, it was, sort of, like eBay.

[0:08:29.7] Aleksandras: Okay, it's just a marketplace, like an auction marketplace?

[0:08:32.6] Augustinas: Exactly, and people would post all of their own things and try to sell, like, all these random things, like audio gear and whatnot. And the purpose of that was to sort of, like, find the best deals in audio gear, and then I would like to buy those things and then resell it. And, you know, in my case, building a web scraper wasn't that hard. In essence, all I had to do was to make an HTTP request and, honestly, just parse the HTML result, and that was basically all there was to my little project. But I'm assuming that, since we're working with such a big company, there's more than that to web scrapers in general and that there are probably a million of issues that can occur while we try to do this web scraping thing.

[0:09:29.0] Aleksandras: Yeah, I mean, there are at least three types of issues I could talk about here. So one, you know, a thing that's very evident is that, well, even though you were able to scrape your site that you extracted, you know, the audio gear listings from. It does not mean that it would be as easy to do it on some other sites that we work with. And, well, perhaps you were lucky, or perhaps you just didn't have a big enough appetite for the data so that, you know, the site would really notice that.

[0:10:06.2] Augustinas: Yeah, that's about right. I made like a hundred requests top per day and...

[0:10:11.8] Aleksandras: That's just, you know, hundreds of requests per day with your home IP. That shouldn't raise too many red flags for most sites. So that's alright, but imagine if you want to scale your operations, like tenfold or a hundredfold. At some point, you would run into issues. And, you know, at that point, you start thinking how can I optimize this and then, you know, not just keep, like, stupidly just bombarding the site with requests and see the…you know, trying to collect your data somehow. So you must develop smart ways of scraping sites, and that includes, you know, managing the proxies that you're using. So, first of all, you should use proxies, right?

[0:10:58.4] Augustinas: Right, so these are the smart ways to scrape a website, as in you're not actually using your own home IP address, but you're using some proxy to forward the traffic into the other websites that you're trying to scrape.

[0:11:11.7] Aleksandras: Yeah, well, you know, what a proxy does, it just masks your IP address. So, instead of the web server – that you're trying to get straight data from, basically, – seeing your IP address, your home IP address, what it sees is an IP address of a proxy server. And then imagine you have a hundred of those proxy servers. So, instead of being seen as this big entity that's trying to scrape something from, you know, a search engine, what the search engine sees on its side, it sees a hundred small entities where the IPs are assigned to your proxy servers, right? The search engine only sees those, and it's a bit harder for it to, really, bring it all back to you and, you know, sort of, come to a conclusion that it's, you know, this bad guy basically that's trying to scrape us, right. So that's just step one, that's the, you know, the first point where you should start optimizing your scraping activity, but there's so much more.

[0:12:20.7] Augustinas: Like what?

[0:12:21.1] Aleksandras: Well, you know, you could look at your request that you are sending to the website. Because you see, when you're…Okay, well, to get a bit techy here. When you're sending an HTTPS request, you're making a GET request to a particular… and you're asking the web server to return whatever is on the URL that you're asking to retrieve, basically. But you're not just asking for stuff. You are sending a bit of information about who you are. So, they get stuff, like, your ability to consume compressed content; the language preferences you may have on your browser; the browser and operating system model; and, you know, they may also get some information of whether…If you've been on the site before as a real user, you will have received a set of cookies from, you know, in most cases, from the webmaster or website, and with consecutive requests, you would be sending those cookies as well. So there's the whole, you know, sort of, user fingerprint that comes into the picture here. And if you're trying to scrape – that is, gather data automatically, without, you know, too much human interference – you will have to form a fingerprint that's believable to the site you're trying to scrape. So that can be tough at times.

[0:13:54.3] Augustinas: And is it correct to say that a lot of what we do here at Oxylabs is exactly that? Preparing those fingerprints and making sure that we essentially don't get blocked by these websites.

[0:14:09.4] Aleksandras: Yeah, yeah. I mean, that's a huge part of what we do here. And it's not always easy; I'll tell you that. You know you can…I mean, as I worked in this company for like four years, and it's always been a game of cat and mouse. And, you know, we would evolve our ability to scrape a particular site, and then they would evolve their capabilities of, you know, finding us out and then blocking us.

[0:14:42.2] Augustinas: So, it's pretty hard for newcomers to come into because there are probably already all of these, like, strategies that websites use in order to block you from…

[0:14:52.1] Aleksandras: Yeah, I mean, it used to be much easier, but now it's getting harder, definitely, and the thing is you cannot just build it once and, you know, forget it. You have to regularly get back to tweaking your approaches, too, you know.

[0:15:08.8] Augustinas: It's a pet. It's something that you keep maintaining for a while.

[0:15:12.4] Aleksandras: Yeah, yeah, there's…you've got to…I mean, obviously, if you're trying to scrape a website, you will want to monitor how successful you are at doing it. And then you see your success rate dropping, you know, below a certain threshold, and that's the point where you would usually say – "Well, I must…let's do something." Because in a lot of cases, even if you're making a request that you know does not really bring you useful data, or you get blocked, basically, you get a CAPTCHA to solve, right. In a lot of cases, you will still be consuming traffic while you're making those requests, and, usually, you are paying for your proxy traffic. So you are paying regardless of whether you're getting good results or not. You're paying for the traffic. So you better, you know, optimize your scripting approach so that you're not spending too much on your traffic.

[0:16:11.7] Augustinas: And then when you use something like our products, you not only get a website scrape, you also get, you know, a bunch of these, like, dependencies, like the proxies that we use sort of paid for you already.

[0:16:25.4] Aleksandras: Yeah, I mean, it's all included in the price. So you get, really, you don't really need to manage proxies, or, as a user, you don't really need to worry about whether there are proxies or not. Just tell us what to scrape, you know, give us a few parameters we can work with, and then we figure out the optimal approach towards a particular site or URL and get it done in real-time.

[0:16:54.5] Augustinas: All right, well, we have all these, like, blocking issues, but then another question that comes to my mind is that you probably can't scrape every single website on the internet. It's probably a bit more difficult than that because when I had to make my scraper, the process of parsing the HTML was that I had to find specific CSS selectors and, you know, I just had to go through the website with HTML inspect and find those identificators, basically, to get me the data that I need. But when you are trying to do a product that handles all of those cases, it's probably not as easy as there's...

[0:17:36.3] Aleksandras: Yeah, I mean, scraping can be generalized much easier than parsing. So, you know, so just...if you know how to create a good, a believable fingerprint, you know, that works on one site. If you optimize it enough, it's going to work on, you know, a bunch of other sites, basically. So the knowledge you have, you know, well, you have built while figuring out how to build a good fingerprint so that you can scrape something. Once you have that knowledge, you can apply it to scrape a lot of different sites. But then with parsing, well, it's a different game with parsing because each website, or each page type on the website you're trying to scrape, it will have a different element layout. So whether you're using, you know, XPath or RegEx, or CSS selectors, you will have to define your selectors for every particular page type you want to parse. Because you will find your useful elements, for instance, your, you know, your product title, your price, or, you know, whatever else there may be. You will find them in different places within the HTML code, so you, even though you can, once you learn how to scrape, you can apply the knowledge across a lot of different sites quite seamlessly, right? You will have to, more or less, build your parses from scratch for each individual type of page you are trying to parse.

[0:19:26.2] Augustinas: So, I guess I don't really feel like my question was answered here. The idea is that we can scrape a lot of websites, right? We have the technology to scrape basically anything. Well, I'm saying basically. There are probably a bunch of exceptions, but...

[0:19:45.2] Aleksandras: Oh, yeah, you know, every time you, sort of, think you can do everything, there comes the site that…

[0:19:51.3] Augustinas: Yeah, but what I wanted to ask was that the parsing part is, sort of, unclear to me. We can scrape a lot of things, but we probably don't know how to parse most of the things that we scrape, right. Because, as you said, every website is unique, they have their own HTML structure. You have to find the selectors or the regexes or whatever you want to use to, kind of, ID the specific elements or the specific data that you want to get out of that website. And so my question is – how do we handle those cases where we don't really know how to parse a specific website? Do we just tell the client that – "I'm sorry, we just don't know how to do that," or..?

[0:20:39.2] Aleksandras: Well, you know, it's not a technical question. I mean, the thing you are asking. It's always possible to find a selector that matches the stuff you want to parse. So, given enough time and having spent enough effort, you can build a parser for any page type. The question is, for us, whether it's worth actually building it for our clients or, you know, well, we must have a financial incentive to do it, right? Because we're a business. So just finding that common ground with our client and, you know, agreeing that, well, they're gonna pay us for…If we build it, you know, they're actually going to use it. If we have that kind of understanding, we can justify building something from scratch.

[0:21:22.1] Augustinas: So, if I come as a new client to Oxylabs and I find that you don't exactly at this moment support parsing a specific website, what you're saying is that in those cases, because there are probably cases where you just can't really parse websites, or…

[0:21:43.3] Aleksandras: I don't know, you tell me. I think it's mostly doable, but, you know, the question is whether it's worth it, whether it's worth our effort, and whether it's worth for the client to actually, you know, pay for the…to wait, you know, a couple of weeks at least while we will create the parser and then, you know, subscribe to using it.

[0:22:08.2] Augustinas: Okay, I got myself some water, Alex. I had a thought that I remember seeing this thing as a universal source in one of our products. And what I remember seeing specifically there is that you can, basically, select any website that you want, right? And when you select any website, I'm assuming that you will have some standardized response – JSON response from that website.

[0:22:40.0] Aleksandras: Yeah, well…

[0:22:40.4] Augustinas: Sorry…

[0:22:41.4] Aleksandras: Okay.

[0:22:43.2] Augustinas: Meaning that we do probably have some rules in order to parse a website completely. And then, you know, no matter what kind of website it is and that we, probably, can't always succeed with those rules, and then that also probably means that data that we end up parsing there doesn't always make sense, it's not what the client always needs, right?

[0:23:07.0] Aleksandras: It's a loaded question. Well, yeah, I mean, we can scrape almost any website out there. Like I said, you know, if you want to build a parser, mostly, you have to build it for the page type you want to extract, you want to parse the bits out of, right? Well, we have this machine learning-based parser. We call it – "the adaptive parser". So, basically, it's an ML model that we fed a lot of e-commerce product pages into, and we label the data so that, you know, for every page, the model knows on every page where there is the price, the title, the bullet points of the item description, things like that.

[0:23:54.7] Augustinas: First time I'm hearing about it.

[0:23:56.0] Aleksandras: Yeah, so we fed a lot of data to that model, and basically now it's able to make a really good prediction on where you'll find the price, the title, and, you know, all the other descriptors of the…

[0:24:09.0] Augustinas: Key information that you usually need.

[0:24:11.3] Aleksandras: Yeah, the key information of where to find it on the page that you've not seen before, basically. So even though it's something that we've not seen before, we've not worked it before, we're still able to extract the, you know, the basic information that really most of our customers still have to extract anyway.

[0:24:35.3] Augustinas: And so for those cases that this doesn't work, that's where you meant that we actually invest an additional amount of time, where we talk with our clients and, you know, make sure that the websites are parsed correctly and all of that.

[0:24:49.0] Aleksandras: Yeah, I mean, so there are two things – we may be unable to parse a particular layout completely. Well, you know, it just may happen. You never know. So that may happen, and the other thing is that, well, some clients have really custom requirements, you know. They don't want to just get standardized, you know, the 80%, sort of, of their requirements. They want to have exactly what they need, and it may not be what we provide as part of our standard.

[0:25:23.2] Augustinas: With this machine learning application…

[0:25:24.7] Aleksandras: Yeah, you can just do so much, you know, and if you want something that's really, you know, tailored to your particular needs, you will have to either do it yourself or ask us nicely to do it for you.

[0:25:39.1] Augustinas: Okay, well, Alex, that's a lot about parsing. I wanted to come back to what we thought was going to be our talk for the day. We wanted to talk about all things parsing-related, and can you remind me of the other things that we wanted to talk about?

[0:25:56.8] Aleksandras: Yeah, I mean, we should discuss whether or not you should build your scraper and parser.

[0:26:05.4] Augustinas: I'm pretty sure that, by now, I'm convinced that it's not always worth the effort.

[0:26:11.7] Aleksandras: Yeah. Well, I mean, you know, the way I look at it, you can…a lot of companies we work with right now, they have built their own stuff before. And they know just how hard it is, and they, at some point, they realize that they want to concentrate on the part of their business where they generate, you know, the most value. And that, in most cases, is not scraping or parsing. It's the analysis of the data that they extract from the web, or, you know, some other place. But that's, you know…so that's how, after this realization, it's much easier to make a decision to just buy a solution that's out there instead of trying to build your own stuff. Because we are, you know, we and the…even some other companies, even some other solutions, they are reliable enough so that you can just set it up and stop worrying about it.

[0:27:07.6 Augustinas: It just works.

[0:27:08.9] Aleksandras: Yeah, it does.

[0:27:11.6] Augustinas: Most of the time.

[0:27:13.2] Aleksandras: Yeah, I mean, there's nothing…nothing is perfect, right? But we work well enough to be able to provide this service to hundreds of clients.

[0:27:29.9] Augustinas: And, so is our reliability something that we are proud of? We probably are, right?

[0:27:36.8] Aleksandras: Yeah, yeah, of course.

[0:27:37.7] Augustinas: We have all these monitoring tools. And I keep walking into the office, and I keep seeing the two monitors with our Grafana boards where we show these things…

[0:27:49.4] Aleksandras: Yeah, I mean, there's a mantra in our team, like, "There's never enough visibility and observability" or whatever you want to call it – monitoring tools. There's never enough. And, you know, at this point, I'm not sure whether I should take it as a joke or whether it's, you know, it actually is true. But it seems like the more data points you have about your system performance, well, the more you realize that there's something you don't know, and you want to, you know, sort of, dissect it and get more visibility into the inside of every process. And then you're trying to build new stuff as well. So what will happen is you will never have enough visibility for everything. You can just cover them, you know, the main points, and then, sort of, you know, if you have time, just try to go into more depth, but there's never going to be enough visibility. Just, you know, sort of an okayish level that you can really cope with.

[0:28:57.8] Augustinas: And so on the topic of visibility, do we see that we talked about all the things that are important to scrapers?

[0:29:05.8] Aleksandras: Yeah, I would say most of them, I mean, we certainly know when things go wrong, we have good tools that help us figure out what went wrong and, you know, what was the root cause of the issue so that we're able to eliminate in the least amount of time that we can do it.

[0:29:29.3] Augustinas: Well then, on the topic of visibility, I'd like to call this an end for today. Alex, it's been an honor to have you. It's been so fun to hear about all of these things.

[0:29:38.5] Aleksandras: Yeah, good talking. I mean, it's an honor to be opening this series of podcasts. So thanks for having me here.

[0:29:49.3] Augustinas: I loved it, Alex, and on the topic of visibility, I'd like to also remind you to look at our social media. You can find us on Spotify, YouTube, Apple Podcasts. Anything I'm missing?

[0:30:03.4] Aleksandras: Spotify, YouTube, Apple Podcasts – that's the three, yeah.

[0:30:05.5] Augustinas: Magic trio. Alex, do you happen to know, or do you happen to have any last words for today?

[0:30:12.0] Aleksandras: Yeah, scrape responsibly and parse safely.

[0:30:15.2] Augustinas: Thank you so much for being with us. Once again, welcome to the OxyCast. We'll see you another time.

[0:30:20.4] Aleksandras: Cheers!

Get the latest news from data gathering world

I'm interested