Proxies for Web Scraping: a Complete Guide | OxyCast #5

Proxies for Web Scraping

[0:00:00.0] Mindaugas: Try and search on Google for free proxies. There is a possibility to get them because there are people or companies that are allowing you to use proxies for free. It's not very safe, it's not very reliable, but this is an option, basically to start with, at least. 

[0:00:23.3] Augstinas: Hello, everyone, and welcome to the Oxycast. The show where we talk about everything web scraping related. Today's a special day outside - it's kind of raining, the weather is kind of nasty, but through fire and rain, we are here to bring you the best content we can. With that in mind, today is a special episode, especially because it's our fifth episode, I believe. It's sort of an anniversary at this point, and on that topic, we thought about bringing to you a topic that is essential to proxies, something or, sorry, essential to web scrapers. I'm already hinting at the topic here. But yeah, at some point in your life of your web scraping experience, you'll face that you are blocked by certain websites, which is why proxies are a really important topic, and for that topic, we brought to you on the other side of our small table - Mindaugas Dunderis. Who's the product owner or product manager? 

[0:01:25.2] Mindaugas: Yes, product owner, one of our products, basically.

[0:01:28.3] Augustinas: Who's responsible for our proxy team? Right? 

[0:01:32.0] Mindaugas: Yeah, specifically for the residential proxy team.

[0:01:36.2] Augustinas: Okay, so from what I understand, you're sort of an OG here at Oxylabs. You're, like, here since the very beginning, and you know, before getting into the meat and bones of proxies in general, I wanted to get to know you a little bit better because I really don't know much you know about the people here in Oxylabs besides, well, my own team. So why don't you just tell me a little bit about yourself? How did you get here, to Oxylabs, and that team in particular? What has your career path looked like up until now?

[0:02:04.5] Mindaugas: Yeah, okay. So I've been with Oxylabs for quite a while now. The story is not very interesting, I just happened to be hired here, and that's it, pretty much through my friends, basically, and, yeah, since I started a while ago, I had a chance to work on multiple projects, multiple products many of them actually started, when there was no product, basically. So I was at the very early stages, basically. So it was a very interesting journey, and it still is.

[0:02:41.8] Augustinas: So, how exactly did you get into the proxy team? Was it like, I believe we talked a little bit about this before, but you weren't always a product manager here?

[0:02:53.8] Mindaugas: Yeah. Yes, exactly. I started as a data analyst, but as the company grew, there was a need for people to do some product management. So I moved there. I'm not sure I was the best person at that time, but, you know, it just happened, and here I am. 

[0:03:10.6] Augustinas: So they picked you for this particular team as a product manager, or were you moving around?

[0:03:19.1] Mindaugas: moving around, basically. The team I'm working in at the moment is probably the third or the fourth team down the road.

[0:03:32.7] Augustinas: So you moved to the residential proxy team when it was already established. Do you happen to know how many years that team existed when you came there? 

[0:03:39.4] Mindaugas: Actually, I moved to the team before it was established. So, basically, I was, you know, one of the first people on the team, basically building up the product. So well that's the question I never, you know, I'm never good at answering how many years because for me everything is very dizzy. It's very hard for me to understand whether it was five years ago or three years ago. So I'm not really sure, but I think it's like a four-year, five-year product age.

[0:04:14.7] Augustinas: Do you happen to know what was the motivation behind that particular team forming?

[0:04:16.9] Mindaugas: So, well, basically, it was natural for the company. We started with datacenter proxies, which is kind of simple - you buy, you convert them, and you sell them, and then we move to new things, better things, like advanced things for scraping. So residential proxy scraping, APIs as well. So, it's basically naturally evolved for the company. 

[0:04:40.4] Augustinas: All right, I was thinking before we even get into, like, these residential proxies and these datacenter proxies that you keep mentioning, it would be nice to establish what is a proxy in general.

[0:04:53.4]  Mindaugas: Okay, so a proxy is basically, you know, an intermediary that instead of you using your internet connection to reach a website, you're using a different server, a different IP address to do that thing. To, basically, get the data from the website. So you're, basically, hiding yourself or expanding your availability, let's say. Let's put it that way.

[0:05:18.8] Augustinas: And I already hinted that most web scripting professionals should know a little bit about proxies, but I want to hear from your perspective why they should actually be interested in this particular topic. 

[0:05:33.0] Mindaugas: Well, as a developer, maybe, you know, you will never use proxies if you're working on things that don't need that. If you don't really extract the data from the internet, proxies are not really a topic for you to be interested in. But in general, yeah, proxies are pretty important. As long as you're trying to get some kind of data from the public resources, from the internet, chances are that you're eventually going to have to use proxies. So, you know, at the end of the day, it's kind of important to, at least, have some basic information, what they are and how they work and what they can do for you. 

[0:06:16.4] Augustinas: Chances are that you'll eventually have to use proxies, but why exactly? 

[0:06:22.8] Mindaugas: So, again, if you're trying to get data from the internet, websites are usually not really keen to give you that data on a large scale, which means that if you need to get a lot of pages, scrape a lot of data, you're going to have to use proxies because this is the way for you to access that data. Because, usually, if you're doing, like, trying to, like, open a lot of pages from one computer, from one IP address, that website is going to tell you that - "no, that's not okay, that's not allowed." And this means that if you want to get a lot of pages, a lot of data from the internet, you have to have a lot of IP addresses or proxies. 

[0:07:04.0] Augustinas: So, that was what I was trying to get to, really. Proxies are generally a topic that you should be interested in when you are growing in scale. When you are not doing, you know, ten requests per second, but we're already talking about maybe hundreds, and at that point, I believe, what you're gonna get is a rate limit. Websites usually block you per IP, and if they notice that you are hitting them too often, that's what they will usually do, from my personal experience. For my personal projects, because I've done a few web scrapers in the past, I’ve never faced that particular case yet. But, I believe that if I were to start selling my web scraper as a product, for example, that is one of the first issues that I would get myself into.

[0:07:56.1] Mindaugas: Yeah, sure, I mean, when you're at scale, basically, then when this problem comes into picture, and then that's, basically, when you have to think about using proxies. 

[0:08:09.0] Augustinas: Okay, so it's a scale thing. My next question would be - how would I get these proxies in general as a developer?

[0:08:16.8] Mindaugas: Okay, so there are a few ways. The easiest one is to try and search on google for free proxies. There's a possibility to get them because there are people or companies that are allowing you to use proxies for free. It's not very safe, and it's not very reliable, but this is an option basically to start with, at least. 

[0:08:36.3] Augustinas: What do you mean by not safe and not reliable?

[0:08:38.9] Mindaugas: You don't really know what's behind that proxy, who actually owns it. They could see the data that you're transferring, especially if it's not encrypted. If it's not HTTPS traffic, then that proxy is able to see what kind of data is going through, and if you're using that proxy and sending some kind of data that you don't want to be seen by anyone else, that might be a problem for you.

[0:09:05.5] Augustinas: So, Mindaugas is sharing a very important tip here - you should always use HTTPS, my dear developers, especially when you're scraping. It is a question that I believe we have been getting in some of our videos — can proxy providers see your traffic, and I guess what you just said is that yes, they can.

[0:09:28.0] Mindaugas: Yes, they can. For HTTP, they can. For HTTPS, they can't unless it's a man-in-the-middle proxy, but you will know whether it's a man-in-the-middle proxy. 

[0:09:38.2] Augustinas: How will I know?

[0:09:39.6] Mindaugas: You're gonna have to either use the certificate that the provider is providing you - their own certificate. So it's an intermediary certificate, which you accept instead of the certificate from the website, which is HTTPS. Otherwise, that proxy is not going to work. So I mean, for it to work, you're gonna have to accept that certificate, and you would know that you're doing that basically.

[0:10:08.9] Augustinas: So we're getting into security, into this like sphere of security knowledge already, and I guess thank you for hinting at the idea. That's probably going to be one of our episodes in the future. Everybody should know how to exactly scrape safely and securely. Well, my next question would be - if there are free available proxies online, what's the use case for commercial proxies in general? Because I believe, well, since you mentioned that you work with residential proxies, what I'm assuming is that that means that you are providing a product that sells proxies in general. So why do people usually use you instead of gathering like these online proxies that are freely available, and you've hinted a little bit at that answer which is security, and can you repeat once again what it was? 

[0:11:03.5] Mindaugas: Yeah. So, well, security is one thing, but I would say that at scale, you would need reliability as well. And free proxies, you don't know where they come from, you don't know where they're going to be online tomorrow or in the next few hours, let's say. So, once you have a project that you need to continuously update, you need the data to be coming in all the time without any interruptions. You need to make sure that the resources that you're using are reliable, fast, and available. And this means that you can't really get them for free - you have to buy some sort of proxies from a provider that is good at this.

[0:11:49.3] Augustinas: So how big are we? What can we offer as, like, commercial proxy providers to our clients?

[0:11:59.4] Mindaugas: We're pretty big. I'm not sure how to answer that, but we're offering basically everything that's related to scraping. Whatever you want, we probably have this. It's either proxies or scraping APIs, so we're covering a lot of ways how to scrape and how to get the data. So if you aren’t sure how to use proxies, for example, you can go with scraper APIs, which is an easier option for you.

[0:12:26.7] Augustinas: I guess my question is, like - so we're providing commercial proxies. Right? But is our only selling point just the amount of proxies that we have? Like, what's the dynamic over here? Can we, like, offer our clients proxies that are specifically in some area of the world, or how does that work?

[0:12:49.7] Mindaugas: Yeah. So yeah, the coverage is one of the things that we're pretty good at. We have proxies from a lot of countries. From different countries around the world, from very small and very tiny countries as well. It's not necessarily that we have them all the time, but, yeah, you know, on a theoretical level, we cover basically the whole world. So if you need proxies in any country, basically - we probably are able to provide you with that. And also, we've been in this business for a while. And one of our biggest selling points, I would say, if it's even though we don't really mention this, it's our expertise. I mean, we've seen a lot of people scrape. Who are scraping websites. We're trying to gather data, we've seen a lot of projects that people were doing, and we know, I mean, a lot about websites and how to scrape them, basically.

[0:13:45.1] Augustinas: I want to talk a little bit about availability. You mentioned something that we can provide you with a proxy from a specific country if that proxy is available. So what does that mean exactly? How is it that we can sometimes have proxies in a country and sometimes not?

[0:14:01.8] Mindaugas: Yeah. So what I mentioned is our residential proxy network. So there are devices that are running on residential basically connections, on regular devices, like computers and mobile phones. And they're connecting to our proxy network from many locations around the world. But since they belong to regular people, these devices may go offline, you know, they may not be, you know, turned on, and once they go offline, we don't have that proxy anymore. So maybe we have like five proxies from a particular country, and everybody is sleeping at that time. So all the computers are shut down, and we don't have any proxies in that country.

[0:14:47.2] Augustinas: And we're already getting another one of those questions that I had in mind - how exactly do we, as a company, gather up all of those, like, proxies? So from what I'm hearing right now, one of those ways is to just ask for our clients to like, basically, provide their device temporarily for us to do something with them, or, basically, just use them as proxies? Raises the question - what exactly do we give them in return for that kind of service?

[0:15:17.9] Mindaugas: Okay, so not really our clients, but yeah, in a sense, our clients, I would say. So we were working with partners like Honeygain, for example. This is a company that has an app that you can install on your computer, on your mobile phone, and what it does - it basically sells your traffic. So we're using that device in that application as a proxy node. So that computer or device works as a proxy - we're sending the data, and depending on how much data we send through that device, the person that owns that device is going to get paid for that. So basically, we're buying the traffic, the internet connection from people.

[0:16:03.5] Augustinas: Residential proxies. I've heard another interesting term - datacenter proxies. I'm assuming that means that we also just use, like, regular servers, that we rent. When should we then use residential proxies as developers, and why should we use datacenter proxies?

[0:16:23.6] Mindaugas: Okay, so this is a very broad topic, but yeah, there are basically two types of proxies that you can have - it's datacenter and residential. Now how do you know whether this proxy or this IP address is either datacenter or residential? There's no way of telling that right away. But there are companies around the world, like MaxMind, providing IP2location or IP info, and these companies, what they do, they basically map all the available IPs with some certain properties. So, for example, there's an IP somewhere, like any single IP, and they assign the owner, the location, and, maybe, the usage type. And, for example, there's an IP that belongs to a company that's usually providing datacenter services. One of those companies, one of those IP mapping companies, they can assume that IP address, since it belongs to a company that is, you know, providing datacenter services, this IP, is very likely to be used in a datacenter. 

[0:17:29.6] So now they show that in their database, their IP database, they show that this IP address is used as a datacenter IP. And now, if we use this IP as a proxy, it can be recognized if it was an IP from a datacenter, which, kind of, means that it's used for commercial purposes it's not probably used for, I mean, in residential networking and stuff like that. And on the other hand, an IP that belongs to a company that's providing residential connectivity, like AT&T, for example, Verizon, chances are that this IP address is used somewhere at home, at an office, as an IP address, basically, to provide connectivity to the internet. So you have these two types of IP addresses, and you can then use that information and determine what kind of IP you're seeing and then do something about it.

[0:18:29.3]  Augustinas: Yeah, so I guess what that means is that some services, say, I love using potatomarketplace.com, use it in previous episodes as an example. So, let's say I am a web developer for potatomarketplace.com, and I am being scraped very actively. And my idea to defend myself from web scraping would be to just look at the IP addresses that are being used and use one of those IP2location services, for example, and see if they match our residential IPs or if they are owned by some company.

[0:19:15.3] Mindaugas: Yeah, exactly. So this is what you could do, right? Like you said, if you see a lot of traffic from an IP address, you can buy the database, and you can see what kind of IP address that is, even if it's a datacenter IP address. You may assume that, okay, this is something suspicious. Probably it's not a real person behind that IP address, and you can block it, or you can throttle it in some way. But if it's a residential IP address according to that database, then you might think - "okay, even though this person is doing a lot of requests, opening a lot of pages, that IP address is residential. So I can't really block it because that probably is really a person behind it, and maybe trying to buy something from my website."

[0:20:02.0] Augustinas: So, in general, I want to use residential proxies once I have a hunch that I'm being blocked, particularly from datacenter IPs.

[0:20:11.8] Mindaugas: Yeah, that's one of the reasons why you would need to use residential proxies - if you're not able to get the data from the website using datacenter proxies.

[0:20:21.3] Augustinas: So, if you've ever been to the Oxylabs website, you'd find that we pride ourselves in being the frontrunner of innovations. How exactly are we innovating in the proxy share of things, then?

[0:20:32.4] Mindugas: So every time we come up with a new feature that we think is unique, and it actually helps our product and helps our customers, we try to patent that feature. So we have around, I think, twenty patents, at the moment, in the United States alone, and some of them are, for example, for proxy networks and residential proxy networks. We have a proprietary solution on how to determine whether a proxy or the device that's running that proxy is fast enough, and responsive enough to be used as a proxy by our clients. Otherwise, it's, you know, it will do more harm than real good. We also have a patent for scraper APIs, where we manage the proxy network internally within the script API. So we always make sure that the proxy that we're using is fresh, is new, and it works. If it doesn't work, we then try to change some qualities around it and try to make it work again. 

[0:21:43.3] Augustinas: Fresh as in - we haven't really seen that particular proxy in our proxy pool?

[0:21:55.8] Mindaugas: Yeah, for example, yeah. 

[0:21:55.8] Augustinas: Okay, anything else that comes to mind?

[0:21:52.1] Mindaugas: No.

[0:21:55.4] Augustinas: Well, no problem. In that case, I'd like you to look at that camera, that camera, that camera. How do you feel about, you know, the future of proxies, and where do you think the path of, you know, being a proxy manager will take us in general?

[0:22:10.4] Mindaugas: Wow, I'm not ready for this question. I'm not sure, I think, that definitely this market is going to keep on growing. It's been growing like crazy in the past few years. I mean, when I joined this company, there were only a few, probably, proxy providers on the market. All of them looked super shady, super suspicious, and now it's a different market - we have real brands that are well known. To a certain extent, obviously, it's a proxy market at the end of the day. But I think it's, I mean, it's gonna grow basically, still gonna grow. 

[0:22:55.4] Augustinas: Well, in that case, everyone - thank you for coming to the Oxycast. I'd like to say a very big thank you for, you know, staying with us here and just hearing us out, and I hope that you learned something new. As always, I'd like to remind you of a special little phrase (I don't think that Mindaugas has heard of it yet) scrape responsibly, guys, and parse safely.

Get the latest news from data gathering world

I'm interested