Data Parsing: The Basic, the Easy, and the Difficult | OxyCast #3

[0:00:00.0] Povilas: I wanted to parse this company's catalog, and everything went well until one important piece of information - the phone number was not text-based, but it was an image. So, they tried to protect the data that way, you know. Yeah, and that was kind of a challenge of what to do, yeah.

[0:00:26.3] Augustinas: How did you solve that?

[0:00:27.0] Povilas: I came up with the solution just to use some OCR technology, like Tesseract.

[0:00:31.3] Augustinas: OCR is what exactly?

[0:00:33.0] Povilas: It's optical character recognition. Yeah, basically, you can recognize characters from an image.

[0:00:44.7] Augustinas: Hello and welcome to the Oxycast - the podcast show where we talk about everything web scraping related. Today's episode is about parsing, and I don't think that there are a lot of people better suited to talk about this topic than my colleague, who's sitting here on the other side of this table. Why don't you introduce yourself?

[0:01:02.2] Povilas: Hi everyone! My name is Povilas, I'm a software engineer here at Oxylabs, and I've been doing web scraping for, well, quite some time now.

[0:01:11.0] Augustinas: And that particular sentence that you've been doing web scraping for quite some time now. That is exactly why I think that you're the best person for this particular topic because I remember talking to you previously, and you told me about your, like, how your career started in general and that you've been doing, like, web scraping and parsing for pretty much forever. Can you tell us a little bit about that? 

[0:01:33.8] Povilas: Forever is quite a strong word, but, yeah, actually, in general, my path to software engineering wasn't this traditional one. I remember the first book I got was when I was thirteen years old - it was Turbo Pascal. It's a programming language.

[0:01:55.3] Augustinas: Yeah, I've heard about that one. 

[0:01:57.4] Povilas: And then the passion for software engineering kind of appeared, but I always thought that maybe it's not for me because you have to be... I had this imagination that you have to be very good at math, and you have to be this geeky person or whatever. So, I kind of laid aside that dream and went into, like, business because I like sales, I like marketing. So I started studying business and eventually opened my own business, and it was going quite well for some time. But later on, I just, well, I had to close it and found myself with this pretty scary situation where you don't know what's next, you know, what you're gonna do, who you are, and so on. So my friend offered me to go to Norway to work in this fish factory. Okay, and I thought - "okay, maybe it's like a good time for me, yeah," just to figure out what to do next, just have some time, earn some money. So I took that opportunity, and so I started working in this Norwegian town in a fish factory, and, to be fair, it was, well, a very harsh environment. I was working in this warehouse - it was freezing cold. It was hard physically, but mostly it was hard mentally because I just got probably a bit different vision of what I want to do, you know. So that gave me a lot of motivation to just pursue my old passion for software engineering. So I worked shifts during the day, and in the evening, I studied Python programming language. 

[0:03:43.2] Augustinas: Okay.

[0:03:46.9] Povilas: Yeah, and it was for almost nine months, I think. And so, you know, when you're learning software engineering, at some point, you think, "I don't know anything." Then you learn something, and then "I'm gonna build anything now." Yeah, that feeling, right, so it was that feeling where I think I can do anything, and so I went on and looked for some freelance jobs, and I got some just realizing that - "okay, I actually don't have the skill to deliver those." So I had to study even more. But, yeah, so those freelance jobs that I got were actually web scraping, so that's how I got into the scraping field.

[0:04:33.2] Augustinas: Okay, you had to do a few of these, like, web scraping jobs. So what do you think about them? Were they, like, easy, were they hard?

[0:04:43.8] Povilas: I mean, there are different cases. Some were easy, you know. We have great frameworks in the Python programming language for building scrapers, but some jobs are hard.

[0:05:01.0] Augustinas: Okay, so when you say that, like, some jobs are easy, can you give me an easy example maybe of, like, since we're talking about parsing, I want you to focus on the parsing part of the job. Can you give me some examples of something that was, like, super easy to do?

[0:05:19.3] Povilas: I mean, in general talking, like, it's easy to. Let's say you have some microservice that takes input in HTML and spits out the JSON. So it's kind of easy to build such a microservice. You would just need, like, a library to build an element tree, you know, from our source HTML.

[0:05:41.0] Augustinas: Why would you need an element tree in the first place?

[0:05:44.3] Povilas: Well, you see, like, HTML is a hierarchical structure, which can be represented as a tree and to be able to... but as a tree in a sense that a trunk would be like a top HTML element, then we have two branches, like head and body branch, and that body branch will branch out into more branches. So that's representable as an element tree, and when we have this kind of object, we can access it, access the branches and the elements.

[0:06:16.7] Augustinas: Yeah, but why can't you use, like, any sort of string searching algorithm to just find the parts that you're interested in? Why would you actually need something like an element tree in the first place?

[0:06:29.7] Povilas: Because it's, well, you might use, like, regex, for example, to access some part of it. But it's really not efficient because, with regex, it's more, like, for working with text and when you want to find some string in that text. But when we have a website, it's very dynamic. It's like a living organism - it changes constantly, and the layout changes. So it would be super hard to write, you know, sustainable regex that would work all the time. So that's when we have the element tree, and we can use selectors to access different objects.

[0:07:08.7] Augustinas: What's a selector?

[0:07:09.9] Povilas: Well, you can think about it as a query language which we can use, write a pattern and use that query language to query and get some element in that element tree.

[0:07:27.1] Augustinas: Okay, so you have a selector. You use that sort of selector thing to select a particular element from the element tree, which is actually the HTML that you got.

[0:07:39.7] Povilas: In this case, yeah.

[0:07:42.1] Augustinas: Right, okay. Tell me more about these selectors. Is there just one type of selector, or are there multiple types of selectors? 

[0:07:49.8] Povilas: There are not a lot of types of selectors. We have two main selectors, which would be, like, CSS selector and the XPath selector. And, well, in Oxylabs, we use XPath all the time, and there's no clear advantage. It's more of the type of preference, which one you choose. I would probably mention that on XPath, it has, like, a function "contains," which allows you to select an element based on the text it has, which sometimes can be useful. The other part is that with XPath, you can traverse up through the elementary, which means that you can select the parent of your child. You cannot do that with a CSS selector. And last but probably not least would be the community in web scraping. Like, in web scraping, XPath is a bit more popular, so you would get more examples and more help when you use XPath. 

[0:08:52.3] Augustinas: More links in Stack Overflow. All right, so you have a selector, which can be XPath or CSS selectors and is regex. Oh, so regex is not, like, one of those elementary selectors because it doesn't work with element trees. Right?

[0:09:09.5] Povilas: Exactly, you cannot say that regex is a selector. 

[0:09:14.1] Augustinas: Okay, well, I guess we could show a little example of how an XPath or how a CSS selector looks on the screen, which we'll do that if you're watching us on YouTube right now, you should see a little image right here. I'm also interested in when you should use, like, each of those you mentioned. It's a matter of preference - is that all there is to it? 

[0:09:40.8] Povilas: Basically, yeah. It's actually, I mean, if you are someone who's listening and using a CSS selector - there's no strong reason to switch to XPath, no. But if you have an option to choose now which one, I would, personally, I would recommend XPath because of the reasons I mentioned before.

[0:10:00.9] Augustinas: Okay, so I guess it's easy when all you have is like a minimal case, where you need to find, like, some specific elements in some HTML code. Personally, I've had, like, these situations where, you know, as you mentioned before, websites are usually a sort of an organism, right? They constantly change because, well, they are, usually, generated content, right? And it depends, like, the website's content depends on the product's descriptions. As if we're talking about some e-commerce websites and if we're talking about SEO websites that, you know, the number of results that we could get, you know, could be different. How do you write a selector that withstands all of those changes?

[0:10:52.9] Povilas: That's a very good question, actually, because there's a difference, you know, to write... I can give you a good/bad selector example. So if you go to the browser, you open the developer tools, and you select some element in the DOM tree, you would right-click on it, and you would get an option to copy XPath - that would be a really bad example of this.

[0:11:20.4] Augustinas: Why is it bad? 

[0:11:23.5] Povilas: Because it won't sustain, I mean, it won't survive any turbulence of the website. Something could change, and that's it.

[0:11:30.7] Augustinas: So, any single change will make that particular XPath string invalid?

[0:11:35.7] Povilas: Yeah, you could think about it, like, in a city if you want to find a building and you describe the road to your taxi driver. So you would say, you wouldn't give the coordinates or an address, you would say - "okay, now go straight three blocks, and at the third road, you turn right and so on." And so that's the same with bad XPath, you know if you have straight directions to the element you're particularly interested in. If a road is closed, the taxi driver is gonna be lost because of some change in the road.

[0:12:15.1] Augustinas: But I've also had these particular scenarios where you can't really give your "taxi driver" an address. Sometimes there's just nothing to hook on. It is great when your website has, like, a specific ID on that HTML element, and you can just say: "Hey, find me an element in the DOM tree that has this particular ID." Right? So those kinds of situations are easy to deal with, but what about when the data is in a super generic element, like a UL, which is a numbered list element, right? I'm pretty sure that's what it's called. Sorry if I'm mistaken. So how do you write an XPath that would withstand the turbulence of, you know, a changing website, but the XPath could still find the particular element that you're, you know, thinking about? 

[0:13:11.5] Povilas: Well, the example you gave us actually would be a pretty bad HTML, and it's not that often that you would find, like, a list of elements without any class or attribute on which, you know, you could hook your selector. So, yeah, you just have to find a way to look for those classes and attributes or some particular text that you could use to select an element. 

[0:13:41.8] Augustinas: So, am I understanding you correctly that if there's, like, no particular class that you can hook onto in order to find that particular element that you're interested in, so, you're basically lost? You can't really...

[0:13:53.3] Povilas: You're not lost. You could probably say, "Okay, I need a third element in this list," and that's very, well, dangerous because if another element appears in that list—that's it. 

[0:14:10.9] Augustinas: Sorry, I choked on the water for a bit there. Okay, so all of this is still like the easy case when we are talking about parsing in general. What about the hard cases? Can you think of, like, something that was, you know, still parsing related, but situations where parsing a specific piece of data from an HTML website was kind of hard?

[0:14:38.2] Povilas: I mean, there were some cases. One thing that comes to my mind is that I wanted to parse this company's catalog and everything went well until one important piece of information - the phone number was not text-based, but it was an image. So, they tried to protect the data that way, you know. Yeah, and that was kind of a challenge of what to do, yeah. 

[0:15:08.6] Augustinas: How did you solve that? 

[0:15:09.4] Povilas: I came up with the solution just to use some OCR technology, like Tesseract.

[0:15:14.4] Augustinas: OCR is what exactly?

[0:15:14.7] Povilas: It's optical character recognition. Yeah, basically, you can recognize characters from an image. Yeah, so you have to take some time, you know.

[0:15:25.6] Augustimas: You have to download every single one of those images and run it through, like, some piece of software that would recognize the phone numbers from those particular images.

[0:15:33.1] Povilas: Yeah, exactly. So it took some time. I had to tweak the settings because I didn't have previous experience.

[0:15:38.3] Augustinas: I'm pretty sure that I stumbled upon Tesseract in the past and, you know, my particular scenario was a bit different. My ex-girlfriend used to work as a translation coordinator. She would often get requests from her clients to translate some documents, basically. Rather than translating the documents herself, she would have to, you know, find a translator that would do those particular documents, but that's beside the point. The point is that once in a while, she would get like these documents that were actually images, and so, you know, translators don't like to work with images. They, you know, have their specific pieces of software that help them to, basically, map out specific words into translated phrases that they've done in the past. And, you know, I used Tesseract, I believe, as well, and it just didn't work for that particular use case because the accuracy of it was horrible, really. I mean, it was better than nothing. 

[0:16:42.2] Povilas: But this was a scanned document, or it was like...

[0:16:44.6] Augustinas: I'm pretty sure that it was a scanned document and...

[0:16:48.1] Povilas: That would be a problem, you know, because the quality is a bit different. But in my case, it was quite easy because it wasn't like it was just a string of digits converted to the image. So it was pretty good quality, you know. So all it took me was just to find the right settings, you know, to get that string. 

[0:17:12.0] Augustinas: Okay, so if I understand you correctly, all you have to do is just play around with the Tesseract settings, and that was basically it. What was your accuracy? Do you remember?

[0:17:24.0] Povilas: It was very high, I think, close to 100%.

[0:17:26.6] Augustinas: Okay, so, you know, one or two images, the fact that you didn't... Is translating the right word? Well, let's assume it is.

[0:17:36.4] Povilas: Converting, maybe?

[0:17:37.2] Augustinas: Parsing, or parsing, right?

[0:17:38.1] Povilas: That's right. 

[0:17:40.2] Augustinas: So one or two incorrectly parsed images weren't a big deal for your particular case.

[0:17:45.8] Povilas: Yeah, I mean parsing is a big topic. There are different cases, like, now I remember one more case that we have with this. With one of the search engines, we have this case where the HTML source is not, well, how do I say it, is not complete, or the data is there in the source, but it's hidden in JavaScript objects and code which is in the same HTML. So, yeah, it's kind of an edge case, but what's...

[0:18:24.5] Augustinas: Well, from what I know, like these particular cases, at very least in Oxylabs, they are usually solved by just having a headless browser. So we're basically running an actual browser that, you know, just clicks a few buttons in order to get, you know, the fully complete HTML that is actually ready for us to parse.

[0:18:45.6]  Povilas: That's another scraping part, you know, when you do... we use headless if we need to make an additional request, like, well, a request with JavaScript. But this case that I'm describing is that the data is already there. There are no additional requests. So the thing is that when you go to the search engine, and there's a section of questions with expandable bars, when you click on that bar—the JavaScript runs, and it takes some piece of data, you know, HTML puts it into that. So, yeah, in that case, we kind of had to, you know, think about the logic that happens. I think a front-end programmer was sent to do it, and you kind of have to, well, replicate that on the parser level. So that's the challenge. 

[0:19:39.1] Augustinas: I'm guessing that in this particular case, evaluating the JavaScript was a little bit too expensive, and that's why you weren't running a full browser in order to get the HTML that you actually would need to use to actually parse the data easily.

[0:19:57.8] Povilas: Exactly, that's the reason that we can't afford ourselves at the parsing level to just, you know, fire up a headless browser because it takes resources and it takes time, and, you know, we have to, you know, parse it fast.

[0:20:11.0] Augustinas: Okay, so what you did there, if I'm understanding you correctly, what you did there, you were just looking at the JavaScript code, and you replicated the whole thing, but in python.

[0:20:22.8] Povilas: Well, similarly to that with some parts of the JavaScript that runs, we have to replicate those actions in order to be able to get the data, to get the full HTML. Then we could load it to the element tree and do the parsing. 

[0:20:38.6] Augustinas: How do you monitor parsers in general? Like, I mean, we talked about the usual case where you have to write selectors that withstand some turbulence within, like, changing websites, right? But the scenario where a website completely changes is still probably possible. Well, not probably. It's definitely possible that some, you know, front-end developer that maintains another website will suddenly decide - "This is a good day to update our website and our whole design." So in those particular scenarios, your selector probably doesn't work anymore. How do you possibly...

[0:21:20.2] Povilas: Yeah, you're right, it's possible. It doesn't happen so often, but we have more of, like, slight changes in layout, slight changes in some elements. And visibility is a big part. Visibility is becoming a big part of, you know, your client, and your service is a crucial part of your client's business. So visibility has to be very important for us because we have to catch very fast that our parsers are not doing well. So we constantly, with visibility, constantly ask ourselves how our parsers are doing and what's the success rate of our parsers.

[0:22:02.6] Augustinas: So how do you measure success rate?

[0:22:07.3] Povilas: Well, we have these status codes for our parsers for the job that we do and, well, you know, when we are writing the parsers, we kind of describe what field is going to be, what is the output. So we might have optional fields, and we definitely have the required fields. So whenever we fail to parse the required field, that means that there's some error, and that's an indication that maybe the layout changed or any other changes happened to the website. So if we get an error, we assign a specific status code for the job, and we track those status codes. So anything less than 99% of successes is kind of not acceptable.

[0:22:55.2] Augustinas: Okay, it's kind of weird because what's the point of having required fields in general? I mean, wouldn't it be... Okay, I'm going to assume that you're going to say something like, you know, you're scraping an e-commerce website, and one of the required fields would probably be something like the title, and if you can't find the title of the product, then you know the whole job goes to hell, basically.

[0:23:20.6] Povilas: Exactly.

[0:23:23.5] Augustinas: But I'm thinking, isn't it better to parse the title and then return everything else?

[0:23:34.6] Povilas: Well, we actually do that, yeah. If we just fail to parse some required element, it doesn't mean that we just, you know, fail all the job. We do return the result, but we clearly indicate that this field has an error. We have failed to parse this. But there is a case when we fail the job because sometimes you just, I know, it's an example that helps, if you won't find the body, branch, you know, in the tree.

[0:24:03.4] Augustinas: Right, those are, like, critical failures. 

[0:24:05.8] Povilas: Yeah, that's a critical failure 

[0:24:09.2] Augustinas: Because if there's no body, you basically can't find anything in the element tree. Yeah, I can understand that. Are a lot of cases like this?

[0:24:18.2] Povilas: When do we have to, well, critically fail the job?

[0:24:19.1] Augustinas: No, I mean in general when, like, websites are changing their design and then…

[0:24:24.5] Povilas: Yeah, that's our life, you know, marketing guys doing like a/b testings and so on. The website changes and evolves, and that's everyday life. So, you know, in our office, we have on the wall these big screen TVs with Grafana dashboards where we can clearly see what's the success rate of our parsers. And we even have this parser duty thing that's a rotation. Every week we have one person assigned to the parser duty. So his task is to just watch those screens, watch the Grafana dashboard, and, you know, react quickly. If we see that something is failing, you know, if we have to...

[0:25:08.5] Augustinas: Do they have to wake up at night to fix this?

[0:25:12.9] Povilas: Not at night, yeah, not at night, but that's, you know...

[0:25:13.7] Augustinas: So it's like a specific job that you give one of your colleagues to just respond quickly to websites that change their design?

[0:25:24.4] Povilas: Yeah, because it's a thing, you know, with websites.

[0:25:27.6] Augustinas: They probably have, like, to drop everything that they're doing and go to fix that particular website, or do you still, like, consider the priorities of fixing a particular parser?

[0:25:39.6] Povilas: That's a priority, yeah, because we're personally delivering data.

[0:25:44.4] Augustinas: The parsers that you already have are the priority, okay. What kind of other, like, breaking things in the parsers have you noticed within the last year? Have there been situations where, like, parsers just in general stopped working?

[0:26:02.7] Povilas: Just, in general, like, what do you mean?

[0:26:06.5] Augustinas: Any, sort of, like downtimes that you remember that you had, like, with parsers in general. 

[0:26:14.8] Povilas: If it really happened, we kind of know. The stability of our services is also, like, a crucial part, so we try to make sure that we wouldn't have any downtime.

[0:26:29.3] Augustinas: So, no parsers were completely down within the last year?

[0:26:30.8] Povilas: Maybe it was in some cases. Yeah, definitely not often.

[0:26:33.5] Augustinas: You're just not aware of it, I guess. All right, so…

[0:26:39.6] Povilas: I mean, you know, just on the topic, like, it's very important not to release something bad to the production server. So, yeah, so we, kind of, have to do this testing of the code, you know, we try to prevent anything that might create this downtime.

[0:26:58.3] Augustinas: Right, so how do you test your parsers then?

[0:27:02.5] Povilas: Well, ironically, we don't write any tests for parsers.

[0:27:09.8] Augustinas: But how do you know that you didn't break your parser then?

[0:27:11.0] Povilas: It's a cool thing that we have in the parsers. Is that we have these two files for each parsing case, like, for our target website, which would be, like, one file is the fixture, which is a JSON file. It has some metadata about the job, and most importantly, it has an HTML of the target website. And the other file, this golden file, it's like a source of the true, the expected output of the parsers. So, whenever you write any changes on your code, you kind of run the tests, and it takes the input file, the fixture, and it matches looks - do we match or not over the golden file?

[0:28:01.2] Augustinas: So what I hear here is that you don't have unit tests, but you have functional tests?

[0:28:09.4] Povilas: Yeah, it would be a functional test.

[0:28:09.6] Augustinas: Okay. So I'm hearing that there's quite a lot of nuance to parsing in general, but I'm also wondering about what you think the future is like. What is the future of parsing going to look like?

[0:28:29.9] Povilas: Well, I think, like, a lot of things are impacted today by machine learning. So same as with this field, with parsing field, I think the machine learning will do more and more, will replace more and more manual tasks that we do, and to be fair, like, even now, in Oxylabs, I believe that we have this grasp of future, you know. Because we have this universal parser, it's done by our R&D department, and we already have it in production, and our clients are using it. This fascinating piece of software, I always think if I had it back then when I was freelancing, I would have made a fortune at that time. So that piece of software doesn't have to know anything about the target website, and you provide the HTML document of the e-commerce page, and it kind of gives you back the parsed elements, yeah.

[0:29:29.5] Augustinas: So, is this for e-commerce right now?

[0:29:32.5] Povilas: Yeah, it's for e-commerce

[0:29:32.5] Augustinas: Oh, I see. So, I'm guessing that one of the ways that Oxylabs is going to improve is just by keeping that. We'll keep working on this particular universal parser, and that will probably make it work for SEO websites as well, maybe?

[0:29:50.1] Povilas: It might be, I don't know these plans, but I believe that it should be, you know, and eventually we will just be in Mexico drinking margaritas, and hopefully we'll do it.

[0:30:02.8] Augustinas: Hopefully, or maybe we'll find something else interesting to do. I'm guessing that when it comes to scraping, parsing, and everything, like, web scraping related, in general, there's always going to be work for us. And even when machine learning gets developed, I'm pretty sure that for the next 50 years, we'll still have things to do. I mean, from one side, you would like to hope that machine learning improves enough to, basically, allow us to parse websites easily, but, you know, not everything that we do here in Oxylabs is just, you know, about the scraping or about parsing. Pretty sure that by now, we consider ourselves a data extraction company rather than just parsing one.

[0:30:48.8] Povilas: Yeah, I would agree, yeah.

[0:30:51.3] Augustinas: Also, I'm guessing, you know, if you were to ask my personal opinion, I'm pretty sure that we're going to work on products that maybe allow clients to write less code and instead, like, get specific parse results for themselves without having to write any code at all, almost like a no-code platform.

[0:31:13.7] Povilas: Yeah.

[0:31:14.0] Augustinas: That being said - crawling, right. When you think about scraping and parsing, if you watched our previous episode, you probably heard about the term crawling. And I'm pretty sure you know about it already. It's a topic that I want to introduce to our viewers in our next episode. I'm hoping that you guys will stay around and find out what we have to say about this particular topic. That being said, I'd like to remind you all that you can find us on SoundCloud, YouTube, Apple Podcasts, and Spotify. I'm pretty sure I did not forget anything this time. And, you know, guys, scrape responsibly and parse safely.

Get the latest news from data gathering world

I'm interested