While the present global scenario is unfavorable for face-to-face meetings, Oxylabs has held a virtual event we’ve all been waiting for so long: OxyCon 2021. This late-summer afternoon has finally brought us together for a two-day conference, boosted with know-how by experts from market-leading businesses. 16 speakers have gathered online to cover the most relevant aspects of web scraping.
On day one, we had a mix of in-depth presentations and discussions, wrapped up with some friendly interactions. Whether you attended the event or are just curious, we’ve put together a brief rundown of the first day. So, let’s get started.
Data Quality – Your Worst Nightmare
Following the opening speech by Oxylabs CEO Julius Černiauskas, introducing the conference and its moderators Gabija Fatėnaitė, Product Marketing Manager, and Vaidotas Šedys, Head of Risk Management, it was time to meet the first presenter: Allen O’Neill, founder and CTO at DataWorks.
Allen shared tips and tricks on how to reach the top of data quality and make sure working with data is cost-efficient and not too time-consuming:
- First and foremost, he suggested that businesses should focus on their core competencies and leverage experts to ensure data is high-quality.
- Allen also noted it’s essential to evaluate the scope of data needed and define the expected value range. Different industries require different data.
- Finally, he provided ten key points that are crucial to observe when ensuring data quality. They include data clustering, unexpected values evaluation, impossible values consideration, checking different types of data, keeping in mind extraordinary data, tracking values that are out of scope or range, checking for outliers, making sure data is logical, and eliminating spelling mistakes.
Indexing: Scraping Website from Zero to Sitemap
Eivydas Vilčinskas, Senior Software Engineer at Oxylabs, opened up a series of more technical topics designed for developers in the field of web scraping. Eivydas introduced website indexing and explained how businesses could use it for data collection.
Eivydas noted two most important things businesses have to define before gathering data: 1) what data they need; 2) where to find it. Only after that the process can go on.
According to the speaker, website indexing is crucial, yet it depends on the website how much it can be used. For instance, if the content doesn’t change, indexes can be available for a long period of time. However, websites containing dynamic content must make sure to update their index constantly.
TLS Fingerprinting in Web Scraping
Another technical topic was covered by Martynas Juravičius, Lead Data Analyst at Oxylabs. He presented the concept of TLS fingerprinting and its applications, and discussed the impact this process has on web scraping and bot detection.
Firstly, Martynas introduced what fingerprinting, in general, is – a process of taking protocol settings and combining them into a unique fingerprint stored in a database. Fingerprints are used to track malicious software and identify device parameters. TLS fingerprinting is a passive type of fingerprinting and might be one factor in bot scrapers getting blocked during data scraping.
How do they get blocked? Basically, anti-bot software compares TLS fingerprints with HTTP user agents. If they don’t match – that’s the reason the scraper gets restricted. To avoid this, developers use vast databases of user agents.
Martynas provided three solutions on how to avoid TLS fingerprinting while web scraping:
- Randomize parameters (Cipher suites and TLS versions) – sending different parameters that would be hardly detectable by anti-bot software.
- Align parameters – employ massive user agent databases and align them with TLS parameters.
- Use real browsers and user agents – for example, download many various browsers to your computer.
Harnessing the Power of External Data in E-commerce
Tomas Montvilas, Chief Commercial Officer at Oxylabs, continued the conference with one more fascinating topic on data collection for businesses. He outlined the main challenges of collecting external data in real-time, namely:
- Building and maintaining real-time scraping pipelines.
- Managing proxy infrastructure.
- Handling CAPTCHAs and website changes.
- Data parsing and cleaning.
Tomas also noted that external data is a powerful tool that helps companies stand out from competitors. In the presentation, he discussed the following use cases and solutions:
- Optimizing assortment in digital shelves by identifying selection gaps and overlaps using pricing ladders.
- Enabling real-time dynamic pricing by applying multiple price recommendation algorithms, such as competitive response, KVI, and elasticity algorithms.
- Monitoring search placements in the marketplace by moving to the first page and thus boosting your sales.
Monitoring Web Scrapers: Best Practices
Oxylabs Data Analyst Andrius Kūkšta emphasized the necessity of quick reaction to any potential deviations while maintaining web scrapers. During his presentation, Andrius pointed out some beneficial practices for building and maintaining scraper monitoring systems:
- Building block detection tools. This includes getting HTML, parsing it, passing it to the classifier that would predict if there is a possibility of a block.
- Collecting statistics. For this, you’ll need to prepare dashboards and do a lot of analysis of operation outcomes, durations, and all the steps of a process chain.
- Setting up alerting systems. It’s important to set highly sensitive thresholds at first and make them smaller later on if you don’t want to get alerts so often. Also, it comes in handy to split alerts into critical and non-critical ones.
- Testing scraper monitoring capacity. This contains information such as how many requests you can make per proxy and average requests per every scraping parameter set.
- Making data-based decisions. It’s essential to evaluate if you have enough capacity before onboarding new customers, continuously improve it, and consider pricing.
Machine Learning Infrastructure
The final presentation of day one was made by Pujaa Rajan, Machine Learning Engineer at Stripe. Being an experienced professional in machine learning, Pujaa presented what tools and resources are a must when developing machine learning infrastructure in a company. The key takeaways from the presentation:
- Pujaa introduced the lifecycle of Machine Learning: preparing data, building a model, training it, putting a model into production, continuous scaling, and retraining for making it more cost-effective, adding new features, and overall deciding on how to improve your product.
- She also discussed the aspects of how each ML infrastructure supports the above-mentioned lifecycle by going through the steps of the development, from writing a code to monitoring the overall result.
- What’s more, Pujaa introduced software that it’s used for building and maintaining the ML infrastructure. According to the speaker, choosing a programming language is a vital step before developing any project.
- Pujaa discussed some technical details of the ML infrastructure and summed up her presentation by defining the characteristics of an excellent ML-based infrastructure: usefulness, explainability, simplicity, and scalability.
Here’s a quick wrap-up of what we’ve experienced on day one. We also had a second inspiring day filled with expertise sharing. Head over to our blog post to learn more about day two of OxyCon 2021!