Back to blog

OxyCon 2024: The Top Takeaways

OxyCon 2024: The Top Takeaways

Augustas Pelakauskas

2024-09-266 min read
Share

OxyCon 2024 brought another round of insights on all things web data. The special five-year edition marked the explosion of AI, ML, and LLMs.

We, the Oxylabs team, thank you for your enthusiasm in attending, watching, and asking so many questions. Whether you missed the live event or want to refresh your memory, here you’ll find highlights from each topic.

For OxyCon on-demand videos – they are just around the corner and will be available soon. Before diving in, join our Discord to network and stay tuned for updates.

Rūta Giriūnaitė, OxyCon Host

Rūta Giriūnaitė, OxyCon Host

Ensuring Scalability in Data Collection: Key Components, Challenges, and Advancements

To kick things off – some scraping basics from our CTO, Žydrūnas Tamašauskas. What are the ingredients for scalable data collection? Žydrūnas presented a formula for choosing proxies by balancing success rate, cost, and speed. But with so many proxy types, it’s difficult to prioritize. Even though it’s common sense that residential proxies are the best proxy products across the industry, choosing them standalone won’t solve all the problems.

Combining multiple proxy types and performing proxy load balancing through trial and error while running headless browsers is the solution. How many headless browsers? Tens, hundreds, or thousands? Well, it always depends. An entire headless farm might be in order.

If you successfully unblocked a target today, the following week might unpleasantly surprise you. So go on and reiterate. Don’t get tired, don’t give up, and prepare for a tedious and never-ending process. Try every possible combination. There is no best single-case scenario. Sharing sessions and proxies effectively between headless browsers is an art, as Žydrūnas put it.

Žydrūnas Tamašauskas, CTO @ Oxylabs

Žydrūnas Tamašauskas, CTO @ Oxylabs

Human-Centered Approach to Streamlined Data Gathering

Machine learning, artificial this, artificial that, but what about human learning and a natural instinct to grow as a rookie coder specializing in web data gathering? Well, Vilius Visockas, CEO at City Now, shared his technical daily dealings in gathering real estate data, focusing on human development. He explained his take on employee selection and comforting junior devs.

Vilius discussed City Now's innovative approach to managing large, niche datasets with a small team. As he remarked, for 100% correct data, humans are (still) needed, and 99% isn’t good enough. To be more specific, Vilius is all about peer review, seeing it as a code audit.

When talking about task delegation to new coders, being mentored, getting a nice resume for a future job, and just having fun, all could be included. In a nutshell, web data gathering is validating – you either succeed or not. There is no middle ground. It’s a peculiar benefit for newcomers.

Sometimes, making the whole scraping process more personal might be psychologically useful. Say you track a commodity item you actually buy and consume daily, getting email alerts when the price drops. You might save a hefty amount in the process. This is just an example of a homework project to sweat a bit and warm your feet. These sorts of simple abstractions are great for newcomers’ engagement.

Imitating Real User Behavior With Mouse Movements

Let’s assume your proxies, fingerprints, and headless browsers are set up perfectly. And yet, you’re still getting blocked. Why? Tadas Gedgaudas, a Developer at Oxylabs, had the same question when he realized that a target website tracked mouse movements – in his case, a lack thereof.

Unnatural mouse movements can be a culprit. Some websites track movement just to know what customers are interested in; others do it to feed data to AI models that detect bots. Even some CAPTCHAs use movement detection.

How can artificial mouse movements effectively mimic human behavior? First, Tadas explained how to check if a website uses mouse movement detection. Then, he introduced three common mouse trajectory algorithms and his open-source project, OxyMouse, which supports them.

Today, in addition to all the basic and advanced anti-blocking tricks, you might also need to simulate mouse movements. So don’t spend too much time on the usual unblocking matters when the real reason might be mouse movement detection.

Pro tip: mobile data scraping is mouseless with no need for mouse simulators.

Harnessing Gen AI for Data-Driven Answers

How to push GenAI to its limits? Imagine a chatbot with billions of rows of code. Hardly feasible just four years ago, right? The same chatbot powered by GenAI – here’s an answer to your query in 30 seconds; you’re welcome.

The impossible is now possible, as Paul Felby, Adthena's CTO, previewed a system designed to transform users' interactions with large data sets using conversational languages.

Besides quirks and features on how to prepare a workflow that integrates GenAI, along the way, Paul covered:

  • The differences between the most popular GenAIs. 

  • Building long prompts.

  • AI confidence layer implementation.

From starting with a basic SQL schema and adding a semantic layer to a multi-agent architecture approach, technicalities-wise, Paul dove the deepest at this year’s OxyCon. So go ahead and watch on-demand to catch all the details.

AI-Powered Public Web Data Collection at Scale

An AI-powered abstraction layer can greatly benefit web data acquisition pipelines. As Aleksandras Šulženko, Product Owner at Oxylabs, explained, you can turn to AI to complete web scraping workflows. However, scraping code generated by LLMs is often expensive and, as Aleksandras put it, prone to hallucinations.

At Oxylabs, our data acquisition pipeline previously had some AI-powered features, but now, to help connect all the pieces, Aleksandras introduced an AI companion, OxyCopilot.

OxyCopilot is the culmination of years of work in data acquisition, AI, and ML. The industry’s first AI-driven scraping assistant allows you to build scraping and parsing pipelines with minimal programming knowledge.

During a demo, Aleksandras showcased the simplicity of inputting a prompt with a task description and a target URL to generate code for parsing instructions. The instructions were then sent to our Web Scraper API, which returned a structured result.

Aleksandras’ team is still testing various GenAIs to fine-tune the performance, but the tool promises to greatly simplify the whole scraping process.

To take a break from technicalities, Nerijus Šveistys, Senior Legal Counsel at Oxylabs, highlighted the most recent ethical, social, and legal issues in AI. Compared to 2023, the use of AI went up by 70%. Of course, the main culprit is GenAI.

Currently, we have no uniform AI regulations. Nerijus covered AI regulations in the US, EU, and China, which all differ substantially. It’s somewhat of a grey legal area.

The main legal dilemma is scraping intellectual/copyrighted content to feed AI models. To illustrate this point, Nerijus reviewed some notable legal cases. The main narratives shared the same divergent point: copyright holders felt exploited, while AI developers argued that the created output is always different from the sample data and, therefore, deemed original content. The fundamental question is: could fair use policy include AI training? As of now, courts have no verdict on the matter.

Such a legal battle could discourage AI investors in the short term. In the future, additional permits will probably be needed to use public data for AI training, as it seemingly goes beyond fair usage policies.

AI laws aren’t set in stone just yet. Regulations are being prepared for various levels of restrictive use, with some extreme AI model designs planned to be outright prohibited. The bottom line is that you don’t want to be asked to delete your AI model, Nerijus stressed.

Advanced Unblocking Strategies

Finally, Juras Juršėnas, COO at Oxylabs, hosted a panel discussion. As collectively noted, just a few years ago, bots were running relatively free, and now, with a dramatic increase in anti-bot measures, data really is a priceless commodity.

Evaluation of IP reputation, mouse movement detection, and more new bot detection measures are coming online. Some argued that you can expect measures such as daily official browser updates and websites letting through only the newest versions.

The satisfaction of unblocking a website is increasing in paralel to the sophistication of anti-bot measures. All in all, for enterprises, this is a real hurdle, and for some developers, it's a pretty fun cat-and-mouse game, as Paulius Gervė put it.

Another emerging trend is the commercialization of anti-bot measures. The market for such providers is growing at an accelerating rate, with many hungry customers. It seems that the days of websites putting all they can muster to combat bots are coming to an end. They will just outsource the whole process.

Some joked that we should hope for a reasonable future in which we won’t have to downscale tasks to a simple home PC level to appear as organic as possible.

And lastly, here are some general tips generated by the panelists:

  • Watch out for LLM hallucinations.

  • Exploit bugs in anti-bot measures as they are plentiful.

  • If you like puzzles, consider developing web data acquisition solutions.

  • Just go through Discord and GitHub; always some hidden gems there. 

  • Start a field manual on all of your successes with unblocking.

Juras Juršėnas, COO @ Oxylabs

Juras Juršėnas, COO @ Oxylabs

And that's a wrap! If you've registered for the event, rest assured – you'll receive the on-demand videos soon. If you haven't registered, keep an eye on the OxyCon page for updates. Have questions? Feel free to contact us at events@oxylabs.io.

See you at OxyCon 2025!

About the author

Augustas Pelakauskas

Senior Copywriter

Augustas Pelakauskas is a Senior Copywriter at Oxylabs. Coming from an artistic background, he is deeply invested in various creative ventures - the most recent one being writing. After testing his abilities in the field of freelance journalism, he transitioned to tech content creation. When at ease, he enjoys sunny outdoors and active recreation. As it turns out, his bicycle is his fourth best friend.

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

Related articles

Get the latest news from data gathering world

I'm interested