One of our clients is in a business of internet crawling and data mining. They need to collect millions of web pages from hundreds of different web sites and extract relevant parts of data from them every month.
The number of pages that needs to be retrieved and analyzed varies from small (several hundred) to huge (several million) per web site. Some of the sites are crawled recurringly and some just once, but sometimes under a very strict deadline. The original approach for meeting such demands included development of a dedicated app for each of the target web sites. Initially, it did not rely on any of the modern cloud platforms, so there was a need to deal with a large payload by splitting it and running multiple instances of the same app on many different machines, each processing its own part. Over time, the maintenance of all those apps became difficult from both development and maintenance standpoints. One had to deal with constant changes in the structure of crawled web sites, updating apps to keep track with technology trends, writing new apps in the most efficient and optimized way and finally, constantly reconfiguring how and where they run.
That all merged into an idea of developing a highly scalable generic solution to run any kind of web crawling process - actually any kind of workflow - so that processing, transforming and storing of the collected data is also covered. Additionally, we had to improve the crawling logic for some of the problematic web sites which required constant workflow changes and manual interaction due to their advanced bot protection. Finally, we had to find a way for the new solution to be flexible to the constantly changing load and deadline demands.
The first challenge was to abstract and standardize the way users define what the system needs to do for all the different crawling processes. The core logic may be described as the following:
The idea is to pack the logic of common functionality like these into tasks, which users can then invoke and define their web site specific behavior via different properties on those tasks. The most challenging part was how to present this to the users. We had to come up with some user friendly solution so that people are able to define, view and understand complex workflows as easy and as fast as possible.
The most important requirement was that the solution has to be able to scale the work to multiple machines automatically. That meant that we had to come up with some kind of a cluster solution in a cloud. For a number of workflows where the processing time is not important and they are scheduled to run at different times, the cluster would dynamically expand and contract based on load demands. For the important ones under a strict deadline, one may need to run a dedicated cluster with a specified scaling parameters, calculated so that they could complete in time.
We were tasked with building a generic web crawling workflow system, able to handle any kind of custom logic and deal with website bot protection, suitable for non experienced technical personnel and capable of scaling horizontally. There is no commercial solution meeting such demands on the market. We managed to implement a generic solution by describing workflows via XML, having library of packed tasks able to run that XML and implementing custom tasks scripting via Roslyn compiler platform. Advanced web crawling tasks and bot protection were done via CEF - Chromium Embedded Framework. Horizontal scaling was implemented by modeling the entire application architecture with the Akka.NET actor model framework and using its built-in cluster capabilities. Docker and Kubernetes were used for running the application parts in the cloud environment.
One of the initial demands was to try to produce a user interface, one that could make the system usable by someone technically savvy, but not necessarily an experienced software developer. We looked at commercial tools like Mozenda for ideas and developed a flowchart-like user interface in ReactJS and .NET web api, with the ability to drag and drop tasks and define their properties visually. We also created a class library workflow model on the backend and were able to transform it to/from our front-end and also serialize/deserialize it to/from a database.
After testing such a solution for a while, it turned out that really complex workflows with hundreds of tasks, and most of our client's workflows are like that, were simply too complex to comprehend, search through and change in such a visual manner. Another problem was that frequent changes were hard to track and revert, so it was almost like we needed a source control system for the stored workflows.
Since we used XML for serialization, it occurred to us that using a decent XML-friendly text editor and working directly on the xml made everyone more productive. Simple file as a workflow definition was also suitable for source control. Hence we opted for such simpler, yet more effective solution and completely separated the logic of designing workflows from running them.
As workflows grew in complexity, our XMLs needed a full support of almost anything one might do in a programming language. We packed complex functionalities to tasks, but there were many custom math operations or data transformations that simply couldn't be defined in such a generic manner. We realized that we needed the ability to write small code snippets and use them to accomplish any kind of custom functionality, so we looked out for an expression engine being able to parse and run custom code.
Looking further, we decided to try the (back then relatively new) Microsoft's .NET Compiler Platform Roslyn and run our scripts in C#. We never looked back since the performance was much faster than anything else we tried. There is an initial compilation penalty, but we reduced that by caching each of our snippets as dynamically generated assembly on disk, so we loaded it from there on every repeated run. This gave us the final ingredient for writing workflows with any kind of custom logic. The way our worklows looked at that point, we had an internal joke that we produced XML#.
One may wonder why all that XML & C# scripting effort, why not just pack the major functionalities as a class library and then have programmers write custom apps or modules which use that. Well, a number of reasons - first, and most important, it was simply hard to enforce the discipline to the developer-users of the app not to introduce their own modified versions of standard tasks and then we'd end up again with a maintenance and possibly performance mess. Second, the xml scripts for simpler workflows were written by other roles, not strictly developers. Third reason was simplicity - although it would be possible to write a custom library module per each workflow, recompile it on every change and plug it in to the main platform for running, it was way more complex than just changing a simple script file and re-running it.
We used Xilium.CefGlue .NET CEF implementation as that gave us a de-facto 1:1 mapping to the native C++ CEF code comparing to the more user friendly and a less of a learning curve CefSharp. It was also better as we were able to do our own custom CEF build and upgrade to newer versions more frequently.
Finally, with the emergence of .NET core, we wanted to run isolated crawling processes on a number of available Linux machines, so we had to do a custom Linux CEF build and integrate it to our app.
While looking into how to do the horizontal scaling in the best way, we choose the Actor Model. Studying resources like - Distributed Systems Done Right: Embracing the Actor Model - convinced us that this was the right architecture for us. It later proved that it was a right decision, not only due to built-in scaling and cluster support, but also in terms of how it handles concurrency problems (e.g. different parts of workflow doing conflicting actions at the same time) and fault tolerance (when one of the workflow worker parts becomes inaccessible on the network or fails due to an error).
Using .NET as core technology, our choice was between two actor framework solutions - Microsoft's Orleans and Akka.NET, a port of the Akka framework from Java/Scala. We spent some time evaluating both of them and chose Akka.Net, despite its slightly higher learning curve, but due to it being more open, customizable and closer to the original actor model philosophy.
Cluster environment was our primary goal, but as we went along with implementation, we realized that there were simply problems we couldn't have solved without the actor model. These problems are inherent to the nature of distributed systems. For example, as our workflow was parallelized, there was a need to pause, restart or cancel the whole workflow, either by manual action from the user interface or by the workflow logic itself in one of the parallel iterations. On a single machine one could use various locking techniques to synchronize behavior of parallel executions, but in a distributed system that all changes. Akka, as well as other actor models, is designed to deal with this kind of problems. Having every logical unit-of-work part defined as a single actor responsible for handling it, it was easy to deal with concurrency since messages (commands) are always processed one at a time. The workflow control problem was solved by having a single runner actor which responded to the pause, restart or cancel messages one at a time, notifying other (distributed) parts/actors of the same workflow and not proceeding further with message processing until making sure they completed their part.
We dockerized all of our application parts and run it in Kubernetes, which provided the adaptive load balancing ability, i.e. cluster elasticity. This was not without challenges as we had to make all our actors persistent and resilient to both cluster expansion and contraction. At any time they had to be able to terminate on one machine and re-emerge with the same restored state on another one. This was done via Akka cluster sharding, but we couldn't solve the headless browser part in this way as it was simply impossible to save and restore its state. Thus we had to implement the browser part as a separate micro service, with the ability to scale out to new machines via regular Akka cluster routing, but without the contraction, which had to be controlled manually.
he client also had a number of physical machines available, which were still running legacy crawling apps, but were underused. We were able to run a direct, static cluster environment on them too. There was no cluster elasticity in this case and the work distribution was done via adaptive loaded balancing provided by the Akka cluster metrics module.
Book a free consultation
Let us know what would you like to do. We will probably have some ideas on how to do it.