One of the main challenges users face when adopting an on-premises solution is the ability to integrate it into their infrastructure. The days when EJBs and application servers ruled the world have gone, and organizations bet for virtualization. They offer convenient features like isolation and replication, but along with a critical drawback: performance. Docker has raised as a serious alternative to virtual machines, and dockerized applications are the new EJBs. It is not uncommon to find in a company’s infrastructure dockerized services and processes. In this sense, a question rises: what about dockerized text analytics?
The problem with virtual machines
By definition, a virtual machine runs a complete stack of virtualized hardware and operating system. It takes a powerful host machine to run a large amount of virtual machines seamlessly. Organizations often find themselves forced to invest in powerful servers to run solutions that are in fact not specially hardware-demanding.
In the last years, an alternative approach called containers has been widely adopted. In short, a container is an isolated file system, with its own processes, users, and network interfaces, but without any virtualized hardware.
Dockerized text analytics with MeaningCloud
MeaningCloud runs seamlessly in Docker containers, which makes it a convenient solution for deploying it in some infrastructures. It also takes advantage of some appealing aspects inherited from the Docker internal design.
One of the most exciting features of dockerized applications is that their installation is fully automated by definition, unlike virtual machines that have to be stored in disk byte after byte or provisioned with a third party tool like Ansible. In this sense, MeaningCloud installs all its dependencies automatically and configures the web server without user intervention.
Because of the ephemeral nature of containers, databases and storage directories must be provided, if necessary, to store permanent files. MeaningCloud stores all data in an MySQL database and a shared volume, allowing users to backup and restore everything. Users have full access and control over their data.
In a similar way to virtual machines, containers can scale up and down to adapt to the incoming traffic. Docker does a great job in caching and reusing filesystem layers between running containers. Thanks to this, many instances of a single MeaningCloud container can run on a single host without any performance penalty.
Thanks to this, MeaningCloud offers a great dockerized text analytics solution to perform text analytics on-premises, offering all available APIs, adapting the deployment to the security and privacy requirements and reducing hardware and operating costs.
A dockerized text analytics use case
The most relevant cloud providers offer a solution to run containerized applications in a private cloud. Two well known examples are Amazon ECS and Google GCE.
In this use case, a huge amount of documents have to be processed, and Amazon ECS was chosen as the service provider. The following figure depicts the solution architecture:
The solution used a 5-node ECS cluster, each node mounting an EFS volume containing the documents. The cluster allowed up to 20 MeaningCloud containers deployed, along with an additional monitoring application. MeaningCloud containers also included a script that implemented a naive distributed dispatching algorithm.
The following figure shows a monitoring panel of the solution:
The panels at the top offer information about the quantity of documents processed and the ones at the bottom show the resource consumption by container and node. It can be seen how the processing power is distributed among the containers and how the required resources react to the system load.
Other users willing to run MeaningCloud on premises can take advantage of docker-compose or Docker Swarm to launch and scale containers in a similar way to Amazon ECS.
Summing up
The most significant disruption of Docker is resource efficiency. Docker containers install automatically and scale out of the box. MeaningCloud runs seamlessly in Docker, avoiding the cost of manual deployment and configuration of text analytics on-premises solutions, and offering the ability to scale replicas up and down to adapt to the incoming traffic and load. Such solutions are ideal in security, legal, compliance or health scenarios, where data privacy is a main concern, or whenever large volume and low latency requirements make SaaS alternatives undesirable.