How Algolia engineers manage billions of API calls per month without losing sleep

With more than 3000 customers and 120 billion API calls each month reliability is a must at cloud-based search company Algolia. We spoke with Anthony Seure a Site Reliability Engineer at Algolia about how sleep was the main constraint when the processing pipeline was redesigned. By keeping the system running the Site Reliability Engineers, or Sleep Reinforcement Engineers (SREs) as they preferred to be known, made sure the whole system was highly reliable. As much as the team would have like to call forth Harry Potter’s magic it relied on crafty use of multiple cloud services and removing single points of failure.

What were some of the early decisions you needed to make around hosting and cloud to be able to scale Algolia?

We really wanted provider independence up front so data is replicated across more than one country by default and have relays between multiple DNS providers so there is no single point of failure. It is difficult to be consistent across cloud providers and we are contracting both hardware and software for our service. For example, we are using Cloudflare for caching for external services. We use a custom Python tool to help abstract all the products we have so when we need to provision something new so we can do it in a transparent way.

How do you manage the process of changing one type of technology or service while keeping Algolia running? What are some SRE tips?

The most important thing is deployment and what’s happening before and after. We are doing simple redeployments by ranges of machines and then we do another and another. This way we can scale easily.

A tip is to not rush the deployment, and always have monitoring on top of what you are deploying as it is the main way of knowing if what you are doing is correct. Be able to roll back fast in case there is a major issue. You will also need to easily fetch libraries if needed, so we are using Chef for system management.

How is the process of generating and managing log data changing? What needs to happen to make it less of a burden on companies that do lots of data processing?

Companies have to put in place more tools for logging as they see the quantity of logs processing is directly linked to the growth of the company. We could process this with one machine, but it gets to the stage where you can’t process logs any more. Even of if it’s not the main product, logs are the only way of knowing what is happening in your systems.

You can use a SaaS company to do log processing, but you should consider building something yourself past a certain volumetry and if you also need real-time processing. A lot of SaaS companies are doing a very great job at real-time processing for most volumes. We were investigating SaaS, but it was not feasible for the amount of logs we have.

Also, with the data we have coming out of our logs we are the only ones that know how to extract value from it. Our metrics are really specific to extract information from. We need the logs generated by our search engine and our analytics is reliant on all the logs. The ability to merge the two types of information and extract intelligence from it can be complex.

You are using a combination of container technology and multiple cloud services. How is this hybrid cloud going and what challenges have you had to deal with?

The biggest pain point is having to deal with adjunct technology which is difficult to get support for. And dealing with young technical stuff might be difficult. Even if the cloud products are good and working the APIs provided might not be so good. When you switch from one version to another you get changes and you are relying on them and once. For example, if you want to perform a security upgrade have to make sure everything is working and does break.

In the hybrid cloud you have to be very aware of all the limits and quotas the different services have. When running your own physical machines you are dealing with known limits, but when you’re dealing with SaaS providers you have limits on transactions, CPU and bandwidth, etc.

Have you, or are you looking to, publish the work you have done with open source code or design docs? What are the advantages of a collaborative approach?

We are opening sourcing some of our code and helping the community. For log processing it is closely tied to what we are doing and open sourcing this might be a good idea, but once open something you have to deal with the community and other problems that are not yours.

Regarding our design docs we are looking to be transparent and publish articles on them. A collaborative approach here will benefit us with peer review and feedback. We also challenge it internally as all logs are exporting some value from customers.

What future plans do you have for Algolia’s infrastructure?

Currently, our Kubernetes cluster is running on a container engine, but it might make sense to run it on our own dedicated cluster in the future. It’s unlikely we will use SaaS for log processing. Regarding the Algolia search engine we are staying on bare metal using a number of providers, including LeaseWeb.