Foreward / Apologies
We pride ourselves on offering a world-class package management service that manages it all, so you don't have to. As a high-criticality part of your pipelines and distribution processes, remaining up and available at all costs is part of our mission. It's our core reason for being. We distill and remove problems, and work extremely hard to not introduce them.
On Tuesday 4th February, we suffered a multi-phased outage that went from partial to total, eventually affecting downloads across all users. Although extremely unusual for us, in this case we fell far short of our high expectations, of maintaining our high availability. Outages do happen, of course, but the gravity of them happening is no less so when they do, and total outages, in particular, weigh heavily on myself and the rest of the team.
We feel that it is our responsibility to provide a detailed summary via this post-mortem on what happened, when, why, and what we'll do about it to prevent reoccurrence. We want you to trust us, and we firmly believe as a company that operates as transparently as possible, that this is our way to earn it from you.
No matter what, for all of our users that were affected, paying or not, please accept my heartfelt apologies on behalf of the team.
We'll do better, I promise.
Co-Founder & CTO @ Cloudsmith
Between 22:20 UTC and 23:42 UTC on Tuesday 4th February, all users of Cloudsmith experienced a partial and then major/total outage. The event was triggered by a huge spike in traffic, plus ineffectual caching specific to this scenario, lead to a ripple effect of crippling the production databases, then the frontend servers, the upload APIs, and then the Content Distribution Network for downloads. The event was detected by monitoring and service reports. This is extremely unusual for us, as several layers of failsafes didn’t prevent the incident from occurring. It was eventually mitigated by blocking the source of traffic, bringing up additional database replicas, fixing an issue preventing instances from recycling and finally deploying fixes to reduce the pressure on the master database. The incident was classified as critical.
At 21:02 UTC on Tuesday 4th February, traffic from the problematic source began and increased at a high rate until the incident. The traffic originated from a diverse set of geographic locations. The huge amount of traffic generated additional background data to be processed by the system; combined with ineffectual caching specific to this scenario, resulted in increasing pressure on the system.
The production databases couldn't keep up with the increased load, gradually falling behind in processing until it started to impacted other parts of the service. The ineffectual caching meant that the service was being hit a lot more often than normal, and as such it couldn't keep up with requests.
For 1 hour and 18 minutes between 21:02 UTC and 22:20 UTC on Tuesday 4th February, some or all users experienced degraded performance. This resulted in requests to the site or downloads taking longer than usual.
For 39 minutes between 22:20 UTC and 22:59 UTC on Tuesday 4th February, some or all users experienced a partial outage. This resulted in some requests to the site or downloads failing to load, sometimes.
For 43 minutes between 22:59 UTC and 23:42 UTC on Tuesday 4th February, all users experienced major/total outage. This resulted in all requests to the site and downloads failing to load, completely.
No data was lost during the incident.
The team was made aware of a potential incident at 22:02 UTC due to monitoring and service reports. The observability and monitoring detected increase latency on the frontend and in the databases. Due to the unusual patterns of traffic, the incident was escalated to the full team for investigation. By 22:10 UTC the Cloudsmith team was investigating the incident.
The service was restored at 23:42 UTC by:
- Blocking all of the problematic traffic at the source.
- Bringing up additional database replicas to help process the backlog.
- Fixing an issue that prevented booting of the frontend instances.
- Bringing up new instances to replace the old ones that got killed.
- Deploying fixes to significantly reduce pressure on the databases.
A moment-by-moment play describes exactly how this unfolded:
- 21:02 UTC: The service was exposed to a huge spike of traffic from many distinct locations. Over the hour, our services remained functional but the requests create a flood of new logs and other mementos that are processed.
- 21:49 UTC: Our production master database hits 100% utilization as it attempts to process backlogs related to background data processing, as a side-effect of the huge number of requests.
- 21:54 UTC: Our production replica databases hit 100% utilization as the backlog from the writes start impact reads for data.
- 22:02 UTC: Alerts are generated due to the write and then read latency over the past 15 minutes.
- 22:10 UTC: The Cloudsmith team starts to investigate the reason for the high utilization of the databases. It's common-place to see spikes for some short periods, but not with slowness.
- 22:20 UTC: The first reports of the site and downloads for the site being slow are received, as users start to see 504 Gateway Timeout errors due to the service not being able to respond in time.
- 22:28 UTC: The status page incident is created as "Partial Outage" and users are informed that we're investigating; at this point, it's understood to be a degraded database, but we're trying to determine why.
- 22:30 UTC: Frontend instances start to cycle due to failing health checks, because of the extreme latency with the databases.
- 22:36 UTC: More reports of users encountering 502 Bad Gateway due to an increasingly small amount of instances servicing requests.
- 22:38 UTC: Additional database replicas are brought online to help deal with the backlog while the source of traffic is identified. These don't come online until much later, as a side-effect of the master database being so write-locked.
- 22:40 UTC: The team discovers that a massive amount of traffic has been received and seems to correlate to the incident. The traffic shape is huge and irregular for the time of day and is also being sourced from many distinct geographic locations.
- 22:53 UTC: Blocks are put into place to drop the traffic. This is an extremely harsh measure that we wouldn't normally consider, but due to the gravity of the situation, an executive decision was taken until we could recover and find out how to handle it gracefully.
- 22:54 UTC: Almost immediately, the traffic spikes halt as the problematic traffic is eliminated and the shape returns to normal; however, the databases remain at 100% due to backlogs.
- 22:59 UTC: The site is reported as completely down. New instances are not able to come online due to the databases being inaccessible, so this ultimately results in a total outage of the frontend servers that serve users, the API, and downloads.
- 23:02 UTC: The status page incident is updated to "Identified" and the components are flagged as "Major Outage". The team investigates how to bring the master database back to nominal levels, alongside trying to bring the frontend instances back online to restore functionality.
- 23:17 UTC: The source of the issue preventing the frontend instances from coming back online is discovered; alongside the load on the databases, the system has never run with no instances live and there's an issue with "bootstrapping" from no instances.
- 23:24 UTC: A fix is deployed to restore the bootstrapping process for the instances, and the team starts a new cycle of instances.
- 23:41 UTC: The load on the replica databases begins to drop as the pressure from the traffic spike has been relieved. The new instances start to come up and start serving traffic immediately.
- 23:42 UTC: The site is reported as being back up and operational. The performance is still degraded but improves as more instances come online.
- 23:47 UTC: The new replicas for the database are finally online, and also greatly relieve pressure on the original database instances. The performance of the website returns to "normal" latency.
- 23:59 UTC: The team continues to investigate how to fix the write and read latencies on the master database. Without fixing it, there's a risk of the replicas falling behind again.
- 00:22 UTC: The status page is updated to "Monitoring" as the components are updated to "Degraded Performance". We continue to keep an extremely close eye on the system and performance.
- 00:54 UTC: Several fixes have been identified to help restore the database completely, and these are deployed.
- 01:18 UTC: Pressure begins to reduce on the master database as the fixes take effect.
- 02:01 UTC: All of the databases recover, and all of the backlogs completely clear. The team begins to examine all aspects of the service to ensure that it is fully operational.
- 02:29 UTC: A miniature pre-emptive post-mortem is held, along with various notes, to contribute to the full post-mortem the next day.
- 02:53 UTC: The status page is updated to "Resolved" and all components are set back to "Fully Operational". The incident has been resolved.
Ultimately the root cause was the huge spike in traffic, from a diverse range of geographic locations. However, the effect of this was a massive net increase in the amount of background data processed by the system, exacerbated by the ineffectual caching in this scenario, leading to the root cause of failures in the database. This rippled to the frontend servers, which serve the UI and APIs, as their requests to the database began to fail. This finally impacted downloads and uploads, before bringing them to a halt when the frontend servers completely failed.
We learned the following lessons during this incident:
- Despite processing multiples of the problematic traffic per day, the shape and velocity of the traffic caused issues, exacerbated by ineffectual caching for this scenario. As such, we need better detection of unusual traffic that gets served, especially when it has an impact on service latencies.
- The rippling failures lead to some very specific edge-case scenarios that haven't been frequently tested, such as destroying the entire server farm and starting from scratch. As such, we need additional types of resilience testing (e.g. chaos engineering).
- Having a master database that is pegged/hot lead to a greatly increased time before new replicas could be provisioned. As such, we need to investigate alternative methods of bringing up additional replicas (e.g. scaling ahead of time).
We'll be taking the following actions to ensure this doesn't happen again:
- The issue with ineffectual caching has already been fixed.
- The issue with background data processing causing backlogs.
- Monitoring for unusual traffic shapes will be added.
- Monitoring for backlogs at the database will be added.
- Additional pre-emptive alerting will be added for these.
- Circuit-breakers and rate-limiting at the CDN will be added.
- Additional dedicated channels to service downloads will be added.
These are prioritized from top-to-bottom, but we believe the fixes that have been implemented during and since the incident will provide much greater resilience against traffic spikes of this nature.
You've reached the end of this post-mortem. Thank you for staying with us, and believing in Cloudsmith. Let's get back to providing the best service that we can for you.