On Monday 30th of October at 8:38am EST / 12:38 UTC we suffered a significant disruption on our primary Questions API cluster. From this time until 10:28am EST / 14:28 UTC a majority of traffic for our customers displaying or saving questions was failing.
Learnosity’s systems are designed to handle scenarios where systems fail and automatically recover, and scale up to handle increased load requirements. However, in this case there was a cascading failure that we had not anticipated which caused a failure in a core system. We have done a full root cause analysis on this issue, and made necessary changes so it can never happen again.
As traffic began to ramp up for the day some of our internal database proxy systems initiated a scale up events due to the increasing load. As part of the scale up process the pre-baked AMI (Amazon Machine Images) perform a system check to ensure that they have all of the updated security patches before they are marked available for service. This process encountered an error; some of the required package updates were not available (more on this later) and so failed. The newly started instances marked themselves as unhealthy and terminated. This caused a failure loop; new machines suffered the same error and failed just the same.
Our systems do not rely on third-party repositories for any packages used in our production environment. We maintain a local replica of all the repositories our systems require to operate. We do this by creating a package repository mirror that we deploy to S3. This safe guards our systems from potential failures from these third-party repositories. Our local s3 mirror is updated regularly automatically, but earlier in the day an update to it had failed. This failure was due to a cache disk volume failing to mount correctly on the machine that handled the syncing of our s3 mirror. This mount failure meant that the machine only had a partial copy of the repository which it then synced to our S3 mirror. This in turned caused a number of packages to be deleted from our repository. This was the key component that caused our machines to fail to start.
Once identified, our team resynchronised the local mirror which enabled the machines to boot successfully. Due to the volume of packages to be resynced, this took almost 30 minutes. Once this was completed the systems recovered approximately 5 minutes later as the additional capacity came online.
We are making several operational changes as a result of this event to ensure that this cannot happen again. Additional safety checks have been added to the sync script and we are implementing a process to run a full AMI build after any updates to the mirror to ensure that the machines are not reliant on our local mirror for their packages.
Finally, I want to apologize for the impact this event caused for our customers. While we are proud of our long track record of availability with our APIs, we know how critical this service is to our customers, their applications and end users, and their businesses.
Mark Lynch CTO