On 2023-12-13, Learnosity suffered a major outage that began at 13:22 UTC and was resolved at 18:29 UTC, lasting 5 hours and 7 minutes. Its primary impact was on the assessment stack in the AMER region, with partial impact to authoring and select analytics APIs due to the use of assessment functionality in some preview features.
Initially, we thought this incident was a scaling issue because errors were linked to the creation of new virtual servers to meet daily traffic. However, we ultimately discovered that the problem was unrelated to scaling. The root cause was actually a faulty configuration management client, salt v3006.5
. A bug in the most recent version of Salt-stack, released at 2023-12-12 at 21:38, caused a failure in our cloudinit
process when spawning new EC2 instances. From 13:05 PM UTC, as new instances were launched, the salt
bug caused the new machines to fail a health check and fall into a loop of retried launches.
Resolution
Immediately after this discovery, we were able to hard code to a prior version of salt
, removing the regression. New EC2 instances were successfully provisioned and scaled to comfortably handle all traffic. All APIs returned to normal operations soon after.
Additional Analysis and Prevention
Upon further investigation, it was discovered that our system became vulnerable to the salt
regression due to error handling dependencies. To facilitate fast and reliable scale up, our image build process is designed so that dependencies are pre-installed at build time, with only minor applicable config changes required at instantiation. Further investigation discovered that a prior fix for a build dependency issue inadvertently moved the installation of the affected salt
package from image build time to instance launch time. This meant that the creation of new instances no longer relied on the original image version of salt
, instead the newest salt
version was installed during the scaling process.
We are making the following changes as a result of this operational event, to prevent this from happening again.