Issues loading assessment content in AMER region
Incident Report for Learnosity
Postmortem

On 2023-12-13, Learnosity suffered a major outage that began at 13:22 UTC and was resolved at 18:29 UTC, lasting 5 hours and 7 minutes. Its primary impact was on the assessment stack in the AMER region, with partial impact to authoring and select analytics APIs due to the use of assessment functionality in some preview features.

Initially, we thought this incident was a scaling issue because errors were linked to the creation of new virtual servers to meet daily traffic. However, we ultimately discovered that the problem was unrelated to scaling. The root cause was actually a faulty configuration management client, salt v3006.5. A bug in the most recent version of Salt-stack, released at 2023-12-12 at 21:38, caused a failure in our cloudinit process when spawning new EC2 instances. From 13:05 PM UTC, as new instances were launched, the salt bug caused the new machines to fail a health check and fall into a loop of retried launches.

Resolution

Immediately after this discovery, we were able to hard code to a prior version of salt, removing the regression. New EC2 instances were successfully  provisioned and scaled to comfortably handle all traffic. All APIs returned to normal operations soon after.

Additional Analysis and Prevention

Upon further investigation, it was discovered that our system became vulnerable to the salt regression due to error handling dependencies. To facilitate fast and reliable scale up, our image build process is designed so that dependencies are pre-installed at build time, with only minor applicable config changes required at instantiation. Further investigation discovered that a prior fix for a build dependency issue inadvertently moved the installation of the affected salt package from image build time to instance launch time. This meant that the creation of new instances no longer relied on the original image version of salt, instead the newest salt version was installed during the scaling process. 

We are making the following changes as a result of this operational event, to prevent this from happening again.

  • We are conducting a full review of the launch process to ensure there are no other unknown launch-time dependencies.
  • We are adding additional guard rails in the build process to catch any dependency regression at the imaging phase before production use.
Posted Dec 14, 2023 - 04:49 EST

Resolved
As of 19:40 UTC, we've concluded one hour of active monitoring without additional issues. We will continue to keep an eye on everything but are now ready to call this incident resolved.

Learnosity Support and Systems Engineering teams will follow up with a post mortem once we have completed root cause analysis and finalized any next steps or preventative measures required.
Posted Dec 13, 2023 - 14:42 EST
Monitoring
As of 18:36 UTC, we've now recovered fully and all stacks are operational. We will continue to monitor for a period before resolving the incident.

Learnosity Support and Systems Engineering teams are continuing to actively investigate the issue, and will follow on with an update and resolution as soon as possible.
Posted Dec 13, 2023 - 13:36 EST
Update
As of 18:20 UTC, we've verified a cause of the current incident and recovery is already partially complete. We have restored part of the assessment stack and are working on full restoration now, with the remaining stacks to follow. We will provide additional updates soon, including a possible ETA as soon as a reasonably accurate estimate has been determined.

Learnosity Support and Systems Engineering teams are continuing to actively investigate the issue, and will follow on with an update and resolution as soon as possible.
Posted Dec 13, 2023 - 13:25 EST
Identified
We've now identified a possible cause for the current incident and are working to resolve the issue. We will add additional information as soon as it becomes available.

We've also added EMEA authoring and APAC authoring to affected regions for users of the hosted author site, which operates in the AMER region.

Learnosity Support and Systems Engineering teams are continuing to actively investigate the current issue, and will follow on with an update and resolution as soon as possible.
Posted Dec 13, 2023 - 12:16 EST
Update
Correction: The authoring and analytics stacks are also begin impacted intermittently in workflows where the assessment stack is in play. (E.g. editing views where the assessment stack is used for previews.) Errors in the AMER region continue to reduce. 502 errors have cleared and 504 timeout errors are now being investigated. Elevating classifications to include all stacks and partial outage.

Learnosity Support and Systems Engineering teams are continuing to actively investigate the issue, and will follow on with an update and resolution as soon as possible.
Posted Dec 13, 2023 - 10:40 EST
Investigating
As of 14:15 UTC, the assessment stack returned to normal operating metrics in the EMEA and APAC regions. As of 14:30 and 14:45 UTC, the NA region error count dropped by approximately 15% each period, and error counts are continuing to reduce as of 15:00 UTC. Intermittent errors continue to occur in the assessment stack in AMER. Authoring and Analytics APIs remain unaffected.

Learnosity Support and Systems Engineering teams are continuing to actively investigate the issue, and will follow on with an update and resolution as soon as possible.
Posted Dec 13, 2023 - 10:06 EST
Identified
As of 14:15 UTC, we've identified that the AWS Auto Scaling Groups failed to create additional EC2 instances causing an elevated number of 502 errors in the assessment stack.

Learnosity Support and Systems Engineering teams are continuing to actively investigate the issue, and will follow on with an update and resolution as soon as possible.
Posted Dec 13, 2023 - 09:37 EST
Investigating
As of 14:00 UTC, We are seeing intermittent issues with loading demo assessment content and are investigating out of an abundance of caution.

Learnosity Support and Systems Engineering teams are actively investigating the issue, and will follow on with an update and resolution as soon as possible.
Posted Dec 13, 2023 - 09:06 EST
This incident affected: AMER || Analytics (Loading and rendering of reports), AMER || Authoring (Loading and rendering of Item/Activity Edit view), APAC || Authoring (Loading and rendering of Item/Activity Edit view), AMER || Assessment (Loading and rendering of Items/Questions/Features), and EMEA || Authoring (Loading and rendering of Item/Activity Edit view).