Issue affecting rendering of Questions in VA
Incident Report for Learnosity
Postmortem

On Monday 30th of October at 8:38am EST / 12:38 UTC we suffered a significant disruption on our primary Questions API cluster. From this time until 10:28am EST / 14:28 UTC a majority of traffic for our customers displaying or saving questions was failing.

Learnosity’s systems are designed to handle scenarios where systems fail and automatically recover, and scale up to handle increased load requirements. However, in this case there was a cascading failure that we had not anticipated which caused a failure in a core system. We have done a full root cause analysis on this issue, and made necessary changes so it can never happen again.

As traffic began to ramp up for the day some of our internal database proxy systems initiated a scale up events due to the increasing load. As part of the scale up process the pre-baked AMI (Amazon Machine Images) perform a system check to ensure that they have all of the updated security patches before they are marked available for service. This process encountered an error; some of the required package updates were not available (more on this later) and so failed. The newly started instances marked themselves as unhealthy and terminated. This caused a failure loop; new machines suffered the same error and failed just the same.

Our systems do not rely on third-party repositories for any packages used in our production environment. We maintain a local replica of all the repositories our systems require to operate. We do this by creating a package repository mirror that we deploy to S3. This safe guards our systems from potential failures from these third-party repositories. Our local s3 mirror is updated regularly automatically, but earlier in the day an update to it had failed. This failure was due to a cache disk volume failing to mount correctly on the machine that handled the syncing of our s3 mirror. This mount failure meant that the machine only had a partial copy of the repository which it then synced to our S3 mirror. This in turned caused a number of packages to be deleted from our repository. This was the key component that caused our machines to fail to start.

Once identified, our team resynchronised the local mirror which enabled the machines to boot successfully. Due to the volume of packages to be resynced, this took almost 30 minutes. Once this was completed the systems recovered approximately 5 minutes later as the additional capacity came online.

We are making several operational changes as a result of this event to ensure that this cannot happen again. Additional safety checks have been added to the sync script and we are implementing a process to run a full AMI build after any updates to the mirror to ensure that the machines are not reliant on our local mirror for their packages.

Finally, I want to apologize for the impact this event caused for our customers. While we are proud of our long track record of availability with our APIs, we know how critical this service is to our customers, their applications and end users, and their businesses.

Mark Lynch CTO

Posted Oct 31, 2017 - 10:03 EDT

Resolved
We can confirm that all services have returned to normal, with final errors seen at 14:28 UTC.

All assessments, authoring and reporting should be functioning normally.

Learnosity Support and Systems Engineering teams will follow up with a post mortem once we have completed root cause analysis and finalised any next steps or preventative measures required.

Please reach out if you have any questions or concerns.
Posted Oct 30, 2017 - 10:43 EDT
Update
As of 14:31 UTC, We have identified the core issue affecting loading and rendering of questions in VA (Virginia) and have put in place a fix to resolve this issue. Loading and rendering of questions is starting to return to normal as infrastructure with the fix in place is being brought online. During this period, there will still be some issues affecting loading of questions, and we will follow up once confirmed as fully resolved.

Learnosity Support and Systems Engineering teams will follow on with an update and resolution as soon as possible.
Posted Oct 30, 2017 - 10:35 EDT
Identified
As of 13:54 UTC, We have identified the core issue affecting loading and rendering of questions in VA (Virginia) and are in the process of putting mitigation steps in place.

Learnosity Support and Systems Engineering teams will follow on with an update and resolution as soon as possible.
Posted Oct 30, 2017 - 09:55 EDT
Investigating
As of 12:50 UTC, We are currently experiencing an issue affecting rendering of questions in VA(Virginia).

This issue relates to the loading of the Questions API javascript files, and currently affects the rendering of questions via the Questions API, Assess API and Items API.

Learnosity Support and Systems Engineering teams are actively investigating the issue, and will follow on with an update and resolution as soon as possible.
Posted Oct 30, 2017 - 09:18 EDT