Issue affecting rendering of assessments, availability of scores in us-east-1 (Virginia)

Incident Report for Learnosity

Postmortem

issue:

We’ve completed our investigation and have determined the following:

that a database error caused by a Student Response DB instance being on an older version of our DB schema resulted in an error condition for a number of our sync queue service workers.
As a result of this, our sync queue service workers began to loop in a manner that created a significantly higher than normal number of requests for those worker types.
This temporarily starved the Items API Application Servers of connections to the database, causing timeouts for a portion of requests.
As a result of the rejected and timed out requests, the Cloudfront CDN instance in front of the Application Servers (and their Application Load Balancers) stopped sending requests for a period of time which magnified the error.

‌

Mitigation & Resolution:

Based on the impacted infrastructure areas, we:

Temporarily halted the service worker causing the resource starvation, to allow the Items API Application Servers to process requests as normal.
Temporarily removed the Cloudfront layer in front of our Application Servers to allow all traffic back through to our Application Servers.
Refreshed our Cloudfront instance to allow it to function appropriately, and placed it back in front of our Application Servers when ready.
Upgraded the impacted Student Response DB instance to the latest version of our DB Schema

‌

Action Items:

Based on this event, we are improving or implementing the following safeguards:

Prevent the error condition from reoccurring by reviewing all schema changes for this condition.
Improve the automation of schema changes to mitigate human error.
Include automated alerts for DB schemas below our current version.
Implement fixes to prioritize the front-end API servers over the backend workers to prevent resource starvation
Improve separation of traffic to limit impact of any similar event.

‌

Timeline:

17:15 UTC Errors began occurring on the ALB backends with 17% of requests failing.

17:19 UTC Our operations team was alerted and began diagnosis of the issue. Investigation initially centered on Cloudfront or a network issue, as the ALBs and machines behind were showing low CPU and scaling down from their peaks based on autoscaling rules. We transitioned traffic for the Items and Assess APIs away from Cloudfront and directly to the ALBs, and the service recovery began.

18:25 UTC Majority of service restored - 98.5% of requests

18:50 UTC Additional machines were manually scaled up which ensured there were enough workers to cover all load.

18:50 UTC Full service restored - 100% of requests

Posted Sep 02, 2020 - 15:32 EDT

Resolved

As of 9:00 pm UTC, We have resolved the affecting rendering of assessments, availability of scores in US-EAST-1 (Virginia).

Learnosity Support and Systems Engineering teams will follow up with a post mortem once we have completed root cause analysis and finalizsed any next steps or preventative measures required.

Please reach out if you have any questions or concerns.

Posted Sep 01, 2020 - 17:04 EDT

Update

As of 8:18 pm UTC, we continue to monitor restored service.

All APIs should be functioning normally.

Learnosity Support and Systems Engineering teams are continuing to actively investigate the issue, and will follow on with an update and resolution as soon as possible.

Posted Sep 01, 2020 - 16:20 EDT

Monitoring

As of 6:30 pm UTC, we are continuing to monitor improvements made to services.

All API stacks should be functioning with limited degradation, and are continuing to improve as DNS and CDN paths are refreshed.

Learnosity Support and Systems Engineering teams are continuing to actively investigate the issue, and will follow on with an update and resolution as soon as possible.

Posted Sep 01, 2020 - 14:38 EDT

Update

We are continuing to investigate this issue, and hope to have a resolution for this problem shortly.

We have updated the affected components list to better reflect the impact of the issue we're currently experiencing.

Posted Sep 01, 2020 - 14:17 EDT

Investigating

As of 5.34pm UTC, We are currently experiencing issues affecting rendering of assessments, availability of scores in US-EAST-1 (Virginia).

We are currently investigating to identify the issue, and will update further with more specifics relating to the impact of this issue.

Learnosity Support and Systems Engineering teams are actively investigating the issue, and will follow on with an update and resolution as soon as possible.

Posted Sep 01, 2020 - 13:36 EDT

This incident affected: AMER || Authoring (Loading and rendering of Item/Activity List view), AMER || Assessment (Loading and rendering of assessment UI), and AMER || Analytics (Loading and rendering of reports, Availability of session information).