Issue affecting authoring and specific kinds of assessments in U.S.-East-1
Incident Report for Learnosity
Postmortem

On 2024-04-30, Learnosity suffered a service interruption in the AMER (US East-1) region that affected a subset of customers. The incident began at 12:52 UTC and was resolved at 14:18 UTC, lasting 1 hour and 26 minutes.  We apologize for the impact that this had on our valued customers and learners.  We have learned, and will continue to learn, from this incident and have taken steps to ensure this issue doesn’t repeat.

Details of Incident

Performance issues were initially detected by our Site Reliability Engineering team, who executed our escalation process in raising Tier1 and subsequently Tier2 in short order.

While assessment and authoring load were typical, we saw a massive spike in Data API activity from a single client. This was limited to the get_/itembank endpoints set typically used for authoring. The single client had never used these Data API endpoints before, went from zero to equal to all other clients combined in a short period of time.

This is a particularly laborious query necessary for authoring use. Calling this at such a high volume and frequency created a hotspot and contention on the cluster of four itembank delivery databases, which handle a subset of customers. When we’re under high volume of typical traffic we operate on about 30% peak database load. The atypical traffic outlined above put the databases at full capacity.

While all of our assessment critical systems auto scale, the itembank delivery databases are over provisioned to have significant headroom, and as such don’t auto scale. Auto scaling of database instances can introduce a risk if anything goes wrong while scaling, and as such when designing the system our team determined manual scaling of these itembank delivery databases was the more reliable option.

This atypical activity led to resource starvation of the Data API, with knock-on effects to the assessments stack. Data API impacts included limiting access to authoring and session data, as well as slowing down session scoring and follow-on dispatching of scoring events via Learnosity Firehose.

During the incident, we experienced degraded performance with up to 30% of requests not succeeding. Assessment impacts included cases where the Data API was used in assessment delivery, and in some cases limited new initializations (i.e. starting a test). For the avoidance of doubt, all save and submit calls were successful without data loss.

Resolution

Once the atypical traffic was identified, temporary request rate limits were put in place to allow our remediation efforts to quickly scale to support the increased level of activity and process backlog. Once all services were again operating at optimal levels, the temporary request limits were removed.

Additional Analysis and Prevention

We are making the following changes as a result of this operational event, to prevent this from happening again.

  • We identified that the baseline rate limit configured for the itembank endpoints was not appropriate - and have reviewed and configured this appropriately.
  • We are reviewing improvements and resource allocations of the Data API to handle atypical usage patterns and increase resilience.
  • We are continuing to perform analysis to determine what other preventative measures are appropriate to implement.
Posted May 01, 2024 - 23:09 EDT

Resolved
As of 15:00 UTC, we are closing this incident as resolved. The availability and performance of the Data API and downstream systems in US-East-1 has been restored and operating correctly for 30 minutes.

Learnosity Support and Systems Engineering teams will follow up with a post mortem once we have completed root cause analysis and finalized any next steps or preventative measures required.

Please reach out if you have any questions or concerns.
Posted Apr 30, 2024 - 11:24 EDT
Monitoring
As of 14:30 UTC, we have addressed an issue primarily affecting Data API use. Error rate has reduced to zero, all newly submitted sessions are scoring immediately, and the backlog of sessions awaiting scoring is cleared. The Firehose service queue is clearing quickly dispatching information about recently scored sessions.

Learnosity Support and Systems Engineering teams are continuing to actively investigate the issue, and will follow on with an update and resolution as soon as possible.
Posted Apr 30, 2024 - 10:54 EDT
Update
As of 13:45 UTC, we are investigating an issue with the availability of authoring and assessment content in the US-East-1 region. This issue is also impacting analytics, with degraded performance of Reports and select Data API endpoints that require content, such as the Session Detail by Item report and Data API item bank and scoring endpoints.

We will continue to update this incident as we learn more.

Learnosity Support and Systems Engineering teams are continuing to actively investigate the issue, and will follow on with an update and resolution as soon as possible.
Posted Apr 30, 2024 - 09:49 EDT
Investigating
We are currently experiencing an issue affecting the availability of newly authored and edited content. This is also having a ripple effect on assessments dynamically retrieving content, such as adaptive and on-the-fly assessments.

We are investigating and will update this record as soon as we learn more.
Posted Apr 30, 2024 - 09:29 EDT
This incident affected: AMER || Analytics (Loading and rendering of reports, Availability of session information), AMER || Assessment (Loading and rendering of Items/Questions/Features), AMER || Data Centric (Scoring endpoint, Firehose), and AMER || Authoring (Creating and saving of Items/Activities/Tags).