On 2024-04-30, Learnosity suffered a service interruption in the AMER (US East-1) region that affected a subset of customers. The incident began at 12:52 UTC and was resolved at 14:18 UTC, lasting 1 hour and 26 minutes. We apologize for the impact that this had on our valued customers and learners. We have learned, and will continue to learn, from this incident and have taken steps to ensure this issue doesn’t repeat.
Details of Incident
Performance issues were initially detected by our Site Reliability Engineering team, who executed our escalation process in raising Tier1 and subsequently Tier2 in short order.
While assessment and authoring load were typical, we saw a massive spike in Data API activity from a single client. This was limited to the get_/itembank endpoints set typically used for authoring. The single client had never used these Data API endpoints before, went from zero to equal to all other clients combined in a short period of time.
This is a particularly laborious query necessary for authoring use. Calling this at such a high volume and frequency created a hotspot and contention on the cluster of four itembank delivery databases, which handle a subset of customers. When we’re under high volume of typical traffic we operate on about 30% peak database load. The atypical traffic outlined above put the databases at full capacity.
While all of our assessment critical systems auto scale, the itembank delivery databases are over provisioned to have significant headroom, and as such don’t auto scale. Auto scaling of database instances can introduce a risk if anything goes wrong while scaling, and as such when designing the system our team determined manual scaling of these itembank delivery databases was the more reliable option.
This atypical activity led to resource starvation of the Data API, with knock-on effects to the assessments stack. Data API impacts included limiting access to authoring and session data, as well as slowing down session scoring and follow-on dispatching of scoring events via Learnosity Firehose.
During the incident, we experienced degraded performance with up to 30% of requests not succeeding. Assessment impacts included cases where the Data API was used in assessment delivery, and in some cases limited new initializations (i.e. starting a test). For the avoidance of doubt, all save and submit calls were successful without data loss.
Resolution
Once the atypical traffic was identified, temporary request rate limits were put in place to allow our remediation efforts to quickly scale to support the increased level of activity and process backlog. Once all services were again operating at optimal levels, the temporary request limits were removed.
Additional Analysis and Prevention
We are making the following changes as a result of this operational event, to prevent this from happening again.