On November 29th 2022, our systems detected delays in available analytics data affecting the Live Activity Events API, Data API, and Reports API in the US-East-1 region. Investigation began immediately upon detection, leading to the discovery of a resource starvation in our api-events system. Requests to
events/lastevents (used internally by our Live Progress by Activity by User report) encountered a function that ran for an extended period of time, leading to the exhaustion of select workers. This in turn led to starvation of api-data and api-reports resources. This siloed issue did not trigger autoscaling, but Learnosity manually scaled relevant EC2 instances to temporarily add workers, limiting the incident to 1 hour and 19 minutes from detection to remediation.
Steps were taken to reduce the impact of extraneous calls to
events/lastevents that result from malformed customer initializations, such as when enabling the Events API in assessment initialization requests when those assessments are not part of a Live Progress by Activity by User report. This errant use indiscriminately publishes live events even when the Live Progress by Activity by User report is not in use to receive them. As a further safeguard, a "wait for subscribe" feature is in testing to reduce extraneous events that fire in this edge case.