Possible delays affecting the availability of session data in US-East-1 (VA) region
Incident Report for Learnosity
Postmortem

On November 29th 2022, our systems detected delays in available analytics data affecting the Live Activity Events API, Data API, and Reports API in the US-East-1 region. Investigation began immediately upon detection, leading to the discovery of a resource starvation in our api-events system. Requests to events/lastevents (used internally by our Live Progress by Activity by User report) encountered a function that ran for an extended period of time, leading to the exhaustion of select workers. This in turn led to starvation of api-data and api-reports resources. This siloed issue did not trigger autoscaling, but Learnosity manually scaled relevant EC2 instances to temporarily add workers, limiting the incident to 1 hour and 19 minutes from detection to remediation.

Steps were taken to reduce the impact of extraneous calls to events/lastevents that result from malformed customer initializations, such as when enabling the Events API in assessment initialization requests when those assessments are not part of a Live Progress by Activity by User report. This errant use indiscriminately publishes live events even when the Live Progress by Activity by User report is not in use to receive them. As a further safeguard, a "wait for subscribe" feature is in testing to reduce extraneous events that fire in this edge case.

Posted Dec 16, 2022 - 13:09 EST

Resolved
As of 3:10 UTC, a further 30 minutes of monitoring has confirmed that all delays affecting the availability of session data in the US-East-1 (VA) region have been eliminated and all functions are operating normally.

Learnosity Support and Systems Engineering teams will follow up with a post mortem once we have completed root cause analysis and finalised any next steps or preventative measures required.
Posted Nov 29, 2022 - 10:10 EST
Monitoring
As of 2:40 UTC, availability of session data is fully restored, the message backlog has been eliminated, and load balancing is performing normally. Data API, Reports API, and Firehose streams all operating as expected.

No impact was seen in authoring or assessment data flows, and persisting of learner data was also unaffected.

Learnosity Support and Systems Engineering teams are continuing to actively monitor the issue, and will follow on with an update and resolution as soon as possible.
Posted Nov 29, 2022 - 09:42 EST
Identified
At approximately 2:10 UTC, Learnosity engineers identified an apparent scaling issue and have manually scaled available server instances, to immediately impact message backlog. Queues are steadily reducing and we are actively monitoring.

Learnosity Support and Systems Engineering teams are continuing to actively investigate the issue, and will follow on with an update and resolution as soon as possible.
Posted Nov 29, 2022 - 09:36 EST
Update
2:05 UTC. We are continuing to investigate possible delays affecting the availability of session data in the US-East-1 (VA) region. Firehose messages and Data API queries started showing slower than normal response times at approximately 1:20 UTC.

Learner submissions, persisting data, and scoring remain unaffected, as do all authoring and assessment APIs which continue to operate normally.

Learnosity Support and Systems Engineering teams will follow on with an update and resolution as soon as possible.
Posted Nov 29, 2022 - 09:07 EST
Investigating
As of 1:40 UTC, we are investigating possible delays affecting the availability of session data in the US-East-1 (VA) region. Scoring is not affected, nor are any authoring or assessment APIs..

Learnosity Support and Systems Engineering teams are actively investigating the issue, and will follow on with an update and resolution as soon as possible.
Posted Nov 29, 2022 - 08:59 EST
This incident affected: AMER || Analytics (Loading and rendering of reports, Availability of session information, Live Progress (Live Activity by User) report) and AMER || Data Centric (Firehose).