We’ve completed our investigation and have determined the following:
- that a database error caused by a Student Response DB instance being on an older version of our DB schema resulted in an error condition for a number of our sync queue service workers.
- As a result of this, our sync queue service workers began to loop in a manner that created a significantly higher than normal number of requests for those worker types.
- This temporarily starved the Items API Application Servers of connections to the database, causing timeouts for a portion of requests.
- As a result of the rejected and timed out requests, the Cloudfront CDN instance in front of the Application Servers (and their Application Load Balancers) stopped sending requests for a period of time which magnified the error.
Mitigation & Resolution:
Based on the impacted infrastructure areas, we:
- Temporarily halted the service worker causing the resource starvation, to allow the Items API Application Servers to process requests as normal.
- Temporarily removed the Cloudfront layer in front of our Application Servers to allow all traffic back through to our Application Servers.
- Refreshed our Cloudfront instance to allow it to function appropriately, and placed it back in front of our Application Servers when ready.
- Upgraded the impacted Student Response DB instance to the latest version of our DB Schema
Based on this event, we are improving or implementing the following safeguards:
- Prevent the error condition from reoccurring by reviewing all schema changes for this condition.
- Improve the automation of schema changes to mitigate human error.
- Include automated alerts for DB schemas below our current version.
- Implement fixes to prioritize the front-end API servers over the backend workers to prevent resource starvation
- Improve separation of traffic to limit impact of any similar event.
17:15 UTC Errors began occurring on the ALB backends with 17% of requests failing.
17:19 UTC Our operations team was alerted and began diagnosis of the issue. Investigation initially centered on Cloudfront or a network issue, as the ALBs and machines behind were showing low CPU and scaling down from their peaks based on autoscaling rules. We transitioned traffic for the Items and Assess APIs away from Cloudfront and directly to the ALBs, and the service recovery began.
18:25 UTC Majority of service restored - 98.5% of requests
18:50 UTC Additional machines were manually scaled up which ensured there were enough workers to cover all load.
18:50 UTC Full service restored - 100% of requests