01/26/2018: Slowness in Canvas on Wednesday, 1/24 [UPDATE]

Instructure has released a statement about the slowdown on Wednesday:

Canvas Incident Report

Canvas loaded slowly for some users on Wednesday, January 24, 2018 from 7:28 MST to 8:19 MT

Summary

Canvas loaded slowly for some users between 7:28 AM and 8:19 AM MT on Wednesday, January 24, 2018. An update intended to improve performance caused this issue; reverting the update solved the incident and restored normal performance.

Details

A database caching system called “Redis” sits between two critical elements of the Canvas infrastructure: application servers, which handle user requests, and the database layer, which stores the data. Caching allows the application servers to handle requests faster because it stores frequently-used data where it is easy to retrieve.

Our DevOps team constantly seeks to improve Canvas performance and stability. Recently, they have focused efforts on improving caching performance. Rather than run all requests through one large fleet of caching servers, the plan is to route smaller groups of requests to many smaller, faster fleets. We successfully tested this approach in our test and free environments before approving it for release to production. We deployed it in production for a few database clusters (about 10%) overnight on Tuesday, January 23, 2018.

When we deployed the change, we misconfigured a critical setting. Rather than re-routing traffic from only a few clusters to one of the new, small fleets of caching servers, we re-routed all traffic. This error did not cause a problem immediately, since we made the change when volume was low. But as more users came online on Wednesday morning, volume quickly outstripped the capacity of the new fleet and requests began to queue. Users see queueing as slow performance.

Beginning at 7:27 AM MT, our DevOps team received automated alerts of slow performance from several database clusters.

Mitigation

Our DevOps team immediately began researching the problem and eliminating possible causes. They quickly determined that the caching-layer change was likely at fault and concluded that they could safely revert it without negative side-effects.

DevOps reverted the change at 8:00 AM MT. Performance improved steadily over the next 15 minutes as servers in the old caching fleet began handling requests again. Canvas was performing normally by 8:19 AM MT. 

In our internal post-mortem evaluation, we discussed what we could have done to prevent this incident. One key insight is that the testing and validation procedures we performed after pointing queries from the small group of production clusters to the new caching fleet focused only on the clusters we had moved. That is, we checked to ensure that queries from those clusters were flowing to the new caching fleet, but we did not verify that queries from all other clusters were still flowing to the old fleet. Had we done this, we would have spotted the problem immediately and could have fixed it before it affected users. We will perform both kinds of verification steps when we re-implement the change at a later date, and we will apply the same philosophy to other, similar changes in the future.

Conclusion

This incident was an unintended consequence of a change meant to improve Canvas performance. We learn from our mistakes, and this issue highlighted a weakness in our testing protocol for a specific kind of maintenance action. We will apply the lesson we learned here.

Your users expect Canvas to be available whenever they are ready to learn. We hold ourselves to a high standard of meeting that expectation with strong uptime and performance, and we apologize that we did not meet that standard this morning.

Posted 01/24/2018 by ODL Technical Support

Between approximately 9:45am and 10:15am ET on Wednesday, January 24, 2018,  Florida State users experienced longer load times than normal in Canvas. This was a server-wide event that affected other institutions as well. Canvas's development team was able to identify the source of the issue and recover server performance within 30 minutes.

You can find a more detailed report of this event on status.instructure.com.

If you have any questions about how this may have impacted other users, please contact ODL Technical Support at (850) 644-8004 or by using our contact form.

  • 1095
  • 26-Jan-2018
  • 601 Views