[Resolved] AWS Restored: Impaired functionality between Canvas and Amazon continues

All services appear to be resolved.

Monitoring Feb 28, 15:59 MST

Amazon has verified that uploads to their service should be working again; users should be seeing improved performance with their uploads to Canvas. Our DevOps team is continuing to monitor the situation, but we are not currently aware of any lingering issues that affect Canvas functionality at this time.

Update Feb 28, 14:37 MST

In our previous update, we mentioned there would still be areas of impaired functionality between Canvas and Amazon. The biggest area of impact right now is that uploads are not yet working. This includes student uploads to assignments, instructor grade uploads, and similar functions, but also the ability for Canvas’ background processes to upload files such as admin reports (which is required as part of the process to generate a report at the account level). You may continue to see issues with this, and other areas in Canvas, as Amazon works to fully restore all services.

Update Feb 28, 14:15 MST

Canvas performance and service recovery continues to progress quickly. Although many users should now be able to access Canvas, there may still be areas of impaired functionality as we work through remaining issues.

Update Feb 28, 13:54 MST

We are beginning to see positive indications of recovery and have successfully tested workflows that were previously failing. We are still awaiting full resolution, and we will provide updates as the situation continues to improve.

Update Feb 28, 13:45 MST

AWS is still working through their recovery process. Unfortunately, the number of Amazon services that have been impacted has grown in the time it took to find the root cause, and it will be a significant effort on their side to recover all of the services. They are understandably starting with the most critical ones. Since Canvas depends on so many of their services, a full recovery may still take some time.

On our side, our DevOps team has moved on to other ideas about how to get from a “service disruption” state to a “degraded performance” state in Canvas. We are also discussing the plans for addressing similar circumstances in the future, though our options are limited due to the perniciousness of this incident; but we are considering all options at this time.

Update Feb 28, 13:05 MST

Amazon is continuing to work through their recovery process. On our side, our DevOps team has implemented a temporary change to ensure tools and apps not hosted on AWS (Amazon Web Services) are still accessible to those that are able to access Canvas, which is an improvement to the complete service disruption we have had since 10:37 AM MST. However, the majority of Canvas users are still unable to access their Canvas site, due to the outage with AWS.

We will continue our efforts to ensure a good experience with Canvas for users once they are able to access the site again, and will provide an update on the overall issue within the next 30 minutes.

Update Feb 28, 12:29 MST

As Amazon works to restore availability in their systems, our DevOps team continues their efforts to expedite the process to restore access to Canvas. We will provide a new update on their progress in 30 minutes or less.

Update Feb 28, 12:04 MST

Amazon Web Services has informed us that they have identified the underlying root cause of the issue and they are beginning the remediation process. Our internal DevOps team continues to explore options to facilitate faster recovery.

Update Feb 28, 11:52 MST

Amazon is still working to restore server access for sites that have been affected by their outage today, including many Canvas sites. They will keep us updated on their progress.

Identified Feb 28, 11:27 MST

Amazon has narrowed the scope of their investigation and has identified a specific region impacted by the networking issue. They are actively working on a solution. Our own DevOps team is investigating options that may allow us to work around the problem. We will provide another update in 15 minutes.

Identified Feb 28, 11:27 MST

Amazon has identified the issue as being limited to a set of servers in the US. They are actively working on finding a fix to address the errors you are seeing.

Update Feb 28, 11:08 MST

Amazon has updated their status page to indicate they are investigating increased error rates for their servers. They are working with us to provide updates on the issue; we will update this page with any new information. In the meantime, you can monitor their status page at https://status.aws.amazon.com/. Other Amazon Web Service Applications may be affected.

Update Feb 28, 11:03 MST

Amazon Web Services is currently experiencing what appears to be a large-scale networking issue that has impacted Instructure along with many other companies. We are working with Amazon to diagnose the problem and waiting for updates on their mitigation timeline. We will keep you posted as soon as we have more information.

Investigating Feb 28, 10:50 MST

Canvas is currently experiencing an outage that we are investigating. Our DevOps team has determined that this is an AWS (Amazon Web Services) Outage. We will post updates as they become available.

Updates will follow as they become available.

Posted in Educational Technologies