Outage issue
Incident Report for Deputy.com
Postmortem

Date : 31st of July,2014

Time start : 615am Sydney time

Time end : 1030am Sydney time

Affected services

  • Login
  • Web Application
  • iOS/Android application

Not affected

  • Deputy Kiosk
  • Deputy Enterprise

Timeline

615AM: we started receiving minor error reports that some customers are having connection issues. Investigation led to finding that there are intermittent connectivity issues with 503 errors

715AM: all deputy servers, cache nodes, database servers were rebooted/checked for issues. Everything seemed fully functioning and we are unable to identify or rectify the issue.

915AM: we contacted AWS support services. They debugged the issue even further.

1000AM: problem was identified as one of the load balancers nodes was incorrectly reporting 503 issue. Which explained why the problem was intermittent as traffic is distributed evenly. We immediately disconnected that node. Traffic was routed through one of the healthy node and it restored connectivity with degraded performance.

1030AM: AWS restored the faulty node. Deputy service was operating 100%

Root cause

Amazon Elastic Load Balancing is a component that is never meant to fail. And as infrastructure users we do not have access to administrative functions such as reboot or remote login into load balancer node. It is a component that has never failed in last five years for us. Even amazon admitted that it's a new issue for them.

Prevention

We follow best practices as recommended by amazon. This is not something we can do better than what we have setup already. We could have restored the service earlier if we were able to identify the issue. We have implemented additional checks to identify Load Balancer failure issues in future so should the rarest of outage like this happens, we will be able to identify and restore faster.

Posted Aug 01, 2014 - 02:49 UTC

Resolved
@awscloud has fixed the load balancer issues. We are back on track. It's a very rare incident -> one that we have never seen in 5 years. We are extremely sorry for any inconveniences caused. Rest assured, Deputy is fully functioning now.
Posted Jul 31, 2014 - 00:37 UTC
Monitoring
Slight improvement.. We have isolated the node having issues. It should be working better now
Posted Jul 31, 2014 - 00:01 UTC
Identified
Problem has been identified as one the @awscloud load balancer nodes (beyond our control) is having issue. Their engineers are working on it..
Posted Jul 30, 2014 - 23:49 UTC
Investigating
We seem to have an issue with the login services. We are working on it
Posted Jul 30, 2014 - 21:24 UTC