Date : 31st of July,2014
Time start : 615am Sydney time
Time end : 1030am Sydney time
615AM: we started receiving minor error reports that some customers are having connection issues. Investigation led to finding that there are intermittent connectivity issues with 503 errors
715AM: all deputy servers, cache nodes, database servers were rebooted/checked for issues. Everything seemed fully functioning and we are unable to identify or rectify the issue.
915AM: we contacted AWS support services. They debugged the issue even further.
1000AM: problem was identified as one of the load balancers nodes was incorrectly reporting 503 issue. Which explained why the problem was intermittent as traffic is distributed evenly. We immediately disconnected that node. Traffic was routed through one of the healthy node and it restored connectivity with degraded performance.
1030AM: AWS restored the faulty node. Deputy service was operating 100%
Amazon Elastic Load Balancing is a component that is never meant to fail. And as infrastructure users we do not have access to administrative functions such as reboot or remote login into load balancer node. It is a component that has never failed in last five years for us. Even amazon admitted that it's a new issue for them.
We follow best practices as recommended by amazon. This is not something we can do better than what we have setup already. We could have restored the service earlier if we were able to identify the issue. We have implemented additional checks to identify Load Balancer failure issues in future so should the rarest of outage like this happens, we will be able to identify and restore faster.