504 errors reported

Incident Report for Deputy.com

Postmortem

Details of the Deputy Outages in Australia on 25th January (and 29th March)

Deepesh Banerji, SVP Technology

At Deputy, it is our vision to build thriving workplaces in every community. Trust and transparency is a key pillar that underpins a thriving workplace, which is why we are sharing with our valued customers some of the challenges we faced during our toughest outage to date (disclaimer: this will get technical!)

On the 25th of January, a system failure caused a full platform outage for Deputy customers in Australia between the hours of 8:39am and 2:30pm (6 hours). This quickly led to an investment in the improvement of the underlying infrastructure of the Deputy platform. Our journey from outage to improvement has been detailed below.

The events of Monday the 25th January

January 25th is a unique and busy work day in Australia, especially for our customers. It’s the day before a public holiday, it’s the end of the month, and this year, it fell on a Monday, when many of our customers export timesheets and run payroll simultaneously.

At 8:39am our automated system alerts triggered with Alert: Heavy Response Times. At the same time, our customer support team started receiving 100s of customer chats, indicating they had trouble accessing Deputy. The company triggered an Incident at this time, and updated our accompanying status page.

Investigations began. Our software engineering team hadn’t released anything new that day, so no new code or infrastructure changes were present. Our code continued to pass all of our automated quality tests. Traffic to the login page had grown naturally through the morning, as expected.

Meanwhile symptoms were surfacing. Our elastic servers kept adding and scaling more web servers to try to cope with increasing load. Digging one level deeper, our databases were seeing 10x-20x usual load. Continuing to dig, Redis, our in-memory cached database, which is normally used to drive high performance, was seeing an abnormally high amount of utilisation. It was at this point that we confirmed that Redis was the single point of failure → our scalable databases, and elastic web servers, were all waiting for our one Redis storage unit, resulting in a cascading failure.

‌

Traffic Pattern Morning of 25th. Traffic looked like normal patterns.

Behind the scenes, each web server started seeing significant over-utilisation (requests per instance)

Meanwhile, databases seeing significant load (connections per database)

The root cause: Redis CPU started showing signs of over utilisation after 8:30am, which in turn caused database and web server utilisation to hit an unsustainable peak (utilisation % of Redis)

‌

By midday, we had provisioned a new version of Redis, effectively restarted all of our processes and systems, and by 2:30pm, Deputy was again accessible to our customers. However, the Redis risk remained, lurking - provisioning a new version was a patch fix. In fact, it came back in a smaller, more controlled way on a few other occasions through the next few weeks.

So What Happened With Redis?

Deputy was using a non-scaling implementation Redis across an entire region (i.e. Australia) as a caching solution. As our customer base has grown, this created a single failure point, resulting in workloads becoming heavy and concentrated.

We had outgrown our existing Redis architecture, and it was quickly made apparent to us that it was time to implement a more scalable solution. To make an analogy, we had 1 cash register for a very, very busy supermarket. Even as the supermarket got to peak capacity, we still had 1 cash register. In the new architecture, we have unlimited cash registers!

Our team went hard at work, consulting with our AWS enterprise architecture team, working nights and weekends, to develop a scalable, distributed Redis.

In this new architecture, our infrastructure now has 10x the redis clusters to effectively spread and orchestrate workloads, and we continue to add new clusters as our customer base grows. In short, our infrastructure now reflects our requirements for today’s customers and future proofs us for our growth ambitions.

29th March, The False Start

On Monday the 29th, we released the new scalable and distributed Redis to all customers, with the intent to resolve these issues once and for all. However,, as irony would have it, this inadvertently led to another outage on 29th of March, due to settings of how the new system was tuned, which was quickly resolved.

19th April, All Systems Go!

The previous outage on the 29th was a growing pain and speed bump to deliver the full working solution that is in production now, and handling usage elegantly and with ease!

This incident was a key catalyst in driving the constant journey we’ve been on to improve system resilience and systematically removing any single points of failure that may exist as our customer utilisation expands.

What Else Has Happened to improve our up-time and customer experience?

Redis has been reworked and re-architected
Increased Monitoring, alerts and logs have been introduced in the application
Circuit Breakers have been implemented to reduce likelihood of cascading failures
Elastic computing scaling rules have been adjusted to better handle scale up when required

Conclusion

We understand this was an upsetting outage for our customers, especially a payroll day before a public holiday. We responded quickly to correct the situation, and have systematically dealt with Redis scalability as the root cause.

Thank you for your patience and understanding. We do not take for granted the trust you have placed on Deputy. We will continue to be on a journey to make Deputy highly available and your trusted partner!

Posted Apr 27, 2021 - 06:51 UTC

Resolved

This incident is now resolved.

Posted Jan 25, 2021 - 06:22 UTC

Monitoring

We've deployed a fix and are now monitoring the recovery. 504 errors should no longer occur.

Posted Jan 25, 2021 - 03:37 UTC

Update

We are still in the process of testing and rolling out a resolution for this issue. Our next update will be at 3pm Sydney time, if no further changes happen in that time

Posted Jan 25, 2021 - 03:01 UTC

Update

We believe we have identified a resolution to our system outage and are in the process of testing, and rolling it out imminently.

Posted Jan 25, 2021 - 02:02 UTC

Update

We continue to work on a implementing a fix for this issue. Our next update will bet at 1pm Sydney time, 25th Jan. Currently there is no ETA for this to be resolved

Posted Jan 25, 2021 - 01:02 UTC

Identified

We've identified what we believe to be the cause of this issue and are currently working on implementing a fix. There is still no ETA for this issue.

Posted Jan 25, 2021 - 00:26 UTC

Update

We continue to investigate 504 errors. We have no ETA at this time - if there is no further information, our next update will be a at 11.30am Sydney, Australia time.

Posted Jan 24, 2021 - 23:20 UTC

Update

We are still continuing to investigate 504 errors. At this time, we are are aware that customers will not be able to login or access parts of the Deputy platform.

Posted Jan 24, 2021 - 22:43 UTC

Update

We are continuing to investigate this issue.

Posted Jan 24, 2021 - 22:23 UTC

Investigating

Currently investigating reports of customers getting 504 errors when accessing Deputy

Posted Jan 24, 2021 - 22:19 UTC

This incident affected: Login Services and Deputy - All regions (Deputy - AU).