At Deputy, it is our vision to build thriving workplaces in every community. Trust and transparency is a key pillar that underpins a thriving workplace, which is why we are sharing with our valued customers some of the challenges we faced during our toughest outage to date (disclaimer: this will get technical!)
On the 25th of January, a system failure caused a full platform outage for Deputy customers in Australia between the hours of 8:39am and 2:30pm (6 hours). This quickly led to an investment in the improvement of the underlying infrastructure of the Deputy platform. Our journey from outage to improvement has been detailed below.
January 25th is a unique and busy work day in Australia, especially for our customers. It’s the day before a public holiday, it’s the end of the month, and this year, it fell on a Monday, when many of our customers export timesheets and run payroll simultaneously.
At 8:39am our automated system alerts triggered with Alert: Heavy Response Times. At the same time, our customer support team started receiving 100s of customer chats, indicating they had trouble accessing Deputy. The company triggered an Incident at this time, and updated our accompanying status page.
Investigations began. Our software engineering team hadn’t released anything new that day, so no new code or infrastructure changes were present. Our code continued to pass all of our automated quality tests. Traffic to the login page had grown naturally through the morning, as expected.
Meanwhile symptoms were surfacing. Our elastic servers kept adding and scaling more web servers to try to cope with increasing load. Digging one level deeper, our databases were seeing 10x-20x usual load. Continuing to dig, Redis, our in-memory cached database, which is normally used to drive high performance, was seeing an abnormally high amount of utilisation. It was at this point that we confirmed that Redis was the single point of failure → our scalable databases, and elastic web servers, were all waiting for our one Redis storage unit, resulting in a cascading failure.
Traffic Pattern Morning of 25th. Traffic looked like normal patterns.
Behind the scenes, each web server started seeing significant over-utilisation (requests per instance)
Meanwhile, databases seeing significant load (connections per database)
The root cause: Redis CPU started showing signs of over utilisation after 8:30am, which in turn caused database and web server utilisation to hit an unsustainable peak (utilisation % of Redis)
By midday, we had provisioned a new version of Redis, effectively restarted all of our processes and systems, and by 2:30pm, Deputy was again accessible to our customers. However, the Redis risk remained, lurking - provisioning a new version was a patch fix. In fact, it came back in a smaller, more controlled way on a few other occasions through the next few weeks.
Deputy was using a non-scaling implementation Redis across an entire region (i.e. Australia) as a caching solution. As our customer base has grown, this created a single failure point, resulting in workloads becoming heavy and concentrated.
We had outgrown our existing Redis architecture, and it was quickly made apparent to us that it was time to implement a more scalable solution. To make an analogy, we had 1 cash register for a very, very busy supermarket. Even as the supermarket got to peak capacity, we still had 1 cash register. In the new architecture, we have unlimited cash registers!
Our team went hard at work, consulting with our AWS enterprise architecture team, working nights and weekends, to develop a scalable, distributed Redis.
In this new architecture, our infrastructure now has 10x the redis clusters to effectively spread and orchestrate workloads, and we continue to add new clusters as our customer base grows. In short, our infrastructure now reflects our requirements for today’s customers and future proofs us for our growth ambitions.
On Monday the 29th, we released the new scalable and distributed Redis to all customers, with the intent to resolve these issues once and for all. However,, as irony would have it, this inadvertently led to another outage on 29th of March, due to settings of how the new system was tuned, which was quickly resolved.
The previous outage on the 29th was a growing pain and speed bump to deliver the full working solution that is in production now, and handling usage elegantly and with ease!
This incident was a key catalyst in driving the constant journey we’ve been on to improve system resilience and systematically removing any single points of failure that may exist as our customer utilisation expands.
We understand this was an upsetting outage for our customers, especially a payroll day before a public holiday. We responded quickly to correct the situation, and have systematically dealt with Redis scalability as the root cause.
Thank you for your patience and understanding. We do not take for granted the trust you have placed on Deputy. We will continue to be on a journey to make Deputy highly available and your trusted partner!