A weekend where deliveries don’t seem to be …. delivering. One of Britain’s favourite food delivery go-tos, Just Eat, reported on its socials of “experiencing some technical issues” on Friday evening – a crucial time when reliance on food delivery services is at its peak. Peak periods are when tech platforms expect to experience spikes in demand compared to other times of day. This has left customers and restaurants alike in a state of frustration and confusion but more importantly, left hungry customers and restaurants with reduced orders and revenue.
It is not always straightforward to predict the pattern of demand and the impact on the underlying technology stack that makes up your platform. Just Eat went through a program of decoupling and worked towards a ‘microservice’ architecture back in 2019. This would have helped isolate such issues and reduce Mean Time to Restore (MTR) of services. However, this does not mean that complications -especially when you consider 3rd party integrations and services managed outside of your ecosystem- would not arise.
The fundamentals remain the same: To maintain a good service for unpredicted traffic and handling unforeseen issues within individual components, one needs to be confident in these 4 areas and capable of answering these questions:
Your Technology
- How well does your Service Management and SRE teams know your architecture?
- What confidence do you have in the back-up and recovery process?
- How tightly coupled are your microservices?
- How does your technology scale?
Your Partners
- What SLAs and NFRs are in place with vendor systems?
- How often are partner systems tested for stability and scalability?
- Are all dependencies known and understood?
Your Observability
- What are the critical technical and business metrics to monitor?
- Are alerts set-up correctly or am I being inundated with false positive alerts?
- What automation is in place to report Problems from a series of Issues and Incidents?
Your Processes
- What quality gates are in place for assuring changes prior to hitting production?
- What risk am I carrying forward and do we have visibility of all mitigation avenues based on the risk profile in production?
- Do my operational run-books reflect possible scenarios? Have they been tested?
When supporting our retail clients to prepare for known and unknown peaks we follow a tried and tested methodology of reviewing the technology landscape dependencies and weaknesses within the architecture followed by testing of different scenarios representing expected user profiles and malicious activities such as BOTs. This, tied with the improvement of DevOps and SRE processes has consistently helped retail clients unblock platform scaling limitations and remove bottlenecks within complex integrated platforms at application and infrastructure layers to successfully and cost effectively serve clients when demand peaks.
How are you assuring your site reliability and operational stability? Reach out for a free diagnostic and consultation.