I want to take a few minutes to share with you how I have grown to once again realise how common so many of our challenges are in tech. This time the commonality is between the world of test automation and operational alerting. Before I dive into these details though, let me quickly explain my career journey which led me here.
From the beginning of 2012 to the middle of 2018 I worked as a test consultant in organisations ranging from small non-profits through to large multinational financial institutions. During this time I saw the scale, financing, and cultures change fairly drastically, however the challenges were often quite similar when it came down to the team level of delivering software. Most notably, over these years I realised that being a specialist in quality often meant my biggest contribution was in how we deliver code through Continuous Integration and Delivery processes rather than just how I test things. This led me to working with many DevOps and infrastructure related teams and activities and eventually on to joining MOO as a Test Engineer on their Platform Engineering team.
Introduction to our platform engineering team
At MOO the Platform Engineering team is responsible for the infrastructure of both production and pre-production environments as well as the running of internal tooling like source control and backlog management tooling. As you may expect this includes being on call for these systems 24 hours a day, 7 days a week. From what I can see around the industry our team is not particularly unique in how we handle this.
We have a rota which makes sure we share the responsibility and our team generates both warning and critical alerts to make sure we respond if issues occur. These alerts provide us huge opportunity for speeding up our Mean Time To Recovery (MTTR) by decreasing our time to discovery. However when I joined the team, they were also generating a lot of stress which is why one of my first tasks was to review our alerts for value and how to remove that stress.
Tackling our alerting challenges
Our alert stress was generated for a few major reasons. First of all, some of our alerts were things that our MOO team members could do nothing about. For example, if our Google authentication is failing there is an impact to our internal users which could trigger an alert to our on call engineer, however there is nothing we can do but wait until Google sorts out their incident. Another issue with our alerts was critical alerts (meaning someone is woken up in the middle of the night) that resolved themselves within minutes. Often these were issues with machine health in AWS which has its own self-healing mechanisms and our thresholds were just too low. Finally, we had some alerts that made people wonder if they were worthwhile as they had never triggered once in over a year of tracking.
When I joined the team there was a ticket on the backlog to “Improve team alert fatigue”. This was intentionally a pretty vague topic and was meant to be an analysis placeholder as there wasn’t just one solution to all these concerns with our alerts. Alert fatigue can be described as “exposed to a large number of frequent alarms and consequently becomes desensitized to them. Desensitization can lead to longer response times or to missing important alarms.” and our team was definitely showing the early symptoms. After spending some time with each team member we decided that definition of done for this ticket would be to:
- Identify the shape and requirements of a “good” alert
- Review our alerts against this definition
- Set up future work to bring our current alerts in line with our desired state
Achieving a working solution by putting it to the test
To achieve these goals I went about putting together a straw man definition of what a warning and a critical alert would need to have. Then scheduled reviews of all of our alerts against these definitions to let them be hardened and evolved by our real context. This was a (very) long process but by grouping alarms together, sharing the load across all team members in small groups and doing no more than 30 minute meetings we were able to get through the review in less than a month and with only a limited amount of meeting fatigue!
In the end these became our team’s working definitions:
All critical alerts should have…
…an immediate and quantifiable end user impact which must require a specific human interaction to rectify
…and should be paged on immediately
All warning alerts should have…
…no short term action required, but do have proven critical impact to the system if left unmanaged
…and should be paged on after 72 hours in a persistent state
We of course dove into each of the details like how to define “short term” but seeing this high level framework can hopefully demonstrate how we also uncovered the need to discuss the information provided by an alert (if it requires human interaction, shouldn’t that action be clear from the get go?) and also how we handle different types of services (grafana which just displays data and is used most heavily in office hours may not need to be paged on Sunday morning).
Outcomes and learnings from the process
By the end of the review we identified some alerts that could be removed, some alerts which should be retired through retiring the systems they were attached to, and many alerts which could use some sprucing up. In all these cases we created story cards with the review context and have been regularly feeding them into our workload since then.
As we are now about four months on, I thought it may be a good time to retrospect on the experience and the value of the work done. My biggest takeaway is that this experience closely mirrored my experience with test automation suite reviews.
In particular, stepping away from a single story card or a single feature helped our team realise the bigger picture of alerts and come up with a framework to keep our alerts in order. Nothing we came up with was novel to our team, it is just that we had not been able to sit down and agree while simultaneously being able to apply or theories against real world scenarios. The alert review is what really helped cement agreement over our alert framework through discussion and debate about possible outliers.
Another similarity to test automation is how the team cared deeply about quality alerts, but given the intermittent pain of poor alerts and lack of customer facing impact, it was hard to spend the time to improve them. To be honest, this continues to be a challenge as we try and strike a balance between interrupt driven work (incidents or alerts on a given day), feature delivery, and long term maintenance. That being said, having the thoughtfully created and thorough documentation of fixes in places has meant when a “mildly annoying” alert goes off again we can often pluck that card from the backlog and prioritise it quickly.
Finally, the test community doesn’t seem to have as well known a term for red build fatigue as the operations community has for alert fatigue, but trust me they are the same thing! Our team started to rely on their instincts and experience to decide which warning alerts actually mattered and more often than not just wanted to remove the culprit alerts because it was the only way they saw any chance of improving the situation.
I am not suggesting we are in an ideal place now, but by introducing some practices for bringing the pain forward like escalating all warnings after 72 hours, it has increased our on call engineer and product owners focus on alert quality for the sake of team sanity.
*This post was originally published on May 10, 2019 on the DevTestOps Community site.
Abby Bangser is a software tester with a keen interest in working on products where fellow engineers are the users. Abby brings the techniques of analysing and testing customer facing products to tools like delivery pipelines and logging so as to generate clearer feedback and greater value. Currently Abby is a Test Engineer on the Platform Engineering team at MOO which supports the shared infrastructure and tooling needs of the organisation.
Outside of work Abby is active in the community by co-leading Speak Easy which mentors new and diverse speakers, co-hosting the London free meetup Software Testing Clinic which brings together mentors and new joiners to the software testing industry, and co-organising European Testing Conference 2019. You can get in touch easiest on Twitter at @a_bangser.