IT Brief Asia - Technology news for CIOs & IT decision-makers
Story image
How change tracking can foster a blameless culture and reduce the impact of IT downtime
Wed, 22nd Feb 2023
FYI, this story is more than a year old

Technology downtime due to application failure, cyberattacks or human error can cause substantial damage to a company's reputation, impacting revenue and saddling them with significant recovery costs.

Most often, change events are at the root of software performance degradation and outages, but unearthing them can be very challenging. It gets even more complicated when larger organisations with siloed teams running complex, interconnected architectures are involved.

Engineers and operations teams could save an enormous amount of time if they simply knew when changes are made so issues can be diagnosed and rectified when they surface.

Unfortunately, many engineers find themselves operating in a somewhat authoritarian culture where mistakes can result in disciplinary action. This is a deterrent from adequately documenting their work or admitting a mistake was made.

This environment doesn't bode well for employee retention and wellbeing. In fact, Amy Edmondson, author and Harvard Business School Novartis Professor of Leadership and Management, says high-performing teams have 'psychological safety', which she describes as a belief that one will not be punished or humiliated for speaking up with ideas, questions, concerns, or mistakes.

Finding a balance between tracking changes and open, honest communication can go a long way in preventing unplanned downtime and keeping engineers happy. Here are three ways teams can improve the visibility of changes across their organisation:

1. Foster a blameless culture and learn from mistakes

Embedding a blameless culture and keeping feedback constructive gives people the confidence to raise issues without fear. 'Blameless post-mortems' have become embedded in Site Reliability Engineering (SRE) culture. For instance, at Google, blameless post-mortems assume that everyone involved in an incident had good intentions and operated from the right place using the information they had.

Cultivating such a culture where people feel safe to make, admit and learn from their mistakes is essential to preventing repeat incidents. This cultural shift will be critical as technology architectures become more complex and unpredictable.

2. Implement a comprehensive change-tracking tool

With a blameless culture comes clearer communication and documentation. Since high-performing engineering teams ship code multiple times per day, the ability to identify and diagnose issues becomes more challenging. It's essential that teams know when changes occur so remediation steps can be taken immediately when changes introduce unanticipated consequences.

Instead of using monitoring and visualisation tools that only show metrics in time-series charts, organisations could consider equipping their teams with a comprehensive change-tracking tool that overlays contextual markers on top of charts to easily correlate the impact of change. This would provide IT teams with immediate and actionable insights to resolve issues faster.

3. Consider integrating with CI/CD tools for clear visibility

The increasing complexity of the software ecosystem and the way in which modern teams seek to improve release velocity requires meaningful integrations. The ideal change tracking tool gives all engineers real-time information on the performance of their Continuous Integration/Continuous Deployment (CI/CD) pipeline. Teams can integrate change tracking with CI/CD to seamlessly capture changes and/or deployments that can be broadly shared with the engineering community. This allows engineers to view changes in context to the systems and applications they support in real-time and take swift actions when changes introduce unexpected outcomes.

It's challenging to recover from the material and reputational cost of an outage. However, planning in advance and using change tracking can offer organisations a consolidated view - with powerful insights - to drive real-time operational decisions in the event of unplanned downtime.

Harnessing the power of an all-in-one observability platform provides clear visibility of any impact caused by changes to the tech stack, leading to faster resolution - a huge relief to IT teams, who may inadvertently find themselves at the receiving end of criticism for data and monetary losses.