The Curious Case of the Leaking Land Rover
The Art of Root Cause Analysis: Solving Problems at Their Source

We recently concluded our inaugural Customer Success kickoff session for the year 2025 themed “Pioneering Excellence”, where I conducted a session with the same title as this article. In my role heading the SRE team at WSO2, I’ve created guidelines for my team on conducting Root Cause Analysis sessions as well as have participated in many RCA review sessions. In our line of work, outages and incidents are part and parcel of life. However, we strive to learn from every such occurrence and ensure that we take every action possible to prevent recurrence. After all, we are in the business of making customers successful, and repeat incidents can only have negative consequences. Hence, every incident related to the deployments my team manages, which include WSO2 Choreo, WSO2 Asgardeo, managed and private clouds that we manage on behalf of our customers, requires a post incident resolution RCA. The inspiration behind this session was based on my learnings and observations as a participant of such sessions. I believe that documenting it here would be helpful for others as well.
Tl;dr
Organizations aim for smooth operations and reliability, but incidents still happen, disrupting services and affecting customer trust; Root Cause Analysis (RCA) is a powerful tool that enables teams to identify the underlying causes of these issues and implement effective, long-lasting solutions. This blog explores the fundamentals of RCA, the principles of blameless analysis, and methodologies such as 5-Whys and Fishbone analysis, demonstrating how to identify actionable solutions that address the roots of the problem.

A car owner regularly finds oil patches in his driveway, and low engine oil levels. His mechanic repeatedly tops up the oil and replaces the oil filter gasket. Additionally, the mechanic also cleans the engine to remove visible oil residue, giving the impression that the issue is resolved, but the issue persists and the owner is frustrated. This or something similar has happened to many of us, isn’t it? Finally he decides to take his car to a different mechanic, who takes a step back and takes his time to analyze the problem and uncovers that the source of the leak was a cracked valve cover gasket. Once that is replaced, the leak gets permanently fixed. The owner is relieved, but he had needlessly spent time, money and suffered disappointment due to the first mechanic treating the symptoms instead of spending a bit more time to find the root cause. The owner would never go back to that mechanic.
The Why Behind Every Problem
Root Cause Analysis is a systematic process to uncover the true source of a problem or incident. By addressing root causes, teams prevent recurrence and build robust systems. Unlike quick fixes that address symptoms, RCA drives meaningful change by targeting the underlying problems.
Engaging in Root Cause Analysis (RCA) offers numerous tangible advantages that significantly benefit organizations. By identifying and addressing underlying issues, RCA helps minimize repeated outages, effectively reducing downtime and ensuring smoother operations. It also enhances processes and reliability by optimizing workflows and systems, leading to improved efficiency and stability. Furthermore, RCA fosters a culture of continuous improvement, promoting accountability and encouraging teams to embrace proactive problem-solving. This holistic approach not only resolves current issues but also strengthens the foundation for long-term success.
RCA aims to achieve the following critical objectives:
Identify the True Source of the Problem

Determining the specific underlying causes instead of addressing symptoms alone is crucial. Many a time, people end up applying plasters to counter the symptoms without investing time in understanding why the symptoms manifested in the first place. Needless to mention that these symptoms will raise their ugly heads from time to time unless the reasons for those symptoms are not addressed.
Implement Effective Corrective Actions
Developing actionable and lasting solutions that address the root causes which fix the root causes identified during the RCA is one of the fundamental objectives.
Prevent Similar Incidents in the Future
Introducing process improvements to eliminate the risk of recurrence is another important objective.
Fix It at the Top!
Living in Sri Lanka, we know that to eliminate corruption, it must be tackled at the top. However, what happens in reality is that the big fish go scot-free while the small fry get caught, leading to the situation we are in today.

Getting back to RCA, we have observed that once you start plotting a graph of the root causes, many incidents ultimately lead to a handful of ultimate causes as shown in the illustration above. Fixing the problems closer to the top will ensure the elimination of problems and incidents further down the hierarchy. For example, incidents A, B and C all are due to a partial deployment outage. Further analysis uncovered a misconfiguration which occurred due to lack of awareness in the SRE team as well as poor documentation, which ultimately point to operational process issues & product management issues. This indicates that there could be other problems that would stem from those two roots, and hence to avoid future problems, we should scrutinize and update these processes.
Conducting an RCA
Conducting an effective RCA requires a methodical approach. Here are the fundamental steps in detail:
- Define the Problem Clearly: Frame the issue with a specific and concise statement to guide the analysis. Clearly articulating the problem is as good as solving half the problem.
- Focus on Root Causes, Not Symptoms: Look beyond immediate effects to discover the deeper systemic issues.
- Gather Data and Evidence: Collect logs, metrics, timelines and all relevant data to establish a factual foundation for the analysis.
- Involve Relevant Stakeholders: Include people from different roles to ensure diverse perspectives.
- Use Structured Tools and Techniques: Leverage methods like Fishbone Diagrams and 5-Whys for a systematic exploration of causes.
- Develop Practical Solutions: Focus on realistic and actionable remedies that can be effectively implemented. Ivory tower solutions will not yield anticipated results.
- Focus on Blameless RCA: Encourage open communication and analyze processes instead of blaming individuals.
- Document and Share Findings: Compile and disseminate results to ensure organizational learning and transparency.
- Implement and Verify Corrective Actions: Execute and monitor solutions to validate their effectiveness.
- Promote Continuous Improvement: Use RCA outcomes to refine practices and foster a culture of learning.
Blameless RCA: Focus on systems and processes, not individuals

What would be the best method of avoid making mistakes? Not doing anything would ensure that you don’t commit mistakes. Humanity wouldn’t progress if everyone thought like that. Most people wouldn’t deliberately make mistakes and hence it is better to give individuals the benefit of the doubt.
Blameless RCA fosters an environment of psychological safety, where individuals feel secure sharing mistakes and insights. Instead of asking, “Who caused the problem?” teams focus on “What allowed this problem to happen?” This approach encourages collaboration, honesty, and innovation.
How to Conduct a Blameless RCA:
- Establish Psychological Safety: Cultivate a culture where team members can openly discuss issues without fear of blame or retribution.
- Focus on Systems and Processes: Examine workflows, tools, and systemic factors contributing to the problem instead of attributing fault to individuals.
- Use Neutral Language: Frame discussions constructively, such as “The configuration check was missed” instead of “The engineer failed to check the configuration.”
- Rely on Data: Base analysis on objective evidence like logs, metrics, and timelines rather than assumptions and biases.
Methodologies
Two popular tools for RCA are Fishbone Analysis and the 5-Whys technique. Let’s explore how they work.
5-Whys Analysis
The 5-Whys method is an iterative questioning technique used to drill down to the root cause. Here’s an example:
- Why did the system go down? — A configuration file was missing.
- Why was the configuration file missing? — It wasn’t included in the deployment process.
- Why wasn’t it included? — The checklist didn’t cover this file.
- Why didn’t the checklist cover it? — The checklist was outdated.
- Why was the checklist outdated? — No process existed for regular updates.
By repeatedly asking “Why?” teams can uncover deeper systemic issues that might otherwise be overlooked. Even though this is called the 5-whys method, it is not mandatory to ask the question strictly 5 times only. As required, the depth could be more than or less than 5.
Fishbone Analysis

The Fishbone Diagram, also known as the Ishikawa Diagram, is a visual tool that maps out cause-and-effect relationships. The diagram consists of:
- The Head: Represents the defined problem or issue.
- The Bones: Major categories of causes, such as People, Process, Equipment, Materials, Environment, and Management. This encourages a multi-dimensional analysis of the problem.
- The Sub-Causes: Specific contributing factors branching out from the main categories. At each of these “bones”, we would conduct a 5-whys analysis or at least ask the question “why” as many times as appropriate.
By systematically breaking down causes, teams can comprehensively explore all potential contributors to the problem.
Hypothetical Scenario: WSO2 Gateway Timeout Issue
Let’s apply the above methodologies to a hypothetical scenario involving a WSO2 cloud deployment to get a better understanding.
Problem Statement
Multiple users report gateway timeout errors after a recent update. These errors are impacting API calls and causing disruptions.
Initial Investigation:
Following the incident run book, the SRE team reverted recent changes to mitigate the impact temporarily.
Stakeholders: feature developers from the product team, product leads, SRE members who handled the incident, SRE leads, relevant CS leads.
5-Why Analysis:
- Why did the gateway timeout occur? The product change resulted in a mandatory configuration parameter to be set. Not setting this results in a change in behavior of the product.
- Why was this not detected by the product team? The product team tested it with the parameter properly set. Why? The product team always tests with fresh product builds with the latest config files and don’t test with older config files.
- Why wasn’t this mandatory configuration change communicated to the SRE team? The product team forgot to update the documentation & related change log. Why? The feature release checklist doesn’t mandate checking whether docs have to be updated.
- Why wasn’t this parameter introduced so that it has a sensible default so that existing systems will not be impacted? The impact was overlooked during the design phase of the feature. Why was that? Impact on existing deployments is not considered as part of the feature design phase.
- Process: Why didn’t SRE detect this issue until users reported it? The monitors were missing. Why? Along with the new feature deployment, the monitor was disabled and they forgot to enable it. Why? There is no process to keep track of temporarily disabled monitors.
Fishbone Analysis:
- People: Lack of awareness about configuration changes.
- Process: No checklist for updating configurations.
- Equipment: Missing monitors for new deployments.
- Materials: Outdated reference documentation.
- Environment: High system load during the update.
- Management: Lack of oversight on deployment procedures.
Root Causes:
- Lack of a standardized process for updating and validating configurations.
- No process to track temporarily disabled monitors.
Action Items:
- Introduce sensible defaults for new configurations.
- Test with older configuration files during development.
- Mandate documentation updates in release checklists.
- Implement a system to track temporarily disabled monitors.
Post-RCA Actions
Effective RCA doesn’t end with identifying causes. It’s critical to:
- Implement Recommendations: Ensure timely execution of corrective actions and monitor their success.
- Update Processes: Revise and standardize workflows to eliminate the recurrence of similar issues.
- Monitor Progress: Establish metrics and KPIs to assess the effectiveness of solutions over time.
- Share Lessons Learned: Document findings comprehensively and share them across teams to promote organizational learning.
Common Pitfalls to Avoid
- Inadequate Stakeholder Involvement: Failing to include all relevant parties can lead to incomplete analyses.
- Vague Problem Statements: Ambiguous definitions of the issue hinder the identification of root causes.
- Superficial Analysis: Avoiding deeper exploration due to time constraints or fear of blame results in recurring problems.
- Overlooking Systemic Changes: Addressing only immediate issues without improving underlying processes leads to recurring incidents.
Key Takeaways
- Fix the root, not just the symptom.
- Focus on systems, not individuals.
- Learn, improve, and prevent future issues.
Root Cause Analysis is more than just a problem-solving tool; it’s a mindset. By systematically identifying and addressing root causes, organizations can build resilient systems, foster collaboration, and drive continuous improvement. Remember, every problem is an opportunity to learn and grow.