Problem Solving Revisited

Why is it that you always find the cause of a problem in the last place you look? The correct answer, I think, is because when you find it, you stop looking. This post is a response to a conversation that I had with a manager I work with. We were talking about why people (software developers) take too long to correct software defects. My response was that often it is that they are lacking a problem solving framework that keeps them focused.

In my experience, there are two key behaviors that act to slow down problem solving:

1) Getting Stuck – not knowing what to do next or how to do it. This happens when I don’t know enough about the technical domain or the business domain of the product that I am working on.

2) Rabbit Holing – becoming focused on things that cannot lead to a solution of the problem at hand. This happens when I either don’t know that there are other possible causes for a problem, or I am unable to effectively evaluate the probability of a potential cause, or understand the cost of proving that something is the cause of the problem. Continue Reading

The Cause

Causes of problems are sometimes elusive. Symptoms are misleading, reporters expectations are misguided, layers of business process and automation, and years of workarounds and patch jobs render the situation very confusing.

The cause of a problem has a category related to it's proximity to the symptom revealing component:

  • Local cause – when analysis reveals that the problem is specific to the component itself.
  • Adjacent cause – when analysis reveals that the cause is in components directly adjacent to the component revealing the symptom.
  • Subsystem cause – when analysis reveals that the cause is in components closely related to the symptomatic component.
  • Systemic cause – the cause is isolated in a component that is shared or used by many unrelated components in the system.
  • External cause – the cause is outside of the boundaries of the system – or business process. This is often referred to as "garbage in, garbage out".

The cause of a problem has a category related to its probable frequency of occurrence:

  • Incessant – when one occurrence of the is not likely to complete before the next begins. If the symptom is performance related, the result can be a complete log jam.
  • Regular – when an occurrence is expected every time a process happens.
  • Routine – when an occurrence is expected only under certain conditions, that happen routinely.
  • Infrequent – when an occurrence is expected only under certain conditions, that happen infrequently.
  • Unlikely – when an occurrence is expected only under a rare confluence of unlikely conditions.

The cause of a problem has category related to its class or type:

  • Human Error – A human did not correctly execute according to process and procedure.
  • Sequence Error – Process was executed out of sequence.
  • Resource Issue – Some critical resource was unavailable when needed to complete the process.
  • Process Definition Issue – The process definition did not account for some real condition, and error resulted.
  • Automation Issue – Some process automation (either software or machinery) malfunctioned.

When documenting a cause for a problem, consider that each of these categories should be assigned. Note that these categories are equally applicable to problems expressed in terms of a manual or automated systems or processes. In our current mostly automated state, we have as much likelihood of a problem involving both automated components and manual components. Even when we are only responsible for the software, or the "systems" components, we need to reflect that the problem can be caused by errors in the manual components of the "system".

The Impact

 No description of a problem should be considered complete without some explanation of the impact. Impact is simply stated as the result of not solving the problem. All statements of impact should have a cost, a timeline to realize the impact, and a likelihood or probabibility of realization. None of these are precise measurements, because measurement would only be valid after the impact was realized. These are informed conjecture.

Let me walk through an example:

In a manufacturing plant, a piece of equipment essential to making hottentot widgets breaks down. We are currently manufacturing to back fill our inventory, and do not have any open customer orders for hottentot widgets. This piece of equipment is one of two that are capable of doing the required work, so that I can continue to manufacture, but at half speed.

In this case, impact is only realized after I have sufficient customer orders to exhaust my inventory of hottentot widgets, and insufficient capacity to meet my customers delivery expectation (causing my customer to cancel the order and place with an alternate supplier). The immediate cost would be the profit on the canceled order. A subsequent cost might be that the customer would then favor the alternate supplier in a way that causes me a longer term loss of profitable business.

Given that I have three clients who purchase hottentot widgets regularly, my timeline to the immediate cost would be determined by the normal schedule for those clients and my current inventory. The probability of this impact would be realtively high 95% – in that if I don't fix that machine, it will happen. The longer term cost might only happen on one client, and only after two missed ship dates, amybe only with a 60% probability. This second risk can be mitigated by byuying the widgets from the competitor myself and selling them at a loss to keep my customer happy.

Let's say that the immediate cost is $10000, and the timeline to realization is 2 weeks, with a probablility of 95%. That is an easy impact to understand. If I have the machine fixed with 2 weeks, there is a strong likelihood, I can get by with no impact. Call the service company and schedule the technician. However, if by the middle of week 2, things have not been resolved, I might get a little excited. I would probably be calling the service provider daily or more frequently, escalating with their management, potentially threatening to use a competitor if they can't meet my need.

Understanding the impact in terms of the cost, timeline and probability are important to assessing the urgency of a solution. They tell me when I need to get out my cape and tights, and when even that is too late. More than anything, they allow me to manage my customers' and my management's expectations and to react to their concerns in ways that build confidence and credibility.

The Symptom

 Problem solving is done better when the symptom is articulated separately from the cause and the impact. The symptom should be articulated as the experience of the person reporting the problem, and his or her opinion about where the process/system failed to meet his or her expectations. There may be some steps leading up to the failure, and maybe an outcome (what happened after the failure.

Analysis of what is necessary to reproduce the symptom is valuable. Validation that they symptom can be reproduced following some steps is extremely valuable.

The context, or business process that the reporter was executing is important. A description of what should have happened instead is also valuable. These are part of the symptom.

— the cause is separate from the symptom. When reporting the problem, speculation with respect to the cause is simply that, speculation.
— the impact is separate from the symptom. When reporting the problem, the impact is only to be understood in the context in which the problem was reported. 

The Problem

Problem solving is difficult. The good news is that problem solving is domain independent, so good problem solving skills can be applied to any problem. They can be applied to business process problems, to technical software problems, organizational problems, anything that presents itself as a system, for which a potential customer can ask you to solve a problem.

Perhaps the most difficult part is understanding enough about the problem to ensure that any potential solution will be effective. Your customer doesn't necessarily have a complete understanding of the symptoms. Your technical team doesn't necessarily have a complete understanding of the causes. Your customer relations group doesn't necessarily have a complete understanding of the impact. Yet those three aspects of the problem are required to assess potential solutions.

To get a complete enough understanding of the symptoms, you should be able to reproduce the symptom. If the symptom occurs inconsistently, or unpredictably, it is much harder to tell if you have addressed it. At the same time, an understanding of the symptom including the circumstances under which it can be reproduced, divorced from any speculation about the cause is an essential aspect of the problem definition.

With a good understanding of the symptom, you can document the impact of the problem. What does it mean to the individual customer, how frequently/likely does it happen? What does it prevent the customer from doing? How many customers is it likely to affect? Are there any opportunities to work around the problem to reduce/remove the impact. What is the potential for customer relationship impact? (are you going to lose customers?). Sometimes the identity of the customer alone is an impact. When your customer is a public figure, or an organizational leader – the reputational impact is greater.

If your impact is a risk (no real damage has been done yet), what is the time line on the realization of that risk? Is it end of day? End of billing cycle? How long is the fuse on this bomb? Does the customer have a business event that is impacted? When is that business event? (Do you routinely ask the customer those questions?)

When you have a complete enough understanding of the symptom, you can isolate causes of the symptom easy enough. For symptoms that appear inconsistent, it may be the case that several causal factors are required to align, rather than an individual cause. Isolating these individual causes is the challenge.

Understanding the causes, then exposes a more comprehensive impact statement. What other symptoms might result from the issues that are causing the currently known symptoms. What are other potential impacts from this problem, based on these new possible symptoms, and the understanding of the causes.

A solution that addresses the symptom without reducing or eliminating the impact merely "pushes the bubble." A solution that addresses symptoms without mitigation at the cause, increases the complexity of the system and long term maintenance or operating costs. A good solution eliminates the impact, not necessarily the symptoms, without increasing the complexity or sustainability of the system.

To solve a problem well, it is necessary to understand the symptoms, the causes and the impact in isolation from each other, as well as in relation to each other. A well documented problem statement has these three elements, clearly described, and also explains how they are related.

Truth is, most people and organizations waste a lot of time because they start to work on a solution before they have a good grasp on the problem. If you cannot explain the problem to others, you definitely should not be working on the solution. If you start to solve every problem by attempting to articulate the problem in this way, you will likely be regarded as a genius at problem solving. You will save your company time and money and quickly become a star performer.

The Solution

The notion that there is a single solution to any problem is a fallacy. There may be a solution to an equation, but every problem has more than one possible solution.

We learned in math that there is one right answer for every arithmetic "problem". But in fact even that is fallacious, because we can represent that answer correctly in many forms. 1/2 = 2/4 = 0.5 = 50%, etc… We also learned in math that the teacher expected us to show our work. Because the exercise was not to get the answer (that was in the back of the textbook) but to learn the method.

In the real world, when we have a problem, it is more likely to be a "word" problem, and if there is math behind it (rather than boolean logic) we need to represent the answer, or solution, in a form that fits whatever we are going to do with it. We don't get credit for doing the right method, or showing our work (unless you are building a repeatable process, that others will follow) – we get the answer and move on.

In the real world, there are always multiple paths to solution, and if the problem has any degree of complexity, it is the fastest, least cost, least effort, optimized path that is valued. In the real world, sometimes a quick approximation is more valueable than a 100% certainty. In the real world, the need changes faster than we can solve problems, so sometimes a quick fix is more valuable than a perfect solution.

In the real world, knowing what the likely points of failure are within the solution, and what the probability of experiencing those failures, or what external events would trigger those failures is as important as knowing how to construct the solution itself.

In the real world, understanding the problem, the impact of the problem, and the timeline for realizing that impact is as important to the solution as the solution itself.

In the real world, there are almost always solution options. Sometimes the right answer is to solve the problem multiple times.

1) A quick fix to manage the risk – in hours to days (you may call this a workaround, a band-aid, or a hack)

This is like applying a tourniquet. Good to stop the bleeding, but for a short perriod of time, otherwise we will lose a limb. This solution incurs technical debt, as this fix will need to be unwound, and soon.

2) A more thorough solution to provide more of whatever attribute needs to be increased – in days to weeks (this might be a well structured hardening or bullet-proofing exercise, or a non-behavioral system change, or a behavioral change to accomodate new real world conditions)

This can be a long term fix, but usually adds complexity at the expense of continual maintenance. Every new project that has to change this thing will need to contemplate this complexity. Enough of these in our system and change becomes difficult, beyond a certain point it is better to start over than to fix. This solution will last for a long time, but may make us despise our own handiwork after a while. The benefit here is that we keep the change inside the bounds of our own control.

3) A correction at the root cause of the problem – in months to years, and may require agreement/negotiation with multiple stakeholders, change to work process, legal documents, and other elements outside of the direct control of those who are impacted by the problem ( this is what we always want to do, but may require changes to human behavior, customer expectations, etc. that require a much greater planning effort to realize)

This change requires organizational management attention. If the root cause is outside the bounds of our control, or if the consequences of the change exert influence beyond the bounds of our control, we need to negotiate, and exert influence to get others (systems, teams, business units, customers) to accept the net impact of this change.

How many times do we stop after the first, or especially the second solution, and how much complexity is inserted and maintained because we do not go the distance to solve the root cause. How many times have we moaned that "There is never time to do it right, but there is always time to do it over…"

A Framework

The first step in problem solving is root cause analysis. Identifying the root cause requires a framework. A simple set of questions, the answers to which allow you to rule out probable causes, so that you can investigate only the probable causes you can't rule out. I have watched as team members exercise a brute force framework, investigating each identified cause to conclusion in sequence, instead of using a more optimized approach. I have observed colleagues spend hours investigating problems based on a single assumption about the root cause of the problem, only to realize that their assumption was incorrect, and they could have proven it in 15 minutes.

Try the following framework:

1) Carefully document the symptom that is manifest.
— make sure that you separate the symptom from any speculation as to cause.

2) Do you understand the system or portion of the system where the symptom is manifest?
— if not then engage someone who does – knowledge of the system is essential.

3) List out as many probable causes of the problem as you can think of.

4) For each probable cause, find the simplest way to prove that it is not the cause.
— ask yourself the question – if this were the cause, what else would need to be true?
— ask yourself the question – if this were the cause, what could not possibly also be true?

5) Select some optimized order or sequence to disprove
— the goal would be to disprove the most likely first, or to let the work of each proof build on the last – whichever feels more efficient.

6) Disprove each probable cause

7) The remaining probable causes must be investigated

8) Do not stop at one probable cause.
— a symptom can be the result of more than one cause.
— an occurrence of a symptom can be the result of exactly one cause.