Search Expertbase
Hire Rolly
Virtually
Hybrid
In-Person
Why Most Root Cause Failure Initiatives Fail
50 Claps
2737 Words
14 min
Lets shed some light on what Root Cause Analysis is all about - and why most industries fail in their Root Cause initiatives.
Many industries have some form of Root Cause Analysis or problem solving tools they use to analyze problems, yet, many seems to be misguided as to what root cause analysis really is all about. I myself admit that, I have only known its true meaning when my good old friend Charles Robert Nelms, invited me to attend his Latent Cause Analysis training in March of 2007. In this issue of my reliability newsletter, I would like to provide some shed of light about what Root Cause Analysis is all about and why most industries fail in their Root Cause initiatives.
Wrong Belief: In order to perform Root Cause, we must have DATA:
Data, data, data, without data, we can’t perform a thorough Root Cause Analysis. I heard this message a lot and I just cannot emphasize how wrong they are. The lifeblood of any Root Cause Analysis is “EVIDENCE” and not data. When something goes wrong, there is a cause behind every single bit of problem and in order to determine the cause, bits of puzzles are gathered so that we can determine what the big picture really is. Data is just part of the evidence, but when you tie up things together, and then we can have an understanding on what really caused the problem and what Root Cause is all about. To understand the cause of failure is to envision an ICEBERG, failures are evident, but what is underneath the failure remains hidden and our only link on what is underneath the failure will be the evidences.
Most of the Evidences are being washed up by the fast team:
When an equipment failure occurs, there are two teams, the fast team and the slow team. The fast team refers to the restoration team, which will restore the equipment so that operations can be up and running again. They must be fast, spares are at hand, they are timed as to how fast they can repair the equipment, otherwise, the longer they take to repair, and then more shadows (people watching them) will be at their back. There is pressure on the fast team, their MTTF (Mean Time To Repair) had improved by 30% for the last 5 years and are keen on improving their repair time always. They are considered heroes all the time because without them, then operations is delayed. Does this sound familiar?
On the other hand, we have the slow team, the team that will analyze the failures. The problem is that when the slow team arrives, the equipment is already been running and most of the evidence had already been washed up and evaporated. So how in the world can we perform a Root Cause Analysis in this situation?
Many will disagree with me, but I live by my principles and in what I believe is right. MTTF or Mean Time To Repair can never be a measure of Reliability, in fact when people become too good at repairing, it only means one thing that the failure keeps on repeating itself. I would rather take my time and understand and learn from this failure rather than keep on repairing the same failure all the time. My recommendation to this situation is simple, before fixing the failure, have the restoration team take photographs of the equipment, inside and out and a close up photograph of the affected part or component of the equipment, and secure the part or component that failed and place them on a plastic bag as well as any foreign part that they can locate near the part that failed and endorse them to the slow team which is the team that will investigate and analyze the problem. Make this a habit before repairing the equipment!
How deep should we probe on Root Cause Analysis efforts in unknown?
A true and meaningful Root Cause Analysis always believes that all failures have a physical cause but all physical cause are triggered by humans, which means that there is always a human error involved, but humans are negatively influence by latent forces. Therefore the goal of any root cause analysis is to expose and identify these latent causes of failures. The depth of probe on any Root Cause Analysis efforts should end once the Latent Causes is reached.
During one of my training on Lubrication and Oil Contamination Control Class, a manager hand me an oil analysis report on one of their compressors and when I scan through the report I was totally shocked that I almost slipped the report out of my hand. I told the person who handed the report if your compressor fails frequently. He said “NOT THIS TIME”, and I said that I do not believe you unless otherwise you are giving me a wrong report. Then he said, that our air compressor fails a lot before when we are using mineral based oil and we talked to the vendor about it, the vendor recommended us to use synthetic oil and after using that oil he recommended, failure seems to be gone. My question is, was the root cause of the problem determined? NO!
You see, oil analysis report indicates a moisture content of around 7000 ppm. The standard sample use for any oil analysis is at 100 ml of sample. Allowable moisture content for oil is around 100 ppm and when the amount of moisture is around 700 ppm, this signals a warning that the oil needs to be change, but the reading of moisture is 10 times the warning limit for moisture and the oil seems to be capable of handling it. Why? Because the oil is synthetic, and this is what they are design for, synthetic oil can withstand the heat, but you must also be capable to withstand the cost of it. When I ask how much was the cost of the synthetic oil, he told me it is around Php 120,000.00 (around 2,667 USD) per barrel but the mineral or petroleum oil they use before cause around 15,000 php ( 333 USD). Again we are comparing 1 barrel against one drum. There are 55 gallons in a drum and 42 gallons in a barrel.
Surely, when we probe our Root Cause analysis, it will have something to do with the oil that cause the compressor to fail, but all failures are triggered by humans, and when we visited the oil storage room, the lids were either missing or open not only on this oil but on the other drums as well. Perhaps the maintenance did not close the lid when he went to get some oil in the barrel. Again my second question here is do we blame the maintenance for not closing the lid of the container that led to the compressor failure? NO!
But when we probe much deeper on the Latent Cause of the problem, the drums and barrels are stocked out in the open, being exposed to sun, dust, rain and moisture. I bet that even if you have those lids covered, there will still be water present due to condensation effects of moisture most especially during early morning. The drums are piled up on the wall and every single employee that enters the plant for work and leave may have noticed this “SMALL” problem that the lids of the oil container is missing and that no one seem to care or no one seems to understand its effect that cause the failure of the air compressor which had paralyzed 1/2 of their plant a year back. Latency lies in each and every one of us, yet sad to note that when the “WHO” in the root cause is known, most organizations end their probe and the guilty one is punished. I cannot emphasize it more.
Root Cause takes much more than 24 hours to complete:
Root Cause is not about some fancy software which provides you a list of failure modes to select where we can click here and there. To undergo Root Cause is a slow and painful process. This will require time to complete and absorb the learning. In fact we can only learn from the failure if we acknowledge that each and every one of us is part of the problem. As part of my work I teach industries on a selection of courses on reliability and maintenance I offered, and in one of my classes, a participant which is from Top Level Management approach me at the end of the day and said, Rolly, come with me and I will show you something, so I followed him to his office and handed me a report on Failure Analysis done by one of their clients, so I glimpse through the report and as usual the analysis ends on the physical cause of the failure. Then after I place the report on his desk, he handed to me another report, the same failure. Although the difference was that this report was 3 years ago. It was an exact replica word for work and paste into this new report that have been handed to me, the only difference between the 2 reports was the name of who did the report and the dates which are 3 years apart.
You see, Root Cause takes more that 24 hours to undertake, in fact depending on the gravity of the problem it will take more than a week or more. This will take time most specially in verifying the causes of the problem based on the evidences unfolded and if the evidence don’t show up then we need to dig more evidences until we verify each cause of the problem. I came from manufacturing and spend quite a number of years there, and most manufacturing are required to complete their analysis in 24 hours most specially if their customer is the once that complained to them about a defective product. We want to impress them on how fast we can resolve problems, but even if you ask the experts on Root Cause, it simply cannot be completed in just 24 hours. Again, Root Cause is a slow and painful process and we must bear with it to get to the truth and cause of the problem.
Root Cause is not about probability of failure and failure modes:
There is somewhat confusion regarding what methods and tools would determine a Root Cause Analysis and some are led to the belief that these tools will eventually led them to the root cause of the problem.
Fault Tree Analysis was developed in the 1950’s by BOEING Aerospace Engineer for use in the development stages of the design process. It is a very mathematical tool, and yields probabilities. Its primary intent is to predict the probability of a specific failure. This is a design tool. (Source: What You Can Learn From the Things That Go Wrong by Bob Nelms)
Failure Mode and Effects Analysis discipline was developed in the United States Military. Procedure MIL-P-1629, titled Procedures for Performing a Failure Mode, Effects and Criticality Analysis, dated back November 9, 1949. It was used as a reliability evaluation technique to determine the effect of system and equipment failures. Failures were classified according to their impact on mission success, personnel and equipment safety.
Pareto’s 80/20 Rule: Quality Management pioneer, Dr. Joseph Juran, working in the US in 1930 to 1940 and recognized a universal principle he called the vital few and trivial many and reduced it to writing. In an early work, a lack of precision on Juran's part made it appear that he was applying Pareto's observations about economics to a broader body of work. The name Pareto's Principle stuck, probably because it sounded better than Juran's Principle. As a result, Dr. Juran's observation of the "vital few and trivial many", the principle that 20 percent of something always are responsible for 80 percent of the results, became known as Pareto's Principle or the 80/20 Rule.
Ishikawa Diagram or Fishbone Diagram, Developed by Kauro Ishikawa in 1969. A fish bone is constructed and assigned the 4M’s and 1E and most of the probable causes that relates to the problem are listed accordingly. The teams will brainstorm and focused most likely on the most likely or probable cause of failure.
No disrespect to the people who develop these tools and those who use them, in fact these are powerful problem solving tools but they are simply not designed to determine the Root Cause but rather the Probable Cause of failures. These two are entirely different, Root Cause and Probable cause. A Root Cause Analysis will always depend on the evidence found in order to determine the precise cause of the problem. It is not about a selection of the trivial few, it is not about brainstorming, it is not about probability of failure, it is not about experience, but rather it is all about “EVIDENCE” that we must unfold that will eventually lead us to the truth behind the cause of failure.
Focus of Root Cause should be on small problems and not big problems:
Big problems are caused by small problems; in fact these small problems accumulate overtime and end up causing us chaos in our organization. Remember my example above about the air compressor causing a certain portion of the plant to be shut down and loss its revenue because the lid on the oil container is missing allowing moisture to penetrate. A power plant was shut down for days because its triple redundancy protective device all failed overtime because the plant never inspect them at all and are confident that this would not cause them any problem because the protective device had 2 more back-ups. In a manufacturing plant I previously work, one of the main customer of this plant pulled out all its work because the products they shipped where off specs and when it was traced later on to a tiny led that was clogged due to dirt. This led after all determines whether the product is off specs or not. I can go on and on and on, but these things have one thing in common, the small things caused the big fires to come up. Root Cause is reactive because we only perform a root cause analysis if a failure occurred, but the things that we can learn from Root Cause most specially the little things that often times are neglected are the cause of big problems. When people in the plant start to care, then this is what makes Root Cause Analysis PROACTIVE.
Root Cause is not about “WHO” caused the problem:
Root Cause Analysis is not design to determine who cause the problem but rather understand why and how the problem manifested itself. While it is true that for every failure, there is always a human cause, but there is a hidden cause that caused this person to commit this mistake. In fact, according to Bob Nelms, the Golden Rule of any Root Cause is, we will try to understand to such an extent that we are convinced that we would have done the same thing if we were that person. It must be clear to all people in the organization that no one should be punished as a result of any root cause analysis investigation. Our point is to understand what led that person to commit that mistake so that we can all learn from it, before we go on to punishing someone ask your conscience, are we certainly sure that we have done differently if we were in his shoes? Blame and finger pointing have no room for any Root Cause Analysis. If our intent for any Root Cause Analysis is to simply punish people, then our efforts will be entirely useless and the industry will never learn.
Acknowledgement:
I would like to thank my friend Charles Robert (BOB) Nelms, for allowing me to understand the true meaning of Root Cause Analysis. Bob is a Root Cause Analysis Consultant and President of Failsafe Network. You can reach him through his website at http:www.failsafe-network.com
This Article is authored / contributed by ▸ Rolly A. who travels from Sta. Rosa, Laguna, Philippines. Rolly is available for Professional Consulting Work both Virtually and In-Person. ▸ Enquire Now.
Comments
What's your opinion?
6 more Articles by Rolly
A Different Root Cause Analysis Experience
9 min. We often hear the word Root Cause Analysis, and Root Cause Failure Analysis yet, I really wonder if we really know what it truly means. In fact almost ...Where Does Root Cause Failure Analysis (RCFA) Fit In A RCM Strategy?
9 min. What should be done if suitable proactive maintenance tasks cannot be found?50