I didn’t study manufacturing engineering. Most of what I know about reliability comes from the field as they happen like when a tap leaks, when the front panel jams on unboxing, when an LED turns yellow in first production lot.
So I’m writing this series as a way to learn vocationally by documenting, testing, and applying. Thanks to two of my team members - Krishna (Industrial Designer) and Tejas (Design Engineer) who are in the constant pursuit of driving FMEA across our products and educating me about it as well. As I'm learning their documentation, I decided to put together a set of notes for myself so I can share it ahead with the broader team and keep reflecting in the future.
There are four sources I'm learning from: a. Documenting all failures across stages of deep R&D, designing the machine, b. issues from assembly, processes, quality checks, and moulding, c. the always active document two of my team members are building, and d. books: The basics of FMEA & FMEA from theory to execution.
1. What is FMEA?
FMEA stands for Failure Mode and Effects Analysis. It's basically a way to ask if this thing we're building ever fails, what's the worst that could happen, how likely is it to happen, and how soon would we catch it?
A “failure mode” is simply the way something can go wrong. For example, a water-purifier tap can leak, jam, or break off. Each of those is a different mode of failure.
An “effect” is what that failure causes. It could be water damage, angry user, warranty claim, or a bad post on social media.
So, FMEA = listing the ways things fail + what those failures cause + what we can do about them. That’s it. But because it’s so simple, it scales beautifully from a tap lever to a probably a jet engine.
2. Why is this important?
Because discovering failure late is expensive. Very expensive. Changing a CAD model costs nothing in comparison but changing a mould costs lakhs. Changing a field-installed unit costs money and most importantly, reputation.
FMEA brings failures forward in time so we could catch them when they’re cheap and harmless to fix. It’s also the only meeting where pessimists are welcome. Everyone gets to say “I told you so” in advance.
3. Three buckets: Design, Process, and User
Design FMEA (DFMEA)
This is done on the product itself. E.g. Will the lever hinge go through mechanical fatigue after 500 cycles or 10,000 cycles. Each answer exposes a potential design weakness before tooling begins. DFMEA lives mostly inside CAD, material specs, and assembly diagrams. The goal is to minimise retooling and make the design inherently robust without any later fixes.Process FMEA (PFMEA)
Could the operator install the lid with a different orientation, or install the wrong gasket? Could the water tank get contaminated during assembly? PFMEA is where design hands off to manufacturing, and both teams work together to control variation.User FMEA (UFMEA)
How people use (or misuse) the product. E.g. The user pulls the lever with a jerk. They accidentally wiped the fascia with a dirty cloth scratching the fine, matte finish surface. The user isn't making the mistakes by choice. But it's our goal to make misuse unlikely by design and make the correct use obvious. UFMEA helps us design for wet hands, low light, impatience, and known habits.
Lens | Focus | Owners | Output |
---|---|---|---|
DFMEA | The product | R&D, industrial Design, design engineering, electronics & firmware | Stronger architecture, design, fit, materials, etc. |
PFMEA | The build process | Manufacturing, assembly, quality, supplier | Clear SOPs, fixtures & jigs, torque & inspection controls, part quality, manufacturing variability |
UFMEA | Humans using the product | Design, user experience, service & install | Clear design with affordances, feedback, error-proofing |
4. How to rate & compare failure risks
Every potential failure is scored on three things:
Factor | What is it | Scale |
---|---|---|
Severity (S) | How bad is it if it happens? | 1 – 10 |
Occurrence (O) | How likely is it to happen? | 1 – 10 |
Detection (D) | How likely are we to catch it before the user does? | 1 – 10 |
Then you multiply them to assign a Risk Priority Number (RPN). RPN = S × O × D. High RPN = danger zone. Low RPN = probably fine.
This scoring is just a conversation starter. The real activity here is when a designer argues with a manufacturing engineer about why something should be a 6 instead of a 4. That argument is the learning. The learning compounds.
5. How to run FMEA
FMEA books list ten steps. Here’s the same thing in plain English (for my understanding) and where all it should be baked in the full product development process:
Design stage | What's happening now in product development (high-level) | FMEA step |
---|---|---|
| Product vision is locked, design brief is approved, and high-level requirements are frozen. | - |
| Surface CAD from Industrial Designers is complete. Basically, the shape & form is closed. Initial mechanical layout starts now. | DFMEA / UFMEA Review the product or process |
| Design Engineers start working here. Form–fit–function exploration with early DFMEA. | DFMEA / PFMEA / UFMEA Brainstorm everything that could fail |
| Design and aesthetics (non-CMF) close here. Functional reliability targets are defined. | DFMEA / PFMEA / UFMEA Note the effect of each failure |
Features of the machine are closed here. Feature-level trade-offs are also resolved. | DFMEA / UFMEA Rate severity (S) | |
Engineering assumptions are tested here. Material/process selection are in progress. | DFMEA / PFMEA Rate occurrence (O) | |
Reliability test planning begins. Controls and checkpoints are identified. | DFMEA / PFMEA / UFMEA Rate detection (D) | |
Risks sheet is created. Top 10/20 items become visible to all teams. | RPN = S x O x D | |
| Final 3D ready for manufacturing readiness. DFM/DFA reviews under way. | DFMEA / PFMEA / UFMEA |
Tooling release preparation begins. Vendor & fixture readiness detailing also start shortly after. | DFMEA / PFMEA / UFMEA | |
| Testing phase is live. Re-validation of design + process changes. | DFMEA / PFMEA / UFMEA |
| Manufacturing SOPs freeze here. Service, installation training begins. | PFMEA / UFMEA |
| Product is in the market. FMEA evolves via service feedback and data. | DFMEA / PFMEA / UFMEA |
That’s the whole game. It’s a simple but exhaustive process. But, it forces everyone to speak the same language of risk and appreciate it.
6. Trade-offs of FMEA and cutting traditional biases
Tradeoffs and biases are important to understand because I know what it feels like to fight against time. Time-to-market matters in hardware. There’s always a push to launch the next version, meet factory deadlines, and lock BOMs before the quarter ends. But somewhere in that rush, the slow, unglamorous thinking gets squeezed out to meet timelines, and that’s usually where reliability dies first. FMEA is slow because it’s the only time when the team can ask “What if it breaks, and why?”
6.1. Time trade-off
FMEA adds days or weeks early in design, but saves months later in rework, tooling changes, and assembly firefighting. Every untested failure mode that was skipped eventually shows up as a customer issue. And when that happens, it costs time & trust.
6.2. Mindset trade-off
I'm guilty of designing for success. Most of us are. This required us to also start designing against failure. That’s not natural for builders. We love imagining how things work, not how they break. Of course, the danger is swinging too far on the opposite side obsessing over unlikely edge cases and delaying good ideas.
We should treat FMEA as a hygiene, sanity lens and limit each cycle of FMEA to 15-20 high-impact risks. If an item's failure doesn't affect user trust, safety, or service cost, we could park it for later.
6.3. Cutting biases pragmatically
I've seen that the concept of exhaustive FMEA attracts inputs from two kinds of people who have their own biases:
The over-believers: People from traditional manufacturing/ large OEMs treat FMEA as gospel. They love the process because it's a bias toward completeness, not necessarily the purpose of the different dynamics of a fast-moving start-up.
The under-believers: People from start-ups (like us) who love velocity, iteration, and proof over paperwork. It also comes from a decade+ exposure to software building. The bias is towards intuition. It's only when things break that we realise it's not a software bug fix push but a nightmare with 1000s of real machines in the field.
6.4. Middle ground
The goal is not to adopt or reject FMEA but to use it as a default thinking framework. If the analysis helps us understand failure and reduced service revisits extensively, we should do it. Are we learning faster so that we minimise the similar failures for another machine in the near future? If yes, we should do it. And consistently look for the minimum structured effort required to produce real insights.
I'll continue to write next parts as I continue to absorb it more deeply and get into running the process, specifics of DFMEA, PFMEA, and UFMEA, scoring and maybe along the way I'll explore even faster and more efficient ways to do it. Maybe.