Name: NirmIQ
Rating: 4.8 (50 reviews)

On August 1, 2012, Knight Capital deployed new trading software to production. Within 45 minutes, the firm had lost $440 million—accumulated over 16 years, gone before lunch. The bug? Old code that should have been removed was accidentally reactivated, causing the system to execute 4 million trades at unfavorable prices.

Knight Capital filed for bankruptcy protection days later.

This wasn't a sophisticated edge case. It was a failure mode that proper analysis would have identified: "What happens if deprecated code paths are triggered in production?"

FMEA Isn't Just for Hardware

Mention FMEA to software engineers, and you'll often get puzzled looks. Isn't that for manufacturing? For hardware? For safety-critical systems?

The methodology originated in military hardware development in the 1940s, gained prominence after NASA adopted it following spacecraft failures, and became mandatory in automotive and aerospace industries. But the core principle—systematically identifying what can fail and ensuring adequate controls—applies universally.

Every software system has failure modes:

Deployments fail
Dependencies break
Edge cases trigger unexpected behavior

The question isn't whether you have failure modes; it's whether you've identified them before your users do.

The Evolution of FMEA: From RPN to AIAG-VDA

Traditional FMEA assigns a Risk Priority Number (RPN) by multiplying Severity (1-10), Occurrence (1-10), and Detection (1-10). A bug that causes data corruption (Severity 9), happens occasionally (Occurrence 5), and is hard to detect before production (Detection 8) yields RPN 360—demanding attention.

The modern AIAG-VDA approach, refined by the automotive industry, adds nuance. The Action Priority (AP) system considers combinations of factors rather than simple multiplication, and the seven-step process ensures nothing is overlooked:

Define Scope – What you're analyzing: a service, a deployment pipeline, a critical feature
Map Structure – Microservices, dependencies, data flows
Document Functions – What each component must do correctly
Identify Failures – How each function can fail
Assess Risk – Severity, likelihood, detectability
Implement Controls – Testing, monitoring, rollback mechanisms
Document Results – Traceable, auditable, maintainable

Applying FMEA to Knight Capital's Failure

The failure mode: deprecated code reactivated during deployment. Let's work through how systematic analysis would have flagged this.

Structure Analysis

The trading system contained multiple code paths, including legacy functionality from a previous algorithm. The deployment process activated features via configuration flags.

Function Analysis

The deployment process must activate only current, tested functionality. Legacy code paths must remain dormant.

Failure Analysis

What if a deployment activates legacy code?
What if configuration flags are misconfigured?
What if old binaries contain active but untested paths?

Risk Assessment

Severity: 10 (automated trading at unfavorable prices can cause unlimited losses)
Occurrence: 4 (configuration errors happen in complex deployments)
Detection: 9 (no automated verification that only intended code paths were active)

This combination—high severity, moderate occurrence, poor detection—demands immediate action. The optimization step would have prescribed what every modern deployment should include: feature flags with explicit activation, dead code removal, canary deployments, and automated rollback triggers.

Common Software Failure Modes Worth Analyzing

Database Migration Failures

Failure Mode: Schema migration partially completes, leaving database in inconsistent state
Effect: Application errors, data corruption, extended downtime
Current Controls: Migration runs in transaction (but doesn't handle all edge cases)
Detection: 6 (often discovered when application throws errors)
Action: Pre-migration validation, rollback scripts tested in staging

Third-Party API Changes

Failure Mode: External API changes behavior or schema without notice
Effect: Integration failures, data processing errors, customer-facing issues
Current Controls: Monitoring for error rates
Detection: 5 (discovered after impact begins)
Action: Contract testing, API version pinning, graceful degradation

Race Conditions Under Load

Failure Mode: Concurrent operations cause data inconsistency
Effect: Incorrect state, duplicate processing, financial discrepancies
Current Controls: Unit tests (which don't test concurrency)
Detection: 8 (surface unpredictably under specific load patterns)
Action: Concurrency testing, idempotency enforcement, distributed tracing

Memory Leaks in Long-Running Services

Failure Mode: Gradual memory consumption leads to OOM termination
Effect: Service unavailability, request failures, cascading effects
Current Controls: Container memory limits trigger restart
Detection: 4 (memory metrics are monitored)
Action: Memory profiling in load tests, trend analysis alerting

Integrating FMEA into Your Development Process

FMEA shouldn't be a separate ceremony—it should be woven into how you build software.

During Design Reviews

Before implementing significant features, spend 30 minutes on failure analysis:

What external dependencies are we adding?
What happens when they're unavailable?
How will we know if this breaks?

During Deployment Planning

Each deployment is an opportunity for failure:

What's the rollback plan?
How will we detect problems?
What's our blast radius?

Document these answers; you're building your FMEA.

After Incidents

Every post-mortem should update your failure mode inventory. You've just learned something about your system—capture it. Adjust detection ratings. Add new failure modes. Update controls.

Before Major Releases

High-stakes releases deserve explicit FMEA sessions. Gather engineering, QA, and operations. Work through what can go wrong. You'll launch with confidence because you've already addressed the obvious failure modes.

Why This Matters for Your Career

The engineers who prevent disasters are more valuable than those who heroically respond to them. Building systematic failure analysis into your practice—whether you call it FMEA or not—demonstrates the kind of judgment that distinguishes senior engineers.

When you ask "what happens if this fails?" before writing code, you're practicing FMEA
When you insist on rollback mechanisms before deployment, you're implementing FMEA controls
When you document system dependencies and single points of failure, you're conducting FMEA structure analysis

The methodology gives a name and structure to practices that effective engineers develop through painful experience. Learning it deliberately means fewer lessons learned the hard way.

The Aerospace Standard for Software

Commercial aircraft run on software analyzed to DO-178C standards. Medical devices follow IEC 62304. These aren't suggestions—they're requirements that have prevented countless failures.

Your software may not control aircraft or medical devices, but your users depend on it. Your business depends on it. The principles that keep airplanes in the sky apply to keeping your services running.

Start with your most critical path. What's the user journey that absolutely cannot fail? Analyze its failure modes. Implement controls. You'll be surprised how quickly this thinking becomes natural—and how many problems you catch before they reach production.

FMEA for Software Teams: What the Knight Capital Disaster Teaches Us