On August 1, 2012, Knight Capital deployed new trading software to production. Within 45 minutes, the firm had lost $440 million—accumulated over 16 years, gone before lunch. The bug? Old code that should have been removed was accidentally reactivated, causing the system to execute 4 million trades at unfavorable prices.
Knight Capital filed for bankruptcy protection days later.
This wasn't a sophisticated edge case. It was a failure mode that proper analysis would have identified: "What happens if deprecated code paths are triggered in production?"
FMEA Isn't Just for Hardware
Mention FMEA to software engineers, and you'll often get puzzled looks. Isn't that for manufacturing? For hardware? For safety-critical systems?
The methodology originated in military hardware development in the 1940s, gained prominence after NASA adopted it following spacecraft failures, and became mandatory in automotive and aerospace industries. But the core principle—systematically identifying what can fail and ensuring adequate controls—applies universally.
Every software system has failure modes:
- Deployments fail
- Dependencies break
- Edge cases trigger unexpected behavior
The question isn't whether you have failure modes; it's whether you've identified them before your users do.
The Evolution of FMEA: From RPN to AIAG-VDA
Traditional FMEA assigns a Risk Priority Number (RPN) by multiplying Severity (1-10), Occurrence (1-10), and Detection (1-10). A bug that causes data corruption (Severity 9), happens occasionally (Occurrence 5), and is hard to detect before production (Detection 8) yields RPN 360—demanding attention.
The modern AIAG-VDA approach, refined by the automotive industry, adds nuance. The Action Priority (AP) system considers combinations of factors rather than simple multiplication, and the seven-step process ensures nothing is overlooked:
- Define Scope – What you're analyzing: a service, a deployment pipeline, a critical feature
- Map Structure – Microservices, dependencies, data flows
- Document Functions – What each component must do correctly
- Identify Failures – How each function can fail
- Assess Risk – Severity, likelihood, detectability
- Implement Controls – Testing, monitoring, rollback mechanisms
- Document Results – Traceable, auditable, maintainable
Applying FMEA to Knight Capital's Failure
The failure mode: deprecated code reactivated during deployment. Let's work through how systematic analysis would have flagged this.
Structure Analysis
The trading system contained multiple code paths, including legacy functionality from a previous algorithm. The deployment process activated features via configuration flags.
Function Analysis
The deployment process must activate only current, tested functionality. Legacy code paths must remain dormant.
Failure Analysis
- What if a deployment activates legacy code?
- What if configuration flags are misconfigured?
- What if old binaries contain active but untested paths?
Risk Assessment
- Severity: 10 (automated trading at unfavorable prices can cause unlimited losses)
- Occurrence: 4 (configuration errors happen in complex deployments)
- Detection: 9 (no automated verification that only intended code paths were active)
This combination—high severity, moderate occurrence, poor detection—demands immediate action. The optimization step would have prescribed what every modern deployment should include: feature flags with explicit activation, dead code removal, canary deployments, and automated rollback triggers.
Common Software Failure Modes Worth Analyzing
Database Migration Failures
- Failure Mode: Schema migration partially completes, leaving database in inconsistent state
- Effect: Application errors, data corruption, extended downtime
- Current Controls: Migration runs in transaction (but doesn't handle all edge cases)
- Detection: 6 (often discovered when application throws errors)
- Action: Pre-migration validation, rollback scripts tested in staging
Third-Party API Changes
- Failure Mode: External API changes behavior or schema without notice
- Effect: Integration failures, data processing errors, customer-facing issues
- Current Controls: Monitoring for error rates
- Detection: 5 (discovered after impact begins)
- Action: Contract testing, API version pinning, graceful degradation
Race Conditions Under Load
- Failure Mode: Concurrent operations cause data inconsistency
- Effect: Incorrect state, duplicate processing, financial discrepancies
- Current Controls: Unit tests (which don't test concurrency)
- Detection: 8 (surface unpredictably under specific load patterns)
- Action: Concurrency testing, idempotency enforcement, distributed tracing
Memory Leaks in Long-Running Services
- Failure Mode: Gradual memory consumption leads to OOM termination
- Effect: Service unavailability, request failures, cascading effects
- Current Controls: Container memory limits trigger restart
- Detection: 4 (memory metrics are monitored)
- Action: Memory profiling in load tests, trend analysis alerting
Integrating FMEA into Your Development Process
FMEA shouldn't be a separate ceremony—it should be woven into how you build software.
During Design Reviews
Before implementing significant features, spend 30 minutes on failure analysis:
- What external dependencies are we adding?
- What happens when they're unavailable?
- How will we know if this breaks?
During Deployment Planning
Each deployment is an opportunity for failure:
- What's the rollback plan?
- How will we detect problems?
- What's our blast radius?
Document these answers; you're building your FMEA.
After Incidents
Every post-mortem should update your failure mode inventory. You've just learned something about your system—capture it. Adjust detection ratings. Add new failure modes. Update controls.
Before Major Releases
High-stakes releases deserve explicit FMEA sessions. Gather engineering, QA, and operations. Work through what can go wrong. You'll launch with confidence because you've already addressed the obvious failure modes.
Why This Matters for Your Career
The engineers who prevent disasters are more valuable than those who heroically respond to them. Building systematic failure analysis into your practice—whether you call it FMEA or not—demonstrates the kind of judgment that distinguishes senior engineers.
- When you ask "what happens if this fails?" before writing code, you're practicing FMEA
- When you insist on rollback mechanisms before deployment, you're implementing FMEA controls
- When you document system dependencies and single points of failure, you're conducting FMEA structure analysis
The methodology gives a name and structure to practices that effective engineers develop through painful experience. Learning it deliberately means fewer lessons learned the hard way.
The Aerospace Standard for Software
Commercial aircraft run on software analyzed to DO-178C standards. Medical devices follow IEC 62304. These aren't suggestions—they're requirements that have prevented countless failures.
Your software may not control aircraft or medical devices, but your users depend on it. Your business depends on it. The principles that keep airplanes in the sky apply to keeping your services running.
Start with your most critical path. What's the user journey that absolutely cannot fail? Analyze its failure modes. Implement controls. You'll be surprised how quickly this thinking becomes natural—and how many problems you catch before they reach production.
NirmIQ Team
The NirmIQ team shares insights on requirements management, FMEA, and safety-critical systems engineering.
Follow on LinkedInRelated Articles
Industry InsightsThe Overlooked Danger of Ignoring FMEA – And How NirmIQ Fixes It
IT & InfrastructureWhy IT Infrastructure Needs FMEA: Lessons from the CrowdStrike Outage
Electronics