On July 19, 2024, a routine software update brought down 8.5 million Windows devices worldwide. Airlines grounded flights. Hospitals postponed surgeries. Banks froze transactions. The CrowdStrike incident cost the global economy an estimated $5.4 billion—and it started with a single configuration file that nobody thought to stress-test.

This wasn't a sophisticated cyberattack. It was a failure mode that went unexamined.

The Aviation Industry Solved This Problem Decades Ago

When you board a commercial aircraft, you're trusting your life to systems analyzed with a methodology called Failure Mode and Effects Analysis (FMEA). Developed by the U.S. military in the 1940s and adopted by NASA after early spacecraft failures, FMEA systematically identifies what can go wrong, assesses the consequences, and ensures controls exist before problems occur.

Every critical aircraft system undergoes this scrutiny. Engineers ask:

What happens if this sensor fails?
What if that redundancy doesn't activate?
How will the pilot know something is wrong?

The result is an industry where fatal accidents are measured in parts per billion passenger miles. Your IT infrastructure deserves the same discipline.

Understanding FMEA: From RPN to AIAG-VDA

Traditional FMEA uses the Risk Priority Number (RPN)—a simple multiplication of Severity, Occurrence likelihood, and Detection difficulty on scales of 1-10. A failure that's severe (9), likely (7), and hard to detect (8) yields an RPN of 504, flagging it for immediate attention.

Modern FMEA has evolved. The AIAG-VDA methodology, developed collaboratively by automotive giants and the German automotive association, introduces the Action Priority (AP) system and a structured seven-step process:

Planning and Preparation – Define scope, team, and timeline
Structure Analysis – Map system architecture and dependencies
Function Analysis – Document what each component must do
Failure Analysis – Identify how each function can fail
Risk Analysis – Assess severity, occurrence, and detection
Optimization – Implement controls for high-priority risks
Results Documentation – Create audit-ready records

How FMEA Would Have Caught the CrowdStrike Issue

Let's apply FMEA thinking to what happened:

Failure Mode: Malformed channel file causes kernel-level crash
Effect: Complete system unavailability
Cause: Insufficient validation of configuration updates before deployment

A proper Structure Analysis would have mapped the dependency chain: Falcon Sensor → Channel Files → Windows Kernel.

Function Analysis would have documented: "Channel files must be syntactically valid and semantically correct."

Failure Analysis would have asked: "What if a channel file contains malformed data?"

The risk assessment would have been alarming:

Severity: 10 (complete system failure)
Occurrence: Medium (software bugs happen)
Detection: 1 (no automated validation existed)

That configuration alone demands immediate action. The optimization step would have prescribed exactly what CrowdStrike later implemented: staged rollouts, automated validation, and rollback mechanisms.

Real-World IT Failure Modes You Should Analyze

Configuration Drift

Failure Mode: Production configuration diverges from documented baseline
Effect: Unexpected behavior during recovery or scaling
Current Controls: Manual audits (quarterly)
Detection Rating: 7 (often discovered only during incidents)
Action: Implement continuous configuration validation

Single Points of Failure in "Redundant" Systems

Failure Mode: Both redundant nodes depend on shared component (DNS, authentication, storage)
Effect: Complete service outage despite architectural redundancy
Current Controls: Architecture reviews (annual)
Detection Rating: 8 (hidden dependencies surface during cascading failures)
Action: Automated dependency mapping and chaos engineering

Certificate Expiration

Failure Mode: TLS certificate expires without renewal
Effect: Service unavailability, security warnings, automated system failures
Current Controls: Calendar reminders
Detection Rating: 5 (often detected only when expiration imminent or passed)
Action: Automated certificate lifecycle management with 90-day advance alerting

Backup Restoration Failure

Failure Mode: Backups complete successfully but restoration fails
Effect: Data loss during disaster recovery
Current Controls: Occasional test restores
Detection Rating: 9 (discovered only when restoration is urgently needed)
Action: Automated restoration testing with synthetic data validation

Building FMEA into IT Operations

The organizations leading in reliability don't treat FMEA as a one-time exercise. They embed it into their operational rhythm.

During Architecture Review

Before deploying new systems, require FMEA documentation:

What happens when this service is unavailable?
What if that database becomes corrupted?
What's our blast radius?

After Every Incident

Each post-mortem should update your FMEA. The failure mode you just experienced? That detection rating just improved because you now have monitoring. But what adjacent failure modes were exposed?

Quarterly Risk Reviews

Reassess your highest-priority items:

Have occurrence rates changed?
Are your detection mechanisms still functioning?
Have new dependencies emerged?

The Competitive Advantage of Systematic Reliability

While competitors scramble after outages, organizations with mature FMEA practices operate differently. They know their risks. They've documented their controls. When incidents occur—and they will—recovery is faster because failure modes and responses are already mapped.

Amazon, Google, and Microsoft didn't achieve their reliability through luck. Behind the scenes, they practice rigorous failure analysis, even if they don't call it FMEA. Chaos engineering, game days, and architecture reviews are all manifestations of the same principle: systematically identify what can fail before it fails.

Your organization can adopt these practices without their scale. The methodology is straightforward. The tools exist. The only question is whether you'll invest in prevention or continue paying for reaction.

Getting Started

Begin with your most critical service. Gather the people who understand its architecture—infrastructure engineers, developers, operations staff. Spend two hours mapping failure modes. You'll leave with a prioritized list of risks and actions.

That's your FMEA. Refine it over time. Apply it to the next service. Within a year, you'll have transformed how your organization thinks about reliability.

The airplanes flying overhead right now depend on this methodology. Your infrastructure can too.

Why IT Infrastructure Needs FMEA: Lessons from the CrowdStrike Outage