Back to BlogIT & Infrastructure

Why IT Infrastructure Needs FMEA: Lessons from the CrowdStrike Outage

NirmIQ TeamJanuary 22, 202512 min read259 views
Share:

On July 19, 2024, a routine software update brought down 8.5 million Windows devices worldwide. Airlines grounded flights. Hospitals postponed surgeries. Banks froze transactions. The CrowdStrike incident cost the global economy an estimated $5.4 billion—and it started with a single configuration file that nobody thought to stress-test.

This wasn't a sophisticated cyberattack. It was a failure mode that went unexamined.

The Aviation Industry Solved This Problem Decades Ago

When you board a commercial aircraft, you're trusting your life to systems analyzed with a methodology called Failure Mode and Effects Analysis (FMEA). Developed by the U.S. military in the 1940s and adopted by NASA after early spacecraft failures, FMEA systematically identifies what can go wrong, assesses the consequences, and ensures controls exist before problems occur.

Every critical aircraft system undergoes this scrutiny. Engineers ask:

  • What happens if this sensor fails?
  • What if that redundancy doesn't activate?
  • How will the pilot know something is wrong?

The result is an industry where fatal accidents are measured in parts per billion passenger miles. Your IT infrastructure deserves the same discipline.

Understanding FMEA: From RPN to AIAG-VDA

Traditional FMEA uses the Risk Priority Number (RPN)—a simple multiplication of Severity, Occurrence likelihood, and Detection difficulty on scales of 1-10. A failure that's severe (9), likely (7), and hard to detect (8) yields an RPN of 504, flagging it for immediate attention.

Modern FMEA has evolved. The AIAG-VDA methodology, developed collaboratively by automotive giants and the German automotive association, introduces the Action Priority (AP) system and a structured seven-step process:

  1. Planning and Preparation – Define scope, team, and timeline
  2. Structure Analysis – Map system architecture and dependencies
  3. Function Analysis – Document what each component must do
  4. Failure Analysis – Identify how each function can fail
  5. Risk Analysis – Assess severity, occurrence, and detection
  6. Optimization – Implement controls for high-priority risks
  7. Results Documentation – Create audit-ready records

How FMEA Would Have Caught the CrowdStrike Issue

Let's apply FMEA thinking to what happened:

  • Failure Mode: Malformed channel file causes kernel-level crash
  • Effect: Complete system unavailability
  • Cause: Insufficient validation of configuration updates before deployment

A proper Structure Analysis would have mapped the dependency chain: Falcon Sensor → Channel Files → Windows Kernel.

Function Analysis would have documented: "Channel files must be syntactically valid and semantically correct."

Failure Analysis would have asked: "What if a channel file contains malformed data?"

The risk assessment would have been alarming:

  • Severity: 10 (complete system failure)
  • Occurrence: Medium (software bugs happen)
  • Detection: 1 (no automated validation existed)

That configuration alone demands immediate action. The optimization step would have prescribed exactly what CrowdStrike later implemented: staged rollouts, automated validation, and rollback mechanisms.

Real-World IT Failure Modes You Should Analyze

Configuration Drift

  • Failure Mode: Production configuration diverges from documented baseline
  • Effect: Unexpected behavior during recovery or scaling
  • Current Controls: Manual audits (quarterly)
  • Detection Rating: 7 (often discovered only during incidents)
  • Action: Implement continuous configuration validation

Single Points of Failure in "Redundant" Systems

  • Failure Mode: Both redundant nodes depend on shared component (DNS, authentication, storage)
  • Effect: Complete service outage despite architectural redundancy
  • Current Controls: Architecture reviews (annual)
  • Detection Rating: 8 (hidden dependencies surface during cascading failures)
  • Action: Automated dependency mapping and chaos engineering

Certificate Expiration

  • Failure Mode: TLS certificate expires without renewal
  • Effect: Service unavailability, security warnings, automated system failures
  • Current Controls: Calendar reminders
  • Detection Rating: 5 (often detected only when expiration imminent or passed)
  • Action: Automated certificate lifecycle management with 90-day advance alerting

Backup Restoration Failure

  • Failure Mode: Backups complete successfully but restoration fails
  • Effect: Data loss during disaster recovery
  • Current Controls: Occasional test restores
  • Detection Rating: 9 (discovered only when restoration is urgently needed)
  • Action: Automated restoration testing with synthetic data validation

Building FMEA into IT Operations

The organizations leading in reliability don't treat FMEA as a one-time exercise. They embed it into their operational rhythm.

During Architecture Review

Before deploying new systems, require FMEA documentation:

  • What happens when this service is unavailable?
  • What if that database becomes corrupted?
  • What's our blast radius?

After Every Incident

Each post-mortem should update your FMEA. The failure mode you just experienced? That detection rating just improved because you now have monitoring. But what adjacent failure modes were exposed?

Quarterly Risk Reviews

Reassess your highest-priority items:

  • Have occurrence rates changed?
  • Are your detection mechanisms still functioning?
  • Have new dependencies emerged?

The Competitive Advantage of Systematic Reliability

While competitors scramble after outages, organizations with mature FMEA practices operate differently. They know their risks. They've documented their controls. When incidents occur—and they will—recovery is faster because failure modes and responses are already mapped.

Amazon, Google, and Microsoft didn't achieve their reliability through luck. Behind the scenes, they practice rigorous failure analysis, even if they don't call it FMEA. Chaos engineering, game days, and architecture reviews are all manifestations of the same principle: systematically identify what can fail before it fails.

Your organization can adopt these practices without their scale. The methodology is straightforward. The tools exist. The only question is whether you'll invest in prevention or continue paying for reaction.

Getting Started

Begin with your most critical service. Gather the people who understand its architecture—infrastructure engineers, developers, operations staff. Spend two hours mapping failure modes. You'll leave with a prioritized list of risks and actions.

That's your FMEA. Refine it over time. Apply it to the next service. Within a year, you'll have transformed how your organization thinks about reliability.

The airplanes flying overhead right now depend on this methodology. Your infrastructure can too.

Share this article:

Share:

NirmIQ Team

The NirmIQ team shares insights on requirements management, FMEA, and safety-critical systems engineering.

Follow on LinkedIn

Ready to improve your systems engineering?

See how NirmIQ connects requirements to FMEA analysis.