Cyber Resilience Lessons from the CrowdStrike Outage

Imagine arriving at London’s Heathrow Airport only to find your flight merged with two others and chaos reigning supreme with baggage scattered everywhere. This was the real-life impact of the recent Microsoft outage, experienced firsthand by many travelers, including myself. The global implications of this outage extend beyond delayed flights and lost baggage, underscoring the vulnerabilities in our digital ecosystem and the need for thorough preparedness.

Understanding the Outage Event

The Microsoft-CrowdStrike outage occurred on July 19, 2024, and had immediate and widespread impacts, affecting businesses globally. The outage was reportedly sparked by a botched CrowdStrike software update and took thousands of Microsoft systems around the world offline.

Operations were halted, data access was disrupted, and communication breakdowns occurred. Hospitals faced delays in patient care due to inaccessible medical records. Grocery stores experienced supply chain disruptions that affected inventory management. Airlines dealt with flight cancellations and rescheduling problems. Financial institutions experienced transaction delays and security concerns. Retail businesses struggled with point-of-sale system failures and customer service interruptions. According to Microsoft’s blog, the outage impacted approximately 8.5 million Windows devices, demonstrating the extensive reach and severity of the event.

The incident was a wake-up call for businesses about the importance of robust cybersecurity measures and reinforced the need for testing and contingency planning in patch management processes. It also shed light on the vulnerability of interconnected systems and the role that major service providers play in the global economy, where a single failure can cascade into substantial economic losses and operational setbacks. According to GovInfoSecurity, losses from this event could cost cyber insurers $1.5 billion, with overall monetary losses to businesses anticipated as high as $5.4 billion, as reported by DarkReading.

Differentiating Cyber Risks: CBI vs. CSF

Understanding the terminologies is crucial in grasping the full scope of such incidents and possible insurance coverage. Two key terms are Contingent Business Interruption (CBI) and Contingent System Failure (CSF).

Contingent business interruption (CBI): In the cyber world, this occurs when a business relies on a third-party vendor that suffers a cyber event, causing a disruption in the reliant business’s operations. An example is a car dealership that cannot process sales because its revenue management system, provided by a third-party vendor, is compromised.
Contingent system failure (CSF): This refers to a non-malicious system failure caused by an update or patch from a third-party vendor. The Microsoft-CrowdStrike incident falls under this category, where a patch intended to fix an issue led to a system failure, affecting numerous businesses.

CBI and CSF impact businesses differently, so from an insurance standpoint, they are not always covered the same way. Insurance policies for CBI focus on losses from supplier disruptions, while CSF coverage addresses failures in critical systems. Payouts can differ based on the cause of the interruption and the specific terms of each policy.

Businesses should have separate plans for each scenario. A CBI plan might focus on alternative suppliers and maintaining supply chain resilience. A CSF plan could emphasize cybersecurity measures and backup systems to keep operations running smoothly.

By understanding these differences, business leaders can ensure comprehensive risk management and better prepare for potential disruptions.

Lessons from the Microsoft-CrowdStrike Outage

The Microsoft-CrowdStrike outage revealed key lessons in vendor management and business continuity. It called attention to the need for rigorous vendor management and quality assurance to ensure the robustness and security of vendors’ systems. Further, the incident underscored the significance of business continuity and incident response plans to quickly address and mitigate disruptions.

Vendor management and quality assurance

Quality assurance, in this context, refers to the systematic process of evaluating and ensuring that vendors’ products and services meet specific standards of quality, particularly in their update and patch management protocols. Vendors, including major technology providers like Microsoft, must be rigorously assessed to confirm their systems are secure, reliable, and functioning as intended. Essential questions to ask vendors include:

How do you handle updates and patches?
What is your protocol for quality assurance?
How frequently do you review and test your systems?

Maintain an ongoing dialogue with vendors about their security measures and incident response capabilities. Establishing clear expectations and performance benchmarks can help ensure that vendors adhere to the highest standards of cybersecurity.

Business continuity and incident response plans

Robust business continuity and incident response plans, with alignment and buy-in from all areas of the organization, will help ensure you have the decisions and processes set in place should an incident occur. Having these plans in place ahead of time will allow you to act quicker and minimize impact. Key elements include:

In-depth risk assessment: Identify and evaluate potential risks to the organization. Tailor the plan accordingly.
Clear roles and responsibilities: Define specific roles and responsibilities to ensure everyone knows their duties during a crisis.
Effective communication strategy: Establish a communication plan to keep all stakeholders informed during an incident.
Regular testing and updates: Conduct regular drills and simulations to test the plan and make necessary updates based on the results.
Recovery procedures: Develop detailed procedures for business recovery and continuity after an incident.

Such plans allow businesses to respond to disruptions quickly and efficiently, minimizing downtime and associated costs.

Proactive Cyber Risk Management Strategies

In the wake of significant cyber incidents like the recent Microsoft outage, businesses are increasingly recognizing the importance of proactive risk management strategies. Such incidents underscore the need for holistic approaches to identify vulnerabilities and prepare for potential disruptions. Implementing effective risk management practices can help businesses minimize the impact of outages and ensure continuity of operations. Additionally, having appropriate insurance coverage, such as cyber liability insurance, can provide crucial support for businesses affected by cyber incidents.

Conducting tabletop exercises

Tabletop exercises are an effective way to prepare for cyber incidents. These simulated scenarios help teams practice their response to various threats, identify weaknesses, and improve coordination. By conducting regular tabletop exercises, businesses can mitigate the impact of outages and cyber incidents. These exercises also foster a culture of preparedness, ensuring that employees are confident and capable of handling real-world crises. For instance, during the Microsoft outage, organizations with well-rehearsed response plans were better positioned to manage the disruption and maintain operational stability. Insurance coverage that includes incident response costs can be particularly beneficial in such scenarios.

Balancing security measures with human error

The Microsoft outage emphasized the importance of addressing human factors in cybersecurity. To minimize human error, organizations should consider providing regular training and updates on best practices, implementing robust oversight and review processes, and ensuring clear communication and thorough documentation of procedures.

Vendor management and continuous oversight are also critical in minimizing risks associated with human error. By establishing strict protocols and fostering a culture of accountability, businesses can significantly reduce the likelihood of mistakes that could lead to system failures. Coverage options such as business interruption insurance and cyber liability insurance that includes human error can help businesses recover swiftly from incidents like the Microsoft outage, ensuring resilience and sustained security posture.

Navigating cyber insurance

Understanding your cyber policy is important to identify the protection it offers against different types of cyber risks and interruptions, such as issues from third-party vendors or direct system failures. By knowing the coverage details, exclusions, and limits of your policy, you can better prepare for potential disruptions and mitigate financial losses.

Understanding your cyber policy

A thorough understanding of your cyber policy can significantly impact your organization’s resilience. In the Microsoft-CrowdStrike example, a cyber insurance policy covering only CBIs but not CSFs would likely be outside of the coverage parameters for an impacted business. Each organization can look at its unique systems and processes to determine the best cyber policy for its needs. Key aspects to consider include:

Business interruption: Coverage for lost income due to a cyber event.
Contingent business interruption: Coverage for losses when a third-party provider experiences an outage.
System failure: Coverage for direct system failures within your organization.
Contingent system failure: Coverage for failures in external systems your business relies on.

Businesses should work closely with their brokers to tailor policies that address their specific risks. This might involve conducting a thorough risk assessment to identify potential vulnerabilities and ensuring that the policy provides adequate coverage for all identified risks.

Sublimiting in cyber policies

Sublimiting refers to the practice of setting lower limits for specific types of losses within a policy. This can have significant implications for businesses. Knowing your policy limits and exclusions ensures adequate coverage in the event of an incident.

For example, if your policy has a sublimit on contingent business interruption, the payout may be insufficient to cover all your losses from a third-party outage. Engaging with your broker to understand these limits and negotiating higher sublimits where necessary can provide better protection against extensive losses.

Future Outlook and Implications

The Microsoft and CrowdStrike outage showed us that cybersecurity isn’t solely the responsibility of third-party providers. Moving forward, this incident should prompt a reassessment of cybersecurity strategies, increased investment in robust defenses, and a review of insurance policies. The potential for far-reaching consequences, from global economic impacts to the devastation of smaller businesses, emphasizes the necessity of these forward-looking measures. As cyber threats continue to evolve, stay ahead by anticipating and preparing for future challenges.