insights

When the Next CrowdStrike Happens: Thoughts on a Swift Recovery

Bill Church

July 25, 2024

Critical Strategies for Swift Recovery

Fully Automated Recovery for Remote Machines

Remote machines require a fully automated recovery process to minimize downtime and manual intervention. In early IT days, I spent a week manually visiting hundreds of machines to prepare them for Y2K, which led to a focus on automation and scripting.

Deploying scripts that automatically detect and remediate issues can significantly reduce recovery time. Centralized tools like Microsoft Endpoint Configuration Manager can help manage and deploy recovery tasks across remote machines efficiently. For macOS, tools like Jamf Pro and Apple Business Manager offer similar capabilities, while Red Hat Satellite and Canonical Landscape provide robust management and automation for Linux systems.

PXE Booting: Modern Solutions for Legacy Challenges

PXE (Pre-boot Execution Environment) booting allows for network-based booting, where systems can boot from a network server instead of local storage, simplifying the deployment of system images and updates. In an outage, systems can automatically boot into a recovery environment and download a fresh, uncorrupted operating system image.

Addressing security challenges associated with PXE booting is crucial. Unauthorized access and network traffic interception are potential risks. Implementing secure boot protocols, encrypting PXE traffic, and ensuring only authorized devices can connect to the PXE server are essential mitigations.

Easy Recovery Procedures for Business-Critical Devices

A user-friendly recovery procedure is essential for business-critical notebooks and workstations. This can be facilitated through network-based recovery partitions, secure UEFI boot, VPN-based recovery, and cloud-based recovery options. These methods ensure that recovery images and instructions are always accessible and up to date while maintaining security.

Handling Encrypted Storage: BitLocker Challenges

Encrypted storage, while necessary for security, can complicate recovery processes. BitLocker adds an extra layer of protection but requires careful handling during recovery. Strategies to manage systems using BitLocker include:

Robust recovery keys management
Integration of BitLocker recovery steps into automated recovery scripts
Comprehensive training and documentation for IT staff

Accessing recovery keys stored in secure locations becomes crucial during a global outage. Post-outage recovery involves restoring BitLocker management servers from backups and ensuring synchronization with stored recovery keys. Implementing redundant servers and regular disaster recovery tests can help organizations manage and recover BitLocker-encrypted systems while maintaining data security and availability.

Stateless Systems: Reducing Complexity and Increasing Resilience

Adopting stateless systems for signage and kiosks can simplify recovery processes. Operating systems like Flatcar, Container OS, or ChromeOS can be configured to download and run a fresh OS image at each boot, ensuring a consistent and clean state. Their minimal attack surface also reduces the need for complex security solutions.

Learning from the Incident: Turning Lemons into Lemonade

The CrowdStrike outage provided a real-world test of Business Continuity Planning (BCP). Key learnings include:

Conduct a thorough post-incident review to identify where processes, automation, and plans failed
Enhance training and documentation to prepare staff for actual incidents
Invest in automation tools and enhanced monitoring for proactive management
Continuously test and update your BCP to address new threats and changes in your IT environment
Implement canary testing strategies to prevent enterprise-wide issues
Mitigate opportunistic attacks during chaotic periods by reinforcing strict authentication and authorization practices

Conclusion

Organizations inevitably face increasing system complexity while striving for more robust recovery processes. The challenge lies in recovering from incidents and designing systems that can adapt and evolve without compromising stability or security.

Innovative approaches like microservices architectures, containerization, and infrastructure-as-code allow for more granular control and more accessible updates, potentially reducing the impact of incidents like the CrowdStrike outage. However, implementing these advanced architectures requires a shift in mindset and skillset across the organization.

As technology leaders, the job isn't just to prevent disasters but to ensure swift recovery when they happen. By implementing these strategies and embracing the complexity challenge, organizations can enhance their resilience against similar incidents in the future. Fully automated recovery processes, secure systems, and a culture of continuous improvement can all contribute to swift and effective recovery, minimizing downtime and ensuring continuity of operations.

Bill Church

Vice President, Engineering & Services