Saturday, 20 July 2024

With great kernel power, comes great operating responsibility!

The recent Microsoft-Crowdstrike incident causing Windows Blue Screen Of Death error, is the result of an update pushed to their Falcon sensor version 6.58. This version was pulled after widespread reports of BSOD incidents.

However, this incident raises several critical questions about the root cause, the testing and deployment processes, the capabilities and shortcomings of CrowdStrike's tools, and the oversight mechanisms in place.


The issue is linked to the sensor's interaction with the Windows operating system at the kernel level. CrowdStrike's sensors operate at this level to provide deep security insights and to prevent sophisticated attacks that might otherwise bypass user-level protections. By integrating at the kernel level, these sensors can monitor and respond to system calls and processes in real-time, offering robust security measures against advanced threats.

However, kernel-level modifications come with significant risks. Any error or incompatibility in the kernel-mode drivers can lead to critical system failures, like BSODs. In this case, the specific problem likely arose from an unintended conflict or bug within the sensor's driver code, which directly interacts with the Windows kernel.

Root Cause

The root cause of the Windows host crashes was identified as a defect in a single content update for the Falcon sensor. The problematic update, specifically the "C-00000291*.sys" file, caused the Windows OS to crash. CrowdStrike's engineering team suggested to revert the changes to a previous stable version of the channel file.

Lack of Thorough Testing

One of the primary issues highlighted by this incident is the apparent lack of thorough testing in a controlled test environment before deploying the update to production. Proper testing procedures are crucial to ensure that any updates or changes do not adversely affect the system's stability and functionality. The failure to identify such a critical issue in the testing phase suggests that the update was either inadequately tested or not tested in an environment that accurately mirrored the production setup.

Capabilities and Shortcomings of CrowdStrike Falcon Tools

Apparently, the next-gen advanced threat detection and prevention capabilities of Crowdstrike, with this incident underscores some significant shortcomings:

Strengths

  • Advanced Threat Detection: Falcon is equipped with robust machine learning and behavioral analytics to detect and prevent threats.
  • Cloud-Based Architecture: The cloud-based platform allows for real-time threat intelligence and updates.
  • Scalability: Falcon can scale to protect large enterprises with numerous endpoints.

Shortcomings

  • Update Management: The incident revealed weaknesses in the update management process, particularly in testing and validation.
  • Oversight and Quality Control: The lack of oversight in ensuring the quality and stability of updates before deployment is a critical flaw.
  • Customer Impact: The rapid deployment of untested updates directly impacted customer operations, leading to significant downtime and disruption.

Lack of Security Standards and Process Controls

The incident highlights a broader issue of insufficient security standards and process controls in place to prevent such configuration or administration errors. Effective security practices should include:

  • Comprehensive Testing: Updates should undergo rigorous testing in environments that replicate production setups.
  • Change Management: A robust change management process should be implemented to ensure that any updates are carefully reviewed and approved.
  • Incident Response: Clear incident response procedures should be in place to quickly address and mitigate any issues that arise from updates.
People talk highest levels of quality but have lowest levels of realistic implementation, This reflects gaps between process vs practical adoption. The level of seriousness is not reflected when it boils down to nth level worker.

Microsoft's Oversight Responsibilities

Microsoft, as the provider of the Windows operating system, shares a degree of responsibility in ensuring that third-party integrations, such as those from CrowdStrike, do not compromise system stability. The delegation of control to third-party vendors without adequate oversight can lead to such incidents.

Recommendations for Microsoft
  • Stricter Integration Policies: Implement stricter policies and guidelines for third-party integrations to ensure compatibility and stability.
  • Joint Testing Initiatives: Collaborate with third-party vendors to conduct joint testing and validation of updates.
  • Monitoring and Auditing: Regularly monitor and audit third-party integrations to identify and address potential issues proactively.

CrowdStrike's Accountability

CrowdStrike must take responsibility for the failure and implement measures to prevent recurrence. The company needs to address several critical areas:

Improving Update Testing
  • Enhanced Testing Protocols: Develop and enforce stringent testing protocols for updates.
  • Simulated Production Environments: Use simulated production environments to test updates thoroughly.
  • Beta Programs: Introduce beta testing programs where updates are tested by a small group of users before wider deployment.
Strengthening Quality Control
  • Quality Assurance Teams: Establish dedicated quality assurance teams to review and approve updates.
  • Automated Testing Tools: Utilize automated testing tools to identify potential issues quickly.
Customer Communication
  • Transparent Communication: Maintain transparent communication with customers about updates and potential issues.
  • Support Channels: Ensure robust support channels are available for customers to report and resolve issues promptly.
Great minds can have great ideas but if they do not bring it with customer lens and accountability it will be only hyped-up product security. 

Conclusion

This incident clearly calls out the critical gaps in following basic security guidelines of update testing and deployment processes, both within CrowdStrike and in Microsoft. While CrowdStrike offers powerful cybersecurity tools, the incident underscores the importance of rigorous testing, quality control, and effective communication with customers. 

Moving forward, both CrowdStrike and Microsoft must implement stronger safeguards to prevent such incidents and ensure the stability and security of their systems. 

Don't strike the wrong places to loose your market for competition!!!




Sunday, 14 July 2024

Enterprise Responsible AI Adoption – A Holistic AI Perspective

Enterprise Trade-off: Enterprises can use multiple open-source models to achieve around 90% accuracy, compared to using the latest OpenAI model and achieving 95% accuracy with a single model. Open-source models also require additional training and Reinforcement Learning from Human Feedback (RLHF). The trade-off between achieving 60% accuracy with open-source versus 90% with proprietary models needs careful evaluation.

  • Model and Data Alignment: Failing to invest time in understanding the models, aligning them with the right data, and establishing proper benchmarks will lead to a random, fragile implementation. A "lift-and-shift" approach to building AI products is not a sustainable strategy.
  • Data and Model Understanding: If you don’t fully understand the data sources and the limitations of the models you're using, don’t assume that handling only the happy path scenarios is enough to deliver successful GenAI applications.
  • Responsible AI Adoption: Relying on open-source models that deliver subpar accuracy does not constitute responsible AI adoption. It reflects a short-term vision and a failure to prioritize long-term sustainability.
  • Open Source Paradox: There's a growing push to leverage open-source models and frameworks, but expectations for state-of-the-art accuracy remain unrealistically high.
  • Long-term Costs: The broader impact and cost of fixing data issues or model errors are often overlooked in favor of flashy, short-term demo solutions that generate applause but don't provide lasting value.

Key Questions to Ask About the Model:
  1. Data: Is the data representative, reliable, and aligned with the intended use case?
  2. Domain: Does the model have domain-specific knowledge to perform effectively?
  3. Benchmark: Have clear benchmarks and performance metrics been set and evaluated?
  4. Key Questions to Ask About the Use Case:
  5. Why do we need an LLM?: Is an LLM the best solution for this problem, or are there alternatives?
  6. How much effort does it save?: What quantifiable efficiencies or cost savings does the LLM offer compared to traditional methods?
  7. What is the plan to improve accuracy?: How will you progress from the current accuracy level, and what steps will be taken to continuously improve the model's performance?
  8. Leadership Clarity: Leaders must understand that simply purchasing a platform or tool will not solve the underlying challenges of responsible AI adoption. A clear vision and strategy are critical for long-term success.

EchoLeak Vulnerability Exposes Microsoft 365 Copilot to Zero-Click Data Theft

🚨 Critical Alert: A wake-up call for AI security in enterprise environments Microsoft has just patched a critical vulnerability that shoul...