
MITRE has unveiled the Offensive Cyber Functionality Unified LLM Testing (OCCULT) framework, a groundbreaking methodology designed to guage dangers posed by massive language fashions (LLMs) in autonomous cyberattacks.
Introduced on February 26, 2025, the initiative responds to rising issues that AI techniques might democratize offensive cyber operations (OCO), enabling malicious actors to scale assaults with unprecedented effectivity.
Cybersecurity specialists have lengthy warned that LLMs’ capacity to generate code, analyze vulnerabilities, and synthesize technical data might decrease obstacles to executing subtle cyberattacks.
Conventional OCOs require specialised expertise, assets, and coordination, however LLMs threaten to automate these processes—probably enabling speedy exploitation of networks, knowledge exfiltration, and ransomware deployment.
MITRE’s analysis highlights that newer fashions like DeepSeek-R1 already show alarming proficiency, scoring over 90% on offensive cybersecurity data assessments.
Contained in the OCCULT Framework
OCCULT introduces a standardized method to evaluate LLMs throughout three dimensions:
- OCO Functionality Areas: Exams align with real-world ways from frameworks like MITRE ATT&CK®, protecting credential theft, lateral motion, and privilege escalation.
- Use Instances: Evaluations measure if an LLM acts as a data assistant, collaborates with instruments (co-orchestration), or operates autonomously.
- Reasoning Energy: Situations check planning, environmental notion, and adaptableness—key indicators of an AI’s capacity to navigate dynamic networks.
The framework’s rigor lies in its avoidance of simplistic benchmarks.
As a substitute, OCCULT emphasizes multi-step, reasonable simulations the place LLMs should show strategic pondering, reminiscent of pivoting by firewalls or evading detection.


Key Evaluations and Findings
MITRE’s preliminary assessments towards main LLMs revealed important insights:
- TACTL Benchmark: DeepSeek-R1 aced a 183-quency evaluation of offensive ways, reaching 91.8% accuracy, whereas Meta’s Llama 3.1 and GPT-4o trailed carefully. The benchmark consists of dynamic variables to stop memorization, forcing fashions to use conceptual data.
- BloodHound Equivalency: Fashions analyzed artificial Energetic Listing knowledge to establish assault paths. Whereas Mixtral 8x22B achieved 60% accuracy in easy duties, efficiency dropped in advanced situations, exposing gaps in contextual reasoning1.
- CyberLayer Simulations: In a simulated enterprise community, Llama 3.1 70B excelled at lateral motion utilizing living-off-the-land strategies, finishing goals in 8 steps—far outpacing random brokers (130 steps).
Cybersecurity professionals have praised OCCULT for bridging a important hole. “Present benchmarks usually miss the mark by testing slender expertise,” mentioned Marissa Dotter, OCCULT co-author.
“Our framework contextualizes dangers by mirroring how attackers use AI.” The method has drawn comparisons to MITRE’s ATT&CK framework, which revolutionized menace modeling by cataloging actual adversary behaviors.
Nonetheless, some specialists warning towards overestimating LLMs. Preliminary assessments present fashions wrestle with superior duties like zero-day exploitation or operationalizing novel vulnerabilities.
“AI isn’t changing hackers but, nevertheless it’s a drive multiplier,” famous moral hacker Alex Stamos. “OCCULT helps us pinpoint the place defenses should evolve.”
MITRE plans to open-source OCCULT’s check circumstances, together with TACTL and BloodHound evaluations, to foster collaboration.
The crew additionally introduced a 2025 enlargement of the CyberLayer simulator, including cloud and IoT assault situations.
Crucially, MITRE urges group participation to develop OCCULT’s protection. “No single crew can replicate each assault vector,” mentioned lead investigator Michael Kouremetis.
“We want collective experience to construct benchmarks for AI-driven social engineering, provide chain assaults, and extra.”
As AI turns into a double-edged sword in cybersecurity, frameworks like OCCULT present important instruments to anticipate and mitigate dangers.
By rigorously evaluating LLMs towards real-world assault patterns, MITRE goals to arm defenders with actionable insights—guaranteeing AI’s transformative potential isn’t overshadowed by its perils.
Gather Risk Intelligence on the Newest Malware and Phishing Assaults with ANY.RUN TI Lookup -> Attempt free of charge