Last Week in AI Security — Week of January 26, 2026

Top Stories

NIST Releases AI Red Teaming Framework

The National Institute of Standards and Technology published its long-anticipated AI 600-1 Companion Guide on Red Teaming, providing a standardized methodology for adversarial testing of generative AI systems. The framework covers:

Scope definition — Categorizing AI systems by risk level and determining appropriate testing depth.
Attack taxonomy — A structured catalog of 147 attack techniques across prompt injection, jailbreaking, data extraction, and denial of service categories.
Evaluation metrics — Quantitative measures for attack success rate, guardrail robustness, and defense coverage.
Reporting standards — Templates for communicating findings to both technical and executive audiences.

The framework is expected to become the baseline for compliance audits under the forthcoming US Executive Order on AI Safety implementation rules. Organizations deploying customer-facing AI systems should begin mapping their testing practices against the NIST AI RMF taxonomy.

Critical RCE Vulnerability in vLLM

CVE-2026-2847 (CVSS 9.8) was disclosed in vLLM versions prior to 0.7.3, affecting the OpenAI-compatible API serving endpoint. The vulnerability allows an attacker to execute arbitrary code on the inference server by sending a specially crafted request with a malicious tensor payload in the token embedding override parameter.

Impact: Any vLLM deployment exposing the API endpoint to untrusted networks is vulnerable. This includes many production LLM serving configurations.

Mitigation: Upgrade to vLLM 0.7.3 immediately. If upgrade is not possible, disable the --allow-embedding-override flag and restrict API access to trusted networks.

Researchers at the University of Toronto published “CrossFire: Adversarial Attacks Across Modalities,” demonstrating that safety guardrails trained on text inputs are consistently bypassed when malicious instructions are encoded in images, audio, or structured data:

Typography injection — Embedding harmful instructions as text rendered in images. The vision encoder reads the text, bypassing the text-input safety classifier.
Audio steganography — Hiding instructions in audio spectrograms that are interpreted by multi-modal models but inaudible to human reviewers.
Structured data manipulation — Encoding instructions in JSON or CSV fields that are processed differently by the model’s data parsing pipeline.

The paper reports a 73% bypass rate across tested commercial multi-modal models, compared to a 12% bypass rate using text-only injection techniques.

Research Highlights

“Signed Prompts: Cryptographic Instruction Authentication for LLMs” (Stanford, arXiv) — Proposes embedding HMAC signatures in system prompts that models are trained to verify before executing instructions. Early results show 94% instruction authentication accuracy with minimal performance degradation.
“Sleeper Agents Revisited: Detecting Deferred Deception in Fine-Tuned Models” (Anthropic) — Follow-up research on detecting trojan behaviors that activate only after deployment. Introduces a new probing technique that identifies latent deceptive policies with 89% recall.
“Quantization-Aware Backdoors” (ETH Zurich) — Demonstrates that backdoors can be injected that only activate after model quantization (e.g., FP16 → INT4), evading detection in the full-precision model. Highlights the need to evaluate model security at the actual deployed precision.

Industry Updates

EU AI Act enforcement officially begins on February 2nd for providers of high-risk AI systems. Companies must demonstrate conformity assessments, risk management systems, and human oversight mechanisms. Non-compliance fines up to 35M EUR or 7% of global turnover.
Hugging Face launches Model Provenance Attestations — A new feature allowing model publishers to cryptographically sign their uploads with training pipeline metadata. Verification is integrated into the transformers library’s from_pretrained() method.
OpenAI expands Bug Bounty scope to include prompt injection and jailbreak vulnerabilities in ChatGPT and the API, with bounties up to $25,000 for critical findings that demonstrate real-world impact.
MITRE releases ATLAS v4.0 — The Adversarial Threat Landscape for AI Systems framework adds 34 new techniques and introduces sub-techniques for multi-modal attacks, supply chain compromises, and agentic system manipulation.

Tools and Resources

Garak v0.12 — Adds probes for multi-modal injection, MCP server exploitation, and agentic loop manipulation. Now supports testing against local Ollama models.
ModelScan v2.0 — Expanded static analysis for detecting malicious code in model files. Now supports SafeTensors, GGUF, and ONNX formats in addition to pickle.
LLM Guard v0.8 — Open-source input/output firewall adds real-time injection detection with sub-10ms latency. New scanner modules for PII detection and topic restriction.

Key Highlights