Search

Showing top 135 results for "AI safety defenses" · filtered from 137 indexed

All sources techcrunch.com 12 theverge.com 12 anthropic.com 11 huggingface.co 9 blog.google 9 wired.com 9 theregister.com 5 xda-developers.com 5 techpowerup.com 4 arstechnica.com 4 spectrum.ieee.org 4 androidauthority.com 4

Videos

Anthropic just wrote itself a safety loophole

“Safety first” was the mantra that made Anthropic unique among its big AI competitors. …

Feb 25, 2026 · By Ben Patterson

Paper page - One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

… The following papers were recommended by the Semantic Scholar API SaFeR-Steer: Evolving Multi-Turn MLLMs via Synthetic Bootstrapping and Feedback Dynamics 2026 ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming 2026 Transient Turn Injection: Exposing Stateless Multi-… …

May 13, 2026

Google expands Gemini DoD partnership with Gem-like agents for unclassified projects

… This comes as the DoD/DoW recently came into partnership with OpenAI, ousting Anthropic due to concerns over red-line safety measures for citizens. …

Mar 10, 2026 · Andrew Romero

Paper page - MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks

… To demonstrate its reconfigurability, we apply MASCing to two different safety objectives and observe consistent gains with negligible overhead across seven open-source MoE models. …

May 4, 2026

Discussions and forums

r/netsec · u/unknownhad · May 10, 2026

The compression of the exploit timeline: Why n-day gaps and 90-day embargoes are failing in practice.

The traditional vulnerability disclosure timeline relies on a fundamental assumption: exploit development and vulnerability discovery take time. Over the last 12 months the integration of LLMs into offensive tooling has …

Hacker News · u/dk189 · 13h ago

Show HN: We post-trained a model that pen tests instead of refusing

Anthropic and OpenAI's publicly available models are explicitly guard-railed so that they refuse offensive tasks. And their cyber-focussed models are gated for enterprises. This leaves SMEs and mid market open to major v…

76 35

r/Android · u/MishaalRahman · May 12, 2026

New features, emojis, & security improvements: Here’s everything new coming to Android!

Hi Reddit, We just wrapped up The Android Show | I/O Edition, and a core theme of the show was how we’re making your phone more helpful so that you can spend less time looking at it and more time living your life. To mak…

Gemini is stopping harmful ads before people ever see them

… Read the 2025 Ads Safety Report to learn how we're stopping threats and supporting businesses. Summaries were generated by Google AI. Generative AI is experimental. Bullet points "Gemini is stopping harmful ads before people ever see them" – this article explains how. …

Apr 16, 2026 · Keerat Sharma

Google Workspace’s continuous approach to mitigating indirect prompt injections

… Deterministic Defenses Deterministic defenses , including user confirmation, URL sanitization, and tool chaining policies, are designed for rapid response against new or emerging prompt injection attacks by relying on simple configuration updates. …

Apr 2, 2026 · Adam Gavish

Paper page - AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security

… The following papers were recommended by the Semantic Scholar API Orchard: An Open-Source Agentic Modeling Framework 2026 Auditing Agent Harness Safety 2026 SafeHarbor: Hierarchical Memory-Augmented Guardrail for LLM Agent Safety 2026 Security Risks in Tool-Enabled AI Agents: A Systematic Analysis … …

May 29, 2026

Securing internal systems against increasingly capable and imperfectly aligned AI

… Scaling security as AI gets smarter As AI models continue to advance, our defenses must also strengthen in tandem. …

Jun 18, 2026 · Rohin Shah and Four Flynn

In the Wake of Anthropic’s Mythos, OpenAI Has a New Cybersecurity Model—and Strategy

… Over the long term, to ensure the ongoing sufficiency of AI safety in cybersecurity, we also expect the need for more expansive defenses for future models, whose capabilities will rapidly exceed even the best purpose-built models of today.” The company says that it has homed in on three pillars for… …

Apr 14, 2026 · Lily Hay Newman

Apple Provides Update on App Store, Highlights Key 2025 Safety Stats

… "Apple's Trust and Safety teams integrate AI throughout the entire moderation process to detect spam, offensive content, and inauthentic reviews at scale," the company explained. …

May 20, 2026 · Joe Rossignol

Followed topics