Search: LLM capabilities

Paper page - Code World Model Preparedness Report

…The following papers were recommended by the Semantic Scholar API Systematic Capability Benchmarking of Frontier Large Language Models for Offensive Cyber Tasks (2026) CyberCertBench: Evaluating LLMs in Cybersecurity Certification Knowledge (2026) Evaluating…

May 6, 2026

Paper page - Darwin Family: MRI-Trust-Weighted Evolutionary Merging for Training-Free Scaling of Language-Model Reasoning

…Multi-Objective Evolutionary Optimization Discovers Diverse LLM Alignments Beyond Gradient Descent (2026) Parameter-Efficient Neuroevolution for Diverse LLM Generation: Quality-Diversity Optimization via Prompt Embedding Evolution (2026) Please give a thumbs up…

May 15, 2026

Paper page - PAAC: Privacy-Aware Agentic Device-Cloud Collaboration

…language model (LLM) agents face a structural tension: cloud agents provide strong reasoning but expose user data, while on-device agents preserve privacy at the cost of overall capability. Existing device-cloud…

May 13, 2026

Paper page - Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

…Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks (2026) ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces (2026) Spec Kit Agents: Context-Grounded Agentic Workflows (2026…

May 1, 2026

Paper page - AgentFugue: Agent Scaling for Long-Horizon Tasks through Collective Reasoning

…A Bi-Level Multi-Agent LLM System for Internet-Scale Information Search and Extraction (2026) TMAS: Scaling Test-Time Compute via Multi-Agent Synergy (2026) Agent^2 RL-Bench: Can LLM Agents…

May 28, 2026

Paper page - Instruction-Guided Poetry Generation in Arabic and Its Dialects

…While modern Arabic speakers continue to value poetry, existing research on Arabic poetry within Large Language Models (LLMs) has primarily focused on analysis tasks such as interpretation or metadata prediction, e.g…

May 1, 2026

Paper page - From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills

…AI-generated summary LLM agents increasingly rely on reusable skills , capability packages that combine instructions, control flow, constraints, and tool calls. In most current agent systems, however, skills are still represented by…

May 4, 2026

Paper page - MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills

…Agent skills are increasingly deployed as modular, reusable capability units in AI agent systems. Medical research agent skills require safeguards beyond general-purpose evaluation, including scientific integrity, methodological validity, reproducibility, and boundary…

May 7, 2026

Paper page - One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

…AI-generated summary Hidden malicious intent in multi-turn dialogue poses a growing threat to deployed large language models (LLMs). Rather than exposing a harmful objective in a single prompt, increasingly capable…