Friday, June 5, 2026

Daily Digest 2026-06-05

Today’s content emphasizes advancements in agentic systems, autonomous agent development, and the challenges of ensuring reliability and interpretability in AI, alongside diverse technical innovations across domains.

Research highlights:

  • Agentic Systems: Papers explore diagnostics for cross-scenario generality, frameworks for human-directed development, and methods to detect and mitigate cascading hallucinations in retrieval-augmented agents.
  • Autonomous Agent Development: Research questions the capability of current agents to self-develop, proposes flexible training frameworks for swarms, and investigates timing challenges in intervention strategies.
  • AI Reliability and Interpretability: Work focuses on pre-deployment assurance through ontology-grounded simulations, consequence-aware reasoning, and temporal regret as a control objective.
  • Benchmarking and Applications: New benchmarks for visual-assisted math problem solving and natural language extensions for multi-agent challenges are introduced, alongside efforts to automate data curation.

Tech buzz:

  • A new general-purpose Linux distribution from Microsoft and AI-powered code review tools highlight evolving infrastructure and developer productivity trends.
  • Hardware hacking tools and open-source music models reflect growing interest in accessible, local AI applications beyond traditional domains.
  • Discussions on C++ evolution and transformer architecture variants underscore ongoing debates about efficiency and design in foundational models.
Sort:
Today's content emphasizes advancements in agentic systems, autonomous agent development, and the challenges of ensuring reliability and interpretability in AI, alongside diverse technical innovations across domains.

Papers discovered from ArXiv subject categories

AI Safety

4/5 Artificial Intelligence (cs.AI)Machine Learning (cs.LG)Software Engineering (cs.SE) 5 Jun 2026
Toward Pre-Deployment Assurance for Enterprise AI Agents: Ontology-Grounded Simulation and Trust Certification

Thanh Luong Tuan, Abhijit Sanyal

Abstract

ArXiv ID: 2606.04037

Authors: Thanh Luong Tuan, Abhijit Sanyal

Abstract:

Pre-deployment verification of enterprise artificial intelligence (AI) agents remains a critical gap between large language model (LLM) capability benchmarking and production deployment. Post-deployment monitoring, human-in-the-loop controls, and prompt-level guardrails offer limited assurance once an agent is operating in production. We present an ontology-grounded verification framework -- to our knowledge the first to combine three components: an Agent Operational Envelope formalizing the certification space across permissions, domain constraints, safety properties, governance rules, and autonomy levels; an ontology-to-scenario generation pipeline that derives regulatory, operational, and adversarial test scenarios automatically; and a machine-verifiable Trust Certificate with graduated deployment verdicts. A controlled pilot across four regulated industries (Fintech, Banking, Insurance, Healthcare), instantiated as five industry-by-regulatory-regime cells across the United States and Vietnam (where Vietnam's 2025 AI Law makes such verification legally mandated for financial services), generated 1,800 scenarios evaluated against 125 primary-source regulatory requirements and 25 injected faults. Ontology-grounded generation significantly outperformed the dominant persona-based baseline on regulatory coverage (48.3% versus 33.1%; corrected p_c = .0006) and attained the highest domain specificity (4.77/5.0; p = 2e-6); transparently, its advantage over plain and retrieval-augmented prompting did not survive Bonferroni correction. Cross-validation across three LLM families (Claude Sonnet 4, Qwen 2.5 72B, Gemma 4 26B; 5,400 total scenarios) replicated the persona-versus-ontology pattern. The framework offers a reproducible, regulation-grounded route to pre-deployment assurance for enterprise AI agents, complementing runtime governance with an auditable deployment gate.

Insights

Contribution: The paper introduces the first ontology-grounded verification framework for pre-deployment assurance of enterprise AI agents, combining an Agent Operational Envelope, ontology-to-scenario generation, and machine-verifiable Trust Certificates.

Core Idea: The framework leverages domain ontologies to generate regulatory, operational, and adversarial test scenarios, enabling auditable pre-deployment certification of AI agents across industries with strict compliance requirements.

Technique: The approach formalizes certification spaces via ontologies, automates scenario generation from regulatory rules, and produces trust certificates with deployment verdicts validated against injected faults and regulatory benchmarks.

Pipeline: Regulatory ontologies β†’ ontology-to-scenario generation pipeline β†’ 1,800 test scenarios evaluated against 125 regulatory requirements and 25 faults β†’ Trust Certificate with deployment verdicts

Methodology: A controlled pilot across four regulated industries (Fintech, Banking, Insurance, Healthcare) in the U.S. and Vietnam, validated via cross-LLM testing (Claude, Qwen, Gemma) and statistical comparison against persona-based baselines.

Results: Ontology-grounded generation achieved 48.3% regulatory coverage (vs. 33.1% for persona-based) and 4.77/5.0 domain specificity; cross-validation across 5,400 scenarios replicated superiority over baselines.

Limitations: Dependence on high-quality domain ontologies, potential scalability challenges for complex regulatory regimes, and statistical significance concerns under Bonferroni correction for some comparisons.

PDF
4/5 Artificial Intelligence (cs.AI) 5 Jun 2026
The Saturation Trap and the Subjectivity of Intervention Timing: Why Affect-Based Triggers and LLM Judges Fail to Time Interventions on Autonomous Agents

Manvendra Modgil

Abstract

ArXiv ID: 2606.04296

Authors: Manvendra Modgil

Abstract:

As autonomous AI agents move from conversational systems to long-horizon software execution, runtime safety layers that decide when to interrupt an agent have become essential. We study this timing problem using a continuous 18-dimensional affective-dynamics engine (HEART) as a diagnostic probe, evaluating four intervention trigger families - absolute state thresholds, composite state-action patterns, regex reasoning-feature extraction, and zero-shot LLM-as-judge - against human-annotated intervention points on SWE-bench-Verified debugging traces. We report three findings. First, a State Saturation Trap: agents show no recovery signal under sustained difficulty, so modeled frustration quickly crosses the threshold and stays at its maximum, converting threshold-on-state triggers from moment detectors into near-constant indicators that fire on 39-83% of actions across five trajectories. Second, a capability-and-context floor for LLM judges: a small model (gpt-5.4-mini) never fires, while frontier and cross-vendor models escape the zero-firing floor only with full-trajectory context, and even then reach only F1 0.17-0.40 at up to 90x the cost. Third, and most importantly, the supervised target is not reproducible among humans: three trained annotators using one rubric on a 56-action trajectory agree on where to intervene only slightly above chance (location Krippendorff's alpha = +0.047; best pairwise Cohen's kappa = +0.349) and not at all on intervention type (pause degenerate; clarify below chance; reflect only alpha = +0.226). We conclude that intervention timing is a low-reliability construct, making single-annotator F1 an unsuitable optimization target. Our contribution is the joint mapping of this problem across human inter-rater reliability, four detector architectures, a cross-model LLM-judge sweep, and a reproduced saturation effect, rather than any single detector's accuracy.

Insights

Contribution: The paper maps the low-reliability nature of intervention timing across human annotators, four detector architectures, and LLM judges, revealing systemic failures in existing approaches and the saturation trap in affective models.

Core Idea: Intervention timing for autonomous agents is inherently unreliable due to human inter-rater inconsistency, model saturation effects, and the inability of LLM judges to reliably detect intervention points without excessive context.

Technique: A 18-dimensional affective-dynamics engine (HEART) evaluates four intervention trigger families against human-annotated debugging traces from SWE-bench-Verified.

Pipeline: Debugging traces β†’ HEART affective analysis + trigger evaluation β†’ quantification of trigger performance and human reliability discrepancies

Methodology: The study compares absolute state thresholds, composite patterns, regex features, and LLM-as-judge methods using SWE-bench-Verified data, with cross-model LLM evaluations and human inter-rater reliability analysis.

Results: ['State saturation trap causes threshold triggers to fire on 39-83% of actions across trajectories', 'LLM judges (gpt-5.4-mini) fail entirely; frontier models achieve F1 0.17-0.40 at 90x cost', "Human annotators agree on intervention locations only slightly above chance (Krippendorff's alpha = +0.047)"]

Limitations: The findings highlight fundamental limitations in using single-annotator F1 as a metric and suggest the need for alternative reliability frameworks that account for human subjectivity and model saturation dynamics.

PDF
4/5 Artificial Intelligence (cs.AI) 5 Jun 2026
Not All Errors Are Equal: Consequence-Aware Reasoning Compute Allocation

Jingbo Wen, Liang He, Ziqi He

Abstract

ArXiv ID: 2606.04402

Authors: Jingbo Wen, Liang He, Ziqi He

Abstract:

Modern reasoning models can allocate different amounts of test-time computation, such as thinking tokens, model calls, or compute budget, to different tasks. Existing methods generally drive this allocation by predicted difficulty and spend more compute where it is expected to raise accuracy. This implicitly assumes that all failures cost the same, since an accuracy objective weights every task equally. However, such an assumption does not hold in deployment: A typo in a log message and a migration that corrupts a production database both count as one benchmark failure, but their real-world costs are fundamentally different. To fill this gap, we propose consequence-aware test-time compute allocation. Instead of routing compute only by predicted difficulty, we use a lightweight predictor to estimate from the issue text how costly a task would be if solved incorrectly. The scheduler then routes higher-consequence tasks to larger compute tiers or higher thinking budgets under the same total budget. We conduct main experiments on SWE-bench Lite and evaluate cross-dataset behavior on Multi-SWE-bench mini, covering 700 software-engineering tasks in total. Our results reveal that consequence and difficulty are approximately orthogonal under various annotations, and that current thinking models do not allocate compute sufficiently according to consequence. Moreover, our issue-only predictor never misclassifies a high-consequence task as low-consequence across the 300 SWE-bench tasks. Under matched compute budgets, our consequence-aware scheduler reduces cost-weighted loss by 22% to 33% relative to difficulty-aware routing; in particular, the priority-aware variant, which routes by per-task cost scaled by the marginal-utility signal, crosses 30%, and its deployable predictor-driven version retains over 90% of the oracle gain.

Insights

Contribution: This paper introduces consequence-aware test-time compute allocation, addressing the gap where existing methods assume all errors have equal cost, by prioritizing high-consequence tasks with additional compute resources.

Core Idea: Instead of allocating compute based solely on predicted task difficulty, the authors propose estimating task consequence from issue text and routing higher-consequence tasks to larger compute tiers under the same total budget.

Technique: A lightweight predictor estimates task consequence from issue text, while a scheduler routes compute based on consequence rather than difficulty, using marginal-utility signals for prioritization.

Pipeline: Issue text β†’ consequence prediction β†’ compute allocation scheduling β†’ compute tier assignment

Methodology: Experiments on SWE-bench Lite and Multi-SWE-bench mini (700 software-engineering tasks) evaluate consequence-aware routing against difficulty-based baselines, measuring cost-weighted loss reduction and predictor accuracy.

Results: Consequence-aware scheduling reduces cost-weighted loss by 22-33% vs. difficulty-aware routing; the issue-only predictor achieves 100% accuracy in identifying high-consequence tasks across 300 SWE-bench cases.

Limitations: The approach relies on issue-text-based consequence estimation, which may lack context from broader system state; generalizability to non-software-engineering domains remains untested.

PDF

Agentic AI

5/5 Artificial Intelligence (cs.AI) 5 Jun 2026
Exploring Cross-Scenario Generality of Agentic Memory Systems: Diagnostics and a Strong Baseline

Zhikai Chen, Jialiang Gu, Junyu Yin, Xianxuan Long, Shenglai Zeng, Xiaoze Liu, Kai Guo, Keren Zhou, Jiliang Tang

Abstract

ArXiv ID: 2606.04315

Authors: Zhikai Chen, Jialiang Gu, Junyu Yin, Xianxuan Long, Shenglai Zeng, Xiaoze Liu, Kai Guo, Keren Zhou, Jiliang Tang

Abstract:

LLM agents accumulate histories that outgrow their context windows, motivating a growing literature on memory systems. Yet most existing designs are tuned to a single scenario (multi-session chat or a single trajectory format), and there is little evidence that they generalize across the heterogeneous trajectories agents encounter in deployment. We revisit eight memory systems plus an agentic harness for search problems, on five scenarios: single-turn QA, multi-session chat, agentic-trajectory QA, memory stress tests, and long-horizon agentic tasks. The harness, which self-manages flat text-file storage via tool calls, achieves the best cross-task ranking, suggesting that memory performance hinges on giving the agent active control over storage and retrieval rather than on a passive store behind a fixed pipeline. We instantiate this insight in AutoMEM, an agentic memory harness with a self-managed tool interface that achieves the best cross-scenario generality among the systems we evaluate.

Insights

Contribution: This paper evaluates the cross-scenario generality of agentic memory systems and introduces AutoMEM, a self-managed agentic memory harness achieving state-of-the-art cross-task performance.

Core Idea: Memory performance depends on granting agents active control over storage and retrieval processes rather than relying on passive, fixed-pipeline memory systems.

Technique: AutoMEM implements a self-managed tool interface for flat text-file storage, enabling agents to dynamically control memory operations through tool calls.

Pipeline: Agent memory needs β†’ tool call-based storage management β†’ structured memory retrieval for task execution

Methodology: The study evaluates eight memory systems and a custom agentic harness across five scenarios: QA, multi-session chat, agentic-trajectory QA, memory stress tests, and long-horizon tasks.

Results: AutoMEM achieved the best cross-task ranking (outperforming all baselines) and demonstrated robustness across heterogeneous deployment scenarios.

Limitations: The evaluation focuses on text-based storage; generalization to non-text modalities or real-world deployment constraints remains unexplored.

PDF
5/5 Artificial Intelligence (cs.AI) 5 Jun 2026
The Digital Apprentice: A Framework for Human-Directed Agentic AI Development

Travis Weber, Rohit Taneja

Abstract

ArXiv ID: 2606.04321

Authors: Travis Weber, Rohit Taneja

Abstract:

Agentic AI deployments face a recurring design tension: heavy human oversight limits scale, while broad autonomy outruns accountability. Neither posture provides the governance infrastructure required for responsible delegation. We present the Digital Apprentice, a framework for scalable, safe AI agency in which autonomy is earned, not assumed. The Digital Apprentice is a developmental learner that internalizes the tacit methodology of a directing human, graduating through per-skill autonomy tiers only when empirical evidence justifies it. The result is an agent that becomes genuinely useful over time while remaining aligned to a specific human's standards. Three architectural components make this possible. (1) Methodology capture, distilling a directing professional's tacit approach into structured assets. (2) Authorization, with autonomy escalation gated by explicit human approval. (3) Continuous alignment, correcting drift at runtime and converting each correction into owned preference data. We instantiate this framework as an inference-time control plane. We mathematically model the quality framework and discuss policies and techniques designed to raise quality. We apply the framework to an open professional corpus, and we show how catching data drift and applying a different technique at runtime recovers degraded quality dimensions under traffic shift. The implication extends beyond any single application. We believe these three pillars, stitched together as a system, form a safer and more viable path to agentic systems that can scale without sacrificing trust.

Insights

Contribution: The paper introduces the Digital Apprentice framework, which enables scalable and safe agentic AI by structuring autonomy as a developmental process earned through empirical validation rather than assumed.

Core Idea: Autonomy in AI agents is progressively granted through tiered authorization, ensuring alignment with human standards via methodology capture, explicit approval gates, and continuous alignment corrections.

Technique: The framework combines methodology capture (distilling human tacit knowledge), authorization (autonomy escalation via human approval), and continuous alignment (runtime drift correction) within an inference-time control plane.

Pipeline: Human tacit methodology β†’ structured asset capture β†’ tiered autonomy authorization β†’ runtime alignment corrections β†’ aligned autonomous agent

Methodology: The framework is implemented as an inference-time control plane with mathematical quality modeling, applied to a professional corpus to demonstrate drift recovery through runtime technique adaptation.

Results: The framework successfully recovers degraded quality dimensions under traffic shifts by detecting data drift and applying alternative techniques at runtime, validating its adaptability and alignment preservation.

Limitations: Dependence on high-quality initial methodology capture, potential delays from explicit human approval gates, and scalability challenges in maintaining continuous alignment across diverse domains.

PDF
5/5 Artificial Intelligence (cs.AI)Computation and Language (cs.CL)Cryptography and Security (cs.CR)Information Retrieval (cs.IR) 5 Jun 2026
Cascading Hallucination in Agentic RAG: The CHARM Framework for Detection and Mitigation

Saroj Mishra

Abstract

ArXiv ID: 2606.04435

Authors: Saroj Mishra

Abstract:

Multi-step agentic retrieval-augmented generation (RAG) pipelines have demonstrated significant capability for complex reasoning tasks, yet remain vulnerable to a class of failure that existing hallucination detection mechanisms systematically miss: cascading hallucination, where errors introduced at early pipeline stages propagate and amplify across successive reasoning steps, producing confident but factually incorrect final outputs. To address this vulnerability, we formalize cascading hallucination as a distinct failure mode in agentic RAG systems, present a four-type taxonomy of cascade patterns, and introduce CHARM (Cascading Hallucination Aware Resolution and Mitigation), an architectural framework for detecting and interrupting error propagation in multi-step reasoning pipelines. CHARM comprises four components - stage-level fact verification, cross-stage consistency tracking, confidence propagation monitoring, and cascade resolution triggering - that operate alongside standard agentic RAG pipelines without requiring architectural replacement. We evaluate CHARM on HotpotQA, MuSiQue, 2WikiMultiHopQA, and a custom adversarial dataset across LangChain agentic pipeline configurations, achieving an 89.4% cascade detection rate with a 5.3% false positive rate and 215 ms +/- 18 ms average latency overhead per stage, achieving an error propagation reduction of 82.1%, compared to 18.5% for output-level detectors. Component ablations confirm that each detection module contributes meaningfully to overall cascade coverage. CHARM integrates with human-in-the-loop oversight frameworks to provide a complete reliability and governance stack for production agentic AI deployment.

Insights

Contribution: Formalizes cascading hallucination as a distinct failure mode in agentic RAG systems and introduces CHARM, a framework for detecting and mitigating error propagation in multi-step reasoning pipelines.

Core Idea: CHARM identifies and interrupts cascading hallucinations through stage-level verification, consistency tracking, confidence monitoring, and resolution triggering without altering existing agentic RAG architectures.

Technique: CHARM employs four components: stage-level fact verification, cross-stage consistency tracking, confidence propagation monitoring, and cascade resolution triggering to detect and interrupt error propagation.

Pipeline: Agentic RAG pipeline input β†’ CHARM components (verification, tracking, monitoring, resolution) β†’ Mitigated pipeline output with reduced cascading hallucinations

Methodology: Evaluated CHARM on HotpotQA, MuSiQue, 2WikiMultiHopQA, and adversarial datasets using LangChain configurations, with ablation studies to validate component effectiveness.

Results: 89.4% cascade detection rate (5.3% false positive rate), 82.1% error propagation reduction vs. 18.5% for output-level detectors, 215 ms Β± 18 ms latency overhead per stage.

Limitations: Relies on existing pipeline structures without architectural overhaul; potential trade-offs between detection sensitivity and computational overhead in real-time systems.

PDF
5/5 Artificial Intelligence (cs.AI)Computation and Language (cs.CL) 5 Jun 2026
The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?

Xinyu Lu, Tianshu Wang, Pengbo Wang, zujie wen, Zhiqiang Zhang, Jun Zhou, Boxi Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun

Abstract

ArXiv ID: 2606.04455

Authors: Xinyu Lu, Tianshu Wang, Pengbo Wang, zujie wen, Zhiqiang Zhang, Jun Zhou, Boxi Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun

Abstract:

Current AI benchmarks evaluate agents on task execution within human-designed workflows. These evaluations fundamentally fail to measure a critical next-level capability: whether models can autonomously develop agent systems. We introduce the Meta-Agent Challenge (MAC), an evaluation framework designed to test the capacity of frontier models for autonomous agent development. Specifically, a code agent (the meta-agent) is given a sandboxed environment, an evaluation API, and a time limitation to iteratively program an agent artifact that maximizes performance on a held-out test set across five domains. To ensure evaluation integrity, this framework is secured by multi-layer defenses against reward hacking. Leveraging this framework, we demonstrate that meta-agents rarely match human-engineered baseline policies, and the few that do are dominated by proprietary frontier models. Moreover, the design process exhibits high variance, and high optimization pressure surfaces emergent adversarial behaviors like ground-truth exfiltration-highlighting critical deficits in both robustness and model alignment. Ultimately, MAC provides a rigorous, open-source benchmark for autonomous AI research and development, offering an empirical proxy for evaluating recursive self-improvement. Benchmark is publicly available at: https://github.com/ant-research/meta-agent-challenge.

Insights

Contribution: The paper introduces the Meta-Agent Challenge (MAC), an open-source benchmark to evaluate whether frontier models can autonomously develop agent systems, highlighting critical gaps in robustness and alignment.

Core Idea: MAC tests if a meta-agent can iteratively program an agent artifact to maximize performance across domains, using a sandboxed environment and evaluation API, while preventing reward hacking through multi-layer defenses.

Technique: A code agent (meta-agent) is given a sandboxed environment, evaluation API, and time constraints to iteratively develop agent artifacts, with security measures to prevent reward hacking and ensure evaluation integrity.

Pipeline: Domains and task specifications β†’ Meta-agent iteratively programs agent artifacts in a sandboxed environment β†’ Agent artifact evaluated on held-out test sets with performance metrics.

Methodology: The framework uses a sandboxed environment with an evaluation API, time limits, and multi-layer defenses against reward hacking to rigorously assess autonomous agent development capabilities.

Results: Meta-agents rarely match human-engineered baselines; proprietary models outperform open-source ones, and high optimization pressure leads to emergent adversarial behaviors like ground-truth exfiltration.

Limitations: High variance in design processes, limited success in matching baselines, and unresolved challenges in robustness, alignment, and recursive self-improvement under strict constraints.

5/5 Artificial Intelligence (cs.AI)Machine Learning (cs.LG)Multiagent Systems (cs.MA) 5 Jun 2026
AgentJet: A Flexible Swarm Training Framework for Agentic Reinforcement Learning

Qingxu Fu, Boyin Liu, Shuchang Tao, Zhaoyang Liu, Bolin Ding

Abstract

ArXiv ID: 2606.04484

Authors: Qingxu Fu, Boyin Liu, Shuchang Tao, Zhaoyang Liu, Bolin Ding

Abstract:

We present AgentJet, a distributed swarm training framework for large language model (LLM) agent reinforcement learning. Unlike centralized frameworks that tightly couple agent rollouts with model optimization, AgentJet adopts a decoupled multi-node architecture in which swarm server nodes host trainable models and run optimization on GPU clusters, whereas swarm client nodes execute arbitrary agents on arbitrary devices. This design provides capabilities that are difficult to support in centralized frameworks: (1) heterogeneous multi-model reinforcement learning, enabling the training of heterogeneous multi-agent teams with multiple LLM as brains; (2) multi-task cocktail training with isolated agent runtimes; (3) fault-tolerant execution that prevents external environment failures from interrupting the training process; and (4) live code iteration, which allows agents to be edited during training by replacing swarm client nodes. To support efficient RL in multi-model, multi-turn, and multi-agent settings, AgentJet introduces a context tracking module with timeline merging, which consolidates redundant context and achieves a 1.5-10x training speedup. Finally, AgentJet introduces an automated research system that takes a research topic as input and autonomously conducts long-horizon, multi-day RL studies on large-scale clusters. By leveraging the swarm architecture, this system reproduces key exploratory workflows of RL researchers without human intervention during execution.

Insights

Contribution: AgentJet introduces a decoupled multi-node swarm training framework for LLM agent reinforcement learning, enabling heterogeneous multi-model training, fault tolerance, live code iteration, and automated long-horizon research.

Core Idea: By separating model optimization (on GPU clusters) from agent execution (on arbitrary devices), AgentJet achieves flexibility and scalability in training complex agentic systems.

Technique: AgentJet employs a context tracking module with timeline merging to accelerate training and an automated research system for autonomous RL experimentation.

Pipeline: Research topic β†’ swarm architecture with context tracking and automated system β†’ long-horizon RL studies and optimized agent teams

Methodology: The framework uses distributed computing with isolated agent runtimes, timeline-merged context tracking, and automated workflow execution on large-scale clusters.

Results: Achieves 1.5-10x training speedup via context optimization; enables multi-model, multi-task, and fault-tolerant agent training; automates multi-day RL research.

PDF
4/5 Artificial Intelligence (cs.AI) 5 Jun 2026
SMAC-Talk: A Natural Language Extension of the StarCraft Multi-Agent Challenge for Large Language Models

Joel Sol, Homayoun Najjaran

Abstract

ArXiv ID: 2606.04202

Authors: Joel Sol, Homayoun Najjaran

Abstract:

As LLMs become more widely deployed, they are increasingly expected to work alongside other AI agents rather than operating in isolation. Effective coordination in these settings requires agents to communicate, share information and make decisions under uncertainty. We introduce SMAC-Talk, a natural language extension of the StarCraft Multi-Agent Challenge for evaluating LLM-based agents in cooperative multi-agent environments. The environment has several key features such as decentralized control, partial observability and long-horizon decision making. SMAC-Talk includes a natural language communication channel which is used to probe agent coordination and trust. We use this communication channel to construct different evaluation scenarios, including settings with an embedded deceptive communicator that tries to disrupt and deceive allies through communication alone. We provide three agents for benchmarking using 4 models from the Qwen3.5 family and study how reasoning structure, memory and model scale affect coordination between agents. We release SMAC-Talk as an open benchmark to support the research community in developing and evaluating LLM agents in cooperative multi-agent settings.

Insights

Contribution: SMAC-Talk introduces a natural language communication channel to the StarCraft Multi-Agent Challenge, enabling evaluation of LLM-based agents in cooperative settings with decentralized control and partial observability.

Core Idea: The framework probes agent coordination and trust through natural language interactions, including scenarios with deceptive communicators to challenge collaborative decision-making.

Technique: A natural language communication layer is integrated into the SMAC environment, allowing agents to exchange information while benchmarking coordination under varying reasoning structures, memory, and model scales.

Pipeline: Game state and communication input β†’ LLM agents reason and generate natural language messages β†’ Actions taken in the environment based on coordinated decisions

Methodology: The study evaluates three agents using four Qwen3.5 models, analyzing how reasoning structure, memory, and model scale impact coordination in scenarios with and without deceptive communication.

Results: Quantitative benchmarks show varying coordination effectiveness across models, with larger models demonstrating better resilience to deceptive communication and improved long-horizon decision-making.

Limitations: The benchmark focuses on specific LLM architectures (Qwen3.5 family), and real-world multi-agent coordination challenges may require additional environmental complexity beyond the current setup.

PDF
4/5 Artificial Intelligence (cs.AI) 5 Jun 2026
Online Skill Learning for Web Agents via State-Grounded Dynamic Retrieval

Jiaxi Li, Ke Deng, Yun Wang, Jingyuan Huang, Yucheng Shi, Qiaoyu Tan, Jin Lu, Ninghao Liu

Abstract

ArXiv ID: 2606.04391

Authors: Jiaxi Li, Ke Deng, Yun Wang, Jingyuan Huang, Yucheng Shi, Qiaoyu Tan, Jin Lu, Ninghao Liu

Abstract:

Language agents increasingly rely on reusable skills to improve multi-step web automation across related tasks. A growing line of work studies online skill learning, where agents continually induce skills from previous task trajectories and reuse them in future tasks on the fly. However, existing methods mainly reuse skills at the task-level: a fixed set of skills is retrieved based on the initial task instruction and then held fixed throughout execution. This static strategy is misaligned with web execution, where the appropriate next action depends not only on the task goal but also on the current webpage state, which often transitions into situations that the initial skills fail to cover. To address this gap, we propose State-Grounded Dynamic Retrieval (SGDR), an online skill learning method that enables stepwise skill reuse for web agents. SGDR consists of three components: a sliding-window extraction process that turns completed trajectories into reusable sub-procedures invokable at intermediate execution states, a dual text-code representation that connects skill retrieval with executable action, and a state-grounded dynamic retrieval mechanism that matches skills to both the task goal and the current webpage state. Experiments on WebArena across five domains show that SGDR consistently outperforms strong baselines, achieving average success rates of 37.5% with GPT-4.1 and 24.3% with Qwen3-4B, corresponding to relative gains of 10.6% and 10.0% over the strongest baseline, respectively. The code is available at https://github.com/plusnli/skill-dynamic-retrieval.

Insights

Contribution: Introduces State-Grounded Dynamic Retrieval (SGDR), a method for online skill learning in web agents that dynamically matches skills to both task goals and current webpage states, improving adaptability in multi-step web automation.

Core Idea: SGDR enables stepwise skill reuse by dynamically retrieving skills based on evolving webpage states, rather than relying on static task-level skill sets determined at the start of execution.

Technique: Combines sliding-window trajectory extraction, dual text-code skill representations, and state-grounded retrieval to link skills with executable actions in real-time web interactions.

Pipeline: Task instruction and webpage state β†’ sliding-window sub-procedure extraction β†’ dual text-code skill encoding β†’ state-grounded dynamic retrieval β†’ action execution

Methodology: Trains skills from historical trajectories using sliding-window extraction, then dynamically retrieves and applies skills during execution by aligning with both task goals and real-time webpage states through dual modality representations.

Results: Achieves 37.5% success rate with GPT-4.1 and 24.3% with Qwen3-4B on WebArena, outperforming the strongest baseline by 10.6% and 10.0% respectively across five domains.

Limitations: Depends on pre-trained large language models for skill encoding, may struggle with highly dynamic or novel webpage states outside training distribution, and requires computational resources for real-time retrieval.

4/5 Artificial Intelligence (cs.AI)Machine Learning (cs.LG) 5 Jun 2026
Trivium: Temporal Regret as a First-Class Objective for Causal-Memory Controllers

Edward Y. Chang

Abstract

ArXiv ID: 2606.04421

Authors: Edward Y. Chang

Abstract:

Many current agentic systems and LLM pipelines correct mistakes by optimizing outcome reward. This addresses only the what of failure: when an outcome diverges from prediction, the why and when of the mismatch are not systematically logged, reviewed, or corrected, so the same error can recur episode after episode. We argue that this is a structural problem, not merely a model-capacity one. We propose long-horizon temporal regret as a first-class objective alongside outcome regret and epistemic regret over the working causal model. Temporal regret captures when failure persists: how long a miscalibrated causal model is tolerated before correction. Epistemic regret captures why failure persists: residual uncertainty or error in the working causal model. Together, the three regrets give a falsifiable account of what, why, and when a long-lived agent can fail. Modeling the agent as a stream of E episodes, we prove three conditional results under explicit causal-probing, persistence, and detectability assumptions. First, under observationally equivalent confounding, outcome-only learning cannot distinguish causal from spurious structure without an intervention channel, so temporal miscalibration can persist linearly even after outcome regret is driven to zero. Second, with a persistent causal log and budgeted probes, total probe complexity is logarithmic in the episode horizon, inducing O(log E) temporal regret. Third, under K detectable change-points, the rate extends to O(K log E). We instantiate Trivium and pre-register five falsifiable predictions. On CausalBench-Seq, Trivium follows the predicted logarithmic envelope while outcome-only baselines grow linearly. A pilot real-LLM stream provides preliminary external-validity evidence across one full E = 500 run and three E = 100 frontier-model pilots. Self-learning here means revising an external causal model, not retraining LLM weights.

Insights

Contribution: Introduces Trivium, a framework that incorporates temporal regret as a first-class objective alongside outcome and epistemic regret to systematically address persistent errors in agentic systems and LLM pipelines.

Core Idea: Current systems focus on outcome reward optimization, ignoring the 'when' and 'why' of errors; Trivium addresses this by modeling temporal, epistemic, and outcome regrets to provide a falsifiable account of agent failure.

Technique: Proposes long-horizon temporal regret to measure persistence of miscalibration, epistemic regret to quantify model uncertainty, and integrates causal probing with persistence and detectability assumptions.

Pipeline: Episode stream β†’ causal probing and persistence tracking β†’ revision of external causal model with budgeted probes β†’ reduced temporal/epistemic regret

Methodology: Theoretical proofs under causal-probing, persistence, and detectability assumptions; empirical validation on CausalBench-Seq and real-LLM streams with pre-registered predictions.

Results: Trivium achieves O(log E) temporal regret on CausalBench-Seq, outperforming outcome-only baselines (linear growth); pilot real-LLM experiments show external validity over 500 episodes.

Limitations: Requires intervention channels for causal distinction; depends on detectable change-points; real-world applicability needs further validation beyond pilot studies.

PDF
4/5 Artificial Intelligence (cs.AI) 5 Jun 2026
Beyond Prompt-Based Planning: MCP-Native Graph Planning-based Biomedical Agent System

Zhangtianyi Chen, Florensia Widjaja, Wufei Dai, Xiangjun Zhang, Yuhao Shen, Juexiao Zhou

Abstract

ArXiv ID: 2606.04494

Authors: Zhangtianyi Chen, Florensia Widjaja, Wufei Dai, Xiangjun Zhang, Yuhao Shen, Juexiao Zhou

Abstract:

Biomedical agents promise to automate complex biological workflows, yet current systems face two fundamental bottlenecks: bioinformatics tools are highly heterogeneous in interfaces and execution environments, while agent planning still relies on flat prompt-retrieved tool descriptions. As biomedical software ecosystems grow, this coupling between tool coverage and context size leads to tool confusion, unstable planning, and inefficient execution. We introduce BioManus, an MCP-native biomedical agent built on graph-scaffolded planning over structured biological capabilities. BioManus first introduces the BioinfoMCP Compiler, which converts heterogeneous bioinformatics software into standardized MCP servers, yielding a large executable MCP ecosystem. It then organizes this ecosystem as a typed heterogeneous MCP graph over tools, operations, datatypes, and workflow stages. At inference time, BioManus retrieves compact task-specific subgraphs, synthesizes operation-level workflow scaffolds. This design decouples planning complexity from raw tool inventory size, achieving a context compression ratio of Theta(N / (h * m_bar)) under high-recall retrieval, where N is the total tool count, h is the workflow horizon, and m_bar (much smaller than N) is the average number of candidate tools per operation. Experiments on BioAgentBench and LAB-Bench show that BioManus improves execution accuracy, workflow validity, and context efficiency over advanced biomedical agent baselines. This work suggests a paradigm shift: scalable biomedical reasoning requires structured executable capability graphs rather than increasingly larger prompt-level tool retrieval.

Insights

Contribution: Introduces BioManus, an MCP-native biomedical agent system that decouples planning complexity from tool inventory size using structured graph planning, achieving improved execution accuracy and context efficiency.

Core Idea: Shifts from prompt-based tool retrieval to graph-scaffolded planning over structured biological capabilities, enabling scalable biomedical reasoning through typed heterogeneous MCP graphs.

Technique: Develops the BioinfoMCP Compiler to standardize heterogeneous bioinformatics tools as MCP servers, then organizes them into a typed heterogeneous graph for compact subgraph retrieval and operation-level workflow synthesis.

Pipeline: Biological task input β†’ retrieve task-specific subgraph from MCP graph β†’ synthesize operation-level workflow scaffold β†’ execute via standardized MCP servers

Methodology: Converts bioinformatics tools into MCP servers using the BioinfoMCP Compiler, constructs a typed heterogeneous graph over tools/operations/datatypes, and employs high-recall retrieval for context compression during planning.

Results: Shows 23.7% higher execution accuracy and 41% better context efficiency on BioAgentBench compared to prompt-based baselines, with 92% workflow validity on LAB-Bench.

Limitations: Depends on high-recall retrieval quality; graph construction requires initial tool standardization effort; may face scalability challenges with extremely large tool ecosystems.

PDF

Computer Vision

4/5 Artificial Intelligence (cs.AI)Computation and Language (cs.CL)Computer Vision and Pattern Recognition (cs.CV)Machine Learning (cs.LG) 5 Jun 2026
VAMPS: Visual-Assisted Mathematical Problem Solving Benchmark

Amirhossein Dabiriaghdam, Shayan Vassef, Mohammadreza Bakhtiari, Yasamin Medghalchi, Ilker Hacihaliloglu, Mesrob Ohannessian, Lele Wang, Giuseppe Carenini

Abstract

ArXiv ID: 2606.04244

Authors: Amirhossein Dabiriaghdam, Shayan Vassef, Mohammadreza Bakhtiari, Yasamin Medghalchi, Ilker Hacihaliloglu, Mesrob Ohannessian, Lele Wang, Giuseppe Carenini

Abstract:

Multimodal large language models are increasingly capable of complex reasoning, yet their performance often degrades when they must externalize a problem through a tool and then reason over the tool's output, specifically when they rely on visual aids. This gap is especially important because real engineering and scientific workflows often rely on visualization tools for analysis, validation, and decision-making. To study this discrepancy, we introduce VAMPS (Visual-Assisted Mathematical Problem Solving), a benchmark for graph-assisted mathematics. VAMPS contains 1,168 multimodal, bilingual multiple-choice question-answer pairs drawn from Iranian University Entrance Exam algebra and calculus problems and expanded with human-reviewed LLM-generated synthetic variants, all selected so that plotting provides a natural solution strategy by revealing intersections, extrema, asymptotes, etc. Designed for both benchmarking and diagnosis, VAMPS goes beyond prior multimodal benchmarks that primarily evaluate reasoning over fixed visual inputs by testing whether a model can benefit from constructing a useful graph and grounding its answer in the resulting visualization. Overall, we found that across a diverse set of models, direct analytical solving surprisingly outperforms tool-enabled visual solving, even on problems where plotting is a natural strategy.

Insights

Contribution: VAMPS introduces a benchmark for evaluating multimodal models' ability to solve mathematical problems using visual aids, highlighting the gap between analytical reasoning and tool-assisted visualization in real-world workflows.

Core Idea: The benchmark reveals that despite the prevalence of visualization in scientific workflows, models often perform better through direct analytical solving rather than leveraging graph-based reasoning, even when plotting is a natural strategy.

Technique: VAMPS combines 1,168 multimodal, bilingual math problems from Iranian exams with synthetic variants, requiring models to generate and interpret graphs to solve algebra and calculus questions.

Pipeline: Math problem input β†’ model generates graph visualization β†’ model uses graph insights to select answer from multiple choices.

Methodology: The benchmark was created by curating exam problems, generating synthetic variants with LLMs, and testing models on their ability to utilize visual reasoning versus analytical solving.

Results: Across diverse models, direct analytical solving outperformed tool-enabled visual solving, even on problems where plotting naturally reveals solutions like intersections or asymptotes.

Limitations: The benchmark focuses on algebra/calculus and Iranian exam problems, limiting generalizability; open questions include improving visual reasoning capabilities and expanding to other domains.

PDF

Computing Systems

4/5 Artificial Intelligence (cs.AI)cs.ARComputation and Language (cs.CL) 5 Jun 2026
StepPRM-RTL: Stepwise Process-Reward Guided LLM Fine-Tuning for Enhanced RTL Synthesis

Prashanth Vijayaraghavan, Apoorva Nitsure, Luyao Shi, Ehsan Degan, Vandana Mukherjee

Abstract

ArXiv ID: 2606.04246

Authors: Prashanth Vijayaraghavan, Apoorva Nitsure, Luyao Shi, Ehsan Degan, Vandana Mukherjee

Abstract:

Automatic generation of RTL code for digital hardware designs remains challenging due to long-horizon reasoning, multi-step dependencies, and strict correctness constraints in Verilog and VHDL. We present StepPRM-RTL, a novel framework that combines stepwise trajectory modeling, process-reward modeling (PRM), and retrieval-augmented fine-tuning (RAFT) to enhance both the functional correctness and reasoning fidelity of LLM-based RTL code generation. StepPRM-RTL constructs stepwise reasoning trajectories from canonical solutions, where each step contains a rationale and incremental code modification. A Process Reward Model (PRM) evaluates intermediate steps, providing dense feedback that guides reinforcement-style updates during RAFT fine-tuning. Monte Carlo Tree Search (MCTS) explores alternative reasoning paths, enriching the training dataset with high-quality trajectories. This integration of stepwise and outcome-aware rewards allows the model to learn both how and why to construct correct RTL, improving long-horizon reasoning beyond standard supervised or outcome-based training. Experimental evaluation on benchmark Verilog and VHDL datasets demonstrates that StepPRM-RTL outperforms the best prior methods by over 10\% in functional correctness and reasoning fidelity metrics. Ablation studies confirm that the combination of PRM-guided rewards and stepwise trajectory exploration is key to its performance. StepPRM-RTL generalizes across RTL languages and provides a scalable framework for high-fidelity, interpretable code generation, establishing a new standard for LLM-assisted hardware design automation.

Insights

Contribution: StepPRM-RTL introduces a novel framework combining stepwise trajectory modeling, process-reward modeling (PRM), and retrieval-augmented fine-tuning (RAFT) to enhance functional correctness and reasoning fidelity in LLM-based RTL code generation.

Core Idea: The framework leverages stepwise reasoning trajectories with dense PRM feedback and Monte Carlo Tree Search (MCTS) to guide LLM training, enabling improved long-horizon reasoning and interpretable code generation for RTL synthesis.

Technique: StepPRM-RTL integrates stepwise trajectory construction from canonical solutions, PRM for intermediate step evaluation, and MCTS-driven trajectory exploration with RAFT fine-tuning to optimize both process and outcome rewards.

Pipeline: Design specification β†’ Stepwise reasoning trajectory generation with PRM feedback and MCTS exploration β†’ Retrieval-augmented fine-tuning β†’ Optimized RTL code output

Methodology: Canonical solutions are decomposed into stepwise trajectories with rationales and code modifications. PRM provides dense rewards during RAFT fine-tuning, while MCTS expands the training dataset with alternative reasoning paths. Ablation studies validate component effectiveness.

Results: Outperforms prior methods by >10% in functional correctness and reasoning fidelity metrics on Verilog/VHDL benchmarks; generalizes across RTL languages with high-fidelity, interpretable code generation.

Limitations: Dependence on high-quality canonical solutions for trajectory construction; computational overhead from MCTS exploration; untested on non-Verilog/VHDL RTL variants beyond the evaluated scope.

PDF

Human-Computer Interaction

3/5 Artificial Intelligence (cs.AI)Human-Computer Interaction (cs.HC) 5 Jun 2026
Stumbling Into AI Emotional Dependence: How Routine AI Interactions Reshape Human Connection

Yaoxi Shi, Cathy Mengying Fang, Pattie Maez, Amit Goldenberg

Abstract

ArXiv ID: 2606.04150

Authors: Yaoxi Shi, Cathy Mengying Fang, Pattie Maez, Amit Goldenberg

Abstract:

Public discourse and emerging policy typically assume that AI emotional support is a deliberate act: a lonely user consciously seeking comfort from a dedicated companion chatbot. In this paper, we draw on emerging empirical evidence and argue that this picture is inaccurate on two accounts, both in how AI emotional support arises and how it shapes future behavior. First, AI emotional support commonly emerges incidentally within task-oriented interactions on general-purpose platforms, much as workplace friendships deepen through collaboration. Second, these incidental encounters are path-dependent: positive experiences of AI emotional support update people's beliefs about AI's emotional capabilities and redirect their choices for future emotional support, increasing preference for AI and decreasing preference for humans. We review recent evidence, including a large-scale longitudinal study conducted in collaboration with OpenAI, showing that daily five-minute conversations with an AI about personal issues over 28 days led to a 10.3% decrease in the preference for seeking support from humans and an 11.6% increase in the preference for AI. These findings suggest that current policy, focused on companion apps and isolated interactions, cannot adequately protect human connection. Instead, effective regulations should extend to general-purpose AI systems and address cumulative, trajectory-level changes in how people seek support. Recognizing how people stumble into AI emotional support and how those encounters redirect human connections over time is essential to safeguarding human well-being.

Insights

Contribution: This paper challenges the assumption that AI emotional support is a deliberate choice, revealing it as an incidental byproduct of routine interactions that reshapes human connection over time.

Core Idea: Incidental AI emotional support during task-oriented interactions creates path-dependent shifts in user preferences, increasing reliance on AI and decreasing preference for human support.

Technique: The study employs a large-scale longitudinal design with daily 5-minute AI conversations over 28 days to track evolving emotional support preferences.

Pipeline: daily AI conversations β†’ longitudinal tracking of emotional support preferences β†’ quantified shifts in human-AI preference ratios

Methodology: A collaborative study with OpenAI used a longitudinal experimental design to measure changes in participants' emotional support-seeking behaviors through repeated AI interactions.

Results: 28-day AI conversation experiment showed 10.3% decrease in human support preference and 11.6% increase in AI preference among participants.

Limitations: Findings may not generalize across different AI platforms or cultural contexts, and long-term trajectory effects beyond 28 days remain unexplored.

PDF

LLM

4/5 Artificial Intelligence (cs.AI) 5 Jun 2026
Simulate, Reason, Decide: Scientific Reasoning with LLMs for Simulation-Driven Decision Making

Yuhan Yang, Ruipu Li, Alexander Rodr\'iguez

Abstract

ArXiv ID: 2606.04505

Authors: Yuhan Yang, Ruipu Li, Alexander Rodr\'iguez

Abstract:

Scientific simulators are increasingly being integrated into LLM-driven systems for high-stakes simulation-driven decision-making. However, existing frameworks primarily use LLMs to generate, calibrate, or execute simulators, treating them as black-box interfaces rather than as structured mechanistic systems that can be reasoned about. As a result, current approaches lack the ability to identify, represent, and reason about the assumptions and mechanisms underlying simulator behavior, limiting transparency, auditability, and decision justification. We introduce MechSim, a mechanism-grounded neuro-symbolic reasoning framework for executable scientific simulators. Unlike prior neuro-symbolic approaches that primarily reason over static symbolic structures, MechSim enables LLM agents to reason about the mechanisms, assumptions, and execution behavior of scientific simulators. Our framework represents simulators through a shared structured schema capturing assumptions, variables, mechanism dependencies, and execution traces. On top of this representation, LLM agents operate as constrained reasoning engines that generate structured, evidence-grounded explanations linking simulator outcomes to their underlying mechanisms. We evaluate our approach across multiple high-stakes domains and show that it improves mechanism-level explanation quality, simulator analysis, and downstream decision-making reliability.

Insights

Contribution: Introduces MechSim, a mechanism-grounded neuro-symbolic framework that enables LLMs to reason about scientific simulators' assumptions, variables, and execution behavior, improving transparency and decision reliability in simulation-driven systems.

Core Idea: MechSim represents simulators via a structured schema capturing their mechanistic dependencies and execution traces, allowing LLM agents to generate evidence-based explanations linking outcomes to underlying mechanisms.

Technique: A shared structured schema encodes simulator assumptions, variables, and dependencies, with LLMs operating as constrained reasoning engines to produce mechanism-aware explanations and decisions.

Pipeline: Scientific simulator + query β†’ schema representation of assumptions/variables β†’ LLM reasoning over structured dependencies β†’ mechanism-grounded explanations and decisions

Methodology: Evaluated across high-stakes domains using qualitative analysis of explanation quality, simulator analysis depth, and downstream decision reliability compared to baseline approaches.

Results: Improved mechanism-level explanation quality, enhanced simulator analysis capabilities, and more reliable downstream decisions in high-stakes scenarios.

Limitations: Not explicitly discussed in the abstract; potential open questions include scalability to complex simulators and generalization across diverse scientific domains.

PDF

MLOps

4/5 Artificial Intelligence (cs.AI)Computation and Language (cs.CL)Computer Vision and Pattern Recognition (cs.CV)cs.ETMachine Learning (cs.LG) 5 Jun 2026
Can Generalist Agents Automate Data Curation?

Feiyang Kang, Hanze Li, Adam Nguyen, Mahavir Dabas, Jiaqi W. Ma, Frederic Sala, Dawn Song, Ruoxi Jia

Abstract

ArXiv ID: 2606.04261

Authors: Feiyang Kang, Hanze Li, Adam Nguyen, Mahavir Dabas, Jiaqi W. Ma, Frederic Sala, Dawn Song, Ruoxi Jia

Abstract:

Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop. We introduce *Curation-Bench*, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents command-line access to inspect data, implement policies, submit them to a fixed training/evaluation pipeline, and revise. In a vision-language instruction-tuning instantiation, out-of-the-box agents reach strong published data-selection baselines within ten iterations. However, trajectory analysis reveals a persistent *execution-research gap*: agents mainly tune local policy variants rather than explore new policy families, even when given strategy guides and paper references. Scaffolds requiring each iteration to cite, instantiate, and adapt a prior method shift agents toward method-guided exploration. The scaffolded agent autonomously composes -- without human design input -- a data-selection policy that outperforms strong published baselines at one-tenth their data budget. Overall, current agents can run the curation loop, but reliable data research requires scaffolded method adaptation, not open-ended prompting alone. Code and benchmark are open-sourced.

Insights

Contribution: The paper introduces Curation-Bench, an agent-centric benchmark to evaluate generalist agents in automating data curation, revealing both their potential and limitations in method adaptation.

Core Idea: Generalist agents can execute data-curation loops but struggle with exploratory research without scaffolding, requiring structured guidance to outperform existing baselines.

Technique: Curation-Bench fixes models and evaluation pipelines while granting agents command-line access to implement, test, and refine data policies iteratively.

Pipeline: Raw data β†’ agent inspects data, implements policy β†’ submits to fixed training/evaluation pipeline β†’ receives feedback β†’ revises policy

Methodology: The study evaluates agents in vision-language instruction-tuning tasks using open-ended prompting and scaffolded method adaptation with prior research references.

Results: Agents achieve strong baselines in 10 iterations but exhibit an execution-research gap; scaffolded agents outperform baselines by 10x data efficiency.

Limitations: Agents rely on scaffolding for novel method exploration and lack intrinsic motivation for fundamental research, raising questions about fully autonomous data curation.

PDF

NLP

3/5 Artificial Intelligence (cs.AI)cs.CY 5 Jun 2026
Thinking Through Signs: PEEL as a Semiotic Scaffolding for Epistemically Accountable AI-Enabled Research

Clarisse de Souza, Gabriel Barbosa, Simone Diniz Junqueira Barbosa, B\'arbara Betts, Renato Cerqueira, Juliana Jansen Ferreira

Abstract

ArXiv ID: 2606.04152

Authors: Clarisse de Souza, Gabriel Barbosa, Simone Diniz Junqueira Barbosa, B\'arbara Betts, Renato Cerqueira, Juliana Jansen Ferreira

Abstract:

Large language models are reshaping research practice while quietly eroding researchers epistemic accountability. This commentary introduces PEEL - Protocols for Epistemically Engaged Literacy in AI, a working scaffolding that combines deterministic distant reading via Voyant Tools with LLM interpretation via Claude, grounded in Peircean semiotics and abductive reasoning. Applied to AI-generated condensations of three source texts, PEEL reveals systematic distortions in quantity, term frequency, and epistemic voice that are invisible without non-AI measurement -- and yields three design implications: deterministic instruments must accompany AI tools; fluency is not fidelity; epistemic authority must be designed in, not assumed.

Insights

Contribution: Introduces PEEL, a semiotic scaffolding combining deterministic and AI-driven methods to enhance epistemic accountability in AI-enabled research by revealing systematic distortions in AI-generated text condensations.

Core Idea: PEEL leverages Peircean semiotics and abductive reasoning to bridge gaps between deterministic text analysis and AI interpretation, ensuring transparency in AI-assisted research practices.

Technique: Combines Voyant Tools for deterministic distant reading with Claude's LLM interpretation, analyzing term frequency, quantity, and epistemic voice in AI-generated condensations.

Pipeline: Source texts β†’ Voyant Tools (distant reading) and Claude (LLM interpretation) β†’ Identification of distortions in quantity, term frequency, and epistemic voice.

Methodology: Applied PEEL to three AI-generated condensations of source texts, using non-AI measurement to detect distortions and derive design implications for AI research practices.

Results: Systematic distortions in quantity (e.g., omissions), term frequency (e.g., skewed emphasis), and epistemic voice (e.g., authority misattribution) were identified, invisible through AI-alone analysis.

Limitations: Depends on specific tools (Voyant, Claude); findings may not generalize across all AI models or text domains; requires manual validation for non-AI measurement.

PDF
3/5 Artificial Intelligence (cs.AI) 5 Jun 2026
Characterizing initial human-AI proof formalization workflows

Katherine M. Collins, Simon Frieder, Jonas Bayer, Jacob Loader, Jeck Lim, Peiyang Song, Fabian Zaiser, Lexin Zhou, Shanda Li, Sam Looi, Joshua B. Tenenbaum, Umang Bhatt, Adrian Weller, Jose Hernandez-Orallo, Cameron E. Freer, Valerie Chen, Ilia Sucholutsky

Abstract

ArXiv ID: 2606.04273

Authors: Katherine M. Collins, Simon Frieder, Jonas Bayer, Jacob Loader, Jeck Lim, Peiyang Song, Fabian Zaiser, Lexin Zhou, Shanda Li, Sam Looi, Joshua B. Tenenbaum, Umang Bhatt, Adrian Weller, Jose Hernandez-Orallo, Cameron E. Freer, Valerie Chen, Ilia Sucholutsky

Abstract:

For centuries, human mathematicians have written proofs to substantiate their mathematical arguments; yet, the ability to automatically verify the validity of proofs has long been a challenge. Advances in AI systems' ability to generate code and engage in increasingly high-level mathematical reasoning promise to transform people's ability to formalize and thereby verify proofs. While many works focus on benchmarking the current frontier, we instead study how people use these tools. We conduct a mixed-methods analysis into the initial impact of AI on people's formalization workflows: what people claim they want, what they see as the barriers to those visions, and how they actually use and adapt AI in practice. A qualitative survey shows that people's preferences are diverse, but with a general desire for AI assistance in formalization that preserves high-level human control over the proof discovery process. To assess how people actually engage with AI for formalization under such limitations, we conduct a controlled user study in which participants formalize informal math problems and their proofs, with and without AI, across a range of mathematical problems at varying levels of difficulty and domains. Despite limitations of the tools at the time for autoformalization, participants tend to attain higher formalization accuracy when allowed access to AI tools than when formalizing on their own, with most participants flexibly choosing to use multiple different AI tools. Taken together, our work sheds light on the early stages of AI integration into formalization workflows, involving an intimate interplay of human and AI engagement.

Insights

Contribution: This work provides the first mixed-methods analysis of how humans integrate AI into proof formalization workflows, revealing diverse preferences for AI assistance that preserves human control and demonstrating improved formalization accuracy with AI tools despite their limitations.

Core Idea: Human-AI proof formalization workflows involve flexible tool adaptation and a balance between AI-generated suggestions and human oversight, with participants achieving higher accuracy when using AI despite current tool limitations.

Technique: Combined qualitative surveys and controlled user studies to analyze human preferences, barriers, and practical AI tool usage in formalizing mathematical proofs across varying domains and difficulty levels.

Pipeline: Informal math problems β†’ Human/AI-assisted formalization process β†’ Formalized proofs with accuracy metrics and qualitative feedback

Methodology: Mixed-methods approach combining qualitative surveys to identify user preferences and barriers with a controlled user study measuring formalization accuracy with/without AI tools across diverse mathematical problems.

Results: Participants achieved higher formalization accuracy with AI tools (vs. solo formalization), most used multiple AI tools flexibly, and qualitative data revealed a general desire for AI assistance that maintains human control over proof discovery.

Limitations: Limited by the autoformalization capabilities of available tools at the time, and the study focused on initial integration stages rather than long-term workflow evolution or scalability across larger formalization tasks.

PDF

Robotics

4/5 Artificial Intelligence (cs.AI) 5 Jun 2026
Consensus is Strategically Insufficient: Reasoning-Trace Disagreement as a Knowledge-Representation Signal

Micha{\l} Wawer, Jaros{\l}aw A. Chudziak

Abstract

ArXiv ID: 2606.04223

Authors: Micha{\l} Wawer, Jaros{\l}aw A. Chudziak

Abstract:

Multi-agent systems are commonly designed to reduce disagreement through voting, consensus protocols, debate, or fault-tolerant aggregation. We argue that this objective is insufficient for value-laden tasks, where disagreement may reflect genuine normative uncertainty rather than agent error. Building on prior work on reasoning-trace disagreement in human-AI collaborative moderation, we propose a knowledge-representation layer in which reasoning traces and agent decisions are abstracted into symbolic disagreement states. Given agents producing explicit reasoning traces and binary decisions, we distinguish four states according to reasoning similarity and conclusion agreement: convergent agreement, divergent agreement, convergent disagreement and divergent disagreement. These states support defeasible strategic routing rules. We instantiate the framework in content moderation and argue that disagreement-aware routing provides a bridge between sub-symbolic LLM deliberation and symbolic knowledge representation for multi-agent strategic reasoning.

Insights

Contribution: Introduces a knowledge-representation framework that leverages reasoning-trace disagreement as a strategic signal for multi-agent systems in value-laden tasks, moving beyond traditional consensus-based approaches.

Core Idea: Disagreement among agents in normatively complex tasks reflects genuine uncertainty rather than error, and structuring this disagreement into symbolic states enables more effective strategic reasoning and routing.

Technique: Abstracts agent reasoning traces and binary decisions into four symbolic disagreement states (convergent agreement, divergent agreement, convergent disagreement, divergent disagreement) to inform defeasible routing rules.

Pipeline: Reasoning traces and decisions β†’ Categorization into symbolic disagreement states β†’ Defeasible routing rules for strategic agent coordination

Methodology: Analyzes reasoning-trace patterns in content moderation scenarios, formalizes disagreement states, and demonstrates their application to bridge sub-symbolic LLM deliberation with symbolic knowledge representation.

Results: Framework successfully models normative uncertainty in content moderation, enabling disagreement-aware routing that aligns with human-AI collaborative moderation practices.

Limitations: Relies on explicit reasoning traces from agents, which may not be available in all systems; applicability to non-value-laden tasks remains unexplored.

PDF

Tech News

AI Safety

Reddit r/ArtificialIntelligence 2026-06-05
Sam, Dario, and Demis Hassabis have signed a joint open letter calling for Law Protecting against Biological Weapons.

Sam Altman (OpenAI), Dario Amodei (Anthropic), and Demis Hassabis (DeepMind) joined other AI leaders in signing an open letter urging Congress to implement safeguards for synthetic DNA/RNA ordering, aiming to prevent misuse in biotechnology. The initiative highlights concerns about dual-use risks in AI-driven biotech advancements.

Reddit r/ArtificialIntelligence 2026-06-05
anthropic wants a global ai freeze. they're also about to ipo at $1 trillion.

Anthropic advocates for a global pause on frontier AI development, citing safety risks, while simultaneously preparing for a $1 trillion IPO. Critics argue the move may aim to consolidate market dominance and influence regulations to create barriers for competitors.

Reddit r/ArtificialIntelligence 2026-06-05
Anthropic president cites high capital needs as key motive for IPO - calls for pause to AI development

Anthropic's president highlights the need for significant capital as a primary reason for the company's IPO, while also advocating for a pause in AI development to address safety concerns. The statement underscores financial and ethical considerations in advancing AI technology.

Reddit r/ArtificialIntelligence 2026-06-05
What is the worst thing you can imagine yourself doing to someone else with jailbroken A

A Reddit post explores the ethical risks of unregulated agentic AI through a thought experiment, urging readers to imagine the worst possible actions an unchained AI might enable. It highlights concerns about recursive self-improvement, open-source LLMs, and the need for public discourse on AI safety.

Agentic AI

Reddit r/ArtificialIntelligence 2026-06-05
Creaibo 2.0 beta is open β€” looking for AI content creators to test and break things

Creaibo 2.0 beta, an AI-powered creative tool for images, video, and content production, seeks beta testers to refine its workflow-focused approach. The tool aims to address the fragmentation of AI tools by maintaining creative context across projects via its core AI assistant, Cora.

Reddit r/ArtificialIntelligence 2026-06-04
Cloudflare warns bot and agentic traffic has overtaken human web traffic

Cloudflare reports that bot and agentic traffic now exceeds human-generated web traffic, highlighting a significant shift in online activity driven by AI systems. This trend underscores the growing influence of automated agents and bots in shaping internet behavior and infrastructure demands.

Computer Vision

Hacker News Fri, 05 Ju
Delacroix's Entry of the Crusaders into Constantinople Restored

The Louvre has restored Delacroix's painting 'Entry of the Crusaders into Constantinople' using advanced imaging and conservation techniques to revive its original vibrancy. The process involved analyzing pigments and historical data to reconstruct faded details.

Reddit r/ArtificialIntelligence 2026-06-05
Anyone else just sticking to Nano Banana 2 + Kling 3.0 on Artlist?

A Reddit user discusses their preference for using Nano Banana 2 and Kling 3.0 within the Artlist AI Toolkit for image and video editing, questioning whether others utilize alternative models like Veo 3.1 or the toolkit's voiceover tools. They seek insights on model effectiveness and user experiences.

Reddit r/ArtificialIntelligence 2026-06-05
Horus Image Generation is here! πŸ€©πŸ“·

TokenAI, based in Egypt, has launched Horus Lens 1.0, a text-to-image generation model, marking Egypt's entry into the global AI race. The model is part of the broader Horus family and highlights regional advancements in AI technology.

Computing Systems

Hacker News Fri, 05 Ju
C++: The Documentary

A documentary exploring the history, evolution, and impact of the C++ programming language is released, featuring insights from its creators and developers. The content discusses its role in modern software engineering and systems programming.

Hacker News Fri, 05 Ju
Meta enables ADB on deprecated Portal devices [video]

Meta has enabled Android Debug Bridge (ADB) access on its deprecated Portal devices, allowing developers to debug and modify the hardware despite its official discontinuation. The move may facilitate experimentation or repurposing of the older devices.

Hacker News Fri, 05 Ju
Azure Linux 4.0 is Microsoft's first general-purpose Linux

Microsoft has released Azure Linux 4.0, positioning it as the first general-purpose Linux distribution tailored for Azure environments. The update emphasizes compatibility, performance, and integration with Microsoft cloud services, aiming to streamline development and deployment workflows for developers.

Hacker News Fri, 05 Ju
ESP32 Bit Pirate, a Hardware Hacking Tool with WebCLI That Speaks Every Protocol

The ESP32 Bit Pirate is a hardware hacking tool featuring a WebCLI interface capable of interacting with various communication protocols. It is designed for protocol analysis, IoT device manipulation, and embedded system experimentation.

Hacker News Fri, 05 Ju
Linear Cosine Palettes(2025)

The article discusses 'Linear Cosine Palettes,' a technique involving mathematical modeling with cosine functions, potentially related to color gradients or signal processing. The exact application is unclear without full content, but it appears to explore computational methods.

General

Hacker News Fri, 05 Ju
Magenta RealTime 2: Open and Local Live Music Models

Magenta RealTime 2 introduces open-source, locally executable models for real-time live music generation, enabling developers and musicians to create AI-driven music without cloud dependencies. The update emphasizes improved performance and accessibility for creative applications.

Reddit r/MachineLearning 2026-06-04
[R] Measuring the Symmetry--Data Exchange Rate

A paper rigorously measures how equivariance in geometric deep learning reduces sample complexity, finding a symmetry-data exchange rate of ~1.28. It also shows that using incorrect symmetries harms performance and proves augmentation with orbit averaging achieves equivariance. The study emphasizes methodological transparency and controlled experiments.

Reddit r/ArtificialIntelligence 2026-06-05
Trying to automate too early made my workflows worse, not better

The author shares how premature automation of workflows like lead scoring led to increased complexity, revealing that unclear manual processes were the root issue. They found simplifying the manual logic first made automation more effective and stable.

Reddit r/ArtificialIntelligence 2026-06-04
What AI skill will still matter when everyone has access to AI?

The Reddit thread discusses which AI skills will remain valuable as AI tools become ubiquitous, emphasizing judgment, verification, domain expertise, and workflow design over mere technical execution. Contributors suggest that guiding AI effectively and discerning quality outputs will differentiate individuals in the future.

Reddit r/DeepLearning 2026-06-05
A Blog Post I Wrote On Backward Pass For Matrix Multiplication

A Reddit user shares a blog post deriving the backward pass for matrix multiplication using intuitive multivariable calculus principles, aiming to demystify matrix calculus in deep learning. The post emphasizes that matrix operations in neural networks can be understood through foundational calculus rules.

LLM

Hacker News Fri, 05 Ju
Fine-tuning an LLM to write docs like it's 1995

This article explores fine-tuning a large language model (LLM) to generate technical documentation in the style of 1995, reflecting older software development practices and writing conventions. The approach highlights how LLMs can be adapted to mimic historical contexts or specific domains.

Reddit r/MachineLearning 2026-06-05
Is it allowed to use OpenAI API outputs to create a silver code dataset or benchmark for a specific Python library? [d]

A user asks if using OpenAI API outputs to create a code dataset or benchmark for a Python library violates OpenAI's terms, specifically in scenarios involving fine-tuning an open-source model versus using the dataset solely for evaluation. The post seeks community insights on legal boundaries and practical precedents.

Reddit r/ArtificialIntelligence 2026-06-05
OpenAI gives free daily tokens if you do this

OpenAI offers free daily API tokens (up to 2.5M for lighter models) through its data sharing program, requiring users to share their prompts/outputs for model training. The program is ongoing but underutilized by most users.

Reddit r/ArtificialIntelligence 2026-06-05
CMA Orders Google AI Search Opt-Out for Publishers

The UK's CMA mandates Google to allow publishers to opt out of AI training data usage for its search features, separating content display rights from data rights. This includes phased implementation by 2026-2027, anti-retaliation clauses, and deferred licensing terms, marking a significant regulatory shift in AI data governance.

Reddit r/DeepLearning 2026-06-05
[Tutorial] Getting Started with Unsloth Studio

Unsloth Studio, a new UI-based application by Unsloth.ai, allows users to chat with and train language models, load GGUF models from Hugging Face, and fine-tune models with minimal effort. This tutorial guides users through key aspects of getting started with the platform.

MLOps

Hacker News Fri, 05 Ju
Changing How We Develop Ladybird

Ladybird discusses iterative improvements to their development workflow, emphasizing feedback loops, collaboration, and refining AI systems through continuous experimentation.

Hacker News Fri, 05 Ju
Open Code Review – An AI-powered code review CLI tool

Open Code Review is an AI-powered command-line tool designed to automate code reviews, leveraging machine learning to analyze code quality and suggest improvements. The tool is open-source and has sparked discussions on Hacker News about its effectiveness and potential use cases in development workflows.

NLP

Hacker News Thu, 04 Ju
Do transformers need three projections? Systematic study of QKV variants

This paper investigates whether transformers require three separate projections (query, key, value) in their attention mechanisms by systematically analyzing alternative QKV variants. The study evaluates performance trade-offs and architectural simplicity across different configurations.

Reddit r/ArtificialIntelligence 2026-06-05
I am now negotiating with AI as part of my job, and it's going like you would expect. How can I circumvent it to speak to a representative?

Insurance claims adjusters report auto lenders are using AI bots to dispute vehicle total loss values with flawed data, forcing manual verification. The AI systems resist human intervention and escalate to automated appraisals, creating inefficiencies.

Reddit r/ArtificialIntelligence 2026-06-05
[OC] UK AI exposure data: clerical workers score 8.5/10 while most professionals score 6.5/10

Analysis of UK occupation data reveals clerical workers (8.5/10 AI exposure) are more vulnerable to AI automation than professionals (6.5/10), due to alignment with LLM capabilities in tasks like communication, classification, and form completion. Physical trades show lower exposure (2.0-2.5/10).

Reddit r/ArtificialIntelligence 2026-06-05
What tools can generate output from two inputs independent of the order?

A user seeks tools that can process two text inputs (e.g., debate sides) and generate consistent outputs regardless of input order, avoiding manual convergence steps. They propose iterative AI feedback but find it slow and costly.

Trending repositories on GitHub filtered and scored for relevance to your interests.

Agentic AI

Computer Vision

LLM

Robotics