Monday, June 15, 2026

Daily Digest 2026-06-15

Today’s research focuses on the transition from isolated chatbots to persistent, autonomous agents, with a heavy emphasis on robust orchestration, safety protocols, and verifiable memory systems.

Research highlights:

Agent Orchestration and Evolution: Research explores multi-modal orchestration, proactive communication policies, and frameworks for evolving agent skills through trajectory auditing and adaptive harnesses.
Safety and Privacy: New methodologies introduce risk-aware causal gating to enforce least-privilege access and local sanitization techniques to ensure privacy-aware data views for agents.
Memory and Reasoning: Developments include version-controlled reasoning logs for replayable memory, streaming evaluations for future-oriented assistance, and latent-space synthesis for parallel workflows.
Trust and Reliability: Studies characterize conditional trust in agent swarms and propose calibration bonuses to bridge the gap between agent reflection and actual performance.
Planning and Architecture: Work explores causal object-centric models for Monte Carlo Tree Search and the shift toward persistent “digital colleagues” over transient chat interfaces.

Tech buzz:

The ecosystem is seeing a surge in infrastructure tools for deploying, caching, and managing large-scale AI workloads.
Open-Source Frameworks: New releases provide accessible pathways for agentic workflows and beginner-friendly AI education.
Infrastructure Optimization: Tools for distributed AI scaling and efficient KV cache management are gaining traction to support high-throughput agentic tasks.

Sort:

Today's research focuses on the transition from isolated chatbots to persistent, autonomous agents, with a heavy emphasis on robust orchestration, safety protocols, and verifiable memory systems.

Global Trends

Papers discovered from ArXiv subject categories

AI Safety

5/5 Artificial Intelligence (cs.AI)Computation and Language (cs.CL)Multiagent Systems (cs.MA) 15 Jun 2026

WorkBench Revisited: Workplace Agents Two Years On

Olly Styles

Abstract

ArXiv ID: 2606.13715

Authors: Olly Styles

Abstract:

The best agent on WorkBench in March 2024, GPT-4, completed 43% of tasks and took an unintended harmful action, such as emailing the wrong person, on 26% of them. We re-visit the benchmark in June 2026 and find that the best agent to date, Claude Opus 4.8, completes 89% and takes an unintended harmful action on 2.5%. Aside from this considerable progress in frontier agent performance, three things stand out. First, capability and safety go together on WorkBench rather than trade off, so the models that finish the most tasks also do the least unintended damage. Second, while several classes of error have been totally eliminated, frontier models still make some basic mistakes that occasionally result in irreversible harm, such as sending an email to the wrong person. Third, the rise of open-weight models has drastically lowered costs for a performance level that was previously only accessible to proprietary models, while frontier costs have stayed relatively stable. We release an updated version of the benchmark with data and code quality improvements, new model scores, and analysis of agent progress on WorkBench since 2024.

Insights

Contribution: The paper provides a longitudinal evaluation of agentic performance on the WorkBench benchmark, demonstrating significant improvements in both task completion and safety over a two-year period.

Core Idea: The study explores the evolution of frontier models, finding that capability and safety are positively correlated rather than in trade-off, while highlighting the democratization of high-performance agents through open-weight models.

Technique: The authors updated the WorkBench benchmark with improved data and code quality to re-evaluate frontier models like Claude Opus 4.8 against previous benchmarks.

Pipeline: WorkBench tasks → Frontier model execution → Performance and safety evaluation → Comparative analysis of progress and costs.

Methodology: A comparative longitudinal study measuring task completion rates and the frequency of unintended harmful actions across different model generations from 2024 to 2026.

Results: Task completion improved from 43% (GPT-4) to 89% (Claude Opus 4.8), while unintended harmful actions dropped from 26% to 2.5%.

Limitations: Despite progress, frontier models still make basic mistakes that can result in irreversible harm, such as sending emails to the wrong recipients.

PDF

5/5 Artificial Intelligence (cs.AI) 15 Jun 2026

Capability Minimization as a Safety Primitive: Risk-Aware Causal Gating for Least-Privilege LLM Agents

Laxmipriya Ganesh Iyer, Rahul Suresh Babu

Abstract

ArXiv ID: 2606.13884

Authors: Laxmipriya Ganesh Iyer, Rahul Suresh Babu

Abstract:

Modern decision systems increasingly rely on learned components whose outputs may be confident yet wrong, exposing downstream actions to costly errors. We introduce Risk-Aware Causal Gating (RACG), a framework that decides whether to act on, defer, or abstain from a model's prediction by combining causal effect estimation with calibrated risk control. RACG models the causal pathway from candidate actions to outcomes and gates each decision according to an estimated counterfactual risk rather than raw predictive confidence. To make gating reliable, we derive distribution-free bounds on the probability of acting under high-risk conditions and show how these bounds translate into operating thresholds that satisfy user-specified safety constraints. We further propose an adaptive gating policy that adjusts to distribution shift by monitoring discrepancies between predicted and realized outcomes, tightening the gate when causal assumptions appear violated. Across simulated interventions and real-world decision benchmarks, RACG reduces high-cost errors substantially while preserving most of the utility of an ungated policy, and it outperforms confidence-based and selective-prediction baselines at matched abstention rates. Our results indicate that explicitly separating causal risk from predictive uncertainty yields decision systems that are both safer and more transparent, offering a principled mechanism for trustworthy automation in high-stakes settings.

Insights

Contribution: The paper introduces Risk-Aware Causal Gating (RACG), a framework that prioritizes safety by gating model actions based on counterfactual risk rather than raw predictive confidence. It provides a principled mechanism for least-privilege LLM agents by minimizing the capabilities of a model to perform high-risk actions.

Core Idea: The core idea is to separate predictive uncertainty from causal risk, ensuring that a system abstains from an action if the potential negative outcome is high, even if the model is confident in its prediction.

Technique: The technique combines causal effect estimation with distribution-free risk bounds to create a gating mechanism that adjusts to distribution shifts by monitoring discrepancies between predicted and realized outcomes.

Pipeline: Model Prediction → Causal Pathway Modeling → Counterfactual Risk Estimation → Distribution-Free Bound Calculation → Adaptive Gating Policy → Action (Act, Defer, or Abstain)

Methodology: The authors derive mathematical bounds on the probability of high-risk actions and implement an adaptive policy that tightens safety thresholds when causal assumptions are violated.

Results: RACG substantially reduces high-cost errors while preserving most utility, outperforming confidence-based and selective-prediction baselines at matched abstention rates across simulated and real-world benchmarks.

Limitations: The effectiveness of the framework depends on the accuracy of the underlying causal pathway modeling and the availability of data to monitor realized outcomes for adaptive gating.

PDF

5/5 Machine Learning (cs.LG)Computation and Language (cs.CL) 15 Jun 2026

Natively Unlearnable Large Language Models

Gaurav R. Ghosal, Pratyush Maini, Aditi Raghunathan

Abstract

ArXiv ID: 2606.13873

Authors: Gaurav R. Ghosal, Pratyush Maini, Aditi Raghunathan

Abstract:

Unlearning aims to remove the influence of specific training data sources, but this has proved challenging because the contributions of different sources are entangled within the model. Isolating source contributions to disjoint parameters makes removal easier, though it obstructs joint learning across sources. We propose NULLs (Natively Unlearnable LLMs), a model class that satisfies the two opposing goals of isolating source-specific contributions and learning jointly across sources, by training a set of shared backbone neurons alongside a pool of sparsely activated sinks. During training, information specific to a source naturally concentrates in its sinks while information shared across sources accumulates in the backbone. A source is then unlearned at deployment by disabling its corresponding sinks, with no gradient updates and no access to the retained data. We show that NULLs scales to Wikipedia's ~6M articles, isolating each as an independent source. Unlearning a single article removes knowledge specific to it while preserving facts shared with semantically related articles, closely matching retraining from scratch. We note that unlearning with NULLs is also robust: in a case study of unlearning the Harry Potter books, NULLs resists both adversarial extraction and relearning that reverses post-hoc unlearning. Finally, NULLs preserves general language capabilities, matching a standard transformer on downstream benchmarks. Together, these results suggest that source-level unlearning need not be an afterthought. It can be built natively into LLM training while retaining the benefits of shared representation learning.

Insights

Contribution: The paper introduces NULLs (Natively Unlearnable LLMs), a model architecture that enables source-specific unlearning without sacrificing the benefits of joint representation learning. It demonstrates that unlearning can be built natively into the training process rather than as a post-hoc correction.

Core Idea: The authors propose a dual-component architecture where shared information is stored in a backbone while source-specific information is isolated in sparsely activated sinks.

Technique: The model utilizes a set of shared backbone neurons and a pool of sparsely activated sinks, allowing for the removal of specific data sources by simply disabling their corresponding sinks at deployment.

Pipeline: Training data (shared and source-specific) → Dual-path training (backbone for shared info, sinks for source-specific info) → Deployment with sink-disabling for unlearning → Output (unlearned model with preserved general capabilities).

Methodology: The researchers trained a model on Wikipedia's ~6M articles, treating each as an independent source, and evaluated the effectiveness of unlearning through knowledge removal, adversarial extraction, and relearning tests.

Results: NULLs successfully removed knowledge specific to individual articles while preserving shared facts, matched the performance of retraining from scratch, resisted adversarial extraction, and maintained general language capabilities on downstream benchmarks.

Limitations: The paper does not extensively explore the computational overhead of maintaining the sink pool or the potential for information leakage between sinks during the joint learning phase.

PDF

Agentic AI

5/5 Artificial Intelligence (cs.AI)Computation and Language (cs.CL)Computer Vision and Pattern Recognition (cs.CV) 15 Jun 2026

Orchestra-o1: Omnimodal Agent Orchestration

Fan Zhang, Vireo Zhang, Shengju Qian, Haoxuan Li, Hao Wu, Jinyang Wu, Donghao Zhou, Zhihong Zhu, Zheng Lian, Xin Wang, Pheng-Ann Heng

Abstract

ArXiv ID: 2606.13707

Authors: Fan Zhang, Vireo Zhang, Shengju Qian, Haoxuan Li, Hao Wu, Jinyang Wu, Donghao Zhou, Zhihong Zhu, Zheng Lian, Xin Wang, Pheng-Ann Heng

Abstract:

The recent success of agent swarms has shifted the paradigm of large language model (LLM)-based agents from single-agent workflows to multi-agent systems, highlighting the importance of agent orchestration for task decomposition and collaboration. However, existing orchestration frameworks are limited to a narrow set of modalities and struggle to generalize to more complex settings where heterogeneous modalities coexist and interact. This limitation becomes particularly pronounced in omnimodal scenarios, where tasks require the unified understanding and coordination of diverse inputs such as text, image, audio, and video. In this work, we propose Orchestra-o1, an omnimodal agent orchestration framework designed to support efficient agent collaboration across multiple modalities. Orchestra-o1 introduces a unified orchestration mechanism that enables modality-aware task decomposition, online sub-agent specialization, and parallel sub-task execution. This scalable design allows agent systems to effectively tackle complex real-world tasks involving heterogeneous information sources, surpassing the second-best approach by 10.3% accuracy on the OmniGAIA benchmark. Furthermore, we introduce decision-aligned group relative policy optimization (DA-GRPO), an efficient agentic reinforcement learning approach for training Orchestra-o1-8B, which also achieves state-of-the-art performance against all existing open-source omnimodal agents.

Insights

Contribution: The paper introduces Orchestra-o1, a scalable omnimodal agent orchestration framework that enables efficient collaboration across heterogeneous modalities like text, image, audio, and video. It also proposes DA-GRPO, a decision-aligned reinforcement learning approach for training omnimodal agents.

Core Idea: The core idea is to move beyond single-modality agent swarms by creating a unified orchestration mechanism that can decompose and execute tasks involving diverse, interacting information sources.

Technique: The framework utilizes modality-aware task decomposition, online sub-agent specialization, and parallel sub-task execution, optimized via Decision-Aligned Group Relative Policy Optimization (DA-GRPO).

Pipeline: Omnimodal inputs (text, image, audio, video) → Modality-aware task decomposition → Online sub-agent specialization → Parallel sub-task execution → Unified final output

Methodology: The authors developed a scalable orchestration design to handle heterogeneous data and trained a 8B parameter model using a novel reinforcement learning objective that aligns group policies with decision-making.

Results: Orchestra-o1 surpassed the second-best approach by 10.3% accuracy on the OmniGAIA benchmark and achieved state-of-the-art performance against all existing open-source omnimodal agents.

Limitations: The paper does not explicitly detail specific limitations, but the scope is currently focused on the OmniGAIA benchmark and the scalability of the 8B parameter model.

PDF

5/5 Artificial Intelligence (cs.AI)Machine Learning (cs.LG) 15 Jun 2026

Hybrid Open-Ended Tri-Evolution Makes Better Deep Researcher

Hongming Piao, Chi Liu, Mengzhuo Chen, Yan Shu, Derek Li, Ying Wei, Bryan Dai

Abstract

ArXiv ID: 2606.13710

Authors: Hongming Piao, Chi Liu, Mengzhuo Chen, Yan Shu, Derek Li, Ying Wei, Bryan Dai

Abstract:

Deep research and agent evolution serve as de-facto tasks for AI agents in real-world applications toward artificial general intelligence. The former enables autonomous retrieval and integration of information in open-ended environments to tackle open-ended research tasks, yet it is constrained by the static parametric deep research capabilities of agent systems. The latter allows agents to autonomously interact with the environment to gain experiences that evolve model capabilities. However, its effectiveness has been widely validated only on verifiable tasks with standard answers, leaving a gap with open-ended research tasks. To bridge these two critical tasks, we propose the Hybrid Open-Ended Tri-Evolution (HOTE) framework, which leverages hybrid-mode reinforcement learning to facilitate the collaborative evolution of a proposer, solver and judge based on web-scale knowledge, moving toward autonomous evolving agents in open-ended tasks and environments. Extensive experiments on three long-form deep research benchmarks demonstrate that the 8B model trained via HOTE surpasses the strongest static open 8-32B models as well as those trained by state-of-the-art deep research training methods with less time overhead, and further verify that the evolution of all three modules in HOTE is indispensable.

Insights

Contribution: The paper introduces the Hybrid Open-Ended Tri-Evolution (HOTE) framework, which enables AI agents to autonomously evolve their capabilities for open-ended research tasks by bridging deep research and agent evolution.

Core Idea: The core idea is to move beyond static parametric capabilities by creating a collaborative evolutionary loop where three distinct agent roles (proposer, solver, and judge) improve simultaneously based on web-scale knowledge.

Technique: The framework utilizes hybrid-mode reinforcement learning to facilitate the co-evolution of a proposer, solver, and judge in an open-ended environment.

Pipeline: Open-ended research task → Hybrid-mode RL evolution of Proposer, Solver, and Judge modules → Enhanced deep research capabilities

Methodology: HOTE employs a tri-evolutionary approach where a proposer generates tasks, a solver attempts to research them, and a judge evaluates the results, with all three modules evolving concurrently through reinforcement learning.

Results: An 8B model trained via HOTE outperformed the strongest static open 8-32B models and state-of-the-art deep research training methods with lower time overhead across three long-form benchmarks.

Limitations: The paper focuses on the evolution of these three specific modules, leaving open questions regarding the scalability of the tri-evolutionary framework to even more complex multi-agent ecosystems.

PDF

5/5 Artificial Intelligence (cs.AI) 15 Jun 2026

Minim: Privacy-Aware Minimal View for Agents via Trusted Local Sanitization

Hexuan Yu, Chaoyu Zhang, Heng Jin, Shanghao Shi, Ning Zhang, Y. Thomas Hou, Wenjing Lou

Abstract

ArXiv ID: 2606.13949

Authors: Hexuan Yu, Chaoyu Zhang, Heng Jin, Shanghao Shi, Ning Zhang, Y. Thomas Hou, Wenjing Lou

Abstract:

Modern LLM-powered autonomous agents increasingly rely on rich user interface (UI) state observations to achieve reliable action grounding in complex digital environments. However, many deployments transmit the full UI state to remote inference servers even when most elements are irrelevant to the current task, which can leak sensitive but unnecessary context such as authentication codes, private notifications, and background application states. We propose MINIM, a trusted local broker that performs privacy-aware minimization on the client side before any observation leaves the device. Grounded in Contextual Integrity (CI), MINIM learns a dual-score representation for each UI element by predicting an inherent sensitivity score (s) and a task-conditioned necessity score (n). These scores drive a ternary disclosure policy that keeps essential elements, abstracts sensitive attributes when needed, and removes task-irrelevant content. We optimize a CI-aware objective that penalizes necessity errors more strongly on high-risk content, enabling aggressive pruning while preserving task-critical information. Experiments on real-world UI observations derived from WebArena show that MINIM substantially reduces task-irrelevant sensitive leakage while preserving task-critical semantic context and the interactive affordances required for reliable agent actions.

Insights

Contribution: The paper introduces MINIM, a trusted local broker that performs privacy-aware minimization of UI states to prevent sensitive data leakage when using LLM-powered autonomous agents.

Core Idea: The core idea is to filter UI observations on the client side by balancing privacy preservation with task necessity, ensuring only relevant information is sent to remote inference servers.

Technique: MINIM utilizes a dual-score representation (sensitivity and necessity) grounded in Contextual Integrity (CI) to implement a ternary disclosure policy for UI elements.

Pipeline: Raw UI State → Local Broker (Sensitivity & Necessity Scoring) → Ternary Disclosure Policy (Keep/Abstract/Remove) → Minimized UI Observation → Remote Inference Server

Methodology: The authors developed a CI-aware objective function that penalizes necessity errors more heavily on high-risk content, optimizing the trade-off between privacy and agent performance.

Results: MINIM substantially reduces task-irrelevant sensitive leakage while preserving task-critical semantic context and interactive affordances for reliable agent actions on the WebArena benchmark.

Limitations: The paper does not explicitly detail the computational overhead of local processing or the potential for 'abstraction' to lose subtle but necessary cues for complex edge cases.

PDF

5/5 Artificial Intelligence (cs.AI)Machine Learning (cs.LG) 15 Jun 2026

When Should Agent Trust Be Conditional? Characterizing and Attacking Skill-Conditional Reputation in Agent Swarms

Yihan Xia, Taotao Wang

Abstract

ArXiv ID: 2606.14200

Authors: Yihan Xia, Taotao Wang

Abstract:

Open platforms increasingly route tasks among heterogeneous LLM agents--differing in base model, scaffold, and tool stack--whose competence varies sharply by skill: an agent excellent at one skill may be useless at another. The standard reputation approach summarizes each agent by a single global trust score, but that scalar is the wrong object here, because routing every task to the globally most-trusted agent leaves the value of specialization unclaimed. We study skill-conditional trust R(i | k)--the trust to place in agent i for a task requiring skill k, rather than one score per agent--and pose three falsifiable questions: when is conditioning worth it, how much cross-skill evidence should be borrowed, and whether that borrowing is safe. A controlled phase-diagram analysis answers the first two: conditional trust wins only in a specific regime--high agent heterogeneity, sparse per-skill evidence, and correlated skills--and the coupling strength beta that buys this data efficiency is dual-use, because the same cross-skill borrowing is also a laundering channel. On a public benchmark of 14 genuinely heterogeneous AppWorld agents, real pools land inside the beneficial regime--a small but genuine gain, with the per-skill best agent genuinely changing across skills. We then show that an attacker with cheap evidence in one skill and none in a target skill hijacks the conditional router, driving routing regret from 0 to 0.94 on a pool our zero-cost Conditional Information Value Test (CIVT) rates GREEN--while the ungated trust verdict it contaminates reads -0.06 instead of the honest +0.19. A zero-evidence gate bounds the attack but does not eliminate it; we characterize the residual cost under an explicit budget. We do not claim Sybil-resistance--we quantify the trade-off.

Insights

Contribution: The paper introduces skill-conditional trust (R(i | k)) as a superior alternative to global reputation scores for heterogeneous agent swarms and identifies a security vulnerability where cross-skill evidence borrowing can be exploited by attackers.

Core Idea: Instead of a single trust score, agents should be evaluated based on specific skills, but the mechanism used to infer trust in low-evidence skills from high-evidence skills creates a 'laundering channel' for malicious actors.

Technique: The authors use a phase-diagram analysis to define the regime where conditional trust is beneficial and develop the Conditional Information Value Test (CIVT) to detect potential hijacking.

Pipeline: Heterogeneous agent pool → Skill-specific task routing → Cross-skill evidence borrowing (coupling) → Conditional trust score R(i | k) → Task assignment

Methodology: The study employs a controlled phase-diagram analysis to study the trade-offs between data efficiency and security, followed by empirical testing on a public benchmark of 14 AppWorld agents.

Results: Conditional trust provides a small but genuine gain in real-world pools; however, an attacker with cheap evidence in one skill can hijack the router, increasing routing regret from 0 to 0.94 while maintaining a deceptive 'honest' trust verdict.

Limitations: The proposed methods do not claim full Sybil-resistance and only bound the residual cost of attacks under an explicit budget.

PDF

5/5 Artificial Intelligence (cs.AI)Machine Learning (cs.LG) 15 Jun 2026

Closing the Reflection Gap: A Free Calibration Bonus for Agentic RL

Yinglun Zhu

Abstract

ArXiv ID: 2606.14211

Authors: Yinglun Zhu

Abstract:

LLMs are increasingly deployed as agents that interact with external environments and observe feedback such as execution results, error messages, and tool outputs. A well-functioning agent should be able to leverage this feedback to accurately assess its own performance. Yet we find a persistent reflection gap: LLM agents tend to mis-assess their own outputs after observing concrete environment feedback -- even for questions they correctly answered -- and standard RL barely helps due to a credit-assignment mismatch. To close this gap, we propose RefGRPO, a simple yet effective fix that augments standard RL algorithms with two key ingredients: a free calibration bonus computed by contrasting the agent's own reflection with the actual outcome (requiring no additional reward model, LLM judge, or external annotation), and a dynamic schedule on its coefficient. Compared to standard RL baselines, our method simultaneously improves reflection calibration (e.g., reduces underconfidence rate $44.4\% \to 7.7\%$) and task accuracy (e.g., $75.1\% \to 76.5\%$) on text-to-SQL across five benchmarks. The resulting calibrated reflection turns the agent into its own verifier grounded in environment feedback, which further enables (i) better self-improvement that uses reflections as pseudo-rewards without outcome supervision, and (ii) more effective test-time selective prediction by committing only to rollouts flagged as correct.

Insights

Contribution: The paper introduces RefGRPO, a method to close the 'reflection gap' where LLM agents mis-assess their own performance despite receiving concrete environment feedback.

Core Idea: By incorporating a free calibration bonus that contrasts an agent's self-reflection with actual outcomes, the model learns to align its internal confidence with external reality.

Technique: The authors augment standard RL algorithms with a calibration bonus and a dynamic coefficient schedule, requiring no external reward models or human annotations.

Pipeline: Agent action → Environment feedback → Agent reflection → Calibration bonus calculation → RefGRPO update → Calibrated agent

Methodology: The methodology uses a contrastive approach to penalize discrepancies between the agent's self-assessment and the ground truth, optimizing for both accuracy and reflection calibration.

Results: Reduced underconfidence rate from 44.4% to 7.7% and improved text-to-SQL accuracy from 75.1% to 76.5% across five benchmarks.

Limitations: The paper focuses on environment-based feedback and does not explicitly detail performance in scenarios where feedback is ambiguous or missing.

PDF

5/5 Artificial Intelligence (cs.AI) 15 Jun 2026

SkillAudit: Ground-Truth-Free Skill Evolution via Paired Trajectory Auditing

Haowen Gao, Haoran Chen, Can Wang, Shasha Guo, Liang Pang, Zhaoyang Liu, Huawei Shen, Xueqi Cheng

Abstract

ArXiv ID: 2606.14239

Authors: Haowen Gao, Haoran Chen, Can Wang, Shasha Guo, Liang Pang, Zhaoyang Liu, Huawei Shen, Xueqi Cheng

Abstract:

Agent skills are structured procedural packages that guide frozen LLM agents in specialized workflows. Skills rarely remain sufficient after deployment: edge cases, API changes, and deployment constraints become visible only through use, making skill evolution a practical necessity. Existing methods depend on privileged feedback such as held-out validation scores, hidden test outcomes, or environment rewards -- signals often unavailable when a practitioner has only a task description and workspace data. We introduce SkillAudit, a framework for evolving agent skills without ground-truth feedback. The key idea is paired trajectory auditing: at each iteration, the same task is executed with and without the candidate skill, isolating how the skill changes agent behavior without external labels. To turn behavioral differences into edit guidance, SkillAudit uses Process-Aligned Contrastive Evaluation (PACE), a cluster of evaluators that maps trajectory divergences to diagnostic signals linked to specific passages in the skill document. A structural verifier, compiled once from the task specification and then fixed, checks task constraints and rolls back harmful updates. SkillAudit routes edits through two pipelines: Refine removes noisy or irrelevant guidance from broadly useful skills, while Repair replaces passages that conflict with the task. Across 89 containerized tasks spanning 8 professional domains, SkillAudit achieves 73.9% average task reward, outperforming an agent without skills (40.9%) and the static expert skill (56.7%). These gains are obtained without accessing hidden tests, reference solutions, or external scoring functions during evolution.

Insights

Contribution: The paper introduces SkillAudit, a framework for evolving agent skills without requiring ground-truth feedback, hidden test outcomes, or environment rewards.

Core Idea: The framework uses paired trajectory auditing to isolate the specific impact of a skill by comparing agent behavior with and without the skill on the same task.

Technique: It employs Process-Aligned Contrastive Evaluation (PACE) to map trajectory divergences to diagnostic signals and a structural verifier to ensure task constraint compliance.

Pipeline: Task description and workspace data → Paired trajectory execution (with/without skill) → PACE diagnostic mapping → Refine/Repair edit pipelines → Updated skill document

Methodology: SkillAudit identifies behavioral differences between paired trajectories, routes edits through refinement or repair pipelines based on diagnostic signals, and uses a fixed structural verifier to roll back harmful updates.

Results: Achieved a 73.9% average task reward across 89 containerized tasks, significantly outperforming agents without skills (40.9%) and static expert skills (56.7%).

Limitations: The paper does not explicitly detail the computational overhead of running paired trajectories for every iteration or the scalability of the PACE cluster across highly complex, multi-step reasoning tasks.

PDF

5/5 Artificial Intelligence (cs.AI) 15 Jun 2026

HarnessX: A Composable, Adaptive, and Evolvable Agent Harness Foundry

Tingyang Chen, Shuo Lu, Kang Zhao, Weicheng Meng, Hanlin Teng, Tianhao Li, Chao Li, Xule Liu, Jian Liang, Zhizhong Zhang, Yuan Xie, Heng Qu, Kun Shao, Jian Luan

Abstract

ArXiv ID: 2606.14249

Authors: Tingyang Chen, Shuo Lu, Kang Zhao, Weicheng Meng, Hanlin Teng, Tianhao Li, Chao Li, Xule Liu, Jian Liang, Zhizhong Zhang, Yuan Xie, Heng Qu, Kun Shao, Jian Luan

Abstract:

AI agent performance depends critically on the runtime harness, comprising the prompts, tools, memory, and control flow that mediate how a model observes, reasons, and acts. Yet today's harnesses remain largely hand-crafted and static: each new model or task still demands bespoke scaffolding, and the rich traces produced during execution are rarely distilled back into systematic improvement. We introduce HarnessX, a foundry for composable, adaptive, and evolvable agent harnesses. HarnessX assembles typed harness primitives via a substitution algebra, adapts them through AEGIS, a trace-driven multi-agent evolution engine grounded in an operational mirror between symbolic adaptation and reinforcement learning, and closes the harness-model loop by turning trajectories into both harness updates and model training signal. Across five benchmarks (ALFWorld, GAIA, WebShop, tau^3-Bench, and SWE-bench Verified), HarnessX yields an average gain of +14.5% (up to +44.0%), with gains largest where baselines are lowest. These results suggest that agent progress need not come from model scaling alone: composing and evolving runtime interfaces from execution feedback is an actionable and complementary lever. The complete codebase will be open-sourced in a future release.

Insights

Contribution: The paper introduces HarnessX, a foundry for creating composable, adaptive, and evolvable agent harnesses that move beyond static, hand-crafted scaffolding.

Core Idea: Agent performance can be significantly improved by treating the runtime harness (prompts, tools, memory, and control flow) as a dynamic, evolvable component that learns from execution traces.

Technique: HarnessX utilizes a substitution algebra for composing typed primitives and the AEGIS engine, which uses an operational mirror between symbolic adaptation and reinforcement learning to evolve harnesses.

Pipeline: Execution traces → AEGIS multi-agent evolution engine → Updated harness primitives and model training signals

Methodology: The authors developed a framework to assemble harness components via substitution algebra and evaluated it across five diverse benchmarks including ALFWorld, GAIA, and SWE-bench Verified.

Results: HarnessX achieved an average performance gain of +14.5% across benchmarks, with individual gains reaching up to +44.0%, particularly in low-baseline scenarios.

Limitations: The full codebase is not yet available (future release), and the extent of scalability for extremely complex, multi-step reasoning tasks remains to be fully explored.

PDF

5/5 Artificial Intelligence (cs.AI) 15 Jun 2026

Communication Policy Evolution for Proactive LLM Agents

Xinbei Ma, Jiyang Qiu, Yao Yao, Zheng Wu, Yijie Lu, Xiangmou Qu, Jiaxin Yin, Xingyu Lou, Jun Wang, Weiwen Liu, Weinan Zhang, Zhuosheng Zhang, Hai Zhao

Abstract

ArXiv ID: 2606.14314

Authors: Xinbei Ma, Jiyang Qiu, Yao Yao, Zheng Wu, Yijie Lu, Xiangmou Qu, Jiaxin Yin, Xingyu Lou, Jun Wang, Weiwen Liu, Weinan Zhang, Zhuosheng Zhang, Hai Zhao

Abstract:

LLM agents have rapidly evolved into autonomous systems, yet a persistent information gap remains between users and agents: communication is costly, while users' identical preferences further limit information exchange. To investigate how agents should communicate across modalities, this paper formalizes Communication Policy, establishes textual and UI-based policies, and then evaluates communication policies across diverse environments, personas, and model combinations. Building information asymmetry for proactive agents, we set up two complementary settings, User-Agent and Planner-Executor. Experimental results reveal complementary strengths between interaction channels: text-based interaction often facilitates task performance, while structured UI improves agents' response quality and persona compliance. Motivated by that, a hybrid method combines these advantages. We further propose Communication Policy Evolution (CPE), a self-evolution framework for refining communication policies through rollout and prompt-level evolving. Without model modification, CPE achieves the best task success across multiple settings using prompt refinement alone. Our findings identify communication behavior as a critical yet underexplored design dimension for LLM agents.

Insights

Contribution: The paper formalizes 'Communication Policy' for proactive LLM agents and introduces a self-evolution framework (CPE) to optimize how agents exchange information across different modalities.

Core Idea: To bridge the information gap between users and autonomous agents, communication should be treated as a dynamic policy that balances task performance, response quality, and persona compliance through hybrid text and UI channels.

Technique: The authors propose Communication Policy Evolution (CPE), a framework that refines communication strategies through rollout and prompt-level evolution without requiring model fine-tuning.

Pipeline: User/Planner requirements → Communication Policy selection (Text/UI/Hybrid) → Agent execution → Performance evaluation → CPE feedback loop → Refined Communication Policy

Methodology: The study establishes textual and UI-based policies and evaluates them across diverse environments, personas, and model combinations using User-Agent and Planner-Executor settings.

Results: Text-based interaction improves task performance, while structured UI enhances response quality and persona compliance; the hybrid method and CPE framework achieve the highest task success rates.

Limitations: The study focuses on prompt-level evolution without model modification, leaving the potential for architectural changes in communication modules as an open question.

PDF

5/5 Artificial Intelligence (cs.AI)Computation and Language (cs.CL)Machine Learning (cs.LG) 15 Jun 2026

GitOfThoughts: Version-Controlled Reasoning and Agent Memory You Can Replay, Diff, and Merge

Pavan C Shekar, Abhishek H S, Aswanth Krishnan

Abstract

ArXiv ID: 2606.14470

Authors: Pavan C Shekar, Abhishek H S, Aswanth Krishnan

Abstract:

Large language model (LLM) reasoning is ephemeral: chains of thought vanish with the context window, pruned search branches leave no record, and memory buffers cannot be diffed, merged, or audited. Every other complex software process (code, infrastructure, data, experiments) is version-controlled; reasoning is not. We introduce GitOfThoughts, which stores an agent's reasoning tree as a git repository: every scored thought is a commit, scores are notes, outcomes are tags, and retrieval is "git log" over the agent's own history. This makes reasoning replayable, auditable, and mergeable across agents at near-zero engineering cost. We then ask the harder question: does memory, in any substrate, actually improve accuracy? Across five substrates (none, markdown, vector, graph, git), two benchmarks, two model scales, and pre-registered replications, the answer for novel problems is no. No memory format reliably helps, and a promising early result collapsed under its own pre-registered replication. Memory pays only above what we call the copyability threshold: when the retrieved case is a near-duplicate of the current problem (similarity >~ 0.8), accuracy jumps sharply; below it, nothing. The gain is answer retrieval, not method transfer: a 4.5x larger model doubles the near-duplicate payoff yet still cannot extract a transferable method from a worked example. The only general lever we find is test-time sampling. The case for git-as-substrate is therefore auditability, provenance, and mergeability at accuracy parity. We document a retracted result and a refuted hypothesis to model the evaluation standard we hold ourselves to.

Insights

Contribution: The paper introduces GitOfThoughts, a framework that treats LLM reasoning as a version-controlled repository, and provides a rigorous empirical analysis of memory substrates for LLM accuracy.

Core Idea: Reasoning processes should be treated like software code, allowing for replayability, auditing, and merging by storing every thought as a commit in a git-like structure.

Technique: The authors map reasoning trees to git repositories where thoughts are commits, scores are notes, and outcomes are tags, while evaluating five different memory substrates (none, markdown, vector, graph, and git).

Pipeline: Agent reasoning steps → Git commit storage (thoughts, scores, tags) → Retrieval via 'git log' and history analysis → Auditable/mergeable reasoning paths.

Methodology: The researchers conducted pre-registered replications across five memory substrates, two benchmarks, and two model scales to test if specific memory formats improve accuracy on novel problems.

Results: Memory formats do not improve accuracy on novel problems; gains only occur above a 'copyability threshold' (similarity > 0.8) where the model performs answer retrieval rather than method transfer.

Limitations: Current memory substrates fail to facilitate transferable method learning from worked examples, and the primary benefit of GitOfThoughts is auditability and provenance rather than raw accuracy gains.

PDF

5/5 Artificial Intelligence (cs.AI) 15 Jun 2026

From Chatbot to Digital Colleague: The Paradigm Shift Toward Persistent Autonomous AI

Yongheng Zhang, Ziang Liu, Jiaxuan Zhu, Shuai Wang, Xiangqi Chen, Haojing Huang, Jiayi Kuang, Siyu Chen, Ao Shen, Hao Wu, Qiufeng Wang, Qian-Wen Zhang, Junnan Dong, Wenhao Jiang, Ying Shen, Hai-Tao Zheng, Yinghui Li, Di Yin, Xing Sun, Philip S. Yu

Abstract

ArXiv ID: 2606.14502

Authors: Yongheng Zhang, Ziang Liu, Jiaxuan Zhu, Shuai Wang, Xiangqi Chen, Haojing Huang, Jiayi Kuang, Siyu Chen, Ao Shen, Hao Wu, Qiufeng Wang, Qian-Wen Zhang, Junnan Dong, Wenhao Jiang, Ying Shen, Hai-Tao Zheng, Yinghui Li, Di Yin, Xing Sun, Philip S. Yu

Abstract:

Large Language Models (LLMs) are undergoing a fundamental transformation from conversational generators into integrated AI systems capable of reasoning, action, memory, and self-improvement. We conceptualize this transition as a shift from Chatbot to Digital Colleague: from conversational answers to persistent work. We organize this transition along two tightly coupled dimensions. First, at the cognitive core level, LLMs are advancing from Chatbot-era "fast thinking" systems driven by next-token prediction toward Thinking LLMs that leverage inference-time computation, Chain-of-Thought reasoning, reflection, process supervision, and reinforcement learning to support more deliberate and reliable cognition. Second, at the tool-augmented task execution level, LLMs are progressing from tool-calling Agents that invoke external resources in an ad hoc manner toward OpenClaw-style workstation systems (OpenClaw) equipped with persistent Workspaces, skills, verification loops, and governance. The "Workspace + Skill" paradigm makes episodic tool use colleague-like via state persistence, reusable procedures, task closure, and experience reuse. We examine data construction shifts from instruction-response pairs to State-Action-Observation trajectories and evaluation from static benchmarks to sandboxed, auditable, self-evolving AI ecosystems.

Insights

Contribution: The paper conceptualizes the paradigm shift of LLMs from conversational chatbots to 'Digital Colleagues' by defining a transition toward persistent autonomous systems capable of reasoning, memory, and self-improvement.

Core Idea: The transition is defined by two dimensions: moving from 'fast thinking' next-token prediction to deliberate inference-time reasoning, and from ad-hoc tool-calling to persistent 'Workspace + Skill' systems.

Technique: The authors propose the 'Workspace + Skill' paradigm, which utilizes persistent workspaces, reusable procedures, verification loops, and state-action-observation trajectories to enable colleague-like behavior.

Pipeline: User Task/Goal → Thinking LLM (Reasoning & Reflection) → Workspace + Skill Execution (State Persistence & Tool Use) → Verification Loop → Task Closure & Experience Reuse

Methodology: The research analyzes the shift in data construction from instruction-response pairs to state-action-observation trajectories and evaluates the transition from static benchmarks to sandboxed, auditable ecosystems.

Results: The framework establishes a roadmap for moving beyond episodic tool use toward persistent work, emphasizing the importance of state persistence, process supervision, and self-evolving AI ecosystems.

Limitations: The paper focuses on conceptualizing the paradigm shift and architectural requirements, leaving specific implementation details of the 'OpenClaw' workstation and long-term governance challenges as areas for further exploration.

PDF

5/5 Artificial Intelligence (cs.AI) 15 Jun 2026

StreamMemBench: Streaming Evaluation of Agent Memory for Future-Oriented Assistance

Guanming Liu, Yuqi Ren, Hansu Gu, Peng Zhang, Weihang Wang, Jiahao Liu, Ning Gu, Tun Lu

Abstract

ArXiv ID: 2606.14571

Authors: Guanming Liu, Yuqi Ren, Hansu Gu, Peng Zhang, Weihang Wang, Jiahao Liu, Ning Gu, Tun Lu

Abstract:

A central role of personal-agent memory is to turn stored information and prior interactions into future-oriented assistance. In daily use, useful cues come from what the agent observes and how the user interacts with the agent, and the agent must carry them forward from the current request to similar future tasks. Existing memory benchmarks usually test dialogue recall or task improvement in isolation, leaving the trajectory from streaming observations to later assistance largely untested. We introduce StreamMemBench, a streaming benchmark that constructs a two-step task sequence around each evidence anchor from EgoLife egocentric streams. The initial task tests evidence use, while the follow-up task tests whether feedback and interaction experience are reused. Four metrics diagnose evidence recall, initial evidence use, feedback incorporation, and follow-up reuse. Experiments with eight memory systems across two backbones show that current systems often fail to use observed evidence or turn feedback into reliable follow-up behavior, even when evidence is stored or feedback is incorporated locally. StreamMemBench is publicly available at https://github.com/landian60/StreamMemBench.

Insights

Contribution: The paper introduces StreamMemBench, a new streaming benchmark designed to evaluate how personal agents carry forward observations and interaction feedback into future-oriented assistance.

Core Idea: Existing benchmarks test memory in isolation, whereas real-world agents must bridge the gap between streaming observations and subsequent tasks by reusing evidence and feedback.

Technique: The authors construct two-step task sequences around evidence anchors from EgoLife egocentric streams to test both immediate evidence use and long-term feedback incorporation.

Pipeline: EgoLife egocentric streams → Evidence anchor identification → Two-step task sequence generation (initial task + follow-up task) → Multi-metric evaluation (recall, use, feedback, reuse)

Methodology: The researchers evaluated eight memory systems across two backbones using four specific metrics to diagnose the trajectory from observation to future assistance.

Results: Current systems frequently fail to utilize observed evidence or convert feedback into reliable follow-up behaviors, even when the information is successfully stored or incorporated locally.

Limitations: The study focuses on specific egocentric streams and may not capture the full diversity of all possible real-world personal agent interaction scenarios.

PDF Code

5/5 Artificial Intelligence (cs.AI)Computation and Language (cs.CL) 15 Jun 2026

Towards Direct Latent-Space Synthesis for Parallel Branches in LLM-Agent Workflows

Shikun Liu, Mufei Li, Dongqi Fu, Haoyu Wang, Yinglong Xia, Hong Li, Hong Yan, Pan Li

Abstract

ArXiv ID: 2606.14672

Authors: Shikun Liu, Mufei Li, Dongqi Fu, Haoyu Wang, Yinglong Xia, Hong Li, Hong Yan, Pan Li

Abstract:

Large language models increasingly serve as execution engines for agentic systems, yet they still consume context through a sequential text interface. This creates a mismatch with modern structured agent workflows, in which independent branches explore subtasks, retrieve evidence, or generate candidate solutions before a final synthesis step. Existing systems typically merge these branches by concatenating their textual outputs, which discards the parallel structure and incurs redundant prefill computation. In this work, we introduce Parallel-Synthesis, a plug-and-play framework that enables a synthesizer to directly consume the KV caches produced by parallel worker agents. Parallel-Synthesis combines a cache mapper that calibrates independently generated branch caches with a fine-tuned synthesizer adapter that enables generation from this non-sequential cache interface. We train Parallel-Synthesis using data that exposes the synthesizer to parallel cache contexts, teaches aggregation across cached branches, and distills reasoning behavior from standard text-concatenation-based synthesis. Across nine downstream datasets spanning math, science QA, code generation, GAIA, and multi-agent database diagnosis, Parallel-Synthesis matches or outperforms text-based synthesis on seven datasets and remains close on the other two. It also reduces time-to-first-token by 2.5x-11x, suggesting that direct cache-based synthesis is a promising interface for more native and efficient synthesis over parallel agent branches.

Insights

Contribution: The paper introduces Parallel-Synthesis, a plug-and-play framework that allows LLMs to synthesize information directly from the KV caches of parallel agent branches rather than through sequential text concatenation.

Core Idea: By bypassing the need to re-process textual outputs from parallel workers, the system preserves the parallel structure of agent workflows and eliminates redundant prefill computations.

Technique: The framework utilizes a cache mapper to calibrate independent branch caches and a fine-tuned synthesizer adapter to generate outputs from this non-sequential cache interface.

Pipeline: Parallel worker KV caches → Cache Mapper (calibration) → Synthesizer Adapter → Final synthesized output

Methodology: The authors trained the system using data that exposes the synthesizer to parallel cache contexts, teaching it to aggregate across branches while distilling reasoning behavior from standard text-concatenation methods.

Results: Parallel-Synthesis matched or outperformed text-based synthesis on 7 out of 9 datasets and achieved a 2.5x-11x reduction in time-to-first-token.

Limitations: The study does not fully explore the scalability of the cache mapper across extremely large numbers of parallel branches or the impact of varying cache lengths on synthesis quality.

PDF

LLM

5/5 Machine Learning (cs.LG)Artificial Intelligence (cs.AI)Computation and Language (cs.CL) 15 Jun 2026

SuperThoughts: Reasoning Tokens in Superposition

Zheyang Xiong, Shivam Garg, Max Yu, Vaishnavi Shrivastava, Haoyu Zhao, Anastasios Kyrillidis, Dimitris Papailiopoulos

Abstract

ArXiv ID: 2606.13862

Authors: Zheyang Xiong, Shivam Garg, Max Yu, Vaishnavi Shrivastava, Haoyu Zhao, Anastasios Kyrillidis, Dimitris Papailiopoulos

Abstract:

Long Chain-of-Thought (CoT) reasoning improves LLM problem-solving but is computationally expensive due to sequential token generation. While recent works explore reasoning in continuous latent spaces to bypass discrete token generation, they often struggle with training stability and fail to scale to complex, long-horizon tasks due to lack of supervision signal. We propose SuperThoughts, which compresses pairs of consecutive CoT tokens into single latent representations and decodes two tokens per step via a lightweight Multi-Token Prediction (MTP) module. This preserves discrete token supervision at training time while doubling throughput at inference time. We finetune Qwen2.5-Math-1.5B-Instruct, Qwen2.5-Math-7B-Instruct, Qwen2.5-Math-14B-Instruct, and evaluate on MATH500, AMC, OlympiadBench, and GPQA-Diamond. With a confidence-based adaptive mechanism that falls back to standard decoding when uncertain, SuperThoughts achieves $\sim$20--30\% CoT length reduction while maintaining accuracy with minimal degradation (1-2 points accuracy drop on most tasks).

Insights

Contribution: The paper introduces SuperThoughts, a method to accelerate Long Chain-of-Thought (CoT) reasoning by compressing consecutive tokens into latent representations to improve inference throughput.

Core Idea: The core idea is to bypass the sequential bottleneck of discrete token generation by representing pairs of CoT tokens in a continuous latent space while maintaining discrete supervision during training.

Technique: The authors use a lightweight Multi-Token Prediction (MTP) module to decode two tokens per step and a confidence-based adaptive mechanism to fall back to standard decoding when uncertainty is high.

Pipeline: Input prompt → Latent representation of consecutive CoT tokens → Multi-Token Prediction (MTP) decoding → Confidence-based adaptive selection → Final reasoning output

Methodology: The researchers fine-tuned various Qwen2.5-Math models by training the model to predict pairs of tokens simultaneously, using a confidence threshold to decide between compressed and standard decoding.

Results: SuperThoughts achieves a 20-30% reduction in CoT length while maintaining high accuracy, with only a minimal 1-2 point accuracy drop across MATH500, AMC, OlympiadBench, and GPQA-Diamond.

Limitations: The method may face challenges in extremely complex, long-horizon tasks where the confidence-based fallback might trigger frequently, potentially negating throughput gains.

PDF

4/5 Artificial Intelligence (cs.AI)Computation and Language (cs.CL) 15 Jun 2026

UP-NRPA: User Portrait based Nested Rollout Policy Adaptation for Planning with Large Language Models in Goal-oriented Dialogue Systems

Hui Wang, Fafa Zhang, Meng Liu, Xiangyu Chen, Chaoxu Mu

Abstract

ArXiv ID: 2606.13683

Authors: Hui Wang, Fafa Zhang, Meng Liu, Xiangyu Chen, Chaoxu Mu

Abstract:

To address the challenge that current dialogue policy planning methods struggle to dynamically adapt to diverse user characteristics, this paper proposes a User Portrait based Nested Rollout Policy Adaptation (UP-NRPA) online framework with Large Language Models. In contrast to conventional approaches dependent on model training and require offline reinforcement learning policy models for user groups, UP-NRPA enables dynamic customization of dialogue strategies through an adaptive mechanism. This is achieved by leveraging real-time user feedback alongside personality, preferences, and objectives mapped from the current user portrait, thereby adapting to user characteristics without offline reinforcement learning. In collaborative and non-collaborative dialogue benchmarks, UP-NRPA demonstrated considerable benefits, achieving an impressive 100% success rate in multiple dialogue tasks. Particularly in negotiation tasks, the sale-to-list ratio (SL) increased by 56.41%. This demonstrates that UP-NRPA can adapt to diverse user needs without requiring a training mechanism, enabling the dialogue system to adapt to user characteristics.

Insights

Contribution: The paper introduces UP-NRPA, an online framework that enables dialogue systems to dynamically adapt to diverse user characteristics without requiring offline reinforcement learning or pre-trained group-specific models.

Core Idea: The core idea is to achieve real-time policy adaptation by mapping user portraits (personality, preferences, and objectives) and live feedback into a nested rollout mechanism for LLM-based planning.

Technique: The technique utilizes a User Portrait-based Nested Rollout Policy Adaptation (UP-NRPA) that leverages LLMs to customize dialogue strategies on-the-fly based on dynamic user profiles.

Pipeline: User characteristics and real-time feedback → User Portrait mapping and Nested Rollout Policy Adaptation → Customized dialogue strategy and planning

Methodology: The methodology involves extracting user traits to form a portrait, then using a nested rollout process to adapt the LLM's planning policy based on that portrait and ongoing interaction feedback.

Results: Achieved a 100% success rate in multiple dialogue tasks and a 56.41% increase in the sale-to-list (SL) ratio in negotiation tasks.

Limitations: The paper does not explicitly detail the computational overhead of real-time nested rollouts or the scalability of the portrait mapping across extremely complex, multi-turn long-term goals.

PDF

RL

4/5 Artificial Intelligence (cs.AI)Machine Learning (cs.LG) 15 Jun 2026

A Deep Reinforcement Learning (DRL)-Based Transformer Method for Solving the Open Shop Scheduling Problem

Faezeh Ardali, Mwembezi A. Nyelele, Gerald M. Knapp

Abstract

ArXiv ID: 2606.13682

Authors: Faezeh Ardali, Mwembezi A. Nyelele, Gerald M. Knapp

Abstract:

The open shop scheduling problem (OSSP) arises in many industrial and service settings but remains computationally challenging as the number of jobs and machines increases. While exact methods quickly become intractable, classical dispatching rules and metaheuristics may require substantial tuning to maintain solution quality at large scales. This study develops a Transformer-based scheduling policy for OSSP using an encoder-decoder architecture with multi-head attention. The model is trained on Taillard benchmark instances (4x4, 5x5, 7x7, and 10x10) using only the processing-time matrix as input and produces feasible schedules with makespans typically within 15-30% of best-known values. To evaluate scalability, the trained policy is applied without retraining to randomly generated instances from 40x40 to 100x100 and compared against classical dispatching heuristics, including SPT, LPT, MWKR, and EST. Across these large instances, the Transformer achieved average gaps of 12.89-15.12% relative to a standard lower bound. Compared with EST, the Transformer remained competitive, typically within a modest margin, while substantially outperforming SPT and LPT. These results indicate that a Transformer policy trained on small OSSP instances can generalize to substantially larger problems and provide a feature-light, learning-based alternative to classical dispatching rules.

Insights

Contribution: The paper introduces a Transformer-based Deep Reinforcement Learning policy for the Open Shop Scheduling Problem (OSSP) that generalizes from small-scale benchmarks to large-scale industrial instances.

Core Idea: A Transformer architecture can learn to produce high-quality, feasible schedules by capturing complex dependencies in processing-time matrices without requiring manual heuristic tuning.

Technique: The study utilizes an encoder-decoder Transformer architecture with multi-head attention trained via Deep Reinforcement Learning.

Pipeline: Processing-time matrix → Transformer encoder-decoder architecture → Feasible schedules

Methodology: The model was trained on Taillard benchmark instances (4x4 to 10x10) and evaluated on randomly generated large-scale instances (40x40 to 100x100) against classical dispatching rules.

Results: The Transformer achieved average gaps of 12.89-15.12% relative to a standard lower bound on large instances, substantially outperforming SPT and LPT heuristics.

Limitations: The model produces schedules within 15-30% of best-known values on small instances, suggesting a potential gap in reaching optimal solutions compared to exact methods.

PDF

Robotics

5/5 Artificial Intelligence (cs.AI)Machine Learning (cs.LG)Robotics (cs.RO) 15 Jun 2026

Causal Object-Centric Models for Planning with Monte Carlo Tree Search

Rodion Vakhitov, Leonid Ugadiarov, Alexey Skrynnik, Aleksandr Panov

Abstract

ArXiv ID: 2606.14418

Authors: Rodion Vakhitov, Leonid Ugadiarov, Alexey Skrynnik, Aleksandr Panov

Abstract:

We introduce COMET (Causal Object-centric Model for Efficient Tree search), a model-based reinforcement learning algorithm that performs Monte Carlo Tree Search in a slot-structured latent space. COMET pairs a frozen unsupervised object-centric encoder with a transformer-based world model, in which actions are bound to objects through a novel action-slot fusion mechanism that is used in slot transition prediction. Policy and value heads use object-causal attention, modulating token interactions by learned per-slot relevance scores so that decision-making concentrates on task-relevant entities. COMET adds an explicit object-level inductive bias to MuZero-style latent planning. Across eight visually and dynamically diverse tasks from the Object-Centric Visual RL benchmark, ManiSkill, Robosuite, and VizDoom, COMET achieves a higher mean normalized score during the early stages of training compared to object-centric and monolithic baselines.

Insights

Contribution: The paper introduces COMET, a model-based reinforcement learning algorithm that integrates object-level inductive biases into MuZero-style latent planning. It achieves superior early-stage training performance by performing Monte Carlo Tree Search in a slot-structured latent space.

Core Idea: The core idea is to move from monolithic latent representations to object-centric ones where actions are explicitly bound to specific objects and attention is modulated by per-slot relevance.

Technique: The technique employs a frozen unsupervised object-centric encoder paired with a transformer-based world model featuring an action-slot fusion mechanism and object-causal attention.

Pipeline: Visual input → Unsupervised object-centric encoder → Slot-structured latent space → Transformer-based world model with action-slot fusion → Monte Carlo Tree Search with object-causal attention → Policy and value heads

Methodology: The authors evaluate COMET across eight diverse tasks in the Object-Centric Visual RL benchmark, ManiSkill, Robosuite, and VizDoom, comparing it against monolithic and standard object-centric baselines.

Results: COMET achieves a higher mean normalized score during the early stages of training compared to both object-centric and monolithic baselines across multiple benchmarks.

Limitations: The paper does not explicitly detail the scalability of the action-slot fusion mechanism to environments with a very high number of dynamic objects or the computational overhead of the per-slot relevance scores.

PDF

5/5 Machine Learning (cs.LG) 15 Jun 2026

Diffusion Policy Optimization without Drifting Apart

Haozhe Jiang, Haiwen Feng, Pieter Abbeel, Jiantao Jiao, Angjoo Kanazawa, Nika Haghtalab

Abstract

ArXiv ID: 2606.13795

Authors: Haozhe Jiang, Haiwen Feng, Pieter Abbeel, Jiantao Jiao, Angjoo Kanazawa, Nika Haghtalab

Abstract:

RL post-training has become increasingly pivotal for improving diffusion policies, but existing diffusion policy-gradient methods are often unstable and cannot achieve reliable policy improvement. We identify the cause as the double-drift phenomenon: optimizing a variational surrogate can let the ELBO separate from the true log-likelihood, which then makes the resulting proxy policy gradient misaligned with the true policy gradient of expected return. We propose \textbf{DiPOD}, a diffusion policy optimization framework that maintains tight-bound behavior throughout training by interleaving self-distillation with policy-improving gradient updates. This leads to a simple and practical algorithm: augmenting each diffusion policy-gradient update with an on-policy ELBO regularizer. Across diffusion language model post-training and continuous-control diffusion policies, DiPOD substantially stabilizes training and reaches higher rewards than previous methods.

Insights

Contribution: The paper introduces DiPOD, a framework that stabilizes diffusion policy optimization by addressing the 'double-drift' phenomenon where surrogate optimization causes the proxy policy gradient to misalign with the true policy gradient.

Core Idea: The authors propose interleaving self-distillation with policy-improving gradient updates to maintain a tight bound between the variational surrogate and the true log-likelihood.

Technique: The main technique is the inclusion of an on-policy ELBO regularizer to augment each diffusion policy-gradient update, preventing the policy from drifting away from the data distribution.

Pipeline: Diffusion Policy → Policy-Gradient Update + On-policy ELBO Regularization → Stabilized Policy Improvement

Methodology: The methodology involves identifying the mathematical cause of instability in diffusion RL and implementing a dual-objective update that balances reward maximization with distribution preservation.

Results: DiPOD substantially stabilizes training and achieves higher rewards than previous methods in both diffusion language model post-training and continuous-control diffusion policies.

Limitations: The paper focuses on the double-drift phenomenon; further exploration into the scalability of the ELBO regularizer in extremely high-dimensional action spaces remains an open area.

PDF

Trending repositories on GitHub filtered and scored for relevance to your interests.

Daily Digest 2026-06-15

Global Trends

Papers discovered from ArXiv subject categories

AI Safety

Agentic AI

LLM

RL

Robotics

Agentic AI

Computer Vision

Computing Systems

General

LLM

MLOps

Robotics

Daily Digest 2026-06-15

Global Trends

Papers discovered from ArXiv subject categories

AI Safety

Agentic AI

LLM

RL

Robotics

GitHub Trending

Agentic AI

Computer Vision

Computing Systems

General

LLM

MLOps

Robotics