Thursday, June 18, 2026

Daily Digest 2026-06-18

Today’s digest highlights a significant shift toward long-horizon planning and stateful memory in embodied agents, alongside advancements in verifiable reasoning and specialized industrial applications.

Research highlights:

Embodied AI and Robotics: Research focuses on multi-agent reinforcement learning in sports environments, long-horizon stateful navigation, and navigation in partially observable spaces.
Agentic Frameworks: New methods are being developed to decouple search from reasoning, synthesize data for multi-turn tool use, and distill skills for GUI interaction.
Reasoning and Interpretability: Papers introduce frameworks for interpretable vision-language models and benchmarks for defeasible abduction in foundation models.
Memory and Profiling: Studies explore what generalist agents must retain for long-term tasks and how to align user profiling for industrial dispatch systems.
Earth Observation: A milestone was reached with the first in-orbit demonstration of a zero-shot vision-language model for autonomous satellite data analysis.

Tech buzz:

The industry is seeing a move toward specialized hardware optimizations and more nuanced comparisons between local and proprietary models.
Hardware and Infrastructure: The announcement of x86 AI Compute Extensions (ACE) signals continued hardware-level optimization for AI workloads.
Open Source and Tooling: New releases include desktop-oriented UI agents, robotics dataset diffing tools, and comprehensive guides for building AI engineering pipelines from scratch.

Sort:

Today's digest highlights a significant shift toward long-horizon planning and stateful memory in embodied agents, alongside advancements in verifiable reasoning and specialized industrial applications.

Global Trends

Papers discovered from ArXiv subject categories

AI Safety

4/5 Artificial Intelligence (cs.AI)cs.CY 18 Jun 2026

SciRisk-Bench: A Risk-Dimension-Aware Benchmark for AI4Science Safety

Linghao Feng, Yinqian Sun, Dongqi Liang, Sicheng Shen, Chenfei Yan, Yuxuan Peng, Yilin Zhao, Haibo Tong, Kai Li, FeiFei Zhao, Yi Zeng

Abstract

ArXiv ID: 2606.18936

Authors: Linghao Feng, Yinqian Sun, Dongqi Liang, Sicheng Shen, Chenfei Yan, Yuxuan Peng, Yilin Zhao, Haibo Tong, Kai Li, FeiFei Zhao, Yi Zeng

Abstract:

Large language models (LLMs) are increasingly embedded in AI for Science (AI4Science) workflows, from scientific question answering and literature analysis to laboratory planning and autonomous discovery. This progress creates an urgent need for safety benchmarks that evaluate not only scientific competence, but also whether models recognize and avoid risks in high-stakes scientific contexts. Existing AI4Science safety datasets cover several disciplines and task formats, leaving the underlying risk dimensions underspecified. We introduce \textbf{SciRisk-Bench}, a benchmark designed to evaluate AI4Science safety from two complementary perspectives: explicit risk dimensions and scientific disciplines. SciRisk-Bench covers 7 disciplines, 31 subdisciplines and 10 risk dimensions. In the experimental section, we evaluate both mainstream LLMs and science-oriented LLMs across risk dimensions, disciplines, and sub-disciplines, enabling fine-grained diagnosis of where scientific models remain unsafe.

Insights

Contribution: The paper introduces SciRisk-Bench, a comprehensive benchmark designed to evaluate the safety of Large Language Models in AI4Science contexts by focusing on specific risk dimensions and scientific disciplines.

Core Idea: Current AI4Science safety datasets lack specified risk dimensions; SciRisk-Bench addresses this by providing a structured framework to diagnose model safety across 7 disciplines and 10 risk dimensions.

Technique: The authors developed a multi-dimensional evaluation framework that categorizes risks into 10 distinct dimensions across 31 subdisciplines to enable fine-grained safety diagnosis.

Pipeline: Scientific queries/scenarios → SciRisk-Bench evaluation framework (Risk Dimensions & Disciplines) → Fine-grained safety diagnosis of LLMs

Methodology: The researchers curated a dataset covering 7 disciplines and 10 risk dimensions, then evaluated both mainstream and science-oriented LLMs to identify specific safety vulnerabilities.

Results: The benchmark enables a fine-grained diagnosis of where scientific models remain unsafe across various disciplines and subdisciplines, highlighting specific risk-dimension failures.

Limitations: The paper does not explicitly detail the specific types of risks included in the 10 dimensions or the specific performance gaps of the evaluated models in the abstract.

PDF

Agentic AI

5/5 Artificial Intelligence (cs.AI) 18 Jun 2026

CaVe-VLM-CoT: An Interpretable Vision-Language Model Framework

Sneha Rao, Shaina Raza, Dhanesh Ramachandram

Abstract

ArXiv ID: 2606.18385

Authors: Sneha Rao, Shaina Raza, Dhanesh Ramachandram

Abstract:

Vision-Language Models (VLMs) remain prone to hallucinations, producing fluent but visually unfaithful outputs. Existing chain-of-thought and retrieval-augmented methods only partially address this, as they neither enforce step-level citation grounding nor route verification failures back to retrieval for correction. We present CaVe-VLM-CoT, a modular reflection-based agentic-RAG framework that enforces evidence-grounded reasoning through a five-stage closed-loop pipeline: Extractor, Retriever, Solver, Citation Injector, and Verifier, in which detected ungrounded claims trigger structured feedback to the Extractor for targeted re-retrieval. Since no existing framework jointly measures retrieval quality, step-wise citation faithfulness, and cross-modal grounding, we propose a suite of 23 component-wise metrics across all stages, anchored by CaVeScore, a composite metric weighting accuracy, citation precision and recall, attribution, and evidence grounding. Without any architectural or prompt modifications, CaVe-VLM-CoT achieves 87.1\% accuracy and 56.6\% CaVeScore on ScienceQA , and 55.2\% accuracy and 35.7\% CaVeScore on MMMU (30 subjects).

Insights

Contribution: The paper introduces CaVe-VLM-CoT, a modular reflection-based agentic-RAG framework designed to mitigate hallucinations in Vision-Language Models by enforcing step-level citation grounding. It also proposes a comprehensive suite of 23 component-wise metrics and a composite CaVeScore to evaluate retrieval quality and cross-modal grounding.

Core Idea: The core idea is to create a closed-loop reasoning system where ungrounded claims are detected and routed back to the retrieval stage for correction, ensuring every step of the chain-of-thought is anchored in visual evidence.

Technique: The framework employs a five-stage agentic pipeline (Extractor, Retriever, Solver, Citation Injector, and Verifier) that utilizes a reflection mechanism to trigger targeted re-retrieval upon verification failure.

Pipeline: Visual/Textual Input → Extractor → Retriever → Solver → Citation Injector → Verifier → (Feedback Loop to Extractor if ungrounded) → Final Grounded Output

Methodology: The authors developed a modular agentic-RAG architecture that evaluates each reasoning step for faithfulness to retrieved evidence. They established a multi-dimensional evaluation framework using 23 metrics to measure accuracy, citation precision, and evidence grounding.

Results: CaVe-VLM-CoT achieved 87.1% accuracy and a 56.6% CaVeScore on ScienceQA, and 55.2% accuracy with a 35.7% CaVeScore on MMMU (30 subjects) without requiring architectural or prompt modifications.

Limitations: The paper does not explicitly detail the computational overhead of the multi-stage reflection loop or the potential latency introduced by the iterative re-retrieval process.

PDF

5/5 Artificial Intelligence (cs.AI)Computation and Language (cs.CL)Software Engineering (cs.SE) 18 Jun 2026

CEO-Bench: Can Agents Play the Long Game?

Haozhe Chen, Karthik Narasimhan, Zhuang Liu

Abstract

ArXiv ID: 2606.18543

Authors: Haozhe Chen, Karthik Narasimhan, Zhuang Liu

Abstract:

Language model agents are becoming proficient executors at isolated, short-horizon tasks such as software engineering and customer service. Yet real-world challenges require a combination of sophisticated skills that remain largely untested in agents: (1) navigating long horizons amid uncertainty; (2) acquiring information in noisy environments; (3) adapting to a changing world; (4) orchestrating multiple moving parts toward a coherent goal. We introduce CEO-Bench, which evaluates these capabilities together by simulating a representative real-world task: operating a startup for 500 days. An agent manages pricing, marketing, budgeting, and many other aspects of a fictional company through a programmable Python interface, operating in the same environment and facing the same challenges as a human CEO. Success demands analyzing noisy, interconnected business databases, translating signals into sound strategy, and coordinating many decisions with programming. The strongest agents write sophisticated code that simulates customer cohorts to forecast future cash and mines negotiation history to uncover hidden customer preferences. Even so, most state-of-the-art models struggle in this environment. Only Claude Opus 4.8 and GPT-5.5 finish above the $1M starting balance, and neither consistently turns a profit. CEO-Bench takes a first step toward measuring the intelligence required to drive sustained, adaptive progress over time.

Insights

Contribution: The paper introduces CEO-Bench, a new benchmark designed to evaluate the ability of LLM agents to manage long-horizon, complex, and multi-faceted real-world tasks. It shifts the focus from isolated task execution to sustained, adaptive decision-making in a dynamic environment.

Core Idea: To test true agentic intelligence, models must be evaluated on their ability to navigate uncertainty, handle noisy data, adapt to changing conditions, and orchestrate multiple moving parts over a long period.

Technique: The authors developed a programmable Python interface that simulates a startup environment, requiring agents to manage pricing, marketing, and budgeting over a 500-day period.

Pipeline: Startup environment state → Agent analysis of noisy databases and history → Strategy formulation and code execution → Environment update and feedback loop → Final financial outcome

Methodology: The benchmark simulates a fictional company where agents must interact with interconnected business databases and write code to forecast trends and mine customer preferences. Success is measured by the agent's ability to maintain and grow a starting capital balance over 500 days.

Results: Most state-of-the-art models failed to maintain the initial $1M balance; only Claude Opus 4.8 and GPT-5.5 finished above the starting balance, and neither consistently achieved profitability.

Limitations: The current results show that even the strongest models struggle with sustained profit, highlighting a significant gap in the ability of agents to drive long-term, adaptive progress.

PDF

5/5 Artificial Intelligence (cs.AI) 18 Jun 2026

Skill-Guided Continuation Distillation for GUI Agents

Zhimin Fan, Hongwei Yu, Yeqing Shen, Haolong Yan, Guozhen Peng, Tianhao Peng, Yudong Zhang, Xiaowen Zhang, Kaijun Tan, Zheng Ge, Xiangyu Zhang, Daxin Jiang

Abstract

ArXiv ID: 2606.18890

Authors: Zhimin Fan, Hongwei Yu, Yeqing Shen, Haolong Yan, Guozhen Peng, Tianhao Peng, Yudong Zhang, Xiaowen Zhang, Kaijun Tan, Zheng Ge, Xiangyu Zhang, Daxin Jiang

Abstract:

Improving GUI agents typically relies on behavior cloning on expert trajectories. However, as the current policy deviates from the expert policy, it inevitably encounters policy-induced off-trajectory states during closed-loop execution, i.e., states that fall outside the expert trajectories. Since expert trajectories provide no demonstrations for these unseen states, such states receive no effective supervision, leaving the policy unable to select the correct action. To close this supervision gap, we propose Skill-Guided Continuation Distillation (SGCD), an iterative self-improvement framework. SGCD first runs the plain policy without skill guidance for a few steps to reach realistic off-trajectory states. From these states, a skill-guided policy then completes the task and produces successful continuations, which are mixed with expert trajectories to supply supervision over policy-induced off-trajectory states. The skills are extracted from both successful and failed rollouts, consisting of Continuation Plans, Critical Targets, Failure Traps, and Success Criteria. On OSWorld-Verified, SGCD improves the success rate of three base models from the low-30\% range to over 50\%, demonstrating its effectiveness and generality.

Insights

Contribution: The paper introduces Skill-Guided Continuation Distillation (SGCD), an iterative self-improvement framework that addresses the supervision gap in GUI agents caused by policy-induced off-trajectory states.

Core Idea: By generating successful continuations from states reached by the current policy but absent in expert data, the model can learn to recover from its own mistakes.

Technique: The method uses a skill-guided policy to complete tasks from off-trajectory states, extracting Continuation Plans, Critical Targets, Failure Traps, and Success Criteria to provide supervision.

Pipeline: Expert trajectories and policy-induced off-trajectory states → Skill-guided completion of tasks → Mixed supervision data → Policy distillation and self-improvement

Methodology: SGCD runs a plain policy to reach off-trajectory states, then uses a skill-guided policy to produce successful continuations which are mixed with original expert trajectories for training.

Results: SGCD improved the success rate of three base models on OSWorld-Verified from the low-30% range to over 50%.

Limitations: The paper does not explicitly detail the computational overhead of the iterative distillation process or the potential for compounding errors if the skill-guided policy fails to find a valid continuation.

PDF

5/5 Artificial Intelligence (cs.AI)Computation and Language (cs.CL)Information Retrieval (cs.IR)Multiagent Systems (cs.MA) 18 Jun 2026

Decoupling Search from Reasoning: A Vendor-Agnostic Grounding Architecture for LLM Agents

Emmanuel Aboah Boateng, Kyle MacDonald, Amardeep Kumar, Siddharth Kodwani, Sudeep Das

Abstract

ArXiv ID: 2606.18947

Authors: Emmanuel Aboah Boateng, Kyle MacDonald, Amardeep Kumar, Siddharth Kodwani, Sudeep Das

Abstract:

Production LLM agents increasingly depend on real-time search, yet native search grounding bundles retrieval policy, provider choice, evidence injection, cost, latency, and generation behavior behind a single model-provider boundary. This coupling makes grounding hard to inspect, tune, reuse, or port, and can trigger Search-Induced Verbosity that breaks strict output contracts. We present Decoupled Search Grounding (DSG), a vendor-agnostic boundary that moves grounding outside the reasoning model through an MCP-compatible gateway, exposing provider routing, source-aware context rendering, configured fallback, retrieval-depth control, and exact plus semantic caching as first-class controls. Across five frontier models on SimpleQA, FreshQA, and HotpotQA, native search leads on recency-sensitive FreshQA, but DSG exposes a stronger frontier when control matters: on SimpleQA it nearly matches native accuracy (86.1% vs. 87.7%) at 91% lower search cost, preserves concise answer contracts, and reaches a 99.4% warm-cache hit rate with 68% lower latency. Deployed as a shared production grounding layer for large-scale agentic workloads with interchangeable models, DSG matches or slightly exceeds native-search accuracy on an e-commerce query-understanding (QIU) workload while cutting search cost by over 98%. Real-time grounding is best treated as an optimizable interface boundary, not a fixed model feature.

Insights

Contribution: The paper introduces Decoupled Search Grounding (DSG), a vendor-agnostic architecture that separates search retrieval and grounding from the reasoning model's internal logic. This allows for granular control over costs, latency, and output formatting while enabling model interchangeability.

Core Idea: Grounding should be treated as an optimizable interface boundary rather than a fixed model feature to prevent search-induced verbosity and high costs.

Technique: DSG utilizes an MCP-compatible gateway to expose first-class controls such as provider routing, source-aware context rendering, retrieval-depth control, and exact plus semantic caching.

Pipeline: User Query → DSG Gateway (Routing, Retrieval, Context Rendering, Caching) → Reasoning Model → Final Output

Methodology: The authors evaluated DSG against native search grounding across five frontier models on SimpleQA, FreshQA, and HotpotQA benchmarks, as well as a production e-commerce query-understanding (QIU) workload.

Results: DSG achieved 86.1% accuracy on SimpleQA (vs. 87.7% native) at 91% lower search cost, reached a 99.4% warm-cache hit rate with 68% lower latency, and cut search costs by over 98% on QIU workloads.

Limitations: Native search still leads on recency-sensitive tasks like FreshQA, suggesting that the decoupling may slightly trade off extreme recency for control and efficiency.

PDF

4/5 Artificial Intelligence (cs.AI)Human-Computer Interaction (cs.HC) 18 Jun 2026

Searching for Synergy in Shared Workspace Human-AI Collaboration

Nachiket Kotalwar, Rohini Das, Carolyn Rose

Abstract

ArXiv ID: 2606.18413

Authors: Nachiket Kotalwar, Rohini Das, Carolyn Rose

Abstract:

Automated AI agents are increasingly capable, yet many scientific and professional tasks require human judgment and contextual expertise. We study shared-workspace human-AI teams, where AI agents and human collaborators must coordinate responsibilities before submitting a final answer. Using the Collaborative Gym environment with DiscoveryBench tasks, we examine when adding simulated human collaborators improves performance and when process loss turns additional collaborators into coordination overhead. Across 1,482 sessions, adding relevant collaborators can lower performance when teams lack structure to coordinate their contributions. We then evaluate scaffolding that combines shared group memory with simulated human-in-the-loop (HITL) gates, where selected actions require approval from a designated simulated participant. This scaffolding yields higher mean performance, most clearly in three-person teams, with clearer responsibility signals and stronger routing of expertise to team actions. Overall, how human-AI teams coordinate and integrate expertise matters as much as the capability available to them.

Insights

Contribution: The paper investigates the dynamics of human-AI collaboration in shared workspaces, identifying how team structure and coordination scaffolding impact performance. It demonstrates that adding collaborators can lead to process loss without proper responsibility signals and proposes a scaffolding method to mitigate this.

Core Idea: The effectiveness of human-AI teams depends as much on the coordination structure and integration of expertise as it does on the individual capabilities of the agents.

Technique: The researchers developed a scaffolding framework that combines shared group memory with simulated human-in-the-loop (HITL) gates to manage responsibility and route expertise.

Pipeline: Shared-workspace tasks → Collaborative Gym environment with DiscoveryBench tasks → Team coordination with shared memory and HITL gates → Final performance evaluation

Methodology: The study analyzed 1,482 sessions in a simulated environment to compare team performance with and without coordination scaffolding across different team sizes.

Results: Adding relevant collaborators can lower performance due to coordination overhead; however, the proposed scaffolding yielded higher mean performance, particularly in three-person teams, by providing clearer responsibility signals.

Limitations: The study relies on simulated human collaborators and may not fully capture the nuances of real-world human-AI interaction dynamics.

PDF

4/5 Artificial Intelligence (cs.AI) 18 Jun 2026

What Must Generalist Agents Remember?

Khurram Yamin, Namrata Deka, Maitreyi Swaroop, Albert Ting, Jeff Schneider, Bryan Wilder

Abstract

ArXiv ID: 2606.18746

Authors: Khurram Yamin, Namrata Deka, Maitreyi Swaroop, Albert Ting, Jeff Schneider, Bryan Wilder

Abstract:

This paper develops a formal account of what generalist agents must store in memory in order to act near-optimally across multiple environments and goals. It shows that when two domains share an observational bottleneck but require incompatible optimal actions, any uniformly near-optimal policy must induce distinct memory distributions at that bottleneck. The result yields a separation theorem: sufficiently successful agents cannot rely only on current state observations, but must preserve domain-relevant information in memory. The paper further shows that if an agent's memory contains enough information to estimate values for related goals, then that memory can be used to approximately reconstruct the agent's local transition dynamics. Together, these results characterize memory as the substrate that supports domain disambiguation, transition-model reconstruction, and planning for generalist agents.

Insights

Contribution: The paper provides a formal theoretical framework characterizing the necessary memory requirements for generalist agents to achieve near-optimal performance across diverse environments. It establishes a separation theorem proving that agents must store domain-relevant information to disambiguate environments with shared observational bottlenecks.

Core Idea: Generalist agents require memory to distinguish between different domains that appear identical in current observations but require different actions. Furthermore, memory sufficient for goal-value estimation can be leveraged to reconstruct local transition dynamics.

Technique: The authors use a formal mathematical account and a separation theorem to analyze the relationship between observational bottlenecks, memory distributions, and policy optimality.

Pipeline: Multi-domain observations → Memory storage of domain-relevant information → Domain disambiguation and transition-model reconstruction → Near-optimal action selection

Methodology: The study employs theoretical analysis to demonstrate that uniformly near-optimal policies must induce distinct memory distributions at observational bottlenecks. It also explores the relationship between value estimation and transition dynamics reconstruction.

Results: The research establishes a separation theorem showing that successful agents cannot rely solely on current state observations. It identifies memory as the fundamental substrate for domain disambiguation, transition-model reconstruction, and planning.

Limitations: The paper focuses on the theoretical requirements of memory rather than specific architectural implementations or empirical benchmarks for large-scale generalist models.

PDF

4/5 Artificial Intelligence (cs.AI)cs.CY 18 Jun 2026

ProfiLLM: Utility-Aligned Agentic User Profiling for Industrial Ride-Hailing Dispatch

Tengfei Lyu, Zirui Yuan, Xu Liu, Kai Wan, Zihao Lu, Li Ma, Hao Liu

Abstract

ArXiv ID: 2606.18803

Authors: Tengfei Lyu, Zirui Yuan, Xu Liu, Kai Wan, Zihao Lu, Li Ma, Hao Liu

Abstract:

Bringing Large Language Models (LLMs) into industrial ride-hailing dispatch as semantic feature extractors over platform-scale behavioral logs is a compelling but under-explored data systems problem. Production matching pipelines remain dominated by structured numerical features, yet decisive behavioral signals (e.g., a driver's habitual aversion to certain regions) are inherently contextual and naturally expressible as LLM-generated user profiles. However, scaling such profiling to a live, millisecond-latency dispatcher faces three intertwined constraints rarely addressed together: on a platform with millions of daily orders, logs exceed any LLM's context window by orders of magnitude; most users are long-tail, with too few interactions for per-user profiling; and surface-fluent profiles do not necessarily improve downstream prediction utility. We present ProfiLLM, an agentic LLM data pipeline that operationalizes utility-aligned user profiling for production matching systems through two modules. (1) Tool-Augmented Global Knowledge Mining equips an LLM agent with 27 analytical tools to mine platform-scale data, producing reusable global knowledge, adaptive user clustering rules, and region-level supply-demand priors. (2) Utility-Aligned Profile Exploration generates multiple candidate profiles per cluster, evaluates them via a lightweight downstream utility proxy, iteratively refines the best candidates and constructs preference pairs for DPO fine-tuning. Deployed on DiDi's production dispatcher, ProfiLLM achieves up to +6.14% relative AUC improvement in outcome prediction, up to +4.35% GMV gain in dispatching simulation, and consistent improvements in a 14-day online A/B test including +0.47% GMV, +0.33% Completion Rate, and -0.82% Cancel-Before-Accept rate.

Insights

Contribution: The paper introduces ProfiLLM, an agentic LLM data pipeline that extracts semantic user profiles from large-scale behavioral logs to improve industrial ride-hailing dispatching.

Core Idea: The authors propose bridging the gap between unstructured behavioral signals and production matching systems by using LLMs to generate utility-aligned profiles that capture contextual preferences like regional aversions.

Technique: The system utilizes a tool-augmented LLM agent for global knowledge mining and a utility-aligned exploration module that uses a lightweight proxy to refine profiles for DPO fine-tuning.

Pipeline: Platform-scale behavioral logs → Tool-Augmented Global Knowledge Mining & Utility-Aligned Profile Exploration → LLM-generated user profiles & DPO-fine-tuned models → Improved dispatching outcomes.

Methodology: The methodology involves mining global priors and clustering rules using 27 analytical tools, followed by iterative profile refinement based on downstream utility proxies and DPO fine-tuning.

Results: Achieved +6.14% relative AUC improvement, +4.35% GMV gain in simulation, and +0.47% GMV with -0.82% Cancel-Before-Accept rate in a 14-day online A/B test.

Limitations: The paper notes the challenges of scaling to millions of daily orders, handling long-tail users with sparse data, and ensuring surface-fluent profiles translate to actual prediction utility.

PDF

4/5 Artificial Intelligence (cs.AI) 18 Jun 2026

Externalizing Research Synthesis and Validation in AI Scientists through a Research Harness

Zijian Wang, Hanqi Li, Ziyue Yang, Zijian Hu, Shenghan Zuo, Yunzhe Zhang, Da Ma, Danyu Luo, Chenrun Wang, Jing Peng, Tiancheng Huang, Sijia Guo, Huayang Wang, Zichen Zhu, Senyu Han, Yilu Cao, Kai Yu, Lu Chen

Abstract

ArXiv ID: 2606.18874

Authors: Zijian Wang, Hanqi Li, Ziyue Yang, Zijian Hu, Shenghan Zuo, Yunzhe Zhang, Da Ma, Danyu Luo, Chenrun Wang, Jing Peng, Tiancheng Huang, Sijia Guo, Huayang Wang, Zichen Zhu, Senyu Han, Yilu Cao, Kai Yu, Lu Chen

Abstract:

AI systems can increasingly automate scientific workflows, but the reasoning that links prior evidence, generated ideas, experiments and final claims often remains implicit inside model inference. Here we introduce Xcientist, a research harness that externalizes research synthesis and experimental validation into inspectable, contract-governed processes. Xcientist organizes literature evidence, idea states, implementation plans, ablation records and repair traces as persistent research artifacts, so that generated mechanisms can be grounded, executed, tested and revised without losing their evidential basis. We identify claim drift as a failure mode of automated research, where runnable artifacts no longer support the mechanism originally claimed. Across training-free memory systems, graph-structured traffic forecasting and multi-scale physics-informed neural networks, Xcientist preserves traceable trajectories from problem formulation to mechanism design, validation and bounded revision. These results suggest that AI scientists should be evaluated not only by their final artifacts, but by whether their synthesis and validation processes remain attributable, inspectable and scientifically accountable.

Insights

Contribution: The paper introduces Xcientist, a research harness that externalizes the implicit reasoning of AI scientists into inspectable, contract-governed processes to ensure scientific accountability.

Core Idea: By treating research synthesis and validation as persistent artifacts rather than hidden model inferences, the system prevents 'claim drift' and ensures that final mechanisms remain grounded in their original evidential basis.

Technique: The system utilizes a research harness to organize literature evidence, idea states, implementation plans, and repair traces as traceable, persistent artifacts.

Pipeline: Problem Formulation → Literature Evidence Synthesis → Idea State Generation → Implementation Planning → Experimental Validation → Bounded Revision → Final Artifact

Methodology: The authors evaluated Xcientist across three diverse domains: training-free memory systems, graph-structured traffic forecasting, and multi-scale physics-informed neural networks.

Results: Xcientist successfully preserved traceable trajectories from formulation to validation, demonstrating that the system maintains scientific accountability and prevents the decoupling of runnable artifacts from their claimed mechanisms.

Limitations: The paper focuses on the framework for externalizing synthesis and does not fully explore the scalability of the harness across extremely large-scale, multi-year longitudinal research projects.

PDF

Computer Vision

5/5 Artificial Intelligence (cs.AI)Machine Learning (cs.LG) 18 Jun 2026

NAVI-Orbital: First In-Orbit Demonstration of a Zero-Shot Vision-Language Model for Autonomous Earth Observation

Juan Manuel Delfa Victoria, Taran Cyriac John, Andrew W. Herson

Abstract

ArXiv ID: 2606.18271

Authors: Juan Manuel Delfa Victoria, Taran Cyriac John, Andrew W. Herson

Abstract:

As Earth Observation data generation outpaces downlink bandwidth and human-in-the-loop processing, a widening gap has emerged between onboard collection and actionable ground intelligence. This paper presents NAVI-Orbital, a software system deployed on a Low Earth Orbit (LEO) spacecraft. On April 16, 2026, NAVI-Orbital achieved what is, to the authors' knowledge, the first in-orbit demonstration of a vision-language model performing autonomous multi-modal inference entirely onboard. NAVI-Orbital uses a local vision-language model (Gemma 3) to classify each captured scene, produce a text description of its content and the relationships between its features, and respond to operator follow-up via natural-language dialogue. The system is re-tasked through plain-English prompts in place of conventional command sequences, and is orchestrated by a graph-based state machine (LangGraph) coordinating dedicated agents for detection and dialogue. Results across ground benchmarking (88.16% accuracy on the 7,960-image curated AID benchmark), Flatsat validation, and live in-orbit captures of newly acquired, previously unseen Earth imagery (including uncorrected YAM-9 imagery, processed onboard with hardware-accelerated GPU inference and no fine-tuning for the flight instrument) demonstrate the feasibility of running foundation models on satellite-class edge computers to invert the conventional acquire-then-downlink-everything bandwidth profile through semantic compression of Earth observations in-orbit.

Insights

Contribution: The paper presents the first in-orbit demonstration of a zero-shot vision-language model (VLM) performing autonomous multi-modal inference on a Low Earth Orbit (LEO) spacecraft. It demonstrates the feasibility of using onboard foundation models to perform semantic compression of Earth observation data.

Core Idea: By processing imagery onboard using a VLM, the system replaces the 'acquire-then-downlink-everything' model with a system that only transmits high-level semantic descriptions and actionable intelligence.

Technique: The system utilizes a local Gemma 3 vision-language model orchestrated by a graph-based state machine (LangGraph) to coordinate dedicated agents for detection and natural-language dialogue.

Pipeline: Raw Earth imagery → Hardware-accelerated GPU inference (Gemma 3) → Scene classification and text description → Natural-language dialogue/re-tasking via plain-English prompts

Methodology: The authors deployed NAVI-Orbital on a LEO spacecraft, validating it through ground benchmarking on the AID dataset, Flatsat simulation, and live in-orbit captures of previously unseen imagery without fine-tuning.

Results: Achieved 88.16% accuracy on the 7,960-image AID benchmark and successfully performed live in-orbit inference on uncorrected YAM-9 imagery.

Limitations: The paper focuses on zero-shot performance and does not detail the specific power/thermal constraints of the satellite-class edge computer or the long-term reliability of the graph-based state machine in space.

PDF

4/5 Artificial Intelligence (cs.AI) 18 Jun 2026

RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language Models

San Kim, Daechul Ahn, Reokyoung Kim, Hyeonbeom Choi, Seungyeon Jwa, Jonghyun Choi

Abstract

ArXiv ID: 2606.18950

Authors: San Kim, Daechul Ahn, Reokyoung Kim, Hyeonbeom Choi, Seungyeon Jwa, Jonghyun Choi

Abstract:

Modern Vision-Language Models (VLMs) often struggle with strategic reasoning, i.e., anticipating and influencing other agents' actions, under uncertainty in competitive and cooperative settings. Real-time strategy (RTS) games can be a natural testbed for diagnosing this limitation, as they demand coordination with allies, adaptation to opponents' strategy, and long-horizon planning under partial observability. However, existing RTS benchmarks offer limited evaluation scope, lack systematic competency diagnosis, and remain fixed in the pre-designed scenario coverage. To address these limitations, we present RTSGameBench, which is built on Beyond All Reason, a large-scale RTS game with an expanded battlefield that demands broader strategy diversity than the existing testbeds. The proposed benchmark provides evaluations through diverse gameplay across various matchup structures, diagnostic assessment via mini-games, each targeting an individual strategic competency, and extensible coverage via a self-evolving generation framework that converts free-form queries into new mini-games, improving over successive cycles. Additionally, for VLMs to operate in large-scale RTS games, we provide RTSGameAgent that manages units by an FSM with agentic memory. We empirically validate that multiple state-of-the-art VLMs do not perform well when matchups demand tighter coordination, multiagent coordination and when task scale increases.

Insights

Contribution: The paper introduces RTSGameBench, a comprehensive benchmark designed to evaluate and diagnose the strategic reasoning capabilities of Vision-Language Models (VLMs) in complex, large-scale real-time strategy environments.

Core Idea: RTS games serve as a natural testbed for assessing a VLM's ability to handle long-horizon planning, partial observability, and multi-agent coordination under uncertainty.

Technique: The authors developed a self-evolving generation framework to create new mini-games from free-form queries and an RTSGameAgent system using Finite State Machines (FSM) with agentic memory.

Pipeline: Free-form queries → Self-evolving generation framework → New mini-games → VLM evaluation via RTSGameAgent → Strategic competency diagnosis

Methodology: The researchers built a benchmark on the 'Beyond All Reason' game, utilizing diverse matchup structures and specific mini-games to isolate and assess individual strategic competencies.

Results: Empirical validation shows that state-of-the-art VLMs struggle significantly with tight coordination, multi-agent cooperation, and increasing task scales.

Limitations: The study highlights the current inability of VLMs to perform complex strategic reasoning at scale, leaving open questions on how to improve long-horizon planning and multi-agent coordination.

PDF

General

3/5 Artificial Intelligence (cs.AI) 18 Jun 2026

ThinkDeception: A Progressive Reinforcement Learning Framework for Interpretable Multimodal Deception Detection

Jinhao Song, Shan Liang, Yiqun Yue, Zhuhuayang Zhang, Tianqi Gao

Abstract

ArXiv ID: 2606.18988

Authors: Jinhao Song, Shan Liang, Yiqun Yue, Zhuhuayang Zhang, Tianqi Gao

Abstract:

Multimodal deception detection is critical for identifying fraudulent intentions, yet existing approaches predominantly rely on end to end black--box paradigms. These methods suffer from a severe lack of interpretability failing to provide transparent reasoning trajectories and struggling to explicitly capture the subtle, cross modal inconsistencies inherent in deceptive behaviors. To transcend these limitations, we propose ThinkDeception, a novel and interpretable multimodal deception detection framework. As a pioneering effort, it introduces Multimodal Large Language Models (MLLMs) into this domain, transforming deception detection from a traditional binary classification task into an explicit cognitive reasoning process. Facilitated by the first meticulously annotated step--by--step multimodal Chain of Thought (CoT) dataset, we develop a foundational model, ThinkDeception Base, empirically validating the critical role of modal inconsistency in decoding deception. Building upon this foundation, our core innovation lies in proposing Visual-Audio Consistency Group Relative Policy Optimization(VAC--GRPO) equipped with a progressive training strategy. Distinct from standard GRPO, we stratify the training data into four progressive difficulty tiers, guiding the model through a psychologically grounded easy--to--hard cognitive transition. By innovatively coupling this dynamic curriculum scheduler with a multi dimensional, process aware reward mechanism and a reflective learning paradigm, we significantly elevate the model's overall reasoning quality. Extensive experiments on mainstream benchmarks demonstrate that ThinkDeception establishes a new SOTA, significantly outperforming existing methods in both detection accuracy and rationale quality. Ultimately, this work successfully drives the field of deception detection toward interpretable, multimodal cognitive reasoning.

Insights

Contribution: The paper introduces ThinkDeception, a framework that transforms multimodal deception detection from a black-box classification task into an interpretable cognitive reasoning process using MLLMs and a novel step-by-step CoT dataset.

Core Idea: The core idea is to leverage multimodal Chain of Thought (CoT) and a progressive training strategy to explicitly capture cross-modal inconsistencies and provide transparent reasoning trajectories for detecting deception.

Technique: The authors propose Visual-Audio Consistency Group Relative Policy Optimization (VAC-GRPO) combined with a progressive difficulty curriculum and a multi-dimensional, process-aware reward mechanism.

Pipeline: Multimodal input (Visual/Audio) → Progressive CoT Reasoning (Easy-to-Hard) → VAC-GRPO Optimization → Interpretable Deception Detection & Rationale

Methodology: The methodology involves creating a meticulously annotated multimodal CoT dataset, training a base model to identify modal inconsistencies, and refining it via a reflective learning paradigm with a dynamic curriculum scheduler.

Results: ThinkDeception establishes a new State-of-the-Art (SOTA) on mainstream benchmarks, significantly outperforming existing methods in both detection accuracy and the quality of generated rationales.

Limitations: The paper does not explicitly detail the computational overhead of the progressive training strategy or the scalability of the meticulously annotated CoT dataset to diverse, real-world deception scenarios.

PDF

LLM

4/5 Artificial Intelligence (cs.AI)Machine Learning (cs.LG)Logic in Computer Science (cs.LO) 18 Jun 2026

DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

Patrick Cooper, Alvaro Velasquez

Abstract

ArXiv ID: 2606.18557

Authors: Patrick Cooper, Alvaro Velasquez

Abstract:

A rule-based logic solver resolves every instance in our benchmark in under 50 microseconds with 100% accuracy; the best frontier language model reaches 65% at best and drops to 23.5% under rendering-robust evaluation (worst case over four surface renderings). We introduce DeFAb (Defeasible Abduction Benchmark), a dataset and generation pipeline that converts four decades of publicly funded knowledge bases into formally grounded instances for defeasible abduction: constructing hypotheses that explain anomalies by overriding defaults while preserving unrelated expectations. Because every hypothesis must pass polynomial-time checks for valid derivation, conservativity, and minimality, DeFAb makes logical rigor the instrument for measuring creativity and theoretical reasoning, scoring the disciplined construction of theory revisions rather than fluent but theory-destroying prose. The pipeline pairs taxonomic hierarchies (OpenCyc, YAGO, Wikidata) with behavioral property graphs (ConceptNet, UMLS) to produce 372,648+ instances across 33.75M materialized rules from 18 sources, in three levels with polynomial-time verifiable gold standards. Four frontier models do not reliably internalize defeasible reasoning: rendering-robust Level 2 accuracy is 7.8-23.5%; chain-of-thought variance (~36 pp) exceeds any inter-model gap; and a matched contamination control isolates a +19.4 pp Level 3 gap. We further release DeFAb-Hard (a 235-instance Level 3 difficulty variant; best model 53.3% vs 100% symbolic) and CONJURE (a kernel-verified transformative-creativity variant of 560 Lean 4/Mathlib instances whose gold answers are definitions the proof kernel did not previously contain, judge-free verifier; a pilot finds zero novel concepts). The same verifier doubles as an exact reward for preference optimization (DPO, RLVR/GRPO). Released under MIT at https://huggingface.co/datasets/PatrickAllenCooper/DeFAb.

Insights

Contribution: The paper introduces DeFAb, a verifiable benchmark for defeasible abduction that uses formal logic to measure a model's ability to construct hypotheses that explain anomalies while preserving unrelated expectations.

Core Idea: The authors propose that logical rigor—specifically valid derivation, conservativity, and minimality—should be the primary metric for evaluating theoretical reasoning and creativity in foundation models.

Technique: The authors developed a generation pipeline that converts large-scale knowledge bases into formally grounded instances with polynomial-time verifiable gold standards.

Pipeline: Taxonomic hierarchies (OpenCyc, YAGO, Wikidata) and behavioral property graphs (ConceptNet, UMLS) → Rule materialization and instance generation → Formally verifiable defeasible abduction tasks.

Methodology: The researchers evaluated frontier language models against a rule-based logic solver across three difficulty levels, using rendering-robust evaluation and contamination controls to ensure accuracy.

Results: A rule-based solver achieved 100% accuracy in under 50 microseconds, while the best frontier models reached only 65% (dropping to 23.5% under robust evaluation), showing high variance in Chain-of-Thought reasoning.

Limitations: The CONJURE pilot found zero novel concepts in the transformative-creativity variant, and the significant performance gap between symbolic solvers and LLMs suggests a fundamental struggle with internalizing defeasible reasoning.

PDF

MLOps

4/5 Artificial Intelligence (cs.AI) 18 Jun 2026

ARIADNE: Agnostic Routing for Inference-time Adapter DyNamic sElection

Enrico Cassano, Micha{\l} Brzozowski, Zuzanna Dubanowska, Paolo Mandica, Neo Christopher Chung

Abstract

ArXiv ID: 2606.19079

Authors: Enrico Cassano, Micha{\l} Brzozowski, Zuzanna Dubanowska, Paolo Mandica, Neo Christopher Chung

Abstract:

The increasing deployment of parameter-efficient fine-tuning (PEFT) has led to model ecosystems in which a single backbone is paired with many task-specialized adapters. In this setting, inference-time queries often arrive without task labels, requiring the system to automatically select the most appropriate adapter from a growing and heterogeneous adapter pool. Existing routing methods either depend on access to adapter internals, such as weight decompositions or gradient-based statistics, or require additional router training, which limits scalability and portability as new adapters are added. We introduce ARIADNE, a training-free, adapter-agnostic routing framework for dynamic adapter selection at inference time. ARIADNE represents each adapter through a set of centroids computed from embeddings of its training set, capturing the data distribution associated with that adapter. Given an unlabeled input, it selects an adapter by measuring proximity to these centroids in latent space. Because routing is performed entirely in the input embedding space, ARIADNE is compatible with arbitrary PEFT methods and requires no modification to the adapters or training procedures. Primarily evaluated with Llama 3.2 1B Instruct on 23 diverse NLP tasks, ARIADNE recovers 97.44% of the upper bound performance. Scaling to 44 tasks, it achieves 89.7% average selection accuracy, without additional training or access to adapter internals.

Insights

Contribution: The paper introduces ARIADNE, a training-free and adapter-agnostic routing framework that enables dynamic selection of task-specialized adapters from a heterogeneous pool without requiring access to adapter internals or additional training.

Core Idea: The framework treats adapter selection as a proximity problem in the latent space, representing each adapter by the distribution of its training data rather than its internal weights.

Technique: ARIADNE computes a set of centroids from the embeddings of each adapter's training set and selects the best adapter by measuring the distance between the input query's embedding and these centroids.

Pipeline: Unlabeled input query → Embedding generation → Proximity measurement against adapter centroids → Optimal adapter selection

Methodology: The authors evaluated ARIADNE on Llama 3.2 1B Instruct across 23 and 44 diverse NLP tasks, comparing selection accuracy against an upper bound of perfect task knowledge.

Results: ARIADNE recovered 97.44% of the upper bound performance on 23 tasks and achieved 89.7% average selection accuracy on 44 tasks without any additional training.

Limitations: The paper does not explicitly detail the performance impact of very large adapter pools or the computational overhead of calculating distances across a massive number of centroids.

PDF

RL

5/5 Artificial Intelligence (cs.AI) 18 Jun 2026

RODS: Reward-Driven Online Data Synthesis for Multi-Turn Tool-Use Agents

Ruishan Fang, Siyuan Lu, Chenyi Zhuang, Tao Lin

Abstract

ArXiv ID: 2606.19047

Authors: Ruishan Fang, Siyuan Lu, Chenyi Zhuang, Tao Lin

Abstract:

Multi-turn tool-use RL is bottlenecked by the rapid depletion of informative samples in static datasets. We observe that the gradient signal in GRPO concentrates on tasks with the highest rollout reward variance, a consequence of the Popoviciu upper bound. Consequently, samples near the agent's capability boundary -- where successes and failures are roughly balanced -- contribute disproportionately large policy gradients. As training progresses, this boundary continuously shifts, which gradually depletes the pool of informative samples in a static dataset. We propose RODS (Reward-driven Online Data Synthesis) to resolve this depletion. RODS closes the loop between RL training and data generation by repurposing the progress reward variance as a practical, zero-cost boundary detector that requires no extra inference beyond the rollouts already computed for training. It continuously identifies such boundary samples, synthesizes new multi-turn variants matching their structural complexity (e.g., API topology and dependency depth) via a skill-aligned resampling pipeline, and manages a dynamic replay buffer that co-evolves with the policy. Starting from 400 human seeds and maintaining an active training pool of ~800 samples, RODS achieves comparable performance to a 17K-sample offline pipeline while requiring roughly 20x fewer trajectories, and improves over fixed-data RL and environment augmentation in our controlled setting.

Insights

Contribution: The paper introduces RODS, a framework that addresses the depletion of informative samples in multi-turn tool-use RL by dynamically synthesizing new data that aligns with the agent's evolving capability boundary.

Core Idea: The authors identify that policy gradients are most effective on samples where success and failure are balanced (high reward variance), and they propose using this variance as a zero-cost signal to identify and synthesize new training data in real-time.

Technique: RODS utilizes a reward-driven online data synthesis pipeline that identifies boundary samples, generates structurally similar multi-turn variants via skill-aligned resampling, and manages a dynamic replay buffer.

Pipeline: Static human seeds → RL rollouts → Reward variance detection → Skill-aligned resampling → Dynamic replay buffer → Policy update

Methodology: The method repurposes the Popoviciu upper bound logic to detect samples near the agent's capability boundary and uses a resampling pipeline to maintain structural complexity (API topology and dependency depth) in synthesized data.

Results: RODS achieves performance comparable to a 17K-sample offline pipeline using only 400 human seeds and ~800 active samples, requiring roughly 20x fewer trajectories than standard offline methods.

Limitations: The paper focuses on controlled settings and does not extensively explore the scalability of the resampling pipeline across vastly different tool domains or the long-term stability of the dynamic replay buffer.

PDF

4/5 Artificial Intelligence (cs.AI)Computation and Language (cs.CL)Machine Learning (cs.LG) 18 Jun 2026

ForecastBench-Sim: A Simulated-World Forecasting Benchmark

Jaeho Lee, Nick Merrill, Ezra Karger

Abstract

ArXiv ID: 2606.18686

Authors: Jaeho Lee, Nick Merrill, Ezra Karger

Abstract:

Forecasting benchmarks for general-purpose AI systems usually inherit the constraints of the real world: outcomes resolve slowly, tail events are rare, and counterfactual questions are difficult to score. We introduce ForecastBench-Sim, a simulated-world forecasting benchmark built on game rollouts from Freeciv, a turn-based strategy game modelled on the Civilization series. Forecasters receive a fixed world report (a structured snapshot of the current game state) and answer questions about hidden future states; the benchmark then continues the simulation and scores forecasts. Because the world is simulated, the same setup can generate continuous or binary forecasting questions at arbitrary time horizons, paired intervention worlds for conditional or causal questions, and resolved examples of rare or disruptive outcomes. We describe the benchmark pipeline, question families, scoring protocol, and release artifacts, and report validation slices from model evaluations and an anonymized human pilot. ForecastBench-Sim is intended to complement real-world forecasting benchmarks by providing controlled, immediately resolvable tasks for studying probabilistic reasoning under dynamic world states.

Insights

Contribution: The paper introduces ForecastBench-Sim, a new forecasting benchmark based on simulated game rollouts to overcome the limitations of real-world forecasting data. It provides a controlled environment for evaluating probabilistic reasoning, causal inference, and tail-event prediction.

Core Idea: By using a turn-based strategy game (Freeciv) as a world simulator, the authors create a sandbox where future states are immediately resolvable and counterfactual scenarios can be perfectly controlled.

Technique: The benchmark utilizes game rollouts to generate structured world reports and corresponding future states, allowing for the automated generation of continuous, binary, and conditional forecasting questions.

Pipeline: Game state snapshot (World Report) → Forecasting question generation → Model/Human prediction → Simulation rollout → Automated scoring against ground truth.

Methodology: The authors developed a pipeline to generate diverse question families (e.g., causal, rare events) and validated the benchmark using both large language model evaluations and an anonymized human pilot.

Results: The benchmark successfully provides immediately resolvable tasks for studying dynamic world states and offers a scalable way to generate high-quality, diverse forecasting data that real-world datasets lack.

Limitations: The benchmark is limited to the dynamics of a simulated game environment, which may not fully capture the complexity or noise of real-world socio-economic systems.

PDF

Robotics

5/5 Artificial Intelligence (cs.AI) 18 Jun 2026

R2D-RL: A RoboCup 2D Soccer Environment for Multi-Agent Reinforcement Learning

Haobin Qin, Baofeng Zhang, Hidehisa Akiyama, Keisuke Fujii

Abstract

ArXiv ID: 2606.18786

Authors: Haobin Qin, Baofeng Zhang, Hidehisa Akiyama, Keisuke Fujii

Abstract:

Robot soccer is a challenging testbed for multi-agent reinforcement learning because it combines partial observability, cooperative and adversarial interaction, sparse rewards, and long-horizon tactical behavior. RoboCup 2D Soccer Simulation (RCSS2D) provides a mature robot-soccer platform, but its competition-oriented server-client architecture is difficult to use directly with modern Python-based MARL workflows. We introduce R2D-RL, a reinforcement learning environment that connects RCSS2D and HELIOS-based player clients to a Python MARL interface through shared-memory communication and cycle-level synchronization. R2D-RL supports full-field and scenario-based training with configurable opponents, Base discrete and Hybrid parameterized action spaces, action masks, expected possession value (EPV)-based reward shaping, and parallel execution. We provide front-goal scenarios and an 11-vs-11 full-field benchmark, together with baseline results.

Insights

Contribution: The authors introduce R2D-RL, a new reinforcement learning environment that bridges the RoboCup 2D Soccer Simulation (RCSS2D) with modern Python-based multi-agent reinforcement learning (MARL) workflows.

Core Idea: The core idea is to overcome the limitations of the competition-oriented server-client architecture of RCSS2D by enabling high-performance, synchronized communication for MARL training.

Technique: The framework utilizes shared-memory communication and cycle-level synchronization to connect RCSS2D and HELIOS-based player clients to a Python interface.

Pipeline: RCSS2D Simulation State → Shared-Memory Communication → Python MARL Interface (Agent Decision) → HELIOS Player Clients → Action Execution in RCSS2D

Methodology: The environment supports full-field and scenario-based training with configurable action spaces, action masks, and EPV-based reward shaping for complex tactical behavior.

Results: The paper provides a benchmark for front-goal scenarios and 11-vs-11 full-field play, establishing baseline results for multi-agent training in a realistic soccer environment.

Limitations: The paper focuses on the infrastructure and initial benchmarks, leaving the exploration of more complex long-horizon tactical behaviors as an open area for future research.

PDF

5/5 Artificial Intelligence (cs.AI) 18 Jun 2026

WorldLines: Benchmarking and Modeling Long-Horizon Stateful Embodied Agents

Yehang Zhang, Jianchong Su, Haojian Huang, Yifan Chang, Tianhao Zhou, Xinli Xu, Yingjie Xu, Yinchuan Li, Zexi Li, Ying-Cong Chen

Abstract

ArXiv ID: 2606.18847

Authors: Yehang Zhang, Jianchong Su, Haojian Huang, Yifan Chang, Tianhao Zhou, Xinli Xu, Yingjie Xu, Yinchuan Li, Zexi Li, Ying-Cong Chen

Abstract:

To assist humans over extended periods in real homes, embodied agents must remember user routines, world states, and past interactions. Existing long-term memory benchmarks mainly evaluate language-centric retrieval and question answering, while embodied benchmarks often focus on short-horizon task execution without testing long-term memory use in dynamic environments. We introduce WorldLines, a project-driven benchmark for long-horizon embodied household assistance. It constructs temporally extended household traces with dialogues, actions, execution feedback, object and device state changes, and converts them into evidence-linked samples for Memory QA and Embodied Task Planning. We further propose ObsMem, an observer-grounded memory framework that maintains visibility-aware memories and action-native state trails for state-aware decisions. Experiments reveal persistent challenges in partial observability, overwritten world states, and translating long-term memory into embodied plans, while ObsMem offers a stronger reference architecture for this setting.

Insights

Contribution: The paper introduces WorldLines, a new benchmark for long-horizon stateful embodied agents, and ObsMem, a memory framework designed for visibility-aware state tracking.

Core Idea: Embodied agents require the ability to maintain and utilize long-term memories of user routines and dynamic world states to perform complex household tasks over extended periods.

Technique: The authors develop an observer-grounded memory framework (ObsMem) that maintains visibility-aware memories and action-native state trails to handle partial observability and state changes.

Pipeline: Household traces (dialogues, actions, feedback, state changes) → Evidence-linked sample conversion → Memory QA and Embodied Task Planning evaluation

Methodology: The researchers constructed temporally extended household traces to create a project-driven benchmark and evaluated agent performance using a memory-augmented architecture.

Results: Experiments highlight significant challenges in handling partial observability and overwritten world states, while demonstrating that ObsMem provides a superior reference architecture for state-aware decision-making.

Limitations: The study identifies persistent difficulties in effectively translating long-term memories into executable embodied plans in dynamic environments.

PDF

4/5 Artificial Intelligence (cs.AI) 18 Jun 2026

Generative-Model Predictive Planning for Navigation in Partially Observable Environments

Thomas Quilter, Yifan Zhu, Guorui Quan, Mingfei Sun, Samuel Kaski

Abstract

ArXiv ID: 2606.18888

Authors: Thomas Quilter, Yifan Zhu, Guorui Quan, Mingfei Sun, Samuel Kaski

Abstract:

Navigation in partially observable environments presents a significant challenge for autonomous agents, requiring effective decision-making with limited sensory information in unknown environments. Belief-based methods, particularly those using neural networks to approximate the belief space, often fail to capture the inherent multimodality of belief spaces, especially in high-dimensional cases with perceptual aliasing. While generative models present a compelling alternative, they typically require substantial data or expert demonstrations and lack explicit mechanisms for long-term planning. In this paper, we introduce BeliefDiffusion, a novel framework that combines the benefits of both generation and planning. BeliefDiffusion leverages diffusion models to explicitly characterize multimodal belief distributions and utilizes Model Predictive Control (MPC) to simultaneously plan ahead. It consists of two steps: (1) Imagining plausible environment configurations based on observation history and (2) Planning efficient navigation strategies across an aggregated configurations. Through extensive experiments in synthetic map environments, we demonstrate that BeliefDiffusion significantly outperforms both model-free reinforcement learning baselines and other generative approaches in navigation success rate and path efficiency. Our results validate that explicitly incorporating multimodal belief representations into planning enables more robust navigation in partially observable settings.

Insights

Contribution: The paper introduces BeliefDiffusion, a framework that integrates diffusion models with Model Predictive Control (MPC) to handle multimodal belief distributions in partially observable navigation tasks.

Core Idea: By combining generative modeling to represent complex environment uncertainties with MPC for long-term planning, the method overcomes the limitations of unimodal belief approximations and data-heavy generative models.

Technique: The framework uses diffusion models to imagine plausible environment configurations from observation history and employs MPC to plan navigation strategies across these aggregated configurations.

Pipeline: Observation history → Diffusion-based environment imagination → Aggregated configuration planning via MPC → Navigation actions

Methodology: The authors developed a two-step process: first, generating a distribution of possible environment states using a diffusion model, and second, optimizing paths across these states using a predictive control loop.

Results: BeliefDiffusion significantly outperformed model-free reinforcement learning and existing generative approaches in both navigation success rate and path efficiency in synthetic map environments.

Limitations: The paper focuses on synthetic map environments, leaving the scalability and robustness of the diffusion-based imagination in complex, real-world dynamic environments as an open question.

PDF

Personal Interests

Papers discovered through your interest topics.

Multi-Agent Systems

3/5 stat.AP 17 Jun 2026

Accelerating Network-Agent Dispersion: Territorial Behavior and Directionally Biased Lazy Random Walks

Li Zeng, Steve Alpern

Abstract

ArXiv ID: 2606.19294

Authors: Li Zeng, Steve Alpern

Abstract:

Territorial behavior can greatly accelerate decentralized agent dispersion on networks. This paper studies a network-agent dispersion problem in which m autonomous agents move in discrete time on a connected graph and seek a configuration in which no two agents occupy the same node. We focus on the dispersion case m = n, where successful configurations contain exactly one agent per node. In the baseline model, each agent follows a lazy random walk with a common laziness parameter p. This process defines a finite absorbing Markov chain, and the expected absorption time is used to measure dispersion efficiency. We introduce two local behavioral extensions: territorial behavior, in which an agent that is alone at a node claims that node and repels later arrivals, and directional bias, in which agents share a preferred direction of movement on paths and cycles. Exact calculations on three-agent path and cycle networks and Monte Carlo simulations on larger instances show that territorial behavior substantially reduces expected dispersion time, with larger relative reductions as network size increases. Directional bias alone has limited effect in most small-network cases, but when combined with territorial behavior it can produce large additional speedups. In particular, the simulations show reductions of 99.22% on L100 and 97.48% on C100 when all agents start from one node. These results show how simple local movement rules can strongly affect global dispersion time in decentralized networked multi-agent systems.

Insights

Contribution: The paper demonstrates how incorporating territorial behavior and directional bias into local agent movement rules significantly accelerates the time required for decentralized agents to disperse across a network.

Core Idea: Simple local behavioral rules, specifically territoriality and directional bias, can drastically reduce the expected absorption time in a network-agent dispersion problem compared to standard lazy random walks.

Technique: The authors model the dispersion as a finite absorbing Markov chain and analyze the expected absorption time using exact calculations and Monte Carlo simulations.

Pipeline: Network graph and initial agent positions → Local movement rules (Lazy Random Walk + Territoriality + Directional Bias) → Final configuration with one agent per node

Methodology: The study compares a baseline lazy random walk model against two extensions: territorial behavior (repelling arrivals) and directional bias (shared movement preferences) on path and cycle networks.

Results: Territorial behavior substantially reduces dispersion time, with combined rules achieving up to 99.22% reduction on L100 and 97.48% on C100 when all agents start from a single node.

Limitations: Directional bias alone has limited effect on small networks, and the study focuses specifically on path and cycle topologies.

PDF

Tech News

Computer Vision

Hacker News Thu, 18 Ju

Midjourney Medical

The discussion explores the application of Midjourney's generative capabilities within the medical field. It covers potential use cases such as medical illustration, anatomical visualization, and the ethical considerations of synthetic imagery in healthcare.

Computing Systems

Hacker News Thu, 18 Ju

[x86] AI Compute Extensions (ACE) Specification

The AI Compute Extensions (ACE) specification introduces a standardized set of instructions for x86 processors to accelerate AI workloads. It aims to improve efficiency and performance for deep learning operations by providing hardware-level optimizations. This initiative seeks to bridge the gap between general-purpose CPUs and specialized AI accelerators.

LLM

Hacker News Thu, 18 Ju

Local Qwen isn't a worse Opus, it's a different tool

The article argues against the common misconception that local models like Qwen are inferior versions of proprietary models like Claude 3 Opus. Instead, it highlights that local models serve as distinct tools with unique advantages in privacy, cost, and customization. It encourages users to evaluate models based on their specific use cases rather than direct performance parity.

Reddit r/DeepLearning 2026-06-18

Seeking Peer Review: Comprehensive Mathematical Derivations of GPT-2 Backpropagation (Index-Form)

A user on the r/DeepLearning subreddit has shared a comprehensive mathematical derivation of the GPT-2 backpropagation process using index-form notation. The post seeks peer review and feedback from the community to verify the accuracy of the complex calculus involved in the model's training mechanics.

MLOps

Reddit r/DeepLearning 2026-06-18

LiteLLM Stability Announcement

LiteLLM has released a stability announcement regarding its library, which is widely used for managing multiple LLM APIs. This update is significant for developers looking for reliable integration across different model providers. It highlights the project's maturation in the production environment.

Robotics

Reddit r/DeepLearning 2026-06-18

I built a CLI tool to diff robotics datasets at the episode level (so you can figure out why your imitation learning model regressed)

A new CLI tool called EpisodeVault has been released to help robotics researchers debug imitation learning regressions by diffing datasets at the episode level. It uses DuckDB and PyArrow to provide sub-second analysis of task distributions and quality metrics without loading raw video data. Key features include anomaly detection, custom Python-based quality metrics, and HuggingFace Hub integration.

Trending repositories on GitHub filtered and scored for relevance to your interests.

Daily Digest 2026-06-18

Global Trends

Papers discovered from ArXiv subject categories

AI Safety

Agentic AI

Computer Vision

General

LLM

MLOps

RL

Robotics

Personal Interests

Multi-Agent Systems

Tech News

Computer Vision

Computing Systems

LLM

MLOps

Robotics

GitHub Trending

Agentic AI

Computer Vision

MLOps

Speech