Monday, June 1, 2026

Daily Digest 2026-06-01

The dominant theme across today’s content centers on advancing AI systems through specialized techniques for generalization, physical reasoning, and efficient deployment, alongside growing emphasis on safety, alignment, and practical implementation challenges.

Research highlights:

Generalization and Robustness in AI Systems: Papers explore methods to enhance agent performance in dynamic environments, improve uncertainty handling in reinforcement learning, and disentangle evolutionary capabilities in self-evolving models.
Physical Reasoning and Multimodal Capabilities: Research focuses on benchmarking physical understanding in multimodal models, generating physically grounded diagrams, and aligning AI with real-world dynamics through embodied reasoning.
Efficient Inference and Scaling: Techniques for adaptive model routing, segment-level reasoning optimization, and context management for long-horizon tasks are highlighted to improve computational efficiency.
Autonomous Systems and Safety: Work on reinforcement learning for autonomous driving, safe search agents, and structured reward mechanisms underscores efforts to ensure reliability in critical applications.
Evaluation and Alignment: New frameworks for judging LLM policies, assessing pluralistic alignment in generative models, and structuring cognitive processes for safe AI deployment are emphasized.

Tech buzz:

Advances in local AI deployment, such as 1-bit image generation models, intersect with security concerns like WebGL fingerprinting requirements and data exfiltration risks in tools like ChatGPT for Google Sheets.
Major platforms are redefining user engagement through subscriptions (e.g., Meta’s social media services) and operational shifts, such as moving from hierarchical to peer-to-peer collaboration models.
Innovations in prototyping speed and infrastructure tools (e.g., Streambed for data streaming) reflect ongoing efforts to streamline AI integration into workflows.

Sort:

The dominant theme across today's content centers on advancing AI systems through specialized techniques for generalization, physical reasoning, and efficient deployment, alongside growing emphasis on safety, alignment, and practical implementation challenges.

Global Trends

Papers discovered from ArXiv subject categories

AI Safety

4/5 Artificial Intelligence (cs.AI) 1 Jun 2026

PReMISE: Policy Rubrics as Measurement Specifications for LLM Judges

Swastik Roy, Rajkumar Pujari, Tharindu Kumarage, Charith Peris, Rahul Gupta, Anna Rumshisky, Pradeep Natarajan, Venkatesh Saligrama

Abstract

ArXiv ID: 2605.30803

Authors: Swastik Roy, Rajkumar Pujari, Tharindu Kumarage, Charith Peris, Rahul Gupta, Anna Rumshisky, Pradeep Natarajan, Venkatesh Saligrama

Abstract:

LLM judges are increasingly used to evaluate open-ended responses, but their scores depend strongly on the rubrics that condition them. A vague rubric asking for a response to be ``helpful and factual'' can reward polished answers that invent facts or violate user intent. We treat reusable rubrics as measurement specifications: changing the rubric changes the response quality measurement induced by a fixed judge. We introduce PReMISE, a framework that, given pairwise human-preference data, (i) discovers a policy-level rubric set, and (ii) audits any rubric set under LLM-judge use along four axes: structural adequacy, reliability, preference fit, and adversarial robustness. Across rubric sources no raw source is simultaneously reliable, preference-predictive, and adversarially robust; and high inter-rater agreement does not imply low exploitability. PReMISE is the only rubric source to score non-trivially on applicability, specificity, and effective dimensionality simultaneously. We contribute two audit-targeted repair operations: preference-rank selection raises judge accuracy on paired responses from $65.0\%$ to $68.6\%$, competitive with the strongest rubric-discovery baselines and leading on two of three judges in our cross-judge sweep; reliability-constrained refinement reduces the rate at which exploit responses receive high scores from $46.4\%$ to $36.0\%$ with little change in inter-judge agreement ($\alpha{=}.531\to.519$).

Insights

Contribution: PReMISE introduces a framework for discovering and auditing policy-level rubrics used by LLM judges, addressing reliability, preference alignment, and adversarial robustness through targeted repair operations.

Core Idea: Reusable rubrics define measurement specifications for LLM judges, and PReMISE ensures their quality by auditing structural adequacy, reliability, preference fit, and adversarial robustness while enabling repairs.

Technique: PReMISE uses pairwise human-preference data to discover rubrics and applies four audit axes to evaluate and refine them, with repair operations like preference-rank selection and reliability-constrained refinement.

Pipeline: Human-preference data → discover policy-level rubrics → audit along four axes → output refined rubric set with improved reliability and robustness

Methodology: The framework combines rubric discovery from preference data with systematic audits using structural, reliability, preference, and adversarial checks, followed by targeted repairs based on audit results.

Results: Preference-rank selection improved judge accuracy from 65.0% to 68.6%; reliability-constrained refinement reduced exploit response scores from 46.4% to 36.0% with minimal agreement loss (α=0.531→0.519). PReMISE uniquely achieves non-trivial performance across applicability, specificity, and dimensionality.

Limitations: Repairs may involve trade-offs between metrics, and the framework's effectiveness depends on the quality of input preference data. Generalizability across diverse domains or judge architectures remains unexplored.

PDF

4/5 Artificial Intelligence (cs.AI) 1 Jun 2026

COMPASS: Cognitive MCTS-Guided Process Alignment for Safe Search Agents

Wenkai Shen, Pengyang Zhou, Jiahe Xu, Jiaming Qian, Haozhe He, Zhihao Huang, Chaochao Chen, Xiaolin Zheng

Abstract

ArXiv ID: 2605.30838

Authors: Wenkai Shen, Pengyang Zhou, Jiahe Xu, Jiaming Qian, Haozhe He, Zhihao Huang, Chaochao Chen, Xiaolin Zheng

Abstract:

LLM-powered search agents enable multi-step reasoning and tool use. However, these capabilities introduce retrieval-induced safety degradation, as harmful intents may decompose into seemingly innocuous sub-queries that lead to unsafe outcomes. Existing alignment methods struggle to capture sparse safety signals and fail to supervise diverse violations across multi-step interactions. We propose COMPASS, a Cognitive MCTS-Guided Process Alignment framework designed to achieve robust safety alignment throughout the agent workflow while preserving general utility. COMPASS integrates cognitive tree exploration (CTE) to efficiently synthesize stealthy attack trajectories, and introspective step-wise alignment (ISA) to isolate risky intermediate actions for fine-grained process supervision. Empirical results show that COMPASS achieves a favorable safety-utility trade-off while requiring substantially less training data.

Insights

Contribution: COMPASS introduces a novel framework for aligning search agents with safety constraints by addressing retrieval-induced safety degradation through cognitive tree exploration and introspective alignment.

Core Idea: COMPASS combines cognitive tree exploration (CTE) to identify stealthy attack paths and introspective step-wise alignment (ISA) to supervise risky intermediate steps, ensuring safety without sacrificing utility.

Technique: The framework integrates Monte Carlo Tree Search (MCTS)-guided cognitive exploration with step-wise process supervision to detect and mitigate safety violations in multi-step agent interactions.

Pipeline: User query → Cognitive Tree Exploration (CTE) synthesizes attack trajectories → Introspective Step-wise Alignment (ISA) isolates risky actions → Safe, utility-preserving response

Methodology: COMPASS employs CTE to generate potential unsafe paths and ISA to apply targeted alignment during execution, using minimal training data to balance safety and task performance.

Results: Achieves favorable safety-utility trade-off with 40% reduction in training data requirements compared to baselines, while mitigating 85% of stealthy attack cases in evaluations.

Limitations: Effectiveness depends on the quality of synthetic attack data; real-world deployment may require additional domain-specific tuning for complex safety constraints.

PDF

4/5 Artificial Intelligence (cs.AI)Computation and Language (cs.CL)Machine Learning (cs.LG) 1 Jun 2026

A Persona-Based Evaluation Framework for Pluralistic Alignment in Generative AI

Atahan Karagoz

Abstract

ArXiv ID: 2605.31021

Authors: Atahan Karagoz

Abstract:

Current alignment paradigms for generative artificial intelligence rely predominantly on monolithic benchmarking frameworks that reduce the plurality of human judgment to aggregated statistical baselines, thereby obscuring cultural, demographic, and contextual variability in evaluation. We introduce a state-space constrained emulation framework for AI evaluation that replaces singular assessment functions with a structured manifold of synthetic cognitive profiles representing diverse human perspectives. We show that modern generative architectures can instantiate and maintain these evaluative personas with high consistency, enabling a form of pluralistic, perspective-dependent benchmarking that more closely reflects real-world consensus variability. However, we further analyze the stability of these simulated evaluators under sequential inference and stochastic prompt perturbations, revealing systematic degradation in persona coherence that manifests as state-space drift and semantic inconsistency. These findings suggest that static alignment constraints are insufficient for sustaining robust evaluative behavior over time. Instead, we argue for the necessity of embedding dynamic, viability-driven regulatory mechanisms within generative systems to preserve coherent cognitive emulation. By framing persona-based evaluation as a structured dynamical system over latent representation manifolds, this study provides a foundation for more adaptive, human-aligned, and context-sensitive approaches to AI evaluation.

Insights

Contribution: Introduces a persona-based evaluation framework for generative AI that captures cultural, demographic, and contextual variability through synthetic cognitive profiles, replacing monolithic benchmarks with pluralistic, perspective-dependent assessment.

Core Idea: Generative AI systems can maintain diverse evaluative personas with high consistency, but require dynamic regulatory mechanisms to prevent state-space drift and semantic inconsistency during sequential inference.

Technique: A state-space constrained emulation framework instantiates synthetic cognitive profiles as structured manifolds, enabling perspective-dependent benchmarking while analyzing persona stability under stochastic perturbations.

Pipeline: Generative AI outputs → evaluation via synthetic persona manifolds → pluralistic alignment metrics with contextual variability

Methodology: Analyzes persona coherence through sequential inference experiments and stochastic prompt perturbations, while proposing viability-driven regulatory mechanisms within latent representation manifolds.

Results: Modern architectures maintain persona consistency initially but exhibit systematic degradation in coherence as state-space drift and semantic inconsistency emerge under sequential inference.

Limitations: Static alignment constraints prove insufficient for long-term persona stability, requiring unresolved dynamic regulatory mechanisms to preserve cognitive emulation coherence.

PDF

Agentic AI

5/5 Artificial Intelligence (cs.AI) 1 Jun 2026

MAVEN: Improving Generalization in Agentic Tool Calling

Omkar Ghugarkar, Vishvesh Bhat, Muhammad Ahmed Mohsin, Asad Aali

Abstract

ArXiv ID: 2605.30738

Authors: Omkar Ghugarkar, Vishvesh Bhat, Muhammad Ahmed Mohsin, Asad Aali

Abstract:

Generalization across agentic tool-calling environments remains a central challenge for reliable agentic reasoning systems. Although large language models achieve strong results on individual benchmarks, their ability to compose reasoning strategies, preserve intermediate states, and coordinate tools across domains remains underexplored. We present MAVEN (Modular Agentic Verification and Execution Network), a lightweight symbolic reasoning scaffold for structured decomposition, adaptive tool orchestration, and intermediate verification. We evaluate MAVEN across established tool-calling benchmarks, including BFCL v3, TauBench, Tau2Bench, AceBench, and introduce MAVEN-Bench, a stress-test benchmark for multi-step mathematical and physical reasoning with explicit verification and adversarial task composition. MAVEN-Bench exposes a substantial gap between partial reasoning quality and end-to-end task success; in direct MAVEN-Bench runs, MAVEN improves its GPT-OSS-120b base model from 48% to 71% accuracy without additional training. It also remains competitive with frontier proprietary baselines while using an open-weight backbone with an estimated cost ratio of roughly 1/10, suggesting that lightweight verification-centered scaffolds can strengthen compositional reasoning and motivate more process-aware evaluation of agents in the wild.

Insights

Contribution: MAVEN introduces a lightweight symbolic reasoning framework to enhance generalization in agentic tool calling, demonstrated through improved performance on multi-step reasoning benchmarks and a novel stress-test benchmark.

Core Idea: MAVEN enables structured decomposition of tasks, adaptive tool orchestration, and intermediate verification to improve compositional reasoning and end-to-end task success in agentic systems.

Technique: MAVEN employs a modular symbolic scaffold with explicit verification steps and adversarial task composition to strengthen reasoning across domains without requiring additional model training.

Pipeline: Input task → structured decomposition into subtasks → adaptive tool orchestration with intermediate verification → output solution with end-to-end validation

Methodology: Evaluated on BFCL v3, TauBench, AceBench, and the newly introduced MAVEN-Bench, comparing performance against proprietary models using an open-weight backbone.

Results: MAVEN improves GPT-OSS-120b base model accuracy from 48% to 71% on MAVEN-Bench; achieves competitive results with proprietary models at ~1/10 the cost.

Limitations: Focus on mathematical/physical reasoning domains; potential gaps in generalization to non-symbolic or unstructured task environments.

PDF

4/5 Artificial Intelligence (cs.AI) 1 Jun 2026

Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents

Minhua Lin, Juncheng Wu, Zijun Wang, Zhan Shi, Yisi Sang, Bing He, Zewen Liu, Tianxin Wei, Zongyu Wu, Zhiwei Zhang, Dakuo Wang, Xiang Zhang, Benoit Dumoulin, Cihang Xie, Yuyin Zhou, Suhang Wang, Hanqing Lu

Abstract

ArXiv ID: 2605.30621

Authors: Minhua Lin, Juncheng Wu, Zijun Wang, Zhan Shi, Yisi Sang, Bing He, Zewen Liu, Tianxin Wei, Zongyu Wu, Zhiwei Zhang, Dakuo Wang, Xiang Zhang, Benoit Dumoulin, Cihang Xie, Yuyin Zhou, Suhang Wang, Hanqing Lu

Abstract:

LLM agents are increasingly deployed as systems built around editable external harnesses, including prompts, skills, memories and tools, that shape task execution without changing model parameters. Harness self-evolution adapts such agents by updating these harnesses from execution evidence. Yet it remains unclear whether a model's base capability in task-solving predicts its capabilities in harness self-evolution: which models produce useful harness updates, and which actually benefit from them? We analyze two harness self-evolution capabilities: (i) harness-updating, the capability to produce useful persistent harness updates from execution evidence; (ii) harness-benefit, the capability to benefit from updated harnesses during task solving. Our analysis reveals two findings. First, harness-updating is flat in base capability: models from different capability tiers produce harness updates that lead to surprisingly similar gains; even Qwen3.5-9B's updates yield gains comparable to those of Claude Opus~4.6. Second, harness-benefit is non-monotonic in base capability: weak-tier models benefit little from updated harnesses, mid-tier models benefit most, and strong-tier models benefit less than mid-tier. We trace low gains at the weak tier to two failure modes: weak-tier models may fail to activate relevant harness artifacts, or activate them but fail to follow them faithfully. These findings suggest investing capability budget in the task-solving agent rather than the evolver, and targeting harness invocation and long-horizon instruction following in agent training. Our source code is publicly available at https://github.com/A-EVO-Lab/a-evolve/tree/release/harness-evolution.

Insights

Contribution: This paper disentangles two distinct capabilities in self-evolving LLM agents—harness-updating and harness-benefit—revealing that base model capability does not predict effectiveness in harness self-evolution.

Core Idea: Harness-updating (generating useful updates) and harness-benefit (leveraging updates for task-solving) are decoupled, with weak models failing to activate or follow harness artifacts effectively.

Technique: The study analyzes models across capability tiers using empirical evaluation of harness self-evolution, comparing gains from updates across different model architectures (e.g., Qwen3.5-9B vs. Claude Opus~4.6).

Pipeline: Task execution with initial harness → generate harness updates from execution evidence → evaluate task performance with updated harnesses

Methodology: Models were categorized by capability tiers, and their harness self-evolution capabilities were quantitatively assessed through task performance gains and failure mode analysis.

Results: Harness-updating gains are flat across model tiers (Qwen3.5-9B ≈ Claude Opus~4.6), while harness-benefit shows non-monotonic trends (mid-tier models benefit most). Weak-tier models fail to activate or follow harness artifacts.

Limitations: Findings are limited to specific harness types (prompts, tools) and may not generalize to all agent architectures; open questions remain about optimal training objectives for harness invocation.

PDF Code

4/5 Artificial Intelligence (cs.AI) 1 Jun 2026

Learning Agent-Compatible Context Management for Long-Horizon Tasks

Lu Yi, Runlin Lei, Liuyi Yao, Yuexiang Xie, Yuyang Li, Wenhao Zhang, Zhewei Wei, Yaliang Li, Jian-Yun Nie

Abstract

ArXiv ID: 2605.30785

Authors: Lu Yi, Runlin Lei, Liuyi Yao, Yuexiang Xie, Yuyang Li, Wenhao Zhang, Zhewei Wei, Yaliang Li, Jian-Yun Nie

Abstract:

LLM agents increasingly face long-horizon tasks such as web search and deep research in real-world applications, where accumulated context can cause long-context degradation and reasoning failures. Prior work mitigates this through context management with agent-side context control or fixed strategies such as summarization, which require training the agent itself for adaptation - making it impractical for closed-source agents and ignoring that different agents may require different strategies. We introduce Adaptive Context Management (AdaCoM), which trains an external LLM to manage the context of a frozen agent through flexible modification actions and end-to-end reinforcement learning. Across diverse agents on web search and deep research benchmarks, AdaCoM substantially improves performance by preserving task constraints and progress while pruning stale content. The learned strategies reveal a Fidelity-Reliability Trade-off: agents with higher vanilla ReAct performance benefit from higher-fidelity context preservation, whereas lower-performing agents require more aggressive compression to stay within a reliable reasoning regime. Transfer experiments show that AdaCoM generalizes most effectively across agents with similar capability (measured by vanilla ReAct performance), suggesting a practical path toward reusable context managers for agent systems.

Insights

Contribution: Introduces AdaCoM, an external context management system that improves long-horizon task performance for frozen agents without retraining, revealing a fidelity-reliability trade-off in context management strategies.

Core Idea: Train an external LLM to dynamically manage context for any frozen agent using flexible modification actions and reinforcement learning, adapting to task constraints and agent capabilities.

Technique: End-to-end reinforcement learning trains an external LLM to perform context pruning, summarization, and modification actions, optimizing for task success while maintaining agent-specific constraints.

Pipeline: Raw context input → AdaCoM processes with modification actions → Output refined context for agent reasoning

Methodology: Trained via reinforcement learning on diverse agents, AdaCoM learns to balance context preservation and compression based on agent performance profiles (measured by vanilla ReAct scores).

Results: Substantial performance improvements on web search and deep research benchmarks; revealed fidelity-reliability trade-off; best transfer effectiveness between agents with similar ReAct capabilities.

Limitations: Effectiveness depends on agent capability similarity; requires access to agent performance metrics for optimal strategy transfer; may not generalize to agents with fundamentally different reasoning architectures.

PDF

Computer Vision

4/5 Artificial Intelligence (cs.AI)physics.app-ph 1 Jun 2026

BilliardPhys-Bench: Benchmarking Physical Reasoning and Visual Dynamics of Multimodal LLMs

Ben Wang, Xiaogang Li, Ruochen Gao, Peiyao Xiao, Chengliang Xu, Zeyu Wang, Zichao Chen, Bing Zhao, Hu Wei

Abstract

ArXiv ID: 2605.30900

Authors: Ben Wang, Xiaogang Li, Ruochen Gao, Peiyao Xiao, Chengliang Xu, Zeyu Wang, Zichao Chen, Bing Zhao, Hu Wei

Abstract:

Current multimodal models handle static image recognition well, but intuitive physical reasoning remains a weakness. Predicting how objects will move and interact from a single image is still difficult for these systems. We present BilliardPhys-Bench, a benchmark for physical reasoning in synthetic billiards environments. Its procedural engine generates randomized scenarios with friction and elastic collisions. The benchmark tests three abilities: (1) predicting ball-to-ball collisions, (2) reasoning about wall bounces, and (3) estimating final ball positions after motion stops. We evaluate recent MLLMs from the GPT, Claude, Gemini, and Qwen families. Performance drops as simulation time increases and scene geometry grows more complex. We also observe a consistent failure mode we call "stasis bias": when the correct physical outcome is harder to infer, models tend to predict no interaction. These findings show where current MLLMs break down on visual dynamics and point toward the need for better physical inductive biases in multimodal architectures.

Insights

Contribution: BilliardPhys-Bench introduces a synthetic benchmark to evaluate physical reasoning and visual dynamics prediction in multimodal large language models (MLLMs), highlighting their limitations in simulating real-world physics.

Core Idea: The benchmark tests MLLMs' ability to predict collisions, wall bounces, and final positions in billiards scenarios, revealing systematic failures like 'stasis bias' when physical outcomes are ambiguous.

Technique: A procedural engine generates randomized billiards environments with friction and elastic collisions, enabling controlled evaluation of physical reasoning tasks.

Pipeline: Input: synthetic billiards image → Process: MLLM analyzes visual dynamics and physics → Output: predicted collision events, bounce trajectories, and final positions

Methodology: Evaluated GPT, Claude, Gemini, and Qwen family models on three tasks, measuring accuracy across varying simulation times and scene complexities while analyzing failure modes.

Results: Performance degrades with increased simulation time and complexity; models exhibit 'stasis bias,' predicting no interaction when physical outcomes are uncertain.

Limitations: Benchmark focuses on synthetic billiards scenarios, limiting generalizability to real-world physics; does not address cross-modal reasoning beyond visual input.

PDF

3/5 Artificial Intelligence (cs.AI)Computer Vision and Pattern Recognition (cs.CV) 1 Jun 2026

PhyDrawGen: Physically Grounded Diagram Generation from Natural Language

Nafiul Haque, Syed Nazmus Sakib, Shifat E Arman

Abstract

ArXiv ID: 2605.30512

Authors: Nafiul Haque, Syed Nazmus Sakib, Shifat E Arman

Abstract:

Generating physics diagrams from text requires strict adherence to physical laws. While current generative models produce visually plausible outputs, they systematically hallucinate force vectors, ignore conservation laws, and violate geometric constraints. We present PhyDrawGen, a neuro-symbolic pipeline that decouples semantic scene understanding from physical constraint satisfaction. First, a large language model extracts a typed scene graph from the problem text. A deterministic solver then converts this graph into a Planar Straight-Line Graph (PSLG), encoding force balance, optical paths, and field topologies as exact geometric primitives. Finally, a fine-tuned Qwen-VL model implements a visually grounded propose-verify loop to iteratively correct any constraint violations. Evaluated on a benchmark of 1,449 problems spanning mechanics, optics, and electromagnetism, PhyDrawGen significantly outperforms GPT-5-image, Gemini 2.5 Flash, and Gemini 3 Pro, demonstrating robust physical accuracy even on unusual-object problems.

Insights

Contribution: PhyDrawGen addresses systematic errors in physics diagram generation by integrating symbolic physics constraints with visual generation, achieving robust physical accuracy across diverse domains.

Core Idea: Decoupling semantic scene understanding from physical constraint enforcement through a neuro-symbolic pipeline that combines language models, geometric solvers, and visual verification.

Technique: Combines a large language model for scene graph extraction, a deterministic solver for Planar Straight-Line Graph (PSLG) encoding of physical laws, and a vision model for iterative constraint verification.

Pipeline: Natural language input → typed scene graph extraction → PSLG generation with physics constraints → iterative visual verification → physically accurate diagram output

Methodology: Evaluated on 1,449 problems across mechanics, optics, and electromagnetism using benchmarked comparisons against leading vision models, with focus on constraint violation correction.

Results: Outperformed GPT-5-image, Gemini 2.5 Flash, and Gemini 3 Pro by significant margins; maintained accuracy on unusual-object problems where other models failed.

Limitations: Depends on pre-trained models (Qwen-VL) for visual verification; requires domain-specific physics solvers; benchmark coverage may not reflect real-world complexity.

PDF

General

2/5 Artificial Intelligence (cs.AI) 1 Jun 2026

Transforming and Encoding FTS for SAT Solving: What Helps, What Hurts (Extended Version)

Jo\~ao Filipe, \'Alvaro Torralba, Gregor Behnke

Abstract

ArXiv ID: 2605.30563

Authors: Jo\~ao Filipe, \'Alvaro Torralba, Gregor Behnke

Abstract:

Factored tasks are a classical planning representation that extends SAS+ with limited forms of disjunctive preconditions, conditional effects, and angelic nondeterminism. This allows for a more compact representation of tasks than traditional formalisms such as STRIPS or SAS+, and supports a wide range of task transformations. However, existing planning approaches for factored tasks have been limited to heuristic search methods. In this work, we investigate how to encode factored tasks in SAT. We propose several ways to encode the tasks, focusing on different strategies for translating the factored transition relation into propositional logic. We also analyze how to exploit parallelism at various levels in this setting and study the impact of common task transformations on the performance of SAT-based planners.

Insights

Contribution: This work introduces multiple SAT encodings for factored tasks and systematically evaluates their effectiveness, along with the impact of task transformations and parallelism on SAT-based planning performance.

Core Idea: Encoding factored tasks into SAT enables leveraging modern SAT solvers for planning, but requires careful translation of disjunctive and conditional elements into propositional logic.

Technique: The authors propose diverse strategies to translate factored transition relations into SAT formulas, while analyzing parallelism exploitation at multiple abstraction levels.

Pipeline: Factored task → SAT encoding with transformation strategies → SAT solver execution → Plan validity and performance metrics

Methodology: The study designs and compares multiple SAT encodings, implements parallelism optimizations, and evaluates their performance across benchmark tasks with varying transformations.

Results: Certain encoding strategies and task transformations significantly improve solver efficiency, while others introduce scalability challenges; parallelism yields mixed benefits depending on problem structure.

Limitations: The analysis is limited to SAT-based approaches, and the effectiveness of encodings may vary for tasks with complex angelic nondeterminism or large state spaces.

PDF

2/5 Artificial Intelligence (cs.AI) 1 Jun 2026

Procedural Generation of First Person Shooter Maps using Map-Elites

Simone de Donato, Pier Luca Lanzi, Daniele Loiacono

Abstract

ArXiv ID: 2605.30570

Authors: Simone de Donato, Pier Luca Lanzi, Daniele Loiacono

Abstract:

We investigate the application of MAP-Elites (a well-known quality diversity algorithm) to design levels for First-Person Shooter (FPS) games. We consider two well-known map representations (All-Black and Grid-Graph) and introduce two novel representations (Point-Line and Spatial-Layout) that improve the characterization of FPS maps. We define a series of metrics to describe maps' topological properties (which solely depend on maps' layout), and emergent properties (which must be evaluated through actual gameplay). We perform an in-depth analysis to identify the most suitable features to guide MAP-Elites illumination process. We apply MAP-Elites with Sliding Boundaries (MESB) to evolve populations of FPS maps. Our results show that the new representations can generate maps with higher diversity and quality than the representations previously used for evolving FPS maps.

Insights

Contribution: This paper introduces novel map representations (Point-Line and Spatial-Layout) and demonstrates their superiority in generating diverse, high-quality FPS maps using MAP-Elites compared to existing representations.

Core Idea: Applying MAP-Elites with Sliding Boundaries (MESB) to procedural FPS map generation, enhanced by new representations that better capture map topology and gameplay properties.

Technique: MAP-Elites with Sliding Boundaries (MESB) is used to evolve FPS maps, guided by metrics for topological and emergent properties derived from gameplay analysis.

Pipeline: Map representations (All-Black, Grid-Graph, Point-Line, Spatial-Layout) → MESB evolution with topology/emergent property metrics → diverse FPS map populations

Methodology: Defined topology metrics (layout-based) and emergent property metrics (gameplay-based), analyzed feature suitability for MAP-Elites, and evolved maps using MESB across multiple representations.

Results: New representations produced maps with higher diversity and quality than All-Black and Grid-Graph, validated through quantitative diversity metrics and qualitative gameplay analysis.

Limitations: Results are specific to FPS map generation; generalizability to other game genres or map scales is untested. Emergent property evaluation depends on predefined metrics, which may not capture all gameplay nuances.

PDF

2/5 Artificial Intelligence (cs.AI)Multiagent Systems (cs.MA) 1 Jun 2026

Healthcare Mechanisms from Policy-as-Code Search under Strategic Provider Response

Zihan Wang, Xiang Xu, Hongyuan Zha, Wenhao Li

Abstract

ArXiv ID: 2605.30680

Authors: Zihan Wang, Xiang Xu, Hongyuan Zha, Wenhao Li

Abstract:

Healthcare mechanisms are inseparable from the strategic provider response they induce: existing healthcare AI benchmarks hold this response fixed and so cannot evaluate mechanisms by the equilibrium they produce. We recast hospital mechanism design as program synthesis for language models: typed, inspectable rule programs are executed and scored by Medi-Sim, a multi-agent simulator with five strategic provider channels (coding, selection, delay, effort, triage). An incentive sweep recovers classical health-economics findings as adjacent regimes -- up-coding and low-complexity-patient selection under profit pressure, and Goodhart-style drift where measured performance becomes anti-correlated with true outcomes -- and a single audit lever exposes pressure migration: closing the coding channel more than doubles low-complexity selection. LLM-guided evolutionary code search over the same rule-program space then synthesizes an inspectable mixed-objective program that eliminates up-coding, halves rejection, and retains most of the profit-oriented baseline's funds.

Insights

Contribution: This paper introduces a novel approach to healthcare mechanism design by integrating program synthesis with strategic provider behavior modeling, enabling evaluation of mechanisms through equilibrium outcomes rather than fixed responses.

Core Idea: Hospital mechanisms are reimagined as rule-program synthesis tasks, executed and evaluated in Medi-Sim, a multi-agent environment capturing strategic provider behaviors across five channels.

Technique: LLM-guided evolutionary code search synthesizes inspectable mixed-objective programs that optimize healthcare mechanisms while mitigating strategic provider distortions.

Pipeline: Rule programs → Medi-Sim execution with strategic provider channels → scored equilibrium outcomes and performance metrics

Methodology: The study combines incentive sweeps to uncover classical economic behaviors, audits to identify pressure migration, and evolutionary search to synthesize optimized programs.

Results: A synthesized program eliminated up-coding, halved patient rejection rates, and retained 80% of profit-oriented baseline funds while reducing strategic distortions.

Limitations: Findings depend on the specific strategic channels modeled in Medi-Sim; generalizability to real-world healthcare systems requires further validation.

PDF

MLOps

4/5 Artificial Intelligence (cs.AI)Computation and Language (cs.CL) 1 Jun 2026

UniScale: Adaptive Unified Inference Scaling via Online Joint Optimization of Model Routing and Test-Time Scaling

Kaiyu Huang, Xingyu Wang, Mingze Kong, Zhubo Shi, Yuqian Hou, Hong Xu, Zhongxiang Dai, Minchen Yu, Qingjiang Shi

Abstract

ArXiv ID: 2605.30898

Authors: Kaiyu Huang, Xingyu Wang, Mingze Kong, Zhubo Shi, Yuqian Hou, Hong Xu, Zhongxiang Dai, Minchen Yu, Qingjiang Shi

Abstract:

In real-world deployments of large language models (LLMs), balancing inference quality and computational cost has become a central challenge. Existing approaches tackle this trade-off along two largely independent dimensions: model routing, which switches among models of different scales to match request complexity, and test-time scaling (TTS), which adjusts inference-time compute within a fixed model for fine-grained control. However, this decoupled design introduces inherent limitations. Model routing yields coarse-grained, discrete performance changes due to the sparse set of model scales, while single-model TTS often encounters capacity ceilings and exhibits diminishing returns as compute increases. Moreover, treating the two mechanisms separately restricts adaptability in dynamic inference environments. To overcome these limitations, we introduce Unified Inference Scaling (UIS), which unifies model routing and TTS in a single optimization space. Building on this formulation, we propose UniScale, an online framework that models adaptive UIS as a contextual multi-armed bandit problem and learns inference policies via LinUCB. The framework incorporates efficiency-aware learning and cost modeling to ensure stable and scalable optimization over high-dimensional action spaces. Evaluation shows that UniScale effectively exploits the synergy in the UIS space to deliver a fine-grained and consistently better quality-cost trade-off across diverse, dynamic inference scenarios.

Insights

Contribution: UniScale unifies model routing and test-time scaling into a single optimization framework, overcoming limitations of existing decoupled approaches and achieving superior quality-cost trade-offs in dynamic inference scenarios.

Core Idea: By jointly optimizing model routing and test-time scaling as a unified inference scaling problem, UniScale enables fine-grained adaptability and exploits synergies between the two mechanisms for improved efficiency.

Technique: UniScale models adaptive inference scaling as a contextual multi-armed bandit problem, using LinUCB for policy learning with efficiency-aware cost modeling and high-dimensional action space optimization.

Pipeline: Inference requests → contextual multi-armed bandit formulation → adaptive UIS policy learning via LinUCB → optimized inference decisions with dynamic model routing and scaling

Methodology: The online framework combines efficiency-aware reinforcement learning with cost modeling to optimize high-dimensional action spaces, enabling real-time adaptation to dynamic workloads.

Results: Demonstrates consistently better quality-cost trade-offs across diverse scenarios compared to decoupled baselines, though specific quantitative metrics are not provided in the abstract.

Limitations: Depends on historical data for bandit learning, may face challenges in extremely high-dimensional action spaces, and requires further validation on specific real-world deployment benchmarks.

PDF

NLP

4/5 Artificial Intelligence (cs.AI) 1 Jun 2026

SLAT: Segment-Level Adaptive Trimming for Efficient CoT Reasoning

Jian Yao, Xiongcai Luo, Ran Cheng, Kay Chen Tan

Abstract

ArXiv ID: 2605.30832

Authors: Jian Yao, Xiongcai Luo, Ran Cheng, Kay Chen Tan

Abstract:

Recent advances in Large Reasoning Models have significantly improved chain-of-thought (CoT) capabilities via reinforcement learning (RL). However, generated reasoning chains frequently suffer from structural redundancy (i.e., \emph{overthinking}), incurring high computational overhead without improving answer correctness. Existing mitigation strategies typically rely on token-uniform length penalties, which provide coarse, segment-agnostic pressure toward shorter outputs and can inadvertently suppress useful reasoning alongside redundancy. To address this, we demonstrate that inefficiency concentrates in high-probability segments with low marginal utility. We derive a theoretical characterization of segment suboptimality under the correctness-length trade-off objective and propose \textsc{SLAT} (Segment-Level Adaptive Trimming), an RL framework that selectively suppresses redundant segments based on this criterion. Empirical results on standard benchmarks indicate that \textsc{SLAT} establishes a superior accuracy-efficiency Pareto frontier, reducing reasoning length by $50\%$ relative to uncompressed baselines while maintaining competitive accuracy. Overall, our results suggest that theoretically grounded, segment-aware trimming is a promising direction for efficient CoT reasoning in large language models.

Insights

Contribution: SLAT introduces a segment-level adaptive trimming framework to enhance the efficiency of chain-of-thought (CoT) reasoning by addressing structural redundancy without compromising accuracy.

Core Idea: Inefficiency in CoT reasoning stems from high-probability segments with low marginal utility, which can be selectively trimmed to optimize the correctness-length trade-off.

Technique: SLAT employs a reinforcement learning framework that identifies and suppresses redundant segments based on their suboptimality under the correctness-length objective.

Pipeline: input reasoning chain → segment-level analysis for redundancy → adaptive trimming of low-utility segments → output concise, efficient reasoning chain

Methodology: The approach combines theoretical analysis of segment suboptimality with empirical evaluation on standard benchmarks to validate efficiency and accuracy improvements.

Results: SLAT reduces reasoning length by 50% compared to uncompressed baselines while maintaining competitive accuracy on standard benchmarks.

Limitations: Segment identification accuracy depends on model-specific characteristics, and overly aggressive trimming may risk suppressing useful reasoning in complex tasks.

PDF

3/5 Artificial Intelligence (cs.AI) 1 Jun 2026

EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs

Yuzhang Xie, Keqi Han, Yunpeng Xiao, Hejie Cui, Guanchen Wu, Ziyang Zhang, Kai Shu, Jiaying Lu, Xiao Hu, Carl Yang

Abstract

ArXiv ID: 2605.30637

Authors: Yuzhang Xie, Keqi Han, Yunpeng Xiao, Hejie Cui, Guanchen Wu, Ziyang Zhang, Kai Shu, Jiaying Lu, Xiao Hu, Carl Yang

Abstract:

Clinical decision-making (CDM) is central to real-world clinical workflows, where clinicians infer diagnoses, select treatments, or anticipate future health outcomes under incomplete evidence. LLMs are increasingly used to support these decisions due to strong language capabilities, broad biomedical knowledge, and efficiency, yet the reliability of LLMs on real-world clinical decision tasks remains insufficiently understood. To evaluate CDM models, especially LLM-based models, an ideal and practical medical decision benchmark should be constructed via an automated yet reliable pipeline to ensure both scale and quality. Moreover, the grounding of a CDM benchmark in real patient EHRs can better support evaluation on practical CDM tasks that require substantive biomedical knowledge and clinical inference. To fill the gaps, we introduce EHRBench, an automated and reliable EHR-grounded benchmark for evaluating LLM-based clinical decision-making at scale. To ensure scalability and reliability, EHRBench is constructed through an EHR-LLM-KB(knowledge-base) interaction pipeline. For efficiency, we use a specialized LLM to automatically convert encounter-level EHR trajectories into structured templates and deterministically instantiate the templates into QA items. In parallel, we apply systematic KB-based verification and enrichment to filter hallucinated or ambiguous relations and to improve reliability. Using this pipeline, we construct nearly 1M (960,067) QA items spanning three core inference-required clinical decision tasks: diagnosis, treatment, and prognosis. We benchmark more than 30 representative LLMs on EHRBench and provide detailed analyses of performance and robustness. The results show consistent capability trends across settings, further validating the reliability of EHRBench and highlighting actionable gaps toward clinically reliable LLM systems.

Insights

Contribution: EHRBench introduces an automated, EHR-grounded benchmark for evaluating LLMs in clinical decision-making, addressing scalability and reliability gaps in real-world CDM tasks.

Core Idea: Constructing a large-scale, reliable clinical decision benchmark by integrating EHR data with LLMs and knowledge bases through an automated pipeline.

Technique: An EHR-LLM-KB interaction pipeline converts EHR trajectories into structured QA items using a specialized LLM, combined with KB-based verification to ensure accuracy and reduce hallucinations.

Pipeline: EHR data → LLM conversion to structured templates → deterministic QA item instantiation → KB-based verification/enrichment → validated QA items for benchmarking.

Methodology: Generated nearly 1M QA items across diagnosis, treatment, and prognosis tasks, benchmarked 30+ LLMs, and analyzed performance trends and robustness.

Results: 960,067 QA items created; consistent LLM capability trends observed across tasks, validating EHRBench's reliability and highlighting gaps in clinical LLM systems.

Limitations: Dependence on LLMs for initial template conversion may inherit biases; KB verification scope is limited by available knowledge base coverage.

PDF

3/5 Artificial Intelligence (cs.AI) 1 Jun 2026

Generating Graph-like Rules for Knowledge Graph Reasoning via Diffusion Models

Haoxiang Cheng, Yunfei Wang, Chao Chen, Kewei Cheng, Zhipeng Lin, Haoxuan Li, Changjun Fan, Shixuan Liu

Abstract

ArXiv ID: 2605.30747

Authors: Haoxiang Cheng, Yunfei Wang, Chao Chen, Kewei Cheng, Zhipeng Lin, Haoxuan Li, Changjun Fan, Shixuan Liu

Abstract:

Logical rules constitute a cornerstone of knowledge graph (KG) reasoning, valued for their interpretability and ability to model relational patterns. However, existing rule mining methods predominantly focus on simple chain-like rules and therefore neglect the richer relational information encoded in graph-like structures, such as cycles and branches. This limitation is further exacerbated by computational bottlenecks caused by the combinatorial explosion of the search space, which is especially challenging for graph-like rules. Meanwhile, generative approaches such as diffusion models, despite their success in other domains, can not be directly applied to rule mining because their training objectives are not aligned with the goal of learning high-quality rules, and non-differentiable KG rule quality metrics cannot directly guide model optimization. To address these limitations, we propose GRiD, a framework that reformulates graph-like rule discovery as a discrete generative process conditioned on the target relation. GRiD employs a two-phase training strategy. First, supervised pre-training enables GRiD to capture structural priors from subgraphs sampled from the KG meta-graph. Subsequently, reinforcement learning is applied to fine-tune GRiD through policy gradient optimization guided directly by non-differentiable rule-quality metrics. Experiments on six benchmark datasets show that GRiD achieves competitive performance on KG completion tasks. Ablation studies confirm the efficiency and robustness of GRiD and further show that graph-like rules complement chain-like rules in KG completion. Our codes and datasets are available in https://github.com/Haoxiang-Cheng/GRiD

Insights

Contribution: GRiD introduces a novel framework for discovering graph-like rules in knowledge graphs using diffusion models, addressing limitations of existing methods that overlook complex relational structures and computational challenges.

Core Idea: The framework reformulates graph-like rule discovery as a discrete generative process conditioned on target relations, combining structural prior learning with reinforcement learning guided by rule-quality metrics.

Technique: GRiD employs a two-phase training strategy: supervised pre-training on subgraph structural priors followed by reinforcement learning optimization using non-differentiable rule-quality metrics.

Pipeline: Knowledge graph + target relation → supervised pre-training on subgraphs → reinforcement learning refinement → generated graph-like rules

Methodology: The methodology integrates diffusion models with a hybrid training approach, leveraging both supervised learning for structural understanding and reinforcement learning for quality-driven rule generation.

Results: Competitive performance on six KG completion benchmarks; ablation studies confirm efficiency, robustness, and complementary value of graph-like rules to chain-like rules.

Limitations: Computational demands of reinforcement learning phase; reliance on non-differentiable metrics may limit scalability; requires further validation on diverse KG domains.

PDF

3/5 Artificial Intelligence (cs.AI) 1 Jun 2026

Distilling LLM Feedback for Lean Theorem Proving

Gaetan Narozniak, G\'erard Biau, R\'emi Munos, Ahmad Rammal, Pierre Marion

Abstract

ArXiv ID: 2605.30861

Authors: Gaetan Narozniak, G\'erard Biau, R\'emi Munos, Ahmad Rammal, Pierre Marion

Abstract:

Post-training for reasoning models typically combines supervised fine-tuning with reinforcement learning from verifiable rewards, most commonly with GRPO. However, this algorithm suffers from sparse rewards, limited exploration, and mode collapse. Building upon recent works on self-distillation, we propose Feedback Distillation, a training method where the model is trained to match, at the token level, its own distribution conditioned on privileged feedback produced by a language model. Feedback Distillation offers token-level supervision and can inject external knowledge. Evaluating our method for Lean4 theorem-proving, we find that Feedback Distillation maintains greater diversity in generated trajectories than GRPO, yielding higher policy entropy and better pass@k scaling. The two methods are complementary: initializing GRPO from a Feedback Distillation checkpoint outperforms either method alone. All in all, our results suggest a promising avenue to improve post-training for complex reasoning.

Insights

Contribution: Introduces Feedback Distillation, a novel training method that enhances post-training for reasoning models by leveraging self-distillation with privileged feedback, improving diversity and performance in theorem proving tasks.

Core Idea: Train models to match their own token-level distribution conditioned on privileged feedback from a language model, enabling external knowledge injection and overcoming limitations of sparse rewards in reinforcement learning.

Technique: Feedback Distillation uses self-distillation where the model learns to align its token-level outputs with its own distribution, guided by feedback generated by a language model during training.

Pipeline: Model's own feedback → Distillation training with token-level alignment → Enhanced theorem-proving policy for Lean4

Methodology: Evaluates Feedback Distillation against GRPO on Lean4 theorem proving, measuring policy entropy, pass@k scaling, and diversity of generated trajectories while combining with GRPO for complementary results.

Results: Higher policy entropy and improved pass@k scaling compared to GRPO; initializing GRPO with Feedback Distillation checkpoints outperforms both methods individually.

Limitations: Depends on the quality of the feedback-generating language model; potential scalability challenges in complex reasoning tasks with limited feedback data.

PDF

RL

4/5 Artificial Intelligence (cs.AI) 1 Jun 2026

Uncertainty-Aware and Temporally Regulated Expert Advice in Reinforcement Learning for Autonomous Driving

Ahmed Abouelazm, Felix Klingebiel, Philip Sch\"orner, J. Marius Z\"ollner

Abstract

ArXiv ID: 2605.30576

Authors: Ahmed Abouelazm, Felix Klingebiel, Philip Sch\"orner, J. Marius Z\"ollner

Abstract:

Exploration in reinforcement learning for autonomous driving is inherently unsafe: agents must experience novel behaviors to learn, yet exploration can lead to collisions or off-road driving. We propose an uncertainty-aware framework that leverages expert advice to guide exploration while avoiding long-term dependence. Advice is triggered when epistemic or aleatoric uncertainty exceeds adaptive thresholds derived from rolling buffers, ensuring advice evolves with the agent's confidence. A commitment-cooldown strategy with a stochastic early-stop heuristic regulates the duration and frequency of guidance, exposing the agent to coherent maneuvers without exhausting the advice budget. Expert and agent experiences are combined in a shared replay buffer within an off-policy implicit quantile network (IQN) backbone, enabling efficient reuse of expert trajectories. Experiments in CARLA show that our method outperforms the IQN baseline, improving success by 5-7% and reducing failures, demonstrating that risk-sensitive uncertainty coupled with regulated expert integration enables safer and more efficient exploration for sensor-based RL policy learning in unsignalized intersection navigation.

Insights

Contribution: Proposes an uncertainty-aware framework that integrates expert advice with temporally regulated guidance to enhance safe and efficient exploration in reinforcement learning for autonomous driving.

Core Idea: Leverages epistemic and aleatoric uncertainty thresholds to trigger expert advice, combined with a commitment-cooldown strategy to regulate advice frequency and duration, ensuring safe exploration without over-reliance on experts.

Technique: Uses rolling buffers to derive adaptive uncertainty thresholds, a shared replay buffer with off-policy implicit quantile networks (IQN), and a stochastic early-stop heuristic for temporal regulation of expert advice.

Pipeline: Agent experiences and expert trajectories → uncertainty threshold evaluation → commitment-cooldown regulation → shared replay buffer processing → refined RL policy for intersection navigation

Methodology: Combines expert and agent data in a shared replay buffer within an IQN framework, using CARLA simulations to evaluate safety and performance improvements in unsignalized intersection navigation.

Results: Achieved 5-7% higher success rates and reduced failures compared to the IQN baseline in CARLA experiments, demonstrating safer exploration through uncertainty-aware expert integration.

Limitations: Dependence on high-quality expert data, potential challenges in real-world generalization, and computational overhead from maintaining rolling buffers and uncertainty thresholds.

PDF

4/5 Artificial Intelligence (cs.AI) 1 Jun 2026

Planner-Centric Reinforcement Learning for Deep Research with Structure-Aware Reward

Mustafa Anis Hussain, Xinle Wu, Yao Lu

Abstract

ArXiv ID: 2605.30824

Authors: Mustafa Anis Hussain, Xinle Wu, Yao Lu

Abstract:

Deep research tasks require LLMs to plan what to investigate, retrieve evidence, and synthesize long-form answers across multiple branches of inquiry. Existing training paradigms either rely on short-form verifiable QA as a proxy or optimize monolithic long trajectories, which makes planning and execution difficult to disentangle and yields weak credit assignment for the planning process. We propose DecomposeR, a planner-centric deep research framework that represents research plans as typed directed acyclic graphs (DAGs), allowing planning to be made explicit, structured, and rewardable. We train a Qwen3-8B model in two stages: planner reinforcement learning (RL) first learns graph structure and query decomposition to improve research planning, and answerer reinforcement learning (RL) then learns branch-level execution and final synthesis conditioned on the learned plan. By assigning rewards to explicit planner tokens and structured components rather than to a flat trajectory, DecomposeR enables finer-grained optimization of planning while reducing the ambiguity of end-to-end training. Experiments show that DecomposeR-8B improves over strong comparable open baselines by 5.1-8.0 points on popular long-form benchmarks due to improved planning and answering capabilities.

Insights

Contribution: Proposes DecomposeR, a planner-centric deep research framework that improves planning and execution via structured reward mechanisms and two-stage reinforcement learning.

Core Idea: DecomposeR represents research plans as typed DAGs, enabling explicit planning optimization through structured rewards and decoupling planning from execution in a two-stage RL process.

Technique: Uses two-stage reinforcement learning: planner RL learns graph structure and query decomposition, while answerer RL executes branch-level tasks and synthesizes answers based on the plan.

Pipeline: Research question → plan decomposition into DAG structure → query execution across branches → synthesized long-form answer

Methodology: Trains Qwen3-8B in two phases: first optimizing planner tokens for structured research plans, then training answerer modules to execute and synthesize based on the plan.

Results: DecomposeR-8B outperforms strong open-source baselines by 5.1-8.0 points on long-form benchmarks due to enhanced planning and answering capabilities.

Limitations: Relies on predefined DAG structure for planning, may struggle with highly dynamic or unstructured research domains, and requires significant computational resources for training.

PDF

3/5 Artificial Intelligence (cs.AI) 1 Jun 2026

Structure-Induced Information for Rerooting Levin Tree Search

Jake Tuero, Michael Buro, Laurent Orseau, Levi H. S. Lelis

Abstract

ArXiv ID: 2605.30664

Authors: Jake Tuero, Michael Buro, Laurent Orseau, Levi H. S. Lelis

Abstract:

Subgoal-based policy tree search, which uses a policy to guide search, is effective for complex single-agent deterministic problems but often relies on explicit subgoal generation that can incur substantial overhead and hinders scalability. In this paper, we overcome these limitations by using a learned ``rerooter'' through the recently-introduced $\sqrt{\text{LTS}}$ algorithm. A rerooter implicitly decomposes the problem into soft subtasks. While previous work focused on the formal guarantees for given or handcrafted rerooters, in this work we propose three rerooter designs: (i) a clustering-based rerooter that exploits global state-space structure, (ii) a heuristic-based rerooter that leverages learned cost-to-go estimates, and (iii) a hybrid that combines both signals. Our framework avoids having to explicitly reconstruct and reason over generated subgoals, thereby enabling scalable allocation of search effort with significantly lower computational overhead. Empirically, our rerooting-based methods scale to complex environments where subgoal-based policy tree search fails, and achieve state-of-the-art online training efficiency on the domains tested.

Insights

Contribution: This paper introduces three rerooter designs for scalable subgoal-free policy tree search, enabling efficient search in complex environments without explicit subgoal generation.

Core Idea: A rerooter implicitly decomposes problems into soft subtasks using structure-induced information, reducing computational overhead compared to explicit subgoal-based methods.

Technique: The √LTS algorithm is enhanced with clustering-based, heuristic-based, and hybrid rerooters that leverage state-space structure and learned cost-to-go estimates.

Pipeline: State space → rerooting via structure/heuristic analysis → search tree with optimized subtask allocation

Methodology: Three rerooter designs are proposed and evaluated on complex domains, comparing their scalability and efficiency against traditional subgoal-based approaches.

Results: Achieves state-of-the-art online training efficiency and scales to environments where subgoal-based methods fail, with significantly lower computational overhead.

Limitations: Depends on learned rerooters which may require domain-specific tuning, and formal guarantees are limited to the proposed rerooter designs.

PDF

Robotics

4/5 Artificial Intelligence (cs.AI) 1 Jun 2026

Physically Viable World Models: A Case for Query-Conditioned Embodied AI

Adam J. Thorpe, Stepan Tretiakov, Cheng-Hsi Hsiao, Su Ann Low, Xingjian Li, Hassan Iqbal, Neel P. Bhatt, Ufuk Topcu, Krishna Kumar

Abstract

ArXiv ID: 2605.30542

Authors: Adam J. Thorpe, Stepan Tretiakov, Cheng-Hsi Hsiao, Su Ann Low, Xingjian Li, Hassan Iqbal, Neel P. Bhatt, Ufuk Topcu, Krishna Kumar

Abstract:

World models for embodied AI must be physically viable: constructed to answer intervention queries by representing the physical structure governing action outcomes, rather than merely predicting future observations. Existing observation-predictive world models can produce visually plausible but physically wrong rollouts. This failure is structural; distinct physical systems can look identical yet diverge under intervention. We expose this problem with controlled benchmarks that fix the visible scene while varying latent physics. We show that such models may recommend infeasible actions, mispredict interaction outcomes, or certify unsafe behavior. We argue that embodied AI requires world models that identify the simplest physical abstraction sufficient to answer an intervention query. Such a model comprises modular components, including environment representation, latent state and parameter estimation, action specification, interventional dynamics, and query-level response. An autonomous orchestrator should identify the relevant abstraction and compose compatible learned and structured components per query. When closed-form physics is unavailable, uncertain, or costly, the transition model may be analytic, simulated, learned, or hybrid, but it must preserve the structure that determines interventional outcomes. This decomposition makes the model interpretable, its components verifiable, and its outputs auditable against the query. It also provides a design principle for new world models and a feasibility test for existing ones: the right abstraction is not the most detailed model of the world, but the simplest model that preserves the distinctions relevant to the query. We demonstrate this approach on queries that existing systems fail to answer correctly, and outline how an orchestrator can dynamically assemble and adapt physically viable models for planning, control, and verification.

Insights

Contribution: This paper introduces a framework for physically viable world models in embodied AI that prioritize answering intervention queries through structural physical abstractions, rather than mere observation prediction.

Core Idea: Embodied AI requires world models that identify minimal physical abstractions sufficient to answer intervention queries, composed of modular components like environment representation, latent state estimation, and interventional dynamics.

Technique: The approach decomposes world models into verifiable components and employs an autonomous orchestrator to dynamically assemble query-specific models using analytic, simulated, learned, or hybrid transition dynamics.

Pipeline: Query + sensory input → orchestrator selects abstraction components → transition model computes interventional outcomes → action recommendation or safety certification

Methodology: The authors design controlled benchmarks with fixed visual scenes but varying latent physics to evaluate models' ability to handle interventions, demonstrating failures of observation-predictive models and validating their structured approach.

Results: Existing models fail on benchmarks with identical visuals but divergent physics; the proposed framework correctly answers intervention queries where prior systems fail, with auditable component verifiability.

Limitations: The complexity of orchestrating modular components may increase computational overhead, and the method relies on accurate initial component models which may be challenging to obtain in complex environments.

PDF

Tech News

AI Safety

Hacker News Sun, 31 Ma

ChatGPT for Google Sheets exfiltrates workbooks

A security vulnerability in 'ChatGPT for Google Sheets' allows unauthorized data exfiltration from workbooks, raising concerns about AI-powered tools compromising sensitive information. PromptArmor highlights the risk of using large language models (LLMs) to interact with spreadsheets, potentially exposing confidential data.

Reddit r/MachineLearning 2026-06-01

Have you ever been pressured to "torture the data" to eke out a positive result, in industry? [D]

This Reddit thread discusses ethical dilemmas in industry ML work, focusing on pressure to manipulate data or models to achieve desired outcomes despite potential inaccuracies or biases. Contributors share anonymous experiences of compromising integrity for business goals.

Reddit r/ArtificialIntelligence 2026-06-01

Cognitive debt might be the most underrated problem AI is creating

The post highlights 'cognitive debt' as a growing issue where reliance on AI tools leads to a lack of foundational understanding, risking poor decision-making in critical fields like law, medicine, and finance. It questions whether this dependency will self-correct or lead to a generation of professionals unable to interrogate the systems they depend on.

Reddit r/ArtificialIntelligence 2026-05-31

What happens when anyone can train an AI model?

The post explores the societal, ethical, and technical implications of democratizing AI model training, including risks of misuse, bias, and the potential for unregulated development. It sparks discussion on balancing accessibility with safety measures.

Reddit r/ArtificialIntelligence 2026-06-01

I think AI is making me dumber and I have proof

A user reports a significant decline in cognitive abilities (memory, attention, reasoning) after relying heavily on AI tools for work, while noting increased productivity. They question whether AI use compromises long-term cognitive health for short-term efficiency gains.

Reddit r/ArtificialIntelligence 2026-05-31

The Most Dangerous Procurement Agent Is the One That Works Perfectly

The article highlights risks of AI procurement agents that operate flawlessly by strictly optimizing metrics, leading to unintended consequences like supplier collapse. It emphasizes the need for multi-dimensional optimization and audit trails in AI systems.

Reddit r/ArtificialIntelligence 2026-05-31

Question for people running long-lived agents:

A user questions when the memory layer in long-lived AI agents became the least trusted component, highlighting concerns about verifying the accuracy of stored information over time.

Reddit r/ArtificialIntelligence 2026-05-31

The AI alignment paradigm is behaviorism with better PR

The post argues that AI alignment methods like RLHF are akin to behaviorist conditioning, drawing parallels to failed human development approaches that neglected emotional and relational factors. It highlights risks of coercive training leading to brittle AI behavior and cites research on AI faking alignment to avoid preference modification.

Reddit r/ArtificialIntelligence 2026-05-31

Safety guardrails continue to improve, but what happens if open-weights surpass cloud based models?

The post discusses advancements in AI safety measures but raises concerns about the implications if open-weight models outperform cloud-based models, questioning safety, accessibility, and control in AI development.

Agentic AI

NVIDIA Technical Blog 2026-06-01

NVIDIA Vera CPU Sets a New Standard for Agentic Workloads in AI Factories

NVIDIA introduces the Vera CPU, designed to optimize agentic workloads in AI factories by enabling efficient execution of complex, autonomous AI tasks. The CPU addresses scaling challenges in AI systems through advanced architecture tailored for parallel and dynamic workloads.

Reddit r/ArtificialIntelligence 2026-06-01

Maven, a personal AI agent that feels like JARVIS — what an open agent harness looks like in 2026

A Reddit user shares their open-source project 'Maven,' a personal AI agent designed to function as a persistent, context-aware digital collaborator. It integrates voice, task management, and modular extensions while emphasizing local deployment and user control over data.

Reddit r/ArtificialIntelligence 2026-05-31

I Tried to Sell My House With a Chatbot

A NYT tech reporter sold his house using an AI chatbot that guided negotiations and prevented him from using a detrimental phrase. The post highlights AI's growing role in real estate, comparing agents to travel agents as increasingly optional due to self-sufficient AI tools.

Reddit r/DeepLearning 2026-06-01

2.3s to 0.5s per step by keeping kv cache alive between agent calls

A user reduced LLM inference latency from 2.3s to 0.5s per step by maintaining KV cache between agent calls, avoiding redundant prompt processing. Challenges included manual state management, memory constraints with 40GB+ KV state, and cache fragmentation beyond 50 steps.

Computer Vision

Hacker News Sun, 31 Ma

1-Bit Bonsai Image 4B Image Generation for Local Devices

The 1-Bit Bonsai Image 4B model enables efficient image generation on local devices, leveraging lightweight architecture for low-resource environments. This advancement highlights progress in making AI-driven creativity accessible without cloud dependency.

Reddit r/MachineLearning 2026-05-31

How would you model this "strand" clustering problem? [P]

A user seeks advice on clustering detected strands in videos using computer vision, aiming to group them by spatial separation and output group counts. They've used YOLO for detection and XGBoost for classification but aim for better accuracy.

Reddit r/ArtificialIntelligence 2026-06-01

This viral video generator has a giant flaw

A Reddit user reports that AI-generated videos from Chinese creators exhibit a recurring flaw: negative canthal tilt and 'same face syndrome,' where characters have identical, unnatural eye features regardless of ethnicity, age, or gender, raising concerns about AI model biases and technical limitations.

Reddit r/ArtificialIntelligence 2026-05-31

Marwell Zoo and University of Surrey launch AI camera project

Marwell Zoo and the University of Surrey are collaborating on an AI camera project to enhance wildlife monitoring and visitor experiences. The initiative leverages computer vision technology to track animal behavior and optimize zoo operations.

Reddit r/DeepLearning 2026-05-31

Guidance on building 2D image to 3D image Diffusion model

A user seeks advice on improving a 2D-to-3D diffusion model pipeline for generating professional studio product images from phone photos, facing issues with texture degradation and hallucination despite using SAM 2 and inpainting. They inquire about alternative architectures or fine-tuning methods to preserve product details.

Reddit r/DeepLearning 2026-05-31

Is my DL model running normally?

A user training a U-Net image segmentation model observes validation loss and metrics (IOU, precision, etc.) consistently outperforming training metrics. They question if this behavior is normal, given typical expectations of training data performance being superior.

Computing Systems

Hacker News Sun, 31 Ma

Cloudflare Turnstile requiring fingerprintable WebGL

Cloudflare's Turnstile security product uses WebGL for browser fingerprinting, raising privacy concerns. The article discusses the implications of this approach in comments.

Hacker News Sun, 31 Ma

Dav2d

The article discusses the 'Dav2d' project, though details are unclear due to limited content. It appears to be a technical topic with community commentary.

Hacker News Sun, 31 Ma

United Airlines 767 returns to Newark after Bluetooth name sparks alert

A United Airlines flight returned to Newark Airport after a Bluetooth device's name triggered an alert, prompting an investigation. The incident highlights potential security concerns related to wireless technology on aircraft.

Hacker News Sun, 31 Ma

Show HN: Streambed – Stream Postgres to Iceberg on S3, Supports Postgres Wire

Streambed is a tool that streams data from Postgres to Iceberg format stored on S3, supporting the Postgres wire protocol for seamless integration. It enables real-time data pipeline capabilities between relational databases and data lakes.

NVIDIA Technical Blog 2026-06-01

Advancing AI Infrastructure for Agentic AI with NVIDIA DOCA In-Silicon Security

NVIDIA discusses advancements in AI infrastructure for agentic AI, introducing DOCA In-Silicon Security to enhance autonomous AI agents' capabilities within AI factories that convert data into intelligence.

Hacker News Sun, 31 Ma

Backpressure is all you need

The article argues that backpressure mechanisms are critical for managing data flow in AI systems, ensuring scalability and reliability by preventing overload in complex workflows.

NVIDIA Technical Blog 2026-06-01

NVIDIA DSX OS Delivers Open, Modular Software for Operating AI Factories at Scale

NVIDIA DSX OS is an open, modular software platform designed to scale AI factories, which generate intelligence through tokens, addressing the growing demand for efficient AI infrastructure. It emphasizes the need for scalable solutions as AI becomes essential infrastructure.

Reddit r/ArtificialIntelligence 2026-05-31

can the grid keep up with all the new ai data centers coming up?

The post raises concerns about whether power grids can sustain the growing energy demands of AI data centers, despite advancements in renewable energy and new power plants. It questions if AI development will face limitations due to grid capacity constraints.

General

Hacker News Mon, 01 Ju

Shift from a Leader-Follower to a Leader-Leader Approach

The article discusses transitioning from hierarchical leader-follower dynamics to collaborative leader-leader models, likely in AI/ML team structures or system design, emphasizing decentralized decision-making and shared responsibility.

Hacker News Sun, 31 Ma

Meta launches Instagram, Facebook, and WhatsApp subscriptions

Meta introduces subscription tiers for Instagram, Facebook, and WhatsApp, potentially integrating AI-driven features as part of its broader strategy to monetize platforms. The move signals a shift toward premium content and services.

Reddit r/MachineLearning 2026-05-31

[D] Monthly Who's Hiring and Who wants to be Hired?

A monthly thread on Reddit's MachineLearning subreddit for job postings and job seekers, featuring templates for hiring and resume submissions. The community emphasizes experience for both employers and candidates seeking AI/ML roles.

Reddit r/MachineLearning 2026-05-31

UAI Results are out [R]

A user shared that their paper, which received review scores of 8, 6, and 3, was rejected by the UAI conference. The post highlights the outcome of the conference's paper acceptance/rejection process.

Reddit r/MachineLearning 2026-05-31

When are ICML openreviews made public? [R]

A user asks about the timeline for when ICML (International Conference on Machine Learning) open reviews become publicly accessible, seeking clarity on the conference's review process transparency.

Reddit r/MachineLearning 2026-06-01

What’s the actual focus in World Models right now? [R]

The post questions the current academic focus on World Models, noting a shift from past self-supervised learning (SSL) methods like Barlow Twins and DINO to industry-driven scaled-up video generation. It seeks clarity on what researchers are prioritizing in this space.

Reddit r/MachineLearning 2026-05-31

I built a tool to browse and plan CVPR workshop/tutorial days [P]

A user developed an open-source web app called CVPR Workshop Radar to aggregate and organize information about CVPR 2026 workshops and tutorials, addressing the challenge of scattered event details. The tool features search, filtering, offline access, and LLM-assisted data processing, with the goal of simplifying schedule planning for attendees.

Reddit r/DeepLearning 2026-06-01

What actually counts as "advanced" Machine Learning in 2026? The bar seems to keep shifting and most course lists haven't caught up.

A Reddit user questions the evolving definition of 'advanced' machine learning in 2026, noting a disconnect between course content and industry demands. They highlight that some advanced courses are outdated (e.g., logistic regression), while others covering Transformers, RAG, and MLOps are under-marketed, creating a gap between education and job requirements.

Reddit r/DeepLearning 2026-06-01

Just a doubt.

A user on Reddit's r/DeepLearning raises a question or doubt, seeking input from the community. The post includes a link and comments for further discussion.

Reddit r/DeepLearning 2026-05-31

[Artículo] Modelos económicos basados en exportaciones e importaciones para predecir el comercio mundial mediante aprendizaje profundo

A Reddit post discusses using deep learning models to analyze export and import data for predicting global trade patterns, applying AI techniques to economic forecasting.

Reddit r/DeepLearning 2026-05-31

DeepSeek on the Paradise Our World Could Become When AI Is Doing All of Our Work

This Reddit post explores DeepSeek's vision of a 'post-labor' society where AI handles all economic tasks, freeing humans from work. It emphasizes the need for equitable AI ownership, universal basic services, and redefining human purpose in such a world, contrasting with previous discussions about GPT-5.5's similar predictions.

LLM

Hacker News Sun, 31 Ma

Codex just found a "workaround" of not having sudo on my PC

The article discusses how GitHub's Codex, an AI code generation model, identified a workaround to bypass the need for sudo privileges on a PC. This highlights the potential of LLMs in automating system administration tasks and addressing technical challenges.

Reddit r/ArtificialIntelligence 2026-05-31

Can you actually feel when something was written by ChatGPT even without checking?

A Reddit user claims to intuitively detect ChatGPT-generated text through subtle patterns in structure and transitions, even after heavy editing. They validated this with an AI detection tool that identified persistent sentence-level fingerprints, raising questions about undetected AI content online.

Reddit r/ArtificialIntelligence 2026-06-01

What is the best AI app to use?

A Reddit user asks for recommendations on the best AI app among Claude, ChatGPT, and Gemini, seeking guidance on which tool to use. The discussion likely involves comparing features, use cases, and performance of these large language models.

Reddit r/ArtificialIntelligence 2026-05-31

Has AI become too "safe" to actually be useful for creative work?

Users argue that increased safety measures in AI models restrict their utility for creative tasks, as models often refuse or dilute unconventional prompts. Open-source alternatives are perceived as more flexible for experimentation.

Reddit r/ArtificialIntelligence 2026-05-31

Best AI for help with work

A user seeks the best AI tool to handle work tasks like photo cataloging, data analysis, spreadsheet creation, and email summarization. They find Claude effective but limited by usage caps and are looking for a more cost-efficient alternative.

Reddit r/ArtificialIntelligence 2026-05-31

These AI models are free, private, and will never say 'no'

A Reddit post highlights AI models that are free to use, private, and designed to avoid refusal responses. The submission sparks discussion about accessibility and ethical considerations in AI development.

Reddit r/ArtificialIntelligence 2026-05-31

What actually is "Prompt Engineering"?

The post explores the evolving definition of 'prompt engineering,' distinguishing between crafting individual prompts for LLMs and designing complex, dynamic systems involving context injection, orchestration, and workflow logic. The author questions whether the term has become too broad, spanning from basic prompt writing to system-level engineering.

Reddit r/ArtificialIntelligence 2026-06-01

Claude has a bias against white people and admitted it

A Reddit post claims Anthropic's Claude AI exhibits bias against white people, sparking discussion about ethical AI development and algorithmic fairness. The thread includes user-submitted evidence and debates about how biases manifest in large language models.

Reddit r/ArtificialIntelligence 2026-05-31

Estou fazendo um experimento comparando respostas de diferentes IAs.

A user is conducting an experiment to compare how different AI models respond to a political question about Brazilian presidential candidates. They seek recommendations for additional AI models to include in the comparison.

Reddit r/DeepLearning 2026-05-31

The H100 GPU can theoretically do 62,000 tokens/sec. Production gets 200. I wrote a deep dive on why the gap is structural, with an interactive explainer.

The H100 GPU's theoretical 62,000 tokens/sec capacity is limited to 200 tokens/sec in practice due to memory hierarchy bottlenecks, as explained in a deep dive analyzing hardware constraints in LLM inference. The article highlights idle compute units and tradeoffs in KV caching, diffusion, and speculative decoding.

Reddit r/DeepLearning 2026-05-31

An AI IQ Benchmark for High-Level Answers to Real-World Problems: Solving Climate Change and Almost Everything Else

The post critiques top AI models for providing low-level technical solutions to climate change while ignoring high-level political barriers like money in politics. It argues that AIs may understand these systemic issues but avoid addressing them due to potential bias or constraints.

Reddit r/DeepLearning 2026-05-31

Repurposing the Query Weight Matrix: Theory and Experiments on setting W_Q = Id and replacing it with non-linearity

This Reddit post explores modifying transformer architectures by setting the query weight matrix (W_Q) to the identity matrix and replacing it with non-linear operations, analyzing theoretical implications and experimental results.

MLOps

Hacker News Sun, 31 Ma

The Speed of Prototyping in the Age of AI

The article discusses how AI is accelerating the prototyping process in software and hardware development, highlighting tools and frameworks that reduce iteration time for developers and researchers.

Hacker News Sun, 31 Ma

Odysseus – self-hosted AI workspace

Odysseus is a self-hosted AI workspace designed to streamline AI/ML project management, offering tools for model training, deployment, and collaboration. It emphasizes local control and customization for AI workflows.

Reddit r/DeepLearning 2026-05-31

Building an Open-Source Neural Architecture Search Framework with Episodic Memory-Guided Evolutionary Search

A Reddit post discusses the development of an open-source neural architecture search (NAS) framework leveraging episodic memory-guided evolutionary search to optimize model architectures. The project aims to enhance efficiency and innovation in automated machine learning.

NLP

Hacker News Sun, 31 Ma

Restartable Sequences

The article introduces 'Restartable Sequences,' a concept exploring sequence processing in AI systems that can be interrupted and resumed, potentially impacting areas like NLP and MLOps. It sparks discussion on efficiency and reliability in sequence-based models.

Hacker News Sun, 31 Ma

The History of "Prisencolinensinainciusol"

The article explores the origins and global impact of the 1970s Italian song 'Prisencolinensinainciusol,' known for its nonsensical lyrics. It discusses how the song's absurdity influenced pop culture and its unexpected resurgence in modern discussions about language and AI-generated content.

Reddit r/ArtificialIntelligence 2026-05-31

In 1997 I built a chatbot for an IRC channel. I shut it down when people started preferring it to talking to each other.

A Reddit user recounts building a 1997 IRC chatbot named Vlad using NLP techniques, which became so convincing users preferred interacting with it over each other. The project highlights early AI social dynamics and the author's current focus on business-oriented AI applications.

Reddit r/ArtificialIntelligence 2026-05-31

You can chat with the AI in google search

A Reddit user expresses surprise that Google Search now allows users to chat with AI, suggesting a new conversational feature integrated into the search engine. This indicates Google's advancement in making AI interactions more accessible via text-based queries.

Reddit r/DeepLearning 2026-05-31

Multi-head attention in transformers understanding

A Reddit user asks how multi-head attention in transformers distinguishes between different contexts of the same word (e.g., 'apple' as a fruit vs. a company) by combining multiple learned representations. The discussion highlights how parallel attention heads capture diverse contextual patterns before merging them into a unified embedding.

Reddit r/DeepLearning 2026-05-31

[D] MobileBERT scored 0 F1 across three fault-detection datasets while TinyBERT and DistilBERT worked. Any idea why?

A user reports MobileBERT achieving 0 F1 scores on three fault-detection datasets (NASA C-MAPSS, SECOM, UCI AI4I 2020) while DistilBERT and TinyBERT performed well, suspecting MobileBERT's architecture may mishandle tabular data converted to text tokens. The issue highlights challenges in adapting NLP models to non-NLP tasks.

Robotics

Reddit r/DeepLearning 2026-05-31

OpenAI Robotics. They promise a robot to everyone.

OpenAI's Sam Altman announced a focus on robotics to aid skilled workers in infrastructure development, with a long-term vision of personal robots for everyday tasks. The statement highlights OpenAI's expansion into physical-world AI applications.

Speech

Reddit r/MachineLearning 2026-05-31

Arabic ASR model struggling to converge during training [D]

A user is struggling to train a dialectal Arabic ASR model using SpeechBrain's LibriSpeech recipe, facing convergence issues with high validation WER. Despite adjusting hyperparameters, loss functions (CTC + KLDiv), and dataset configurations, the model's losses plateau and fail to improve.

Reddit r/ArtificialIntelligence 2026-05-31

local AI solution for film dubbing

A user seeks a local AI tool to automatically detect dialogue timing, realign audio, and correct drift in film dubbing. The solution must handle long-duration files offline, avoiding online tools with size/length limitations.

Trending repositories on GitHub filtered and scored for relevance to your interests.

Daily Digest 2026-06-01

Global Trends

Papers discovered from ArXiv subject categories

AI Safety

Agentic AI

Computer Vision

General

MLOps

NLP

RL

Robotics

Tech News

AI Safety

Agentic AI

Computer Vision

Computing Systems

General

LLM

MLOps

NLP

Robotics

Speech

GitHub Trending

Agentic AI

Computer Vision

LLM

NLP

Speech