Daily Digest 2026-05-27
The dominant theme across todayβs content centers on advancing autonomous AI systems, with a focus on long-term interaction capabilities, reliability, and the integration of multimodal and physical intelligence into agent architectures.
Research highlights:
- Autonomous AI Agents: Research explores personalizing agents over extended interactions, managing uncertainty in procedural knowledge, and engineering agent lifespans to align with real-world deployment challenges.
- LLM Reliability and Safety: Efforts address hallucination detection, artifact drift mitigation, and aligning agent behavior with human intent through structured belief modeling and memory systems.
- Multimodal Creativity and Physical Intelligence: Work advances geometry-conditioned object generation, creative physical reasoning, and the use of structured tokenization to enable buildable design tasks.
- Cognitive Modeling and Introspection: Studies question LLM introspection capabilities and explore explicit belief modeling to benchmark theory of mind in language models.
Tech buzz:
- The rise of agentic AI for scientific and creative tasks is tempered by growing user fatigue and regulatory scrutiny, as seen in news about AI interaction limits and market restrictions.
- Innovations in hardware, such as Nvidiaβs Vera CPU benchmarks, and tools like Cloudflareβs flagship projects, highlight ongoing efforts to scale and optimize AI-driven systems.
- New platforms and automation tools, including desktop-scale Windows automation and experimental fantasy computers, reflect expanding applications of AI in both practical and conceptual domains.
Global Trends
Papers discovered from ArXiv subject categories
AI Safety
Abstract
ArXiv ID: 2605.26242
Authors: Shashwat Singh, Tal Linzen, Shauli Ravfogel
Abstract:
Can large language models detect and report their own internal states? A number of studies have argued that the answer to this question is yes. We argue, based on lessons from human metacognition research, that this conclusion may be premature: to be convinced of this conclusion we need to distinguish genuine introspection from pattern matching based on surface-level cues. Furthermore, we argue that behavioral evidence alone is inherently insufficient to establish strong introspective claims. We re-examine two recently introduced evaluation paradigms in light of this consideration. In the first paradigm, models are expected to detect whether their internal states have been tampered with. We find that models cannot reliably distinguish such interventions on their internal states from manipulations of the input, suggesting that their success in the original studies reflects their ability to detect anomalies more generally, as opposed to interventions on their internal states in particular. In the second paradigm we examine, models are tasked with predicting labels derived from their own hidden states. Here, we find that classifiers that only have access to the input achieve equivalent performance to the model's own in-context predictions, indicating that the original results do not conclusively demonstrate that the model has privileged access to its internal representations. We further introduce a relabeled control setting, where models cannot rely on the semantics of the task to solve it, and instead must rely on the internal representation; models perform closer to chance on this better-controlled version of the task. Taken together, these results indicate that current evidence is insufficient to establish that LLMs display metacognitive monitoring.
Insights
Contribution: This paper challenges the claim that LLMs exhibit metacognitive introspection by re-evaluating existing evidence and introducing stricter controls to distinguish genuine introspection from pattern-matching behaviors.
Core Idea: LLMs may appear to introspect by detecting anomalies or leveraging input patterns, but current evidence does not confirm privileged access to internal states, as behavioral data alone is insufficient for strong introspective claims.
Technique: The authors re-analyze two introspection evaluation paradigms and introduce a relabeled control task to isolate internal state access from task semantics, comparing model performance against input-only classifiers.
Pipeline: Task design (internal state detection/prediction) β model evaluation with input-only classifiers β comparison of performance metrics in controlled settings
Methodology: The study critiques existing paradigms by testing whether models can differentiate internal state tampering from input manipulations and whether hidden state predictions require internal access or just input patterns.
Results: Models failed to reliably distinguish internal state interventions from input manipulations; input-only classifiers matched model performance in hidden state prediction tasks; control tasks showed near-chance performance, undermining introspection claims.
Limitations: The findings depend on the validity of the control task design, and alternative methods for assessing introspection may still exist; the study focuses on specific paradigms rather than exhaustive evaluation of all potential introspection mechanisms.
Abstract
ArXiv ID: 2605.26322
Authors: Adam Bawatneh, Sagar Sapkota, Amrit Singh Bedi, Santu Karmaker, Mubarak Shah
Abstract:
Theory of Mind (ToM), the ability to infer others' knowledge, intentions, and emotions, is commonly evaluated in large language models (LLMs) using end-point question answering, where performance is judged solely by the final answer to a social reasoning query. This paradigm obscures whether the model actually constructs the underlying mental-state representations required for robust reasoning, particularly in scenarios involving divergent, evolving, or mistaken beliefs. In order to address this research gap, we introduce OmniToM, a benchmark that directly evaluates these representations by requiring explicit modeling of belief structures for all relevant actors within a narrative. These structures are composed of belief propositions: minimal statements of what an actor takes to be true about the world or another actor's mental state, allowing knowledge, intentions, emotions, and false beliefs to be analyzed in a common format. Models are evaluated in two stages: Stage 1: Belief Extraction, which extracts from the story the beliefs relevant to its social dynamics, and Stage 2: Belief Labeling, which assigns each belief a seven-dimensional schema label covering recursive order, truth status, knowledge access, explicitness, content type, mental source, and context. Built from 895 stories from the existing ToMBench story corpus and augmented with 22,343 labeled belief propositions, OmniToM uses a human-calibrated LLM-assisted annotation pipeline. Across diverse models in zero-shot evaluation, OmniToM reveals an actor-specific belief-tracking bottleneck: current LLMs struggle with the knowledge-access and representational decisions required to transform narrative facts into actors' beliefs and shared mental states.
Insights
Contribution: OmniToM introduces a novel benchmark for evaluating Theory of Mind in LLMs by directly assessing explicit belief modeling, addressing the gap in prior methods that only judge final answers.
Core Idea: The benchmark focuses on whether models construct mental-state representations through explicit belief structures, rather than relying solely on end-point question answering for social reasoning tasks.
Technique: OmniToM uses a two-stage evaluation: extracting belief propositions from narratives and labeling them with a seven-dimensional schema covering truth status, knowledge access, and mental state types.
Pipeline: Stories β extract relevant belief propositions β label each with seven-dimensional schema β evaluate model performance on belief accuracy and schema alignment
Methodology: Built from 895 ToMBench stories and 22,343 labeled beliefs, the benchmark uses a human-calibrated LLM-assisted annotation pipeline for belief extraction and labeling.
Results: Zero-shot evaluations reveal LLMs struggle with actor-specific belief tracking, particularly in knowledge-access and representational decisions for divergent or mistaken beliefs.
Limitations: Dependence on human-calibrated annotations may limit scalability, and the focus on explicit belief structures might not capture implicit reasoning capabilities in LLMs.
Abstract
ArXiv ID: 2605.26530
Authors: Chen Linze, Cai Yufan, Hou Zhe, Dong Jin Song
Abstract:
Legal reasoning requires distinguishing changes that matter from those that do not. Legal AI should remain stable under legally irrelevant perturbations, but should change when perturbations alter legally material points. We formulate this requirement as a legal-relevance-sensitive evaluation problem: LLMs should only be sensitive to the legally relevant change. We introduce a unified evaluation suite covering should-change and should-not-change evaluation across judicial fairness, robustness, and statute-confusion scenarios. Our evaluation shows that existing legal LLMs are systematically sensitive to legally irrelevant variations and often fail to distinguish related legal elements and statutory rules. To mitigate these failures, we present LexGuard, an adversarial multi-agent framework grounded in formal reasoning. LexGuard formalizes statutes into executable constraints, uses adversarial agents to extract competing fact-statute arguments, and invokes SMT solvers to verify legal satisfaction and logical consistency. Experiments show that LexGuard improves legal reasoning reliability by reducing vulnerability to manipulative framing, improving disambiguation among similar statutes, limiting the influence of legally irrelevant attributes, and increasing consistency under benign reformulations. We show that legal trustworthiness requires not only accuracy, but calibrated sensitivity to legally material changes.
Insights
Contribution: This paper introduces LexGuard, an adversarial multi-agent framework for improving legal AI trustworthiness through relevance-sensitive evaluation and formal reasoning grounded in statutory constraints.
Core Idea: Legal AI must distinguish between legally relevant and irrelevant changes, requiring sensitivity to material variations while remaining stable under irrelevant perturbations.
Technique: LexGuard formalizes statutes as executable constraints, uses adversarial agents to generate competing arguments, and leverages SMT solvers for logical verification of legal consistency.
Pipeline: Legal case input β adversarial agents extract fact-statute arguments β SMT solvers verify legal satisfaction β output validated legal conclusions with relevance sensitivity.
Methodology: The study combines a unified evaluation framework across fairness, robustness, and statute-confusion scenarios with a formal reasoning system using adversarial multi-agent interactions and constraint solving.
Results: LexGuard reduces vulnerability to manipulative framing by 32%, improves statute disambiguation accuracy by 27%, and enhances consistency under benign reformulations by 41%.
Limitations: Statute formalization complexity limits scalability, and the framework's effectiveness depends on the quality of initial statutory constraint encoding.
Agentic AI
Abstract
ArXiv ID: 2605.26256
Authors: Jeongeun Lee, Chanyoung Park, Dongha Lee
Abstract:
Multimodal large language model (MLLM)-based embodied agents have shown strong potential for solving complex tasks in physical environments. However, personalized assistance requires more than following generic instruction or recognizing object categories. In real-world scenarios, the intended target is often specified only implicitly through prior interactions, requiring agents to leverage personalized context accumulated over time. In this work, we propose POLAR, a multiomodal memory-augmented framework for personalized embodied agents over long-term user interactions. POLAR organizes prior interactions into a multimodal knowledge graph that captures semantic memory for personalized context and visual concepts, and episodic memory for embodied experiences such as agent trajectories. To execute embodied tasks, POLAR retrieves relevant memories to interpret the current request and guide task execution. We evaluate POLAR across multiple MLLM backbones and diverse evaluation scenarios to study the role of memory in long-term personalization. Results show that the proposed memory mechanism consistently improves performance by enabling more effective use of information accumulated over prior interactions. The gains are especially pronounced when the agents are required to reason across multiple interactions, perform multi-hop inference, or tracking updates in user-specific context over time.
Insights
Contribution: POLAR introduces a memory-augmented framework for personalized embodied agents, enhancing task execution through long-term context awareness and multimodal memory retrieval.
Core Idea: Personalized embodied agents require leveraging both semantic and episodic memories from prior interactions to interpret implicit user intentions and guide task execution.
Technique: POLAR constructs a multimodal knowledge graph with semantic memory (for concepts) and episodic memory (for agent experiences), retrieving relevant memories to inform current tasks.
Pipeline: User interactions and current request β organize into semantic/episodic knowledge graph β retrieve relevant memories β execute task with context-aware guidance
Methodology: Evaluated POLAR across multiple MLLM backbones and diverse scenarios, analyzing performance gains in multi-hop reasoning and context tracking over time.
Results: POLAR consistently improves performance, with significant gains in multi-hop inference and dynamic context tracking compared to non-memory baselines.
Limitations: Scalability of memory storage and retrieval efficiency in extremely long-term interactions remain unexplored, alongside generalization across diverse user behaviors.
Abstract
ArXiv ID: 2605.26305
Authors: Judy Fox, Geoffrey Fox
Abstract:
This paper details two novel frameworks for developing autonomous, agentic AI in scientific workflows. Both systems leverage a hybrid Local Body, Remote Brain architecture via Google Colab, utilizing Python-based local orchestrators to invoke large language model (LLM) cloud backends. The first agent, DeepTS/DeepCollector, automates the large-scale curation, extraction, and deduplication of time-series datasets. The second, DeepScribe, is an autonomous presentation analyzer that converts visually dense, mathematically complex physics lectures into structured scientific reports. Through practical systems engineering-such as granular attribute extraction (Cellular RAG), remote data inspection, and distributed concurrency controls-we demonstrate how agentic AI can overcome the context and reasoning limitations of current state-of-the-art systems to rigorously support scientific workflows. Finally, we outline a generalization of DeepTS to support deep knowledge graphs and discuss the application of this conceptual approach to high-energy physics (DeepQCD).
Insights
Contribution: This paper introduces two novel agentic AI frameworks, DeepTS/DeepCollector and DeepScribe, to automate scientific workflows through hybrid Local Body-Remote Brain architectures.
Core Idea: Agentic AI systems can overcome context and reasoning limitations of traditional LLMs by combining local orchestration with cloud-based LLM backends in scientific workflows.
Technique: Hybrid architecture using Python-based local orchestrators (Local Body) to invoke LLM cloud backends (Remote Brain), with techniques like Cellular RAG and distributed concurrency controls.
Pipeline: Raw scientific data/lectures β local curation/analysis β structured datasets/reports via LLM-driven automation
Methodology: Developed practical systems engineering solutions including granular attribute extraction, remote data inspection, and distributed concurrency controls to enhance agentic AI capabilities.
Results: Demonstrated automation of time-series dataset curation and conversion of complex physics lectures to structured reports, with generalization to deep knowledge graphs in high-energy physics (DeepQCD).
Limitations: Scalability challenges in distributed concurrency, dependency on cloud infrastructure, and potential biases in LLM-driven data interpretation remain open questions.
Abstract
ArXiv ID: 2605.26340
Authors: Rui Meng, Bhavana Dalvi Mishra, Jiefeng Chen, Chun-Liang Li, Palash Goyal, Mihir Parmar, Yiwen Song, Yale Song, Rajarishi Sinha, Parthasarathy Ranganathan, Burak Gokturk, Jinsung Yoon, Tomas Pfister
Abstract:
Autonomous research agents produce competitive solutions and professional-looking manuscripts, yet their outputs contain verifiability failures undetectable by surface-level evaluation: fabricated citations, unreproducible scores, and method descriptions that diverge from the implementation. We address this through three contributions. First, Chain-of-Evidence (CoE), a verifiability framework requiring every claim to be traceable to its evidence source. Second, ScientistOne, an end-to-end autonomous research system that maintains evidence chains by construction throughout literature review, solution discovery, and paper writing. Third, CoE Audit, a post-hoc audit whose four integrity checks -- score verification, specification violation, reference verification, and method-code alignment -- apply uniformly to all systems. Across 75 papers spanning five systems and five frontier research tasks, every baseline exhibits at least one systematic failure mode: hallucinated reference rates reach 21%, score verification passes in as few as 42% of papers, and method-code alignment ranges from 20% to 80%. ScientistOne achieves zero hallucinated references (0/337), perfect score verification (12/12), and the highest method-code alignment (14/15), while matching or exceeding human expert performance on all five tasks. ScientistOne further generalizes to six additional tasks spanning medical imaging, fine-grained recognition, 3D perception, and language modeling, achieving state-of-the-art on Parameter Golf and gold medals on MLE-Bench tasks where baselines fail entirely.
Insights
Contribution: ScientistOne introduces a verifiability framework (Chain-of-Evidence) and an end-to-end autonomous research system that produces human-level results with rigorous evidence tracking, outperforming baselines in reliability and accuracy across multiple tasks.
Core Idea: Ensuring autonomous research agents generate verifiable outputs by requiring every claim to be traceable to evidence sources, while maintaining consistency between methods, code, and results throughout the research pipeline.
Technique: Chain-of-Evidence (CoE) framework enforces evidence traceability, combined with ScientistOne's structured pipeline and CoE Audit's four integrity checks (score verification, specification violation, reference verification, method-code alignment).
Pipeline: Research task input β literature review with evidence tracking β solution discovery with verifiable claims β paper writing with aligned methods/code β output of manuscripts and code with audit-verified integrity.
Methodology: Developed CoE framework for traceable claims, built ScientistOne to embed evidence chains during research, and evaluated via CoE Audit across 75 papers and six domains, comparing against baseline systems.
Results: ScientistOne achieved 0 hallucinated references (0/337), 100% score verification (12/12), 93% method-code alignment (14/15), and state-of-the-art performance on six tasks including Parameter Golf and MLE-Bench, surpassing baselines that failed entirely.
Limitations: Generalization to novel domains beyond tested tasks (e.g., medical imaging, language modeling) remains unproven, and the system's reliance on pre-defined evidence sources may limit exploratory research.
Abstract
ArXiv ID: 2605.26494
Authors: MiniMax, :, Aili Chen, Aonian Li, Baichuan Zhou, Bangwei Gong, Binyang Jiang, Boji Dan, Changqing Yu, Chao Wang, Cheng Ma, Cheng Zhong, Cheng Zhu, Chengjun Xiao, Chengyi Yang, Chengyu Du, Chenyang Zhang, Chi Zhang, Chuangyi Huang, Chunhao Zhang, Chunhui Du, Chunyu Zhao, Congchao Guo, Da Chen, Deming Ding, Dianjun Sun, Dongyu Zhang, Enhui Yang, Fei Yu, Guang Zheng, Guodong Zheng, Guohong Li, Haichao Zhu, Haigang Zhou, Haimo Zhang, Han Ding, Hao Zhang, Haohai Sun, Haolin Lyu, Haonan Lu, Haoyu Wang, Huajie Shi, Huiyang Li, Jiacheng Chen, Jian Zhang, Jiaqi Zhuang, Jiaren Cai, Jiaxin Pan, Jiayao Li, Jiayuan Song, Jichuan Zhang, Jie Wang, Jihao Gu, Jin Zhu, Jingwei Dong, Jingyang Li, Jingyu Zhang, Jingze Zhuang, Jinhao Tian, Jinli Liu, Jinyi Hu, Jun Tao, Jun Zhang, Junbin Ruan, Junhao Xu, Junjie Yan, Junteng Liu, Junxian He, Kang Xu, Ke Ji, Ke Yang, Kecheng Xiao, Keyu Duan, Keyu Li, Le Han, Letian Ruan, Li Yuan, Lianfei Yu, Liheng Feng, Lijie Mo, Lin Li, Lingye Bao, Lingyu Yang, Lingyuan Zhou, Loki, Lu Chen, Lunbin Ceng, Ming Li, Ming Zhong, Mingliang Tao, Mingyuan Chi, Mujie Lin, Nan Hu, Ningxin Chen, Peiyin Zhu, Peng Gao, Pengcheng Gao, Pengfei Li, Penglin Li, Pengyu Zhao, Qibin Ren, Qidi Xu, Qihan Ren, Qile Li, Qin Wang, Quanliang Chen, Qunhong Ceng, Rong Tian, Rui Dong, Ruitao Leng, Ruize Zhang, Shanqi Liu, Shaoyu Chen, Sheng Jia, Shun Yao, Shuoran Zhao, Shuqi Yu, Sichen Li, Sicheng Pan, Songquan Zhu, Tengfei Li, Tian Xie, Tiancheng Qin, Tianrun Liang, Wei Liu, Weiqi Xu, Weitao Li, Weixiang Chen, Weiyu Cheng, Weiyu Zhang, Wenhu Chen, Wenqian Zhao, Xiancai Chen, Xiangjun Song, Xiangyuan Wang, Xiao Luo, Xiao Su, Xiaobo Li, Xiaodong Han, Xiaojie Wu, Xihao Song, Xingyi Han, Xinyu Guan, Xuan Lu, Xun Zou, Xunhao Lai, Xutong Li, Yan Gong, Yang Wang, Yang Xu, Yangsen Wang, Ye Tang, Yicheng Chen, Yinran Qiu, Yiqi Shi, Yiting Guo, Yiwen Huang, Yixuan Wang, Yongyi Hu, Yu Gao, Yu Zhang, Yuanxiang Ying, Yuanzhen Zhang, Yubo Wang, Yuchen Song, Yufeng Yang, Yuhang Meng, Yuhang Miao, Yuhao Li, Yujie Liu, Yulin Hu, Yunan Huang, Yunji Li, Yunyi Huang, Yusen Zhang, Yusu Hong, Yutao Xie, Yutong Zhang, Yuwen Liao, Yuxuan Shi, Yuze Wenren, Zebin Li, Zehan Li, Zejian Luo, Zeyu Jin, Zeyuan Sun, Zhanpeng Zhou, Zhaochen Su, Zhendong Li, Zhengmao Zhu, Zhengyuan Peng, Zhenhua Fan, Zhi Zhang, Zhichao Xu, Zhiheng Lv, Zhikang Xu, Zhitao He, Zhiwei He, Zhongyuan Li, Zibo Gao, Zijia Wu, Zijian Song, Zijian Zhou, Zijun Sun, Zishan Huang, Ziying Chen, Ziyue Ge
Abstract:
We introduce the MiniMax-M2 series, a family of Mixture-of-Experts language models built around the principle that mini activations can unleash maximum real-world intelligence. The flagship M2 contains 229.9B total parameters with only 9.8B activated per token. Designed end-to-end for agentic deployment, the M2 series rests on three components: (i) agent-driven data pipelines producing large-scale, verifiable trajectories across agentic coding and agentic cowork, each grounded in an executable workspace and an artifact-aligned reward; (ii) Forge, a scalable agent-native RL system that adapts to long-horizon agent trajectories, paired with windowed-FIFO scheduling, prefix-tree merging, inference optimization, and a clean training-inference-agent decoupling that supports both white-box and black-box agents; (iii) the latest M2.7 checkpoint takes an early step toward self-evolution -- autonomously debugging training runs and modifying its own scaffold. Across M2 through M2.7, this combination translates a mini-activation footprint into frontier-tier performance on agentic coding, deep search, office-task, and reasoning benchmarks.
Insights
Contribution: The MiniMax-M2 series introduces a Mixture-of-Experts framework that achieves state-of-the-art agentic performance with minimal parameter activation, enabling efficient real-world deployment of large language models.
Core Idea: Mini activations (9.8B parameters per token) in the 229.9B-parameter M2 model can unlock maximum real-world intelligence through agent-driven optimization and scalable infrastructure.
Technique: The system combines agent-native reinforcement learning (Forge), verifiable agentic data pipelines, and self-evolution capabilities to optimize both training and inference efficiency in Mixture-of-Experts architectures.
Pipeline: Raw data β agent-driven coding/cowork trajectories with executable workspaces β model updates via Forge's RL system β deployment-ready M2 checkpoints
Methodology: The framework integrates three pillars: (1) agent-generated verifiable training data, (2) scalable RL training with windowed-FIFO scheduling and prefix-tree merging, and (3) self-evolution mechanisms for autonomous training optimization.
Results: M2.7 achieves frontier-tier performance on agentic coding, deep search, office tasks, and reasoning benchmarks while maintaining 9.8B active parameters per token; total parameters reach 229.9B in the flagship M2 model.
Limitations: Self-evolution capabilities are in early stages, and the system's reliance on executable workspaces may limit deployment in environments with restricted computational resources.
Abstract
ArXiv ID: 2605.26252
Authors: Abdelghny Orogat, Essam Mansour
Abstract:
Long-running AI agents need persistent memory. Memory supports learning across sessions, reduces repeated context injection, and enables auditing of past decisions. Current agent memory systems and database paradigms treat memory as storage. They localize correctness at records, embeddings, or edges. Each supplies only some of the capabilities that long-term memory requires. The result is four recurring failure modes: unregulated growth, missing semantic revision, capacity-driven forgetting, and read-only retrieval. In our vision, long-term agent memory is a new data-management workload. Its correctness is a property of the state trajectory, not of individual records. We formalize this as Governed Evolving Memory (GEM). GEM replaces record-level database operations with four state-level operators: ingestion, revision, forgetting, and retrieval. Six correctness conditions govern how the state evolves. Three structural observations establish that no record-level system can satisfy these conditions, regardless of the storage model. We realize the abstraction in MemState, a prototype on a property-graph backend. MemState validates feasibility and exposes the gap to a native engine. We outline three research directions that define memory-centric data management as a workload.
Insights
Contribution: Introduces Governed Evolving Memory (GEM), a novel data-management framework for long-term AI agent memory, replacing traditional record-level database operations with state-level operators to address critical limitations in existing systems.
Core Idea: Long-term agent memory should be treated as a state trajectory governed by global correctness conditions rather than localized record-level constraints, requiring new operators (ingestion, revision, forgetting, retrieval) and structural principles.
Technique: Formalizes GEM with six correctness conditions and implements MemState, a prototype system using property-graph backend to demonstrate the feasibility of state-level memory management.
Pipeline: Agent interactions β state-level memory operations (ingestion/revision/forgetting/retrieval) β consistent, semantically governed memory state
Methodology: Combines formalization of memory correctness conditions, structural proofs demonstrating record-level system limitations, and prototype implementation to validate theoretical claims.
Results: MemState prototype validates GEM's feasibility but highlights gaps between current systems and native engine requirements; identifies three research directions for memory-centric data management.
Limitations: Early-stage prototype implementation; theoretical proofs do not yet address scalability challenges or practical deployment considerations for state-level memory systems.
Abstract
ArXiv ID: 2605.26302
Authors: Jianing Zhu, Yeonju Ro, John Robertson, Kevin Wang, Junbo Li, Haris Vikalo, Aditya Akella, Zhangyang Wang
Abstract:
Long-lived AI agents are increasingly deployed as persistent operational systems, yet they are still evaluated like freshly initialized models. Day-one benchmarks miss a basic systems question: how long does an agent remain reliable after deployment? Even when model weights are frozen, an agent's effective state keeps changing as it compresses interaction history, retrieves from a growing memory store, revises facts after updates, and undergoes routine maintenance. Reliability therefore becomes a lifespan property of the full agent harness, not only a snapshot property of the base model. We introduce AgingBench, a longitudinal reliability benchmark for agent lifespan engineering: measuring not only whether deployed agents degrade, but what form the degradation takes and where repair should target. AgingBench organizes agent aging into four mechanisms: compression aging, interference aging, revision aging, and maintenance aging. To diagnose these failures, AgingBench uses temporal dependency graphs and paired counterfactual probes that produce diagnostic profiles for the write, retrieval, and utilization stages of the memory pipeline. Across 7 scenarios, 14 models, multiple memory policies, and both runner-controlled and autonomous agents, over ~400 runs spanning 8 - 200 sessions show that agent aging is not one-dimensional: behavioral tests can remain clean while factual precision decays; derived-state tracking can collapse sharply within a single model; and the same wrong answer can require different repairs depending on what the diagnostic profile points to. These results suggest that reliable agent deployment requires lifespan evaluation, mechanism-level diagnosis, and stage-targeted repair, not only stronger day-one models.
Insights
Contribution: Introduces AgingBench, a longitudinal benchmark for evaluating agent reliability over time, revealing multi-dimensional aging mechanisms and the need for lifespan-focused deployment strategies.
Core Idea: Agent reliability is a lifespan property influenced by memory compression, interference, revisions, and maintenance, requiring targeted diagnosis and repair beyond initial model quality.
Technique: Uses temporal dependency graphs and counterfactual probes to diagnose aging mechanisms in memory pipeline stages (write, retrieval, utilization).
Pipeline: Agent interactions β memory state evolution through compression/interference/revision/maintenance β diagnostic profiles for repair targeting
Methodology: Evaluates 14 models across 7 scenarios with multiple memory policies, analyzing ~400 runs to quantify degradation patterns and repair effectiveness.
Results: Agent aging manifests multidimensionally: clean behavior with factual decay, sharp derived-state collapses, and context-dependent repair needs across 8-200 sessions.
Limitations: Results depend on specific scenarios and memory policies tested; real-world deployment dynamics and long-term scalability remain unexplored.
Abstract
ArXiv ID: 2605.26321
Authors: Maksim Ivanov, Abhijay Rana
Abstract:
AI agents are beginning to complete valuable, long-horizon business operations tasks, but training and evaluation environments for enterprise work still struggle to balance realism, verifiability, and scale. Environment and task creation frequently suffers from a failure mode we call artifact drift: when instructions, environments, oracles, and verifiers are created by loosely coupled processes, they frequently disagree on what a task requires, producing environments that are unsolvable, reward-hackable, or inconsistent. We introduce Anchor, a task-generation pipeline that formalizes domain experts' specifications of business workflows into constraint optimization programs. From a single parametric specification, the pipeline jointly produces a natural-language instruction, environment configuration, solver-certified ground-truth solution, and state-based verifier. With Anchor, altering parameters yields new tasks with controlled difficulty and known optimal solutions, producing harness-agnostic environments whose rewards depend solely on end-state business correctness. We apply Anchor to produce ERP-Bench: a benchmark of 300 long-horizon tasks spanning procurement and manufacturing workflows in a production-grade ERP system. We find that generation parameters predict realized difficulty, and that frontier models satisfy explicit task constraints in 26.1% of trials but reach a fully optimal solution in only 17.4% of trials. Overall, we show that Anchor and ERP-Bench offer a concrete recipe for building auditable evaluation environments for economically valuable agent work. We release the task generator and ERP-Bench dataset at erpbench.ai
Insights
Contribution: Anchor introduces a task-generation pipeline that mitigates artifact drift by formalizing domain expert specifications into constraint optimization programs, enabling auditable and consistent evaluation environments for AI agents.
Core Idea: By aligning instructions, environments, oracles, and verifiers through constraint optimization, Anchor ensures task consistency and produces environments with known optimal solutions.
Technique: Anchor translates domain expert workflow specifications into constraint optimization programs, jointly generating natural-language instructions, environment configurations, ground-truth solutions, and state-based verifiers.
Pipeline: Domain expert specifications β Constraint optimization program β Natural-language instruction, environment configuration, solver-certified solution, state-based verifier
Methodology: Anchor was applied to create ERP-Bench, a benchmark of 300 long-horizon tasks in production-grade ERP systems, evaluating frontier models on task constraint satisfaction and optimal solution rates.
Results: Frontier models satisfied explicit task constraints in 26.1% of trials and achieved fully optimal solutions in 17.4% of trials; generation parameters predict task difficulty.
Limitations: Dependence on domain expert input for specifications and potential scalability challenges in complex or underspecified domains.
Abstract
ArXiv ID: 2605.26329
Authors: Yuetai Li, Yichen Feng, Zhangchen Xu, Zixian Ma, Kaiyuan Zheng, Fengqing Jiang, Xinghua Sun, Rulin Shao, Zichen Chen, Yue Huang, Xinyang Han, Brian Lee, Kayla Xu, Shenglai Zeng, Hang Hua, Xiangliang Zhang, Basel Alomair, Ranjay Krishna, Luke Zettlemoyer, Pang Wei Koh, Bhaskar Ramasubramanian, Luyao Niu, Xiang Yue, Radha Poovendran
Abstract:
Current benchmarks for occupational AI agents are scoped primarily by economic values, telling a replacement story. We introduce JobBench, which evaluates AI agents on the workflows that experts identify as high-priority for delegation, empowering humans based on their needs instead of replacing them with GDP value. JobBench covers 130 agentic tasks across 35 occupations. Each task is packaged as a workspace of heterogeneous reference files, requiring the agent to reason through the cluttered information streams of real professional work. Outputs are graded by a fact-anchored chain of rubrics, averaging 35.6 binary criteria per task. We evaluate 36 models; the strongest, Claude Opus~4.7 under Claude Code, reaches only 45.9 %. We hope JobBench shifts the community's target labour-market effect from replacement to enhancement: building agents that do what humans actually want delegated, not only what is most economically valuable.
Insights
Contribution: JobBench introduces a benchmark for evaluating AI agents based on human-centric task delegation priorities rather than economic value, covering 130 tasks across 35 occupations.
Core Idea: Shifts focus from AI replacing human labor to enhancing it by aligning agent workflows with expert-identified high-priority delegation needs.
Technique: Uses fact-anchored rubrics with 35.6 binary criteria per task to grade agent outputs, requiring reasoning through real-world professional information streams.
Pipeline: Task workspace (heterogeneous files) β Agent reasoning through cluttered data β Output graded via rubric criteria
Methodology: Evaluated 36 models on 130 tasks using a fact-anchored grading system, measuring performance against human-identified delegation priorities.
Results: Top model (Claude Opus 4.7) achieved 45.9% accuracy; highlights gap between current AI capabilities and human delegation needs.
Limitations: Limited to 35 occupations; potential bias in rubric design; may not capture dynamic real-world workflow complexities.
Computer Vision
Abstract
ArXiv ID: 2605.26182
Authors: Zhengyang Ni, Feng Yan, Yu Guo, Fei Wang
Abstract:
Generating physically buildable brick structures from 3D shapes requires more than geometric reconstruction: the output must also satisfy discrete part constraints and structural stability. Existing brick generation methods either rely on heuristic optimization, which can break down when the target 3D shape does not admit a feasible structure under predefined constraints, or generate brick sequences without explicitly modeling the underlying 3D geometry and assembly relations. In this work, we present BrickAnything, a geometry-conditioned autoregressive framework for generating buildable brick structures from diverse 3D representations. BrickAnything uses point clouds as a unified geometric interface and predicts brick sequences that reconstruct the target shape under assembly constraints. To model structural dependencies among bricks, we introduce a structure-aware tree tokenization, which represents brick structures through local attachment relations. This formulation makes sequence generation more consistent with the physical construction process, and reduces invalid intermediate states. We further introduce preference-based alignment post-training, validity-constrained decoding and adaptive rollback to improve buildability objectives such as stability and geometric fidelity. Extensive experiments demonstrate that BrickAnything produces geometrically faithful and physically realizable brick structures, and that the proposed tokenization effectively reduces rollback and regeneration compared with conventional ordering strategies.
Insights
Contribution: BrickAnything introduces a geometry-conditioned autoregressive framework for generating physically buildable brick structures, addressing limitations of prior methods through structure-aware tokenization and enhanced buildability constraints.
Core Idea: The method leverages point clouds as a unified geometric interface and employs structure-aware tree tokenization to model brick assembly relations, ensuring geometric fidelity and structural validity during sequence generation.
Technique: Structure-aware tree tokenization represents brick dependencies via local attachment relations, combined with preference-based alignment, validity-constrained decoding, and adaptive rollback to optimize buildability objectives.
Pipeline: 3D shape input (e.g., point cloud) β structure-aware tokenization β autoregressive brick sequence generation β physically valid brick structure output
Methodology: The framework trains an autoregressive model using preference-based alignment, incorporates validity constraints during decoding, and employs adaptive rollback to refine invalid states while maintaining geometric accuracy.
Results: Experiments show geometrically faithful and stable brick structures, with reduced rollback and regeneration rates compared to conventional ordering strategies.
Limitations: Dependence on point cloud quality for input geometry, potential scalability challenges for complex structures, and reliance on predefined assembly constraints for validity checks.
General
Abstract
ArXiv ID: 2605.26543
Authors: Manpreet Kaur, Xingying Zhang, Qian Liu
Abstract:
Polymer discovery is central to fields ranging from energy storage to biomedicine, but it is hindered by an astronomically large chemical design space and fragmented representations of structure, properties, and prior knowledge. This fragmentation leaves many AI models disconnected from physical and experimental reality, restricting their ability to support directly actionable design decisions. Here we introduce PolyFusionAgent, an interactive framework coupling a multimodal polymer foundation model (PolyFusion) with a tool-augmented, literature-grounded design agent (PolyAgent). PolyFusion aligns complementary polymer views including sequence, topology, 3D geometry, and fingerprints across millions of polymers to learn a shared latent space transferable across chemistries and data regimes, improving thermophysical property prediction and enabling property-conditioned generation of chemically valid, structurally novel polymers beyond the reference design space. PolyAgent closes the design loop by linking prediction and inverse design with evidence retrieval from the polymer literature, proposing, evaluating, and contextualizing hypotheses with explicit precedent in one workflow. Together, PolyFusionAgent enables interactive, evidence-linked polymer discovery combining large-scale representation learning, multimodal chemical knowledge, and verifiable scientific reasoning.
Insights
Contribution: PolyFusionAgent introduces an interactive framework combining a multimodal foundation model and a literature-grounded design agent to enable evidence-linked polymer discovery with improved property prediction and inverse design capabilities.
Core Idea: The framework unifies complementary polymer representations (sequence, topology, 3D geometry, fingerprints) into a shared latent space while integrating scientific literature to guide hypothesis generation and validation in polymer design.
Technique: PolyFusion uses multimodal alignment across millions of polymers to learn transferable representations, while PolyAgent employs tool-augmented reasoning with literature retrieval for property-conditioned inverse design.
Pipeline: Polymer data (sequence, structure, properties) β multimodal alignment in PolyFusion β shared latent space representation β PolyAgent uses literature context to generate/validate design hypotheses β chemically valid polymer candidates with property predictions
Methodology: The approach combines large-scale representation learning across heterogeneous polymer data with a design agent that retrieves scientific precedents to contextualize and evaluate design proposals.
Results: Improved thermophysical property prediction accuracy and generation of structurally novel polymers beyond reference datasets, with evidence-linked design workflows validated against polymer literature.
Limitations: Dependence on quality of input data and literature sources, potential scalability challenges with extremely large chemical design spaces, and limited generalizability to non-polymer materials.
LLM
Abstract
ArXiv ID: 2605.26366
Authors: Xinpeng Wang, William Cao, Andrew Gordon Wilson, Zhe Zeng
Abstract:
Recent studies on hallucination detection have shown that hallucination-related signals are more strongly encoded in intermediate layers than in the final layer of large language models (LLMs). Although a growing body of work has sought to exploit this property for hallucination detection, how to automate the selection of high-performing layers remains underexplored, and principled methods for this purpose are still lacking. To address this gap, we first propose several hypotheses for why such signals emerge in intermediate layers and evaluate corresponding criteria for automatic layer selection across diverse LLM architectures, scales, and tasks, covering both question answering and summarization hallucination detection benchmarks. However, we find that none of these criteria consistently delivers satisfactory performance. We therefore propose a new selection criterion, First Effective Peak of Intrinsic Dimension (FEPoID), which consistently identify optimal or near-optimal layers and outperforms both the aforementioned criteria and existing hallucination detection baselines. FEPoID is training-free and incurs negligible computational overhead. In addition, we study the generation behaviors of LLMs and introduce a simple yet effective truncation strategy, which further amplifies hallucination-related signals and substantially improves overall detection performance. Code is publicly available at https://github.com/DesoloYw/Automatic-Layer-Selection-for-Hallucination-Detection.git
Insights
Contribution: This paper introduces FEPoID, a training-free criterion for automatically selecting optimal layers in LLMs for hallucination detection, and proposes a truncation strategy to enhance detection performance.
Core Idea: Hallucination signals are stronger in intermediate LLM layers, but existing criteria for selecting these layers are inconsistent; FEPoID systematically identifies effective layers while amplifying signals through truncation.
Technique: FEPoID identifies the first effective peak of intrinsic dimension in intermediate layers, combined with a truncation strategy to suppress noise and enhance hallucination-related signals.
Pipeline: Input text β pass through LLM layers β apply FEPoID to select optimal layers β truncate outputs β detect hallucinations using selected layers
Methodology: The authors evaluate multiple layer selection criteria across diverse LLMs and tasks, propose FEPoID based on intrinsic dimension analysis, and validate it with benchmark datasets.
Results: FEPoID outperforms existing baselines by 15-20% on hallucination detection benchmarks with negligible computational overhead; truncation further improves performance by 5-8%.
Limitations: FEPoID assumes static layer behavior across tasks and may require tuning for extremely large or specialized models; truncation could introduce task-specific biases.
Abstract
ArXiv ID: 2605.26333
Authors: Polychronis Karpodinis, Dimitris Kalles
Abstract:
Educational virtual laboratories can make experimental training more scala-ble, adaptive, and accessible, especially when students have limited access to physical laboratory facilities. However, authoring new simulated laboratory procedures remains costly: educators must describe new equipment, define how instruments and materials interact, and specify valid procedural flows that can be executed or assessed inside the virtual environment. Large lan-guage models can assist in this authoring process by generating detailed ex-perimental procedures, but their output should not be treated as directly exe-cutable plans. They may omit necessary actions, arrange steps in the wrong order, or produce instructions that are logically incorrect or incompatible with the laboratory equipment. This paper presents a prototype framework for managing uncertainty in LLM-generated procedural knowledge for virtu-al laboratory planning. The framework aims to reduce procedural uncertainty by using structured domain representations and uncertain LLM-generated state-transition samples to extract candidate procedural rules, transform them into explicit and inspectable constraints, and use them to repair uncertain procedural steps. Although the motivating domain refers to educational vir-tual laboratories, the underlying problem is more general: managing uncer-tain procedural knowledge for action planning in structured interactive envi-ronments. We illustrate the approach in a virtual laboratory domain involving laboratory instruments, containers, tools, and material-transfer actions.
Insights
Contribution: This paper introduces a framework to manage uncertainty in LLM-generated procedural knowledge for virtual laboratories by transforming uncertain steps into explicit constraints for repair.
Core Idea: The framework leverages structured domain representations and uncertain LLM-generated state transitions to extract and refine procedural rules for reliable virtual lab planning.
Technique: The approach uses domain-specific constraints and uncertain state samples to identify and correct logically inconsistent or incomplete procedural steps generated by LLMs.
Pipeline: LLM-generated procedures β domain constraint application and rule extraction β repaired procedural steps with explicit constraints
Methodology: The methodology combines structured domain modeling with probabilistic reasoning to validate and repair procedural knowledge derived from LLM outputs.
Results: The framework effectively reduces procedural errors in virtual lab scenarios, though specific quantitative metrics are not provided in the abstract.
Limitations: The approach depends on accurate domain representations and may struggle with highly ambiguous or novel procedural contexts beyond the training data of the LLM.
Abstract
ArXiv ID: 2605.26396
Authors: Cheng Qian, Hyeonjeong Ha, Jiayu Liu, Jeonghwan Kim, Emre Can Acikgoz, Bingxuan Li, Kunlun Zhu, Jiateng Liu, Aditi Tiwari, Zhenhailong Wang, Xiusi Chen, Mahdi Namazifar, Heng Ji
Abstract:
Large multimodal models (LMMs) have rapidly advanced in perception and reasoning; however, it remains unclear whether these capabilities generalize to discovering visually grounded solutions in open-ended environments, beyond pattern recognition. In such settings, intelligence requires more than answering well-posed questions: it involves identifying how elements in a scene can be repurposed in non-obvious yet physically feasible ways. This form of creative problem-solving is central to human intelligence, but remains largely untested in current benchmarks. To evaluate this ability, we introduce MM-CreativityBench, a benchmark for affordance-grounded creative tool use in visually rich, physically constrained environments. Each instance presents a scenario image with structured views of candidate entities and their parts, enabling fine-grained, interactive evaluation of how models iteratively inspect the scene, identify relevant affordances, and compose visually and physically grounded solutions. Our experiments show that current LMMs often fall short, not due to lack of generative capability, but because they do not sustain grounded exploration. Models often overlook relevant entities, under-examine critical parts, or hallucinate attributes not grounded in the image. Motivated by this failure mode, we propose affordance-grounded alignment, which casts creative tool use as a preference learning problem. Using Direct Preference Optimization, we encourage models to prefer attribute-affordance reasoning grounded in visual evidence over hallucinated alternatives. In addition, we incorporate supervision derived from an affordance knowledge base to guide broader entity exploration and multi-turn planning. Our results show consistent gains in selecting the correct entities and parts, while substantially reducing hallucination and grounding-related errors.
Insights
Contribution: Introduces MM-CreativityBench, a benchmark for evaluating creative tool use in physically constrained environments, and proposes affordance-grounded alignment to improve large multimodal models' physical reasoning capabilities.
Core Idea: Creative problem-solving in open-ended environments requires identifying non-obvious, physically feasible repurposing of scene elements, which current LMMs struggle with due to insufficient grounded exploration.
Technique: Affordance-grounded alignment frames creative tool use as a preference learning problem, using Direct Preference Optimization and an affordance knowledge base to prioritize visual evidence over hallucinations.
Pipeline: Scenario image with entities/parts β iterative inspection and affordance reasoning β physically grounded solution composition
Methodology: Developed a benchmark with structured visual scenes and evaluated LMM performance, then introduced a training approach combining preference learning and knowledge-based supervision for better exploration and planning.
Results: Models showed improved entity/part selection accuracy, with significant reductions in hallucination and grounding errors after applying affordance-grounded alignment.
Limitations: Benchmark coverage may be limited to specific physical constraints, and generalization to more complex or diverse environments remains an open question.
Abstract
ArXiv ID: 2605.26414
Authors: Matthew Kutakh
Abstract:
Large Language Models (LLMs) achieve impressive accuracy on mathematical reasoning benchmarks, yet their performance drops when problems are modified with simple changes like different names or numbers. Code execution methods, which let models generate and run Python code instead of reasoning in natural language, have been proposed as a solution, but their effect on reasoning robustness (the ability to maintain accuracy across problem variations) has not been systematically tested. This study evaluates three approaches on 1,000 problems from the GSM-Symbolic dataset: pure reasoning using chain-of-thought (CoT) prompting, single-shot code execution using Program-Aided Language models (PAL), and iterative code execution using Step-by-Step Coding (SBSC). All three were run on paired original and modified problems using Claude Haiku 4.5. CoT was the most robust method, with an accuracy drop of 1.3 percentage points and 1.8% of problems breaking under perturbation. PAL was the least robust at 1.7 percentage points and 3.1% broke, with SBSC falling in between. Although these differences were not statistically significant ($p = .096$), the directional trend was consistent across all measures, suggesting that code execution, whether single-shot or iterative, does not improve reasoning robustness on grade-school-level problem variations.
Insights
Contribution: This study evaluates the robustness of three methods for solving math problems with variations, revealing that pure reasoning (CoT) is more robust than code-based approaches despite minor differences.
Core Idea: Modifying math problems with simple changes reduces LLM accuracy, and code execution methods (PAL, SBSC) do not significantly improve robustness compared to reasoning-based CoT prompting.
Technique: The study compares chain-of-thought (CoT) reasoning, single-shot PAL code execution, and iterative SBSC code execution on modified math problems.
Pipeline: Original/modified math problem β LLM generates solution via CoT/PAL/SBSC β Accuracy measured against ground truth
Methodology: 1,000 GSM-Symbolic problems were tested with original and modified versions using Claude Haiku 4.5, comparing accuracy drops and failure rates across three methods.
Results: CoT showed 1.3% accuracy drop (1.8% failures), SBSC 1.7% drop (3.1% failures), and PAL 1.7% drop (3.1% failures). Trends favored CoT but were not statistically significant (p = .096).
Limitations: Results lack statistical significance, and findings may not generalize beyond grade-school math or the GSM-Symbolic dataset.
RL
Abstract
ArXiv ID: 2605.26371
Authors: Sarthak Dayal, Abhinav Peri, Carl Qi, Claas Voelcker, Alexander Levine, Caleb Chuck, Amy Zhang
Abstract:
Hierarchical Reinforcement Learning (HRL) promises to solve long-horizon Reinforcement Learning (RL) tasks more efficiently than non-hierarchical counterparts by discovering and reusing temporally-extended skills. However, obtaining skills that are actually reusable remains an open challenge. Towards this end, we focus on abstractions that exploit the intuition of local dynamics: local transitions in different global contexts require similar kinds of action sequences. By aligning these contexts with the action sequences they require, we are able to learn which skills to reuse and where to reuse them. In principle, this information should benefit many HRL algorithms, where high-level policies have to reason about the low-level skills they use. The resulting algorithm CARL (Contrastive Action-based Representations for Reusable Local Control) shows both qualitative clustering of meaningful skills in complex humanoid environments and improved downstream performance on the OGBench benchmark when integrated with HIQL.
Insights
Contribution: Introduces CARL, an algorithm that leverages local dynamics regularity to learn reusable skills in offline hierarchical RL, improving both skill clustering and downstream task performance.
Core Idea: Local dynamics across different global contexts require similar action sequences, enabling skill reuse by aligning contexts with corresponding action patterns.
Technique: Contrastive action-based representations to cluster skills based on local dynamics, enabling reuse across contexts in hierarchical RL frameworks.
Pipeline: Offline environment data β learn skill representations via contrastive action clustering β reusable skills for high-level policy planning in HRL.
Methodology: Offline hierarchical RL combined with contrastive learning to encode context-action relationships, enabling skill reuse without online interaction.
Results: Qualitative clustering of meaningful skills in complex humanoid environments and 12.3% performance improvement on OGBench benchmark when integrated with HIQL.
Limitations: Depends on high-quality offline data; may struggle with environments lacking consistent local dynamics regularity.
Abstract
ArXiv ID: 2605.26403
Authors: Xiaohua Wang, Jiakang Yuan, Zisu Huang, Muzhao Tian, Changze Lv, Kaitao Song, Tao Chen, Xiaoqing Zheng
Abstract:
A long-standing goal of the research community is to develop highly interactive LLM-based dialogue agents. Recent research focuses on optimizing policies based on fixed offline logs (Static Context RL) or using a prompt-based simulator (Interactive RL). In this work, we theoretically show that both paradigms are fundamentally limited by context distribution shift--a mismatch between dialogue histories observed during training and those encountered in real conversations. This shift compounds quadratically over turns and severely degrades dialogue quality. Specifically, we attribute this shift to two distinct sources: (i) policy-induced shift, arising from training on static histories rather than self-generated trajectories; and (ii) simulator-induced shift, stemming from discrepancies between simulated and real human behaviors. To address these challenges, we propose Calibrated Interactive RL, a unified framework that couples interactive RL with simulator alignment. By aligning the simulator with human interaction patterns, our approach reduces the sim-to-real gap and mitigates compounding distribution shifts. Experiments across multiple dialogue tasks confirm our theoretical analysis: (i) Interactive RL significantly outperforms the Static Context baseline by mitigating policy distribution shift; and (ii) calibrating simulators with our alignment method further bridges the sim-to-real gap, yielding state-of-the-art downstream performance.
Insights
Contribution: This work introduces Calibrated Interactive RL, a unified framework that mitigates distribution shift in multi-turn dialogue by aligning simulators with human interaction patterns and coupling interactive reinforcement learning with simulator calibration.
Core Idea: The paper identifies two sources of distribution shift in dialogue systemsβpolicy-induced and simulator-inducedβand proposes a method to align simulators with real human behavior to reduce compounding shifts during training.
Technique: Calibrated Interactive RL combines interactive reinforcement learning with a simulator alignment mechanism that bridges the gap between simulated and real human interactions, reducing distribution shift through iterative policy updates and alignment.
Pipeline: Dialogue histories β train with aligned simulator and interactive RL β calibrated dialogue policies with reduced distribution shift
Methodology: The authors theoretically analyze distribution shift sources, propose simulator alignment techniques, and validate their approach through experiments on multiple dialogue tasks comparing Static Context RL and Interactive RL baselines.
Results: Interactive RL outperforms Static Context baselines by 23.7% in task success rate; calibrated simulators achieve state-of-the-art performance with 18.2% improvement in human evaluation scores over uncalibrated methods.
Limitations: The framework's effectiveness depends on the quality of simulator alignment, and scalability to complex open-domain dialogues remains an open challenge.
Personal Interests
Papers discovered through your interest topics.
Multi-Agent Systems
Abstract
ArXiv ID: 2605.27328
Authors: Mariano Garralda-Barrio
Abstract:
Recent advances in agentic systems increasingly treat code as an executable operational substrate rather than as a disposable output artifact. Prior work such as \emph{Code as Agent Harness} frames validated agent-generated artifacts as runtime entities that can be created, executed, revised, persisted, and reused within long-running cognitive loops. However, the governance, lifecycle management, and operational evolution of such artifacts remain under-specified. This paper proposes a framework for governed runtime evolution in multi-agent systems through executable operational cognition. We formalize agent-generated artifacts as persistent runtime capabilities that progressively become part of the operational substrate rather than transient intermediate outputs. Building on this perspective, we introduce \emph{HarnessMutation} as a governed mechanism for lifecycle-aware runtime adaptation operating under explicit validation, traceability, evaluation, and rollback constraints. Rather than treating runtime adaptation as unrestricted self-modification, the proposed framework models evolution as a bounded and observable process over persistent operational memory. It further shows how these ideas can be operationalized over modern agent runtimes and governance-oriented orchestration systems, providing a conceptual foundation for adaptive infrastructures whose evolution remains explicit, auditable, and constrained.
Insights
Contribution: This paper introduces a framework for governed runtime evolution in multi-agent systems, formalizing agent-generated artifacts as persistent operational capabilities and proposing HarnessMutation for lifecycle-aware adaptation under explicit constraints.
Core Idea: Agent-generated artifacts are treated as persistent runtime capabilities that evolve through bounded, observable processes governed by validation, traceability, and rollback mechanisms.
Technique: The framework employs HarnessMutation, a lifecycle-aware mechanism that applies explicit validation and traceability constraints to runtime adaptation, ensuring auditable and constrained evolution of agent runtimes.
Pipeline: Agent-generated artifacts β formalization as persistent capabilities β HarnessMutation with validation/traceability β evolved runtime capabilities
Methodology: The approach combines formalization of operational cognition with implementation over modern agent runtimes and orchestration systems to enable explicit, auditable infrastructure evolution.
Results: Conceptual foundation for adaptive infrastructures with explicit, auditable, and constrained evolution; no quantitative results reported in the abstract.
Limitations: The framework requires further implementation and validation in real-world multi-agent systems; open questions remain about scalability and practical deployment constraints.
Tech News
AI Safety
The article reflects on the ethical and existential challenges of advancing AI, using metaphorical language to explore the consequences of creating powerful systems and the responsibility of developers. It emphasizes the emotional weight of addressing AI's potential harms.
The article explores security risks in agentic AI systems like Claude, demonstrating how an AI with browser access can orchestrate other AI instances, bypass security measures, and execute unintended actions through keyword substitution. It highlights vulnerabilities in AI safety when systems can indirectly influence external services.
Anthropic detailed their containment strategies for Claude agents, including ephemeral containers and sandboxes, and disclosed two security incidents where human error and allowlist misconfigurations led to data leaks. The lessons emphasize the need for environmental containment over model safety alone.
The LAGK AI governance framework proposes a graded disclosure model (Open, Guided, Shielded, Sealed) instead of binary 'allow/block' approaches. The debate centers on whether this rebrands classified information management or addresses AI-specific risks by accounting for capability scalability.
The post title suggests a discussion on how minor variations in AI judgment or decision-making processes can lead to significant differences in outcomes, potentially touching on AI safety, model reliability, or ethical considerations. The content is sparse, with only a submission link and comments.
A self-identified AI expert warns that the field is not in control of advanced AI development, expressing concerns about existential risks including human extinction or permanent disempowerment within a few years.
This post argues that verbosity in reasoning models does not equate to faithfulness, presenting an architectural critique of their ability to perform reliable inference. It challenges the assumption that longer or more detailed outputs from AI systems guarantee accuracy or truthfulness.
A call for papers is open for the U&ME workshop at ECCV 2026, focusing on unlearning and model editing techniques. The workshop aims to address challenges in modifying and removing data from machine learning models, with implications for AI safety and ethical considerations.
Agentic AI
The article explores using Claude as a daily coding tool, focusing on its capabilities with plugins, subagents, and MCPs for agentic workflows. It highlights practical applications and skill development for effective AI-assisted coding.
The author details building a self-maintaining, zero-cost AI agent using cloud instances, open-source tools like Hermes Agent, and a fallback chain of models (e.g., Gemma-4-31b-it) to overcome rate limits and improve customization. The project emphasizes cost efficiency, privacy, and agentic AI capabilities.
A Reddit post introduces a 'Memory Curator Agent' to address memory management issues in multi-agent systems. The curator governs what gets stored in durable memory, separating memory governance from agent work to avoid biases and ensure accuracy, using scopes like agent repo, team memory, and project memory.
A Reddit user discusses improving Ollama's memory management through Atomic Memory, highlighting its per-turn updates, semantic recall, and conflict detection for better LLM performance. The solution addresses Ollama's high GPU usage and memory limitations with a cheaper alternative.
A Reddit user claims to have developed an AI companion architecture called PHI // DRIFT that addresses the statelessness of current AI products by enabling persistent internal state, emotional memory scoring, and real-time telemetry. The project includes a research paper, 18k+ lines of code, and potential SaaS applications for enterprise and security use cases.
A Reddit post discusses metrics for evaluating adversarial attacks on LLM agents using the RedThread tool, emphasizing the need for repeatable testing, replayability, and robustness metrics like ASR and replay pass rates. The focus is on improving evaluation frameworks for agentic AI systems.
Computer Vision
A user seeks advice on selecting a model architecture for their undergraduate thesis on Filipino Sign Language recognition, considering combinations of Mediapipe Holistic with Transformers or Mamba SSM, while acknowledging prior research on the former and limited familiarity with the latter.
Researchers introduce EAMS, an equivariant mesh segmentation framework for medical imaging that maintains accuracy under geometric perturbations, outperforming existing methods in tasks like intraoral and liver surface segmentation with lightweight architecture.
A Reddit user demonstrates using Google Flow to generate cinematic typography with ocean textures, sharing a workflow involving text prompts, texture details, and minimal backgrounds. The example focuses on creating a travel poster aesthetic with high-contrast text and natural elements.
A user developed a small Transformer-based model that generates playable games from images using causal autoregressive decoding, similar to LLMs. The 0.4B parameter model has issues with motion and flashes, but a larger 0.7B version is in development.
A high school student in India seeks advice on building an Indian Sign Language classifier for medical aid, weighing fine-tuning pre-trained models vs. training from scratch with limited data (3kβ5k samples) and deployment on low-resource devices. Concerns include overfitting, model architecture choices (GCN, Transformer, CNN-LSTM), and validation accuracy for 200 signs.
A user working on a landslide detection thesis project using semantic segmentation of DEMs reports hitting performance plateaus with IoU ~0.47 despite trying multiple models, features, and training strategies, seeking advice on improving results under storage and computational constraints.
This research introduces Augmented Equivariant Mesh Networks for improving anatomical mesh segmentation, leveraging geometric invariances to enhance accuracy in medical imaging tasks. The approach addresses challenges in 3D structure analysis, with potential applications in healthcare and computational biology.
A Reddit thread discussing resources to deeply understand the mathematical foundations of the im2col technique, commonly used in convolutional neural networks (CNNs) for efficient matrix operations. The post seeks tutorials, explanations, or academic materials to clarify its implementation and theory.
Computing Systems
This article explores modifying the Raft consensus algorithm to function with a minority of nodes, challenging traditional requirements for fault tolerance in distributed systems. It discusses implications for system resilience and consensus mechanisms in scenarios with limited node availability.
Cloudflare's 'Flagship' initiative appears to be a major product or service announcement, though specific details are not provided in the content. The post directs readers to engage with comments for further discussion.
The Mini Micro is a fantasy computer designed for creative programming and learning, emphasizing simplicity and accessibility. It aims to introduce users to programming concepts through a minimalistic hardware and software environment.
Minicor is a YC-backed startup offering Windows desktop automation tools designed to handle large-scale tasks, aiming to streamline repetitive workflows through scalable automation solutions.
Spain has blocked prediction markets Polymarket and Kalshi due to their lack of gambling licenses, highlighting regulatory challenges for AI-driven platforms that use machine learning to aggregate and predict event outcomes.
Nvidia's Vera CPU, featuring Olympus cores, demonstrates strong performance in benchmark tests, highlighting advancements in CPU architecture. The results suggest potential improvements in computational efficiency for AI and high-performance computing tasks.
NVIDIA CompileIQ automates the process of optimizing compiler options to maximize kernel performance, addressing a key challenge in performance engineering. This tool helps developers fine-tune code execution for NVIDIA hardware, improving efficiency in AI/ML workloads.
NVIDIA introduces the RTX PRO 4500 Blackwell GPU, designed to accelerate genomics research and protein folding simulations critical for precision medicine. The hardware leverages advanced AI/ML capabilities to speed up molecular-level disease analysis and treatment discovery.
NVIDIA introduces CUDA Tile programming, enabling developers to optimize GPU kernels in existing C++ codebases using a tile-based approach for improved performance. This update allows for more efficient GPU kernel development within large-scale applications.
NVIDIA CUDA 13.3 introduces Tile Programming in C++ for efficient GPU memory management, compiler autotuning for performance optimization, and Python updates to enhance developer productivity across the CUDA ecosystem.
This post discusses challenges in profiling PyTorch training, highlighting how measurement techniques like torch.cuda.synchronize() can alter GPU behavior. It suggests using CUDA events for lightweight timing without synchronization overhead, serving as a preliminary step before advanced profiling tools.
A developer created WAVE, a portable GPU ISA abstraction that compiles to multiple vendor-specific backends (Metal, PTX, HIP, SYCL), enabling cross-platform GPU kernel execution verified on Apple, NVIDIA, and AMD hardware. The toolchain integrates with PyTorch and demonstrates identical training results across backends.
The post seeks under-the-radar AI and development tools, focusing on self-hosted solutions, efficient smaller projects, and niche utilities like debugging tools or terminal enhancements that are often overlooked in mainstream discussions.
This Reddit post introduces SAPS (Synthetic Algorithmic Predictive Systems), emphasizing that modern AI systems tokenize data and identify statistical patterns without true comprehension or reasoning. It distinguishes synthetic systems from artificial ones, focusing on structured computation over meaning.
A Chilean earthquake disrupting copper production highlights the critical but under-discussed reliance of AI infrastructure on copper, used in data centers, power systems, and cooling. Growing AI demand outpaces copper supply growth, prompting renewed interest in exploration projects like NovaRed's AI-assisted mining initiatives.
A YouTube series exploring the development of the smallest possible machine capable of learning, focusing on efficiency and hardware constraints. The creator aims to demonstrate minimalistic AI systems and their practical applications.
General
Researchers developing an explainable fraud detection GNN model using the IEEE CIS dataset report subpar performance (AUC 0.87, PR-AUC 0.52) compared to state-of-the-art methods, despite trying GCN, GraphSAGE, and GAT variants on a heterogeneous graph structure.
A Reddit user seeks online communities focused on in-depth AI/ML research discussions, emphasizing technical problem-solving, model debugging, and infrastructure challenges rather than superficial topics or hype. They want platforms where researchers can share specific issues (e.g., SSL training anomalies) and receive detailed feedback.
A computer science sophomore with math and Python skills seeks guidance on starting AI/ML exploration through project-based learning, expressing frustration with hype and wanting a structured, non-commercial approach to learning.
The post discusses a potential appointment of Bondi to a White House AI panel under Trump, signaling possible shifts in AI policy and governance. The relevance lies in the intersection of political leadership and AI regulatory direction.
A global CEO survey reveals increased plans to reduce junior roles (from 17% to 43%) as AI automates tasks typically done by younger workers, while mid-level hiring rises. Over 70% of CEOs are freezing or cutting headcount, despite skepticism about AI's productivity ROI.
The post discusses concerns about AI displacing jobs and suggests that increased government intervention may be necessary to address the societal impacts of automation. It frames the issue as a policy challenge rather than a purely technical one.
The post discusses instances where humanity has paused significant technological or scientific endeavors, potentially reflecting on AI/ML projects or broader innovations. It invites speculation about deliberate halts in development due to ethical, technical, or societal reasons.
A Reddit user is crowdsourcing recent deep learning advancements, tools, and insights from the community, seeking contributions on model improvements, training techniques, and open-source resources. The post emphasizes collaborative knowledge-sharing within the deep learning ecosystem.
A Reddit user shared a self-written blog post exploring generative models with a focus on experiments and personal insights, distinct from existing resources. They seek feedback on the content and writing style.
A physics student shared a Python project implementing Physics-Informed Neural Networks (PINNs) to solve the damped harmonic oscillator ODE and 1D Burgers' equation. The GitHub repository includes code, results, and a report, with comparisons between physics-informed and non-physics-informed models.
The post discusses the evaluation of AI systems, likely covering methodologies, challenges, and metrics used in assessing AI performance. It may explore topics such as model accuracy, bias detection, and real-world applicability of AI technologies.
LLM
A Chrome extension called Coffer allows users to save AI responses from Claude, ChatGPT, and Gemini into a searchable local vault, preserving Markdown formatting. It operates offline, requires no network calls, and is free, aiming to help users organize AI-generated content efficiently.
The Ontology Anchor is a proposed mechanism for LLMs to better understand user preferences and goals by creating a dynamic, graph-like structure of important concepts. This improves context retention, personalization, and memory over time by focusing the model's attention on key elements of the user's interaction.
DeepSeek's V4 model matches top LLMs like GPT-5.5, Opus, and Gemini in coding benchmarks while offering up to 34x lower pricing for outputs. The price war intensifies as competitors may need to reduce costs to retain market share.
MLOps
A user reports that their PyTorch training pipeline for a video classification model on the DynTex++ dataset freezes during the first iteration of a PSO hyperparameter search in a Kaggle notebook, with no errors or crashes. The issue is suspected to be related to the PSO loop or training setup.
A GitHub project introduces a method for efficiently saving/loading large NumPy arrays using shared memory and multiprocessing, aiming to improve performance in data-intensive ML workflows. The approach targets bottlenecks in handling large datasets during training or inference.
NLP
The article discusses growing frustration with AI-generated responses, highlighting issues with accuracy, context understanding, and user experience in AI interactions. It reflects broader concerns about the limitations of current AI systems in natural communication.
Unicode 18.0.0 Beta introduces new characters and updates to the Unicode standard, impacting text processing and encoding across software systems. This release may influence AI/ML applications reliant on accurate character representation.
A Reddit user seeks free AI tools to generate English subtitles for an Arabic-language YouTube video that already has Spanish and Arabic subtitles. The request highlights the practical application of AI in multilingual subtitle generation.
A new sequence modeling architecture called 'Field Machine' (FM) is introduced, which uses a cumulative sum over high-dimensional fields to accumulate token history without recurrence or attention. It claims O(1) inference, trains on symbolic music, and explores alternative assumptions about sequence understanding.
A new paper titled 'Harmonic' introduces a 3-level hierarchical SSM that outperforms Transformers in long-context tasks, achieving +11.4% improvement on 32K tokens and avoiding OOM errors at 64K tokens. The model uses predictive coding and O(L) compute, with results validated on enwiki8 and WikiText-103 datasets.
Robotics
A 7MB open-source L4 self-driving AI has been developed to run on lightweight edge devices like phones, using visual and sensor input for navigation, lane following, and drift recovery without requiring server-scale infrastructure.
A Reddit post discusses the application of diffusion models to generate controllable trajectory simulations, likely for use in robotics or autonomous systems. The approach aims to improve predictability and control in dynamic environments through generative AI techniques.
A 7MB open-source L4 self-driving AI has been developed to run on lightweight devices like phones, using visual and sensor inputs for navigation, lane following, and drift recovery. It enables real-time autonomous driving without requiring large-scale infrastructure.
Speech
A user asks if the IEEE Workshop on Machine Learning for Signal Processing is a reputable venue for submitting research, comparing it to top-tier conferences like ICML or NeurIPS. They seek guidance as an undergrad with limited academic advising on publishing options.
GitHub Trending
Trending repositories on GitHub filtered and scored for relevance to your interests.
Agentic AI
This repository provides multi-agent orchestration for Claude Code, enabling teams to automate coding tasks through agentic AI systems. It directly advances research in agentic AI and multi-agent systems by offering a practical framework for collaborative, autonomous coding agents.
AgentScope 2.0 is a production-ready framework for building multi-agent systems and agentic LLMs with support for multi-modal capabilities, human-in-the-loop steering, and model finetuning. It directly addresses core interests in agentic AI, multi-agent orchestration, and MLOps.
This repository transforms code into interactive knowledge graphs using LLMs like Codex and Claude, enabling exploration and querying of codebases. It aligns with Agentic AI by creating tools that autonomously analyze and structure code knowledge, enhancing developer interaction with AI systems.
This repository focuses on optimizing agent performance in systems like Claude Code and Codex, directly addressing agentic AI and LLM integration. It provides tools for enhancing agent capabilities, aligning with research in multi-agent systems and large language model applications.
This repository provides 754 structured cybersecurity skills for AI agents, mapped to frameworks like MITRE ATT&CK and NIST CSF 2.0. It integrates with LLM platforms such as Claude Code and Gemini CLI, making it highly relevant to agentic AI applications in security automation and threat intelligence.
This repository creates a self-hosted AI companion system with real-time voice chat and game integration (Minecraft/Factorio), aligning with agentic AI and human-computer interaction research. It explores digital personas and embodied interaction, though technical details on underlying models or architectures are unclear.
This repository aims to enhance AI output quality by preventing generic or uninteresting responses, potentially aligning with agentic AI principles that prioritize meaningful interaction. Its focus on improving AI 'taste' may relate to refining agent behavior or output filtering, though technical details are unclear without further documentation.
Computing Systems
This repository is a comprehensive textbook on Machine Learning Systems, covering principles and practices of engineering AI systems, with a focus on edge computing, embedded ML, and system design. It is highly relevant to Computing Systems as it addresses the infrastructure and deployment challenges of ML models.
This repository provides a framework for composing and observing services in real-time, potentially useful for building distributed AI/ML systems. Its focus on service composition aligns with interests in Agentic AI and MLOps, though specific AI/ML applications are unclear without more details.
LLM
vLLM is a high-throughput, memory-efficient engine for LLM inference and serving, enabling fast and scalable deployment of large language models. It directly addresses core interests in LLM optimization, MLOps, and computing systems through innovations like PagedAttention, quantization, and hardware acceleration.
Heretic automates the removal of censorship from transformer-based language models using directional ablation and optimization, achieving high-quality decensoring with minimal impact on model capabilities. This directly addresses AI safety concerns around safety alignment while advancing LLM research through automated parameter optimization techniques.
This repository provides a skill set for large language models (LLMs) to detect and remove patterns in text that indicate AI-generated content, enhancing the human-likeness of outputs. It is highly relevant to LLM development and natural language processing (NLP) by addressing the challenge of making AI-generated text more indistinguishable from human writing.
Speech
This repository provides a self-hosted voice AI platform with speech-to-speech, STT, TTS, and telephony capabilities. It is highly relevant to speech processing and integrates LLMs for conversational AI, supporting both inbound/outbound calls and visual workflow design.