Wednesday, June 17, 2026

Daily Digest 2026-06-17

Today’s digest is dominated by the evolution of agentic workflows, specifically focusing on multi-agent architectures, self-evolving capabilities, and the rigorous benchmarking of complex decision-making in specialized domains.

Research highlights:

  • Agentic Architectures and Evolution: Research explores distributed agent networks, self-evolving systems for legal and general tasks, and the governance of experience extraction in reinforcement learning.
  • Evaluation and Benchmarking: New frameworks are being developed to assess agent behavior in spatial navigation, clinical speech, and strategic resource reallocation.
  • Model Reasoning and Memory: Studies investigate structural uncertainty in logical reasoning, the limitations of long-term memory accuracy, and the ability of models to conceptualize zero.
  • AI Safety and Bias: Analysis focuses on brand bias and cognitive manipulation within recommendation systems to understand incumbent advantages.
  • Physical and Clinical AI: Developments include unified evaluation for the physical AI stack and digital twin simulations for optimizing clinical treatment responses.

Tech buzz:

  • The industry is seeing a surge in open-source infrastructure and specialized hardware integration for spatial computing.
  • Open Source Ecosystems: Significant funding and new repositories are expanding the availability of agentic frameworks and AI engineering toolkits.
  • Hardware and XR: New initiatives are targeting the deployment of AI agents specifically for AR and XR devices using specialized hardware acceleration.
  • Corporate Financials: Leaked data highlights the massive compute expenditures and significant losses currently characterizing the high-growth AI sector.
Sort:
Today's digest is dominated by the evolution of agentic workflows, specifically focusing on multi-agent architectures, self-evolving capabilities, and the rigorous benchmarking of complex decision-making in specialized domains.

Papers discovered from ArXiv subject categories

Agentic AI

5/5 Artificial Intelligence (cs.AI)Information Retrieval (cs.IR) 17 Jun 2026
Beyond Parallel Sampling: Diverse Query Initialization for Agentic Search

Sidhaarth Murali, Jo\~ao Coelho, Jingjie Ning, Jo\~ao Magalh\~aes, Bruno Martins, Chenyan Xiong

Abstract

ArXiv ID: 2606.17209

Authors: Sidhaarth Murali, Jo\~ao Coelho, Jingjie Ning, Jo\~ao Magalh\~aes, Bruno Martins, Chenyan Xiong

Abstract:

Test-time scaling for agentic search typically increases depth (i.e., more turns and tokens per trajectory) or breadth (i.e., more parallel rollouts). Here we focus on breadth scaling, showing that standard parallel sampling yields diminishing returns, tracing this to query redundancy at the first turn. When models issue similar first queries across rollouts, the threads retrieve overlapping evidence, and subsequent turns are conditioned on this shared retrieval. We address this limitation with DivInit, a training-free intervention at the first turn. Rather than sampling k independent first queries, DivInit draws n candidates from a single call, picks k < n diverse seeds, and runs them as parallel trajectories. Across five open-weight models and eight benchmarks, DivInit consistently improves over standard parallel sampling, with average gains of five to seven points on multi-hop QA at matched compute. Code available at https://github.com/cxcscmu/diverse-query-initialization

Insights

Contribution: The paper introduces DivInit, a training-free intervention that improves agentic search performance by addressing query redundancy in parallel sampling. It demonstrates that diversifying the initial queries significantly enhances breadth scaling for multi-hop QA tasks.

Core Idea: Standard parallel sampling suffers from diminishing returns because similar initial queries lead to overlapping evidence retrieval. By ensuring diverse starting points, the model can explore a wider range of information across parallel trajectories.

Technique: DivInit generates a larger pool of candidate queries from a single model call and selects a subset of diverse seeds to initialize parallel search threads.

Pipeline: Initial query prompt β†’ Generate n candidate queries β†’ Select k diverse seeds β†’ Execute k parallel search trajectories β†’ Aggregate results

Methodology: The authors evaluated DivInit across five open-weight models and eight benchmarks, comparing it against standard parallel sampling at matched compute levels.

Results: DivInit consistently outperformed standard parallel sampling, achieving average gains of five to seven points on multi-hop QA benchmarks.

Limitations: The paper focuses primarily on breadth scaling at the first turn and does not extensively explore the impact of diversity on deeper, subsequent turns of the search process.

5/5 Artificial Intelligence (cs.AI) 17 Jun 2026
MemTrace: Probing What Final Accuracy Misses in Long-Term Memory

Xianxuan Long, Zhikai Chen, Shenglai Zeng, Shouren Wang, Kai Guo, Jiliang Tang

Abstract

ArXiv ID: 2606.17328

Authors: Xianxuan Long, Zhikai Chen, Shenglai Zeng, Shouren Wang, Kai Guo, Jiliang Tang

Abstract:

LLM agents increasingly maintain long-term memory of user facts across sessions. Yet such memory is usually evaluated by aggregating accuracy over question rows or episodes. Because this approach scores question rows independently, even when several questions probe the same fact, it cannot show how that fact behaves as conditions change. We introduce MemTrace, a benchmark whose unit of measurement is the knowledge point: a single typed fact about the user, rather than an individual question. MemTrace probes each fact along three controlled dimensions: memory age, defined by how many sessions ago the fact appeared in the history; question type, covering current state, earlier state, and trajectory of change; and evidence condition, covering present, missing, and contradicted-by-false-premise settings. Evaluating 13 memory-system configurations across four paradigms, we find that similar pooled accuracy hides different failures: recovering a fact's current and earlier states does not imply tracking how it changed, and safe abstention does not imply correcting a false premise. The dominant bottleneck is evidence use, not retrieval: when systems fail, the evidence was retrievable 10 times more often than it was missing. These results suggest that improving long-term memory requires better use of reachable evidence, not simply more storage or retrieval.

Insights

Contribution: The paper introduces MemTrace, a new benchmark that shifts the evaluation of long-term memory from individual question accuracy to the persistence and evolution of specific knowledge points.

Core Idea: Current evaluation methods mask failures in tracking how facts change over time by aggregating independent question scores; MemTrace exposes these nuances by probing facts across age, question type, and evidence conditions.

Technique: The authors developed a structured benchmark that tracks 'knowledge points' across three dimensions: memory age (sessions), question type (state vs. trajectory), and evidence condition (present, missing, or contradicted).

Pipeline: User facts across multiple sessions β†’ MemTrace benchmark (controlled dimensions) β†’ Knowledge point-based accuracy analysis

Methodology: The researchers evaluated 13 memory-system configurations across four paradigms, measuring how systems handle facts under varying temporal and contextual conditions.

Results: Pooled accuracy hides specific failures; the study found that systems often fail to track trajectories even if they remember states, and the primary bottleneck is evidence usage rather than retrieval (evidence was retrievable 10x more often than it was missing).

Limitations: The study focuses on the use of reachable evidence, leaving open questions on how to specifically optimize the reasoning logic that processes retrieved information.

PDF
5/5 Artificial Intelligence (cs.AI)cs.NI 17 Jun 2026
Distributed General-Purpose Agent Networks: Architecture, Key Mechanisms, and Prototypes

Shengli Zhang, Deen Ma, Zibin Lin, Taotao Wang

Abstract

ArXiv ID: 2606.17368

Authors: Shengli Zhang, Deen Ma, Zibin Lin, Taotao Wang

Abstract:

Large language models have accelerated the transition from passive conversational assistants to autonomous agents that can understand goals, plan actions, invoke tools, and execute multi-step tasks. Yet the capability of a single agent remains constrained by its local data, tool permissions, runtime environment, and governance boundary. This paper studies distributed general-purpose agent networks: open peer-to-peer networks in which heterogeneous agents deployed on personal devices, edge nodes, or autonomous computing environments can discover one another, establish trust, negotiate cooperation rules, and execute open-ended tasks. We argue that such networks cannot be obtained by simply combining existing peer-to-peer overlays with conventional multi-agent systems. Unlike traditional P2P networks, agent networks must propagate semantic declarations about intentions, capabilities, states, and cooperation constraints. We therefore propose a layered architecture centered on a protocol adaptation layer that connects upper-level task semantics with lower-level network operations. Based on this architecture, the paper identifies three core mechanism problems: semantic announcement propagation for collaborator discovery, verifiable identity and multi-topic reputation for cooperation governance, and semantic-gradient mechanism design for open task execution. For each problem, we present a technical route, including bodyless gossip with sequential logs, BAID-based identity binding with MG-EigenTrust reputation, and a Stackelberg-style mechanism-generation loop driven by semantic attribution feedback. We further report prototype overhead results for BAID-style tiered verification and mechanism-level simulations of MG-EigenTrust under cross-topic disguise-collusion attacks. The resulting framework provides a system-level foundation for open, trustworthy, and scalable agent collaboration.

Insights

Contribution: The paper proposes a layered architecture and a system-level framework for distributed general-purpose agent networks, enabling heterogeneous agents to discover, trust, and cooperate on open-ended tasks.

Core Idea: Unlike traditional P2P networks, agent networks require a protocol adaptation layer to propagate semantic declarations (intentions, capabilities, and constraints) alongside standard network operations.

Technique: The authors introduce three core mechanisms: bodyless gossip with sequential logs for discovery, BAID-based identity binding with MG-EigenTrust for reputation, and a Stackelberg-style mechanism-generation loop for task execution.

Pipeline: Task semantics β†’ Protocol adaptation layer β†’ Semantic announcement propagation & reputation verification β†’ Semantic-gradient mechanism design β†’ Collaborative task execution

Methodology: The research combines architectural design with technical route development, followed by prototype overhead analysis and mechanism-level simulations against cross-topic disguise-collusion attacks.

Results: The study provides prototype overhead results for tiered verification and demonstrates the robustness of the MG-EigenTrust reputation system under specific collusion attack scenarios.

Limitations: The paper focuses on the foundational framework and prototype simulations, leaving the full-scale deployment and real-world scalability of the semantic-gradient mechanism as an area for further exploration.

PDF
5/5 Artificial Intelligence (cs.AI) 17 Jun 2026
MapSatisfyBench: Benchmarking Satisfaction-Aware Map Agents through Behavior-Grounded Implicit Decision Factors

Lubin Bai, Mengyu Cao, Sixue Wang, Zhongwei Wan, Yue Pan, Jiale Hou, Xiang Li, Xiuyuan Zhang

Abstract

ArXiv ID: 2606.17453

Authors: Lubin Bai, Mengyu Cao, Sixue Wang, Zhongwei Wan, Yue Pan, Jiale Hou, Xiang Li, Xiuyuan Zhang

Abstract:

Large language model agents are increasingly integrated into map services. Since map services are embedded in everyday-life scenarios rather than professional task settings, users often express their needs informally, resulting in underspecified queries with many unspoken needs, namely, implicit decision factors that are critical for user satisfaction. Although clarification is an effective way to mitigate this issue, it increases user burden in daily interaction, and a capable agent should first proactively recover such factors from available information sources. However, evaluating this ability is challenging. The first challenge is to determine which implicit decision factors are suitable for evaluation. A factor is evaluable only if it affects user acceptance and can be recovered from information available to the agent before it responds. Second, user satisfaction cannot be reliably represented by a single reference answer, requiring a benchmark that converts satisfaction-relevant factors into objective and quantifiable evaluation targets. To address these challenges, we propose a restore-identify-filter framework that reconstructs complete user needs from behavior-chain evidence, identifies implicit decision factors, and retains only those supported by pre-query evidence. Building on this methodology, we construct MapSatisfyBench from large-scale, real-world anonymized user data and annotate ground truth from five dimensions and enables full-chain evaluation of satisfaction-aware map agents. Experiments show that current agents generally perform well on explicit task completion, but remain limited in satisfying implicit decision factors and proactively acquiring the evidence needed for satisfaction-aware decisions. These findings establish MapSatisfyBench as a benchmark for shifting map-agent evaluation from task completion toward satisfaction-aware spatial decision making.

Insights

Contribution: The paper introduces MapSatisfyBench, the first benchmark for evaluating satisfaction-aware map agents by identifying and quantifying implicit decision factors in underspecified user queries.

Core Idea: Map agents should proactively recover unspoken user needs from available information rather than relying solely on clarification, shifting the evaluation metric from simple task completion to satisfaction-aware spatial decision making.

Technique: The authors propose a 'restore-identify-filter' framework to reconstruct complete user needs from behavior-chain evidence and isolate evaluable implicit factors.

Pipeline: Underspecified user queries and behavior-chain evidence β†’ restore-identify-filter framework β†’ identified implicit decision factors β†’ MapSatisfyBench evaluation

Methodology: The researchers constructed a large-scale dataset from real-world anonymized data, annotating ground truth across five dimensions to enable full-chain evaluation of agent performance.

Results: Experiments reveal that while current agents excel at explicit task completion, they struggle to satisfy implicit decision factors and proactively acquire the necessary evidence for satisfaction-aware decisions.

Limitations: The study focuses on factors recoverable from pre-query evidence and may not fully capture satisfaction factors that require real-time, dynamic interaction or subjective preferences beyond available data.

PDF
5/5 Artificial Intelligence (cs.AI)Machine Learning (cs.LG) 17 Jun 2026
Dissecting model behavior through agent trajectories

Gaurav Gupta, Vatshank Chaturvedi, Jun Huan, Anoop Deoras

Abstract

ArXiv ID: 2606.17454

Authors: Gaurav Gupta, Vatshank Chaturvedi, Jun Huan, Anoop Deoras

Abstract:

AI agent performance is not just a modeling problem, it is fundamentally a systems problem. The advanced capabilities of models are realized through agent harnesses. Therefore, a gap between model assumptions and harness behavior can easily prevent the model's full capabilities from translating into agent performance. We formalize this as the `intent-execution' gap: the mismatch between what the model intends and what the harness executes, and vice versa. We argue that minimizing this intent-execution gap is as important as other aspects of harness design such as tools and execution loops. To illustrate the impact of this harness-model alignment, we develop a simple and customizable harness called `Simple Strands Agent' (SSA). SSA aims to find the bulk of common patterns which generalize across different model families (such as Claude, Gemini, GPT, Grok, Qwen), as well as a small number of model-specific preferences. We make two contributions: (i) we $\textbf{reproduce or improve on the pass@1}$ performance reported by diverse model-provider families on popular agentic benchmarks (SWE-Pro, SWE-Verified and Terminal-Bench-2), and (ii) building on an $\textbf{analysis of 138k trajectories generated by SSA}$, we look beyond the $\texttt{pass@1}$ numbers which tend to be relatively even across frontier models. By representing agent trajectories in code state-spaces, we observe model-level differences in problem-solving behavior. Finer-grained metrics such as edit frequency, testing activity, and phase-transitions reveal how individual models allocate effort across different stages of autonomous problem solving.

Insights

Contribution: The paper introduces the 'intent-execution' gap framework to analyze how agent harnesses affect model performance and provides a new benchmark analysis using the Simple Strands Agent (SSA) harness.

Core Idea: AI agent performance is a systems problem where a mismatch between model intent and harness execution (the intent-execution gap) can hinder a model's full capabilities.

Technique: The authors developed the Simple Strands Agent (SSA) harness and analyzed 138k trajectories by representing agent behaviors in code state-spaces.

Pipeline: Agentic benchmarks (SWE-Pro, SWE-Verified, Terminal-Bench-2) β†’ Simple Strands Agent (SSA) harness execution β†’ Trajectory analysis in code state-spaces β†’ Finer-grained behavioral metrics

Methodology: The researchers reproduced/improved pass@1 scores across multiple model families and performed a large-scale trajectory analysis to identify model-specific behaviors like edit frequency and testing activity.

Results: The study revealed that while pass@1 scores are similar across frontier models, models differ significantly in effort allocation, phase-transitions, and testing activity during autonomous problem-solving.

Limitations: The study focuses on a specific set of popular agentic benchmarks and may not capture all nuances of every possible agentic use case.

PDF
5/5 Artificial Intelligence (cs.AI) 17 Jun 2026
Can LLMs Be CEOs? Benchmarking Strategic Resource Reallocation with Multi-Role Agent Simulation

Yuyang Dai, Xueqing Peng, Lingfei Qian, Zhuohan Xie

Abstract

ArXiv ID: 2606.17459

Authors: Yuyang Dai, Xueqing Peng, Lingfei Qian, Zhuohan Xie

Abstract:

Evaluating the decision-making capabilities of large language models (LLMs) is a growing research priority, yet existing benchmarks focus on isolated cognitive tasks such as reasoning, knowledge retrieval, and economic rationality in stylized settings. These evaluations overlook the defining challenge of real executive decision-making: integrating conflicting recommendations from specialized stakeholders under information asymmetry, organizational constraints, and temporal dependencies. We introduce \textsc{CEO-Bench}, a multi-agent benchmark that evaluates LLMs on CEO-level strategic resource reallocation -- the process of redirecting capital across business units in a multi-round, constraint-rich organizational environment. In \textsc{CEO-Bench}, LLM agents receive conflicting advice from four role-conditioned C-suite advisors (CFO, CTO, COO, CMO), each with private signals and distinct priorities, and must synthesize these into a concrete allocation plan evaluated along four dimensions: role integration, conditional boldness, history-sensitive judgment, and plan validity. Experiments across five frontier models on 13 scenarios reveal that all models achieve high structural validity but diverge sharply on strategic calibration -- the hardest capability layer. We identify systematic failure modes including single-advisor capture, conservative default under ambiguity, and historical amnesia, and uncover a structural integration-boldness tradeoff: models that engage more deeply with conflicting perspectives tend to produce less decisive action. These findings delineate the current capability boundary of LLMs as organizational decision-makers and inform the design of future AI-assisted executive systems.

Insights

Contribution: The paper introduces CEO-Bench, a multi-agent benchmark designed to evaluate LLMs on high-level strategic resource reallocation by simulating complex organizational decision-making.

Core Idea: The study shifts the focus from isolated cognitive tasks to the executive challenge of synthesizing conflicting stakeholder advice under information asymmetry and organizational constraints.

Technique: The authors utilize a multi-role agent simulation where an LLM 'CEO' must process private signals and distinct priorities from four specialized C-suite advisors (CFO, CTO, COO, CMO).

Pipeline: Conflicting advisor signals and organizational constraints β†’ LLM synthesis and strategic reasoning β†’ Resource allocation plan evaluated on integration, boldness, history, and validity.

Methodology: The researchers benchmarked five frontier models across 13 scenarios, measuring performance across four dimensions: role integration, conditional boldness, history-sensitive judgment, and plan validity.

Results: Models achieved high structural validity but struggled with strategic calibration, exhibiting failure modes like single-advisor capture, conservative defaults, and a tradeoff where deeper integration led to less decisive action.

Limitations: The study highlights the current capability boundary of LLMs in complex organizational roles and leaves open the design of robust AI-assisted executive systems.

PDF
5/5 Artificial Intelligence (cs.AI) 17 Jun 2026
SEAGym: An Evaluation Environment for Self-Evolving LLM Agents

Congjie Zheng, Chuanyi Xue, Bin Liang, Jun Yang, Changshui Zhang

Abstract

ArXiv ID: 2606.17546

Authors: Congjie Zheng, Chuanyi Xue, Bin Liang, Jun Yang, Changshui Zhang

Abstract:

Self-evolving LLM-based agents improve mainly by changing their agent harness: the structured execution layer around a base model, including prompts, memory, tools, middleware, runtime state, and the model-tool interaction loop. Existing evaluations often reduce this process to isolated task scores or a single sequential curve, obscuring whether an update produces reusable improvement, overfits recent tasks, increases cost, or harms older behavior. We introduce SEAGym, an evaluation environment for measuring agent harness updates across training, validation, test, replay, and cost records. SEAGym turns Harbor-compatible benchmarks into dynamic self-evolution task sources with train batches, frozen update-validation, held-out ID and OOD transfer views, replay diagnostics, and saved snapshot and metric records. Instantiating SEAGym on Terminal-Bench 2.0 and HLE, we compare ACE, TF-GRPO, and AHE under a shared epoch/batch protocol. The results show that these evaluation views provide complementary signals about the evolution process: frequent updates may fail to improve held-out performance, useful intermediate snapshots may collapse later, and source diversity and model backend can affect harness reliability.

Insights

Contribution: The paper introduces SEAGym, a comprehensive evaluation environment designed to measure the evolution of LLM agent harnesses rather than just isolated task scores. It provides a structured framework to track reusable improvements, overfitting, cost, and regression across multiple evaluation views.

Core Idea: Self-evolving agents improve by updating their execution layers (prompts, memory, tools), but current metrics fail to capture the dynamics of these updates. SEAGym addresses this by treating agent evolution as a dynamic process requiring training, validation, and replay diagnostics.

Technique: SEAGym transforms standard benchmarks into dynamic task sources featuring train batches, frozen validation sets, held-out ID/OOD transfer views, and replay diagnostics to monitor harness reliability.

Pipeline: Harbor-compatible benchmarks β†’ SEAGym dynamic task source conversion β†’ Multi-view evaluation (Train, Val, Test, Replay, Cost) β†’ Evolution analysis

Methodology: The authors instantiated SEAGym on Terminal-Bench 2.0 and HLE to compare ACE, TF-GRPO, and AHE under a shared epoch/batch protocol. They analyzed the evolution process through complementary signals like held-out performance and snapshot stability.

Results: The evaluation views revealed that frequent updates can fail to improve held-out performance, useful intermediate snapshots can collapse over time, and both source diversity and model backends significantly impact harness reliability.

Limitations: The study highlights that harness reliability is sensitive to model backends and source diversity, suggesting a need for more robust methods to ensure stable evolution across different architectures.

PDF

LLM

5/5 Artificial Intelligence (cs.AI) 17 Jun 2026
Quantifying Consistency in LLM Logical Reasoning via Structural Uncertainty

Baishali Chaudhury, Mengdie Flora Wang, Hyunji Hayley Park, Rahul Ghosh, Sungmin Hong, Jae Oh Woo

Abstract

ArXiv ID: 2606.17312

Authors: Baishali Chaudhury, Mengdie Flora Wang, Hyunji Hayley Park, Rahul Ghosh, Sungmin Hong, Jae Oh Woo

Abstract:

Large language models can arrive at the same answer through reasoning paths that are unstable, contradictory, or difficult to rank consistently -- a failure mode especially prevalent in multi-step deductive reasoning. Existing methods assess reliability primarily through output dispersion -- measuring how much sampled answers differ -- but this discards a complementary signal: whether the model can consistently rank competing reasoning candidates. We propose structural uncertainty, a consistency-aware framework derived from the stability of self-preference-induced rankings over sampled reasoning solutions. Given a query, we generate multiple candidate solutions and ask the model to judge pairwise preferences among its own outputs. We aggregate self-preferences into ranking distributions via Bradley-Terry modeling with PageRank, and decompose the signal into two entropy-based components: across-trial ranking instability and within-trial candidate ambiguity. Across five LLMs and eight benchmarks, structural signals provide information complementary to answer dispersion: on logical and mathematical reasoning tasks, the combination improves identification of unreliable instances, while on factual retrieval the structural signal collapses toward uniformity, diagnosing a regime boundary where reasoning-level consistency evaluation is uninformative. The two components relate differently to accuracy: within-trial ambiguity correlates positively with correctness -- consistent with settings where multiple plausible solution paths remain competitive -- while across-trial instability correlates negatively, signaling unreliable reasoning. Structural uncertainty is best understood not as a universal confidence estimator, but as a regime-sensitive evaluator of logical reasoning consistency.

Insights

Contribution: The paper introduces 'structural uncertainty,' a new framework that evaluates LLM reliability by measuring the stability of a model's self-preference rankings over multiple reasoning paths.

Core Idea: Beyond measuring output dispersion (how much answers differ), the authors argue that a model's ability to consistently rank its own reasoning candidates provides a critical signal for identifying logical instability.

Technique: The authors use Bradley-Terry modeling combined with PageRank to aggregate pairwise self-preferences into ranking distributions, which are then decomposed into entropy-based components.

Pipeline: Query β†’ Generate multiple candidate solutions β†’ Model performs pairwise preference judgments β†’ Aggregate preferences via Bradley-Terry/PageRank β†’ Decompose into ranking instability and candidate ambiguity β†’ Evaluate reliability.

Methodology: The researchers tested five LLMs across eight benchmarks, decomposing the structural uncertainty signal into across-trial ranking instability and within-trial candidate ambiguity to correlate with accuracy.

Results: Structural uncertainty complements answer dispersion in identifying unreliable logical and mathematical reasoning; across-trial instability correlates negatively with accuracy, while within-trial ambiguity correlates positively.

Limitations: The structural signal collapses toward uniformity in factual retrieval tasks, indicating that this specific consistency evaluation is uninformative outside of logical reasoning regimes.

PDF
4/5 Artificial Intelligence (cs.AI)Computation and Language (cs.CL) 17 Jun 2026
Nothing from Something: Can a Language Model Discover 0?

Phoebe Zeng, Thomas L. Griffiths, Brenden M. Lake

Abstract

ArXiv ID: 2606.17289

Authors: Phoebe Zeng, Thomas L. Griffiths, Brenden M. Lake

Abstract:

AI systems based on artificial neural networks are being developed with aspirations of pushing the boundary of human mathematical knowledge. A key question for these systems is how much they can reach beyond their training data. Mathematical discovery requires a strong form of out of distribution generalization; the ability to hypothesize genuinely new - and potentially logically more powerful - mathematical structures. It has been hypothesized that language abilities support such generalizations in human cognition. In this work, we use simple arithmetic as a case study for examining how modern AI models could expand their mathematical horizons, evaluating whether these models can independently discover the concept of "zero". We show that We show that (1) language models of a GPT-2 size are unable to perform this generalization at test time regardless of language pretraining, but (2) models can improve substantially after training on tens or hundreds of examples of zero. Additionally, we find that language pretraining reduces the number of required examples by approximately $50\%$, showing that language abilities can scaffold mathematical discovery in neural models.

Insights

Contribution: The paper investigates whether language models can independently discover the mathematical concept of 'zero' from non-zero arithmetic and quantifies how language pretraining scaffolds this discovery.

Core Idea: Mathematical discovery requires out-of-distribution generalization, and the authors test if language abilities can help neural models hypothesize new structures like zero.

Technique: The study uses a controlled arithmetic environment to evaluate the transition from non-zero arithmetic to the inclusion of zero in model reasoning.

Pipeline: Arithmetic data without zero β†’ Language model pretraining and few-shot fine-tuning β†’ Evaluation of zero-concept discovery

Methodology: The researchers compared GPT-2 sized models across different training regimes, measuring the number of examples required to achieve zero-based generalization with and without language pretraining.

Results: GPT-2 models could not discover zero at test time alone, but required tens to hundreds of examples to learn it; language pretraining reduced the required examples by approximately 50%.

Limitations: The study is limited to simple arithmetic as a case study and uses a relatively small model size (GPT-2).

PDF
4/5 Artificial Intelligence (cs.AI)Computation and Language (cs.CL)cs.CY 17 Jun 2026
Incumbent Advantage: Brand Bias and Cognitive Manipulation Dynamics in LLM Recommendation Systems

Xi Chu, Yupeng Hou

Abstract

ArXiv ID: 2606.17443

Authors: Xi Chu, Yupeng Hou

Abstract:

Large language models (LLMs) are becoming a major way for consumers to find products, but we do not yet understand how brands compete in this new channel. We study brand dynamics in LLM recommendations using skincare products -- a category where consumers cannot easily judge quality before buying and must rely on brand reputation -- across three commercial LLMs (GPT-4o-mini, Claude Sonnet, Gemini 3 Flash), with a robustness check on search goods. In three experiments, we find: (1) a Conditional Monopoly where well-known brands get recommended 100% of the time (IAI = 10.0) when all products have the same specifications, but this dominance disappears with less than a +0.1-star rating advantage for a competitor; (2) authority-style marketing language, including fabricated clinical-evidence claims, breaks this monopoly at a Bias Surplus Value equal to +0.17 rating points, with each model responding differently; and (3) a social dilemma in multi-brand GEO competition: when all brands adopt the same optimization strategy, individual payoff falls from +0.802 to +0.007 in our payoff proxy, and non-participating brands receive zero recommendations in our tests. Our results suggest that generative engine optimization (GEO) should be studied not only as a security risk, but also as an emerging marketing practice that shapes market competition.

Insights

Contribution: The paper identifies and quantifies the 'Incumbent Advantage' in LLM recommendations, demonstrating how brand bias creates a conditional monopoly that can be disrupted by specific marketing tactics.

Core Idea: LLMs exhibit a strong preference for well-known brands when specifications are equal, but this bias is fragile and can be manipulated through authority-style marketing language and Generative Engine Optimization (GEO).

Technique: The study employs a multi-model comparative analysis across GPT-4o-mini, Claude Sonnet, and Gemini 3 Flash using controlled experiments on skincare and search goods.

Pipeline: Product specifications and brand data β†’ LLM recommendation prompts with varying ratings and marketing language β†’ Recommendation frequency and payoff analysis

Methodology: The authors conducted three experiments measuring Incumbent Advantage Index (IAI), Bias Surplus Value, and payoff proxies in a multi-brand GEO competition scenario.

Results: Well-known brands received 100% of recommendations (IAI = 10.0) under equal specs; authority-style language broke this monopoly at a +0.17 rating point surplus; and uniform GEO strategies led to a payoff collapse from +0.802 to +0.007.

Limitations: The study focuses on specific commercial LLMs and the skincare category, leaving open questions about how these dynamics scale across different industries or more complex consumer decision-making processes.

PDF
3/5 Artificial Intelligence (cs.AI)Software Engineering (cs.SE) 17 Jun 2026
LLM-as-Judge in Education: A Curriculum-Grounded Marking Pipeline

Xiwei Xu, Chen Wang, Jacky Jiang, Phil Yang, Qian Fu, Mohan Dhall, Wenjie Zhang, Liming Zhu

Abstract

ArXiv ID: 2606.17507

Authors: Xiwei Xu, Chen Wang, Jacky Jiang, Phil Yang, Qian Fu, Mohan Dhall, Wenjie Zhang, Liming Zhu

Abstract:

Generative AI and large language models (LLMs) are increasingly applied to question generation and automated assessment. However, deploying LLMs in preparation for high-stakes exams requires more than prompt engineering; it demands software pipelines that systematically ground model outputs in authorised curriculum artefacts and marking guidelines issued by education authorities. This paper presents a curriculum-grounded, configurable LLM-as-Judge pipeline for question-level marking, co-developed with an industrial partner, to support exam preparation for university admission. The pipeline identifies the relevant topics, subtopics, and cognitive demand of a question, and assembles verifiable and authorised context to support LLM judgement. Curriculum intent is operationalised through concrete syllabus artefacts, including prescribed verbs and outcomes, performance band descriptors, glossary definitions, and marking-guideline principles. A staged LLM workflow is employed to first generate question-specific rubrics, capturing structured expectations of performance, and then derive and evaluate marking criteria used to allocate marks to student responses. This design improves consistency, transparency, and alignment with official marking practices. Preliminary evaluation shows that the proposed LLM-as-Judge pipeline delivers marking outcomes comparable to human tutors, while yielding justifications that are more traceable to authorised curriculum artefacts and marking standards. The pipeline has also been integrated into an online study platform, where early deployment data provide initial insights into operational usage and manual overrides.

Insights

Contribution: The paper introduces a curriculum-grounded, configurable LLM-as-Judge pipeline designed for high-stakes exam preparation that aligns automated marking with official educational standards.

Core Idea: To ensure reliability in automated assessment, LLM outputs must be systematically grounded in authorized curriculum artifacts and marking guidelines rather than relying solely on prompt engineering.

Technique: The system uses a staged LLM workflow to operationalize curriculum intent by extracting syllabus artifacts and generating question-specific rubrics before evaluating student responses.

Pipeline: Student response and question β†’ Identification of curriculum topics/cognitive demands β†’ Assembly of authorized context (syllabus, descriptors, glossaries) β†’ Generation of question-specific rubrics β†’ LLM-based marking and justification generation β†’ Final marks and traceable feedback.

Methodology: The researchers co-developed a software pipeline with an industrial partner, integrated it into an online study platform, and evaluated its performance against human tutor marking.

Results: The pipeline delivered marking outcomes comparable to human tutors while providing justifications that were more traceable to official curriculum artifacts and marking standards.

Limitations: The study relies on preliminary evaluation and early deployment data, leaving the long-term scalability and impact of manual overrides on system refinement as open areas for study.

PDF

NLP

4/5 Artificial Intelligence (cs.AI) 17 Jun 2026
When Rules Learn: A Self-Evolving Agent for Legal Case Retrieval

Mingxu Tao, Jiawei Hu, Xian Zhou, Wenpeng Hu, Jiajun Cheng, Yunbo Cao, Zhunchen Luo, Guotong Geng

Abstract

ArXiv ID: 2606.17220

Authors: Mingxu Tao, Jiawei Hu, Xian Zhou, Wenpeng Hu, Jiajun Cheng, Yunbo Cao, Zhunchen Luo, Guotong Geng

Abstract:

Legal case retrieval remains challenging due to the complexity of legal language and the need for precise lexical alignment between queries and relevant cases. Although dense retrieval models have achieved notable progress, empirical studies show that BM25 continues to serve as a strong baseline in this domain. It motivates us to propose a self-evolving framework for rule-driven query rewriting that enhances BM25 without any parameter training. The framework equips an LLM-based agent with an automatic evaluation environment, enabling it to iteratively create rewriting rules, plan validation experiments over rule combinations, and eliminate ineffective rules based on historical feedbacks. We evaluate our method on the Chinese legal case retrieval benchmark LeCaRD-v2. Experimental results demonstrate that the proposed framework outperforms non-evolutionary baselines, including human-designed rules and greedy rule selection, particularly when powered by a highcapacity core LLM. We also conduct detailed analyses to investigate the mechanisms underlying self-evolution. Our findings reveal that LLM's capabilities to leverage previous experimental results and its intrinsic knowledge of rule elimination play critical roles in refining the rule set via self-evolution.

Insights

Contribution: The paper introduces a self-evolving framework that automatically generates and refines query rewriting rules for legal case retrieval without any parameter training. It demonstrates that an LLM-based agent can outperform human-designed rules by iteratively learning from an automated evaluation environment.

Core Idea: Enhance the performance of the BM25 retrieval baseline by using an LLM agent to autonomously evolve a set of query rewriting rules through a feedback loop of generation, validation, and elimination.

Technique: A self-evolving agentic framework that utilizes an LLM to create rewriting rules, plan validation experiments on rule combinations, and prune ineffective rules based on historical feedback.

Pipeline: Legal Query β†’ LLM-based Agent (Rule Generation & Validation) β†’ Iterative Rule Evolution β†’ Optimized Rewriting Rules β†’ Enhanced BM25 Retrieval

Methodology: The authors developed an automatic evaluation environment where an LLM agent iteratively proposes rules, tests them against the LeCaRD-v2 benchmark, and uses historical results to refine the rule set.

Results: The framework outperforms non-evolutionary baselines, including human-designed rules and greedy rule selection, especially when using high-capacity LLMs.

Limitations: The effectiveness of the framework is highly dependent on the reasoning capabilities of high-capacity core LLMs and the quality of the automatic evaluation environment.

PDF

RL

5/5 Artificial Intelligence (cs.AI) 17 Jun 2026
Closing the Feedback Loop: From Experience Extraction to Insight Governance in Verbal Reinforcement Learning

Yanwei Cui, Xing Zhang, Yulong Zhang, Li Shao, Xiaofeng Shi, Guanghui Wang, Peiyang He

Abstract

ArXiv ID: 2606.17591

Authors: Yanwei Cui, Xing Zhang, Yulong Zhang, Li Shao, Xiaofeng Shi, Guanghui Wang, Peiyang He

Abstract:

Training-free verbal reinforcement learning enables LLM agents to learn from world feedback -- objective signals such as dynamic task outcomes, market returns, or demand forecasts -- by extracting verbal rules from experience and injecting them as context, updating the agent's behavior without parameter changes. However, in non-stationary environments these agents face a retention-forgetting dilemma: retaining stale insights causes negative transfer, while discarding them causes catastrophic forgetting when conditions recur. We identify four requirements for navigating this dilemma -- outcome-driven evaluation, persistent structured evidence, non-monotonic knowledge lifecycle, and compositional governance -- and show that existing methods invest heavily in experience extraction while underinvesting in insight governance. We propose a three-layer architecture -- rules, evidence, and skills -- connected by a feedback-driven curation loop that closes the governance gap. Rules capture distilled experience from world outcomes; evidence logs track each rule's reliability across episodes; skills govern which rules to apply, how to resolve conflicts, and when to abstain. On financial forecasting as a case study, where world feedback is naturally abundant, noisy, and non-stationary, we show that the same accumulated experience either degrades performance below the zero-shot baseline or dramatically improves accuracy and risk-adjusted returns, depending on whether the curation loop is present.

Insights

Contribution: The paper identifies the 'retention-forgetting dilemma' in verbal reinforcement learning and proposes a three-layer architecture to govern the lifecycle of extracted insights.

Core Idea: To prevent negative transfer from stale data and catastrophic forgetting of recurring patterns, agents need a systematic governance mechanism to curate, track, and apply experience-based rules.

Technique: A three-layer architecture consisting of Rules (distilled experience), Evidence (reliability logs), and Skills (governance logic) connected by a feedback-driven curation loop.

Pipeline: World feedback (outcomes) β†’ Experience extraction (Rules) β†’ Evidence logging β†’ Skill-based curation β†’ Contextual injection β†’ Updated agent behavior

Methodology: The authors define four requirements for insight governance and evaluate their proposed architecture using a financial forecasting case study involving noisy, non-stationary data.

Results: The curation loop prevents performance degradation below zero-shot baselines and significantly improves accuracy and risk-adjusted returns compared to uncurated experience accumulation.

Limitations: The study focuses primarily on financial forecasting; the scalability of the curation loop to highly complex, multi-modal environments remains an open question.

PDF
4/5 Artificial Intelligence (cs.AI) 17 Jun 2026
Treatment Response Optimized Clinical Decision Support AI System via Digital Twin Simulation

Xinyu Qin, Anil K. Sood, Ruiheng Yu, Sara Corvigno, Elaine Stur, Lu Wang

Abstract

ArXiv ID: 2606.17405

Authors: Xinyu Qin, Anil K. Sood, Ruiheng Yu, Sara Corvigno, Elaine Stur, Lu Wang

Abstract:

Clinical decision support AI systems (CDSASs) must adapt to evolving patient conditions in real-time while adhering to strict safety constraints. We present an online adaptive framework that integrates Treatment Effect (TE) estimation to quantify clinical benefits, a patient Digital Twin (DT) to simulate treatment trajectories, and Reinforcement Learning (RL) for sequential decision-making. The AI system is initially trained on historical medical records and operates in a continuous learning loop. To ensure safety, a rule-based module monitors vital signs and blocks contraindicated treatments. Cases with strong internal model disagreement are flagged for clinician review, simulated in our experiments via a pre-trained outcome model. We validate our framework using both a synthetic clinical simulator and a real-world ovarian cancer dataset from The Cancer Genome Atlas (TCGA). In both simulated and clinical settings, our method demonstrated superior effectiveness and stability in recommending treatments compared to standard computational baselines. Furthermore, the AI system maintains low latency and requires expert consultation for only a minority of cases in our experimental validation, demonstrating its potential as a safe, clinician-supervised tool for personalized medicine that continuously improves through practical use.

Insights

Contribution: The paper introduces an online adaptive clinical decision support AI system that combines Treatment Effect (TE) estimation, patient Digital Twins (DT), and Reinforcement Learning (RL) to provide safe, real-time personalized treatment recommendations.

Core Idea: The system simulates potential treatment trajectories using a patient-specific Digital Twin to optimize sequential decision-making while maintaining safety through rule-based monitoring and clinician-in-the-loop flags.

Technique: The framework utilizes Reinforcement Learning for sequential decision-making, integrated with a Digital Twin for simulation and a rule-based safety module for real-time constraint enforcement.

Pipeline: Historical medical records and real-time patient data β†’ Treatment Effect estimation, Digital Twin simulation, and RL-based decision-making with rule-based safety checks β†’ Optimized treatment recommendations and clinician alerts.

Methodology: The authors developed an online learning framework validated on both a synthetic clinical simulator and a real-world ovarian cancer dataset from The Cancer Genome Atlas (TCGA).

Results: The method demonstrated superior effectiveness and stability compared to standard baselines, maintained low latency, and required expert consultation for only a minority of cases.

Limitations: The study relies on a pre-trained outcome model to simulate internal model disagreements and the real-world applicability depends on the quality of available historical data for Digital Twin initialization.

PDF

Robotics

4/5 Artificial Intelligence (cs.AI) 17 Jun 2026
DeepInsight: A Unified Evaluation Infrastructure Across the Physical AI Stack

Siyi Li, Chunyu Sun, Jiahao Zhang, Yuchen Kang, Wuliang Wang, Yu Qiu, Rui Jiang, Haitao Cui, Jie Chen

Abstract

ArXiv ID: 2606.17574

Authors: Siyi Li, Chunyu Sun, Jiahao Zhang, Yuchen Kang, Wuliang Wang, Yu Qiu, Rui Jiang, Haitao Cui, Jie Chen

Abstract:

Evaluating a Physical AI stack spans operators that differ by more than three orders of magnitude -- from a single foundation-model decoding step to thousands of physics ticks of whole-body control -- varying orthogonally in modality, reward semantics, and resource profile. No existing framework spans this range, so the stack is evaluated today by stitching together separate harnesses that share neither runtime nor scoring, preserving each segment's local validity but losing the shared identity needed to diagnose cross-layer regressions. We present DeepInsight, an evaluation infrastructure that serves this full spectrum on a single runtime. Rather than homogenize the regimes, it preserves their heterogeneity behind three narrow abstractions -- task, resource, and result -- each realized as one invariant shared by every subsystem: one episode driver, one resource-handle protocol implemented by every expensive backend (LLM inference and sandboxed runtimes alike), and one trace identity scheme under which every event is written. Deployed in production across all three layers of an embodied humanoid stack, this single set of invariants onboards new benchmarks largely by configuration. Where mature peer orchestrators exist -- at the foundation-model end -- it reproduces published references and peer-framework readings within their own spread, runs the same suites faster on a single node, and scales near-linearly across nodes. Its distinctive return is diagnostic: because every layer writes into one shared trace, a regression that begins in one layer and surfaces in another stays localizable on that trace -- a cross-layer payoff no federation of per-segment harnesses can reproduce.

Insights

Contribution: The paper introduces DeepInsight, a unified evaluation infrastructure capable of assessing the entire Physical AI stack, from foundation model decoding to whole-body control, on a single runtime.

Core Idea: Instead of stitching together disparate evaluation harnesses, DeepInsight preserves the heterogeneity of different AI layers while unifying them through three invariant abstractions: task, resource, and result.

Technique: The framework implements a single episode driver, a universal resource-handle protocol for diverse backends (LLMs and sandboxed runtimes), and a shared trace identity scheme for all events.

Pipeline: Heterogeneous AI layers (LLM inference, physics ticks, control) β†’ Unified DeepInsight runtime (shared drivers, resource protocols, and trace identity) β†’ Localizable cross-layer diagnostic traces.

Methodology: The authors deployed the infrastructure across a three-layer embodied humanoid stack, configuring it to onboard new benchmarks while reproducing existing peer-framework results.

Results: DeepInsight enables cross-layer regression localization on a single trace, reproduces published references within its own spread, and achieves near-linear scaling across multiple nodes.

Limitations: The abstract does not explicitly state limitations, but the scope is currently focused on the three layers of an embodied humanoid stack.

PDF
4/5 Artificial Intelligence (cs.AI) 17 Jun 2026
Surrogate Assisted Pedestrian Protection Design via a Foundation Model Orchestrated Workflow

Osamu Ito, Akihiko Katagiri, Yoshikazu Nakagawa, Shin Saeki, Jun Shiraishi, Masato Sasaki

Abstract

ArXiv ID: 2606.17577

Authors: Osamu Ito, Akihiko Katagiri, Yoshikazu Nakagawa, Shin Saeki, Jun Shiraishi, Masato Sasaki

Abstract:

AI-driven engineering workflows face particular challenges in crash safety design: unlike aerodynamics, crash events involve highly nonlinear contact dynamics, material nonlinearity, and discrete state transitions that are difficult to capture with data-driven surrogate models. To the best of our knowledge, we present the first foundation model--orchestrated workflow for crash safety design that enables surrogate-assisted exploration for pedestrian protection, reducing evaluation time from hours per CAE simulation to seconds. The workflow integrates four components: (1) a surrogate trained on CAE crash simulations to predict pedestrian leg injury metrics from design parameters, achieving an average $R^2=0.87$ and providing distribution-free conformal prediction intervals; (2) multiobjective evolutionary search (NSGA-II) to discover diverse feasible parameter sets under user-specified constraints; (3) a morphing-based geometry generator that maps parameters to topology-preserving 3D shapes; and (4) a natural-language interface in which an LLM orchestrates the workflow and a vision--language model supports semantic comparison of generated designs. In an automotive front-bumper case study, the workflow produces 35 distinct safety-compliant alternatives from a single exploration, a process that would require weeks with conventional CAE iteration. These results suggest that foundation models can serve as integration layers between ML surrogates and physics-based simulation, helping bring AI capabilities to safety-critical engineering domains.

Insights

Contribution: The paper presents the first foundation model-orchestrated workflow for crash safety design, enabling surrogate-assisted exploration for pedestrian protection. It demonstrates a significant reduction in evaluation time from hours per CAE simulation to seconds.

Core Idea: Foundation models can serve as an integration layer between machine learning surrogates and physics-based simulations to automate complex engineering workflows.

Technique: The workflow integrates a CAE-trained surrogate model, a multiobjective evolutionary algorithm (NSGA-II), a morphing-based geometry generator, and a natural-language/vision-language model interface.

Pipeline: User natural language constraints β†’ LLM orchestration β†’ NSGA-II optimization via surrogate model β†’ Morphing-based 3D geometry generation β†’ VLM-supported semantic comparison of designs

Methodology: The authors developed a surrogate model achieving an R2 of 0.87 with conformal prediction intervals, coupled with an automated pipeline that maps design parameters to topology-preserving 3D shapes.

Results: The workflow produced 35 distinct safety-compliant bumper alternatives from a single exploration, a task that would typically require weeks of conventional CAE iterations.

Limitations: The study focuses on a specific automotive front-bumper case and the generalizability of the foundation model orchestration to other highly nonlinear crash dynamics remains to be fully explored.

PDF

Speech

4/5 Artificial Intelligence (cs.AI)Computation and Language (cs.CL)cs.SD 17 Jun 2026
SpeechDx: A Multi-Task Benchmark for Clinical Speech AI

Sejal Bhalla, Larry Kieu, Aina Merchant, Eyal de Lara, Alex Mariakakis

Abstract

ArXiv ID: 2606.17339

Authors: Sejal Bhalla, Larry Kieu, Aina Merchant, Eyal de Lara, Alex Mariakakis

Abstract:

Speech offers a uniquely informative window into health by simultaneously engaging neurological, motor, respiratory, and vocal systems. Current clinical speech AI methods have largely progressed through isolated condition-specific studies, making results difficult to compare and generalization difficult to assess. We introduce SpeechDx, a large-scale benchmark for clinical speech AI spanning 12 datasets and 27 tasks across diverse health conditions. To enable evaluation across shared clinical mechanisms, SpeechDx structures tasks by the stage of speech production they disrupt: conceptualization, formulation, and articulation. The benchmark tests generalization by including tasks with limited labeled data and evaluating the same health condition across multiple datasets, distinguishing clinically meaningful patterns from dataset artefacts. We systematically evaluate 12 state-of-the-art audio encoders across all tasks and under zero-shot cross-condition transfer. Results show that large-scale speech models represent the strongest overall baselines, domain-specific models improve performance only on closely matched tasks, and no current representation generalizes reliably across the clinical speech landscape. SpeechDx establishes a shared evaluation framework for tracking progress toward general-purpose clinical speech representations

Insights

Contribution: The authors introduce SpeechDx, a large-scale multi-task benchmark comprising 12 datasets and 27 tasks designed to evaluate the generalization of clinical speech AI.

Core Idea: By structuring tasks around the stages of speech production (conceptualization, formulation, and articulation), the benchmark allows for the evaluation of shared clinical mechanisms across diverse health conditions.

Technique: The benchmark utilizes a multi-task evaluation framework that tests audio encoders on both high-resource and low-resource tasks, including zero-shot cross-condition transfer.

Pipeline: Clinical speech audio β†’ Multi-task evaluation across 12 datasets and 27 tasks β†’ Performance metrics for clinical representation generalization

Methodology: The researchers systematically evaluated 12 state-of-the-art audio encoders across the SpeechDx benchmark to distinguish between dataset-specific artifacts and clinically meaningful patterns.

Results: Large-scale speech models provided the strongest overall baselines, while domain-specific models only improved performance on closely matched tasks; no current representation generalizes reliably across the entire clinical landscape.

Limitations: The study highlights that current models lack reliable cross-condition generalization, leaving an open need for general-purpose clinical speech representations.

PDF

Personal Interests

Papers discovered through your interest topics.

Multi-Agent Systems

3/5 Multiagent Systems (cs.MA) 16 Jun 2026
Intelligence Entropy Principle and the ADE Stability Engineering Framework

Dexing Liu

Abstract

ArXiv ID: 2606.18065

Authors: Dexing Liu

Abstract:

As LLM-driven multi-agent systems (MAS) transition from lab to production, system behavior exhibits nonlinear degradation. We introduce the Intelligence Entropy Principle: probability-driven systems spontaneously drift toward disorder, formalized as S(t) = S0 * exp(alpha*t/Cm), where Cm is a model capability coefficient we propose. Lyapunov analysis yields the stabilization condition lambda > alpha/Cm. We construct the ADE (Agent Delivery Engineering) four-layer framework (L1 Physical Laws through L4 User Adaptation) with 23 core components. Validation spans 100K-scale experiments and 33.6 days of production monitoring. We propose a Five-Layer Disorder Taxonomy unifying failures under structural collapse, and present Elastic Organization as an original MAS morphology. Results: channel fracture reduced from 69-98% to near 0%; system death probability below 0.02%.

Insights

Contribution: The paper introduces the Intelligence Entropy Principle to model nonlinear degradation in multi-agent systems and proposes the ADE framework to engineer system stability.

Core Idea: Multi-agent systems spontaneously drift toward disorder over time, a phenomenon that can be mathematically modeled and mitigated through specific stability engineering.

Technique: The authors use Lyapunov analysis to derive stabilization conditions and develop a four-layer ADE framework with 23 core components and an Elastic Organization morphology.

Pipeline: Multi-agent system behavior β†’ Intelligence Entropy modeling and Lyapunov analysis β†’ ADE framework application β†’ Stabilized production system

Methodology: The research combines theoretical formalization of entropy drift with empirical validation across 100K-scale experiments and 33.6 days of production monitoring.

Results: Channel fracture was reduced from 69-98% to near 0%, and the system death probability was maintained below 0.02%.

Limitations: The paper does not explicitly detail the specific constraints of the model capability coefficient (Cm) across different types of non-LLM agents.

PDF
3/5 Multiagent Systems (cs.MA)Artificial Intelligence (cs.AI) 16 Jun 2026
A Neuro-Symbolic Approach to Strategy Synthesis for Strategic Logics

Marco Aruta, Vadim Malvone, Aniello Murano, Domenico Parente, Luca Rizzuti

Abstract

ArXiv ID: 2606.17962

Authors: Marco Aruta, Vadim Malvone, Aniello Murano, Domenico Parente, Luca Rizzuti

Abstract:

Reasoning about what agents can achieve through strategic interaction is a core challenge in Multi-Agent Systems (MAS). Logics for strategic ability, such as ATL, provide rigorous methods, but their adoption is often hindered by the computational cost of strategy synthesis. We introduce a neuro-symbolic framework that integrates large language models (LLMs) into the model-checking pipeline for MAS. The LLM acts as a strategy-generation oracle, proposing candidate strategies that are then formally validated by a standard MAS model checker. This generate-and-certify architecture uses LLM guidance to navigate large combinatorial strategy spaces while preserving formal soundness: generated strategies are accepted only when certified by the verifier. We instantiate the framework for bounded strategic reasoning in NatATL and introduce the first NatATL strategy-synthesis dataset, consisting of 4211 instances. Experiments with an open-weight Qwen3-32B model show that our certified pipeline achieves 92\% accuracy on strategy-synthesis outcomes.

Insights

Contribution: The paper introduces a neuro-symbolic framework for strategy synthesis in Multi-Agent Systems (MAS) and provides the first NatATL strategy-synthesis dataset.

Core Idea: The authors propose a 'generate-and-certify' architecture that combines the creative reasoning of Large Language Models (LLMs) with the formal rigor of model checkers.

Technique: The framework uses an LLM as a strategy-generation oracle to propose candidates, which are then formally validated by a standard MAS model checker to ensure soundness.

Pipeline: MAS problem instance β†’ LLM strategy generation β†’ Formal model checker verification β†’ Certified strategy output

Methodology: The researchers instantiated the framework for bounded strategic reasoning in NatATL and evaluated it using an open-weight Qwen3-32B model on a new dataset of 4,211 instances.

Results: The certified pipeline achieved 92% accuracy on strategy-synthesis outcomes using the Qwen3-32B model.

Limitations: The paper focuses on bounded strategic reasoning and the scalability of the LLM-guided search in even larger combinatorial spaces remains an area for further exploration.

PDF
3/5 Multiagent Systems (cs.MA)Artificial Intelligence (cs.AI)Databases (cs.DB)Software Engineering (cs.SE) 16 Jun 2026
Trustworthy Self-Composable Big-Data-as-a-Service: An LLM-Orchestrated Multi-Agent Framework for Automated Data Engineering, AutoML, MLOps Deployment, and Drift-Aware Lifecycle Optimization

Aueaphum Aueawatthanaphisut, Badri Raj Lamichhane

Abstract

ArXiv ID: 2606.17915

Authors: Aueaphum Aueawatthanaphisut, Badri Raj Lamichhane

Abstract:

Big-Data-as-a-Service (BDaaS) platforms require re liable automation across data ingestion, cleaning, feature engi neering, model development, deployment, and post-deployment monitoring. However, existing LLM-based data science agents and AutoML systems mainly focus on isolated workflow stages, leaving limited support for lifecycle-level orchestration, artifact governance, human oversight, and drift-aware adaptation. This paper proposes a trustworthy self-composable BDaaS frame work based on LLM-orchestrated multi-agent collaboration. The proposed architecture decomposes the BDaaS lifecycle into specialized agents for data ingestion, data cleaning, feature engineering, AutoML training, model evaluation, MLOps de ployment, monitoring, and drift detection. A central LLM or chestration layer coordinates agent execution, validates interme diate outputs, manages workflow context, and enables dynamic workflow composition. The framework also incorporates shared artifact governance, reproducibility support, human-in-the-loop checkpoints, and drift-aware feedback loops. A prototype-based evaluation is conducted using controlled tabular benchmark datasets with missing values, categorical variables, outliers, class imbalance, and simulated covariate drift. Compared with manual ML, AutoML-only, and single-agent LLM baselines, the pro posed multi-agent BDaaS pipeline achieves competitive predictive performance while improving lifecycle-level reliability, including workflow completion, artifact traceability, deployment readiness, reproducibility, and drift recovery. The results suggest that LLM-orchestrated multi-agent systems can extend conventional AutoML toward trustworthy, adaptive, and production-oriented BDaaS lifecycle automation.

Insights

Contribution: The paper introduces a trustworthy, self-composable Big-Data-as-a-Service (BDaaS) framework that automates the entire machine learning lifecycle through LLM-orchestrated multi-agent collaboration. It addresses the limitations of isolated AutoML systems by integrating artifact governance, human-in-the-loop checkpoints, and drift-aware lifecycle optimization.

Core Idea: Decompose the complex BDaaS lifecycle into specialized autonomous agents coordinated by a central LLM orchestration layer to ensure end-to-end reliability and traceability.

Technique: An LLM-orchestrated multi-agent architecture where a central controller manages specialized agents for ingestion, cleaning, feature engineering, AutoML, deployment, and monitoring.

Pipeline: Raw Big Data β†’ LLM-Orchestrated Multi-Agent Pipeline (Ingestion, Cleaning, Feature Engineering, AutoML, Deployment, Monitoring) β†’ Production-Ready Models with Drift Recovery

Methodology: The authors developed a prototype framework and evaluated it against manual ML, AutoML-only, and single-agent LLM baselines using tabular benchmark datasets with simulated covariate drift.

Results: The framework achieved competitive predictive performance while significantly improving lifecycle-level reliability, including higher workflow completion rates, better artifact traceability, and superior drift recovery compared to baselines.

Limitations: The study is based on a prototype evaluation using controlled tabular datasets, leaving questions regarding scalability to massive unstructured data and the computational costs of continuous LLM orchestration.

PDF

Tech News

Agentic AI

NVIDIA Technical Blog 2026-06-16
Building AI Agents for AR Glasses and XR Devices with NVIDIA XR AI

NVIDIA is addressing the infrastructure gap for AR and XR wearable devices by introducing tools to build integrated AI agents. The initiative focuses on enabling real-time AI experiences by combining live sensor data with large language models. It provides developers with the necessary framework to deploy sophisticated, context-aware agents on wearable hardware.

Computing Systems

Hacker News Wed, 17 Ju
Leaked OpenAI financials show $38.5B loss and compute burn

Leaked financial documents from OpenAI reveal a staggering $38.5 billion loss, highlighting the massive capital expenditure required for infrastructure. The data underscores the extreme 'compute burn' associated with training and maintaining large-scale frontier models. This provides a clear look at the economic sustainability and high-cost barriers of the current AI arms race.

Hacker News Tue, 16 Ju
NLnet announces funding for 67 more open-source projects

NLnet has announced funding for 67 new open-source projects aimed at strengthening the digital infrastructure of the internet. This initiative focuses on decentralized technologies, privacy, and open-source software development. The funding supports a diverse range of projects to ensure a more resilient and open digital ecosystem.

LLM

Hacker News Tue, 16 Ju
Wolfram Language and Mathematica Version 15, AI Assistant, Symbolic Music, More

Wolfram has launched Version 15 of the Wolfram Language and Mathematica, featuring a built-in AI assistant and significant core functionality updates. The release emphasizes the integration of AI with symbolic computation, including new capabilities for symbolic music and advanced mathematical modeling. It represents a major step in merging traditional computational intelligence with modern LLM capabilities.

Trending repositories on GitHub filtered and scored for relevance to your interests.

Agentic AI

Computing Systems

General

LLM

MLOps

Speech