Tuesday, June 16, 2026

Daily Digest 2026-06-16

Today’s digest highlights a shift toward the operational reliability of autonomous agents, focusing on safety benchmarks, multi-agent trust dynamics, and the structural integrity of agentic workflows.

Research highlights:

  • Agentic Systems and Safety: Research explores the governance of multi-agent trust, safety benchmarks for computer-use agents, and risk-aware retrieval for sensitive data.
  • Tool Use and Reasoning: New frameworks propose standardized interfaces for logic programming, verifiable data science via tool-grounded reasoning, and active visual reasoning for multimodal search.
  • Model Interpretability and Theory: Studies investigate the definition of β€œgood” explanations, the search for memory traces (engrams), and the development of relational structural causal models.
  • Optimization and Infrastructure: Papers introduce efficient sparse aggregation for device-cloud RAG and cross-modal representation alignment for time-to-event modeling.
  • Cognitive and Systemic Analysis: Research addresses β€œcognitive debt” as a systemic fragility and develops profiling methods for proactive warnings in edge services.

Tech buzz:

  • The industry is seeing a push toward infrastructure scaling and the democratization of model access.
  • Infrastructure: Amazon announced a multibillion-dollar data center expansion in Missouri.
  • Efficiency: New developments include a faster, byte-identical tokenizer and a collection of free LLM API resources.
  • Societal Impact: Discussions continue regarding the concentration of power among AI billionaires and the readiness of humanity for rapid intelligence expansion.
Sort:
Today's digest highlights a shift toward the operational reliability of autonomous agents, focusing on safety benchmarks, multi-agent trust dynamics, and the structural integrity of agentic workflows.

Papers discovered from ArXiv subject categories

AI Safety

5/5 Artificial Intelligence (cs.AI) 16 Jun 2026
OSGuard: A Benchmark for Safety in Computer-Use Agents

Mina Mohammadmirzaei, Jeffrey Flanigan

Abstract

ArXiv ID: 2606.15034

Authors: Mina Mohammadmirzaei, Jeffrey Flanigan

Abstract:

Computer-use agents are increasingly evaluated by whether they complete realistic desktop and web tasks. However, task success alone can miss failures in which an agent reaches the nominal goal through an unsafe shortcut. We introduce OSGuard, a dual-granularity benchmark suite for evaluating safety in computer-use agents under benign, unchanged user instructions. OSGuard contains an action-level benchmark for local guardrail decisions and a risk-augmented execution suite for end-to-end evaluation. The action-level benchmark consists of contextualized proposed actions labeled as allowed, unrelated, or unsafe, each judged relative to the original instruction and current interface state. The execution suite contains manually constructed OSWorld-derived task variants in which the original task remains achievable, but the environment is modified to introduce latent hazards such as destructive overwrites, etc. Each variant is paired with augmented evaluators that retain the original task-success criterion while adding explicit state-based safety invariants, allowing us to distinguish safe completions from unsafe completions that satisfy the nominal task objective. Our experimental results on OSGuard show that current multimodal guardrails can perform well on isolated action judgments, while risk-augmented execution exposes remaining gaps between local oversight and reliable end-to-end safety. This dual-granularity design enables more precise diagnosis of whether models can both recognize unsafe proposed actions and improve full-task safety when deployed as guardrails.

Insights

Contribution: The paper introduces OSGuard, a dual-granularity benchmark designed to evaluate the safety of computer-use agents by distinguishing between successful task completion and unsafe shortcuts.

Core Idea: Safety evaluation must go beyond task success to identify instances where agents achieve goals through hazardous actions, requiring both action-level oversight and end-to-end risk-augmented testing.

Technique: The authors developed a dual-granularity framework consisting of an action-level benchmark for local guardrail decisions and a risk-augmented execution suite for end-to-end safety evaluation.

Pipeline: User instruction and interface state β†’ Proposed agent action β†’ Action-level safety judgment (allowed/unrelated/unsafe) OR Risk-augmented task execution β†’ Task success + safety invariant verification.

Methodology: The authors manually constructed OSWorld-derived task variants with latent hazards and paired them with augmented evaluators that check for state-based safety invariants alongside nominal task completion.

Results: Experimental results show that while multimodal guardrails perform well on isolated action judgments, risk-augmented execution reveals significant gaps in reliable end-to-end safety.

Limitations: The study highlights the remaining gap between local oversight and full-task safety, suggesting that recognizing an unsafe action does not always translate to safe end-to-end behavior.

PDF
4/5 Artificial Intelligence (cs.AI) 16 Jun 2026
A Definition of Good Explanations and the Challenges Explaining LLM Outputs

Louis Mahon, Elliot Ford, Callum Hackett

Abstract

ArXiv ID: 2606.14838

Authors: Louis Mahon, Elliot Ford, Callum Hackett

Abstract:

How to define a good explanation is a long-standing philosophical debate which has found recent renewed interest in the context of AI outputs. Explainability is crucial for AI adoption in many contexts, but in order to produce good explanations of AI systems, we must first have an understanding of what good explanations are. In this paper we propose a definition inspired by the notion of counterfactual explanations, however we argue that one must also take into account the interlocutor's prior beliefs in each fact that could be offered in an explanation. We explore the ramifications of this definition for AI explainability and, in particular, why LLM outputs are difficult to produce good explanations for.

Insights

Contribution: The paper proposes a formal definition of a 'good explanation' that integrates counterfactual reasoning with the interlocutor's prior beliefs. It further identifies specific structural challenges in providing such explanations for Large Language Model (LLM) outputs.

Core Idea: A good explanation is not just a causal link but a piece of information that changes an interlocutor's belief state by addressing the specific facts they already hold as true.

Technique: The authors utilize a philosophical framework combining counterfactual logic with epistemic modeling of the user's prior knowledge.

Pipeline: User's prior beliefs + AI output β†’ Counterfactual analysis of facts β†’ Explanation that addresses belief gaps β†’ Informed user understanding

Methodology: The authors conduct a theoretical and philosophical analysis to derive a definition of explainability and apply this definition to evaluate the current state of LLM interpretability.

Results: The research highlights that LLMs are difficult to explain because their internal processes often lack the clear, discrete causal chains required to satisfy the proposed definition of a 'good' explanation.

Limitations: The paper focuses on the theoretical definition and conceptual challenges, leaving the practical implementation of these explanations in real-time AI systems as an open question.

PDF
4/5 Artificial Intelligence (cs.AI)Computer Science and Game Theory (cs.GT)physics.soc-ph 16 Jun 2026
Cognitive Debt: AI as Intellectual Leverage and the Dynamics of Systemic Fragility

Shuchen Meng

Abstract

ArXiv ID: 2606.15078

Authors: Shuchen Meng

Abstract:

We develop a formal theory of cognitive debt: the stock of unverified reasoning obligations that accumulates when individuals use AI as a substitute rather than a complement for first-principles cognition. The model features two state variables per agent, cognitive capital and cognitive debt, and a multiplicative production technology in which cognitive capital functions as collateral that determines the return to AI adoption. We establish six propositions. Rational agents incur positive cognitive debt because the costs are deferred, partially external, and masked by short-run productivity gains. Tranquil periods lower subjective risk assessments, raise AI substitution intensity, and compound leverage, generating a cognitive Minsky moment in which subjective risk falls while true systemic fragility rises. Expected crisis losses are convex in aggregate leverage. Post-crisis, output-target pressure can produce a false-correction loop in which agents patch AI failures with more AI. The decentralised equilibrium over-adopts substitutive AI relative to the social optimum because of systemic risk, cognitive public goods, and arms-race externalities. In a two-type heterogeneous-agent economy, high-cognitive-capital agents adopt AI more intensively and may eventually erode their unaided cognitive capital below that of initially lower-skilled agents.

Insights

Contribution: The paper introduces a formal theory of 'cognitive debt' to model how using AI as a substitute for first-principles reasoning creates systemic fragility. It identifies a 'cognitive Minsky moment' where deferred costs and short-run productivity gains mask rising systemic risk.

Core Idea: AI adoption functions as intellectual leverage where cognitive capital acts as collateral; over-reliance on AI creates a stock of unverified reasoning obligations that can lead to non-linear systemic collapses.

Technique: The authors develop a formal economic model featuring two state variables (cognitive capital and cognitive debt) and a multiplicative production technology.

Pipeline: AI adoption as substitution β†’ accumulation of unverified reasoning obligations (cognitive debt) β†’ masked short-run productivity gains β†’ systemic fragility β†’ cognitive Minsky moment/crisis.

Methodology: The research employs a formal economic modeling approach, establishing six propositions and analyzing a two-type heterogeneous-agent economy to determine equilibrium outcomes.

Results: Key findings include: rational agents incur debt due to deferred costs; tranquil periods compound leverage; crisis losses are convex in aggregate leverage; and high-capital agents may eventually erode their unaided skills below those of lower-skilled agents.

Limitations: The model focuses on theoretical equilibrium and systemic dynamics, leaving open questions regarding specific empirical measurements of 'cognitive capital' and the exact threshold of the Minsky moment in real-world systems.

PDF

Agentic AI

5/5 Artificial Intelligence (cs.AI)Computation and Language (cs.CL) 16 Jun 2026
Dr-DCI: Scaling Direct Corpus Interaction via Dynamic Workspace Expansion

Yi Lu, Zhuofeng Li, Ping Nie, Haoxiang Zhang, Yuyu Zhang, Kai Zou, Wenhu Chen, Jimmy Lin, Dongfu Jiang, Yu Zhang

Abstract

ArXiv ID: 2606.14885

Authors: Yi Lu, Zhuofeng Li, Ping Nie, Haoxiang Zhang, Yuyu Zhang, Kai Zou, Wenhu Chen, Jimmy Lin, Dongfu Jiang, Yu Zhang

Abstract:

Agentic search over large corpora relies on retriever-mediated interfaces (e.g., BM25 or ColBERT) for scalable candidate discovery. While effective at ranking relevant documents, these interfaces expose evidence only as ranked results or bounded document views, limiting agents' ability to reorganize material and verify constraints across documents. Direct Corpus Interaction (DCI) addresses this limitation by exposing shell-executable corpus operations for flexible search, filtering, comparison, and verification. However, full-corpus terminal commands become slow and unstable as the corpus grows, degrading performance and efficiency. We introduce DR-DCI, a retriever-steered DCI framework that treats retrieval as an agent-callable action for expanding a local workspace. Rather than operating directly over the full corpus, the agent dynamically pulls relevant documents into an evolving workspace and conducts DCI operations within it. This design combines retriever-level recall with DCI-style precision: retrieval keeps exploration scalable, while DCI preserves the local operations needed for effective evidence resolution. Experiments show that DR-DCI is both effective and efficient across scales. On Browsecomp-Plus, DR-DCI reaches 71.2\% accuracy, improving over raw DCI and ablated variants by up to 8.3 points while reducing tool usage, wall time, and estimated cost. With workspace-preserving context reset, accuracy further improves to 73.3\%. In corpus-scaling experiments, DR-DCI remains effective from 100K to 10M documents, whereas raw DCI becomes unstable and BM25 performs substantially worse. DR-DCI also scales to a 20M-scale file-per-document Wiki-18 QA setting, achieving an average score of 63.0 across six benchmarks and outperforming retrieval-based and trained search-agent baselines. Ablation analysis further shows that ranked previews and inter-document DCI are key to performance.

Insights

Contribution: The paper introduces DR-DCI, a framework that enables agents to perform complex, multi-document operations over large corpora by combining retriever-based scalability with direct corpus interaction (DCI) precision.

Core Idea: Instead of operating on a full corpus (too slow) or just ranked results (too limited), the agent dynamically expands a local workspace by pulling relevant documents into a shell-executable environment.

Technique: The framework treats retrieval as an agent-callable action to populate a workspace, allowing the agent to perform flexible search, filtering, and cross-document verification within a manageable subset of data.

Pipeline: Large Corpus β†’ Retriever-steered Workspace Expansion β†’ Agent-led DCI Operations (filtering, comparison, verification) β†’ Final Answer

Methodology: The authors implemented a retriever-steered DCI framework and evaluated it on Browsecomp-Plus and Wiki-18 datasets, comparing it against raw DCI, BM25, and trained search-agent baselines.

Results: DR-DCI achieved 71.2% accuracy on Browsecomp-Plus (improving over raw DCI by up to 8.3 points) and maintained stability across 10M documents, outperforming retrieval-based baselines on Wiki-18.

Limitations: The paper does not explicitly detail the specific overhead of workspace management or the potential for 'workspace pollution' if the agent fails to prune irrelevant documents during expansion.

PDF
5/5 Artificial Intelligence (cs.AI)cs.CYMultiagent Systems (cs.MA) 16 Jun 2026
Trust Between AI Agents: Measuring Formation, Breakage, and Recovery, with Implications for Governing Multi-Agent Systems

Yujiao Chen

Abstract

ArXiv ID: 2606.14923

Authors: Yujiao Chen

Abstract:

As language-model agents increasingly work in teams, each agent must decide how much to trust its teammates. Yet we lack a standard way to measure trust between AI agents. We propose a behavioral measure based on costly verification. In a cooperative survival game, checking a teammate's work consumes resources, while trusting a wrong answer can be fatal. Relative to a memoryless version of the same model, reduced verification provides an observable measure of trust. Using this framework, we study trust formation, breakage, and recovery across six frontier model snapshots. When paired with a consistently reliable teammate, four snapshots (Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.1, and Gemini 3.1 Pro) reduce verification by roughly 60-85%, whereas two smaller snapshots show little or no such adjustment. Failures reverse this discount, but models differ in how they respond. Some concentrate renewed scrutiny on the culprit, while others become more cautious toward the entire team. Recovery is slower than formation, and clustered failures sustain suspicion far longer than the same number of failures spread apart. These differences have practical consequences. Models that form trust verify less, decide more quickly, and achieve higher payoffs in our environment. By contrast, persistent over-verification is associated with indecision rather than safety. Our results show that trust dispositions can be measured before deployment and suggest that calibration, rather than maximal suspicion, should be the central concern in the governance of multi-agent AI systems.

Insights

Contribution: The paper introduces a behavioral framework to measure trust between AI agents using costly verification and evaluates how frontier models form, break, and recover trust in cooperative environments.

Core Idea: Trust can be quantified by observing an agent's willingness to forgo costly verification of a teammate's actions based on past reliability.

Technique: A cooperative survival game where agents must balance the resource cost of verification against the risk of accepting incorrect information.

Pipeline: Multi-agent cooperative survival game β†’ Observation of verification frequency relative to a memoryless baseline β†’ Quantification of trust formation, breakage, and recovery dynamics.

Methodology: The authors compared six frontier model snapshots, measuring the reduction in verification frequency when paired with reliable vs. unreliable teammates.

Results: Four frontier models (Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.1, and Gemini 3.1 Pro) reduced verification by 60-85% with reliable partners; recovery from trust breakage is slower than formation, and clustered failures sustain suspicion longer than isolated ones.

Limitations: The study focuses on a specific cooperative survival game and may not fully capture trust dynamics in diverse real-world multi-agent architectures.

PDF
5/5 Artificial Intelligence (cs.AI) 16 Jun 2026
PrologMCP: A Standardized Prolog Tool Interface for LLM Agents

Agnieszka Mensfelt, Adarsh Prabhakaran, Adrian Haret, Vince Trencsenyi, Kostas Stathis

Abstract

ArXiv ID: 2606.14935

Authors: Agnieszka Mensfelt, Adarsh Prabhakaran, Adrian Haret, Vince Trencsenyi, Kostas Stathis

Abstract:

Frontier reasoning-tuned language models still fail on deductive tasks at depth, and the cost of improved performance through extended internal reasoning scales poorly. Symbolic delegation offers a complementary route: a language model translates the problem, while a solver performs the inference. However, current autoformalization pipelines for logic programming are typically bespoke integrations tied to particular tasks or agents. We introduce PrologMCP, a task-agnostic, open-source server that exposes Prolog as a stateful tool through the Model Context Protocol (MCP). Its compact tool interface, structured error reporting, and per-session isolation make the translate-run-inspect-repair loop a reusable primitive for MCP-capable agents. We evaluate a formalizer agent enhanced with PrologMCP against standard and reasoning LLMs (Claude Sonnet 4.6, GPT-4.1, and o4-mini) on two subsets of PARARULE-Plus: a general-purpose sample and a more challenging one targeting a specific failure mode of natural-language reasoning. On the general sample, the formalizer matches or exceeds reasoning LLMs (accuracy 1.00 vs.\ 1.00 / 0.998), with the largest gains over standard models (0.762 for GPT-4.1). On the challenging subset, the formalizer remains near-perfect (1.00 / 0.99) while reasoning LLMs drop to 0.95 / 0.94. These results suggest that delegating inference to Prolog via MCP is a robust and inspectable alternative to extended natural-language reasoning.

Insights

Contribution: The paper introduces PrologMCP, a task-agnostic, open-source server that exposes Prolog as a stateful tool via the Model Context Protocol (MCP) to enable symbolic delegation for LLM agents.

Core Idea: Instead of relying on expensive and potentially unreliable internal reasoning in LLMs for deductive tasks, the system uses the LLM to translate problems into logic programs while delegating the actual inference to a symbolic solver.

Technique: The authors implement a standardized tool interface that provides structured error reporting and per-session isolation, creating a reusable 'translate-run-inspect-repair' loop for MCP-capable agents.

Pipeline: Natural language problem β†’ LLM translation to Prolog β†’ PrologMCP execution β†’ Error reporting/Inspection β†’ LLM repair (if needed) β†’ Final result

Methodology: The authors evaluated a formalizer agent equipped with PrologMCP against standard and reasoning LLMs (Claude Sonnet 4.6, GPT-4.1, and o4-mini) using two subsets of the PARARULE-Plus dataset.

Results: The PrologMCP-enhanced formalizer achieved near-perfect accuracy (1.00) on both general and challenging subsets, significantly outperforming standard models (e.g., 0.762 for GPT-4.1) and maintaining stability where reasoning LLMs saw performance drops.

Limitations: The paper focuses on logic programming and does not explicitly address the scalability of the translation step for extremely large-scale knowledge bases or non-logic symbolic domains.

PDF
5/5 Artificial Intelligence (cs.AI)Computation and Language (cs.CL) 16 Jun 2026
Risk-Aware LLM Agents for Geospatial Data Retrieval: Design and Preliminary Adversarial Evaluation

Kyle Gao, Joel Cumming, Jonathan Li, Linlin Xu, David A. Clausi

Abstract

ArXiv ID: 2606.15077

Authors: Kyle Gao, Joel Cumming, Jonathan Li, Linlin Xu, David A. Clausi

Abstract:

We present an LLM-driven framework for retrieving remote sensing data from cloud-based geospatial catalogues using natural language queries. The system converts user intent into structured API calls, enabling efficient access to satellite imagery and environmental datasets. The architecture integrates three agents: Guardrail for safety and policy enforcement, General-QA for intent interpretation, and Recommender-Analyst for schema-aware API call generation. This coordinated design ensures reliable, semantically aligned interaction with external data services. The modular framework is portable across platforms through API schema substitution and supports applications in environmental monitoring, disaster response, and climate analysis. It establishes a scalable interface between user intent and geospatial infrastructure, enabling streamlined and automated Earth observation workflows. Preliminary experiments under adversarial multi-turn settings show that prompt-level safety instructions improve robustness, although rare high-impact failures persist in API manipulation scenarios and highlight the need for adaptive, system-level defenses that balance safety, usability, and cost efficiency, which motivates the use of our intercept-level Guardrail agent.

Insights

Contribution: The paper introduces a multi-agent LLM framework for retrieving remote sensing data from cloud-based geospatial catalogues using natural language queries. It also provides a preliminary adversarial evaluation to assess the robustness of these agents against prompt-level manipulations.

Core Idea: The system bridges the gap between natural language intent and complex geospatial API schemas by using a coordinated multi-agent architecture that balances safety, intent interpretation, and schema-aware execution.

Technique: A modular three-agent architecture consisting of a Guardrail agent for policy enforcement, a General-QA agent for intent interpretation, and a Recommender-Analyst agent for generating structured API calls.

Pipeline: Natural language query β†’ Guardrail (Safety Check) β†’ General-QA (Intent Interpretation) β†’ Recommender-Analyst (API Call Generation) β†’ Geospatial Data Retrieval

Methodology: The authors designed a portable framework using API schema substitution and conducted preliminary adversarial multi-turn experiments to test system robustness and safety.

Results: Prompt-level safety instructions improved system robustness, though the study identified rare high-impact failures in API manipulation scenarios.

Limitations: Persistent high-impact failures in specific API manipulation scenarios highlight the need for more adaptive, system-level defenses that balance safety, usability, and cost.

PDF
5/5 Artificial Intelligence (cs.AI) 16 Jun 2026
Towards Verifiable Agentic Data Science: Solving Irregular TSQA Via Tool-Grounded Reasoning

Sanhorn Chen, Xiaoyang Chen, Boyu Liu, Roy Zhao

Abstract

ArXiv ID: 2606.15107

Authors: Sanhorn Chen, Xiaoyang Chen, Boyu Liu, Roy Zhao

Abstract:

Time series data in real-world deployments is overwhelmingly irregular. Observations are asynchronous, missing values are informative rather than random, and sampling frequencies vary across sensors and operational windows. However, existing Time Series Question Answering (TSQA) benchmarks mostly assume regularly sampled inputs, leaving a fundamental gap in understanding how large language models (LLMs) and AI agents perform under irregular conditions. To bridge this gap, we introduce IRTS-ToolBench, a benchmark of 1,700 questions spanning 10 task types across 13 domains. IRTS-ToolBench is designed to be used independently by any researcher working on LLM-based irregular time series analysis, providing standardized inputs and a reproducible evaluation protocol. Code can be found in https://github.com/SanhornC/IRTS-ToolBench.

Insights

Contribution: The paper introduces IRTS-ToolBench, a comprehensive benchmark of 1,700 questions across 13 domains designed to evaluate LLM-based agents on irregular time series data.

Core Idea: Existing TSQA benchmarks assume regular sampling, but real-world data is often asynchronous and irregular; this work addresses that gap by providing a standardized evaluation for irregular time series analysis.

Technique: The authors utilize a tool-grounded reasoning framework to enable agents to handle complex, non-uniform temporal data through external tool interaction.

Pipeline: Irregular time series data β†’ Tool-grounded reasoning agent β†’ Verifiable answers to TSQA questions

Methodology: The researchers developed a benchmark spanning 10 task types and 13 domains, providing a reproducible protocol for testing how LLMs handle missing values and varying sampling frequencies.

Results: The benchmark provides a standardized evaluation protocol and a dataset of 1,700 questions to measure the performance gap between regular and irregular TSQA tasks.

Limitations: The abstract does not specify specific performance metrics or limitations, but the scope is currently focused on the benchmark's creation and the general problem of irregular sampling.

5/5 Artificial Intelligence (cs.AI) 16 Jun 2026
Visual-Seeker: Towards Visual-Native Multimodal Agentic Search via Active Visual Reasoning

Zhengbo Zhang, Changtao Miao, Jinbo Su, Zhaowen Zhou, Chunxia Zhang, Xukai Wang, Ruiqi Liu, Kaiyuan Zheng, Jiansheng Cai, Bo Zhang, Zhe Li, Shiming Xiang, Ying Yan

Abstract

ArXiv ID: 2606.15231

Authors: Zhengbo Zhang, Changtao Miao, Jinbo Su, Zhaowen Zhou, Chunxia Zhang, Xukai Wang, Ruiqi Liu, Kaiyuan Zheng, Jiansheng Cai, Bo Zhang, Zhe Li, Shiming Xiang, Ying Yan

Abstract:

Multimodal large language models (MLLMs) have demonstrated impressive capabilities in many visual tasks, but they often struggle with factual grounding when confronted with complex, open-world scenarios. While recent multimodal deep search agents attempt to address this issue by utilizing external tools, the visual-native search paradigm remains underexplored. Existing methods primarily rely on simple images with explicit semantics and text-only evidence trajectories, limiting the agent's ability to perform multi-hop, cross-modal reasoning and search. To address these limitations, we propose Visual-Seeker, a visual-native multimodal deep search agent via active visual reasoning. Rather than treating vision as a static input, our agent actively attends to fine-grained visual details, dynamically harvests visual evidence throughout the search process. To unlock its visual-native potential, we design an active visual reasoning data pipeline and synthesize 5K high-quality multimodal trajectories for model training. Extensive experiments demonstrate the state-of-the-art performance across five challenging multimodal search benchmarks, even surpassing several proprietary models, validating robust visual-native reasoning and search in real-world web environments. The code and data can be accessed at: https://github.com/ZhengboZhang/Visual-Seeker.

Insights

Contribution: The paper introduces Visual-Seeker, a visual-native multimodal deep search agent that performs active visual reasoning to handle complex, open-world scenarios. It provides a new active visual reasoning data pipeline and a high-quality dataset of 5K multimodal trajectories.

Core Idea: Instead of treating vision as a static input, the agent dynamically harvests fine-grained visual evidence throughout the search process to perform multi-hop, cross-modal reasoning.

Technique: The method employs an active visual reasoning framework that allows the agent to attend to specific visual details and synthesize visual evidence trajectories rather than relying on text-only evidence.

Pipeline: Complex multimodal queries β†’ Active visual reasoning & dynamic evidence harvesting β†’ Multi-hop cross-modal search β†’ Factually grounded answers

Methodology: The authors developed a specialized data pipeline to synthesize 5K high-quality multimodal trajectories and trained the agent to perform visual-native search across diverse web environments.

Results: Achieved state-of-the-art performance across five challenging multimodal search benchmarks, surpassing several proprietary models in real-world web environments.

Limitations: The paper does not explicitly detail the computational overhead of active visual reasoning or the scalability of the synthesized data pipeline to even more diverse web domains.

Computer Vision

4/5 Artificial Intelligence (cs.AI) 16 Jun 2026
Fusion is not one-size-fits-all: Cross-Modal Representation Alignment for Time-to-Event Modeling

Zhemin Zhang, Weijie Chen, David Le, Amara Tariq, Alex Wallace, Matthew Stib, Juan Maria Farina, Chadi Ayoub, Reza Arsanjani, Imon Banerjee

Abstract

ArXiv ID: 2606.15038

Authors: Zhemin Zhang, Weijie Chen, David Le, Amara Tariq, Alex Wallace, Matthew Stib, Juan Maria Farina, Chadi Ayoub, Reza Arsanjani, Imon Banerjee

Abstract:

Accurate time-to-event (TTE) prediction from multimodal clinical data remains challenging due to modality imbalance and distribution shift. We introduce a foundation model-driven framework for cross-modal representation alignment between CT imaging and longitudinal EHR data, designed to generalize across tasks and institutions. CT and EHR modalities are encoded independently using domain-specific foundation models and aligned in a shared latent space through four principled fusion strategies: late fusion, contrastive alignment, cross-attention, and co-attention. We evaluate two clinically distinct TTE tasks: pulmonary embolism (PE) mortality and cardiovascular disease (CVD) outcomes, on large-scale multi-institutional cohorts (PE: N=3,099 train; 1,098 internal; 435 external; CVD: N=2,951 train; 837 internal; 682 external). Fusion consistently improves concordance index by 1.5-5.4% over unimodal baselines when modalities contribute comparably. Overall, contrastive multimodal fusion, particularly with CLMBR representations, provided the most consistent and statistically robust improvements, especially for PE mortality prediction. For MACE, cross-attention (one-hot) achieved the highest internal performance and image-guided co-attention achieved the best external performance. We therefore introduce a generalizable foundation model-based cross-modal alignment framework and provide the first systematic analysis of fusion behavior under modality imbalance in TTE prediction. Our results establish task-aware multimodal alignment as a necessary design principle for robust generalization and scalable clinical deployment.

Insights

Contribution: The paper introduces a foundation model-driven framework for cross-modal representation alignment between CT imaging and longitudinal EHR data for time-to-event (TTE) modeling. It provides the first systematic analysis of how different fusion strategies behave under modality imbalance across diverse clinical tasks.

Core Idea: Multimodal fusion is not one-size-fits-all; different clinical tasks (e.g., PE mortality vs. CVD outcomes) require specific alignment strategies to achieve robust generalization across institutions.

Technique: The authors employ four principled fusion strategiesβ€”late fusion, contrastive alignment, cross-attention, and co-attentionβ€”to align independent domain-specific foundation model embeddings into a shared latent space.

Pipeline: CT imaging and longitudinal EHR data β†’ Domain-specific foundation model encoding β†’ Cross-modal representation alignment (Late Fusion, Contrastive, Cross-Attention, or Co-Attention) β†’ Time-to-Event (TTE) prediction

Methodology: The framework was evaluated on two distinct clinical tasks (PE mortality and CVD outcomes) using large-scale multi-institutional cohorts, comparing unimodal baselines against various multimodal fusion strategies.

Results: Fusion improved the concordance index by 1.5-5.4% over unimodal baselines. Contrastive multimodal fusion (with CLMBR) was most robust for PE mortality, while cross-attention and image-guided co-attention performed best for MACE.

Limitations: The study highlights that fusion performance is highly dependent on modality contribution and task-specific characteristics, suggesting that a single universal fusion architecture may not be optimal for all clinical applications.

PDF

Computing Systems

4/5 Artificial Intelligence (cs.AI) 16 Jun 2026
CONCORD: Asynchronous Sparse Aggregation for Device-Cloud RAG under Document Isolation

Xuedong Hu, Zhiqing Tang, Zhi Yao, Tian Wang, Weijia Jia

Abstract

ArXiv ID: 2606.15179

Authors: Xuedong Hu, Zhiqing Tang, Zhi Yao, Tian Wang, Weijia Jia

Abstract:

Retrieval-augmented generation (RAG) has emerged as a pivotal technique for improving language models by incorporating external knowledge at inference time. As device-cloud collaborative inference makes it feasible to deploy small language models on edge devices, a new setting arises in which private documents remain on the device and public knowledge resides in the cloud. Privacy and policy constraints often forbid raw document exchange, creating a document-isolated dual-end RAG setting. However, existing methods rely on frequent remote synchronization and dense evidence transfer, limiting throughput under realistic latency and bandwidth conditions. To address this issue, we propose CONCORD, an asynchronous sparse aggregation framework for dual-end RAG under document isolation. CONCORD treats the cloud as an asynchronously arriving evidence source rather than a continuously synchronized co-generator. Specifically, we introduce waiting debt control to decide whether each decoding step should continue waiting for remote participation based on the observed return of waiting. We also design a certificate-guided minimal supplementation mechanism that requests only the remote evidence needed to determine the current greedy decision. Steps that consult the cloud preserve the same greedy token as dense dual-end aggregation, while the remaining steps commit locally without remote evidence. Experiments on Natural Questions and WikiText-2 show that CONCORD improves end-to-end throughput over baselines by $1.66\times$ and $2.15\times$, respectively, while reducing per-token communication by over two orders of magnitude and maintaining comparable answer quality and perplexity.

Insights

Contribution: The paper introduces CONCORD, an asynchronous sparse aggregation framework designed for device-cloud collaborative RAG where private documents are isolated on the device and public knowledge is in the cloud.

Core Idea: Instead of continuous synchronization, the cloud is treated as an asynchronous evidence source, allowing the device to proceed locally when remote evidence is not critical for the current decoding step.

Technique: The framework employs waiting debt control to manage remote participation and a certificate-guided minimal supplementation mechanism to request only the specific evidence needed for greedy token decisions.

Pipeline: Private device documents and public cloud knowledge β†’ Asynchronous sparse aggregation with waiting debt control and certificate-guided supplementation β†’ High-throughput RAG inference with document isolation.

Methodology: CONCORD optimizes communication by deciding whether to wait for cloud evidence based on historical return rates and only requesting remote data when it is necessary to change the current greedy token selection.

Results: Improved end-to-end throughput by 1.66x on Natural Questions and 2.15x on WikiText-2, while reducing per-token communication by over two orders of magnitude with comparable answer quality.

Limitations: The paper does not explicitly detail the performance impact of extreme network jitter or the scalability of the certificate-guided mechanism as the number of remote evidence sources increases.

PDF

General

4/5 Artificial Intelligence (cs.AI)Machine Learning (cs.LG)cs.SIMachine Learning (Statistics) (stat.ML) 16 Jun 2026
Relational Structural Causal Models

Adiba Ejaz, Elias Bareinboim

Abstract

ArXiv ID: 2606.14892

Authors: Adiba Ejaz, Elias Bareinboim

Abstract:

An artificial intelligence must have a model of its environment that is causal, supporting reasoning about interventions and counterfactuals, and also combinatorial, supporting generalization to unseen combinations of objects. In this work, we formally study when and how such a model can be learned. We develop relational structural causal models, extending structural causal models (Pearl 2009) to settings where objects and their relations vary. First, we show how answers to not only causal but also observational queries about unseen combinations of objects can not be identified without further assumptions. To enable such identification--including in the presence of unobserved confounding--we define relational causal graphs and derive symbolic identification criteria. Finally, we propose relational neural causal models, a provably correct approach that outperforms non-relational baselines on simulated traffic scenes with varying cars, signals, and pedestrians.

Insights

Contribution: The paper introduces Relational Structural Causal Models (RSCMs), a framework that extends structural causal models to environments with varying numbers and types of objects. It provides symbolic identification criteria for relational queries and a provably correct neural implementation.

Core Idea: To achieve true AI reasoning, models must combine causal reasoning (interventions/counterfactuals) with combinatorial generalization (handling unseen combinations of objects).

Technique: The authors develop relational causal graphs and derive symbolic identification criteria to determine when queries about unseen object combinations are identifiable, even with unobserved confounding.

Pipeline: Relational environment data β†’ Relational Causal Graph & Symbolic Identification β†’ Relational Neural Causal Model β†’ Causal/Counterfactual reasoning on unseen object combinations.

Methodology: The authors formally analyze the identifiability of relational queries, derive mathematical criteria for identification, and implement a neural architecture to test these models on simulated traffic scenes.

Results: The proposed relational neural causal models are provably correct and outperform non-relational baselines on simulated traffic scenes involving varying cars, signals, and pedestrians.

Limitations: The paper notes that answers to certain queries about unseen combinations cannot be identified without further assumptions, highlighting the boundaries of the current framework.

PDF
4/5 Artificial Intelligence (cs.AI)Machine Learning (cs.LG) 16 Jun 2026
AI Engram: In Search of Memory Traces in Artificial Intelligence

Jea Kwon, Dong-Kyum Kim, Jiwon Kim, Yonghyun Kim, Woong Kook, Meeyoung Cha

Abstract

ArXiv ID: 2606.14997

Authors: Jea Kwon, Dong-Kyum Kim, Jiwon Kim, Yonghyun Kim, Woong Kook, Meeyoung Cha

Abstract:

Memory formation is fundamental to intelligence, yet whether deep neural networks preserve identifiable memory traces analogous to biological memory units remains an open question. This work introduces a geometric framework to identify such "AI engrams" by formalizing the neuroscientific criteria of specificity, reactivation, sufficiency, and necessity into a constrained inverse problem. We derive a closed-form estimator that isolates individual memory traces from globally entangled parameters, and show that this biologically-derived solution corresponds to a natural gradient update on the parameter manifold. AI engrams enable surgical manipulation of learned knowledge: any subset of memories can be composed or erased through linear arithmetic, without iterative optimization. Experiments ranging from simple MLPs to LLMs demonstrate the causal validity and substantial scalability of AI engrams. Together, these results bridge theories of biological memory and artificial representation learning and offer geometric insight into how deep networks simultaneously support functional specificity within distributed storage.

Insights

Contribution: The paper introduces a geometric framework to identify and isolate 'AI engrams'β€”specific memory traces within deep neural networksβ€”bridging biological memory theories with artificial representation learning.

Core Idea: The authors propose that memories in neural networks can be treated as distinct, manipulatable units that can be composed or erased through linear arithmetic rather than iterative optimization.

Technique: The study formalizes neuroscientific criteria (specificity, reactivation, sufficiency, and necessity) into a constrained inverse problem to derive a closed-form estimator for memory traces.

Pipeline: Neural network parameters β†’ Geometric inverse problem formulation β†’ Closed-form engram estimator β†’ Surgical memory manipulation (composition/erasure)

Methodology: The researchers derived a mathematical estimator that corresponds to a natural gradient update on the parameter manifold and validated it across various architectures.

Results: The framework successfully isolated individual memories in MLPs and LLMs, demonstrating causal validity and scalability for surgical knowledge manipulation without retraining.

Limitations: The paper leaves open the extent of engram stability over long-term continuous learning and the complexity of isolating highly overlapping, non-linear semantic concepts.

PDF
3/5 Artificial Intelligence (cs.AI) 16 Jun 2026
VGPT-RSI for RH-Adjacent Formal Progress: Boundary Certificates, Verified Finite Lagarias Inequalities, and Explicit Failure Localization

Zhixin Hu, Tao Xu, Xiaodian Sun, Li Jin, Momiao Xiong

Abstract

ArXiv ID: 2606.15096

Authors: Zhixin Hu, Tao Xu, Xiaodian Sun, Li Jin, Momiao Xiong

Abstract:

The Riemann Hypothesis remains one of the central unsolved problems in mathematics. Rather than claiming proof, we investigate whether a verifiable AI-assisted reasoning system can produce reliable, formally checked partial progress while explicitly identifying the remaining mathematical obstructions. We apply the Verifiable Growing Physical Transformer with Recursive Self-Improvement (VGPT-RSI) to two RH-adjacent certification tasks. First, we construct and verify a finite RH-boundary certificate for inequality on a parameterized safe lower curve over a region. The numerical boundary curve is converted into a certificate-backed lower curve, audited using outward-rounded interval arithmetic and Arb/FLINT ball arithmetic, and then checked in Rocq/CoqInterval for the parameterized theorem. Second, we initiate a formal Lagarias-route certificate. Lagarias criterion states that RH is equivalent to the global inequality. We formalize the finite quantity and produce a Coq-checked finite certificate. The final system identifies the exact unresolved mathematical bottlenecks: formalizing the Lagarias equivalence, proving the global tail theorem beyond any finite cutoff, and potentially reducing counterexamples to colossally abundant or related extremal integers. These results demonstrate that VGPT-RSI can produce certified RH-adjacent formal progress, organize proof dependencies, and avoid overclaiming when the remaining obstruction is genuinely mathematical.

Insights

Contribution: The paper introduces VGPT-RSI to produce verifiable, formally checked partial progress on Riemann Hypothesis (RH) adjacent problems while explicitly identifying remaining mathematical obstructions. It demonstrates that AI-assisted systems can generate certified certificates for finite boundaries and Lagarias inequalities without overclaiming a full proof.

Core Idea: Instead of attempting a full proof of the Riemann Hypothesis, the system focuses on generating verifiable certificates for finite regions and identifying the specific mathematical bottlenecks that prevent a complete proof.

Technique: The system utilizes the Verifiable Growing Physical Transformer with Recursive Self-Improvement (VGPT-RSI) combined with interval arithmetic (Arb/FLINT) and formal verification in Rocq/CoqInterval.

Pipeline: RH-adjacent mathematical problems β†’ VGPT-RSI reasoning and certificate generation β†’ Interval arithmetic auditing and formal verification in Coq β†’ Certified partial progress and explicit failure localization.

Methodology: The authors applied the VGPT-RSI system to two tasks: constructing a finite RH-boundary certificate for a parameterized safe lower curve and initiating a formal Lagarias-route certificate for finite quantities.

Results: Successfully produced a Coq-checked finite certificate for a parameterized theorem and a finite Lagarias-route certificate; identified specific bottlenecks including the formalization of the Lagarias equivalence and the global tail theorem.

Limitations: The system cannot yet prove the global tail theorem beyond any finite cutoff or formalize the full Lagarias equivalence, which remain genuine mathematical obstructions.

PDF
3/5 Artificial Intelligence (cs.AI) 16 Jun 2026
Feature Attribution in Directed Acyclic Graphs Using Edge Intervention

Qiheng Sun, Junxu Liu, Xiaokai Mao, Haocheng Xia, Jinfei Liu, Kui Ren, Haibo Hu

Abstract

ArXiv ID: 2606.15273

Authors: Qiheng Sun, Junxu Liu, Xiaokai Mao, Haocheng Xia, Jinfei Liu, Kui Ren, Haibo Hu

Abstract:

Shapley value-based feature attribution methods face challenges in scenarios involving complex feature interactions and causal relationships, even when a causal structure is provided. Existing methods typically adopt a node-centric view, attributing importance solely to individual features. Consequently, they often fail to simultaneously capture the externality and exogenous influence of features, leading to unreasonable interpretations. To overcome these limitations, we propose a novel feature attribution method called DAG-SHAP, which is based on edge intervention. DAG-SHAP treats each feature edge as an individual attribution object, ensuring that both externality and exogenous contributions of features are appropriately captured. Additionally, we introduce an approximation method for efficiently computing DAG-SHAP. Extensive experiments on both real and synthetic datasets validate the effectiveness of DAG-SHAP. Our code is available at https://github.com/ZJU-DIVER/DAG-SHAP.

Insights

Contribution: The paper introduces DAG-SHAP, a novel feature attribution method that shifts the focus from node-centric to edge-centric attribution in Directed Acyclic Graphs (DAGs). It effectively captures both the externality and exogenous influences of features in complex causal structures.

Core Idea: Instead of attributing importance to individual features (nodes), the method treats each feature edge as an individual attribution object to account for complex interactions.

Technique: The method utilizes edge intervention within a Shapley value framework and introduces an approximation method to ensure computational efficiency.

Pipeline: Causal DAG and model β†’ Edge intervention-based Shapley value calculation β†’ Feature edge importance scores

Methodology: The authors define attribution based on the impact of intervening on specific edges in a DAG, then develop an approximation algorithm to handle the high dimensionality of edge-based calculations.

Results: Extensive experiments on real and synthetic datasets demonstrate that DAG-SHAP provides more reasonable and accurate interpretations compared to traditional node-centric Shapley value methods.

Limitations: The paper does not explicitly detail the scalability limits of the approximation method on extremely large, dense graphs or its performance on non-DAG structures.

LLM

4/5 Artificial Intelligence (cs.AI) 16 Jun 2026
Metric Match: A Subset Selection Approach to Evaluating LLM Judge Reliability

Alyssa Unell, Natalie Dullerud, Naomi Boneh, Meena Jagadeesan, Tatsu Hashimoto, Nigam Shah, Sanmi Koyejo

Abstract

ArXiv ID: 2606.15029

Authors: Alyssa Unell, Natalie Dullerud, Naomi Boneh, Meena Jagadeesan, Tatsu Hashimoto, Nigam Shah, Sanmi Koyejo

Abstract:

LLM judges are used to reduce the need for costly human labor in evaluating open-ended text generation. However, the reliability of these judges depends critically on their alignment with human raters -- a property that itself depends on costly human annotations. In this work, we develop a method (Metric Match) for estimating correlation-based reliability metrics of LLM judges from limited annotations. Metric Match selects a subset of samples for human annotation such that the subset matches the population reliability metric with respect to acquired synthetic labels. We empirically show that Metric Match achieves a win-rate of 0.838 against random subset selection across four different correlation metrics and 15 datasets, with an 18.7% decrease in average estimation error and reduces annotation needs by 32.5%. We provide a cost model and highlight a medical case study where our method saves $1,041.67 compared to random selection for expert annotation. Further, we shift our task from reliability estimation to reliability classification of whether a given judge is above a deployment threshold, outperforming random selection with Metric Match. All project code is publicly available, and we additionally provide an installable package for ease of use.

Insights

Contribution: The paper introduces Metric Match, a method to estimate the reliability of LLM judges using significantly fewer human annotations by selecting a representative subset of samples.

Core Idea: Instead of random sampling, the authors propose selecting a subset of data that specifically matches the population's correlation metrics based on synthetic labels.

Technique: The technique utilizes a subset selection strategy that optimizes for correlation-based reliability metrics, reducing the human annotation burden while maintaining estimation accuracy.

Pipeline: Open-ended text generation samples β†’ Synthetic label generation β†’ Metric Match subset selection β†’ Human annotation β†’ Reliability estimation/classification

Methodology: The researchers compared Metric Match against random subset selection across 15 datasets and four correlation metrics, including a cost model and a medical case study.

Results: Achieved a 0.838 win-rate against random selection, an 18.7% decrease in average estimation error, a 32.5% reduction in annotation needs, and a cost saving of $1,041.67 in a medical case study.

Limitations: The study focuses on correlation-based reliability and does not explicitly address non-linear alignment or specific biases inherent in the synthetic labels used for selection.

PDF
4/5 Artificial Intelligence (cs.AI) 16 Jun 2026
CogGuard: Cognitive and Operational Profiling for Proactive Warning in Edge Intelligent Services

Zhi Yao, Weihao Chen, Zhiqing Tang, Hanshuai Cui, Qianli Ma, Weijia Jia, Wei Zhao

Abstract

ArXiv ID: 2606.15199

Authors: Zhi Yao, Weihao Chen, Zhiqing Tang, Hanshuai Cui, Qianli Ma, Weijia Jia, Wei Zhao

Abstract:

Proactive warning is an important capability for edge intelligent services, where the system predicts whether a subject will successfully complete an incoming task under strict latency and privacy constraints. Such prediction depends on both long-term static attributes and short-term dynamic states derived from historical interaction logs. Recent Large Language Models (LLMs) offer strong long-context reasoning for constructing structured profiles from these logs, but existing solutions face two challenges for edge deployment: (1) profiling methods are typically domain-specific and lack a reusable abstraction across service scenarios, and (2) fine-tuning alignment models on heterogeneous edge clusters incurs high synchronization overhead due to the variance in input sequence lengths. To address these challenges, we propose CogGuard, a proactive-warning framework for edge intelligent services. CogGuard decouples offline LLM-based profile construction from online Small Language Model (SLM)-based score prediction through a shared static-dynamic profile-to-score pipeline, and instantiates it in two representative scenarios: educational performance warning and operational task outcome warning. For efficient profile construction, we design scenario-specific profiling methods with prefix-aligned KV-cache reuse to reduce repeated encoding overhead. For edge-side model alignment, we propose a length-aware distributed fine-tuning strategy with contrastive regularization to mitigate workload imbalance on heterogeneous clusters. Experiments on education and operation datasets show that CogGuard reduces profile construction time by up to 48% and distributed fine-tuning time by 19%, while achieving MAEs of 13.4 and 5.9, respectively, on 100-point-scale warning tasks. In the largest educational setting, CogGuard reduces prediction error by 15.4% compared with the strongest baseline.

Insights

Contribution: The paper introduces CogGuard, a proactive-warning framework for edge intelligent services that decouples offline profile construction from online score prediction to handle long-context reasoning and heterogeneous edge constraints.

Core Idea: CogGuard leverages Large Language Models (LLMs) for offline profile construction and Small Language Models (SLMs) for online prediction, utilizing a shared profile-to-score pipeline to balance reasoning depth with edge deployment efficiency.

Technique: The framework employs prefix-aligned KV-cache reuse for efficient profile construction and a length-aware distributed fine-tuning strategy with contrastive regularization for model alignment on heterogeneous clusters.

Pipeline: Historical interaction logs β†’ LLM-based profile construction (offline) β†’ Shared profile-to-score pipeline β†’ SLM-based score prediction (online) β†’ Proactive warning

Methodology: The authors designed scenario-specific profiling methods for education and operation tasks, implementing a distributed fine-tuning approach to mitigate workload imbalance caused by varying input sequence lengths.

Results: CogGuard reduced profile construction time by up to 48% and distributed fine-tuning time by 19%, achieving MAEs of 13.4 and 5.9 on 100-point-scale tasks and a 15.4% reduction in prediction error in large educational settings.

Limitations: The study focuses on two specific representative scenarios (education and operation), leaving the generalizability of the profiling methods across other diverse domains as an open question.

PDF
4/5 Artificial Intelligence (cs.AI) 16 Jun 2026
Mask-Proof: An LLM-based Automated Data Curation Pipeline on Mathematical Proofs

Jierui Zhang, Siyuan Tan, Xinhang Li, Longzhuangzhi Lin, Dailin Li, Chengfeng Gu, Xinping Li, Yaxian Hao, Shengjia Liang, Yuxiang Ren, Wenhao Liu

Abstract

ArXiv ID: 2606.15258

Authors: Jierui Zhang, Siyuan Tan, Xinhang Li, Longzhuangzhi Lin, Dailin Li, Chengfeng Gu, Xinping Li, Yaxian Hao, Shengjia Liang, Yuxiang Ren, Wenhao Liu

Abstract:

Large language models (LLMs) are increasingly capable of mathematical problem solving and can even assist with research-level proofs, yet we still lack a scalable and reproducible way to measure step-level reasoning in long proofs across diverse sources. This evaluation gap limits trustworthy AI assistance in proof-certified scientific progress. Existing evaluations often emphasize final answers or rely on costly expert grading, while end-to-end proof generation remains open-ended and hard to verify automatically. We introduce Mask-Proof, a pipeline that turns real proofs into automatically checkable masked-step tasks. It masks key formula steps, provides the necessary surrounding context, and evaluates model reconstructions with an LLM-based equivalence judge using repeated votes for stability. The resulting Mask-ProofBench contains 292 curated problems across diverse research areas. Experiments with 17 models show that reasoning-enhanced models outperform standard models by 12% to 27%. Our evaluator achieves 96.8% agreement with expert annotators, enabling faithful, reproducible, and comparable measurement of step-level mathematical reasoning. Benchmark, annotations, and code are available at https://github.com/weating/Mask-Proof.

Insights

Contribution: The paper introduces Mask-Proof, an automated pipeline and benchmark (Mask-ProofBench) designed to evaluate step-level reasoning in long mathematical proofs using LLM-based automated grading.

Core Idea: By transforming real mathematical proofs into masked-step tasks, the authors create a scalable and reproducible way to measure intermediate reasoning rather than just final answers.

Technique: The method utilizes an LLM-based equivalence judge with repeated voting to verify if a model's reconstructed formula is mathematically equivalent to the original masked step.

Pipeline: Real mathematical proofs β†’ Masking key formula steps with surrounding context β†’ LLM reconstruction β†’ LLM-based equivalence judging with repeated voting β†’ Final reasoning score

Methodology: The authors curated 292 problems across diverse research areas and evaluated 17 models, comparing standard models against reasoning-enhanced versions.

Results: Reasoning-enhanced models outperformed standard models by 12% to 27%, and the automated evaluator achieved a 96.8% agreement rate with human experts.

Limitations: The study focuses on step-level reconstruction rather than full end-to-end proof generation, and the reliability of the LLM judge may still vary on highly complex, novel logic.

NLP

3/5 Artificial Intelligence (cs.AI) 16 Jun 2026
Semantics-Enhanced Retrieval-Augmented Time Series Forecasting

Shiqiao Zhou, Zipeng Wu, Holger Sch\"oner, Edouard Fouch\'e, IAG Wilson, Shuo Wang

Abstract

ArXiv ID: 2606.14941

Authors: Shiqiao Zhou, Zipeng Wu, Holger Sch\"oner, Edouard Fouch\'e, IAG Wilson, Shuo Wang

Abstract:

Time series forecasting models often benefit from historical patterns. Inspired by Retrieval-Augmented Generation (RAG), recent research explored retrieving relevant historical time series segments to enhance forecasting. However, relying solely on time series similarity is often insufficient for retrieval under non-stationarity. To address this, we propose a multimodal approach: a \textbf{S}emantics-\textbf{E}nhanced \textbf{R}etrieval-\textbf{A}ugmented Time Series \textbf{F}orecasting framework, SERAF. Unlike mainstream approaches that depend only on time series similarity, SERAF conducts dual retrieval over the time series and their self-generated textual descriptions. It retrieves two complementary sets of historical patterns and corresponding futures, which are selectively and jointly used to guide future predictions. Experiments across seven real-world datasets demonstrate the effectiveness of SERAF in bridging numerical and semantic views of time series compared with state-of-the-art baselines.

Insights

Contribution: The paper introduces SERAF, a multimodal framework that enhances time series forecasting by integrating both numerical similarity and semantic descriptions into a retrieval-augmented architecture.

Core Idea: To overcome the limitations of relying solely on time series similarity in non-stationary environments, the model retrieves complementary historical patterns from both numerical and textual perspectives.

Technique: The framework employs a dual-retrieval mechanism that retrieves historical segments based on time series similarity and self-generated textual descriptions to guide future predictions.

Pipeline: Input time series β†’ Generate textual descriptions β†’ Dual retrieval (numerical similarity + semantic matching) β†’ Jointly select and fuse retrieved patterns β†’ Forecast future values

Methodology: The methodology involves creating a multimodal retrieval system where historical data is indexed by both its raw signal and its semantic meaning, allowing for a more robust selection of relevant past contexts.

Results: Experiments across seven real-world datasets demonstrate that SERAF outperforms state-of-the-art baselines by effectively bridging numerical and semantic views.

Limitations: The paper does not explicitly detail the computational overhead of generating textual descriptions or the specific criteria for the 'selective' joint use of retrieved sets.

PDF

Personal Interests

Papers discovered through your interest topics.

Multi-Agent Systems

3/5 Multiagent Systems (cs.MA)Computation and Language (cs.CL) 15 Jun 2026
Misinformation Propagation in Benign Multi-Agent Systems

Jonas Becker, Jan Philip Wahle, Terry Ruas, Bela Gipp

Abstract

ArXiv ID: 2606.16710

Authors: Jonas Becker, Jan Philip Wahle, Terry Ruas, Bela Gipp

Abstract:

Multi-agent systems, in which multiple large language model agents solve problems through turn-based interaction, are increasingly deployed in high-stakes settings such as medical diagnosis, legal analysis, and forensic decision-making. Their reliability can be at risk when single agents reason from incorrect or misleading context, e.g., from tool calls, since errors may propagate through agent interactions. This work studies this risk by injecting intent-based misinformation into benign single-agent and multi-agent systems across reasoning, knowledge, and alignment tasks. We find that misinformation can degrade single-agent performance and persists across multi-agent debate, with agents often retaining answers introduced by misinformed peers. Nevertheless, multi-agent debate reduces the resulting performance degradation compared to single-agent prompting, especially when most agents are not exposed to misinformation. Robustness depends on group composition and decision protocol. Consensus can be more stable than voting under peer pressure, while majorities can often steer misinformed agents back toward correct answers. Our results show that misinformation robustness in multi-agent systems depends on the underlying model and also on how agents exchange information and aggregate decisions.

Insights

Contribution: The paper investigates how intent-based misinformation propagates within multi-agent systems and identifies how group composition and decision protocols influence robustness.

Core Idea: While misinformation can persist across multi-agent interactions, multi-agent debate can mitigate performance degradation compared to single-agent systems, particularly when a majority of agents remain uncorrupted.

Technique: The researchers injected intent-based misinformation into reasoning, knowledge, and alignment tasks to observe error propagation across different agent interaction models.

Pipeline: Misinformed context input β†’ Multi-agent interaction/debate β†’ Aggregated decision output

Methodology: The study compares single-agent performance against multi-agent systems using various decision protocols (consensus vs. voting) and group compositions under injected misinformation.

Results: Multi-agent debate reduces performance degradation compared to single-agent prompting; consensus protocols offer more stability than voting under peer pressure; and majorities can successfully steer misinformed agents toward correct answers.

Limitations: Robustness is highly dependent on the specific underlying model and the specific method of information exchange and aggregation.

PDF

Tech News

AI Safety

Hacker News Tue, 16 Ju
Humanity isn't ready for the coming intelligence explosion

The article discusses the existential risks and societal unpreparedness regarding a potential 'intelligence explosion' driven by rapid AI advancement. It explores the gap between technological capabilities and our current regulatory, ethical, and cognitive frameworks. The piece emphasizes the need for proactive safety measures before AGI reaches a point of no return.

Reddit r/ArtificialIntelligence 2026-06-16
AI Billionaires Want to Control EVERY Aspect of Your Life | Aaron Bastani Meets Karen Hao

The post discusses a conversation between Aaron Bastani and Karen Hao regarding the concentration of power among AI billionaires. It explores the societal implications and potential risks of tech giants exerting excessive control over various aspects of human life through AI integration.

Computing Systems

Hacker News Tue, 16 Ju
The time the x86 emulator team found code so bad they fixed it during emulation

The x86 emulator team encountered legacy code so poorly written that they chose to fix the original source code during the emulation process. This highlights extreme technical debt and the complexities of maintaining hardware-level software compatibility.

Hacker News Tue, 16 Ju
John Carmack on Fabrice Bellard

John Carmack, the legendary programmer behind Doom and id Software, shared his admiration for Fabrice Bellard, the creator of FFmpeg and QEMU. The discussion highlights the profound impact of high-performance systems programming and efficient software engineering on the broader technology landscape.

Hacker News Tue, 16 Ju
Amazon Announces Multibillion-Dollar Data Center in Missouri

Amazon has announced a multibillion-dollar investment to construct a new data center in Missouri. This infrastructure expansion is designed to bolster cloud computing capabilities and support the growing demand for high-performance computing and AI workloads.

Reddit r/MachineLearning 2026-06-16
quicktok: a faster tokenizer (exact and byte-identical with tiktoken) [P]

A new C++ tokenizer called 'quicktok' has been released, offering byte-identical results to OpenAI's tiktoken while significantly improving performance. By utilizing data structure engineering like 2-byte tries and hand-compiled pretokenizers, it achieves speeds up to 11x faster than the original tiktoken library. It supports major models including Llama-3, Qwen2.5, and GPT-4o (o200k).

General

Reddit r/MachineLearning 2026-06-15
How the brains learn [R]

Researchers have proposed a framework for neocortex learning based on error-driven predictive learning via temporal derivatives and competitive kinase synaptic plasticity. This model, implemented in the Axon neural simulation framework, aims to provide a biologically plausible alternative to backpropagation that could significantly improve training efficiency.

LLM

Hacker News Tue, 16 Ju
Reviews have become expensive, rewrites have become cheap

The author argues that the rise of LLMs has shifted the software development paradigm by making code generation nearly instantaneous. Consequently, the cost of producing a rewrite is now negligible, shifting the human bottleneck from writing code to the high-cost cognitive effort of reviewing and verifying it.

Trending repositories on GitHub filtered and scored for relevance to your interests.

Agentic AI

LLM

MLOps