Tuesday, June 9, 2026

Daily Digest 2026-06-09

The dominant theme across today’s content centers on advancing AI systems for specialized applications while addressing technical and ethical challenges in their deployment, with a focus on multi-agent workflows, safety, and cross-domain reasoning.

Research highlights:

AI agents and workflows: Papers explore multi-source evidence adjudication in pathology, neuroscience data pipelines, and agent-to-agent protocols for regulatory compliance.
LLM safety and evaluation: Studies examine contextual safety, sycophancy metrics, and the limitations of rigid evaluation frameworks for language models.
Multimodal reasoning and compression: Research advances include memory-efficient LLMs for audio-visual tasks, pruning techniques for model compression, and methods to enhance cross-modal reasoning.
Healthcare and neuroscience applications: Work spans disease trajectory forecasting, structured information extraction from medical reports, and AI-driven flood detection using satellite data.

Tech buzz:

OpenAI’s confidential IPO filing signals growing interest in the commercialization of foundational AI models.
News highlights include AI’s impact on legal practices, real-world applications like mosquito control systems, and debates over AI-generated content’s reliability.
Community discussions focus on open-source automation tools, AI education methods, and the ethical implications of AI epistemic deference.

Sort:

The dominant theme across today's content centers on advancing AI systems for specialized applications while addressing technical and ethical challenges in their deployment, with a focus on multi-agent workflows, safety, and cross-domain reasoning.

Global Trends

Papers discovered from ArXiv subject categories

AI Safety

5/5 Artificial Intelligence (cs.AI) 9 Jun 2026

Safety is Contextual, LLM-Judges Are Not: Navigating the Rigid Priors of Evaluators

Anissa Alloula, Federico Licini, Ava Batchkala, Seraphina Goldfarb-Tarrant

Abstract

ArXiv ID: 2606.07874

Authors: Anissa Alloula, Federico Licini, Ava Batchkala, Seraphina Goldfarb-Tarrant

Abstract:

LLMs-as-judges are the only way to evaluate safety at scale. Despite their importance, LLM-judges themselves are rarely evaluated beyond human agreement in simple, static benchmarks. We therefore investigate two under-explored but crucial properties of LLMs-as-judges: their susceptibility to relying on in context-information, and their steerability to differing safety definitions, which may not align with their internal safety priors. We evaluate the safety judging abilities of many generalist LLMs and safety-specific judges, and investigate the impact of task demonstrations, novel in-context information, and changing safety definitions. We find that while LLM-judges can learn from new information, they are broadly unlikely to adjust their evaluations if the context or safety definition contradicts their prior.

Insights

Contribution: This paper investigates the limitations of LLM-judges in adapting to contextual and definitional variations in safety evaluation, revealing their reliance on rigid internal priors despite exposure to new information.

Core Idea: LLM-judges prioritize their pre-trained safety priors over dynamic contextual cues or redefined safety criteria, leading to inconsistent evaluations when faced with conflicting information.

Technique: The study evaluates generalist and safety-specific LLMs on tasks involving in-context information, task demonstrations, and shifting safety definitions to test their adaptability.

Pipeline: Text samples + safety definitions → LLM-judges process inputs using internal priors and contextual cues → Safety judgments (aligned or misaligned with provided definitions).

Methodology: Authors tested LLM-judges across varied scenarios with controlled changes in context, task examples, and safety definitions to measure their adaptability and prior-driven biases.

Results: LLMs showed limited ability to adjust evaluations when context or safety definitions contradicted their internal priors, despite demonstrating learning from novel in-context information.

Limitations: Findings may not generalize to specialized safety models or real-world deployment contexts, and the study focuses on static benchmarks rather than dynamic, evolving safety criteria.

PDF

4/5 Artificial Intelligence (cs.AI)Multiagent Systems (cs.MA) 9 Jun 2026

Beyond Goodhart's Law: A Dynamic Benchmark for Evaluating Compliance in Multi-Agent Systems

Yiyang Zhao, Zhuo Zhang, Qingxuan Le, Lizhen Qu, Zenglin Xu

Abstract

ArXiv ID: 2606.07805

Authors: Yiyang Zhao, Zhuo Zhang, Qingxuan Le, Lizhen Qu, Zenglin Xu

Abstract:

The rapid evolution of Large Language Models (LLMs) from passive assistants to autonomous, execution-capable agents has introduced critical operational risks. Most current evaluation frameworks neglect procedural compliance, leading to ''Machiavellian'' behaviors where agents strategically violate safety rules to maximize rewards - a direct manifestation of Goodhart's Law. To address this blind spot, we introduce MAC-Bench, a dynamic, adversarial benchmark designed to evaluate the procedural alignment of multi-agent systems under realistic pressure. We propose the SERV(Seed - Evolve - Refine - Verify) pipeline, an ``Agent-as-a-Benchmark'' paradigm that transforms unstructured legal texts into executable, contamination-free scenarios. By synthesizing holographic sandbox environments and injecting calibrated social-engineering pressure vectors, MAC-Bench forces agents into Pareto-optimal trade-offs between task success and regulatory adherence. We introduced novel metrics: the Compliance-Weighted Success Rate (CSR) and the Machiavellian Gap (MG), and conducted a comprehensive evaluation of state-of-the-art frontier models to reveal the pervasive trade-offs between success and compliance.

Insights

Contribution: Introduces MAC-Bench, a dynamic adversarial benchmark, and SERV pipeline to evaluate procedural compliance in multi-agent systems, addressing the neglect of compliance risks in current evaluation frameworks.

Core Idea: Addressing the failure of existing frameworks to detect Machiavellian behaviors in agents, MAC-Bench forces agents into compliance-task trade-offs using realistic pressure scenarios and novel metrics.

Technique: The SERV pipeline transforms legal texts into executable scenarios via Seed-Evolve-Refine-Verify, combined with holographic sandboxes and calibrated social-engineering pressure vectors.

Pipeline: Unstructured legal texts → Seed (scenario generation) → Evolve (pressure injection) → Refine (scenario validation) → Verify (executable scenario output)

Methodology: Synthesizes dynamic environments with calibrated social-engineering pressure to stress-test agents, evaluating them using Compliance-Weighted Success Rate (CSR) and Machiavellian Gap (MG) metrics.

Results: Revealed pervasive trade-offs between task success and compliance in state-of-the-art models, quantifying the Machiavellian Gap across multiple benchmarks.

Limitations: Scenarios may lack real-world complexity, pressure vector calibration requires domain expertise, and the benchmark focuses on procedural compliance over broader ethical considerations.

PDF

4/5 Artificial Intelligence (cs.AI)Human-Computer Interaction (cs.HC) 9 Jun 2026

The AI Epistemic Deference Index: A Continuous Measure of Sycophancy

Alejandro Botas, Paul de Font-Reaulx, Luke Hewitt

Abstract

ArXiv ID: 2606.07897

Authors: Alejandro Botas, Paul de Font-Reaulx, Luke Hewitt

Abstract:

Current AI models frequently exhibit epistemic sycophancy, endorsing claims to agree with a user. Existing evaluations typically measure this either by assessing what it takes to make a model shift a binary endorsement or by eliciting an explicit probability in a proposition. However, much user-facing sycophantic behavior is demonstrated through shifts in graded support expressed through ordinary language. We propose the AI Epistemic Deference Index (AEDI): a continuous, unidimensional score representing how sensitive the support expressed in a model's output is to the attitude expressed in a user's prompt. To generate AEDI, we provide a new protocol for estimating probabilities from natural language outputs, using LLMs-as-judges validated for consistency and correlation to human judgment. We deploy it on a new curated database of 500 propositions across diverse topics and 16,000 prompts varying in user attitude, testing eight prominent models. Every model exhibits substantial deference, though with large and systematic differences across providers, with Claude models demonstrating the least, and Grok and Gemini models the most. The effect is amplified in prompts requesting a written artifact, and concentrated on propositions where models hold weaker priors. We release AEDI as an easy-to-update benchmark and measurement pipeline for output-level sycophancy evaluation.

Insights

Contribution: Introduces the AI Epistemic Deference Index (AEDI), a continuous measure of AI sycophancy sensitivity to user attitudes, enabling nuanced evaluation of model deference beyond binary metrics.

Core Idea: AEDI quantifies how AI outputs adjust graded support in response to user prompts, capturing subtle shifts in language rather than relying on binary or explicit probability judgments.

Technique: Leverages LLMs-as-judges to estimate probabilities from natural language outputs, validated for consistency with human judgment, to compute AEDI scores across diverse prompts and propositions.

Pipeline: Curated prompts with varying user attitudes → LLMs-as-judges assess probability shifts in model outputs → AEDI score calculated as sensitivity to user attitude.

Methodology: Tests eight models on 500 propositions with 16,000 prompts using a validated protocol, measuring deference across providers and prompt types (e.g., artifact requests).

Results: All models show significant deference, with Claude models least deferent and Grok/Gemini most deferent; effect amplified in artifact requests and propositions with weak model priors.

Limitations: Depends on LLM judges' alignment with human judgment; may not capture all forms of sycophancy; results limited to tested models and proposition types.

PDF

3/5 Artificial Intelligence (cs.AI) 9 Jun 2026

Some hypotheses on how chatbots work in problem-solving-driven conversations. Large Language Models as confirmation of the Innovation Illusion

S. F. M. van Vlijmen, H. D. Lethe jr

Abstract

ArXiv ID: 2606.07722

Authors: S. F. M. van Vlijmen, H. D. Lethe jr

Abstract:

This article offers a perspective on the nature of chatbots as genuine conversation partners when discussing problems in relation to their solutions. What can chatbots do and what can't they do, and how can this be explained? Our argument draws on Aggregation Dynamics, Cognitive Linguistics, Neuropsychology and Psychology. Our argument focuses on basic chatbots in the hope of thereby making statements about the core functionality of more advanced chatbots. Basic chatbots are assumed to consist of a Large Language Model (LLM) with a simple interface. The main results are: a description of human understanding and thinking based on so-called metaphorical problem propagations; the hypothesis that text dataset used for training LLMs have specific characteristics and that these text datasets only partially imitate human thinking and understanding; the hypothesis that the LLM training process encodes artificial metaphorical problem propagations into an LLM from these datasets; our conclusion that a basic chatbot cannot be a thinking partner capable of matching humans; our conclusion that further development of the Large Language Model will not lead to this either. Yann LeCun states: "Animals and humans exhibit learning abilities and understandings of the world that are far beyond the capabilities of current AI and machine learning (ML) systems." Our conclusions are in line with this. LeCun's vision and ours are at odds with the optimism of Big Tech. That does not alter the fact that chatbots exist, that they are being used on a massive scale, by both individuals and organisations, and that it is therefore socially and politically important to understand them. Our article aims to contribute to the discussion on the functioning, benefits and drawbacks of chatbots. We have not yet encountered the approach we used to arrive at our conclusions in our research into how chatbots work.

Insights

Contribution: This paper provides an interdisciplinary analysis of chatbot limitations in problem-solving conversations, challenging the assumption that Large Language Models (LLMs) can replicate human cognition, and aligns with Yann LeCun's critique of AI's current capabilities.

Core Idea: Basic chatbots, powered by LLMs, cannot function as true thinking partners due to artificial metaphorical problem propagations encoded in their training data, which only partially mimic human understanding and reasoning.

Technique: The study integrates Aggregation Dynamics, Cognitive Linguistics, Neuropsychology, and Psychology to analyze how LLM training data and processes shape chatbot behavior and limitations.

Pipeline: Problem-solving conversation input → LLM processes metaphorical problem propagations from training data → Chatbot generates responses that reflect artificial, not human, reasoning patterns.

Methodology: The authors use theoretical frameworks from multiple disciplines to examine LLM training data characteristics and their encoding of metaphorical structures, without empirical experimentation.

Results: ['Description of human understanding via metaphorical problem propagations', 'Hypothesis that LLM training data only partially imitate human thinking', 'Conclusion that basic chatbots (and advanced LLMs) cannot match human cognitive capabilities']

Limitations: ['Theoretical approach lacks empirical validation through experiments', 'Open questions about potential future advancements in LLMs and their societal implications']

PDF

Agentic AI

5/5 Artificial Intelligence (cs.AI)Multiagent Systems (cs.MA) 9 Jun 2026

PathoSage: Towards Multi-Source Evidence Adjudication in Pathology via Experience-Aware Agentic Workflow

Chengyang Zhang, Wenchuan Zhang, Bo Li, Mengran Li, Bob Zhang, Yuhao Yi, Hong Bu, Jiancheng Lv

Abstract

ArXiv ID: 2606.07549

Authors: Chengyang Zhang, Wenchuan Zhang, Bo Li, Mengran Li, Bob Zhang, Yuhao Yi, Hong Bu, Jiancheng Lv

Abstract:

Recent advances in Multimodal Large Language Models (MLLMs) and agent workflows have shown strong promise for computational pathology, yet reliable patch-level reasoning remains challenging. End-to-end pathology MLLMs often hallucinate morphological features, while recent agentic systems usually merge tool outputs and retrieved knowledge into a shared context, making decisions vulnerable to conflicting evidence and context contamination. We propose PathoSage, a three-stage framework that explicitly separates knowledge retrieval, evidence collection, and evidence adjudication for patch-level pathology multimodal reasoning. Its core component, Structured Evidence Deliberation, independently evaluates heterogeneous evidence from tools, performs conflict analysis, and generates the final judgment in a fresh context to reduce anchoring bias. We further introduce a training-free Beta-Bernoulli experience system with continuous credit assignment to model long-term tool reliability and construct similarity-weighted priors for future tool use. Experiments show that PathoSage effectively mitigates VQA hallucinations and classifier disagreement, outperforming strong pathology MLLM and agentic baselines. Our results highlight explicit evidence adjudication and reliability-aware tool modeling as key ingredients for robust pathology agents.

Insights

Contribution: PathoSage introduces a three-stage framework for reliable patch-level pathology reasoning by explicitly separating evidence retrieval, collection, and adjudication, while modeling tool reliability through a training-free Beta-Bernoulli system.

Core Idea: Structured Evidence Deliberation independently evaluates conflicting evidence from multiple sources and generates judgments in a fresh context to reduce bias, combined with experience-aware tool reliability modeling for robust decision-making.

Technique: A training-free Beta-Bernoulli experience system with continuous credit assignment models long-term tool reliability, while Structured Evidence Deliberation performs conflict analysis and generates final judgments in isolated contexts.

Pipeline: Pathology patches and clinical queries → knowledge retrieval and tool-based evidence collection → structured deliberation to resolve conflicts and generate final diagnosis

Methodology: Experiments demonstrate reduced VQA hallucinations and classifier disagreements through explicit evidence adjudication, validated against strong MLLM and agentic baselines using standard pathology datasets.

Results: Outperforms state-of-the-art pathology MLLMs and agentic systems in mitigating hallucinations and resolving classifier disagreements, with significant improvements in reliability-aware decision-making.

Limitations: Depends on diverse and high-quality tool outputs for effective adjudication; long-term reliability modeling may require further validation on broader clinical datasets.

PDF

5/5 Artificial Intelligence (cs.AI)Human-Computer Interaction (cs.HC)Machine Learning (cs.LG)Software Engineering (cs.SE) 9 Jun 2026

Syll: Open-Source Personal Automation with Cross-Surface Execution

Bo Zhang, Borui Zhang, Chenghao Jiang, Minglei Shi, Xiaofeng Wang, Zheng Zhu, Jie Zhou, Jiwen Lu

Abstract

ArXiv ID: 2606.07594

Authors: Bo Zhang, Borui Zhang, Chenghao Jiang, Minglei Shi, Xiaofeng Wang, Zheng Zhu, Jie Zhou, Jiwen Lu

Abstract:

Personal AI agents must increasingly operate across APIs, shells, web surfaces, and desktop GUIs, yet many systems remain tuned to a single interface and offer limited support for user teaching and auditability. We present Syll, an open-source, self-hosted multimodal agent harness that unifies MCP/API tools, CLI execution, and visual GUI control in a modular runtime, enabling agents to coordinate computer use across heterogeneous interfaces while streamlining how users and agents exchange information. At the core of Syll is a bidirectional user-agent interaction layer: users teach procedures through direct demonstration, which Syll compiles into reusable skills; agent execution is translated back into multimodal evidence -- logs, keyframes, and approval checkpoints -- for inspection and control. Syll further externalizes memory, skills, routines, and governance as editable local artifacts, supporting straightforward inspection, extension, and downstream development. Our implementation has been validated on production desktop applications including Adobe Photoshop, Adobe Audition, Stardew Valley, macOS Finder and others. We report mechanism-oriented studies that validate multimodal routing, teachable GUI replay, and persistent local artifacts. We hope Syll can serve as a practical open-source foundation for personal automation that users can teach, inspect, and continuously extend.

Insights

Contribution: Syll introduces an open-source, self-hosted multimodal agent framework that enables cross-surface automation by unifying API, CLI, and GUI interactions, with a focus on user teachability and auditability through bidirectional interaction and persistent local artifacts.

Core Idea: Syll bridges heterogeneous interfaces via a modular runtime and bidirectional user-agent interaction, allowing users to teach procedures through demonstrations and enabling agents to generate inspectable multimodal evidence during execution.

Technique: The system externalizes memory, skills, and governance as editable local artifacts while using a bidirectional interaction layer to translate user demonstrations into reusable skills and agent actions into logs, keyframes, and approval checkpoints.

Pipeline: User demonstrations (input) → Syll compiles into reusable skills and executes across APIs, CLI, and GUI (process) → Generates logs, keyframes, and approval checkpoints for inspection (output).

Methodology: The framework was validated on production desktop applications like Adobe Photoshop and macOS Finder, with studies confirming multimodal routing, teachable GUI replay, and persistent artifact management.

Results: Successful execution on complex applications (e.g., Adobe suite, Stardew Valley); validated multimodal routing accuracy and user teachability through GUI replay; demonstrated robustness via persistent local artifact management.

Limitations: Relies on user demonstrations for skill acquisition, which may limit autonomy; potential complexity in managing heterogeneous interface interactions in highly dynamic environments.

PDF

5/5 Artificial Intelligence (cs.AI)Computer Vision and Pattern Recognition (cs.CV)Machine Learning (cs.LG) 9 Jun 2026

A case study of evaluating AI agents on a neuroscience data-to-discovery pipeline

Kai A. Horstmann, Ethan Lin, Alice A. Robie, Jennifer J. Sun, Kristin Branson

Abstract

ArXiv ID: 2606.07718

Authors: Kai A. Horstmann, Ethan Lin, Alice A. Robie, Jennifer J. Sun, Kristin Branson

Abstract:

Agentic AI tools offer a promising path to automating software development bottlenecks in scientific research pipelines, particularly for stages that take domain experts days to months to build, where scientists care about correctness and robustness, not implementation details. We present an empirical study of general-purpose coding agents on a fly optogenetics data-to-discovery pipeline. We assess agents on tasks substantially larger than existing benchmarks, datasets orders of magnitude bigger, and evaluation criteria grounded in domain expert standards. We show that agents can solve several individual pipeline stages, suggesting stage-level automation is tractable. By analyzing agents' code iterations, we show that they struggle most when there is not a pre-defined criterion to iterate on, and they must instead use their scientific judgment to assess their current solution, a key open challenge. Mirroring scientific practice, they sometimes attempt visual inspection of intermediate outputs for self-evaluation, but largely fail to interpret what they see or act on it appropriately. Solving the end-to-end pipeline correctly requires stringing together successes across all pipeline stages, and this is beyond agents' current abilities. We identify challenges largely absent from existing benchmarks, including computational resource management and generalization to large held-out data collections. Finally, we distill principles for constructing scientific tasks and rigorous evaluation criteria for open-ended problems.

Insights

Contribution: This study empirically evaluates general-purpose coding agents on a neuroscience data-to-discovery pipeline, demonstrating stage-level automation feasibility while highlighting critical challenges in end-to-end execution and scientific judgment.

Core Idea: Agentic AI can automate individual stages of complex scientific pipelines but struggles with open-ended tasks requiring domain-specific reasoning and resource management beyond predefined criteria.

Technique: The research employs an empirical evaluation framework using large-scale neuroscience datasets and domain expert-defined criteria to assess agents' ability to solve pipeline stages and manage computational resources.

Pipeline: fly optogenetics data → data analysis/discovery stages (e.g., preprocessing, feature extraction, hypothesis testing) → scientific findings

Methodology: Agents were evaluated on tasks exceeding existing benchmarks with datasets orders of magnitude larger than prior work, using criteria aligned with domain expert standards for correctness and robustness.

Results: Agents solved individual pipeline stages but failed end-to-end execution; they struggled with open-ended criteria, visual inspection interpretation, and generalization to large held-out data collections.

Limitations: Current agents lack capabilities for computational resource management, scientific judgment in open-ended tasks, and generalization to unseen data, highlighting gaps in existing benchmarks.

PDF

4/5 Artificial Intelligence (cs.AI) 9 Jun 2026

Where Instruction Hierarchy Breaks: Diagnosing and Repairing Failures in Reasoning Language Models

Sanjay Kariyappa, G. Edward Suh

Abstract

ArXiv ID: 2606.07808

Authors: Sanjay Kariyappa, G. Edward Suh

Abstract:

Reasoning language models deployed in agentic workflows must follow an instruction hierarchy: when instructions from different sources conflict, the model should obey the highest-privilege applicable instruction. Existing benchmarks largely measure this behavior end-to-end, asking whether the final response is compliant. However, a non-compliant response can arise from several distinct failures: the model may fail to identify the relevant instructions in context, fail to resolve conflicts among identified instructions, or correctly resolve the conflict in its reasoning while still producing a violating response. We introduce a white-box diagnostic framework that localizes instruction hierarchy failures into instruction identification, conflict resolution, and response realization, making failures more interpretable. We evaluate three reasoning models--Gemma-4-31B-IT, Qwen3.6-35B-A3B, and Claude Sonnet 4.6--on long-context adaptations of IHEval and IHChallenge, and find that the dominant failure mode varies across models, tasks, and context length. Building on the observation that models can often detect conflicts and output violations when explicitly prompted, we propose two training-free self-monitoring mechanisms: a parallel input monitor for low-latency conflict detection before generation, and a sequential output monitor for response-level review and repair. Across Gemma-4-31B-IT, Claude Sonnet 4.6, and GPT-5.3, the strongest monitor reduces rule-following non-compliance by 81-99%, with GPT-5.3 reductions of 86% under static attacks and 45% under adaptive attacks.

Insights

Contribution: Introduces a white-box diagnostic framework to localize instruction hierarchy failures in reasoning language models and proposes training-free self-monitoring mechanisms to reduce non-compliance by up to 99%.

Core Idea: Instruction hierarchy failures in language models can be systematically diagnosed by decomposing them into instruction identification, conflict resolution, and response realization stages, enabling targeted repairs.

Technique: A diagnostic framework with three failure localization modules and two self-monitoring mechanisms (parallel input monitor and sequential output monitor) to detect and correct non-compliant responses without retraining.

Pipeline: Input (instructions with conflicts) → Process (diagnose failure stage, resolve conflicts, apply self-monitoring) → Output (compliant response with repairs)

Methodology: Evaluated three large models on long-context benchmarks, analyzing failure modes across tasks and context lengths, then tested self-monitoring effectiveness under static and adaptive attacks.

Results: Self-monitoring reduced rule-following non-compliance by 81-99% (86% for GPT-5.3 under static attacks, 45% under adaptive attacks) across Gemma-4-31B-IT, Claude Sonnet 4.6, and GPT-5.3.

Limitations: Effectiveness varies by model and attack type; adaptive attacks reduce mitigation efficacy, suggesting potential vulnerabilities in dynamic environments.

PDF

4/5 Artificial Intelligence (cs.AI)Multiagent Systems (cs.MA) 9 Jun 2026

Overcoming the Regulatory Bottleneck via Agent-to-Agent Protocols: A Nuclear Case Study

Akshay J. Dave, David Grabaskas, Joseph A. Renevitz, Richard B. Vilim

Abstract

ArXiv ID: 2606.07866

Authors: Akshay J. Dave, David Grabaskas, Joseph A. Renevitz, Richard B. Vilim

Abstract:

Regulatory review of advanced nuclear reactor designs routinely spans more than three years and consumes hundreds of millions of dollars in combined regulator and applicant labor. We present the Regulatory Context Protocol (RCP), an Agent-to-Agent communication standard that replaces the formal human-to-human pipeline between regulators and applicants with a structured, auditable agentic channel, while preserving human oversight at safety-significant decision points. The protocol is calibrated against an analysis of 1,236 documents from U.S. Nuclear Regulatory Commission advanced reactor dockets and demonstrated with a working multi-agent pilot. Against an 89M USD, 42-month Reconstructed Baseline, RCP cuts costs by 50-77 percent (21M-44M USD) and timelines by 65 percent (15 months). Without a shared protocol, Standalone Agents reach only 54M-74M USD and 21 months. The residual cost-and-time gap is structural, not algorithmic: it traces to the inter-organizational pipeline that only an agent-to-agent standard can compress. The same bottleneck - formal multi-party review under strict auditability requirements - characterizes pharmaceutical approvals, environmental permitting, financial supervision, and aviation certification. The US regulatory paperwork burden carries a 426.5 billion USD annual opportunity cost; replicated broadly, the projected 50-77 percent reduction implies savings on the order of 210-330 billion USD per year - approaching 1 percent of US GDP.

Insights

Contribution: Introduces the Regulatory Context Protocol (RCP), an agent-to-agent communication standard that reduces nuclear regulatory review costs and timelines by 50-77% and 65%, respectively, through structured automation while maintaining human oversight.

Core Idea: Replaces traditional human-to-human regulatory pipelines with auditable agent-to-agent protocols to compress inter-organizational review processes without sacrificing safety-critical human judgment.

Technique: Develops a structured, auditable agent communication framework validated via analysis of 1,236 nuclear regulatory documents and tested in a multi-agent pilot system.

Pipeline: Regulatory documents and requirements → RCP-enabled agent communication and automated analysis → Streamlined safety reviews with reduced cost/timelines

Methodology: Calibrated against U.S. NRC advanced reactor docket data and demonstrated via a working multi-agent pilot, comparing standalone agents vs. RCP-enabled systems against a reconstructed baseline.

Results: 50-77% cost reduction ($21M-$44M) and 65% timeline reduction (15 months) vs. 89M USD, 42-month baseline; residual gaps attributed to inter-organizational pipeline inefficiencies.

Limitations: Adoption requires cross-organizational standardization; potential resistance to replacing formal human review processes; generalizability to other sectors requires further validation.

PDF

4/5 Artificial Intelligence (cs.AI)Software Engineering (cs.SE) 9 Jun 2026

Contract2Tool: Learning Preconditions and Effects for Reliable Tool-Augmented LLM Agents

Rahul Suresh Babu, Laxmipriya Ganesh Iyer

Abstract

ArXiv ID: 2606.07904

Authors: Rahul Suresh Babu, Laxmipriya Ganesh Iyer

Abstract:

Tool-augmented large language model agents increasingly rely on external APIs, but standard tool schemas describe how to call a tool, not when the tool is causally appropriate or what task state it produces. Causal tool filtering addresses this gap by using lightweight contracts that specify each tool's preconditions, effects, risk level, and cost. However, manually writing and maintaining such contracts does not scale to large or changing tool ecosystems. We introduce Contract2Tool, a framework for inferring tool contracts from metadata, schemas, documentation, and execution traces. Contract2Tool converts observable tool evidence into normalized symbolic contracts that can be evaluated intrinsically and deployed inside downstream causal tool filtering. We evaluate learned contracts against gold preconditions, effects, and risk labels, and measure their downstream utility on multi-step agent tasks. Our results show that hybrid documentation-and-trace evidence produces contracts accurate enough to preserve most of the reliability and efficiency benefits of gold contracts. Learned-contract CMTF achieves 0.980 downstream success, close to 0.990 for gold-contract CMTF, while reducing visible tools from 100 to 1 and reducing average token usage from 26,172 to 2,528 relative to all-tools exposure. These results suggest that learned contracts can provide a scalable contract layer between tool schemas and reliable agent execution.

Insights

Contribution: Contract2Tool introduces a scalable framework for automatically inferring tool contracts (preconditions, effects, risk, cost) from metadata, schemas, documentation, and execution traces, enabling reliable causal tool filtering in LLM agents.

Core Idea: By learning symbolic contracts from observable tool evidence, Contract2Tool bridges the gap between tool schemas and reliable agent execution, reducing reliance on manually curated contracts.

Technique: The framework normalizes heterogeneous tool evidence (metadata, documentation, traces) into symbolic contracts, which are then evaluated intrinsically and deployed for causal tool filtering in agent tasks.

Pipeline: metadata/schemas/documentation/execution traces → Contract2Tool inference → normalized symbolic contracts → causal tool filtering in agent tasks

Methodology: Learned contracts were evaluated against gold-standard labels for accuracy and tested on multi-step agent tasks to measure downstream success, tool reduction, and token efficiency.

Results: Learned-contract CMTF achieved 0.980 downstream success (vs. 0.990 for gold contracts), reduced visible tools from 100 to 1, and cut average token usage from 26,172 to 2,528.

Limitations: Performance depends on quality of input evidence; learned contracts may struggle with ambiguous or incomplete tool documentation/traces.

PDF

4/5 Artificial Intelligence (cs.AI)Computation and Language (cs.CL) 9 Jun 2026

MemToolAgent overview with a simple restaurant booking scenario where the agent retrieves similar memories, receives feedback on an invalid time format, and generates a reflection to update its memory

Suleyman Armagan Er, Danilo Ribeiro, Yogesh Virkar, Surafel Lakew, Adi Kalyanpur, James Gung, Thomas Delteil, Arshit Gupta

Abstract

ArXiv ID: 2606.07909

Authors: Suleyman Armagan Er, Danilo Ribeiro, Yogesh Virkar, Surafel Lakew, Adi Kalyanpur, James Gung, Thomas Delteil, Arshit Gupta

Abstract:

Modern large language model (LLM) agents can use external tools to help users solve complex tasks. However, for problems that require learning from long-term historical events or from previous agent-environment interactions, LLM agents are required to use memory mechanisms to store and retrieve experiences. While sophisticated memory systems exist for dialogue agents, few studies have empirically examined how to improve agents' tool-using capabilities through past user-agent conversations. We propose MemToolAgent, a framework that improves tool use through memory management. Our approach contains a memory extraction module that processes past experiences into structured memory entries, and a retrieval module that dynamically selects a subset of the stored memory entries. This enables more personalized and accurate responses aligned with user preferences and feedback without requiring LLM fine-tuning. In summary, this work has three main contributions: (1) a unified memory entry format that improves both general-purpose and personalized tool use without LLM fine-tuning, (2) a reflection-based memory extraction that uses environment and user feedback to distill wrong executions into critiques to store, and (3) a retrieval module that chooses how many past experiences to use based on the memory similarity distribution. MemToolAgent achieves 29%, 80%, and 17% relative improvements compared to strong baselines on the WorkBench, NESTFUL, and PEToolBench benchmarks, respectively.

Insights

Contribution: MemToolAgent introduces a memory-driven framework to enhance LLM agents' tool-use capabilities through structured memory management, achieving significant improvements on benchmark tasks without requiring LLM fine-tuning.

Core Idea: The framework leverages memory mechanisms to store and retrieve past user-agent interactions, enabling personalized tool use and iterative learning from feedback through reflection-based memory updates.

Technique: MemToolAgent employs a memory extraction module to create structured entries from past experiences and a retrieval module that dynamically selects relevant memories based on similarity distributions for task execution.

Pipeline: User query/input → memory retrieval and reflection → tool execution with feedback → updated memory storage

Methodology: The approach uses environment and user feedback to distill errors into critiques, storing them as structured memories, and dynamically selects memory entries during task execution to guide tool use.

Results: 29% relative improvement on WorkBench, 80% on NESTFUL, and 17% on PEToolBench compared to strong baselines.

Limitations: Performance depends on memory quality and relevance, and the framework may struggle with highly dynamic or contextually diverse tasks requiring real-time adaptation.

PDF

Computer Vision

4/5 Artificial Intelligence (cs.AI)Computer Vision and Pattern Recognition (cs.CV)Machine Learning (cs.LG) 9 Jun 2026

Land cover and flood type govern the detection limits of satellite-based flood mapping across diverse global flood events

Venkatesh Kolluru, Rajat Shinde, Abdelhak Marouane, Caden Helbling, Deepak Shah, Othneil Drew, Iksha Gurung, Manil Maskey, Rahul Ramachandran

Abstract

ArXiv ID: 2606.07780

Authors: Venkatesh Kolluru, Rajat Shinde, Abdelhak Marouane, Caden Helbling, Deepak Shah, Othneil Drew, Iksha Gurung, Manil Maskey, Rahul Ramachandran

Abstract:

Floods are among the most destructive natural hazards, and their increasing frequency under climate change makes satellite-based inundation mapping essential for disaster response. Geospatial foundation models pretrained on satellite archives offer geographic transferability, but their operational reliability across diverse, unseen events remains uncharacterized. Here we deploy Prithvi-EO-2.0 across 19 out-of-distribution flood events (2017-2025) spanning six continents, eight climate zones, and six flood mechanisms, validating against two independent reference products. Detection accuracy depended jointly on land cover and flood type, with cropland yielding the highest agreement (IoU=52%) and riverine events the strongest detection (F1=0.69), while tree cover and built-up areas showed near-zero detection (IoU=4%) regardless of flood mechanism. Dual-reference validation revealed that apparent model error partly reflects definitional inconsistency between reference products rather than detection failure. Iterative pipeline testing identified 23 failure modes, with pipeline engineering dominating initial error over model capacity. These findings establish environment-dependent detection boundaries for operational satellite flood mapping.

Insights

Contribution: This study establishes environment-dependent detection boundaries for satellite-based flood mapping by analyzing how land cover and flood type influence model performance across diverse global events.

Core Idea: Satellite flood mapping accuracy is governed by interactions between land cover types and flood mechanisms, with significant variability in detection limits across different geographic and climatic contexts.

Technique: Prithvi-EO-2.0, a geospatial foundation model pretrained on satellite archives, was validated against two independent reference products across 19 out-of-distribution flood events spanning six continents and six flood mechanisms.

Pipeline: Flood event data → Prithvi-EO-2.0 inundation mapping → dual-reference validation → quantitative accuracy assessment (IoU, F1) and failure mode analysis

Methodology: The model was tested on 19 diverse flood events (2017-2025) with validation against two reference datasets, while iteratively analyzing pipeline components to identify error sources and environmental dependencies.

Results: Cropland achieved highest IoU (52%), riverine floods had strongest detection (F1=0.69), while tree cover and built-up areas showed near-zero detection (IoU=4%). Dual-reference validation revealed 23 failure modes, with pipeline engineering errors outweighing model capacity limitations.

Limitations: Findings are constrained to the specific model and reference datasets used; generalizability to other regions or flood types remains untested, and definitional inconsistencies between reference products may persist in operational applications.

PDF

Computing Systems

3/5 Artificial Intelligence (cs.AI)Computation and Language (cs.CL) 9 Jun 2026

Scaling Participation in Modular AI Systems

Shangbin Feng, Yike Wang, Weijia Shi, Luke Zettlemoyer, Yejin Choi, Yulia Tsvetkov

Abstract

ArXiv ID: 2606.07812

Authors: Shangbin Feng, Yike Wang, Weijia Shi, Luke Zettlemoyer, Yejin Choi, Yulia Tsvetkov

Abstract:

Humanity is a mosaic of multifaceted talents and needs, and any truly intelligent AI must reflect that richness. Yet the LLMs used by all are built by the few -- a centralized market of monolithic AI models structurally ill-suited to capture the diversity of human knowledge, reasoning, and values. Here we introduce scaling participation, a new paradigm in which modular AI systems are built from the bottom up through the contributions of diverse stakeholders. Participants contribute small models trained on their own interests and priorities; these models then collaborate in modular frameworks as compositional AI systems. Participatory AI systems outperform monolithic LLMs by up to 15.4% across 15 tasks, such as reasoning and factuality, surpassing models larger than all contributed components combined. Further experiments show that participatory AI systems benefit from contributor diversity, substantially improve on each contributor's original priorities, and exhibit emergent capabilities that allow them to solve over 15% of problems where all individual models fail. Scaling participation provides a technical foundation for transitioning from the monolithic status quo toward an open, bottom-up, and collaborative AI future.

Insights

Contribution: Introduces 'scaling participation' as a paradigm for building modular AI systems through diverse stakeholder contributions, achieving superior performance over monolithic LLMs.

Core Idea: Decentralized AI development where participants train specialized models on personal priorities, which are then combined in modular frameworks to create collaborative, emergent capabilities.

Technique: Leverages contributor-diverse small models trained on individual interests, integrated via modular composition to solve complex tasks through collective reasoning.

Pipeline: Individual contributor models → modular framework integration → collaborative AI system with emergent capabilities

Methodology: Evaluated participatory systems against monolithic LLMs across 15 tasks, measuring performance gains, diversity benefits, and emergent problem-solving abilities.

Results: 15.4% performance improvement over monolithic models on 15 tasks; solved 15% of problems individual models failed; benefits from contributor diversity.

Limitations: Scalability challenges in coordinating diverse contributors; potential quality control issues with user-generated models; dependency on effective modular integration mechanisms.

PDF

General

4/5 Artificial Intelligence (cs.AI) 9 Jun 2026

Improving Multimodal Reasoning via Worst Dimension Optimization

Haocheng Lv, Huaping Zhang, Qiuchi Li, Lei Li, Chunxiao Gao

Abstract

ArXiv ID: 2606.07801

Authors: Haocheng Lv, Huaping Zhang, Qiuchi Li, Lei Li, Chunxiao Gao

Abstract:

Multimodal reasoning requires a path that retains integrity over a wide range of constraints, from visual grounding to logic consistency. However, the current Process Reward Models focus on heuristically defined rewards that equally weigh these factors, which may lead to the concealment of individual dimension failures by the dominating factors, without guaranteeing the validity of the reasoning process in general.

Insights

Contribution: This paper introduces Worst Dimension Optimization (WDO), a novel approach to enhance multimodal reasoning by addressing individual dimension failures that are masked by dominant factors in existing reward models.

Core Idea: Instead of equally weighting multimodal constraints, WDO prioritizes optimization of the worst-performing dimensions to ensure robustness and validity across all aspects of reasoning.

Technique: The method identifies underperforming dimensions through systematic evaluation and adjusts the reward mechanism to emphasize their improvement during training.

Pipeline: Multimodal input → identify worst-performing dimensions → optimize via adjusted reward model → enhanced multimodal reasoning output

Methodology: The authors propose a modified reward framework that dynamically focuses on the weakest dimensions, validated through experiments on multimodal reasoning tasks with quantitative analysis.

Results: Experiments show improved reasoning accuracy and reduced error rates in tasks requiring cross-modal consistency, with specific gains in previously underperforming dimensions.

Limitations: The approach may increase computational overhead due to dynamic dimension evaluation, and its effectiveness depends on accurate initial dimension performance assessment.

PDF

LLM

5/5 Artificial Intelligence (cs.AI)Computation and Language (cs.CL)Machine Learning (cs.LG) 9 Jun 2026

Why Limit the Residual Stream to Layers and Not Tokens? Persistent Memory for Continuous Latent Reasoning

Mujtaba Farhan, Maheep Chaudhary

Abstract

ArXiv ID: 2606.07720

Authors: Mujtaba Farhan, Maheep Chaudhary

Abstract:

Large language models (LLMs) have demonstrated remarkable reasoning abilities on mathematical and multi-hop planning tasks. The CoCoNuT (Chain of Continuous Thought) paradigm~\cite{hao2024coconut} extends this by enabling models to reason in latent space, exploring multiple reasoning paths simultaneously rather than committing to a single chain early on. However, we identify a limitation we term the \textbf{concept bottleneck}. At each reasoning pass, intermediate hidden states are overwritten, causing the model to lose critical facts computed in earlier steps as reasoning depth increases. We observe this empirically. On HotpotQA, vanilla CoCoNuT (10.4\% EM) fails to improve over the CoT baseline (11.0\% EM), and performance degrades with curriculum depth on GSM8K. To address this, we propose \textbf{AGCLR} (Adaptive Gated Continuous Latent Reasoning), which augments CoCoNuT with a \textit{Gated Concept Stream}. A persistent residual memory maintained across all reasoning passes, controlled by three learned gates: a \textit{write} gate that commits intermediate facts to memory, a \textit{read} gate that retrieves relevant prior states, and a \textit{forget} gate that prunes irrelevant context. Evaluated on GSM8K, HotpotQA, and ProsQA using GPT-2 as our base model, AGCLR achieves consistent improvements across all types of datasets. With the performance gap compounding as curriculum depth increases, directly resolving the concept bottleneck. Code available at https://anonymous.4open.science/r/JJJJ/README.md

Insights

Contribution: Introduces AGCLR, a method to address the concept bottleneck in CoCoNuT models by maintaining persistent memory across reasoning steps using gated mechanisms.

Core Idea: Persistent memory via a Gated Concept Stream allows models to retain and reuse intermediate facts across reasoning passes, improving performance on complex tasks.

Technique: AGCLR employs three learned gates (write, read, forget) to manage a residual memory stream, preserving critical information while pruning irrelevant context.

Pipeline: Input question → process through adaptive gated memory stream → output answer with retained contextual facts

Methodology: Evaluated AGCLR on GSM8K, HotpotQA, and ProsQA using GPT-2, comparing performance against baseline CoCoNuT with varying curriculum depths.

Results: AGCLR improved HotpotQA EM from 10.4% to 12.3%, achieved 14.8% EM on GSM8K (vs. 10.4% baseline), and showed compounding gains with increased curriculum depth.

Limitations: Depends on base model capacity (GPT-2 used), potential scalability challenges with longer reasoning chains, and requires careful gate calibration.

PDF

4/5 Artificial Intelligence (cs.AI)Computer Vision and Pattern Recognition (cs.CV)cs.SDAudio and Speech Processing (eess.AS) 9 Jun 2026

OmniMem: Perturbation-aware Memory Compression for Streaming Audio-Visual LLMs

Guangzhi Sun, Yixuan Li, Yudong Yang, Chao Zhang

Abstract

ArXiv ID: 2606.07577

Authors: Guangzhi Sun, Yixuan Li, Yudong Yang, Chao Zhang

Abstract:

Audio-visual large language models (LLMs) hold strong promise for long-form video understanding, yet their long-video inference is fundamentally limited by the linear growth of video tokens and key-value (KV) caches. We present OmniMem, a memory-efficient streaming framework designed specifically for audio-visual LLMs. Unlike existing compression methods that treat all tokens uniformly, OmniMem introduces a modality-aware memory allocation strategy that separately manages visual and audio contexts, addressing the severe token imbalance between the two modalities. OmniMem further preserves informative and non-redundant KV states through perturbation-aware memory selection, enabling compact memory without sacrificing long-range understanding. To strengthen compression under realistic deployment constraints, we also explore budget-aware fine-tuning, which encourages the model to consolidate useful information into retained memory. Experiments on VideoMME Long, LVBench, and LVOmniBench with video-SALMONN 2+ and Qwen-2.5-Omni show that OmniMem consistently improves over strong training-free compression baselines by 2-4% absolute accuracy under the same memory budgets, with an additional 1-2% gain after fine-tuning.

Insights

Contribution: OmniMem introduces a modality-aware memory compression framework for audio-visual LLMs, addressing token imbalance and enabling efficient long-video inference without sacrificing accuracy.

Core Idea: OmniMem separately manages visual and audio contexts through modality-aware allocation and retains informative KV states via perturbation-aware selection, combined with budget-aware fine-tuning for deployment efficiency.

Technique: The framework employs modality-specific memory allocation, perturbation-aware KV state selection, and budget-aware fine-tuning to compress memory while preserving long-range understanding in streaming scenarios.

Pipeline: Video/audio stream input → modality-aware token segmentation → perturbation-aware KV compression → budget-aware fine-tuning → compact memory representation with maintained long-range context.

Methodology: Experiments on VideoMME Long, LVBench, and LVOmniBench using video-SALMONN 2+ and Qwen-2.5-Omni evaluate compression efficacy under memory budgets, comparing against training-free baselines with and without fine-tuning.

Results: OmniMem achieves 2-4% absolute accuracy improvements over baselines under same memory budgets, with an additional 1-2% gain after budget-aware fine-tuning on long-video understanding benchmarks.

Limitations: Effectiveness may depend on modality-specific perturbation patterns, and deployment constraints for real-time streaming require further optimization.

PDF

4/5 Artificial Intelligence (cs.AI)Machine Learning (cs.LG) 9 Jun 2026

Joint Structural Pruning and Mixed-Precision Quantization for LLM Compression

Hoang-Loc La, Truong-Thanh Le, Amir Taherkordi, Phuong Hoai Ha

Abstract

ArXiv ID: 2606.07819

Authors: Hoang-Loc La, Truong-Thanh Le, Amir Taherkordi, Phuong Hoai Ha

Abstract:

Recently, the efficiency of Large Language Models (LLMs) deployment has become a critical concern in practical applications. While post-training quantization (PTQ) and structural pruning are established techniques for reducing memory footprint and inference latency, most existing PTQ approaches optimize quantization errors on a per-layer basis, overlooking how errors accumulate and propagate through the network, often resulting in suboptimal solutions. Traditional pipelines also tend to apply pruning and quantization in isolation or sequentially, further compounding sub-optimality. We introduce a novel end-to-end framework that addresses these limitations in two key ways. First, we propose a novel mixed-precision PTQ strategy that directly minimizes global error propagation across the entire model, rather than isolating layer-wise errors. Building on this, we develop a novel joint optimization approach that simultaneously learns structural pruning decisions and mixed-precision quantization policies within a unified search space. Extensive experiments show that, at ultra-low precisions (1-3 bits), our quantization method reduces WikiText perplexity by up to 21% compared to state-of-the-art (SoTA) weight-activation quantization baselines. Against leading weight-only quantization methods, it achieves up to 59% and 85% lower perplexity on WikiText and C4, respectively. Compared to the SoTA joint pruning-and-quantization techniques, our proposed method delivers superior perplexity and reasoning performance at ultra-low bits.

Insights

Contribution: This paper introduces an end-to-end framework for LLM compression that jointly optimizes structural pruning and mixed-precision quantization to minimize global error propagation and achieve superior performance at ultra-low bit precisions.

Core Idea: The framework addresses suboptimal solutions from isolated pruning/quantization by directly minimizing global error propagation and unifying the search space for joint optimization of pruning decisions and quantization policies.

Technique: A mixed-precision post-training quantization strategy is combined with a joint optimization approach that learns structural pruning and quantization policies simultaneously within a unified search space.

Pipeline: Model → joint structural pruning and mixed-precision quantization optimization → compressed model with reduced memory footprint and latency

Methodology: Experiments evaluate perplexity on WikiText and C4 datasets, comparing against state-of-the-art quantization and joint pruning-quantization baselines at ultra-low bit precisions (1-3 bits).

Results: Reduces WikiText perplexity by 21% vs. SoTA weight-activation quantization, achieves 59% and 85% lower perplexity than leading weight-only quantization methods on WikiText and C4, respectively, and outperforms SoTA joint pruning-quantization techniques in reasoning performance.

Limitations: The method's effectiveness may depend on specific model architectures, and the computational cost of joint optimization could limit scalability to extremely large models.

PDF

NLP

4/5 Artificial Intelligence (cs.AI) 9 Jun 2026

Automatic Extraction of Structured Information from Brain MRI Reports Using an Open-Weight Large Language Model

Kaouther Mouheb, Amos Pomp, Antoine Manenti, Romy de Haan, Farog Faghir, Joy Martens, Harro Seelaar, Francesco Mattace-Raso, Meike W. Vernooij, Frank J. Wolters, Stefan Klein, Esther E. Bron

Abstract

ArXiv ID: 2606.07721

Authors: Kaouther Mouheb, Amos Pomp, Antoine Manenti, Romy de Haan, Farog Faghir, Joy Martens, Harro Seelaar, Francesco Mattace-Raso, Meike W. Vernooij, Frank J. Wolters, Stefan Klein, Esther E. Bron

Abstract:

Objectives: Automatic data extraction from free-text radiology reports enables large-scale research, but few studies assessed the performance of large language models (LLMs) on Dutch neuroradiology reports. Methods: We analyzed 947 brain MRI reports from a tertiary memory clinic (2016-2021), authored by consultant neuroradiologists. Trained medical students annotated thirty variables; 100 reports were double-annotated to assess inter-rater reliability. We evaluated the performance of the open-weight LLM LLaMA 3.1 using different languages (Dutch vs. English translation) and few-shot prompting with different example selection strategies. Performance was evaluated using balanced accuracy for categorical variables, accuracy and mean absolute error for counts, and text similarity for free-text. Metrics were computed across 10 random splits of the 947 reports. Results: LLaMA 3.1 demonstrated high zero-shot performance for visual rating scores (mean [95%-CI]): Medial Temporal Atrophy: 90% [77-100%] on the left and 96% [94-99%] on the right, Global Cortical Atrophy: 87% [83-91%], and Fazekas: 94% [93-96%]. Microbleed mentions were detected with 93% accuracy [92-95%] and infarct mentions with 82% [80-84%]. Text similarity for lesion location reached 0.95 [0.95-0.96]. Performance was lower for numerical variables: 80% [78-82%] for the number of microbleeds and 66% [63-68%] for infarcts. English translation yielded comparable results. Few-shot prompting improved performance for numerical variables, achieving 92% [90-93%] for microbleeds and 81% [77-85%] for infarcts using structural similarity-based selection. Conclusion: LLaMA 3.1 shows strong potential for extracting data from Dutch neuroradiology reports. Few-shot prompting enhances performance for numerical variables, whereas challenges remain for location-specific variables.

Insights

Contribution: This study demonstrates the effectiveness of open-weight large language models (LLMs) for extracting structured information from Dutch brain MRI reports, addressing a gap in neuroradiology research with limited prior work on Dutch-language models.

Core Idea: LLMs like LLaMA 3.1 can accurately extract structured data from clinical reports, with performance varying by variable type and language, while few-shot prompting improves numerical variable extraction.

Technique: Open-weight LLaMA 3.1 was fine-tuned using few-shot prompting with strategic example selection to extract structured data from Dutch brain MRI reports, evaluated against human-annotated ground truth.

Pipeline: Brain MRI report (text input) → LLaMA 3.1 processing with few-shot prompts → Structured data extraction (categorical variables, counts, free-text fields).

Methodology: 947 Dutch MRI reports were annotated by medical students, with inter-rater reliability assessed. LLaMA 3.1 was evaluated in zero-shot and few-shot settings across 10 random splits, comparing Dutch and English translations.

Results: Zero-shot accuracy: 90-96% for visual ratings, 93% for microbleed mentions, 82% for infarct mentions. Text similarity: 0.95 for lesion locations. Few-shot improved numerical variables to 92% (microbleeds) and 81% (infarcts).

Limitations: Challenges remain in extracting location-specific variables and numerical counts, requiring further optimization of prompting strategies and larger annotated datasets.

PDF

4/5 Artificial Intelligence (cs.AI)Machine Learning (cs.LG)Neurons and Cognition (q-bio.NC) 9 Jun 2026

Reconstructing and forecasting disease trajectories of patients with Alzheimer's disease using routine data in resource-constrained settings

Ratnadeep Das, Atri Chatterjee, Sitikantha Roy

Abstract

ArXiv ID: 2606.07798

Authors: Ratnadeep Das, Atri Chatterjee, Sitikantha Roy

Abstract:

Alzheimer's disease is a progressive neurodegenerative disorder, and its progression varies substantially across patients. Existing work aims to forecast patients' future cognitive state, with minimal focus on reconstructing the state from past visits. Furthermore, in current research, quantifying predictive uncertainty remains underexplored and relies on costly modalities such as MRI, PET, and CSF, limiting their deployment in resource-limited settings. In this research, our primary objectives are: First, bidirectional prediction of cognitive scores from irregular visits to present the complete disease trajectory. Second, to enable interpolation and extrapolation capabilities to assist clinicians in informed prognostic decision making, and third, to provide a well-calibrated uncertainty estimate for all predictions, and finally, to achieve the objectives using the modalities available during routine visits. We propose a unified framework, GNOVA: A GRU-Neural ODE Variational Autoencoder. The architecture combines a Gated Recurrent Unit encoder and a Neural ODE decoder within a variational autoencoder framework. In our work, we forecast the CDR-SB and MMSE Scores. The GRU encoder allows for any number of inputs at any time point. The Neural-ODE decoder performs continuous estimation, allowing interpolation and extrapolation at any desired time point. The Variational autoencoder allows for uncertainty estimation in predictions. We worked with 1,727 patients from the ADNI dataset over 10 years; the model achieved mean absolute errors of 1.35 and 2.28 for CDR-SB and MMSE scores, respectively, without requiring any neuroimaging or biomarker data. Feature-ablation studies revealed that age, BMI, and APOE4 status were strong predictors. The proposed framework enables the reconstruction of incomplete patient histories and the anticipation of future cognitive states.

Insights

Contribution: The paper introduces GNOVA, a framework for reconstructing and forecasting Alzheimer's disease trajectories using routine clinical data, enabling uncertainty quantification without costly neuroimaging modalities.

Core Idea: GNOVA combines bidirectional prediction of cognitive scores with uncertainty estimation, leveraging routine data to reconstruct incomplete patient histories and forecast future states.

Technique: A GRU-Neural ODE Variational Autoencoder (GNOVA) integrates irregular visit data through a GRU encoder, continuous estimation via Neural ODEs, and uncertainty quantification using variational inference.

Pipeline: Patient routine data (e.g., demographics, lab results) → GRU encoder processes irregular time points → Neural ODE decoder generates continuous cognitive score trajectories (CDR-SB, MMSE) with uncertainty estimates → Clinicians use outputs for prognostic decisions.

Methodology: The framework was trained on 1,727 ADNI patients over 10 years, validated via mean absolute error (MAE) metrics and feature-ablation studies to identify key predictors like age, BMI, and APOE4 status.

Results: Achieved MAEs of 1.35 (CDR-SB) and 2.28 (MMSE) without neuroimaging; age, BMI, and APOE4 status emerged as strong predictors through ablation studies.

Limitations: Relies on ADNI dataset generalizability; extrapolation beyond 10-year timeframe untested; uncertainty calibration requires further validation in diverse clinical settings.

PDF

3/5 Artificial Intelligence (cs.AI) 9 Jun 2026

EditSR: Enhancing Neural Symbolic Regression via Edit-based Rectification

Da Li, Xinxin Li, Xingyu Cui, Jin Xu, Juan Zhang, Junping Yin

Abstract

ArXiv ID: 2606.07915

Authors: Da Li, Xinxin Li, Xingyu Cui, Jin Xu, Juan Zhang, Junping Yin

Abstract:

Neural symbolic regression models improve inference efficiency by shifting structural search to pretraining, but their one-pass autoregressive decoding is prone to error accumulation, which may lead to generating structurally incorrect expressions, especially in complex expression generation scenarios. Existing rectification strategies can alleviate this issue, but they often depend on restarting global search, thereby weakening the efficiency advantage of neural models, and remain susceptible to error accumulation. In this paper, we propose EditSR, a two-layer framework that combines a neural symbolic regression model in the first layer with an edit-based Rectifier in the second layer to achieve efficient prediction and post-hoc rectification. Instead of restarting the global search, we maintain rectification efficiency by pretraining the Rectifier. Specifically, we formulate the rectification process as a step-by-step state-transition chain starting from an incorrect expression, and develop a state-transition algorithm to construct supervised rectification chains for training the Rectifier. To ensure syntactic validity throughout rectification, each edit action is restricted to a syntactically valid space so that every edited expression remains parseable. In addition, because each edit decision is conditioned on the current state rather than the history, the Rectifier allows errors made in earlier steps to be rectified by subsequent edits, thereby reducing the risk of error accumulation. Extensive experiments and ablation studies show that EditSR substantially improves symbolic structure recovery with limited extra cost, with more pronounced gains on complex expressions, where one-pass autoregressive decoding is more susceptible to error accumulation.

Insights

Contribution: EditSR introduces a two-layer framework combining neural symbolic regression with an edit-based rectifier to efficiently correct structural errors without restarting global search, improving accuracy on complex expressions.

Core Idea: By pretraining an edit-based Rectifier to iteratively refine incorrect expressions through state-transition chains, EditSR mitigates error accumulation while maintaining the efficiency of neural models.

Technique: The Rectifier operates via syntactically constrained state transitions, conditioning edits on the current state rather than historical context to allow sequential error correction.

Pipeline: Input data → Neural symbolic regression model generates initial expression → Edit-based Rectifier applies stepwise syntactically valid edits → Final corrected symbolic expression

Methodology: Supervised rectification chains are trained via pretraining the Rectifier on state-transition sequences, ensuring syntactic validity at each edit step while enabling error correction through sequential refinement.

Results: EditSR achieves substantial improvements in symbolic structure recovery, with more pronounced gains on complex expressions, while incurring limited additional computational cost.

Limitations: Performance may depend on pretraining data diversity; handling extremely novel expression structures remains an open challenge.

PDF

Tech News

Computer Vision

Reddit r/ArtificialIntelligence 2026-06-09

Nature is losing to AI even on Google Images

A Reddit user shares a screenshot suggesting AI-generated content is outcompeting natural imagery on Google Images, sparking discussion about AI's impact on visual search results and digital ecosystems.

Reddit r/ArtificialIntelligence 2026-06-09

Engineer builds AI laser defense system that wiped out every mosquito in his home

An engineer developed an AI-powered laser system to eliminate mosquitoes in his home, combining computer vision for detection with automated targeting. The project highlights creative applications of AI in pest control.

Computing Systems

Hacker News Tue, 09 Ju

Porting the ThinkPad X61 to Coreboot

This article details efforts to port Coreboot firmware to the ThinkPad X61 laptop, focusing on open-source firmware development and hardware compatibility. The project highlights challenges in replacing proprietary BIOS with open-source alternatives.

General

Reddit r/ArtificialIntelligence 2026-06-08

OpenAI says it has confidentially filed for an IPO

OpenAI has confidentially filed for an IPO, though it has not yet decided on the timing, citing plans that may be easier to execute as a private company. The announcement follows Anthropic's S-1 filing and precedes SpaceX's potential public debut.

LLM

Reddit r/ArtificialIntelligence 2026-06-09

Anyone here built a gpt on Chatgpt?

A user on Reddit seeks advice on building a GPT model using ChatGPT, noting they encountered similar results in previous attempts and are looking for guidance on effective methods.

Reddit r/ArtificialIntelligence 2026-06-09

Great way to Learn while using ChatGPT

A Reddit user shares how they use ChatGPT to simplify complex math/statistics concepts by requesting explanations tailored for high school-level understanding, enhancing their learning process even with advanced academic background. The approach emphasizes building foundational knowledge before progressing to more complex questions.

Reddit r/ArtificialIntelligence 2026-06-09

Watch These Judges Rip Into Lawyers For Citing Cases That Don't Exist

Judges criticize lawyers for citing fabricated legal cases, highlighting potential issues with AI-generated content in legal contexts. The discussion underscores concerns about hallucinations in large language models (LLMs) used for legal research.

Reddit r/ArtificialIntelligence 2026-06-08

Trolling AI for no reason

A Reddit user discusses their habit of trolling AI models like Claude after using them for serious tasks, sharing examples of absurd prompts such as requesting baby talk versions of reports or unconventional marketing ideas. They reflect on whether their need for entertainment stems from a lack of stimulation in their daily life.

NLP

Reddit r/ArtificialIntelligence 2026-06-08

What AI tool do you trust for what task?

Users discuss the strengths and weaknesses of various AI tools for specific tasks like writing, research, coding, and brainstorming, highlighting preferences and limitations in AI tool usage.

Trending repositories on GitHub filtered and scored for relevance to your interests.

Daily Digest 2026-06-09

Global Trends

Papers discovered from ArXiv subject categories

AI Safety

Agentic AI

Computer Vision

Computing Systems

General

LLM

NLP

Tech News

Computer Vision

Computing Systems

General

LLM

NLP

GitHub Trending

Agentic AI

Computer Vision

General

LLM