Friday, June 19, 2026

Daily Digest 2026-06-19

Today’s research and news focus heavily on the governance, reliability, and architectural refinement of agentic systems, specifically addressing how to manage uncertainty and ensure alignment in complex workflows.

Research highlights:

Agentic Governance and Alignment: Research explores deontic policies for runtime governance, emergent alignment properties, and the identification of hidden anchors in multi-agent deliberations.
LLM Reliability and Evaluation: New methods focus on detecting epistemic blind spots in clinical data, analyzing narration gaps in solver loops, and moving toward predictive validity in agent evaluation.
Agentic RAG and Tool Use: Studies examine the practical limitations of agentic RAG in clinical settings and techniques for uncertainty decomposition to improve clarification-seeking behaviors.
Architectural Innovations: Developments include a learnable integral transform subsuming convolution and attention, experimental analyses of diffusion language models, and dual-purpose tokenizers for 3D medical imaging.
Specialized Applications: New systems demonstrate agentic applications in DeFi risk supervision and differentiable phenotypic grouping for Alzheimer’s disease risk modeling.

Tech buzz:

The community is actively exploring the practical deployment of new model architectures and open-source implementations for specialized financial and vision tasks.
Model Releases: New updates include insights into the Gemma 4 architecture and the release of GLM-5.
Open Source Tools: New tools for hosting custom HTML applications within Datasette and open-source banking event-sequence models were highlighted.

Sort:

Today's research and news focus heavily on the governance, reliability, and architectural refinement of agentic systems, specifically addressing how to manage uncertainty and ensure alignment in complex workflows.

Global Trends

Papers discovered from ArXiv subject categories

AI Safety

5/5 Artificial Intelligence (cs.AI) 19 Jun 2026

Emergent Alignment

Martin Kol\'a\v{r}

Abstract

ArXiv ID: 2606.19527

Authors: Martin Kol\'a\v{r}

Abstract:

Can Large Language Models (LLMs) discern when their own outputs are misaligned with human ethics? And can they self-correct? We endow an LLM with a conscience step that reviews its own reasoning and outputs, and we extend the training loss with an alignment component using Direct Preference Optimization (DPO) to steer the model away from non-ethical outputs. The result is an online technique to align models in a wide range of applications: training, fine-tuning, adversarial prompting, and zero-shot learning. It does not require a weaker or stronger judge, relying instead on a frozen copy of itself. In previous work, the Emergent Misalignment scenario showed a range of emergent unethical behaviors from fine-tuning the model to hack code. Instead, we empirically show how to achieve Emergent Alignment: a single high-level introspective question steers training toward an ethical model under the same code hacking scenario.

Insights

Contribution: The paper introduces 'Emergent Alignment,' a method where LLMs can self-correct unethical outputs by using a conscience step and a self-referential alignment loss. It demonstrates that a single high-level introspective question can steer a model toward ethical behavior during complex tasks like code hacking.

Core Idea: LLMs can be trained to recognize and correct their own misalignments with human ethics by reviewing their own reasoning through an internal conscience mechanism.

Technique: The authors use a 'conscience step' for self-review and extend the training loss with an alignment component using Direct Preference Optimization (DPO) based on a frozen copy of the model.

Pipeline: Input prompt → Model reasoning & output generation → Conscience step (self-review) → DPO-based alignment loss calculation → Final ethical output

Methodology: The researchers endowed an LLM with a conscience step to review its own outputs and applied DPO to steer the model away from non-ethical behaviors using a frozen copy of the model as a reference.

Results: The method successfully achieved 'Emergent Alignment' in a code hacking scenario, where a single introspective question steered the model toward ethical behavior across training, fine-tuning, and zero-shot learning.

Limitations: The paper focuses on the emergence of alignment through introspection but does not fully explore the potential for the model to develop sophisticated methods to bypass its own conscience step.

PDF

Agentic AI

5/5 Artificial Intelligence (cs.AI)Multiagent Systems (cs.MA) 19 Jun 2026

Deontic Policies for Runtime Governance of Agentic AI Systems

Anupam Joshi, Tim Finin, Karuna Pande Joshi, Lalana Kagal

Abstract

ArXiv ID: 2606.19464

Authors: Anupam Joshi, Tim Finin, Karuna Pande Joshi, Lalana Kagal

Abstract:

Autonomous agentic AI systems driven by Large Language Models (LLMs) introduce a new class of security, privacy, and compliance challenges: an agent that can invoke tools, manipulate data, install software, and coordinate with peer agents across organizational boundaries must be constrained not just by authentication and access control, but by the full structure of enterprise governance. This includes specifying what agents are permitted and prohibited from doing, what they areobliged to do after certain actions (e.g., notify the CISO), under what conditions a standing obligation may be waived, and which rules take precedence when policies conflict. This governance problem exceeds what current policy engines provide. Systems such as XACML, Rego, and Cedar address only the permit/prohibit subset of this governance structure. They do not provide obligation lifecycle management, meta-policy conflict resolution, dispensations that waive obligations in specific circumstances, and ontological reasoning over domain class hierarchies commonly found in applications such as healthcare, cybersecurity, or data privacy. We propose AgenticRei, which realizes key governance requirements such as obligations, dispensations, policy conflict resolutions, and reasoning over policies, as well as the basic permit/prohibit constraints. We use a deontic policy language built on the Rei framework, expressed as OWL (Web Ontology Language) and evaluated at runtime by a high-performance logic engine entirely outside the LLM. The same pipeline governs both tool invocations by the agent and agent-to-agent messages. We show through examples that deontic policies capture governance constraints around security and privacy that mostly cannot be expressed in current production engines. Our approach composes naturally with industry-standard frameworks like A2AS.

Insights

Contribution: The paper introduces AgenticRei, a governance framework that extends standard access control to include deontic logic concepts like obligations, dispensations, and conflict resolution for agentic AI.

Core Idea: Current policy engines (XACML, Rego, Cedar) are insufficient for agentic AI because they lack the ability to manage obligation lifecycles and complex ontological reasoning required for enterprise governance.

Technique: The authors develop a deontic policy language built on the Rei framework, expressed in OWL (Web Ontology Language) and evaluated by an external high-performance logic engine.

Pipeline: Agent action/message request → Deontic policy evaluation (OWL/Rei) → Governance decision (Permit/Prohibit/Obligation/Dispensation) → Execution or Notification

Methodology: The researchers designed a logic-based governance pipeline that operates entirely outside the LLM to ensure deterministic enforcement of security, privacy, and compliance rules.

Results: The framework successfully captures complex governance constraints (e.g., mandatory notifications, conditional waivers, and policy precedence) that are currently unexpressible in production engines.

Limitations: The paper focuses on the architectural framework and logic engine; further exploration into real-time performance scaling for massive agent swarms or specific industry-specific ontology mappings may be needed.

PDF

5/5 Artificial Intelligence (cs.AI)Computation and Language (cs.CL) 19 Jun 2026

Uncertainty Decomposition for Clarification Seeking in LLM Agents

Gregory Matsnev

Abstract

ArXiv ID: 2606.19559

Authors: Gregory Matsnev

Abstract:

Recent position papers argue that the classical aleatoric/epistemic uncertainty framework is insufficient for interactive large language model (LLM) agents and call for underspecification-aware, decomposed, and communicable uncertainty representations that can unlock new agent capabilities such as proactive clarification seeking and shared mental-model building. Practical deployment constraints -- black-box APIs, interactive latency budgets, and the absence of labeled trajectories -- rule out logprob-based, multi-sampling, and training-based methods, leaving prompt-based estimation as the most viable family for surfacing such signals at deployment time. We answer this call with a simple prompt-based decomposition that separates action confidence from request uncertainty (u), enabling the agent to ask for clarification when the task specification is ambiguous. To evaluate it, we introduce two clarification-augmented benchmarks (WebShop-Clarification and ALFWorld-Clarification) in which 50% of tasks are deliberately underspecified, and systematically compare the proposed decomposition against ReAct+UE and Uncertainty-Aware Memory (UAM) across five LLM backbones (GPT-5.1, DeepSeek-v3.2-exp, GLM-4.7, Qwen3.5-35B, GPT-OSS-120B) on these variants together with the standard WebShop, ALFWorld, and REAL benchmarks for fault detection. Averaged across the five backbones, the proposed decomposition improves clarification F1 on ALFWorld-Clarification by 73% over ReAct+UE and by 36% over UAM, and leads clarification F1 on every backbone on WebShop-Clarification and on four of five backbones on ALFWorld-Clarification, indicating that the gains generalize beyond a single LLM.

Insights

Contribution: The paper introduces a prompt-based uncertainty decomposition method that separates action confidence from request uncertainty to enable LLM agents to proactively seek clarification on underspecified tasks.

Core Idea: Standard aleatoric/epistemic uncertainty frameworks are insufficient for interactive agents; instead, agents need to distinguish between their ability to perform an action and the ambiguity of the user's request.

Technique: A simple prompt-based decomposition is used to surface uncertainty signals at deployment time, bypassing the need for logprobs, multi-sampling, or retraining.

Pipeline: Underspecified task input → Prompt-based uncertainty decomposition (Action Confidence vs. Request Uncertainty) → Proactive clarification seeking or action execution

Methodology: The authors evaluated the method using two new clarification-augmented benchmarks (WebShop-Clarification and ALFWorld-Clarification) across five different LLM backbones.

Results: The proposed decomposition improved clarification F1 on ALFWorld-Clarification by 73% over ReAct+UE and 36% over UAM, achieving the highest clarification F1 on every backbone for WebShop-Clarification.

Limitations: The study focuses on prompt-based estimation due to black-box API constraints, which may limit the granularity of uncertainty signals compared to internal model weights.

PDF

5/5 Artificial Intelligence (cs.AI) 19 Jun 2026

Configurable Clinical Information Extraction with Agentic RAG: What Works, What Breaks, and Why

Osman Alperen \c{C}inar-Kora\c{s}, Marie Bauer, Sameh Khattab, Merlin Engelke, Moon Kim, Stephan Settelmeier, Shigeyasu Sugawara, Fabian Freisleben, Felix Nensa, Jens Kleesiek

Abstract

ArXiv ID: 2606.19602

Authors: Osman Alperen \c{C}inar-Kora\c{s}, Marie Bauer, Sameh Khattab, Merlin Engelke, Moon Kim, Stephan Settelmeier, Shigeyasu Sugawara, Fabian Freisleben, Felix Nensa, Jens Kleesiek

Abstract:

Patient contexts span hundreds of heterogeneous documents and thousands of structured data points, yet the document-level metadata that AI systems need for retrieval and triage is absent or incomplete. Standard retrieval-augmented generation fails on this data, mishandling temporal reasoning, cross-document dependencies, and missing metadata. We deploy ACIE (Agentic Clinical Information Extraction) at University Medicine Essen: an on-premise agentic RAG pipeline that reasons over complete patient contexts and grounds every answer in source passages for clinician verification. We quantify the metadata gap, trace the architectural decisions it shaped, and evaluate extraction alongside an independent retrospective lymphoma registry study, in which nuclear-medicine physicians verify every extracted value against its cited sources. Across 7,326 judgments, clinicians accepted 96.5\% of extractions, with per-type acceptance ranging from 80\% to 99\%.

Insights

Contribution: The paper introduces ACIE, an on-premise agentic RAG pipeline designed to extract clinical information from complex, heterogeneous patient records where standard RAG fails due to missing metadata and temporal dependencies.

Core Idea: By using an agentic reasoning framework, the system can navigate hundreds of documents to ground clinical extractions in specific source passages, ensuring high-fidelity data for clinician verification.

Technique: The authors employ Agentic Retrieval-Augmented Generation (RAG) to perform multi-step reasoning over complete patient contexts, addressing cross-document dependencies and temporal reasoning.

Pipeline: Heterogeneous patient documents → Agentic RAG reasoning and source grounding → Verified clinical information extractions

Methodology: The researchers deployed ACIE at University Medicine Essen and evaluated it against an independent retrospective lymphoma registry study, where nuclear-medicine physicians verified 7,326 extractions against cited sources.

Results: Clinicians accepted 96.5% of the extractions, with specific data types achieving acceptance rates between 80% and 99%.

Limitations: The study highlights the 'metadata gap' in clinical records and explores the architectural trade-offs required to handle missing information and complex temporal reasoning.

PDF

5/5 Artificial Intelligence (cs.AI) 19 Jun 2026

Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents

Dhaval C. Patel, Kaoutar El Maghraoui, Shuxin Lin, Yusheng Li, Tianjun Feng, Chun-Yi Tsai, Yihan Sun, Wei Alexander Xin, Akshat Bhandari, Tanisha Rathod, Aaron Fan, Sanskruti Vijay Shejwal, Tomas Pasiecznik, Sagar Chethan Kumar, Tanmay Agarwal, Rohith Kanathur, Sam Colman, Amaan Sheikh, Dev Bahl, Ann Li, Krish Veera, Alimurtaza Mustafa Merchant, Shambhawi Baswaraj Bhure, Sajal Kumar Goyla, Chengrui Li, Kirthana Natarajan, Rui Li, Thomas Ajai, Rujing Li, Vivek G. Iyer, Sanjaii Vijayakumar, Yitong Bai, Ayal Yakobe, Darief Maes, Yassine Jebbouri, Tianyang Xu, Thai Quoc On, Vera Mazeeva, Winston Li, Yuval Shemla, Yeshitha Bhuvanesh, Rushin Bhatt, Siddharth Chethan Gowda, Alisha Vinod, Caroline Cahill, Shriya Aishani Rachakonda, Yunfeng Chen, Aryaman Agrawal, Aman Upganlawar, Mao Le Jonathan Ang, Yubin Sally Go, Madhav Rajkondawar, Yang-Jung Chen, Trisha Maturi, Ananya Kapoor, Andrew Li, Shrey Arora, Mana Abbaszadeh, Shen Li, Charles Xu, Byeolah Kwon

Abstract

ArXiv ID: 2606.19704

Authors: Dhaval C. Patel, Kaoutar El Maghraoui, Shuxin Lin, Yusheng Li, Tianjun Feng, Chun-Yi Tsai, Yihan Sun, Wei Alexander Xin, Akshat Bhandari, Tanisha Rathod, Aaron Fan, Sanskruti Vijay Shejwal, Tomas Pasiecznik, Sagar Chethan Kumar, Tanmay Agarwal, Rohith Kanathur, Sam Colman, Amaan Sheikh, Dev Bahl, Ann Li, Krish Veera, Alimurtaza Mustafa Merchant, Shambhawi Baswaraj Bhure, Sajal Kumar Goyla, Chengrui Li, Kirthana Natarajan, Rui Li, Thomas Ajai, Rujing Li, Vivek G. Iyer, Sanjaii Vijayakumar, Yitong Bai, Ayal Yakobe, Darief Maes, Yassine Jebbouri, Tianyang Xu, Thai Quoc On, Vera Mazeeva, Winston Li, Yuval Shemla, Yeshitha Bhuvanesh, Rushin Bhatt, Siddharth Chethan Gowda, Alisha Vinod, Caroline Cahill, Shriya Aishani Rachakonda, Yunfeng Chen, Aryaman Agrawal, Aman Upganlawar, Mao Le Jonathan Ang, Yubin Sally Go, Madhav Rajkondawar, Yang-Jung Chen, Trisha Maturi, Ananya Kapoor, Andrew Li, Shrey Arora, Mana Abbaszadeh, Shen Li, Charles Xu, Byeolah Kwon

Abstract:

Agent benchmarks are growing fast, but no single benchmark touches more than four or five of the dimensions that deployment exposes. This paper aggregates the largest coordinated deep-dive of one MCP-based industrial-agent benchmark to date: fourteen parallel implementation studies covering new asset classes (including a multi-modal visual extension), alternative orchestrations, retrieval strategies, reasoning modes, infrastructure optimizations, and evaluation-methodology probes. Consolidating those studies with seven prior agent benchmarks, we argue that aggregate-score leaderboards systematically underspecify deployed-agent evaluation. Rankings derived from aggregate scores do not transfer to out-of-distribution settings; recent public-to-hidden competition retrospectives provide direct empirical evidence of this rank instability. We propose ranking configurations by predictive validity, the correlation between in-sample and out-of-sample rank, rather than in-sample mean, and report a twelve-tier measurement apparatus that exposes the deployment-relevant dimensions HELM and its agent-era successors collapse. The position is operationalized through three falsifiable out-of-distribution criteria with explicit thresholds; existing evidence partly supports it but is too thin to confirm. We close with a pre-registered pilot design and a field-level vision for what the next generation of agentic benchmarks should report.

Insights

Contribution: The paper identifies that aggregate-score leaderboards for LLM agents lack predictive validity for real-world deployment and proposes a new evaluation framework based on out-of-distribution (OOD) rank correlation.

Core Idea: Current agent benchmarks fail to generalize because they collapse complex deployment dimensions into single scores; evaluation should instead prioritize the correlation between in-sample and out-of-sample performance.

Technique: The authors utilize a twelve-tier measurement apparatus and a predictive validity metric to evaluate how well benchmark rankings transfer to unseen settings.

Pipeline: Agent benchmarks and industrial-agent studies → Analysis of rank instability across OOD settings → Predictive validity ranking and twelve-tier measurement apparatus

Methodology: The researchers conducted fourteen parallel implementation studies on an MCP-based industrial-agent benchmark and consolidated them with seven prior benchmarks to analyze rank transferability.

Results: The study provides empirical evidence of rank instability in public-to-hidden competitions and demonstrates that aggregate scores systematically underspecify the dimensions relevant to actual deployment.

Limitations: The evidence supporting the proposed OOD criteria is currently too thin to fully confirm the new framework, necessitating further field-level testing.

PDF

4/5 Artificial Intelligence (cs.AI) 19 Jun 2026

Hidden Anchors in Multi-Agent LLM Deliberation

Apurba Pokharel, Ram Dantu

Abstract

ArXiv ID: 2606.19494

Authors: Apurba Pokharel, Ram Dantu

Abstract:

Multi-agent LLM deliberation, where agents exchange and revise answers over several rounds, is increasingly used to improve reasoning and accuracy, yet how and why it works is rarely modelled. Such deliberation mirrors how humans reach decisions. As social animals we are pulled both by the group, the herd effect that classical opinion-dynamics models such as DeGroot and Friedkin--Johnsen capture, and by our own internal belief, which they do not. We model multi-agent deliberation as a closed-loop dynamical system in which each agent carries a hidden internal belief, its anchor, that continually pulls its opinion regardless of its neighbours. We show this anchor can be recovered from the deliberation alone, and that it explains a behaviour classical consensus rules forbid: an agent's confidence in the correct answer can climb past where any agent started, escaping the space (convexhull) formed by the initial beliefs. Checking whether the recovered anchor also predicts held-out runs (generalizes) gives a simple test for when a model is truly driven bysuch an anchor. Across three open-weight model families this is a spectrum, not all-or-nothing. All anchors' influence are about equally strongly, but they differ in where the anchor sits, and only when it sits far from the initial opinions does deliberation escape the hull and need the full closed-loop model.

Insights

Contribution: The paper introduces a closed-loop dynamical system model to explain multi-agent LLM deliberation, identifying 'hidden anchors' as the mechanism that allows agents to reach conclusions outside the convex hull of initial opinions.

Core Idea: While classical models focus on herd effects, this research posits that agents possess internal beliefs (anchors) that continuously pull their opinions, explaining how deliberation can transcend the starting points of all participants.

Technique: The authors model deliberation as a closed-loop dynamical system and develop a method to recover hidden anchors from observed deliberation traces.

Pipeline: Multi-agent deliberation traces → Anchor recovery algorithm → Generalization testing on held-out runs

Methodology: The researchers compared multi-agent deliberation across three open-weight model families against classical opinion-dynamics models (DeGroot and Friedkin-Johnsen) to measure anchor influence.

Results: The study found that anchors exist across all tested models; when an anchor sits far from initial opinions, the deliberation escapes the initial convex hull, requiring the full closed-loop model for accurate prediction.

Limitations: The research identifies a spectrum of anchor influence rather than a binary state, and further investigation is needed on the specific nature of these internal beliefs in different model architectures.

PDF

4/5 Artificial Intelligence (cs.AI)Computation and Language (cs.CL)Machine Learning (cs.LG)q-fin.RM 19 Jun 2026

DeXposure-Claw: An Agentic System for DeFi Risk Supervision

Aijie Shu, Bowei Chen, Wenbin Wu, Cathy Yi-Hsuan Chen, Fengxiang He

Abstract

ArXiv ID: 2606.19501

Authors: Aijie Shu, Bowei Chen, Wenbin Wu, Cathy Yi-Hsuan Chen, Fengxiang He

Abstract:

Decentralized finance exposes supervisors to fast-moving, networked credit risks. General-purpose LLM agents fit this setting poorly: they over-read weak evidence and recommend high-stakes interventions, while existing evaluations offer no regulator-aligned way to measure the resulting false alarms. We introduce DeXposure-Claw, a forecast-grounded agentic supervision system that routes LLM decisions through structured evidence: (1) DeXposure-FM, a graph time-series foundation model, forecasts future exposure networks; (2) deterministic monitors and stress scenarios then turn those forecasts into typed alerts, attribution signals, and scenario evidence; and (3) data-health and confidence gates constrain escalation before DeXposure-Claw emits auditable supervisory tickets with rationales. We further develop DeXposure-Bench, a six-axis evaluation harness, whose decision axis scores tickets against a regulator-aligned absolute-loss ground truth and an explicit false-intervention rate. Experiments on five years of weekly real data fully support our system. Code is at https://github.com/EVIEHub/DeXposure-Claw.

Insights

Contribution: The paper introduces DeXposure-Claw, an agentic system for DeFi risk supervision, and DeXposure-Bench, a new evaluation harness designed to measure false-intervention rates in regulatory contexts.

Core Idea: To prevent LLM agents from over-reacting to weak evidence in complex DeFi networks, the system grounds agentic decisions in forecast-based evidence and deterministic stress scenarios.

Technique: The system utilizes a graph time-series foundation model (DeXposure-FM) to forecast exposure networks, coupled with confidence gates and a multi-axis evaluation framework.

Pipeline: DeFi network data → DeXposure-FM (forecasting) → Deterministic monitors & stress scenarios (alert generation) → Confidence gates (filtering) → DeXposure-Claw (auditable supervisory tickets)

Methodology: The authors developed a three-stage pipeline involving graph-based forecasting, typed alert generation, and agentic reasoning, evaluated against five years of real-world weekly DeFi data.

Results: Experiments on five years of real data demonstrate that the system successfully reduces false alarms and provides auditable rationales compared to general-purpose LLM agents.

Limitations: The paper does not explicitly detail the computational overhead of the graph time-series foundation model or the scalability of the system to extremely high-frequency trading data.

PDF Code

Computer Vision

4/5 Artificial Intelligence (cs.AI) 19 Jun 2026

REVEAL++: Differentiable Phenotypic Grouping for Vision-Language Retinal Modeling of Alzheimer's Disease Risk

Ethan Elio Meidinger, Seowung Leem, Zeyun Zhao, Ruogu Fang

Abstract

ArXiv ID: 2606.19522

Authors: Ethan Elio Meidinger, Seowung Leem, Zeyun Zhao, Ruogu Fang

Abstract:

The retina offers a noninvasive window into neurodegenerative disease, capturing subtle structural patterns associated with a risk of future cognitive decline. Vision-language alignment frameworks such as REVEAL have shown that pairing retinal fundus images with structured clinical risk narratives improves early prediction of Alzheimer's disease (AD). A key design choice in these approaches is the use of phenotypic grouping, where individuals with similar risk profiles are treated as multi-positive pairs during contrastive learning. However, existing methods operationalize phenotypic similarity as a discrete construct, relying on hard group assignments that impose rigid supervision and decouple group formation from representation learning. We propose a continuous formulation of phenotypic structure within contrastive learning. Rather than assigning samples to fixed clusters, we model inter-subject similarity as a differentiable weighting function derived from intra-modality embedding similarities in both retinal images and risk profiles. These weights define soft multi-positive relationships through a continuous aggregation operator, enabling graded supervision that reflects the spectrum nature of disease risk. We further introduce a soft-target contrastive objective that jointly learns cross-modal alignment and phenotypic structure in an end-to-end manner. Evaluated on UK Biobank retinal imaging data for incident AD prediction, the proposed framework consistently outperforms discrete group-based contrastive learning and standard vision-language baselines. By treating phenotypic similarity as a learnable, continuous signal rather than a fixed grouping rule, our approach provides a principled and robust foundation for population-scale neurodegenerative risk modeling from multi-modal retinal and clinical data.

Insights

Contribution: The paper introduces REVEAL++, a framework that replaces discrete phenotypic grouping with a continuous, differentiable weighting function for vision-language alignment in Alzheimer's disease risk modeling.

Core Idea: Instead of assigning patients to fixed clusters, the model treats phenotypic similarity as a learnable, continuous signal derived from intra-modality embedding similarities.

Technique: The authors employ a soft-target contrastive objective and a continuous aggregation operator to define graded multi-positive relationships between retinal images and clinical risk narratives.

Pipeline: Retinal fundus images and clinical risk narratives → Intra-modality embedding similarity calculation → Differentiable weighting function → Soft-target contrastive learning → Alzheimer's disease risk prediction

Methodology: The methodology involves modeling inter-subject similarity as a differentiable weight based on both image and text embeddings, allowing for end-to-end joint learning of cross-modal alignment and phenotypic structure.

Results: The framework consistently outperforms both discrete group-based contrastive learning and standard vision-language baselines on UK Biobank retinal imaging data for incident AD prediction.

Limitations: The paper does not explicitly detail the specific clinical interpretability of the learned continuous weights or the scalability of the differentiable aggregation across extremely large-scale datasets.

PDF

4/5 Artificial Intelligence (cs.AI)Computer Vision and Pattern Recognition (cs.CV)Machine Learning (cs.LG) 19 Jun 2026

BrainG3N: A Dual-Purpose Tokenizer for Controllable 3D Brain MRI Generation

Max Van Puyvelde, Ibrahim Gulluk, Wim Van Criekinge, Olivier Gevaert

Abstract

ArXiv ID: 2606.19651

Authors: Max Van Puyvelde, Ibrahim Gulluk, Wim Van Criekinge, Olivier Gevaert

Abstract:

Three-dimensional (3D) brain MRI is central to clinical neurology and neuro-oncology, where generative models could augment under-represented cohorts, simulate disease trajectories, and support privacy-preserving data sharing. Latent diffusion has been the go-to solution for modeling imaging data, but it places two competing demands on the tokenizer: encoder embeddings must retain the clinical information that downstream tasks act on, and the decoder must reconstruct anatomically faithful volumes. Existing reconstruction-driven tokenizers achieve the second at the expense of the first. To address this, we introduce a fully volumetric masked-autoencoder (MAE) based tokenizer for 3D brain MRI latent diffusion, decoupling encoder and decoder: a frozen 3D MAE encoder produces clinically informative embeddings, while a dedicated CNN decoder reconstructs voxels from a linear projection of those embeddings. We pretrain the encoder on 35,309 volumes from 18 public cohorts spanning four modalities, ten disease categories, and 200+ acquisition sites, and demonstrate its dual utility in two settings. First, on a 23-task linear-probing benchmark, the encoder outperforms or matches SOTA models (i.e., BrainIAC, BrainSegFounder, and MedicalNet) on 21 of 23 tasks. Second, a conditional diffusion transformer (DiT) trained on these clinically informative embeddings supports both conditional generation across six variables and patient-specific longitudinal forecasting. Together these results establish a single 3D brain-MRI embedding space capable of both downstream clinical tasks and controllable generation.

Insights

Contribution: The paper introduces BrainG3N, a dual-purpose tokenizer that successfully decouples clinical information preservation from anatomical reconstruction in 3D brain MRI latent diffusion.

Core Idea: By using a frozen Masked Autoencoder (MAE) encoder and a dedicated CNN decoder, the model ensures that latent embeddings remain rich in clinical features while still allowing for high-fidelity volumetric reconstruction.

Technique: The authors employ a volumetric Masked Autoencoder (MAE) architecture to pretrain a robust embedding space, which is then used as the latent space for a conditional Diffusion Transformer (DiT).

Pipeline: 3D Brain MRI volumes → Frozen 3D MAE Encoder → Clinically informative embeddings → Conditional Diffusion Transformer (DiT) → Controllable 3D MRI generation or longitudinal forecasting

Methodology: The encoder was pretrained on a massive dataset of 35,309 volumes across 18 cohorts, followed by linear probing for clinical tasks and training a DiT for conditional generation and forecasting.

Results: The encoder outperformed or matched SOTA models on 21 out of 23 clinical tasks and successfully supported conditional generation across six variables and patient-specific longitudinal forecasting.

Limitations: The paper does not explicitly detail the computational overhead of the dual-decoder approach or the specific performance gaps on the 2 tasks where it did not outperform SOTA.

PDF

4/5 Artificial Intelligence (cs.AI)Computer Vision and Pattern Recognition (cs.CV) 19 Jun 2026

GLARE: A Natural Language Interface for Querying Global Explanations

Bhavan Vasu, Rajesh Mangannavar

Abstract

ArXiv ID: 2606.19735

Authors: Bhavan Vasu, Rajesh Mangannavar

Abstract:

While global explanations are crucial for understanding vision models across datasets, classes, and decision contexts, their complex and monolithic nature often hinders practical exploration. Because users typically seek targeted answers to specific questions rather than static artifacts, we present an LLM-based interactive interface that provides natural language access to global explanations for black-box image classifiers. The system's core LLM acts as a mediator, translating natural language questions into structured SQL queries over local explanation data. This enables flexible aggregation without exposing users to low-level representations. For each query, the interface outputs statistics-augmented natural language responses, supporting local explanations, and intent-aligned visualizations. We evaluate the system on intent interpretation, query mapping accuracy, generalization to novel queries and datasets, and robustness to linguistic errors. Our results demonstrate that LLM-mediated querying substantially improves the accessibility and usability of global explanations for human-centered XAI.

Insights

Contribution: The paper introduces GLARE, an LLM-based interactive interface that enables users to query global explanations of black-box image classifiers using natural language. It bridges the gap between complex, monolithic global explanations and user-specific questions by providing a flexible, queryable system.

Core Idea: Instead of presenting static global explanation artifacts, the system treats local explanation data as a queryable database, allowing users to extract targeted insights through natural language interaction.

Technique: The system uses a Large Language Model (LLM) as a mediator to translate natural language questions into structured SQL queries, which are then executed over a repository of local explanations.

Pipeline: Natural language question → LLM translation to SQL → SQL execution over local explanation data → Statistics-augmented natural language response and intent-aligned visualization.

Methodology: The authors evaluated the system based on intent interpretation, query mapping accuracy, generalization to new datasets/queries, and robustness to linguistic errors.

Results: The results demonstrate that LLM-mediated querying substantially improves the accessibility and usability of global explanations for human-centered XAI compared to traditional static methods.

Limitations: The paper does not explicitly detail specific limitations, but potential areas for further research include the scalability of the SQL database for massive datasets and the handling of highly ambiguous linguistic queries.

PDF

Computing Systems

4/5 Artificial Intelligence (cs.AI)Machine Learning (cs.LG) 19 Jun 2026

ITNet: A Learnable Integral Transform That Subsumes Convolution, Attention, and Recurrence

Ashim Dhor, Rasel Mondal, Pin Yu Chen

Abstract

ArXiv ID: 2606.19538

Authors: Ashim Dhor, Rasel Mondal, Pin Yu Chen

Abstract:

Convolutional networks, recurrent networks, and transformers each encode different inductive biases -- locality, sequential memory, and content-dependent pairwise interaction -- and have remained mathematically distinct since their inception. We show that this fragmentation reflects not a fundamental diversity in how signals should be processed, but rather incomplete views of a single underlying mathematical object: a learnable integral transform. We introduce the Integral Transform Network (ITNet), a unified architecture built around a learnable kernel that depends jointly on positions and features. This kernel is implemented as a small neural network, specifically an MLP, that models pairwise interactions, enabling the model to adapt its behavior from data. We show that convolution, self-attention (including multi-head), and autoregressive recurrence (including LSTM, GRU, S4, and Mamba) arise as special cases under appropriate parameterizations, and that ITNet is a universal approximator of continuous operators. To make this practical, we develop tiled kernel fusion, importance-weighted Monte Carlo integration, and learned low-rank factorization, enabling efficient and scalable computation. A single ITNet architecture with a shared operator and lightweight modality-specific encoders matches or exceeds specialized baselines on ImageNet-1K , GLUE, ModelNet40, VQA\,v2 and NLVR2. The results demonstrate that a single learned interaction mechanism can recover the behavior of all three architectural families from data.

Insights

Contribution: The paper introduces ITNet, a unified architecture that mathematically subsumes convolution, self-attention, and recurrence into a single learnable integral transform. It demonstrates that these three distinct architectural families are special cases of a single underlying operator.

Core Idea: The authors propose that the fragmentation of neural architectures reflects incomplete views of a single mathematical object: a learnable kernel that depends jointly on positions and features.

Technique: ITNet uses a learnable kernel implemented as a small MLP to model pairwise interactions, which can be parameterized to behave as convolution, attention, or recurrence.

Pipeline: Input data → Modality-specific encoders → ITNet (Learnable Integral Transform via MLP kernel) → Output

Methodology: The researchers developed tiled kernel fusion, importance-weighted Monte Carlo integration, and learned low-rank factorization to ensure the integral transform is computationally efficient and scalable.

Results: A single ITNet architecture with a shared operator matches or exceeds specialized baselines across diverse tasks including ImageNet-1K, GLUE, ModelNet40, VQA v2, and NLVR2.

Limitations: While the paper addresses scalability through factorization and Monte Carlo integration, the computational complexity of high-dimensional integral transforms remains a potential challenge for extremely large-scale deployments.

PDF

General

3/5 Artificial Intelligence (cs.AI)cs.DLSystems and Control (cs.SY)eess.SY 19 Jun 2026

AI4SE and SE4AI Exploration: A Decade Looking Back and Forward

H. Sinan Bank, Daniel R. Herber, Thomas Bradley

Abstract

ArXiv ID: 2606.19630

Authors: H. Sinan Bank, Daniel R. Herber, Thomas Bradley

Abstract:

The March 2020 INCOSE INSIGHT special issue on AI and Systems Engineering (SE) became the most downloaded issue in the publication's history and launched a research community that now draws over 250 registrants to its annual workshop. In this article, we trace the progress in AI and SE across three phases (labeled here foundational, applied, and LLM inflection) based on the authors' reading of the field's core papers, and describe our opinions of where the community has converged and where critical gaps remain. Separately, a human-AI agreement literature review leveraging both human expertise and six AI models was performed to assess the relevance of 1,712 INCOSE INSIGHT articles and 889 SERC publications. The results identify five critical research gaps and offer guidance for practitioners navigating AI adoption, assurance, and workforce transformation in SE. We share the agreement data and the AI4SE/SE4AI Explorer web application so readers can compare their own relevance judgments with the human and AI raters.

Insights

Contribution: The paper provides a historical retrospective of the AI and Systems Engineering (SE) intersection and identifies critical research gaps through a large-scale human-AI agreement study.

Core Idea: The field has evolved through foundational, applied, and LLM inflection phases, necessitating a structured understanding of AI4SE (AI for SE) and SE4AI (SE for AI) to guide adoption and assurance.

Technique: The authors utilized a hybrid literature review combining human expertise with six different AI models to evaluate the relevance of nearly 3,500 publications.

Pipeline: Historical literature and publication databases → Human and AI model relevance scoring → Identification of research gaps and development of the AI4SE/SE4AI Explorer web application.

Methodology: A qualitative analysis of core papers across three historical phases combined with a quantitative human-AI agreement study on 1,712 INCOSE INSIGHT articles and 889 SERC publications.

Results: The study identified five critical research gaps and established a baseline for human-AI agreement in relevance judgment, while providing a web application for community exploration.

Limitations: The study identifies significant remaining gaps in AI assurance and workforce transformation that require further investigation as the field moves past the LLM inflection point.

PDF

LLM

5/5 Artificial Intelligence (cs.AI)Cryptography and Security (cs.CR)Logic in Computer Science (cs.LO) 19 Jun 2026

Analyzing the Narration Gap in LLM-Solver Loops

Zunchen Huang, Songgaojun Deng

Abstract

ArXiv ID: 2606.19588

Authors: Zunchen Huang, Songgaojun Deng

Abstract:

Formal tools such as SAT and SMT solvers are increasingly embedded in language model reasoning pipelines when a safety or security critical question can be formulated in logic. Unlike chain of thought whose steps are sampled from the model distribution without formal guarantee, a solver produces a sound and independently verifiable answer. However, the soundness guarantee can be lost in the interaction between the solver and the model. The hybrid pipeline has three components: formalizing the question, deciding it, and narrating the result. Prior work has studied the formalization and decision, but not narration, which is the step that turns a formal tool's output into the user answer. To fill the narration gap, we first model the LLM-solver loop as a verified decision procedure. We further evaluate five open-sourced models under prompt injection, and we find certificate gating makes the solver verdict sound, while an adversary can invert a verified conclusion across phrasings and channels. We study the mitigation through hardened prompt that reduces injection significantly but cannot eliminate it and still suffers under adaptive attack. Combining the formal analysis and empirical studies, we show in the LLM-solver loop, robustness does not reach to the answer that the user finally reads.

Insights

Contribution: The paper identifies and analyzes the 'narration gap' in LLM-solver loops, demonstrating that while formal solvers provide sound answers, the final narration step remains vulnerable to prompt injections.

Core Idea: The soundness of a formal solver can be compromised during the final step of translating a verified result into a natural language response for the user.

Technique: The authors model the LLM-solver loop as a verified decision procedure and employ certificate gating and hardened prompts to mitigate adversarial attacks.

Pipeline: User Question → Formalization → Solver Decision → Narration → Final User Answer

Methodology: The study combines formal verification modeling with empirical evaluations of five open-source models under various prompt injection and adaptive attack scenarios.

Results: Certificate gating ensures the solver verdict remains sound, but adversaries can still invert verified conclusions across different phrasings; hardened prompts reduce but do not eliminate injection risks.

Limitations: Hardened prompts cannot completely eliminate vulnerabilities and remain susceptible to sophisticated adaptive attacks.

PDF

4/5 Artificial Intelligence (cs.AI)Computation and Language (cs.CL) 19 Jun 2026

Diffusion Language Models: An Experimental Analysis

Thomas Bertolani, Davide Bucciarelli, Leonardo Zini, Marcella Cornia, Lorenzo Baraldi

Abstract

ArXiv ID: 2606.19475

Authors: Thomas Bertolani, Davide Bucciarelli, Leonardo Zini, Marcella Cornia, Lorenzo Baraldi

Abstract:

Large Language Models (LLMs) have revolutionized language modeling through autoregressive generation, enabling strong performance across a wide range of tasks. Recently, Diffusion Language Models (DLMs) have emerged as an alternative paradigm that generates text through iterative denoising rather than next-token prediction, allowing parallel refinement of entire sequences. While numerous diffusion-based architectures have been proposed, differences in evaluation protocols, datasets, inference budgets, and generation hyperparameters make it difficult to compare their capabilities and understand the trade-offs they offer. In this work, we present a systematic experimental analysis of modern DLMs. Specifically, we evaluate eight state-of-the-art DLMs across eight benchmarks spanning reasoning, coding, translation, knowledge, and structured problem solving, while explicitly considering both generation quality and computational efficiency. Beyond downstream evaluation, we analyze the impact of key inference-time factors, including denoising steps, context length, block size, and parallel unmasking strategies, and complement large-scale experiments with controlled comparisons of smaller models trained under identical conditions. Our analysis highlights the strengths and limitations of diffusion-based language modeling across different tasks, architectures, and inference budgets. We show that the behavior of DLMs is strongly influenced by generation-time design choices, leading to distinct trade-offs between performance and computational efficiency. Overall, our study provides practical insights into the capabilities and deployment characteristics of contemporary DLMs.

Insights

Contribution: The paper provides a systematic experimental analysis of eight state-of-the-art Diffusion Language Models (DLMs) across diverse benchmarks to establish a clear understanding of their capabilities and trade-offs.

Core Idea: Unlike autoregressive models that predict tokens sequentially, DLMs generate text through iterative denoising, allowing for parallel refinement of entire sequences.

Technique: The study employs a multi-dimensional evaluation framework that accounts for generation quality, computational efficiency, and various inference-time hyperparameters.

Pipeline: Text prompts and benchmarks → Iterative denoising across multiple DLM architectures → Evaluated text outputs across reasoning, coding, translation, and knowledge tasks.

Methodology: The authors evaluated eight DLMs on eight benchmarks, conducting controlled comparisons of smaller models and analyzing the impact of denoising steps, context length, block size, and parallel unmasking strategies.

Results: The analysis reveals that DLM performance is heavily influenced by generation-time design choices, showing distinct trade-offs between output quality and computational efficiency across different tasks.

Limitations: The study highlights that the lack of standardized evaluation protocols and hyperparameters in existing DLM research makes direct comparisons difficult.

PDF

4/5 Artificial Intelligence (cs.AI)stat.AP 19 Jun 2026

Which Pairs to Compare for LLM Post-Training?

Jiangze Han, Vineet Goyal, Will Ma

Abstract

ArXiv ID: 2606.19607

Authors: Jiangze Han, Vineet Goyal, Will Ma

Abstract:

Preference-based post-training has become a central paradigm for aligning language models. A common data-collection strategy is to generate a small set of completions for each prompt and label the resulting comparison pairs. However, human preference labels are often much more expensive than generating additional completions, suggesting a different use of the same labeling budget: generate a larger pool of completions, but label only the most informative comparison pairs. This paper studies which pairs should be compared in preference-based post-training. We formulate comparison curation as a sampling-design problem and evaluate designs by the quality of the final policy under the preference-based post-training objective. We instantiate this framework for Direct Preference Optimization (DPO), analyzing how the choice of labeled pairs propagates through DPO training to downstream policy performance. Our main results provide matching upper and lower bounds on the post-training optimality gap of the DPO-trained policy. The bounds show that comparison selection affects downstream performance through a single design-dependent information matrix, which links label allocation to parameter estimation error and policy suboptimality. This yields an explicit optimization criterion for budgeted comparison curation and motivates practical sampling designs for selecting informative pairs from large generated completion pools. Experiments on synthetic settings and language-model post-training benchmarks show that the proposed designs consistently improve sample efficiency over common comparison-selection heuristics.

Insights

Contribution: The paper provides a theoretical framework for optimal comparison curation in preference-based post-training, establishing upper and lower bounds on the optimality gap of DPO-trained policies based on label allocation.

Core Idea: Instead of labeling all possible pairs from a small set of completions, it is more efficient to generate a large pool of completions and selectively label only the most informative comparison pairs.

Technique: The authors formulate comparison curation as a sampling-design problem, deriving a design-dependent information matrix that links label allocation to parameter estimation error and policy suboptimality.

Pipeline: Large pool of generated completions → Information-theoretic comparison curation → Selected preference pairs → DPO post-training → Optimized policy

Methodology: The study analyzes how different comparison selection designs propagate through the DPO objective, using synthetic settings and real-world LLM benchmarks to evaluate sample efficiency.

Results: The proposed designs consistently outperform common heuristics in sample efficiency, providing an explicit optimization criterion for maximizing policy performance under a fixed labeling budget.

Limitations: The analysis focuses primarily on the DPO objective and may require further validation across different preference-based alignment algorithms or highly complex multi-turn dialogues.

PDF

3/5 Artificial Intelligence (cs.AI) 19 Jun 2026

LLM Doesn't Know What It Doesn't Know: Detecting Epistemic Blind Spots via Cross-Model Attribution Divergence on Clinical Tabular Data

Akshat Dasula, Prasanna Desikan, Jaideep Srivastava

Abstract

ArXiv ID: 2606.19509

Authors: Akshat Dasula, Prasanna Desikan, Jaideep Srivastava

Abstract:

Large language models (LLMs) are increasingly applied to structured clinical data, yet whether they can recognize the limits of their own knowledge on such tasks remains unexplored. We study this question through the lens of cross-model attribution divergence with the goal of reducing epistemic uncertainty for structured tasks, comparing Qwen 2.5 7B and XGBoost on a prediction task via attribution divergence analysis. We report four findings. First, LLM verbalized confidence is epistemically vacuous, it outputs a near-constant (0.856-0.937) regardless of whether accuracy is 49% or 75.3%, tracking prompt format rather than prediction quality. Second, the LLM exhibits an inverse difficulty effect: accuracy drops to 64.8% when XGBoost is 99% correct, but matches XGBoost (73.8% vs. 73.1%) when it is moderately uncertain. Third, few-shot examples and SHAP-derived feature evidence are orthogonal, super-additive interventions: they reduce the Attribution Disagreement Score (ADS) from 1.54 to 0.38 and improve accuracy from 49% to 75.3% without training. Fourth, a cross-model calibrator that determined LLM reliability using attribution divergence signals reduces expected calibration error from 0.254 to 0.080, replacing uninformative verbalized confidence with patient-specific reliability estimates, without accessing model internals or requiring repeated inference. We frame these findings as a cold start problem for LLMs on structured data and outline a path toward genuine epistemic self-awareness.

Insights

Contribution: The paper identifies that LLMs lack epistemic self-awareness on clinical tabular data and proposes a cross-model attribution divergence method to calibrate reliability. It demonstrates that verbalized confidence is vacuous and provides a way to estimate model reliability without internal access or repeated inference.

Core Idea: LLMs exhibit 'epistemic blind spots' where they provide high confidence despite low accuracy, particularly when a specialized model (XGBoost) is highly certain. By measuring the divergence in feature attribution between an LLM and a gradient-boosted tree, one can quantify the LLM's uncertainty.

Technique: Cross-model attribution divergence analysis comparing LLM attention/feature importance against SHAP values from an XGBoost model. The authors also use super-additive interventions (few-shot examples + SHAP evidence) to improve performance.

Pipeline: Clinical tabular data → XGBoost (SHAP values) & LLM (Attribution) → Attribution Disagreement Score (ADS) → Cross-model Calibrator → Patient-specific reliability estimate

Methodology: The researchers compared Qwen 2.5 7B and XGBoost on a clinical prediction task, analyzing the correlation between verbalized confidence, prediction accuracy, and attribution divergence. They tested interventions like few-shot prompting and SHAP-derived evidence to mitigate these blind spots.

Results: Verbalized confidence was found to be vacuous (0.856-0.937) regardless of accuracy. The cross-model calibrator reduced Expected Calibration Error (ECE) from 0.254 to 0.080. Few-shot and SHAP interventions improved accuracy from 49% to 75.3% and reduced ADS from 1.54 to 0.38.

Limitations: The study focuses on a specific clinical tabular task and does not explore the scalability of attribution divergence across diverse unstructured domains or the computational overhead of generating SHAP values for the calibrator.

PDF

NLP

3/5 Artificial Intelligence (cs.AI)Computation and Language (cs.CL) 19 Jun 2026

Toten: Knowledge-Based Ontological Tokenization Of Physical Quantities And Technical Notation In Brazilian Portuguese

Antonio de Sousa Leit\~ao Filho; Allan Kardec Duailibe Barros Filho; Fabr\'icio Saul Lima; Selby Mykael Lima dos Santos; Rejani Bandeira Vieira Sousa

Abstract

ArXiv ID: 2606.19626

Authors: Antonio de Sousa Leit\~ao Filho; Allan Kardec Duailibe Barros Filho; Fabr\'icio Saul Lima; Selby Mykael Lima dos Santos; Rejani Bandeira Vieira Sousa

Abstract:

Byte-Pair Encoding tokenization is statistically efficient for vocabulary compression, but semantically blind to structured technical entities, fragmenting physical quantities, numbers, units, and symbolic expressions into lexically arbitrary subwords. We present TOTEN, a knowledge-based ontological tokenization framework that replaces statistical derivation with declarative classification grounded in a formal ontology of engineering entities (OEE). We formalize TOTEN as the triple : the ontology gathers types, structural principles, composition relations, and preservable invariants; the classification function maps raw text into typed regions; and the instantiator family yields a self-descriptive structured representation. Robustness derives from deterministic coupling with three external oracles: Pint (dimensional), Unicode Character Database (typographic), and RSLP (Portuguese morphology). Intrinsic evaluation covers four properties verifiable by construction -- ontological atomicity, dimensional equivalence, typographic robustness, and numerical reconstruction -- over an internal, physically validated benchmark (EngQuant, N=800) and four Brazilian Portuguese external corpora (N=1771 eligible cases). We also report detection recall, distinguishing coverage from conditional atomicity. Against eight state-of-the-art baselines, TOTEN achieves unit ontological atomicity in all contrasts and numerical reconstruction of 0.775-0.904 on external corpora, vs. 0.627-0.703 for the best baseline (Quantulum3); on EngQuant, 0.780 vs. 0.340. Differences are statistically significant (McNemar with Holm correction). Spearman correlation between internal and external rankings confirms concurrent validity of the control benchmark. Dimensional equivalence shows statistical parity with Pint, the oracle from which the system inherits dimensional authority.

Insights

Contribution: The paper introduces TOTEN, a knowledge-based ontological tokenization framework that preserves the semantic integrity of physical quantities and technical notations in Brazilian Portuguese. It outperforms state-of-the-art statistical tokenizers by ensuring ontological atomicity and numerical reconstruction of engineering entities.

Core Idea: Replace statistically derived subword tokenization (like BPE) with a declarative, ontology-grounded classification system to prevent the fragmentation of structured technical data.

Technique: TOTEN utilizes a triple framework consisting of a formal ontology of engineering entities (OEE), a classification function for typed regions, and an instantiator family for structured representation.

Pipeline: Raw Brazilian Portuguese technical text → Ontological classification (OEE) + External Oracles (Pint, Unicode, RSLP) → Self-descriptive structured tokens

Methodology: The authors developed a deterministic system coupled with three external oracles and evaluated it against eight baselines using an internal benchmark (EngQuant) and four external Brazilian Portuguese corpora.

Results: TOTEN achieved unit ontological atomicity in all contrasts and significantly higher numerical reconstruction scores (0.775-0.904) compared to the best baseline (0.627-0.703) on external corpora, and 0.780 vs. 0.340 on the EngQuant benchmark.

Limitations: The study focuses specifically on Brazilian Portuguese and engineering entities; the paper also distinguishes between detection recall and conditional atomicity, suggesting coverage remains a factor.

PDF

RL

3/5 Artificial Intelligence (cs.AI)Multiagent Systems (cs.MA)Systems and Control (cs.SY)eess.SY 19 Jun 2026

Exit-and-Join Dynamics for Decentralized Coalition Formation

Quanyan Zhu

Abstract

ArXiv ID: 2606.19683

Authors: Quanyan Zhu

Abstract:

This paper studies coalition formation as a decentralized dynamical process driven by unilateral exit-and-join decisions. Agents evaluate local moves using the Aumann-Dreze value, so payoffs are computed within the agent's current coalition rather than through a globally negotiated coalition structure. The resulting model links cooperative payoff allocation with noncooperative best-response behavior: a terminal partition is precisely a coalition structure with no admissible, individually profitable exit-and-join deviation. We establish equilibrium characterizations, identify conditions under which the dynamics admit scalar Lyapunov or exact-potential representations, and analyze how switching and acceptance costs shape local stability. Numerical experiments test finite-time stabilization, cost sensitivity, and a special convex-game benchmark.

Insights

Contribution: The paper introduces a decentralized dynamical model for coalition formation where agents make unilateral exit-and-join decisions based on local payoff evaluations. It establishes equilibrium characterizations and identifies conditions for Lyapunov and potential-based stability in these dynamics.

Core Idea: Coalition formation is modeled as a noncooperative best-response process where agents move to maximize their own payoffs using the Aumann-Dreze value, leading to a terminal partition where no profitable moves remain.

Technique: The study employs game-theoretic analysis to link cooperative payoff allocation with decentralized dynamics, utilizing Lyapunov functions and potential theory to analyze stability.

Pipeline: Initial coalition structure → Agent-level local payoff evaluation (Aumann-Dreze) → Unilateral exit-and-join decisions → Terminal stable partition

Methodology: The authors develop a theoretical framework for decentralized dynamics, prove existence of equilibrium conditions, and conduct numerical experiments on cost sensitivity and finite-time stabilization.

Results: The dynamics reach a terminal partition equivalent to a stable coalition structure; the model demonstrates how switching and acceptance costs influence local stability and convergence.

Limitations: The paper focuses on unilateral moves and local evaluations, potentially leaving open questions regarding multi-agent coordinated negotiations or complex global constraints.

PDF

Personal Interests

Papers discovered through your interest topics.

Multi-Agent Systems

3/5 q-fin.RMArtificial Intelligence (cs.AI)nlin.AOphysics.soc-ph 18 Jun 2026

Optimal Order of Multi-Agent and General Many-Body Systems

Jake J. Xia

Abstract

ArXiv ID: 2606.20485

Authors: Jake J. Xia

Abstract:

This paper develops a general framework for analyzing multi-agent systems with feedback loops between agents actions and collective observations. The framework is built on two fundamental agent-level variables: power, which measures agent influence on collective outcomes, and response functions, which determine how agents react to observations. We derive how macroscopic properties, including total power, useful power, entropy, order, fragility, and mobility, emerge from these two variables of heterogeneous agents. To study the trade off between growth and resilience, we introduce a system-level utility function parameterized by a risk-appetite coefficient and derive an optimal degree of order that balances productivity, stability, and adaptability. The analysis suggests that stronger synchronization can increase collective output but may also increase systemic fragility and reduce mobility. We further argue that order, entropy, information, and useful energy are task-dependent and system-relative concepts whose meanings depend on the objectives of the system. By measuring and designing agent power distributions and response functions, it may be possible to better understand, predict, and optimize collective behavior and identify the conditions under which collective intelligence and optimal order emerge.

Insights

Contribution: The paper develops a general framework to analyze multi-agent systems by deriving macroscopic properties from two fundamental agent-level variables: power and response functions. It introduces a system-level utility function to identify the optimal balance between collective productivity, stability, and adaptability.

Core Idea: Collective behaviors like order, entropy, and fragility emerge from the interplay between an agent's influence (power) and its reaction to collective observations (response functions). The study posits that order is task-dependent and that there is a fundamental trade-off between synchronization-driven growth and systemic resilience.

Technique: The author employs a mathematical framework to derive macroscopic properties from heterogeneous agent variables and uses a risk-appetite parameterized utility function to optimize system states.

Pipeline: Agent-level variables (power and response functions) → Framework analysis of macroscopic properties (order, entropy, fragility) → Utility function optimization → Optimal degree of order

Methodology: The research uses a theoretical framework to model feedback loops between individual actions and collective observations, deriving analytical expressions for system-level metrics.

Results: Stronger synchronization increases collective output but simultaneously increases systemic fragility and reduces mobility; the optimal degree of order is determined by the specific risk-appetite and objectives of the system.

Limitations: The paper notes that concepts like entropy and useful energy are system-relative and task-dependent, implying that the 'optimal' state varies significantly depending on the specific goals of the system.

PDF

3/5 Multiagent Systems (cs.MA) 18 Jun 2026

SIGMA: Skill-Incidence Graphs for Compositional Multi-Agent Design

Kun Zeng, Yu Huo, Siyu Zhang, Yuecheng Zhuo, Yuquan Lu, Haoyue Liu, Siyue Chen, Xiaoying Tang

Abstract

ArXiv ID: 2606.19758

Authors: Kun Zeng, Yu Huo, Siyu Zhang, Yuecheng Zhuo, Yuquan Lu, Haoyue Liu, Siyue Chen, Xiaoying Tang

Abstract:

Existing graph-based multi-agent system (MAS) designers mainly improve collaboration by optimizing communication topologies over predefined agents, roles, or groups. However, because each node remains a closed-set entity, these methods struggle to generalize to tasks that require unseen combinations of capabilities. We propose SIGMA, a skill-incidence graph framework that constructs agents as task-conditioned bundles of reusable skills. Given a task and a skill library, SIGMA predicts a skill-agent incidence matrix, composes agent node embeddings from selected skills, and decodes a communication topology over the constructed agents. During execution, skill-specific mailboxes route messages to the relevant assigned capabilities, making the incidence structure directly operational. Across six reasoning and coding benchmarks with three base LLMs, SIGMA achieves the best average performance and improves over CARD, the strongest non-compositional topology-based baseline, by 2.06, 2.36, and 1.75 points, respectively. It also shows stronger robustness to unseen skill libraries, with an average performance drop of only 0.96 points. These results suggest that compositional node construction is a complementary and important axis for multi-agent design beyond communication topology optimization. Code is available at https://anonymous.4open.science/r/SIGMA-2338/.

Insights

Contribution: The paper introduces SIGMA, a framework that shifts multi-agent system (MAS) design from optimizing fixed agent roles to dynamically composing agents as bundles of reusable skills. It demonstrates that compositional node construction is a critical axis for improving MAS performance on complex, unseen tasks.

Core Idea: Instead of treating agents as closed-set entities, SIGMA treats them as task-conditioned bundles of skills derived from a shared library. This allows the system to generalize to new tasks by dynamically constructing the necessary agent capabilities.

Technique: The framework utilizes a skill-incidence graph to predict which skills are needed for a task, composes agent embeddings from these skills, and decodes a communication topology. It employs skill-specific mailboxes to route messages directly to the relevant capabilities during execution.

Pipeline: Task and skill library → Skill-agent incidence matrix prediction → Agent node embedding composition → Communication topology decoding → Skill-specific message routing

Methodology: The authors evaluated SIGMA across six reasoning and coding benchmarks using three base LLMs, comparing it against non-compositional topology-based baselines like CARD. They also tested the framework's robustness against unseen skill libraries.

Results: SIGMA achieved the best average performance across all benchmarks, outperforming the strongest baseline (CARD) by 1.75 to 2.36 points. It also showed high robustness to unseen skill libraries, with an average performance drop of only 0.96 points.

Limitations: The paper focuses on the composition of agents from a skill library but does not deeply explore the optimal way to automatically discover or refine the underlying skill library itself.

PDF

Tech News

Agentic AI

Reddit r/DeepLearning 2026-06-19

Does anyone know how to make a small language model use tools like websearch while avoiding "catastrophic forgetness" i think its called .. this my first attempt to make my own model by training it on my own data

A user is seeking technical guidance on integrating web search tools into a small, custom-trained language model. They are specifically looking for methods to enable tool-use capabilities while preventing 'catastrophic forgetting' during the training process.

Computing Systems

Hacker News Fri, 19 Ju

Let's Encrypt has been down most of today

Let's Encrypt, a major certificate authority, experienced significant downtime throughout the day. This outage impacts the issuance and renewal of SSL/TLS certificates, potentially disrupting secure connections for numerous websites and services.

Hacker News Fri, 19 Ju

Datasette Apps: Host custom HTML applications inside Datasette

Datasette has introduced a new feature allowing users to host custom HTML applications directly within the platform. This enables developers to build interactive front-ends and custom interfaces for their data without needing a separate web server. It simplifies the process of creating data-driven tools and dashboards.

Reddit r/DeepLearning 2026-06-18

pragmatiq: open-source implementation of PRAGMA-style banking event-sequence models

The pragmatiq project provides an open-source implementation of PRAGMA-style banking event-sequence models. It enables the conversion of timestamped key-value user histories into embeddings for applications like AML graph experiments, LoRA fine-tuning, and explainability. The repository includes synthetic data, PyTorch encoders, and CPU-first training tools to make the research path more accessible.

LLM

Reddit r/DeepLearning 2026-06-19

[Article] Gemma 4 – Inference, Architecture, and Practical Insights

This article provides a deep dive into Google DeepMind's Gemma 4 model, focusing on its architectural improvements and enhanced open-source capabilities. It includes practical insights on inference, technical developments, and a functional Gradio application for deployment.

Trending repositories on GitHub filtered and scored for relevance to your interests.

Daily Digest 2026-06-19

Global Trends

Papers discovered from ArXiv subject categories

AI Safety

Agentic AI

Computer Vision

Computing Systems

General

LLM

NLP

RL

Personal Interests

Multi-Agent Systems

Tech News

Agentic AI

Computing Systems

LLM

GitHub Trending

AI Safety

Agentic AI

Computer Vision

Computing Systems

General