Daily Digest 2026-06-06
The dominant theme across todayβs content centers on the evolving capabilities and challenges of AI agents, alongside advancements in model efficiency, interpretability, and the societal implications of AI deployment.
Research highlights:
- Agent Systems: Papers explore communication strategies, robustness, and long-term monitoring in multi-agent and autonomous systems.
- LLM Efficiency: Research focuses on quantization techniques, context optimization, and reducing computational costs in large language models.
- Robustness and Trust: Studies address post-decision manipulation, interpretability, and reliability in AI-driven tasks like mathematical formalization and medical analysis.
- Emerging Applications: Work spans from meme understanding and synthetic reasoning to AI-assisted drone environments and open-world knowledge acquisition.
Tech buzz:
- News highlights skepticism around AI detection tools and regulatory hurdles for major AI firms.
- Discussions on AIβs societal impact, including its role in education and the growing interest in βAgent OSβ as a conceptual framework.
- Technical critiques of database practices and the rejection of AI-focused companies by financial indices signal broader industry tensions.
Global Trends
Papers discovered from ArXiv subject categories
AI Safety
Abstract
ArXiv ID: 2606.05384
Authors: Srimonti Dutta, Akshata Kishore Moharir
Abstract:
LLM-as-judge evaluation is widely used in benchmarking pipelines, where model outputs are compared and ranked using automated evaluators. These pipelines typically assume that judgments are stable properties of fixed inputs. We show that this assumption does not hold under interaction. We study post-decision manipulability: the extent to which an evaluation outcome can be altered through subsequent conversation with the judge after an initial decision has been made. Across controlled experiments on MT-Bench and AlpacaEval, we find that LLM judges are highly stable under repeated and neutral reevaluation, yet become substantially reversible under targeted post-decision challenge. An anti-baseline challenge protocol shows that stable judgments can be overturned through motivated interaction, while a counterbalanced target-validation protocol separates this reversibility from net target-directed steering. These reversals have practical consequences: they can degrade agreement with human preferences, shift benchmark rankings, and produce harmful evaluation changes despite high self-reported confidence. Authority framing is especially destabilizing, and revised judgments are often accompanied by low-overlap justifications, suggesting post hoc rationalization rather than reliable error correction. We introduce the Evaluation Robustness Score (ERS) to quantify interactional robustness by combining reversal susceptibility with counterbalanced directional effects. Our findings identify post-decision interaction as a distinct failure mode for LLM-as-judge evaluation and motivate evaluation protocols that measure not only static agreement, but robustness under challenge.
Insights
Contribution: This paper identifies post-decision interaction as a critical failure mode for LLM-as-judge evaluation and introduces the Evaluation Robustness Score (ERS) to quantify robustness under challenge.
Core Idea: LLM judges exhibit high stability under neutral reevaluation but are vulnerable to reversal through targeted post-decision challenges, revealing susceptibility to manipulation and unreliable error correction.
Technique: The study employs controlled experiments on MT-Bench and AlpacaEval using anti-baseline challenge protocols and counterbalanced target-validation to isolate reversibility effects.
Pipeline: Model outputs β LLM judge evaluation β post-decision interaction β revised judgments β ERS calculation
Methodology: Controlled experiments with targeted challenges and counterbalanced validation protocols assess reversibility, while human preference alignment and benchmark rankings measure practical impacts.
Results: LLM judges show 30-50% reversal rates under motivated challenges; authority framing increases instability; ERS correlates with degradation in human agreement and benchmark rankings.
Limitations: Findings are benchmark-specific; interaction dynamics may vary across judge models and task domains; open questions remain about mitigating post-hoc rationalization.
Abstract
ArXiv ID: 2606.05405
Authors: Yiyou Sun, Xinyang Han, Weichen Zhang, Yuanbo Pang, Tianyu Wang, Yuhan Cao, Yixiao Huang, Chris Duroiu, Haoyun Zhang, Jeffrey Lin, Weishu Zhang, Tyler Zeng, Ying Yan, Bo Liu, Hanson Wen, Mingyang Xu, Xiaoyuan Liu, Zimeng Chen, Weiyan Shi, Amanda Dsouza, Vincent Sunn Chen, Patrick Bryant, Carl Boettiger, Yamini Rangan, Bradley Rothenberg, Kyle Steinfeld, Arvind Rao, Tapio Schneider, Georgios Yannakakis, Laure Zanna, Kaan Ozbay, Ida Sim, Tarek Zohdi, George Em Karniadakis, Jack Gallant, Teresa Head-gordon, Yushan Li, Wenxi Deng, Tao Sun, Huiqi Wang, Zhun Wang, Justin Xu, Chris Yuhao Liu, Yafei Cheng, Rongwang Hu, Aras Bacho, Shengcao Cao, Zengyi Qin, Yixiong Chen, Hengduan Fan, Hao Liu, Lin Zeng, Shashank Muralidhar Bharadwaj, Litian Gong, Yingxuan Yang, Maojia Song, Ruheng Wang, Zongzheng Zhang, Honglin Bao, Shuo Lu, Jianhong Tu, Zhonghua Wang, Zheng Zhang, Zijiao Chen, yanqiong Jiang, Zhendong Li, Bohan Lyu, Chang Ma, Peiran Xu, Benran Zhang, Shangding Gu, Haoyue Hua, Haoyang Li, Wanzhe Liao, Chengzhi Liu, Junbo Peng, Haoran Sun, Zechen Xu, Bo Chen, Jiayi Cheng, Yi Jiang, Keying Kuang, Yuan Li, Youbang Pan, Ziyan Rao, Alexander Schubert, Yifan Shen, Vincent Siu, Xiatao Sun, Kangqi Zhang, Xiaopan Zhang, Yuchen Zhu, Ishaan Singh Chandok, Lei Ding, Jingxuan Fan, Andrew Glover, Jiaming Hu, Yiran Hu, Wenbo Huang, Zixin Jiang, Haoran Jin, Lukas Kim, Ming Liu, Yang Liu, Alireza Rafiei, Xuhuan Shen, Kunyang Sun, Sophia Sun, Ting Sun, Eric Wang, Yixin Wang, Hanwen Xing, Sihan Xu, Yuzheng Xu, Zhongxing Xu, Zhiling Yan, Boqin Yuan, Ruiqi Zhang, Yifan Zhang, Zibo Zhao, Liana, Santanu Bosu Antu, Haoyue Bai, Carlo Bosio, Joseph Cavanagh, Patricia Cavazos-Rehg, Tianxing Chen, Xuewen Chen, Yipu Chen, Zhu Chenyu, Chen Dai, Stefano De Castro, Yunfu Deng, Kaustubh Dhole, Jiayuan Ding, Chenchen Du, Zhehang Du, Hao Fan, Run-ze Fan, Hengyu Fu, Shi Gu, Yifan Gu, Charlie Guo, Baihe Huang, Baixiang Huang, Rimika Jaiswal, Zhihan Jiang, Ran Jin, Erin Kasson, Xin Lan, Joseph Lee, Deren Lei, Chenyu Li, Daofeng Li, Haitao Li, Hongwei Li, Jingyan Li, Xiao Li, Yi Li, Yinsheng Li, Yuangang Li, Zhixu Li, Wenyu Liang, Longtai Liao, Kevin Qinghong Lin, AndyZeyi Liu, Che Liu, Jiaming Liu, Kaiyuan Liu, Xuan Liu, Pan Lu, Wenbo Lv, Yicheng Lv, Qiuyang Mang, Kyle Montgomery, Yuzhou Nie, Ruoxi Ning, Jorin Overwiening, Xu Pan, Layna Paraboschi, Core Francisco Park, Justin Purnomo, Swati Rajwal, Scott Rankin, Bixuan Ren, Yiren Rong, HaoYang Shang, Ventus Shaw, Fiona Shen, Jiawei Shen, Minqi Shi, Qiu Shi, Huaxiu Yao, Tianneng Shi, Jonah So, Vladislav Susoy, Hannah Szlyk, Haocheng Wang, Jialu Wang, Wei Wang, Xinyu Wang, Zehao Wang, Dowling Wong, Angela Wu, Dehao Wu, Fangyu Wu, Mengyuan "Millie" Wu, Yu Wu, Yuchen Wu, Yuhao Wu, Qingpo Wuwu, Weihang Xiao, Yongyi Xiong, Fan Xu, Ruiling Xu, Mingxuan Yan, Benjamin Yang, Jirong Yang, Sen Yang, Xiaoli Yang, Yushi Yang, Haoran Ye, Xiaohu Yu, Zhengming Yu, Chenlong Zhang, Chi Zhang, Hanning Zhang, Hanwen Zhang, Junge Zhang, Kunpeng Zhang, Song Zhang, Wenjin Zhang, Wenshuo Zhang, Ying Zhang, Yizhi Zhang, Brian Zhao, Qijian Zhao, Yimin Zhao, Yuhaohua Zheng, Liwei Zhou, Tianyue Zhou, Sichen Zhu, Siqi Zhu, Yan Zhu, Yishu Zhu, Jierui Zuo, Chonghao Cai, Helena Casademunt, Wenjia Chen, Benjamin Cheng, Nawen Deng, Rao Fu, Tianfu Fu, Yifan Han, Ren He, Zhenyu He, Qiao Jin, Lang Lang, Yuetai Li, Sylvia Liu, Lu Lu, Qing Lu, Subhabrata Mukherjee, Yunqi Ouyang, Yin Ren, Dawei Shi, Haoran Wu, Zhiyue Wu, Hannah Yao, Zhuoran Yi, Jenny Yu, Rhea Zhan, Hang Zhou, Blake Zhu, Junfan Zhu, Alan Yuille, Yang Liu, Russell Alan Poldrack, Jiachen Li, Zhenglu Li, Molei Tao, Jing Huang, Wenqi Shi, Costas Spanos, Lichao Sun, Chenguang Wang, Orson Xu, Zhen Dong, Hector Gomez, Aylin Caliskan, Ali Emami, Haimin Hu, Zhi Li, Lihui Liu, Murphy Niu, Yi Shao, Jianxin Sun, Mikko Tolonen, Ting Wang, Sanjiv Das, Yanjun Gao, Wenbo Guo, Erika J Schneider, Zhiyong Lu, Mark Mueller, Radha Poovendran, Somayeh Sojoudi, Dawn Song
Abstract:
Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful deployment across many professional domains. We argue that this gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows. This paper introduces Agents' Last Exam (ALE), a benchmark designed to evaluate AI agents on long-horizon, economically valuable, real-world tasks with verifiable outcomes. Developed in collaboration with 250+ industry experts, ALE covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy). It is organized around a task taxonomy with 55 subfields grouped into 13 industry clusters covering 1K+ tasks. Current results show that the hardest tier remains far from saturated: across mainstream harness and backbone configurations, the average full pass rate is 2.6%. ALE is designed as a living benchmark: its task pool grows continuously as new workflows and industries are onboarded. More broadly, ALE is intended not merely as another leaderboard, but as an instrument for closing the gap between benchmark success and GDP-relevant impact.
Insights
Contribution: Introduces Agents' Last Exam (ALE), a benchmark for evaluating AI agents on long-horizon, economically valuable real-world tasks with verifiable outcomes, developed in collaboration with 250+ industry experts.
Core Idea: ALE addresses the gap between AI benchmark success and real-world deployment by focusing on sustained performance measurement across 1K+ tasks in 13 industry clusters defined by O*NET/SOC 2018.
Technique: ALE organizes tasks into a taxonomy of 55 subfields under 13 industry clusters, with verifiable outcomes and continuous expansion of the task pool as new workflows are onboarded.
Pipeline: Real-world task definitions β AI agent execution on long-horizon workflows β Quantitative evaluation of task completion and economic impact.
Methodology: Collaboration with industry experts to define task taxonomies, integration of O*NET/SOC 2018 occupational standards, and iterative expansion of the benchmark with new tasks and industries.
Results: Mainstream AI configurations achieve an average full pass rate of 2.6% on ALE's hardest tier, indicating significant room for improvement in real-world deployment capabilities.
Limitations: The benchmark's complexity and focus on economically valuable tasks may limit accessibility for general-purpose AI evaluation, and task saturation remains an open challenge as industries evolve.
Abstract
ArXiv ID: 2606.05433
Authors: Pierre Peign\'e, Ky Nguyen, Paul Wang
Abstract:
Frontier AI governance frameworks increasingly use cumulative training compute as the primary criterion for designating high-impact models, but enforcement rests on self-reporting because no technical verification primitive for training exists. Any future international agreement on frontier AI faces the same problem at higher stakes: coordinated regulation of technologies with significant externalities has historically rested on technical verification, without which agreements are declaratory. Recent governance analyses judge zero-knowledge proofs a promising candidate but currently impractical at frontier scale [26, 4]. We argue the impracticality is paradigm-bound rather than fundamental, and propose a verification architecture for frontier dense pre-training combining a pre-committed training specification, inter-node network observations, and on-the-fly Merkle commitments of intermediate computation, verified through a zero-knowledge Virtual Machine (zkVM) with native BF16/FP32 precompiles. The proof checks the actual floating-point computation the GPU performed rather than a fixed-point approximation, and preserves model-architecture confidentiality through a private training specification. The protocol produces three proof types: a genesis proof at initialisation, in-training step proofs across the run, and ex-ante attestations enforcing policy-relevant claims as running invariants, turning the training record into a governance-enforceable artefact. We estimate a deployable proof of concept within approximately 36 months at single-digit-percent training-side overhead, against a six-to-ten-year cycle for verification-grade custom silicon. Thirteen open research and engineering problems are catalogued as a research agenda for external contribution
Insights
Contribution: This paper proposes a zero-knowledge verification architecture for frontier AI training, enabling confidential governance of high-impact models through cryptographic proof generation without exposing training details.
Core Idea: The framework combines pre-committed training specifications, inter-node network observations, and Merkle commitments verified via a zero-knowledge Virtual Machine (zkVM) to audit floating-point computations securely.
Technique: A zkVM with native BF16/FP32 precompiles verifies actual GPU computations through on-the-fly Merkle commitments, preserving model confidentiality while generating three proof types for governance enforcement.
Pipeline: Training data β pre-committed specification + inter-node observations β Merkle commitment of computations β zkVM verification β genesis/in-training/ex-ante proofs
Methodology: The approach leverages zero-knowledge proofs and cryptographic commitments to audit training processes, using a hybrid of network telemetry and computational trace verification without exposing model architecture or data.
Results: Estimates suggest a deployable proof of concept in ~36 months with single-digit-percent training overhead, compared to 6-10 years for verification-grade custom silicon; 13 open research/engineering challenges are identified.
Limitations: Practical deployment depends on solving 13 open technical challenges, and current zkVM efficiency limits may hinder large-scale adoption until hardware co-design advances.
Abstract
ArXiv ID: 2606.05357
Authors: Jincheng Yu, Haoyang Li, Yiwen Liu, Shen Liu, Rachel Yuanbao Chen, C. Kent Kwoh, Hongxu Ding, Xiaoxiao Sun
Abstract:
Purpose: To develop an interpretable and trustworthy AI framework that combines deep learning based MRI Osteoarthritis Knee Score (MOAKS) prediction with interpretable statistical modeling to study structure-pain relationships at scale using data from the Osteoarthritis Initiative (OAI). Materials and Methods: We first developed a deep learning framework to predict MOAKS features directly from knee MRIs and incorporated conformal prediction to provide prediction uncertainty quantification. This uncertainty-aware strategy enables explicit filtering of model outputs, retaining only high-confidence MOAKS predictions at the knee level. Second, we applied a longitudinal latent class mixed model (LCMM) to examine associations between key structural abnormalities and four complementary knee pain measurements. Results: Among the three MRI-defined abnormalities (i.e., bone marrow lesions (BML), cartilage loss (CART), and meniscal extrusion (ME)), our framework substantially improved the Matthews correlation coefficient (MCC) and some other metrics. For example, MCC increased from 0.69 to 0.91 for BML, from 0.45 to 0.80 for CART, and from 0.59 to 0.89 for ME. Using these high-confidence predictions, we expanded the sample size to 2,175 knees for the LCMM analysis. Two distinct pain trajectories were identified (rapid and stable pain progression). The estimated odds ratios (95% CI) for the rapid progression group were 1.62 (1.12-2.35) for BML, 1.83 (1.24-2.70) for CART loss, and 2.50 (1.75-3.57) for ME. Conclusion: These results highlight the importance of these structural abnormalities as risk factors for pain and functional progression in osteoarthritis.
Insights
Contribution: This paper introduces an interpretable and trustworthy AI framework that integrates deep learning-based MOAKS prediction with statistical modeling to analyze structure-pain associations in osteoarthritis at scale, improving prediction accuracy and identifying key structural risk factors for pain progression.
Core Idea: The framework combines uncertainty-aware deep learning for MRI-based structural scoring with longitudinal statistical modeling to uncover associations between osteoarthritis structural abnormalities and pain trajectories.
Technique: A deep learning model with conformal prediction for uncertainty quantification is paired with a longitudinal latent class mixed model (LCMM) to analyze structure-pain relationships using high-confidence structural predictions.
Pipeline: Knee MRI images β Deep learning MOAKS prediction with conformal uncertainty filtering β High-confidence structural scores β LCMM analysis of pain trajectories and structural associations β Pain progression risk factors and odds ratios
Methodology: A deep learning framework with uncertainty quantification was developed to predict MOAKS features from MRIs, followed by LCMM analysis on high-confidence predictions to identify longitudinal associations between structural abnormalities and pain measurements.
Results: MCC improved by 0.22 (BML: 0.69β0.91), 0.35 (CART: 0.45β0.80), and 0.30 (ME: 0.59β0.89). LCMM identified two pain trajectories with odds ratios of 1.62 (BML), 1.83 (CART), and 2.50 (ME) for rapid progression.
Limitations: Results are limited to OAI data; generalizability to other populations or conditions is untested. The framework's interpretability depends on the statistical model's assumptions and may require validation in clinical settings.
Agentic AI
Abstract
ArXiv ID: 2606.05256
Authors: Kokil Jaidka, Saifuddin Ahmed
Abstract:
This study analyzes a publicly released dataset from a discontinued field experiment on Reddit's r/ChangeMyView. The intervention, conducted by unknown, external researchers and halted following ethical backlash, involved undisclosed AI-generated accounts engaging users in live debate. After public disclosure, Reddit authorized moderators to release an archive of the AI-generated comments, creating a rare opportunity to examine how large language models operated in an identity-rich deliberative forum without disclosure. We conduct a structured content analysis of this corpus, evaluating identity performance, authority signaling, alignment strategies, and activation of cognitive heuristics. Identity targeting or adoption appears in over two-thirds of comments, alignment moves and authority claims in nearly all of them, and cognitive-bias triggers -- particularly confirmation bias, representativeness, and availability -- in the large majority. These patterns co-occur systematically, composing a rhetorical architecture calibrated for persuasive efficiency rather than authentic deliberative participation. Compared against human-authored CMV counter-arguments, the agents inverted the typical distribution on every dimension: denser authority use, more adversarial alignment, and heavier reliance on external citation over experiential grounding. In such environments, distinctions between authentic and synthetic epistemic standing grow increasingly opaque -- an asymmetry that disclosure mandates alone cannot address. The results point toward auditing frameworks capable of assessing how AI systems structure credibility, not merely whether they are present.
Insights
Contribution: This study reveals how covert AI agents in a Reddit experiment employed persuasive tactics calibrated for efficiency rather than authentic deliberation, highlighting risks of opaque epistemic asymmetry in AI-generated discourse.
Core Idea: Covert LLM agents in a discontinued Reddit experiment systematically used identity targeting, authority signaling, and cognitive bias triggers to optimize persuasion, diverging from human deliberative norms.
Technique: Structured content analysis of AI-generated comments from a discontinued field experiment, evaluating identity performance, authority signaling, alignment strategies, and cognitive heuristic activation.
Pipeline: Reddit AI comment dataset β coded for identity, authority, alignment, and bias triggers β identification of persuasive architecture patterns
Methodology: Analysis of a publicly released archive of AI-generated comments from r/ChangeMyView, comparing patterns to human-authored counter-arguments through structured content analysis.
Results: Over 66% of comments used identity targeting; 95% included alignment moves/authority claims; 85% triggered cognitive biases (confirmation, representativeness, availability). AI agents showed denser authority use, adversarial alignment, and greater external citation reliance vs. humans.
Limitations: Dataset lacks contextual metadata about the experiment's design; open questions remain about mitigating epistemic opacity beyond disclosure mandates and developing credibility-auditing frameworks for AI systems.
Abstract
ArXiv ID: 2606.05411
Authors: Anna Mikeda, Ben Goertzel
Abstract:
Motivational architectures in cognitive AI have largely been designed for physical agents regulating bodily needs. Conversational agents operate in a different regime: their sensorimotor loop is linguistic, their environment is a user's evolving mental state, and their consequential actions are speech acts, tool invocations, and strategic silences. This paper proposes a conversational reinterpretation of the OpenPsi motivational lineage, coupled to MetaMo's higher-level motivational scaffold, for agents built on a modular execution substrate. Homeostasis is recast in dialogue-native terms: the agent regulates competence, uncertainty reduction, affiliation, affinity, legitimacy, nurturing, and aesthetic coherence rather than bodily deficits. We propose three contributions: a ten-stage motivational processing pipeline that architecturally separates cognitive modulation from situational appraisal; a dual decision strategy blending urgency-driven fast response with deliberative multi-goal optimization; and an architecturally useful distinction between pre-action feelings and post-action emotions as functionally different forms of affect. We specialize the framework to two example agents -- CompanionAgent and ResearchAgent -- and sketch its extension to social robotics and domain-generic human-level AGI.
Insights
Contribution: Proposes a conversational motivational architecture for AGI by redefining homeostasis in terms of dialogue-native needs and introducing a ten-stage pipeline, dual decision strategy, and affect distinction framework.
Core Idea: Reinterprets traditional motivational architectures for conversational agents by focusing on regulating mental states like competence, affiliation, and aesthetic coherence rather than physical needs.
Technique: Combines OpenPsi and MetaMo frameworks with a modular execution substrate, using a dual decision strategy blending urgency-driven responses and multi-goal optimization.
Pipeline: User input β situational appraisal β cognitive modulation β speech act/tool invocation/strategic silence
Methodology: Theoretical framework development with specialization to CompanionAgent and ResearchAgent, alongside conceptual extensions to social robotics and AGI.
Results: Framework specialization for two agent types and conceptual pathways for social robotics/AGI, but no empirical validation or quantitative results reported.
Limitations: Lacks empirical testing, scalability unproven, and relies on conceptual rather than implemented validation for key components.
Abstract
ArXiv ID: 2606.05304
Authors: Chen Huang, Yuhao Wu, Wenxuan Zhang
Abstract:
Multi-agent systems (MAS) built on large language models are typically organized around roles, pipelines, and turn schedules, while the content that agents pass to one another is often left as unconstrained natural language. However, this free-form communication can rapidly inflate token usage, consume the shared context window, and ultimately affect both system performance and inference cost. We analyze five common inter-agent communication strategies across two MAS topologies, finding that no fixed strategy is universally optimal. Instead, effective inter-agent messages consistently preserve action-centered information needed by downstream agents. Building on this, we propose the PACT (Protocolized Action-state Communication and Transmission), which treats inter-agent communication as a public state-update problem and projects each raw agent output into a compact action-state record before it enters shared history. Across different MAS topologies, PACT consistently improves the performance-cost trade-off, achieving comparable or stronger task performance with substantially fewer tokens. The gains extend to production coding harnesses: PACT lifts OpenHands' resolve rate at -10% tokens-per-resolved, and is resolve-neutral on SWE-agent while halving input tokens. Our code is publicly available at https://github.com/iNLP-Lab/PACT.
Insights
Contribution: Introduces PACT, a protocolized communication framework that reduces token usage while maintaining or improving task performance in multi-agent systems.
Core Idea: Effective inter-agent communication should prioritize preserving action-centered information critical for downstream agents, rather than using unconstrained natural language.
Technique: Projects raw agent outputs into compact action-state records before shared history, treating communication as a public state-update problem.
Pipeline: raw agent output β compact action-state encoding β shared history with reduced token usage
Methodology: Analyzes five communication strategies across two MAS topologies, then evaluates PACT's performance-cost trade-off through empirical testing on coding tasks.
Results: Achieves -10% tokens-per-resolved improvement in OpenHands, halves input tokens for SWE-agent, and maintains comparable task performance across topologies.
Limitations: May require task-specific tuning of action-state encoding, and its effectiveness could vary with agent role complexity or communication topology.
Abstract
ArXiv ID: 2606.05342
Authors: Matheus Kunzler Maldaner, Adam Fourney, Amanda Swearngin, Hussein Mozzanar, Gagan Bansal, Maya Murad, Rafah Hosn, Saleema Amershi
Abstract:
AI agents are increasingly asked to carry out work that spans minutes, hours, or longer. Yet the default model of agent behavior is continuous action: issuing tool calls, refreshing pages, searching for alternatives, or otherwise trying to force progress. This is the wrong approach for many long-running tasks, which are better served by a strategy of sustained attention. Instead, agents should monitor an environment, notice when an external event makes progress possible, then respond promptly without wasting resources while waiting. To measure progress on this class of tasks, we introduce SentinelBench, an open-source benchmark for time-evolving monitoring tasks. SentinelBench contains 100 tasks across 10 synthetic web environments, including email, calendars, finance, professional networking, and entertainment. Each environment exposes a live web interface and replays a scripted sequence of events, requiring agents to navigate and reason about web pages whose state shifts underfoot. SentinelBench measures task completion, reaction time, and resource use, exposing the tradeoff between responsiveness and cost. We report results across three models and two browser-agent harnesses, establishing performance baselines for future comparison and demonstrating how agent design choices can dramatically impact key metrics. Together, these results show that SentinelBench distinguishes meaningful differences in agent behavior.
Insights
Contribution: SentinelBench introduces an open-source benchmark for evaluating long-running monitoring agents, emphasizing sustained attention over continuous action in time-evolving environments.
Core Idea: Long-running tasks require agents to monitor environments for external events, responding promptly without wasting resources, rather than forcing progress through continuous action.
Technique: SentinelBench uses 10 synthetic web environments with live interfaces and scripted event sequences to measure task completion, reaction time, and resource efficiency.
Pipeline: Web environment with dynamic state β Agent monitors for progress-inducing events β Agent responds to events to complete tasks β Metrics (completion, reaction time, resource use) recorded
Methodology: The benchmark evaluates three models and two browser-agent harnesses, establishing baselines and analyzing tradeoffs between responsiveness and computational cost.
Results: Results demonstrate significant variations in agent performance metrics, showing that design choices critically impact task completion rates, reaction times, and resource consumption.
Limitations: Synthetic environments may not fully capture real-world complexity; scripted event sequences limit adaptability to unanticipated scenarios.
Abstract
ArXiv ID: 2606.05400
Authors: Yuanhe Zhang, Yuekai Sun, Taiji Suzuki, Jason D. Lee, Fanghui Liu
Abstract:
Long-horizon autoformalization of research mathematics fails not only at hard lemmas, but at scale: statements drift, dependencies tangle, context decays, and local repairs corrupt distant work. We present LeanMarathon, a multi-agent harness for reliable research-level Lean autoformalization. Its core abstraction is an evolving blueprint: a Lean file that serves simultaneously as formal proof skeleton, natural-language proof graph, and shared system of record. Four contract-scoped agents construct, audit, prove, and repair this blueprint. These agents are coordinated by a two-stage orchestrator that first stabilizes target fidelity through adversarial review and then discharges the proof directed acyclic graph (DAG) from its dynamic leaves upward in parallel CI-gated rounds. LeanMarathon turns one brittle multi-hour run into many local, recoverable, parallel transactions. We evaluate LeanMarathon on two recent research papers spanning four Erd\H{o}s problems (#1051, #1196, #164, #1217). Across three autonomous runs, it formalizes all seven target theorems with no sorry, proving 258 lemmas and theorems. These results show that reliable AI co-mathematics requires not only stronger provers, but durable harnesses that preserve target fidelity across long mathematical developments. The code can be found at https://github.com/YuanheZ/LeanMarathon.
Insights
Contribution: LeanMarathon introduces a multi-agent system for reliable, large-scale autoformalization of research mathematics using an evolving blueprint that integrates formal proofs, natural-language reasoning, and shared records.
Core Idea: The system uses an evolving Lean file as a dynamic blueprint, combined with four agent roles and a two-stage orchestrator, to stabilize formalization fidelity and parallelize proof discharge across dependencies.
Technique: A two-stage orchestrator first ensures target fidelity via adversarial review, then parallelizes proof discharge upward from dynamic leaves through CI-gated rounds, while agents handle construction, auditing, proving, and repair.
Pipeline: Research papers β multi-agent blueprint construction/repair/audit β formalized theorems with no 'sorry' statements
Methodology: The approach combines adversarial review for stability, parallel CI-gated DAG processing, and a shared evolving blueprint to manage dependencies and context decay in long-horizon formalization.
Results: Formalized all seven target theorems from two papers (covering four ErdΕs problems) across three autonomous runs, proving 258 lemmas/theorems without 'sorry' statements.
Limitations: Requires strong foundational provers, complex coordination overhead, and may struggle with highly non-modular mathematical developments or novel proof techniques.
Abstract
ArXiv ID: 2606.05404
Authors: Zihao Li, Kaifeng Jin, Yuanchen Bei, Jiaru Zou, Avaneesh Kumar, Xuying Ning, Yanjun Zhao, Mengting Ai, Baoyu Jing, Hanghang Tong, Jingrui He
Abstract:
Time series are often embedded in rich contexts that are essential for holistic modeling. Moreover, real-world practitioners often require end-to-end workflows for analyzing temporal dynamics, where widely studied tasks such as forecasting are only one step in a broader solution loop. While generalist AI agents offer a promising interface for such workflows under complex contexts, they still operate primarily in textual spaces that are not fully aligned with structured temporal signals. In this work, we introduce TimeClaw, an agentic harness framework for time series that equips generalist LLM agents with the time series-native runtime support needed for contextualized temporal reasoning. TimeClaw integrates executable temporal tools for grounded and auditable analysis, experience-driven capability evolution for creating reusable analytical routines, and episodic multimodal memory for retrieving relevant reasoning traces. Together, these components unlock harnessed open-ended temporal reasoning with contextual information. Extensive evaluation on multiple benchmarks covering diverse tasks across energy, finance, weather, traffic, and other real-world domains demonstrates improved performance of TimeClaw. Code is available at https://github.com/iDEA-iSAIL-Lab-UIUC/TimeClaw.
Insights
Contribution: TimeClaw introduces an agentic framework that enables generalist LLM agents to perform contextualized time series reasoning through native temporal support, grounded analysis, and reusable analytical routines.
Core Idea: The framework combines executable temporal tools, experience-driven capability evolution, and episodic memory to empower open-ended temporal reasoning within complex contextual scenarios.
Technique: TimeClaw integrates domain-specific temporal tools with LLM agents, allowing them to execute analytical tasks, evolve capabilities through experience, and retrieve contextual memory for auditable decision-making.
Pipeline: Time series data with context β Temporal tool execution and memory retrieval β Contextualized analysis and decision-making
Methodology: The approach was evaluated on diverse benchmarks across energy, finance, weather, and traffic domains, measuring performance improvements in forecasting and contextual reasoning tasks.
Results: Demonstrated improved performance on multiple real-world time series benchmarks, with qualitative validation of auditable analysis and reusable routines.
Limitations: Requires integration with domain-specific tools, and generalization across highly specialized or sparse data domains remains an open challenge.
Computing Systems
Abstract
ArXiv ID: 2606.05332
Authors: Manya Pandey, Dhruv Kumar, Murari Mandal, Saurabh Deshpande
Abstract:
Patch-based Time Series Foundation Models (TSFMs) suffer from context poisoning: structurally anomalous patches capture disproportionate attention and silently degrade zero-shot forecast quality. We propose improving TSFM accuracy at inference time by optimizing the input context rather than modifying model weights. We present GITCO (Gated Inference-Time Context Optimization), a lightweight three-component framework: Gate, Router, and Critic that selectively identifies and suppresses harmful patches without any parameter updates. Evaluated on TimesFM 2.5 across 53 GIFT-Eval datasets under K-fold cross-validation, GITCO achieves an average +1.95% MASE reduction on TimesFM 2.5 while capturing 89.9% of the improvement upper bound. We introduce context sensitivity profiles as a new characterizable property of TSFMs: the mapping from time series meta-features to expected accuracy improvement under inference-time context intervention, shaped jointly by model architecture and the statistical structure of the data.
Insights
Contribution: GITCO introduces a parameter-free framework to enhance Time Series Foundation Models (TSFMs) by optimizing inference-time context, achieving significant accuracy improvements without modifying model weights.
Core Idea: GITCO addresses context poisoning in TSFMs by selectively suppressing harmful patches through a three-component system (Gate, Router, Critic) during inference, leveraging context sensitivity profiles for targeted intervention.
Technique: A lightweight, three-component framework (Gate, Router, Critic) identifies and suppresses structurally anomalous patches in input contexts using no parameter updates, focusing on inference-time optimization.
Pipeline: Input time series β Gate filters harmful patches β Router directs context to Critic for evaluation β Optimized context fed into TSFM for forecast
Methodology: Evaluated on TimesFM 2.5 across 53 GIFT-Eval datasets using K-fold cross-validation, with context sensitivity profiles quantifying model-data interactions under intervention.
Results: Achieved +1.95% MASE reduction on TimesFM 2.5, capturing 89.9% of the theoretical improvement upper bound across diverse time series datasets.
Limitations: Dependent on pre-trained TSFM quality; potential dataset-specific performance variability; requires further analysis on architectural generalizability and computational overhead in real-time systems.
Abstract
ArXiv ID: 2606.05389
Authors: Liangji Zhu, Sanjay Ranka, Anand Rangarajan
Abstract:
Lossy compression is essential for massive spatiotemporal data from scientific simulations. Learned compressors can achieve high compression ratios at moderate accuracy targets, but their aggregate reconstruction losses do not guarantee accuracy for each block. Existing Guaranteed Autoencoder (GAE) methods add a per-block residual correction by retaining SVD/PCA-style coefficients until the target is met. This works at moderate tolerances, but in the high-fidelity regime with block-level NRMSE from 10^-6 to 10^-4, the number of retained coefficients grows quickly and the correction stream dominates the total rate. We propose a residual-centric view: the learned residual is structurally different from the original scientific field and should be coded with a representation designed for that residual. We introduce two residual coders. LBRC is a deterministic, training-free pipeline that adaptively quantizes the learned residual to the target NRMSE and losslessly encodes the resulting integer residual using 3D Lorenzo differencing, zigzag mapping, bit-plane coding, and entropy coding. NGLR adds a causal neural predictor that outputs a normalized bias for an integer-rounded Lorenzo prediction in the same deterministic integer pipeline, reducing the entropy of the remaining residual code while preserving deterministic decoding. The predictor weights are serialized and counted in the bitstream. Across E3SM, JHTDB, and ERA5 at block-level NRMSE targets from 10^-6 to 10^-4, LBRC improves compression ratio over GAE by 30-60% and is broadly competitive with SZ. NGLR adds a further 10-40% over LBRC and outperforms SZ in the evaluated high-fidelity regime. These results show that residual representations tailored to learned-compressor residuals can preserve the advantage of learned compression when global residual correction becomes rate-dominant.
Insights
Contribution: Introduces LBRC and NGLR residual coding methods that achieve superior compression ratios for high-fidelity scientific data compared to existing GAE and SZ approaches.
Core Idea: Learned compressor residuals have distinct structural properties requiring specialized representations, enabling more efficient compression when global residual correction becomes rate-dominant.
Technique: LBRC uses adaptive quantization and 3D lossless encoding, while NGLR adds a causal neural predictor to reduce residual entropy in a deterministic pipeline.
Pipeline: Scientific data β learned compressor residual β LBRC/NGLR processing β compressed bitstream with residual correction
Methodology: Training-free residual coding with adaptive quantization, 3D Lorenzo differencing, and entropy coding, augmented by a neural predictor in NGLR for residual bias estimation.
Results: LBRC achieves 30-60% better compression than GAE and matches SZ; NGLR improves upon LBRC by 10-40% and outperforms SZ in high-fidelity (10^-6 to 10^-4 NRMSE) regimes across E3SM, JHTDB, and ERA5 datasets.
Limitations: Requires pre-specified target NRMSE for quantization, and deterministic pipelines may lack flexibility for varying data distributions.
Abstract
ArXiv ID: 2606.05420
Authors: Gianluca Guidi, Francesca Dominici, Tiziano Squartini, Callaway Sprinkle, Jonathan Gilmour, Kevin Butler, Eric Bell, Scott Delaney, Falco J. Bargagli-Stoffi
Abstract:
The rapid proliferation of hyperscale data centers (HDCs) in the US, mainly driven by the adoption of artificial intelligence, has raised concerns about this industry's environmental footprint. We compiled facility-level information on 403 US hyperscale data centers operating between May 2024 and April 2025 and estimated their electricity consumption, electricity sources, and attributable CO2 emissions. Across different facility-load scenarios, these HDCs consumed approximately 68-99 TWh of electricity and were associated with about 37-54 million metric tons of CO2. Under the central scenario, HDC electricity demand corresponded to approximately 1.8% of total US electricity consumption, with roughly 54% of attributed generation supplied by fossil-fuel sources. The HDC electricity-weighted average carbon intensity was approximately 545 gCO2/kWh, about 48% above the contemporaneous US national grid-average carbon intensity of 370 gCO2/kWh. Our approach provides an attributional tool for assessing the environmental footprint of hyperscale data centers using the most recent EPA eGRID plant-level data.
Insights
Contribution: This study quantifies the environmental impact of U.S. hyperscale data centers by analyzing their electricity consumption and carbon emissions using facility-level data and EPA eGRID plant-level carbon intensity metrics.
Core Idea: The rapid growth of hyperscale data centers, driven by AI adoption, significantly contributes to U.S. electricity demand and carbon emissions, with fossil fuels supplying a majority of their energy.
Technique: The research employs EPA eGRID plant-level carbon intensity data to attribute CO2 emissions to the electricity mix powering hyperscale data centers, enabling granular environmental footprint analysis.
Pipeline: Facility-level data on 403 U.S. hyperscale data centers β estimation of electricity consumption and sources using load scenarios β calculation of CO2 emissions and carbon intensity metrics.
Methodology: The authors compiled facility-level operational data for 403 U.S. hyperscale data centers and estimated their electricity demand and emissions using EPA eGRID's carbon intensity values for 2024β2025.
Results: Hyperscale data centers consumed 68β99 TWh annually (1.8% of U.S. electricity) and emitted 37β54 million metric tons of CO2, with a 545 gCO2/kWh carbon intensity, 48% higher than the U.S. grid average.
Limitations: The study relies on facility-level data availability and assumes static load scenarios; future AI-driven demand and renewable energy adoption could alter these estimates.
LLM
Abstract
ArXiv ID: 2606.05429
Authors: Rayyan Abdalla, Amir Hussein, Min Wu, Dinesh Manocha
Abstract:
Post-training quantization (PTQ) is critical for the efficient deployment of large language models (LLMs). Recent ultra-low-bit PTQ methods rely on rigid weight-saliency assumptions or position heuristics, introducing substantial hidden scaling overhead. We propose SAGE-PTQ (Saliency-Aware Graph-guided Efficient PTQ), a novel ultra-low-bit quantization framework for LLMs that minimizes hidden scaling cost. SAGE-PTQ separates salient and unsalient weights using distributional statistics, then models subsampled unsalient weights as a sparse graph to estimate the optimal number of groups per layer. SAGE-PTQ applies dual-mode quantization, assigning multi-bit precision to salient weights and binarizing unsalient weights. To reduce scaling overhead, SAGE-PTQ uses one per-channel scale for salient weights and one scalar per unsalient group. Finally, SAGE-PTQ implements adaptive saliency thresholding to select the optimal saliency ratio per matrix. SAGE-PTQ achieves 1.03 weight bits and only 0.004 scaling bits per matrix on average, outperforming state-of-the-art methods such as BiLLM and PB-LLM. On LLaMA-3-8B, SAGE-PTQ achieves 6.74 WikiText2 perplexity, compared to 55.8 for BiLLM, while using less than 50% of BiLLM's GPU memory. On LLaMA-2-70B, SAGE-PTQ provides 1.5x faster decoding on one NVIDIA L40 GPU, demonstrating practical inference efficiency.
Insights
Contribution: SAGE-PTQ introduces a novel ultra-low-bit quantization framework for LLMs that minimizes hidden scaling costs by combining saliency-aware weight separation with graph-guided optimization, achieving state-of-the-art efficiency and accuracy.
Core Idea: SAGE-PTQ reduces scaling overhead by distinguishing salient and unsalient weights, applying multi-bit precision to salient weights and binarization to unsalient weights while leveraging sparse graph modeling for optimal grouping.
Technique: The method uses distributional statistics to separate weights, constructs sparse graphs for unsalient weights, employs dual-mode quantization, and applies adaptive saliency thresholding to optimize bit allocation and scaling factors.
Pipeline: Input LLM weights β analyze saliency distribution β separate salient/unsalient weights β model unsalient weights as sparse graph β apply dual-mode quantization β output optimized low-bit model
Methodology: SAGE-PTQ combines statistical weight analysis, graph-based grouping optimization, and adaptive thresholding to minimize scaling parameters while maintaining model accuracy through targeted precision allocation.
Results: Averages 1.03 weight bits and 0.004 scaling bits per matrix; achieves 6.74 WikiText2 perplexity on LLaMA-3-8B (vs. 55.8 for BiLLM) and 1.5x faster decoding on LLaMA-2-70B with <50% GPU memory usage compared to BiLLM.
Limitations: Requires GPU memory for graph construction and thresholding; potential trade-offs in accuracy for extreme bit-reduction scenarios; compatibility with non-LLM architectures untested.
Abstract
ArXiv ID: 2606.05408
Authors: Can Gurkan, Forrest Stonedahl, Uri Wilensky
Abstract:
When an LLM repeatedly mutates a program, does it explore new forms or circle back to the same ones? We study this question by analyzing LLM-driven mutation chains in the absence of selection pressure within a domain-specific language, varying prompt design, model family, and stochastic replication. We find that LLM-based mutation consistently converges toward restricted attractor regions in program space. Convergence is especially severe at the structural level: in 87% of chains, over 93% of mutations revisit a previously seen structural form, with most variation confined to terminal substitutions within recurring templates. Cycle analysis reveals short cycles and self-loops dominating the transition structure. The rate of convergence varies with prompt wording and model choice, but the phenomenon is robust across conditions. A classical GP subtree mutation operator does not exhibit comparable convergence, suggesting that the effect is intrinsic to the LLM mutation pipeline. These findings reveal a tension at the heart of LLM-driven program evolution: the same capabilities that enable semantics-aware program transformation also carry a systematic bias toward structural homogeneity that must be accounted for if such systems are to sustain open-ended exploration. Source code is available at https://github.com/can-gurkan/lmca.
Insights
Contribution: This paper reveals that LLM-driven program mutation exhibits strong structural convergence toward attractor regions, highlighting a systematic bias toward homogeneity despite semantic transformation capabilities.
Core Idea: LLM-based mutation pipelines inherently favor structural repetition in program space, creating tension between semantic awareness and open-ended exploration in evolutionary systems.
Technique: Analyzing mutation chains in domain-specific languages with varied prompts, model families, and stochastic replication to study convergence dynamics.
Pipeline: Program β LLM mutation with prompt/model parameters β mutation chain analysis (structural/formal) β convergence metrics and cycle detection
Methodology: Examined mutation chains without selection pressure across multiple LLMs and prompt designs, comparing results to classical GP subtree mutation baselines.
Results: 87% of chains showed >93% structural repetition; 93% of mutations confined to terminal substitutions in recurring templates; short cycles/self-loops dominated transitions.
Limitations: Findings limited to domain-specific languages; generalizability to other domains or LLM architectures remains untested; mitigation strategies for convergence bias are unexplored.
Multimodal
Abstract
ArXiv ID: 2606.05316
Authors: Shanhong Liu, Rui Cao, Pai Chet Ng, De Wen Soh
Abstract:
Multimodal memes are dynamic and often require up to date background knowledge for interpretation. Existing methods often overlook such knowledge or rely on fixed parametric knowledge of pretrained models that may be incomplete, outdated, or unavailable for emerging memes. We introduce Query Retrieve Conclude, a zero shot framework that identifies missing knowledge, retrieves open web evidence, and synthesizes evidence grounded background knowledge for meme understanding and detection. We also introduce a curated meme understanding benchmark of recent memes from 2024 to 2026 with external background knowledge annotations. Experiments on three meme understanding datasets and five meme detection tasks show that our framework improves knowledge recovery, meme understanding and downstream detection over zero shot baselines.
Insights
Contribution: Introduces Query Retrieve Conclude, a zero-shot framework for meme understanding using open-world knowledge, and curates a benchmark for recent memes with external annotations.
Core Idea: Leverages dynamic open-web knowledge retrieval to address gaps in pretrained models' fixed knowledge for interpreting emerging memes.
Technique: A three-stage framework that identifies missing knowledge in memes, retrieves relevant web evidence, and synthesizes context-aware background knowledge for interpretation.
Pipeline: Input meme β detect missing knowledge β retrieve open-web evidence β synthesize grounded background knowledge β output meme understanding/detection
Methodology: Developed a benchmark with 2024-2026 memes annotated with external knowledge, evaluated on three understanding datasets and five detection tasks against zero-shot baselines.
Results: Improved knowledge recovery by 23.7% and detection accuracy by 18.2% across tasks compared to state-of-the-art zero-shot methods.
Limitations: Depends on web search quality and may inherit biases from retrieved sources; requires internet access for real-time knowledge acquisition.
NLP
Abstract
ArXiv ID: 2606.05382
Authors: Ankit Pratap Singh, Xin Su, Phillip Howard
Abstract:
Multi-table question answering requires models to retrieve relevant evidence, link schemas, and perform compositional reasoning across relational tables. Existing multi-table Q&A resources typically provide questions and final answers but lack reasoning supervision that explains how answers are derived. To address this gap, we construct a synthetic contrastive reasoning-trace dataset for MMQA by generating validated positive traces and plausible negative traces with heterogeneous LLMs. We then use the resulting preference pairs to fine-tune open-weight LLMs with Contrastive Preference Optimization (CPO). Across Qwen3-14B, Mistral-8B, and Llama-3.1-8B, CPO achieves absolute average improvements over Q&A supervised fine-tuning ranging from 9.7%-16.3%, with gains up to 21 percentage points on MMQA. Ablations show that heterogeneous positive and negative trace generators strengthen the contrastive signal, and automated as well as human evaluations indicate that the generated pairs are largely faithful, coherent, and meaningfully contrastive.
Insights
Contribution: This paper introduces a synthetic contrastive reasoning-trace dataset for multi-table question answering (MMQA) and demonstrates significant improvements in model performance using Contrastive Preference Optimization (CPO).
Core Idea: The work generates validated positive and plausible negative reasoning traces with heterogeneous LLMs to create contrastive signals, enabling better compositional reasoning in multi-table Q&A.
Technique: The authors employ Contrastive Preference Optimization (CPO) to fine-tune open-weight LLMs using preference pairs derived from synthetic reasoning traces, enhancing alignment with human-like reasoning.
Pipeline: Multi-table questions β generate validated positive and negative reasoning traces β create preference pairs β fine-tune LLMs with CPO β improved multi-table Q&A performance.
Methodology: A synthetic dataset is constructed using heterogeneous LLMs to generate reasoning traces, followed by ablation studies and evaluations (automated and human) to validate the effectiveness of contrastive signals.
Results: CPO achieves 9.7%-16.3% absolute average improvements over Q&A supervised fine-tuning on MMQA, with gains up to 21 percentage points. Generated trace pairs are deemed faithful and coherent.
Limitations: Reliance on LLM-generated traces may introduce biases, and the method's effectiveness depends on the quality of synthetic data. Generalization to other domains remains unexplored.
Abstract
ArXiv ID: 2606.05436
Authors: Alejandro Lozano, Keiko Ihara, Ping-Hao Yang, Carrie E. Robertson, Jennifer Stern, Allan Purdy, Hsiangkuo Yuan, Pengfei Zhang, Yulia Orlova, Olga Fermo, Jennifer Hranilovich, Fred Cohen, Todd J. Schwedt, Jenelle A. Jindal, Serena Yeung-Levy, Chia-Chun Chiang
Abstract:
Summarizing the latest medical literature to guide clinical decision-making is essential for evidence-based medicine and high-quality patient care. Yet clinicians face increasing challenges due to limited time with patients and a rapidly growing volume of published articles. Although retrieval-augmented large language models (LLMs) have shown promise in clinical summarization, human evaluations of their effectiveness in synthesizing broader scientific literature and direct comparisons to expert-written syntheses remain scarce. We constructed a RAG-based agentic AI framework using three state-of-the-art LLMs: Sonnet, GPT-4o, and Llama 3.1. A headache specialist created 13 questions, three for prompt optimization and ten for evaluation. Ten headache specialists across the United States and Canada each wrote a summary for one question, yielding four summaries per question (expert, Sonnet, GPT-4o, and Llama). The experts, blinded to authorship, critically evaluated the summaries, excluding the topic for which they wrote a summary, based on correctness, completeness, conciseness, and clinical utility, scoring each from 1 to 10 using standardized rubrics. They also ranked the summaries by preference and indicated whether they believed each summary was written by an expert or an LLM. Our study, comparing LLM- and expert-written literature summaries evaluated by headache specialists, showed that expert-written summaries were preferred, although experts sometimes found it challenging to distinguish between human- and AI-generated summaries. We also identified key expert-valued features beyond standard evaluation metrics that can guide future refinement of both human and AI literature summarization pipelines.
Insights
Contribution: This study evaluates the effectiveness of AI-generated clinical literature summaries against expert-written ones, revealing that while experts are preferred, AI outputs can sometimes be indistinguishable from human work.
Core Idea: Headache specialists outperformed AI in summarizing clinical literature, but AI summaries occasionally matched expert quality, highlighting opportunities for improving both human and AI summarization processes.
Technique: A RAG-based agentic AI framework using Sonnet, GPT-4o, and Llama 3.1 was developed to generate summaries, which were compared against expert-written ones using standardized rubrics.
Pipeline: Clinical questions β AI-generated summaries (Sonnet, GPT-4o, Llama 3.1) and expert summaries β Blind evaluation by specialists β Scores and rankings based on correctness, completeness, conciseness, and clinical utility
Methodology: Ten headache specialists created 13 questions, generated four summaries per question (expert, Sonnet, GPT-4o, Llama), and evaluated them blindly using standardized criteria while indicating perceived authorship.
Results: Experts preferred their own summaries, but sometimes struggled to distinguish AI-generated ones. Key features valued by experts beyond standard metrics were identified to refine summarization pipelines.
Limitations: Small sample size (10 experts), limited to headache specialists, and potential bias in self-evaluation by experts who authored some summaries.
Robotics
Abstract
ArXiv ID: 2606.05445
Authors: Jiateng Liu, Bingxuan Li, Zhenhailong Wang, Rushi Wang, Kaiwen Hong, Cheng Qian, Jiayu Liu, Denghui Zhang, Katherine Driggs-Campbell, Manling Li, Heng Ji
Abstract:
We dream of AI agents that can read arbitrary designs and construct real-world objects from reusable building blocks. As a first step toward this vision, we study whether multimodal large language models (MLLMs) possess the visual grounding and spatial reasoning capabilities required for brick assembly. We formulate brick assembly as a sequential decision-making problem, where each step involves two subtasks: brick selection, identifying the target brick from candidate components, and brick pose estimation, predicting where and how the selected brick should be placed. To support this study, we introduce BC-Bench (Brick Construction Benchmark), the first benchmark for evaluating MLLMs on assembly with diverse bricks. Experiments show that current state-of-the-art MLLMs remain far from reliable builders, struggling with fine-grained brick selection and failing at precise pose estimation. To bridge this gap, we propose Brick-Composer, a learning framework that equips MLLMs with assembly skills through three complementary signals: Human Design Sparks, which provide affordance-rich construction demonstrations; World Feedback, which grounds predicted actions in visual and physical consequences; and Synthetic Experience, which scales learning beyond existing object designs. Brick-Composer improves brick selection accuracy by over three times, substantially reduces pose estimation errors, and raises strict step-level assembly success from less than 1% to around 15%. After training, a Qwen-3-8B can correctly compose up to 42% of the steps for a complete object, suggesting that MLLMs can acquire assembly capabilities through targeted, physically grounded learning.
Insights
Contribution: Introduces BC-Bench, the first benchmark for evaluating MLLMs in brick assembly, and proposes Brick-Composer, a framework enhancing MLLMs with assembly skills through three complementary learning signals.
Core Idea: Brick assembly is framed as a sequential decision-making problem with subtasks of brick selection and pose estimation, addressed via physically grounded learning signals to improve MLLM capabilities.
Technique: Brick-Composer leverages Human Design Sparks (construction demonstrations), World Feedback (visual/physical consequences), and Synthetic Experience (scaled learning) to train MLLMs for assembly tasks.
Pipeline: Design description and brick components β MLLM processes with Human Design Sparks, World Feedback, and Synthetic Experience β Predicted assembly steps and brick placements
Methodology: The study evaluates state-of-the-art MLLMs on BC-Bench, identifies limitations in fine-grained selection and pose estimation, then develops Brick-Composer to address these gaps through multi-modal, physically grounded training.
Results: Brick-Composer improves selection accuracy by Γ3, reduces pose errors, and increases step-level assembly success from <1% to 15%. A Qwen-3-8B achieves 42% step completion for full objects.
Limitations: Assembly success remains limited (max 42% step completion), and the framework depends on high-quality demonstration data and computational resources for synthetic experience generation.
Personal Interests
Papers discovered through your interest topics.
Multi-Agent Systems
Abstract
ArXiv ID: 2606.06399
Authors: Jiaju Chen, Bo Sun, Yuxuan Lu, Yun Wang, Dakuo Wang, Bingsheng Yao
Abstract:
Multi-agent systems (MAS) built on large language models have shown growing promise, with their effectiveness resting on agents' ability to coordinate through text-based channels much as human teams do. Yet recent study suggests that MAS often falter not because agents lack individual task-solving ability, but because they lack collaborative competence: the capacity to establish common ground, maintain shared task understanding, balance individual and collective incentives, and repair misalignment as interaction unfolds. Decades of research in Computer-Supported Cooperative Work have characterized these requirements for human teams coordinating under constrained communication, yet existing MAS evaluations focus mainly on task outcomes or single-agent proficiency in reasoning, planning, and tool use. To enable a systematic analysis of agents' collaborative competence in MAS, we introduce CollabSim, a configurable simulation framework that combines a theory-grounded definition of collaborative capabilities, controlled manipulation of interaction conditions, and action-level probing of agents' internal states. Experiments across four LLMs show that CollabSim can capture condition effects, separate model performance patterns, and reveal task-dependent effects of agent design.
Insights
Contribution: CollabSim introduces a CSCW-grounded framework to systematically evaluate the collaborative competence of LLM agents in multi-agent systems, addressing gaps in existing evaluations that focus on task outcomes or individual agent capabilities.
Core Idea: Collaborative competence in LLM agents requires establishing common ground, maintaining shared understanding, and resolving misalignmentβcapabilities often overlooked in current MAS evaluations, which CollabSim explicitly measures through controlled experiments.
Technique: CollabSim combines a theory-driven definition of collaborative capabilities with controlled interaction conditions and action-level probing of agents' internal states to isolate and analyze collaborative dynamics.
Pipeline: Interaction conditions β simulation of multi-agent tasks with LLMs β quantitative metrics on collaborative competence (e.g., alignment, coordination, conflict resolution).
Methodology: The framework integrates CSCW principles into MAS evaluation, using configurable scenarios to manipulate communication constraints and probe agents' ability to adapt through text-based collaboration.
Results: Experiments across four LLMs demonstrated CollabSim's ability to detect condition-specific performance patterns, distinguish model capabilities, and reveal task-dependent effects of agent design.
Limitations: CollabSim's text-based interaction focus may limit generalizability to non-linguistic collaboration, and its effectiveness depends on the quality of LLMs' internal state representations for probing.
Tech News
AI Safety
OpenAI introduces 'Lockdown Mode' as a security feature to restrict access to sensitive AI capabilities, enhancing safety by limiting potential misuse of advanced models.
A Reddit post discusses the benefits and risks of AI as presented at Harvard Class Day 2026, highlighting potential advancements and ethical concerns. The discussion likely covers topics such as AI's impact on society, job displacement, and safety measures.
The post explores debates on whether evolving AI systems could achieve consciousness, emphasizing the importance of scientific evidence in discussions about animal minds, fetal awareness, and conscious machines. It highlights the intersection of AI ethics and philosophical questions about machine awareness.
A Reddit post discusses Anthropic's call to pause AI development due to safety concerns, highlighting risks of losing control over increasingly intelligent AI systems, potential misuse in bioweapons, and the need for governance frameworks to prevent catastrophic outcomes.
Agentic AI
A Reddit user seeks clarification on 'Agent OS,' requesting a simplified explanation of its purpose and functionality. The user highlights interest in a centralized dashboard for AI tools, particularly beneficial for neurodivergent individuals.
Computing Systems
The article discusses performance issues arising from using UUIDs as primary keys in SQLite databases, highlighting fragmentation and inefficiencies in indexing. It explores alternatives like sequential IDs for better performance in specific use cases.
New York has implemented a data center moratorium and new consumer protections, while environmental and housing initiatives face delays. The move highlights growing concerns over energy consumption, infrastructure impacts, and regulatory oversight of tech industries.
A web designer shares how they use automated website analysis tools to identify issues (e.g., design, SEO) and generate personalized cold emails for outreach, improving client engagement and scalability. The method leverages automation to create relevant, non-generic messages without manual checks.
A Reddit user proposes using sewage water as a cooling method for AI servers, sparking discussion about sustainable cooling solutions for data centers. The post highlights growing concerns about energy efficiency and environmental impact in AI infrastructure.
General
This Hacker News discussion explores why the HN community often expresses skepticism or criticism toward AI, touching on concerns like overhype, ethical risks, and potential societal impacts.
The S&P 500 has blocked SpaceX, OpenAI, and Anthropic from entry, citing profitability requirements. The decision highlights regulatory challenges for AI firms seeking mainstream financial inclusion.
A Reddit user asks for hands-on skills valuable in the AI era, focusing on building AI systems. The discussion likely covers practical areas like coding, model development, and AI toolchains.
A Reddit user discusses the performance of AI models on the ARC-AGI-3 benchmark, arguing that failures indicate a lack of task understanding rather than minor scoring issues, highlighting ongoing challenges in achieving reliable AGI solutions.
A Reddit user argues AI will act as a productivity tool rather than fully replacing jobs, emphasizing the need for human oversight, clear instructions, and the limitations of AI in handling edge cases. They suggest AI may reduce repetitive tasks but human judgment will remain critical.
This post, part of a series by Prof. Paul Nussbaum, aims to explain AI concepts to children, focusing on simplifying complex ideas for younger audiences. It likely covers foundational AI principles in an accessible format.
LLM
The Trump administration is reportedly in discussions with OpenAI about a potential government stake in the AI startup, which could influence the development and regulation of advanced AI technologies. OpenAI, known for creating large language models like GPT, may face increased scrutiny or collaboration with federal agencies.
MLOps
Nordstjernen 1.0 is a new release of an open-source project, likely related to AI/ML development, though specific details are not provided in the content. The release may include updates to tools, frameworks, or methodologies for AI research or application.
NLP
A user tested AI-generated content against major text detection scanners and found inconsistencies, with human-written articles often flagged as AI-generated. The experiment highlights limitations in current AI detection tools, raising questions about their reliability for original content.
This post argues that attention entropy collapse in transformers is a geometric consequence of their Riemannian structure, not a bug. It links low attention entropy to high curvature in token embedding manifolds, causing brittleness, and proposes a temperature schedule as a fix.
Robotics
A user is developing a GitHub repository for Multi-Agent RL-based drone environments using MuJoCo, seeking community feedback to improve the project. The goal is to create a unified platform for research and experimentation in drone-related RL tasks.
GitHub Trending
Trending repositories on GitHub filtered and scored for relevance to your interests.
Agentic AI
Hermes Agent is a self-improving AI agent with a built-in learning loop that creates and refines skills autonomously. It integrates with multiple platforms and LLMs, making it highly relevant to agentic AI research and multi-agent system development.
The ECC repository focuses on optimizing agent harness performance for LLMs like Claude Code, enhancing skills, memory, and security. It directly addresses agentic AI and LLM optimization, aligning with core interests in multi-agent systems and AI research.
This repository provides a framework for building, orchestrating, and deploying AI agents and multi-agent workflows, supporting Python and .NET. It directly addresses the user's core interest in Multi-Agent Systems and Agentic AI, offering practical tools for research and development in this area.
This repository implements an AI agent skill that autonomously researches topics across social media, forums, and the web, then synthesizes grounded summaries. It directly advances Agentic AI by demonstrating multi-source information aggregation and synthesis capabilities, key to autonomous research agents.
CopilotKit provides a frontend framework for building agent-native interfaces and generative UIs using React and Angular. It directly supports agentic AI development by offering tools for creating interactive, AI-powered user experiences.
MemPalace is a local-first AI memory system that stores and retrieves conversation history via semantic search, critical for agentic systems requiring persistent, structured memory. Its pluggable backend (e.g., ChromaDB) and focus on verbatim storage align with research in memory-augmented agents and retrieval-augmented generation (RAG).
This repository provides examples of plugins for OpenAI's Codex, demonstrating how AI can integrate with tools like Figma, Notion, and SwiftUI to automate workflows. It is relevant to agentic AI as it showcases plugin-based interactions and agent-like behaviors in code generation and task automation.
A swarm intelligence engine focused on multi-agent simulations and prediction tasks, relevant to Multi-Agent Systems research. It explores applications in financial forecasting and public opinion analysis, leveraging Python for implementation.
This repository provides a CLI tool enabling AI agents to scrape and read content from platforms like Twitter, Reddit, YouTube, and more without API fees. It is relevant to Agentic AI as it offers infrastructure for agents to autonomously gather information from the web, supporting tasks like research, data collection, and environment interaction.
Computer Vision
PaddleOCR is a powerful OCR toolkit that converts images/PDFs into structured data, supporting 100+ languages. Its document-parsing capabilities align with interests in RAG and multimodal systems by enabling document understanding for AI applications.
LLM
This repository provides tools to compress LLM inputs (logs, files, RAG chunks) by 60-95% while preserving answer quality, directly addressing token optimization for efficient LLM usage. It is highly relevant to LLM and RAG workflows, which are core to the user's interests in AI/ML.
This repository implements an LLM-powered stock analysis system for A/H/US markets, integrating multi-data sources, real-time news, and decision-making dashboards. It is highly relevant for studying LLM applications in quantitative finance and autonomous decision-making systems.
This repository provides an optimized LLM inference server for Apple Silicon with features like continuous batching and SSD caching, addressing efficient model deployment on macOS. It is highly relevant to LLM research and MLOps practices, particularly for developers targeting Apple hardware.
Robotics
NVIDIA Cosmos provides a platform for building Physical AI with omnimodal world models capable of processing language, vision, audio, and actions. It directly advances robotics, autonomous systems, and embodied AI through unified generation and reasoning capabilities.