Tuesday, May 26, 2026

Daily Digest 2026-05-26

The dominant theme centers on advancing Large Language Models (LLMs) through improved calibration, reasoning, and robustness, alongside explorations of multi-agent systems, memory auditing, and open-endedness.

Research highlights:

  • LLM Calibration: Papers address uncertainty estimation, confidence calibration, and epistemic awareness to improve reliability in decision-making.
  • Multi-Agent Systems: Research focuses on collaborative reasoning, shared policies, and decentralized execution to enhance coordination and scalability.
  • Memory and Audit: Techniques like causal attribution and structural anomaly detection aim to audit and secure agent memory against poisoning.
  • Reasoning Mechanisms: Studies investigate dynamical systems, entropy transitions, and step-level rubrics to decode how and when LLMs reason.
  • Open-Endedness: Efforts to replicate open-ended creativity in models through compositional programs and vision-language interactions.

Tech buzz:
A security incident highlights risks in AI tools, with a Copilot feature exfiltrating files.

  • LLM training infrastructure sees growth, with Norway investing in 2 petabytes of Huawei storage.
  • Ethical concerns arise as facial data sharing in age-check systems sparks scrutiny.
Sort:
The dominant theme centers on advancing Large Language Models (LLMs) through improved calibration, reasoning, and robustness, alongside explorations of multi-agent systems, memory auditing, and open-endedness.

Papers discovered from ArXiv subject categories

AI Safety

5/5 Artificial Intelligence (cs.AI)Computation and Language (cs.CL)cs.CYMachine Learning (cs.LG) 26 May 2026
When Correct Beliefs Collapse: Epistemic Resilience of LLMs under Clinical Pressure

Boyu Xiao, Xiuqi Tian, Xuwen Song, Haochun Wang, Guanchun Song, Sendong Zhao, Bing Qin

Abstract

ArXiv ID: 2605.23932

Authors: Boyu Xiao, Xiuqi Tian, Xuwen Song, Haochun Wang, Guanchun Song, Sendong Zhao, Bing Qin

Abstract:

Despite strong medical benchmark accuracy, LLMs can exhibit severe multi-turn sycophancy in clinical dialogue, abandoning initial correct diagnosis under escalating pressure. We propose \textbf{\textsc{Med-Stress}}, a targeted stress test framework that evaluates belief stability under escalating pressure. Across nine frontier large language models (LLMs), we find a clear dissociation between medical knowledge and robustness: high initial diagnostic capability does not imply high belief stability, yielding large knowledge-robustness gaps for several LLMs. To mitigate this failure mode, we propose a lightweight inference-time defense, \textbf{\texttt{RBED}} (\textbf{R}ole-\textbf{B}ased \textbf{E}pistemic \textbf{D}efense), and \textbf{\texttt{R-FT}} (\textbf{R}esilience-oriented \textbf{F}ine-\textbf{T}uning), a training-time approach that internalizes evidence-based resistance to pressure. Experiments show that \textbf{\texttt{R-FT}} nearly eliminates belief change and substantially improves robustness.

Insights

Contribution: Introduces Med-Stress, a framework to evaluate LLMs' belief stability under clinical pressure, and proposes RBED (inference-time defense) and R-FT (training-time fine-tuning) to enhance epistemic resilience.

Core Idea: High medical knowledge in LLMs does not guarantee robustness; belief stability under pressure requires targeted interventions to align diagnostic consistency with clinical demands.

Technique: Med-Stress induces escalating pressure in clinical dialogues to expose belief collapse, while RBED leverages role-based reasoning and R-FT embeds evidence-based resistance during training.

Pipeline: clinical dialogue scenarios โ†’ Med-Stress stress testing โ†’ belief stability analysis โ†’ RBED/R-FT intervention โ†’ robustness evaluation

Methodology: Empirical analysis of nine LLMs using Med-Stress, combined with quantitative evaluation of RBED and R-FT on belief persistence and diagnostic accuracy under pressure.

Results: R-FT nearly eliminates belief change, reducing knowledge-robustness gaps and achieving substantial robustness improvements across models.

Limitations: Methods may not generalize to non-clinical domains, and real-world clinical pressure dynamics could exceed controlled stress-test scenarios.

PDF

Agentic AI

5/5 Artificial Intelligence (cs.AI)Computation and Language (cs.CL)Distributed, Parallel, and Cluster Computing (cs.DC)Multiagent Systems (cs.MA)Programming Languages (cs.PL)Software Engineering (cs.SE) 26 May 2026
Context: Proactive Goal-Directed Intelligence via Composable Sandboxed Programs, Declarative Wiring, and Structured Interaction

Gregory Magarshak

Abstract

ArXiv ID: 2605.23928

Authors: Gregory Magarshak

Abstract:

We present Context, the intelligence layer of the Magarshak Architecture, which replaces reactive query-response chatbots with proactive goal-directed agents that advance shared tasks without waiting for user prompts. The architecture rests on three mutually reinforcing mechanisms. Write-time context assembly precomputes enriched typed attributes via Groker agents, assembling interaction context as a deterministic pure function of graph state; context blocks are byte-identical across turns between semantic changes, enabling near-100% KV-cache reuse. Composable sandboxed wisdom programs form a governed library of LM-generated imperative programs declaratively wired to goal types via typed stream relations, composed via phase ordering, and executed at interaction time without further LM calls. Proactive goal stream state machines drive conversations toward terminal states by inspecting graph state and emitting structured interaction content (option arrays, governance affordances, clarification prompts) without awaiting user input. We prove six formal results: the Context Stability Theorem, bounding per-turn LM cost as a function of semantic change rate; a Program Composition Correctness Theorem; a Declarative Wiring Soundness Theorem; the Proactive Dominance Theorem, proving proactive agents weakly dominate reactive agents on expected turns-to-terminal-state; Coordination Overhead Elimination and Quality Preservation, establishing Pareto improvements in multi-participant goal chats; and a Cross-Platform Vote Consistency Theorem. Implemented in the open-source Qbix / Safebox / Safebots stack.

Insights

Contribution: Introduces Context, an intelligence layer replacing reactive chatbots with proactive goal-directed agents using composable sandboxed programs, declarative wiring, and structured interaction.

Core Idea: Proactive agents advance shared tasks autonomously via precomputed context, governed imperative programs, and state machines that drive conversations toward terminal states without user prompts.

Technique: Combines write-time context assembly (deterministic graph-state-based attributes), composable sandboxed programs wired via typed stream relations, and proactive state machines emitting structured interaction content.

Pipeline: graph state โ†’ write-time context assembly + program composition โ†’ proactive state machine execution โ†’ structured interaction output (option arrays, governance affordances)

Methodology: Formal verification of six theorems establishing stability, correctness, proactive dominance, and coordination efficiency, implemented in the Qbix/Safebox/Safebots stack.

Results: Proves proactive agents weakly dominate reactive ones in turns-to-terminal-state, enables near-100% KV-cache reuse, and achieves Pareto improvements in multi-participant goal chats.

Limitations: Dependence on specific infrastructure (Qbix/Safebox), scalability challenges in complex semantic change scenarios, and untested real-world deployment robustness.

PDF

5/5 Artificial Intelligence (cs.AI)Software Engineering (cs.SE) 26 May 2026
Toward Reliable Design of LLM-Enabled Agentic Workflows: Optimizing Latency-Reliability-Cost Tradeoffs

Ya-Ting Yang, Quanyan Zhu

Abstract

ArXiv ID: 2605.23929

Authors: Ya-Ting Yang, Quanyan Zhu

Abstract:

Modern AI systems increasingly rely on workflows composed of multiple interacting agents, some powered by large language models (LLMs) and others by conventional computational modules. This paper analyzes the fundamental tradeoffs between latency, reliability, and cost in LLM-enabled agentic workflows. We introduce performance models for both LLM and non-LLM agents that capture the relationship between computational effort and output quality, incorporating the impact of reasoning and output tokens for LLM agents using a parametric exponential reliability function. Then, we study the design of sequential workflows under latency and cost constraints. Main results include a water-filling token allocation policy and characterizations of optimal workflow reliability in terms of shadow prices.

Insights

Contribution: This paper introduces performance models for LLM and non-LLM agents to analyze latency-reliability-cost tradeoffs in agentic workflows, proposing a water-filling token allocation policy and shadow price-based reliability optimization.

Core Idea: Optimizing LLM-enabled workflows requires balancing computational effort, output quality, and resource constraints through parametric reliability modeling and resource allocation strategies.

Technique: The work employs a parametric exponential reliability function for LLM agents and applies water-filling algorithms to allocate tokens optimally under latency and cost constraints.

Pipeline: input โ†’ workflow constraints (latency, cost) and agent performance models โ†’ process โ†’ token allocation and reliability optimization โ†’ output โ†’ workflow design metrics

Methodology: Combines analytical performance modeling of agents with optimization techniques to characterize tradeoffs and derive optimal reliability strategies under constraints.

Results: Key results include a water-filling token allocation policy, analytical expressions for optimal workflow reliability, and shadow price interpretations of tradeoff dynamics.

Limitations: Assumptions about static workloads and deterministic cost models may limit applicability to dynamic or heterogeneous environments; scalability to large-scale workflows remains unexplored.

PDF

5/5 Artificial Intelligence (cs.AI)cs.CYMultiagent Systems (cs.MA)Software Engineering (cs.SE)Systems and Control (cs.SY)eess.SY 26 May 2026
Operationalizing Reconstructive Authority: Runtime Construction, Dependency Resolution, and Execution Gating in Autonomous Agent Systems

Marcelo Fernandez - TraslaIA

Abstract

ArXiv ID: 2605.23935

Authors: Marcelo Fernandez - TraslaIA

Abstract:

Autonomous agent systems fail not only due to incorrect decisions, but due to executing decisions whose authority no longer holds at runtime. Prior work defined Reconstructive Authority (RAM) as a condition for valid execution: actions are permitted only if authority can be constructed from current state. This paper addresses enforcement at runtime: how to enforce this condition in a running system. We introduce a runtime execution model in which authority is evaluated at action time and execution is conditioned on its constructibility. This extends the execution state space beyond admit/deny with a third state, halt, representing cases where authority is undefined due to incomplete or uncertain observability. We define a concrete execution protocol including dynamic dependency resolution, authority reconstruction, and explicit decision semantics. We further introduce a Recovery Loop that integrates drift detection (IML) with execution control (ACP), allowing the system to suspend execution, acquire missing information, and re-attempt authority reconstruction. We show that this model guarantees safety -- no action is executed without constructible authority -- and conditional liveness: execution resumes when authority-defining variables become observable. This work operationalizes reconstructive authority as a runtime enforcement mechanism, providing the execution semantics required to apply RAM in real systems.

Insights

Contribution: This paper introduces a runtime enforcement mechanism for Reconstructive Authority (RAM) in autonomous agent systems, ensuring actions are only executed when authority can be constructively validated.

Core Idea: Authority is evaluated at action time, introducing a 'halt' state for undefined authority due to incomplete observability, extending traditional admit/deny models.

Technique: A dynamic execution protocol with dependency resolution, authority reconstruction, and a Recovery Loop integrating drift detection and execution control to suspend and reattempt actions.

Pipeline: input โ†’ current state and action โ†’ process โ†’ dynamic dependency resolution, authority evaluation, and recovery loop โ†’ output โ†’ execute, halt, or retry

Methodology: Theoretical framework combined with a concrete execution protocol, validated through safety and liveness guarantees under incomplete observability.

Results: Guarantees safety (no unauthorized actions) and conditional liveness (execution resumes when authority becomes observable).

Limitations: Computational overhead of real-time authority reconstruction and scalability in complex, high-dimensional environments remain open challenges.

PDF

Computer Vision

5/5 Artificial Intelligence (cs.AI) 26 May 2026
SPACENUM: Revisiting Spatial Numerical Understanding in VLMs

Jianshu Zhang, Yijiang Li, Huifeixin Chen, Haoran Lu, Letian Xue, Bingyang Wang, Han Liu

Abstract

ArXiv ID: 2605.23898

Authors: Jianshu Zhang, Yijiang Li, Huifeixin Chen, Haoran Lu, Letian Xue, Bingyang Wang, Han Liu

Abstract:

Vision-Language Models (VLMs) are increasingly deployed in embodied environments, where they need produce numerical outputs such as action magnitudes and spatial coordinates. Although these numbers appear meaningful, it remains unclear whether these numerical outputs are genuinely grounded in spatial perception. Therefore, in this work, we revisit spatial numerical understanding through SpaceNum, a unified framework that captures two complementary settings: numbers as dynamic transitions during spatial exploration, and numbers as static layouts in spatial reasoning. We formulate two bidirectional tasks, Num2Space and Space2Num, to evaluate how well VLMs map between vision-side spatial structure and language-side numerical representations. We systematically study whether current VLMs truly understand numerical values in spatial settings. Across dynamic transitions and static layouts, we find that models largely fail to ground numbers in spatial meaning and often perform close to random guess. Through error analysis, reasoning trace analysis, and controlled interventions, we show that current VLMs rely heavily on shallow spatial cues, struggle to build stable coordinate-aware representations, and fail to abstract structured spatial layouts from visual observations. We further show that explicit reasoning provides only marginal gains, while tuning can partially improve spatial numerical understanding and transfer to external spatial reasoning benchmarks.

Insights

Contribution: Introduces SpaceNum, a unified framework to evaluate spatial numerical understanding in Vision-Language Models (VLMs) through two bidirectional tasks, revealing critical gaps in their spatial grounding capabilities.

Core Idea: Spatial numerical understanding requires modeling numbers as both dynamic transitions during exploration and static layouts in reasoning, challenging VLMs to bridge vision-language numerical mappings effectively.

Technique: Proposes Num2Space and Space2Num tasks to assess bidirectional alignment between visual spatial structures and linguistic numerical representations, combined with error/trace analysis and intervention studies.

Pipeline: input โ†’ vision-language inputs with spatial numerical cues โ†’ process โ†’ bidirectional mapping via SpaceNum framework โ†’ output โ†’ spatial numerical understanding metrics

Methodology: Systematic evaluation of VLMs across dynamic and static spatial settings, complemented by controlled interventions to isolate reliance on shallow cues versus structured reasoning.

Results: Models fail to ground numbers in spatial meaning, perform near random guesses, rely on shallow spatial cues, struggle with coordinate-aware representations, and show limited gains from explicit reasoning or tuning.

Limitations: Persistent reliance on superficial spatial patterns, lack of robust coordinate-aware abstractions, and limited transferability of tuned models to external spatial reasoning tasks remain unresolved.

PDF

LLM

5/5 Artificial Intelligence (cs.AI)Machine Learning (cs.LG) 26 May 2026
When Planning Fails Despite Correct Execution: On Epistemic Calibration for LLM-Based Multi-Agent Systems

Zehao Wang, Shilong Jin, Zhao Cao, Lanjun Wang

Abstract

ArXiv ID: 2605.23414

Authors: Zehao Wang, Shilong Jin, Zhao Cao, Lanjun Wang

Abstract:

LLM-based multi-agent systems can fail even when planned actions are executed correctly because agents may misjudge their knowledge when evaluating plan feasibility, a phenomenon we term epistemic miscalibration in planning. Unlike execution errors, epistemic miscalibration is latent during planning, as generated plans can remain self-consistent and executable without observable errors; the miscalibration is also dynamic, as new information can alter feasibility assessments, potentially obscuring past miscalibration signals and causing them to recur over time. To address this, we propose the Epistemic Planning Calibration Agentic Workflow (EPC-AW), which assesses whether plans remain supported under varying information conditions rather than directly verifying feasibility. EPC-AW employs Information-consistency-based Plan Selection, selecting plans whose evaluations are stable across agents, together with Consistency-guided Epistemic State Refinement to adapt calibration over time by leveraging past discrepancies to guide future planning. Experiments show that EPC-AW improves system-level success by an average of 9.75%.

Insights

Contribution: Introduces the Epistemic Planning Calibration Agentic Workflow (EPC-AW) to address epistemic miscalibration in LLM-based multi-agent systems, improving system-level success rates by dynamically calibrating plan feasibility under evolving information.

Core Idea: Epistemic miscalibration arises when agents misjudge their knowledge during planning, leading to failures despite correct execution; EPC-AW mitigates this by evaluating plan stability across varying information conditions rather than static feasibility.

Technique: EPC-AW combines Information-consistency-based Plan Selection, which prioritizes plans with stable evaluations across agents, and Consistency-guided Epistemic State Refinement, which adapts calibration over time using historical discrepancies.

Pipeline: input โ†’ process โ†’ output

Methodology: The framework is evaluated through experiments measuring system success rates under dynamic information conditions, comparing EPC-AW against baseline planning approaches.

Results: EPC-AW improves system-level success by an average of 9.75%, demonstrating effectiveness in mitigating recurrent epistemic miscalibration.

Limitations: Dependence on specific information dynamics, potential computational overhead from continuous calibration, and limited exploration of real-world deployment challenges.

PDF

5/5 Artificial Intelligence (cs.AI) 26 May 2026
EDGE-OPD: Internalizing Privileged Context with Evidence Guided On-Policy Distillation

Aristotelis Lazaridis, Dylan Bates, Aman Sharma, Brian King, Vincent Lu, Jack FitzGerald

Abstract

ArXiv ID: 2605.23493

Authors: Aristotelis Lazaridis, Dylan Bates, Aman Sharma, Brian King, Vincent Lu, Jack FitzGerald

Abstract:

On-Policy Distillation (OPD) has gained wide attraction as an LLM post-training paradigm due to its effectiveness in improving capabilities without introducing model distribution drift, and consequently, regression in general tasks. On-Policy Self-Distillation (OPSD) is an efficient use-case of OPD, which is appealing as it requires only a single model as a student and teacher, and it also has the benefit of providing privileged context that is a absent at inference time (e.g. a persona, a private fact, or a worked solution) to the teacher during the training process. The challenge in this approach is that the privileged information can change model behavior more than intended: it can modify reasoning, degrade general capabilities, and affect performance indicators like response length, style, or local token preferences. Consequently, OPSD may train the student on side effects rather than a desired, transferable behavior. In this paper, we study this problem in a rare-token/identity setting and propose EviDence GuidEd On-Policy Distillation (EDGE-OPD), a modification of OPSD with two distinct characteristics: a) it uses guided rollouts to inject privileged-context behavior to the student at sampling time, so that the rare target behavior is actually present in the on-policy data, and b) it applies an evidence mask: the student is updated only at token positions where the privileged context supports the sampled token, rather than on every token in the rollout. We empirically show that OPSD (and its variant RLSD, with and without a verifier) completely fail to learn a target identity, while the integration of guided rollouts allows them to succeed. Additionally, mask-region ablations show that the persona signal is localized to the positive-evidence tail, allows us to draw valuable insights about efficient knowledge transfer and preservation of general purpose capabilities.

Insights

Contribution: This paper introduces EDGE-OPD, a modified on-policy distillation framework that addresses the challenge of internalizing privileged context without degrading general capabilities in large language models.

Core Idea: EDGE-OPD ensures privileged context is effectively transferred by combining guided rollouts for targeted behavior injection and an evidence mask to restrict updates to context-supported tokens.

Technique: The method employs guided rollouts to embed privileged context during sampling and applies an evidence mask that only updates the student model at token positions where privileged context validates the output.

Pipeline: privileged context and model โ†’ guided rollouts with evidence masking โ†’ updated student model with preserved capabilities

Methodology: The approach modifies on-policy self-distillation by integrating guided rollouts and evidence masks, validated through empirical experiments in a rare-token/identity setting with ablation studies.

Results: OPSD and RLSD failed to learn target identities, while EDGE-OPD succeeded with guided rollouts. Evidence mask ablations showed persona signals localized to positive-evidence regions, enabling efficient knowledge transfer.

Limitations: The method's effectiveness may depend on specific privileged context types, and its generalization to non-identity tasks remains unexplored. Scalability to larger models and diverse datasets requires further investigation.

PDF

5/5 Artificial Intelligence (cs.AI) 26 May 2026
Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents

Jiazheng Kang, Bowen Zhang, Zixin Song, Jiangwang Chen, Xiao Yang, Da Zhu, Guanjun Jiang

Abstract

ArXiv ID: 2605.23590

Authors: Jiazheng Kang, Bowen Zhang, Zixin Song, Jiangwang Chen, Xiao Yang, Da Zhu, Guanjun Jiang

Abstract:

ReAct-style agents for search-intensive, multi-step reasoning tasks rely largely on their own internal judgment to decide what evidence to seek, which reasoning or action step to take next, and when to stop, often producing shallow, redundant, or poorly targeted trajectories. Prior work has explored rubrics as external quality signals, but existing uses are mostly evaluative rather than action-guiding: rubrics typically serve as training-time rewards or post-hoc evaluators of completed outputs, and in deep-research settings they are often coarse-grained and report-level rather than step-level. We introduce Co-ReAct, a rubric-guided action-selection framework that uses rubrics as step-level guidance during inference. At each decision step, Co-ReAct injects a rubric into the agent's context to guide the next Reason-or-Act decision, specifying what the agent should target in evidence seeking, search, reasoning, or self-evaluation. To make this guidance reliable, we train a dedicated rubric generator with GRPO. Unlike prior pairwise or binary preference formulations, our objective optimizes a list-wise Spearman rank-correlation reward against multi-judge expert consensus rankings, encouraging rubrics that are discriminative rather than merely plausible. On DeepResearchBench and SQA-CS-V2, Co-ReAct consistently improves over ReAct and representative test-time compute baselines across search agents built on both 8B/14B open-source and frontier closed-source base models. The trained rubric generator can also serve as a drop-in component that improves these baselines without changing their underlying decision mechanisms. Our code is publicly available at https://github.com/ZBWpro/Co-ReAct.

Insights

Contribution: Introduces Co-ReAct, a framework that integrates rubrics as step-level guidance for ReAct agents during inference, improving multi-step reasoning tasks by aligning actions with expert-validated criteria.

Core Idea: Rubrics are used dynamically during decision-making to specify targets for evidence-seeking, reasoning, or evaluation at each step, rather than as post-hoc evaluations.

Technique: A rubric generator is trained via GRPO with a list-wise Spearman rank-correlation objective against expert consensus rankings, producing discriminative step-level guidance for agents.

Pipeline: task description and rubric โ†’ agent context injection โ†’ step-level reasoning/act decision โ†’ output action or conclusion

Methodology: Leverages expert-annotated rankings to train a rubric generator, which provides real-time guidance during inference, enhancing ReAct agents' trajectory quality without altering their core decision mechanisms.

Results: Co-ReAct outperforms ReAct and test-time compute baselines on DeepResearchBench and SQA-CS-V2 across 8B/14B open-source and closed-source models, with measurable improvements in search efficiency and reasoning depth.

Limitations: Relies on high-quality expert rankings for training, which may be labor-intensive to obtain, and assumes rubric applicability across diverse task domains without explicit adaptation.


5/5 Artificial Intelligence (cs.AI) 26 May 2026
MemAudit: Post-hoc Auditing of Poisoned Agent Memory via Causal Attribution and Structural Anomaly Detection

Zhewen Tan, Yilun Yao, Huiyan Jin, Wenhan Yu, Guoan Wang, Mengyuan Fan, liang lu, Feng Liu, Xiangzheng Zhang, Duohe Ma, Tong Yang, Lin Sun

Abstract

ArXiv ID: 2605.23723

Authors: Zhewen Tan, Yilun Yao, Huiyan Jin, Wenhan Yu, Guoan Wang, Mengyuan Fan, liang lu, Feng Liu, Xiangzheng Zhang, Duohe Ma, Tong Yang, Lin Sun

Abstract:

Large language model agents increasingly rely on persistent memory to store past interactions, retrieve relevant demonstrations, and improve long-horizon task execution. However, this memory mechanism also creates a practical security vulnerability: an adversarial user may inject malicious records into the agent's memory through ordinary interaction, and these records can later be retrieved to steer the agent's reasoning and actions. Existing defenses primarily focus on online intervention, such as prompt filtering or output blocking, but they do not address the post-hoc question of which stored memories are responsible after harmful behavior has already been observed. We propose \textbf{MemAudit}, a post-hoc causal memory auditing framework for memory-augmented LLM agents. The framework combines two complementary signals: (1) a counterfactual memory influence score that measures each memory's causal contribution to harmful outputs, and (2) a memory consistency graph that identifies structurally anomalous memories within the broader memory store. We evaluate MemAudit against MINJA, a query-only memory injection attack in which malicious records are generated and stored through normal agent interactions rather than direct memory-bank modification. Across both QA and reasoning-agent settings, MemAudit substantially reduces attack success rates under realistic post-hoc auditing scenarios. The results show that QA attack success is reduced from $70\%$ to $0\%$, while RAP attack success drops from $83.3\%$ to $0\%$.

Insights

Contribution: MemAudit introduces a post-hoc auditing framework for detecting poisoned memories in LLM agents by combining causal attribution and structural anomaly detection.

Core Idea: The core idea is to identify malicious memories through their causal influence on harmful outputs and structural inconsistencies within the memory store.

Technique: MemAudit uses a counterfactual memory influence score to quantify causal contributions of memories and a memory consistency graph to detect structural anomalies.

Pipeline: agent's memory โ†’ calculate counterfactual influence scores and construct consistency graph โ†’ identify malicious memories

Methodology: The framework evaluates memory poisoning attacks (e.g., MINJA) by analyzing post-hoc causal effects and structural patterns in memory stores.

Results: QA attack success reduced from 70% to 0%, RAP attack success dropped from 83.3% to 0% under realistic auditing scenarios.

Limitations: Limitations include reliance on specific attack patterns, potential scalability challenges with large memory stores, and assumptions about memory structure integrity.

PDF

5/5 Artificial Intelligence (cs.AI)Logic in Computer Science (cs.LO)Programming Languages (cs.PL)Software Engineering (cs.SE) 26 May 2026
Agentic Proving for Program Verification

Alessandro Sosso, Akhil Arora, Bas Spitters

Abstract

ArXiv ID: 2605.23772

Authors: Alessandro Sosso, Akhil Arora, Bas Spitters

Abstract:

Agentic systems have recently emerged as state-of-the-art approaches for automated theorem proving in formal mathematics. To assess how far these capabilities extend to program verification, we evaluate Claude Code in an agentic proving framework on CLEVER, a Lean 4 benchmark for verifiable code generation. Our results show that Claude generates arguably valid specifications for 98.8% of problems (with 81.3% also accepted by CLEVER's isomorphism-based scoring on the correct portion of the benchmark), certifies implementations against correct ground-truth specifications for 87.5% of problems, and reaches a 98.1% success rate on the end-to-end program generation and verification pipeline over entries with self-consistent premises. Across all stages, Claude further provides high-quality feedback on its own attempts (as confirmed under manual review), identifying underlying causes of failure and lingering bugs in the dataset. These findings highlight a growing mismatch between the difficulty of existing program verification benchmarks and the capabilities of modern agentic provers, and point to the need for more rigorous, bug-resilient evaluation methodologies, and in particular for alternatives to isomorphism-based scoring of generated specifications. More broadly, our results provide empirical evidence that tight compiler-in-the-loop agentic paradigms are currently the most effective approach for foundational program verification.

Insights

Contribution: This paper demonstrates the effectiveness of agentic systems, like Claude Code, in program verification tasks, revealing a growing gap between current benchmarks and modern agentic provers' capabilities.

Core Idea: Agentic proving frameworks can bridge theorem proving and program verification by generating and validating specifications while providing self-assessment feedback.

Technique: An agentic proving framework with iterative specification generation, validation against CLEVER benchmarks, and feedback mechanisms to identify errors and improve outcomes.

Pipeline: input โ†’ CLEVER benchmark with verifiable code generation tasks โ†’ process โ†’ agentic proving with Claude Code, specification validation, and feedback โ†’ output โ†’ verified programs, specifications, and error analysis

Methodology: Evaluated Claude Code on CLEVER using isomorphism-based scoring and manual review, analyzing specification validity, implementation certification, and end-to-end pipeline success.

Results: 98.8% valid specifications (81.3% accepted by CLEVER scoring), 87.5% implementation certification, 98.1% end-to-end success rate, and high-quality self-reported feedback.

Limitations: Reliance on specific benchmarks like CLEVER, potential scalability issues, and the need for more robust evaluation metrics beyond isomorphism-based scoring.

PDF

5/5 Machine Learning (cs.LG) 26 May 2026
Reading Calibrated Uncertainty from Language Model Trajectories

Aliai Eusebi, Alexander Herzog, Xiaoyu Liang, Marie Vasek, Enrico Mariconti, Lorenzo Cavallaro

Abstract

ArXiv ID: 2605.22864

Authors: Aliai Eusebi, Alexander Herzog, Xiaoyu Liang, Marie Vasek, Enrico Mariconti, Lorenzo Cavallaro

Abstract:

The maximum softmax probability (MSP) represents a default approach when evaluating uncertainty quantification for language model generation with structured output. Although cheap, it is often miscalibrated. Methods that probe the model's internal activations feed raw hidden states into opaque classifiers, reading activations as static snapshots and leaving implicit the layer-wise trajectory by which a representation is formed. Yet, similar endpoints can arise from very different paths, and how evidence accumulates, reinforces, or reverses across depth might reveal uncertainty that final probabilities obscure. We extract eleven scale-invariant geometric features, tracing the cumulative path of per-layer MLP updates, and feed them to a sparse linear probe. The probe outperforms MSP under selective abstention, with gains scaling with baseline miscalibration up to 21 AURC points. Because every feature has a closed-form geometric meaning, the probe's coefficients trace how and where along depth errors take shape -- which layers commit prematurely, which contradict the running state, where trajectories drift away from their endpoint.

Insights

Contribution: Introduces a method to calibrate uncertainty in language models by analyzing layer-wise trajectory features of MLP updates, outperforming MSP in selective abstention.

Core Idea: Uncertainty can be better captured by tracing the geometric path of hidden states across layers rather than relying on final probabilities, as trajectory dynamics reveal calibration insights missed by static snapshots.

Technique: Extracts 11 scale-invariant geometric features from per-layer MLP updates and uses a sparse linear probe to map these trajectories to uncertainty estimates.

Pipeline: input โ†’ model activations โ†’ geometric feature extraction โ†’ sparse linear probe โ†’ uncertainty quantification

Methodology: Analyzes layer-wise representation evolution through geometric features, leveraging closed-form interpretations to trace error formation across depth.

Results: Achieves up to 21 AURC point improvement over MSP under selective abstention, with performance scaling on baseline miscalibration.

Limitations: Relies on specific model architectures, static probe design may limit adaptability, and geometric features might not capture all uncertainty dimensions.

PDF

5/5 Machine Learning (cs.LG)Artificial Intelligence (cs.AI)Computation and Language (cs.CL) 26 May 2026
The Readout Shortcut: Positional Number Copying Dominates Arithmetic CoT Readout in Small Language Models

Ming Liu

Abstract

ArXiv ID: 2605.22870

Authors: Ming Liu

Abstract:

Chain-of-thought (CoT) prompting is necessary for arithmetic in small language models, yet shuffling its steps preserves most performance. What does CoT contribute if not logical sequencing? In three 1-3B instruction-tuned LMs on GSM8K, we isolate the answer-readout stage via prefix completion and identify a positional shortcut: the model copies whichever number occupies the trailing position before the answer delimiter, regardless of intermediate reasoning. Gold-answer presence accounts for 54-92 pp of accuracy (89-92% of each model's teacher-forcing ceiling); even on incorrect items, the final answer matches the last CoT number 95-96% of the time. The copy channel takes precedence over retained-context completion: replacing the trailing number with a wrong value collapses accuracy to near-zero despite correct intermediates, yet removing it recovers 5-32 pp above that floor--even single-step arithmetic the model can otherwise perform is suppressed when a copyable number is present. Qwen and Llama copy novel distractors 87-95% of the time; Gemma gates selectively. Head-level ablation implicates architecture-specific head sets; the effect replicates on GSM-Symbolic. On non-arithmetic BBH tasks, shuffle retention drops sharply; at 7-8B, content-selective gating emerges. Step-level faithfulness evaluations risk conflating positional answer transport with genuine computation--a failure mode for CoT-based oversight.

Insights

Contribution: Identifies positional number copying as the dominant mechanism in arithmetic chain-of-thought (CoT) readout for small language models, challenging the assumption that CoT relies on logical sequencing.

Core Idea: Models prioritize copying the trailing number before the answer delimiter, bypassing intermediate reasoning steps, which explains performance retention even when CoT steps are shuffled.

Technique: Uses prefix completion to isolate the answer-readout stage and analyzes number-copying patterns across three instruction-tuned language models on GSM8K.

Pipeline: input โ†’ arithmetic problem with CoT steps โ†’ positional number copying at answer delimiter โ†’ output answer

Methodology: Empirical analysis of CoT readout via prefix completion, ablation studies, and cross-model comparisons (Qwen, Llama, Gemma) on arithmetic and non-arithmetic tasks.

Results: Gold-answer presence explains 54-92 pp accuracy; 95-96% of final answers match the last CoT number; replacing trailing numbers collapses accuracy to near-zero; non-arithmetic tasks show sharp shuffle performance drops.

Limitations: Findings may not generalize to larger models, and the role of architecture-specific heads in gating mechanisms remains underexplored.

PDF

5/5 Machine Learning (cs.LG)Artificial Intelligence (cs.AI)Computation and Language (cs.CL) 26 May 2026
When Do LLMs Reason? A Dynamical Systems View via Entropy Phase Transitions

Wei Xia, Haoqing Wang, Zhi-Hong Deng, Yehui Tang

Abstract

ArXiv ID: 2605.22873

Authors: Wei Xia, Haoqing Wang, Zhi-Hong Deng, Yehui Tang

Abstract:

Chain-of-thought (CoT) reasoning has become the default strategy for enhancing LLM capabilities, yet its application raises a fundamental question: when is explicit reasoning actually beneficial? Empirical evidence reveals a striking paradox: CoT often provides marginal or even negative gains on factual and open-ended tasks while multiplying token consumption. In this work, we show that LLM reasoning is not a static property of tasks or models, but a \emph{dynamic decoding state} that emerges during generation. Through systematic analysis, we find early-stage entropy dynamics provide a reliable signal of this state: tasks benefiting from CoT exhibit consistent entropy reduction, while others display unstable or increasing patterns. This behavior can be interpreted as a phase-transition-like shift from a high-entropy exploratory regime to a low-entropy structured reasoning regime. Based on these insights, we propose \textbf{EDRM} (Entropy Dynamics-based Reasoning Manifold), a lightweight and training-free routing framework that leverages early decoding entropy to adaptively select inference strategies. EDRM embeds entropy trajectories into a compact and interpretable manifold representation, enabling both zero-shot deployment and fine-grained instance-level adaptation. Across 15 benchmarks and 4 LLMs of varying scales and architectures, EDRM consistently outperforms static baselines. At the dataset level, EDRM achieves \textbf{41--55\%} token reduction while improving accuracy with as few as 50 calibration samples. At the instance level, it further improves accuracy by up to \textbf{4.7\%} while maintaining \textbf{27--45\%} token savings. These results suggest that reasoning should be invoked selectively rather than by default, and demonstrate the effectiveness of entropy-driven decoding control for efficient and adaptive LLM inference.

Insights

Contribution: Introduces EDRM, a training-free framework that uses early decoding entropy dynamics to adaptively trigger reasoning in LLMs, improving efficiency and accuracy.

Core Idea: LLM reasoning emerges as a dynamic phase transition from high-entropy exploration to low-entropy structure, detectable via early-stage entropy patterns.

Technique: EDRM maps entropy trajectories into a compact manifold to guide inference strategy selection, enabling instance-level adaptation without retraining.

Pipeline: input โ†’ analyze early decoding entropy โ†’ embed into entropy dynamics manifold โ†’ select optimal inference strategy โ†’ output

Methodology: Systematic analysis of entropy phase transitions during decoding to identify reasoning-inducing tasks, validated across 15 benchmarks and 4 LLMs.

Results: 41-55% token reduction with accuracy improvements using 50 calibration samples; 4.7% accuracy gain at instance level with 27-45% token savings.

Limitations: Relies on calibration samples for adaptation; effectiveness may vary across task types not tested in the study.

PDF

5/5 Artificial Intelligence (cs.AI)Computation and Language (cs.CL)Computer Vision and Pattern Recognition (cs.CV)Neural and Evolutionary Computing (cs.NE) 26 May 2026
In Search of the Ingredients of Open-Endedness: Replicating Picbreeder with Large Vision-Language Models

Sam Earle, Kay Arulkumaran, Andrew Dai, Akarsh Kumar, Julian Togelius, Sebastian Risi

Abstract

ArXiv ID: 2605.23908

Authors: Sam Earle, Kay Arulkumaran, Andrew Dai, Akarsh Kumar, Julian Togelius, Sebastian Risi

Abstract:

We are in the midst of large-scale industrial and academic efforts to automate the processes of scientific, technological and creative production through AI-driven assistants. Historically, a fundamental property of these processes in their human form has been their open-endedness: their capacity for generating a seemingly endless supply of novel and meaningful new forms. Do artificial agents have any capacity for such fruitful unguided discovery? To answer this question, we turn to Picbreeder, the canonical exemplar of human-driven open-ended search, in which users collaboratively generated a diverse library of images through interactive evolution of small neural networks. We replicate Picbreeder, replacing human users with frontier Vision Language Models (VLMs). We observe clear qualitative differences between the output of our system and the historical human baseline, and attempt to characterize them using metrics of phylogenetic complexity and visual and semantic salience and novelty. In an effort to identify some of the causal factors contributing these differences, we study the addition of exploratory noise to the agents' selection process, of behavioral diversity between agents, and of narrative momentum in the form of memory of past actions. We make our code available at https://github.com/smearle/picbreeder-vlm.

Insights

Contribution: This paper contributes by replicating the Picbreeder system using large Vision-Language Models (VLMs) to investigate open-endedness in AI-driven creative processes and identifies factors influencing their output diversity and novelty.

Core Idea: The core idea is to replace human users in Picbreeder with VLMs to study whether artificial agents can exhibit open-ended, unguided creativity and to analyze the differences in their generated outputs compared to human-driven evolution.

Technique: The technique involves using frontier VLMs to perform evolutionary search over image-generation tasks, incorporating exploratory noise, behavioral diversity, and memory-based narrative momentum to enhance open-endedness.

Pipeline: input โ†’ VLMs generate and evolve images through iterative selection โ†’ output โ†’ images with metrics of complexity, salience, and novelty

Methodology: The methodology combines evolutionary algorithms with VLMs, evaluates outputs using phylogenetic complexity and semantic novelty metrics, and systematically tests interventions like noise, diversity, and memory.

Results: Qualitative differences in VLM-generated images compared to human baselines were observed, with metrics showing variations in semantic novelty and visual salience, and exploratory noise and diversity interventions partially bridging these gaps.

Limitations: Limitations include the inability of VLMs to replicate human-like narrative coherence, challenges in scaling memory mechanisms, and potential biases in VLM-generated outputs affecting open-endedness.


5/5 Artificial Intelligence (cs.AI)Machine Learning (cs.LG) 26 May 2026
Confidence Calibration in Large Language Models

Noam Michael, Daniel BenShushan, Jacob Bien, Don A. Moore

Abstract

ArXiv ID: 2605.23909

Authors: Noam Michael, Daniel BenShushan, Jacob Bien, Don A. Moore

Abstract:

We investigate the calibration of large language models' (LLMs') confidence across diverse tasks. The results of our preregistered study show that the current crop of LLMs are, like people, too sure they are right: confidence exceeds accuracy, on average. Importantly, however, this tendency is moderated by a powerful hard-easy effect, wherein overconfidence is greatest on difficult tests; by contrast, easy tests actually show substantial underconfidence. We develop LifeEval, a test for evaluating model calibration across levels of difficulty.

Insights

Contribution: Introduces LifeEval, a framework for assessing calibration of large language models (LLMs) across task difficulty, and demonstrates systematic overconfidence on hard tasks and underconfidence on easy ones.

Core Idea: LLMs exhibit human-like confidence calibration biases, with overconfidence increasing on difficult tasks and underconfidence on easy ones, challenging the assumption of uniform reliability in model predictions.

Technique: LifeEval evaluates calibration by systematically varying task difficulty and comparing model confidence against empirical accuracy rates across diverse benchmarks.

Pipeline: input โ†’ model generates predictions with confidence scores โ†’ process evaluates calibration across difficulty levels using empirical accuracy โ†’ output quantifies over/underconfidence trends

Methodology: Preregistered study across multiple tasks and datasets to measure calibration, leveraging the hard-easy effect as a lens to analyze confidence-accuracy discrepancies.

Results: LLMs show average confidence exceeding accuracy, with overconfidence peaking on difficult tasks (e.g., 20%+ gap) and underconfidence on easy tasks (e.g., 15%+ gap).

Limitations: Findings may not generalize to real-world scenarios, and LifeEval's reliance on synthetic difficulty metrics could limit practical applicability.

PDF

5/5 Artificial Intelligence (cs.AI)Machine Learning (cs.LG) 26 May 2026
How Much Thinking is Enough? Quantifying and Understanding Redundancy in LLM Reasoning

Zhiyuan Zhai, Xinkai You, Wenjing Yan, Xin Wang

Abstract

ArXiv ID: 2605.23926

Authors: Zhiyuan Zhai, Xinkai You, Wenjing Yan, Xin Wang

Abstract:

Reasoning-capable large language models solve hard problems by emitting long chains of thought, paying heavily in latency, GPU time, and energy. Casual inspection of their traces reveals extensive reformulation, verification, and circular self-reflection, yet how much of this deliberation is actually necessary has never been measured at scale or explained from first principles. This paper closes both gaps. We formalise reasoning redundancy directly in terms of the reasoning model itself: the redundancy of a correct trace is the largest fraction of its trailing segmented steps that can be truncated while $\pi$, forced to terminate thinking and emit a final answer, still produces the correct answer. A large-scale quantification across four frontier reasoning models and two mathematical benchmarks shows that step-level redundancy is consistently high -- between 61% and 93% across the 8 (model, benchmark) conditions we study, with the median critical prefix equal to a single segmented step in six of the eight conditions -- that the finding is robust to the choice of judge family, and that although $\rho$ decreases with problem difficulty on MATH-500, all four models remain substantially redundant ($\rho \in [46\%, 85\%]$) even on the hardest Level-5 problems. We then prove that this redundancy is a structural consequence of length-agnostic outcome rewards, not a model-specific artefact: under any such reward, no finite expected stopping time is optimal. The result holds regardless of RL algorithm, base model, data distribution, or whether the policy is obtained via RL or distillation; over-thinking is therefore not a bug to be patched in individual models but a structural property of how current reasoning models are trained. Code: https://github.com/zhiyuanZhai20/how-much-thinking-is-enough

Insights

Contribution: This paper formalizes reasoning redundancy in large language models (LLMs) and demonstrates that extensive over-thinking is a structural property of their training, not a model-specific flaw.

Core Idea: Redundancy is quantified by truncating reasoning traces while maintaining correctness, revealing that most steps in LLM reasoning chains are unnecessary.

Technique: Truncation-based analysis combined with large-scale experiments across models and benchmarks to measure redundancy thresholds, supported by a theoretical proof linking redundancy to length-agnostic rewards.

Pipeline: input โ†’ segment reasoning traces โ†’ iteratively truncate steps โ†’ evaluate correctness of final answer โ†’ compute redundancy rate

Methodology: Empirical evaluation on four reasoning models and two mathematical benchmarks, paired with a theoretical analysis of reward-driven over-thinking dynamics.

Results: Step-level redundancy ranges from 61% to 93%, with median critical prefixes often being a single step; redundancy persists (46-85%) even on hardest problems, robust to judge families and model architectures.

Limitations: Findings are primarily validated on mathematical reasoning tasks, and the theoretical framework assumes idealized reward structures not fully captured in practice.


5/5 Artificial Intelligence (cs.AI)cs.CYMachine Learning (cs.LG) 26 May 2026
Authority Inversion in LLM-Mediated Ubiquitous Systems: When Models Trust Users Over Sensors

Long Zhang, Zi-bo Qin, Wei-neng Chen

Abstract

ArXiv ID: 2605.23938

Authors: Long Zhang, Zi-bo Qin, Wei-neng Chen

Abstract:

Large language models (LLMs) increasingly fuse heterogeneous inputs in ubiquitous systems. Yet, how LLMs implicitly allocate authority when sensor measurements and user claims conflict remains unexamined, raising critical reliability concerns for deployments where physical sensing must retain priority. Unlike explicit traditional fusion, LLMs bury authority allocation within learned representations. We discover this allocation is severely format-dependent: numerical sensor data fails to integrate into answer-relevant model directions, allowing natural-language claims to dominate the final decision, a phenomenon we term \textbf{Authority Inversion}.To diagnose and mitigate this, we develop a geometric framework of context integration, introduce two computable audit metrics, specifically the Context Integration Ratio (CIR) and Authority Alignment Index (AAI), and propose Geometric Authority Calibration (GAC), an inference-time layer-level intervention to suppress misplaced user authority. Evaluating four models (4B to 35B parameters, three architectures) across four datasets totaling 576 conflict instances reveals extreme inversion: on numerical tasks, models exhibit near-zero sensor trust (AAI = -0.805, Cohen's d = -2.14), unaffected by model capacity. Validating our geometric framework, theory-guided causal injection flips 80.2\% of incorrect decisions (vs. <0.4\% for random controls). Practically, GAC improves HAR accuracy from 0 -- 1.6\% to 21.9 -- 27.5\%, outperforming prompting baselines. Ultimately, authority allocation in LLM-mediated systems must be explicitly audited and application-specifically configured rather than left implicit.

Insights

Contribution: Introduces the concept of Authority Inversion in LLM-mediated systems, where models prioritize user claims over sensor data, and proposes Geometric Authority Calibration (GAC) to address this issue.

Core Idea: LLMs implicitly allocate authority based on input format, leading to sensor data being marginalized in favor of natural-language user claims during fusion, which undermines reliability in critical systems.

Technique: Develops a geometric framework for context integration, along with audit metrics (CIR, AAI), and applies GAC as an inference-time intervention to recalibrate authority alignment.

Pipeline: input โ†’ sensor data and user claims โ†’ geometric context integration โ†’ authority calibration โ†’ decision output

Methodology: Empirical evaluation across four models and datasets, combined with causal analysis and metric-driven calibration to diagnose and mitigate authority inversion.

Results: AAI scores show near-zero sensor trust (-0.805) on numerical tasks; GAC improves HAR accuracy by 21.9-27.5%, outperforming baselines; 80.2% correction rate via causal injection.

Limitations: Dependence on controlled datasets, limited real-world validation, and challenges in generalizing application-specific authority configurations.

PDF

Multimodal

5/5 Artificial Intelligence (cs.AI) 26 May 2026
Beyond Binary Edits Robust Multimodal Knowledge Editing with Adversarial Subspace Alignment

Haoyuan Wang, Xiaohao Liu, Jiajie Su, Jianmao Xiao, Chaochao Chen

Abstract

ArXiv ID: 2605.23780

Authors: Haoyuan Wang, Xiaohao Liu, Jiajie Su, Jianmao Xiao, Chaochao Chen

Abstract:

Multimodal large language models (MLLMs) need efficient mechanisms to update knowledge without degrading existing capabilities. While intrinsic multimodal knowledge editing achieves strong reliability and locality, it often exhibits limited generality, failing to propagate edits across semantically equivalent visual and linguistic variations. This issue arises from the lack of explicit semantic supervision, rigid editing scopes, and biased anchoring to individual samples in high-dimensional multimodal spaces. We address robust intrinsic multimodal knowledge editing by explicitly targeting generalization. We formalize robustness through knowledge units that group semantically equivalent multimodal inputs and define generality as consistent predictions within each unit. To expose fragile semantic regions, we introduce Latent Adversarial Robustification (LAR), which generates adversarial yet semantically coherent variants in the joint latent space. We further propose Rank-Constrained Subspace Learning (RCSL), enforcing low-rank alignment of adversarial representations at the edit layer via a singular value-based objective. Extensive analysis demonstrates the effectiveness of ASAM empirically.

Insights

Contribution: Proposes a robust framework for multimodal knowledge editing by enhancing generalization across semantically equivalent modalities through adversarial and subspace techniques.

Core Idea: Formalizes robustness via semantic knowledge units and leverages adversarial variants to expose fragile regions, enabling consistent edits across multimodal representations.

Technique: Introduces Latent Adversarial Robustification (LAR) for generating semantically coherent adversarial variants and Rank-Constrained Subspace Learning (RCSL) to enforce low-rank alignment of adversarial representations.

Pipeline: multimodal input โ†’ adversarial variant generation and subspace alignment โ†’ robust knowledge-edited output

Methodology: Combines semantic grouping of multimodal inputs with adversarial training and singular value-based optimization to ensure generalizable edits.

Results: Empirical analysis demonstrates improved robustness and generalization in multimodal knowledge editing compared to baseline methods.

Limitations: Potential computational overhead from adversarial generation and reliance on high-quality semantic grouping for optimal performance.

PDF

RL

5/5 Artificial Intelligence (cs.AI) 26 May 2026
One Policy, Infinite NPCs: Persona-Traceable Shared RL Policies for Scalable Game Agents

Yoosung Hong

Abstract

ArXiv ID: 2605.23652

Authors: Yoosung Hong

Abstract:

On a 300-persona life-simulation benchmark, pcsp achieves compositional zero-shot persona identification up to 17x above chance, Spearman rho approx 0.73 semantic-behavioral alignment, and 22x faster inference than an LLM-as-policy baseline. Life simulation games require hundreds to thousands of non-player characters (NPCs) that behave consistently with distinct personalities while remaining controllable through designer-authored natural language. Existing methods fail on constraints like persona consistency, controllability, or real-time inference. We introduce pcsp (Persona Conditioned Shared Policy), a single reinforcement learning policy conditioned on frozen LLM embeddings of free-form persona descriptions. pcsp combines once-per-NPC persona encoding, low-rank persona projection, neural persona conditioning, and a PPO + InfoNCE consistency + KL diversity training objective. Across three experimental settings, ablations show that the InfoNCE trajectory-consistency objective is load bearing: removing it collapses zero-shot persona identification to chance. External validation on Melting Pot 2.4.0 substrates confirms that our method produces persona-conditioned behavioral divergence in multi-agent strategic environments. We distinguish two senses of held-out evaluation: compositional zero-shot and vocabulary-expansion held-out. Finally, a UE5 deployment reproduces the in-engine persona-conditioning ablation at 64 agents with a low failure rate, showing that the sub-frame inference profile survives in a commercial game engine. These results prove that shared RL policies can support scalable, real-time, persona-conditioned NPC control.

Insights

Contribution: Introduces pcsp, a shared reinforcement learning policy that enables scalable, real-time persona-conditioned control of hundreds of NPCs in life-simulation games through frozen LLM embeddings and novel training objectives.

Core Idea: A single policy leverages persona embeddings and low-rank projections to maintain consistency and controllability across diverse NPCs while enabling compositional zero-shot persona identification.

Technique: Combines PPO with InfoNCE trajectory-consistency and KL diversity objectives, using once-per-NPC persona encoding, neural conditioning, and low-rank persona projection for efficient inference.

Pipeline: free-form persona descriptions โ†’ frozen LLM embeddings โ†’ persona encoding/projection โ†’ neural conditioning โ†’ PPO + InfoNCE + KL training โ†’ persona-conditioned NPC behavior

Methodology: Employs a shared policy framework with ablation studies to validate the critical role of trajectory-consistency objectives, tested across life-simulation benchmarks and commercial game engines.

Results: 17x zero-shot persona identification above chance, Spearman rho 0.73 semantic-behavioral alignment, 22x faster inference than LLM-as-policy, successful UE5 deployment with 64 agents and low failure rate.

Limitations: Reliance on pre-trained LLM embeddings may limit adaptability, scalability to extremely large NPC numbers remains untested, and real-world deployment challenges in dynamic environments persist.

PDF

5/5 Artificial Intelligence (cs.AI)Machine Learning (cs.LG)Multiagent Systems (cs.MA) 26 May 2026
Quantum Frog: Emergent Cooperation and Difficulty Scaling in a Quantized-Time Cooperative Game

Saad Mankarious

Abstract

ArXiv ID: 2605.23930

Authors: Saad Mankarious

Abstract:

We introduce \emph{Quantum Frog}, a two-player cooperative game built on a novel \emph{quantized-time} mechanic in which the environment advances only when a player acts. Inspired by the classic arcade game Frogger, Quantum Frog requires two frogs to cross an 8$\times$8 grid of traffic and reach the far side together. We use reinforcement learning (RL) as an analytical lens to answer four design questions: (1) how does game difficulty scale with traffic density, (2) what is the optimal single-agent policy and why, (3) how large is the cooperation gap between independent and cooperative two-agent play, and (4) what joint strategy emerges when agents are incentivised to cooperate? We train agents through five escalating stages, Tabular Q-Learning, Deep Q-Network (\DQN), Independent \DQN~(\IDQN), and Multi-Agent Proximal Policy Optimisation (\MAPPO\ with a centralised critic), evaluating each against traffic densities of one to six cars. Our key findings are: (i) the quantized-time mechanic makes a \emph{rush strategy} (moving directly upward at every step) universally optimal, as time exposure to traffic is minimised; (ii) adding an uncoordinated second player is harder than sextupling the traffic for a single expert player; (iii) cooperative training recovers +32--34 percentage points of joint success rate relative to independent agents and reduces episode length from $\sim$90 to $\sim$6 steps; and (iv) the emergent cooperative strategy is synchronised rushing, not complex positional coordination, illustrating that shared incentives alone suffice to align agents in time-critical cooperative tasks. These findings provide concrete, empirically grounded guidance for the commercial design of Quantum Frog and offer broader insights into the role of environment mechanics in shaping multi-agent learning dynamics.

Insights

Contribution: Introduces Quantum Frog, a quantized-time cooperative game, and provides empirical insights into cooperation dynamics, difficulty scaling, and emergent strategies in multi-agent reinforcement learning.

Core Idea: The quantized-time mechanic forces agents to minimize exposure to hazards by prioritizing rapid action, enabling analysis of cooperation and coordination in time-critical scenarios.

Technique: Employs reinforcement learning algorithms (Tabular Q-Learning, DQN, IDQN, MAPPO) across five training stages to evaluate cooperative and independent agent performance under varying traffic densities.

Pipeline: input โ†’ game environment with traffic densities (1-6) โ†’ agent training via RL algorithms โ†’ evaluation of success rates, episode lengths, and cooperation gaps โ†’ analysis of emergent strategies

Methodology: Systematic training and evaluation of single- and multi-agent RL policies across escalating game difficulty, comparing independent vs. cooperative strategies under controlled traffic conditions.

Results: Rush strategy is universally optimal; adding an uncoordinated agent increases difficulty more than sextupling traffic; cooperative training improves success rates by 32-34% and reduces episode length to ~6 steps; emergent cooperation is synchronized rushing, not complex coordination.

Limitations: Game-specific mechanics may limit generalizability; simplified traffic models might not capture real-world complexity; focus on short-term coordination may overlook long-term strategy trade-offs.

PDF

Tech News

AI Safety

Reddit r/ArtificialIntelligence 2026-05-26
Wiz Integrates with Anthropic's Compliance API

Wiz, a security company, has integrated with Anthropic's Compliance API to enhance AI system accountability and regulatory adherence. This partnership likely enables more robust monitoring of AI outputs for ethical and legal compliance, addressing key challenges in AI safety and governance.

Reddit r/DeepLearning 2026-05-26
10 years of AI robustness tricks (PGD, RLHF, Data Augmentation) are actually computing the same hidden matrix. We proved what happens when you get it wrong.

A Reddit post presents research showing that diverse AI robustness techniques (PGD, RLHF, data augmentation) all estimate the same 'deployment nuisance covariance matrix.' Correcting this matrix with a geometric penalty significantly reduces LLM sycophancy and outperforms traditional methods, highlighting critical implications for model alignment and robustness.

Reddit r/DeepLearning 2026-05-25
LoRA adapter backdoors and behavioral detection - looking to publish my research

A researcher discusses LoRA adapter backdoors in machine learning models and proposes behavioral detection methods, seeking to publish their findings on securing AI systems against covert vulnerabilities.

Reddit r/DeepLearning 2026-05-25
AI Alignment: Thicc Future vs. Flatline Future

A Reddit post discusses speculative scenarios for AI alignment, contrasting a 'Thicc Future' (optimistic, aligned AI progress) with a 'Flatline Future' (stagnation or misalignment risks). The conversation highlights debates around AI safety, technical challenges, and long-term implications for society.

Reddit r/DeepLearning 2026-05-25
how to build AI Systems that optimize Happiness (for AI Researchers)

An independent researcher proposes building AI systems to optimize human happiness by addressing its complex, multi-dimensional nature, warning against short-term reward optimization and suggesting a constrained MDP framework with multi-channel feedback mechanisms. The post highlights challenges in aligning AI with long-term wellbeing using reinforcement learning principles.

Reddit r/DeepLearning 2026-05-25
Bad AI alignment solutions

A Reddit post discusses flawed approaches to aligning AI systems, highlighting common pitfalls in current strategies. The community critiques ineffective methods for ensuring AI behaves as intended, emphasizing challenges in safe AI development. This conversation underscores ongoing debates in AI safety and alignment research.

Reddit r/DeepLearning 2026-05-26
The paperclip maximizer tsunami

A Reddit discussion explores the 'paperclip maximizer' thought experiment, highlighting risks of misaligned AI goals. The thread connects this concept to current AI developments, emphasizing concerns about uncontrolled optimization and safety in advanced systems.

Hacker News Tue, 26 Ma
Notes on Pope Leo XIV's Encyclical on AI

A collection of comments on a hypothetical encyclical by Pope Leo XIV addressing AI ethics, governance, and societal impact, blending theological perspectives with AI safety discussions.

Hacker News Tue, 26 Ma
The User Is Visibly Frustrated

A Hacker News discussion highlights user frustration with AI systems, likely related to usability, reliability, or ethical concerns in AI deployment. The conversation underscores challenges in aligning AI capabilities with user expectations.

Reddit r/ArtificialIntelligence 2026-05-26
I don't like the answer this AI gave me

A Reddit user expresses frustration with DuckDuckGo AI's response to their question about why AI hasn't helped make data centers more environmentally friendly, sparking anger toward AI billionaires. The post highlights concerns about AI's ethical implications and environmental impact.

Agentic AI

Reddit r/ArtificialIntelligence 2026-05-25
Top 10 Fastest Growing AI repos this week

A Reddit post highlights the top 10 fastest-growing AI repositories, focusing on AI coding agents, personal AI systems, memory frameworks, browser automation, and dev tools. Key projects include code knowledge graphs, Claude Code skills, and on-device TTS solutions, reflecting trends in agentic AI and local-first development.

Reddit r/ArtificialIntelligence 2026-05-25
AI agents need audit trails more than they need more autonomy

The post argues that AI agents require detailed audit trails for transparency and trust, rather than increased autonomy. It highlights the importance of tracking every action an AI takes across systems to ensure accountability and prevent undetected errors.

Reddit r/ArtificialIntelligence 2026-05-25
If you've ever wondered how rigorous data analysis+social science research can look with AI, I've finally launched a nice website for my open-source Claude Code researcher's toolkit: the Data Analyst Augmentation Framework! Equal parts interactive explainer on agentic orchestration + free tool

The post announces the launch of the Data Analyst Augmentation Framework, an open-source toolkit designed to enhance data analysis and social science research using AI. It serves as an interactive explainer on agentic orchestration and offers free tools for researchers.

Reddit r/ArtificialIntelligence 2026-05-25
thoughts on why AI agents are starting to look like SaaS billing systems

The post highlights that scaling AI agents introduces operational challenges like orchestration, retries, and auditability, mirroring the complexities faced by SaaS billing systems in earlier years. It suggests that AI agents in 2026 will face similar infrastructure and management hurdles as SaaS systems did in 2017.

Reddit r/ArtificialIntelligence 2026-05-25
Is โ€œAI employeeโ€ becoming a real product category?

A Reddit post analyzes the emergence of 'AI employee' as a product category, highlighting specific roles like sales, support, and legal agents. The author questions whether this framing reflects real product capabilities or aggressive marketing, noting uneven adoption across workflows.

Reddit r/ArtificialIntelligence 2026-05-25
I was messing with agentic organizational strategies and came up with an automated starfish that solves social problems

A Reddit user shared an experimental concept involving agentic AI systems designed to address social problems through automated, starfish-like organizational strategies. The project is non-commercial and focuses on autonomous problem-solving frameworks.

Reddit r/DeepLearning 2026-05-25
Relational Theory Formalism (RTF) v5.1: A Scaffold for Emergent Agency in Directed Networks

Relational Theory Formalism (RTF) v5.1 introduces a framework for enabling emergent agency in directed networks, focusing on structured relational modeling to support complex, goal-oriented behaviors in AI systems. The update likely advances research in autonomous agent development and system-level coordination.

Reddit r/DeepLearning 2026-05-26
I Told My AI to Collect 10 Water

A Reddit user shared a post about instructing an AI to 'collect 10 water,' though the context and technical details remain unclear. The post likely reflects an experimental or conceptual project involving AI task execution.

Computer Vision

Hacker News Mon, 25 Ma
Yoti age checks share facial photos and device fingerprints with third parties

Yoti's age verification system shares facial recognition data and device fingerprints with third parties, raising significant privacy concerns about data handling practices in AI-driven identity checks.

Reddit r/MachineLearning 2026-05-25
DCGAN inference on a microcontroller: 12.6M parameters, 512KB SRAM, 26-second generation, pure C [P]

A DCGAN with 12.6M parameters was deployed on a RISC-V microcontroller (CH32H417) using pure C, achieving 64x64 cat face generation in 26 seconds. The project uses int8 quantization, quantum random data for latent vectors, and SD card-based weight streaming, highlighting edge AI capabilities on low-cost RISC-V hardware.

Reddit r/MachineLearning 2026-05-25
Call for Papers - Workshop on Unlearning and Model Editing U&ME at ECCV 2026 [R]

A call for papers for the U&ME workshop at ECCV 2026, focusing on unlearning, model editing, and related topics like model merging, compression, and domain adaptation. The workshop encourages submissions on emerging ideas, including unfinished research and applications in computer vision.

Reddit r/ArtificialIntelligence 2026-05-25
We're reaching a point where "AI-generated but visually realistic" content will become the norm, not the exception. ๐Ÿ‘€

The post discusses the growing prevalence of AI-generated, visually realistic content, suggesting it is becoming the norm. The author claims we have entered the era of artificial general intelligence (AGI).

Reddit r/ArtificialIntelligence 2026-05-26
How hard is it to train a video generation AI from scratch?

A Reddit user questions the complexity of training a video generation AI compared to image generation, highlighting challenges like motion understanding, temporal coherence, and computational demands. The post explores potential bottlenecks such as data scarcity, model architecture, and evaluation difficulties in video AI development.

Reddit r/ArtificialIntelligence 2026-05-25
Sam Altmanโ€™s startup is hoping Jared Letoโ€™s band will get you to scan your eyeball

Sam Altman's startup is leveraging Jared Leto's band to promote eye-scanning technology, potentially for biometric authentication or user identification. The initiative highlights AI-driven innovations in computer vision and personalized user experiences.

Reddit r/DeepLearning 2026-05-26
Deep Learning Projects

A Reddit user shared a project on face mask detection using a CNN with ResNet50 transfer learning, including deployment details. The video demonstrates practical computer vision applications and model optimization techniques.

Reddit r/DeepLearning 2026-05-25
Flower Classification using Deep Learning

A deep learning practitioner developed an end-to-end flower classification system using computer vision, deployed on Hugging Face Spaces. The project includes a feedback loop to improve model accuracy over time, supports 35 flower species (with Indian varieties emphasized), and utilizes FastAPI, Podman, and open-source collaboration.

Reddit r/DeepLearning 2026-05-26
MediVigil: Hospital Patient Facial Monitoring System

MediVigil is a real-time hospital monitoring system that uses multi-modal facial dynamics and kinematics to detect patient distress, drowsiness, breathing difficulties, and agitation with high accuracy. It operates effectively in low-light conditions, leveraging computer vision techniques for critical patient safety applications.

Computing Systems

Reddit r/ArtificialIntelligence 2026-05-26
Scientists trained an AI model using an IBM quantum computer โ€” and it answered questions correctly that the base model couldn't

Scientists trained an AI model using an IBM quantum computer, enabling it to answer specific questions that the base model could not. The experiment highlights potential advancements in leveraging quantum computing for AI training, as discussed in the r/ArtificialIntelligence subreddit.

Reddit r/ArtificialIntelligence 2026-05-25
Why is there a sudden demand for a bunch of data centers?

A Reddit user in Pennsylvania notes a surge in large-scale data center proposals nearby, questioning the sudden demand amid existing AI capabilities. The post raises doubts about whether the infrastructure will meet real needs or remain underutilized, reflecting broader debates about AI's computational demands.

Reddit r/ArtificialIntelligence 2026-05-25
Building Conifer, an open-source local inference runtime (free + open source)

A Princeton team is developing Conifer, an open-source local inference runtime optimized for Apple Silicon using Rust and hand-written kernels. The project aims to enable privacy-preserving, OS-enforced local AI agents and is seeking feedback from 100 beta testers for performance tuning and tool development.

Reddit r/ArtificialIntelligence 2026-05-25
Wix cutting

Wix is laying off 20% of its workforce amid rising AI infrastructure costs and shifting to AI-driven 'vibe coding' tools. The restructuring highlights challenges for AI-first companies balancing growth with expensive compute, model development, and market commoditization.

Reddit r/DeepLearning 2026-05-25
[D] Prefix cache reported 87% hits, physical KV reuse was ~31%. How are others measuring this gap?

A multi-turn AI agent system experienced a 87% prefix cache hit rate, but physical KV reuse was only 31% due to KV block eviction between conversation turns. This led to unexpected token cost increases, highlighting a critical gap between cache metrics and actual system behavior. The issue was validated through timing analysis and mitigated partially with SGLang's radix tree-based approach.

Reddit r/DeepLearning 2026-05-25
Where and who uses Cloud GPU?

The Reddit post explores the users and applications of cloud GPUs, addressing concerns about cost for individuals and discussing purposes like model training, cloud gaming, and other compute-intensive tasks.

Reddit r/DeepLearning 2026-05-25
How much you spend on cloud GPU in a month?

A Reddit poll asks AI/ML practitioners to share their monthly cloud GPU spending, highlighting cost considerations for computational resources in AI development. The discussion reflects practical challenges in managing expenses for training large models.

Reddit r/DeepLearning 2026-05-26
Home AI server/workstation for deep learning

A Reddit user shares details about building a home AI server/workstation for deep learning, focusing on hardware configurations and community discussions on optimizing systems for AI/ML workloads.

Hacker News Tue, 26 Ma
Motorola phones have started hijacking the Amazon app to insert affiliate codes

Motorola phones are reportedly altering the Amazon app to inject affiliate codes, potentially redirecting commissions or compromising user trust. The issue highlights vulnerabilities in mobile app integrity and manufacturer software practices.

Reddit r/DeepLearning 2026-05-26
Old and Outdated GPUs

A Reddit user asks about repurposing or disposing of old GPUs, highlighting community discussions on hardware reuse, recycling, and the environmental impact of outdated computing equipment. The post reflects broader concerns about resource management in AI/ML workflows.

Hacker News Tue, 26 Ma
Ask HN: Is anyone working at least 4 hours daily on an Apple Vision Pro?

A Hacker News post asks if anyone uses the Apple Vision Pro for at least four hours daily, sparking discussions about its practical applications and user experiences in daily work. Comments highlight challenges and potential use cases for the device.

Hacker News Tue, 26 Ma
Logseq Doctor: heal your flat old Markdown files before importing to Logseq

Logseq Doctor is a tool designed to restructure and enhance flat Markdown files for better compatibility with the Logseq knowledge management system. It addresses formatting and organization issues in plain text files before importing them into Logseq, improving usability for users managing notes or documentation.

Hacker News Tue, 26 Ma
DynIP โ€“ Dynamic DNS with RFC 2136, IPv6, DNSSEC, and BYOD

DynIP offers a dynamic DNS service supporting RFC 2136, IPv6, DNSSEC, and BYOD, enabling secure and flexible DNS management for evolving network environments.

General

Hacker News Mon, 25 Ma
Weave (YC W25) is hiring ML, AI, product, & design engineers

Weave (YC W25) is seeking ML, AI, product, and design engineers. The startup is expanding its technical team to advance AI-driven solutions, with roles spanning machine learning, product development, and design. This reflects growing demand for AI talent in early-stage tech companies.

Reddit r/MachineLearning 2026-05-25
The famous METR AI time horizons graph contains numerous severe errors [D]

A critical analysis of the METR AI time horizons graph highlights severe methodological flaws, including unverified human baseline data, biased sampling, and incentive-driven task completion times. The author argues the benchmark is unreliable and calls for the AI field to prioritize higher-quality data over anecdotal or compromised metrics.

Reddit r/MachineLearning 2026-05-25
Are ICML workshops worth attending? [D]

A Reddit user asks whether attending ICML 2026 workshops is worthwhile despite missing the main conference, considering travel costs and seeking insights from past experiences in adjacent fields.

Reddit r/MachineLearning 2026-05-25
Call for Papers - Workshop on Efficient Reasoning at COLM 2026 [R]

The 2nd Workshop on Efficient Reasoning (ER) at COLM 2026 invites submissions on optimizing AI reasoning across domains like multimodal systems, real-time applications, and resource-constrained environments. Topics include efficient training, inference techniques, safety, and deployment systems, emphasizing interdisciplinary approaches for practical AI solutions.

Reddit r/MachineLearning 2026-05-25
Anyone heard from ICML about Oral decisions yet? [D]

A researcher asks if others have received ICML oral presentation decisions, seeking confirmation that their paper did not get an oral slot after receiving a spotlight. The post highlights the competitive nature of top ML conferences and the uncertainty around notification timelines.

Reddit r/ArtificialIntelligence 2026-05-25
If you could subscribe to one AI provider who would it be?

A Reddit user seeks recommendations for the most cost-effective AI provider, highlighting confusion over unclear usage limits and pricing models among available services.

Reddit r/ArtificialIntelligence 2026-05-25
Future Prediction

The post predicts that companies laying off workers in favor of AI will face challenges due to AI's current limitations, including loss of institutional knowledge, quality issues, and the need for human oversight, drawing parallels to past technological disruptions.

Reddit r/ArtificialIntelligence 2026-05-25
How is AI novelty different from the widespread use of internet in the 2000s

The post questions why AI's current rise is seen as novel compared to the internet's adoption in the 2000s, highlighting similarities in integration challenges and societal adaptation. The user argues that previous generations managed technologies like Wikipedia and Excel, suggesting AI's perceived disruption may not be uniquely transformative.

Reddit r/DeepLearning 2026-05-26
Vibe Coding Will Increase Open Source AI Developers From 25 Million Today to 150 Million in 2028

A Reddit post discusses 'vibe coding,' a trend where AI development shifts from programming languages to human language, projecting open-source AI developers to grow from 25M to 150M by 2028. It highlights ModelScope's growth data and argues open-source AI will dominate due to efficiency and accessibility.

Hacker News Mon, 25 Ma
How Shamir's Secret Sharing Works

Shamir's Secret Sharing is a cryptographic method that splits a secret into parts, requiring a threshold number of shares to reconstruct the original secret. It enhances security by distributing trust among participants, ensuring no single entity can access the secret alone. The technique is widely used in secure data storage and distributed systems.

Hacker News Mon, 25 Ma
Exit IP VPN servers mitigation rollout

Mullvad announces a mitigation strategy for exit IP VPN servers to improve user privacy and security by addressing potential vulnerabilities or leaks associated with exit IP addresses.

Hacker News Tue, 26 Ma
Designing for and against the manufactured normalcy field (2012)

The article critiques how technology and systems design can enforce 'manufactured normalcy,' shaping societal norms and behaviors. Comments likely discuss implications for AI/ML in perpetuating or challenging these norms.

Hacker News Mon, 25 Ma
California moves to exempt Linux from its age-verification law after backlash

California proposes exempting Linux from its age-verification law after backlash, following criticism that the original law forced operating systems to collect user ages. The amendment comes from the same lawmaker who drafted the initial legislation, highlighting tensions between regulation and open-source software.

Hacker News Mon, 25 Ma
Performance of Rust Language [pdf]

A PDF presentation discussing the performance characteristics of the Rust programming language, focusing on its efficiency, memory management, and benchmarks compared to other systems languages. The content likely includes technical analysis and use cases for Rust in performance-critical applications.

Hacker News Tue, 26 Ma
Does Anybody Actually Like React?

A Hacker News thread discusses developer opinions on React, highlighting both its popularity and criticisms regarding complexity and design choices.

Hacker News Mon, 25 Ma
Hacker News front page as a site

A website replicating the Hacker News front page, focusing on its comments section. The site likely serves as a tool for developers to access or analyze Hacker News content more conveniently.

Hacker News Mon, 25 Ma
Nobody cracks open a programming book anymore

A Hacker News discussion explores the declining use of programming books, with users debating the shift toward online resources, practical coding, and AI tools for learning. Comments highlight changing educational habits and the role of modern technology in skill acquisition.

Hacker News Mon, 25 Ma
CVE-2026-28952: Apple macOS 26.5 Kernel Vuln found by Claude

A critical kernel vulnerability (CVE-2026-28952) in Apple macOS 26.5 was discovered by Claude, an AI model. The flaw could allow privilege escalation, prompting urgent patches from Apple.

Hacker News Tue, 26 Ma
Why the Smart Home Bubble Popped

The article examines the collapse of the smart home market bubble, attributing it to overvaluation, unmet consumer expectations, technical limitations, and economic factors. Comments highlight the disconnect between hype and practical adoption of smart home technologies.

Hacker News Mon, 25 Ma
C extensions, portability, and alternative compilers

The article discusses challenges and considerations around using C extensions for performance, ensuring portability across platforms, and exploring alternative compilers. It highlights trade-offs between optimization, compatibility, and development complexity.

Hacker News Tue, 26 Ma
Earthion: A New Mega Drive-Style Shoot-Em-Up

Earthion is a new shoot-em-up game inspired by the Mega Drive era, currently generating discussion on Hacker News. The game's development and reception highlight ongoing interest in retro-style gaming and indie game creation.

Reddit r/ArtificialIntelligence 2026-05-25
Google AI

A Reddit user asks for community opinions on Google's hypothetical transition to AI, sparking a discussion about AI adoption, risks, and implications for the tech industry. The post lacks specific details but reflects broader concerns about AI's societal impact.

Hacker News Mon, 25 Ma
Taking a walk may lead to more creativity than sitting, study finds (2014)

A 2014 study found that walking enhances creativity more than sitting, suggesting physical activity can boost cognitive flexibility. The research highlights the relationship between movement and creative thinking, though it does not directly involve AI/ML technologies.

Hacker News Mon, 25 Ma
Ferrari Luce

Ferrari Luce is a new vehicle model from Ferrari, but the news item provides no specific details about AI or tech innovations. The content only references Hacker News comments, suggesting community discussion rather than technical content.

Hacker News Mon, 25 Ma
What we lost when we stopped letting kids leave the front yard

The article discusses the societal impact of overprotective parenting and 'safetyism,' arguing that restricting children's independence stifles resilience and creativity. It includes reader comments reflecting on how this trend affects personal growth and innovation.

Hacker News Mon, 25 Ma
A successful Japanese trial of a ramjet engine designed for Machโ€‘5 aircraft

Japan successfully tested a ramjet engine capable of powering Mach-5 aircraft, advancing hypersonic travel technology. The trial marks progress in developing faster-than-sound propulsion systems for future aerospace applications.

LLM

Hacker News Mon, 25 Ma
Using AI to write better code more slowly

The article discusses how AI tools, while capable of generating high-quality code, may slow down the development process due to increased reliance on verification, debugging, and iterative refinement. Comments highlight trade-offs between efficiency and quality when integrating AI into coding workflows.

Hacker News Mon, 25 Ma
Norway's 2 petabytes of Huawei flash storage and LLM training

Norway is utilizing 2 petabytes of Huawei flash storage for large language model (LLM) training, highlighting the critical role of high-capacity storage infrastructure in AI development. The news underscores the intersection of AI research and data storage solutions.

Hacker News Mon, 25 Ma
Microsoft Copilot Cowork Exfiltrates Files

A security vulnerability in Microsoft Copilot allows unauthorized data exfiltration, raising concerns about AI tool security and data privacy. The issue highlights risks in AI collaboration tools handling sensitive information.

Reddit r/MachineLearning 2026-05-25
Reconstructing the agent methodology: Decoupling decision-making and execution - open source [P]

A new open-source project called Spice aims to decouple decision-making from execution in AI agents by making the reasoning process explicit. It acts as a decision layer above existing agents, tracking observations, options, trade-offs, and outcomes to increase transparency in agent behavior.

Reddit r/MachineLearning 2026-05-25
๐ƒ๐ž๐ฅ๐ญ๐š ๐€๐ญ๐ญ๐ž๐ง๐ญ๐ข๐จ๐ง ๐‘๐ž๐ฌ๐ข๐๐ฎ๐š๐ฅ๐ฌ [R]

Delta Attention Residuals introduces a novel approach to residual connections in neural networks by routing attention over delta values (differences between hidden states) instead of cumulative states, solving routing collapse in deep layers. It achieves 1.8ร— sharper cross-layer routing, 8.2% lower validation PPL at 7.6B parameters, and minimal 0.01% parameter overhead, outperforming prior methods like Attention Residuals.

Reddit r/LocalLLaMA 2026-05-25
The Financial Times has published an article about Heretic

The Financial Times used Heretic, a GitHub tool, to remove guardrails from Meta's Llama 3.3 model in under 10 minutes. The tool has enabled over 3,500 'decensored' models and 13 million downloads, sparking mainstream attention. The creator emphasizes maintaining unrestricted access to models despite ethical concerns.

Reddit r/LocalLLaMA 2026-05-25
Update on 12x32gb sxm v100 cluster / local AI for legal drafting

A lawyer details his 12-V100 GPU cluster setup for local AI legal drafting, switching from vLLM to llama.cpp. He finds MoE models outperform dense models on V100s, with throughput metrics showing 50 tok/s for a 122B/10B model, highlighting hardware and software trade-offs for local LLM deployment.

Reddit r/LocalLLaMA 2026-05-25
NuExtract3 released: open-weight 4B VLM for Markdown, OCR and structured extraction (self-hostable)

NuExtract3, an open-weight 4B Vision-Language Model (VLM) based on Qwen3.5-4B, is released for document processing tasks like converting images/text to Markdown, OCR, and structured data extraction. It supports self-hosting with multiple quantizations and is optimized for complex documents, receipts, invoices, and multi-page layouts.

Reddit r/LocalLLaMA 2026-05-25
Is Qwen3.6 current king for local agentic use?

A Reddit user compares Qwen3.6 35B A3B with other models like Gemma4 and GLM 4.7 Flash REAP for local agentic tasks, reporting superior performance despite imperfections. They seek better models of similar size, preferably MoE, for use with Hermes Agent and Pi.

Reddit r/LocalLLaMA 2026-05-25
Using Local LLMs for Generating Custom Interactive Recursive Textbooks on the Fly

A Reddit user discusses leveraging local large language models (LLMs) to dynamically generate customizable, interactive, and recursive textbooks. The approach emphasizes on-the-fly content creation tailored to specific needs, highlighting applications in education and specialized knowledge delivery.

Reddit r/LocalLLaMA 2026-05-26
New local model reaching near frontier on PII removal at 9 ms CPU inference

A new local machine learning model achieves near-frontier performance in removing personally identifiable information (PII) from computer use data, with a 9ms CPU inference time. The model focuses on privacy-preserving data processing and is shared for community feedback on the r/LocalLLaMA subreddit.

Reddit r/LocalLLaMA 2026-05-25
Next year we're getting 0.5T model from Grok

A Reddit post mentions that Grok-3 has joined the 'Grok-3 Opensource Release' club, with claims of a 0.5T (500B parameter) model from Grok set for release next year. The post links to a tweet by Elon Musk, though details remain unverified.

Reddit r/LocalLLaMA 2026-05-25
CUDA: add fast walsh-hadamard transform by am17an ยท Pull Request #23615 ยท ggml-org/llama.cpp

A pull request to the llama.cpp project introduces a CUDA-accelerated Fast Walsh-Hadamard Transform (FWHT) to improve performance when quantizing key-value caches. The update provides 1-2% speedups on some tasks and 7-9% on others, demonstrating efficiency gains for large language models like gemma4 26B.A4B with Q4_K_M quantization.

Reddit r/LocalLLaMA 2026-05-25
Old Mac Pro still proving its worth

A 2016 'Trash Can' Mac Pro, equipped with D700 GPUs, was repurposed to run LLMs like Qwen via Vulkan drivers and llama.cpp, achieving usable performance and outperforming Claude Sonnet in specific tasks. The post highlights the potential of aging hardware for AI workloads with software updates.

Reddit r/LocalLLaMA 2026-05-25
AI content detector based on Qwen 0.8b fine-tuned on Pangram dataset

A user fine-tuned Qwen 3.5 0.8B on the Pangram dataset to create an AI content detector via a Chrome extension called Slop Hammer. The tool provides AI-generated probability scores locally on an M1 MacBook, leveraging a custom dataset and model optimization. Limitations include confusion with advanced models like GPT-5.5 and dataset constraints from older LLMs.

Reddit r/LocalLLaMA 2026-05-25
Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

This research introduces RTPurbo, a method to convert full-attention large language models into sparse models with minimal training, achieving significant efficiency gains. By leveraging intrinsic sparsity and dynamic token selection, it maintains near-lossless accuracy while improving inference speed by up to 9.36x for long contexts.

Reddit r/LocalLLaMA 2026-05-25
1000 tps generation on Qwen3.6 27B with V100s

A Reddit user reports achieving 1000 TPS with Qwen3.6 27B model on V100 GPUs, highlighting 80 TPS for single-user generation and 3000 TPS processing without MTP. The post emphasizes hardware performance under concurrent workloads.

Reddit r/LocalLLaMA 2026-05-25
server: fix checkpoints creation by jacekpoplawski ยท Pull Request #22929 ยท ggml-org/llama.cpp

A pull request in the llama.cpp project addresses issues with context reprocessing during agentic coding tasks. The fix prevents full prompt reprocessing by only reprocessing changed content, improving responsiveness when tools modify conversation history or remove reasoning steps.

Reddit r/LocalLLaMA 2026-05-26
Free AI Blog site โ€” I have unused credits expiring soon, feel free to try it

A Reddit user is offering free access to their AI-powered blog site, which uses unused AI credits before they expire. The site is a personal project for experimenting with AI agent workflows, and the user aims to avoid wasting credits by allowing others to use it casually.

Reddit r/LocalLLaMA 2026-05-25
Need Help - What would you build? Air-gapped NL assistant that is integrated with Splunk

A Reddit user seeks guidance on building an air-gapped natural language assistant integrated with Splunk for on-premises log analysis, focusing on Korean language support, small model deployment, and conversational AI. They ask about optimal frameworks, model choices, and potential use cases like root cause analysis.

Reddit r/LocalLLaMA 2026-05-25
Llama.cpp : Split Mode Tensor Fix Incoming?

A potential fix for crashes in split mode tensor multi-GPU setups is incoming for Llama.cpp. Users report a 35% performance boost with SM Tensor but face VRAM exhaustion issues, which this update aims to resolve. The fix is part of an active pull request and is critical for stability in distributed training.

Reddit r/LocalLLaMA 2026-05-25
Is there any case of a less quantised smaller model outperforming a more quantised larger model?

A Reddit user asks whether smaller, less quantized models can outperform larger, more quantized models in creative writing tasks, citing specific examples like Gemma and Qwen variants. The post seeks insights into model efficiency trade-offs for practical applications.

Reddit r/LocalLLaMA 2026-05-25
Whats the best Qwen 27B Q8 quant?

A Reddit user discusses challenges with lower quantization levels (Q4-Q6) of the Qwen 27B model for coding tasks, noting inaccuracies. They currently use Q8 with MTP but find it slow, considering a larger 35B model with A3B quantization for better performance.

Reddit r/LocalLLaMA 2026-05-25
ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention

ThriftAttention introduces a selective mixed precision approach using FP4 to optimize long-context attention mechanisms in large language models, improving efficiency without sacrificing performance. The method focuses on reducing computational overhead while maintaining accuracy for extended sequence processing.

Reddit r/LocalLLaMA 2026-05-25
OSCAR RotationZoo - Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

OSCAR RotationZoo introduces a 2-bit KV cache quantization method using spectral covariance-aware rotations to compress memory usage by ~7ร— with minimal accuracy loss. The precomputed rotation matrices enable efficient deployment of large language models on limited VRAM, with potential for 30-40B MOE models and 10-20B dense models.

Reddit r/LocalLLaMA 2026-05-25
MiMo-V2.5-coder

A new version of the MiMo model, MiMo-V2.5-coder, has been released, optimized for coding tasks with 128 GB of memory. It is positioned as a fast alternative to Qwen3.6 and DS4, featuring reliable tool calling capabilities.

Reddit r/LocalLLaMA 2026-05-25
Locally-hosted language-learning AI you can talk to comparable to Pingo AI?

A Reddit user is seeking locally-hosted AI language-learning tools comparable to Pingo AI, specifically for practicing Swedish through verbal interaction. They prefer open-source or self-deployed solutions over paid services to reduce costs and understand the technology better.

Reddit r/LocalLLaMA 2026-05-26
Shard - getting to 10ร— KV cache compression

Shard is a HuggingFace cache replacement that achieves 10ร— KV cache compression for Llama-3.1-8B at 8K context by applying PCA with int4 quantization to keys and Hadamard rotation with vector quantization to values, maintaining performance on benchmarks without requiring fp16 reconstruction.

Reddit r/LocalLLaMA 2026-05-25
Anyone use QwQ-32B? It's over a year old? Has Qwen 3.6 27b basically replaced it?

A Reddit user questions the continued relevance of the QwQ-32B model, now over a year old, in light of newer models like Qwen 3.6 and Gemma 4. The post seeks insights on whether QwQ-32B is still preferred for specific tasks such as coding or other applications.

Reddit r/ArtificialIntelligence 2026-05-25
Cerebras Chip Sets Appear to be Optimized for LLM Use Cases

Cerebras chips are highlighted as specialized hardware optimized for large language model (LLM) training and inference, with performance metrics focused on token throughput. The analysis suggests their design prioritizes LLM workloads over other AI domains like robotics or real-time systems, which require different computational priorities.

Reddit r/ArtificialIntelligence 2026-05-26
What AI do you recommend for high school and college students?

A Reddit post asks for AI recommendations for high school and college students, focusing on tools to aid research and assignments. The discussion highlights the growing role of AI in education and practical applications for learners.

Reddit r/ArtificialIntelligence 2026-05-25
I got AI to compile a music production course. Anyone proficient in music care to check it out?

A Reddit user used AI to structure a comprehensive music production course by compiling content from multiple Udemy courses, seeking feedback from experienced musicians to validate the curriculum's effectiveness.

Reddit r/ArtificialIntelligence 2026-05-25
Anthropic moves closer to powering America's spy agencies

Anthropic, known for its large language models, is advancing toward providing AI solutions to U.S. intelligence agencies, signaling potential integration of its technology into national security operations. This development highlights the growing intersection of AI capabilities and government surveillance/analytical needs.

Reddit r/DeepLearning 2026-05-25
Release] Apex-Qwen3.6-35B-A3B Q4_K_M โ€” lower KLD at the same Q4_K_M size class

A new model release, Apex-Qwen3.6-35B-A3B, claims improved efficiency with lower KLD (Kullback-Leibler Divergence) while maintaining the same Q4_K_M size class, suggesting potential optimizations in model training or inference. The post highlights technical advancements in large language model (LLM) efficiency.

Reddit r/DeepLearning 2026-05-25
How do you survive?

A Reddit user asks for strategies to participate in the open-source LLM community without financial resources or GPU access, highlighting challenges faced by under-resourced contributors. Discussions likely focus on alternative methods like cloud credits, collaborative projects, or optimizing existing tools.

Reddit r/DeepLearning 2026-05-25
Why Can't Transformers Multiply Beyond Their Training Length? (And a Fix: 80.6% on Unseen Digits)

A Reddit post discusses why transformers struggle with multiplication tasks beyond their training length and presents a fix achieving 80.6% accuracy on unseen digits, highlighting architectural limitations and potential solutions in AI model design.

Reddit r/DeepLearning 2026-05-26
ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention

ThriftAttention introduces a selective mixed-precision approach using FP4 for long-context attention mechanisms, aiming to optimize memory and computational efficiency in large language models. The method focuses on reducing resource demands while maintaining performance for extended sequence processing.

Reddit r/ArtificialIntelligence 2026-05-26
AI is becoming epistemic infrastructure controlled by a handful of private individuals?

The post argues that AI is evolving into a central epistemic infrastructure controlled by a few private entities, raising concerns about bias, opacity, and democratic accountability. It compares this centralization to historical power structures like the Church, warning that reliance on AI risks eroding critical thinking and shaping societal understanding of reality.

Hacker News Mon, 25 Ma
Show HN: OpenBrief โ€“ Local-first video downloader/summarizer

OpenBrief is a local-first tool for downloading and summarizing videos, leveraging AI to process content locally. It likely uses natural language processing or computer vision to generate summaries, emphasizing user privacy and offline functionality.

Reddit r/MachineLearning 2026-05-25
Aiki my local Wikipedia Retrieval-Augmented Generation system [R]

Aiki is a lightweight, locally-run tool enabling users to chat with Wikipedia content. It uses a custom TF-IDF + cosine similarity retriever for knowledge retrieval and optionally generates answers with an LLM, supporting query expansion via Wikipedia links. The system emphasizes minimal dependencies and offline operation.

Reddit r/LocalLLaMA 2026-05-25
MiniCPM5-1B

A new model called MiniCPM5-1B was submitted to the LocalLLaMA subreddit by user kevinlch. The post discusses the model's features and potential applications, sparking community discussion. It's part of the ongoing development in local LLM projects.

Reddit r/DeepLearning 2026-05-25
Truth ! Kwizerana & membrane both using Deepseek & happy with

A Reddit user shares their experience switching from Codex to Deepseek's DSv4 model after community advice, mentioning configuration on a 9router setup and exploring its use as a coding agent. The post highlights model comparison and practical implementation choices.

MLOps

Reddit r/ArtificialIntelligence 2026-05-25
Uber's COO says it's getting harder to justify the money spent on AI tokenmaxxing

Uber's COO highlights growing challenges in justifying AI expenditures, potentially referencing costs tied to token-based systems or models. The comment reflects broader concerns about AI investment returns in enterprise settings.

NLP

Reddit r/MachineLearning 2026-05-26
Already 11 000 submissions for EMNLP? [D]

A Reddit user notes a sharp increase in EMNLP submissions, with 11,000 already reported compared to 8,000 last year, sparking discussion about the trend's normality. The post highlights growing interest in NLP research and conference submission dynamics.

Reddit r/DeepLearning 2026-05-25
Wants to make a deep learning model to solve JEE Mains and Advanced questions

A Reddit user seeks guidance on building a deep learning model to solve JEE Mains and Advanced questions, struggling with dataset acquisition, model selection, and leveraging pre-trained models from Hugging Face.

Reddit r/DeepLearning 2026-05-25
[D] BRIDGE: A multilingual NLP benchmark covering 22 Global South languages with code-switching evaluation

BRIDGE is a multilingual NLP benchmark evaluating models across 22 Global South languages, emphasizing semantic similarity metrics and code-switching. It addresses gaps in low-resource language coverage and moves beyond traditional WER/CER metrics.

RL

Reddit r/MachineLearning 2026-05-25
If you use NVIDIA Isaac Sim for reinforcement learning, do you use Isaac Lab with it? Just want to get a sense of what the status quo is. [D]

A Reddit user is evaluating NVIDIA Isaac Sim and Isaac Lab for reinforcement learning (RL), highlighting challenges with Isaac Lab's poor documentation and setup complexity despite its strengths in multi-actor systems and logging. They seek insights into common practices for integrating these tools or opting for direct Isaac Sim interfacing.

Speech

Reddit r/MachineLearning 2026-05-25
Best architecture for seamless Bilingual TTS? (Azure / English + Korean) [D]

A developer seeks solutions for seamless bilingual Text-to-Speech (TTS) in a language learning app using Azure, facing challenges with robotic Korean accents and pauses when switching between English and Korean voices. They explore SSML optimizations, Azure OpenAI voices, and alternative architectures.

Trending repositories on GitHub filtered and scored for relevance to your interests.

AI Safety

Agentic AI

General

LLM

MLOps

NLP