Tuesday, June 23, 2026

Daily Digest 2026-06-23

Today’s digest highlights a significant shift toward agentic workflows, focusing on autonomous research, industrial-scale diagnostics, and the optimization of multi-agent systems. There is a clear emphasis on bridging the gap between high-level reasoning and physical embodiment through spatial memory and 3D perception.

Research highlights:

Agentic Systems and Research: New frameworks explore autonomous research pipelines, multi-agent prompt optimization, and human-in-the-loop refinement for scientific discovery.
Embodied AI and Robotics: Research focuses on 3D spatial memory, stereo-based occupancy datasets, and 6D pose estimation for more accurate physical interaction.
Medical AI: Developments include dual-stream vision-language models for 3D CT diagnosis and parameter-efficient adaptation techniques to prevent forgetting in medical contexts.
Video and Image Generation: Innovations include procedural video representation learning, controllable interaction generation, and diversity-focused semantic browsing for image synthesis.
Optimization and Reasoning: Papers introduce counterfactual policy optimization for multimodal reasoning and robust Nash Equilibrium seeking under partial information.

Tech buzz:

The industry is seeing a surge in “agentic” applications, ranging from autonomous telecommunications networks to specialized coding tools.
Model Efficiency: A new 3B parameter model is demonstrating high-level reasoning capabilities through novel supervised fine-tuning and reinforcement learning techniques.
Computer Vision: The release of unified real-time end-to-end vision models marks a step forward in standardized object detection and tracking.

Sort:

Today's digest highlights a significant shift toward agentic workflows, focusing on autonomous research, industrial-scale diagnostics, and the optimization of multi-agent systems. There is a clear emphasis on bridging the gap between high-level reasoning and physical embodiment through spatial memory and 3D perception.

Global Trends

Personal Interests

Papers discovered through your interest topics.

Embodied AI

3/5 Computer Vision and Pattern Recognition (cs.CV) 22 Jun 2026

IMAGIN-4D: Image-Guided Controllable Interaction Generation

Sai Kumar Dwivedi, Federica Bogo, Buğra Tekin, Chenhongyi Yang, Nadine Bertsch, Tomas Hodan, Michael J. Black, Dimitrios Tzionas, Shreyas Hampali

Abstract

ArXiv ID: 2606.23675

Authors: Sai Kumar Dwivedi, Federica Bogo, Buğra Tekin, Chenhongyi Yang, Nadine Bertsch, Tomas Hodan, Michael J. Black, Dimitrios Tzionas, Shreyas Hampali

Abstract:

Generating human-object interactions (HOI) is central to character animation, robotics, AR/VR, and embodied AI. Recent HOI generation methods synthesize motion from text, object geometry, and sparse waypoints, controlling action semantics and object trajectories. However, these signals underspecify interaction: the same prompt and trajectory can produce different grasps, approach directions, body poses, object poses, contacts, and body-object layouts. We address this ambiguity with a reference image as a visual specification of the desired interaction snapshot. However, a single global image representation conflates distinct cues and conditions all frames on identical visual evidence. We therefore introduce IMAGIN-4D, a diffusion-based HOI generator that decomposes image conditioning spatio-temporally. For spatial conditioning, IMAGIN-4D extracts supervised interaction-state tokens for body pose, object pose, body-object contact, and spatial relationships at the depicted frame. For temporal conditioning, it computes frame-aware tokens by querying image patches per generated frame, allowing sequence segments to attend to different visual cues from the same image. To balance image, text, and waypoint cues, IMAGIN-4D uses role-aware conditioning: text, waypoints, and interaction-state tokens use separate AdaLN streams, while frame-aware visual tokens cross-attend with motion tokens. Since HOI motion datasets lack paired images, we build a synthetic motion-to-image rendering pipeline from FullBodyManipulation (FBM) and introduce an image-adherence metric to evaluate whether generated motions match the reference snapshot. Experiments on FBM and BEHAVE show that IMAGIN-4D improves fine-grained interaction control over single-token and uniformly image-conditioned baselines while preserving waypoint-following and motion quality. Code and models will be released at https://imagin4d.github.io.

Insights

Contribution: The paper introduces IMAGIN-4D, a diffusion-based framework that enables fine-grained control over human-object interactions (HOI) by decomposing a reference image into spatio-temporal conditioning tokens.

Core Idea: To resolve the ambiguity of underspecified text and waypoint prompts, the model uses a reference image to specify precise interaction snapshots, avoiding the conflation of cues by decomposing the image into spatial states and temporal frame-aware tokens.

Technique: The method employs a diffusion model with role-aware conditioning, utilizing separate AdaLN streams for text, waypoints, and interaction-state tokens, while using cross-attention for frame-aware visual tokens.

Pipeline: Text prompt + Object geometry + Waypoints + Reference image → Spatio-temporal token extraction & Role-aware conditioning → Diffusion-based motion generation

Methodology: The authors developed a synthetic motion-to-image rendering pipeline to create a training dataset and introduced a new image-adherence metric to evaluate how well the generated motion matches the reference snapshot.

Results: IMAGIN-4D outperforms single-token and uniformly image-conditioned baselines in fine-grained interaction control while maintaining high motion quality and waypoint adherence on the FBM and BEHAVE datasets.

Limitations: The model relies on a synthetic motion-to-image rendering pipeline for training due to the lack of paired images in existing HOI datasets.

PDF

3/5 Robotics (cs.RO)Computer Vision and Pattern Recognition (cs.CV) 22 Jun 2026

HoloAgent-0: A Unified Embodied Agent Framework with 3D Spatial Memory

Xiaolin Zhou, Liu Liu, Tingyang Xiao, Wei Feng, Fa Fu, Xinrui Meng, Xinjie Wang, Jialiang Han, Boyang Yu, Yun Du, Wei Sui, Zhizhong Su

Abstract

ArXiv ID: 2606.23565

Authors: Xiaolin Zhou, Liu Liu, Tingyang Xiao, Wei Feng, Fa Fu, Xinrui Meng, Xinjie Wang, Jialiang Han, Boyang Yu, Yun Du, Wei Sui, Zhizhong Su

Abstract:

LLM agents follow a practical execution loop in digital environments: they reason over structured states, invoke tools, inspect feedback, and revise actions. Extending this loop to physical robots is difficult because physical execution is continuous, embodiment-dependent, uncertain, and constrained by safety. Existing embodied-AI systems have advanced manipulation, spatial understanding, navigation, and humanoid control, but these capabilities often remain specialized modules or loosely coupled decision loops. In this work, we introduce HoloAgent-0, a unified embodied agent framework for real-world robot deployment. Embodied AgentOS converts language instructions into executable skill graphs, schedules robot resources, monitors execution, and triggers clarification or re-planning from runtime feedback. HoloAgent-0 organizes heterogeneous robot models and controllers through three coupled layers: Embodied AgentOS for closed-loop execution, 3D spatial memory for physical world grounding, and embodied skills for robot action. We deploy HoloAgent-0 on real hardware and evaluate its spatial memory, long-horizon navigation, and closed-loop execution across motion generation, object search, cross-robot coordination, and mobile manipulation.

Insights

Contribution: The paper introduces HoloAgent-0, a unified framework that integrates 3D spatial memory with a closed-loop operating system to enable complex, long-horizon robot tasks.

Core Idea: To bridge the gap between digital LLM reasoning and physical robot execution, the framework unifies heterogeneous robot controllers through a structured OS that manages skill graphs and spatial grounding.

Technique: The framework employs a three-layer architecture consisting of Embodied AgentOS for execution logic, 3D spatial memory for environmental grounding, and a library of embodied skills for physical actions.

Pipeline: Natural language instructions → Embodied AgentOS (Skill Graph conversion & Resource Scheduling) → 3D Spatial Memory Grounding → Embodied Skills Execution → Runtime Feedback → Re-planning/Clarification

Methodology: The authors developed a modular system that converts high-level goals into executable graphs, deployed it on real hardware, and evaluated it across navigation, object search, and mobile manipulation tasks.

Results: The framework successfully demonstrated capabilities in long-horizon navigation, cross-robot coordination, and closed-loop mobile manipulation on real-world hardware.

Limitations: The paper does not explicitly detail the scalability of the 3D spatial memory in extremely large-scale dynamic environments or the latency overhead of the closed-loop re-planning cycle.

PDF

3/5 Computer Vision and Pattern Recognition (cs.CV)Robotics (cs.RO) 22 Jun 2026

Flow6D: Discrete-to-Continuous Flow Matching for Efficient and Accurate Category-Level 6D Pose Estimation

Mingyu Mei, Li Zhang, Zibo Dai, Han Sun, Xinyue Zhao, Huiliang Shen, Zaixing He

Abstract

ArXiv ID: 2606.23293

Authors: Mingyu Mei, Li Zhang, Zibo Dai, Han Sun, Xinyue Zhao, Huiliang Shen, Zaixing He

Abstract:

6D pose estimation is a key task in computer vision and embodied AI, widely used in robotic manipulation, augmented reality, etc. Existing methods directly regress in a high-dimensional continuous space, facing two key challenges in category-level pose estimation: limited accuracy due to noise and local optima, and inefficient search over an infinite space that hinders real-time performance. This paper proposes Flow6D, a hierarchical flow matching framework with a two-stage discrete latent space localization-continuous pose regression strategy. Rotation and translation parameters are first discretized into bins, with a discrete flow matching model locking the latent space around the true pose to reduce search complexity. Then, by sampling in the latent space, a continuous flow matching model predicts local pose residuals to optimize the estimate and regress to an accurate pose. The framework also naturally extends to articulated objects, outperforming state-of-the-art methods on synthetic and real datasets with real-time inference at 70 FPS. Project website: https://flow6d.github.io/.

Insights

Contribution: The paper introduces Flow6D, a hierarchical flow matching framework that addresses the accuracy and efficiency challenges of category-level 6D pose estimation by combining discrete latent space localization with continuous pose regression.

Core Idea: The core idea is to decompose the 6D pose estimation task into a two-stage process: first narrowing down the search space using discrete flow matching and then refining the pose using continuous flow matching.

Technique: The method utilizes a discrete-to-continuous flow matching strategy where rotation and translation are discretized into bins to lock the latent space before regressing local residuals.

Pipeline: Input Image → Discrete Flow Matching (Latent Space Localization) → Continuous Flow Matching (Local Residual Regression) → Final 6D Pose

Methodology: The framework discretizes pose parameters into bins to reduce search complexity and employs a hierarchical flow matching model to predict accurate pose residuals from the localized latent space.

Results: Flow6D outperforms state-of-the-art methods on synthetic and real datasets, achieves real-time inference at 70 FPS, and naturally extends to articulated objects.

Limitations: The paper does not explicitly detail the sensitivity of the binning strategy to different object scales or the computational overhead of the hierarchical flow matching compared to single-stage regressors.

PDF

3/5 Computer Vision and Pattern Recognition (cs.CV)Artificial Intelligence (cs.AI) 22 Jun 2026

P-JEPA: Procedural Video Representation Learning via Joint Embedding Predictive Architecture

Felix Tristram, Stefano Gasperini, Benjamin Killeen, Marcel Walch, Christian Benz, Nassir Navab, Ghazal Ghazaei

Abstract

ArXiv ID: 2606.23256

Authors: Felix Tristram, Stefano Gasperini, Benjamin Killeen, Marcel Walch, Christian Benz, Nassir Navab, Ghazal Ghazaei

Abstract:

The increasing maturity of embodied AI platforms has driven a growing interest in procedural video representation learning to support intelligent assistance systems for complex, multi-step tasks. Leveraging large-scale latent predictive training, video foundation models capture video dynamics, enabling downstream tasks such as activity understanding, spatiotemporal localization, and predictive control. However, procedural videos include actions with long-range dependencies that these models do not support, due to the quadratic complexity of self-attention. Distinct actions, for example, may be visually similar despite appearing at different points in the procedure, such as turning the stove on versus off. Here, we propose a backbone-agnostic approach that learns long-duration video representations by reducing the problem to a dense, frame-aligned action space and predicting pooled masked latent vectors. This approach allows our Procedural Joint Embedding Predictive Architecture (P-JEPA) to ingest videos over 30 minutes long, enabling effective long-form understanding of procedural steps. We evaluate P-JEPA using features extracted with VJEPA2.1, TSM, and I3D over the EgoExo4D, EgoProceL, and Assembly101 datasets, finding that it consistently improves linear separability, streaming inference, and temporal action segmentation performance, achieving state-of-the-art results on EgoExo4D fine-grained action classification while using an order of magnitude fewer parameters than LLM-based methods and running in real time.

Insights

Contribution: The paper introduces P-JEPA, a backbone-agnostic framework for learning long-duration procedural video representations that overcomes the quadratic complexity of self-attention in long-form videos.

Core Idea: The authors propose reducing long-duration video understanding to a dense, frame-aligned action space to capture long-range dependencies and distinguish visually similar actions occurring at different procedural stages.

Technique: P-JEPA utilizes a Joint Embedding Predictive Architecture that predicts pooled masked latent vectors, allowing the model to process videos exceeding 30 minutes in length.

Pipeline: Long-form procedural video → Feature extraction (VJEPA2.1, TSM, or I3D) → Dense frame-aligned action space mapping → Pooled masked latent vector prediction → Long-form representation learning

Methodology: The model is evaluated across EgoExo4D, EgoProceL, and Assembly101 datasets using various backbones to measure linear separability, streaming inference, and temporal action segmentation.

Results: P-JEPA achieved state-of-the-art results on EgoExo4D fine-grained action classification, outperformed LLM-based methods with an order of magnitude fewer parameters, and achieved real-time inference.

Limitations: The paper does not explicitly detail the specific performance trade-offs when using different backbone architectures or the scalability of the dense action space for extremely high-frequency actions.

PDF

3/5 Robotics (cs.RO)Computer Vision and Pattern Recognition (cs.CV) 22 Jun 2026

Humanoid-OmniOcc: Stereo-Based Full-View Occupancy Dataset for Embodied AI

Xianda Guo, Bohao Zhang, Chenwei Huang, Shiyuan Chen, Ruilin Wang, Yiqun Duan, Cong Yang, Qin Zou, Wei Sui

Abstract

ArXiv ID: 2606.22971

Authors: Xianda Guo, Bohao Zhang, Chenwei Huang, Shiyuan Chen, Ruilin Wang, Yiqun Duan, Cong Yang, Qin Zou, Wei Sui

Abstract:

Occupancy prediction at voxel-level granularity is essential for safe robotic navigation and interaction in complex environments. Existing occupancy datasets, however, are predominantly designed for autonomous driving with vehicle-centric biases -- forward-facing cameras, far-field geometry, and static road priors -- limiting their applicability to embodied humanoid perception. We present Humanoid-OmniOcc, a large-scale panoramic stereo-based occupancy dataset tailored for humanoid robots. The dataset encompasses 15 diverse simulated indoor scenes and 5 real-world environments, yielding over 155K samples with broad scene and style diversity. Importantly, the dataset is designed around a Real2Sim2Real closed-loop paradigm: real sensor specifications drive physically accurate simulation, simulation produces large-scale annotated training data, and models trained in simulation are directly evaluated on real-world captures -- enabling iterative refinement of the sim-to-real pipeline. We further propose \textbf{H}umanoid \textbf{S}urround \textbf{S}tereo-guided \textbf{Occ}upancy model (Humanoid-OmniOcc) that exploits robust depth priors for accurate 2D-to-3D lifting. Extensive experiments show that Humanoid-OmniOcc consistently outperforms monocular baselines and generalizes well to both unseen simulated test scenes and real-world environments, validating the effectiveness of the Real2Sim2Real design. Code and data will be available upon acceptance at https://d-robotics-ai-lab.github.io/humanoid-omniocc.

Insights

Contribution: The paper introduces Humanoid-OmniOcc, a large-scale panoramic stereo-based occupancy dataset and a corresponding model specifically designed for humanoid robots to overcome vehicle-centric biases in existing datasets.

Core Idea: The research leverages a Real2Sim2Real closed-loop paradigm to bridge the gap between simulation and reality, providing high-granularity voxel-level occupancy for embodied AI.

Technique: The authors propose the Humanoid Surround Stereo-guided Occupancy (Humanoid-OmniOcc) model, which utilizes robust depth priors from stereo vision for accurate 2D-to-3D lifting.

Pipeline: Real sensor specifications → Physically accurate simulation → Large-scale annotated training data → Model training → Real-world evaluation

Methodology: The methodology involves creating a diverse dataset of 15 simulated and 5 real-world environments, followed by training a stereo-guided model to predict 3D occupancy from panoramic views.

Results: The Humanoid-OmniOcc model consistently outperforms monocular baselines and demonstrates strong generalization across unseen simulated scenes and real-world environments.

Limitations: The study focuses on simulated and real-world indoor/diverse environments, but the scalability of the Real2Sim2Real pipeline to highly dynamic or extremely large-scale outdoor humanoid tasks remains an open area.

PDF

Multi-Agent Systems

3/5 Machine Learning (cs.LG)Multiagent Systems (cs.MA) 22 Jun 2026

MAS-PromptBench: When Does Prompt Optimization Improve Multi-Agent LLM Systems?

Juyang Bai, Laixi Shi

Abstract

ArXiv ID: 2606.23664

Authors: Juyang Bai, Laixi Shi

Abstract:

Multi-agent systems (MAS) offer a scalable path forward for agentic AI, comprising multiple LLM-based agents, each assigned a system prompt and a position within a workflow that governs inter-agent coordination and output aggregation. System prompts thus form a critical and accessible optimization surface: they specify agents' roles and behaviors, enabling system-level improvements without model finetuning. Although prompt optimization has shown substantial potential for single LLMs, extending it to MAS poses distinct challenges, notably an exponentially growing search space. It remains unclear whether, when, and by how much prompt optimization improves MAS performance, and how sensitive such gains are to system configuration. In this work, we systematically study system-prompt optimization across a broad range of MAS setups varying in task, workflow, communication protocol, and team size, benchmarking two prompt optimizers that naturally extend state-of-the-art single-agent methods. The results reveal its potential to unlock significant gains while exposing open challenges, characterizing when and how much prompt optimization helps across diverse MAS settings.

Insights

Contribution: The paper systematically investigates the impact of system-prompt optimization on Multi-Agent Systems (MAS), identifying when and how much optimization improves performance across diverse configurations.

Core Idea: While prompt optimization is effective for single LLMs, MAS introduces an exponentially growing search space, making it unclear if optimizing individual agent prompts translates to significant system-level gains.

Technique: The authors benchmark two prompt optimizers extended from state-of-the-art single-agent methods to evaluate their effectiveness across various MAS workflows.

Pipeline: MAS configurations (task, workflow, protocol, team size) → Prompt optimization of individual agent system prompts → Performance evaluation of the aggregated MAS output.

Methodology: The study employs a systematic benchmarking approach across a broad range of MAS setups, varying communication protocols and team sizes to measure sensitivity to system configuration.

Results: The results reveal that prompt optimization can unlock significant gains in some MAS settings but also expose challenges regarding the sensitivity of these gains to specific system configurations.

Limitations: The research highlights the challenge of the exponentially growing search space in MAS and leaves open questions regarding the optimal balance of optimization across complex workflows.

PDF

3/5 eess.SY 22 Jun 2026

Robust Data-Driven Nash Equilibrium Seeking under Partial-Decision Information

Linqi Wang, Yifei Li, Wenjie Liu, Yuzhou Wei, Gang Wang, Lihua Xie

Abstract

ArXiv ID: 2606.23011

Authors: Linqi Wang, Yifei Li, Wenjie Liu, Yuzhou Wei, Gang Wang, Lihua Xie

Abstract:

This paper presents a data-driven framework for decentralized Nash equilibrium (NE) seeking in multi-agent systems with unknown linear dynamics subject to exogenous disturbances, operating under partial-decision information (where agents lack direct access to the decisions of all others) and equality constraints. The proposed framework integrates an NE model, a distributed communication protocol, an internal model for disturbance rejection, and a data-driven stabilization strategy. By reformulating the problem as a cooperative output regulation problem, we synthesize controllers directly from noisy input-state data via semi-definite programs (SDPs), providing formal guarantees for closed-loop stability and asymptotic convergence to the NE. The approach is further extended to a class of nonlinear systems with constant disturbances by leveraging integral control and describing nonlinearities via quadratic constraints. Numerical simulations involving unmanned aerial vehicle networks and a rotary-wing aerial vehicle formation validate the efficacy and robustness of the proposed method.

Insights

Contribution: The paper proposes a robust data-driven framework for decentralized Nash equilibrium (NE) seeking in multi-agent systems with unknown linear dynamics, exogenous disturbances, and partial-decision information.

Core Idea: The problem is reformulated as a cooperative output regulation problem, allowing for the synthesis of controllers directly from noisy input-state data while ensuring stability and convergence.

Technique: The framework utilizes semi-definite programs (SDPs) to synthesize controllers and employs an internal model for disturbance rejection and quadratic constraints for nonlinear extensions.

Pipeline: Noisy input-state data → Cooperative output regulation reformulation → Semi-definite programming (SDP) controller synthesis → Decentralized Nash equilibrium seeking

Methodology: The authors integrate an NE model with a distributed communication protocol and a data-driven stabilization strategy, extending the linear approach to nonlinear systems via integral control.

Results: The method provides formal guarantees for closed-loop stability and asymptotic convergence, validated through numerical simulations of UAV networks and rotary-wing aerial vehicle formations.

Limitations: The current framework is primarily focused on systems with constant disturbances and specific classes of nonlinearities described by quadratic constraints.

PDF

3/5 Artificial Intelligence (cs.AI)astro-ph.IMphysics.soc-ph 22 Jun 2026

AI Scientists as Engines of Discovery: A Case for Development within Reformed Institutions

Raul Jimenez, Boris Bolliet, Francisco Villaescusa-Navarro, Rabih Zbib, Benjamin Wandelt, David N. Spergel, Thomas Meier, Jessica Montgomery, Hana Aliee, Licia Verde

Abstract

ArXiv ID: 2606.22859

Authors: Raul Jimenez, Boris Bolliet, Francisco Villaescusa-Navarro, Rabih Zbib, Benjamin Wandelt, David N. Spergel, Thomas Meier, Jessica Montgomery, Hana Aliee, Licia Verde

Abstract:

Agentic artificial intelligence (AI) systems are beginning to assist, accelerate, and partially automate scientific discovery, performing tasks that span literature synthesis, code generation, data analysis, hypothesis proposal, and model criticism. We argue that this transition is qualitative rather than incremental, and that suitably designed multi-agent systems may evolve from passive computational tools into ``AI scientists'' that can expand the hypothesis-generating and verification capacity of science. Such systems must be developed and deployed within a scientific ecosystem fit for purpose: institutions must be redesigned for verification, accountability, interpretability, and dual-use safety. We sketch how multi-agent architectures, illustrated by the prototype framework \textit{Denario}, accelerate the discovery cycle and traverse model spaces beyond human reach; examine what this implies for authorship, peer review, and the enduring role of human scientists; and close with recommendations for governing AI as an epistemic actor rather than a mere instrument.

Insights

Contribution: The paper argues for a paradigm shift from viewing AI as a passive tool to an 'AI scientist' capable of autonomous hypothesis generation and verification. It proposes a framework for integrating these agentic systems into reformed scientific institutions that prioritize accountability and safety.

Core Idea: Agentic multi-agent systems can qualitatively expand the capacity of scientific discovery by traversing model spaces beyond human reach. This transition requires a fundamental redesign of scientific infrastructure to manage AI as an epistemic actor.

Technique: The authors utilize multi-agent architectures, specifically the 'Denario' prototype framework, to automate and accelerate the scientific discovery cycle.

Pipeline: Scientific data and literature → Multi-agent system (hypothesis generation, code execution, data analysis, model criticism) → Accelerated discovery and expanded model spaces

Methodology: The paper combines conceptual framework development with a technical sketch of the Denario prototype to illustrate how multi-agent systems perform complex scientific tasks.

Results: The framework demonstrates the ability to accelerate the discovery cycle and explore complex model spaces; it also identifies critical implications for authorship, peer review, and human-AI collaboration.

Limitations: The paper leaves open questions regarding the specific governance of AI as an epistemic actor and the practicalities of ensuring dual-use safety in autonomous discovery.

PDF

3/5 Artificial Intelligence (cs.AI) 21 Jun 2026

PaperClaw: Harnessing Agents for Autonomous Research and Human-in-the-Loop Refinement

Weiwei Ye, Hangchen Liu, Dongyuan Li, Renhe Jiang

Abstract

ArXiv ID: 2606.22610

Authors: Weiwei Ye, Hangchen Liu, Dongyuan Li, Renhe Jiang

Abstract:

Large language models have become capable reasoners and tool users that write and run code and search the literature, which makes automating the research process itself a realistic goal. We present PAPERCLAW, a harnessed multi-agent system that carries a project autonomously, from a field of study to a finished paper. PAPERCLAW curates a domain from a field's live literature, datasets, and code; brainstorms it into an idea with a pre-registered main-result contract; and drives a stoppable hypothesis map through an iterative propose, test, reflect loop that grows only from measured verdicts and halts once the evidence supports the idea, at which point it writes a venue-compliant paper. A full-lifecycle memory keeps each stage in a single living record, so a long run can be paused, inspected, and resumed without losing context. At the centre is an in-cycle research assistant with research tools and skills: it can drive the whole pipeline on its own, while the same interface lets a person step in at any stage, turning a first autonomous draft into a stronger paper through human-in-the-loop refinement. Throughout, PAPERCLAW keeps its output grounded and checkable, citing only references validated against open scholarly indexes and reporting results that genuinely ran. An evaluation with an LLM judge finds that PAPERCLAW produces strong papers both fully autonomously and with human-in-the-loop refinement.

Insights

Contribution: The paper introduces PAPERCLAW, a multi-agent system capable of autonomously conducting research from initial domain curation to the production of a venue-compliant paper. It features a full-lifecycle memory system and a human-in-the-loop interface for collaborative refinement.

Core Idea: Automating the research lifecycle by using LLM agents to brainstorm, test hypotheses through iterative loops, and write papers while maintaining groundedness through validated citations and executed code.

Technique: A harnessed multi-agent architecture utilizing a 'propose, test, reflect' loop, a pre-registered main-result contract, and a persistent memory record to maintain context across long-running tasks.

Pipeline: Field of study → Literature/Dataset/Code curation → Idea brainstorming & Contract registration → Iterative hypothesis testing → Paper writing → Final venue-compliant manuscript

Methodology: The system uses an in-cycle research assistant to drive a stoppable hypothesis map, where each step is validated against open scholarly indexes and actual code execution results.

Results: Evaluation using an LLM judge demonstrated that PAPERCLAW produces high-quality papers in both fully autonomous modes and through human-in-the-loop refinement.

Limitations: The study relies on an LLM judge for evaluation and the system's success is dependent on the availability of open scholarly indexes and executable code environments.

PDF

3/5 Artificial Intelligence (cs.AI)Software Engineering (cs.SE) 20 Jun 2026

Holmes: Multimodal Agentic Diagnosis for Mixed-Language Mobile Crashes at Industrial Scale

Jia Li, Wenyuan Ma, Ting Peng, Haibin Zheng, Yuetang Deng

Abstract

ArXiv ID: 2606.21963

Authors: Jia Li, Wenyuan Ma, Ting Peng, Haibin Zheng, Yuetang Deng

Abstract:

Diagnosing mobile crashes in ultra-large-scale industrial applications is a formidable challenge due to the sheer volume of code, the complexity of mixed-language environments, and the inability to reproduce failures locally. Traditional static analysis struggles with scalability, while existing LLM-based agents often rely on reproducible environments unavailable in post-mortem scenarios. We present Holmes, a multi-agent system that automates root cause analysis by synthesizing multimodal runtime signals--stack traces, logs, and thread states--to reconstruct failure contexts without reproduction. Holmes introduces a hierarchical Retrieve-Explore-Reason architecture that leverages low-level artifacts (e.g., registers, assembly) to bridge the semantic gap between open-source business logic and closed-source system frameworks. By dynamically compressing the search space using runtime clues, Holmes precisely navigates 70-million-line codebases to identify non-local defects. Evaluated on real-world crashes from WeChat, Holmes achieves 87.6% accuracy in function-level fault localization and reduces average investigation time by over 98% (to ~77 seconds), demonstrating its effectiveness in transforming labor-intensive debugging into an efficient verification workflow.

Insights

Contribution: The paper introduces Holmes, a multi-agent system designed to automate root cause analysis for mobile crashes in ultra-large-scale, mixed-language industrial environments without requiring local reproduction.

Core Idea: Holmes reconstructs failure contexts by synthesizing multimodal runtime signals and navigating massive codebases using a hierarchical architecture that bridges the gap between business logic and system frameworks.

Technique: The system employs a hierarchical Retrieve-Explore-Reason architecture that utilizes low-level artifacts like registers and assembly to dynamically compress the search space.

Pipeline: Multimodal runtime signals (stack traces, logs, thread states) → Retrieve-Explore-Reason multi-agent synthesis → Function-level fault localization

Methodology: The authors developed a multi-agent framework that processes real-world crash data from a 70-million-line codebase, using runtime clues to navigate complex dependencies and identify non-local defects.

Results: Achieved 87.6% accuracy in function-level fault localization and reduced average investigation time by over 98% (to approximately 77 seconds).

Limitations: The paper focuses on post-mortem analysis where reproduction is impossible, potentially leaving open questions regarding its performance in scenarios where dynamic reproduction is feasible.

PDF

Vision-Language Models

3/5 Computer Vision and Pattern Recognition (cs.CV)Artificial Intelligence (cs.AI)cs.GRMachine Learning (cs.LG) 22 Jun 2026

Semantic Browsing: Controllable Diversity for Image Generation

Sara Dorfman, Maya Vishnevsky, Omer Dahary, Or Patashnik, Daniel Cohen-Or

Abstract

ArXiv ID: 2606.23679

Authors: Sara Dorfman, Maya Vishnevsky, Omer Dahary, Or Patashnik, Daniel Cohen-Or

Abstract:

Modern text-to-image models excel in visual fidelity and prompt adherence. However, this strict adherence comes at the cost of diversity: generated samples tend to collapse into a single visual interpretation. Existing methods to improve diversity produce outputs driven by incidental variations rather than meaningful design choices. This motivates a new variant of the diversity task where structure is enforced on the generated samples. We introduce a method for controlled diversity that enables Semantic Browsing, where users can navigate structured image galleries and experience creative exploration through a systematic traversal of meaningful, interpretable axes of variation. Achieving this level of semantic control requires a deep understanding of the scene. We exploit the fact that recent text-to-image models are trained on elaborated captions, effectively decoupling semantic decision-making from pixel generation. This enables a paradigm shift: instead of relying on stochastic variation within the text-to-image model, we induce diversity directly at the text level. By leveraging rich textual representations, we allow a Vision Language Model (VLM) to operate on the full scene context. To overcome the generic outputs typical of standard VLMs, we employ an agentic workflow that explicitly enforces structured variation attuned to the original prompt. We demonstrate that our method produces diverse and navigable design spaces where every variation corresponds to a specific, user-understandable semantic decision.

Insights

Contribution: The paper introduces 'Semantic Browsing,' a method for controllable diversity in image generation that allows users to navigate structured galleries based on meaningful, interpretable axes of variation.

Core Idea: Instead of relying on stochastic noise for diversity, the authors induce variation directly at the text level by decoupling semantic decision-making from pixel generation.

Technique: The authors employ an agentic workflow using a Vision Language Model (VLM) to manipulate rich textual representations and enforce structured variations attuned to the original prompt.

Pipeline: User Prompt → VLM Agentic Workflow (Semantic Variation) → Elaborated Textual Descriptions → Text-to-Image Model → Structured Image Gallery

Methodology: The method leverages the fact that modern models are trained on elaborated captions, using a VLM to systematically traverse scene contexts and generate diverse but semantically linked prompts.

Results: The method produces navigable design spaces where every variation corresponds to a specific, user-understandable semantic decision, overcoming the 'mode collapse' of standard text-to-image models.

Limitations: The approach relies on the quality of the VLM's scene understanding and the ability of the text-to-image model to faithfully translate complex, elaborated captions into pixels.

PDF

3/5 Computer Vision and Pattern Recognition (cs.CV)Artificial Intelligence (cs.AI)Machine Learning (cs.LG) 22 Jun 2026

Data Selection Through Iterative Self-Filtering for Vision-Language Settings

Andrei Liviu Nicolicioiu, Sarvjeet Singh Ghotra, Morgane M. Moss, Aaron Courville

Abstract

ArXiv ID: 2606.23611

Authors: Andrei Liviu Nicolicioiu, Sarvjeet Singh Ghotra, Morgane M. Moss, Aaron Courville

Abstract:

The availability of large amounts of clean data is paramount to training neural networks. However, at large scales, manual oversight is impractical, resulting in sizeable datasets that can be very noisy. Attempts to mitigate this obstacle to producing performant vision-language models have so far involved heuristics, curated reference datasets, and using pre-trained models. Here we propose a novel, bootstrapped method in which a CLIP model is trained on an evolving, self-selected dataset. This evolving dataset constitutes a balance of filtered, highly probable clean samples as well as diverse samples from the entire distribution. Our proposed Self-Filtering method iterates between training the model and selecting a subsequently improved data mixture. Training on vision-language datasets filtered by the proposed approach improves downstream performance without the need for additional data or pre-trained models.

Insights

Contribution: The paper introduces a bootstrapped self-filtering method for vision-language data selection that improves model performance without requiring manual oversight, curated reference sets, or external pre-trained models.

Core Idea: The authors propose an iterative process where a model trains on a dataset it helps curate, balancing high-probability clean samples with diverse samples from the broader distribution.

Technique: The method employs an iterative self-filtering loop where a CLIP model is trained on an evolving mixture of data, refining the selection criteria in each cycle.

Pipeline: Raw noisy vision-language dataset → Iterative CLIP training and self-filtering → Balanced mixture of clean and diverse samples → Improved downstream model performance

Methodology: The researchers developed a bootstrapping framework that alternates between model training and data selection, ensuring the training set evolves to include both high-quality filtered data and representative distribution diversity.

Results: The proposed Self-Filtering method improves downstream performance on vision-language tasks without the need for additional data or pre-trained models.

Limitations: The paper does not explicitly detail the potential for model collapse or the specific risks of over-filtering diversity during the iterative bootstrapping process.

PDF

3/5 Computer Vision and Pattern Recognition (cs.CV) 22 Jun 2026

Brain-Adapter: A Dual-Stream Vision-Language MIL Framework for Comprehensive 3D CT Diagnosis of Acute Intracranial Pathologies

Zhenyu Yi, Zhiyun Song, Yusong Sun, Zelin Liu, Manman Fei, Zhenhao Li, Jiaxuan Zhao, Xu Han, Lichi Zhang

Abstract

ArXiv ID: 2606.23494

Authors: Zhenyu Yi, Zhiyun Song, Yusong Sun, Zelin Liu, Manman Fei, Zhenhao Li, Jiaxuan Zhao, Xu Han, Lichi Zhang

Abstract:

Automated diagnosis of 3D brain CT scans is essential for critical care, yet it remains challenging due to the heavy reliance on manual annotations and the limited semantic understanding of conventional models. While 2D foundation vision-language models (VLMs) have shown remarkable generalization, effectively transferring their representational power to 3D volumes remains an open problem. In this paper, we propose Brain-Adapter, a novel dual-stream multiple instance learning (MIL) framework that leverages pre-trained 2D biomedical VLMs and raw diagnostic reports for robust scan-level multi-label classification. Specifically, we introduce a Text-Conditioned Attention (TCA) mechanism, utilizing raw diagnostic sentences as semantic queries to dynamically align visual cues with specific disease concepts. Concurrently, a parallel visual MIL stream captures global scan characteristics, supervised by structured labels extracted via a Large Language Model (LLM). To ensure representation coherence, a consistency constraint enforces synergy between the two streams. During inference, an Uncertainty-Aware Refinement (UAR) module dynamically calibrates and fuses these dual-stream predictions to resolve ambiguous cases. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art 3D models and standard MIL approaches. By eliminating the reliance on dense annotations, Brain-Adapter provides a highly scalable and clinically viable solution for 3D acute intracranial pathology analysis.

Insights

Contribution: The paper introduces Brain-Adapter, a dual-stream Multiple Instance Learning (MIL) framework that enables robust 3D brain CT multi-label classification by leveraging 2D vision-language models and raw diagnostic reports.

Core Idea: The framework bridges the gap between 2D foundation models and 3D medical volumes by using a dual-stream approach that aligns visual cues with semantic text queries while maintaining global scan consistency.

Technique: The primary techniques include a Text-Conditioned Attention (TCA) mechanism for semantic alignment, a consistency constraint for stream synergy, and an Uncertainty-Aware Refinement (UAR) module for final prediction fusion.

Pipeline: 3D Brain CT Scans & Raw Diagnostic Reports → Dual-Stream MIL (TCA-based Semantic Stream + Global Visual Stream) → Consistency Constraint & Uncertainty-Aware Refinement → Multi-label Pathology Diagnosis

Methodology: The model uses a Text-Conditioned Attention mechanism to query visual features using raw sentences, a parallel MIL stream supervised by LLM-extracted labels, and a consistency loss to ensure coherent representations.

Results: Brain-Adapter significantly outperforms state-of-the-art 3D models and standard MIL approaches, providing a scalable solution that eliminates the need for dense manual annotations.

Limitations: The study does not explicitly detail the computational overhead of the dual-stream architecture or the potential sensitivity of the UAR module to specific types of noise in raw diagnostic reports.

PDF

3/5 Artificial Intelligence (cs.AI) 22 Jun 2026

CADRE: Stable, Parameter Efficient Adaptation of Medical Vision Language Models with Bounded Forgetting and Prior Drift

Amrita Singh, Rishabh Jha

Abstract

ArXiv ID: 2606.23487

Authors: Amrita Singh, Rishabh Jha

Abstract:

Medical vision-language models (VLMs) such as BiomedCLIP generalize broadly, but adapting them to a clinical service is as much a safety problem as an accuracy one. Updating a deployed model for a new imaging modality can fail silently in two ways that harm patients: it can forget modalities it already handled (catastrophic forgetting), and it can drift from its trustworthy pretrained prior toward modality-specific shortcuts. We study parameter-efficient continual adaptation through these two properties rather than leaderboard accuracy, presenting CADRE: a frozen-backbone framework combining low-rank adaptation (LoRA) with an online, self-scaling, similarity-aware elastic weight consolidation term that bounds retained-competence loss, and an anchor-to-prior penalty bounding embedding drift from the frozen prior. Two short guarantees, a bound on total consolidation mass and a scale-invariance property, remove the scale-related sources of vanilla EWC's order fragility. Using breast cancer across three maximally dissimilar modalities (histopathology, ultrasound, chest radiography) as a controlled cross-modality stress test, under a multi-seed, multi-order protocol with paired significance testing and training approximately 0.23% of parameters, CADRE attains the highest accuracy, SPQ, and backward transfer and the lowest forgetting among adapting methods, reducing forgetting roughly sevenfold versus the strongest regularized baseline (0.075 to 0.011; paired p=0.023) and achieving positive backward transfer where every baseline is negative. We frame these as stability properties aligned with clinical-safety desiderata, not a deployment guarantee; robustness to distribution shift and adversarial inputs is out of scope.

Insights

Contribution: The paper introduces CADRE, a parameter-efficient framework for adapting medical vision-language models that simultaneously mitigates catastrophic forgetting and prior drift to ensure clinical safety.

Core Idea: Instead of optimizing solely for leaderboard accuracy, the authors prioritize stability by bounding retained-competence loss and preventing the model from drifting toward modality-specific shortcuts.

Technique: CADRE combines Low-Rank Adaptation (LoRA) with a self-scaling, similarity-aware Elastic Weight Consolidation (EWC) term and an anchor-to-prior penalty.

Pipeline: Medical images and text → Frozen backbone with LoRA layers → Similarity-aware EWC and anchor-to-prior regularization → Stable, multi-modality adapted VLM.

Methodology: The authors conducted a cross-modality stress test using breast cancer data across histopathology, ultrasound, and chest radiography, evaluating performance using multi-seed, multi-order protocols.

Results: CADRE achieved the highest accuracy and backward transfer while reducing forgetting roughly sevenfold (0.075 to 0.011) compared to the strongest regularized baseline, achieving positive backward transfer where all baselines were negative.

Limitations: The framework does not provide deployment guarantees and does not address robustness to distribution shifts or adversarial inputs.

PDF

3/5 Computer Vision and Pattern Recognition (cs.CV)Computation and Language (cs.CL) 22 Jun 2026

CFPO: Counterfactual Policy Optimization for Multimodal Reasoning

Zhangyuan Yu, Wanran Sun, Guangjing Yang, Xiaohu Wu, Qicheng Lao

Abstract

ArXiv ID: 2606.23206

Authors: Zhangyuan Yu, Wanran Sun, Guangjing Yang, Xiaohu Wu, Qicheng Lao

Abstract:

Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in multimodal reasoning. However, prevailing reinforcement learning (RL) paradigms lack explicit counterfactual enhancement and causal learning mechanisms. This fundamental deficiency results in severe grounding failures, manifesting as a tendency to ignore visual evidence in favor of language priors or exhibiting hallucination drift during long chain-of-thought reasoning. To address this root cause, we propose CounterFactual Policy Optimization (CFPO), a novel framework that enforces causal consistency between visual perception and textual reasoning. CFPO introduces a cross-modal counterfactual enhancement mechanism, which regularizes the policy by maximizing the discrepancy between the model's predictions and those from a counterfactual state where critical visual cues are suppressed. This approach seamlessly integrates with standard algorithms like GRPO and DAPO without requiring external reward models or additional supervision. Extensive experiments demonstrate that CFPO significantly improves reasoning fidelity, achieving consistent gains of 3.17%-6.25% over standard RL baselines and 1.32%-2.13% over the state-of-the-art perception-aware method (PAPO). Code is available at https://github.com/Raven-July/CFPO.

Insights

Contribution: The paper introduces CFPO, a novel reinforcement learning framework that enforces causal consistency between visual perception and textual reasoning in Large Vision-Language Models (LVLMs).

Core Idea: The core idea is to mitigate grounding failures and hallucinations by regularizing the model to distinguish between actual visual evidence and language priors through counterfactual reasoning.

Technique: The technique involves a cross-modal counterfactual enhancement mechanism that maximizes the prediction discrepancy between the original state and a state where critical visual cues are suppressed.

Pipeline: Multimodal Input (Image + Text) → Counterfactual State Generation (Visual Cue Suppression) → Policy Optimization (GRPO/DAPO with CFPO Regularization) → Causal-Consistent Reasoning Output

Methodology: CFPO integrates with standard RL algorithms by penalizing the model if its reasoning remains unchanged when key visual information is removed, thereby forcing the model to rely on actual visual evidence.

Results: CFPO achieved consistent gains of 3.17%-6.25% over standard RL baselines and 1.32%-2.13% over the state-of-the-art perception-aware method (PAPO).

Limitations: The paper does not explicitly detail the computational overhead of generating counterfactual states or the specific criteria for identifying which visual cues are 'critical' for suppression.

PDF Code

3/5 Robotics (cs.RO) 22 Jun 2026

Bridging Semantics and Kinematics: A Modular Framework for Zero-Shot Robotic Manipulation

Ali Alabbas, Dipshikha Das, Camillo Murgia, Sainul Ansary, Alaa Elkamash, Philip Long

Abstract

ArXiv ID: 2606.23157

Authors: Ali Alabbas, Dipshikha Das, Camillo Murgia, Sainul Ansary, Alaa Elkamash, Philip Long

Abstract:

This paper presents a modular training-free framework for zero-shot, language-guided robotic manipulation in semi-structured environments. The architecture bridges the gap between high-level reasoning and low-level kinematics by decomposing the vision-action pipeline into three stages: visual perception, semantic interpretation, and task execution. To overcome the spatial ambiguity and semantic hallucinations inherent in standard Vision-Language Models (VLMs), the perception module employs FastSAM and Set-of-Mark (SoM) prompting to dynamically generate grounded, alphanumeric visual anchors. The same foundation model then operates purely as a Large Language Model (LLM) to act as a semantic router, translating unconstrained human directives into verifiable, reconfigurable configurations. Finally, these configurations are dynamically parsed by a Task Orchestrator into MoveIt Task Constructor (MTC) to generate collision-free trajectories. The framework is evaluated across two zero-shot experimental setups: unconstrained open-world sequential manipulation and dense relational spatial reasoning, achieving a 62% end-to-end task success rate across both scenarios, demonstrating its capacity to reliably execute complex physical actions without domain-specific training or manual coordinate programming.

Insights

Contribution: The paper introduces a modular, training-free framework that enables zero-shot, language-guided robotic manipulation by bridging high-level semantic reasoning with low-level kinematic execution.

Core Idea: The framework decomposes the vision-action pipeline into three distinct stages to eliminate spatial ambiguity and semantic hallucinations common in standard Vision-Language Models.

Technique: It utilizes FastSAM and Set-of-Mark (SoM) prompting for grounded visual anchoring, an LLM as a semantic router for configuration generation, and MoveIt Task Constructor (MTC) for trajectory planning.

Pipeline: Unconstrained human directives → Visual perception (FastSAM/SoM) & Semantic routing (LLM) → Task Orchestration (MTC) → Collision-free trajectories

Methodology: The authors evaluated the framework in two zero-shot scenarios: open-world sequential manipulation and dense relational spatial reasoning, testing the system's ability to generalize without domain-specific training.

Results: The framework achieved a 62% end-to-end task success rate across both zero-shot experimental setups.

Limitations: The framework operates in semi-structured environments and may face challenges in highly dynamic or unstructured settings where visual anchors are difficult to maintain.

PDF

3/5 Computer Vision and Pattern Recognition (cs.CV) 22 Jun 2026

T-VSS: Test-Time Visual Subspace Steering for Adversarial Robustness of Vision-Language Models

Jaehyuk Jang, Minseok Seo. Seungju Cho, Kangwook Ko, Changick Kim

Abstract

ArXiv ID: 2606.23132

Authors: Jaehyuk Jang, Minseok Seo. Seungju Cho, Kangwook Ko, Changick Kim

Abstract:

Vision-language models (VLMs) achieve strong zero-shot recognition, but they remain highly vulnerable to adversarial perturbations. Recent test-time adaptations improve robustness without retraining, but they do not directly adapt the corrupted visual representation itself. Prompt-based methods adapt the learnable text prompts, while input-space methods optimize pixels or padding at test time. These approaches can improve predictions, but they do so through an indirect and expensive optimization path. We propose Test-time Visual Subspace Steering (T-VSS), a lightweight defense that performs test-time adaptation directly in the visual feature space. T-VSS first builds a sample-specific low-rank subspace from multi-view feature residuals anchored at the attacked image. It then learns a shared feature correction within this subspace using reliability-weighted entropy minimization. By constraining adaptation to a compact visual geometry, T-VSS steers attacked features toward more stable and discriminative predictions while avoiding noisy full-space updates. Experiments on fine-grained, ImageNet, and ImageNet-OOD benchmarks show that T-VSS improves adversarial robustness while maintaining competitive clean accuracy and better efficiency than prior test-time adaptations.

Insights

Contribution: The paper introduces T-VSS, a lightweight test-time defense that improves the adversarial robustness of Vision-Language Models (VLMs) by directly steering visual features in a low-rank subspace.

Core Idea: Instead of optimizing pixels or text prompts, the method performs direct feature-space adaptation by constraining updates to a sample-specific low-rank subspace to avoid noisy full-space updates.

Technique: T-VSS utilizes multi-view feature residuals to construct a low-rank subspace and applies reliability-weighted entropy minimization to learn a shared feature correction.

Pipeline: Attacked Image → Multi-view Feature Extraction → Low-rank Subspace Construction → Reliability-weighted Entropy Minimization → Corrected Visual Features → Robust Prediction

Methodology: The approach builds a sample-specific subspace from feature residuals anchored at the attacked image and optimizes a correction vector within this compact geometry to steer features toward stable predictions.

Results: T-VSS improves adversarial robustness across fine-grained, ImageNet, and ImageNet-OOD benchmarks while maintaining competitive clean accuracy and superior efficiency compared to prior test-time adaptations.

Limitations: The paper does not explicitly detail the computational overhead of multi-view feature extraction or the scalability of subspace construction for extremely high-dimensional feature spaces.

PDF

3/5 Machine Learning (cs.LG) 22 Jun 2026

FlowTrain: Flow-Based Decoupled Training for Industrial-Grade Vision-Language Models

Zhida Jiang, Zhaolong Xing, Yang Pei, Xiaolong Chen, Yuanhang Xiao, Chengzhi Huang, Xiyu Liu, Haopeng Liu, Qingyuan Sang, Lingfeng Zhou, Jiaxing Wang, Zicheng Zhang, Wenzhe Wang, Xinyu Liu, Yan Li, Zhen Chen, Ke Zhang

Abstract

ArXiv ID: 2606.23087

Authors: Zhida Jiang, Zhaolong Xing, Yang Pei, Xiaolong Chen, Yuanhang Xiao, Chengzhi Huang, Xiyu Liu, Haopeng Liu, Qingyuan Sang, Lingfeng Zhou, Jiaxing Wang, Zicheng Zhang, Wenzhe Wang, Xinyu Liu, Yan Li, Zhen Chen, Ke Zhang

Abstract:

Industrial-grade distributed training of vision-language models (VLMs) remains far less efficient than that of unimodal LLMs. Existing solutions either follow a monolithic design that assigns uniform parallelism to heterogeneous modules or adopt a disaggregated deployment that separates modules while executing them as a batch-synchronized pipeline. In this paper, we highlight that the above solutions are still not sufficient, and VLM training can be further decoupled. To this end, we present FlowTrain, a flow-based decoupled training framework that reformulates VLM training as a producer-consumer dataflow coordinated through a unified memory pool. The encoder and backbone can progress independently over a global virtual address space. Since this execution decoupling fundamentally changes the optimization objective of allocation and scheduling, FlowTrain further introduces a heterogeneous parallel allocator that assigns module-specific parallelism strategies by solving a throughput matching problem. The dynamic packing scheduler is used to construct balanced microbatches at runtime according to the actual LLM-side computation cost. Extensive experiments on real-world workloads show that FlowTrain achieves over 50% MFU and up to 1.7x throughput improvement, narrowing the efficiency gap to LLM-only training.

Insights

Contribution: The paper introduces FlowTrain, a flow-based decoupled training framework that significantly improves the training efficiency of vision-language models (VLMs) by decoupling the execution of heterogeneous modules.

Core Idea: VLM training is reformulated as a producer-consumer dataflow where the encoder and backbone progress independently over a unified memory pool rather than as a batch-synchronized pipeline.

Technique: The framework employs a heterogeneous parallel allocator to solve a throughput matching problem and a dynamic packing scheduler to construct balanced microbatches based on real-time computation costs.

Pipeline: Raw data → Unified memory pool → Independent encoder/backbone execution → Throughput-matched parallel allocation → Dynamic microbatch packing → Optimized VLM training

Methodology: FlowTrain replaces monolithic parallelism with a decoupled dataflow architecture, utilizing a global virtual address space and a dynamic scheduling mechanism to handle the varying computational demands of different VLM components.

Results: Achieved over 50% Model Flops Utilization (MFU) and up to a 1.7x throughput improvement on real-world workloads, narrowing the efficiency gap between VLM and LLM training.

Limitations: The paper does not explicitly detail the overhead of managing the unified memory pool or the scalability limits of the dynamic packing scheduler under extreme hardware heterogeneity.

PDF

3/5 Computer Vision and Pattern Recognition (cs.CV) 22 Jun 2026

Black-Box Continual Learning for Vision-Language Models

Yuting Li, Weihang Fang, Haoyuan Gao, Linghe Kong, Yexin Li, Lichao Sun, Weiran Huang

Abstract

ArXiv ID: 2606.22999

Authors: Yuting Li, Weihang Fang, Haoyuan Gao, Linghe Kong, Yexin Li, Lichao Sun, Weiran Huang

Abstract:

The rapid deployment of Vision-Language Models (VLMs) in dynamic environments necessitates the ability to learn continuously without forgetting. However, traditional continual learning (CL) settings often rely on white-box paradigms, which is increasingly invalidated by the shift toward cloud-hosted models. In this paper, we introduce Black-CL, a more realistic benchmark for VLMs that enforces three primary real-world challenges: weight and architecture inaccessibility, constrained computation, and task-agnostic inference. The learner can query only output embeddings or logits, with no gradient flow through or structural modification of the backbone. Current CL methodologies, which rely on backbone backpropagation or complex parameter expansion, are fundamentally incompatible with these constraints. Under this setting, we propose BETA, a simple yet effective baseline built on the key insight that solely optimizing textual prototypes can navigate the complexities of CL. BETA integrates three core components: Semantic Projection Accumulation (SPA) for incremental knowledge acquisition, Latent Distribution Replay (LDR) for anchoring the embedding space against catastrophic forgetting, and Test-Time Prototype Adaptation (TTPA) for dynamic, instance-aware boundary refinement. Extensive experiments across ten diverse datasets and various backbones demonstrate that BETA significantly outperforms existing black-box tuners. Remarkably, with only 0.05 M trainable parameters, a 180--3000$\times$ reduction compared to competitive methods, BETA achieves performance on par with or even exceeding white-box CL methods. We believe Black-CL and BETA provide a foundational framework for future advancements in continual learning and accelerates the transition of continual learning from academia to real-world systems.

Insights

Contribution: The paper introduces Black-CL, a realistic black-box continual learning benchmark for Vision-Language Models (VLMs), and proposes BETA, a parameter-efficient framework that achieves white-box performance under strict constraints.

Core Idea: Continual learning for VLMs should be evaluated in a black-box setting where the backbone is inaccessible, focusing on optimizing textual prototypes rather than model weights.

Technique: BETA utilizes Semantic Projection Accumulation (SPA), Latent Distribution Replay (LDR), and Test-Time Prototype Adaptation (TTPA) to manage knowledge acquisition and prevent catastrophic forgetting.

Pipeline: Input (New Task Data) → Process (Embedding Extraction → SPA Knowledge Accumulation → LDR Space Anchoring → TTPA Boundary Refinement) → Output (Updated Textual Prototypes)

Methodology: The authors developed a framework that operates solely on output embeddings or logits, avoiding gradient flow through the backbone while using a minimal set of trainable parameters.

Results: BETA outperformed existing black-box tuners across ten datasets, achieving performance on par with white-box methods while using only 0.05M trainable parameters (a 180-3000x reduction).

Limitations: The study focuses on black-box constraints which may not account for scenarios where partial weight access or architectural modifications are permitted.

PDF

3/5 Computer Vision and Pattern Recognition (cs.CV)Computer Science and Game Theory (cs.GT) 22 Jun 2026

Each Judge Its Own Yardstick: Discovering Per-VLM Taxonomies for Physical Video Evaluation

Yu Cao, Ziquan Liu, Zhensong Zhang, Jiankang Deng, Shaogang Gong, Jifei Song

Abstract

ArXiv ID: 2606.22918

Authors: Yu Cao, Ziquan Liu, Zhensong Zhang, Jiankang Deng, Shaogang Gong, Jifei Song

Abstract:

Maintaining physical consistency in video generators and world models increasingly relies on vision-language models (VLMs) as automated judges that provide reward signals, ranking decisions, and data-filtering criteria. Yet VLMs differ substantially in training data and architecture, encoding physical phenomena through distinct internal representations. A single global evaluation schema therefore gives every VLM the same axes of competence, regardless of what each can actually perceive. We propose JudgeFit, an iterative refinement procedure that discovers a per-VLM evaluation taxonomy. An initial taxonomy is constructed by prompting the target VLM to enumerate physics errors on a small set of videos and clustering the resulting descriptions. The taxonomy is then refined through a diagnostic step: we calibrate the VLM's per-dimension scores to human physical-commonsense ratings, diagnose which dimensions it scores unreliably or redundantly, and prompt an LLM to repair them, iterating until convergence. We further instantiate this procedure as a benchmark and apply it to 16 VLMs spanning eight model families. The refined taxonomy outperforms the global-schema baseline on held-out videos for every VLM tested, with a mean relative improvement of approximately 32%. Beyond aggregate accuracy, the per-VLM profiles expose model-specific blind spots that overall rankings cannot anticipate, with reliability patterns differing markedly across model families.

Insights

Contribution: The paper introduces JudgeFit, a framework to discover and refine model-specific evaluation taxonomies for physical video consistency, moving beyond one-size-fits-all evaluation schemas.

Core Idea: Different Vision-Language Models (VLMs) possess unique internal representations of physics; therefore, evaluation criteria should be tailored to each model's specific perceptual capabilities.

Technique: An iterative refinement procedure that combines VLM-generated error descriptions, clustering, human-aligned calibration, and LLM-based taxonomy repair.

Pipeline: Small set of videos → VLM error enumeration → Clustering → Human-aligned calibration → LLM-based taxonomy repair → Iterative convergence → Per-VLM evaluation taxonomy

Methodology: The authors applied JudgeFit to 16 VLMs across eight families, comparing the performance of per-VLM taxonomies against a global-schema baseline using human physical-commonsense ratings.

Results: The per-VLM taxonomy outperformed the global-schema baseline on held-out videos for all 16 models, achieving a mean relative improvement of approximately 32%.

Limitations: The study focuses on physical consistency and does not explore how these taxonomies might adapt to other non-physical video attributes or real-time evaluation constraints.

PDF

Tech News

AI Safety

Hacker News Tue, 23 Ju

OpenAI DayBreak – GPT-5.5-Cyber

OpenAI introduced 'DayBreak,' a specialized initiative focused on securing the digital landscape against cyber threats. The announcement highlights the development of GPT-5.5-Cyber, a model specifically optimized for cybersecurity tasks and defense.

Reddit r/MachineLearning 2026-06-22

Non-deterministic Vulnerability Detection Benchmark System [P]

A developer is seeking feedback on a benchmark system designed to test LLM vulnerability detection. The project uses 'hidden' Juliet code to remove known CWE patterns and employs LLM-generated comments to study how sentiment and plain English can manipulate an AI's ability to identify security flaws.

Agentic AI

Hacker News Tue, 23 Ju

Ask HN: Anthropic banned me from using Claude Code and I don't know what to do

A user reported being banned from using Anthropic's Claude Code tool, sparking a discussion on platform policies and usage limits. The thread explores potential reasons for account restrictions and the challenges of using agentic coding tools. It highlights the friction between automated tool usage and provider safety guardrails.

NVIDIA Technical Blog 2026-06-23

How Telcos Build Autonomous Networks with Agentic AI

Telecom operators are transitioning from basic AI integration to autonomous network management using Agentic AI. The blog explores how these agents can automate complex network operations, customer care, and back-office workflows to improve efficiency. It highlights the shift toward self-healing and self-optimizing infrastructure.

Computer Vision

Hacker News Tue, 23 Ju

An Introduction to YOLO26

This article provides an introduction to YOLOv26, a significant iteration in the popular You Only Look Once object detection framework. It likely covers architectural improvements, performance benchmarks, and practical applications for real-time computer vision tasks.

Hacker News Tue, 23 Ju

Ultralytics YOLO26: Unified Real-Time End-to-End Vision Models

Ultralytics has introduced YOLO26, a unified real-time end-to-end vision model designed for high-performance tasks. The model aims to streamline complex vision pipelines by integrating multiple capabilities into a single architecture. It represents a significant advancement in efficient, real-time object detection and spatial understanding.

Reddit r/MachineLearning 2026-06-23

Just landed a Computer Vision internship, here's the preparation list I used [D]

A user shared a comprehensive preparation checklist for landing Computer Vision internships, covering core math, ML fundamentals, and specialized CV topics. The resource is designed to be actionable and can be compressed into a 7-day study plan for job seekers.

Computing Systems

Hacker News Tue, 23 Ju

The new HTTP QUERY method explained

This article explains the technical nuances and implementation of a new HTTP query method. It focuses on how these protocols facilitate data exchange and request handling in modern web architectures.

Hacker News Tue, 23 Ju

In praise of memcached

The article explores the enduring utility and architectural simplicity of Memcached in modern infrastructure. It highlights how its minimalist design provides high-performance caching that remains relevant despite the rise of more complex distributed systems.

General

Reddit r/DeepLearning 2026-06-23

Alignment processes in neural networks?

A researcher proposes a method to test if ReLU activation decisions align with training data by comparing a standard neural network against one where ReLU gates are replaced by Locality Sensitive Hashing (LSH). Preliminary results on toy models showed unexpected behavior, potentially suggesting complex internal dynamics in how neural networks gate information. The post includes links to research directions and FreeBasic code for further exploration.

LLM

Hacker News Tue, 23 Ju

Will It Mythos?

The post discusses the 'Will It Mythos' project, which explores the capabilities and limitations of AI models in generating complex, consistent mythological frameworks. It touches upon the creative boundaries of generative models and how they handle deep world-building.

Hacker News Tue, 23 Ju

VibeThinker: 3B param model that beats Opus 4.5 on reasoning with novel SFT+GRPO

VibeThinker is a 3B parameter model that achieves reasoning performance surpassing Claude 3 Opus 4.5. The breakthrough is attributed to a novel combination of Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO). This demonstrates that efficient RL techniques can significantly boost reasoning capabilities in smaller, more accessible models.

Trending repositories on GitHub filtered and scored for relevance to your interests.

Daily Digest 2026-06-23

Global Trends

Personal Interests

Embodied AI

Multi-Agent Systems

Vision-Language Models

Tech News

AI Safety

Agentic AI

Computer Vision

Computing Systems

General

LLM

GitHub Trending

Agentic AI

Speech