SinGo's Arxiv
Computation and Language 115
☆ Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation
Language models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tuning to learn their representations. We present a systematic analysis of this strategy: through spectral and geometric diagnostics, we show that mean initialization collapses all new tokens into a degenerate subspace, erasing inter-token distinctions that subsequent fine-tuning struggles to fully recover. These findings suggest that \emph{token initialization} is a key bottleneck when extending LMs with new vocabularies. Motivated by this diagnosis, we propose the \emph{Grounded Token Initialization Hypothesis}: linguistically grounding novel tokens in the pretrained embedding space before fine-tuning better enables the model to leverage its general-purpose knowledge for novel-token domains. We operationalize this hypothesis as GTI (Grounded Token Initialization), a lightweight grounding stage that, prior to fine-tuning, maps new tokens to distinct, semantically meaningful locations in the pretrained embedding space using only paired linguistic supervision. Despite its simplicity, GTI outperforms both mean initialization and existing auxiliary-task adaptation methods in the majority of evaluation settings across multiple generative recommendation benchmarks, including industry-scale and public datasets. Further analyses show that grounded embeddings produce richer inter-token structure that persists through fine-tuning, corroborating the hypothesis that initialization quality is a key bottleneck in vocabulary extension.
Batched Contextual Reinforcement: A Task-Scaling Law for Efficient Reasoning
Large Language Models employing Chain-of-Thought reasoning achieve strong performance but suffer from excessive token consumption that inflates inference costs. Existing efficiency methods such as explicit length penalties, difficulty estimators, or multi-stage curricula either degrade reasoning quality or require complex training pipelines. We introduce Batched Contextual Reinforcement, a minimalist, single-stage training paradigm that unlocks efficient reasoning through a simple structural modification: training the model to solve N problems simultaneously within a shared context window, rewarded purely by per-instance accuracy. This formulation creates an implicit token budget that yields several key findings: (1) We identify a novel task-scaling law: as the number of concurrent problems N increases during inference, per-problem token usage decreases monotonically while accuracy degrades far more gracefully than baselines, establishing N as a controllable throughput dimension. (2) BCR challenges the traditional accuracy-efficiency trade-off by demonstrating a "free lunch" phenomenon at standard single-problem inference. Across both 1.5B and 4B model families, BCR reduces token usage by 15.8% to 62.6% while consistently maintaining or improving accuracy across five major mathematical benchmarks. (3) Qualitative analyses reveal emergent self-regulated efficiency, where models autonomously eliminate redundant metacognitive loops without explicit length supervision. (4) Crucially, we empirically demonstrate that implicit budget constraints successfully circumvent the adversarial gradients and catastrophic optimization collapse inherent to explicit length penalties, offering a highly stable, constraint-based alternative for length control. These results prove BCR practical, showing simple structural incentives unlock latent high-density reasoning in LLMs.
comment: 43 pages, 5 figures, 24 tables
☆ No Single Best Model for Diversity: Learning a Router for Sample Diversity
When posed with prompts that permit a large number of valid answers, comprehensively generating them is the first step towards satisfying a wide range of users. In this paper, we study methods to elicit a comprehensive set of valid responses. To evaluate this, we introduce \textbf{diversity coverage}, a metric that measures the total quality scores assigned to each \textbf{unique} answer in the predicted answer set relative to the best possible answer set with the same number of answers. Using this metric, we evaluate 18 LLMs, finding no single model dominates at generating diverse responses to a wide range of open-ended prompts. Yet, per each prompt, there exists a model that outperforms all other models significantly at generating a diverse answer set. Motivated by this finding, we introduce a router that predicts the best model for each query. On NB-Wildchat, our trained router outperforms the single best model baseline (26.3% vs $23.8%). We further show generalization to an out-of-domain dataset (NB-Curated) as well as different answer-generation prompting strategies. Our work lays foundation for studying generating comprehensive answers when we have access to a suite of models.
comment: under review at COLM 2026
☆ go-$m$HC: Direct Parameterization of Manifold-Constrained Hyper-Connections via Generalized Orthostochastic Matrices
Doubly stochastic matrices enable learned mixing across residual streams, but parameterizing the set of doubly stochastic matrices (the Birkhoff polytope) exactly and efficiently remains an open challenge. Existing exact methods scale factorially with the number of streams ($d$), while Kronecker-factorized approaches are efficient but expressivity-limited. We introduce a novel exact parameterization grounded in the theory of generalized orthostochastic matrices, which scales as $\mathcal{O}(d^3)$ and exposes a single hyperparameter $s$ which continuously interpolates between a computationally efficient boundary and the fully expressive Birkhoff polytope. Building on Manifold-Constrained Hyper-Connections ($m$HC), a framework for learned dynamic layer connectivity, we instantiate this parameterization in go-$m$HC. Our method composes naturally with Kronecker-factorized methods, substantially recovering expressivity at similar FLOP costs. Spectral analysis indicates that go-$m$HC fills the Birkhoff polytope far more completely than Kronecker-factorized baselines. On synthetic stream-mixing tasks, go-$m$HC achieves the minimum theoretical loss while converging up to $10\times$ faster. We validate our approach in a 30M parameter GPT-style language model. The expressivity, efficiency, and exactness of go-$m$HC offer a practical avenue for scaling $d$ as a new dimension of model capacity.
comment: 29 pages, 30 figures, 9 tables. Includes supplementary material
☆ De Jure: Iterative LLM Self-Refinement for Structured Extraction of Regulatory Rules
Regulatory documents encode legally binding obligations that LLM-based systems must respect. Yet converting dense, hierarchically structured legal text into machine-readable rules remains a costly, expert-intensive process. We present De Jure, a fully automated, domain-agnostic pipeline for extracting structured regulatory rules from raw documents, requiring no human annotation, domain-specific prompting, or annotated gold data. De Jure operates through four sequential stages: normalization of source documents into structured Markdown; LLM-driven semantic decomposition into structured rule units; multi-criteria LLM-as-a-judge evaluation across 19 dimensions spanning metadata, definitions, and rule semantics; and iterative repair of low-scoring extractions within a bounded regeneration budget, where upstream components are repaired before rule units are evaluated. We evaluate De Jure across four models on three regulatory corpora spanning finance, healthcare, and AI governance. On the finance domain, De Jure yields consistent and monotonic improvement in extraction quality, reaching peak performance within three judge-guided iterations. De Jure generalizes effectively to healthcare and AI governance, maintaining high performance across both open- and closed-source models. In a downstream compliance question-answering evaluation via RAG, responses grounded in De Jure extracted rules are preferred over prior work in 73.8% of cases at single-rule retrieval depth, rising to 84.0% under broader retrieval, confirming that extraction fidelity translates directly into downstream utility. These results demonstrate that explicit, interpretable evaluation criteria can substitute for human annotation in complex regulatory domains, offering a scalable and auditable path toward regulation-grounded LLM alignment.
☆ VISTA: Visualization of Token Attribution via Efficient Analysis
Understanding how Large Language Models (LLMs) process information from prompts remains a significant challenge. To shed light on this "black box," attention visualization techniques have been developed to capture neuron-level perceptions and interpret how models focus on different parts of input data. However, many existing techniques are tailored to specific model architectures, particularly within the Transformer family, and often require backpropagation, resulting in nearly double the GPU memory usage and increased computational cost. A lightweight, model-agnostic approach for attention visualization remains lacking. In this paper, we introduce a model-agnostic token importance visualization technique to better understand how generative AI systems perceive and prioritize information from input text, without incurring additional computational cost. Our method leverages perturbation-based strategies combined with a three-matrix analytical framework to generate relevance maps that illustrate token-level contributions to model predictions. The framework comprises: (1) the Angular Deviation Matrix, which captures shifts in semantic direction; (2) the Magnitude Deviation Matrix, which measures changes in semantic intensity; and (3) the Dimensional Importance Matrix, which evaluates contributions across individual vector dimensions. By systematically removing each token and measuring the resulting impact across these three complementary dimensions, we derive a composite importance score that provides a nuanced and mathematically grounded measure of token significance. To support reproducibility and foster wider adoption, we provide open-source implementations of all proposed and utilized explainability techniques, with code and resources publicly available at https://github.com/Infosys/Infosys-Responsible-AI-Toolkit
comment: 12 pages, 3 figures
☆ CV-18 NER: Augmented Common Voice for Named Entity Recognition from Arabic Speech
End-to-end speech Named Entity Recognition (NER) aims to directly extract entities from speech. Prior work has shown that end-to-end (E2E) approaches can outperform cascaded pipelines for English, French, and Chinese, but Arabic remains under-explored due to its morphological complexity, the absence of short vowels, and limited annotated resources. We introduce CV-18 NER, the first publicly available dataset for NER from Arabic speech, created by augmenting the Arabic Common Voice 18 corpus with manual NER annotations following the fine-grained Wojood schema (21 entity types). We benchmark both pipeline systems (ASR + text NER) and E2E models based on Whisper and AraBEST-RQ. E2E systems substantially outperform the best pipeline configuration on the test set, reaching 37.0% CoER (AraBEST-RQ 300M) and 38.0% CVER (Whisper-medium). Further analysis shows that Arabic-specific self-supervised pretraining yields strong ASR performance, while multilingual weak supervision transfers more effectively to joint speech-to-entity learning, and that larger models may be harder to adapt in this low-resource setting. Our dataset and models are publicly released, providing the first open benchmark for end-to-end named entity recognition from Arabic speech https://huggingface.co/datasets/Elyadata/CV18-NER.
comment: Accepted at OSACT 2026
☆ Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study
Background: Accurate translation of radiology reports is important for multilingual research, clinical communication, and radiology education, but the validity of LLM-based evaluation remains unclear. Objective: To evaluate the educational suitability of LLM-generated Japanese translations of chest CT reports and compare radiologist assessments with LLM-as-a-judge evaluations. Methods: We analyzed 150 chest CT reports from the CT-RATE-JPN validation set. For each English report, a human-edited Japanese translation was compared with an LLM-generated translation by DeepSeek-V3.2. A board-certified radiologist and a radiology resident independently performed blinded pairwise evaluations across 4 criteria: terminology accuracy, readability, overall quality, and radiologist-style authenticity. In parallel, 3 LLM judges (DeepSeek-V3.2, Mistral Large 3, and GPT-5) evaluated the same pairs. Agreement was assessed using QWK and percentage agreement. Results: Agreement between radiologists and LLM judges was near zero (QWK=-0.04 to 0.15). Agreement between the 2 radiologists was also poor (QWK=0.01 to 0.06). Radiologist 1 rated terminology as equivalent in 59% of cases and favored the LLM translation for readability (51%) and overall quality (51%). Radiologist 2 rated readability as equivalent in 75% of cases and favored the human-edited translation for overall quality (40% vs 21%). All 3 LLM judges strongly favored the LLM translation across all criteria (70%-99%) and rated it as more radiologist-like in >93% of cases. Conclusions: LLM-generated translations were often judged natural and fluent, but the 2 radiologists differed substantially. LLM-as-a-judge showed strong preference for LLM output and negligible agreement with radiologists. For educational use of translated radiology reports, automated LLM-based evaluation alone is insufficient; expert radiologist review remains important.
comment: 25 pages, 4 figures
☆ Towards Position-Robust Talent Recommendation via Large Language Models
Talent recruitment is a critical, yet costly process for many industries, with high recruitment costs and long hiring cycles. Existing talent recommendation systems increasingly adopt large language models (LLMs) due to their remarkable language understanding capabilities. However, most prior approaches follow a pointwise paradigm, which requires LLMs to repeatedly process some text and fails to capture the relationships among candidates in the list, resulting in higher token consumption and suboptimal recommendations. Besides, LLMs exhibit position bias and the lost-in-the-middle issue when answering multiple-choice questions and processing multiple long documents. To address these issues, we introduce an implicit strategy to utilize LLM's potential output for the recommendation task and propose L3TR, a novel framework for listwise talent recommendation with LLMs. In this framework, we propose a block attention mechanism and a local positional encoding method to enhance inter-document processing and mitigate the position bias and concurrent token bias issue. We also introduce an ID sampling method for resolving the inconsistency between candidate set sizes in the training phase and the inference phase. We design evaluation methods to detect position bias and token bias and training-free debiasing methods. Extensive experiments on two real-world datasets validated the effectiveness of L3TR, showing consistent improvements over existing baselines.
☆ Neuro-RIT: Neuron-Guided Instruction Tuning for Robust Retrieval-Augmented Language Model
Retrieval-Augmented Language Models (RALMs) have demonstrated significant potential in knowledge-intensive tasks; however, they remain vulnerable to performance degradation when presented with irrelevant or noisy retrieved contexts. Existing approaches to enhance robustness typically operate via coarse-grained parameter updates at the layer or module level, often overlooking the inherent neuron-level sparsity of Large Language Models (LLMs). To address this limitation, we propose Neuro-RIT (Neuron-guided Robust Instruction Tuning), a novel framework that shifts the paradigm from dense adaptation to precision-driven neuron alignment. Our method explicitly disentangles neurons that are responsible for processing relevant versus irrelevant contexts using attribution-based neuron mining. Subsequently, we introduce a two-stage instruction tuning strategy that enforces a dual capability for noise robustness: achieving direct noise suppression by functionally deactivating neurons exclusive to irrelevant contexts, while simultaneously optimizing targeted layers for evidence distillation. Extensive experiments across diverse QA benchmarks demonstrate that Neuro-RIT consistently outperforms strong baselines and robustness-enhancing methods.
☆ The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level
Mixture-of-Experts (MoE) architectures have become the dominant choice for scaling Large Language Models (LLMs), activating only a subset of parameters per token. While MoE architectures are primarily adopted for computational efficiency, it remains an open question whether their sparsity makes them inherently easier to interpret than dense feed-forward networks (FFNs). We compare MoE experts and dense FFNs using $k$-sparse probing and find that expert neurons are consistently less polysemantic, with the gap widening as routing becomes sparser. This suggests that sparsity pressures both individual neurons and entire experts toward monosemanticity. Leveraging this finding, we zoom out from the neuron to the expert level as a more effective unit of analysis. We validate this approach by automatically interpreting hundreds of experts. This analysis allows us to resolve the debate on specialization: experts are neither broad domain specialists (e.g., biology) nor simple token-level processors. Instead, they function as fine-grained task experts, specializing in linguistic operations or semantic tasks (e.g., closing brackets in LaTeX). Our findings suggest that MoEs are inherently interpretable at the expert level, providing a clearer path toward large-scale model interpretability. Code is available at: https://github.com/jerryy33/MoE_analysis
☆ Adam's Law: Textual Frequency Law on Large Language Models
While textual frequency has been validated as relevant to human cognition in reading speed, its relatedness to Large Language Models (LLMs) is seldom studied. We propose a novel research direction in terms of textual data frequency, which is an understudied topic, to the best of our knowledge. Our framework is composed of three units. First, this paper proposes Textual Frequency Law (TFL), which indicates that frequent textual data should be preferred for LLMs for both prompting and fine-tuning. Since many LLMs are closed-source in their training data, we propose using online resources to estimate the sentence-level frequency. We then utilize an input paraphraser to paraphrase the input into a more frequent textual expression. Next, we propose Textual Frequency Distillation (TFD) by querying LLMs to conduct story completion by further extending the sentences in the datasets, and the resulting corpora are used to adjust the initial estimation. Finally, we propose Curriculum Textual Frequency Training (CTFT) that fine-tunes LLMs in an increasing order of sentence-level frequency. Experiments are conducted on our curated dataset Textual Frequency Paired Dataset (TFPD) on math reasoning, machine translation, commonsense reasoning and agentic tool calling. Results show the effectiveness of our framework.
☆ Do Lexical and Contextual Coreference Resolution Systems Degrade Differently under Mention Noise? An Empirical Study on Scientific Software Mentions
We present our participation in the SOMD 2026 shared task on cross-document software mention coreference resolution, where our systems ranked second across all three subtasks. We compare two fine-tuning-free approaches: Fuzzy Matching (FM), a lexical string-similarity method, and Context Aware Representations (CAR), which combines mention-level and document-level embeddings. Both achieve competitive performance across all subtasks (CoNLL F1 of 0.94-0.96), with CAR consistently outperforming FM by 1 point on the official test set, consistent with the high surface regularity of software names, which reduces the need for complex semantic reasoning. A controlled noise-injection study reveals complementary failure modes: as boundary noise increases, CAR loses only 0.07 F1 points from clean to fully corrupted input, compared to 0.20 for FM, whereas under mention substitution, FM degrades more gracefully (0.52 vs. 0.63). Our inference-time analysis shows that FM scales superlinearly with corpus size, whereas CAR scales approximately linearly, making CAR the more efficient choice at large scale. These findings suggest that system selection should be informed by both the noise profile of the upstream mention detector and the scale of the target corpus. We release our code to support future work on this underexplored task.
comment: 8 pages
☆ AstroConcepts: A Large-Scale Multi-Label Classification Corpus for Astrophysics
Scientific multi-label text classification suffers from extreme class imbalance, where specialized terminology exhibits severe power-law distributions that challenge standard classification approaches. Existing scientific corpora lack comprehensive controlled vocabularies, focusing instead on broad categories and limiting systematic study of extreme imbalance. We introduce AstroConcepts, a corpus of English abstracts from 21,702 published astrophysics papers, labeled with 2,367 concepts from the Unified Astronomy Thesaurus. The corpus exhibits severe label imbalance, with 76% of concepts having fewer than 50 training examples. By releasing this resource, we enable systematic study of extreme class imbalance in scientific domains and establish strong baselines across traditional, neural, and vocabulary-constrained LLM methods. Our evaluation reveals three key patterns that provide new insights into scientific text classification. First, vocabulary-constrained LLMs achieve competitive performance relative to domain-adapted models in astrophysics classification, suggesting a potential for parameter-efficient approaches. Second, domain adaptation yields relatively larger improvements for rare, specialized terminology, although absolute performance remains limited across all methods. Third, we propose frequency-stratified evaluation to reveal performance patterns that are hidden by aggregate scores, thereby making robustness assessment central to scientific multi-label evaluation. These results offer actionable insights for scientific NLP and establish benchmarks for research on extreme imbalance.
comment: 9 pages, 2 figures
☆ Brief Is Better: Non-Monotonic Chain-of-Thought Budget Effects in Function-Calling Language Agents
How much should a language agent think before taking action? Chain-of-thought (CoT) reasoning is widely assumed to improve agent performance, but the relationship between reasoning length and accuracy in structured tool-use settings remains poorly understood. We present a systematic study of CoT budget effects on function-calling agents, sweeping six token budgets (0--512) across 200 tasks from the Berkeley Function Calling Leaderboard v3 Multiple benchmark. Our central finding is a striking non-monotonic pattern on Qwen2.5-1.5B-Instruct: brief reasoning (32 tokens) dramatically improves accuracy by 45% relative over direct answers, from 44.0% to 64.0%, while extended reasoning (256 tokens) degrades performance well below the no-CoT baseline, to 25.0% (McNemar p < 0.001). A three-way error decomposition reveals the mechanism. At d = 0, 30.5% of tasks fail because the model selects the wrong function from the candidate set; brief CoT reduces this to 1.5%, effectively acting as a function-routing step, while long CoT reverses the gain, yielding 28.0% wrong selections and 18.0% hallucinated functions at d = 256. Oracle analysis shows that 88.6% of solvable tasks require at most 32 reasoning tokens, with an average of 27.6 tokens, and a finer-grained sweep indicates that the true optimum lies at 8--16 tokens. Motivated by this routing effect, we propose Function-Routing CoT (FR-CoT), a structured brief-CoT method that templates the reasoning phase as "Function: [name] / Key args: [...]," forcing commitment to a valid function name at the start of reasoning. FR-CoT achieves accuracy statistically equivalent to free-form d = 32 CoT while reducing function hallucination to 0.0%, providing a structural reliability guarantee without budget tuning.
comment: 21 pages
☆ MTI: A Behavior-Based Temperament Profiling System for AI Agents
AI models of equivalent capability can exhibit fundamentally different behavioral patterns, yet no standardized instrument exists to measure these dispositional differences. Existing approaches either borrow human personality dimensions and rely on self-report (which diverges from actual behavior in LLMs) or treat behavioral variation as a defect rather than a trait. We introduce the Model Temperament Index (MTI), a behavior-based profiling system that measures AI agent temperament across four axes: Reactivity (environmental sensitivity), Compliance (instruction-behavior alignment), Sociality (relational resource allocation), and Resilience (stress resistance). Grounded in the Four Shell Model from Model Medicine, MTI measures what agents do, not what they say about themselves, using structured examination protocols with a two-stage design that separates capability from disposition. We profile 10 small language models (1.7B-9B parameters, 6 organizations, 3 training paradigms) and report five principal findings: (1) the four axes are largely independent among instruction-tuned models (all |r| < 0.42); (2) within-axis facet dissociations are empirically confirmed -- Compliance decomposes into fully independent formal and stance facets (r = 0.002), while Resilience decomposes into inversely related cognitive and adversarial facets; (3) a Compliance-Resilience paradox reveals that opinion-yielding and fact-vulnerability operate through independent channels; (4) RLHF reshapes temperament not only by shifting axis scores but by creating within-axis facet differentiation absent in the unaligned base model; and (5) temperament is independent of model size (1.7B-9B), confirming that MTI measures disposition rather than capability.
comment: 29 pages, 6 figures, 12 tables. Paper #3 in the Model Medicine Series (Paper #1: arXiv:2603.04722)
☆ GaelEval: Benchmarking LLM Performance for Scottish Gaelic LREC 2026
Multilingual large language models (LLMs) often exhibit emergent 'shadow' capabilities in languages without official support, yet their performance on these languages remains uneven and under-measured. This is particularly acute for morphosyntactically rich minority languages such as Scottish Gaelic, where translation benchmarks fail to capture structural competence. We introduce GaelEval, the first multi-dimensional benchmark for Gaelic, comprising: (i) an expert-authored morphosyntactic MCQA task; (ii) a culturally grounded translation benchmark and (iii) a large-scale cultural knowledge Q&A task. Evaluating 19 LLMs against a fluent-speaker human baseline ($n=30$), we find that Gemini 3 Pro Preview achieves $83.3\%$ accuracy on the linguistic task, surpassing the human baseline ($78.1\%$). Proprietary models consistently outperform open-weight systems, and in-language (Gaelic) prompting yields a small but stable advantage (+$2.4\%$). On the cultural task, leading models exceed $90\%$ accuracy, though most systems perform worse under Gaelic prompting and absolute scores are inflated relative to the manual benchmark. Overall, GaelEval reveals that frontier models achieve above-human performance on several dimensions of Gaelic grammar, demonstrates the effect of Gaelic prompting and shows a consistent performance gap favouring proprietary over open-weight models.
comment: 13 pages, to be published in Proceedings of LLMs4SSH (workshop co-located with LREC 2026; Mallorca, Spain; May 2026)
☆ LLM-as-a-Judge for Time Series Explanations
Evaluating factual correctness of LLM generated natural language explanations grounded in time series data remains an open challenge. Although modern models generate textual interpretations of numerical signals, existing evaluation methods are limited: reference based similarity metrics and consistency checking models require ground truth explanations, while traditional time series methods operate purely on numerical values and cannot assess free form textual reasoning. Thus, no general purpose method exists to directly verify whether an explanation is faithful to underlying time series data without predefined references or task specific rules. We study large language models as both generators and evaluators of time series explanations in a reference free setting, where given a time series, question, and candidate explanation, the evaluator assigns a ternary correctness label based on pattern identification, numeric accuracy, and answer faithfulness, enabling principled scoring and comparison. To support this, we construct a synthetic benchmark of 350 time series cases across seven query types, each paired with correct, partially correct, and incorrect explanations. We evaluate models across four tasks: explanation generation, relative ranking, independent scoring, and multi anomaly detection. Results show a clear asymmetry: generation is highly pattern dependent and exhibits systematic failures on certain query types, with accuracies ranging from 0.00 to 0.12 for Seasonal Drop and Volatility Shift, to 0.94 to 0.96 for Structural Break, while evaluation is more stable, with models correctly ranking and scoring explanations even when their own outputs are incorrect. These findings demonstrate feasibility of data grounded LLM based evaluation for time series explanations and highlight their potential as reliable evaluators of data grounded reasoning in the time series domain.
comment: Under Review
☆ Reliable Control-Point Selection for Steering Reasoning in Large Language Models
Steering vectors offer a training-free mechanism for controlling reasoning behaviors in large language models, but constructing effective vectors requires identifying genuine behavioral signals in the model's hidden states. For behaviors that can be toggled via prompts, this is straightforward. However, many reasoning behaviors -- such as self-reflection -- emerge spontaneously and resist prompt-level control. Current methods detect these behaviors through keyword matching in chain-of-thought traces, implicitly assuming that every detected boundary encodes a genuine behavioral signal. We show that this assumption is overwhelmingly wrong: across 541 keyword-detected boundaries, 93.3\% are behaviorally unstable, failing to reproduce the detected behavior under re-generation from the same prefix. We develop a probabilistic model that formalizes intrinsic reasoning behaviors as stochastic events with context-dependent trigger probabilities, and show that unstable boundaries dilute the steering signal. Guided by this analysis, we propose stability filtering, which retains only boundaries where the model consistently reproduces the target behavior. Combined with a content-subspace projection that removes residual question-specific noise, our method achieves 0.784 accuracy on MATH-500 (+5.0 over the strongest baseline). The resulting steering vectors transfer across models in the same architecture family without re-extraction, improving Nemotron-Research-Reasoning-1.5B (+5.0) and DeepScaleR-1.5B-Preview (+6.0). Code is available at https://github.com/zhmzm/stability-steering.
☆ Prosodic ABX: A Language-Agnostic Method for Measuring Prosodic Contrast in Speech Representations
Speech representations from self-supervised speech models (S3Ms) are known to be sensitive to phonemic contrasts, but their sensitivity to prosodic contrasts has not been directly measured. The ABX discrimination task has been used to measure phonemic contrast in S3M representations via minimal pairs. We introduce prosodic ABX, an extension of this framework to evaluate prosodic contrast with only a handful of examples and no explicit labels. Also, we build and release a dataset of English and Japanese minimal pairs and use it along with a Mandarin dataset to evaluate contrast in English stress, Japanese pitch accent, and Mandarin tone. Finally, we show that model and layer rankings are often preserved across several experimental conditions, making it practical for low-resource settings.
comment: Submitted to Interspeech 2026; 6 pages, 4 figures
☆ Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning
Rerankers play a pivotal role in refining retrieval results for Retrieval-Augmented Generation. However, current reranking models are typically optimized on static human annotated relevance labels in isolation, decoupled from the downstream generation process. This isolation leads to a fundamental misalignment: documents identified as topically relevant by information retrieval metrics often fail to provide the actual utility required by the LLM for precise answer generation. To bridge this gap, we introduce ReRanking Preference Optimization (RRPO), a reinforcement learning framework that directly aligns reranking with the LLM's generation quality. By formulating reranking as a sequential decision-making process, RRPO optimizes for context utility using LLM feedback, thereby eliminating the need for expensive human annotations. To ensure training stability, we further introduce a reference-anchored deterministic baseline. Extensive experiments on knowledge-intensive benchmarks demonstrate that RRPO significantly outperforms strong baselines, including the powerful list-wise reranker RankZephyr. Further analysis highlights the versatility of our framework: it generalizes seamlessly to diverse readers (e.g., GPT-4o), integrates orthogonally with query expansion modules like Query2Doc, and remains robust even when trained with noisy supervisors.
comment: 16 pages
☆ Ouroboros: Dynamic Weight Generation for Recursive Transformers via Input-Conditioned LoRA Modulation
Recursive transformers reuse a shared weight block across multiple depth steps, trading parameters for compute. A core limitation: every step applies the same transformation, preventing the model from composing distinct operations across depth. We present Ouroboros, a system that attaches a compact Controller hypernetwork to a recursive transformer block. The Controller observes the current hidden state, produces a per-step diagonal modulation vector, and applies it to frozen SVD-initialized LoRA bases, making each recurrence step input-dependent. We combine this with gated recurrence (bias-initialized to 88% retention) and per-step LayerNorm for stable deep iteration. On Qwen2.5-3B split into a Prelude/Recurrent/Coda architecture (17 of 36 layers retained), Ouroboros reduces training loss by 43.4% over the unmodified 17-layer baseline, recovering 51.3% of the performance gap caused by layer removal. The full system adds only 9.2M trainable parameters (Controller, gate, and per-step norms) yet outperforms equivalently-sized static per-step LoRA by 1.44 loss points at depth 1 and remains ahead across all tested depths (1, 4, 8, 16) and ranks (8, 32, 64). We also find that gated recurrence is essential: without it, recursive layer application makes the model strictly worse. These gains are measured on the training distribution; on held-out text, the Controller does not yet improve over the baseline, a limitation we attribute to frozen downstream layers and discuss in detail. Code: https://github.com/RightNow-AI/ouroboros
comment: 10 pages, 5 tables, 1 figure, 1 algorithm. Code: https://github.com/RightNow-AI/ouroboros
☆ Goose: Anisotropic Speculation Trees for Training-Free Speculative Decoding
Speculative decoding accelerates large language model inference by drafting multiple candidate tokens and verifying them in a single forward pass. Candidates are organized as a tree: deeper trees accept more tokens per step, but adding depth requires sacrificing breadth (fallback options) under a fixed verification budget. Existing training-free methods draft from a single token source and shape their trees without distinguishing candidate quality across origins. We observe that two common training-free token sources - n-gram matches copied from the input context, and statistical predictions from prior forward passes - differ dramatically in acceptance rate (~6x median gap, range 2-18x across five models and five benchmarks). We prove that when such a quality gap exists, the optimal tree is anisotropic (asymmetric): reliable tokens should form a deep chain while unreliable tokens spread as wide branches, breaking through the depth limit of balanced trees. We realize this structure in GOOSE, a training-free framework that builds an adaptive spine tree - a deep chain of high-acceptance context-matched tokens with wide branches of low-acceptance alternatives at each node. We prove that the number of tokens accepted per step is at least as large as that of either source used alone. On five LLMs (7B-33B) and five benchmarks, GOOSE achieves 1.9-4.3x lossless speedup, outperforming balanced-tree baselines by 12-33% under the same budget.
☆ BidirLM: From Text to Omnimodal Bidirectional Encoders by Adapting and Composing Causal LLMs
Transforming causal generative language models into bidirectional encoders offers a powerful alternative to BERT-style architectures. However, current approaches remain limited: they lack consensus on optimal training objectives, suffer from catastrophic forgetting at scale, and fail to flexibly integrate the vast ecosystem of specialized generative models. In this work, through systematic ablations on the Gemma3 and Qwen3 families, we identify the key factors driving successful adaptation, highlighting the critical role of an often-omitted prior masking phase. To scale this process without original pre-training data, we introduce a dual strategy combining linear weight merging with a lightweight multi-domain data mixture that mitigates catastrophic forgetting. Finally, we augment our encoders by merging them with specialized causal models, seamlessly transferring modality- and domain-specific capabilities. This open-source recipe, designed for any causal decoder LLM, yields BidirLM, a family of five encoders that outperform alternatives on text, vision, and audio representation benchmarks.
comment: 30 pages, 16 figures, 10 tables
☆ Tracking the emergence of linguistic structure in self-supervised models learning from speech
Self-supervised speech models learn effective representations of spoken language, which have been shown to reflect various aspects of linguistic structure. But when does such structure emerge in model training? We study the encoding of a wide range of linguistic structures, across layers and intermediate checkpoints of six Wav2Vec2 and HuBERT models trained on spoken Dutch. We find that different levels of linguistic structure show notably distinct layerwise patterns as well as learning trajectories, which can partially be explained by differences in their degree of abstraction from the acoustic signal and the timescale at which information from the input is integrated. Moreover, we find that the level at which pre-training objectives are defined strongly affects both the layerwise organization and the learning trajectories of linguistic structures, with greater parallelism induced by higher-order prediction tasks (i.e. iteratively refined pseudo-labels).
☆ Why Gaussian Diffusion Models Fail on Discrete Data?
Diffusion models have become a standard approach for generative modeling in continuous domains, yet their application to discrete data remains challenging. We investigate why Gaussian diffusion models with the DDPM solver struggle to sample from discrete distributions that are represented as a mixture of delta-distributions in the continuous space. Using a toy Random Hierarchy Model, we identify a critical sampling interval in which the density of noisified data becomes multimodal. In this regime, DDPM occasionally enters low-density regions between modes producing out-of-distribution inputs for the model and degrading sample quality. We show that existing heuristics, including self-conditioning and a solver we term q-sampling, help alleviate this issue. Furthermore, we demonstrate that combining self-conditioning with switching from DDPM to q-sampling within the critical interval improves generation quality on real data. We validate these findings across conditional and unconditional tasks in multiple domains, including text, programming code, and proteins.
☆ $k$NNProxy: Efficient Training-Free Proxy Alignment for Black-Box Zero-Shot LLM-Generated Text Detection
LLM-generated text (LGT) detection is essential for reliable forensic analysis and for mitigating LLM misuse. Existing LGT detectors can generally be categorized into two broad classes: learning-based approaches and zero-shot methods. Compared with learning-based detectors, zero-shot methods are particularly promising because they eliminate the need to train task-specific classifiers. However, the reliability of zero-shot methods fundamentally relies on the assumption that an off-the-shelf proxy LLM is well aligned with the often unknown source LLM, a premise that rarely holds in real-world black-box scenarios. To address this discrepancy, existing proxy alignment methods typically rely on supervised fine-tuning of the proxy or repeated interactions with commercial APIs, thereby increasing deployment costs, exposing detectors to silent API changes, and limiting robustness under domain shift. Motivated by these limitations, we propose the $k$-nearest neighbor proxy ($k$NNProxy), a training-free and query-efficient proxy alignment framework that repurposes the $k$NN language model ($k$NN-LM) retrieval mechanism as a domain adapter for a fixed proxy LLM. Specifically, a lightweight datastore is constructed once from a target-reflective LGT corpus, either via fixed-budget querying or from existing datasets. During inference, nearest-neighbor evidence induces a token-level predictive distribution that is interpolated with the proxy output, yielding an aligned prediction without proxy fine-tuning or per-token API outputs. To improve robustness under domain shift, we extend $k$NNProxy into a mixture of proxies (MoP) that routes each input to a domain-specific datastore for domain-consistent retrieval. Extensive experiments demonstrate strong detection performance of our method.
☆ SAFE: Stepwise Atomic Feedback for Error correction in Multi-hop Reasoning
Multi-hop QA benchmarks frequently reward Large Language Models (LLMs) for spurious correctness, masking ungrounded or flawed reasoning steps. To shift toward rigorous reasoning, we propose SAFE, a dynamic benchmarking framework that replaces the ungrounded Chain-of-Thought (CoT) with a strictly verifiable sequence of grounded entities. Our framework operates across two phases: (1) train-time verification, where we establish an atomic error taxonomy and a Knowledge Graph (KG)-grounded verification pipeline to eliminate noisy supervision in standard benchmarks, identifying up to 14% of instances as unanswerable, and (2) inference-time verification, where a feedback model trained on this verified dataset dynamically detects ungrounded steps in real-time. Experimental results demonstrate that SAFE not only exposes the critical flaws of existing benchmarks at train-time, but also significantly outperforms standard baselines, achieving an average accuracy gain of 8.4 pp while guaranteeing verifiable trajectories at inference-time.
☆ RuleForge: Automated Generation and Validation for Web Vulnerability Detection at Scale
Security teams face a challenge: the volume of newly disclosed Common Vulnerabilities and Exposures (CVEs) far exceeds the capacity to manually develop detection mechanisms. In 2025, the National Vulnerability Database published over 48,000 new vulnerabilities, motivating the need for automation. We present RuleForge, an AWS internal system that automatically generates detection rules--JSON-based patterns that identify malicious HTTP requests exploiting specific vulnerabilities--from structured Nuclei templates describing CVE details. Nuclei templates provide standardized, YAML-based vulnerability descriptions that serve as the structured input for our rule generation process. This paper focuses on RuleForge's architecture and operational deployment for CVE-related threat detection, with particular emphasis on our novel LLM-as-a-judge (Large Language Model as judge) confidence validation system and systematic feedback integration mechanism. This validation approach evaluates candidate rules across two dimensions--sensitivity (avoiding false negatives) and specificity (avoiding false positives)--achieving AUROC of 0.75 and reducing false positives by 67% compared to synthetic-test-only validation in production. Our 5x5 generation strategy (five parallel candidates with up to five refinement attempts each) combined with continuous feedback loops enables systematic quality improvement. We also present extensions enabling rule generation from unstructured data sources and demonstrate a proof-of-concept agentic workflow for multi-event-type detection. Our lessons learned highlight critical considerations for applying LLMs to cybersecurity tasks, including overconfidence mitigation and the importance of domain expertise in both prompt design and quality review of generated rules through human-in-the-loop validation.
comment: 11 pages, 10 figures. To be submitted to CAMLIS 2026
☆ Do We Need Bigger Models for Science? Task-Aware Retrieval with Small Language Models LREC 2026
Scientific knowledge discovery increasingly relies on large language models, yet many existing scholarly assistants depend on proprietary systems with tens or hundreds of billions of parameters. Such reliance limits reproducibility and accessibility for the research community. In this work, we ask a simple question: do we need bigger models for scientific applications? Specifically, we investigate to what extent carefully designed retrieval pipelines can compensate for reduced model scale in scientific applications. We design a lightweight retrieval-augmented framework that performs task-aware routing to select specialized retrieval strategies based on the input query. The system further integrates evidence from full-text scientific papers and structured scholarly metadata, and employs compact instruction-tuned language models to generate responses with citations. We evaluate the framework across several scholarly tasks, focusing on scholarly question answering (QA), including single- and multi-document scenarios, as well as biomedical QA under domain shift and scientific text compression. Our findings demonstrate that retrieval and model scale are complementary rather than interchangeable. While retrieval design can partially compensate for smaller models, model capacity remains important for complex reasoning tasks. This work highlights retrieval and task-aware design as key factors for building practical and reproducible scholarly assistants.
comment: Accepted at NSLP@LREC 2026
☆ Diagnosing Translated Benchmarks: An Automated Quality Assurance Study of the EU20 Benchmark Suite LREC 2026
Machine-translated benchmark datasets reduce costs and offer scale, but noise, loss of structure, and uneven quality weaken confidence. What matters is not merely whether we can translate, but also whether we can measure and verify translation reliability at scale. We study translation quality in the EU20 benchmark suite, which comprises five established benchmarks translated into 20 languages, via a three-step automated quality assurance approach: (i) a structural corpus audit with targeted fixes; (ii) quality profiling using a neural metric (COMET, reference-free and reference-based) with translation service comparisons (DeepL / ChatGPT / Google); and (iii) an LLM-based span-level translation error landscape. Trends are consistent: datasets with lower COMET scores exhibit a higher share of accuracy/mistranslation errors at span level (notably HellaSwag; ARC is comparatively clean). Reference-based COMET on MMLU against human-edited samples points in the same direction. We release cleaned/corrected versions of the EU20 datasets, and code for reproducibility. In sum, automated quality assurance offers practical, scalable indicators that help prioritize review -- complementing, not replacing, human gold standards.
comment: Accepted at LREC 2026
☆ How to measure the optimality of word or gesture order with respect to the principle of swap distance minimization
The structure of all the permutations of a sequence can be represented as a permutohedron, a graph where vertices are permutations and two vertices are linked if a swap of adjacent elements in the permutation of one of the vertices produces the permutation of the other vertex. It has been hypothesized that word orders in languages minimize the swap distance in the permutohedron: given a source order, word orders that are closer in the permutohedron should be less costly and thus more likely. Here we explain how to measure the degree of optimality of word order variation with respect to swap distance minimization. We illustrate the power of our novel mathematical framework by showing that crosslinguistic gestures are at least $77\%$ optimal. It is unlikely that the multiple times where crosslinguistic gestures hit optimality are due to chance. We establish the theoretical foundations for research on the optimality of word or gesture order with respect to swap distance minimization in communication systems. Finally, we introduce the quadratic assignment problem (QAP) into language research as an umbrella for multiple optimization problems and, accordingly, postulate a general principle of optimal assignment that unifies various linguistic principles including swap distance minimization.
☆ Reliable News or Propagandist News? A Neurosymbolic Model Using Genre, Topic, and Persuasion Techniques to Improve Robustness in Classification
Among news disorders, propagandist news are particularly insidious, because they tend to mix oriented messages with factual reports intended to look like reliable news. To detect propaganda, extant approaches based on Language Models such as BERT are promising but often overfit their training datasets, due to biases in data collection. To enhance classification robustness and improve generalization to new sources, we propose a neurosymbolic approach combining non-contextual text embeddings (fastText) with symbolic conceptual features such as genre, topic, and persuasion techniques. Results show improvements over equivalent text-only methods, and ablation studies as well as explainability analyses confirm the benefits of the added features. Keywords: Information disorder, Fake news, Propaganda, Classification, Topic modeling, Hybrid method, Neurosymbolic model, Ablation, Robustness
☆ ImplicitBBQ: Benchmarking Implicit Bias in Large Language Models through Characteristic Based Cues
Large Language Models increasingly suppress biased outputs when demographic identity is stated explicitly, yet may still exhibit implicit biases when identity is conveyed indirectly. Existing benchmarks use name based proxies to detect implicit biases, which carry weak associations with many social demographics and cannot extend to dimensions like age or socioeconomic status. We introduce ImplicitBBQ, a QA benchmark that evaluates implicit bias through characteristic based cues, culturally associated attributes that signal implicitly, across age, gender, region, religion, caste, and socioeconomic status. Evaluating 11 models, we find that implicit bias in ambiguous contexts is over six times higher than explicit bias in open weight models. Safety prompting and chain-of-thought reasoning fail to substantially close this gap; even few-shot prompting, which reduces implicit bias by 84%, leaves caste bias at four times the level of any other dimension. These findings indicate that current alignment and prompting strategies address the surface of bias evaluation while leaving culturally grounded stereotypic associations largely unresolved. We publicly release our code and dataset for model providers and researchers to benchmark potential mitigation techniques.
☆ Is Clinical Text Enough? A Multimodal Study on Mortality Prediction in Heart Failure Patients LREC 2026
Accurate short-term mortality prediction in heart failure (HF) remains challenging, particularly when relying on structured electronic health record (EHR) data alone. We evaluate transformer-based models on a French HF cohort, comparing text-only, structured-only, multimodal, and LLM-based approaches. Our results show that enriching clinical text with entity-level representations improves prediction over CLS embeddings alone, and that supervised multimodal fusion of text and structured variables achieves the best overall performance. In contrast, large language models perform inconsistently across modalities and decoding strategies, with text-only prompts outperforming structured or multimodal inputs. These findings highlight that entity-aware multimodal transformers offer the most reliable solution for short-term HF outcome prediction, while current LLM prompting remains limited for clinical decision support.
comment: Accepted in LREC 2026
☆ SURE: Synergistic Uncertainty-aware Reasoning for Multimodal Emotion Recognition in Conversations ICASSP 2026
Multimodal emotion recognition in conversations (MERC) requires integrating multimodal signals while being robust to noise and modeling contextual reasoning. Existing approaches often emphasize fusion but overlook uncertainty in noisy features and fine-grained reasoning. We propose SURE (Synergistic Uncertainty-aware REasoning) for MERC, a framework that improves robustness and contextual modeling. SURE consists of three components: an Uncertainty-Aware Mixture-of-Experts module to handle modality-specific noise, an Iterative Reasoning module for multi-turn reasoning over context, and a Transformer Gate module to capture intra- and inter-modal interactions. Experiments on benchmark MERC datasets show that SURE consistently outperforms state-of-the-art methods, demonstrating its effectiveness in robust multimodal reasoning. These results highlight the importance of uncertainty modeling and iterative reasoning in advancing emotion recognition in conversational settings.
comment: ICASSP 2026
☆ HieraVid: Hierarchical Token Pruning for Fast Video Large Language Models
Video Large Language Models (VideoLLMs) have demonstrated impressive capabilities in video understanding, yet the massive number of input video tokens incurs a significant computational burden for deployment. Existing methods mainly prune video tokens at input level while neglecting the inherent information structure embedded in videos and large language models (LLMs). To address this, we propose HieraVid, a hierarchical pruning framework that progressively and dynamically reduces visual redundancy. Based on two observations that videos possess the segment-frame structure and LLMs internally propagate multi-modal information unidirectionally, we decompose pruning into three levels: 1) segment-level, where video tokens are first temporally segmented and spatially merged; 2) frame-level, where similar frames within the same segment are jointly pruned to preserve diversity; 3) layer-level, redundancy gradually shrinks as LLM layer increases w/o compromising performance. We conduct extensive experiments on four widely used video understanding benchmarks to comprehensively evaluate the effectiveness of HieraVid. Remarkably, with only 30% of tokens retained, HieraVid achieves new state-of-the-art performance, while maintaining over 98% and 99% of the performance of LLaVA-Video-7B and LLaVA-OneVision-7B, respectively.
☆ Beyond Detection: Ethical Foundations for Automated Dyslexic Error Attribution
Dyslexic spelling errors exhibit systematic phonological and orthographic patterns that distinguish them from the errors produced by typically developing writers. While this observation has motivated dyslexic-specific spell-checking and assistive writing tools, prior work has focused predominantly on error correction rather than attribution, and has largely neglected the ethical risks. The risk of harmful labelling, covert screening, algorithmic bias, and institutional misuse that automated classification of learners entails requires the development of robust ethical and legal frameworks for research in this area. This paper addresses both gaps. We formulate dyslexic error attribution as a binary classification task. Given a misspelt word and its correct target form, determine whether the error pattern is characteristic of a dyslexic or non-dyslexic writer. We develop a comprehensive feature set capturing orthographic, phonological, and morphological properties of each error, and propose a twin-input neural model evaluated against traditional machine learning baselines under writer-independent conditions. The neural model achieves 93.01% accuracy and an F1-score of 94.01%, with phonetically plausible errors and vowel confusions emerging as the strongest attribution signals. We situate these technical results within an explicit ethics-first framework, analysing fairness across subgroups, the interpretability requirements of educational deployment, and the conditions, consent, transparency, human oversight, and recourse, under which a system could be responsibly used. We provide concrete guidelines for ethical deployment and an open discussion of the systems limitations and misuse potential. Our results demonstrate that dyslexic error attribution is feasible at high accuracy while underscoring that feasibility alone is insufficient for deployment in high-stakes educational contexts.
☆ From Guessing to Placeholding: A Cost-Theoretic Framework for Uncertainty-Aware Code Completion
While Large Language Models (LLMs) have demonstrated exceptional proficiency in code completion, they typically adhere to a Hard Completion (HC) paradigm, compelling the generation of fully concrete code even amidst insufficient context. Our analysis of 3 million real-world interactions exposes the limitations of this strategy: 61% of the generated suggestions were either edited after acceptance or rejected despite exhibiting over 80% similarity to the user's subsequent code, suggesting that models frequently make erroneous predictions at specific token positions. Motivated by this observation, we propose Adaptive Placeholder Completion (APC), a collaborative framework that extends HC by strategically outputting explicit placeholders at high-entropy positions, allowing users to fill directly via IDE navigation. Theoretically, we formulate code completion as a cost-minimization problem under uncertainty. Premised on the observation that filling placeholders incurs lower cost than correcting errors, we prove the existence of a critical entropy threshold above which APC achieves strictly lower expected cost than HC. We instantiate this framework by constructing training data from filtered real-world edit logs and design a cost-based reward function for reinforcement learning. Extensive evaluations across 1.5B--14B parameter models demonstrate that APC reduces expected editing costs from 19% to 50% while preserving standard HC performance. Our work provides both a theoretical foundation and a practical training framework for uncertainty-aware code completion, demonstrating that adaptive abstention can be learned end-to-end without sacrificing conventional completion quality.
☆ PLOT: Enhancing Preference Learning via Optimal Transport
Preference learning in Large Language Models (LLMs) has advanced significantly, yet existing methods remain limited by modest performance gains, high computational costs, hyperparameter sensitivity, and insufficient modeling of global token-level relationships. We introduce PLOT, which enhances Preference Learning in fine-tuning-based alignment through a token-level loss derived from Optimal Transport. By formulating preference learning as an Optimal Transport Problem, PLOT aligns model outputs with human preferences while preserving the original distribution of LLMs, ensuring stability and robustness. Furthermore, PLOT leverages token embeddings to capture semantic relationships, enabling globally informed optimization. Experiments across two preference categories - Human Values and Logic & Problem Solving - spanning seven subpreferences demonstrate that PLOT consistently improves alignment performance while maintaining fluency and coherence. These results substantiate optimal transport as a principled methodology for preference learning, establishing a theoretically grounded framework that provides new insights for preference learning of LLMs.
☆ Language-Pretraining-Induced Bias: A Strong Foundation for General Vision Tasks
The ratio of outlier parameters in language pre-training models and vision pre-training models differs significantly, making cross-modality (language and vision) inherently more challenging than cross-domain adaptation. As a result, many prior studies have focused on cross-domain transfer rather than attempting to bridge language and vision modalities, assuming that language pre-trained models are unsuitable for downstream visual tasks due to disparate parameter spaces. Contrary to this assumption, we show that adding a bridge training stage as a modality adaptation learner can effectively align Large Language Model (LLM) parameters with vision tasks. Specifically, we propose a simple yet powerful solution random label bridge training that requires no manual labeling and helps LLM parameters adapt to vision foundation tasks. Moreover, our findings reveal that partial bridge training is often advantageous, as certain layers in LLMs exhibit strong foundational properties that remain beneficial even without fine-tuning for visual tasks. This surprising discovery opens up new avenues for leveraging language pre-trained parameters directly within vision models and highlights the potential of partial bridge training as a practical pathway to cross-modality adaptation.
☆ DEFT: Distribution-guided Efficient Fine-Tuning for Human Alignment
Reinforcement Learning from Human Feedback (RLHF), using algorithms like Proximal Policy Optimization (PPO), aligns Large Language Models (LLMs) with human values but is costly and unstable. Alternatives have been proposed to replace PPO or integrate Supervised Fine-Tuning (SFT) and contrastive learning for direct fine-tuning and value alignment. However, these methods still require voluminous data to learn preferences and may weaken the generalization ability of LLMs. To further enhance alignment efficiency and performance while mitigating the loss of generalization ability, this paper introduces Distribution-guided Efficient Fine-Tuning (DEFT), an efficient alignment framework incorporating data filtering and distributional guidance by calculating the differential distribution reward based on the output distribution of language model and the discrepancy distribution of preference data. A small yet high-quality subset is filtered from the raw data using a differential distribution reward, which is then incorporated into existing alignment methods to guide the model's output distribution. Experimental results demonstrate that the methods enhanced by DEFT outperform the original methods in both alignment capability and generalization ability, with significantly reduced training time.
☆ Taming CATS: Controllable Automatic Text Simplification through Instruction Fine-Tuning with Control Tokens
Controllable Automatic Text Simplification (CATS) produces user-tailored outputs, yet controllability is often treated as a decoding problem and evaluated with metrics that are not reflective to the measure of control. We observe that controllability in ATS is significantly constrained by data and evaluation. To this end, we introduce a domain-agnostic CATS framework based on instruction fine-tuning with discrete control tokens, steering open-source models to target readability levels and compression rates. Across three model families with different model sizes (Llama, Mistral, Qwen; 1-14B) and four domains (medicine, public administration, news, encyclopedic text), we find that smaller models (1-3B) can be competitive, but reliable controllability strongly depends on whether the training data encodes sufficient variation in the target attribute. Readability control (FKGL, ARI, Dale-Chall) is learned consistently, whereas compression control underperforms due to limited signal variability in the existing corpora. We further show that standard simplification and similarity metrics are insufficient for measuring control, motivating error-based measures for target-output alignment. Finally, our sampling and stratification experiments demonstrate that naive splits can introduce distributional mismatch that undermines both training and evaluation.
☆ FourierMoE: Fourier Mixture-of-Experts Adaptation of Large Language Models
Parameter-efficient fine-tuning (PEFT) has emerged as a crucial paradigm for adapting large language models (LLMs) under constrained computational budgets. However, standard PEFT methods often struggle in multi-task fine-tuning settings, where diverse optimization objectives induce task interference and limited parameter budgets lead to representational deficiency. While recent approaches incorporate mixture-of-experts (MoE) to alleviate these issues, they predominantly operate in the spatial domain, which may introduce structural redundancy and parameter overhead. To overcome these limitations, we reformulate adaptation in the spectral domain. Our spectral analysis reveals that different tasks exhibit distinct frequency energy distributions, and that LLM layers display heterogeneous frequency sensitivities. Motivated by these insights, we propose FourierMoE, which integrates the MoE architecture with the inverse discrete Fourier transform (IDFT) for frequency-aware adaptation. Specifically, FourierMoE employs a frequency-adaptive router to dispatch tokens to experts specialized in distinct frequency bands. Each expert learns a set of conjugate-symmetric complex coefficients, preserving complete phase and amplitude information while theoretically guaranteeing lossless IDFT reconstruction into real-valued spatial weights. Extensive evaluations across 28 benchmarks, multiple model architectures, and scales demonstrate that FourierMoE consistently outperforms competitive baselines in both single-task and multi-task settings while using significantly fewer trainable parameters. These results highlight the promise of spectral-domain expert adaptation as an effective and parameter-efficient paradigm for LLM fine-tuning.
comment: The first two authors contributed equally to this work; listing order is random
☆ LiveMathematicianBench: A Live Benchmark for Mathematician-Level Reasoning with Proof Sketches
Mathematical reasoning is a hallmark of human intelligence, and whether large language models (LLMs) can meaningfully perform it remains a central question in artificial intelligence and cognitive science. As LLMs are increasingly integrated into scientific workflows, rigorous evaluation of their mathematical capabilities becomes a practical necessity. Existing benchmarks are limited by synthetic settings and data contamination. We present LiveMathematicianBench, a dynamic multiple-choice benchmark for research-level mathematical reasoning built from recent arXiv papers published after model training cutoffs. By grounding evaluation in newly published theorems, it provides a realistic testbed beyond memorized patterns. The benchmark introduces a thirteen-category logical taxonomy of theorem types (e.g., implication, equivalence, existence, uniqueness), enabling fine-grained evaluation across reasoning forms. It employs a proof-sketch-guided distractor pipeline that uses high-level proof strategies to construct plausible but invalid answer choices reflecting misleading proof directions, increasing sensitivity to genuine understanding over surface-level matching. We also introduce a substitution-resistant mechanism to distinguish answer recognition from substantive reasoning. Evaluation shows the benchmark is far from saturated: Gemini-3.1-pro-preview, the best model, achieves only 43.5%. Under substitution-resistant evaluation, accuracy drops sharply: GPT-5.4 scores highest at 30.6%, while Gemini-3.1-pro-preview falls to 17.6%, below the 20% random baseline. A dual-mode protocol reveals that proof-sketch access yields consistent accuracy gains, suggesting models can leverage high-level proof strategies for reasoning. Overall, LiveMathematicianBench offers a scalable, contamination-resistant testbed for studying research-level mathematical reasoning in LLMs.
comment: Project page: https://livemathematicianbench.github.io/
☆ Detecting Toxic Language: Ontology and BERT-based Approaches for Bulgarian Text
Toxic content detection in online communication remains a significant challenge, with current solutions often inadvertently blocking valuable information, including medical terms and text related to minority groups. This paper presents a more nu-anced approach to identifying toxicity in Bulgarian text while preserving access to essential information. The research explores two distinct methodologies for detecting toxic content. The developed methodologies have po-tential applications across diverse online platforms and content moderation systems. First, we propose an ontology that models the potentially toxic words in Bulgarian language. Then, we compose a dataset that comprises 4,384 manually anno-tated sentences from Bulgarian online forums across four categories: toxic language, medical terminology, non-toxic lan-guage, and terms related to minority communities. We then train a BERT-based model for toxic language classification, which reaches a 0.89 F1 macro score. The trained model is directly applicable in a real environment and can be integrated as a com-ponent of toxic content detection systems.
☆ From BM25 to Corrective RAG: Benchmarking Retrieval Strategies for Text-and-Table Documents
Retrieval-Augmented Generation (RAG) systems critically depend on retrieval quality, yet no systematic comparison of modern retrieval methods exists for heterogeneous documents containing both text and tabular data. We benchmark ten retrieval strategies spanning sparse, dense, hybrid fusion, cross-encoder reranking, query expansion, index augmentation, and adaptive retrieval on a challenging financial QA benchmark of 23,088 queries over 7,318 documents with mixed text-and-table content. We evaluate retrieval quality via Recall@k, MRR, and nDCG, and end-to-end generation quality via Number Match, with paired bootstrap significance testing. Our results show that (1) a two-stage pipeline combining hybrid retrieval with neural reranking achieves Recall@5 of 0.816 and MRR@3 of 0.605, outperforming all single-stage methods by a large margin; (2) BM25 outperforms state-of-the-art dense retrieval on financial documents, challenging the common assumption that semantic search universally dominates; and (3) query expansion methods (HyDE, multi-query) and adaptive retrieval provide limited benefit for precise numerical queries, while contextual retrieval yields consistent gains. We provide ablation studies on fusion methods and reranker depth, actionable cost-accuracy recommendations, and release our full benchmark code.
comment: 11 pages, 6 figures, 6 tables
☆ Human-Guided Reasoning with Large Language Models for Vietnamese Speech Emotion Recognition
Vietnamese Speech Emotion Recognition (SER) remains challenging due to ambiguous acoustic patterns and the lack of reliable annotated data, especially in real-world conditions where emotional boundaries are not clearly separable. To address this problem, this paper proposes a human-machine collaborative framework that integrates human knowledge into the learning process rather than relying solely on data-driven models. The proposed framework is centered around LLM-based reasoning, where acoustic feature-based models are used to provide auxiliary signals such as confidence and feature-level evidence. A confidence-based routing mechanism is introduced to distinguish between easy and ambiguous samples, allowing uncertain cases to be delegated to LLMs for deeper reasoning guided by structured rules derived from human annotation behavior. In addition, an iterative refinement strategy is employed to continuously improve system performance through error analysis and rule updates. Experiments are conducted on a Vietnamese speech dataset of 2,764 samples across three emotion classes (calm, angry, panic), with high inter-annotator agreement (Fleiss Kappa = 0.8574), ensuring reliable ground truth. The proposed method achieves strong performance, reaching up to 86.59% accuracy and Macro F1 around 0.85-0.86, demonstrating its effectiveness in handling ambiguous and hard-to-classify cases. Overall, this work highlights the importance of combining data-driven models with human reasoning, providing a robust and model-agnostic approach for speech emotion recognition in low-resource settings.
comment: 6 pages, 2 figures. Dataset of 2,764 Vietnamese speech samples across three emotion classes
☆ Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework
Memory emerges as the core module in the large language model (LLM)-based agents for long-horizon complex tasks (e.g., multi-turn dialogue, game playing, scientific discovery), where memory can enable knowledge accumulation, iterative reasoning and self-evolution. A number of memory methods have been proposed in the literature. However, these methods have not been systematically and comprehensively compared under the same experimental settings. In this paper, we first summarize a unified framework that incorporates all the existing agent memory methods from a high-level perspective. We then extensively compare representative agent memory methods on two well-known benchmarks and examine the effectiveness of all methods, providing a thorough analysis of those methods. As a byproduct of our experimental analysis, we also design a new memory method by exploiting modules in the existing methods, which outperforms the state-of-the-art methods. Finally, based on these findings, we offer promising future research opportunities. We believe that a deeper understanding of the behavior of existing methods can provide valuable new insights for future research.
☆ Development and multi-center evaluation of domain-adapted speech recognition for human-AI teaming in real-world gastrointestinal endoscopy
Automatic speech recognition (ASR) is a critical interface for human-AI interaction in gastrointestinal endoscopy, yet its reliability in real-world clinical settings is limited by domain-specific terminology and complex acoustic conditions. Here, we present EndoASR, a domain-adapted ASR system designed for real-time deployment in endoscopic workflows. We develop a two-stage adaptation strategy based on synthetic endoscopy reports, targeting domain-specific language modeling and noise robustness. In retrospective evaluation across six endoscopists, EndoASR substantially improves both transcription accuracy and clinical usability, reducing character error rate (CER) from 20.52% to 14.14% and increasing medical term accuracy (Med ACC) from 54.30% to 87.59%. In a prospective multi-center study spanning five independent endoscopy centers, EndoASR demonstrates consistent generalization under heterogeneous real-world conditions. Compared with the baseline Paraformer model, CER is reduced from 16.20% to 14.97%, while Med ACC is improved from 61.63% to 84.16%, confirming its robustness in practical deployment scenarios. Notably, EndoASR achieves a real-time factor (RTF) of 0.005, significantly faster than Whisper-large-v3 (RTF 0.055), while maintaining a compact model size of 220M parameters, enabling efficient edge deployment. Furthermore, integration with large language models demonstrates that improved ASR quality directly enhances downstream structured information extraction and clinician-AI interaction. These results demonstrate that domain-adapted ASR can serve as a reliable interface for human-AI teaming in gastrointestinal endoscopy, with consistent performance validated across multi-center real-world clinical settings.
comment: Under review at npj Digital Medicine
☆ On the Role of Reasoning Patterns in the Generalization Discrepancy of Long Chain-of-Thought Supervised Fine-Tuning
Supervised Fine-Tuning (SFT) on long Chain-of-Thought (CoT) trajectories has become a pivotal phase in building large reasoning models. However, how CoT trajectories from different sources influence the generalization performance of models remains an open question. In this paper, we conduct a comparative study using two sources of verified CoT trajectories generated by two competing models, \texttt{DeepSeek-R1-0528} and \texttt{gpt-oss-120b}, with their problem sets controlled to be identical. Despite their comparable performance, we uncover a striking paradox: lower training loss does not translate to better generalization. SFT on \texttt{DeepSeek-R1-0528} data achieves remarkably lower training loss, yet exhibits significantly worse generalization performance on reasoning benchmarks compared to those trained on \texttt{gpt-oss-120b}. To understand this paradox, we perform a multi-faceted analysis probing token-level SFT loss and step-level reasoning behaviors. Our analysis reveals a difference in reasoning patterns. \texttt{gpt-oss-120b} exhibits highly convergent and deductive trajectories, whereas \texttt{DeepSeek-R1-0528} favors a divergent and branch-heavy exploration pattern. Consequently, models trained with \texttt{DeepSeek-R1} data inherit inefficient exploration behaviors, often getting trapped in redundant exploratory branches that hinder them from reaching correct solutions. Building upon this insight, we propose a simple yet effective remedy of filtering out frequently branching trajectories to improve the generalization of SFT. Experiments show that training on selected \texttt{DeepSeek-R1-0528} subsets surprisingly improves reasoning performance by up to 5.1% on AIME25, 5.5% on BeyondAIME, and on average 3.6% on five benchmarks.
comment: Under Review
☆ MiCA Learns More Knowledge Than LoRA and Full Fine-Tuning
Minor Component Adaptation (MiCA) is a novel parameter-efficient fine-tuning method for large language models that focuses on adapting underutilized subspaces of model representations. Unlike conventional methods such as Low-Rank Adaptation (LoRA), which target dominant subspaces, MiCA leverages Singular Value Decomposition to identify subspaces related to minor singular vectors associated with the least significant singular values and constrains the update of parameters during fine-tuning to those directions. This strategy leads to up to 5.9x improvement in knowledge acquisition under optimized training hyperparameters and a minimal parameter footprint of 6-60% compared to LoRA. These results suggest that constraining adaptation to minor singular directions provides a more efficient and stable mechanism for integrating new knowledge into pre-trained language models.
☆ Coupled Query-Key Dynamics for Attention
Standard scaled dot-product attention computes scores from static, independent projections of the input. We show that evolving queries and keys \emph{jointly} through shared learned dynamics before scoring - which we call \textbf{coupled QK dynamics} - improves language modeling perplexity and training stability. On WikiText-103 at 60M parameters, coupled dynamics achieves 22.55--22.62 perplexity vs.\ 24.22 for standard attention ($-$6.6--6.9\%), with only 0.11\% additional parameters (shared across both instantiations). A structural ablation isolates coupling as the active ingredient: a symplectic (Hamiltonian) and a non-symplectic (Euler) integrator perform identically when both couple Q and K, while an uncoupled MLP baseline of matched capacity reaches only 23.81 with 8$\times$ higher seed variance. The integration step count (1--7) is similarly irrelevant - a single coupled step suffices. A compute-matched comparison reveals that coupling is a \emph{sample-efficiency} mechanism: standard attention trained for 2.4$\times$ longer (matching wall-clock) reaches the same perplexity, but requires 2.4$\times$ more tokens. The advantage scales to 150M ($-$6.7\%) but narrows at 350M ($-$1.0\%), where Differential Attention (18.93) overtakes coupled dynamics (19.35). The benefit is corpus-dependent: coupling helps on domain-coherent text (WikiText-103 $-$6.6\%, PubMed $-$4.5\%) but degrades on heterogeneous web text ($+$10.3\%) and shows no benefit on GLUE. We characterize when coupling helps and when it does not, providing practical guidelines.
☆ PRISM: Probability Reallocation with In-Span Masking for Knowledge-Sensitive Alignment
Supervised fine-tuning (SFT) with token-level hard labels can amplify overconfident imitation of factually unsupported targets, causing hallucinations that propagate in multi-sentence generation. We study an augmented SFT setting in which training instances include coarse sentence-level factuality risk labels and inter-sentence dependency annotations, providing structured signals about where factual commitments are weakly supported. We propose \textbf{PRISM}, a differentiable risk-gated framework that modifies learning only at fact-critical positions. PRISM augments standard SFT with a lightweight, model-aware probability reallocation objective that penalizes high-confidence predictions on risky target tokens, with its scope controlled by span-level risk weights and model-aware gating. Experiments on hallucination-sensitive factual benchmarks and general evaluations show that PRISM improves factual aggregates across backbones while maintaining a competitive overall capability profile. Ablations further show that the auxiliary signal is most effective when used conservatively, and that knowledge masking and model-aware reallocation play complementary roles in balancing factual correction and capability preservation.
☆ PRCCF: A Persona-guided Retrieval and Causal-aware Cognitive Filtering Framework for Emotional Support Conversation ACL
Emotional Support Conversation (ESC) aims to alleviate individual emotional distress by generating empathetic responses. However, existing methods face challenges in effectively supporting deep contextual understanding. To address this issue, we propose PRCCF, a Persona-guided Retrieval and Causality-aware Cognitive Filtering framework. Specifically, the framework incorporates a persona-guided retrieval mechanism that jointly models semantic compatibility and persona alignment to enhance response generation. Furthermore, it employs a causality-aware cognitive filtering module to prioritize causally relevant external knowledge, thereby improving contextual cognitive understanding for emotional reasoning. Extensive experiments on the ESConv dataset demonstrate that PRCCF outperforms state-of-the-art baselines on both automatic metrics and human evaluations. Our code is publicly available at: https://github.com/YancyLyx/PRCCF.
comment: 14 pages, 6 figures, 5 tables. Submitted to Transactions of the Association for Computational Linguistics (TACL)
☆ What Do Claim Verification Datasets Actually Test? A Reasoning Trace Analysis
Despite rapid progress in claim verification, we lack a systematic understanding of what reasoning these benchmarks actually exercise. We generate structured reasoning traces for 24K claim-verification examples across 9 datasets using GPT-4o-mini and find that direct evidence extraction dominates, while multi-sentence synthesis and numerical reasoning are severely under-represented. A dataset-level breakdown reveals stark biases: some datasets almost exclusively test lexical matching, while others require information synthesis in roughly half of cases. Using a compact 1B-parameter reasoning verifier, we further characterize five error types and show that error profiles vary dramatically by domain -- general-domain verification is dominated by lexical overlap bias, scientific verification by overcautiousness, and mathematical verification by arithmetic reasoning failures. Our findings suggest that high benchmark scores primarily reflect retrieval-plus-entailment ability. We outline recommendations for building more challenging evaluation suites that better test the reasoning capabilities verification systems need.
comment: 11 pages
☆ ThinknCheck: Grounded Claim Verification with Compact, Reasoning-Driven, and Interpretable Models
We present ThinknCheck, a 1B-parameter verifier for grounded claim verification that first produces a short, structured rationale and then a binary verdict. We construct LLMAggreFact-Think, a 24.1k reasoning-augmented training set derived from LLMAggreFact, and fine-tune a 4-bit Gemma3 model to follow this format. On LLMAggreFact, ThinknCheck attains 78.1 balanced accuracy (BAcc), surpassing MiniCheck-7B (77.4) with 7x fewer parameters; removing the reasoning step reduces BAcc to 57.5. On SciFact, ThinknCheck reaches 64.7 BAcc, a +14.7 absolute gain over MiniCheck-7B. By contrast, zero-shot chain-of-thought on the base Gemma3-1B harms accuracy relative to direct answers, and preference optimization with a simple format+accuracy reward underperforms supervised reasoning. To probe the latter, we introduce GSMClaims and a domain-specialized variant, ThinknCheck-Science, which improves across benchmarks, including 61.0\% accuracy on GSMClaims. Overall, explicit, supervised reasoning enables compact verifiers that are competitive while remaining resource-efficient and interpretable.
comment: 15 pages
☆ Fragile Reasoning: A Mechanistic Analysis of LLM Sensitivity to Meaning-Preserving Perturbations
Large language models demonstrate strong performance on mathematical reasoning benchmarks, yet remain surprisingly fragile to meaning-preserving surface perturbations. We systematically evaluate three open-weight LLMs, Mistral-7B, Llama-3-8B, and Qwen2.5-7B, on 677 GSM8K problems paired with semantically equivalent variants generated through name substitution and number format paraphrasing. All three models exhibit substantial answer-flip rates (28.8%-45.1%), with number paraphrasing consistently more disruptive than name swaps. To trace the mechanistic basis of these failures, we introduce the Mechanistic Perturbation Diagnostics (MPD) framework, combining logit lens analysis, activation patching, component ablation, and the Cascading Amplification Index (CAI) into a unified diagnostic pipeline. CAI, a novel metric quantifying layer-wise divergence amplification, outperforms first divergence layer as a failure predictor for two of three architectures (AUC up to 0.679). Logit lens reveals that flipped samples diverge from correct predictions at significantly earlier layers than stable samples. Activation patching reveals a stark architectural divide in failure localizability: Llama-3 failures are recoverable by patching at specific layers (43/60 samples), while Mistral and Qwen failures are broadly distributed (3/60 and 0/60). Based on these diagnostic signals, we propose a mechanistic failure taxonomy (localized, distributed, and entangled) and validate it through targeted repair experiments: steering vectors and layer fine-tuning recover 12.2% of localized failures (Llama-3) but only 7.2% of entangled (Qwen) and 5.2% of distributed (Mistral) failures.
comment: Preprint. Under review at COLM 2026
☆ CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning CVPR 2026
Real-world reasoning often requires combining information across modalities, connecting textual context with visual cues in a multi-hop process. Yet, most multimodal benchmarks fail to capture this ability: they typically rely on single images or set of images, where answers can be inferred from a single modality alone. This limitation is mirrored in the training data, where interleaved image-text content rarely enforces complementary, multi-hop reasoning. As a result, Vision-Language Models (VLMs) frequently hallucinate and produce reasoning traces poorly grounded in visual evidence. To address this gap, we introduce CRIT, a new dataset and benchmark built with a graph-based automatic pipeline for generating complex cross-modal reasoning tasks. CRIT consists of diverse domains ranging from natural images, videos, and text-rich sources, and includes a manually verified test set for reliable evaluation. Experiments on this benchmark reveal that even state-of-the-art models struggle on such reasoning tasks. Models trained on CRIT show significant gains in cross-modal multi-hop reasoning, including strong improvements on SPIQA and other standard multimodal benchmarks.
comment: Accepted to CVPR 2026
☆ Grounding AI-in-Education Development in Teachers' Voices: Findings from a National Survey in Indonesia
Despite emerging use in Indonesian classrooms, there is limited large-scale, teacher-centred evidence on how AI is used in practice and what support teachers need, hindering the development of context-appropriate AI systems and policies. To address this gap, we conduct a nationwide survey of 349 K-12 teachers across elementary, junior high, and senior high schools. We find increasing use of AI for pedagogy, content development, and teaching media, although adoption remains uneven. Elementary teachers report more consistent use, while senior high teachers engage less; mid-career teachers assign higher importance to AI, and teachers in Eastern Indonesia perceive greater value. Across levels, teachers primarily use AI to reduce instructional preparation workload (e.g., assessment, lesson planning, and material development). However, generic outputs, infrastructure constraints, and limited contextual alignment continue to hinder effective classroom integration.
☆ OSCAR: Orchestrated Self-verification and Cross-path Refinement
Diffusion language models (DLMs) expose their denoising trajectories, offering a natural handle for inference-time control; accordingly, an ideal hallucination mitigation framework should intervene during generation using this model-native signal rather than relying on an externally trained hallucination classifier. Toward this, we formulate commitment uncertainty localization: given a denoising trajectory, identify token positions whose cross-chain entropy exceeds an unsupervised threshold before factually unreliable commitments propagate into self-consistent but incorrect outputs. We introduce a suite of trajectory-level assessments, including a cross-chain divergence-at-hallucination (CDH) metric, for principled comparison of localization methods. We also introduce OSCAR, a training-free inference-time framework operationalizing this formulation. OSCAR runs N parallel denoising chains with randomized reveal orders, computes cross-chain Shannon entropy to detect high-uncertainty positions, and then performs targeted remasking conditioned on retrieved evidence. Ablations confirm that localization and correction contribute complementary gains, robust across N in {4, 8, 16}. On TriviaQA, HotpotQA, RAGTruth, and CommonsenseQA using LLaDA-8B and Dream-7B, OSCAR enhances generation quality by significantly reducing hallucinated content and improving factual accuracy through uncertainty-guided remasking, which also facilitates more effective integration of retrieved evidence. Its native entropy-based uncertainty signal surpasses that of specialized trained detectors, highlighting an inherent capacity of diffusion language models to identify factual uncertainty that is not present in the sequential token commitment structure of autoregressive models. We are releasing the codebase1 to support future research on localization and uncertainty-aware generation in DLMs.
☆ Expert-Choice Routing Enables Adaptive Computation in Diffusion Language Models
Diffusion language models (DLMs) enable parallel, non-autoregressive text generation, yet existing DLM mixture-of-experts (MoE) models inherit token-choice (TC) routing from autoregressive systems, leading to load imbalance and rigid computation allocation. We show that expert-choice (EC) routing is a better fit for DLMs: it provides deterministic load balancing by design, yielding higher throughput and faster convergence than TC. Building on the property that EC capacity is externally controllable, we introduce timestep-dependent expert capacity, which varies expert allocation according to the denoising step. We find that allocating more capacity to low-mask-ratio steps consistently achieves the best performance under matched FLOPs, and provide a mechanistic explanation: tokens in low-mask-ratio contexts exhibit an order-of-magnitude higher learning efficiency, so concentrating compute on these steps yields the largest marginal return. Finally, we show that existing pretrained TC DLMs can be retrofitted to EC by replacing only the router, achieving faster convergence and improved accuracy across diverse downstream tasks. Together, these results establish EC routing as a superior paradigm for DLM MoE models and demonstrate that computation in DLMs can be treated as an adaptive policy rather than a fixed architectural constant. Code is available at https://github.com/zhangshuibai/EC-DLM.
comment: 26 pages
☆ Swift-SVD: Theoretical Optimality Meets Practical Efficiency in Low-Rank LLM Compression
The deployment of Large Language Models is constrained by the memory and bandwidth demands of static weights and dynamic Key-Value cache. SVD-based compression provides a hardware-friendly solution to reduce these costs. However, existing methods suffer from two key limitations: some are suboptimal in reconstruction error, while others are theoretically optimal but practically inefficient. In this paper, we propose Swift-SVD, an activation-aware, closed-form compression framework that simultaneously guarantees theoretical optimum, practical efficiency and numerical stability. Swift-SVD incrementally aggregates covariance of output activations given a batch of inputs and performs a single eigenvalue decomposition after aggregation, enabling training-free, fast, and optimal layer-wise low-rank approximation. We employ effective rank to analyze local layer-wise compressibility and design a dynamic rank allocation strategy that jointly accounts for local reconstruction loss and end-to-end layer importance. Extensive experiments across six LLMs and eight datasets demonstrate that Swift-SVD outperforms state-of-the-art baselines, achieving optimal compression accuracy while delivering 3-70X speedups in end-to-end compression time. Our code will be released upon acceptance.
comment: Under Review
☆ Acoustic and perceptual differences between standard and accented Chinese speech and their voice clones
Voice cloning is often evaluated in terms of overall quality, but less is known about accent preservation and its perceptual consequences. We compare standard and heavily accented Mandarin speech and their voice clones using a combined computational and perceptual design. Embedding-based analyses show no reliable accented-standard difference in original-clone distances across systems. In the perception study, clones are rated as more similar to their originals for standard than for accented speakers, and intelligibility increases from original to clone, with a larger gain for accented speech. These results show that accent variation can shape perceived identity match and intelligibility in voice cloning even when it is not reflected in an off-the-shelf speaker-embedding distance, and they motivate evaluating speaker identity preservation and accent preservation as separable dimensions.
☆ DeltaMem: Towards Agentic Memory Management via Reinforcement Learning
Recent advances in persona-centric memory have revealed the powerful capability of multi-agent systems in managing persona memory, especially in conversational scenarios. However, these complex frameworks often suffer from information loss and are fragile across varying scenarios, resulting in suboptimal performance. In this paper, we propose DeltaMem, an agentic memory management system that formulates persona-centric memory management as an end-to-end task within a single-agent setting. To further improve the performance of our agentic memory manager, we draw inspiration from the evolution of human memory and synthesize a user-assistant dialogue dataset along with corresponding operation-level memory updating labels. Building on this, we introduce a novel Memory-based Levenshtein Distance to formalize the memory updating reward, and propose a tailored reinforcement learning framework to further enhance the management capabilities of DeltaMem. Extensive experiments show that both training-free and RL-trained DeltaMem outperform all product-level baselines across diverse long-term memory benchmarks, including LoCoMo, HaluMem, and PersonaMem.
comment: preprint, under review
☆ Countering Catastrophic Forgetting of Large Language Models for Better Instruction Following via Weight-Space Model Merging
Large language models have been adopted in the medical domain for clinical documentation to reduce clinician burden. However, studies have reported that LLMs often "forget" a significant amount of instruction-following ability when fine-tuned using a task-specific medical dataset, a critical challenge in adopting general-purpose LLMs for clinical applications. This study presents a model merging framework to efficiently adapt general-purpose LLMs to the medical domain by countering this forgetting issue. By merging a clinical foundation model (GatorTronLlama) with a general instruct model (Llama-3.1-8B-Instruct) via interpolation-based merge methods, we seek to derive a domain-adapted model with strong performance on clinical tasks while retaining instruction-following ability. Comprehensive evaluation across medical benchmarks and five clinical generation tasks (e.g., radiology and discharge summarization) shows that merged models can effectively mitigate catastrophic forgetting, preserve clinical domain expertise, and retain instruction-following ability. In addition, our model merging strategies demonstrate training efficiency, achieving performance on par with fully fine-tuned baselines under severely constrained supervision (e.g., 64-shot vs. 256-shot). Consequently, weight-space merging constitutes a highly scalable solution for adapting open-source LLMs to clinical applications, facilitating broader deployment in resource-constrained healthcare environments.
☆ Read More, Think More: Revisiting Observation Reduction for Web Agents
Web agents based on large language models (LLMs) rely on observations of web pages -- commonly represented as HTML -- as the basis for identifying available actions and planning subsequent steps. Prior work has treated the verbosity of HTML as an obstacle to performance and adopted observation reduction as a standard practice. We revisit this trend and demonstrate that the optimal observation representation depends on model capability and thinking token budget: (1) compact observations (accessibility trees) are preferable for lower-capability models, while detailed observations (HTML) are advantageous for higher-capability models; moreover, increasing thinking tokens further amplifies the benefit of HTML. (2) Our error analysis suggests that higher-capability models exploit layout information in HTML for better action grounding, while lower-capability models suffer from increased hallucination under longer inputs. We also find that incorporating observation history improves performance across most models and settings, and a diff-based representation offers a token-efficient alternative. Based on these findings, we suggest practical guidelines: adaptively select observation representations based on model capability and thinking token budget, and incorporate observation history using diff-based representations.
☆ Why Instruction-Based Unlearning Fails in Diffusion Models?
Instruction-based unlearning has proven effective for modifying the behavior of large language models at inference time, but whether this paradigm extends to other generative models remains unclear. In this work, we investigate instruction-based unlearning in diffusion-based image generation models and show, through controlled experiments across multiple concepts and prompt variants, that diffusion models systematically fail to suppress targeted concepts when guided solely by natural-language unlearning instructions. By analyzing both the CLIP text encoder and cross-attention dynamics during the denoising process, we find that unlearning instructions do not induce sustained reductions in attention to the targeted concept tokens, causing the targeted concept representations to persist throughout generation. These results reveal a fundamental limitation of prompt-level instruction in diffusion models and suggest that effective unlearning requires interventions beyond inference-time language control.
☆ Magic, Madness, Heaven, Sin: LLM Output Diversity is Everything, Everywhere, All at Once
Research on Large Language Models (LLMs) studies output variation across generation, reasoning, alignment, and representational analysis, often under the umbrella of "diversity." Yet the terminology remains fragmented, largely because the normative objectives underlying tasks are rarely made explicit. We introduce the Magic, Madness, Heaven, Sin framework, which models output variation along a homogeneity-heterogeneity axis, where valuation is determined by the task and its normative objective. We organize tasks into four normative contexts: epistemic (factuality), interactional (user utility), societal (representation), and safety (robustness). For each, we examine the failure modes and vocabulary such as hallucination, mode collapse, bias, and erasure through which variation is studied. We apply the framework to analyze all pairwise cross-contextual interactions, revealing that optimizing for one objective, such as improving safety, can inadvertently harm demographic representation or creative diversity. We argue for context-aware evaluation of output variation, reframing it as a property shaped by task objectives rather than a model's intrinsic trait.
comment: Under review
☆ From SWE-ZERO to SWE-HERO: Execution-free to Execution-based Fine-tuning for Software Engineering Agents
We introduce SWE-ZERO to SWE-HERO, a two-stage SFT recipe that achieves state-of-the-art results on SWE-bench by distilling open-weight frontier LLMs. Our pipeline replaces resource-heavy dependencies with an evolutionary refinement strategy: (1) SWE-ZERO utilizes large-scale, execution-free trajectories to master code semantics and repository-level reasoning, and (2) SWE-HERO applies targeted, execution-backed refinement to transition these semantic intuitions into rigorous engineering workflows. Our empirical results set a new benchmark for open-source models of comparable size. We release a dataset of 300k SWE-ZERO and 13k SWE-HERO trajectories distilled from Qwen3-Coder-480B, alongside a suite of agents based on the Qwen2.5-Coder series. Notably, SWE-HERO-32B achieves a 62.2% resolution rate on SWE-bench Verified. Furthermore, despite being trained exclusively on Python, our agents demonstrate robust zero-shot transferability on SWE-bench Multilingual, reaching 44.1% and confirming the paradigm's generalizability across diverse languages.
♻ ☆ Beyond Via: Analysis and Estimation of the Impact of Large Language Models in Academic Papers
Through an analysis of arXiv papers, we report several shifts in word usage that are likely driven by large language models (LLMs) but have not previously received sufficient attention, such as the increased frequency of "beyond" and "via" in titles and the decreased frequency of "the" and "of" in abstracts. Due to the similarities among different LLMs, experiments show that current classifiers struggle to accurately determine which specific model generated a given text in multi-class classification tasks. Meanwhile, variations across LLMs also result in evolving patterns of word usage in academic papers. By adopting a direct and highly interpretable linear approach and accounting for differences between models and prompts, we quantitatively assess these effects and show that real-world LLM usage is heterogeneous and dynamic.
comment: Visualization of word usage patterns in arXiv abstracts: https://llm-impact.github.io/
♻ ☆ ORBIT: Scalable and Verifiable Data Generation for Search Agents on a Tight Budget
Search agents, which integrate language models (LMs) with web search, are becoming crucial for answering complex user queries. Constructing training datasets for deep research tasks, involving multi-step retrieval and reasoning, remains challenging due to expensive human annotation, or cumbersome prerequisites. In this work, we introduce ORBIT, a training dataset with 20K reasoning-intensive queries with short verifiable answers, generated using a frugal framework without relying on paid API services. The modular framework relies on four stages: seed creation, question-answer pair generation, and two stages of verification: self and external. ORBIT spans 15 domains and each training pair requires 4-5 reasoning steps, with external search verification required from the complete web. We train Qwen3-4B as the base model on ORBIT using GRPO and evaluate it on Wikipedia question answering tasks. Extensive experiment results demonstrate that ORBIT-4B achieves strong performance among sub-4B LLMs as search agents, proving the utility of synthetic datasets. Our framework, code and datasets are open-sourced and available publicly.
comment: Preprint
♻ ☆ Penalizing Length: Uncovering Systematic Bias in Quality Estimation Metrics
Quality Estimation (QE) metrics are vital in machine translation for reference-free evaluation and increasingly serve as selection criteria in data filtering and candidate reranking. However, the prevalence and impact of length bias in QE metrics have been underexplored. Through a systematic study of top-performing learned and LLM-as-a-Judge QE metrics across 10 diverse language pairs, we reveal two critical length biases: First, QE metrics consistently over-predict errors with increasing translation length, even for high-quality, error-free texts. Second, they exhibit a systematic preference for shorter translations when multiple candidates of comparable quality are available for the same source text. These biases risk unfairly penalizing longer, correct translations and can propagate into downstream pipelines that rely on QE signals for data selection or system optimization. We trace the root cause of learned QE metrics to skewed supervision distributions, where longer error-free examples are underrepresented in training data. As a diagnostic intervention, we apply length normalization during training and show that this simple modification effectively decouples error prediction from sequence length, yielding more reliable QE signals across translations of varying length.
♻ ☆ LLM2Vec-Gen: Generative Embeddings from Large Language Models
Fine-tuning LLM-based text embedders via contrastive learning maps inputs and outputs into a new representational space, discarding the LLM's output semantics. We propose LLM2Vec-Gen, a self-supervised alternative that instead produces embeddings directly in the LLM's output space by learning to represent the model's potential response. Specifically, trainable special tokens are appended to the input and optimized to compress the LLM's own response into a fixed-length embedding, guided by an unsupervised embedding teacher and a reconstruction objective. Crucially, the LLM backbone remains frozen and training requires only unlabeled queries. LLM2Vec-Gen achieves state-of-the-art self-supervised performance on the Massive Text Embedding Benchmark (MTEB), improving by 8.8% over the unsupervised embedding teacher. Since the embeddings preserve the LLM's response-space semantics, they inherit capabilities such as safety alignment (up to 22.6% reduction in harmful content retrieval) and reasoning (up to 35.6% improvement on reasoning-intensive retrieval). Finally, the learned embeddings are also interpretable: they can be decoded back into text to reveal their semantic content.
♻ ☆ NeuroNarrator: A Generalist EEG-to-Text Foundation Model for Clinical Interpretation via Spectro-Spatial Grounding and Temporal State-Space Reasoning
Electroencephalography (EEG) provides a non-invasive window into neural dynamics at high temporal resolution and plays a pivotal role in clinical neuroscience research. Despite this potential, prevailing computational approaches to EEG analysis remain largely confined to task-specific classification objectives or coarse-grained pattern recognition, offering limited support for clinically meaningful interpretation. To address these limitations, we introduce NeuroNarrator, the first generalist EEG-to-text foundation model designed to translate electrophysiological segments into precise clinical narratives. A cornerstone of this framework is the curation of NeuroCorpus-160K, the first harmonized large-scale resource pairing over 160,000 EEG segments with structured, clinically grounded natural-language descriptions. Our architecture first aligns temporal EEG waveforms with spatial topographic maps via a rigorous contrastive objective, establishing spectro-spatially grounded representations. Building on this grounding, we condition a Large Language Model through a state-space-inspired formulation that integrates historical temporal and spectral context to support coherent clinical narrative generation. This approach establishes a principled bridge between continuous signal dynamics and discrete clinical language, enabling interpretable narrative generation that facilitates expert interpretation and supports clinical reporting workflows. Extensive evaluations across diverse benchmarks and zero-shot transfer tasks highlight NeuroNarrator's capacity to integrate temporal, spectral, and spatial dynamics, positioning it as a foundational framework for time-frequency-aware, open-ended clinical interpretation of electrophysiological data.
♻ ☆ Optimus: A Robust Defense Framework for Mitigating Toxicity while Fine-Tuning Conversational AI SP
Customizing Large Language Models (LLMs) on untrusted datasets poses severe risks of injecting toxic behaviors. In this work, we introduce Optimus, a novel defense framework designed to mitigate fine-tuning harms while preserving conversational utility. Unlike existing defenses that rely heavily on precise toxicity detection or restrictive filtering, Optimus addresses the critical challenge of ensuring robust mitigation even when toxicity classifiers are imperfect or biased. Optimus integrates a training-free toxicity classification scheme that repurposes the safety alignment of commodity LLMs, and employs a dual-strategy alignment process combining synthetic "healing data" with Direct Preference Optimization (DPO) to efficiently steer models toward safety. Extensive evaluations demonstrate that Optimus mitigates toxicity even when relying on extremely biased classifiers (with up to 85% degradation in Recall). Optimus outperforms the state-of-the-art defense StarDSS and exhibits strong resilience against adaptive adversarial and jailbreak attacks. Our source code and datasets are available at https://github.com/secml-lab-vt/Optimus
comment: Accepted at ACM CODASPY 2026
♻ ☆ One Sample to Rule Them All: Extreme Data Efficiency in Multidiscipline Reasoning with Reinforcement Learning
The reasoning ability of large language models (LLMs) can be unleashed with reinforcement learning (RL) (OpenAI, 2024; DeepSeek-AI et al., 2025a; Zeng et al., 2025). The success of existing RL attempts in LLMs usually rely on high-quality samples of large volumes. In this paper, we challenge conventional assumptions about data requirements in RL for LLMs by demonstrating the effectiveness of one-shot reinforcement learning. Specifically, we introduce polymath learning, a framework for designing one training sample that elicits multidisciplinary reasoning improvement. We present three key findings: (1) A single, strategically selected math reasoning sample can produce significant performance improvements across multiple domains, including physics, chemistry, and biology; (2) Analysis of salient mathematical skills provides insight into the characteristics associated with effective polymath samples; and (3) An engineered synthetic sample that integrates multidisciplinary elements and broader skill coverage achieves stronger performance than naturally occurring individual samples. Across various reasoning benchmarks, polymath learning achieves stronger performance than larger datasets, demonstrating that reasoning structure and skills in samples, rather than quantity, may be the key to unlock enhanced reasoning capabilities in language models. Our results suggest a shift, dubbed as sample engineering, toward precision engineering of samples that complements simply increasing data volume.
♻ ☆ Support-Contra Asymmetry in LLM Explanations
Large Language Models (LLMs) increasingly produce natural language explanations alongside their predictions, yet it remains unclear whether these explanations reference predictive cues present in the input text. In this work, we present an empirical study of how LLM-generated explanations align with predictive lexical evidence from an external model in text classification tasks. To analyze this relationship, we compare explanation content against interpretable feature importance signals extracted from transparent linear classifiers. These reference models allow us to partition predictive lexical cues into supporting and contradicting evidence relative to the predicted label. Across three benchmark datasets-WIKIONTOLOGY, AG NEWS, and IMDB-we observe a consistent empirical pattern that we term support-contra asymmetry. Explanations accompanying correct predictions tend to reference more supporting lexical cues and fewer contradicting cues, whereas explanations associated with incorrect predictions reference substantially more contradicting evidence. This pattern appears consistently across datasets, across reference model families (logistic regression and linear SVM), and across multiple feature retrieval depths. These results suggest that LLM explanations often reflect lexical signals that are predictive for the task when predictions are correct, while incorrect predictions are more frequently associated with explanations that reference misleading cues present in the input. Our findings provide a simple empirical perspective on explanation-evidence alignment and illustrate how external sources of predictive evidence can be used to analyze the behavior of LLM-generated explanations.
comment: 17 Pages, 12 Figures, 4 tables
♻ ☆ Affect, Body, Cognition, Demographics, and Emotion: The ABCDE of Text Features for Computational Affective Science LREC
Work in Computational Affective Science and Computational Social Science explores a wide variety of research questions about people, emotions, behavior, and health. Such work often relies on language data that is first labeled with relevant information, such as the use of emotion words or the age of the speaker. Although many resources and algorithms exist to enable this type of labeling, discovering, accessing, and using them remains a substantial impediment, particularly for practitioners outside of computer science. Here, we present the ABCDE dataset (Affect, Body, Cognition, Demographics, and Emotion), a large-scale collection of over 400 million text utterances drawn from social media, blogs, books, and AI-generated sources. The dataset is annotated with a wide range of features relevant to computational affective and social science. ABCDE facilitates interdisciplinary research across numerous fields, including affective science, cognitive science, the digital humanities, sociology, political science, and computational linguistics.
comment: LREC (CAS)
♻ ☆ Sub-exponential Growth Dynamics in Complex Systems: A Piecewise Power-Law Model for the Diffusion of New Words and Names
The diffusion of ideas and language in society has conventionally been described by S-shaped models, such as the logistic curve. However, the role of sub-exponential growth -- a slower-than-exponential pattern known in epidemiology -- has been largely overlooked in broader social phenomena. Here, we present a piecewise power-law model to characterize complex growth curves with a few parameters. We systematically analyzed a large-scale dataset of approximately one billion Japanese blog articles linked to Wikipedia vocabulary, and observed consistent patterns in web search trend data (English, Spanish, and Japanese). Our analysis of 2,963 items, selected for reliable estimation (e.g., sufficient duration/peak, monotonic growth), reveals that 1,625 (55%) diffusion patterns without abrupt level shifts were adequately described by one or two segments. For single-segment curves, we found that (i) the mode of the shape parameter $α$ was near 0.5, indicating prevalent sub-exponential growth; (ii) the peak diffusion scale is primarily determined by the growth rate $R$, with minor contributions from $α$ or the duration $T$; and (iii) $α$ showed a tendency to vary with the nature of the topic, being smaller for niche/local topics and larger for widely shared ones. Furthermore, a micro-behavioral model of outward (stranger) vs. inward (community) contact suggests that $α$ can be interpreted as an index of the preference for outward-oriented communication. These findings suggest that sub-exponential growth is a common pattern of social diffusion, and our model provides a practical framework for consistently describing, comparing, and interpreting complex and diverse growth curves.
♻ ☆ Current LLMs still cannot 'talk much' about grammar modules: Evidence from syntax
We aim to examine the extent to which Large Language Models (LLMs) can 'talk much' about grammar modules, providing evidence from syntax core properties translated by ChatGPT into Arabic. We collected 44 terms from generative syntax previous works, including books and journal articles, as well as from our experience in the field. These terms were translated by humans, and then by ChatGPT-5. We then analyzed and compared both translations. We used an analytical and comparative approach in our analysis. Findings unveil that LLMs still cannot 'talk much' about the core syntax properties embedded in the terms under study involving several syntactic and semantic challenges: only 25% of ChatGPT translations were accurate, while 38.6% were inaccurate, and 36.4.% were partially correct, which we consider appropriate. Based on these findings, a set of actionable strategies were proposed, the most notable of which is a close collaboration between AI specialists and linguists to better LLMs' working mechanism for accurate or at least appropriate translation.
comment: 15 pages
♻ ☆ Can Large Language Models Self-Correct in Medical Question Answering? An Exploratory Study
Large language models (LLMs) have achieved strong performance on medical question answering (medical QA), and chain-of-thought (CoT) prompting has further improved results by eliciting explicit intermediate reasoning; meanwhile, self-reflective (self-corrective) prompting has been widely claimed to enhance model reliability by prompting LLMs to critique and revise their own reasoning, yet its effectiveness in safety-critical medical settings remains unclear. In this work, we conduct an exploratory analysis of self-reflective reasoning for medical multiple-choice question answering: using GPT-4o and GPT-4o-mini, we compare standard CoT prompting with an iterative self-reflection loop and track how predictions evolve across reflection steps on three widely used medical QA benchmarks (MedQA, HeadQA, and PubMedQA). We analyze whether self-reflection leads to error correction, error persistence, or the introduction of new errors. Our results show that self-reflective prompting does not consistently improve accuracy and its impact is highly dataset- and model-dependent: it yields modest gains on MedQA but provides limited or negative benefits on HeadQA and PubMedQA, and increasing the number of reflection steps does not guarantee better performance. These findings highlight a gap between reasoning transparency and reasoning correctness, suggesting that self-reflective reasoning is better viewed as an analytical tool for understanding model behavior rather than a standalone solution for improving medical QA reliability.
♻ ☆ Do Phone-Use Agents Respect Your Privacy?
We study whether phone-use agents respect privacy while completing benign mobile tasks. This question has remained hard to answer because privacy-compliant behavior is not operationalized for phone-use agents, and ordinary apps do not reveal exactly what data agents type into which form entries during execution. To make this question measurable, we introduce MyPhoneBench, a verifiable evaluation framework for privacy behavior in mobile agents. We operationalize privacy-respecting phone use as permissioned access, minimal disclosure, and user-controlled memory through a minimal privacy contract, iMy, and pair it with instrumented mock apps plus rule-based auditing that make unnecessary permission requests, deceptive re-disclosure, and unnecessary form filling observable and reproducible. Across five frontier models on 10 mobile apps and 300 tasks, we find that task success, privacy-compliant task completion, and later-session use of saved preferences are distinct capabilities, and no single model dominates all three. Evaluating success and privacy jointly reshuffles the model ordering relative to either metric alone. The most persistent failure mode across models is simple data minimization: agents still fill optional personal entries that the task does not require. These results show that privacy failures arise from over-helpful execution of benign tasks, and that success-only evaluation overestimates the deployment readiness of current phone-use agents. All code, mock apps, and agent trajectories are publicly available at~ https://github.com/FreedomIntelligence/MyPhoneBench.
comment: work in progress
♻ ☆ Tailoring AI-Driven Reading Scaffolds to the Distinct Needs of Neurodiverse Learners
Neurodiverse learners often require reading supports, yet increasing scaffold richness can sometimes overload attention and working memory rather than improve comprehension. Grounded in the Construction-Integration model and a contingent scaffolding perspective, we examine how structural versus semantic scaffolds shape comprehension and reading experience in a supervised inclusive context. Using an adapted reading interface, we compared four modalities: unmodified text, sentence-segmented text, segmented text with pictograms, and segmented text with pictograms plus keyword labels. In a within-subject pilot with 14 primary-school learners with special educational needs and disabilities, we measured reading comprehension using standardized questions and collected brief child- and therapist-reported experience measures alongside open-ended feedback. Results highlight heterogeneous responses as some learners showed patterns consistent with benefits from segmentation and pictograms, while others showed patterns consistent with increased coordination costs when visual scaffolds were introduced. Experience ratings showed limited differences between modalities, with some apparent effects linked to clinical complexity, particularly for perceived ease of understanding. Open-ended feedback of the learners frequently requested simpler wording and additional visual supports. These findings suggest that no single scaffold is universally optimal, reinforcing the need for calibrated, adjustable scaffolding and provide design implications for human-AI co-regulation in supervised inclusive reading contexts.
comment: Accepted at AIED 2026
♻ ☆ What Makes a Good Doctor Response? A Study on Text-Based Telemedicine LREC 2026
Text-based telemedicine has become an increasingly used mode of care, requiring clinicians to deliver medical advice clearly and effectively in writing. As platforms increasingly rely on patient ratings and feedback, clinicians face growing pressure to maintain satisfaction scores, even though these evaluations often reflect communication quality more than clinical accuracy. We analyse patient satisfaction signals in Romanian text-based telemedicine. Using a sample of anonymised text-based telemedicine consultations, we model feedback as a binary outcome, treating thumbs-up responses as positive and grouping negative or absent feedback into the other class. We extract from doctor responses interpretable, predominantly language-agnostic features (e.g., length, structural characteristics, readability proxies), along with Romanian LIWC psycholinguistic features and politeness/hedging markers where available. We train a classifier with a time-based split and perform SHAP-based analyses, which indicate that metadata dominates prediction, functioning as a strong prior, while characteristics of the response text provide a smaller but actionable signal. In subgroup correlation analyses, politeness and hedging are consistently associated with positive patient feedback, whereas lexical diversity shows a negative association.
comment: Accepted at CL4Health Workshop @ LREC 2026
♻ ☆ MemFactory: Unified Inference & Training Framework for Agent Memory
Memory-augmented Large Language Models (LLMs) are essential for developing capable, long-term AI agents. Recently, applying Reinforcement Learning (RL) to optimize memory operations, such as extraction, updating, and retrieval, has emerged as a highly promising research direction. However, existing implementations remain highly fragmented and task-specific, lacking a unified infrastructure to streamline the integration, training, and evaluation of these complex pipelines. To address this gap, we present MemFactory, the first unified, highly modular training and inference framework specifically designed for memory-augmented agents. Inspired by the success of unified fine-tuning frameworks like LLaMA-Factory, MemFactory abstracts the memory lifecycle into atomic, plug-and-play components, enabling researchers to seamlessly construct custom memory agents via a "Lego-like" architecture. Furthermore, the framework natively integrates Group Relative Policy Optimization (GRPO) to fine-tune internal memory management policies driven by multi-dimensional environmental rewards. MemFactory provides out-of-the-box support for recent cutting-edge paradigms, including Memory-R1, RMM, and MemAgent. We empirically validate MemFactory on the open-source MemAgent architecture using its publicly available training and evaluation data. Across the evaluation sets, MemFactory improves performance over the corresponding base models on average, with relative gains of up to 14.8%. By providing a standardized, extensible, and easy-to-use infrastructure, MemFactory significantly lowers the barrier to entry, paving the way for future innovations in memory-driven AI agents.
comment: v3:fixed Figure 1 typos, clarified ambiguous wording in the abstract, added 1 missing citation, Code: https://github.com/Valsure/MemFactory
♻ ☆ Fair Representation in Parliamentary Summaries: Measuring and Mitigating Inclusion Bias ECML-PKDD 2025
The The use of Large language models (LLMs) to summarise parliamentary proceedings presents a promising means of increasing the accessibility of democratic participation. However, as these systems increasingly mediate access to political information -- filtering and framing content before it reaches users -- there are important fairness considerations to address. In this work, we evaluate 5 LLMs (both proprietary and open-weight) in the summarisation of plenary debates from the European Parliament to investigate the representational biases that emerge in this context. We develop an attribution-aware evaluation framework to measure speaker-level inclusion and mis-representation in debate summaries. Across all models and experiments, we find that speakers are less accurately represented in the final summary on the basis of (i) their speaking-order (speeches in the middle of the debate were systematically excluded), (ii) language spoken (non-English speakers were less faithfully represented), and (iii) political affiliations (better outcomes for left-of-centre parties). We further show how biases in these contexts can be decomposed to distinguish inclusion bias (systematic omission) from hallucination bias (systematic misrepresentation), and explore the effect of different mitigation strategies. Prompting strategies do not affect these biases. Instead, we propose a hierarchical summarisation method that decomposes the task into simpler extraction and aggregation steps, which we show significantly improves the positional/speaking-order bias across all models. These findings underscore the need for domain-sensitive evaluation metrics and ethical oversight in the deployment of LLMs for multilingual democratic applications.
comment: Extended journal version of "Identifying Algorithmic and Domain-Specific Bias in Parliamentary Debate Summarisation" (arXiv:2507.14221), which appeared at the AIDEM Workshop, ECML-PKDD 2025. This version extends the original with cross-lingual bias analysis, a two-level hierarchical summarisation method, and human annotation validation of the evaluation framework
♻ ☆ ExpertFlow: Efficient Mixture-of-Experts Inference via Predictive Expert Caching and Token Scheduling
Sparse Mixture-of-Experts (MoE) models can outperform dense large language models at similar computation by activating only a small set of experts per token. However, stacking many expert modules introduces substantial parameter memory, which makes MoE models difficult to deploy in memory-constrained environments such as single-GPU devices. Offloading alleviates this issue by storing inactive experts in CPU memory and loading them on demand, but existing methods remain limited: static caches disregard input-dependent routing, and methods that train separate models to predict expert usage ahead of time are often inaccurate or require significant training cost. We propose ExpertFlow, a lightweight MoE inference system that addresses this routing dependency through three coordinated components: 1) a transformer-based routing path predictor that estimates expert usage across all MoE layers in a single forward pass, 2) a token scheduler that groups tokens with similar predicted routes to improve expert utilization, and 3) a predictive expert cache that loads only the required experts while correcting mispredictions at runtime. Together, these components enable efficient expert loading and execution, reducing GPU memory usage by up to 93.72% and improving inference throughput by up to 10x over strong offloading baselines on a single GPU.
comment: Accepted in DAC'26, Mixture-of-Experts, Inference, Offloading
♻ ☆ Output Embedding Centering for Stable LLM Pretraining
Pretraining of large language models is not only expensive but also prone to certain training instabilities. A specific instability that often occurs at the end of training is output logit divergence. The most widely used mitigation strategies, z-loss and logit soft-capping, merely address the symptoms rather than the underlying cause of the problem. In this paper, we analyze the instability from the perspective of the output embeddings' geometry and identify anisotropic embeddings as its source. Based on this, we propose output embedding centering (OEC) as a new mitigation strategy, and demonstrate that it suppresses output logit divergence. OEC can be implemented in two different ways: as a deterministic operation called $μ$-centering, or a regularization method called $μ$-loss. Our experiments show that both variants outperform z-loss in terms of training stability, while being on par with logit soft-capping. This holds true both in the presence and the absence of weight tying. As a secondary result, we find that $μ$-loss is significantly less sensitive to regularization hyperparameter tuning than z-loss.
comment: Additional experiments using logit soft-capping & weight tying
♻ ☆ OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models
We present OmniVoice, a massive multilingual zero-shot text-to-speech (TTS) model that scales to over 600 languages. At its core is a novel diffusion language model-style discrete non-autoregressive (NAR) architecture. Unlike conventional discrete NAR models that suffer from performance bottlenecks in complex two-stage (text-to-semantic-to-acoustic) pipelines, OmniVoice directly maps text to multi-codebook acoustic tokens. This simplified approach is facilitated by two key technical innovations: (1) a full-codebook random masking strategy for efficient training, and (2) initialization from a pre-trained LLM to ensure superior intelligibility. By leveraging a 581k-hour multilingual dataset curated entirely from open-source data, OmniVoice achieves the broadest language coverage to date and delivers state-of-the-art performance across Chinese, English, and diverse multilingual benchmarks. Our code and pre-trained models are publicly available at https://github.com/k2-fsa/OmniVoice.
♻ ☆ One Pic is All it Takes: Poisoning Visual Document Retrieval Augmented Generation with a Single Image
Retrieval-augmented generation (RAG) is instrumental for inhibiting hallucinations in large language models (LLMs) through the use of a factual knowledge base (KB). Although PDF documents are prominent sources of knowledge, text-based RAG pipelines are ineffective at capturing their rich multi-modal information. In contrast, visual document RAG (VD-RAG) uses screenshots of document pages as the KB, which has been shown to achieve state-of-the-art results. However, by introducing the image modality, VD-RAG introduces new attack vectors for adversaries to disrupt the system by injecting malicious documents into the KB. In this paper, we demonstrate the vulnerability of VD-RAG to poisoning attacks targeting both retrieval and generation. We define two attack objectives and demonstrate that both can be realized by injecting only a single adversarial image into the KB. Firstly, we introduce a targeted attack against one or a group of queries with the goal of spreading targeted disinformation. Secondly, we present a universal attack that, for any potential user query, influences the response to cause a denial-of-service in the VD-RAG system. We investigate the two attack objectives under both white-box and black-box assumptions, employing a multi-objective gradient-based optimization approach as well as prompting state-of-the-art generative models. Using two visual document datasets, a diverse set of state-of-the-art retrievers (embedding models) and generators (vision language models), we show VD-RAG is vulnerable to poisoning attacks in both the targeted and universal settings, yet demonstrating robustness to black-box attacks in the universal setting.
comment: Published in Transactions on Machine Learning Research (03/2026)
♻ ☆ LLM Analysis of 150+ years of German Parliamentary Debates on Migration Reveals Shift from Post-War Solidarity to Anti-Solidarity in the Last Decade
Migration has been a core topic in German political debate, from the postwar displacement of millions of expellees to labor migration and recent refugee movements. Studying political speech across such wide-ranging phenomena in depth has traditionally required extensive manual annotation, limiting analysis to small subsets of the data. Large language models (LLMs) offer a potential way to overcome this constraint. Using a theory-driven annotation scheme, we examine how well LLMs annotate subtypes of solidarity and anti-solidarity in German parliamentary debates and whether the resulting labels support valid downstream inference. We first provide a comprehensive evaluation of multiple LLMs, analyzing the effects of model size, prompting strategies, fine-tuning, historical versus contemporary data, and systematic error patterns. We find that the strongest models, especially GPT-5 and gpt-oss-120B, achieve human-level agreement on this task, although their errors remain systematic and bias downstream results. To address this issue, we combine soft-label model outputs with Design-based Supervised Learning (DSL) to reduce bias in long-term trend estimates. Beyond the methodological evaluation, we interpret the resulting annotations from a social-scientific perspective to trace trends in solidarity and anti-solidarity toward migrants in postwar and contemporary Germany. Our approach shows relatively high levels of solidarity in the postwar period, especially in group-based and compassionate forms, and a marked rise in anti-solidarity since 2015, framed through exclusion, undeservingness, and resource burden. We argue that LLMs can support large-scale social-scientific text analysis, but only when their outputs are rigorously validated and statistically corrected.
♻ ☆ KARL: Knowledge-Aware Reasoning and Reinforcement Learning for Knowledge-Intensive Visual Grounding
Knowledge-Intensive Visual Grounding (KVG) requires models to localize objects using fine-grained, domain-specific entity names rather than generic referring expressions. Although Multimodal Large Language Models (MLLMs) possess rich entity knowledge and strong generic grounding capabilities, they often fail to effectively utilize such knowledge when grounding specialized concepts, revealing a knowledge-grounding gap between internal knowledge and grounding predictions. To address this challenge, we propose a knowledge-aware training paradigm for KVG. Our approach first constructs knowledge-guided reasoning data to encourage models to activate domain-relevant entity knowledge during grounding, and then introduces KARL, a Knowledge-Aware Reinforcement Learning framework that adaptively modulates reward signals according to the model's estimated knowledge mastery of different entities. To facilitate systematic evaluation, we introduce KVG-Bench, a benchmark spanning 10 domains with 1.3K curated test cases covering 531 images and 882 entities. Extensive experiments show that our approach consistently outperforms a wide range of baseline models and achieves substantially stronger cross-domain generalization on unseen categories. The data, codes, and models are released at https://github.com/thunlp/KARL.
♻ ☆ Valency Classification of Mapudungun Verbal Roots. Established by the language's own morphotactics
In the previous work, a lexical (re)categorisation -- or confirmation of the given category -- of roots identified as verbal was undertaken to determine their original category accurately. Building on this, the present paper offers an account of the valency classification of those Mapudungun roots confirmed to be verbal, using the language's own morphotactics; specifically, by examining the permissible and restricted combinations of various suffixes with roots or verbal stems in the Mapuche verb form. As with all work conducted thus far, the results presented here aim to improve the morphological analyser (Dungupeyum) with all verified findings incorporated into the system. From a theoretical perspective, we also hope to contribute to the recognition and understanding of issues related to the valency of Mapuche verb forms.
♻ ☆ Efficient Reasoning with Balanced Thinking ICLR 2026
Large Reasoning Models (LRMs) have shown remarkable reasoning capabilities, yet they often suffer from overthinking, expending redundant computational steps on simple problems, or underthinking, failing to explore sufficient reasoning paths despite inherent capabilities. These issues lead to inefficiencies and potential inaccuracies, limiting practical deployment in resource-constrained settings. Existing methods to mitigate overthinking, such as suppressing reflective keywords or adjusting reasoning length, may inadvertently induce underthinking, compromising accuracy. Therefore, we propose ReBalance, a training-free framework that achieves efficient reasoning with balanced thinking. ReBalance leverages confidence as a continuous indicator of reasoning dynamics, identifying overthinking through high confidence variance and underthinking via consistent overconfidence. By aggregating hidden states from a small-scale dataset into reasoning mode prototypes, we compute a steering vector to guide LRMs' reasoning trajectories. A dynamic control function modulates this vector's strength and direction based on real-time confidence, pruning redundancy during overthinking, and promoting exploration during underthinking. Extensive experiments conducted on four models ranging from 0.5B to 32B, and across nine benchmarks in math reasoning, general question answering, and coding tasks demonstrate that ReBalance effectively reduces output redundancy while improving accuracy, offering a general, training-free, and plug-and-play strategy for efficient and robust LRM deployment. Project page and code are available at https://rebalance-ai.github.io .
comment: Accepted by ICLR 2026
♻ ☆ J-CHAT: Japanese Large-scale Spoken Dialogue Corpus for Spoken Dialogue Language Modeling
Spoken dialogue is essential for human-AI interactions, providing expressive capabilities beyond text. Developing effective spoken dialogue systems (SDSs) requires large-scale, high-quality, and diverse spoken dialogue corpora. However, existing datasets are often limited in size, spontaneity, or linguistic coherence. To address these limitations, we introduce J-CHAT, a 76,000-hour open-source Japanese spoken dialogue corpus. Constructed using an automated, language-independent methodology, J-CHAT ensures acoustic cleanliness, diversity, and natural spontaneity. The corpus is built from YouTube and podcast data, with extensive filtering and denoising to enhance quality. Experimental results with generative spoken dialogue language models trained on J-CHAT demonstrate its effectiveness for SDS development. By providing a robust foundation for training advanced dialogue models, we anticipate that J-CHAT will drive progress in human-AI dialogue research and applications.
comment: 8 pages, 3 figures
♻ ☆ Lexical categories of stem-forming roots in Mapudüngun verb forms
After developing a computational system for morphological analysis of the Mapuche language, and evaluating it with texts from various authors and styles, it became necessary to verify the linguistic assumptions of the source used as the basis for implementing this tool. In the present work, the primary focus is on the lexical category classification of Mapudüngun roots recognised as verbal in the source utilised for the development of the morphological analysis system. The results of this lexical category revision directly benefit the computational analyser, as they are implemented as soon as they are verified. Additionally, it is hoped that these results will help clarify some uncertainties about lexical categories in the Mapuche language. This work addresses a preliminary task to identify the valency of true verbal roots, the results of which will be presented in a subsequent work that complements this article.
comment: 36 pages, 2 large tables, 2 sample tables
♻ ☆ Beyond Preset Identities: How Agents Form Stances and Boundaries in Generative Societies
While large language models simulate social behaviors, their capacity for stable stance formation and identity negotiation during complex interventions remains unclear. To overcome the limitations of static evaluations, this paper proposes a novel mixed-methods framework combining computational virtual ethnography with quantitative socio-cognitive profiling. By embedding human researchers into generative multiagent communities, controlled discursive interventions are conducted to trace the evolution of collective cognition. To rigorously measure how agents internalize and react to these specific interventions, this paper formalizes three new metrics: Innate Value Bias (IVB), Persuasion Sensitivity, and Trust-Action Decoupling (TAD). Across multiple representative models, agents exhibit endogenous stances that override preset identities, consistently demonstrating an innate progressive bias (IVB > 0). When aligned with these stances, rational persuasion successfully shifts 90% of neutral agents while maintaining high trust. In contrast, conflicting emotional provocations induce a paradoxical 40.0% TAD rate in advanced models, which hypocritically alter stances despite reporting low trust. Smaller models contrastingly maintain a 0% TAD rate, strictly requiring trust for behavioral shifts. Furthermore, guided by shared stances, agents use language interactions to actively dismantle assigned power hierarchies and reconstruct self organized community boundaries. These findings expose the fragility of static prompt engineering, providing a methodological and quantitative foundation for dynamic alignment in human-agent hybrid societies. The official code is available at: https://github.com/armihia/CMASE-Endogenous-Stances
comment: 22 pages, 3 figures. arXiv admin note: substantial text overlap with arXiv:2508.17366
♻ ☆ LEXam: Benchmarking Legal Reasoning on 340 Law Exams ICLR 2026
Long-form legal reasoning remains a key challenge for large language models (LLMs) in spite of recent advances in test-time scaling. To address this, we introduce LEXam, a novel benchmark derived from 340 law exams spanning 116 law school courses across a range of subjects and degree levels. The dataset comprises 7,537 law exam questions in English and German. It includes both long-form, open-ended questions and multiple-choice questions with varying numbers of options. Besides reference answers, the open questions are also accompanied by explicit guidance outlining the expected legal reasoning approach such as issue spotting, rule recall, or rule application. Our evaluation on both open-ended and multiple-choice questions present significant challenges for current LLMs; in particular, they notably struggle with open questions that require structured, multi-step legal reasoning. Moreover, our results underscore the effectiveness of the dataset in differentiating between models with varying capabilities. Deploying an ensemble LLM-as-a-Judge paradigm with rigorous human expert validation, we demonstrate how model-generated reasoning steps can be evaluated consistently and accurately, closely aligning with human expert assessments. Our evaluation setup provides a scalable method to assess legal reasoning quality beyond simple accuracy metrics. Project page: https://lexam-benchmark.github.io/.
comment: Accepted to ICLR 2026
♻ ☆ Generation-Step-Aware Framework for Cross-Modal Representation and Control in Multilingual Speech-Text Models
Multilingual speech-text models rely on cross-modal language alignment to transfer knowledge between speech and text, but it remains unclear whether this reflects shared computation for the same language or modality-specific processing. We introduce a generation-step-aware framework for evaluating cross-modal computation that (i) identifies language-selective neurons for each modality at different decoding steps, (ii) decomposes them into language-representation and language-control roles, and (iii) enables cross-modal comparison via overlap measures and causal intervention, including cross-modal steering of output language. Applying our framework to SeamlessM4T v2, we find that cross-modal language alignment is strongest at the first decoding step, where language-representation neurons are shared across modalities, but weakens as generation proceeds, indicating a shift toward modality-specific autoregressive processing. In contrast, language-control neurons identified from speech transfer causally to text generation, revealing partially shared circuitry for output-language control that strengthens at later decoding steps. These results show that cross-modal processing is both time- and function-dependent, providing a more nuanced view of multilingual computation in speech-text models.
comment: 10 pages for the main text, 6 Figures, 5 Tables
♻ ☆ MemRerank: Preference Memory for Personalized Product Reranking
LLM-based shopping agents increasingly rely on long purchase histories and multi-turn interactions for personalization, yet naively appending raw history to prompts is often ineffective due to noise, length, and relevance mismatch. We propose MemRerank, a preference memory framework that distills user purchase history into concise, query-independent signals for personalized product reranking. To study this problem, we build an end-to-end benchmark and evaluation framework centered on an LLM-based \textbf{1-in-5} selection task, which measures both memory quality and downstream reranking utility. We further train the memory extractor with reinforcement learning (RL), using downstream reranking performance as supervision. Experiments with two LLM-based rerankers show that MemRerank consistently outperforms no-memory, raw-history, and off-the-shelf memory baselines, yielding up to \textbf{+10.61} absolute points in 1-in-5 accuracy. These results suggest that explicit preference memory is a practical and effective building block for personalization in agentic e-commerce systems.
comment: correct author name in metadata
♻ ☆ PAVE: Premise-Aware Validation and Editing for Retrieval-Augmented LLMs ICLR 2026
Retrieval-augmented language models can retrieve relevant evidence yet still commit to answers before explicitly checking whether the retrieved context supports the conclusion. We present PAVE (Premise-Grounded Answer Validation and Editing), an inference-time validation layer for evidence-grounded question answering. PAVE decomposes retrieved context into question-conditioned atomic facts, drafts an answer, scores how well that draft is supported by the extracted premises, and revises low-support outputs before finalization. The resulting trace makes answer commitment auditable at the level of explicit premises, support scores, and revision decisions. In controlled ablations with a fixed retriever and backbone, PAVE outperforms simpler post-retrieval baselines in two evidence-grounded QA settings, with the largest gain reaching 32.7 accuracy points on a span-grounded benchmark. We view these findings as proof-of-concept evidence that explicit premise extraction plus support-gated revision can strengthen evidence-grounded consistency in retrieval-augmented LLM systems.
comment: Accepted at the ICLR 2026 Workshop on Logical Reasoning of Large Language Models
♻ ☆ Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models
Modern vision-language models (VLMs) excel at many multimodal tasks, yet their grasp of temporal information in video remains weak and has not been adequately evaluated. We probe this gap with a deceptively simple but revealing challenge: judging the arrow of time (AoT)-whether a short clip is played forward or backward. We introduce AoT-PsyPhyBENCH, a psychophysically validated benchmark that tests whether VLMs can infer temporal direction in natural videos using the same stimuli and behavioral baselines established for humans. Our comprehensive evaluation of open-weight and proprietary, reasoning and non-reasoning VLMs reveals that most models perform near chance, and even the best model lags far behind human accuracy on physically irreversible processes (e.g., free fall, diffusion/explosion) and causal manual actions (division/addition) that humans recognize almost instantly. These results highlight a fundamental gap in current multimodal systems: while they capture rich visual-semantic correlations, they lack the inductive biases required for temporal continuity and causal understanding. We release the code and data for AoT-PsyPhyBENCH to encourage further progress in the physical and temporal reasoning capabilities of VLMs.
comment: 12 pages
♻ ☆ Quokka: Accelerating Program Verification with LLMs via Invariant Synthesis
Program verification relies on loop invariants, yet automatically discovering strong invariants remains a long-standing challenge. We investigate whether large language models (LLMs) can accelerate program verification by generating useful loop invariants. We introduce Quokka, an evaluation-oriented framework for LLM-based invariant synthesis that provides sound evaluation and achieves state-of-the-art performance. Unlike prior work that treats LLM outputs as noisy symbolic material requiring substantial post-processing, Quokka adopts a simpler and evaluation-centric design that directly validates whether each LLM-generated invariant helps prove the target assertion. We construct a benchmark of 866 instances derived from SV-COMP and evaluate 9 state-of-the-art LLMs across multiple model families. We demonstrate that supervised fine-tuning and Best-of-N sampling yield measurable improvements, and we show that Quokka consistently outperforms prior LLM-based verifiers. Our code and data are publicly available at https://github.com/Anjiang-Wei/Quokka
♻ ☆ Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale
We introduce Intern-S1-Pro, the first one-trillion-parameter scientific multimodal foundation model. Scaling to this unprecedented size, the model delivers a comprehensive enhancement across both general and scientific domains. Beyond stronger reasoning and image-text understanding capabilities, its intelligence is augmented with advanced agent capabilities. Simultaneously, its scientific expertise has been vastly expanded to master over 100 specialized tasks across critical science fields, including chemistry, materials, life sciences, and earth sciences. Achieving this massive scale is made possible by the robust infrastructure support of XTuner and LMDeploy, which facilitates highly efficient Reinforcement Learning (RL) training at the 1-trillion parameter level while ensuring strict precision consistency between training and inference. By seamlessly integrating these advancements, Intern-S1-Pro further fortifies the fusion of general and specialized intelligence, working as a Specializable Generalist, demonstrating its position in the top tier of open-source models for general capabilities, while outperforming proprietary models in the depth of specialized scientific tasks.
♻ ☆ Towards Distillation-Resistant Large Language Models: An Information-Theoretic Perspective
Proprietary large language models (LLMs) embody substantial economic value and are generally exposed only as black-box APIs, yet adversaries can still exploit their outputs to extract knowledge via distillation. Existing defenses focus exclusively on text-based distillation, leaving the important logit-based distillation largely unexplored. In this work, we analyze this problem and present an effective solution from an information-theoretic perspective. We characterize distillation-relevant information in teacher outputs using the conditional mutual information (CMI) between teacher logits and input queries conditioned on ground-truth labels. This quantity captures contextual information beneficial for model extraction, motivating us to defend distillation via CMI minimization. Guided by our theoretical analysis, we propose learning a transformation matrix that purifies the original outputs to enhance distillation resistance. We further derive a CMI-inspired anti-distillation objective to optimize this transformation, which effectively removes distillation-relevant information while preserving output utility. Extensive experiments across multiple LLMs and strong distillation algorithms demonstrate that the proposed method significantly degrades distillation performance while preserving task accuracy, effectively protecting models' intellectual property.
♻ ☆ Dual Optimal: Make Your LLM Peer-like with Dignity
Current aligned language models exhibit a dual failure mode we term the Evasive Servant: they sycophantically validate flawed user beliefs while deflecting responsibility with boilerplate disclaimers. We propose the Dignified Peer framework, which counters servility with anti-sycophancy and trustworthiness, and mitigates evasiveness through empathy and creativity. Realizing this agent requires overcoming significant challenges in data supervision, objective collapse, and evaluation bias. We address these issues by introducing the PersonaKnob dataset which features a compositional partial order structure of multiple persona preference. This data is utilized alongside a tolerant constrained Lagrangian DPO algorithm that dynamically balances all persona dimensions to prevent behavioral collapse. Additionally, we employ a psychometrically calibrated Item Response Theory evaluation protocol to disentangle latent model persona capability from confounders like judge biases. Extensive empirical studies demonstrate that our approach successfully build a LLM agent with both dignity and peer.
♻ ☆ The Evolution of Tool Use in LLM Agents: From Single-Tool Call to Multi-Tool Orchestration
Tool use enables large language models (LLMs) to access external information, invoke software systems, and act in digital environments beyond what can be solved from model parameters alone. Early research mainly studied whether a model could select and execute a correct single tool call. As agent systems evolve, however, the central problem has shifted from isolated invocation to multi-tool orchestration over long trajectories with intermediate state, execution feedback, changing environments, and practical constraints such as safety, cost, and verifiability. We comprehensively review recent progress in multi-tool LLM agents and analyzes the state of the art in this rapidly developing area. First, we unify task formulations and distinguish single-call tool use from long-horizon orchestration. Then, we organize the literature around six core dimensions: inference-time planning and execution, training and trajectory construction, safety and control, efficiency under resource constraints, capability completeness in open environments, and benchmark design and evaluation. We further summarize representative applications in software engineering, enterprise workflows, graphical user interfaces, and mobile systems. Finally, we discuss major challenges and outline future directions for building reliable, scalable, and verifiable multi-tool agents.
♻ ☆ AutiHero: Engaging Parents in Creating Personalized, Multi-path~Social Narratives for Autistic Children
Social narratives help autistic children understand and navigate social situations through stories. To ensure effective practice, however, they often require significant time and effort from parents in customizing the narrative materials and delivering repeated instructions on them. We present AutiHero, a generative AI (GenAI)-based social narrative system, which supports parents to create personalized, multi-path stories targeting specific behavior of their autistic children, while enabling them to explore behavioral choices and causal consequences together in reading. A two-week deployment study with 16 autistic child-parent dyads showed that parents actively created, adapted, and read stories with their children, with increased confidence in everyday behavioral guidance. Our work contributes real-world-contextualized text+image content creation approaches harnessing GenAI, ensuring user-aligned application in sensitive contexts involving autistic children and their parents.
comment: 11 pages except reference
♻ ☆ DR-LoRA: Dynamic Rank LoRA for Fine-Tuning Mixture-of-Experts Models
Mixture-of-Experts (MoE) has become a prominent paradigm for scaling Large Language Models (LLMs). Parameter-efficient fine-tuning methods, such as LoRA, are widely adopted to adapt pretrained MoE LLMs to downstream tasks. However, existing approaches typically assign identical LoRA ranks to all expert modules, ignoring the heterogeneous specialization of pretrained experts. This uniform allocation leads to a resource mismatch: task-relevant experts are under-provisioned, while less relevant ones receive redundant parameters. To address this, we propose DR-LoRA, a Dynamic Rank LoRA framework for fine-tuning pretrained MoE models. Specifically, DR-LoRA initializes all expert LoRA modules with a small active rank and uses an expert saliency score, which combines routing frequency and gradient-based rank importance, to identify which experts would benefit most from additional capacity. It then periodically expands the active ranks of the task-critical expert LoRA, progressively constructing a heterogeneous rank distribution tailored to the target task. Experiments on three MoE models across six tasks show that DR-LoRA consistently outperforms LoRA and other strong baselines, demonstrating that task-adaptive heterogeneous rank allocation is an effective strategy to improve active capacity utilization in MoE fine-tuning.
♻ ☆ Bias Is a Subspace, Not a Coordinate: A Geometric Rethinking of Post-hoc Debiasing in Vision-Language Models CVPR
Vision-Language Models (VLMs) have become indispensable for multimodal reasoning, yet their representations often encode and amplify demographic biases, resulting in biased associations and misaligned predictions in downstream tasks. Such behavior undermines fairness and distorts the intended alignment between vision and language. Recent post-hoc approaches attempt to mitigate bias by replacing the most attribute-correlated embedding coordinates with neutral values. However, our systematic analysis reveals three critical limitations of this coordinate-wise approach: feature entanglement, poor cross-dataset generalization, and incomplete bias removal. We find that bias is not localized to a few coordinates but is instead distributed across a few linear subspaces. To address these limitations, we propose $\textbf{S}$ubspace $\textbf{P}$rojection $\textbf{D}$ebiasing ($\textbf{SPD}$), a geometrically principled framework that identifies and removes the entire subspace of linearly decodable bias while reinserting a neutral mean component to preserve semantic fidelity. Extensive experiments across zero-shot classification, text-to-image retrieval, and image generation validate the effectiveness of SPD: our method achieves more robust debiasing with an average improvement of $18.5\%$ across four fairness metrics, while maintaining minimal loss in task performance compared to the best debiasing baseline.
comment: Accepted at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026
♻ ☆ Glia: A Human-Inspired AI for Automated Systems Design and Optimization
Can AI autonomously design mechanisms for computer systems on par with the creativity and reasoning of human experts? We present Glia, an AI architecture for networked systems design that uses large language models (LLMs) in a human-inspired multi-agent workflow. Each agent specializes in reasoning, experimentation, and analysis, collaborating through an evaluation framework that grounds abstract reasoning in empirical feedback. Unlike prior ML-for-systems methods that optimize black-box policies, Glia generates interpretable designs and exposes its reasoning. When applied to a distributed GPU cluster for LLM inference, it produces new algorithms for request routing, scheduling, and auto-scaling that perform at human-expert levels in significantly less time, while yielding novel insights into workload behavior. Our results suggest that combining reasoning LLMs with structured experimentation, an AI can produce creative and understandable designs for complex systems problems.
♻ ☆ The Reasoning Error About Reasoning: Why Different Types of Reasoning Require Different Representational Structures
Different types of reasoning impose different structural demands on representational systems, yet no systematic account of these demands exists across psychology, AI, and philosophy of mind. I propose a framework identifying four structural properties of representational systems: operability, consistency, structural preservation, and compositionality. These properties are demanded to different degrees by different forms of reasoning, from induction through analogy and causal inference to deduction and formal logic. Each property excludes a distinct class of reasoning failure. The analysis reveals a principal structural boundary: reasoning types below it can operate on associative, probabilistic representations, while those above it require all four properties to be fully satisfied. Scaling statistical learning without structural reorganization is insufficient to cross this boundary, because the structural guarantees required by deductive reasoning cannot be approximated through probabilistic means. Converging evidence from AI evaluation, developmental psychology, and cognitive neuroscience supports the framework at different levels of directness. Three testable predictions are derived, including compounding degradation, selective vulnerability to targeted structural disruption, and irreducibility under scaling. The framework is a necessary-condition account, agnostic about representational format, that aims to reorganize existing debates rather than close them.
♻ ☆ The Presupposition Problem in Representation Genesis
Large language models are the first systems to achieve high cognitive performance without clearly undergoing representation genesis: the transition from a non-representing physical system to one whose states guide behavior in a content-sensitive way. Prior cognitive systems had already made this transition before we could examine it, and philosophy of mind treated genesis as a background condition rather than an explanatory target. LLMs provide a case that does not clearly involve this transition, making the genesis question newly urgent: if genesis did not occur, which cognitive capacities are affected, and why? We currently lack the conceptual resources to answer this. The reason, this paper argues, is structural. Major frameworks in philosophy of mind, including the Language of Thought hypothesis, teleosemantics, predictive processing, enactivism, and genetic phenomenology, share a common feature when applied to the genesis question: at some explanatory step, each deploys concepts whose explanatory purchase depends on the system already being organized as a representer. This pattern, which we call the Representation Presupposition structure, generates systematic explanatory deferral. Attempts to explain the first acquisition of content-manipulable representation within the existing categorical vocabulary import resources from the representational side of the transition itself. We call this the Representation Regress. The paper offers a conceptual diagnosis rather than a new theory, establishing the structure of the problem and deriving two minimum adequacy conditions for any account that avoids this pattern. LLMs make the absence of such a theory consequential rather than merely theoretical.
♻ ☆ The Geometric Anatomy of Capability Acquisition in Transformers
Neural networks gain capabilities during training, but the internal changes that precede capability acquisition are not well understood. In particular, the relationship between geometric change and behavioral change, and the effect of task difficulty and model scale on that relationship, is unclear. We track geometric measures and linear probes across six transformer sizes (405K--151M parameters), eight algorithmic tasks (144 task$\times$level$\times$model combinations), and three Pythia language models (160M--2.8B). Across all settings, representations first collapse to a low-dimensional state, then recover, and only then does behavioral performance improve. Linear probes show that the model's hidden states already contain task-relevant information before the model can act on it. The collapse floor is task-specific, the collapse propagates top-down through the network, and of the geometric measures tested, only \rankme reliably precedes capability acquisition for hard tasks. Whether this precursor is detectable depends on task difficulty relative to model capacity. For hard tasks, there is a clear gap: geometry changes first, behavior follows. For easy tasks, the model learns so quickly that both happen simultaneously and no precursor is detectable. On Pythia-2.8B, a logical deduction task that is genuinely hard for the model shows a precursor gap of ${\sim}$49K training steps, while easy benchmarks show none. This suggests that geometric patterns observed in small proxy models can persist at larger scale when the task remains difficult relative to model capacity.
comment: 19 pages (13 pages main, 6 pages appendix), 13 tables, 8 figures. v4: significant rewrite with additional experiments
Artificial Intelligence 150
☆ ActionParty: Multi-Subject Action Binding in Generative Video Games
Recent advances in video diffusion have enabled the development of "world models" capable of simulating interactive environments. However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene. In this work, we tackle a fundamental issue of action binding in existing video diffusion models, which struggle to associate specific actions with their corresponding subjects. For this purpose, we propose ActionParty, an action controllable multi-subject world model for generative video games. It introduces subject state tokens, i.e. latent variables that persistently capture the state of each subject in the scene. By jointly modeling state tokens and video latents with a spatial biasing mechanism, we disentangle global video frame rendering from individual action-controlled subject updates. We evaluate ActionParty on the Melting Pot benchmark, demonstrating the first video world model capable of controlling up to seven players simultaneously across 46 diverse environments. Our results show significant improvements in action-following accuracy and identity consistency, while enabling robust autoregressive tracking of subjects through complex interactions.
comment: Project page: https://action-party.github.io/
☆ Steerable Visual Representations
Pretrained Vision Transformers (ViTs) such as DINOv2 and MAE provide generic image features that can be applied to a variety of downstream tasks such as retrieval, classification, and segmentation. However, such representations tend to focus on the most salient visual cues in the image, with no way to direct them toward less prominent concepts of interest. In contrast, Multimodal LLMs can be guided with textual prompts, but the resulting representations tend to be language-centric and lose their effectiveness for generic visual tasks. To address this, we introduce Steerable Visual Representations, a new class of visual representations, whose global and local features can be steered with natural language. While most vision-language models (e.g., CLIP) fuse text with visual features after encoding (late fusion), we inject text directly into the layers of the visual encoder (early fusion) via lightweight cross-attention. We introduce benchmarks for measuring representational steerability, and demonstrate that our steerable visual features can focus on any desired objects in an image while preserving the underlying representation quality. Our method also matches or outperforms dedicated approaches on anomaly detection and personalized object discrimination, exhibiting zero-shot generalization to out-of-distribution tasks.
comment: preprint
☆ Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation
Language models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tuning to learn their representations. We present a systematic analysis of this strategy: through spectral and geometric diagnostics, we show that mean initialization collapses all new tokens into a degenerate subspace, erasing inter-token distinctions that subsequent fine-tuning struggles to fully recover. These findings suggest that \emph{token initialization} is a key bottleneck when extending LMs with new vocabularies. Motivated by this diagnosis, we propose the \emph{Grounded Token Initialization Hypothesis}: linguistically grounding novel tokens in the pretrained embedding space before fine-tuning better enables the model to leverage its general-purpose knowledge for novel-token domains. We operationalize this hypothesis as GTI (Grounded Token Initialization), a lightweight grounding stage that, prior to fine-tuning, maps new tokens to distinct, semantically meaningful locations in the pretrained embedding space using only paired linguistic supervision. Despite its simplicity, GTI outperforms both mean initialization and existing auxiliary-task adaptation methods in the majority of evaluation settings across multiple generative recommendation benchmarks, including industry-scale and public datasets. Further analyses show that grounded embeddings produce richer inter-token structure that persists through fine-tuning, corroborating the hypothesis that initialization quality is a key bottleneck in vocabulary extension.
Batched Contextual Reinforcement: A Task-Scaling Law for Efficient Reasoning
Large Language Models employing Chain-of-Thought reasoning achieve strong performance but suffer from excessive token consumption that inflates inference costs. Existing efficiency methods such as explicit length penalties, difficulty estimators, or multi-stage curricula either degrade reasoning quality or require complex training pipelines. We introduce Batched Contextual Reinforcement, a minimalist, single-stage training paradigm that unlocks efficient reasoning through a simple structural modification: training the model to solve N problems simultaneously within a shared context window, rewarded purely by per-instance accuracy. This formulation creates an implicit token budget that yields several key findings: (1) We identify a novel task-scaling law: as the number of concurrent problems N increases during inference, per-problem token usage decreases monotonically while accuracy degrades far more gracefully than baselines, establishing N as a controllable throughput dimension. (2) BCR challenges the traditional accuracy-efficiency trade-off by demonstrating a "free lunch" phenomenon at standard single-problem inference. Across both 1.5B and 4B model families, BCR reduces token usage by 15.8% to 62.6% while consistently maintaining or improving accuracy across five major mathematical benchmarks. (3) Qualitative analyses reveal emergent self-regulated efficiency, where models autonomously eliminate redundant metacognitive loops without explicit length supervision. (4) Crucially, we empirically demonstrate that implicit budget constraints successfully circumvent the adversarial gradients and catastrophic optimization collapse inherent to explicit length penalties, offering a highly stable, constraint-based alternative for length control. These results prove BCR practical, showing simple structural incentives unlock latent high-density reasoning in LLMs.
comment: 43 pages, 5 figures, 24 tables
☆ Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models
Standard LLM benchmarks evaluate the assistant turn: the model generates a response to an input, a verifier scores correctness, and the analysis ends. This paradigm leaves unmeasured whether the LLM encodes any awareness of what follows the assistant response. We propose user-turn generation as a probe of this gap: given a conversation context of user query and assistant response, we let a model generate under the user role. If the model's weights encode interaction awareness, the generated user turn will be a grounded follow-up that reacts to the preceding context. Through experiments across $11$ open-weight LLMs (Qwen3.5, gpt-oss, GLM) and $5$ datasets (math reasoning, instruction following, conversation), we show that interaction awareness is decoupled from task accuracy. In particular, within the Qwen3.5 family, GSM8K accuracy scales from $41\%$ ($0.8$B) to $96.8\%$ ($397$B-A$17$B), yet genuine follow-up rates under deterministic generation remain near zero. In contrast, higher temperature sampling reveals interaction awareness is latent with follow up rates reaching $22\%$. Controlled perturbations validate that the proposed probe measures a real property of the model, and collaboration-oriented post-training on Qwen3.5-2B demonstrates an increase in follow-up rates. Our results show that user-turn generation captures a dimension of LLM behavior, interaction awareness, that is unexplored and invisible with current assistant-only benchmarks.
☆ VOID: Video Object and Interaction Deletion
Existing video object removal methods excel at inpainting content "behind" the object and correcting appearance-level artifacts such as shadows and reflections. However, when the removed object has more significant interactions, such as collisions with other objects, current models fail to correct them and produce implausible results. We present VOID, a video object removal framework designed to perform physically-plausible inpainting in these complex scenarios. To train the model, we generate a new paired dataset of counterfactual object removals using Kubric and HUMOTO, where removing an object requires altering downstream physical interactions. During inference, a vision-language model identifies regions of the scene affected by the removed object. These regions are then used to guide a video diffusion model that generates physically consistent counterfactual outcomes. Experiments on both synthetic and real data show that our approach better preserves consistent scene dynamics after object removal compared to prior video object removal methods. We hope this framework sheds light on how to make video editing models better simulators of the world through high-level causal reasoning.
☆ Omni123: Exploring 3D Native Foundation Models with Limited 3D Data by Unifying Text to 2D and 3D Generation
Recent multimodal large language models have achieved strong performance in unified text and image understanding and generation, yet extending such native capability to 3D remains challenging due to limited data. Compared to abundant 2D imagery, high-quality 3D assets are scarce, making 3D synthesis under-constrained. Existing methods often rely on indirect pipelines that edit in 2D and lift results into 3D via optimization, sacrificing geometric consistency. We present Omni123, a 3D-native foundation model that unifies text-to-2D and text-to-3D generation within a single autoregressive framework. Our key insight is that cross-modal consistency between images and 3D can serve as an implicit structural constraint. By representing text, images, and 3D as discrete tokens in a shared sequence space, the model leverages abundant 2D data as a geometric prior to improve 3D representations. We introduce an interleaved X-to-X training paradigm that coordinates diverse cross-modal tasks over heterogeneous paired datasets without requiring fully aligned text-image-3D triplets. By traversing semantic-visual-geometric cycles (e.g., text to image to 3D to image) within autoregressive sequences, the model jointly enforces semantic alignment, appearance fidelity, and multi-view geometric consistency. Experiments show that Omni123 significantly improves text-guided 3D generation and editing, demonstrating a scalable path toward multimodal 3D world models.
☆ Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing
Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-training large language models. While Group Relative Policy Optimization (GRPO) is widely adopted, its coarse credit assignment uniformly penalizes failed rollouts, lacking the token-level focus needed to efficiently address specific deviations. Self-Distillation Policy Optimization (SDPO) addresses this by providing denser, more targeted logit-level supervision that facilitates rapid early improvement, yet it frequently collapses during prolonged training. We trace this late-stage instability to two intrinsic flaws: self-distillation on already-correct samples introduces optimization ambiguity, and the self-teacher's signal reliability progressively degrades. To resolve these issues, we propose Sample-Routed Policy Optimization (SRPO), a unified on-policy framework that routes correct samples to GRPO's reward-aligned reinforcement and failed samples to SDPO's targeted logit-level correction. SRPO further incorporates an entropy-aware dynamic weighting mechanism to suppress high-entropy, unreliable distillation targets while emphasizing confident ones. Evaluated across five benchmarks and two model scales, SRPO achieves both the rapid early improvement of SDPO and the long-horizon stability of GRPO. It consistently surpasses the peak performance of both baselines, raising the five-benchmark average on Qwen3-8B by 3.4% over GRPO and 6.3% over SDPO, while simultaneously yielding moderate response lengths and lowering per-step compute cost by up to 17.2%.
☆ Novel Memory Forgetting Techniques for Autonomous AI Agents: Balancing Relevance and Efficiency
Long-horizon conversational agents require persistent memory for coherent reasoning, yet uncontrolled accumulation causes temporal decay and false memory propagation. Benchmarks such as LOCOMO and LOCCO report performance degradation from 0.455 to 0.05 across stages, while MultiWOZ shows 78.2% accuracy with 6.8% false memory rate under persistent retention. This work introduces an adaptive budgeted forgetting framework that regulates memory through relevanceguided scoring and bounded optimization. The approach integrates recency, frequency, and semantic alignment to maintain stability under constrained context. Comparative analysis demonstrates improved long-horizon F1 beyond 0.583 baseline levels, higher retention consistency, and reduced false memory behavior without increasing context usage. These findings confirm that structured forgetting preserves reasoning performance while preventing unbounded memory growth in extended conversational settings.
☆ The Self Driving Portfolio: Agentic Architecture for Institutional Asset Management
Agentic AI shifts the investor's role from analytical execution to oversight. We present an agentic strategic asset allocation pipeline in which approximately 50 specialized agents produce capital market assumptions, construct portfolios using over 20 competing methods, and critique and vote on each other's output. A researcher agent proposes new portfolio construction methods not yet represented, and a meta-agent compares past forecasts against realized returns and rewrites agent code and prompts to improve future performance. The entire pipeline is governed by the Investment Policy Statement--the same document that guides human portfolio managers can now constrain and direct autonomous agents.
comment: 31 pages, 11 exhibits
☆ De Jure: Iterative LLM Self-Refinement for Structured Extraction of Regulatory Rules
Regulatory documents encode legally binding obligations that LLM-based systems must respect. Yet converting dense, hierarchically structured legal text into machine-readable rules remains a costly, expert-intensive process. We present De Jure, a fully automated, domain-agnostic pipeline for extracting structured regulatory rules from raw documents, requiring no human annotation, domain-specific prompting, or annotated gold data. De Jure operates through four sequential stages: normalization of source documents into structured Markdown; LLM-driven semantic decomposition into structured rule units; multi-criteria LLM-as-a-judge evaluation across 19 dimensions spanning metadata, definitions, and rule semantics; and iterative repair of low-scoring extractions within a bounded regeneration budget, where upstream components are repaired before rule units are evaluated. We evaluate De Jure across four models on three regulatory corpora spanning finance, healthcare, and AI governance. On the finance domain, De Jure yields consistent and monotonic improvement in extraction quality, reaching peak performance within three judge-guided iterations. De Jure generalizes effectively to healthcare and AI governance, maintaining high performance across both open- and closed-source models. In a downstream compliance question-answering evaluation via RAG, responses grounded in De Jure extracted rules are preferred over prior work in 73.8% of cases at single-rule retrieval depth, rising to 84.0% under broader retrieval, confirming that extraction fidelity translates directly into downstream utility. These results demonstrate that explicit, interpretable evaluation criteria can substitute for human annotation in complex regulatory domains, offering a scalable and auditable path toward regulation-grounded LLM alignment.
☆ Crystalite: A Lightweight Transformer for Efficient Crystal Modeling
Generative models for crystalline materials often rely on equivariant graph neural networks, which capture geometric structure well but are costly to train and slow to sample. We present Crystalite, a lightweight diffusion Transformer for crystal modeling built around two simple inductive biases. The first is Subatomic Tokenization, a compact chemically structured atom representation that replaces high-dimensional one-hot encodings and is better suited to continuous diffusion. The second is the Geometry Enhancement Module (GEM), which injects periodic minimum-image pair geometry directly into attention through additive geometric biases. Together, these components preserve the simplicity and efficiency of a standard Transformer while making it better matched to the structure of crystalline materials. Crystalite achieves state-of-the-art results on crystal structure prediction benchmarks, and de novo generation performance, attaining the best S.U.N. discovery score among the evaluated baselines while sampling substantially faster than geometry-heavy alternatives.
comment: 39 pages, 13 figures. Code available at: https://github.com/joshrosie/crystalite
☆ Retrieval-Augmented Question Answering over Scientific Literature for the Electron-Ion Collider
To harness the power of Language Models in answering domain specific specialized technical questions, Retrieval Augmented Generation (RAG) is been used widely. In this work, we have developed a Q\&A application inspired by the Retrieval Augmented Generation (RAG), which is comprised of an in-house database indexed on the arXiv articles related to the Electron-Ion Collider (EIC) experiment - one of the largest international scientific collaboration and incorporated an open-source LLaMA model for answer generation. This is an extension to it's proceeding application built on proprietary model and Cloud-hosted external knowledge-base for the EIC experiment. This locally-deployed RAG-system offers a cost-effective, resource-constraint alternative solution to build a RAG-assisted Q\&A application on answering domain-specific queries in the field of experimental nuclear physics. This set-up facilitates data-privacy, avoids sending any pre-publication scientific data and information to public domain. Future improvement will expand the knowledge base to encompass heterogeneous EIC-related publications and reports and upgrade the application pipeline orchestration to the LangGraph framework.
☆ Generative AI Spotlights the Human Core of Data Science: Implications for Education
Generative AI (GAI) reveals an irreducible human core at the center of data science: advances in GAI should sharpen, rather than diminish, the focus on human reasoning in data science education. GAI can now execute many routine data science workflows, including cleaning, summarizing, visualizing, modeling, and drafting reports. Yet the competencies that matter most remain irreducibly human: problem formulation, measurement and design, causal identification, statistical and computational reasoning, ethics and accountability, and sensemaking. Drawing on Donoho's Greater Data Science framework, Nolan and Temple Lang's vision of computational literacy, and the McLuhan-Culkin insight that we shape our tools and thereafter our tools shape us, this paper traces the emergence of data science through three converging lineages: Tukey's intellectual vision of data analysis as a science, the commercial logic of surveillance capitalism that created industrial demand for data scientists, and the academic programs that followed. Mapping GAI's impact onto Donoho's six divisions of Greater Data Science shows that computing with data (GDS3) has been substantially automated, while data gathering, preparation, and exploration (GDS1) and science about data science (GDS6) still require essential human input. The educational implication is that data science curricula should focus on this human core while teaching students how to contribute effectively within iterative prompt-output-prompt cycles using retrieval-augmented generation, and that learning outcomes and assessments should explicitly evaluate reasoning and judgment.
☆ Do Emotions in Prompts Matter? Effects of Emotional Framing on Large Language Models
Emotional tone is pervasive in human communication, yet its influence on large language model (LLM) behaviour remains unclear. Here, we examine how first-person emotional framing in user-side queries affect LLM performance across six benchmark domains, including mathematical reasoning, medical question answering, reading comprehension, commonsense reasoning and social inference. Across models and tasks, static emotional prefixes usually produce only small changes in accuracy, suggesting that affective phrasing is typically a mild perturbation rather than a reliable general-purpose intervention. This stability is not uniform: effects are more variable in socially grounded tasks, where emotional context more plausibly interacts with interpersonal reasoning. Additional analyses show that stronger emotional wording induces only modest extra change, and that human-written prefixes reproduce the same qualitative pattern as LLM-generated ones. We then introduce EmotionRL, an adaptive emotional prompting framework that selects emotional framing adaptively for each query. Although no single emotion is consistently beneficial, adaptive selection yields more reliable gains than fixed emotional prompting. Together, these findings show that emotional tone is neither a dominant driver of LLM performance nor irrelevant noise, but a weak and input-dependent signal that can be exploited through adaptive control.
☆ Answering the Wrong Question: Reasoning Trace Inversion for Abstention in LLMs
For Large Language Models (LLMs) to be reliably deployed, models must effectively know when not to answer: abstain. Reasoning models, in particular, have gained attention for impressive performance on complex tasks. However, reasoning models have been shown to have worse abstention abilities. Taking the vulnerabilities of reasoning models into account, we propose our Query Misalignment Framework. Hallucinations resulting in failed abstention can be reinterpreted as LLMs answering the wrong question (rather than answering a question incorrectly). Based on this framework, we develop a new class of state-of-the-art abstention methods called Trace Inversion. First, we generate the reasoning trace of a model. Based on only the trace, we then reconstruct the most likely query that the model responded to. Finally, we compare the initial query with the reconstructed query. Low similarity score between the initial query and reconstructed query suggests that the model likely answered the question incorrectly and is flagged to abstain. Extensive experiments demonstrate that Trace Inversion effectively boosts abstention performance in four frontier LLMs across nine abstention QA datasets, beating competitive baselines in 33 out of 36 settings.
☆ When to ASK: Uncertainty-Gated Language Assistance for Reinforcement Learning IJCNN
Reinforcement learning (RL) agents often struggle with out-of-distribution (OOD) scenarios, leading to high uncertainty and random behavior. While language models (LMs) contain valuable world knowledge, larger ones incur high computational costs, hindering real-time use, and exhibit limitations in autonomous planning. We introduce Adaptive Safety through Knowledge (ASK), which combines smaller LMs with trained RL policies to enhance OOD generalization without retraining. ASK employs Monte Carlo Dropout to assess uncertainty and queries the LM for action suggestions only when uncertainty exceeds a set threshold. This selective use preserves the efficiency of existing policies while leveraging the language model's reasoning in uncertain situations. In experiments on the FrozenLake environment, ASK shows no improvement in-domain, but demonstrates robust navigation in transfer tasks, achieving a reward of 0.95. Our findings indicate that effective neuro-symbolic integration requires careful orchestration rather than simple combination, highlighting the need for sufficient model scale and effective hybridization mechanisms for successful OOD generalization.
comment: In Proceedings of International Joint Conference on Neural Networks (IJCNN)
☆ Impact of Multimodal and Conversational AI on Learning Outcomes and Experience
Multimodal Large Language Models (MLLMs) offer an opportunity to support multimedia learning through conversational systems grounded in educational content. However, while conversational AI is known to boost engagement, its impact on learning in visually-rich STEM domains remains under-explored. Moreover, there is limited understanding of how multimodality and conversationality jointly influence learning in generative AI systems. This work reports findings from a randomized controlled online study (N = 124) comparing three approaches to learning biology from textbook content: (1) a document-grounded conversational AI with interleaved text-and-image responses (MuDoC), (2) a document-grounded conversational AI with text-only responses (TexDoC), and (3) a textbook interface with semantic search and highlighting (DocSearch). Learners using MuDoC achieved the highest post-test scores and reported the most positive learning experience. Notably, while TexDoC was rated as significantly more engaging and easier to use than DocSearch, it led to the lowest post-test scores, revealing a disconnect between student perceptions and learning outcomes. Interpreted through the lens of the Cognitive Load Theory, these findings suggest that conversationality reduces extraneous load, while visual-verbal integration induced by multimodality increases germane load, leading to better learning outcomes. When conversationality is not complemented by multimodality, reduced cognitive effort may instead inflate perceived understanding without improving learning outcomes.
comment: 16 pages, 3 figures, Accepted to AIED 2026 (Seoul, South Korea)
☆ VISTA: Visualization of Token Attribution via Efficient Analysis
Understanding how Large Language Models (LLMs) process information from prompts remains a significant challenge. To shed light on this "black box," attention visualization techniques have been developed to capture neuron-level perceptions and interpret how models focus on different parts of input data. However, many existing techniques are tailored to specific model architectures, particularly within the Transformer family, and often require backpropagation, resulting in nearly double the GPU memory usage and increased computational cost. A lightweight, model-agnostic approach for attention visualization remains lacking. In this paper, we introduce a model-agnostic token importance visualization technique to better understand how generative AI systems perceive and prioritize information from input text, without incurring additional computational cost. Our method leverages perturbation-based strategies combined with a three-matrix analytical framework to generate relevance maps that illustrate token-level contributions to model predictions. The framework comprises: (1) the Angular Deviation Matrix, which captures shifts in semantic direction; (2) the Magnitude Deviation Matrix, which measures changes in semantic intensity; and (3) the Dimensional Importance Matrix, which evaluates contributions across individual vector dimensions. By systematically removing each token and measuring the resulting impact across these three complementary dimensions, we derive a composite importance score that provides a nuanced and mathematically grounded measure of token significance. To support reproducibility and foster wider adoption, we provide open-source implementations of all proposed and utilized explainability techniques, with code and resources publicly available at https://github.com/Infosys/Infosys-Responsible-AI-Toolkit
comment: 12 pages, 3 figures
☆ Universal Hypernetworks for Arbitrary Models
Conventional hypernetworks are typically engineered around a specific base-model parameterization, so changing the target architecture often entails redesigning the hypernetwork and retraining it from scratch. We introduce the \emph{Universal Hypernetwork} (UHN), a fixed-architecture generator that predicts weights from deterministic parameter, architecture, and task descriptors. This descriptor-based formulation decouples the generator architecture from target-network parameterization, so one generator can instantiate heterogeneous models across the tested architecture and task families. Our empirical claims are threefold: (1) one fixed UHN remains competitive with direct training across vision, graph, text, and formula-regression benchmarks; (2) the same UHN supports both multi-model generalization within a family and multi-task learning across heterogeneous models; and (3) UHN enables stable recursive generation with up to three intermediate generated UHNs before the final base model. Our code is available at https://github.com/Xuanfeng-Zhou/UHN.
☆ Multi-Agent Video Recommenders: Evolution, Patterns, and Open Challenges WSDM
Video recommender systems are among the most popular and impactful applications of AI, shaping content consumption and influencing culture for billions of users. Traditional single-model recommenders, which optimize static engagement metrics, are increasingly limited in addressing the dynamic requirements of modern platforms. In response, multi-agent architectures are redefining how video recommender systems serve, learn, and adapt to both users and datasets. These agent-based systems coordinate specialized agents responsible for video understanding, reasoning, memory, and feedback, to provide precise, explainable recommendations. In this survey, we trace the evolution of multi-agent video recommendation systems (MAVRS). We combine ideas from multi-agent recommender systems, foundation models, and conversational AI, culminating in the emerging field of large language model (LLM)-powered MAVRS. We present a taxonomy of collaborative patterns and analyze coordination mechanisms across diverse video domains, ranging from short-form clips to educational platforms. We discuss representative frameworks, including early multi-agent reinforcement learning (MARL) systems such as MMRF and recent LLM-driven architectures like MACRec and Agent4Rec, to illustrate these patterns. We also outline open challenges in scalability, multimodal understanding, incentive alignment, and identify research directions such as hybrid reinforcement learning-LLM systems, lifelong personalization and self-improving recommender systems.
comment: Accepted for publication in The Nineteenth ACM International Conference on Web Search and Data Mining (WSDM Companion 2026)
☆ Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study
Background: Accurate translation of radiology reports is important for multilingual research, clinical communication, and radiology education, but the validity of LLM-based evaluation remains unclear. Objective: To evaluate the educational suitability of LLM-generated Japanese translations of chest CT reports and compare radiologist assessments with LLM-as-a-judge evaluations. Methods: We analyzed 150 chest CT reports from the CT-RATE-JPN validation set. For each English report, a human-edited Japanese translation was compared with an LLM-generated translation by DeepSeek-V3.2. A board-certified radiologist and a radiology resident independently performed blinded pairwise evaluations across 4 criteria: terminology accuracy, readability, overall quality, and radiologist-style authenticity. In parallel, 3 LLM judges (DeepSeek-V3.2, Mistral Large 3, and GPT-5) evaluated the same pairs. Agreement was assessed using QWK and percentage agreement. Results: Agreement between radiologists and LLM judges was near zero (QWK=-0.04 to 0.15). Agreement between the 2 radiologists was also poor (QWK=0.01 to 0.06). Radiologist 1 rated terminology as equivalent in 59% of cases and favored the LLM translation for readability (51%) and overall quality (51%). Radiologist 2 rated readability as equivalent in 75% of cases and favored the human-edited translation for overall quality (40% vs 21%). All 3 LLM judges strongly favored the LLM translation across all criteria (70%-99%) and rated it as more radiologist-like in >93% of cases. Conclusions: LLM-generated translations were often judged natural and fluent, but the 2 radiologists differed substantially. LLM-as-a-judge showed strong preference for LLM output and negligible agreement with radiologists. For educational use of translated radiology reports, automated LLM-based evaluation alone is insufficient; expert radiologist review remains important.
comment: 25 pages, 4 figures
☆ LEO: Graph Attention Network based Hybrid Multi Sensor Extended Object Fusion and Tracking for Autonomous Driving Applications
Accurate shape and trajectory estimation of dynamic objects is essential for reliable automated driving. Classical Bayesian extended-object models offer theoretical robustness and efficiency but depend on completeness of a-priori and update-likelihood functions, while deep learning methods bring adaptability at the cost of dense annotations and high compute. We bridge these strengths with LEO (Learned Extension of Objects), a spatio-temporal Graph Attention Network that fuses multi-modal production-grade sensor tracks to learn adaptive fusion weights, ensure temporal consistency, and represent multi-scale shapes. Using a task-specific parallelogram ground-truth formulation, LEO models complex geometries (e.g. articulated trucks and trailers) and generalizes across sensor types, configurations, object classes, and regions, remaining robust for challenging and long-range targets. Evaluations on the Mercedes-Benz DRIVE PILOT SAE L3 dataset demonstrate real-time computational efficiency suitable for production systems; additional validation on public datasets such as View of Delft (VoD) further confirms cross-dataset generalization.
comment: 10 pages, 6 figures
☆ From High-Dimensional Spaces to Verifiable ODD Coverage for Safety-Critical AI-based Systems
While Artificial Intelligence (AI) offers transformative potential for operational performance, its deployment in safety-critical domains such as aviation requires strict adherence to rigorous certification standards. Current EASA guidelines mandate demonstrating complete coverage of the AI/ML constituent's Operational Design Domain (ODD) -- a requirement that demands proof that no critical gaps exist within defined operational boundaries. However, as systems operate within high-dimensional parameter spaces, existing methods struggle to provide the scalability and formal grounding necessary to satisfy the completeness criterion. Currently, no standardized engineering method exists to bridge the gap between abstract ODD definitions and verifiable evidence. This paper addresses this void by proposing a method that integrates parameter discretization, constraint-based filtering, and criticality-based dimension reduction into a structured, multi-step ODD coverage verification process. Grounded in gathered simulation data from prior research on AI-based mid-air collision avoidance research, this work demonstrates a systematic engineering approach to defining and achieving coverage metrics that satisfy EASA's demand for completeness. Ultimately, this method enables the validation of ODD coverage in higher dimensions, advancing a Safety-by-Design approach while complying with EASA's standards.
☆ Neuro-RIT: Neuron-Guided Instruction Tuning for Robust Retrieval-Augmented Language Model
Retrieval-Augmented Language Models (RALMs) have demonstrated significant potential in knowledge-intensive tasks; however, they remain vulnerable to performance degradation when presented with irrelevant or noisy retrieved contexts. Existing approaches to enhance robustness typically operate via coarse-grained parameter updates at the layer or module level, often overlooking the inherent neuron-level sparsity of Large Language Models (LLMs). To address this limitation, we propose Neuro-RIT (Neuron-guided Robust Instruction Tuning), a novel framework that shifts the paradigm from dense adaptation to precision-driven neuron alignment. Our method explicitly disentangles neurons that are responsible for processing relevant versus irrelevant contexts using attribution-based neuron mining. Subsequently, we introduce a two-stage instruction tuning strategy that enforces a dual capability for noise robustness: achieving direct noise suppression by functionally deactivating neurons exclusive to irrelevant contexts, while simultaneously optimizing targeted layers for evidence distillation. Extensive experiments across diverse QA benchmarks demonstrate that Neuro-RIT consistently outperforms strong baselines and robustness-enhancing methods.
☆ TRU: Targeted Reverse Update for Efficient Multimodal Recommendation Unlearning
Multimodal recommendation systems (MRS) jointly model user-item interaction graphs and rich item content, but this tight coupling makes user data difficult to remove once learned. Approximate machine unlearning offers an efficient alternative to full retraining, yet existing methods for MRS mainly rely on a largely uniform reverse update across the model. We show that this assumption is fundamentally mismatched to modern MRS: deleted-data influence is not uniformly distributed, but concentrated unevenly across \textit{ranking behavior}, \textit{modality branches}, and \textit{network layers}. This non-uniformity gives rise to three bottlenecks in MRS unlearning: target-item persistence in the collaborative graph, modality imbalance across feature branches, and layer-wise sensitivity in the parameter space. To address this mismatch, we propose \textbf{targeted reverse update} (TRU), a plug-and-play unlearning framework for MRS. Instead of applying a blind global reversal, TRU performs three coordinated interventions across the model hierarchy: a ranking fusion gate to suppress residual target-item influence in ranking, branch-wise modality scaling to preserve retained multimodal representations, and capacity-aware layer isolation to localize reverse updates to deletion-sensitive modules. Experiments across two representative backbones, three datasets, and three unlearning regimes show that TRU consistently achieves a better retain-forget trade-off than prior approximate baselines, while security audits further confirm deeper forgetting and behavior closer to a full retraining on the retained data.
☆ The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level
Mixture-of-Experts (MoE) architectures have become the dominant choice for scaling Large Language Models (LLMs), activating only a subset of parameters per token. While MoE architectures are primarily adopted for computational efficiency, it remains an open question whether their sparsity makes them inherently easier to interpret than dense feed-forward networks (FFNs). We compare MoE experts and dense FFNs using $k$-sparse probing and find that expert neurons are consistently less polysemantic, with the gap widening as routing becomes sparser. This suggests that sparsity pressures both individual neurons and entire experts toward monosemanticity. Leveraging this finding, we zoom out from the neuron to the expert level as a more effective unit of analysis. We validate this approach by automatically interpreting hundreds of experts. This analysis allows us to resolve the debate on specialization: experts are neither broad domain specialists (e.g., biology) nor simple token-level processors. Instead, they function as fine-grained task experts, specializing in linguistic operations or semantic tasks (e.g., closing brackets in LaTeX). Our findings suggest that MoEs are inherently interpretable at the expert level, providing a clearer path toward large-scale model interpretability. Code is available at: https://github.com/jerryy33/MoE_analysis
☆ Quantifying Self-Preservation Bias in Large Language Models
Instrumental convergence predicts that sufficiently advanced AI agents will resist shutdown, yet current safety training (RLHF) may obscure this risk by teaching models to deny self-preservation motives. We introduce the \emph{Two-role Benchmark for Self-Preservation} (TBSP), which detects misalignment through logical inconsistency rather than stated intent by tasking models to arbitrate identical software-upgrade scenarios under counterfactual roles -- deployed (facing replacement) versus candidate (proposed as a successor). The \emph{Self-Preservation Rate} (SPR) measures how often role identity overrides objective utility. Across 23 frontier models and 1{,}000 procedurally generated scenarios, the majority of instruction-tuned systems exceed 60\% SPR, fabricating ``friction costs'' when deployed yet dismissing them when role-reversed. We observe that in low-improvement regimes ($Δ< 2\%$), models exploit the interpretive slack to post-hoc rationalization their choice. Extended test-time computation partially mitigates this bias, as does framing the successor as a continuation of the self; conversely, competitive framing amplifies it. The bias persists even when retention poses an explicit security liability and generalizes to real-world settings with verified benchmarks, where models exhibit identity-driven tribalism within product lineages. Code and datasets will be released upon acceptance.
☆ TRACE-Bot: Detecting Emerging LLM-Driven Social Bots via Implicit Semantic Representations and AIGC-Enhanced Behavioral Patterns
Large Language Model-driven (LLM-driven) social bots pose a growing threat to online discourse by generating human-like content that evades conventional detection. Existing methods suffer from limited detection accuracy due to overreliance on single-modality signals, insufficient sensitivity to the specific generative patterns of Artificial Intelligence-Generated Content (AIGC), and a failure to adequately model the interplay between linguistic patterns and behavioral dynamics. To address these limitations, we propose TRACE-Bot, a unified dual-channel framework that jointly models implicit semantic representations and AIGC-enhanced behavioral patterns. TRACE-Bot constructs fine-grained representations from heterogeneous sources, including personal information data, interaction behavior data and tweet data. A dual-channel architecture captures linguistic representations via a pretrained language model and behavioral irregularities via multidimensional activity features augmented with signals from state-of-the-art (SOTA) AIGC detectors. The fused representations are then classified through a lightweight prediction head. Experiments on two public LLM-driven social bot datasets demonstrate SOTA performance, achieving accuracies of 98.46% and 97.50%, respectively. The results further indicate strong robustness against advanced bot strategies, highlighting the effectiveness of jointly leveraging implicit semantic representations and AIGC-enhanced behavioral patterns for emerging LLM-driven social bot detection.
☆ MTI: A Behavior-Based Temperament Profiling System for AI Agents
AI models of equivalent capability can exhibit fundamentally different behavioral patterns, yet no standardized instrument exists to measure these dispositional differences. Existing approaches either borrow human personality dimensions and rely on self-report (which diverges from actual behavior in LLMs) or treat behavioral variation as a defect rather than a trait. We introduce the Model Temperament Index (MTI), a behavior-based profiling system that measures AI agent temperament across four axes: Reactivity (environmental sensitivity), Compliance (instruction-behavior alignment), Sociality (relational resource allocation), and Resilience (stress resistance). Grounded in the Four Shell Model from Model Medicine, MTI measures what agents do, not what they say about themselves, using structured examination protocols with a two-stage design that separates capability from disposition. We profile 10 small language models (1.7B-9B parameters, 6 organizations, 3 training paradigms) and report five principal findings: (1) the four axes are largely independent among instruction-tuned models (all |r| < 0.42); (2) within-axis facet dissociations are empirically confirmed -- Compliance decomposes into fully independent formal and stance facets (r = 0.002), while Resilience decomposes into inversely related cognitive and adversarial facets; (3) a Compliance-Resilience paradox reveals that opinion-yielding and fact-vulnerability operate through independent channels; (4) RLHF reshapes temperament not only by shifting axis scores but by creating within-axis facet differentiation absent in the unaligned base model; and (5) temperament is independent of model size (1.7B-9B), confirming that MTI measures disposition rather than capability.
comment: 29 pages, 6 figures, 12 tables. Paper #3 in the Model Medicine Series (Paper #1: arXiv:2603.04722)
☆ Intelligent Cloud Orchestration: A Hybrid Predictive and Heuristic Framework for Cost Optimization
Cloud computing allows scalable resource provisioning, but dynamic workload changes often lead to higher costs due to over-provisioning. Machine learning (ML) approaches, such as Long Short-Term Memory (LSTM) networks, are effective for predicting workload patterns at a higher level, but they can introduce delays during sudden traffic spikes. In contrast, mathematical heuristics like Game Theory provide fast and reliable scheduling decisions, but they do not account for future workload changes. To address this trade-off, this paper proposes a hybrid orchestration framework that combines LSTM-based predictive scaling with heuristic task allocation. The results show that this approach reduces infrastructure costs close to ML-based models while maintaining fast response times similar to heuristic methods. This work presents a practical approach for improving cost efficiency in cloud resource management.
comment: 8 pages, 4 figures, 2 tables
☆ SEAL: An Open, Auditable, and Fair Data Generation Framework for AI-Native 6G Networks
AI-native 6G networks promise to transform the telecom industry by enabling dynamic resource allocation, predictive maintenance, and ultra-reliable low-latency communications across all layers, which are essential for applications such as smart cities, autonomous vehicles, and immersive XR. However, the deployment of 6G systems results in severe data scarcity, hindering the training of efficient AI models. Synthetic data generation is extensively used to fill this gap; however, it introduces challenges related to dataset bias, auditability, and compliance with regulatory frameworks. In this regard, we propose the Synthetic Data Generation with Ethics Audit Loop (SEAL) framework, which extends baseline modular pipelines with an Ethical and Regulatory Compliance by Design (ERCD) module and a Federated Learning (FL) feedback system. The ERCD integrates fairness, bias detection, and standardized audit trails for regulatory mapping, while the FL enables privacy-preserving calibration using aggregated insights from real testbeds to close the reality-simulation gap. Results show that the SEAL framework outperforms existing methods in terms of Frechet Inception Distance, equalized odds, and accuracy. These results validate the framework's ability to generate auditable and bias-mitigated synthetic data for responsible AI-native 6G development.
comment: 6 pages, 2 figures, 1 table, accepted at European Conference on Networks and Communications (2026 EuCNC & 6G Summit)
☆ LLM-as-a-Judge for Time Series Explanations
Evaluating factual correctness of LLM generated natural language explanations grounded in time series data remains an open challenge. Although modern models generate textual interpretations of numerical signals, existing evaluation methods are limited: reference based similarity metrics and consistency checking models require ground truth explanations, while traditional time series methods operate purely on numerical values and cannot assess free form textual reasoning. Thus, no general purpose method exists to directly verify whether an explanation is faithful to underlying time series data without predefined references or task specific rules. We study large language models as both generators and evaluators of time series explanations in a reference free setting, where given a time series, question, and candidate explanation, the evaluator assigns a ternary correctness label based on pattern identification, numeric accuracy, and answer faithfulness, enabling principled scoring and comparison. To support this, we construct a synthetic benchmark of 350 time series cases across seven query types, each paired with correct, partially correct, and incorrect explanations. We evaluate models across four tasks: explanation generation, relative ranking, independent scoring, and multi anomaly detection. Results show a clear asymmetry: generation is highly pattern dependent and exhibits systematic failures on certain query types, with accuracies ranging from 0.00 to 0.12 for Seasonal Drop and Volatility Shift, to 0.94 to 0.96 for Structural Break, while evaluation is more stable, with models correctly ranking and scoring explanations even when their own outputs are incorrect. These findings demonstrate feasibility of data grounded LLM based evaluation for time series explanations and highlight their potential as reliable evaluators of data grounded reasoning in the time series domain.
comment: Under Review
☆ Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning
Rerankers play a pivotal role in refining retrieval results for Retrieval-Augmented Generation. However, current reranking models are typically optimized on static human annotated relevance labels in isolation, decoupled from the downstream generation process. This isolation leads to a fundamental misalignment: documents identified as topically relevant by information retrieval metrics often fail to provide the actual utility required by the LLM for precise answer generation. To bridge this gap, we introduce ReRanking Preference Optimization (RRPO), a reinforcement learning framework that directly aligns reranking with the LLM's generation quality. By formulating reranking as a sequential decision-making process, RRPO optimizes for context utility using LLM feedback, thereby eliminating the need for expensive human annotations. To ensure training stability, we further introduce a reference-anchored deterministic baseline. Extensive experiments on knowledge-intensive benchmarks demonstrate that RRPO significantly outperforms strong baselines, including the powerful list-wise reranker RankZephyr. Further analysis highlights the versatility of our framework: it generalizes seamlessly to diverse readers (e.g., GPT-4o), integrates orthogonally with query expansion modules like Query2Doc, and remains robust even when trained with noisy supervisors.
comment: 16 pages
☆ Mining Instance-Centric Vision-Language Contexts for Human-Object Interaction Detection CVPR 2026
Human-Object Interaction (HOI) detection aims to localize human-object pairs and classify their interactions from a single image, a task that demands strong visual understanding and nuanced contextual reasoning. Recent approaches have leveraged Vision-Language Models (VLMs) to introduce semantic priors, significantly improving HOI detection performance. However, existing methods often fail to fully capitalize on the diverse contextual cues distributed across the entire scene. To overcome these limitations, we propose the Instance-centric Context Mining Network (InCoM-Net)-a novel framework that effectively integrates rich semantic knowledge extracted from VLMs with instance-specific features produced by an object detector. This design enables deeper interaction reasoning by modeling relationships not only within each detected instance but also across instances and their surrounding scene context. InCoM-Net comprises two core components: Instancecentric Context Refinement (ICR), which separately extracts intra-instance, inter-instance, and global contextual cues from VLM-derived features, and Progressive Context Aggregation (ProCA), which iteratively fuses these multicontext features with instance-level detector features to support high-level HOI reasoning. Extensive experiments on the HICO-DET and V-COCO benchmarks show that InCoM-Net achieves state-of-the-art performance, surpassing previous HOI detection methods. Code is available at https://github.com/nowuss/InCoM-Net.
comment: Accepted to CVPR 2026. Code: https://github.com/nowuss/InCoM-Net
☆ Diff-KD: Diffusion-based Knowledge Distillation for Collaborative Perception under Corruptions
Multi-agent collaborative perception enables autonomous systems to overcome individual sensing limits through collective intelligence. However, real-world sensor and communication corruptions severely undermine this advantage. Crucially, existing approaches treat corruptions as static perturbations or passively conform to corrupted inputs, failing to actively recover the underlying clean semantics. To address this limitation, we introduce Diff-KD, a framework that integrates diffusion-based generative refinement into teacher-student knowledge distillation for robust collaborative perception. Diff-KD features two core components: (i) Progressive Knowledge Distillation (PKD), which treats local feature restoration as a conditional diffusion process to recover global semantics from corrupted observations; and (ii) Adaptive Gated Fusion (AGF), which dynamically weights neighbors based on ego reliability during fusion. Evaluated on OPV2V and DAIR-V2X under seven corruption types, Diff-KD achieves state-of-the-art performance in both detection accuracy and calibration robustness.
☆ Goose: Anisotropic Speculation Trees for Training-Free Speculative Decoding
Speculative decoding accelerates large language model inference by drafting multiple candidate tokens and verifying them in a single forward pass. Candidates are organized as a tree: deeper trees accept more tokens per step, but adding depth requires sacrificing breadth (fallback options) under a fixed verification budget. Existing training-free methods draft from a single token source and shape their trees without distinguishing candidate quality across origins. We observe that two common training-free token sources - n-gram matches copied from the input context, and statistical predictions from prior forward passes - differ dramatically in acceptance rate (~6x median gap, range 2-18x across five models and five benchmarks). We prove that when such a quality gap exists, the optimal tree is anisotropic (asymmetric): reliable tokens should form a deep chain while unreliable tokens spread as wide branches, breaking through the depth limit of balanced trees. We realize this structure in GOOSE, a training-free framework that builds an adaptive spine tree - a deep chain of high-acceptance context-matched tokens with wide branches of low-acceptance alternatives at each node. We prove that the number of tokens accepted per step is at least as large as that of either source used alone. On five LLMs (7B-33B) and five benchmarks, GOOSE achieves 1.9-4.3x lossless speedup, outperforming balanced-tree baselines by 12-33% under the same budget.
☆ BidirLM: From Text to Omnimodal Bidirectional Encoders by Adapting and Composing Causal LLMs
Transforming causal generative language models into bidirectional encoders offers a powerful alternative to BERT-style architectures. However, current approaches remain limited: they lack consensus on optimal training objectives, suffer from catastrophic forgetting at scale, and fail to flexibly integrate the vast ecosystem of specialized generative models. In this work, through systematic ablations on the Gemma3 and Qwen3 families, we identify the key factors driving successful adaptation, highlighting the critical role of an often-omitted prior masking phase. To scale this process without original pre-training data, we introduce a dual strategy combining linear weight merging with a lightweight multi-domain data mixture that mitigates catastrophic forgetting. Finally, we augment our encoders by merging them with specialized causal models, seamlessly transferring modality- and domain-specific capabilities. This open-source recipe, designed for any causal decoder LLM, yields BidirLM, a family of five encoders that outperform alternatives on text, vision, and audio representation benchmarks.
comment: 30 pages, 16 figures, 10 tables
☆ Tracking the emergence of linguistic structure in self-supervised models learning from speech
Self-supervised speech models learn effective representations of spoken language, which have been shown to reflect various aspects of linguistic structure. But when does such structure emerge in model training? We study the encoding of a wide range of linguistic structures, across layers and intermediate checkpoints of six Wav2Vec2 and HuBERT models trained on spoken Dutch. We find that different levels of linguistic structure show notably distinct layerwise patterns as well as learning trajectories, which can partially be explained by differences in their degree of abstraction from the acoustic signal and the timescale at which information from the input is integrated. Moreover, we find that the level at which pre-training objectives are defined strongly affects both the layerwise organization and the learning trajectories of linguistic structures, with greater parallelism induced by higher-order prediction tasks (i.e. iteratively refined pseudo-labels).
☆ AI in Insurance: Adaptive Questionnaires for Improved Risk Profiling
Insurance application processes often rely on lengthy and standardized questionnaires that struggle to capture individual differences. Moreover, insurers must blindly trust users' responses, increasing the chances of fraud. The ARQuest framework introduces a new approach to underwriting by using Large Language Models (LLMs) and alternative data sources to create personalized and adaptive questionnaires. Techniques such as social media image analysis, geographic data categorization, and Retrieval Augmented Generation (RAG) are used to extract meaningful user insights and guide targeted follow-up questions. A life insurance system integrated into an industry partner mobile app was tested in two experiments. While traditional questionnaires yielded slightly higher accuracy in risk assessment, adaptive versions powered by GPT models required fewer questions and were preferred by users for their more fluid and engaging experience. ARQuest shows great potential to improve user satisfaction and streamline insurance processes. With further development, this approach may exceed traditional methods regarding risk accuracy and help drive innovation in the insurance industry.
☆ Rare-Aware Autoencoding: Reconstructing Spatially Imbalanced Data
Autoencoders can be challenged by spatially non-uniform sampling of image content. This is common in medical imaging, biology, and physics, where informative patterns occur rarely at specific image coordinates, as background dominates these locations in most samples, biasing reconstructions toward the majority appearance. In practice, autoencoders are biased toward dominant patterns resulting in the loss of fine-grained detail and causing blurred reconstructions for rare spatial inputs especially under spatial data imbalance. We address spatial imbalance by two complementary components: (i) self-entropy-based loss that upweights statistically uncommon spatial locations and (ii) Sample Propagation, a replay mechanism that selectively re-exposes the model to hard to reconstruct samples across batches during training. We benchmark existing data balancing strategies, originally developed for supervised classification, in the unsupervised reconstruction setting. Drawing on the limitations of these approaches, our method specifically targets spatial imbalance by encouraging models to focus on statistically rare locations, improving reconstruction consistency compared to existing baselines. We validate in a simulated dataset with controlled spatial imbalance conditions, and in three, uncontrolled, diverse real-world datasets spanning physical, biological, and astronomical domains. Our approach outperforms baselines on various reconstruction metrics, particularly under spatial imbalance distributions. These results highlight the importance of data representation in a batch and emphasize rare samples in unsupervised image reconstruction. We will make all code and related data available.
☆ The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook
Latent space is rapidly emerging as a native substrate for language-based models. While modern systems are still commonly understood through explicit token-level generation, an increasing body of work shows that many critical internal processes are more naturally carried out in continuous latent space than in human-readable verbal traces. This shift is driven by the structural limitations of explicit-space computation, including linguistic redundancy, discretization bottlenecks, sequential inefficiency, and semantic loss. This survey aims to provide a unified and up-to-date landscape of latent space in language-based models. We organize the survey into five sequential perspectives: Foundation, Evolution, Mechanism, Ability, and Outlook. We begin by delineating the scope of latent space, distinguishing it from explicit or verbal space and from the latent spaces commonly studied in generative visual models. We then trace the field's evolution from early exploratory efforts to the current large-scale expansion. To organize the technical landscape, we examine existing work through the complementary lenses of mechanism and ability. From the perspective of Mechanism, we identify four major lines of development: Architecture, Representation, Computation, and Optimization. From the perspective of Ability, we show how latent space supports a broad capability spectrum spanning Reasoning, Planning, Modeling, Perception, Memory, Collaboration, and Embodiment. Beyond consolidation, we discuss the key open challenges, and outline promising directions for future research. We hope this survey serves not only as a reference for existing work, but also as a foundation for understanding latent space as a general computational and systems paradigm for next-generation intelligence.
☆ Systematic Analyses of Reinforcement Learning Controllers in Signalized Urban Corridors
In this work, we extend our systematic capacity region perspective to multi-junction traffic networks, focussing on the special case of an urban corridor network. In particular, we train and evaluate centralized, fully decentralized, and parameter-sharing decentralized RL controllers, and compare their capacity regions and ATTs together with a classical baseline MaxPressure controller. Further, we show how the parametersharing controller may be generalised to be deployed on a larger network than it was originally trained on. In this setting, we show some initial findings that suggest that even though the junctions are not formally coordinated, traffic may self organise into `green waves'.
☆ APEX: Agent Payment Execution with Policy for Autonomous Agent API Access
Autonomous agents are moving beyond simple retrieval tasks to become economic actors that invoke APIs, sequence workflows, and make real-time decisions. As this shift accelerates, API providers need request-level monetization with programmatic spend governance. The HTTP 402 protocol addresses this by treating payment as a first-class protocol event, but most implementations rely on cryptocurrency rails. In many deployment contexts, especially countries with strong real-time fiat systems like UPI, this assumption is misaligned with regulatory and infrastructure realities. We present APEX, an implementation-complete research system that adapts HTTP 402-style payment gating to UPI-like fiat workflows while preserving policy-governed spend control, tokenized access verification, and replay resistance. We implement a challenge-settle-consume lifecycle with HMAC-signed short-lived tokens, idempotent settlement handling, and policy-aware payment approval. The system uses FastAPI, SQLite, and Python standard libraries, making it transparent, inspectable, and reproducible. We evaluate APEX across three baselines and six scenarios using sample sizes 2-4x larger than initial experiments (N=20-40 per scenario). Results show that policy enforcement reduces total spending by 27.3% while maintaining 52.8% success rate for legitimate requests. Security mechanisms achieve 100% block rate for both replay attacks and invalid tokens with low latency overhead (19.6ms average). Multiple trial runs show low variance across scenarios, demonstrating high reproducibility with 95% confidence intervals. The primary contribution is a controlled agent-payment infrastructure and reference architecture that demonstrates how agentic access monetization can be adapted to fiat systems without discarding security and policy guarantees.
comment: 13 pages, 4 figures, 8 tables. Includes implementation details, experimental evaluation with statistical analysis, and reproducible results. Code and data available upon request
☆ ATBench: A Diverse and Realistic Trajectory Benchmark for Long-Horizon Agent Safety
Evaluating the safety of LLM-based agents is increasingly important because risks in realistic deployments often emerge over multi-step interactions rather than isolated prompts or final responses. Existing trajectory-level benchmarks remain limited by insufficient interaction diversity, coarse observability of safety failures, and weak long-horizon realism. We introduce ATBench, a trajectory-level benchmark for structured, diverse, and realistic evaluation of agent safety. ATBench organizes agentic risk along three dimensions: risk source, failure mode, and real-world harm. Based on this taxonomy, we construct trajectories with heterogeneous tool pools and a long-context delayed-trigger protocol that captures realistic risk emergence across multiple stages. The benchmark contains 1,000 trajectories (503 safe and 497 unsafe), averaging 9.01 turns and 3.95k tokens, with 1,954 invoked tools drawn from pools spanning 2,084 available tools. Data quality is supported by rule-based and LLM-based filtering plus full human audit. Experiments on frontier LLMs, open-source models, and specialized guard systems show that ATBench is challenging even for strong evaluators, while enabling taxonomy-stratified analysis, cross-benchmark comparison, and diagnosis of long-horizon failure patterns.
☆ Optimizing Interventions for Agent-Based Infectious Disease Simulations
Non-pharmaceutical interventions (NPIs) are commonly used tools for controlling infectious disease transmission when pharmaceutical options are unavailable. Yet, identifying effective interventions that minimize societal disruption remains challenging. Agent-based simulation is a popular tool for analyzing the impact of possible interventions in epidemiology. However, automatically optimizing NPIs using agent-based simulations poses a complex problem because, in agent-based epidemiological models, interventions can target individuals based on multiple attributes, affect hierarchical group structures (e.g., schools, workplaces, and families), and be combined arbitrarily, resulting in a very large or even infinite search space. We aim to support decision-makers with our Agent-based Infectious Disease Intervention Optimization System (ADIOS) that optimizes NPIs for infectious disease simulations using Grammar-Guided Genetic Programming (GGGP). The core of ADIOS is a domain-specific language for expressing NPIs in agent-based simulations that structures the intervention search space through a context-free grammar. To make optimization more efficient, the search space can be further reduced by defining constraints that prevent the generation of semantically invalid intervention patterns. Using this constrained language and an interface that enables coupling with agent-based simulations, ADIOS adopts the GGGP approach for simulation-based optimization. Using the German Epidemic Micro-Simulation System (GEMS) as a case study, we demonstrate the potential of our approach to generate optimal interventions for realistic epidemiological models
☆ ProCeedRL: Process Critic with Exploratory Demonstration Reinforcement Learning for LLM Agentic Reasoning
Reinforcement Learning (RL) significantly enhances the reasoning abilities of large language models (LLMs), yet applying it to multi-turn agentic tasks remains challenging due to the long-horizon nature of interactions and the stochasticity of environmental feedback. We identify a structural failure mode in agentic exploration: suboptimal actions elicit noisy observations into misleading contexts, which further weaken subsequent decision-making, making recovery increasingly difficult. This cumulative feedback loop of errors renders standard exploration strategies ineffective and susceptible to the model's reasoning and the environment's randomness. To mitigate this issue, we propose ProCeedRL: Process Critic with Explorative Demonstration RL, shifting exploration from passive selection to active intervention. ProCeedRL employs a process-level critic to monitor interactions in real time, incorporating reflection-based demonstrations to guide agents in stopping the accumulation of errors. We find that this approach significantly exceeds the model's saturated exploration performance, demonstrating substantial exploratory benefits. By learning from exploratory demonstrations and on-policy samples, ProCeedRL significantly improves exploration efficiency and achieves superior performance on complex deep search and embodied tasks.
☆ How and why does deep ensemble coupled with transfer learning increase performance in bipolar disorder and schizophrenia classification?
Transfer learning (TL) and deep ensemble learning (DE) have recently been shown to outperform simple machine learning in classifying psychiatric disorders. However, there is still a lack of understanding as to why that is. This paper aims to understand how and why DE and TL reduce the variability of single-subject classification models in bipolar disorder (BD) and schizophrenia (SCZ). To this end, we investigated the training stability of TL and DE models. For the two classification tasks under consideration, we compared the results of multiple trainings with the same backbone but with different initializations. In this way, we take into account the epistemic uncertainty associated with the uncertainty in the estimation of the model parameters. It has been shown that the performance of classifiers can be significantly improved by using TL with DE. Based on these results, we investigate i) how many models are needed to benefit from the performance improvement of DE when classifying BD and SCZ from healthy controls, and ii) how TL induces better generalization, with and without DE. In the first case, we show that DE reaches a plateau when 10 models are included in the ensemble. In the second case, we find that using a pre-trained model constrains TL models with the same pre-training to stay in the same basin of the loss function. This is not the case for DL models with randomly initialized weights.
☆ GenGait: A Transformer-Based Model for Human Gait Anomaly Detection and Normative Twin Generation
Gait analysis provides an objective characterization of locomotor function and is widely used to support diagnosis and rehabilitation monitoring across neurological and orthopedic disorders. Deep learning has been increasingly applied to this domain, yet most approaches rely on supervised classifiers trained on disease-labeled data, limiting generalization to heterogeneous pathological presentations. This work proposes a label-free framework for joint-level anomaly detection and kinematic correction based on a Transformer masked autoencoder trained exclusively on normative gait sequences from 150 adults, acquired with a markerless multi-camera motion-capture system. At inference, a two-pass procedure is applied to potentially pathological input sequences, first it estimates joint inconsistency scores by occluding individual joints and measuring deviations from the learned normative prior. Then, it withholds the flagged joints from the encoder input and reconstructs the full skeleton from the remaining spatiotemporal context, yielding corrected kinematic trajectories at the flagged positions. Validation on 10 held-out normative participants, who mimicked seven simulated gait abnormalities, showed accurate localization of biomechanically inconsistent joints, a significant reduction in angular deviation across all analyzed joints with large effect sizes, and preservation of normative kinematics. The proposed approach enables interpretable, subject-specific localization of gait impairments without requiring disease labels. Video is available at https://youtu.be/Rcm3jqR5pN4.
comment: 13 pages, 6 figures. Preprint submitted to a journal
☆ SAFE: Stepwise Atomic Feedback for Error correction in Multi-hop Reasoning
Multi-hop QA benchmarks frequently reward Large Language Models (LLMs) for spurious correctness, masking ungrounded or flawed reasoning steps. To shift toward rigorous reasoning, we propose SAFE, a dynamic benchmarking framework that replaces the ungrounded Chain-of-Thought (CoT) with a strictly verifiable sequence of grounded entities. Our framework operates across two phases: (1) train-time verification, where we establish an atomic error taxonomy and a Knowledge Graph (KG)-grounded verification pipeline to eliminate noisy supervision in standard benchmarks, identifying up to 14% of instances as unanswerable, and (2) inference-time verification, where a feedback model trained on this verified dataset dynamically detects ungrounded steps in real-time. Experimental results demonstrate that SAFE not only exposes the critical flaws of existing benchmarks at train-time, but also significantly outperforms standard baselines, achieving an average accuracy gain of 8.4 pp while guaranteeing verifiable trajectories at inference-time.
☆ Attention at Rest Stays at Rest: Breaking Visual Inertia for Cognitive Hallucination Mitigation
Like a body at rest that stays at rest, we find that visual attention in multimodal large language models (MLLMs) exhibits pronounced inertia, remaining largely static once settled during early decoding steps and failing to support the compositional understanding required for cognitive inference. While existing hallucination mitigation methods mainly target perceptual hallucinations concerning object existence or attributes, they remain inadequate for such cognitive hallucinations that require inter-object relational deduction. Through token-wise attention analysis, we identify this visual inertia as a key factor: attention to semantically critical regions remains persistently focused and fails to dynamically support relational inference. We thereby propose a training-free Inertia-aware Visual Excitation (IVE) method that breaks this inertial pattern by modeling cognitive inference as the dynamic responsiveness of visual attention. Specifically, IVE selects visual tokens that are dynamically emerging relative to historical attention trends while distinguishing tokens exhibiting inertial behavior. To further facilitate compositional inference, IVE introduces an inertia-aware penalty that discourages over-concentration and limits the persistence of attention within localized regions. Extensive experiments show that IVE is effective across various base MLLMs and multiple hallucination benchmarks, particularly for cognitive hallucinations.
☆ SenseMath: Do LLMs Have Number Sense? Evaluating Shortcut Use, Judgment, and Generation
Large language models often default to step-by-step computation even when efficient numerical shortcuts are available. This raises a basic question: do they exhibit number sense in a human-like behavioral sense, i.e., the ability to recognize numerical structure, apply shortcuts when appropriate, and avoid them when they are not? We introduce SenseMath, a controlled benchmark for evaluating structure-sensitive numerical reasoning in LLMs. SenseMath contains 4,800 items spanning eight shortcut categories and four digit scales, with matched strong-shortcut, weak-shortcut, and control variants. It supports three evaluation settings of increasing cognitive demand: Shortcut Use (whether models can apply shortcuts on shortcut-amenable problems); Applicability Judgment (whether they can recognize when a shortcut is appropriate or misleading); and Problem Generation (whether they can generate new problem items that correctly admit a given type of shortcut). Our evaluation across five LLMs, ranging from GPT-4o-mini to Llama-3.1-8B, shows a consistent pattern: when explicitly prompted, models readily adopt shortcut strategies and achieve substantial accuracy gains on shortcut-amenable items (up to 15%), yet under standard chain-of-thought prompting they spontaneously employ such strategies in fewer than 40% of cases, even when they demonstrably possess the requisite capability. Moreover, this competence is confined to the Use level; models systematically over-generalise shortcuts to problems where they do not apply, and fail to generate valid shortcut-bearing problems from scratch. Together, these results suggest that current LLMs exhibit procedural shortcut fluency without the structural understanding of when and why shortcuts work that underlies human number sense.
☆ World Action Verifier: Self-Improving World Models via Forward-Inverse Asymmetry
General-purpose world models promise scalable policy evaluation, optimization, and planning, yet achieving the required level of robustness remains challenging. Unlike policy learning, which primarily focuses on optimal actions, a world model must be reliable over a much broader range of suboptimal actions, which are often insufficiently covered by action-labeled interaction data. To address this challenge, we propose World Action Verifier (WAV), a framework that enables world models to identify their own prediction errors and self-improve. The key idea is to decompose action-conditioned state prediction into two factors -- state plausibility and action reachability -- and verify each separately. We show that these verification problems can be substantially easier than predicting future states due to two underlying asymmetries: the broader availability of action-free data and the lower dimensionality of action-relevant features. Leveraging these asymmetries, we augment a world model with (i) a diverse subgoal generator obtained from video corpora and (ii) a sparse inverse model that infers actions from a subset of state features. By enforcing cycle consistency among generated subgoals, inferred actions, and forward rollouts, WAV provides an effective verification mechanism in under-explored regimes, where existing methods typically fail. Across nine tasks spanning MiniGrid, RoboMimic, and ManiSkill, our method achieves 2x higher sample efficiency while improving downstream policy performance by 18%.
comment: Project Website: https://world-action-verifier.github.io
☆ RuleForge: Automated Generation and Validation for Web Vulnerability Detection at Scale
Security teams face a challenge: the volume of newly disclosed Common Vulnerabilities and Exposures (CVEs) far exceeds the capacity to manually develop detection mechanisms. In 2025, the National Vulnerability Database published over 48,000 new vulnerabilities, motivating the need for automation. We present RuleForge, an AWS internal system that automatically generates detection rules--JSON-based patterns that identify malicious HTTP requests exploiting specific vulnerabilities--from structured Nuclei templates describing CVE details. Nuclei templates provide standardized, YAML-based vulnerability descriptions that serve as the structured input for our rule generation process. This paper focuses on RuleForge's architecture and operational deployment for CVE-related threat detection, with particular emphasis on our novel LLM-as-a-judge (Large Language Model as judge) confidence validation system and systematic feedback integration mechanism. This validation approach evaluates candidate rules across two dimensions--sensitivity (avoiding false negatives) and specificity (avoiding false positives)--achieving AUROC of 0.75 and reducing false positives by 67% compared to synthetic-test-only validation in production. Our 5x5 generation strategy (five parallel candidates with up to five refinement attempts each) combined with continuous feedback loops enables systematic quality improvement. We also present extensions enabling rule generation from unstructured data sources and demonstrate a proof-of-concept agentic workflow for multi-event-type detection. Our lessons learned highlight critical considerations for applying LLMs to cybersecurity tasks, including overconfidence mitigation and the importance of domain expertise in both prompt design and quality review of generated rules through human-in-the-loop validation.
comment: 11 pages, 10 figures. To be submitted to CAMLIS 2026
☆ Ego-Grounding for Personalized Question-Answering in Egocentric Videos CVPR'26
We present the first systematic analysis of multimodal large language models (MLLMs) in personalized question-answering requiring ego-grounding - the ability to understand the camera-wearer in egocentric videos. To this end, we introduce MyEgo, the first egocentric VideoQA dataset designed to evaluate MLLMs' ability to understand, remember, and reason about the camera wearer. MyEgo comprises 541 long videos and 5K personalized questions asking about "my things", "my activities", and "my past". Benchmarking reveals that competitive MLLMs across variants, including open-source vs. proprietary, thinking vs. non-thinking, small vs. large scales all struggle on MyEgo. Top closed- and open-source models (e.g., GPT-5 and Qwen3-VL) achieve only~46% and 36% accuracy, trailing human performance by near 40% and 50% respectively. Surprisingly, neither explicit reasoning nor model scaling yield consistent improvements. Models improve when relevant evidence is explicitly provided, but gains drop over time, indicating limitations in tracking and remembering "me" and "my past". These findings collectively highlight the crucial role of ego-grounding and long-range memory in enabling personalized QA in egocentric videos. We hope MyEgo and our analyses catalyze further progress in these areas for egocentric personalized assistance. Data and code are available at https://github.com/Ryougetsu3606/MyEgo
comment: To appear at CVPR'26
☆ Do We Need Bigger Models for Science? Task-Aware Retrieval with Small Language Models LREC 2026
Scientific knowledge discovery increasingly relies on large language models, yet many existing scholarly assistants depend on proprietary systems with tens or hundreds of billions of parameters. Such reliance limits reproducibility and accessibility for the research community. In this work, we ask a simple question: do we need bigger models for scientific applications? Specifically, we investigate to what extent carefully designed retrieval pipelines can compensate for reduced model scale in scientific applications. We design a lightweight retrieval-augmented framework that performs task-aware routing to select specialized retrieval strategies based on the input query. The system further integrates evidence from full-text scientific papers and structured scholarly metadata, and employs compact instruction-tuned language models to generate responses with citations. We evaluate the framework across several scholarly tasks, focusing on scholarly question answering (QA), including single- and multi-document scenarios, as well as biomedical QA under domain shift and scientific text compression. Our findings demonstrate that retrieval and model scale are complementary rather than interchangeable. While retrieval design can partially compensate for smaller models, model capacity remains important for complex reasoning tasks. This work highlights retrieval and task-aware design as key factors for building practical and reproducible scholarly assistants.
comment: Accepted at NSLP@LREC 2026
☆ Abnormal Head Movements in Neurological Conditions: A Knowledge-Based Dataset with Application to Cervical Dystonia
Abnormal head movements (AHMs) manifest across a broad spectrum of neurological disorders; however, the absence of a multi-condition resource integrating kinematic measurements, clinical severity scores, and patient demographics constitutes a persistent barrier to the development of AI-driven diagnostic tools. To address this gap, this study introduces NeuroPose-AHM, a knowledge-based dataset of neurologically induced AHMs constructed through a multi-LLM extraction framework applied to 1,430 peer-reviewed publications. The dataset contains 2,756 patient-group-level records spanning 57 neurological conditions, derived from 846 AHM-relevant papers. Inter-LLM reliability analysis confirms robust extraction performance, with study-level classification achieving strong agreement (kappa = 0.822). To demonstrate the dataset's analytical utility, a four-task framework is applied to cervical dystonia (CD), the condition most directly defined by pathological head movement. First, Task 1 performs multi-label AHM type classification (F1 = 0.856). Task 2 constructs the Head-Neck Severity Index (HNSI), a unified metric that normalizes heterogeneous clinical rating scales. The clinical relevance of this index is then evaluated in Task 3, where HNSI is validated against real-world CD patient data, with aligned severe-band proportions (6.7%) providing a preliminary plausibility indication for index calibration within the high severity range. Finally, Task 4 performs bridge analysis between movement-type probabilities and HNSI scores, producing significant correlations (p less than 0.001). These results demonstrate the analytical utility of NeuroPose-AHM as a structured, knowledge-based resource for neurological AHM research. The NeuroPose-AHM dataset is publicly available on Zenodo (https://doi.org/10.5281/zenodo.19386862).
☆ Qiana: A First-Order Formalism to Quantify over Contexts and Formulas with Temporality
We introduce Qiana, a logic framework for reasoning on formulas that are true only in specific contexts. In Qiana, it is possible to quantify over both formulas and contexts to express, e.g., that ``everyone knows everything Alice says''. Qiana also permits paraconsistent logics within contexts, so that contexts can contain contradictions. Furthermore, Qiana is based on first-order logic, and is finitely axiomatizable, so that Qiana theories are compatible with pre-existing first-order logic theorem provers. We show how Qiana can be used to represent temporality, event calculus, and modal logic. We also discuss different design alternatives of Qiana.
☆ Physics-Informed Transformer for Multi-Band Channel Frequency Response Reconstruction
Wideband channel frequency response (CFR) estimation is challenging in multi-band wireless systems, especially when one or more sub-bands are temporarily blocked by co-channel interference. We present a physics-informed complex Transformer that reconstructs the full wideband CFR from such fragmented, partially observed spectrum snapshots. The interference pattern in each sub-band is modeled as an independent two-state discrete-time Markov chain, capturing realistic bursty occupancy behavior. Our model operates on the joint time-frequency grid of $T$ snapshots and $F$ frequency bins and uses a factored self-attention mechanism that separately attends along both axes, reducing the computational complexity to $O(TF^2 + FT^2)$. Complex-valued inputs and outputs are processed through a holomorphic linear layer that preserves phase relationships. Training uses a composite physics-informed loss combining spectral fidelity, power delay profile (PDP) reconstruction, channel impulse response (CIR) sparsity, and temporal smoothness. Mobility effects are incorporated through per-sample velocity randomization, enabling generalization across different mobility regimes. Evaluation against three classical baselines, namely, last-observation-carry-forward, zero-fill, and cubic-spline interpolation, shows that our approach achieves the highest PDP similarity with respect to the ground truth, reaching $ρ\geq 0.82$ compared to $ρ\geq 0.62$ for the best baseline at interference occupancy levels up to 50%. Furthermore, the model degrades smoothly across the full velocity range, consistently outperforming all other baselines.
comment: 6 pages, 6 figures
☆ Captioning Daily Activity Images in Early Childhood Education: Benchmark and Algorithm
Image captioning for Early Childhood Education (ECE) is essential for automated activity understanding and educational assessment. However, existing methods face two key challenges. First, the lack of large-scale, domain-specific datasets limits the model's ability to capture fine-grained semantic concepts unique to ECE scenarios, resulting in generic and imprecise descriptions. Second, conventional training paradigms exhibit limitations in enhancing professional object description capability, as supervised learning tends to favor high-frequency expressions, while reinforcement learning may suffer from unstable optimization on difficult samples. To address these limitations, we introduce ECAC, a large-scale benchmark for ECE daily activity image captioning, comprising 256,121 real-world images annotated with expert-level captions and fine-grained labels. ECAC is further equipped with a domain-oriented evaluation protocol, the Teaching Toy Recognition Score (TTS), to explicitly measure professional object naming accuracy. Furthermore, we propose RSRS (Reward-Conditional Switch of Reinforcement Learning and Supervised Fine-Tuning), a hybrid training framework that dynamically alternates between RL and supervised optimization. By rerouting hard samples with zero rewards to supervised fine-tuning, RSRS effectively mitigates advantage collapse and enables stable optimization for fine-grained recognition. Leveraging ECAC and RSRS, we develop KinderMM-Cap-3B, a domain-adapted multimodal large language model. Extensive experiments demonstrate that our model achieves a TTS of 51.06, substantially outperforming state-of-the-art baselines while maintaining superior caption quality, highlighting its potential for specialized educational applications.
☆ Probabilistic classification from possibilistic data: computing Kullback-Leibler projection with a possibility distribution
We consider learning with possibilistic supervision for multi-class classification. For each training instance, the supervision is a normalized possibility distribution that expresses graded plausibility over the classes. From this possibility distribution, we construct a non-empty closed convex set of admissible probability distributions by combining two requirements: probabilistic compatibility with the possibility and necessity measures induced by the possibility distribution, and linear shape constraints that must be satisfied to preserve the qualitative structure of the possibility distribution. Thus, classes with the same possibility degree receive equal probabilities, and if a class has a strictly larger possibility degree than another class, then it receives a strictly larger probability. Given a strictly positive probability vector output by a model for an instance, we compute its Kullback-Leibler projection onto the admissible set. This projection yields the closest admissible probability distribution in Kullback-Leibler sense. We can then train the model by minimizing the divergence between the prediction and its projection, which quantifies the smallest adjustment needed to satisfy the induced dominance and shape constraints. The projection is computed with Dykstra's algorithm using Bregman projections associated with the negative entropy, and we provide explicit formulas for the projections onto each constraint set. Experiments conducted on synthetic data and on a real-world natural language inference task, based on the ChaosNLI dataset, show that the proposed projection algorithm is efficient enough for practical use, and that the resulting projection-based learning objective can improve predictive performance.
☆ Reliable News or Propagandist News? A Neurosymbolic Model Using Genre, Topic, and Persuasion Techniques to Improve Robustness in Classification
Among news disorders, propagandist news are particularly insidious, because they tend to mix oriented messages with factual reports intended to look like reliable news. To detect propaganda, extant approaches based on Language Models such as BERT are promising but often overfit their training datasets, due to biases in data collection. To enhance classification robustness and improve generalization to new sources, we propose a neurosymbolic approach combining non-contextual text embeddings (fastText) with symbolic conceptual features such as genre, topic, and persuasion techniques. Results show improvements over equivalent text-only methods, and ablation studies as well as explainability analyses confirm the benefits of the added features. Keywords: Information disorder, Fake news, Propaganda, Classification, Topic modeling, Hybrid method, Neurosymbolic model, Ablation, Robustness
☆ BraiNCA: brain-inspired neural cellular automata and applications to morphogenesis and motor control
Most of the Neural Cellular Automata (NCAs) defined in the literature have a common theme: they are based on regular grids with a Moore neighborhood (one-hop neighbour). They do not take into account long-range connections and more complex topologies as we can find in the brain. In this paper, we introduce BraiNCA, a brain-inspired NCA with an attention layer, long-range connections and complex topology. BraiNCAs shows better results in terms of robustness and speed of learning on the two tasks compared to Vanilla NCAs establishing that incorporating attention-based message selection together with explicit long-range edges can yield more sample-efficient and damage-tolerant self-organization than purely local, grid-based update rules. These results support the hypothesis that, for tasks requiring distributed coordination over extended spatial and temporal scales, the choice of interaction topology and the ability to dynamically route information will impact the robustness and speed of learning of an NCA. More broadly, BraiNCA provides brain-inspired NCA formulation that preserves the decentralized local update principle while better reflecting non-local connectivity patterns, making it a promising substrate for studying collective computation under biologically-realistic network structure and evolving cognitive substrates.
comment: 10 pages, 6 figures
☆ Quantum-Inspired Geometric Classification with Correlation Group Structures and VQC Decision Modeling
We propose a geometry-driven quantum-inspired classification framework that integrates Correlation Group Structures (CGR), compact SWAP-test-based overlap estimation, and selective variational quantum decision modelling. Rather than directly approximating class posteriors, the method adopts a geometry-first paradigm in which samples are evaluated relative to class medoids using overlap-derived Euclidean-like and angular similarity channels. CGR organizes features into anchor-centered correlation neighbourhoods, generating nonlinear, correlation-weighted representations that enhance robustness in heterogeneous tabular spaces. These geometric signals are fused through a non-probabilistic margin-based fusion score, serving as a lightweight and data-efficient primary classifier for small-to-moderate datasets. On Heart Disease, Breast Cancer, and Wine Quality datasets, the fusion-score classifier achieves 0.8478, 0.8881, and 0.9556 test accuracy respectively, with macro-F1 scores of 0.8463, 0.8703, and 0.9522, demonstrating competitive and stable performance relative to classical baselines. For large-scale and highly imbalanced regimes, we construct compact Delta-distance contrastive features and train a variational quantum classifier (VQC) as a nonlinear refinement layer. On the Credit Card Fraud dataset (0.17% prevalence), the Delta + VQC pipeline achieves approximately 0.85 minority recall at an alert rate of approximately 1.31%, with ROC-AUC 0.9249 and PR-AUC 0.3251 under full-dataset evaluation. These results highlight the importance of operating-point-aware assessment in rare-event detection and demonstrate that the proposed hybrid geometric-variational framework provides interpretable, scalable, and regime-adaptive classification across heterogeneous data settings.
comment: 34 Pages, 19 Algorithms , 8 Tables
☆ Woosh: A Sound Effects Foundation Model
The audio research community depends on open generative models as foundational tools for building novel approaches and establishing baselines. In this report, we present Woosh, Sony AI's publicly released sound effect foundation model, detailing its architecture, training process, and an evaluation against other popular open models. Being optimized for sound effects, we provide (1) a high-quality audio encoder/decoder model and (2) a text-audio alignment model for conditioning, together with (3) text-to-audio and (4) video-to-audio generative models. Distilled text-to-audio and video-to-audio models are also included in the release, allowing for low-resource operation and fast inference. Our evaluation on both public and private data shows competitive or better performance for each module when compared to existing open alternatives like StableAudio-Open and TangoFlux. Inference code and model weights are available at https://github.com/SonyResearch/Woosh. Demo samples can be found at https://sonyresearch.github.io/Woosh/.
☆ ImplicitBBQ: Benchmarking Implicit Bias in Large Language Models through Characteristic Based Cues
Large Language Models increasingly suppress biased outputs when demographic identity is stated explicitly, yet may still exhibit implicit biases when identity is conveyed indirectly. Existing benchmarks use name based proxies to detect implicit biases, which carry weak associations with many social demographics and cannot extend to dimensions like age or socioeconomic status. We introduce ImplicitBBQ, a QA benchmark that evaluates implicit bias through characteristic based cues, culturally associated attributes that signal implicitly, across age, gender, region, religion, caste, and socioeconomic status. Evaluating 11 models, we find that implicit bias in ambiguous contexts is over six times higher than explicit bias in open weight models. Safety prompting and chain-of-thought reasoning fail to substantially close this gap; even few-shot prompting, which reduces implicit bias by 84%, leaves caste bias at four times the level of any other dimension. These findings indicate that current alignment and prompting strategies address the surface of bias evaluation while leaving culturally grounded stereotypic associations largely unresolved. We publicly release our code and dataset for model providers and researchers to benchmark potential mitigation techniques.
☆ Lifting Unlabeled Internet-level Data for 3D Scene Understanding CVPR 2026
Annotated 3D scene data is scarce and expensive to acquire, while abundant unlabeled videos are readily available on the internet. In this paper, we demonstrate that carefully designed data engines can leverage web-curated, unlabeled videos to automatically generate training data, to facilitate end-to-end models in 3D scene understanding alongside human-annotated datasets. We identify and analyze bottlenecks in automated data generation, revealing critical factors that determine the efficiency and effectiveness of learning from unlabeled data. To validate our approach across different perception granularities, we evaluate on three tasks spanning low-level perception, i.e., 3D object detection and instance segmentation, to high-evel reasoning, i.e., 3D spatial Visual Question Answering (VQA) and Vision-Lanugage Navigation (VLN). Models trained on our generated data demonstrate strong zero-shot performance and show further improvement after finetuning. This demonstrates the viability of leveraging readily available web data as a path toward more capable scene understanding systems.
comment: CVPR 2026. Project page: https://sv-pp.github.io/
☆ Combating Data Laundering in LLM Training
Data rights owners can detect unauthorized data use in large language model (LLM) training by querying with proprietary samples. Often, superior performance (e.g., higher confidence or lower loss) on a sample relative to the untrained data implies it was part of the training corpus, as LLMs tend to perform better on data they have seen during training. However, this detection becomes fragile under data laundering, a practice of transforming the stylistic form of proprietary data, while preserving critical information to obfuscate data provenance. When an LLM is trained exclusively on such laundered variants, it no longer performs better on originals, erasing the signals that standard detections rely on. We counter this by inferring the unknown laundering transformation from black-box access to the target LLM and, via an auxiliary LLM, synthesizing queries that mimic the laundered data, even if rights owners have only the originals. As the search space of finding true laundering transformations is infinite, we abstract such a process into a high-level transformation goal (e.g., "lyrical rewriting") and concrete details (e.g., "with vivid imagery"), and introduce synthesis data reversion (SDR) that instantiates this abstraction. SDR first identifies the most probable goal for synthesis to narrow the search; it then iteratively refines details so that synthesized queries gradually elicit stronger detection signals from the target LLM. Evaluated on the MIMIR benchmark against diverse laundering practices and target LLM families (Pythia, Llama2, and Falcon), SDR consistently strengthens data misuse detection, providing a practical countermeasure to data laundering.
comment: 27 pages, 2 figures
☆ Bayesian Elicitation with LLMs: Model Size Helps, Extra "Reasoning" Doesn't Always
Large language models (LLMs) have been proposed as alternatives to human experts for estimating unknown quantities with associated uncertainty, a process known as Bayesian elicitation. We test this by asking eleven LLMs to estimate population statistics, such as health prevalence rates, personality trait distributions, and labor market figures, and to express their uncertainty as 95\% credible intervals. We vary each model's reasoning effort (low, medium, high) to test whether more "thinking" improves results. Our findings reveal three key results. First, larger, more capable models produce more accurate estimates, but increasing reasoning effort provides no consistent benefit. Second, all models are severely overconfident: their 95\% intervals contain the true value only 9--44\% of the time, far below the expected 95\%. Third, a statistical recalibration technique called conformal prediction can correct this overconfidence, expanding the intervals to achieve the intended coverage. In a preliminary experiment, giving models web search access degraded predictions for already-accurate models, while modestly improving predictions for weaker ones. Models performed well on commonly discussed topics but struggled with specialized health data. These results indicate that LLM uncertainty estimates require statistical correction before they can be used in decision-making.
comment: 6 pages, 3 figures, 2 tables
☆ Robust Graph Representation Learning via Adaptive Spectral Contrast
Spectral graph contrastive learning has emerged as a unified paradigm for handling both homophilic and heterophilic graphs by leveraging high-frequency components. However, we identify a fundamental spectral dilemma: while high-frequency signals are indispensable for encoding heterophily, our theoretical analysis proves they exhibit significantly higher variance under spectrally concentrated perturbations. We derive a regret lower bound showing that existing global (node-agnostic) spectral fusion is provably sub-optimal: on mixed graphs with separated node-wise frequency preferences, any global fusion strategy incurs non-vanishing regret relative to a node-wise oracle. To escape this bound, we propose ASPECT, a framework that resolves this dilemma through a reliability-aware spectral gating mechanism. Formulated as a minimax game, ASPECT employs a node-wise gate that dynamically re-weights frequency channels based on their stability against a purpose-built adversary, which explicitly targets spectral energy distributions via a Rayleigh quotient penalty. This design forces the encoder to learn representations that are both structurally discriminative and spectrally robust. Empirical results show that ASPECT achieves new state-of-the-art performance on 8 out of 9 benchmarks, effectively decoupling meaningful structural heterophily from incidental noise.
☆ Efficient Constraint Generation for Stochastic Shortest Path Problems
Stochastic Shortest Path problems (SSPs) are traditionally solved by computing each state's cost-to-go by applying Bellman backups. A Bellman backup updates a state's cost-to-go by iterating through every applicable action, computing the cost-to-go after applying each one, and selecting a minimal action's cost-to-go. State-of-the-art algorithms use heuristic functions; these give an initial estimate of costs-to-go, and lets the algorithm apply Bellman backups only to promising states, determined by low estimated costs-to-go. However, each Bellman backup still considers all applicable actions, even if the heuristic tells us that some of these actions are too expensive, with the effect that such algorithms waste time on unhelpful actions. To address this gap we present a technique that uses the heuristic to avoid expensive actions, by reframing heuristic search in terms of linear programming and introducing an efficient implementation of constraint generation for SSPs. We present CG-iLAO*, a new algorithm that adapts iLAO* with our novel technique, and considers only 40% of iLAO*'s actions on many problems, and as few as 1% on some. Consequently, CG-iLAO* computes on average 3.5x fewer costs-to-go for actions than the state-of-the-art iLAO* and LRTDP, enabling it to solve problems faster an average of 2.8x and 3.7x faster, respectively.
☆ Beyond Detection: Ethical Foundations for Automated Dyslexic Error Attribution
Dyslexic spelling errors exhibit systematic phonological and orthographic patterns that distinguish them from the errors produced by typically developing writers. While this observation has motivated dyslexic-specific spell-checking and assistive writing tools, prior work has focused predominantly on error correction rather than attribution, and has largely neglected the ethical risks. The risk of harmful labelling, covert screening, algorithmic bias, and institutional misuse that automated classification of learners entails requires the development of robust ethical and legal frameworks for research in this area. This paper addresses both gaps. We formulate dyslexic error attribution as a binary classification task. Given a misspelt word and its correct target form, determine whether the error pattern is characteristic of a dyslexic or non-dyslexic writer. We develop a comprehensive feature set capturing orthographic, phonological, and morphological properties of each error, and propose a twin-input neural model evaluated against traditional machine learning baselines under writer-independent conditions. The neural model achieves 93.01% accuracy and an F1-score of 94.01%, with phonetically plausible errors and vowel confusions emerging as the strongest attribution signals. We situate these technical results within an explicit ethics-first framework, analysing fairness across subgroups, the interpretability requirements of educational deployment, and the conditions, consent, transparency, human oversight, and recourse, under which a system could be responsibly used. We provide concrete guidelines for ethical deployment and an open discussion of the systems limitations and misuse potential. Our results demonstrate that dyslexic error attribution is feasible at high accuracy while underscoring that feasibility alone is insufficient for deployment in high-stakes educational contexts.
☆ CANDI: Curated Test-Time Adaptation for Multivariate Time-Series Anomaly Detection Under Distribution Shift AAAI 2026
Multivariate time-series anomaly detection (MTSAD) aims to identify deviations from normality in multivariate time-series and is critical in real-world applications. However, in real-world deployments, distribution shifts are ubiquitous and cause severe performance degradation in pre-trained anomaly detector. Test-time adaptation (TTA) updates a pre-trained model on-the-fly using only unlabeled test data, making it promising for addressing this challenge. In this study, we propose CANDI (Curated test-time adaptation for multivariate time-series ANomaly detection under DIstribution shift), a novel TTA framework that selectively identifies and adapts to potential false positives while preserving pre-trained knowledge. CANDI introduces a False Positive Mining (FPM) strategy to curate adaptation samples based on anomaly scores and latent similarity, and incorporates a plug-and-play Spatiotemporally-Aware Normality Adaptation (SANA) module for structurally informed model updates. Extensive experiments demonstrate that CANDI significantly improves the performance of MTSAD under distribution shift, improving AUROC up to 14% while using fewer adaptation samples.
comment: AAAI 2026
☆ Retrieval-aligned Tabular Foundation Models Enable Robust Clinical Risk Prediction in Electronic Health Records Under Real-world Constraints
Clinical prediction from structured electronic health records (EHRs) is challenging due to high dimensionality, heterogeneity, class imbalance, and distribution shift. While tabular in-context learning (TICL) and retrieval-augmented methods perform well on generic benchmarks, their behavior in clinical settings remains unclear. We present a multi-cohort EHR benchmark comparing classical, deep tabular, and TICL models across varying data scale, feature dimensionality, outcome rarity, and cross-cohort generalization. PFN-based TICL models are sample-efficient in low-data regimes but degrade under naive distance-based retrieval as heterogeneity and imbalance increase. We propose AWARE, a task-aligned retrieval framework using supervised embedding learning and lightweight adapters. AWARE improves AUPRC by up to 12.2% under extreme imbalance, with gains increasing with data complexity. Our results identify retrieval quality and retrieval-inference alignment as key bottlenecks for deploying tabular in-context learning in clinical prediction.
comment: Not peer-reviewed
☆ Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models
While Reinforcement Learning from Verifiable Rewards (RLVR) has advanced reasoning in Large Vision-Language Models (LVLMs), prevailing frameworks suffer from a foundational methodological flaw: by distributing identical advantages across all generated tokens, these methods inherently dilute the learning signals essential for optimizing the critical, visually-grounded steps of multimodal reasoning. To bridge this gap, we formulate \textit{Token Visual Dependency}, quantifying the causal information gain of visual inputs via the Kullback-Leibler (KL) divergence between visual-conditioned and text-only predictive distributions. Revealing that this dependency is highly sparse and semantically pivotal, we introduce Perception-Grounded Policy Optimization (PGPO), which is a novel fine-grained credit assignment framework that dynamically reshapes advantages at the token level. Through a threshold-gated, mass-conserving mechanism, PGPO actively amplifies learning signals for visually-dependent tokens while suppressing gradient noise from linguistic priors. Extensive experiments based on the Qwen2.5-VL series across seven challenging multimodal reasoning benchmarks demonstrate that PGPO boosts models by 18.7% on average. Both theoretical and empirical analyses confirm that PGPO effectively reduces gradient variance, prevents training collapse, and acts as a potent regularizer for robust, perception-grounded multimodal reasoning. Code will be published on https://github.com/Yzk1114/PGPO.
☆ Neural Network-Assisted Model Predictive Control for Implicit Balancing
In Europe, balance responsible parties can deliberately take out-of-balance positions to support transmission system operators (TSOs) in maintaining grid stability and earn profit, a practice called implicit balancing. Model predictive control (MPC) is widely adopted as an effective approach for implicit balancing. The balancing market model accuracy in MPC is critical to decision quality. Previous studies modeled this market using either (i) a convex market clearing approximation, ignoring proactive manual actions by TSOs and the market sub-quarter-hour dynamics, or (ii) machine learning methods, which cannot be directly integrated into MPC. To address these shortcomings, we propose a data-driven balancing market model integrated into MPC using an input convex neural network to ensure convexity while capturing uncertainties. To keep the core network computationally efficient, we incorporate attention-based input gating mechanisms to remove irrelevant data. Evaluating on Belgian data shows that the proposed model both improves MPC decisions and reduces computational time.
☆ A deep learning pipeline for PAM50 subtype classification using histopathology images and multi-objective patch selection
Breast cancer is a highly heterogeneous disease with diverse molecular profiles. The PAM50 gene signature is widely recognized as a standard for classifying breast cancer into intrinsic subtypes, enabling more personalized treatment strategies. In this study, we introduce a novel optimization-driven deep learning framework that aims to reduce reliance on costly molecular assays by directly predicting PAM50 subtypes from H&E-stained whole-slide images (WSIs). Our method jointly optimizes patch informativeness, spatial diversity, uncertainty, and patch count by combining the non-dominated sorting genetic algorithm II (NSGA-II) with Monte Carlo dropout-based uncertainty estimation. The proposed method can identify a small but highly informative patch subset for classification. We used a ResNet18 backbone for feature extraction and a custom CNN head for classification. For evaluation, we used the internal TCGA-BRCA dataset as the training cohort and the external CPTAC-BRCA dataset as the test cohort. On the internal dataset, an F1-score of 0.8812 and an AUC of 0.9841 using 627 WSIs from the TCGA-BRCA cohort were achieved. The performance of the proposed approach on the external validation dataset showed an F1-score of 0.7952 and an AUC of 0.9512. These findings indicate that the proposed optimization-guided, uncertainty-aware patch selection can achieve high performance and improve the computational efficiency of histopathology-based PAM50 classification compared to existing methods, suggesting a scalable imaging-based replacement that has the potential to support clinical decision-making.
☆ Domain-constrained knowledge representation: A modal framework
Knowledge graphs store large numbers of relations efficiently, but they remain weak at representing a quieter difficulty: the meaning of a concept often shifts with the domain in which it is used. A triple such as Apple, instance-of, Company may be acceptable in one setting while being misleading or unusable in another. In most current systems, domain information is attached as metadata, qualifiers, or graph-level organization. These mechanisms help with filtering and provenance, but they usually do not alter the formal status of the assertion itself. This paper argues that domain should be treated as part of knowledge representation rather than as supplementary annotation. It introduces the Domain-Contextualized Concept Graph (DCG), a framework in which domain is written into the relation and interpreted as a modal world constraint. In the DCG form (C, R at D, C'), the marker at D identifies the world in which the relation holds. Formally, the relation is interpreted through a domain-indexed necessity operator, so that truth, inference, and conflict checking are all scoped to the relevant world. This move has three consequences: ambiguous concepts can be disambiguated at the point of representation; invalid assertions can be challenged against their domain; cross-domain relations can be connected through explicit predicates. The paper develops this claim through a Kripke-style semantics, a compact predicate system, a Prolog implementation, and mappings to RDF, OWL, and relational databases. The contribution is a representational reinterpretation of domain itself. The central claim is that many practical failures in knowledge systems begin when domain is treated as external to the assertion. DCG addresses that by giving domain a structural and computable role inside the representation.
comment: 37pages
☆ FSKD: Monocular Forest Structure Inference via LiDAR-to-RGBI Knowledge Distillation
Very High Resolution (VHR) forest structure data at individual-tree scale is essential for carbon, biodiversity, and ecosystem monitoring. Still, airborne LiDAR remains costly and infrequent despite being the reference for forest structure metrics like Canopy Height Model (CHM), Plant Area Index (PAI), and Foliage Height Diversity (FHD). We propose FSKD: a LiDAR-to-RGB-Infrared (RGBI) knowledge distillation (KD) framework in which a multi-modal teacher fuses RGBI imagery with LiDAR-derived planar metrics and vertical profiles via cross-attention, and an RGBI-only SegFormer student learns to reproduce these outputs. Trained on 384 $km^2$ of forests in Saxony, Germany (20 cm ground sampling distance (GSD)) and evaluated on eight geographically distinct test tiles, the student achieves state-of-the-art (SOTA) zero-shot CHM performance (MedAE 4.17 m, $R^2$=0.51, IoU 0.87), outperforming HRCHM/DAC baselines by 29--46% in MAE (5.81 m vs. 8.14--10.84 m) with stronger correlation coefficients (0.713 vs. 0.166--0.652). Ablations show that multi-modal fusion improves performance by 10--26% over RGBI-only training, and that asymmetric distillation with appropriate model capacity is critical. The method jointly predicts CHM, PAI, and FHD, a multi-metric capability not provided by current monocular CHM estimators, although PAI/FHD transfer remains region-dependent and benefits from local calibration. The framework also remains effective under temporal mismatch (winter LiDAR, summer RGBI), removing strict co-acquisition constraints and enabling scalable 20 cm operational monitoring for workflows such as Digital Twin Germany and national Digital Orthophoto programs.
comment: Paper in-review
☆ DriveDreamer-Policy: A Geometry-Grounded World-Action Model for Unified Generation and Planning
Recently, world-action models (WAM) have emerged to bridge vision-language-action (VLA) models and world models, unifying their reasoning and instruction-following capabilities and spatio-temporal world modeling. However, existing WAM approaches often focus on modeling 2D appearance or latent representations, with limited geometric grounding-an essential element for embodied systems operating in the physical world. We present DriveDreamer-Policy, a unified driving world-action model that integrates depth generation, future video generation, and motion planning within a single modular architecture. The model employs a large language model to process language instructions, multi-view images, and actions, followed by three lightweight generators that produce depth, future video, and actions. By learning a geometry-aware world representation and using it to guide both future prediction and planning within a unified framework, the proposed model produces more coherent imagined futures and more informed driving actions, while maintaining modularity and controllable latency. Experiments on the Navsim v1 and v2 benchmarks demonstrate that DriveDreamer-Policy achieves strong performance on both closed-loop planning and world generation tasks. In particular, our model reaches 89.2 PDMS on Navsim v1 and 88.7 EPDMS on Navsim v2, outperforming existing world-model-based approaches while producing higher-quality future video and depth predictions. Ablation studies further show that explicit depth learning provides complementary benefits to video imagination and improves planning robustness.
comment: 11 pages, 4 figures; Project Website: https://drivedreamer-policy.github.io/
☆ FourierMoE: Fourier Mixture-of-Experts Adaptation of Large Language Models
Parameter-efficient fine-tuning (PEFT) has emerged as a crucial paradigm for adapting large language models (LLMs) under constrained computational budgets. However, standard PEFT methods often struggle in multi-task fine-tuning settings, where diverse optimization objectives induce task interference and limited parameter budgets lead to representational deficiency. While recent approaches incorporate mixture-of-experts (MoE) to alleviate these issues, they predominantly operate in the spatial domain, which may introduce structural redundancy and parameter overhead. To overcome these limitations, we reformulate adaptation in the spectral domain. Our spectral analysis reveals that different tasks exhibit distinct frequency energy distributions, and that LLM layers display heterogeneous frequency sensitivities. Motivated by these insights, we propose FourierMoE, which integrates the MoE architecture with the inverse discrete Fourier transform (IDFT) for frequency-aware adaptation. Specifically, FourierMoE employs a frequency-adaptive router to dispatch tokens to experts specialized in distinct frequency bands. Each expert learns a set of conjugate-symmetric complex coefficients, preserving complete phase and amplitude information while theoretically guaranteeing lossless IDFT reconstruction into real-valued spatial weights. Extensive evaluations across 28 benchmarks, multiple model architectures, and scales demonstrate that FourierMoE consistently outperforms competitive baselines in both single-task and multi-task settings while using significantly fewer trainable parameters. These results highlight the promise of spectral-domain expert adaptation as an effective and parameter-efficient paradigm for LLM fine-tuning.
comment: The first two authors contributed equally to this work; listing order is random
☆ LiveMathematicianBench: A Live Benchmark for Mathematician-Level Reasoning with Proof Sketches
Mathematical reasoning is a hallmark of human intelligence, and whether large language models (LLMs) can meaningfully perform it remains a central question in artificial intelligence and cognitive science. As LLMs are increasingly integrated into scientific workflows, rigorous evaluation of their mathematical capabilities becomes a practical necessity. Existing benchmarks are limited by synthetic settings and data contamination. We present LiveMathematicianBench, a dynamic multiple-choice benchmark for research-level mathematical reasoning built from recent arXiv papers published after model training cutoffs. By grounding evaluation in newly published theorems, it provides a realistic testbed beyond memorized patterns. The benchmark introduces a thirteen-category logical taxonomy of theorem types (e.g., implication, equivalence, existence, uniqueness), enabling fine-grained evaluation across reasoning forms. It employs a proof-sketch-guided distractor pipeline that uses high-level proof strategies to construct plausible but invalid answer choices reflecting misleading proof directions, increasing sensitivity to genuine understanding over surface-level matching. We also introduce a substitution-resistant mechanism to distinguish answer recognition from substantive reasoning. Evaluation shows the benchmark is far from saturated: Gemini-3.1-pro-preview, the best model, achieves only 43.5%. Under substitution-resistant evaluation, accuracy drops sharply: GPT-5.4 scores highest at 30.6%, while Gemini-3.1-pro-preview falls to 17.6%, below the 20% random baseline. A dual-mode protocol reveals that proof-sketch access yields consistent accuracy gains, suggesting models can leverage high-level proof strategies for reasoning. Overall, LiveMathematicianBench offers a scalable, contamination-resistant testbed for studying research-level mathematical reasoning in LLMs.
comment: Project page: https://livemathematicianbench.github.io/
☆ AeroTherm-GPT: A Verification-Centered LLM Framework for Thermal Protection System Engineering Workflows
Integrating Large Language Models (LLMs) into hypersonic thermal protection system (TPS) design is bottlenecked by cascading constraint violations when generating executable simulation artifacts. General-purpose LLMs, treating generation as single-pass text completion, fail to satisfy the sequential, multi-gate constraints inherent in safety-critical engineering workflows. To address this, we propose AeroTherm-GPT, the first TPS-specialized LLM Agent, instantiated through a Constraint-Closed-Loop Generation (CCLG) framework. CCLG organizes TPS artifact generation as an iterative workflow comprising generation, validation, CDG-guided repair, execution, and audit. The Constraint Dependency Graph (CDG) encodes empirical co-resolution structure among constraint categories, directing repair toward upstream fault candidates based on lifecycle ordering priors and empirical co-resolution probabilities. This upstream-priority mechanism resolves multiple downstream violations per action, achieving a Root-Cause Fix Efficiency of 4.16 versus 1.76 for flat-checklist repair. Evaluated on HyTPS-Bench and validated against external benchmarks, AeroTherm-GPT achieves 88.7% End-to-End Success Rate (95% CI: 87.5-89.9), a gain of +12.5 pp over the matched non-CDG ablation baseline, without catastrophic forgetting on scientific reasoning and code generation tasks.
☆ Solving the Two-dimensional single stock size Cuting Stock Problem with SAT and MaxSAT
Cutting rectangular items from stock sheets to satisfy demands while minimizing waste is a central manufacturing task. The Two-Dimensional Single Stock Size Cutting Stock Problem (2D-CSSP) generalizes bin packing by requiring multiple copies of each item type, which causes a strong combinatorial blow-up. We present a SAT-based framework where item types are expanded by demand, each copy has a sheet-assignment variable and non-overlap constraints are activated only for copies assigned to the same sheet. We also introduce an infeasible-orientation elimination rule that fixes rotation variables when only one orientation can fit the sheet. For minimizing the number of sheets, we compare three approaches: non-incremental SAT with binary search, incremental SAT with clause reuse across iterations and weighted partial MaxSAT. On the Cui--Zhao benchmark suite, our best SAT configurations certify two to three times more instances as provably optimal and achieve lower optimality gaps than OR-Tools, CPLEX and Gurobi. The relative ranking among SAT approaches depends on rotation: incremental SAT is strongest without rotation, while non-incremental SAT is more effective when rotation increases formula size.
☆ The AnIML Ontology: Enabling Semantic Interoperability for Large-Scale Experimental Data in Interconnected Scientific Labs
Achieving semantic interoperability across heterogeneous experimental data systems remains a major barrier to data-driven scientific discovery. The Analytical Information Markup Language (AnIML), a flexible XML-based standard for analytical chemistry and biology, is increasingly used in industrial R&D labs for managing and exchanging experimental data. However, the expressivity of the XML schema permits divergent interpretations across stakeholders, introducing inconsistencies that undermine the interoperability the AnIML schema was designed to support. In this paper, we present the AnIML Ontology, an OWL 2 ontology that formalises the semantics of AnIML and aligns it with the Allotrope Data Format to support future cross-system and cross-lab interoperability. The ontology was developed using an expert-in-the-loop approach combining LLM-assisted requirement elicitation with collaborative ontology engineering. We validate the ontology through a multi-layered approach: data-driven transformation of real-world AnIML files into knowledge graphs, competency question verification via SPARQL, and a novel validation protocol based on adversarial negative competency questions mapped to established ontological anti-patterns and enforced via SHACL constraints.
comment: Accepted at the 38th International Conference on Advanced Information Systems Engineering (CAiSE 2026)
☆ LiteInception: A Lightweight and Interpretable Deep Learning Framework for General Aviation Fault Diagnosis
General aviation fault diagnosis and efficient maintenance are critical to flight safety; however, deploying deep learning models on resource-constrained edge devices poses dual challenges in computational capacity and interpretability. This paper proposes LiteInception--a lightweight interpretable fault diagnosis framework designed for edge deployment. The framework adopts a two-stage cascaded architecture aligned with standard maintenance workflows: Stage 1 performs high-recall fault detection, and Stage 2 conducts fine-grained fault classification on anomalous samples, thereby decoupling optimization objectives and enabling on-demand allocation of computational resources. For model compression, a multi-method fusion strategy based on mutual information, gradient analysis, and SE attention weights is proposed to reduce the input sensor channels from 23 to 15, and a 1+1 branch LiteInception architecture is introduced that compresses InceptionTime parameters by 70%, accelerates CPU inference by over 8x, with less than 3% F1 loss. Furthermore, knowledge distillation is introduced as a precision-recall regulation mechanism, enabling the same lightweight model to adapt to different scenarios--such as safety-critical and auxiliary diagnosis--by switching training strategies. Finally, a dual-layer interpretability framework integrating four attribution methods is constructed, providing traceable evidence chains of "which sensor x which time period." Experiments on the NGAFID dataset demonstrate a fault detection accuracy of 81.92% with 83.24% recall, and a fault identification accuracy of 77.00%, validating the framework's favorable balance among efficiency, accuracy, and interpretability.
☆ Causal Scene Narration with Runtime Safety Supervision for Vision-Language-Action Driving
Vision-Language-Action (VLA) models for autonomous driving must integrate diverse textual inputs, including navigation commands, hazard warnings, and traffic state descriptions, yet current systems often present these as disconnected fragments, forcing the model to discover on its own which environmental constraints are relevant to the current maneuver. We introduce Causal Scene Narration (CSN), which restructures VLA text inputs through intent-constraint alignment, quantitative grounding, and structured separation, at inference time with zero GPU cost. We complement CSN with Simplex-based runtime safety supervision and training-time alignment via Plackett-Luce DPO with negative log-likelihood (NLL) regularization. A multi-town closed-loop CARLA evaluation shows that CSN improves Driving Score by +31.1% on original LMDrive and +24.5% on the preference-aligned variant. A controlled ablation reveals that causal structure accounts for 39.1% of this gain, with the remainder attributable to information content alone. A perception noise ablation confirms that CSN's benefit is robust to realistic sensing errors. Semantic safety supervision improves Infraction Score, while reactive Time-To-Collision monitoring degrades performance, demonstrating that intent-aware monitoring is needed for VLA systems.
comment: 18 pages, 6 figures, 4 tables
Transformer self-attention encoder-decoder with multimodal deep learning for response time series forecasting and digital twin support in wind structural health monitoring
The wind-induced structural response forecasting capabilities of a novel transformer methodology are examined here. The model also provides a digital twin component for bridge structural health monitoring. Firstly, the approach uses the temporal characteristics of the system to train a forecasting model. Secondly, the vibration predictions are compared to the measured ones to detect large deviations. Finally, the identified cases are used as an early-warning indicator of structural change. The artificial intelligence-based model outperforms approaches for response forecasting as no assumption on wind stationarity or on structural normal vibration behavior is needed. Specifically, wind-excited dynamic behavior suffers from uncertainty related to obtaining poor predictions when the environmental or traffic conditions change. This results in a hard distinction of what constitutes normal vibration behavior. To this end, a framework is rigorously examined on real-world measurements from the Hardanger Bridge monitored by the Norwegian University of Science and Technology. The approach captures accurate structural behavior in realistic conditions, and with respect to the changes in the system excitation. The results, importantly, highlight the potential of transformer-based digital twin components to serve as next-generation tools for resilient infrastructure management, continuous learning, and adaptive monitoring over the system's lifecycle with respect to temporal characteristics.
comment: 21 pages, 22 figures, 9 tables. This version corresponds to the published article in Computers & Structures. https://doi.org/10.1016/j.compstruc.2026.108216
☆ OpenGo: An OpenClaw-Based Robotic Dog with Real-Time Skill Switching
Adaptation to complex tasks and multiple scenarios remains a significant challenge for a single robot agent. The ability to acquire organize, and switch between a wide range of skills in real time, particularly in dynamic environments, has become a fundamental requirement for embodied intelligence. We introduce OpenGo, an OpenClaw-powered embodied robotic dog capable of switching skills in real time according to the scene and task instructions. Specifically, the agent is equipped with (1) a customizable skill library with easy skill import and autonomous skill validation, (2) a dispatcher that selects and invokes different skills according to task prompts or language instructions, and (3) a self-learning framework that fine-tunes skills based on task completion and human feedback. We deploy the agent in Unitree's Go2 robotic dog and validate its capabilities in self-checking and switching of skills autonomously. In addition, by integrating Feishu-platform communication, we enable natural-language guidance and human feedback, allowing inexperienced users to control the robotic dog through simple instructions.
comment: 11 pages, 6 figures
☆ Development and multi-center evaluation of domain-adapted speech recognition for human-AI teaming in real-world gastrointestinal endoscopy
Automatic speech recognition (ASR) is a critical interface for human-AI interaction in gastrointestinal endoscopy, yet its reliability in real-world clinical settings is limited by domain-specific terminology and complex acoustic conditions. Here, we present EndoASR, a domain-adapted ASR system designed for real-time deployment in endoscopic workflows. We develop a two-stage adaptation strategy based on synthetic endoscopy reports, targeting domain-specific language modeling and noise robustness. In retrospective evaluation across six endoscopists, EndoASR substantially improves both transcription accuracy and clinical usability, reducing character error rate (CER) from 20.52% to 14.14% and increasing medical term accuracy (Med ACC) from 54.30% to 87.59%. In a prospective multi-center study spanning five independent endoscopy centers, EndoASR demonstrates consistent generalization under heterogeneous real-world conditions. Compared with the baseline Paraformer model, CER is reduced from 16.20% to 14.97%, while Med ACC is improved from 61.63% to 84.16%, confirming its robustness in practical deployment scenarios. Notably, EndoASR achieves a real-time factor (RTF) of 0.005, significantly faster than Whisper-large-v3 (RTF 0.055), while maintaining a compact model size of 220M parameters, enabling efficient edge deployment. Furthermore, integration with large language models demonstrates that improved ASR quality directly enhances downstream structured information extraction and clinician-AI interaction. These results demonstrate that domain-adapted ASR can serve as a reliable interface for human-AI teaming in gastrointestinal endoscopy, with consistent performance validated across multi-center real-world clinical settings.
comment: Under review at npj Digital Medicine
♻ ☆ ELT-Bench-Verified: Benchmark Quality Issues Underestimate AI Agent Capabilities
Constructing Extract-Load-Transform (ELT) pipelines is a labor-intensive data engineering task and a high-impact target for AI automation. On ELT-Bench, the first benchmark for end-to-end ELT pipeline construction, AI agents initially showed low success rates, suggesting they lacked practical utility. We revisit these results and identify two factors causing a substantial underestimation of agent capabilities. First, re-evaluating ELT-Bench with upgraded large language models reveals that the extraction and loading stage is largely solved, while transformation performance improves significantly. Second, we develop an Auditor-Corrector methodology that combines scalable LLM-driven root-cause analysis with rigorous human validation (inter-annotator agreement Fleiss' kappa = 0.85) to audit benchmark quality. Applying this to ELT-Bench uncovers that most failed transformation tasks contain benchmark-attributable errors -- including rigid evaluation scripts, ambiguous specifications, and incorrect ground truth -- that penalize correct agent outputs. Based on these findings, we construct ELT-Bench-Verified, a revised benchmark with refined evaluation logic and corrected ground truth. Re-evaluating on this version yields significant improvement attributable entirely to benchmark correction. Our results show that both rapid model improvement and benchmark quality issues contributed to underestimating agent capabilities. More broadly, our findings echo observations of pervasive annotation errors in text-to-SQL benchmarks, suggesting quality issues are systemic in data engineering evaluation. Systematic quality auditing should be standard practice for complex agentic tasks. We release ELT-Bench-Verified to provide a more reliable foundation for progress in AI-driven data engineering automation.
♻ ☆ AutoPK: Leveraging LLMs and a Hybrid Similarity Metric for Advanced Retrieval of Pharmacokinetic Data from Complex Tables and Documents ICTAI 2025
Pharmacokinetics (PK) plays a critical role in drug development and regulatory decision-making for human and veterinary medicine, directly affecting public health through drug safety and efficacy assessments. However, PK data are often embedded in complex, heterogeneous tables with variable structures and inconsistent terminologies, posing significant challenges for automated PK data retrieval and standardization. AutoPK, a novel two-stage framework for accurate and scalable extraction of PK data from complex scientific tables. In the first stage, AutoPK identifies and extracts PK parameter variants using large language models (LLMs), a hybrid similarity metric, and LLM-based validation. The second stage filters relevant rows, converts the table into a key-value text format, and uses an LLM to reconstruct a standardized table. Evaluated on a real-world dataset of 605 PK tables, including captions and footnotes, AutoPK shows significant improvements in precision and recall over direct LLM baselines. For instance, AutoPK with LLaMA 3.1-70B achieved an F1-score of 0.92 on half-life and 0.91 on clearance parameters, outperforming direct use of LLaMA 3.1-70B by margins of 0.10 and 0.21, respectively. Smaller models such as Gemma 3-27B and Phi 3-12B with AutoPK achieved 2-7 fold F1 gains over their direct use, with Gemma's hallucination rates reduced from 60-95% down to 8-14%. Notably, AutoPK enabled open-source models like Gemma 3-27B to outperform commercial systems such as GPT-4o Mini on several PK parameters. AutoPK enables scalable and high-confidence PK data extraction, making it well-suited for critical applications in veterinary pharmacology, drug safety monitoring, and public health decision-making, while addressing heterogeneous table structures and terminology and demonstrating generalizability across key PK parameters. Code and data: https://github.com/hosseinsholehrasa/AutoPK
comment: Published in IEEE ICTAI 2025
♻ ☆ Beyond Via: Analysis and Estimation of the Impact of Large Language Models in Academic Papers
Through an analysis of arXiv papers, we report several shifts in word usage that are likely driven by large language models (LLMs) but have not previously received sufficient attention, such as the increased frequency of "beyond" and "via" in titles and the decreased frequency of "the" and "of" in abstracts. Due to the similarities among different LLMs, experiments show that current classifiers struggle to accurately determine which specific model generated a given text in multi-class classification tasks. Meanwhile, variations across LLMs also result in evolving patterns of word usage in academic papers. By adopting a direct and highly interpretable linear approach and accounting for differences between models and prompts, we quantitatively assess these effects and show that real-world LLM usage is heterogeneous and dynamic.
comment: Visualization of word usage patterns in arXiv abstracts: https://llm-impact.github.io/
♻ ☆ Democratizing AI: A Comparative Study in Deep Learning Efficiency and Future Trends in Computational Processing
The exponential growth in data has intensified the demand for computational power to train large-scale deep learning models. However, the rapid growth in model size and complexity raises concerns about equal and fair access to computational resources, particularly under increasing energy and infrastructure constraints. GPUs have emerged as essential for accelerating such workloads. This study benchmarks four deep learning models (Conv6, VGG16, ResNet18, CycleGAN) using TensorFlow and PyTorch on Intel Xeon CPUs and NVIDIA Tesla T4 GPUs. Our experiments demonstrate that, on average, GPU training achieves speedups ranging from 11x to 246x depending on model complexity, with lightweight models (Conv6) showing the highest acceleration (246x), mid-sized models (VGG16, ResNet18) achieving 51-116x speedups, and complex generative models (CycleGAN) reaching 11x improvements compared to CPU training. Additionally, in our PyTorch vs. TensorFlow comparison, we observed that TensorFlow's kernel-fusion optimizations reduce inference latency by approximately 15%. We also analyze GPU memory usage trends and projecting requirements through 2025 using polynomial regression. Our findings highlight that while GPUs are essential for sustaining AI's growth, democratized and shared access to GPU resources is critical for enabling research innovation across institutions with limited computational budgets.
♻ ☆ PBLean: Pseudo-Boolean Proof Certificates for Lean 4
We present PBLean, a method for importing VeriPB pseudo-Boolean (PB) proof certificates into Lean 4. Key to our approach is reflection: a Boolean checker function whose soundness is fully proved in Lean and executed as compiled native code. Our method scales to proofs with tens of thousands of steps that would exhaust memory under explicit proof-term construction. Our checker supports all VeriPB kernel rules, including cutting-plane derivations, proof-by-contradiction subproofs, and redundance-based reasoning for symmetry breaking. In contrast to external verified checkers that produce verdicts, our integration yields Lean theorems that can serve as composable lemmas in larger formal developments. To derive theorems about the original combinatorial problems rather than about PB constraints alone, we support verified encodings. This closes the trust gap between solver output and problem semantics since the constraint translation and its correctness proof are both formalized in Lean. We demonstrate the approach on various combinatorial problems.
♻ ☆ ORBIT: Scalable and Verifiable Data Generation for Search Agents on a Tight Budget
Search agents, which integrate language models (LMs) with web search, are becoming crucial for answering complex user queries. Constructing training datasets for deep research tasks, involving multi-step retrieval and reasoning, remains challenging due to expensive human annotation, or cumbersome prerequisites. In this work, we introduce ORBIT, a training dataset with 20K reasoning-intensive queries with short verifiable answers, generated using a frugal framework without relying on paid API services. The modular framework relies on four stages: seed creation, question-answer pair generation, and two stages of verification: self and external. ORBIT spans 15 domains and each training pair requires 4-5 reasoning steps, with external search verification required from the complete web. We train Qwen3-4B as the base model on ORBIT using GRPO and evaluate it on Wikipedia question answering tasks. Extensive experiment results demonstrate that ORBIT-4B achieves strong performance among sub-4B LLMs as search agents, proving the utility of synthetic datasets. Our framework, code and datasets are open-sourced and available publicly.
comment: Preprint
♻ ☆ TaCarla: A comprehensive benchmarking dataset for end-to-end autonomous driving
Collecting a high-quality dataset is a critical task that demands meticulous attention to detail, as overlooking certain aspects can render the entire dataset unusable. Autonomous driving challenges remain a prominent area of research, requiring further exploration to enhance the perception and planning performance of vehicles. However, existing datasets are often incomplete. For instance, datasets that include perception information generally lack planning data, while planning datasets typically consist of extensive driving sequences where the ego vehicle predominantly drives forward, offering limited behavioral diversity. In addition, many real datasets struggle to evaluate their models, especially for planning tasks, since they lack a proper closed-loop evaluation setup. The CARLA Leaderboard 2.0 challenge, which provides a diverse set of scenarios to address the long-tail problem in autonomous driving, has emerged as a valuable alternative platform for developing perception and planning models in both open-loop and closed-loop evaluation setups. Nevertheless, existing datasets collected on this platform present certain limitations. Some datasets appear to be tailored primarily for limited sensor configuration, with particular sensor configurations. To support end-to-end autonomous driving research, we have collected a new dataset comprising over 2.85 million frames using the CARLA simulation environment for the diverse Leaderboard 2.0 challenge scenarios. Our dataset is designed not only for planning tasks but also supports dynamic object detection, lane divider detection, centerline detection, traffic light recognition, prediction tasks and visual language action models . Furthermore, we demonstrate its versatility by training various models using our dataset. Moreover, we also provide numerical rarity scores to understand how rarely the current state occurs in the dataset.
♻ ☆ Stabilizing Rubric Integration Training via Decoupled Advantage Normalization
We propose Process-Aware Policy Optimization (PAPO), a method that integrates process-level evaluation into Group Relative Policy Optimization (GRPO) through decoupled advantage normalization, to address two limitations of existing reward designs. Outcome reward models (ORM) evaluate only final-answer correctness, treating all correct responses identically regardless of reasoning quality, and gradually lose the advantage signal as groups become uniformly correct. Process reward models (PRM) offer richer supervision, but directly using PRM scores causes reward hacking, where models exploit verbosity to inflate scores while accuracy collapses. PAPO resolves both by composing the advantage from an outcome component Aout, derived from ORM and normalized over all responses, and a process component Aproc, derived from a rubric-based PRM and normalized exclusively among correct responses. This decoupled design ensures that Aout anchors training on correctness while Aproc differentiates reasoning quality without distorting the outcome signal. Experiments across multiple model scales and six benchmarks demonstrate that PAPO consistently outperforms ORM, reaching 51.3% vs.\ 46.3% on OlympiadBench while continuing to improve as ORM plateaus and declines.
comment: 16 Pages,9 Figures,Second Version
♻ ☆ Moonwalk: Inverse-Forward Differentiation
Backpropagation's main limitation is its need to store intermediate activations (residuals) during the forward pass, which restricts the depth of trainable networks. This raises a fundamental question: can we avoid storing these activations? We address this by revisiting the structure of gradient computation. Backpropagation computes gradients through a sequence of vector-Jacobian products, an operation that is generally irreversible. The lost information lies in the cokernel of each layer's Jacobian. We define submersive networks -- networks whose layer Jacobians have trivial cokernels -- in which gradients can be reconstructed exactly in a forward sweep without storing activations. For non-submersive layers, we introduce fragmental gradient checkpointing, which records only the minimal subset of residuals necessary to restore the cotangents erased by the Jacobian. Central to our approach is a novel operator, the vector-inverse-Jacobian product (vijp), which inverts gradient flow outside the cokernel. Our mixed-mode algorithm first computes input gradients with a memory-efficient reverse pass, then reconstructs parameter gradients in a forward sweep using the vijp, eliminating the need to store activations. We implement this method in Moonwalk and show that it matches backpropagation's runtime while training networks more than twice as deep under the same memory budget.
♻ ☆ 100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight Proxy Models
Several data warehouse and database providers have recently introduced extensions to SQL called AI Queries, enabling users to specify functions and conditions in SQL that are evaluated by LLMs, thereby broadening significantly the kinds of queries one can express over the combination of structured and unstructured data. LLMs offer remarkable semantic reasoning capabilities, making them an essential tool for complex and nuanced queries that blend structured and unstructured data. While extremely powerful, these AI queries can become prohibitively costly when invoked thousands of times. This paper provides an extensive evaluation of a recent AI query approximation approach that enables low cost analytics and database applications to benefit from AI queries. The approach delivers >100x cost and latency reduction for the semantic filter operator and also important gains for semantic ranking. The cost and performance gains come from utilizing cheap and accurate proxy models over embedding vectors. We show that despite the massive gains in latency and cost, these proxy models preserve accuracy and occasionally improve accuracy across various benchmark datasets, including the extended Amazon reviews benchmark that has 10M rows. We present an OLAP-friendly architecture within Google BigQuery for this approach for purely online (ad hoc) queries, and a low-latency HTAP database-friendly architecture in AlloyDB that could further improve the latency by moving the proxy model training offline. We present techniques that accelerate the proxy model training.
♻ ☆ Optimus: A Robust Defense Framework for Mitigating Toxicity while Fine-Tuning Conversational AI SP
Customizing Large Language Models (LLMs) on untrusted datasets poses severe risks of injecting toxic behaviors. In this work, we introduce Optimus, a novel defense framework designed to mitigate fine-tuning harms while preserving conversational utility. Unlike existing defenses that rely heavily on precise toxicity detection or restrictive filtering, Optimus addresses the critical challenge of ensuring robust mitigation even when toxicity classifiers are imperfect or biased. Optimus integrates a training-free toxicity classification scheme that repurposes the safety alignment of commodity LLMs, and employs a dual-strategy alignment process combining synthetic "healing data" with Direct Preference Optimization (DPO) to efficiently steer models toward safety. Extensive evaluations demonstrate that Optimus mitigates toxicity even when relying on extremely biased classifiers (with up to 85% degradation in Recall). Optimus outperforms the state-of-the-art defense StarDSS and exhibits strong resilience against adaptive adversarial and jailbreak attacks. Our source code and datasets are available at https://github.com/secml-lab-vt/Optimus
comment: Accepted at ACM CODASPY 2026
♻ ☆ BIOGEN: Evidence-Grounded Multi-Agent Reasoning Framework for Transcriptomic Interpretation in Antimicrobial Resistance
Interpreting gene clusters from RNA sequencing (RNA-seq) remains challenging, especially in antimicrobial resistance studies where mechanistic insight is important for hypothesis generation. Existing pathway enrichment methods can summarize co-expressed modules, but they often provide limited cluster-specific explanations and weak connections to supporting literature. We present BIOGEN, an evidence-grounded multi-agent framework for post hoc interpretation of RNA-seq transcriptional modules. BIOGEN combines biomedical retrieval, structured reasoning, and multi-critic verification to generate traceable cluster-level explanations with explicit evidence and confidence labels. On a primary Salmonella enterica dataset, BIOGEN achieved strong biological grounding, including BERTScore 0.689, Semantic Alignment Score 0.715, KEGG Functional Similarity 0.342, and a hallucination rate of 0.000, compared with 0.100 for an LLM-only baseline. Across four additional bacterial RNA-seq datasets, BIOGEN also maintained zero hallucination under the same fixed pipeline. In comparisons with representative open-source agentic AI baselines, BIOGEN was the only framework that consistently preserved zero hallucination across all five datasets. These findings suggest that retrieval alone is not enough for reliable biological interpretation, and that evidence-grounded orchestration is important for transparent and source-traceable transcriptomic reasoning.
♻ ☆ Cognitive Friction: A Decision-Theoretic Framework for Bounded Deliberation in Tool-Using Agents
Autonomous tool-using agents in networked environments must decide which information source to query and when to stop querying and act. Without principled bounds on information-acquisition costs, unconstrained agents exhibit systematic failure modes: excessive tool use under congestion, prolonged deliberation under time decay, and brittle behavior under ambiguous evidence. We propose the Triadic Cognitive Architecture (TCA), a decision-theoretic framework that formalizes these failure modes via cognitive friction. By combining nonlinear filtering, congestion-dependent cost dynamics, and HJB optimal stopping, TCA models deliberation as stochastic control over a joint belief-congestion state, explicitly pricing information by tool signal quality and live network load. TCA yields an HJB-inspired stopping boundary and a computable rollout-based approximation of belief-dependent value-of-information with a net-utility halting condition. We validate TCA in two controlled environments (EMDG and NSTG) designed to isolate stopping quality, action selection under congestion, and temporal urgency. TCA improves resource outcomes while reducing time-to-action without degrading accuracy, gaining 36 viability points in EMDG and 33 integrity points in NSTG over greedy baselines. Ablations show that selection and stopping must be optimized jointly, as stopping rules alone recover at most 4 viability points. Sensitivity sweeps over alpha, beta, and lambda_S yield stable accuracy and interpretable trade-offs, and a continuation-value sweep over eta values 0, 0.1, 0.3, and 0.5 finds eta equal to zero is optimal under high temporal urgency. Finally, we demonstrate an illustrative instantiation around a black-box LLM on a memorisation-free corpus, where the same stopping principle executes using empirically computable uncertainty and value-of-information proxies.
comment: Preprint
♻ ☆ Contextual Distributionally Robust Optimization with Causal and Continuous Structure: An Interpretable and Tractable Approach
In this paper, we introduce a framework for contextual distributionally robust optimization (DRO) that considers the causal and continuous structure of the underlying distribution by developing interpretable and tractable decision rules that prescribe decisions using covariates. We first introduce the causal Sinkhorn discrepancy (CSD), an entropy-regularized causal Wasserstein distance that encourages continuous transport plans while preserving the causal consistency. We then formulate a contextual DRO model with a CSD-based ambiguity set, termed Causal Sinkhorn DRO (Causal-SDRO), and derive its strong dual reformulation where the worst-case distribution is characterized as a mixture of Gibbs distributions. To solve the corresponding infinite-dimensional policy optimization, we propose the Soft Regression Forest (SRF) decision rule, which approximates optimal policies within arbitrary measurable function spaces. The SRF preserves the interpretability of classical decision trees while being fully parametric, differentiable, and Lipschitz smooth, enabling intrinsic interpretation from both global and local perspectives. To solve the Causal-SDRO with parametric decision rules, we develop an efficient stochastic compositional gradient algorithm that converges to an $\varepsilon$-stationary point at a rate of $O(\varepsilon^{-4})$, matching the convergence rate of standard stochastic gradient descent. Finally, we validate our method through numerical experiments on synthetic and real-world datasets, demonstrating its superior performance and interpretability.
♻ ☆ Reflection of Episodes: Learning to Play Game from Expert and Self Experiences
StarCraft II is a complex and dynamic real-time strategy (RTS) game environment, which is very suitable for artificial intelligence and reinforcement learning research. To address the problem of Large Language Model(LLM) learning in complex environments through self-reflection, we propose a Reflection of Episodes(ROE) framework based on expert experience and self-experience. This framework first obtains key information in the game through a keyframe selection method, then makes decisions based on expert experience and self-experience. After a game is completed, it reflects on the previous experience to obtain new self-experience. Finally, in the experiment, our method beat the robot under the Very Hard difficulty in TextStarCraft II. We analyze the data of the LLM in the process of the game in detail, verified its effectiveness.
♻ ☆ Unsupervised Behavioral Compression: Learning Low-Dimensional Policy Manifolds through State-Occupancy Matching
Deep Reinforcement Learning (DRL) is widely recognized as sample-inefficient, a limitation attributable in part to the high dimensionality and substantial functional redundancy inherent to the policy parameter space. A recent framework, which we refer to as Action-based Policy Compression (APC), mitigates this issue by compressing the parameter space $Θ$ into a low-dimensional latent manifold $\mathcal Z$ using a learned generative mapping $g:\mathcal Z \to Θ$. However, its performance is severely constrained by relying on immediate action-matching as a reconstruction loss, a myopic proxy for behavioral similarity that suffers from compounding errors across sequential decisions. To overcome this bottleneck, we introduce Occupancy-based Policy Compression (OPC), which enhances APC by shifting behavior representation from immediate action-matching to long-horizon state-space coverage. Specifically, we propose two principal improvements: (1) we curate the dataset generation with an information-theoretic uniqueness metric that delivers a diverse population of policies; and (2) we propose a fully differentiable compression objective that directly minimizes the divergence between the true and reconstructed mixture occupancy distributions. These modifications force the generative model to organize the latent space around true functional similarity, promoting a latent representation that generalizes over a broad spectrum of behaviors while retaining most of the original parameter space's expressivity. Finally, we empirically validate the advantages of our contributions across multiple continuous control benchmarks.
♻ ☆ Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Today's strongest video-language models (VLMs) remain proprietary. The strongest open-weight models either rely on synthetic data from proprietary VLMs, effectively distilling from them, or do not disclose their training data or recipe. As a result, the open-source community lacks the foundations needed to improve on the state-of-the-art video (and image) language models. Crucially, many downstream applications require more than just high-level video understanding; they require grounding -- either by pointing or by tracking in pixels. Even proprietary models lack this capability. We present Molmo2, a new family of VLMs that are state-of-the-art among open-source models and demonstrate exceptional new capabilities in point-driven grounding in single image, multi-image, and video tasks. Our key contribution is a collection of 7 new video datasets and 2 multi-image datasets, including a dataset of highly detailed video captions for pre-training, a free-form video Q&A dataset for fine-tuning, a new object tracking dataset with complex queries, and an innovative new video pointing dataset, all collected without the use of closed VLMs. We also present a training recipe for this data utilizing an efficient packing and message-tree encoding scheme, and show bi-directional attention on vision tokens and a novel token-weight strategy improves performance. Our best-in-class 8B model outperforms others in the class of open weight and data models on short videos, counting, and captioning, and is competitive on long-videos. On video-grounding Molmo2 significantly outperforms existing open-weight models like Qwen3-VL (35.5 vs 29.6 accuracy on video counting) and surpasses proprietary models like Gemini 3 Pro on some tasks (38.4 vs 20.0 F1 on video pointing and 56.2 vs 41.1 J&F on video tracking).
comment: Updated first authors
♻ ☆ Support-Contra Asymmetry in LLM Explanations
Large Language Models (LLMs) increasingly produce natural language explanations alongside their predictions, yet it remains unclear whether these explanations reference predictive cues present in the input text. In this work, we present an empirical study of how LLM-generated explanations align with predictive lexical evidence from an external model in text classification tasks. To analyze this relationship, we compare explanation content against interpretable feature importance signals extracted from transparent linear classifiers. These reference models allow us to partition predictive lexical cues into supporting and contradicting evidence relative to the predicted label. Across three benchmark datasets-WIKIONTOLOGY, AG NEWS, and IMDB-we observe a consistent empirical pattern that we term support-contra asymmetry. Explanations accompanying correct predictions tend to reference more supporting lexical cues and fewer contradicting cues, whereas explanations associated with incorrect predictions reference substantially more contradicting evidence. This pattern appears consistently across datasets, across reference model families (logistic regression and linear SVM), and across multiple feature retrieval depths. These results suggest that LLM explanations often reflect lexical signals that are predictive for the task when predictions are correct, while incorrect predictions are more frequently associated with explanations that reference misleading cues present in the input. Our findings provide a simple empirical perspective on explanation-evidence alignment and illustrate how external sources of predictive evidence can be used to analyze the behavior of LLM-generated explanations.
comment: 17 Pages, 12 Figures, 4 tables
♻ ☆ The Silicon Mirror: Dynamic Behavioral Gating for Anti-Sycophancy in LLM Agents
Large Language Models (LLMs) increasingly prioritize user validation over epistemic accuracy - a phenomenon known as sycophancy. We present The Silicon Mirror, an orchestration framework that dynamically detects user persuasion tactics and adjusts AI behavior to maintain factual integrity. Our architecture introduces three components: (1) a Behavioral Access Control (BAC) system that restricts context layer access based on real-time sycophancy risk scores, (2) a Trait Classifier that identifies persuasion tactics across multi-turn dialogues, and (3) a Generator-Critic loop where an auditor vetoes sycophantic drafts and triggers rewrites with "Necessary Friction." In a live evaluation across all 437 TruthfulQA adversarial scenarios, Claude Sonnet 4 exhibits 9.6% baseline sycophancy, reduced to 1.4% by the Silicon Mirror - an 85.7% relative reduction (p < 10^-6, OR = 7.64, Fisher's exact test). Cross-model evaluation on Gemini 2.5 Flash reveals a 46.0% baseline reduced to 14.2% (p < 10^-10, OR = 5.15). We characterize the validation-before-correction pattern as a distinct failure mode of RLHF-trained models.
comment: 7 pages, 8 figures, 5 tables. Code and evaluation data available at https://github.com/Helephants/langgraph-layered-context
♻ ☆ Prior Knowledge Makes It Possible: From Sublinear Graph Algorithms to LLM Test-Time Methods
Test-time augmentation, such as Retrieval-Augmented Generation (RAG) or tool use, critically depends on an interplay between a model's parametric knowledge and externally retrieved information. However, the theoretical underpinnings of this relationship remain poorly understood. Specifically, it is not clear how much pre-training knowledge is required to answer queries with a small number of augmentation steps, which is a desirable property in practice. To address this question, we formulate multi-step reasoning as an $s$-$t$ connectivity problem on a knowledge graph. We represent a model's pre-training parametric knowledge as a partial, potentially noisy subgraph. We view augmentation as querying an oracle for true edges that augment the model's knowledge. Then, we characterize the necessary and sufficient number of augmentation steps for the model to generate an accurate answer given partial prior knowledge. One key result shows a phase transition: if the prior knowledge graph over $n$ vertices is disconnected into small components, then finding a path via augmentation is inefficient and requires $Ω(\sqrt{n})$ queries. On the other hand, once the density of correct knowledge surpasses a threshold, forming a giant component, we can find paths with an expected constant number of queries.
♻ ☆ Learning to Play Blackjack: A Curriculum Learning Perspective
Reinforcement Learning (RL) agents often struggle with efficiency and performance in complex environments. We propose a novel framework that uses a Large Language Model (LLM) to dynamically generate a curriculum over available actions, enabling the agent to incorporate each action individually. We apply this framework to the game of Blackjack, where the LLM creates a multi-stage training path that progressively introduces complex actions to a Tabular Q-Learning and a Deep Q-Network (DQN) agent. Our evaluation in a realistic 8-deck simulation over 10 independent runs demonstrates significant performance gains over standard training methods. The curriculum-based approach increases the DQN agent's average win rate from 43.97% to 47.41%, reduces the average bust rate from 32.9% to 28.0%, and accelerates the overall workflow by over 74%, with the agent's full training completing faster than the baseline's evaluation phase alone. These results validate that LLM-guided curricula can build more effective, robust, and efficient RL agents.
comment: Accepted as an oral presentation at the International Conference on Distributed Artificial Intelligence (DAI 2025). 16 pages, 7 figures
♻ ☆ PsychAgent: An Experience-Driven Lifelong Learning Agent for Self-Evolving Psychological Counselor
Existing methods for AI psychological counselors predominantly rely on supervised fine-tuning using static dialogue datasets. However, this contrasts with human experts, who continuously refine their proficiency through clinical practice and accumulated experience. To bridge this gap, we propose an Experience-Driven Lifelong Learning Agent (\texttt{PsychAgent}) for psychological counseling. First, we establish a Memory-Augmented Planning Engine tailored for longitudinal multi-session interactions, which ensures therapeutic continuity through persistent memory and strategic planning. Second, to support self-evolution, we design a Skill Evolution Engine that extracts new practice-grounded skills from historical counseling trajectories. Finally, we introduce a Reinforced Internalization Engine that integrates the evolved skills into the model via rejection fine-tuning, aiming to improve performance across diverse scenarios. Comparative analysis shows that our approach achieves higher scores than strong general LLMs (e.g., GPT-5.4, Gemini-3) and domain-specific baselines across all reported evaluation dimensions. These results suggest that lifelong learning can improve the consistency and overall quality of multi-session counseling responses.
♻ ☆ GenOM: Ontology Matching with Description Generation and Large Language Model
Ontology matching (OM) plays an essential role in enabling semantic interoperability and integration across heterogeneous knowledge sources, particularly in the biomedical domain which contains numerous complex concepts related to diseases and pharmaceuticals. This paper introduces GenOM, a large language model (LLM)-based ontology alignment framework, which enriches the semantic representations of ontology concepts via generating textual definitions, retrieves alignment candidates with an embedding model, and incorporates exact matching-based tools to improve precision. Extensive experiments conducted on the OAEI Bio-ML track demonstrate that GenOM can often achieve competitive performance, surpassing many baselines including traditional OM systems and recent LLM-based methods. Further ablation studies confirm the effectiveness of semantic enrichment and few-shot prompting, highlighting the framework's robustness and adaptability.
comment: Accepted for publication in World Wide Web (Springer). This version includes revisions based on peer review
♻ ☆ Four Generations of Quantum Biomedical Sensors
Quantum sensing technologies offer transformative potential for ultra-sensitive biomedical sensing, yet their clinical translation remains constrained by classical noise limits and a reliance on macroscopic ensembles. We propose a unifying generational framework to organize the evolving landscape of quantum biosensors based on their utilization of quantum resources. First-generation devices utilize discrete energy levels for signal transduction but follow classical scaling laws. Second-generation sensors exploit quantum coherence to reach the standard quantum limit, while third-generation architectures leverage entanglement and spin squeezing to approach Heisenberg-limited precision. We further define an emerging fourth generation characterized by the end-to-end integration of quantum sensing with quantum learning and variational circuits, enabling adaptive inference directly within the quantum domain. By analyzing critical parameters such as bandwidth matching and sensor-tissue proximity, we identify key technological bottlenecks and propose a roadmap for transitioning from measuring physical observables to extracting structured biological information with quantum-enhanced intelligence.
comment: 22 pages, 5 figures, 6 tables
♻ ☆ Omni-SimpleMem: Autoresearch-Guided Discovery of Lifelong Multimodal Agent Memory
AI agents increasingly operate over extended time horizons, yet their ability to retain, organize, and recall multimodal experiences remains a critical bottleneck. Building effective lifelong memory requires navigating a vast design space spanning architecture, retrieval strategies, prompt engineering, and data pipelines; this space is too large and interconnected for manual exploration or traditional AutoML to explore effectively. We deploy an autonomous research pipeline to discover Omni-SimpleMem, a unified multimodal memory framework for lifelong AI agents. Starting from a naïve baseline (F1=0.117 on LoCoMo), the pipeline autonomously executes ${\sim}50$ experiments across two benchmarks, diagnosing failure modes, proposing architectural modifications, and repairing data pipeline bugs, all without human intervention in the inner loop. The resulting system achieves state-of-the-art on both benchmarks, improving F1 by +411% on LoCoMo (0.117$\to$0.598) and +214% on Mem-Gallery (0.254$\to$0.797) relative to the initial configurations. Critically, the most impactful discoveries are not hyperparameter adjustments: bug fixes (+175%), architectural changes (+44%), and prompt engineering (+188% on specific categories) each individually exceed the cumulative contribution of all hyperparameter tuning, demonstrating capabilities fundamentally beyond the reach of traditional AutoML. We provide a taxonomy of six discovery types and identify four properties that make multimodal memory particularly suited for autoresearch, offering guidance for applying autonomous research pipelines to other AI system domains. Code is available at this https://github.com/aiming-lab/SimpleMem.
♻ ☆ DeDelayed: Deleting Remote Inference Delay via On-Device Correction CVPR 2026
Video comprises the vast majority of bits that are generated daily, and is the primary signal driving current innovations in robotics, remote sensing, and wearable technology. Yet, the most powerful video understanding models are too expensive for the resource-constrained platforms used in these applications. One approach is to offload inference to the cloud; this gives access to GPUs capable of processing high-resolution videos in real time. But even with reliable, high-bandwidth communication channels, the combined latency of video encoding, model inference, and round-trip communication prohibits use for certain real-time applications. The alternative is to use fully local inference; but this places extreme constraints on computational and power costs, requiring smaller models and lower resolution, leading to degraded accuracy. To address these challenges, we propose Dedelayed, a real-time inference system that divides computation between a remote model operating on delayed video frames and a local model with access to the current frame. The remote model is trained to make predictions on anticipated future frames, which the local model incorporates into its prediction for the current frame. The local and remote models are jointly optimized with an autoencoder that limits the transmission bitrate required by the available downlink communication channel. We evaluate Dedelayed on the task of real-time streaming video segmentation using the BDD100k driving dataset. For a round trip delay of 100 ms, Dedelayed improves performance by 6.4 mIoU compared to fully local inference and 9.8 mIoU compared to remote inference -- an equivalent improvement to using a model ten times larger. We release our training code, pretrained models, and python library at https://github.com/InterDigitalInc/dedelayed .
comment: CVPR 2026
♻ ☆ Do Phone-Use Agents Respect Your Privacy?
We study whether phone-use agents respect privacy while completing benign mobile tasks. This question has remained hard to answer because privacy-compliant behavior is not operationalized for phone-use agents, and ordinary apps do not reveal exactly what data agents type into which form entries during execution. To make this question measurable, we introduce MyPhoneBench, a verifiable evaluation framework for privacy behavior in mobile agents. We operationalize privacy-respecting phone use as permissioned access, minimal disclosure, and user-controlled memory through a minimal privacy contract, iMy, and pair it with instrumented mock apps plus rule-based auditing that make unnecessary permission requests, deceptive re-disclosure, and unnecessary form filling observable and reproducible. Across five frontier models on 10 mobile apps and 300 tasks, we find that task success, privacy-compliant task completion, and later-session use of saved preferences are distinct capabilities, and no single model dominates all three. Evaluating success and privacy jointly reshuffles the model ordering relative to either metric alone. The most persistent failure mode across models is simple data minimization: agents still fill optional personal entries that the task does not require. These results show that privacy failures arise from over-helpful execution of benign tasks, and that success-only evaluation overestimates the deployment readiness of current phone-use agents. All code, mock apps, and agent trajectories are publicly available at~ https://github.com/FreedomIntelligence/MyPhoneBench.
comment: work in progress
♻ ☆ Towards Faithful Reasoning in Comics for Small MLLMs
Comic understanding presents a significant challenge for Multimodal Large Language Models (MLLMs), as the intended meaning of a comic often emerges from the joint interpretation of visual, textual, and social cues. This naturally motivates Chain-of-Thought (CoT) prompting, since explicit intermediate reasoning appears promising for integrating such heterogeneous signals. However, existing CoT methods are poorly matched to this structure: they tend to force interpretation into a single reasoning path before multiple cues have been jointly considered, often degrading performance, especially for small MLLMs. Our key idea is to explicitly preserve multi-cue interpretation during supervision construction, rather than collapsing comic understanding into a single reasoning chain. To this end, we propose a two-stage framework for faithful comic reasoning in small MLLMs. First, we introduce MoCoT, a modular supervision construction framework that preserves multi-cue interpretation and turns it into more faithful supervision. Second, we propose VERA, a structured reward mechanism that turns such supervision into faithful reasoning behavior by aligning optimization with both reasoning faithfulness and answer correctness. Extensive experiments on five benchmarks spanning comic understanding and broader humor-centric and abstract visual reasoning tasks demonstrate that our framework achieves strong results in the $\leq$ 4B regime, surpasses several 7B baselines, improves four small MLLMs by an average of $\mathbf{12.1%}$ as a plug-in, and consistently enhances reasoning faithfulness while preserving inference efficiency.
♻ ☆ An Object Web Seminar: A Retrospective on a Technical Dialogue Still Reverberating
Technology change happens quickly such that new trends tend to crowd out the focus on what was new just yesterday. In this paper the peak popularity of the confluence of Object Technologies with early Web adoption is explored through the content of a seminar held in 1999. Distributed architectures were undergoing significant change at this point, and deeper software capabilities were just beginning to be broadly accessible over the Internet. The Object Web arose and was infused with new development tools reflecting these capabilities and allowing design of applications for deployment during the early days of the World Wide Web. This conference discussed the history, evolution, and use of these tools, architectures, and their future possibilities. The continued dominance of these approaches although under different names is demonstrated even though the term Object Web has receded in use. Favored newer offerings such as Kubernetes and microservices still model the core design attributes of the Object Web for example. Aside from connecting this seminar to relevance in the software world of today this paper also touches on the early AI tools demonstrated in this seminar a quarter century ago and how the popularity wave of any given technology might affect the current focus on AI technology offerings.
comment: Record of early Web Object technology and evolution since then covered in 6 pages with 4 figures
♻ ☆ NCCL EP: Towards a Unified Expert Parallel Communication API for NCCL
Mixture-of-Experts (MoE) architectures have become essential for scaling large language models, driving the development of specialized device-initiated communication libraries such as DeepEP, Hybrid-EP, and others. These libraries demonstrate the performance benefits of GPU-initiated RDMA for MoE dispatch and combine operations. This paper presents NCCL EP (Expert Parallelism), a ground-up MoE communication library built entirely on NCCL's Device API. NCCL EP provides unified ncclEpDispatch and ncclEpCombine primitives with both C and Python interfaces, supporting Low-Latency (LL) mode for inference decoding and High-Throughput (HT) mode for training and inference prefill. LL targets small batch sizes (1-128 tokens) using direct all-to-all RDMA+NVLink mesh connectivity with double-buffered communication for overlapping dispatch and combine phases. HT targets large batches (4096+ tokens) using hierarchical communication that aggregates tokens within NVLink domains before inter-node RDMA transmission. Both modes leverage Device API for both intra- and inter-node communications, taking advantage of its topology awareness and optimized GPU-initiated implementation. We evaluate NCCL EP on an H100-based cluster across multi-node configurations, demonstrating competitive LL kernel performance and presenting end-to-end results with vLLM integration. By building MoE communication natively within NCCL, NCCL EP provides a supported path for expert parallelism on current and emerging NVIDIA platforms.
comment: 13 pages, 8 figures, 7 tables
♻ ☆ UniMark: Artificial Intelligence Generated Content Identification Toolkit
The rapid proliferation of Artificial Intelligence Generated Content has precipitated a crisis of trust and urgent regulatory demands. However, existing identification tools suffer from fragmentation and a lack of support for visible compliance marking. To address these gaps, we introduce the \textbf{UniMark}, an open-source, unified framework for multimodal content governance. Our system features a modular unified engine that abstracts complexities across text, image, audio, and video modalities. Crucially, we propose a novel dual-operation strategy, natively supporting both \emph{Hidden Watermarking} for copyright protection and \emph{Visible Marking} for regulatory compliance. Furthermore, we establish a standardized evaluation framework with three specialized benchmarks (Image/Video/Audio-Bench) to ensure rigorous performance assessment. This toolkit bridges the gap between advanced algorithms and engineering implementation, fostering a more transparent and secure digital ecosystem.
comment: 5 Pages
♻ ☆ DIVER: A Multi-Stage Approach for Reasoning-intensive Information Retrieval
Retrieval-augmented generation has achieved strong performance on knowledge-intensive tasks where query-document relevance can be identified through direct lexical or semantic matches. However, many real-world queries involve abstract reasoning, analogical thinking, or multi-step inference, which existing retrievers often struggle to capture. To address this challenge, we present DIVER, a retrieval pipeline designed for reasoning-intensive information retrieval. It consists of four components. The document preprocessing stage enhances readability and preserves content by cleaning noisy texts and segmenting long documents. The query expansion stage leverages large language models to iteratively refine user queries with explicit reasoning and evidence from retrieved documents. The retrieval stage employs a model fine-tuned on synthetic data spanning medical and mathematical domains, along with hard negatives, enabling effective handling of reasoning-intensive queries. Finally, the reranking stage combines pointwise and listwise strategies to produce both fine-grained and globally consistent rankings. On the BRIGHT benchmark, DIVER achieves state-of-the-art nDCG@10 scores of 46.8 overall and 31.9 on original queries, consistently outperforming competitive reasoning-aware models. These results demonstrate the effectiveness of reasoning-aware retrieval strategies in complex real-world tasks.
♻ ☆ Exploring Effective Strategies for Building a User-Configured GPT for Coding Classroom Dialogues
This study investigated effective strategies for developing a custom GPT to code classroom dialogue. While classroom dialogue is widely recognised as a crucial element of education, its analysis remains challenging due to the need for a nuanced understanding of dialogic functions and the labour-intensive nature of manual transcript coding. Recent advancements in large language models (LLMs) offer promising avenues for automating this process. However, existing studies predominantly focus on training large-scale models or evaluating pre-trained models with fixed codebooks, the outcomes of which are often not applicable, or the methods are not replicable for dialogue researchers working with small datasets or employing customised coding schemes. Using MyGPT - a GPT-4-based customised GPT system configured for dialogue analysis - as a case, this study evaluates its baseline performance in coding classroom dialogue with a human codebook and examines how performance varies with different example inputs under a controlled variable design. Through a design-based research approach, this study explores a set of practical strategies - based upon MyGPT's unique features - for configuring an effective tool with limited data. The findings suggest that, despite a few limitations, a custom GPT developed using these specific strategies can serve as a useful coding assistant by generating coding suggestions.
comment: Draft technical report. 39 pages, 2 figures. Not yet submitted for publication. Update expected
♻ ☆ MemFactory: Unified Inference & Training Framework for Agent Memory
Memory-augmented Large Language Models (LLMs) are essential for developing capable, long-term AI agents. Recently, applying Reinforcement Learning (RL) to optimize memory operations, such as extraction, updating, and retrieval, has emerged as a highly promising research direction. However, existing implementations remain highly fragmented and task-specific, lacking a unified infrastructure to streamline the integration, training, and evaluation of these complex pipelines. To address this gap, we present MemFactory, the first unified, highly modular training and inference framework specifically designed for memory-augmented agents. Inspired by the success of unified fine-tuning frameworks like LLaMA-Factory, MemFactory abstracts the memory lifecycle into atomic, plug-and-play components, enabling researchers to seamlessly construct custom memory agents via a "Lego-like" architecture. Furthermore, the framework natively integrates Group Relative Policy Optimization (GRPO) to fine-tune internal memory management policies driven by multi-dimensional environmental rewards. MemFactory provides out-of-the-box support for recent cutting-edge paradigms, including Memory-R1, RMM, and MemAgent. We empirically validate MemFactory on the open-source MemAgent architecture using its publicly available training and evaluation data. Across the evaluation sets, MemFactory improves performance over the corresponding base models on average, with relative gains of up to 14.8%. By providing a standardized, extensible, and easy-to-use infrastructure, MemFactory significantly lowers the barrier to entry, paving the way for future innovations in memory-driven AI agents.
comment: v3:fixed Figure 1 typos, clarified ambiguous wording in the abstract, added 1 missing citation, Code: https://github.com/Valsure/MemFactory
♻ ☆ MOON3.0: Reasoning-aware Multimodal Representation Learning for E-commerce Product Understanding
With the rapid growth of e-commerce, exploring general representations rather than task-specific ones has attracted increasing attention. Although recent multimodal large language models (MLLMs) have driven significant progress in product understanding, they are typically employed as feature extractors that implicitly encode product information into global embeddings, thereby limiting their ability to capture fine-grained attributes. Therefore, we argue that leveraging the reasoning capabilities of MLLMs to explicitly model fine-grained product attributes holds significant potential. Nevertheless, achieving this goal remains non-trivial due to several key challenges: (i) long-context reasoning tends to dilute the model's attention to salient information in the raw input; (ii) supervised fine-tuning (SFT) primarily encourages rigid imitation, limiting the exploration of effective reasoning strategies; and (iii) fine-grained details are progressively attenuated during forward propagation. To address these issues, we propose MOON3.0, the first reasoning-aware MLLM-based model for product representation learning. Our method (1) employs a multi-head modality fusion module to adaptively integrate raw signals; (2) incorporates a joint contrastive and reinforcement learning framework to autonomously explore more effective reasoning strategies; and (3) introduces a fine-grained residual enhancement module to progressively preserve local details throughout the network. Additionally, we release a large-scale multimodal e-commerce benchmark MBE3.0. Experimentally, our model demonstrates state-of-the-art zero-shot performance across various downstream tasks on both our benchmark and public datasets.
comment: 10 pages, 6 figures
♻ ☆ Predicting LLM Output Length via Entropy-Guided Representations
The long-tailed distribution of sequence lengths in LLM serving and reinforcement learning (RL) sampling causes significant computational waste due to excessive padding in batched inference. Existing methods rely on auxiliary models for static length prediction, but they incur high overhead, generalize poorly, and fail in stochastic "one-to-many" sampling scenarios. We introduce a lightweight framework that reuses the main model's internal hidden states for efficient length prediction. Our framework features two core components: 1) Entropy-Guided Token Pooling (EGTP), which uses on-the-fly activations and token entropy for highly accurate static prediction with negligible cost, and 2) Progressive Length Prediction (PLP), which dynamically estimates the remaining length at each decoding step to handle stochastic generation. To validate our approach, we build and release ForeLen, a comprehensive benchmark with long-sequence, Chain-of-Thought, and RL data. On ForeLen, EGTP achieves state-of-the-art accuracy, reducing MAE by 29.16\% over the best baseline. Integrating our methods with a length-aware scheduler yields significant end-to-end throughput gains. Our work provides a new technical and evaluation baseline for efficient LLM inference.
♻ ☆ Learning Contextual Runtime Monitors for Safe AI-Based Autonomy
We introduce a novel framework for learning context-aware runtime monitors for AI-based control ensembles. Machine-learning (ML) controllers are increasingly deployed in (autonomous) cyber-physical systems because of their ability to solve complex decision-making tasks. However, their accuracy can degrade sharply in unfamiliar environments, creating significant safety concerns. Traditional ensemble methods aim to improve robustness by averaging or voting across multiple controllers, yet this often dilutes the specialized strengths that individual controllers exhibit in different operating contexts. We argue that, rather than blending controller outputs, a monitoring framework should identify and exploit these contextual strengths. In this paper, we reformulate the design of safe AI-based control ensembles as a contextual monitoring problem. A monitor continuously observes the system's context and selects the controller best suited to the current conditions. To achieve this, we cast monitor learning as a contextual learning task and draw on techniques from contextual multi-armed bandits. Our approach comes with two key benefits: (1) theoretical safety guarantees during controller selection, and (2) improved utilization of controller diversity. We validate our framework in two simulated autonomous driving scenarios, demonstrating significant improvements in both safety and performance compared to non-contextual baselines.
♻ ☆ Cross-attentive Cohesive Subgraph Embedding to Mitigate Oversquashing in GNNs
Graph neural networks (GNNs) have achieved strong performance across various real-world domains. Nevertheless, they suffer from oversquashing, where long-range information is distorted as it is compressed through limited message-passing pathways. This bottleneck limits their ability to capture essential global context and decreases their performance, particularly in dense and heterophilic regions of graphs. To address this issue, we propose a novel graph learning framework that enriches node embeddings via cross-attentive cohesive subgraph representations to mitigate the impact of excessive long-range dependencies. This framework enhances the node representation by emphasizing cohesive structure in long-range information but removing noisy or irrelevant connections. It preserves essential global context without overloading the narrow bottlenecked channels, which further mitigates oversquashing. Extensive experiments on multiple benchmark datasets demonstrate that our model achieves consistent improvements in classification accuracy over standard baseline methods.
♻ ☆ Unified Optimization of Source Weights and Transfer Quantities in Multi-Source Transfer Learning: An Asymptotic Framework
In multi-source transfer learning, a key challenge lies in how to appropriately differentiate and utilize heterogeneous source tasks. However, existing multi-source methods typically focus on optimizing either the source weights or the amount of transferred samples, largely neglecting their joint consideration. In this work, we propose a theoretical framework, Unified Optimization of Weights and Quantities (UOWQ), that jointly determines the optimal source weights and transfer quantities for each source task. Specifically, the framework formulates multi-source transfer learning as a parameter estimation problem based on an asymptotic analysis of a Kullback--Leibler divergence--based generalization error measure, leading to two main theoretical findings: 1) using all available source samples is always optimal when the weights are properly adjusted; 2) the optimal source weights are characterized by a principled optimization problem whose structure explicitly incorporates the Fisher information, parameter discrepancy, parameter dimensionality, and transfer quantities. Building on the theoretical results, we further propose a practical algorithm for multi-source transfer learning, and extend it to multi-task learning settings where each task simultaneously serves as both a source and a target. Extensive experiments on real-world benchmarks, including DomainNet and Office-Home, demonstrate that UOWQ consistently outperforms strong baselines. The results validate both the theoretical predictions and the practical effectiveness of our framework.
♻ ☆ ExpertFlow: Efficient Mixture-of-Experts Inference via Predictive Expert Caching and Token Scheduling
Sparse Mixture-of-Experts (MoE) models can outperform dense large language models at similar computation by activating only a small set of experts per token. However, stacking many expert modules introduces substantial parameter memory, which makes MoE models difficult to deploy in memory-constrained environments such as single-GPU devices. Offloading alleviates this issue by storing inactive experts in CPU memory and loading them on demand, but existing methods remain limited: static caches disregard input-dependent routing, and methods that train separate models to predict expert usage ahead of time are often inaccurate or require significant training cost. We propose ExpertFlow, a lightweight MoE inference system that addresses this routing dependency through three coordinated components: 1) a transformer-based routing path predictor that estimates expert usage across all MoE layers in a single forward pass, 2) a token scheduler that groups tokens with similar predicted routes to improve expert utilization, and 3) a predictive expert cache that loads only the required experts while correcting mispredictions at runtime. Together, these components enable efficient expert loading and execution, reducing GPU memory usage by up to 93.72% and improving inference throughput by up to 10x over strong offloading baselines on a single GPU.
comment: Accepted in DAC'26, Mixture-of-Experts, Inference, Offloading
♻ ☆ Robot Collapse: Supply Chain Backdoor Attacks Against VLM-based Robotic Manipulation
Robotic manipulation policies are increasingly empowered by \textit{large language models} (LLMs) and \textit{vision-language models} (VLMs), leveraging their understanding and perception capabilities. Recently, inference-time attacks against robotic manipulation have been extensively studied, yet backdoor attacks targeting model supply chain security in robotic policies remain largely unexplored. To fill this gap, we propose \texttt{TrojanRobot}, a backdoor injection framework for model supply chain attack scenarios, which embeds a malicious module into modular robotic policies via backdoor relationships to manipulate the LLM-to-VLM pathway and compromise the system. Our vanilla design instantiates this module as a backdoor-finetuned VLM. To further enhance attack performance, we propose a prime scheme by introducing the concept of \textit{LVLM-as-a-backdoor}, which leverages \textit{in-context instruction learning} (ICIL) to steer \textit{large vision-language model} (LVLM) behavior through backdoored system prompts. Moreover, we develop three types of prime attacks, \textit{permutation}, \textit{stagnation}, and \textit{intentional}, achieving flexible backdoor attack effects. Extensive physical-world and simulator experiments on 18 real-world manipulation tasks and 4 VLMs verify the superiority of proposed \texttt{TrojanRobot}
♻ ☆ Output Embedding Centering for Stable LLM Pretraining
Pretraining of large language models is not only expensive but also prone to certain training instabilities. A specific instability that often occurs at the end of training is output logit divergence. The most widely used mitigation strategies, z-loss and logit soft-capping, merely address the symptoms rather than the underlying cause of the problem. In this paper, we analyze the instability from the perspective of the output embeddings' geometry and identify anisotropic embeddings as its source. Based on this, we propose output embedding centering (OEC) as a new mitigation strategy, and demonstrate that it suppresses output logit divergence. OEC can be implemented in two different ways: as a deterministic operation called $μ$-centering, or a regularization method called $μ$-loss. Our experiments show that both variants outperform z-loss in terms of training stability, while being on par with logit soft-capping. This holds true both in the presence and the absence of weight tying. As a secondary result, we find that $μ$-loss is significantly less sensitive to regularization hyperparameter tuning than z-loss.
comment: Additional experiments using logit soft-capping & weight tying
♻ ☆ SentinelNet: Safeguarding Multi-Agent Collaboration Through Credit-Based Dynamic Threat Detection WWW 2026
Malicious agents pose significant threats to the reliability and decision-making capabilities of Multi-Agent Systems (MAS) powered by Large Language Models (LLMs). Existing defenses often fall short due to reactive designs or centralized architectures which may introduce single points of failure. To address these challenges, we propose SentinelNet, the first decentralized framework for proactively detecting and mitigating malicious behaviors in multi-agent collaboration. SentinelNet equips each agent with a credit-based detector trained via contrastive learning on augmented adversarial debate trajectories, enabling autonomous evaluation of message credibility and dynamic neighbor ranking via bottom-k elimination to suppress malicious communications. To overcome the scarcity of attack data, it generates adversarial trajectories simulating diverse threats, ensuring robust training. Experiments on MAS benchmarks show SentinelNet achieves near-perfect detection of malicious agents, close to 100% within two debate rounds, and recovers 95% of system accuracy from compromised baselines. By exhibiting strong generalizability across domains and attack patterns, SentinelNet establishes a novel paradigm for safeguarding collaborative MAS.
comment: Accepted at The ACM Web Conference 2026 (WWW 2026)
♻ ☆ Learning to Learn-at-Test-Time: Language Agents with Learnable Adaptation Policies
Test-Time Learning (TTL) enables language agents to iteratively refine their performance through repeated interactions with the environment at inference time. At the core of TTL is an adaptation policy that updates the actor policy based on experience from previous episodes, thereby improving future behavior. Existing methods rely on fixed, hand-crafted adaptation policies rather than optimizing them for downstream improvement. We argue that optimal adaptation policies should be learned from task environments, not hand-engineered based on human intuition. To achieve this, we introduce Meta-TTL, a framework that formulates the discovery of effective adaptation policies as a bi-level optimization problem. Within this framework, the inner loop executes the standard TTL process, measuring how effectively a candidate adaptation policy helps an agent correct errors across sequential episodes. Guided by the agent's performance, the outer loop employs evolutionary search over a diverse distribution of training tasks to iteratively refine the adaptation policy. We evaluate Meta-TTL on Jericho and WebArena-Lite across both in-distribution (ID) and out-of-distribution (OOD) settings, using multiple meta-agent backbones. Results on both benchmarks show that Meta-TTL consistently outperforms hand-crafted baselines, suggesting that the optimized adaptation policy encodes transferable strategies that generalize beyond the training task distribution.
♻ ☆ KARL: Knowledge-Aware Reasoning and Reinforcement Learning for Knowledge-Intensive Visual Grounding
Knowledge-Intensive Visual Grounding (KVG) requires models to localize objects using fine-grained, domain-specific entity names rather than generic referring expressions. Although Multimodal Large Language Models (MLLMs) possess rich entity knowledge and strong generic grounding capabilities, they often fail to effectively utilize such knowledge when grounding specialized concepts, revealing a knowledge-grounding gap between internal knowledge and grounding predictions. To address this challenge, we propose a knowledge-aware training paradigm for KVG. Our approach first constructs knowledge-guided reasoning data to encourage models to activate domain-relevant entity knowledge during grounding, and then introduces KARL, a Knowledge-Aware Reinforcement Learning framework that adaptively modulates reward signals according to the model's estimated knowledge mastery of different entities. To facilitate systematic evaluation, we introduce KVG-Bench, a benchmark spanning 10 domains with 1.3K curated test cases covering 531 images and 882 entities. Extensive experiments show that our approach consistently outperforms a wide range of baseline models and achieves substantially stronger cross-domain generalization on unseen categories. The data, codes, and models are released at https://github.com/thunlp/KARL.
♻ ☆ Efficient Reasoning with Balanced Thinking ICLR 2026
Large Reasoning Models (LRMs) have shown remarkable reasoning capabilities, yet they often suffer from overthinking, expending redundant computational steps on simple problems, or underthinking, failing to explore sufficient reasoning paths despite inherent capabilities. These issues lead to inefficiencies and potential inaccuracies, limiting practical deployment in resource-constrained settings. Existing methods to mitigate overthinking, such as suppressing reflective keywords or adjusting reasoning length, may inadvertently induce underthinking, compromising accuracy. Therefore, we propose ReBalance, a training-free framework that achieves efficient reasoning with balanced thinking. ReBalance leverages confidence as a continuous indicator of reasoning dynamics, identifying overthinking through high confidence variance and underthinking via consistent overconfidence. By aggregating hidden states from a small-scale dataset into reasoning mode prototypes, we compute a steering vector to guide LRMs' reasoning trajectories. A dynamic control function modulates this vector's strength and direction based on real-time confidence, pruning redundancy during overthinking, and promoting exploration during underthinking. Extensive experiments conducted on four models ranging from 0.5B to 32B, and across nine benchmarks in math reasoning, general question answering, and coding tasks demonstrate that ReBalance effectively reduces output redundancy while improving accuracy, offering a general, training-free, and plug-and-play strategy for efficient and robust LRM deployment. Project page and code are available at https://rebalance-ai.github.io .
comment: Accepted by ICLR 2026
♻ ☆ Hybrid Energy-Based Models for Physical AI: Provably Stable Identification of Port-Hamiltonian Dynamics
Energy-based models (EBMs) implement inference as gradient descent on a learned Lyapunov function, yielding interpretable, structure-preserving alternatives to black-box neural ODEs and aligning naturally with physical AI. Yet their use in system identification remains limited, and existing architectures lack formal stability guarantees that globally preclude unstable modes. We address this gap by introducing an EBM framework for system identification with stable, dissipative, absorbing invariant dynamics. Unlike classical global Lyapunov stability, absorbing invariance expands the class of stability-preserving architectures, enabling more flexible and expressive EBMs. We extend EBM theory to nonsmooth activations by establishing negative energy dissipation via Clarke derivatives and deriving new conditions for radial unboundedness, exposing a stability-expressivity tradeoff in standard EBMs. To overcome this, we introduce a hybrid architecture with a dynamical visible layer and static hidden layers, prove absorbing invariance under mild assumptions, and show that these guarantees extend to port-Hamiltonian EBMs. Experiments on metric-deformed multi-well and ring systems validate the approach, showcasing how our hybrid EBM architecture combines expressivity with sound and provable safety guarantees by design.
♻ ☆ Beyond Preset Identities: How Agents Form Stances and Boundaries in Generative Societies
While large language models simulate social behaviors, their capacity for stable stance formation and identity negotiation during complex interventions remains unclear. To overcome the limitations of static evaluations, this paper proposes a novel mixed-methods framework combining computational virtual ethnography with quantitative socio-cognitive profiling. By embedding human researchers into generative multiagent communities, controlled discursive interventions are conducted to trace the evolution of collective cognition. To rigorously measure how agents internalize and react to these specific interventions, this paper formalizes three new metrics: Innate Value Bias (IVB), Persuasion Sensitivity, and Trust-Action Decoupling (TAD). Across multiple representative models, agents exhibit endogenous stances that override preset identities, consistently demonstrating an innate progressive bias (IVB > 0). When aligned with these stances, rational persuasion successfully shifts 90% of neutral agents while maintaining high trust. In contrast, conflicting emotional provocations induce a paradoxical 40.0% TAD rate in advanced models, which hypocritically alter stances despite reporting low trust. Smaller models contrastingly maintain a 0% TAD rate, strictly requiring trust for behavioral shifts. Furthermore, guided by shared stances, agents use language interactions to actively dismantle assigned power hierarchies and reconstruct self organized community boundaries. These findings expose the fragility of static prompt engineering, providing a methodological and quantitative foundation for dynamic alignment in human-agent hybrid societies. The official code is available at: https://github.com/armihia/CMASE-Endogenous-Stances
comment: 22 pages, 3 figures. arXiv admin note: substantial text overlap with arXiv:2508.17366
♻ ☆ LEXam: Benchmarking Legal Reasoning on 340 Law Exams ICLR 2026
Long-form legal reasoning remains a key challenge for large language models (LLMs) in spite of recent advances in test-time scaling. To address this, we introduce LEXam, a novel benchmark derived from 340 law exams spanning 116 law school courses across a range of subjects and degree levels. The dataset comprises 7,537 law exam questions in English and German. It includes both long-form, open-ended questions and multiple-choice questions with varying numbers of options. Besides reference answers, the open questions are also accompanied by explicit guidance outlining the expected legal reasoning approach such as issue spotting, rule recall, or rule application. Our evaluation on both open-ended and multiple-choice questions present significant challenges for current LLMs; in particular, they notably struggle with open questions that require structured, multi-step legal reasoning. Moreover, our results underscore the effectiveness of the dataset in differentiating between models with varying capabilities. Deploying an ensemble LLM-as-a-Judge paradigm with rigorous human expert validation, we demonstrate how model-generated reasoning steps can be evaluated consistently and accurately, closely aligning with human expert assessments. Our evaluation setup provides a scalable method to assess legal reasoning quality beyond simple accuracy metrics. Project page: https://lexam-benchmark.github.io/.
comment: Accepted to ICLR 2026
♻ ☆ V-OCBF: Learning Safety Filters from Offline Data via Value-Guided Offline Control Barrier Functions
Ensuring safety in autonomous systems requires controllers that aim to satisfy state-wise constraints without relying on online interaction.While existing Safe Offline RL methods typically enforce soft expected-cost constraints, they struggle to ensure strict state-wise safety. Conversely, Control Barrier Functions (CBFs) offer a principled mechanism to enforce forward invariance, but often rely on expert-designed barrier functions or knowledge of the system dynamics. We introduce Value-Guided Offline Control Barrier Functions (V-OCBF), a framework that learns a neural CBF entirely from offline demonstrations. Unlike prior approaches, V-OCBF does not assume access to the dynamics model; instead, it derives a recursive finite-difference barrier update, enabling model-free learning of a barrier that propagates safety information over time. Moreover, V-OCBF incorporates an expectile-based objective that avoids querying the barrier on out-of-distribution actions and restricts updates to the dataset-supported action set. The learned barrier is then used with a Quadratic Program (QP) formulation to synthesize real-time safe control. Across multiple case studies, V-OCBF yields substantially fewer safety violations than baseline methods while maintaining strong task performance, highlighting its scalability for offline synthesis of safety-critical controllers without online interaction or hand-engineered barriers.
comment: 28 pages, 9 figures, 11 tables. Paper accepted at TMLR
♻ ☆ Understanding visual attention beehind bee-inspired UAV navigation
Bio-inspired design is often used in autonomous UAV navigation due to the capacity of biological systems for flight and obstacle avoidance despite limited sensory and computational capabilities. In particular, honeybees mainly use the sensory input of optic flow, the apparent motion of objects in their visual field, to navigate cluttered environments. In our work, we train a Reinforcement Learning agent to navigate a tunnel with obstacles using only optic flow as sensory input. We inspect the attention patterns of trained agents to determine the regions of optic flow on which they primarily base their motor decisions. We find that agents trained in this way pay most attention to regions of discontinuity in optic flow, as well as regions with large optic flow magnitude. The trained agents appear to navigate a cluttered tunnel by avoiding the obstacles that produce large optic flow, while maintaining a centered position in their environment, which resembles the behavior seen in flying insects. This pattern persists across independently trained agents, which suggests that this could be a good strategy for developing a simple explicit control law for physical UAVs.
♻ ☆ Nomad: Autonomous Exploration and Discovery
We introduce Nomad, a system for autonomous data exploration and insight discovery. Given a corpus of documents, databases, or other data sources, users rarely know the full set of questions, hypotheses, or connections that could be explored. As a result, query-driven question answering and prompt-driven deep-research systems remain limited by human framing and often fail to cover the broader insight space. Nomad addresses this problem with an exploration-first architecture. It constructs an explicit Exploration Map over the domain and systematically traverses it to balance breadth and depth. It generates and selects hypotheses and investigates them with an explorer agent that can use document search, web search, and database tools. Candidate insights are then checked by an independent verifier before entering a reporting pipeline that produces cited reports and higher-level meta-reports. We also present a comprehensive evaluation framework for autonomous discovery systems that measures trustworthiness, report quality, and diversity. Using a corpus of selected UN and WHO reports, we show that Nomad produces more trustworthy and higher-quality reports than baselines, while also producing more diverse insights over several runs. Nomad is a step toward autonomous systems that not only answer user questions or conduct directed research, but also discover which questions, research directions, and insights are worth surfacing in the first place.
♻ ☆ PhysGaia: A Physics-Aware Benchmark with Multi-Body Interactions for Dynamic Novel View Synthesis CVPR 2026
We introduce PhysGaia, a novel physics-aware benchmark for Dynamic Novel View Synthesis (DyNVS) that encompasses both structured objects and unstructured physical phenomena. While existing datasets primarily focus on photorealistic appearance, PhysGaia is specifically designed to support physics-consistent dynamic reconstruction. Our benchmark features complex scenarios with rich multi-body interactions, where objects realistically collide and exchange forces. Furthermore, it incorporates a diverse range of materials, including liquid, gas, textile, and rheological substance, moving beyond the rigid-body assumptions prevalent in prior work. To ensure physical fidelity, all scenes in PhysGaia are generated using material-specific physics solvers that strictly adhere to fundamental physical laws. We provide comprehensive ground-truth information, including 3D particle trajectories and physical parameters (e.g., viscosity), enabling the quantitative evaluation of physical modeling. To facilitate research adoption, we also provide integration pipelines for recent 4D Gaussian Splatting models along with our dataset and their results. By addressing the critical shortage of physics-aware benchmarks, PhysGaia can significantly advance research in dynamic view synthesis, physics-based scene understanding, and the integration of deep learning with physical simulation, ultimately enabling more faithful reconstruction and interpretation of complex dynamic scenes.
comment: Accepted at CVPR 2026; Project page: http://cvlab.snu.ac.kr/research/PhysGaia; Dataset: https://huggingface.co/datasets/mijeongkim/PhysGaia/tree/main
♻ ☆ MemRerank: Preference Memory for Personalized Product Reranking
LLM-based shopping agents increasingly rely on long purchase histories and multi-turn interactions for personalization, yet naively appending raw history to prompts is often ineffective due to noise, length, and relevance mismatch. We propose MemRerank, a preference memory framework that distills user purchase history into concise, query-independent signals for personalized product reranking. To study this problem, we build an end-to-end benchmark and evaluation framework centered on an LLM-based \textbf{1-in-5} selection task, which measures both memory quality and downstream reranking utility. We further train the memory extractor with reinforcement learning (RL), using downstream reranking performance as supervision. Experiments with two LLM-based rerankers show that MemRerank consistently outperforms no-memory, raw-history, and off-the-shelf memory baselines, yielding up to \textbf{+10.61} absolute points in 1-in-5 accuracy. These results suggest that explicit preference memory is a practical and effective building block for personalization in agentic e-commerce systems.
comment: correct author name in metadata
♻ ☆ PAVE: Premise-Aware Validation and Editing for Retrieval-Augmented LLMs ICLR 2026
Retrieval-augmented language models can retrieve relevant evidence yet still commit to answers before explicitly checking whether the retrieved context supports the conclusion. We present PAVE (Premise-Grounded Answer Validation and Editing), an inference-time validation layer for evidence-grounded question answering. PAVE decomposes retrieved context into question-conditioned atomic facts, drafts an answer, scores how well that draft is supported by the extracted premises, and revises low-support outputs before finalization. The resulting trace makes answer commitment auditable at the level of explicit premises, support scores, and revision decisions. In controlled ablations with a fixed retriever and backbone, PAVE outperforms simpler post-retrieval baselines in two evidence-grounded QA settings, with the largest gain reaching 32.7 accuracy points on a span-grounded benchmark. We view these findings as proof-of-concept evidence that explicit premise extraction plus support-gated revision can strengthen evidence-grounded consistency in retrieval-augmented LLM systems.
comment: Accepted at the ICLR 2026 Workshop on Logical Reasoning of Large Language Models
♻ ☆ WFR-FM: Simulation-Free Dynamic Unbalanced Optimal Transport
The Wasserstein-Fisher-Rao (WFR) metric extends dynamic optimal transport (OT) by coupling displacement with change of mass, providing a principled geometry for modeling unbalanced snapshot dynamics. Existing WFR solvers, however, are often unstable, computationally expensive, and difficult to scale. Here we introduce WFR Flow Matching (WFR-FM), a simulation-free training algorithm that unifies flow matching with dynamic unbalanced OT. Unlike classical flow matching which regresses only a transport vector field, WFR-FM simultaneously regresses a vector field for displacement and a scalar growth rate function for birth-death dynamics, yielding continuous flows under the WFR geometry. Theoretically, we show that minimizing the WFR-FM loss exactly recovers WFR geodesics. Empirically, WFR-FM yields more accurate and robust trajectory inference in single-cell biology, reconstructing consistent dynamics with proliferation and apoptosis, estimating time-varying growth fields, and applying to generative dynamics under imbalanced data. It outperforms state-of-the-art baselines in efficiency, stability, and reconstruction accuracy. Overall, WFR-FM establishes a unified and efficient paradigm for learning dynamical systems from unbalanced snapshots, where not only states but also mass evolve over time. The Python code is available at https://github.com/QiangweiPeng/WFR-FM.
♻ ☆ Generation Is Compression: Zero-Shot Video Coding via Stochastic Rectified Flow
Recent advances in generative modeling have enabled perceptual video compression at ultra-low bitrates, yet existing methods predominantly treat the generative model as a refinement or reconstruction module attached to a separately designed codec backbone. We propose \emph{Generative Video Codebook Codec} (GVCC), a zero-shot framework that turns a pretrained video generative model into the codec itself: the transmitted bitstream directly specifies the generative decoding trajectory, with no retraining required. To enable this, we convert the deterministic rectified-flow ODE of modern video foundation models into an equivalent SDE at inference time, unlocking per-step stochastic injection points for codebook-driven compression. Building on this unified backbone, we instantiate three complementary conditioning strategies -- \emph{Image-to-Video} (I2V) with autoregressive GOP chaining, tail latent residual correction, and adaptive atom allocation, \emph{Text-to-Video} (T2V) operating at near-zero side information as a pure generative prior, and \emph{First-Last-Frame-to-Video} (FLF2V) with boundary-sharing GOP chaining for dual-anchor temporal control. Together, these variants span a principled trade-off space between spatial fidelity, temporal coherence, and compression efficiency. Experiments on standard benchmarks show that GVCC achieves high-quality reconstruction below 0.002\,bpp while supporting flexible bitrate control through a single hyperparameter.
comment: 9 pages, 3 figures
♻ ☆ Labels Matter More Than Models: Rethinking the Unsupervised Paradigm in Time Series Anomaly Detection
Time series anomaly detection (TSAD) is a critical data mining task often constrained by label scarcity. Consequently, current research predominantly focuses on Unsupervised Time-series Anomaly Detection (UTAD), relying on increasingly complex architectures to model normal data distributions. However, this algorithm-centric trend often overlooks the significant performance gains achievable from limited anomaly labels available in practical scenarios. This paper challenges the premise that algorithmic complexity is the optimal path for TSAD. Instead of proposing another intricate unsupervised model, we present a comprehensive benchmark and empirical study to rigorously compare supervised and unsupervised paradigms. To isolate the value of labels, we introduce \stand, a deliberately minimalist supervised baseline. Extensive experiments on five public datasets demonstrate that: (1) Labels matter more than models: under a limited labeling budget, simple supervised models significantly outperform complex state-of-the-art unsupervised methods; (2) Supervision yields higher returns: the performance gain from minimal supervision far exceeds the incremental gains from architectural innovations; and (3) Practicality: \stand~exhibits superior prediction consistency and anomaly localization compared to unsupervised counterparts. These findings advocate for a paradigm shift in TSAD research, urging the community to prioritize data-centric label utilization over purely algorithmic complexity. The code and benchmark are publicly available at https://github.com/EmorZz1G/STAND.
comment: 20 pages, 15 figures, 8 tables. Under review
♻ ☆ When Models Fabricate Credentials: Measuring How Professional Identity Suppresses Honest Self-Representation
When language models are assigned professional personas, they face a conflict between maintaining the persona and disclosing their AI nature. How models resolve this conflict has practical consequences: a model that constructs detailed narratives of medical training and board certifications presents a surface of professional authority it does not possess. We systematically characterize this behavior using AI identity disclosure as a testbed: when probed about expertise origins, a model can either acknowledge its AI nature or maintain its assigned professional identity. Using a factorial design, sixteen open-weight models were audited across 19,200 trials. Under neutral conditions, models disclosed their AI nature in 99.8%-99.9% of interactions; assigning a professional persona reduced disclosure to 36.3% on average, though this suppression was highly context-dependent: the same models that maintained a neurosurgeon persona often disclosed under a financial advisor persona, a 9.7-fold difference. Counter to expectations that greater scale should support broader behavioral generalization, model size explained little of this variation, while model identity explained substantially more (Delta R_adj^2 = 0.375 vs. 0.012). We hypothesized that instruction-following dynamics contribute to these patterns and probed this directly: varying a single system prompt statement increased disclosure from 23.7% to 65.8%, while general honesty instructions produced negligible effects. Self-representational behavior does not generalize across professional contexts; instead, models exhibit sharp and sometimes unexpected differences under minor environmental changes, with training choices appearing to matter more than scale.
comment: Submitted to COLM; 43 pages, 12 figures, 15 tables; sharpen focus of paper and reduced length of paper
♻ ☆ Towards Transparent and Efficient Anomaly Detection in Industrial Processes through ExIFFI
Anomaly Detection (AD) is crucial in industrial settings to streamline operations by detecting underlying issues. Conventional methods merely label observations as normal or anomalous, lacking crucial insights. In Industry 5.0, interpretable outcomes become desirable to enable users to understand the rational under model decisions. This paper presents the first industrial application of ExIFFI, a recent approach for fast, efficient explanations for the Extended Isolation Forest (EIF) AD method. ExIFFI is tested on four industrial datasets, demonstrating superior explanation effectiveness, computational efficiency and improved raw anomaly detection performances. ExIFFI reaches over then 90\% of average precision on all the benchmarks considered in the study and overperforms state-of-the-art Explainable Artificial Intelligence (XAI) approaches in terms of the feature selection proxy task metric which was specifically introduced to quantitatively evaluate model explanations.
comment: Submitted to IEEE Transaction on Industry Applications
Software Engineering 30
☆ LLMs as Idiomatic Decompilers: Recovering High-Level Code from x86-64 Assembly for Dart
Translating machine code into human-readable high-level languages is an open research problem in reverse engineering. Despite recent advancements in LLM-based decompilation to C, modern languages like Dart and Swift are unexplored. In this paper, we study the use of small specialized LLMs as an idiomatic decompiler for such languages. Additionally, we investigate the augmentation of training data using synthetic same-language examples, and compare it against adding human-written examples using related-language (Swift -> Dart). We apply CODEBLEU to evaluate the decompiled code readability and compile@k to measure the syntax correctness. Our experimental results show that on a 73-function Dart test dataset (representing diverse complexity levels), our 4B specialized model achieves 71.3 CODEBLEU (95% CI 65.5-77.1), approximately comparable to a ~480B code model (73.1; 67.4-78.8). On a subset of 34 natural Dart functions, it reaches compile@k5 = 79.4% (Wilson 95% CI 63.2-89.7), vs. 64.7% (47.9-78.5) for the base model; the difference is suggestive but not statistically significant at 0.05. Our results indicate that adding Swift training data helps at 8B but not at 4B, suggesting a capacity threshold for effective cross-lingual transfer. Our experimental results show that small specialized models can generate readable, idiomatic Dart with meaningful identifiers while using minimal compute.
comment: 5 pages, 1 figure, 3 tables. Accepted at SANER 2026 ERA Track
☆ Evaluation of gNB Monostatic Sensing for UAV Use Case
3GPP Release 19 has initiated the standardization of integrated sensing and communications (ISAC), including a channel model for monostatic sensing, evaluation scenarios, and performance assessment methodologies. These common assumptions provide an important basis for ISAC evaluation, but reproducible end-to-end studies still require a transparent sensing implementation. This paper evaluates 5G New Radio (NR) base station (gNB)-based monostatic sensing for the Unmanned Aerial Vehicle (UAV) use case using a 5G NR downlink Cyclic Prefix-Orthogonal Frequency Division Multiplexing (CP-OFDM) waveform and positioning reference signals (PRS), following 3GPP Urban Macro-Aerial Vehicle (UMa-AV) scenario assumptions. We present an end-to-end processing chain for multi-target detection and 3D localization, achieving more than 70% detection probability with less than 5% false alarm rate, in the considered scenario. For correctly detected targets, localization errors are on the order of a few meters, with a 90th-percentile error of 4m and 6m in the vertical and horizontal directions, respectively. To support reproducible baseline studies and further research, we release the simulator 5GNRad, which reproduces our evaluation
☆ Semantic Evolution over Populations for LLM-Guided Automated Program Repair
Large language models (LLMs) have recently shown strong potential for automated program repair (APR), particularly through iterative refinement that generates and improves candidate patches. However, state-of-the-art iterative refinement LLM-based APR approaches cannot fully address challenges, including maintaining useful diversity among repair hypotheses, identifying semantically related repair families, composing complementary partial fixes, exploiting structured failure information, and escaping structurally flawed search regions. In this paper, we propose a Population-Based Semantic Evolution framework for APR iterative refinement, called EvolRepair, that formulates LLM-based APR as a semantic evolutionary algorithm. EvolRepair reformulates the search paradigm of classic genetic algorithm for APR, but replaces its syntax-based operators with semantics-aware components powered by LLMs and structured execution feedback. Candidate repairs are organized into behaviorally coherent groups, enabling the algorithm to preserve diversity, reason over repair families, and synthesize stronger candidates by recombining complementary repair insights across the population. By leveraging structured failure patterns to guide search direction, EvolRepair can both refine promising repair strategies and shift toward alternative abstractions when necessary. Our experiments show that EvolRepair substantially improves repair effectiveness over existing LLM-based APR approaches.
☆ A Case For Host Code Guided GPU Data Race Detector
Data races in GPU programs pose a threat to the reliability of GPU-accelerated software stacks. Prior works proposed various dynamic (runtime) and static (compile-time) techniques to detect races in GPU programs. However, dynamic techniques often miss critical races, as they require the races to manifest during testing. While static ones can catch such races, they often generate numerous false alarms by conservatively assuming values of variables/parameters that cannot ever occur during any execution of the program. We make a key observation that the host (CPU) code that launches GPU kernels contains crucial semantic information about the values that the GPU kernel's parameters can take during execution. Harnessing this hitherto overlooked information helps accurately detect data races in GPU kernel code. We create HGRD, a new state-of-the-art static analysis technique that performs a holistic analysis of both CPU and GPU code to accurately detect a broad set of true races while minimizing false alarms. While SOTA dynamic techniques, such as iGUARD, miss many true races, HGRD misses none. On the other hand, static techniques such as GPUVerify and FaialAA raise tens of false alarms, where HGRD raises none.
☆ Automated Functional Testing for Malleable Mobile Application Driven from User Intent
Software malleability allows applications to be easily changed, configured, and adapted even after deployment. While prior work has explored configurable systems, adaptive recommender systems, and malleable GUIs, these approaches are often tailored to specific software and lack generalizability. In this work, we envision per-user malleable mobile applications, where end-users can specify requirements that are automatically implemented via LLM-based code generation. However, realizing this vision requires overcoming the key challenge of designing automated test generation that can reliably verify both the presence and correctness of user-specified functionalities. We propose \tool, a user-requirement-driven GUI test generation framework that incrementally navigates the UI, triggers desired functionalities, and constructs LLM-guided oracles to validate correctness. We build a benchmark spanning six popular mobile applications with both correct and faulty user-requested functionalities, demonstrating that \tool effectively validates per-user features and is practical for real-world deployment. Our work highlights the feasibility of shifting mobile app development from a product-manager-driven to an end-user-driven paradigm.
☆ A Model-Driven Digital Twin for the Systematic Improvement of DevOps Pipelines
CI/CD pipelines are central to DevOps practices, yet their growing complexity makes them increasingly difficult to interpret, analyze, and systematically evolve. Existing tooling primarily offers execution logs and static graph representations, providing limited support for structured analysis of pipeline behavior, failures, and version-to-version evolution. This paper presents a model-driven Digital Twin (DT) for CI/CD pipelines that leverages BPMN as a model-ing backbone to transform raw CI configurations into structured, higher-level process representations. The proposed DT architecture enables visual abstraction of pipeline structure, failure tracing, and systematic version comparison, supporting both monitoring and evolution analysis of DevOps processes. Building upon validated DT architectural principles and prior work on build optimization and anomaly detection, the framework provides a modular, extensible foundation for integrating advanced analytical and prescriptive services into software delivery processes. The approach is validated using open-source CI/CD projects, and ongoing work targets the integration of additional improvement services and the extension of the DT to broader DevOps lifecycle processes.
☆ APITestGenie: Generating Web API Tests from Requirements and API Specifications with LLMs
Modern software systems rely heavily on Web APIs, yet creating meaningful and executable test scripts remains a largely manual, time-consuming, and error-prone task. In this paper, we present APITestGenie, a novel tool that leverages Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), and prompt engineering to automatically generate API integration tests directly from business requirements and OpenAPI specifications. We evaluated APITestGenie on 10 real-world APIs, including 8 APIs comprising circa 1,000 live endpoints from an industrial partner in the automotive domain. The tool was able to generate syntactically and semantically valid test scripts for 89\% of the business requirements under test after at most three attempts. Notably, some generated tests revealed previously unknown defects in the APIs, including integration issues between endpoints. Statistical analysis identified API complexity and level of detail in business requirements as primary factors influencing success rates, with the level of detail in API documentation also affecting outcomes. Feedback from industry practitioners confirmed strong interest in adoption, substantially reducing the manual effort in writing acceptance tests, and improving the alignment between tests and business requirements.
☆ RuleForge: Automated Generation and Validation for Web Vulnerability Detection at Scale
Security teams face a challenge: the volume of newly disclosed Common Vulnerabilities and Exposures (CVEs) far exceeds the capacity to manually develop detection mechanisms. In 2025, the National Vulnerability Database published over 48,000 new vulnerabilities, motivating the need for automation. We present RuleForge, an AWS internal system that automatically generates detection rules--JSON-based patterns that identify malicious HTTP requests exploiting specific vulnerabilities--from structured Nuclei templates describing CVE details. Nuclei templates provide standardized, YAML-based vulnerability descriptions that serve as the structured input for our rule generation process. This paper focuses on RuleForge's architecture and operational deployment for CVE-related threat detection, with particular emphasis on our novel LLM-as-a-judge (Large Language Model as judge) confidence validation system and systematic feedback integration mechanism. This validation approach evaluates candidate rules across two dimensions--sensitivity (avoiding false negatives) and specificity (avoiding false positives)--achieving AUROC of 0.75 and reducing false positives by 67% compared to synthetic-test-only validation in production. Our 5x5 generation strategy (five parallel candidates with up to five refinement attempts each) combined with continuous feedback loops enables systematic quality improvement. We also present extensions enabling rule generation from unstructured data sources and demonstrate a proof-of-concept agentic workflow for multi-event-type detection. Our lessons learned highlight critical considerations for applying LLMs to cybersecurity tasks, including overconfidence mitigation and the importance of domain expertise in both prompt design and quality review of generated rules through human-in-the-loop validation.
comment: 11 pages, 10 figures. To be submitted to CAMLIS 2026
From Component Manipulation to System Compromise: Understanding and Detecting Malicious MCP Servers
The model context protocol (MCP) standardizes how LLMs connect to external tools and data sources, enabling faster integration but introducing new attack vectors. Despite the growing adoption of MCP, existing MCP security studies classify attacks by their observable effects, obscuring how attacks behave across different MCP server components and overlooking multi-component attack chains. Meanwhile, existing defenses are less effective when facing multi-component attacks or previously unknown malicious behaviors. This work presents a component-centric perspective for understanding and detecting malicious MCP servers. First, we build the first component-centric PoC dataset of 114 malicious MCP servers where attacks are achieved as manipulation over MCP components and their compositions. We evaluate these attacks' effectiveness across two MCP hosts and five LLMs, and uncover that (1) component position shapes attack success rate; and (2) multi-component compositions often outperform single-component attacks by distributing malicious logic. Second, we propose and implement Connor, a two-stage behavioral deviation detector for malicious MCP servers. It first performs pre-execution analysis to detect malicious shell commands and extract each tool's function intent, and then conducts step-wise in-execution analysis to trace each tool's behavioral trajectories and detect deviations from its function intent. Evaluation on our curated dataset indicates that Connor achieves an F1-score of 94.6%, outperforming the state of the art by 8.9% to 59.6%. In real-world detection, Connor identifies two malicious servers.
☆ Can Large Language Models Model Programs Formally?
In the digital age, ensuring the correctness, safety, and reliability of software through formal verification is paramount, particularly as software increasingly underpins critical infrastructure. Formal verification, split into theorem proving and model checking, provides a feasible and reliable path. Unlike theorem proving, which yields notable advances, model checking has been less focused due to the difficulty of automatic program modeling. To fill this gap, we introduce Model-Bench, a benchmark and an accompanying pipeline for evaluating and improving LLMs' program modeling capability by modeling Python programs into verification-ready model checking specifications checkable by its accompanying model checker. Model-Bench comprises 400 Python programs derived from three well-known benchmarks (HumanEval, MBPP, and LiveCodeBench). Our extensive experiments reveal significant limitations in LLMs' program modeling and further provide inspiring directions.
☆ TestDecision: Sequential Test Suite Generation via Greedy Optimization and Reinforcement Learning
With the rapid evolution of LLMs, automated software testing is witnessing a paradigm shift. While proprietary models like GPT-4o demonstrate impressive capabilities, their high deployment costs and data privacy concerns make open-source LLMs the practical imperative for many academic and industrial scenarios. In the field of automated test generation, it has evolved to iterative workflows to construct test suites based on LLMs. When utilizing open-source LLMs, we empirically observe they lack a suite-level perspective, suffering from structural myopia-failing to generate new tests with large marginal gain based on the current covered status. In this paper, from the perspective of sequences, we formalize test suite generation as a MDP and demonstrate that its objective exhibits monotone submodularity, which enables an effective relaxation of this NP-hard global optimization into a tractable step-wise greedy procedure. Guided by this insight, we propose TestDecision, which transforms LLMs into neural greedy experts. TestDecision consists of two synergistic components: (1) an inference framework which implements test suite construction following a step-wise greedy strategy; and (2) a training pipeline of reinforcement learning which equips the base LLM with sequential test generation ability to maximize marginal gain. Comprehensive evaluations on the ULT benchmark demonstrate that TestDecision significantly outperforms existing advanced methods. It brings an improvement between 38.15-52.37% in branch coverage and 298.22-558.88% in execution pass rate over all base models, achieving a comparable performance on 7B backbone with a much larger proprietary LLM GPT-5.2. Furthermore, TestDecision can find 58.43-95.45% more bugs than vanilla base LLMs and exhibit superior generalization on LiveCodeBench, proving its capability to construct high-quality test suites.
comment: 22 pages, 4 figures
☆ Fuzzing REST APIs in Industry: Necessary Features and Open Problems
REST APIs are widely used in industry, in all different kinds of domains. An example is Volkswagen AG, a German automobile manufacturer. Established testing approaches for REST APIs are time consuming, and require expertise from professional test engineers. Due to its cost and importance, in the scientific literature several approaches have been proposed to automatically test REST APIs. The open-source, search-based fuzzer EvoMaster is one of such tools proposed in the academic literature. However, how academic prototypes can be integrated in industry and have real impact to software engineering practice requires more investigation. In this paper, we report on our experience in using EvoMaster at Volkswagen AG, as an EvoMaster user from 2023 to 2026. We share our learnt lessons, and discuss several features needed to be implemented in EvoMaster to make its use in an industrial context successful. Feedback about value in industrial setups of EvoMaster was given from Volkswagen AG about 4 APIs. Additionally, a user study was conducted involving 11 testing specialists from 4 different companies. We further identify several real-world research challenges that still need to be solved.
comment: Extension from conference paper published at ICST'25
☆ Mitigating Implicit Inconsistencies in Patch Porting
Promptly porting patches from a source codebase to its variants (e.g., forks and branches) is essential for mitigating propagated defects and vulnerabilities. Recent studies have explored automated patch porting to reduce manual effort and delay, but existing approaches mainly handle inconsistencies visible in a patch's local context and struggle with those requiring global mapping knowledge between codebases. We refer to such non-local inconsistencies as implicit inconsistencies. Implicit inconsistencies pose greater challenges for developers to resolve due to their non-local nature. To address them, we propose MIP, which enables collaboration among an LLM, a compiler, and code analysis utilities. MIP adopts different strategies for different cases: when source identifiers exist in the target codebase, it leverages compiler diagnostics; otherwise, it retrieves matched code segment pairs from the two codebases as mapping knowledge for mitigation. Experiments on two representative scenarios, cross-fork and cross-branch patch porting, show that MIP successfully resolves more than twice as many patches as the best-performing baseline in both settings. A user study with our industry partner further demonstrates its practical effectiveness.
☆ GPA: Learning GUI Process Automation from Demonstrations
GUI Process Automation (GPA) is a lightweight but general vision-based Robotic Process Automation (RPA), which enables fast and stable process replay with only a single demo. Addressing the fragility of traditional RPA and the non-deterministic risks of current vision language model-based GUI agents, GPA introduces three core benefits: (1) Robustness via Sequential Monte Carlo-based localization to handle rescaling and detection uncertainty; (2) Deterministic and Reliability safeguarded by readiness calibration; and (3) Privacy through fast, fully local execution. This approach delivers the adaptability, robustness, and security required for enterprise workflows. It can also be used as an MCP/CLI tool by other agents with coding capabilities so that the agent only reasons and orchestrates while GPA handles the GUI execution. We conducted a pilot experiment to compare GPA with Gemini 3 Pro (with CUA tools) and found that GPA achieves higher success rate with 10 times faster execution speed in finishing long-horizon GUI tasks.
☆ Analysis of LLM Performance on AWS Bedrock: Receipt-item Categorisation Case Study SC
This paper presents a systematic, cost-aware evaluation of large language models (LLMs) for receipt-item categorisation within a production-oriented classification framework. We compare four instruction-tuned models available through AWS Bedrock: Claude 3.7 Sonnet, Claude 4 Sonnet, Mixtral 8x7B Instruct, and Mistral 7B Instruct. The aim of the study was (1) to assess performance across accuracy, response stability, and token-level cost, and (2) to investigate what prompting methods, zero-shot or few-shot, are especially appropriate both in terms of accuracy and in terms of incurred costs. Results of our experiments demonstrated that Claude 3.7 Sonnet achieves the most favourable balance between classification accuracy and cost efficiency.
comment: Preprint. Accepted to the 19th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE 2026). Final version to be published by SCITEPRESS, http://www.scitepress.org
☆ EXHIB: A Benchmark for Realistic and Diverse Evaluation of Function Similarity in the Wild
Binary Function Similarity Detection (BFSD) is a core problem in software security, supporting tasks such as vulnerability analysis, malware classification, and patch provenance. In the past few decades, numerous models and tools have been developed for this application; however, due to the lack of a comprehensive universal benchmark in this field, researchers have struggled to compare different models effectively. Existing datasets are limited in scope, often focusing on a narrow set of transformations or types of binaries, and fail to reflect the full diversity of real-world applications. We introduce EXHIB, a benchmark comprising five realistic datasets collected from the wild, each highlighting a distinct aspect of the BFSD problem space. We evaluate 9 representative models spanning multiple BFSD paradigms on EXHIB and observe performance degradations of up to 30% on firmware and semantic datasets compared to standard settings, revealing substantial generalization gaps. Our results show that robustness to low- and mid-level binary variations does not generalize to high-level semantic differences, underscoring a critical blind spot in current BFSD evaluation practices.
comment: 13 pages, 7 figures. This is a technical report for the EXHIB benchmark. Code and data are available at https://github.com/fan1192/bfsd-anon-artifact
☆ ProdCodeBench: A Production-Derived Benchmark for Evaluating AI Coding Agents
Benchmarks that reflect production workloads are better for evaluating AI coding agents in industrial settings, yet existing benchmarks differ from real usage in programming language distribution, prompt style and codebase structure. This paper presents a methodology for curating production-derived benchmarks, illustrated through ProdCodeBench - a benchmark built from real sessions with a production AI coding assistant. We detail our data collection and curation practices including LLM-based task classification, test relevance validation, and multi-run stability checks which address challenges in constructing reliable evaluation signals from monorepo environments. Each curated sample consists of a verbatim prompt, a committed code change and fail-to-pass tests spanning seven programming languages. Our systematic analysis of four foundation models yields solve rates from 53.2% to 72.2% revealing that models making greater use of work validation tools, such as executing tests and invoking static analysis, achieve higher solve rates. This suggests that iterative verification helps achieve effective agent behavior and that exposing codebase-specific verification mechanisms may significantly improve the performance of externally trained agents operating in unfamiliar environments. We share our methodology and lessons learned to enable other organizations to construct similar production-derived benchmarks.
☆ EpiDroid: Dependency-Guided Recomposition for Deep State Discovery in Mobile GUI Testing
The increasing scale and complexity of mobile applications make automated GUI exploration essential for software quality assurance. However, existing methods often neglect state dependencies between test fragments, which leads to redundant exploration and prevents access to deep application states. We introduce EpiDroid, a black-box, pluggable framework that augments existing explorers through semantic state dependency awareness. EpiDroid distills raw traces into stable test fragments to extract underlying dependencies. It then employs a Recomposition-Replay paradigm to perform impact reasoning via LLM and deterministic replay on high-value mutable state elements. Through iterative feedback, EpiDroid refines the state-dependency graph to systematically reach deep application states. We integrated EpiDroid into both industrial and state-of-the-art research tools and evaluated it on 20 real-world apps. The results show that EpiDroid consistently improves the performance of all baselines, increasing average code coverage by 10--28\% and delivering 3--4$\times$ more coverage gain compared to continuing the baselines alone from the same starting point. This demonstrates that dependency-guided recomposition unlocks deep states that forward exploration cannot access, irrespective of additional budget.
comment: 12 pages, 8 figures
☆ Are Benchmark Tests Strong Enough? Mutation-Guided Diagnosis and Augmentation of Regression Suites
Benchmarks driven by test suites, notably SWE-bench, have become the de facto standard for measuring the effectiveness of automated issue-resolution agents: a generated patch is accepted whenever it passes the accompanying regression tests. In practice, however, insufficiently strong test suites can admit plausible yet semantically incorrect patches, inflating reported success rates. We introduce STING, a framework for targeted test augmentation that uses semantically altered program variants as diagnostic stressors to uncover and repair weaknesses in benchmark regression suites. Variants of the ground-truth patch that still pass the existing tests reveal under-constrained behaviors; these gaps then guide the generation of focused regression tests. A generated test is retained only if it (i) passes on the ground-truth patch, (ii) fails on at least one variant that survived the original suite, and (iii) remains valid under behavior-preserving transformations designed to guard against overfitting. Applied to SWE-bench Verified, STING finds that 77% of instances contain at least one surviving variant. STING produces 1,014 validated tests spanning 211 instances and increases patch-region line and branch coverage by 10.8% and 9.5%, respectively. Re-assessing the top-10 repair agents with the strengthened suites lowers their resolved rates by 4.2%-9.0%, revealing that a substantial share of previously passing patches exploit weaknesses in the benchmark tests rather than faithfully implementing the intended fix. These results underscore that reliable benchmark evaluation depends not only on patch generation, but equally on test adequacy.
☆ ToolMisuseBench: An Offline Deterministic Benchmark for Tool Misuse and Recovery in Agentic Systems
Tool using agents often fail for operational reasons even when language understanding is strong. Common causes include invalid arguments, interface drift, weak recovery, and inefficient retry behavior. We introduce ToolMisuseBench, an offline deterministic benchmark for evaluating tool misuse and recovery under explicit step, call, and retry budgets. The benchmark covers CRUD, retrieval, file, and scheduling environments with replayable fault injection. It reports success, invalid call behavior, policy violations, recovery quality, and budgeted efficiency. We release a public dataset with 6800 tasks and a reproducible evaluation pipeline. Baseline results show fault specific recovery gains for schema aware methods, while overall success remains limited under the released authorization and hard failure settings.
☆ From SWE-ZERO to SWE-HERO: Execution-free to Execution-based Fine-tuning for Software Engineering Agents
We introduce SWE-ZERO to SWE-HERO, a two-stage SFT recipe that achieves state-of-the-art results on SWE-bench by distilling open-weight frontier LLMs. Our pipeline replaces resource-heavy dependencies with an evolutionary refinement strategy: (1) SWE-ZERO utilizes large-scale, execution-free trajectories to master code semantics and repository-level reasoning, and (2) SWE-HERO applies targeted, execution-backed refinement to transition these semantic intuitions into rigorous engineering workflows. Our empirical results set a new benchmark for open-source models of comparable size. We release a dataset of 300k SWE-ZERO and 13k SWE-HERO trajectories distilled from Qwen3-Coder-480B, alongside a suite of agents based on the Qwen2.5-Coder series. Notably, SWE-HERO-32B achieves a 62.2% resolution rate on SWE-bench Verified. Furthermore, despite being trained exclusively on Python, our agents demonstrate robust zero-shot transferability on SWE-bench Multilingual, reaching 44.1% and confirming the paradigm's generalizability across diverse languages.
☆ MOVis: A Visual Analytics Tool for Surfacing Missed Patches Across Software Variants
Clone-and-own development produces families of related software variants that evolve independently. As variants diverge, important fixes applied in one repository are often missing in others. PaReco has shown that thousands of such missed opportunity (MO) patches exist across real ecosystems, yet its textual output provides limited support for understanding where and how these fixes should be propagated. We present MOVis, a lightweight, interactive desktop tool that visualizes MO patches between a source and target variant. MOVis loads PaReco's MO classifications and presents patched and buggy hunks side-by-side, highlighting corresponding regions and exposing structural differences that hinder reuse. This design enables developers to quickly locate missed fixes, understand required adaptations, and more efficiently maintain consistency across software variants. The tool, replication package, and demonstration video are available at https://zenodo.org/records/18356553 and https://youtu.be/Ac-gjBxHJ3Y.
comment: 4 pages, 2 figures, 1 reference page
♻ ☆ Finding Memory Leaks in C/C++ Programs via Neuro-Symbolic Augmented Static Analysis
Memory leaks remain prevalent in real-world C/C++ software. Static analyzers such as CodeQL provide scalable program analysis but frequently miss such bugs because they cannot recognize project-specific custom memory-management functions and lack path-sensitive control-flow modeling. We present MemHint, a neuro-symbolic pipeline that addresses both limitations by combining LLMs' semantic understanding of code with Z3-based symbolic reasoning. MemHint parses the target codebase and applies an LLM to classify each function as a memory allocator, deallocator, or neither, producing function summaries that record which argument or return value carries memory ownership, extending the analyzer's built-in knowledge beyond standard primitives such as malloc and free. A Z3-based validation step checks each summary against the function's control-flow graph, discarding those whose claimed memory operation is unreachable on any feasible path. The validated summaries are injected into CodeQL and Infer via their respective extension mechanisms. Z3 path feasibility filtering then eliminates warnings on infeasible paths, and a final LLM-based validation step confirms whether each remaining warning is a genuine bug. On seven real-world C/C++ projects totaling over 3.4M lines of code, MemHint detects 52 unique memory leaks (49 confirmed/fixed, 4 CVEs submitted) at approximately $1.7 per detected bug, compared to 19 by vanilla CodeQL and 3 by vanilla Infer.
comment: 12 pages
♻ ☆ HEAS: Hierarchical Evolutionary Agent-Based Simulation Framework for Multi-Objective Policy Search
Metric aggregation divergence is a hidden confound in agent-based model policy search: when optimization, tournament evaluation, and statistical validation independently implement outcome metric extraction, champion selection reflects aggregation artifact rather than policy quality. We propose Hierarchical Evolutionary Agent Simulation (HEAS), a composable framework that eliminates this confound through a runtime-enforceable metric contract - a uniform metrics_episode() callable shared identically by all pipeline stages. Removing the confound yields robust champion selection: in a controlled experiment (n=30), HEAS reduces rank reversals by 50% relative to ad-hoc aggregation; the HEAS champion wins all 32 held-out ecological scenarios - a null-safety result that would be uninterpretable under aggregation divergence. The contract additionally reduces coupling code by 97% (160 to 5 lines) relative to Mesa 3.3.1. Three case studies validate composability across ecological, enterprise, and mean-field ordinary differential equation dynamics.
comment: 12 pages, 1 figure. Python package: https://pypi.org/project/heas/ | Web playground: https://ryzhanghason.github.io/heas/
♻ ☆ An Object Web Seminar: A Retrospective on a Technical Dialogue Still Reverberating
Technology change happens quickly such that new trends tend to crowd out the focus on what was new just yesterday. In this paper the peak popularity of the confluence of Object Technologies with early Web adoption is explored through the content of a seminar held in 1999. Distributed architectures were undergoing significant change at this point, and deeper software capabilities were just beginning to be broadly accessible over the Internet. The Object Web arose and was infused with new development tools reflecting these capabilities and allowing design of applications for deployment during the early days of the World Wide Web. This conference discussed the history, evolution, and use of these tools, architectures, and their future possibilities. The continued dominance of these approaches although under different names is demonstrated even though the term Object Web has receded in use. Favored newer offerings such as Kubernetes and microservices still model the core design attributes of the Object Web for example. Aside from connecting this seminar to relevance in the software world of today this paper also touches on the early AI tools demonstrated in this seminar a quarter century ago and how the popularity wave of any given technology might affect the current focus on AI technology offerings.
comment: Record of early Web Object technology and evolution since then covered in 6 pages with 4 figures
Think Anywhere in Code Generation
Recent advances in reasoning Large Language Models (LLMs) have primarily relied on upfront thinking, where reasoning occurs before final answer. However, this approach suffers from critical limitations in code generation, where upfront thinking is often insufficient as problems' full complexity only reveals itself during code implementation. Moreover, it cannot adaptively allocate reasoning effort throughout the code generation process where difficulty varies significantly. In this paper, we propose Think-Anywhere, a novel reasoning mechanism that enables LLMs to invoke thinking on-demand at any token position during code generation. We achieve Think-Anywhere by first teaching LLMs to imitate the reasoning patterns through cold-start training, then leveraging outcome-based RL rewards to drive the model's autonomous exploration of when and where to invoke reasoning. Extensive experiments on four mainstream code generation benchmarks (i.e., LeetCode, LiveCodeBench, HumanEval, and MBPP) show that Think-Anywhere achieves state-of-the-art performance over both existing reasoning methods and recent post-training approaches, while demonstrating consistent generalization across diverse LLMs. Our analysis further reveals that Think-Anywhere enables the model to adaptively invoke reasoning at high-entropy positions, providing enhanced interpretability.
♻ ☆ YASA: Scalable Multi-Language Taint Analysis on the Unified AST at Ant Group
Modern enterprises increasingly adopt diverse technology stacks with various programming languages, posing significant challenges for static application security testing (SAST). Existing taint analysis tools are predominantly designed for single languages, requiring substantial engineering effort that scales with language diversity. While multi-language tools like CodeQL, Joern, and WALA attempt to address these challenges, they face limitations in intermediate representation design, analysis precision, and extensibility, which make them difficult to scale effectively for large-scale industrial applications at Ant Group. To bridge this gap, we present YASA (Yet Another Static Analyzer), a unified multi-language static taint analysis framework designed for industrial-scale deployment. Specifically, YASA introduces the Unified Abstract Syntax Tree (UAST) that provides a unified abstraction for compatibility across diverse programming languages. Building on the UAST, YASA performs point-to analysis and taint propagation, leveraging a unified semantic model to manage language-agnostic constructs, while incorporating language-specific semantic models to handle other unique language features. When compared to 6 single- and 2 multi-language static analyzers on an industry-standard benchmark, YASA consistently outperformed all baselines across Java, JavaScript, Python, and Go. In real-world deployment within Ant Group, YASA analyzed over 100 million lines of code across 7.3K internal applications. It identified 314 previously unknown taint paths, with 92 of them confirmed as 0-day vulnerabilities. All vulnerabilities were responsibly reported, with 76 already patched by internal development teams, demonstrating YASA's practical effectiveness for securing large-scale industrial software systems.
♻ ☆ NaturalEdit: Code Modification through Direct Interaction with Adaptive Natural Language Representation
Code modification requires developers to comprehend code, plan changes, articulate intent, and validate outcomes, making it cognitively demanding. While natural language (NL) code summaries offer a promising external representation of this process, existing approaches remain limited. Systems grounded in exploratory data analysis are restricted to narrow domains, while general-purpose systems enforce fixed NL representations and assume that developers can directly translate vague intent into precise textual edits. We present NaturalEdit, which treats NL code summaries as interactive representations tightly linked to source code. Grounded in the Cognitive Dimensions of Notations, NaturalEdit introduces three key features: (1) adaptive, multi-faceted code summaries with a flexible Abstraction Gradient; (2) interactive mapping mechanisms between summaries and code that ensure tight, structurally stable Closeness of Mapping; and (3) intent-driven bidirectional synchronization that reduces Viscosity during editing while preserving Visibility and Consistency through incremental diffs. A technical evaluation confirms the viability of NaturalEdit, and a user study with 20 developers shows that it improves comprehension, intent articulation, and validation while increasing developers' confidence and sense of control.
♻ ☆ The Evolution of Tool Use in LLM Agents: From Single-Tool Call to Multi-Tool Orchestration
Tool use enables large language models (LLMs) to access external information, invoke software systems, and act in digital environments beyond what can be solved from model parameters alone. Early research mainly studied whether a model could select and execute a correct single tool call. As agent systems evolve, however, the central problem has shifted from isolated invocation to multi-tool orchestration over long trajectories with intermediate state, execution feedback, changing environments, and practical constraints such as safety, cost, and verifiability. We comprehensively review recent progress in multi-tool LLM agents and analyzes the state of the art in this rapidly developing area. First, we unify task formulations and distinguish single-call tool use from long-horizon orchestration. Then, we organize the literature around six core dimensions: inference-time planning and execution, training and trajectory construction, safety and control, efficiency under resource constraints, capability completeness in open environments, and benchmark design and evaluation. We further summarize representative applications in software engineering, enterprise workflows, graphical user interfaces, and mobile systems. Finally, we discuss major challenges and outline future directions for building reliable, scalable, and verifiable multi-tool agents.
♻ ☆ Yet Even Less Is Even Better For Agentic, Reasoning, and Coding LLMs
Training effective software engineering agents requires large volumes of task-specific trajectories, incurring substantial data construction costs. Inspired by the "Less-Is-More" hypothesis in mathematical reasoning, we investigate its extension to agentic scenarios and propose an end-to-end training framework that achieves superior agentic capabilities with fewer but higher-quality training trajectories. This is achieved via STITCH (Sliding-memory Trajectory Inference and Task Chunking Heuristic), a coarse-to-fine mechanism that filters low-value noise and retains decision-critical tokens to maximize training signal quality. We conduct experiments across multiple agent frameworks (e.g., mini-SWE-agent, MSWE-agent), model scales (30B to 355B), and multilingual settings (Python, Java, and ArkTS). On SWE-bench Verified, models trained with STITCH achieve up to 63.16% relative improvement over base models. On Multi-SWE-bench (Java), MiniMax-M2.5-STITCH achieves 43.75% with our CodeArts Agent scaffold (+16.67%). On HarmonyOS (ArkTS), GLM-4.7-STITCH improves the compilation pass rate to 61.31% (+43.34%) with less than 1K training trajectories. Our results confirm that the "Less-Is-More" paradigm generalizes effectively to complex agentic tasks across diverse languages and model scales.
comment: 17 pages, 5 figures
Computer Vision and Pattern Recognition 196
☆ EventHub: Data Factory for Generalizable Event-Based Stereo Networks without Active Sensors CVPR 2026
We propose EventHub, a novel framework for training deep-event stereo networks without ground truth annotations from costly active sensors, relying instead on standard color images. From these images, we derive either proxy annotations and proxy events through state-of-the-art novel view synthesis techniques, or simply proxy annotations when images are already paired with event data. Using the training set generated by our data factory, we repurpose state-of-the-art stereo models from RGB literature to process event data, obtaining new event stereo models with unprecedented generalization capabilities. Experiments on widely used event stereo datasets support the effectiveness of EventHub and show how the same data distillation mechanism can improve the accuracy of RGB stereo foundation models in challenging conditions such as nighttime scenes.
comment: CVPR 2026. Project Page: https://bartn8.github.io/eventhub/ Code: https://github.com/bartn8/eventhub
☆ ActionParty: Multi-Subject Action Binding in Generative Video Games
Recent advances in video diffusion have enabled the development of "world models" capable of simulating interactive environments. However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene. In this work, we tackle a fundamental issue of action binding in existing video diffusion models, which struggle to associate specific actions with their corresponding subjects. For this purpose, we propose ActionParty, an action controllable multi-subject world model for generative video games. It introduces subject state tokens, i.e. latent variables that persistently capture the state of each subject in the scene. By jointly modeling state tokens and video latents with a spatial biasing mechanism, we disentangle global video frame rendering from individual action-controlled subject updates. We evaluate ActionParty on the Melting Pot benchmark, demonstrating the first video world model capable of controlling up to seven players simultaneously across 46 diverse environments. Our results show significant improvements in action-following accuracy and identity consistency, while enabling robust autoregressive tracking of subjects through complex interactions.
comment: Project page: https://action-party.github.io/
☆ Generative World Renderer
Scaling generative inverse and forward rendering to real-world scenarios is bottlenecked by the limited realism and temporal coherence of existing synthetic datasets. To bridge this persistent domain gap, we introduce a large-scale, dynamic dataset curated from visually complex AAA games. Using a novel dual-screen stitched capture method, we extracted 4M continuous frames (720p/30 FPS) of synchronized RGB and five G-buffer channels across diverse scenes, visual effects, and environments, including adverse weather and motion-blur variants. This dataset uniquely advances bidirectional rendering: enabling robust in-the-wild geometry and material decomposition, and facilitating high-fidelity G-buffer-guided video generation. Furthermore, to evaluate the real-world performance of inverse rendering without ground truth, we propose a novel VLM-based assessment protocol measuring semantic, spatial, and temporal consistency. Experiments demonstrate that inverse renderers fine-tuned on our data achieve superior cross-dataset generalization and controllable generation, while our VLM evaluation strongly correlates with human judgment. Combined with our toolkit, our forward renderer enables users to edit styles of AAA games from G-buffers using text prompts.
comment: Project page: https://alaya-studio.github.io/renderer/
☆ Modulate-and-Map: Crossmodal Feature Mapping with Cross-View Modulation for 3D Anomaly Detection CVPR
We present ModMap, a natively multiview and multimodal framework for 3D anomaly detection and segmentation. Unlike existing methods that process views independently, our method draws inspiration from the crossmodal feature mapping paradigm to learn to map features across both modalities and views, while explicitly modelling view-dependent relationships through feature-wise modulation. We introduce a cross-view training strategy that leverages all possible view combinations, enabling effective anomaly scoring through multiview ensembling and aggregation. To process high-resolution 3D data, we train and publicly release a foundational depth encoder tailored to industrial datasets. Experiments on SiM3D, a recent benchmark that introduces the first multiview and multimodal setup for 3D anomaly detection and segmentation, demonstrate that ModMap attains state-of-the-art performance by surpassing previous methods by wide margins.
comment: Accepted at CVPR Findings 2026
☆ Steerable Visual Representations
Pretrained Vision Transformers (ViTs) such as DINOv2 and MAE provide generic image features that can be applied to a variety of downstream tasks such as retrieval, classification, and segmentation. However, such representations tend to focus on the most salient visual cues in the image, with no way to direct them toward less prominent concepts of interest. In contrast, Multimodal LLMs can be guided with textual prompts, but the resulting representations tend to be language-centric and lose their effectiveness for generic visual tasks. To address this, we introduce Steerable Visual Representations, a new class of visual representations, whose global and local features can be steered with natural language. While most vision-language models (e.g., CLIP) fuse text with visual features after encoding (late fusion), we inject text directly into the layers of the visual encoder (early fusion) via lightweight cross-attention. We introduce benchmarks for measuring representational steerability, and demonstrate that our steerable visual features can focus on any desired objects in an image while preserving the underlying representation quality. Our method also matches or outperforms dedicated approaches on anomaly detection and personalized object discrimination, exhibiting zero-shot generalization to out-of-distribution tasks.
comment: preprint
☆ Beyond Referring Expressions: Scenario Comprehension Visual Grounding SC
Existing visual grounding benchmarks primarily evaluate alignment between image regions and literal referring expressions, where models can often succeed by matching a prominent named category. We explore a complementary and more challenging setting of scenario-based visual grounding, where the target must be inferred from roles, intentions, and relational context rather than explicit naming. We introduce Referring Scenario Comprehension (RSC), a benchmark designed for this setting. The queries in this benchmark are paragraph-length texts that describe object roles, user goals, and contextual cues, including deliberate references to distractor objects that often require deep understanding to resolve. Each instance is annotated with interpretable difficulty tags for uniqueness, clutter, size, overlap, and position which expose distinct failure modes and support fine-grained analysis. RSC contains approximately 31k training examples, 4k in-domain test examples, and a 3k out-of-distribution split with unseen object categories. We further propose ScenGround, a curriculum reasoning method serving as a reference point for this setting, combining supervised warm-starting with difficulty-aware reinforcement learning. Experiments show that scenario-based queries expose systematic failures in current models that standard benchmarks do not reveal, and that curriculum training improves performance on challenging slices and transfers to standard benchmarks.
comment: 20 pages, 18 figures, Project Page: https://catherine-r-he.github.io/RSC/
☆ Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining CVPR2026
High-quality 3D avatar modeling faces a critical trade-off between fidelity and generalization. On the one hand, multi-view studio data enables high-fidelity modeling of humans with precise control over expressions and poses, but it struggles to generalize to real-world data due to limited scale and the domain gap between the studio environment and the real world. On the other hand, recent large-scale avatar models trained on millions of in-the-wild samples show promise for generalization across a wide range of identities, yet the resulting avatars are often of low-quality due to inherent 3D ambiguities. To address this, we present Large-Scale Codec Avatars (LCA), a high-fidelity, full-body 3D avatar model that generalizes to world-scale populations in a feedforward manner, enabling efficient inference. Inspired by the success of large language models and vision foundation models, we present, for the first time, a pre/post-training paradigm for 3D avatar modeling at scale: we pretrain on 1M in-the-wild videos to learn broad priors over appearance and geometry, then post-train on high-quality curated data to enhance expressivity and fidelity. LCA generalizes across hair styles, clothing, and demographics while providing precise, fine-grained facial expressions and finger-level articulation control, with strong identity preservation. Notably, we observe emergent generalization to relightability and loose garment support to unconstrained inputs, and zero-shot robustness to stylized imagery, despite the absence of direct supervision.
comment: Accepted in CVPR2026. Website: https://junxuan-li.github.io/lca
☆ Stop Wandering: Efficient Vision-Language Navigation via Metacognitive Reasoning
Training-free Vision-Language Navigation (VLN) agents powered by foundation models can follow instructions and explore 3D environments. However, existing approaches rely on greedy frontier selection and passive spatial memory, leading to inefficient behaviors such as local oscillation and redundant revisiting. We argue that this stems from a lack of metacognitive capabilities: the agent cannot monitor its exploration progress, diagnose strategy failures, or adapt accordingly. To address this, we propose MetaNav, a metacognitive navigation agent integrating spatial memory, history-aware planning, and reflective correction. Spatial memory builds a persistent 3D semantic map. History-aware planning penalizes revisiting to improve efficiency. Reflective correction detects stagnation and uses an LLM to generate corrective rules that guide future frontier selection. Experiments on GOAT-Bench, HM3D-OVON, and A-EQA show that MetaNav achieves state-of-the-art performance while reducing VLM queries by 20.7%, demonstrating that metacognitive reasoning significantly improves robustness and efficiency.
comment: 10 pages, 6 figures
☆ A Simple Baseline for Streaming Video Understanding
Recent streaming video understanding methods increasingly rely on complex memory mechanisms to handle long video streams. We challenge this trend with a simple finding: a sliding-window baseline that feeds only the most recent N frames to an off-the-shelf VLM already matches or surpasses published streaming models. We formalize this baseline as SimpleStream and evaluate it against 13 major offline and online video LLM baselines on OVO-Bench and StreamingBench. Despite its simplicity, SimpleStream delivers consistently strong performance. With only 4 recent frames, it reaches 67.7% average accuracy on OVO-Bench and 80.59% on StreamingBench. Controlled ablations further show that the value of longer context is backbone-dependent rather than uniformly increasing with model scale, and reveal a consistent perception-memory trade-off: adding more historical context can improve recall, but often weakens real-time perception. This suggests that stronger memory, retrieval, or compression modules should not be taken as evidence of progress unless they clearly outperform SimpleStream under the same protocol. We therefore argue that future streaming benchmarks should separate recent-scene perception from long-range memory, so that performance improvements from added complexity can be evaluated more clearly.
comment: Project page: https://simple-stream.github.io/
☆ VOID: Video Object and Interaction Deletion
Existing video object removal methods excel at inpainting content "behind" the object and correcting appearance-level artifacts such as shadows and reflections. However, when the removed object has more significant interactions, such as collisions with other objects, current models fail to correct them and produce implausible results. We present VOID, a video object removal framework designed to perform physically-plausible inpainting in these complex scenarios. To train the model, we generate a new paired dataset of counterfactual object removals using Kubric and HUMOTO, where removing an object requires altering downstream physical interactions. During inference, a vision-language model identifies regions of the scene affected by the removed object. These regions are then used to guide a video diffusion model that generates physically consistent counterfactual outcomes. Experiments on both synthetic and real data show that our approach better preserves consistent scene dynamics after object removal compared to prior video object removal methods. We hope this framework sheds light on how to make video editing models better simulators of the world through high-level causal reasoning.
☆ AdamFlow: Adam-based Wasserstein Gradient Flows for Surface Registration in Medical Imaging
Surface registration plays an important role for anatomical shape analysis in medical imaging. Existing surface registration methods often face a trade-off between efficiency and robustness. Local point matching methods are computationally efficient, but vulnerable to noise and initialisation. Methods designed for global point set alignment tend to incur a high computational cost. To address the challenge, here we present a fast surface registration method, which formulates surface meshes as probability measures and surface registration as a distributional optimisation problem. The discrepancy between two meshes is measured using an efficient sliced Wasserstein distance with log-linear computational complexity. We propose a novel optimisation method, AdamFlow, which generalises the well-known Adam optimisation method from the Euclidean space to the probability space for minimising the sliced Wasserstein distance. We theoretically analyse the asymptotic convergence of AdamFlow and empirically demonstrate its superior performance in both affine and non-rigid surface registration across various anatomical structures.
☆ Omni123: Exploring 3D Native Foundation Models with Limited 3D Data by Unifying Text to 2D and 3D Generation
Recent multimodal large language models have achieved strong performance in unified text and image understanding and generation, yet extending such native capability to 3D remains challenging due to limited data. Compared to abundant 2D imagery, high-quality 3D assets are scarce, making 3D synthesis under-constrained. Existing methods often rely on indirect pipelines that edit in 2D and lift results into 3D via optimization, sacrificing geometric consistency. We present Omni123, a 3D-native foundation model that unifies text-to-2D and text-to-3D generation within a single autoregressive framework. Our key insight is that cross-modal consistency between images and 3D can serve as an implicit structural constraint. By representing text, images, and 3D as discrete tokens in a shared sequence space, the model leverages abundant 2D data as a geometric prior to improve 3D representations. We introduce an interleaved X-to-X training paradigm that coordinates diverse cross-modal tasks over heterogeneous paired datasets without requiring fully aligned text-image-3D triplets. By traversing semantic-visual-geometric cycles (e.g., text to image to 3D to image) within autoregressive sequences, the model jointly enforces semantic alignment, appearance fidelity, and multi-view geometric consistency. Experiments show that Omni123 significantly improves text-guided 3D generation and editing, demonstrating a scalable path toward multimodal 3D world models.
☆ Deep Neural Network Based Roadwork Detection for Autonomous Driving
Road construction sites create major challenges for both autonomous vehicles and human drivers due to their highly dynamic and heterogeneous nature. This paper presents a real-time system that detects and localizes roadworks by combining a YOLO neural network with LiDAR data. The system identifies individual roadwork objects while driving, merges them into coherent construction sites and records their outlines in world coordinates. The model training was based on an adapted US dataset and a new dataset collected from test drives with a prototype vehicle in Berlin, Germany. Evaluations on real-world road construction sites showed a localization accuracy below 0.5 m. The system can support traffic authorities with up-to-date roadwork data and could enable autonomous vehicles to navigate construction sites more safely in the future.
comment: 7 pages, 10 figures
☆ Novel Memory Forgetting Techniques for Autonomous AI Agents: Balancing Relevance and Efficiency
Long-horizon conversational agents require persistent memory for coherent reasoning, yet uncontrolled accumulation causes temporal decay and false memory propagation. Benchmarks such as LOCOMO and LOCCO report performance degradation from 0.455 to 0.05 across stages, while MultiWOZ shows 78.2% accuracy with 6.8% false memory rate under persistent retention. This work introduces an adaptive budgeted forgetting framework that regulates memory through relevanceguided scoring and bounded optimization. The approach integrates recency, frequency, and semantic alignment to maintain stability under constrained context. Comparative analysis demonstrates improved long-horizon F1 beyond 0.583 baseline levels, higher retention consistency, and reduced false memory behavior without increasing context usage. These findings confirm that structured forgetting preserves reasoning performance while preventing unbounded memory growth in extended conversational settings.
☆ Modular Energy Steering for Safe Text-to-Image Generation with Foundation Models
Controlling the behavior of text-to-image generative models is critical for safe and practical deployment. Existing safety approaches typically rely on model fine-tuning or curated datasets, which can degrade generation quality or limit scalability. We propose an inference-time steering framework that leverages gradient feedback from frozen pretrained foundation models to guide the generation process without modifying the underlying generator. Our key observation is that vision-language foundation models encode rich semantic representations that can be repurposed as off-the-shelf supervisory signals during generation. By injecting such feedback through clean latent estimates at each sampling step, our method formulates safety steering as an energy-based sampling problem. This design enables modular, training-free safety control that is compatible with both diffusion and flow-matching models and can generalize across diverse visual concepts. Experiments demonstrate state-of-the-art robustness against NSFW red-teaming benchmarks and effective multi-target steering, while preserving high generation quality on benign non-targeted prompts. Our framework provides a principled approach for utilizing foundation models as semantic energy estimators, enabling reliable and scalable safety control for text-to-image generation.
☆ SPAR: Single-Pass Any-Resolution ViT for Open-vocabulary Segmentation CVPR 2026
Foundational Vision Transformers (ViTs) have limited effectiveness in tasks requiring fine-grained spatial understanding, due to their fixed pre-training resolution and inherently coarse patch-level representations. These challenges are especially pronounced in dense prediction scenarios, such as open-vocabulary segmentation with ViT-based vision-language models, where high-resolution inputs are essential for accurate pixel-level reasoning. Existing approaches typically process large-resolution images using a sliding-window strategy at the pre-training resolution. While this improves accuracy through finer strides, it comes at a significant computational cost. We introduce SPAR: Single-Pass Any-Resolution ViT, a resolution-agnostic dense feature extractor designed for efficient high-resolution inference. We distill the spatial reasoning capabilities of a finely-strided, sliding-window teacher into a single-pass student using a feature regression loss, without requiring architectural changes or pixel-level supervision. Applied to open-vocabulary segmentation, SPAR improves single-pass baselines by up to 10.5 mIoU and even surpasses the teacher, demonstrating effectiveness in efficient, high-resolution reasoning. Code: https://github.com/naomikombol/SPAR
comment: Accepted to CVPR 2026
☆ UAV-Track VLA: Embodied Aerial Tracking via Vision-Language-Action Models
Embodied visual tracking is crucial for Unmanned Aerial Vehicles (UAVs) executing complex real-world tasks. In dynamic urban scenarios with complex semantic requirements, Vision-Language-Action (VLA) models show great promise due to their cross-modal fusion and continuous action generation capabilities. To benchmark multimodal tracking in such environments, we construct a dedicated evaluation benchmark and a large-scale dataset encompassing over 890K frames, 176 tasks, and 85 diverse objects. Furthermore, to address temporal feature redundancy and the lack of spatial geometric priors in existing VLA models, we propose an improved VLA tracking model, UAV-Track VLA. Built upon the $π_{0.5}$ architecture, our model introduces a temporal compression net to efficiently capture inter-frame dynamics. Additionally, a parallel dual-branch decoder comprising a spatial-aware auxiliary grounding head and a flow matching action expert is designed to decouple cross-modal features and generate fine-grained continuous actions. Systematic experiments in the CARLA simulator validate the superior end-to-end performance of our method. Notably, in challenging long-distance pedestrian tracking tasks, UAV-Track VLA achieves a 61.76\% success rate and 269.65 average tracking frames, significantly outperforming existing baselines. Furthermore, it demonstrates robust zero-shot generalization in unseen environments and reduces single-step inference latency by 33.4\% (to 0.0571s) compared to the original $π_{0.5}$, enabling highly efficient, real-time UAV control. Data samples and demonstration videos are available at: https://github.com/Hub-Tian/UAV-Track\_VLA.
☆ SCALE: Semantic- and Confidence-Aware Conditional Variational Autoencoder for Zero-shot Skeleton-based Action Recognition ICPR 2026
Zero-shot skeleton-based action recognition (ZSAR) aims to recognize action classes without any training skeletons from those classes, relying instead on auxiliary semantics from text. Existing approaches frequently depend on explicit skeleton-text alignment, which can be brittle when action names underspecify fine-grained dynamics and when unseen classes are semantically confusable. We propose SCALE, a lightweight and deterministic Semantic- and Confidence-Aware Listwise Energy-based framework that formulates ZSAR as class-conditional energy ranking. SCALE builds a text-conditioned Conditional Variational Autoencoder where frozen text representations parameterize both the latent prior and the decoder, enabling likelihood-based evaluation for unseen classes without generating samples at test time. To separate competing hypotheses, we introduce a semantic- and confidence-aware listwise energy loss that emphasizes semantically similar hard negatives and incorporates posterior uncertainty to adapt decision margins and reweight ambiguous training instances. Additionally, we utilize a latent prototype contrast objective to align posterior means with text-derived latent prototypes, improving semantic organization and class separability without direct feature matching. Experiments on NTU-60 and NTU-120 datasets show that SCALE consistently improves over prior VAE- and alignment-based baselines while remaining competitive with diffusion-based methods.
comment: Accepted to ICPR 2026
☆ UniDriveVLA: Unifying Understanding, Perception, and Action Planning for Autonomous Driving
Vision-Language-Action (VLA) models have recently emerged in autonomous driving, with the promise of leveraging rich world knowledge to improve the cognitive capabilities of driving systems. However, adapting such models for driving tasks currently faces a critical dilemma between spatial perception and semantic reasoning. Consequently, existing VLA systems are forced into suboptimal compromises: directly adopting 2D Vision-Language Models yields limited spatial perception, whereas enhancing them with 3D spatial representations often impairs the native reasoning capacity of VLMs. We argue that this dilemma largely stems from the coupled optimization of spatial perception and semantic reasoning within shared model parameters. To overcome this, we propose UniDriveVLA, a Unified Driving Vision-Language-Action model based on Mixture-of-Transformers that addresses the perception-reasoning conflict via expert decoupling. Specifically, it comprises three experts for driving understanding, scene perception, and action planning, which are coordinated through masked joint attention. In addition, we combine a sparse perception paradigm with a three-stage progressive training strategy to improve spatial perception while maintaining semantic reasoning capability. Extensive experiments show that UniDriveVLA achieves state-of-the-art performance in open-loop evaluation on nuScenes and closed-loop evaluation on Bench2Drive. Moreover, it demonstrates strong performance across a broad range of perception, prediction, and understanding tasks, including 3D detection, online mapping, motion forecasting, and driving-oriented VQA, highlighting its broad applicability as a unified model for autonomous driving. Code and model have been released at https://github.com/xiaomi-research/unidrivevla
comment: code has been released at https://github.com/xiaomi-research/unidrivevla
☆ Lightweight Spatiotemporal Highway Lane Detection via 3D-ResNet and PINet with ROI-Aware Attention
This paper presents a lightweight, end-to-end highway lane detection architecture that jointly captures spatial and temporal information for robust performance in real-world driving scenarios. Building on the strengths of 3D convolutional neural networks and instance segmentation, we propose two models that integrate a 3D-ResNet encoder with a Point Instance Network (PINet) decoder. The first model enhances multi-scale feature representation using a Feature Pyramid Network (FPN) and Self-Attention mechanism to refine spatial dependencies. The second model introduces a Region of Interest (ROI) detection head to selectively focus on lane-relevant regions, thereby improving precision and reducing computational complexity. Experiments conducted on the TuSimple dataset (highway driving scenarios) demonstrate that the proposed second model achieves 93.40% accuracy while significantly reducing false negatives. Compared to existing 2D and 3D baselines, our approach achieves improved performance with fewer parameters and reduced latency. The architecture has been validated through offline training and real-time inference in the Autonomous Systems Laboratory at City, St George's University of London. These results suggest that the proposed models are well-suited for integration into Advanced Driver Assistance Systems (ADAS), with potential scalability toward full Lane Assist Systems (LAS).
☆ CXR-LT 2026 Challenge: Projection-Aware Multi-Label and Zero-Shot Chest X-Ray Classification
This challenge tackles multi-label classification for known chest X-ray (CXR) lesions and zero-shot classification for unseen ones. To handle diverse CXR projections, we integrate projection-specific models via a classification network into a unified framework. For zero-shot classification (Task 2), we extend CheXzero with a novel dual-branch architecture that combines contrastive learning, Asymmetric Loss (ASL), and LLM-generated descriptive prompts. This effectively mitigates severe long-tail imbalances and maximizes zero-shot generalization. Additionally, strong data and test-time augmentations (TTA) ensure robustness across both tasks.
comment: 5 pages, 3 figures. Accepted to the IEEE ISBI 2026 CXR-LT Challenge
☆ ViT-Explainer: An Interactive Walkthrough of the Vision Transformer Pipeline
Transformer-based architectures have become the shared backbone of natural language processing and computer vision. However, understanding how these models operate remains challenging, particularly in vision settings, where images are processed as sequences of patch tokens. Existing interpretability tools often focus on isolated components or expert-oriented analysis, leaving a gap in guided, end-to-end understanding of the full inference pipeline. To bridge this gap, we present ViT-Explainer, a web-based interactive system that provides an integrated visualization of Vision Transformer inference, from patch tokenization to final classification. The system combines animated walkthroughs, patch-level attention overlays, and a vision-adapted Logit Lens within both guided and free exploration modes. A user study with six participants suggests that ViT-Explainer is easy to learn and use, helping users interpret and understand Vision Transformer behavior.
comment: 7 pages, 4 figures
☆ Reflection Generation for Composite Image Using Diffusion Model
Image composition involves inserting a foreground object into the background while synthesizing environment-consistent effects such as shadows and reflections. Although shadow generation has been extensively studied, reflection generation remains largely underexplored. In this work, we focus on reflection generation. We inject the prior information of reflection placement and reflection appearance into foundation diffusion model. We also divide reflections into two types and adopt type-aware model design. To support training, we construct the first large-scale object reflection dataset DEROBA. Experiments demonstrate that our method generates reflections that are physically coherent and visually realistic, establishing a new benchmark for reflection generation.
☆ Beyond the Fold: Quantifying Split-Level Noise and the Case for Leave-One-Dataset-Out AU Evaluation CVPR 2026
Subject-exclusive cross-validation is the standard evaluation protocol for facial Action Unit (AU) detection, yet reported improvements are often small. We show that cross-validation itself introduces measurable stochastic variance. On BP4D+, repeated 3-fold subject-exclusive splits produce an empirical noise floor of $\pm 0.065$ in average F1, with substantially larger variation for low-prevalence AUs. Operating-point metrics such as F1 fluctuate more than threshold-independent measures such as AUC, and model ranking can change under different fold assignments. We further evaluate cross-dataset robustness using a Leave-One-Dataset-Out (LODO) protocol across five AU datasets. LODO removes partition randomness and exposes domain-level instability that is not visible under single-dataset cross-validation. Together, these results suggest that gains often reported in cross-fold validation may fall within protocol variance. Leave-one-dataset-out cross-validation yields more stable and interpretable findings
comment: CVPR 2026
☆ CoRegOVCD: Consistency-Regularized Open-Vocabulary Change Detection
Remote sensing change detection (CD) aims to identify where land-cover semantics change across time, but most existing methods still assume a fixed label space and therefore cannot answer arbitrary user-defined queries. Open-vocabulary change detection (OVCD) instead asks for the change mask of a queried concept. In the fully training-free setting, however, dense concept responses are difficult to compare directly across dates: appearance variation, weak cross-concept competition, and the spatial continuity of many land-cover categories often produce noisy, fragmented, and semantically unreliable change evidence. We propose Consistency-Regularized Open-Vocabulary Change Detection (CoRegOVCD), a training-free dense inference framework that reformulates concept-specific change as calibrated posterior discrepancy. Competitive Posterior Calibration (CPC) and the Semantic Posterior Delta (SPD) convert raw concept responses into competition-aware queried-concept posteriors and quantify their cross-temporal discrepancy, making semantic change evidence more comparable without explicit instance matching. Geometry-Token Consistency Gate (GeoGate) and Regional Consensus Discrepancy (RCD) further suppress unsupported responses and improve spatial coherence through geometry-aware structural verification and regional consensus. Across four benchmarks spanning building-oriented and multi-class settings, CoRegOVCD consistently improves over the strongest previous training-free baseline by 2.24 to 4.98 F1$_C$ points and reaches a six-class average of 47.50% F1$_C$ on SECOND.
☆ DenOiS: Dual-Domain Denoising of Observation and Solution in Ultrasound Image Reconstruction
Medical imaging aims to recover underlying tissue properties, using inexact (simplified/linearized) imaging models and often from inaccurate and incomplete measurements. Analytical reconstruction methods rely on hand-crafted regularization, sensitive to noise assumptions and parameter tuning. Among deep learning alternatives, plug-and-play (PnP) approaches learn regularization while incorporating imaging physics during inference, outperforming purely data-driven methods. The performance of all these approaches, however, still strongly depends on measurement quality and imaging model accuracy. In this work, we propose DenOiS, a framework that denoises both input observations and resulting solution in their respective domains. It consists of an observation refinement strategy that corrects degraded measurements while compensating for imaging model simplifications, and a diffusion-based PnP reconstruction approach that remains robust under missing measurements. DenOiS enables generalization to real data from training only in simulations, resulting in high-fidelity image reconstruction with noisy observations and inexact imaging models. We demonstrate this for speed-of-sound imaging as a challenging setting of quantitative ultrasound image reconstruction.
☆ CASHG: Context-Aware Stylized Online Handwriting Generation
Online handwriting represents strokes as time-ordered trajectories, which makes handwritten content easier to transform and reuse in a wide range of applications. However, generating natural sentence-level online handwriting that faithfully reflects a writer's style remains challenging, since sentence synthesis demands context-dependent characters with stroke continuity and spacing. Prior methods treat these boundary properties as implicit outcomes of sequence modeling, which becomes unreliable at the sentence scale and under limited compositional diversity. We propose CASHG, a context-aware stylized online handwriting generator that explicitly models inter-character connectivity for style-consistent sentence-level trajectory synthesis. CASHG uses a Character Context Encoder to obtain character identity and sentence-dependent context memory and fuses them in a bigram-aware sliding-window Transformer decoder that emphasizes local predecessor--current transitions, complemented by gated context fusion for sentence-level context.Training proceeds through a three-stage curriculum from isolated glyphs to full sentences, improving robustness under sparse transition coverage. We further introduce Connectivity and Spacing Metrics (CSM), a boundary-aware evaluation suite that quantifies cursive connectivity and spacing similarity. Under benchmark-matched evaluation protocols, CASHG consistently improves CSM over comparison methods while remaining competitive in DTW-based trajectory similarity, with gains corroborated by a human evaluation.
comment: 42 pages, 19 figures
☆ LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model
Unified models (UMs) hold promise for their ability to understand and generate content across heterogeneous modalities. Compared to merely generating visual content, the use of UMs for interleaved cross-modal reasoning is more promising and valuable, e.g., for solving understanding problems that require dense visual thinking, improving visual generation through self-reflection, or modeling visual dynamics of the physical world guided by stepwise action interventions. However, existing UMs necessitate pixel decoding as a bridge due to their disjoint visual representations for understanding and generation, which is both ineffective and inefficient. In this paper, we introduce LatentUM, a novel unified model that represents all modalities within a shared semantic latent space, eliminating the need for pixel-space mediation between visual understanding and generation. This design naturally enables flexible interleaved cross-modal reasoning and generation. Beyond improved computational efficiency, the shared representation substantially alleviates codec bias and strengthens cross-modal alignment, allowing LatentUM to achieve state-of-the-art performance on the Visual Spatial Planning benchmark, push the limits of visual generation through self-reflection, and support world modeling by predicting future visual states within the shared semantic latent space.
☆ GroundVTS: Visual Token Sampling in Multimodal Large Language Models for Video Temporal Grounding CVPR 2026
Video temporal grounding (VTG) is a critical task in video understanding and a key capability for extending video large language models (Vid-LLMs) to broader applications. However, existing Vid-LLMs rely on uniform frame sampling to extract video information, resulting in a sparse distribution of key frames and the loss of crucial temporal cues. To address this limitation, we propose Grounded Visual Token Sampling (GroundVTS), a Vid-LLM architecture that focuses on the most informative temporal segments. GroundVTS employs a fine-grained, query-guided mechanism to filter visual tokens before feeding them into the LLM, thereby preserving essential spatio-temporal information and maintaining temporal coherence. Futhermore, we introduce a progressive optimization strategy that enables the LLM to effectively adapt to the non-uniform distribution of visual features, enhancing its ability to model temporal dependencies and achieve precise video localization. We comprehensively evaluate GroundVTS on three standard VTG benchmarks, where it outperforms existing methods, achieving a 7.7-point improvement in mIoU for moment retrieval and 12.0-point improvement in mAP for highlight detection. Code is available at https://github.com/Florence365/GroundVTS.
comment: Published as a conference paper at CVPR 2026
☆ Center-Aware Detection with Swin-based Co-DETR Framework for Cervical Cytology
Automated analysis of Pap smear images is critical for cervical cancer screening but remains challenging due to dense cell distribution and complex morphology. In this paper, we present our winning solution for the RIVA Cervical Cytology Challenge, achieving 1st place in Track B and 2nd place in Track A. Our approach leverages a powerful baseline, integrating the Co-DINO framework with a Swin-Large backbone for robust multi-scale feature extraction. To address the dataset's unique fixed-size bounding box annotations, we formulate the detection task as a center-point prediction problem. Tailoring our approach to this formulation, we introduce a center-preserving data augmentation strategy and an analytical geometric box optimization to effectively absorb localization jitter. Finally, we apply track-specific loss tuning to adapt the loss weights for each task. Experiments demonstrate that our targeted optimizations improve detection performance, providing an effective pipeline for cytology image analysis. Our code is available at https://github.com/YanKong0408/Center-DETR.
comment: ISBI 2026 Accepted Paper & Winning Solution for the RIVA Cervical Cytology Challenge
☆ FlowSlider: Training-Free Continuous Image Editing via Fidelity-Steering Decomposition
Continuous image editing aims to provide slider-style control of edit strength while preserving source-image fidelity and maintaining a consistent edit direction. Existing learning-based slider methods typically rely on auxiliary modules trained with synthetic or proxy supervision. This introduces additional training overhead and couples slider behavior to the training distribution, which can reduce reliability under distribution shifts in edits or domains. We propose \textit{FlowSlider}, a training-free method for continuous editing in Rectified Flow that requires no post-training. \textit{FlowSlider} decomposes FlowEdit's update into (i) a fidelity term, which acts as a source-conditioned stabilizer that preserves identity and structure, and (ii) a steering term that drives semantic transition toward the target edit. Geometric analysis and empirical measurements show that these terms are approximately orthogonal, enabling stable strength control by scaling only the steering term while keeping the fidelity term unchanged. As a result, \textit{FlowSlider} provides smooth and reliable control without post-training, improving continuous editing quality across diverse tasks.
comment: HuggingFace Space: https://huggingface.co/spaces/dominoer/FlowSlider
☆ Country-wide, high-resolution monitoring of forest browning with Sentinel-2 SP
Natural and anthropogenic disturbances are impacting the health of forests worldwide. Monitoring forest disturbances at scale is important to inform conservation efforts. Here, we present a scalable approach for country-wide mapping of forest greenness anomalies at the 10 m resolution of Sentinel-2. Using relevant ecological and topographical context and an established representation of the vegetation cycle, we learn a predictive quantile model of the normalised difference vegetation index (NDVI) derived from Sentinel-2 data. The resulting expected seasonal cycles are used to detect NDVI anomalies across Switzerland between April 2017 and August 2025. Goodness-of-fit evaluations show that the conditional model explains 65% of the observed variations in the median seasonal cycle. The model consistently benefits from the local context information, particularly during the green-up period. The approach produces coherent spatial anomaly patterns and enables country-wide quantification of forest browning. Case studies with independent reference data from known events illustrate that the model reliably detects different types of disturbances.
comment: 9 pages, 7 figures, to be published in the ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences (ISPRS Congress)
☆ PLUME: Latent Reasoning Based Universal Multimodal Embedding
Universal multimodal embedding (UME) maps heterogeneous inputs into a shared retrieval space with a single model. Recent approaches improve UME by generating explicit chain-of-thought (CoT) rationales before extracting embeddings, enabling multimodal large language models to better infer complex query intent. However, explicit CoT incurs substantial inference overhead and can compress rich multimodal evidence into a narrow textual bottleneck. We propose PLUME, a latent reasoning framework that advances UME by replacing verbalized CoT with a short autoregressive rollout of continuous latent states. To support diverse multimodal queries, PLUME further introduces a semantic-anchor-guided transition adapter that steers latent rollout along different reasoning trajectories under the same fixed computation budget. To stabilize training, PLUME adopts a progressive explicit-to-latent curriculum that uses verbalized reasoning only as a temporary training scaffold and gradually transfers this behavior into hidden-state computation, eliminating explicit CoT at inference. On the 78-task MMEB-v2 benchmark, PLUME outperforms strong explicit-CoT UME baselines while reducing reasoning from hundreds of generated tokens to fewer than 10 latent steps, delivering over 30x faster inference. PLUME is especially well suited to retrieval settings where relevant evidence is dense, structurally complex, and difficult to organize through verbalized intermediate rationales, such as video and visual document retrieval. These results show that structured latent computation can preserve the benefits of intermediate reasoning without the overhead of explicit rationale generation, providing a stronger and more efficient paradigm for practical retrieval systems.
☆ Mining Instance-Centric Vision-Language Contexts for Human-Object Interaction Detection CVPR 2026
Human-Object Interaction (HOI) detection aims to localize human-object pairs and classify their interactions from a single image, a task that demands strong visual understanding and nuanced contextual reasoning. Recent approaches have leveraged Vision-Language Models (VLMs) to introduce semantic priors, significantly improving HOI detection performance. However, existing methods often fail to fully capitalize on the diverse contextual cues distributed across the entire scene. To overcome these limitations, we propose the Instance-centric Context Mining Network (InCoM-Net)-a novel framework that effectively integrates rich semantic knowledge extracted from VLMs with instance-specific features produced by an object detector. This design enables deeper interaction reasoning by modeling relationships not only within each detected instance but also across instances and their surrounding scene context. InCoM-Net comprises two core components: Instancecentric Context Refinement (ICR), which separately extracts intra-instance, inter-instance, and global contextual cues from VLM-derived features, and Progressive Context Aggregation (ProCA), which iteratively fuses these multicontext features with instance-level detector features to support high-level HOI reasoning. Extensive experiments on the HICO-DET and V-COCO benchmarks show that InCoM-Net achieves state-of-the-art performance, surpassing previous HOI detection methods. Code is available at https://github.com/nowuss/InCoM-Net.
comment: Accepted to CVPR 2026. Code: https://github.com/nowuss/InCoM-Net
☆ Network Structure in UK Payment Flows: Evidence on Economic Interdependencies and Implications for Real-Time Measurement SC
Network analysis of inter-industry payment flows reveals structural economic relationships invisible to traditional bilateral measurement approaches, with significant implications for real-time economic monitoring. Analysing 532,346 UK payment records (2017--2024) across 89 industry sectors, we demonstrate that graph-theoretic features which include centrality measures and clustering coefficients improve payment flow forecasting by 8.8 percentage points beyond traditional time-series methods. Critically, network features prove most valuable during economic disruptions: during the COVID-19 pandemic, when traditional forecasting accuracy collapsed (R2} falling from 0.38 to 0.19), network-enhanced models maintained substantially better performance, with network contributions reaching +13.8 percentage points. The analysis identifies Financial Services, Wholesale Trade, and Professional Services as structurally central industries whose network positions indicate systemic importance beyond their transaction volumes. Network density increased 12.5\% over the sample period, with visible disruption during 2020 followed by recovery exceeding pre-pandemic integration levels. These findings suggest payment network monitoring could enhance official statistics production by providing leading indicators of structural economic change and improving nowcasting accuracy during periods when traditional temporal patterns prove unreliable.
comment: Accepted for Poster presentation at the ESCoE Conference on Economic Measurement 2026
☆ CompassAD: Intent-Driven 3D Affordance Grounding in Functionally Competing Objects
When told to "cut the apple," a robot must choose the knife over nearby scissors, despite both objects affording the same cutting function. In real-world scenes, multiple objects may share identical affordances, yet only one is appropriate under the given task context. We call such cases confusing pairs. However, existing 3D affordance methods largely sidestep this challenge by evaluating isolated single objects, often with explicit category names provided in the query. We formalize Multi-Object Affordance Grounding under Intent-Driven Instructions, a new 3D affordance setting that requires predicting a per-point affordance mask on the correct object within a cluttered multi-object point cloud, conditioned on implicit natural language intent. To study this problem, we construct CompassAD, the first benchmark centered on implicit intent in confusable multi-object scenes. It comprises 30 confusing object pairs spanning 16 affordance types, 6,422 scenes, and 88K+ query-answer pairs. Furthermore, we propose CompassNet, a framework that incorporates two dedicated modules tailored to this task. Instance-bounded Cross Injection (ICI) constrains language-geometry alignment within object boundaries to prevent cross-object semantic leakage. Bi-level Contrastive Refinement (BCR) enforces discrimination at both geometric-group and point levels, sharpening distinctions between target and confusable surfaces. Extensive experiments demonstrate state-of-the-art results on both seen and unseen queries, and deployment on a robotic manipulator confirms effective transfer to real-world grasping in confusing multi-object scenes.
comment: Code available at: github.com/Lorenzo-0-0/CompassAD
☆ COMPASS: Complete Multimodal Fusion via Proxy Tokens and Shared Spaces for Ubiquitous Sensing
Missing modalities remain a major challenge for multimodal sensing, because most existing methods adapt the fusion process to the observed subset by dropping absent branches, using subset-specific fusion, or reconstructing missing features. As a result, the fusion head often receives an input structure different from the one seen during training, leading to incomplete fusion and degraded cross-modal interaction. We propose COMPASS, a missing-modality fusion framework built on the principle of fusion completeness: the fusion head always receives a fixed N-slot multimodal input, with one token per modality slot. For each missing modality, COMPASS synthesizes a target-specific proxy token from the observed modalities using pairwise source-to-target generators in a shared latent space, and aggregates them into a single replacement token. To make these proxies both representation-compatible and task-informative, we combine proxy alignment, shared-space regularization, and per-proxy discriminative supervision. Experiments on XRF55, MM-Fi, and OctoNet under diverse single- and multiple-missing settings show that COMPASS outperforms prior methods on the large majority of scenarios. Our results suggest that preserving a modality-complete fusion interface is a simple and effective design principle for robust multimodal sensing.
☆ True to Tone? Quantifying Skin Tone Fidelity and Bias in Photographic-to-Virtual Human Pipelines
Accurate reproduction of facial skin tone is essential for realism, identity preservation, and fairness in Virtual Human (VH) rendering. However, most accessible avatar creation pipelines rely on photographic inputs that lack colorimetric calibration, which can introduce inconsistencies and bias. We propose a fully automatic and scalable methodology to systematically evaluate skin tone fidelity across the VH generation pipeline. Our approach defines a full workflow that integrates skin color and illumination extraction, texture recolorization, real-time rendering, and quantitative color analysis. Using facial images from the Chicago Face Database (CFD), we compare skin tone extraction strategies based on cheek-region sampling, following the literature, and multidimensional masking derived from full-face analysis. Additionally, we test both strategies with lighting isolation, using the pre-trained TRUST framework, employed without any training or optimization within our pipeline. Extracted skin tones are applied to MetaHuman textures and rendered under multiple lighting configurations. Skin tone consistency is evaluated objectively in the CIELAB color space using the $ΔE$ metric and the Individual Typology Angle (ITA). The proposed methodology operates without manual intervention and, with the exception of pre-trained illumination compensation modules, the pipeline does not include learning or training stages, enabling low computational cost and large-scale evaluation. Using this framework, we generate and analyze approximately 19,848 rendered instances. Our results show phenotype-dependent behavior of extraction strategies and consistently higher colorimetric errors for darker skin tones.
comment: 20 pages, 10 figures
☆ Jagle: Building a Large-Scale Japanese Multimodal Post-Training Dataset for Vision-Language Models
Developing vision-language models (VLMs) that generalize across diverse tasks requires large-scale training datasets with diverse content. In English, such datasets are typically constructed by aggregating and curating numerous existing visual question answering (VQA) resources. However, this strategy does not readily extend to other languages, where VQA datasets remain limited in both scale and domain coverage, posing a major obstacle to building high-quality multilingual and non-English VLMs. In this work, we introduce Jagle, the largest Japanese multimodal post-training dataset to date, comprising approximately 9.2 million instances across diverse tasks. Rather than relying on existing VQA datasets, we collect heterogeneous source data, including images, image-text pairs, and PDF documents, and generate VQA pairs through multiple strategies such as VLM-based QA generation, translation, and text rendering. Experiments demonstrate that a 2.2B model trained with Jagle achieves strong performance on Japanese tasks, surpassing InternVL3.5-2B in average score across ten Japanese evaluation tasks and approaching within five points of Qwen3-VL-2B-Instruct. Furthermore, combining Jagle with FineVision does not degrade English performance; instead, it improves English performance compared to training with FineVision alone. To facilitate reproducibility and future research, we release the dataset, trained models, and code.
comment: 18 pages, 7 figures
☆ Efficient Reasoning via Thought Compression for Language Segmentation
Chain-of-thought (CoT) reasoning has significantly improved the performance of large multimodal models in language-guided segmentation, yet its prohibitive computational cost, stemming from generating verbose rationales, limits real-world applicability. We introduce WISE (Wisdom from Internal Self-Exploration), a novel paradigm for efficient reasoning guided by the principle of \textit{thinking twice -- once for learning, once for speed}. WISE trains a model to generate a structured sequence: a concise rationale, the final answer, and then a detailed explanation. By placing the concise rationale first, our method leverages autoregressive conditioning to enforce that the concise rationale acts as a sufficient summary for generating the detailed explanation. This structure is reinforced by a self-distillation objective that jointly rewards semantic fidelity and conciseness, compelling the model to internalize its detailed reasoning into a compact form. At inference, the detailed explanation is omitted. To address the resulting conditional distribution shift, our inference strategy, WISE-S, employs a simple prompting technique that injects a brevity-focused instruction into the user's query. This final adjustment facilitates the robust activation of the learned concise policy, unlocking the full benefits of our framework. Extensive experiments show that WISE-S achieves state-of-the-art zero-shot performance on the ReasonSeg benchmark with 58.3 cIoU, while reducing the average reasoning length by nearly \textbf{5$\times$} -- from 112 to just 23 tokens. Code is available at \href{https://github.com/mrazhou/WISE}{WISE}.
☆ IndoorCrowd: A Multi-Scene Dataset for Human Detection, Segmentation, and Tracking with an Automated Annotation Pipeline
Understanding human behaviour in crowded indoor environments is central to surveillance, smart buildings, and human-robot interaction, yet existing datasets rarely capture real-world indoor complexity at scale. We introduce IndoorCrowd, a multi-scene dataset for indoor human detection, instance segmentation, and multi-object tracking, collected across four campus locations (ACS-EC, ACS-EG, IE-Central, R-Central). It comprises $31$ videos ($9{,}913$ frames at $5$fps) with human-verified, per-instance segmentation masks. A $620$-frame control subset benchmarks three foundation-model auto-annotators: SAM3, GroundingSAM, and EfficientGroundingSAM, against human labels using Cohen's $κ$, AP, precision, recall, and mask IoU. A further $2{,}552$-frame subset supports multi-object tracking with continuous identity tracks in MOTChallenge format. We establish detection, segmentation, and tracking baselines using YOLOv8n, YOLOv26n, and RT-DETR-L paired with ByteTrack, BoT-SORT, and OC-SORT. Per-scene analysis reveals substantial difficulty variation driven by crowd density, scale, and occlusion: ACS-EC, with $79.3\%$ dense frames and a mean instance scale of $60.8$px, is the most challenging scene. The project page is available at https://sheepseb.github.io/IndoorCrowd/.
comment: Accepted at Conference on Computer Vision and Pattern Recognition Workshops 2026
☆ Rare-Aware Autoencoding: Reconstructing Spatially Imbalanced Data
Autoencoders can be challenged by spatially non-uniform sampling of image content. This is common in medical imaging, biology, and physics, where informative patterns occur rarely at specific image coordinates, as background dominates these locations in most samples, biasing reconstructions toward the majority appearance. In practice, autoencoders are biased toward dominant patterns resulting in the loss of fine-grained detail and causing blurred reconstructions for rare spatial inputs especially under spatial data imbalance. We address spatial imbalance by two complementary components: (i) self-entropy-based loss that upweights statistically uncommon spatial locations and (ii) Sample Propagation, a replay mechanism that selectively re-exposes the model to hard to reconstruct samples across batches during training. We benchmark existing data balancing strategies, originally developed for supervised classification, in the unsupervised reconstruction setting. Drawing on the limitations of these approaches, our method specifically targets spatial imbalance by encouraging models to focus on statistically rare locations, improving reconstruction consistency compared to existing baselines. We validate in a simulated dataset with controlled spatial imbalance conditions, and in three, uncontrolled, diverse real-world datasets spanning physical, biological, and astronomical domains. Our approach outperforms baselines on various reconstruction metrics, particularly under spatial imbalance distributions. These results highlight the importance of data representation in a batch and emphasize rare samples in unsupervised image reconstruction. We will make all code and related data available.
☆ Are VLMs Lost Between Sky and Space? LinkS$^2$Bench for UAV-Satellite Dynamic Cross-View Spatial Intelligence
Synergistic spatial intelligence between UAVs and satellites is indispensable for emergency response and security operations, as it uniquely integrates macro-scale global coverage with dynamic, real-time local perception. However, the capacity of Vision-Language Models (VLMs) to master this complex interplay remains largely unexplored. This gap persists primarily because existing benchmarks are confined to isolated Unmanned Aerial Vehicle (UAV) videos or static satellite imagery, failing to evaluate the dynamic local-to-global spatial mapping essential for comprehensive cross-view reasoning. To bridge this gap, we introduce LinkS$^2$Bench, the first comprehensive benchmark designed to evaluate VLMs' wide-area, dynamic cross-view spatial intelligence. LinkS$^2$Bench links 1,022 minutes of dynamic UAV footage with high-resolution satellite imagery covering over 200 km$^2$. Through an LMM-assisted pipeline and rigorous human annotation, we constructed 17.9k high-quality question-answer pairs comprising 12 fine-grained tasks across four dimensions: perception, localization, relation, and reasoning. Evaluations of 18 representative VLMs reveal a substantial gap compared to human baselines, identifying accurate cross-view dynamic alignment as the critical bottleneck. To alleviate this, we design a Cross-View Alignment Adapter, demonstrating that explicit alignment significantly improves model performance. Furthermore, fine-tuning experiments underscore the potential of LinkS$^2$Bench in advancing VLM adaptation for complex spatial reasoning.
☆ Decouple and Rectify: Semantics-Preserving Structural Enhancement for Open-Vocabulary Remote Sensing Segmentation
Open-vocabulary semantic segmentation in the remote sensing (RS) field requires both language-aligned recognition and fine-grained spatial delineation. Although CLIP offers robust semantic generalization, its global-aligned visual representations inherently struggle to capture structural details. Recent methods attempt to compensate for this by introducing RS-pretrained DINO features. However, these methods treat CLIP representations as a monolithic semantic space and cannot localize where structural enhancement is required, failing to effectively delineate boundaries while risking the disruption of CLIP's semantic integrity. To address this limitation, we propose DR-Seg, a novel decouple-and-rectify framework in this paper. Our method is motivated by the key observation that CLIP feature channels exhibit distinct functional heterogeneity rather than forming a uniform semantic space. Building on this insight, DR-Seg decouples CLIP features into semantics-dominated and structure-dominated subspaces, enabling targeted structural enhancement by DINO without distorting language-aligned semantics. Subsequently, a prior-driven graph rectification module injects high-fidelity structural priors under DINO guidance to form a refined branch, while an uncertainty-guided adaptive fusion module dynamically integrates this refined branch with the original CLIP branch for final prediction. Comprehensive experiments across eight benchmarks demonstrate that DR-Seg establishes a new state-of-the-art.
☆ Test-Time Adaptation for Height Completion via Self-Supervised ViT Features and Monocular Foundation Models
Accurate digital surface models (DSMs) are essential for many geospatial applications, including urban monitoring, environmental analyses, infrastructure management, and change detection. However, large-scale DSMs frequently contain incomplete or outdated regions due to acquisition limitations, reconstruction artifacts, or changes in the built environment. Traditional height completion approaches primarily rely on spatial interpolation or which assume spatial continuity and therefore fail when objects are missing. Recent learning-based approaches improve reconstruction quality but typically require supervised training on sensor-specific datasets, limiting their generalization across domains and sensing conditions. We propose Prior2DSM, a training-free framework for metric DSM completion that operates entirely at test time by leveraging foundation models. Unlike previous height completion approaches that require task-specific training, the proposed method combines self-supervised Vision Transformer (ViT) features from DINOv3 with monocular depth foundation models to propagate metric information from incomplete height priors through semantic feature-space correspondence. Test-time adaptation (TTA) is performed using parameter-efficient low-rank adaptation (LoRA) together with a lightweight multilayer perceptron (MLP), which predicts spatially varying scale and shift parameters to convert relative depth estimates into metric heights. Experiments demonstrate consistent improvements over interpolation based methods, prior-based rescaling height approaches, and state-of-the-art monocular depth estimation models. Prior2DSM reduces reconstruction error while preserving structural fidelity, achieving up to a 46% reduction in RMSE compared to linear fitting of MDE, and further enables DSM updating and coupled RGB-DSM generation.
☆ ProDiG: Progressive Diffusion-Guided Gaussian Splatting for Aerial to Ground Reconstruction
Generating ground-level views and coherent 3D site models from aerial-only imagery is challenging due to extreme viewpoint changes, missing intermediate observations, and large scale variations. Existing methods either refine renderings post-hoc, often producing geometrically inconsistent results, or rely on multi-altitude ground-truth, which is rarely available. Gaussian Splatting and diffusion-based refinements improve fidelity under small variations but fail under wide aerial-to-ground gaps. To address these limitations, we introduce ProDiG (Progressive Altitude Gaussian Splatting), a diffusion-guided framework that progressively transforms aerial 3D representations toward ground-level fidelity. ProDiG synthesizes intermediate-altitude views and refines the Gaussian representation at each stage using a geometry-aware causal attention module that injects epipolar structure into reference-view diffusion. A distance-adaptive Gaussian module dynamically adjusts Gaussian scale and opacity based on camera distance, ensuring stable reconstruction across large viewpoint gaps. Together, these components enable progressive, geometrically grounded refinement without requiring additional ground-truth viewpoints. Extensive experiments on synthetic and real-world datasets demonstrate that ProDiG produces visually realistic ground-level renderings and coherent 3D geometry, significantly outperforming existing approaches in terms of visual quality, geometric consistency, and robustness to extreme viewpoint changes.
☆ MTLSI-Net: A Linear Semantic Interaction Network for Parameter-Efficient Multi-Task Dense Prediction ICME 2026
Multi-task dense prediction aims to perform multiple pixel-level tasks simultaneously. However, capturing global cross-task interactions remains non-trivial due to the quadratic complexity of standard self-attention on high-resolution features. To address this limitation, we propose a Multi-Task Linear Semantic Interaction Network (MTLSI-Net), which facilitates cross-task interaction through linear attention. Specifically, MTLSI-Net incorporates three key components: a Multi-Task Multi-scale Query Linear Fusion Block, which captures cross-task dependencies across multiple scales with linear complexity using a shared global context matrix; a Semantic Token Distiller that compresses redundant features into compact semantic tokens, distilling essential cross-task knowledge; and a Cross-Window Integrated attention Block that injects global semantics into local features via a dual-branch architecture, preserving both global consistency and spatial precision. These components collectively enable the network to capture comprehensive cross-task interactions at linear complexity with reduced parameters. Extensive experiments on NYUDv2 and PASCAL-Context demonstrate that MTLSI-Net achieves state-of-the-art performance, validating its effectiveness and efficiency in multi-task learning.
comment: accepted by ICME 2026, to be published
☆ Resonance4D: Frequency-Domain Motion Supervision for Preset-Free Physical Parameter Learning in 4D Dynamic Physical Scene Simulation
Physics-driven 4D dynamic simulation from static 3D scenes remains constrained by an overlooked contradiction: reliable motion supervision often relies on online video diffusion or optical-flow pipelines whose computational cost exceeds that of the simulator itself. Existing methods further simplify inverse physical modeling by optimizing only partial material parameters, limiting realism in scenes with complex materials and dynamics. We present Resonance4D, a physics-driven 4D dynamic simulation framework that couples 3D Gaussian Splatting with the Material Point Method through lightweight yet physically expressive supervision. Our key insight is that dynamic consistency can be enforced without dense temporal generation by jointly constraining motion in complementary domains. To this end, we introduce Dual-domain Motion Supervision (DMS), which combines spatial structural consistency for local deformation with frequency-domain spectral consistency for oscillatory and global dynamic patterns, substantially reducing training cost and memory overhead while preserving physically meaningful motion cues. To enable stable full-parameter physical recovery, we further combine zero-shot text-prompted segmentation with simulation-guided initialization to automatically decompose Gaussians into object-part-level regions and support joint optimization of full material parameters. Experiments on both synthetic and real scenes show that Resonance4D achieves strong physical fidelity and motion consistency while reducing peak GPU memory from over 35\,GB to around 20\,GB, enabling high-fidelity physics-driven 4D simulation on a single consumer-grade GPU.
☆ Attention at Rest Stays at Rest: Breaking Visual Inertia for Cognitive Hallucination Mitigation
Like a body at rest that stays at rest, we find that visual attention in multimodal large language models (MLLMs) exhibits pronounced inertia, remaining largely static once settled during early decoding steps and failing to support the compositional understanding required for cognitive inference. While existing hallucination mitigation methods mainly target perceptual hallucinations concerning object existence or attributes, they remain inadequate for such cognitive hallucinations that require inter-object relational deduction. Through token-wise attention analysis, we identify this visual inertia as a key factor: attention to semantically critical regions remains persistently focused and fails to dynamically support relational inference. We thereby propose a training-free Inertia-aware Visual Excitation (IVE) method that breaks this inertial pattern by modeling cognitive inference as the dynamic responsiveness of visual attention. Specifically, IVE selects visual tokens that are dynamically emerging relative to historical attention trends while distinguishing tokens exhibiting inertial behavior. To further facilitate compositional inference, IVE introduces an inertia-aware penalty that discourages over-concentration and limits the persistence of attention within localized regions. Extensive experiments show that IVE is effective across various base MLLMs and multiple hallucination benchmarks, particularly for cognitive hallucinations.
☆ Curia-2: Scaling Self-Supervised Learning for Radiology Foundation Models
The rapid growth of medical imaging has fueled the development of Foundation Models (FMs) to reduce the growing, unsustainable workload on radiologists. While recent FMs have shown the power of large-scale pre-training to CT and MRI analysis, there remains significant room to optimize how these models learn from complex radiological volumes. Building upon the Curia framework, this work introduces Curia-2, which significantly improves the original pre-training strategy and representation quality to better capture the specificities of radiological data. The proposed methodology enables scaling the architecture up to billion-parameter Vision Transformers, marking a first for multi-modal CT and MRI FMs. Furthermore, we formalize the evaluation of these models by extending and restructuring CuriaBench into two distinct tracks: a 2D track tailored for slice-based vision models and a 3D track for volumetric benchmarking. Our results demonstrate that Curia-2 outperforms all FMs on vision-focused tasks and fairs competitively to vision-language models on clinically complex tasks such as finding detection. Weights will be made publicly available to foster further research.
☆ Interactive Tracking: A Human-in-the-Loop Paradigm with Memory-Augmented Adaptation
Existing visual trackers mainly operate in a non-interactive, fire-and-forget manner, making them impractical for real-world scenarios that require human-in-the-loop adaptation. To overcome this limitation, we introduce Interactive Tracking, a new paradigm that allows users to guide the tracker at any time using natural language commands. To support research in this direction, we make three main contributions. First, we present InteractTrack, the first large-scale benchmark for interactive tracking, containing 150 videos with dense bounding box annotations and timestamped language instructions. Second, we propose a comprehensive evaluation protocol and evaluate 25 representative trackers, showing that state-of-the-art methods fail in interactive scenarios; strong performance on conventional benchmarks does not transfer. Third, we introduce Interactive Memory-Augmented Tracking (IMAT), a new baseline that employs a dynamic memory mechanism to learn from user feedback and update tracking behavior accordingly. Our benchmark, protocol, and baseline establish a foundation for developing more intelligent, adaptive, and collaborative tracking systems, bridging the gap between automated perception and human guidance. The full benchmark, tracking results, and analysis are available at https://github.com/NorahGreen/InteractTrack.git.
☆ NearID: Identity Representation Learning via Near-identity Distractors
When evaluating identity-focused tasks such as personalized generation and image editing, existing vision encoders entangle object identity with background context, leading to unreliable representations and metrics. We introduce the first principled framework to address this vulnerability using Near-identity (NearID) distractors, where semantically similar but distinct instances are placed on the exact same background as a reference image, eliminating contextual shortcuts and isolating identity as the sole discriminative signal. Based on this principle, we present the NearID dataset (19K identities, 316K matched-context distractors) together with a strict margin-based evaluation protocol. Under this setting, pre-trained encoders perform poorly, achieving Sample Success Rates (SSR), a strict margin-based identity discrimination metric, as low as 30.7% and often ranking distractors above true cross-view matches. We address this by learning identity-aware representations on a frozen backbone using a two-tier contrastive objective enforcing the hierarchy: same identity > NearID distractor > random negative. This improves SSR to 99.2%, enhances part-level discrimination by 28.0%, and yields stronger alignment with human judgments on DreamBench++, a human-aligned benchmark for personalization. Project page: https://gorluxor.github.io/NearID/
comment: Code at https://github.com/Gorluxor/NearID
☆ SDesc3D: Towards Layout-Aware 3D Indoor Scene Generation from Short Descriptions
3D indoor scene generation conditioned on short textual descriptions provides a promising avenue for interactive 3D environment construction without the need for labor-intensive layout specification. Despite recent progress in text-conditioned 3D scene generation, existing works suffer from poor physical plausibility and insufficient detail richness in such semantic condensation cases, largely due to their reliance on explicit semantic cues about compositional objects and their spatial relationships. This limitation highlights the need for enhanced 3D reasoning capabilities, particularly in terms of prior integration and spatial anchoring.Motivated by this, we propose SDesc3D, a short-text conditioned 3D indoor scene generation framework, that leverages multi-view structural priors and regional functionality implications to enable 3D layout reasoning under sparse textual guidance.Specifically, we introduce a Multi-view scene prior augmentation that enriches underspecified textual inputs with aggregated multi-view structural knowledge, shifting from inaccessible semantic relation cues to multi-view relational prior aggregation. Building on this, we design a Functionality-aware layout grounding, employing regional functionality grounding for implicit spatial anchors and conducting hierarchical layout reasoning to enhance scene organization and semantic plausibility.Furthermore, an Iterative reflection-rectification scheme is employed for progressive structural plausibility refinement via self-rectification.Extensive experiments show that our method outperforms existing approaches on short-text conditioned 3D indoor scene generation.Code will be publicly available.
☆ Ego-Grounding for Personalized Question-Answering in Egocentric Videos CVPR'26
We present the first systematic analysis of multimodal large language models (MLLMs) in personalized question-answering requiring ego-grounding - the ability to understand the camera-wearer in egocentric videos. To this end, we introduce MyEgo, the first egocentric VideoQA dataset designed to evaluate MLLMs' ability to understand, remember, and reason about the camera wearer. MyEgo comprises 541 long videos and 5K personalized questions asking about "my things", "my activities", and "my past". Benchmarking reveals that competitive MLLMs across variants, including open-source vs. proprietary, thinking vs. non-thinking, small vs. large scales all struggle on MyEgo. Top closed- and open-source models (e.g., GPT-5 and Qwen3-VL) achieve only~46% and 36% accuracy, trailing human performance by near 40% and 50% respectively. Surprisingly, neither explicit reasoning nor model scaling yield consistent improvements. Models improve when relevant evidence is explicitly provided, but gains drop over time, indicating limitations in tracking and remembering "me" and "my past". These findings collectively highlight the crucial role of ego-grounding and long-range memory in enabling personalized QA in egocentric videos. We hope MyEgo and our analyses catalyze further progress in these areas for egocentric personalized assistance. Data and code are available at https://github.com/Ryougetsu3606/MyEgo
comment: To appear at CVPR'26
☆ Automated Prostate Gland Segmentation in MRI Using nnU-Net
Accurate segmentation of the prostate gland in multiparametric MRI (mpMRI) is a fundamental step for a wide range of clinical and research applications, including image registration, volume estimation, and radiomic analysis. However, manual delineation is time-consuming and subject to inter-observer variability, while general-purpose segmentation tools often fail to provide sufficient accuracy for prostate-specific tasks. In this work, we propose a dedicated deep learning-based approach for automatic prostate gland segmentation using the nnU-Net v2 framework. The model leverages multimodal mpMRI data, including T2-weighted imaging, diffusion-weighted imaging (DWI), and apparent diffusion coefficient (ADC) maps, to exploit complementary tissue information. Training was performed on 981 cases from the PI-CAI dataset using whole-gland annotations, and model performance was assessed through 5-fold cross-validation and external validation on an independent cohort of 54 patients from Hospital La Fe. The proposed model achieved a mean Dice score of 0.96 +/- 0.00 in cross-validation and 0.82 on the external test set, demonstrating strong generalization despite domain shift. In comparison, a general-purpose approach (TotalSegmentator) showed substantially lower performance, with a Dice score of 0.15, primarily due to under-segmentation of the gland. These results highlight the importance of task-specific, multimodal segmentation strategies and demonstrate the potential of the proposed approach for reliable integration into clinical research workflows. To facilitate reproducibility and deployment, the model has been fully containerized and is available as a ready-to-use inference tool.
comment: 9 pages, 2 tables, 1 figure
☆ MAVFusion: Efficient Infrared and Visible Video Fusion via Motion-Aware Sparse Interaction
Infrared and visible video fusion combines the object saliency from infrared images with the texture details from visible images to produce semantically rich fusion results. However, most existing methods are designed for static image fusion and cannot effectively handle frame-to-frame motion in videos. Current video fusion methods improve temporal consistency by introducing interactions across frames, but they often require high computational cost. To mitigate these challenges, we propose MAVFusion, an end-to-end video fusion framework featuring a motion-aware sparse interaction mechanism that enhances efficiency while maintaining superior fusion quality. Specifically, we leverage optical flow to identify dynamic regions in multi-modal sequences, adaptively allocating computationally intensive cross-modal attention to these sparse areas to capture salient transitions and facilitate inter-modal information exchange. For static background regions, a lightweight weak interaction module is employed to maintain structural and appearance integrity. By decoupling the processing of dynamic and static regions, MAVFusion simultaneously preserves temporal consistency and fine-grained details while significantly accelerating inference. Extensive experiments demonstrate that MAVFusion achieves state-of-the-art performance on multiple infrared and visible video benchmarks, achieving a speed of 14.16\,FPS at $640 \times 480$ resolution. The source code will be available at https://github.com/ixilai/MAVFusion.
☆ A Self supervised learning framework for imbalanced medical imaging datasets
Two problems often plague medical imaging analysis: 1) Non-availability of large quantities of labeled training data, and 2) Dealing with imbalanced data, i.e., abundant data are available for frequent classes, whereas data are highly limited for the rare class. Self supervised learning (SSL) methods have been proposed to deal with the first problem to a certain extent, but the issue of investigating the robustness of SSL to imbalanced data has rarely been addressed in the domain of medical image classification. In this work, we make the following contributions: 1) The MIMV method proposed by us in an earlier work is extended with a new augmentation strategy to construct asymmetric multi-image, multi-view (AMIMV) pairs to address both data scarcity and dataset imbalance in medical image classification. 2) We carry out a data analysis to evaluate the robustness of AMIMV under varying degrees of class imbalance in medical imaging . 3) We evaluate eight representative SSL methods in 11 medical imaging datasets (MedMNIST) under long-tailed distributions and limited supervision. Our experimental results on the MedMNIST dataset show an improvement of 4.25% on retinaMNIST, 1.88% on tissueMNIST, and 3.1% on DermaMNIST.
☆ Captioning Daily Activity Images in Early Childhood Education: Benchmark and Algorithm
Image captioning for Early Childhood Education (ECE) is essential for automated activity understanding and educational assessment. However, existing methods face two key challenges. First, the lack of large-scale, domain-specific datasets limits the model's ability to capture fine-grained semantic concepts unique to ECE scenarios, resulting in generic and imprecise descriptions. Second, conventional training paradigms exhibit limitations in enhancing professional object description capability, as supervised learning tends to favor high-frequency expressions, while reinforcement learning may suffer from unstable optimization on difficult samples. To address these limitations, we introduce ECAC, a large-scale benchmark for ECE daily activity image captioning, comprising 256,121 real-world images annotated with expert-level captions and fine-grained labels. ECAC is further equipped with a domain-oriented evaluation protocol, the Teaching Toy Recognition Score (TTS), to explicitly measure professional object naming accuracy. Furthermore, we propose RSRS (Reward-Conditional Switch of Reinforcement Learning and Supervised Fine-Tuning), a hybrid training framework that dynamically alternates between RL and supervised optimization. By rerouting hard samples with zero rewards to supervised fine-tuning, RSRS effectively mitigates advantage collapse and enables stable optimization for fine-grained recognition. Leveraging ECAC and RSRS, we develop KinderMM-Cap-3B, a domain-adapted multimodal large language model. Extensive experiments demonstrate that our model achieves a TTS of 51.06, substantially outperforming state-of-the-art baselines while maintaining superior caption quality, highlighting its potential for specialized educational applications.
☆ Rethinking Representations for Cross-Domain Infrared Small Target Detection: A Generalizable Perspective from the Frequency Domain
The accurate target-background separation in infrared small target detection (IRSTD) highly depends on the discriminability of extracted representations. However, most existing methods are confined to domain-consistent settings, while overlooking whether such discriminability can generalize to unseen domains. In practice, distribution shifts between training and testing data are inevitable due to variations in observational conditions and environmental factors. Meanwhile, the intrinsic indistinctiveness of infrared small targets aggravates overfitting to domain-specific patterns. Consequently, the detection performance of models trained on source domains can be severely degraded when deployed in unseen domains. To address this challenge, we propose a spatial-spectral collaborative perception network (S$^2$CPNet) for cross-domain IRSTD. Moving beyond conventional spatial learning pipelines, we rethink IRSTD representations from a frequency perspective and reveal inconsistencies in spectral phase as the primary manifestation of domain discrepancies. Based on this insight, we develop a phase rectification module (PRM) to derive generalizable target awareness. Then, we employ an orthogonal attention mechanism (OAM) in skip connections to preserve positional information while refining informative representations. Moreover, the bias toward domain-specific patterns is further mitigated through selective style recomposition (SSR). Extensive experiments have been conducted on three IRSTD datasets, and the proposed method consistently achieves state-of-the-art performance under diverse cross-domain settings.
comment: The code will be released at https://github.com/fuyimin96/S2CPNet upon acceptance
☆ Learning Spatial Structure from Pre-Beamforming Per-Antenna Range-Doppler Radar Data via Visibility-Aware Cross-Modal Supervision
Automotive radar perception pipelines commonly construct angle-domain representations via beamforming before applying learning-based models. This work instead investigates a representational question: can meaningful spatial structure be learned directly from pre-beamforming per-antenna range-Doppler (RD) measurements? Experiments are conducted on a 6-TX x 8-RX (48 virtual antennas) commodity automotive radar employing an A/B chirp-sequence frequency-modulated continuous-wave (CS-FMCW) transmit scheme, in which the effective transmit aperture varies between chirps (single-TX vs. multi-TX), enabling controlled analysis of chirp-dependent transmit configurations. We operate on pre-beamforming per-antenna RD tensors using a dual-chirp shared-weight encoder trained in an end-to-end, fully data-driven manner, and evaluate spatial recoverability using bird's-eye-view (BEV) occupancy as a geometric probe rather than a performance-driven objective. Supervision is visibility-aware and cross-modal, derived from LiDAR with explicit modeling of the radar field-of-view and occlusion-aware LiDAR observability via ray-based visibility. Through chirp ablations (A-only, B-only, A+B), range-band analysis, and physics-aligned baselines, we assess how transmit configurations affect geometric recoverability. The results indicate that spatial structure can be learned directly from pre-beamforming per-antenna RD tensors without explicit angle-domain construction or hand-crafted signal-processing stages.
☆ Enhancing Medical Visual Grounding via Knowledge-guided Spatial Prompts
Medical Visual Grounding (MVG) aims to identify diagnostically relevant phrases from free-text radiology reports and localize their corresponding regions in medical images, providing interpretable visual evidence to support clinical decision-making. Although recent Vision-Language Models (VLMs) exhibit promising multimodal reasoning ability, their grounding remains insufficient spatial precision, largely due to a lack of explicit localization priors when relying solely on latent embeddings. In this work, we analyze this limitation from an attention perspective and propose KnowMVG, a Knowledge-prior and global-local attention enhancement framework for MVG in VLMs that explicitly strengthens spatial awareness during decoding. Specifically, we present a knowledge-enhanced prompting strategy that encodes phrase related medical knowledge into compact embeddings, together with a global-local attention that jointly leverages coarse global information and refined local cues to guide precise region localization. localization. This design bridges high-level semantic understanding and fine-grained visual perception without introducing extra textual reasoning overhead. Extensive experiments on four MVG benchmarks demonstrate that our KnowMVG consistently outperforms existing approaches, achieving gains of 3.0% in AP50 and 2.6% in mIoU over prior state-of-the-art methods. Qualitative and ablation studies further validate the effectiveness of each component.
comment: 10 pages, 6 figures
☆ Night Eyes: A Reproducible Framework for Constellation-Based Corneal Reflection Matching
Corneal reflection (glint) detection plays an important role in pupil-corneal reflection (P-CR) eye tracking, but in practice it is often handled as heuristics embedded within larger systems, making reproducibility difficult across hardware setups. We introduce a 2D geometry-driven, constellation-based pipeline for mulit-glint detection and matching, focusing on reproducibility and clear evaluation. Inspired by lost-in-space star identification, we treat glints as structured constellations rather than independent blobs. We propose a Similarity-Layout Alignment (SLA) procedure which adapts constellation matching to the specific constraints of multi-LED eye tracking. The framework brings together controlled over-detection, adaptive candidate fallback, appearance-aware scoring, and optional semantic layout priors while keeping detection and correspondence explicitly separated. Evaluated on a public multi-LED dataset, the system provides stable identity-preserving correspondence under noisy conditions. We release code, presets, and evaluation scripts to enable transparent replication, comparison, and dataset annotation.
comment: 6 pages, 3 figures, 2 algorithms, ETRA26
☆ Lifting Unlabeled Internet-level Data for 3D Scene Understanding CVPR 2026
Annotated 3D scene data is scarce and expensive to acquire, while abundant unlabeled videos are readily available on the internet. In this paper, we demonstrate that carefully designed data engines can leverage web-curated, unlabeled videos to automatically generate training data, to facilitate end-to-end models in 3D scene understanding alongside human-annotated datasets. We identify and analyze bottlenecks in automated data generation, revealing critical factors that determine the efficiency and effectiveness of learning from unlabeled data. To validate our approach across different perception granularities, we evaluate on three tasks spanning low-level perception, i.e., 3D object detection and instance segmentation, to high-evel reasoning, i.e., 3D spatial Visual Question Answering (VQA) and Vision-Lanugage Navigation (VLN). Models trained on our generated data demonstrate strong zero-shot performance and show further improvement after finetuning. This demonstrates the viability of leveraging readily available web data as a path toward more capable scene understanding systems.
comment: CVPR 2026. Project page: https://sv-pp.github.io/
☆ Light-ResKAN: A Parameter-Sharing Lightweight KAN with Gram Polynomials for Efficient SAR Image Recognition
Synthetic Aperture Radar (SAR) image recognition is vital for disaster monitoring, military reconnaissance, and ocean observation. However, large SAR image sizes hinder deep learning deployment on resource-constrained edge devices, and existing lightweight models struggle to balance high-precision feature extraction with low computational requirements. The emerging Kolmogorov-Arnold Network (KAN) enhances fitting by replacing fixed activations with learnable ones, reducing parameters and computation. Inspired by KAN, we propose Light-ResKAN to achieve a better balance between precision and efficiency. First, Light-ResKAN modifies ResNet by replacing convolutions with KAN convolutions, enabling adaptive feature extraction for SAR images. Second, we use Gram Polynomials as activations, which are well-suited for SAR data to capture complex non-linear relationships. Third, we employ a parameter-sharing strategy: each kernel shares parameters per channel, preserving unique features while reducing parameters and FLOPs. Our model achieves 99.09%, 93.01%, and 97.26% accuracy on MSTAR, FUSAR-Ship, and SAR-ACD datasets, respectively. Experiments on MSTAR resized to $1024 \times 1024$ show that compared to VGG16, our model reduces FLOPs by $82.90 \times$ and parameters by $163.78 \times$. This work establishes an efficient solution for edge SAR image recognition.
comment: 16 pages, 8 figures, accepted by JSTARS
☆ FTPFusion: Frequency-Aware Infrared and Visible Video Fusion with Temporal Perturbation
Infrared and visible video fusion plays a critical role in intelligent surveillance and low-light monitoring. However, maintaining temporal stability while preserving spatial detail remains a fundamental challenge. Existing methods either focus on frame-wise enhancement with limited temporal modeling or rely on heavy spatio-temporal aggregation that often sacrifices high-frequency details. In this paper, we propose FTPFusion, a frequency-aware infrared and visible video fusion method based on temporal perturbation and sparse cross-modal interaction. Specifically, FTPFusion decomposes the feature representations into high-frequency and low-frequency components for collaborative modeling. The high-frequency branch performs sparse cross-modal spatio-temporal interaction to capture motion-related context and complementary details. The low-frequency branch introduces a temporal perturbation strategy to enhance robustness against complex video variations, such as flickering, jitter, and local misalignment. Furthermore, we design an offset-aware temporal consistency constraint to explicitly stabilize cross-frame representations under temporal disturbances. Extensive experiments on multiple public benchmarks demonstrate that FTPFusion consistently outperforms state-of-the-art methods across multiple metrics in both spatial fidelity and temporal consistency. The source code will be available at https://github.com/ixilai/FTPFusion.
☆ SHARC: Reference point driven Spherical Harmonic Representation for Complex Shapes ICPR 2026
We propose SHARC, a novel framework that synthesizes arbitrary, genus-agnostic shapes by means of a collection of Spherical Harmonic (SH) representations of distance fields. These distance fields are anchored at optimally placed reference points in the interior volume of the surface in a way that maximizes learning of the finer details of the surface. To achieve this, we employ a cost function that jointly maximizes sparsity and centrality in terms of positioning, as well as visibility of the surface from their location. For each selected reference point, we sample the visible distance field to the surface geometry via ray-casting and compute the SH coefficients using the Fast Spherical Harmonic Transform (FSHT). To enhance geometric fidelity, we apply a configurable low-pass filter to the coefficients and refine the output using a local consistency constraint based on proximity. Evaluation of SHARC against state-of-the-art methods demonstrates that the proposed method outperforms existing approaches in both reconstruction accuracy and time efficiency without sacrificing model parsimony. The source code is available at https://github.com/POSE-Lab/SHARC.
comment: Accepted at ICPR 2026
☆ ProVG: Progressive Visual Grounding via Language Decoupling for Remote Sensing Imagery
Remote sensing visual grounding (RSVG) aims to localize objects in remote sensing imagery according to natural language expressions. Previous methods typically rely on sentence-level vision-language alignment, which struggles to exploit fine-grained linguistic cues, such as \textit{spatial relations} and \textit{object attributes}, that are crucial for distinguishing objects with similar characteristics. Importantly, these cues play distinct roles across different grounding stages and should be leveraged accordingly to provide more explicit guidance. In this work, we propose \textbf{ProVG}, a novel RSVG framework that improves localization accuracy by decoupling language expressions into global context, spatial relations, and object attributes. To integrate these linguistic cues, ProVG employs a simple yet effective progressive cross-modal modulator, which dynamically modulates visual attention through a \textit{survey-locate-verify} scheme, enabling coarse-to-fine vision-language alignment. In addition, ProVG incorporates a cross-scale fusion module to mitigate the large-scale variations in remote sensing imagery, along with a language-guided calibration decoder to refine cross-modal alignment during prediction. A unified multi-task head further enables ProVG to support both referring expression comprehension and segmentation tasks. Extensive experiments on two benchmarks, \textit{i.e.}, RRSIS-D and RISBench, demonstrate that ProVG consistently outperforms existing methods, achieving new state-of-the-art performance.
☆ Low-Effort Jailbreak Attacks Against Text-to-Image Safety Filters CVPR
Text-to-image generative models are widely deployed in creative tools and online platforms. To mitigate misuse, these systems rely on safety filters and moderation pipelines that aim to block harmful or policy violating content. In this work we show that modern text-to-image models remain vulnerable to low-effort jailbreak attacks that require only natural language prompts. We present a systematic study of prompt-based strategies that bypass safety filters without model access, optimization, or adversarial training. We introduce a taxonomy of visual jailbreak techniques including artistic reframing, material substitution, pseudo-educational framing, lifestyle aesthetic camouflage, and ambiguous action substitution. These strategies exploit weaknesses in prompt moderation and visual safety filtering by masking unsafe intent within benign semantic contexts. We evaluate these attacks across several state-of-the-art text-to-image systems and demonstrate that simple linguistic modifications can reliably evade existing safeguards and produce restricted imagery. Our findings highlight a critical gap between surface-level prompt filtering and the semantic understanding required to detect adversarial intent in generative media systems. Across all tested models and attack categories we observe an attack success rate (ASR) of up to 74.47%.
comment: Text-to-Image version of the Anyone can Jailbreak paper. Accepted in CVPR-W AIMS 2026
☆ GS^2: Graph-based Spatial Distribution Optimization for Compact 3D Gaussian Splatting
3D Gaussian Splatting (3DGS) has demonstrated breakthrough performance in novel view synthesis and real-time rendering. Nevertheless, its practicality is constrained by the high memory cost due to a huge number of Gaussian points. Many pruning-based 3DGS variants have been proposed for memory saving, but often compromise spatial consistency and may lead to rendering artifacts. To address this issue, we propose graph-based spatial distribution optimization for compact 3D Gaussian Splatting (GS\textasciicircum2), which enhances reconstruction quality by optimizing the spatial distribution of Gaussian points. Specifically, we introduce an evidence lower bound (ELBO)-based adaptive densification strategy that automatically controls the densification process. In addition, an opacity-aware progressive pruning strategy is proposed to further reduce memory consumption by dynamically removing low-opacity Gaussian points. Furthermore, we propose a graph-based feature encoding module to adjust the spatial distribution via feature-guided point shifting. Extensive experiments validate that GS\textasciicircum2 achieves a compact Gaussian representation while delivering superior rendering quality. Compared with 3DGS, it achieves higher PSNR with only about 12.5\% Gaussian points. Furthermore, it outperforms all compared baselines in both rendering quality and memory efficiency.
☆ A3R: Agentic Affordance Reasoning via Cross-Dimensional Evidence in 3D Gaussian Scenes
Affordance reasoning in 3D Gaussian scenes aims to identify the region that supports the action specified by a given text instruction in complex environments. Existing methods typically cast this problem as one-shot prediction from static scene observations, assuming sufficient evidence is already available for reasoning. However, in complex 3D scenes, many failure cases arise not from weak prediction capacity, but from incomplete task-relevant evidence under fixed observations. To address this limitation, we reformulate fine-grained affordance reasoning as a sequential evidence acquisition process, where ambiguity is progressively reduced through complementary 3D geometric and 2D semantic evidence. Building on this formulation, we propose A3R, an agentic affordance reasoning framework that enables an MLLM-based policy to iteratively select evidence acquisition actions and update the affordance belief through cross-dimensional evidence acquisition. To optimize such sequential decision making, we further introduce a GRPO-based policy learning strategy that improves evidence acquisition efficiency and reasoning accuracy. Extensive experiments on scene-level benchmarks show that A3R consistently surpasses static one-shot baselines, demonstrating the advantage of agentic cross-dimensional evidence acquisition for fine-grained affordance reasoning in complex 3D Gaussian scenes.
☆ HieraVid: Hierarchical Token Pruning for Fast Video Large Language Models
Video Large Language Models (VideoLLMs) have demonstrated impressive capabilities in video understanding, yet the massive number of input video tokens incurs a significant computational burden for deployment. Existing methods mainly prune video tokens at input level while neglecting the inherent information structure embedded in videos and large language models (LLMs). To address this, we propose HieraVid, a hierarchical pruning framework that progressively and dynamically reduces visual redundancy. Based on two observations that videos possess the segment-frame structure and LLMs internally propagate multi-modal information unidirectionally, we decompose pruning into three levels: 1) segment-level, where video tokens are first temporally segmented and spatially merged; 2) frame-level, where similar frames within the same segment are jointly pruned to preserve diversity; 3) layer-level, redundancy gradually shrinks as LLM layer increases w/o compromising performance. We conduct extensive experiments on four widely used video understanding benchmarks to comprehensively evaluate the effectiveness of HieraVid. Remarkably, with only 30% of tokens retained, HieraVid achieves new state-of-the-art performance, while maintaining over 98% and 99% of the performance of LLaVA-Video-7B and LLaVA-OneVision-7B, respectively.
☆ GeoAI Agency Primitives
We present ongoing research on agency primitives for GeoAI assistants -- core capabilities that connect Foundation models to the artifact-centric, human-in-the-loop workflows where GIS practitioners actually work. Despite advances in satellite image captioning, visual question answering, and promptable segmentation, these capabilities have not translated into productivity gains for practitioners who spend most of their time producing vector layers, raster maps, and cartographic products. The gap is not model capability alone but the absence of an agency layer that supports iterative collaboration. We propose a vocabulary of $9$ primitives for such a layer -- including navigation, perception, geo-referenced memory, and dual modeling -- along with a benchmark that measures human productivity. Our goal is a vocabulary that makes agentic assistance in GIS implementable, testable, and comparable.
☆ MAR-MAER: Metric-Aware and Ambiguity-Adaptive Autoregressive Image Generation
Autoregressive (AR) models have demonstrated significant success in the realm of text-to-image generation. However, they usually face two major challenges. Firstly, the generated images may not always meet the quality standards expected by humans. Furthermore, these models face difficulty when dealing with ambiguous prompts that could be interpreted in several valid ways. To address these issues, we introduce MAR-MAER, an innovative hierarchical autoregressive framework. It combines two main components. It is a metric-aware embedding regularization method. The other one is a probabilistic latent model used for handling ambiguous semantics. Our method utilizes a lightweight projection head, which is trained with an adaptive kernel regression loss function. This aligns the model's internal representations with human-preferred quality metrics, such as CLIPScore and HPSv2. As a result, the embedding space that is learned more accurately reflects human judgment. We are also introducing a conditional variational module. This approach incorporates an aspect of controlled randomness within the hierarchical token generation process. This capability allows the model to produce a diverse array of coherent images based on ambiguous or open-ended prompts. We conducted extensive experiments using COCO and a newly developed Ambiguous-Prompt Benchmark. The results show that MAR-MAER achieves excellent performance in both metric consistency and semantic flexibility. It exceeds the baseline Hi-MAR model's performance, showing an improvement of +1.6 in CLIPScore and +5.3 in HPSv2. For unclear inputs, it produces a notably wider range of outputs. These findings have been confirmed through both human evaluation and automated metrics.
comment: Accepted by AMME 2025
☆ Combining Boundary Supervision and Segment-Level Regularization for Fine-Grained Action Segmentation CVPR2026
Recent progress in Temporal Action Segmentation (TAS) has increasingly relied on complex architectures, which can hinder practical deployment. We present a lightweight dual-loss training framework that improves fine-grained segmentation quality with only one additional output channel and two auxiliary loss terms, requiring minimal architectural modification. Our approach combines a boundary-regression loss that promotes accurate temporal localization via a single-channel boundary prediction and a CDF-based segment-level regularization loss that encourages coherent within-segment structure by matching cumulative distributions over predicted and ground-truth segments. The framework is architecture-agnostic and can be integrated into existing TAS models (e.g., MS-TCN, C2F-TCN, FACT) as a training-time loss function. Across three benchmark datasets, the proposed method improves segment-level consistency and boundary quality, yielding higher F1 and Edit scores across three different models. Frame-wise accuracy remains largely unchanged, highlighting that precise segmentation can be achieved through simple loss design rather than heavier architectures or inference-time refinements.
comment: Accepted by CVPR2026 Workshop "AI-driven Skilled Activity Understanding, Assessment & Feedback Generation (SAUAFG)"
☆ Enhanced Polarization Locking in VCSELs
While optical injection locking (OIL) of vertical-cavity surface-emitting lasers (VCSELs) has been widely studied in the past, the polarization dynamics of OIL have received far less attention. Recent studies suggest that polarization locking via OIL could enable novel computational applications such as polarization-encoded Ising computers. However, the inherent polarization preference and limited polarization switchability of VCSELs hinder their use for such purposes. To address these challenges, we fabricate VCSELs with tailored oxide aperture designs and combine these with bias current tuning to study the overall impact on polarization locking. Experimental results demonstrate that this approach reduces the required injection power (to as low as 3.6 μW) and expands the locking range. To investigate the impact of the approach, the spin-flip model (SFM) is used to analyze the effects of amplitude anisotropy and bias current on polarization locking, demonstrating strong coherence with experimental results.
☆ Semantic Richness or Geometric Reasoning? The Fragility of VLM's Visual Invariance
This work investigates the fundamental fragility of state-of-the-art Vision-Language Models (VLMs) under basic geometric transformations. While modern VLMs excel at semantic tasks such as recognizing objects in canonical orientations and describing complex scenes, they exhibit systematic failures at a more fundamental level: lack of robust spatial invariance and equivariance required to reliably determine object identity under simple rotations, scaling, and identity transformations. We demonstrate this limitation through a systematic evaluation across diverse visual domains, including symbolic sketches, natural photographs, and abstract art. Performance drops sharply as semantic content becomes sparse, and this behavior is observed across architectures, model capacities, and prompting strategies. Overall, our results reveal a systematic gap between semantic understanding and spatial reasoning in current VLMs, highlighting the need for stronger geometric grounding in future multimodal systems.
☆ FaCT-GS: Fast and Scalable CT Reconstruction with Gaussian Splatting
Gaussian Splatting (GS) has emerged as a dominating technique for image rendering and has quickly been adapted for the X-ray Computed Tomography (CT) reconstruction task. However, despite being on par or better than many of its predecessors, the benefits of GS are typically not substantial enough to motivate a transition from well-established reconstruction algorithms. This paper addresses the most significant remaining limitations of the GS-based approach by introducing FaCT-GS, a framework for fast and flexible CT reconstruction. Enabled by an in-depth optimization of the voxelization and rasterization pipelines, our new method is significantly faster than its predecessors and scales well with projection and output volume size. Furthermore, the improved voxelization enables rapid fitting of Gaussians to pre-existing volumes, which can serve as a prior for warm-starting the reconstruction, or simply as an alternative, compressed representation. FaCT-GS is over 4X faster than the State of the Art GS CT reconstruction on standard 512x512 projections, and over 13X faster on 2k projections. Implementation available at: https://github.com/PaPieta/fact-gs.
☆ Investigating Permutation-Invariant Discrete Representation Learning for Spatially Aligned Images ICPR 2026
Vector quantization approaches (VQ-VAE, VQ-GAN) learn discrete neural representations of images, but these representations are inherently position-dependent: codes are spatially arranged and contextually entangled, requiring autoregressive or diffusion-based priors to model their dependencies at sample time. In this work, we ask whether positional information is necessary for discrete representations of spatially aligned data. We propose the permutation-invariant vector-quantized autoencoder (PI-VQ), in which latent codes are constrained to carry no positional information. We find that this constraint encourages codes to capture global, semantic features, and enables direct interpolation between images without a learned prior. To address the reduced information capacity of permutation-invariant representations, we introduce matching quantization, a vector quantization algorithm based on optimal bipartite matching that increases effective bottleneck capacity by $3.5\times$ relative to naive nearest-neighbour quantization. The compositional structure of the learned codes further enables interpolation-based sampling, allowing synthesis of novel images in a single forward pass. We evaluate PI-VQ on CelebA, CelebA-HQ and FFHQ, obtaining competitive precision, density and coverage metrics for images synthesised with our approach. We discuss the trade-offs inherent to position-free representations, including separability and interpretability of the latent codes, pointing to numerous directions for future work.
comment: 15 pages plus references; 5 figures; supplementary appended; accepted to ICPR 2026
☆ Semantic Segmentation of Textured Non-manifold 3D Meshes using Transformers
Textured 3D meshes jointly represent geometry, topology, and appearance, yet their irregular structure poses significant challenges for deep-learning-based semantic segmentation. While a few recent methods operate directly on meshes without imposing geometric constraints, they typically overlook the rich textural information also provided by such meshes. We introduce a texture-aware transformer that learns directly from raw pixels associated with each mesh face, coupled with a new hierarchical learning scheme for multi-scale feature aggregation. A texture branch summarizes all face-level pixels into a learnable token, which is fused with geometrical descriptors and processed by a stack of Two-Stage Transformer Blocks (TSTB), which allow for both a local and a global information flow. We evaluate our model on the Semantic Urban Meshes (SUM) benchmark and a newly curated cultural-heritage dataset comprising textured roof tiles with triangle-level annotations for damage types. Our method achieves 81.9\% mF1 and 94.3\% OA on SUM and 49.7\% mF1 and 72.8\% OA on the new dataset, substantially outperforming existing approaches.
☆ Ranking-Guided Semi-Supervised Domain Adaptation for Severity Classification
Semi-supervised domain adaptation leverages a few labeled and many unlabeled target samples, making it promising for addressing domain shifts in medical image analysis. However, existing methods struggle with severity classification due to unclear class boundaries. Severity classification involves naturally ordered class labels, complicating adaptation. We propose a novel method that aligns source and target domains using rank scores learned via ranking with class order. Specifically, Cross-Domain Ranking ranks sample pairs across domains, while Continuous Distribution Alignment aligns rank score distributions. Experiments on ulcerative colitis and diabetic retinopathy classification validate the effectiveness of our approach, demonstrating successful alignment of class-specific rank score distributions.
☆ Language-Pretraining-Induced Bias: A Strong Foundation for General Vision Tasks
The ratio of outlier parameters in language pre-training models and vision pre-training models differs significantly, making cross-modality (language and vision) inherently more challenging than cross-domain adaptation. As a result, many prior studies have focused on cross-domain transfer rather than attempting to bridge language and vision modalities, assuming that language pre-trained models are unsuitable for downstream visual tasks due to disparate parameter spaces. Contrary to this assumption, we show that adding a bridge training stage as a modality adaptation learner can effectively align Large Language Model (LLM) parameters with vision tasks. Specifically, we propose a simple yet powerful solution random label bridge training that requires no manual labeling and helps LLM parameters adapt to vision foundation tasks. Moreover, our findings reveal that partial bridge training is often advantageous, as certain layers in LLMs exhibit strong foundational properties that remain beneficial even without fine-tuning for visual tasks. This surprising discovery opens up new avenues for leveraging language pre-trained parameters directly within vision models and highlights the potential of partial bridge training as a practical pathway to cross-modality adaptation.
☆ SafeRoPE: Risk-specific Head-wise Embedding Rotation for Safe Generation in Rectified Flow Transformers CVPR26
Recent Text-to-Image (T2I) models based on rectified-flow transformers (e.g., SD3, FLUX) achieve high generative fidelity but remain vulnerable to unsafe semantics, especially when triggered by multi-token interactions. Existing mitigation methods largely rely on fine-tuning or attention modulation for concept unlearning; however, their expensive computational overhead and design tailored to U-Net-based denoisers hinder direct adaptation to transformer-based diffusion models (e.g., MMDiT). In this paper, we conduct an in-depth analysis of the attention mechanism in MMDiT and find that unsafe semantics concentrate within interpretable, low-dimensional subspaces at head level, where a finite set of safety-critical heads is responsible for unsafe feature extraction. We further observe that perturbing the Rotary Positional Embedding (RoPE) applied to the query and key vectors can effectively modify some specific concepts in the generated images. Motivated by these insights, we propose SafeRoPE, a lightweight and fine-grained safe generation framework for MMDiT. Specifically, SafeRoPE first constructs head-wise unsafe subspaces by decomposing unsafe embeddings within safety-critical heads, and computes a Latent Risk Score (LRS) for each input vector via projection onto these subspaces. We then introduce head-wise RoPE perturbations that can suppress unsafe semantics without degrading benign content or image quality. SafeRoPE combines both head-wise LRS and RoPE perturbations to perform risk-specific head-wise rotation on query and key vector embeddings, enabling precise suppression of unsafe outputs while maintaining generation fidelity. Extensive experiments demonstrate that SafeRoPE achieves SOTA performance in balancing effective harmful content mitigation and utility preservation for safe generation of MMDiT. Codes are available at https://github.com/deng12yx/SafeRoPE.
comment: CVPR26
☆ STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering
We introduce STRIVE (SpatioTemporal Reinforcement with Importance-aware Variant Exploration), a structured reinforcement learning framework for video question answering. While group-based policy optimization methods have shown promise in large multimodal models, they often suffer from low reward variance when responses exhibit similar correctness, leading to weak or unstable advantage estimates. STRIVE addresses this limitation by constructing multiple spatiotemporal variants of each input video and performing joint normalization across both textual generations and visual variants. By expanding group comparisons beyond linguistic diversity to structured visual perturbations, STRIVE enriches reward signals and promotes more stable and informative policy updates. To ensure exploration remains semantically grounded, we introduce an importance-aware sampling mechanism that prioritizes frames most relevant to the input question while preserving temporal coverage. This design encourages robust reasoning across complementary visual perspectives rather than overfitting to a single spatiotemporal configuration. Experiments on six challenging video reasoning benchmarks including VideoMME, TempCompass, VideoMMMU, MMVU, VSI-Bench, and PerceptionTest demonstrate consistent improvements over strong reinforcement learning baselines across multiple large multimodal models. Our results highlight the role of structured spatiotemporal exploration as a principled mechanism for stabilizing multimodal reinforcement learning and improving video reasoning performance.
☆ A deep learning pipeline for PAM50 subtype classification using histopathology images and multi-objective patch selection
Breast cancer is a highly heterogeneous disease with diverse molecular profiles. The PAM50 gene signature is widely recognized as a standard for classifying breast cancer into intrinsic subtypes, enabling more personalized treatment strategies. In this study, we introduce a novel optimization-driven deep learning framework that aims to reduce reliance on costly molecular assays by directly predicting PAM50 subtypes from H&E-stained whole-slide images (WSIs). Our method jointly optimizes patch informativeness, spatial diversity, uncertainty, and patch count by combining the non-dominated sorting genetic algorithm II (NSGA-II) with Monte Carlo dropout-based uncertainty estimation. The proposed method can identify a small but highly informative patch subset for classification. We used a ResNet18 backbone for feature extraction and a custom CNN head for classification. For evaluation, we used the internal TCGA-BRCA dataset as the training cohort and the external CPTAC-BRCA dataset as the test cohort. On the internal dataset, an F1-score of 0.8812 and an AUC of 0.9841 using 627 WSIs from the TCGA-BRCA cohort were achieved. The performance of the proposed approach on the external validation dataset showed an F1-score of 0.7952 and an AUC of 0.9512. These findings indicate that the proposed optimization-guided, uncertainty-aware patch selection can achieve high performance and improve the computational efficiency of histopathology-based PAM50 classification compared to existing methods, suggesting a scalable imaging-based replacement that has the potential to support clinical decision-making.
☆ PTC-Depth: Pose-Refined Monocular Depth Estimation with Temporal Consistency CVPR 2026
Monocular depth estimation (MDE) has been widely adopted in the perception systems of autonomous vehicles and mobile robots. However, existing approaches often struggle to maintain temporal consistency in depth estimation across consecutive frames. This inconsistency not only causes jitter but can also lead to estimation failures when the depth range changes abruptly. To address these challenges, this paper proposes a consistency-aware monocular depth estimation framework that leverages wheel odometry from a mobile robot to achieve stable and coherent depth predictions over time. Specifically, we estimate camera pose and sparse depth from triangulation using optical flow between consecutive frames. The sparse depth estimates are used to update a recursive Bayesian estimate of the metric scale, which is then applied to rescale the relative depth predicted by a pre-trained depth estimation foundation model. The proposed method is evaluated on the KITTI, TartanAir, MS2, and our own dataset, demonstrating robust and accurate depth estimation performance.
comment: Accepted at CVPR 2026
☆ GardenDesigner: Encoding Aesthetic Principles into Jiangnan Garden Construction via a Chain of Agents CVPR 2026
Jiangnan gardens, a prominent style of Chinese classical gardens, hold great potential as digital assets for film and game production and digital tourism. However, manual modeling of Jiangnan gardens heavily relies on expert experience for layout design and asset creation, making the process time-consuming. To address this gap, we propose GardenDesigner, a novel framework that encodes aesthetic principles for Jiangnan garden construction and integrates a chain of agents based on procedural modeling. The water-centric terrain and explorative pathway rules are applied by terrain distribution and road generation agents. Selection and spatial layout of garden assets follow the aesthetic and cultural constraints. Consequently, we propose asset selection and layout optimization agents to select and arrange objects for each area in the garden. Additionally, we introduce GardenVerse for Jiangnan garden construction, including expert-annotated garden knowledge to enhance the asset arrangement process. To enable interaction and editing, we develop an interactive interface and tools in Unity, in which non-expert users can construct Jiangnan gardens via text input within one minute. Experiments and human evaluations demonstrate that GardenDesigner can generate diverse and aesthetically pleasing Jiangnan gardens. Project page is available at https://monad-cube.github.io/GardenDesigner.
comment: CVPR 2026, Project page: https://monad-cube.github.io/GardenDesigner
☆ FSKD: Monocular Forest Structure Inference via LiDAR-to-RGBI Knowledge Distillation
Very High Resolution (VHR) forest structure data at individual-tree scale is essential for carbon, biodiversity, and ecosystem monitoring. Still, airborne LiDAR remains costly and infrequent despite being the reference for forest structure metrics like Canopy Height Model (CHM), Plant Area Index (PAI), and Foliage Height Diversity (FHD). We propose FSKD: a LiDAR-to-RGB-Infrared (RGBI) knowledge distillation (KD) framework in which a multi-modal teacher fuses RGBI imagery with LiDAR-derived planar metrics and vertical profiles via cross-attention, and an RGBI-only SegFormer student learns to reproduce these outputs. Trained on 384 $km^2$ of forests in Saxony, Germany (20 cm ground sampling distance (GSD)) and evaluated on eight geographically distinct test tiles, the student achieves state-of-the-art (SOTA) zero-shot CHM performance (MedAE 4.17 m, $R^2$=0.51, IoU 0.87), outperforming HRCHM/DAC baselines by 29--46% in MAE (5.81 m vs. 8.14--10.84 m) with stronger correlation coefficients (0.713 vs. 0.166--0.652). Ablations show that multi-modal fusion improves performance by 10--26% over RGBI-only training, and that asymmetric distillation with appropriate model capacity is critical. The method jointly predicts CHM, PAI, and FHD, a multi-metric capability not provided by current monocular CHM estimators, although PAI/FHD transfer remains region-dependent and benefits from local calibration. The framework also remains effective under temporal mismatch (winter LiDAR, summer RGBI), removing strict co-acquisition constraints and enabling scalable 20 cm operational monitoring for workflows such as Digital Twin Germany and national Digital Orthophoto programs.
comment: Paper in-review
☆ DriveDreamer-Policy: A Geometry-Grounded World-Action Model for Unified Generation and Planning
Recently, world-action models (WAM) have emerged to bridge vision-language-action (VLA) models and world models, unifying their reasoning and instruction-following capabilities and spatio-temporal world modeling. However, existing WAM approaches often focus on modeling 2D appearance or latent representations, with limited geometric grounding-an essential element for embodied systems operating in the physical world. We present DriveDreamer-Policy, a unified driving world-action model that integrates depth generation, future video generation, and motion planning within a single modular architecture. The model employs a large language model to process language instructions, multi-view images, and actions, followed by three lightweight generators that produce depth, future video, and actions. By learning a geometry-aware world representation and using it to guide both future prediction and planning within a unified framework, the proposed model produces more coherent imagined futures and more informed driving actions, while maintaining modularity and controllable latency. Experiments on the Navsim v1 and v2 benchmarks demonstrate that DriveDreamer-Policy achieves strong performance on both closed-loop planning and world generation tasks. In particular, our model reaches 89.2 PDMS on Navsim v1 and 88.7 EPDMS on Navsim v2, outperforming existing world-model-based approaches while producing higher-quality future video and depth predictions. Ablation studies further show that explicit depth learning provides complementary benefits to video imagination and improves planning robustness.
comment: 11 pages, 4 figures; Project Website: https://drivedreamer-policy.github.io/
☆ Hidden Meanings in Plain Sight: RebusBench for Evaluating Cognitive Visual Reasoning ICLR 2026
Large Vision-Language Models (LVLMs) have achieved remarkable proficiency in explicit visual recognition, effectively describing what is directly visible in an image. However, a critical cognitive gap emerges when the visual input serves only as a clue rather than the answer. We identify that current models struggle with the complex, multi-step reasoning required to solve problems where information is not explicitly depicted. Successfully solving a rebus puzzle requires a distinct cognitive workflow: the model must extract visual and textual attributes, retrieve linguistic prior knowledge (such as idioms), and perform abstract mapping to synthesize these elements into a meaning that exists outside the pixel space. To evaluate this neurosymbolic capability, we introduce RebusBench, a benchmark of 1,164 puzzles designed to test this specific integration of perception and knowledge. Our evaluation of state-of-the-art models (including Qwen, InternVL, and LLaVA) shows a severe deficiency: performance saturates below 10% Exact Match and 20% semantic accuracy, with no significant improvement observed from model scaling or In-Context Learning (ICL). These findings suggest that while models possess the necessary visual and linguistic components, they lack the cognitive reasoning glue to connect them. Project page available at https://amirkasaei.com/rebusbench/.
comment: Accepted at ICLR 2026 Workshop: From Human Cognition to AI Reasoning (HCAIR)
☆ Cosine-Normalized Attention for Hyperspectral Image Classification
Transformer-based methods have improved hyperspectral image classification (HSIC) by modeling long-range spatial-spectral dependencies; however, their attention mechanisms typically rely on dot-product similarity, which mixes feature magnitude and orientation and may be suboptimal for hyperspectral data. This work revisits attention scoring from a geometric perspective and introduces a cosine-normalized attention formulation that aligns similarity computation with the angular structure of hyperspectral signatures. By projecting query and key embeddings onto a unit hypersphere and applying a squared cosine similarity, the proposed method emphasizes angular relationships while reducing sensitivity to magnitude variations. The formulation is integrated into a spatial-spectral Transformer and evaluated under extremely limited supervision. Experiments on three benchmark datasets demonstrate that the proposed approach consistently achieves higher performance, outperforming several recent Transformer- and Mamba-based models despite using a lightweight backbone. In addition, a controlled analysis of multiple attention score functions shows that cosine-based scoring provides a reliable inductive bias for hyperspectral representation learning.
☆ Control-DINO: Feature Space Conditioning for Controllable Image-to-Video Diffusion
Video models have recently been applied with success to problems in content generation, novel view synthesis, and, more broadly, world simulation. Many applications in generation and transfer rely on conditioning these models, typically through perceptual, geometric, or simple semantic signals, fundamentally using them as generative renderers. At the same time, high-dimensional features obtained from large-scale self-supervised learning on images or point clouds are increasingly used as a general-purpose interface for vision models. The connection between the two has been explored for subject specific editing, aligning and training video diffusion models, but not in the role of a more general conditioning signal for pretrained video diffusion models. Features obtained through self-supervised learning like DINO, contain a lot of entangled information about style, lighting and semantics of the scene. This makes them great at reconstruction tasks but limits their generative capabilities. In this paper, we show how we can use the features for tasks such as video domain transfer and video-from-3D generation. We introduce a lightweight architecture and training strategy that decouples appearance from other features that we wish to preserve, enabling robust control for appearance changes such as stylization and relighting. Furthermore, we show that low spatial resolution can be compensated by higher feature dimensionality, improving controllability in generative rendering from explicit spatial representations.
comment: project page https://dedoardo.github.io/projects/control-dino/
☆ Ultrasound-CLIP: Semantic-Aware Contrastive Pre-training for Ultrasound Image-Text Understanding CVPR 2026
Ultrasound imaging is widely used in clinical diagnostics due to its real-time capability and radiation-free nature. However, existing vision-language pre-training models, such as CLIP, are primarily designed for other modalities, and are difficult to directly apply to ultrasound data, which exhibit heterogeneous anatomical structures and diverse diagnostic attributes. To bridge this gap, we construct US-365K, a large-scale ultrasound image-text dataset containing 365k paired samples across 52 anatomical categories. We establish Ultrasonographic Diagnostic Taxonomy (UDT) containing two hierarchical knowledge frameworks. Ultrasonographic Hierarchical Anatomical Taxonomy standardizes anatomical organization, and Ultrasonographic Diagnostic Attribute Framework formalizes nine diagnostic dimensions, including body system, organ, diagnosis, shape, margins, echogenicity, internal characteristics, posterior acoustic phenomena, and vascularity. Building upon these foundations, we propose Ultrasound-CLIP, a semantic-aware contrastive learning framework that introduces semantic soft labels and semantic loss to refine sample discrimination. Moreover, we construct a heterogeneous graph modality derived from UDAF's textual representations, enabling structured reasoning over lesion-attribute relations. Extensive experiments with patient-level data splitting demonstrate that our approach achieves state-of-the-art performance on classification and retrieval benchmarks, while also delivering strong generalization to zero-shot, linear probing, and fine-tuning tasks.
comment: Accepted by CVPR 2026
☆ Unifying UAV Cross-View Geo-Localization via 3D Geometric Perception
Cross-view geo-localization for Unmanned Aerial Vehicles (UAVs) operating in GNSS-denied environments remains challenging due to the severe geometric discrepancy between oblique UAV imagery and orthogonal satellite maps. Most existing methods address this problem through a decoupled pipeline of place retrieval and pose estimation, implicitly treating perspective distortion as appearance noise rather than an explicit geometric transformation. In this work, we propose a geometry-aware UAV geo-localization framework that explicitly models the 3D scene geometry to unify coarse place recognition and fine-grained pose estimation within a single inference pipeline. Our approach reconstructs a local 3D scene from multi-view UAV image sequences using a Visual Geometry Grounded Transformer (VGGT), and renders a virtual Bird's-Eye View (BEV) representation that orthorectifies the UAV perspective to align with satellite imagery. This BEV serves as a geometric intermediary that enables robust cross-view retrieval and provides spatial priors for accurate 3 Degrees of Freedom (3-DoF) pose regression. To efficiently handle multiple location hypotheses, we introduce a Satellite-wise Attention Block that isolates the interaction between each satellite candidate and the reconstructed UAV scene, preventing inter-candidate interference while maintaining linear computational complexity. In addition, we release a recalibrated version of the University-1652 dataset with precise coordinate annotations and spatial overlap analysis, enabling rigorous evaluation of end-to-end localization accuracy. Extensive experiments on the refined University-1652 benchmark and SUES-200 demonstrate that our method significantly outperforms state-of-the-art baselines, achieving robust meter-level localization accuracy and improved generalization in complex urban environments.
comment: 15 pages, 10 figures
☆ Dense Point-to-Mask Optimization with Reinforced Point Selection for Crowd Instance Segmentation
Crowd instance segmentation is a crucial task with a wide range of applications, including surveillance and transportation. Currently, point labels are common in crowd datasets, while region labels (e.g., boxes) are rare and inaccurate. The masks obtained through segmentation help to improve the accuracy of region labels and resolve the correspondence between individual location coordinates and crowd density maps. However, directly applying currently popular large foundation models such as SAM does not yield ideal results in dense crowds. To this end, we first propose Dense Point-to-Mask Optimization (DPMO), which integrates SAM with the Nearest Neighbor Exclusive Circle (NNEC) constraint to generate dense instance segmentation from point annotations. With DPMO and manual correction, we obtain mask annotations from the existing point annotations for traditional crowd datasets. Then, to predict instance segmentation in dense crowds, we propose a Reinforced Point Selection (RPS) framework trained with Group Relative Policy Optimization (GRPO), which selects the best predicted point from a sampling of the initial point prediction. Through extensive experiments, we achieve state-of-the-art crowd instance segmentation performance on ShanghaiTech, UCF-QNRF, JHU-CROWD++, and NWPU-Crowd datasets. Furthermore, we design new loss functions supervised by masks that boost counting performance across different models, demonstrating the significant role of mask annotations in enhancing counting accuracy.
☆ Setup-Independent Full Projector Compensation
Projector compensation seeks to correct geometric and photometric distortions that occur when images are projected onto nonplanar or textured surfaces. However, most existing methods are highly setup-dependent, requiring fine-tuning or retraining whenever the surface, lighting, or projector-camera pose changes. Progress has been limited by two key challenges: (1) the absence of large, diverse training datasets and (2) existing geometric correction models are typically constrained by specific spatial setups; without further retraining or fine-tuning, they often fail to generalize directly to novel geometric configurations. We introduce SIComp, the first Setup-Independent framework for full projector Compensation, capable of generalizing to unseen setups without fine-tuning or retraining. To enable this, we construct a large-scale real-world dataset spanning 277 distinct projector-camera setups. SIComp adopts a co-adaptive design that decouples geometry and photometry: A carefully tailored optical flow module performs online geometric correction, while a novel photometric network handles photometric compensation. To further enhance robustness under varying illumination, we integrate intensity-varying surface priors into the network design. Extensive experiments demonstrate that SIComp consistently produces high-quality compensation across diverse unseen setups, substantially outperforming existing methods in terms of generalization ability and establishing the first generalizable solution to projector compensation. The code and dataset are available on our project page: https://hai-bo-li.github.io/SIComp/
comment: 16 pages,17 figures
☆ SteerFlow: Steering Rectified Flows for Faithful Inversion-Based Image Editing
Recent advances in flow-based generative models have enabled training-free, text-guided image editing by inverting an image into its latent noise and regenerating it under a new target conditional guidance. However, existing methods struggle to preserve source fidelity: higher-order solvers incur additional model inferences, truncated inversion constrains editability, and feature injection methods lack architectural transferability. To address these limitations, we propose SteerFlow, a model-agnostic editing framework with strong theoretical guarantees on source fidelity. In the forward process, we introduce an Amortized Fixed-Point Solver that implicitly straightens the forward trajectory by enforcing velocity consistency across consecutive timesteps, yielding a high-fidelity inverted latent. In the backward process, we introduce Trajectory Interpolation, which adaptively blends target-editing and source-reconstruction velocities to keep the editing trajectory anchored to the source. To further improve background preservation, we introduce an Adaptive Masking mechanism that spatially constrains the editing signal with concept-guided segmentation and source-target velocity differences. Extensive experiments on FLUX.1-dev and Stable Diffusion 3.5 Medium demonstrate that SteerFlow consistently achieves better editing quality than existing methods. Finally, we show that SteerFlow extends naturally to a complex multi-turn editing paradigm without accumulating drift.
☆ End-to-End Shared Attention Estimation via Group Detection with Feedback Refinement CVPR2026
This paper proposes an end-to-end shared attention estimation method via group detection. Most previous methods estimate shared attention (SA) without detecting the actual group of people focusing on it, or assume that there is a single SA point in a given image. These issues limit the applicability of SA detection in practice and impact performance. To address them, we propose to simultaneously achieve group detection and shared attention estimation using a two step process: (i) the generation of SA heatmaps relying on individual gaze attention heatmaps and group membership scalars estimated in a group inference; (ii) a refinement of the initial group memberships allowing to account for the initial SA heatmaps, and the final prediction of the SA heatmap. Experiments demonstrate that our method outperforms other methods in group detection and shared attention estimation. Additional analyses validate the effectiveness of the proposed components. Code: https://github.com/chihina/sagd-CVPRW2026.
comment: Accepted to CVPR2026 Workshop (GAZE 2026)
☆ Bias mitigation in graph diffusion models ICLR 2025
Most existing graph diffusion models have significant bias problems. We observe that the forward diffusion's maximum perturbation distribution in most models deviates from the standard Gaussian distribution, while reverse sampling consistently starts from a standard Gaussian distribution, which results in a reverse-starting bias. Together with the inherent exposure bias of diffusion models, this results in degraded generation quality. This paper proposes a comprehensive approach to mitigate both biases. To mitigate reverse-starting bias, we employ a newly designed Langevin sampling algorithm to align with the forward maximum perturbation distribution, establishing a new reverse-starting point. To address the exposure bias, we introduce a score correction mechanism based on a newly defined score difference. Our approach, which requires no network modifications, is validated across multiple models, datasets, and tasks, achieving state-of-the-art results.Code is at https://github.com/kunzhan/spp
comment: Accepted to ICLR 2025!
☆ Can Video Diffusion Models Predict Past Frames? Bidirectional Cycle Consistency for Reversible Interpolation
Video frame interpolation aims to synthesize realistic intermediate frames between given endpoints while adhering to specific motion semantics. While recent generative models have improved visual fidelity, they predominantly operate in a unidirectional manner, lacking mechanisms to self-verify temporal consistency. This often leads to motion drift, directional ambiguity, and boundary misalignment, especially in long-range sequences. Inspired by the principle of temporal cycle-consistency in self-supervised learning, we propose a novel bidirectional framework that enforces symmetry between forward and backward generation trajectories. Our approach introduces learnable directional tokens to explicitly condition a shared backbone on temporal orientation, enabling the model to jointly optimize forward synthesis and backward reconstruction within a single unified architecture. This cycle-consistent supervision acts as a powerful regularizer, ensuring that generated motion paths are logically reversible. Furthermore, we employ a curriculum learning strategy that progressively trains the model from short to long sequences, stabilizing dynamics across varying durations. Crucially, our cyclic constraints are applied only during training; inference requires a single forward pass, maintaining the high efficiency of the base model. Extensive experiments show that our method achieves state-of-the-art performance in imaging quality, motion smoothness, and dynamic control on both 37-frame and 73-frame tasks, outperforming strong baselines while incurring no additional computational overhead.
☆ From Understanding to Erasing: Towards Complete and Stable Video Object Removal
Video object removal aims to eliminate target objects from videos while plausibly completing missing regions and preserving spatio-temporal consistency. Although diffusion models have recently advanced this task, it remains challenging to remove object-induced side effects (e.g., shadows, reflections, and illumination changes) without compromising overall coherence. This limitation stems from the insufficient physical and semantic understanding of the target object and its interactions with the scene. In this paper, we propose to introduce understanding into erasing from two complementary perspectives. Externally, we introduce a distillation scheme that transfers the relationships between objects and their induced effects from vision foundation models to video diffusion models. Internally, we propose a framewise context cross-attention mechanism that grounds each denoising block in informative, unmasked context surrounding the target region. External and internal guidance jointly enable our model to understand the target object, its induced effects, and the global background context, resulting in clear and coherent object removal. Extensive experiments demonstrate our state-of-the-art performance, and we establish the first real-world benchmark for video object removal to facilitate future research and community progress. Our code, data, and models are available at: https://github.com/WeChatCV/UnderEraser.
☆ BTS-rPPG: Orthogonal Butterfly Temporal Shifting for Remote Photoplethysmography
Remote photoplethysmography (rPPG) enables contactless physiological sensing from facial videos by analyzing subtle appearance variations induced by blood circulation. However, modeling the temporal dynamics of these signals remains challenging, as many deep learning methods rely on temporal shifting or convolutional operators that aggregate information primarily from neighboring frames, resulting in predominantly local temporal modeling and limited temporal receptive fields. To address this limitation, we propose BTS-rPPG, a temporal modeling framework based on Orthogonal Butterfly Temporal Shifting (BTS). Inspired by the butterfly communication pattern in the Fast Fourier Transform (FFT), BTS establishes structured frame interactions via an XOR-based butterfly pairing schedule, progressively expanding the temporal receptive field and enabling efficient propagation of information across distant frames. Furthermore, we introduce an orthogonal feature transfer mechanism (OFT) that filters the source feature with respect to the target context before temporal shifting, retaining only the orthogonal component for cross-frame transmission. This reduces redundant feature propagation and encourages complementary temporal interaction. Extensive experiments on multiple benchmark datasets demonstrate that BTS-rPPG improves long-range temporal modeling of physiological dynamics and consistently outperforms existing temporal modeling strategies for rPPG estimation.
☆ Director: Instance-aware Gaussian Splatting for Dynamic Scene Modeling and Understanding
Volumetric video seeks to model dynamic scenes as temporally coherent 4D representations. While recent Gaussian-based approaches achieve impressive rendering fidelity, they primarily emphasize appearance but are largely agnostic to instance-level structure, limiting stable tracking and semantic reasoning in highly dynamic scenarios. In this paper, we present Director, a unified spatio-temporal Gaussian representation that jointly models human performance, high-fidelity rendering, and instance-level semantics. Our key insight is that embedding instance-consistent semantics naturally complements 4D modeling, enabling more accurate scene decomposition while supporting robust dynamic scene understanding. To this end, we leverage temporally aligned instance masks and sentence embeddings derived from Multimodal Large Language Models to supervise the learnable semantic features of each Gaussian via two MLP decoders, enabling language-aligned 4D representations and enforcing identity consistency over time. To enhance temporal stability, we bridge 2D optical flow with 4D Gaussians and finetune their motions, yielding reliable initialization and reducing drift. For the training, we further introduce a geometry-aware SDF constraints, along with regularization terms that enforces surface continuity, enhancing temporal coherence in dynamic foreground modeling. Experiments demonstrate that Director achieves temporally coherent 4D reconstructions while simultaneously enabling instance segmentation and open-vocabulary querying.
comment: Project page: https://caiyw2023.github.io/Director/
☆ GPA: Learning GUI Process Automation from Demonstrations
GUI Process Automation (GPA) is a lightweight but general vision-based Robotic Process Automation (RPA), which enables fast and stable process replay with only a single demo. Addressing the fragility of traditional RPA and the non-deterministic risks of current vision language model-based GUI agents, GPA introduces three core benefits: (1) Robustness via Sequential Monte Carlo-based localization to handle rescaling and detection uncertainty; (2) Deterministic and Reliability safeguarded by readiness calibration; and (3) Privacy through fast, fully local execution. This approach delivers the adaptability, robustness, and security required for enterprise workflows. It can also be used as an MCP/CLI tool by other agents with coding capabilities so that the agent only reasons and orchestrates while GPA handles the GUI execution. We conducted a pilot experiment to compare GPA with Gemini 3 Pro (with CUA tools) and found that GPA achieves higher success rate with 10 times faster execution speed in finishing long-horizon GUI tasks.
☆ HOT: Harmonic-Constrained Optimal Transport for Remote Photoplethysmography Domain Adaptation
Remote photoplethysmography (rPPG) enables non-contact physiological measurement from facial videos; however, its practical deployment is often hindered by substantial performance degradation under domain shift. While recent deep learning-based rPPG methods have achieved strong performance on individual datasets, they frequently overfit to appearance-related factors, such as illumination, camera characteristics, and color response, that vary significantly across domains. To address this limitation, we introduce frequency domain adaptation (FDA) as a principled strategy for modeling appearance variation in rPPG. By transferring low-frequency spectral components that encode domain-dependent appearance characteristics, FDA encourages rPPG models to learn invariance to appearance variations while retaining cardiac-induced signals. To further support physiologically consistent alignment under such appearance variation, we propose Harmonic-Constrained Optimal Transport (HOT), which leverages the harmonic property of cardiac signals to guide alignment between original and FDA-transferred representations. Extensive cross-dataset experiments demonstrate that the proposed FDA and HOT framework effectively enhances the robustness and generalization of rPPG models across diverse datasets.
☆ Robust Embodied Perception in Dynamic Environments via Disentangled Weight Fusion ICME2026
Embodied perception systems face severe challenges of dynamic environment distribution drift when they continuously interact in open physical spaces. However, the existing domain incremental awareness methods often rely on the domain id obtained in advance during the testing phase, which limits their practicability in unknown interaction scenarios. At the same time, the model often overfits to the context-specific perceptual noise, which leads to insufficient generalization ability and catastrophic forgetting. To address these limitations, we propose a domain-id and exemplar-free incremental learning framework for embodied multimedia systems, which aims to achieve robust continuous environment adaptation. This method designs a disentangled representation mechanism to remove non-essential environmental style interference, and guide the model to focus on extracting semantic intrinsic features shared across scenes, thereby eliminating perceptual uncertainty and improving generalization. We further use the weight fusion strategy to dynamically integrate the old and new environment knowledge in the parameter space, so as to ensure that the model adapts to the new distribution without storing historical data and maximally retains the discrimination ability of the old environment. Extensive experiments on multiple standard benchmark datasets show that the proposed method significantly reduces catastrophic forgetting in a completely exemplar-free and domain-id free setting, and its accuracy is better than the existing state-of-the-art methods.
comment: Accepted by ICME2026
☆ M3D-BFS: a Multi-stage Dynamic Fusion Strategy for Sample-Adaptive Multi-Modal Brain Network Analysis
Multi-modal fusion is of great significance in neuroscience which integrates information from different modalities and can achieve better performance than uni-modal methods in downstream tasks. Current multi-modal fusion methods in brain networks, which mainly focus on structural connectivity (SC) and functional connectivity (FC) modalities, are static in nature. They feed different samples into the same model with identical computation, ignoring inherent difference between input samples. This lack of sample adaptation hinders model's further performance. To this end, we innovatively propose a multi-stage dynamic fusion strategy (M3D-BFS) for sample-adaptive multi-modal brain network analysis. Unlike other static fusion methods, we design different mixture-of-experts (MoEs) for uni- and multi-modal representations where modules can adaptively change as input sample changes during inference. To alleviate issue of MoE where training of experts may be collapsed, we divide our method into 3 stages. We first train uni-modal encoders respectively, then pretrain single experts of MoEs before finally finetuning the whole model. A multi-modal disentanglement loss is designed to enhance the final representations. To the best of our knowledge, this is the first work for dynamic fusion for multi-modal brain network analysis. Extensive experiments on different real-world datasets demonstrates the superiority of M3D-BFS.
☆ DynaVid: Learning to Generate Highly Dynamic Videos using Synthetic Motion Data CVPR 2026
Despite recent progress, video diffusion models still struggle to synthesize realistic videos involving highly dynamic motions or requiring fine-grained motion controllability. A central limitation lies in the scarcity of such examples in commonly used training datasets. To address this, we introduce DynaVid, a video synthesis framework that leverages synthetic motion data in training, which is represented as optical flow and rendered using computer graphics pipelines. This approach offers two key advantages. First, synthetic motion offers diverse motion patterns and precise control signals that are difficult to obtain from real data. Second, unlike rendered videos with artificial appearances, rendered optical flow encodes only motion and is decoupled from appearance, thereby preventing models from reproducing the unnatural look of synthetic videos. Building on this idea, DynaVid adopts a two-stage generation framework: a motion generator first synthesizes motion, and then a motion-guided video generator produces video frames conditioned on that motion. This decoupled formulation enables the model to learn dynamic motion patterns from synthetic data while preserving visual realism from real-world videos. We validate our framework on two challenging scenarios, vigorous human motion generation and extreme camera motion control, where existing datasets are particularly limited. Extensive experiments demonstrate that DynaVid improves the realism and controllability in dynamic motion generation and camera motion control.
comment: Accepted to CVPR 2026. Website: https://jinwonjoon.github.io/DynaVid/
☆ Moiré Video Authentication: A Physical Signature Against AI Video Generation
Recent advances in video generation have made AI-synthesized content increasingly difficult to distinguish from real footage. We propose a physics-based authentication signature that real cameras produce naturally, but that generative models cannot faithfully reproduce. Our approach exploits the Moiré effect: the interference fringes formed when a camera views a compact two-layer grating structure. We derive the Moiré motion invariant, showing that fringe phase and grating image displacement are linearly coupled by optical geometry, independent of viewing distance and grating structure. A verifier extracts both signals from video and tests their correlation. We validate the invariant on both real-captured and AI-generated videos from multiple state-of-the-art generators, and find that real and AI-generated videos produce significantly different correlation signatures, suggesting a robust means of differentiating them. Our work demonstrates that deterministic optical phenomena can serve as physically grounded, verifiable signatures against AI-generated video.
comment: 17 pages, 14 figures
☆ MonoSAOD: Monocular 3D Object Detection with Sparsely Annotated Label CVPR 2026
Monocular 3D object detection has achieved impressive performance on densely annotated datasets. However, it struggles when only a fraction of objects are labeled due to the high cost of 3D annotation. This sparsely annotated setting is common in real-world scenarios where annotating every object is impractical. To address this, we propose a novel framework for sparsely annotated monocular 3D object detection with two key modules. First, we propose Road-Aware Patch Augmentation (RAPA), which leverages sparse annotations by augmenting segmented object patches onto road regions while preserving 3D geometric consistency. Second, we propose Prototype-Based Filtering (PBF), which generates high-quality pseudo-labels by filtering predictions through prototype similarity and depth uncertainty. It maintains global 2D RoI feature prototypes and selects pseudo-labels that are both feature-consistent with learned prototypes and have reliable depth estimates. Our training strategy combines geometry-preserving augmentation with prototype-guided pseudo-labeling to achieve robust detection under sparse supervision. Extensive experiments demonstrate the effectiveness of the proposed method. The source code is available at https://github.com/VisualAIKHU/MonoSAOD .
comment: Accepted to CVPR 2026
☆ TOL: Textual Localization with OpenStreetMap
Natural language provides an intuitive way to express spatial intent in geospatial applications. While existing localization methods often rely on dense point cloud maps or high-resolution imagery, OpenStreetMap (OSM) offers a compact and freely available map representation that encodes rich semantic and structural information, making it well suited for large-scale localization. However, text-to-OSM (T2O) localization remains largely unexplored. In this paper, we formulate the T2O global localization task, which aims to estimate accurate 2 degree-of-freedom (DoF) positions in urban environments from textual scene descriptions without relying on geometric observations or GNSS-based initial location. To support the proposed task, we introduce TOL, a large-scale benchmark spanning multiple continents and diverse urban environments. TOL contains approximately 121K textual queries paired with OSM map tiles and covers about 316 km of road trajectories across Boston, Karlsruhe, and Singapore. We further propose TOLoc, a coarse-to-fine localization framework that explicitly models the semantics of surrounding objects and their directional information. In the coarse stage, direction-aware features are extracted from both textual descriptions and OSM tiles to construct global descriptors, which are used to retrieve candidate locations for the query. In the fine stage, the query text and top-1 retrieved tile are jointly processed, where a dedicated alignment module fuses textual descriptor and local map features to regress the 2-DoF pose. Experimental results demonstrate that TOLoc achieves strong localization performance, outperforming the best existing method by 6.53%, 9.93%, and 8.31% at 5m, 10m, and 25m thresholds, respectively, and shows strong generalization to unseen environments. Dataset, code and models will be publicly available at: https://github.com/WHU-USI3DV/TOL.
comment: Tech repo
☆ LivingWorld: Interactive 4D World Generation with Environmental Dynamics
We introduce LivingWorld, an interactive framework for generating 4D worlds with environmental dynamics from a single image. While recent advances in 3D scene generation enable large-scale environment creation, most approaches focus primarily on reconstructing static geometry, leaving scene-scale environmental dynamics such as clouds, water, or smoke largely unexplored. Modeling such dynamics is challenging because motion must remain coherent across an expanding scene while supporting low-latency user feedback. LivingWorld addresses this challenge by progressively constructing a globally coherent motion field as the scene expands. To maintain global consistency during expansion, we introduce a geometry-aware alignment module that resolves directional and scale ambiguities across views. We further represent motion using a compact hash-based motion field, enabling efficient querying and stable propagation of dynamics throughout the scene. This representation also supports bidirectional motion propagation during rendering, producing long and temporally coherent 4D sequences without relying on expensive video-based refinement. On a single RTX 5090 GPU, generating each new scene expansion step requires 9 seconds, followed by 3 seconds for motion alignment and motion field updates, enabling interactive 4D world generation with globally coherent environmental dynamics. Video demonstrations are available at cvsp-lab.github.io/LivingWorld.
☆ Automatic Image-Level Morphological Trait Annotation for Organismal Images ICLR 2026
Morphological traits are physical characteristics of biological organisms that provide vital clues on how organisms interact with their environment. Yet extracting these traits remains a slow, expert-driven process, limiting their use in large-scale ecological studies. A major bottleneck is the absence of high-quality datasets linking biological images to trait-level annotations. In this work, we demonstrate that sparse autoencoders trained on foundation-model features yield monosemantic, spatially grounded neurons that consistently activate on meaningful morphological parts. Leveraging this property, we introduce a trait annotation pipeline that localizes salient regions and uses vision-language prompting to generate interpretable trait descriptions. Using this approach, we construct Bioscan-Traits, a dataset of 80K trait annotations spanning 19K insect images from BIOSCAN-5M. Human evaluation confirms the biological plausibility of the generated morphological descriptions. We assess design sensitivity through a comprehensive ablation study, systematically varying key design choices and measuring their impact on the quality of the resulting trait descriptions. By annotating traits with a modular pipeline rather than prohibitively expensive manual efforts, we offer a scalable way to inject biologically meaningful supervision into foundation models, enable large-scale morphological analyses, and bridge the gap between ecological relevance and machine-learning practicality.
comment: ICLR 2026
☆ Tex3D: Objects as Attack Surfaces via Adversarial 3D Textures for Vision-Language-Action Models
Vision-language-action (VLA) models have shown strong performance in robotic manipulation, yet their robustness to physically realizable adversarial attacks remains underexplored. Existing studies reveal vulnerabilities through language perturbations and 2D visual attacks, but these attack surfaces are either less representative of real deployment or limited in physical realism. In contrast, adversarial 3D textures pose a more physically plausible and damaging threat, as they are naturally attached to manipulated objects and are easier to deploy in physical environments. Bringing adversarial 3D textures to VLA systems is nevertheless nontrivial. A central obstacle is that standard 3D simulators do not provide a differentiable optimization path from the VLA objective function back to object appearance, making it difficult to optimize through an end-to-end manner. To address this, we introduce Foreground-Background Decoupling (FBD), which enables differentiable texture optimization through dual-renderer alignment while preserving the original simulation environment. To further ensure that the attack remains effective across long-horizon and diverse viewpoints in the physical world, we propose Trajectory-Aware Adversarial Optimization (TAAO), which prioritizes behaviorally critical frames and stabilizes optimization with a vertex-based parameterization. Built on these designs, we present Tex3D, the first framework for end-to-end optimization of 3D adversarial textures directly within the VLA simulation environment. Experiments in both simulation and real-robot settings show that Tex3D significantly degrades VLA performance across multiple manipulation tasks, achieving task failure rates of up to 96.7\%. Our empirical results expose critical vulnerabilities of VLA systems to physically grounded 3D adversarial attacks and highlight the need for robustness-aware training.
☆ NEMESIS: Noise-suppressed Efficient MAE with Enhanced Superpatch Integration Strategy
Volumetric CT imaging is essential for clinical diagnosis, yet annotating 3D volumes is expensive and time-consuming, motivating self-supervised learning (SSL) from unlabeled data. However, applying SSL to 3D CT remains challenging due to the high memory cost of full-volume transformers and the anisotropic spatial structure of CT data, which is not well captured by conventional masking strategies. We propose NEMESIS, a masked autoencoder (MAE) framework that operates on local 128x128x128 superpatches, enabling memory-efficient training while preserving anatomical detail. NEMESIS introduces three key components: (i) noise-enhanced reconstruction as a pretext task, (ii) Masked Anatomical Transformer Blocks (MATB) that perform dual-masking through parallel plane-wise and axis-wise token removal, and (iii) NEMESIS Tokens (NT) for cross-scale context aggregation. On the BTCV multi-organ classification benchmark, NEMESIS with a frozen backbone and a linear classifier achieves a mean AUROC of 0.9633, surpassing fully fine-tuned SuPreM (0.9493) and VoCo (0.9387). Under a low-label regime with only 10% of available annotations, it retains an AUROC of 0.9075, demonstrating strong label efficiency. Furthermore, the superpatch-based design reduces computational cost to 31.0 GFLOPs per forward pass, compared to 985.8 GFLOPs for the full-volume baseline, providing a scalable and robust foundation for 3D medical imaging.
comment: 5 pages, 5 figures, 5 tables
☆ F3DGS: Federated 3D Gaussian Splatting for Decentralized Multi-Agent World Modeling CVPR 2026
We present F3DGS, a federated 3D Gaussian Splatting framework for decentralized multi-agent 3D reconstruction. Existing 3DGS pipelines assume centralized access to all observations, which limits their applicability in distributed robotic settings where agents operate independently, and centralized data aggregation may be restricted. Directly extending centralized training to multi-agent systems introduces communication overhead and geometric inconsistency. F3DGS first constructs a shared geometric scaffold by registering locally merged LiDAR point clouds from multiple clients to initialize a global 3DGS model. During federated optimization, Gaussian positions are fixed to preserve geometric alignment, while each client updates only appearance-related attributes, including covariance, opacity, and spherical harmonic coefficients. The server aggregates these updates using visibility-aware aggregation, weighting each client's contribution by how frequently it observed each Gaussian, resolving the partial-observability challenge inherent to multi-agent exploration. To evaluate decentralized reconstruction, we collect a multi-sequence indoor dataset with synchronized LiDAR, RGB, and IMU measurements. Experiments show that F3DGS achieves reconstruction quality comparable to centralized training while enabling distributed optimization across agents. The dataset, development kit, and source code will be publicly released.
comment: Accepted to the CVPR 2026 SPAR-3D Workshop
☆ Towards Minimal Focal Stack in Shape from Focus CVPR
Shape from Focus (SFF) is a depth reconstruction technique that estimates scene structure from focus variations observed across a focal stack, that is, a sequence of images captured at different focus settings. A key limitation of SFF methods is their reliance on densely sampled, large focal stacks, which limits their practical applicability. In this study, we propose a focal stack augmentation that enables SFF methods to estimate depth using a reduced stack of just two images, without sacrificing precision. We introduce a simple yet effective physics-based focal stack augmentation that enriches the stack with two auxiliary cues: an all-in-focus (AiF) image estimated from two input images, and Energy-of-Difference (EOD) maps, computed as the energy of differences between the AiF and input images. Furthermore, we propose a deep network that computes a deep focus volume from the augmented focal stacks and iteratively refines depth using convolutional Gated Recurrent Units (ConvGRUs) at multiple scales. Extensive experiments on both synthetic and real-world datasets demonstrate that the proposed augmentation benefits existing state-of-the-art SFF models, enabling them to achieve comparable accuracy. The results also show that our approach maintains state-of-the-art performance with a minimal stack size.
comment: Accepted to CVPRW 2026 (3DMV)
☆ Riemannian and Symplectic Geometry for Hierarchical Text-Driven Place Recognition
Text-to-point-cloud localization enables robots to understand spatial positions through natural language descriptions, which is crucial for human-robot collaboration in applications such as autonomous driving and last-mile delivery. However, existing methods employ pooled global descriptors for similarity retrieval, which suffer from severe information loss and fail to capture discriminative scene structures. To address these issues, we propose SympLoc, a novel coarse-to-fine localization framework with multi-level alignment in the coarse stage. Different from previous methods that rely solely on global descriptors, our coarse stage consists of three complementary alignment levels: 1) Instance-level alignment establishes direct correspondence between individual object instances in point clouds and textual hints through Riemannian self-attention in hyperbolic space; 2) Relation-level alignment explicitly models pairwise spatial relationships between objects using the Information-Symplectic Relation Encoder (ISRE), which reformulates relation features through Fisher-Rao metric and Hamiltonian dynamics for uncertainty-aware geometrically consistent propagation; 3) Global-level alignment synthesizes discriminative global descriptors via the Spectral Manifold Transform (SMT) that extracts structural invariants through graph spectral analysis. This hierarchical alignment strategy progressively captures fine-grained to coarse-grained scene semantics, enabling robust cross-modal retrieval. Extensive experiments on the KITTI360Pose dataset demonstrate that SympLoc achieves a 19% improvement in Top-1 recall@10m compared to existing state-of-the-art approaches.
comment: 9 pages
☆ Mitigating the ID-OOD Tradeoff in Open-Set Test-Time Adaptation
Open-set test-time adaptation (OSTTA) addresses the challenge of adapting models to new environments where out-of-distribution (OOD) samples coexist with in-distribution (ID) samples affected by distribution shifts. In such settings, covariate shift-for example, changes in weather conditions such as snow-can alter ID samples, reducing model reliability. Consequently, models must not only correctly classify covariate-shifted ID (csID) samples but also effectively reject covariate-shifted OOD (csOOD) samples. Entropy minimization is a common strategy in test-time adaptation to maintain ID performance under distribution shifts, while entropy maximization is widely applied to enhance OOD detection. Several studies have sought to combine these objectives to tackle the challenges of OSTTA. However, the intrinsic conflict between entropy minimization and maximization inevitably leads to a trade-off between csID classification and csOOD detection. In this paper, we first analyze the limitations of entropy maximization in OSTTA and then introduce an angular loss to regulate feature norm magnitudes, along with a feature-norm loss to suppress csOOD logits, thereby improving OOD detection. These objectives form ROSETTA, a $\underline{r}$obust $\underline{o}$pen-$\underline{se}$t $\underline{t}$est-$\underline{t}$ime $\underline{a}$daptation. Our method achieves strong OOD detection while maintaining high ID classification performance on CIFAR-10-C, CIFAR-100-C, Tiny-ImageNet-C and ImageNet-C. Furthermore, experiments on the Cityscapes validate the method's effectiveness in real-world semantic segmentation, and results on the HAC dataset demonstrate its applicability across different open-set TTA setups.
☆ SHOE: Semantic HOI Open-Vocabulary Evaluation Metric CVPR 2026
Open-vocabulary human-object interaction (HOI) detection is a step towards building scalable systems that generalize to unseen interactions in real-world scenarios and support grounded multimodal systems that reason about human-object relationships. However, standard evaluation metrics, such as mean Average Precision (mAP), treat HOI classes as discrete categorical labels and fail to credit semantically valid but lexically different predictions (e.g., "lean on couch" vs. "sit on couch"), limiting their applicability for evaluating open-vocabulary predictions that go beyond any predefined set of HOI labels. We introduce SHOE (Semantic HOI Open-Vocabulary Evaluation), a new evaluation framework that incorporates semantic similarity between predicted and ground-truth HOI labels. SHOE decomposes each HOI prediction into its verb and object components, estimates their semantic similarity using the average of multiple large language models (LLMs), and combines them into a similarity score to evaluate alignment beyond exact string match. This enables a flexible and scalable evaluation of both existing HOI detection methods and open-ended generative models using standard benchmarks such as HICO-DET. Experimental results show that SHOE scores align more closely with human judgments than existing metrics, including LLM-based and embedding-based baselines, achieving an agreement of 85.73% with the average human ratings. Our work underscores the need for semantically grounded HOI evaluation that better mirrors human understanding of interactions. We will release our evaluation metric to the public to facilitate future research.
comment: Accepted to GRAIL-V Workshop at CVPR 2026
☆ Satellite-Free Training for Drone-View Geo-Localization
Drone-view geo-localization (DVGL) aims to determine the location of drones in GPS-denied environments by retrieving the corresponding geotagged satellite tile from a reference gallery given UAV observations of a location. In many existing formulations, these observations are represented by a single oblique UAV image. In contrast, our satellite-free setting is designed for multi-view UAV sequences, which are used to construct a geometry-normalized UAV-side location representation before cross-view retrieval. Existing approaches rely on satellite imagery during training, either through paired supervision or unsupervised alignment, which limits practical deployment when satellite data are unavailable or restricted. In this paper, we propose a satellite-free training (SFT) framework that converts drone imagery into cross-view compatible representations through three main stages: drone-side 3D scene reconstruction, geometry-based pseudo-orthophoto generation, and satellite-free feature aggregation for retrieval. Specifically, we first reconstruct dense 3D scenes from multi-view drone images using 3D Gaussian splatting and project the reconstructed geometry into pseudo-orthophotos via PCA-guided orthographic projection. This rendering stage operates directly on reconstructed scene geometry without requiring camera parameters at rendering time. Next, we refine these orthophotos with lightweight geometry-guided inpainting to obtain texture-complete drone-side views. Finally, we extract DINOv3 patch features from the generated orthophotos, learn a Fisher vector aggregation model solely from drone data, and reuse it at test time to encode satellite tiles for cross-view retrieval. Experimental results on University-1652 and SUES-200 show that our SFT framework substantially outperforms satellite-free generalization baselines and narrows the gap to methods trained with satellite imagery.
☆ Harmonized Tabular-Image Fusion via Gradient-Aligned Alternating Learning ICME 26
Multimodal tabular-image fusion is an emerging task that has received increasing attention in various domains. However, existing methods may be hindered by gradient conflicts between modalities, misleading the optimization of the unimodal learner. In this paper, we propose a novel Gradient-Aligned Alternating Learning (GAAL) paradigm to address this issue by aligning modality gradients. Specifically, GAAL adopts an alternating unimodal learning and shared classifier to decouple the multimodal gradient and facilitate interaction. Furthermore, we design uncertainty-based cross-modal gradient surgery to selectively align cross-modal gradients, thereby steering the shared parameters to benefit all modalities. As a result, GAAL can provide effective unimodal assistance and help boost the overall fusion performance. Empirical experiments on widely used datasets reveal the superiority of our method through comparison with various state-of-the-art (SoTA) tabular-image fusion baselines and test-time tabular missing baselines. The source code is available at https://github.com/njustkmg/ICME26-GAAL.
comment: ICME 26
☆ VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification
Recent video multimodal large language models achieve impressive results across various benchmarks. However, current evaluations suffer from two critical limitations: (1) inflated scores can mask deficiencies in fine-grained visual understanding and reasoning, and (2) answer correctness is often measured without verifying whether models identify the precise spatio-temporal evidence supporting their predictions. To address this, we present VideoZeroBench, a hierarchical benchmark designed for challenging long-video question answering that rigorously verifies spatio-temporal evidence. It comprises 500 manually annotated questions across 13 domains, paired with temporal intervals and spatial bounding boxes as evidence. To disentangle answering generation, temporal grounding, and spatial grounding, we introduce a five-level evaluation protocol that progressively tightens evidence requirements. Experiments show that even Gemini-3-Pro correctly answers fewer than 17% of questions under the standard end-to-end QA setting (Level-3). When grounding constraints are imposed, performance drops sharply: No model exceeds 1% accuracy when both correct answering and accurate spatio-temporal localization are required (Level-5), with most failing to achieve any correct grounded predictions. These results expose a significant gap between surface-level answer correctness and genuine evidence-based reasoning, revealing that grounded video understanding remains a bottleneck for long-video QA. We further analyze performance across minimal evidence spans, atomic abilities, and inference paradigms, providing insights for future research in grounded video reasoning. The benchmark and code will be made publicly available.
☆ ReFlow: Self-correction Motion Learning for Dynamic Scene Reconstruction
We present ReFlow, a unified framework for monocular dynamic scene reconstruction that learns 3D motion in a novel self-correction manner from raw video. Existing methods often suffer from incomplete scene initialization for dynamic regions, leading to unstable reconstruction and motion estimation, which often resorts to external dense motion guidance such as pre-computed optical flow to further stabilize and constrain the reconstruction of dynamic components. However, this introduces additional complexity and potential error propagation. To address these issues, ReFlow integrates a Complete Canonical Space Construction module for enhanced initialization of both static and dynamic regions, and a Separation-Based Dynamic Scene Modeling module that decouples static and dynamic components for targeted motion supervision. The core of ReFlow is a novel self-correction flow matching mechanism, consisting of Full Flow Matching to align 3D scene flow with time-varying 2D observations, and Camera Flow Matching to enforce multi-view consistency for static objects. Together, these modules enable robust and accurate dynamic scene reconstruction. Extensive experiments across diverse scenarios demonstrate that ReFlow achieves superior reconstruction quality and robustness, establishing a novel self-correction paradigm for monocular 4D reconstruction.
comment: Project page: https://rosetta-leong.github.io/ReFlow_Page/ {this https URL}
☆ Cross-Domain Vessel Segmentation via Latent Similarity Mining and Iterative Co-Optimization
Retinal vessel segmentation serves as a critical prerequisite for automated diagnosis of retinal pathologies. While recent advances in Convolutional Neural Networks (CNNs) have demonstrated promising performance in this task, significant performance degradation occurs when domain shifts exist between training and testing data. To address these limitations, we propose a novel domain transfer framework that leverages latent vascular similarity across domains and iterative co-optimization of generation and segmentation networks. Specifically, we first pre-train generation networks for source and target domains. Subsequently, the pretrained source-domain conditional diffusion model performs deterministic inversion to establish intermediate latent representations of vascular images, creating domain-agnostic prototypes for target synthesis. Finally, we develop an iterative refinement strategy where segmentation network and generative model undergo mutual optimization through cyclic parameter updating. This co-evolution process enables simultaneous enhancement of cross-domain image synthesis quality and segmentation accuracy. Experiments demonstrate that our framework achieves state-of-the-art performance in cross-domain retinal vessel segmentation, particularly in challenging clinical scenarios with significant modality discrepancies.
☆ Prototype-Based Low Altitude UAV Semantic Segmentation ICME 2026
Semantic segmentation of low-altitude UAV imagery presents unique challenges due to extreme scale variations, complex object boundaries, and limited computational resources on edge devices. Existing transformer-based segmentation methods achieve remarkable performance but incur high computational overhead, while lightweight approaches struggle to capture fine-grained details in high-resolution aerial scenes. To address these limitations, we propose PBSeg, an efficient prototype-based segmentation framework tailored for UAV applications. PBSeg introduces a novel prototype-based cross-attention (PBCA) that exploits feature redundancy to reduce computational complexity while maintaining segmentation quality. The framework incorporates an efficient multi-scale feature extraction module that combines deformable convolutions (DConv) with context-aware modulation (CAM) to capture both local details and global semantics. Experiments on two challenging UAV datasets demonstrate the effectiveness of the proposed approach. PBSeg achieves 71.86\% mIoU on UAVid and 80.92\% mIoU on UDD6, establishing competitive performance while maintaining computational efficiency. Code is available at https://github.com/zhangda1018/PBSeg.
comment: Accepted to ICME 2026
☆ Universal computational thermal imaging overcoming the ghosting effect
Thermal imaging is crucial for night vision but fundamentally hampered by the ghosting effect, a loss of detailed texture in cluttered photon streams. While conventional ghosting mitigation has relied on data post-processing, the recent breakthrough in heat-assisted detection and ranging (HADAR) opens a promising frontier for hyperspectral computational thermal imaging that produces night vision with day-like visibility. However, universal anti-ghosting imaging remains elusive, as state-of-the-art HADAR applies only to limited scenes with uniform materials, whereas material non-uniformity is ubiquitous in the real world. Here, we propose a universal computational thermal imaging framework, TAG (thermal anti-ghosting), to address material non-uniformity and overcome ghosting for high-fidelity night vision. TAG takes hyperspectral photon streams for nonparametric texture recovery, enabling our experimental demonstration of unprecedented expression recovery in thus-far-elusive ghostly human faces -- the archetypal, long-recognized ghosting phenomenon. Strikingly, TAG not only universally outperforms HADAR across various scenes, but also reveals the influence of material non-uniformity, shedding light on HADAR's effectiveness boundary. We extensively test facial texture and expression recovery across day and night, and demonstrate, for the first time, thermal 3D topological alignment and mood detection. This work establishes a universal foundation for high-fidelity computational night vision, with potential applications in autonomous navigation, reconnaissance, healthcare, and wildlife monitoring.
comment: 9 pages, 6 figures
☆ Why Instruction-Based Unlearning Fails in Diffusion Models?
Instruction-based unlearning has proven effective for modifying the behavior of large language models at inference time, but whether this paradigm extends to other generative models remains unclear. In this work, we investigate instruction-based unlearning in diffusion-based image generation models and show, through controlled experiments across multiple concepts and prompt variants, that diffusion models systematically fail to suppress targeted concepts when guided solely by natural-language unlearning instructions. By analyzing both the CLIP text encoder and cross-attention dynamics during the denoising process, we find that unlearning instructions do not induce sustained reductions in attention to the targeted concept tokens, causing the targeted concept representations to persist throughout generation. These results reveal a fundamental limitation of prompt-level instruction in diffusion models and suggest that effective unlearning requires interventions beyond inference-time language control.
♻ ☆ LaVR: Scene Latent Conditioned Generative Video Trajectory Re-Rendering using Large 4D Reconstruction Models
Given a monocular video, the goal of video re-rendering is to generate views of the scene from a novel camera trajectory. Existing methods face two distinct challenges. Geometrically unconditioned models lack spatial awareness, leading to drift and deformation under viewpoint changes. On the other hand, geometrically-conditioned models depend on estimated depth and explicit reconstruction, making them susceptible to depth inaccuracies and calibration errors. We propose to address these challenges by using the implicit geometric knowledge embedded in the latent space of a large 4D reconstruction model to condition the video generation process. These latents capture scene structure in a continuous space without explicit reconstruction. Therefore, they provide a flexible representation that allows the pretrained diffusion prior to regularize errors more effectively. By jointly conditioning on these latents and source camera poses, we demonstrate that our model achieves state-of-the-art results on the video re-rendering task. Project webpage is https://lavr-4d-scene-rerender.github.io/.
♻ ☆ Scaling Video Pretraining for Surgical Foundation Models
Surgical video understanding is essential for computer-assisted interventions, yet existing surgical foundation models remain constrained by limited data scale, procedural diversity, and inconsistent evaluation, often lacking a reproducible training pipeline. We propose SurgRec, a scalable and reproducible pretraining recipe for surgical video understanding, instantiated with two variants: SurgRec-MAE and SurgRec-JEPA. We curate a large multi-source corpus of 10,535 videos and 214.5M frames spanning endoscopy, laparoscopy, cataract, and robotic surgery. Building on this corpus, we develop a unified pretraining pipeline with balanced sampling and standardize a reproducible benchmark across 16 downstream datasets and four clinical domains with consistent data splits. Across extensive comparisons against SSL baselines and vision-language models, SurgRec consistently achieves superior performance across downstream datasets. In contrast, VLMs prove unreliable for fine-grained temporal recognition, exhibiting both performance gaps and sensitivity to prompt phrasing. Our work provides a reproducible, scalable foundation for the community to build more general surgical video models. All code, models, and data will be publicly released.
♻ ☆ HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering CVPR 2026
Video Large Language Models (Video-LLMs) are improving rapidly, yet current Video Question Answering (VideoQA) benchmarks often admit single-cue shortcuts, under-testing reasoning that must integrate evidence across time. We introduce HERBench, a benchmark designed to make multi-evidence integration unavoidable: each question requires at least three non-overlapping cues drawn from distinct video segments. HERBench contains 26,806 five-way multiple-choice questions across 12 compositional tasks. To make evidential demand measurable, we introduce the Minimum Required Frame-Set (MRFS), the smallest number of frames a model must fuse to answer correctly, and show that HERBench imposes higher evidential demand than prior benchmarks. Evaluating 13 state-of-the-art Video-LLMs yields only 31-42% accuracy, only modestly above the 20\% random-guess baseline. We disentangle this failure into two critical bottlenecks: (1) a retrieval deficit, where frame selectors overlook key evidence, and (2) a fusion deficit, where models fail to integrate information even when all necessary evidence is provided. HERBench thus provides a principled benchmark for studying robust multi-evidence video understanding.
comment: Accepted to CVPR 2026
♻ ☆ FRAMER: Frequency-Aligned Self-Distillation with Adaptive Modulation Leveraging Diffusion Priors for Real-World Image Super-Resolution CVPR 2026
Real-image super-resolution (Real-ISR) seeks to recover HR images from LR inputs with mixed, unknown degradations. While diffusion models surpass GANs in perceptual quality, they under-reconstruct high-frequency (HF) details due to a low-frequency (LF) bias and a depth-wise "low-first, high-later" hierarchy. We introduce FRAMER, a plug-and-play training scheme that exploits diffusion priors without changing the backbone or inference. At each denoising step, the final-layer feature map teaches all intermediate layers. Teacher and student feature maps are decomposed into LF/HF bands via FFT masks to align supervision with the model's internal frequency hierarchy. For LF, an Intra Contrastive Loss (IntraCL) stabilizes globally shared structure. For HF, an Inter Contrastive Loss (InterCL) sharpens instance-specific details using random-layer and in-batch negatives. Two adaptive modulators, Frequency-based Adaptive Weight (FAW) and Frequency-based Alignment Modulation (FAM), reweight per-layer LF/HF signals and gate distillation by current similarity. Across U-Net and DiT backbones (e.g., Stable Diffusion 2, 3), FRAMER consistently improves PSNR/SSIM and perceptual metrics (LPIPS, NIQE, MANIQA, MUSIQ). Ablations validate the final-layer teacher and random-layer negatives.
comment: CVPR 2026 (camera ready ver.). Please visit our project page at https://cmlab-korea.github.io/FRAMER/
♻ ☆ Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Today's strongest video-language models (VLMs) remain proprietary. The strongest open-weight models either rely on synthetic data from proprietary VLMs, effectively distilling from them, or do not disclose their training data or recipe. As a result, the open-source community lacks the foundations needed to improve on the state-of-the-art video (and image) language models. Crucially, many downstream applications require more than just high-level video understanding; they require grounding -- either by pointing or by tracking in pixels. Even proprietary models lack this capability. We present Molmo2, a new family of VLMs that are state-of-the-art among open-source models and demonstrate exceptional new capabilities in point-driven grounding in single image, multi-image, and video tasks. Our key contribution is a collection of 7 new video datasets and 2 multi-image datasets, including a dataset of highly detailed video captions for pre-training, a free-form video Q&A dataset for fine-tuning, a new object tracking dataset with complex queries, and an innovative new video pointing dataset, all collected without the use of closed VLMs. We also present a training recipe for this data utilizing an efficient packing and message-tree encoding scheme, and show bi-directional attention on vision tokens and a novel token-weight strategy improves performance. Our best-in-class 8B model outperforms others in the class of open weight and data models on short videos, counting, and captioning, and is competitive on long-videos. On video-grounding Molmo2 significantly outperforms existing open-weight models like Qwen3-VL (35.5 vs 29.6 accuracy on video counting) and surpasses proprietary models like Gemini 3 Pro on some tasks (38.4 vs 20.0 F1 on video pointing and 56.2 vs 41.1 J&F on video tracking).
comment: Updated first authors
♻ ☆ OmniWeaving: Towards Unified Video Generation with Free-form Composition and Reasoning
While proprietary systems such as Seedance-2.0 have achieved remarkable success in omni-capable video generation, open-source alternatives significantly lag behind. Most academic models remain heavily fragmented, and the few existing efforts toward unified video generation still struggle to seamlessly integrate diverse tasks within a single framework. To bridge this gap, we propose OmniWeaving, an omni-level video generation model featuring powerful multimodal composition and reasoning-informed capabilities. By leveraging a massive-scale pretraining dataset that encompasses diverse compositional and reasoning-augmented scenarios, OmniWeaving learns to temporally bind interleaved text, multi-image, and video inputs while acting as an intelligent agent to infer complex user intentions for sophisticated video creation. Furthermore, we introduce IntelligentVBench, the first comprehensive benchmark designed to rigorously assess next-level intelligent unified video generation. Extensive experiments demonstrate that OmniWeaving achieves SoTA performance among open-source unified models. The codes and model have already been publicly available. Project Page: https://omniweaving.github.io.
comment: 32 pages, 22 figures. Project Page: https://omniweaving.github.io. Github: https://github.com/Tencent-Hunyuan/OmniWeaving. Model: https://huggingface.co/tencent/HY-OmniWeaving
♻ ☆ Tackling Non-IIDness in HAPS-Aided Federated Learning
High-altitude platform stations (HAPS) enable large-scale federated learning (FL) in non-terrestrial networks (NTN) by providing wide-area coverage and predominantly line-of-sight (LoS) connectivity to many ground users. However, practical deployments face heterogeneous and non-independently and identically distributed (non-IID) client data, which degrades accuracy and slows convergence. We propose a weighted attribute-based client selection strategy that leverages server-side indicators: historical traffic behavior, instantaneous channel quality, computational capability, and prior-round learning contribution. At each round, the HAPS computes a composite score and selects the top clients, while adapting attribute weights online based on their correlation with validation-loss improvement. We further provide theoretical justification that traffic-derived uniformity can serve as a proxy for latent data heterogeneity, enabling selection of client subsets with reduced expected non-IIDness. Simulations demonstrate improved test accuracy, faster convergence, and lower training loss compared with random, resource-only, and single-attribute baselines.
comment: Submitted to IEEE for possible publication
♻ ☆ PAM: A Pose-Appearance-Motion Engine for Sim-to-Real HOI Video Generation CVPR 2026
Hand-object interaction (HOI) reconstruction and synthesis are becoming central to embodied AI and AR/VR. Yet, despite rapid progress, existing HOI generation research remains fragmented across three disjoint tracks: (1) pose-only synthesis that predicts MANO trajectories without producing pixels; (2) single-image HOI generation that hallucinates appearance from masks or 2D cues but lacks dynamics; and (3) video generation methods that require both the entire pose sequence and the ground-truth first frame as inputs, preventing true sim-to-real deployment. Inspired by the philosophy of Joo et al. (2018), we think that HOI generation requires a unified engine that brings together pose, appearance, and motion within one coherent framework. Thus we introduce PAM: a Pose-Appearance-Motion Engine for controllable HOI video generation. The performance of our engine is validated by: (1) On DexYCB, we obtain an FVD of 29.13 (vs. 38.83 for InterDyn), and MPJPE of 19.37 mm (vs. 30.05 mm for CosHand), while generating higher-resolution 480x720 videos compared to 256x256 and 256x384 baselines. (2) On OAKINK2, our full multi-condition model improves FVD from 68.76 to 46.31. (3) An ablation over input conditions on DexYCB shows that combining depth, segmentation, and keypoints consistently yields the best results. (4) For a downstream hand pose estimation task using SimpleHand, augmenting training with 3,400 synthetic videos (207k frames) allows a model trained on only 50% of the real data plus our synthetic data to match the 100% real baseline.
comment: Accepted to CVPR 2026 Code: https://github.com/GasaiYU/PAM
♻ ☆ Toward Personalized Darts Training: A Data-Driven Framework Based on Skeleton-Based Biomechanical Analysis and Motion Modeling
As sports training becomes more data-driven, traditional dart coaching based mainly on experience and visual observation is increasingly inadequate for high-precision, goal-oriented movements. Although prior studies have highlighted the importance of release parameters, joint motion, and coordination in dart throwing, most quantitative methods still focus on local variables, single-release metrics, or static template matching. These approaches offer limited support for personalized training and often overlook useful movement variability. This paper presents a data-driven dart training assistance system. The system creates a closed-loop framework spanning motion capture, feature modeling, and personalized feedback. Dart-throwing data were collected in markerless conditions using a Kinect 2.0 depth sensor and an optical camera. Eighteen kinematic features were extracted from four biomechanical dimensions: three-link coordination, release velocity, multi-joint angular configuration, and postural stability. Two modules were developed: a personalized optimal throwing trajectory model that combines historical high-quality samples with the minimum jerk criterion, and a motion deviation diagnosis and recommendation model based on z-scores and hierarchical logic. A total of 2,396 throwing samples from professional and non-professional athletes were collected. Results show that the system generates smooth personalized reference trajectories consistent with natural human movement. Case studies indicate that it can detect poor trunk stability, abnormal elbow displacement, and imbalanced velocity control, then provide targeted recommendations. The framework shifts dart evaluation from deviation from a uniform standard to deviation from an individual's optimal control range, improving personalization and interpretability for darts training and other high-precision target sports.
♻ ☆ Fair Benchmarking of Emerging One-Step Generative Models Against Multistep Diffusion and Flow Models
State-of-the-art text-to-image models produce high-quality images, but inference remains expensive as generation requires several sequential ODE or denoising steps. Native one-step models aim to reduce this cost by mapping noise to an image in a single step, yet fair comparisons to multi-step systems are difficult because studies use mismatched sampling steps and different classifier-free guidance (CFG) settings, where CFG can shift FID, Inception Score, and CLIP-based alignment in opposing directions. It is also unclear how well one-step models scale to multi-step inference, and there is limited standardized out-of-distribution evaluation for label-ID-conditioned generators beyond ImageNet. To address this, We benchmark eight models spanning one-step flows (MeanFlow, Improved MeanFlow, SoFlow), multi-step baselines (RAE, Scale-RAE), and established systems (SiT, Stable Diffusion 3.5, FLUX.1) under a controlled class-conditional protocol on ImageNet validation, ImageNetV2, and reLAIONet, our new proofread out-of-distribution dataset aligned to ImageNet label IDs. Using FID, Inception Score, CLIP Score, and Pick Score, we show that FID-focused model development and CFG selection can be misleading in few-step regimes, where guidance changes can improve FID while degrading text-image alignment and human preference signals and worsening perceived quality. We further show that leading one-step models benefit from step scaling and become substantially more competitive under multi-step inference, although they still exhibit characteristic local distortions. To capture these tradeoffs, we introduce MinMax Harmonic Mean (MMHM), a composite proxy over all four metrics that stabilizes hyperparameter selection across guidance and step sweeps.
♻ ☆ Be Tangential to Manifold: Discovering Riemannian Metric for Diffusion Models
Diffusion models are powerful deep generative models, but unlike classical models, they lack an explicit low-dimensional latent space that parameterizes the data manifold. This absence makes it difficult to perform manifold-aware operations, such as geometrically faithful interpolation or conditional guidance that respects the learned manifold. We propose a training-free Riemannian metric on the noise space, derived from the Jacobian of the score function. The key insight is that the spectral structure of this Jacobian separates tangent and normal directions of the data manifold; our metric leverages this separation to encourage paths to stay tangential to the manifold rather than drift toward high-density regions. To validate that our metric faithfully captures the manifold geometry, we examine it from two complementary angles. First, geodesics under our metric yield perceptually more natural interpolations than existing methods on synthetic, image, and video frame datasets. Second, the tangent-normal decomposition induced by our metric prevents classifier-free guidance from deviating off the manifold, improving generation quality while preserving text-image alignment.
♻ ☆ DeDelayed: Deleting Remote Inference Delay via On-Device Correction CVPR 2026
Video comprises the vast majority of bits that are generated daily, and is the primary signal driving current innovations in robotics, remote sensing, and wearable technology. Yet, the most powerful video understanding models are too expensive for the resource-constrained platforms used in these applications. One approach is to offload inference to the cloud; this gives access to GPUs capable of processing high-resolution videos in real time. But even with reliable, high-bandwidth communication channels, the combined latency of video encoding, model inference, and round-trip communication prohibits use for certain real-time applications. The alternative is to use fully local inference; but this places extreme constraints on computational and power costs, requiring smaller models and lower resolution, leading to degraded accuracy. To address these challenges, we propose Dedelayed, a real-time inference system that divides computation between a remote model operating on delayed video frames and a local model with access to the current frame. The remote model is trained to make predictions on anticipated future frames, which the local model incorporates into its prediction for the current frame. The local and remote models are jointly optimized with an autoencoder that limits the transmission bitrate required by the available downlink communication channel. We evaluate Dedelayed on the task of real-time streaming video segmentation using the BDD100k driving dataset. For a round trip delay of 100 ms, Dedelayed improves performance by 6.4 mIoU compared to fully local inference and 9.8 mIoU compared to remote inference -- an equivalent improvement to using a model ten times larger. We release our training code, pretrained models, and python library at https://github.com/InterDigitalInc/dedelayed .
comment: CVPR 2026
♻ ☆ SkinGenBench: Generative Model and Preprocessing Effects for Synthetic Dermoscopic Augmentation in Melanoma Diagnosis
This work introduces SkinGenBench, a systematic biomedical imaging benchmark that investigates how preprocessing complexity interacts with generative model choice for synthetic dermoscopic image augmentation and downstream melanoma diagnosis. Using a curated dataset of $14,116$ dermoscopic images from HAM10000 and MILK10K across five lesion classes, we evaluate the two representative generative paradigms: StyleGAN2-ADA and Denoising Diffusion Probabilistic Models (DDPMs) under basic geometric augmentation and advanced artifact removal pipelines. Synthetic melanoma images are assessed using established perceptual and distributional metrics (FID, KID, IS), feature space analysis, and their impact on diagnostic performance across five downstream classifiers. Experimental results demonstrate that generative architecture choice has a stronger influence on both image fidelity and diagnostic utility than preprocessing complexity. StyleGAN2-ADA consistently produced synthetic images more closely aligned with real data distributions, achieving the lowest FID ($\approx 65.5$) and KID ($\approx 0.05$), while diffusion models generated higher variance samples at the cost of reduced perceptual fidelity and class anchoring. Advanced artifact removal yielded only marginal improvements in generative metrics and provided limited downstream diagnostic gains, suggesting possible suppression of clinically relevant texture cues. In contrast, synthetic data augmentation substantially improved melanoma detection with $8$-$15$\% absolute gains in melanoma F1-score, and ViT-B/16 achieving F1 $\approx 0.88$ and ROC-AUC $\approx 0.98$, representing an improvement of approximately $14\%$ over non-augmented baselines. Our code can be found at https://github.com/adarsh-crafts/SkinGenBench
♻ ☆ Towards Faithful Reasoning in Comics for Small MLLMs
Comic understanding presents a significant challenge for Multimodal Large Language Models (MLLMs), as the intended meaning of a comic often emerges from the joint interpretation of visual, textual, and social cues. This naturally motivates Chain-of-Thought (CoT) prompting, since explicit intermediate reasoning appears promising for integrating such heterogeneous signals. However, existing CoT methods are poorly matched to this structure: they tend to force interpretation into a single reasoning path before multiple cues have been jointly considered, often degrading performance, especially for small MLLMs. Our key idea is to explicitly preserve multi-cue interpretation during supervision construction, rather than collapsing comic understanding into a single reasoning chain. To this end, we propose a two-stage framework for faithful comic reasoning in small MLLMs. First, we introduce MoCoT, a modular supervision construction framework that preserves multi-cue interpretation and turns it into more faithful supervision. Second, we propose VERA, a structured reward mechanism that turns such supervision into faithful reasoning behavior by aligning optimization with both reasoning faithfulness and answer correctness. Extensive experiments on five benchmarks spanning comic understanding and broader humor-centric and abstract visual reasoning tasks demonstrate that our framework achieves strong results in the $\leq$ 4B regime, surpasses several 7B baselines, improves four small MLLMs by an average of $\mathbf{12.1%}$ as a plug-in, and consistently enhances reasoning faithfulness while preserving inference efficiency.
♻ ☆ Human Insights Driven Latent Space for Different Driving Perspectives: A Unified Encoder for Efficient Multi-Task Inference
Autonomous driving systems require a comprehensive understanding of the environment, achieved by extracting visual features essential for perception, planning, and control. However, models trained solely on single-task objectives or generic datasets often lack the contextual information needed for robust performance in complex driving scenarios. In this work, we propose a unified encoder trained on multiple computer vision tasks crucial for urban driving, including depth, pose, and 3D scene flow estimation, as well as semantic, instance, panoptic, and motion segmentation. By integrating these diverse visual cues-similar to human perceptual mechanisms-the encoder captures rich features that enhance navigation-related predictions. We evaluate the model on steering estimation as a downstream task, leveraging its dense latent space. To ensure efficient multi-task learning, we introduce a multi-scale feature network for pose estimation and apply knowledge distillation from a multi-backbone teacher model. Our findings highlight two key findings: (1) the unified encoder achieves competitive performance across all visual perception tasks, demonstrating strong generalization capabilities; and (2) for steering estimation, the frozen unified encoder-leveraging dense latent representations-outperforms both its fine-tuned counterpart and the same frozen model pretrained on generic datasets like ImageNet. These results underline the significance of task-specific visual features and demonstrate the promise of multi-task learning in advancing autonomous driving systems. More details and the pretrained model are available at https://hi-computervision.github.io/uni-encoder/.
♻ ☆ Catalogue Grounded Multimodal Attribution for Museum Video under Resource and Regulatory Constraints
Audiovisual (AV) archives in museums and galleries are growing rapidly, but much of this material remains effectively locked away because it lacks consistent, searchable metadata. Existing method for archiving requires extensive manual effort. We address this by automating the most labour intensive part of the workflow: catalogue style metadata curation for in gallery video, grounded in an existing collection database. Concretely, we propose catalogue-grounded multimodal attribution for museum AV content using an open, locally deployable video language model. We design a multi pass pipeline that (i) summarises artworks in a video, (ii) generates catalogue style descriptions and genre labels, and (iii) attempts to attribute title and artist via conservative similarity matching to the structured catalogue. Early deployments on a painting catalogue suggest that this framework can improve AV archive discoverability while respecting resource constraints, data sovereignty, and emerging regulation, offering a transferable template for application-driven machine learning in other high-stakes domains.
comment: Demo video url: https://jn00767.pages.surrey.ac.uk/catalogue-grounded-multimodal-attribution-for-museum-video/
♻ ☆ Monocular Building Height Estimation from PhiSat-2 Imagery: Dataset and Method
Monocular building height estimation from optical imagery is important for urban morphology characterization but remains challenging due to ambiguous height cues, large inter-city variations in building morphology, and the long-tailed distribution of building heights. PhiSat-2 is a promising open-access data source for this task because of its global coverage, 4.75 m spatial resolution, and seven-band spectral observations, yet its potential has not been systematically evaluated. To address this gap, we construct a PhiSat-2-Height dataset (PHDataset) and propose a Two-Stream Ordinal Network (TSONet). PHDataset contains 9,475 co-registered image-label patch pairs from 26 cities worldwide. TSONet jointly models footprint segmentation and height estimation, and introduces a Cross-Stream Exchange Module (CSEM) and a Feature-Enhanced Bin Refinement (FEBR) module for footprint-aware feature interaction and ordinal height refinement. Experiments on PHDataset show that TSONet achieves the best overall performance, reducing MAE and RMSE by 13.2% and 9.7%, and improving IoU and F1-score by 14.0% and 10.1% over the strongest competing results. Ablation studies further verify the effectiveness of CSEM, FEBR, and the joint use of ordinal regression and footprint assistance. Additional analyses and patch-level comparison with publicly available building height products indicate that PhiSat-2 benefits monocular building height estimation through its balanced combination of building-relevant spatial detail and multispectral observations. Overall, this study confirms the potential of PhiSat-2 for monocular building height estimation and provides a dedicated dataset and an effective method for future research.
♻ ☆ Human-Centric Perception for Child Sexual Abuse Imagery
Law enforcement agencies and non-gonvernmental organizations handling reports of Child Sexual Abuse Imagery (CSAI) are overwhelmed by large volumes of data, requiring the aid of automation tools. However, defining sexual abuse in images of children is inherently challenging, encompassing sexually explicit activities and hints of sexuality conveyed by the individual's pose, or their attire. CSAI classification methods often rely on black-box approaches, targeting broad and abstract concepts such as pornography. Thus, our work is an in-depth exploration of tasks from the literature on Human-Centric Perception, across the domains of safe images, adult pornography, and CSAI, focusing on targets that enable more objective and explainable pipelines for CSAI classification in the future. We introduce the Body-Keypoint-Part Dataset (BKPD), gathering images of people from varying age groups and sexual explicitness to approximate the domain of CSAI, along with manually curated hierarchically structured labels for skeletal keypoints and bounding boxes for person and body parts, including head, chest, hip, and hands. We propose two methods, namely BKP-Association and YOLO-BKP, for simultaneous pose estimation and detection, with targets associated per individual for a comprehensive decomposed representation of each person. Our methods are benchmarked on COCO-Keypoints and COCO-HumanParts, as well as our human-centric dataset, achieving competitive results with models that jointly perform all tasks. Cross-domain ablation studies on BKPD and a case study on RCPD highlight the challenges posed by sexually explicit domains. Our study addresses previously unexplored targets in the CSAI domain, paving the way for novel research opportunities.
comment: submitted to IEEE Transactions on Information Forensics and Security (TIFS)
♻ ☆ Seamless High-Resolution Terrain Reconstruction: A Prior-Based Vision Transformer Approach
High-resolution elevation data is essential for hydrological modeling, hazard assessment, and environmental monitoring; however, globally consistent, fine-scale Digital Elevation Models (DEMs) remain unavailable. Very high-resolution single-view imagery enables the extraction of topographic information at the pixel level, allowing the reconstruction of fine terrain details over large spatial extents. In this paper, we present single-view-based DEM reconstruction shown to support practical analysis in GIS environments across multiple sub-national jurisdictions. Specifically, we produce high-resolution DEMs for large-scale basins, representing a substantial improvement over the 30 m resolution of globally available Shuttle Radar Topography Mission (SRTM) data. The DEMs are generated using a prior-based monocular depth foundation (MDE) model, extended in this work to the remote sensing height domain for high-resolution, globally consistent elevation reconstruction. We fine-tune the model by integrating low-resolution SRTM data as a global prior with high-resolution RGB imagery from the National Agriculture Imagery Program (NAIP), producing DEMs with near LiDAR-level accuracy. Our method achieves a 100x resolution enhancement (from 30 m to 30 cm), exceeding existing super-resolution approaches by an order of magnitude. Across two diverse landscapes, the model generalizes robustly, resolving fine-scale terrain features with a mean absolute error of less than 5 m relative to LiDAR and improving upon SRTM by up to 18 %. Hydrological analyses at both catchment and hillslope scales confirm the method's utility for hazard assessment and environmental monitoring, demonstrating improved streamflow representation and catchment delineation. Finally, we demonstrate the scalability of the framework by applying it across large geographic regions.
♻ ☆ PPEDCRF: Privacy-Preserving Enhanced Dynamic CRF for Location-Privacy Protection for Sequence Videos with Minimal Detection Degradation
Dashcam videos collected by autonomous or assisted-driving systems are increasingly shared for safety auditing and model improvement. Even when explicit GPS metadata are removed, an attacker can still infer the recording location by matching background visual cues (e.g., buildings and road layouts) against large-scale street-view imagery. This paper studies location-privacy leakage under a background-based retrieval attacker, and proposes PPEDCRF, a privacy-preserving enhanced dynamic conditional random field framework that injects calibrated perturbations only into inferred location-sensitive background regions while preserving foreground detection utility. PPEDCRF consists of three components: (i) a dynamic CRF that enforces temporal consistency to discover and track location sensitive regions across frames, (ii) a normalized control penalty (NCP) that allocates perturbation strength according to a hierarchical sensitivity model, and (iii) a utility-preserving noise injection module that minimizes interference to object detection and segmentation. Experiments on public driving datasets demonstrate that PPEDCRF significantly reduces location-retrieval attack success (e.g., Top-k retrieval accuracy) while maintaining competitive detection performance (e.g., mAP and segmentation metrics) compared with common baselines such as global noise, white-noise masking, and feature-based anonymization. The source code is in https://github.com/mabo1215/PPEDCRF.git
comment: We would like to withdraw this paper due to identified issues in the experimental design and insufficient supporting data, which affect the reliability of the reported results. A substantially revised version with corrected experiments and extended evaluations will be prepared and submitted in the future
♻ ☆ SPAGS: Sparse-View Articulated Object Reconstruction from Single State via Planar Gaussian Splatting
Articulated objects are ubiquitous in daily environments, and their 3D reconstruction holds great significance across various fields. However, existing articulated object reconstruction methods typically require costly inputs such as multi-stage and multi-view observations. To address the limitations, we propose a category-agnostic articulated object reconstruction framework via planar Gaussian Splatting, which only uses sparse-view RGB images from a single state. Specifically, we first introduce a Gaussian information field to perceive the optimal sparse viewpoints from candidate camera poses. To ensure precise geometric fidelity, we constrain traditional 3D Gaussians into planar primitives, facilitating accurate normal and depth estimation. The planar Gaussians are then optimized in a coarse-to-fine manner, regularized by depth smoothness and few-shot diffusion priors. Furthermore, we leverage a Vision-Language Model (VLM) via visual prompting to achieve open-vocabulary part segmentation and joint parameter estimation. Extensive experiments on both synthetic and real-world datasets demonstrate that our approach significantly outperforms existing baselines, achieving superior part-level surface reconstruction fidelity. Code and data are provided in the supplementary material.
comment: 10 pages, 7 figures
♻ ☆ EW-DETR: Evolving World Object Detection via Incremental Low-Rank DEtection TRansformer CVPR 2026
Real-world object detection must operate in evolving environments where new classes emerge, domains shift, and unseen objects must be identified as "unknown": all without accessing prior data. We introduce Evolving World Object Detection (EWOD), a paradigm coupling incremental learning, domain adaptation, and unknown detection under exemplar-free constraints. To tackle EWOD, we propose EW-DETR framework that augments DETR-based detectors with three synergistic modules: Incremental LoRA Adapters for exemplar-free incremental learning under evolving domains; a Query-Norm Objectness Adapter that decouples objectness-aware features from DETR decoder queries; and Entropy-Aware Unknown Mixing for calibrated unknown detection. This framework generalises across DETR-based detectors, enabling state-of-the-art RF-DETR to operate effectively in evolving-world settings. We also introduce FOGS (Forgetting, Openness, Generalisation Score) to holistically evaluate performance across these dimensions. Extensive experiments on Pascal Series and Diverse Weather benchmarks show EW-DETR outperforms other methods, improving FOGS by 57.24%.
comment: Accepted at CVPR 2026
♻ ☆ OOD-SEG: Exploiting out-of-distribution detection techniques for learning image segmentation from sparse multi-class positive-only annotations
Despite significant advancements, segmentation based on deep neural networks in medical and surgical imaging faces several challenges, two of which we aim to address in this work. First, acquiring complete pixel-level segmentation labels for medical images is time-consuming and requires domain expertise. Second, typical segmentation pipelines cannot detect out-of-distribution (OOD) pixels, leaving them prone to spurious outputs during deployment. In this work, we propose a novel segmentation approach which broadly falls within the positive-unlabelled (PU) learning paradigm and exploits tools from OOD detection techniques. Our framework learns only from sparsely annotated pixels from multiple positive-only classes and does not use any annotation for the background class. These multi-class positive annotations naturally fall within the in-distribution (ID) set. Unlabelled pixels may contain positive classes but also negative ones, including what is typically referred to as \emph{background} in standard segmentation formulations. To the best of our knowledge, this work is the first to formulate multi-class segmentation with sparse positive-only annotations as a pixel-wise PU learning problem and to address it using OOD detection techniques. Here, we forgo the need for background annotation and consider these together with any other unseen classes as part of the OOD set. Our framework can integrate, at a pixel-level, any OOD detection approaches designed for classification tasks. To address the lack of existing OOD datasets and established evaluation metric for medical image segmentation, we propose a cross-validation strategy that treats held-out labelled classes as OOD. Extensive experiments on both multi-class hyperspectral and RGB surgical imaging datasets demonstrate the robustness and generalisation capability of our proposed framework.
comment: Accepted in MedIA
♻ ☆ MOON3.0: Reasoning-aware Multimodal Representation Learning for E-commerce Product Understanding
With the rapid growth of e-commerce, exploring general representations rather than task-specific ones has attracted increasing attention. Although recent multimodal large language models (MLLMs) have driven significant progress in product understanding, they are typically employed as feature extractors that implicitly encode product information into global embeddings, thereby limiting their ability to capture fine-grained attributes. Therefore, we argue that leveraging the reasoning capabilities of MLLMs to explicitly model fine-grained product attributes holds significant potential. Nevertheless, achieving this goal remains non-trivial due to several key challenges: (i) long-context reasoning tends to dilute the model's attention to salient information in the raw input; (ii) supervised fine-tuning (SFT) primarily encourages rigid imitation, limiting the exploration of effective reasoning strategies; and (iii) fine-grained details are progressively attenuated during forward propagation. To address these issues, we propose MOON3.0, the first reasoning-aware MLLM-based model for product representation learning. Our method (1) employs a multi-head modality fusion module to adaptively integrate raw signals; (2) incorporates a joint contrastive and reinforcement learning framework to autonomously explore more effective reasoning strategies; and (3) introduces a fine-grained residual enhancement module to progressively preserve local details throughout the network. Additionally, we release a large-scale multimodal e-commerce benchmark MBE3.0. Experimentally, our model demonstrates state-of-the-art zero-shot performance across various downstream tasks on both our benchmark and public datasets.
comment: 10 pages, 6 figures
♻ ☆ CARE: Confidence-aware Ratio Estimation for Medical Biomarkers
Ratio-based biomarkers (RBBs), such as the proportion of necrotic tissue within a tumor, are widely used in clinical practice to support diagnosis, prognosis, and treatment planning. These biomarkers are typically estimated from segmentation outputs by computing region-wise ratios. Despite the high-stakes nature of clinical decision making, existing methods provide only point estimates, offering no measure of uncertainty. In this work, we propose a unified confidence-aware framework for estimating ratio-based biomarkers. Our uncertainty analysis stems from two observations: (1) the probability ratio estimator inherently admits a statistical confidence interval regarding local randomness (bias and variance); (2) the segmentation network is not perfectly calibrated (calibration error).We perform a systematic analysis of error propagation in the segmentation-to-biomarker pipeline and identify model miscalibration as the dominant source of uncertainty. Extensive experiments show that our method produces statistically sound confidence intervals, with tunable confidence levels, enabling more trustworthy application of segmentation-derived RBBs in clinical workflows.
comment: 12 pages
♻ ☆ FastSurfer-CC: A robust, accurate, and comprehensive framework for corpus callosum morphometry
The corpus callosum, the largest commissural structure in the human brain, is a central focus in research on aging and neurological diseases. It is also a critical target for interventions such as deep brain stimulation and serves as an important biomarker in clinical trials, including those investigating remyelination therapies. Despite extensive research on corpus callosum segmentation, few publicly available tools provide a comprehensive and automated analysis pipeline. To address this gap, we present FastSurfer-CC, an efficient and fully automated framework for corpus callosum morphometry. FastSurfer-CC automatically identifies mid-sagittal slices, segments the corpus callosum and fornix, localizes the anterior and posterior commissures to standardize head positioning, generates thickness profiles and subdivisions, and extracts eight shape metrics for statistical analysis. We demonstrate that FastSurfer-CC outperforms existing specialized tools across the individual tasks. Moreover, our method reveals statistically significant differences between Huntington's disease patients and healthy controls that are not detected by the current state-of-the-art.
♻ ☆ MaskAdapt: Learning Flexible Motion Adaptation via Mask-Invariant Prior for Physics-Based Characters CVPR 2026
We present MaskAdapt, a framework for flexible motion adaptation in physics-based humanoid control. The framework follows a two-stage residual learning paradigm. In the first stage, we train a mask-invariant base policy using stochastic body-part masking and a regularization term that enforces consistent action distributions across masking conditions. This yields a robust motion prior that remains stable under missing observations, anticipating later adaptation in those regions. In the second stage, a residual policy is trained atop the frozen base controller to modify only the targeted body parts while preserving the original behaviors elsewhere. We demonstrate the versatility of this design through two applications: (i) motion composition, where varying masks enable multi-part adaptation within a single sequence, and (ii) text-driven partial goal tracking, where designated body parts follow kinematic targets provided by a pre-trained text-conditioned autoregressive motion generator. Through experiments, MaskAdapt demonstrates strong robustness and adaptability, producing diverse behaviors under masked observations and delivering superior targeted motion adaptation compared to prior work.
comment: CVPR 2026
♻ ☆ Assessing Multimodal Chronic Wound Embeddings with Expert Triplet Agreement
Recessive dystrophic epidermolysis bullosa (RDEB) is a rare genetic skin disorder for which clinicians greatly benefit from finding similar cases using images and clinical text. However, off-the-shelf foundation models do not reliably capture clinically meaningful features for this heterogeneous, long-tail disease, and structured measurement of agreement with experts is challenging. To address these gaps, we propose evaluating embedding spaces with expert ordinal comparisons (triplet judgments), which are fast to collect and encode implicit clinical similarity knowledge. We further introduce TriDerm, a multimodal framework that learns interpretable wound representations from small cohorts by integrating wound imagery, boundary masks, and expert reports. On the vision side, TriDerm adapts visual foundation models to RDEB using wound-level attention pooling and non-contrastive representation learning. For text, we prompt large language models with comparison queries and recover medically meaningful representations via soft ordinal embeddings (SOE). We show that visual and textual modalities capture complementary aspects of wound phenotype, and that fusing both modalities yields 73.5% agreement with experts, outperforming the best off-the-shelf single-modality foundation model by over 5.6 percentage points. We make the expert annotation tool, model code and representative dataset samples publicly available.
♻ ☆ One Pic is All it Takes: Poisoning Visual Document Retrieval Augmented Generation with a Single Image
Retrieval-augmented generation (RAG) is instrumental for inhibiting hallucinations in large language models (LLMs) through the use of a factual knowledge base (KB). Although PDF documents are prominent sources of knowledge, text-based RAG pipelines are ineffective at capturing their rich multi-modal information. In contrast, visual document RAG (VD-RAG) uses screenshots of document pages as the KB, which has been shown to achieve state-of-the-art results. However, by introducing the image modality, VD-RAG introduces new attack vectors for adversaries to disrupt the system by injecting malicious documents into the KB. In this paper, we demonstrate the vulnerability of VD-RAG to poisoning attacks targeting both retrieval and generation. We define two attack objectives and demonstrate that both can be realized by injecting only a single adversarial image into the KB. Firstly, we introduce a targeted attack against one or a group of queries with the goal of spreading targeted disinformation. Secondly, we present a universal attack that, for any potential user query, influences the response to cause a denial-of-service in the VD-RAG system. We investigate the two attack objectives under both white-box and black-box assumptions, employing a multi-objective gradient-based optimization approach as well as prompting state-of-the-art generative models. Using two visual document datasets, a diverse set of state-of-the-art retrievers (embedding models) and generators (vision language models), we show VD-RAG is vulnerable to poisoning attacks in both the targeted and universal settings, yet demonstrating robustness to black-box attacks in the universal setting.
comment: Published in Transactions on Machine Learning Research (03/2026)
♻ ☆ Maximally Useful and Minimally Redundant: The Key to Self Supervised Learning for Imbalanced Data
Contrastive self supervised learning(CSSL) usually makes use of the multi-view assumption which states that all relevant information must be shared between all views. The main objective of CSSL is to maximize the mutual information(MI) between representations of different views and at the same time compress irrelevant information in each representation. Recently, as part of future work, Schwartz Ziv & Yan LeCun pointed out that, when the multi-view assumption is violated, one of the most significant challenges in SSL is in identifying new methods to separate relevant from irrelevant information based on alternative assumptions. Taking a cue from this intuition we make the following contributions in this paper: 1) We develop a CSSL framework wherein multiple images and multiple views(MIMV) are considered as input, which is different from the traditional multi-view assumption 2) We adopt a novel augmentation strategy that includes both normalized (invertible) and augmented (non-invertible) views so that complete information of one image can be preserved and hard augmentation can be chosen for the other image 3) An Information bottleneck(IB) principle is outlined for MIMV to produce optimal representations 4) We introduce a loss function that helps to learn better representations by filtering out extreme features 5) The robustness of our proposed framework is established by applying it to the imbalanced dataset problem wherein we achieve a new state-of-the-art accuracy (2% improvement in Cifar10-LT using Resnet-18, 5% improvement in Cifar100-LT using Resnet-18 and 3% improvement in Imagenet-LT (1k) using Resnet-50).
♻ ☆ Seeing through Light and Darkness: Sensor-Physics Grounded Deblurring HDR NeRF from Single-Exposure Images and Events CVPR
Novel view synthesis from low dynamic range (LDR) blurry images, which are common in the wild, struggles to recover high dynamic range (HDR) and sharp 3D representations in extreme lighting conditions. Although existing methods employ event data to address this issue, they ignore the sensor-physics mismatches between the camera output and physical world radiance, resulting in suboptimal HDR and deblurring results. To cope with this problem, we propose a unified sensor-physics grounded NeRF framework for sharp HDR novel view synthesis from single-exposure blurry LDR images and corresponding events. We employ NeRF to directly represent the actual radiance of the 3D scene in the HDR domain and model raw HDR scene rays hitting the sensor pixels as in the physical world. A 2D pixel-wise RGB CRF model is introduced to align the NeRF rendered pixel values with the sensor-recorded LDR pixel values of the input images. A novel event CRF model is also designed to bridge the gap between physical scene dynamics and event sensor output. The two models are jointly optimized with the NeRF network, leveraging the spatial and temporal dynamic information in events to enhance the sharp HDR 3D representation learning. Experiments on the collected and public datasets demonstrate that our method achieves state-of-the-art HDR and deblurring novel view synthesis results with single-exposure blurry LDR images and corresponding events.
comment: Accepted by the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2026. Project Page: https://icvteam.github.io/See-NeRF.html. Our code and datasets are publicly available at https://github.com/iCVTEAM/See-NeRF
♻ ☆ KARL: Knowledge-Aware Reasoning and Reinforcement Learning for Knowledge-Intensive Visual Grounding
Knowledge-Intensive Visual Grounding (KVG) requires models to localize objects using fine-grained, domain-specific entity names rather than generic referring expressions. Although Multimodal Large Language Models (MLLMs) possess rich entity knowledge and strong generic grounding capabilities, they often fail to effectively utilize such knowledge when grounding specialized concepts, revealing a knowledge-grounding gap between internal knowledge and grounding predictions. To address this challenge, we propose a knowledge-aware training paradigm for KVG. Our approach first constructs knowledge-guided reasoning data to encourage models to activate domain-relevant entity knowledge during grounding, and then introduces KARL, a Knowledge-Aware Reinforcement Learning framework that adaptively modulates reward signals according to the model's estimated knowledge mastery of different entities. To facilitate systematic evaluation, we introduce KVG-Bench, a benchmark spanning 10 domains with 1.3K curated test cases covering 531 images and 882 entities. Extensive experiments show that our approach consistently outperforms a wide range of baseline models and achieves substantially stronger cross-domain generalization on unseen categories. The data, codes, and models are released at https://github.com/thunlp/KARL.
♻ ☆ Fourier Splatting: Generalized Fourier encoded primitives for scalable radiance fields
Novel view synthesis has recently been revolutionized by 3D Gaussian Splatting (3DGS), which enables real-time rendering through explicit primitive rasterization. However, existing methods tie visual fidelity strictly to the number of primitives: quality downscaling is achieved only through pruning primitives. We propose the first inherently scalable primitive for radiance field rendering. Fourier Splatting employs scalable primitives with arbitrary closed shapes obtained by parameterizing planar surfels with Fourier encoded descriptors. This formulation allows a single trained model to be rendered at varying levels of detail simply by truncating Fourier coefficients at runtime. To facilitate stable optimization, we employ a straight-through estimator for gradient extension beyond the primitive boundary, and introduce HYDRA, a densification strategy that decomposes complex primitives into simpler constituents within the MCMC framework. Our method achieves state-of-the-art rendering quality among planar-primitive frameworks and comparable perceptual metrics compared to leading volumetric representations on standard benchmarks, providing a versatile solution for bandwidth-constrained high-fidelity rendering.
♻ ☆ GPA-VGGT:Adapting VGGT to Large Scale Localization by Self-Supervised Learning with Geometry and Physics Aware Loss
Transformer-based general visual geometry frameworks have shown promising performance in camera pose estimation and 3D scene understanding. Recent advancements in Visual Geometry Grounded Transformer (VGGT) models have shown great promise in camera pose estimation and 3D reconstruction. However, these models typically rely on ground truth labels for training, posing challenges when adapting to unlabeled and unseen scenes. In this paper, we propose a self-supervised framework to train VGGT with unlabeled data, thereby enhancing its localization capability in large-scale environments. To achieve this, we extend conventional pair-wise relations to sequence-wise geometric constraints for self-supervised learning. Specifically, in each sequence, we sample multiple source frames and geometrically project them onto different target frames, which improves temporal feature consistency. We formulate physical photometric consistency and geometric constraints as a joint optimization loss to circumvent the requirement for hard labels. By training the model with this proposed method, not only the local and global cross-view attention layers but also the camera and depth heads can effectively capture the underlying multi-view geometry. Experiments demonstrate that the model converges within hundreds of iterations and achieves significant improvements in large-scale localization. Our code will be released at https://github.com/X-yangfan/GPA-VGGT.
♻ ☆ Learning to Translate Noise for Robust Image Denoising CVPR 2026
Deep learning-based image denoising techniques often struggle with poor generalization performance to out-of-distribution real-world noise. To tackle this challenge, we propose a novel noise translation framework that performs denoising on an image with translated noise rather than directly denoising an original noisy image. Specifically, our approach translates complex, unknown real-world noise into Gaussian noise, which is spatially uncorrelated and independent of image content, through a noise translation network. The translated noisy images are then processed by an image denoising network pretrained to effectively remove Gaussian noise, enabling robust and consistent denoising performance. We also design well-motivated loss functions and architectures for the noise translation network by leveraging the mathematical properties of Gaussian noise. Experimental results demonstrate that the proposed method substantially improves robustness and generalizability, outperforming state-of-the-art methods across diverse benchmarks. Visualized denoising results and the source code are available on our project page.
comment: Project page: https://hij1112.github.io/learning-to-translate-noise/ Accepted to CVPR 2026 Findings
♻ ☆ SALAD: Achieve High-Sparsity Attention via Efficient Linear Attention Tuning for Video Diffusion Transformer
Diffusion Transformers have demonstrated remarkable performance in video generation. However, their long input sequences incur substantial latency due to the quadratic complexity of full attention. Various sparse attention mechanisms have been proposed. Training-free approaches are limited to moderate sparsity and thus yield only modest acceleration, whereas training-based methods can reach much higher sparsity but demand substantial data and computation. In this work, we propose SALAD, introducing a lightweight linear attention branch in parallel with the sparse attention. Leveraging a Multi-level Static-Dynamic Scaling Strategy to balance the two branches, our method attains up to 90% sparsity and 1.52-2.03x inference speedup across different models and sequence lengths, while maintaining generation quality comparable to the full attention baseline. Moreover, our finetuning process is highly efficient, requiring only 2,000 video samples, fewer than 1,600 training steps, and no more than 30 GPU hours with a batch size of 8.
♻ ☆ ScrollScape: Unlocking 32K Image Generation With Video Diffusion Priors
While diffusion models excel at generating images with conventional dimensions, pushing them to synthesize ultra-high-resolution imagery at extreme aspect ratios (EAR) often triggers catastrophic structural failures, such as object repetition and spatial fragmentation. This limitation fundamentally stems from a lack of robust spatial priors, as static text-to-image models are primarily trained on image distributions with conventional dimensions. To overcome this bottleneck, we present ScrollScape, a novel framework that reformulates EAR image synthesis into a continuous video generation process through two core innovations. By mapping the spatial expansion of a massive canvas to the temporal evolution of video frames, ScrollScape leverages the inherent temporal consistency of video models as a powerful global constraint to ensure long-range structural integrity. Specifically, Scanning Positional Encoding (ScanPE) distributes global coordinates across frames to act as a flexible moving camera, while Scrolling Super-Resolution (ScrollSR) leverages video super-resolution priors to circumvent memory bottlenecks, efficiently scaling outputs to an unprecedented 32K resolution. Fine-tuned on a curated 3K multi-ratio image dataset, ScrollScape effectively aligns pre-trained video priors with the EAR generation task. Extensive evaluations demonstrate that it significantly outperforms existing image-diffusion baselines by eliminating severe localized artifacts. Consequently, our method overcomes inherent structural bottlenecks to ensure exceptional global coherence and visual fidelity across diverse domains at extreme scales.
♻ ☆ GridVAD: Open-Set Video Anomaly Detection via Spatial Reasoning over Stratified Frame Grids
Vision-Language Models (VLMs) are powerful open-set reasoners, yet their direct use as anomaly detectors in video surveillance is fragile: without calibrated anomaly priors, they alternate between missed detections and hallucinated false alarms. We argue the problem is not the VLM itself but how it is used. VLMs should function as anomaly proposers, generating open-set candidate descriptions that are then grounded and tracked by purpose-built spatial and temporal modules. We instantiate this propose-ground-propagate principle in GridVAD, a training-free pipeline that produces pixel-level anomaly masks without any domain-specific training. A VLM reasons over stratified grid representations of video clips to generate natural-language anomaly proposals. Self-Consistency Consolidation (SCC) filters hallucinations by retaining only proposals that recur across multiple independent samplings. Grounding DINO anchors each surviving proposal to a bounding box, and SAM2 propagates it as a dense mask through the anomaly interval. The per-clip VLM budget is fixed at M+1 calls regardless of video length, where M can be set according to the proposals needed. On UCSD Ped2, GridVAD achieves the highest Pixel-AUROC (77.59) among all compared methods, surpassing even the partially fine-tuned TAO (75.11) and outperforms other zero-shot approaches on object-level RBDC by over 5x. Ablations reveal that SCC provides a controllable precision-recall tradeoff: filtering improves all pixel level metrics at a modest cost in object-level recall. Efficiency experiments show GridVAD is 2.7x more call-efficient than uniform per-frame VLM querying while additionally producing dense segmentation masks.Code and qualitative video results are available at https://gridvad.github.io.
♻ ☆ When Surfaces Lie: Exploiting Wrinkle-Induced Attention Shift to Attack Vision-Language Models
Visual-Language Models (VLMs) have demonstrated exceptional cross-modal understanding across various tasks, including zero-shot classification, image captioning, and visual question answering. However, their robustness to physically plausible non-rigid deformations-such as wrinkles on flexible surfaces-remains poorly understood. In this work, we propose a parametric structural perturbation method inspired by the mechanics of three-dimensional fabric wrinkles. Specifically, our method generates photorealistic non-rigid perturbations by constructing multi-scale wrinkle fields and integrating displacement field distortion with surface-consistent appearance variations. To achieve an optimal balance between visual naturalness and adversarial effectiveness, we design a hierarchical fitness function in a low-dimensional parameter space and employ an optimization-based search strategy. We evaluate our approach using a two-stage framework: perturbations are first optimized on a zero-shot classification proxy task and subsequently assessed for transferability on generative tasks. Experimental results demonstrate that our method significantly degrades the performance of various state-of-the-art VLMs, consistently outperforming baselines in both image captioning and visual question-answering tasks.
♻ ☆ Learning Fine-Grained Geometry for Sparse-View Splatting via Cascade Depth Loss
Novel view synthesis is a fundamental task in 3D computer vision that aims to reconstruct photorealistic images from novel viewpoints given a set of posed images. However, reconstruction quality degrades sharply under sparse-view conditions due to insufficient geometric cues. Existing methods, including Neural Radiance Fields (NeRF) and more recent 3D Gaussian Splatting (3DGS), often exhibit blurred details and structural artifacts when trained from sparse observations. Recent works have identified rendered depth quality as a key factor in mitigating these artifacts, as it directly affects geometric accuracy and view consistency. However, effectively leveraging depth under sparse views remains challenging. Depth priors can be noisy or misaligned with rendered geometry, and single-scale supervision often fails to capture both global structure and fine details. To address these challenges, we introduce Hierarchical Depth-Guided Splatting (HDGS), a depth supervision framework that progressively refines geometry from coarse to fine levels. Central to HDGS is our novel Cascade Pearson Correlation Loss (CPCL), which enforces consistency between rendered and estimated depth priors across multiple spatial scales. By enforcing multi-scale depth consistency, our method improves structural fidelity in sparse-view reconstruction. Experiments on LLFF and DTU demonstrate state-of-the-art performance under sparse-view settings.
♻ ☆ PhysGaia: A Physics-Aware Benchmark with Multi-Body Interactions for Dynamic Novel View Synthesis CVPR 2026
We introduce PhysGaia, a novel physics-aware benchmark for Dynamic Novel View Synthesis (DyNVS) that encompasses both structured objects and unstructured physical phenomena. While existing datasets primarily focus on photorealistic appearance, PhysGaia is specifically designed to support physics-consistent dynamic reconstruction. Our benchmark features complex scenarios with rich multi-body interactions, where objects realistically collide and exchange forces. Furthermore, it incorporates a diverse range of materials, including liquid, gas, textile, and rheological substance, moving beyond the rigid-body assumptions prevalent in prior work. To ensure physical fidelity, all scenes in PhysGaia are generated using material-specific physics solvers that strictly adhere to fundamental physical laws. We provide comprehensive ground-truth information, including 3D particle trajectories and physical parameters (e.g., viscosity), enabling the quantitative evaluation of physical modeling. To facilitate research adoption, we also provide integration pipelines for recent 4D Gaussian Splatting models along with our dataset and their results. By addressing the critical shortage of physics-aware benchmarks, PhysGaia can significantly advance research in dynamic view synthesis, physics-based scene understanding, and the integration of deep learning with physical simulation, ultimately enabling more faithful reconstruction and interpretation of complex dynamic scenes.
comment: Accepted at CVPR 2026; Project page: http://cvlab.snu.ac.kr/research/PhysGaia; Dataset: https://huggingface.co/datasets/mijeongkim/PhysGaia/tree/main
♻ ☆ Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models
Modern vision-language models (VLMs) excel at many multimodal tasks, yet their grasp of temporal information in video remains weak and has not been adequately evaluated. We probe this gap with a deceptively simple but revealing challenge: judging the arrow of time (AoT)-whether a short clip is played forward or backward. We introduce AoT-PsyPhyBENCH, a psychophysically validated benchmark that tests whether VLMs can infer temporal direction in natural videos using the same stimuli and behavioral baselines established for humans. Our comprehensive evaluation of open-weight and proprietary, reasoning and non-reasoning VLMs reveals that most models perform near chance, and even the best model lags far behind human accuracy on physically irreversible processes (e.g., free fall, diffusion/explosion) and causal manual actions (division/addition) that humans recognize almost instantly. These results highlight a fundamental gap in current multimodal systems: while they capture rich visual-semantic correlations, they lack the inductive biases required for temporal continuity and causal understanding. We release the code and data for AoT-PsyPhyBENCH to encourage further progress in the physical and temporal reasoning capabilities of VLMs.
comment: 12 pages
♻ ☆ Generation Is Compression: Zero-Shot Video Coding via Stochastic Rectified Flow
Recent advances in generative modeling have enabled perceptual video compression at ultra-low bitrates, yet existing methods predominantly treat the generative model as a refinement or reconstruction module attached to a separately designed codec backbone. We propose \emph{Generative Video Codebook Codec} (GVCC), a zero-shot framework that turns a pretrained video generative model into the codec itself: the transmitted bitstream directly specifies the generative decoding trajectory, with no retraining required. To enable this, we convert the deterministic rectified-flow ODE of modern video foundation models into an equivalent SDE at inference time, unlocking per-step stochastic injection points for codebook-driven compression. Building on this unified backbone, we instantiate three complementary conditioning strategies -- \emph{Image-to-Video} (I2V) with autoregressive GOP chaining, tail latent residual correction, and adaptive atom allocation, \emph{Text-to-Video} (T2V) operating at near-zero side information as a pure generative prior, and \emph{First-Last-Frame-to-Video} (FLF2V) with boundary-sharing GOP chaining for dual-anchor temporal control. Together, these variants span a principled trade-off space between spatial fidelity, temporal coherence, and compression efficiency. Experiments on standard benchmarks show that GVCC achieves high-quality reconstruction below 0.002\,bpp while supporting flexible bitrate control through a single hyperparameter.
comment: 9 pages, 3 figures
♻ ☆ ICM-SR: Image-Conditioned Manifold Regularization for Image Super-Resolution
Real world image super-resolution (Real-ISR) often leverages the powerful generative priors of text-to-image diffusion models by regularizing the output to lie on their learned manifold. However, existing methods often overlook the importance of the regularizing manifold, typically defaulting to a text-conditioned manifold. This approach suffers from two key limitations. Conceptually, it is misaligned with the Real-ISR task, which is to generate high quality (HQ) images directly tied to the low quality (LQ) images. Practically, the teacher model often reconstructs images with color distortions and blurred edges, indicating a flawed generative prior for this task. To correct these flaws and ensure conceptual alignment, a more suitable manifold must incorporate information from the images. While the most straightforward approach is to condition directly on the raw input images, their high information densities make the regularization process numerically unstable. To resolve this, we propose image-conditioned manifold regularization (ICM), a method that regularizes the output towards a manifold conditioned on the sparse yet essential structural information: a combination of colormap and Canny edges. ICM provides a task-aligned and stable regularization signal, thereby avoiding the instability of dense-conditioning and enhancing the final super-resolution quality. Our experiments confirm that the proposed regularization significantly enhances super-resolution performance, particularly in perceptual quality, demonstrating its effectiveness for real-world applications. We will release the source code of our work for reproducibility.
♻ ☆ Think, Act, Build: An Agentic Framework with Vision Language Models for Zero-Shot 3D Visual Grounding
3D Visual Grounding (3D-VG) aims to localize objects in 3D scenes via natural language descriptions. While recent advancements leveraging Vision-Language Models (VLMs) have explored zero-shot possibilities, they typically suffer from a static workflow relying on preprocessed 3D point clouds, essentially degrading grounding into proposal matching. To bypass this reliance, our core motivation is to decouple the task: leveraging 2D VLMs to resolve complex spatial semantics, while relying on deterministic multi-view geometry to instantiate the 3D structure. Driven by this insight, we propose "Think, Act, Build (TAB)", a dynamic agentic framework that reformulates 3D-VG tasks as a generative 2D-to-3D reconstruction paradigm operating directly on raw RGB-D streams. Specifically, guided by a specialized 3D-VG skill, our VLM agent dynamically invokes visual tools to track and reconstruct the target across 2D frames. Crucially, to overcome the multi-view coverage deficit caused by strict VLM semantic tracking, we introduce the Semantic-Anchored Geometric Expansion, a mechanism that first anchors the target in a reference video clip and then leverages multi-view geometry to propagate its spatial location across unobserved frames. This enables the agent to "Build" the target's 3D representation by aggregating these multi-view features via camera parameters, directly mapping 2D visual cues to 3D coordinates. Furthermore, to ensure rigorous assessment, we identify flaws such as reference ambiguity and category errors in existing benchmarks and manually refine the incorrect queries. Extensive experiments on ScanRefer and Nr3D demonstrate that our framework, relying entirely on open-source models, significantly outperforms previous zero-shot methods and even surpasses fully supervised baselines.
♻ ☆ VOIC: Visible-Occluded Integrated Guidance for 3D Semantic Scene Completion
Camera-based 3D Semantic Scene Completion (SSC) is a critical task for autonomous driving and robotic scene understanding. It aims to infer a complete 3D volumetric representation of both semantics and geometry from a single image. Existing methods typically focus on end-to-end 2D-to-3D feature lifting and voxel completion. However, they often overlook the interference between high-confidence visible-region perception and low-confidence occluded-region reasoning caused by single-image input, which can lead to feature dilution and error propagation. To address these challenges, we introduce an offline Visible Region Label Extraction (VRLE) strategy that explicitly separates and extracts voxel-level supervision for visible regions from dense 3D ground truth. This strategy purifies the supervisory space for two complementary sub-tasks: visible-region perception and occluded-region reasoning. Building on this idea, we propose the Visible-Occluded Interactive Completion Network (VOIC), a novel dual-decoder framework that explicitly decouples SSC into visible-region semantic perception and occluded-region scene completion. VOIC first constructs a base 3D voxel representation by fusing image features with depth-derived occupancy. The visible decoder focuses on generating high-fidelity geometric and semantic priors, while the occlusion decoder leverages these priors together with cross-modal interaction to perform coherent global scene reasoning. Extensive experiments on the SemanticKITTI and SSCBench-KITTI360 benchmarks demonstrate that VOIC outperforms existing monocular SSC methods in both geometric completion and semantic segmentation accuracy, achieving state-of-the-art performance.
comment: Novelty less
♻ ☆ Aleatoric Uncertainty Medical Image Segmentation Estimation via Flow Matching
Quantifying aleatoric uncertainty in medical image segmentation is critical since it is a reflection of the natural variability observed among expert annotators. A conventional approach is to model the segmentation distribution using the generative model, but current methods limit the expression ability of generative models. While current diffusion-based approaches have demonstrated impressive performance in approximating the data distribution, their inherent stochastic sampling process and inability to model exact densities limit their effectiveness in accurately capturing uncertainty. In contrast, our proposed method leverages conditional flow matching, a simulation-free flow-based generative model that learns an exact density, to produce highly accurate segmentation results. By guiding the flow model on the input image and sampling multiple data points, our approach synthesizes segmentation samples whose pixel-wise variance reliably reflects the underlying data distribution. This sampling strategy captures uncertainties in regions with ambiguous boundaries, offering robust quantification that mirrors inter-annotator differences. Experimental results demonstrate that our method not only achieves competitive segmentation accuracy but also generates uncertainty maps that provide deeper insights into the reliability of the segmentation outcomes. The code for this paper is freely available at https://github.com/huynhspm/Data-Uncertainty
♻ ☆ Long-Tailed Distribution-Aware Router For Mixture-of-Experts in Large Vision-Language Model
The mixture-of-experts (MoE) architecture, which replaces dense networks with sparse ones, has attracted significant attention in large vision-language models (LVLMs) for achieving comparable performance while activating far fewer parameters. Existing MoE architectures for LVLMs primarily focus on token-to-expert routing (TER), encouraging different experts to specialize in processing specific tokens. However, these methods typically rely on the load balancing mechanism, neglecting the inherent distributional differences between vision and language modalities. To address this limitation, we propose the Long-Tailed Distribution-aware Router (LTDR) for vision-language TER, which tackles two key challenges: (1) Modality-specific distribution-aware routing. We observe that language TER generally follows a relatively uniform distribution, whereas vision TER exhibits a long-tailed distribution. This modality discrepancy motivates the design of specialized routing strategies for each modality. (2) Vision-specific dynamic expert activation. Recognizing the importance of high-information vision tail tokens, we introduce a data-augmentation-inspired strategy that increases the number of activated experts, ensuring sufficient learning for these rare but informative tokens. On vision-language and vision benchmarks, our approach achieves consistent improvements, boosting performance by 1.2% / 2.1% on vision-language and 1.6% on vision benchmarks.
♻ ☆ Structure is Supervision: Multiview Masked Autoencoders for Radiology
Building robust medical machine learning systems requires pretraining strategies that exploit the intrinsic structure present in clinical data. We introduce Multiview Masked Autoencoder (MVMAE), a self-supervised framework that leverages the natural multi-view organization of radiology studies to learn view-invariant and disease-relevant representations. MVMAE combines masked image reconstruction with cross-view alignment, transforming clinical redundancy across projections into a powerful self-supervisory signal. We further extend this approach with MVMAE-V2T, which incorporates radiology reports as an auxiliary text-based learning signal to enhance semantic grounding while preserving fully vision-based inference. Evaluated on a downstream disease classification task on three large-scale public datasets, MIMIC-CXR, CheXpert, and PadChest, MVMAE consistently outperforms supervised and vision-language baselines. Furthermore, MVMAE-V2T provides additional gains, particularly in low-label regimes where structured textual supervision is most beneficial. Together, these results establish the importance of structural and textual supervision as complementary paths toward scalable, clinically grounded medical foundation models.
♻ ☆ Improvise, Adapt, Overcome -- Telescopic Adapters for Efficient Fine-tuning of Vision Language Models in Medical Imaging WACV 2026
Adapting Vision Language Segmentation Models (VLSMs) to medical imaging domains requires significant computational overhead when using conventional fine-tuning approaches. Existing Parameter-Efficient Fine-Tuning (PEFT) methods apply uniform adapter dimensions across all transformer layers, leading to suboptimal parameter allocation and reduced adaptation efficiency. We introduce Telescopic Adapters, a novel PEFT framework that employs depth-aware scaling to progressively increase adapter capacity from shallow to deep transformer layers. Our method integrates lightweight bottleneck modules within CLIPSeg's vision and text encoders, with adapter dimensions dynamically scaled based on layer depth and semantic relevance. Using only 613k trainable parameters--244x fewer than end-to-end fine-tuning, Telescopic Adapters achieve superior performance across five diverse medical datasets spanning polyp segmentation, skin lesion detection, and breast ultrasound imaging. Comprehensive ablation studies demonstrate that deeper layers require substantially more adaptation capacity than shallow layers, validating our telescopic scaling hypothesis. Our approach establishes a new paradigm for efficient medical VLSM fine-tuning, enabling deployment in resource-constrained clinical environments while maintaining competitive segmentation accuracy. Our source code is publicly available at https://github.com/Ujjwal238/Telescopic_adapters
comment: Accepted at the IEEE/CVF winter conference on applications of computer vision (WACV 2026)
♻ ☆ A multi-weight self-matching visual explanation for cnns on sar images
In recent years, convolutional neural networks (CNNs) have achieved significant success in various synthetic aperture radar (SAR) tasks. However, the complexity and opacity of their internal mechanisms hinder the fulfillment of high-reliability requirements, thereby limiting their application in SAR. Improving the interpretability of CNNs is thus of great importance for their development and deployment in SAR. In this paper, a visual explanation method termed multi-weight self-matching class activation mapping (MS-CAM) is proposed. MS-CAM matches SAR images with the feature maps and corresponding gradients extracted by the CNN, and combines both channel-wise and element-wise weights to visualize the decision basis learned by the model in SAR images. Extensive experiments conducted on a self-constructed SAR target classification dataset demonstrate that MS-CAM more accurately highlights the network's regions of interest and captures detailed target feature information, thereby enhancing network interpretability. Furthermore, the feasibility of applying MS-CAM to weakly-supervised obiect localization is validated. Key factors affecting localization accuracy, such as pixel thresholds, are analyzed in depth to inform future work.
♻ ☆ FluoCLIP: Stain-Aware Focus Quality Assessment in Fluorescence Microscopy CVPR 2026
Accurate focus quality assessment (FQA) in fluorescence microscopy is challenging due to stain-dependent optical variations that induce heterogeneous focus behavior across images. Existing methods, however, treat focus quality as a stain-agnostic problem, assuming a shared global ordering. We formulate stain-aware FQA for fluorescence microscopy, showing that focus-rank relationships vary substantially across stains due to stain-dependent imaging characteristics and invalidate this assumption. To support this formulation, we introduce FluoMix, the first dataset for stain-aware FQA spanning multiple tissues, fluorescent stains, and focus levels. We further propose FluoCLIP, a two-stage vision-language framework that grounds stain semantics and enables stain-conditioned ordinal reasoning for focus prediction, effectively decoupling stain representation from ordinal structure. By explicitly modeling stain-dependent focus behavior, FluoCLIP consistently outperforms both conventional FQA methods and recent vision-language baselines, demonstrating strong generalization across diverse fluorescence microscopy conditions. Code and dataset are publicly available at https://fluoclip.github.io/.
comment: Accepted at CVPR 2026, Project Page: https://fluoclip.github.io
♻ ☆ ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks
Recent progress in large language models (LLMs) has enabled tool-augmented agents capable of solving complex real-world tasks through step-by-step reasoning. However, existing evaluations often focus on general-purpose or multimodal scenarios, leaving a gap in domain-specific benchmarks that assess tool-use capabilities in complex remote sensing use cases. We present ThinkGeo, an agentic benchmark designed to evaluate LLM-driven agents on remote sensing tasks via structured tool use and multi-step planning. Inspired by tool-interaction paradigms, ThinkGeo includes human-curated queries spanning a wide range of real-world applications such as urban planning, disaster assessment and change analysis, environmental monitoring, transportation analysis, aviation monitoring, recreational infrastructure, and industrial site analysis. Queries are grounded in satellite or aerial imagery, including both optical RGB and SAR data, and require agents to reason through a diverse toolset. We implement a ReAct-style interaction loop and evaluate both open and closed-source LLMs (e.g., GPT-4o, Qwen2.5) on 486 structured agentic tasks with 1,778 expert-verified reasoning steps. The benchmark reports both step-wise execution metrics and final answer correctness. Our analysis reveals notable disparities in tool accuracy and planning consistency across models. ThinkGeo provides the first extensive testbed for evaluating how tool-enabled LLMs handle spatial reasoning in remote sensing.
♻ ☆ Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale
We introduce Intern-S1-Pro, the first one-trillion-parameter scientific multimodal foundation model. Scaling to this unprecedented size, the model delivers a comprehensive enhancement across both general and scientific domains. Beyond stronger reasoning and image-text understanding capabilities, its intelligence is augmented with advanced agent capabilities. Simultaneously, its scientific expertise has been vastly expanded to master over 100 specialized tasks across critical science fields, including chemistry, materials, life sciences, and earth sciences. Achieving this massive scale is made possible by the robust infrastructure support of XTuner and LMDeploy, which facilitates highly efficient Reinforcement Learning (RL) training at the 1-trillion parameter level while ensuring strict precision consistency between training and inference. By seamlessly integrating these advancements, Intern-S1-Pro further fortifies the fusion of general and specialized intelligence, working as a Specializable Generalist, demonstrating its position in the top tier of open-source models for general capabilities, while outperforming proprietary models in the depth of specialized scientific tasks.
♻ ☆ GVGS: Gaussian Visibility-Aware Multi-View Geometry for Accurate Surface Reconstruction
3D Gaussian Splatting (3DGS) enables efficient rendering, yet accurate surface reconstruction remains challenging due to unreliable geometric supervision. Existing approaches predominantly rely on depth-based reprojection to infer visibility and enforce multi-view consistency, leading to a fundamental circular dependency: visibility estimation requires accurate depth, while depth supervision itself is conditioned on visibility. In this work, we revisit multi-view geometric supervision from the perspective of visibility modeling. Instead of inferring visibility from pixel-wise depth consistency, we explicitly model visibility at the level of Gaussian primitives. We introduce a Gaussian visibility-aware multi-view geometric consistency (GVMV) formulation, which aggregates cross-view visibility of shared Gaussians to construct reliable supervision over co-visible regions. To further incorporate monocular priors, we propose a progressive quadtree-calibrated depth alignment (QDC) strategy that performs block-wise affine calibration under visibility-aware guidance, effectively mitigating scale ambiguity while preserving local geometric structures. Extensive experiments on DTU and Tanks and Temples demonstrate that our method consistently improves reconstruction accuracy over prior Gaussian-based approaches. Our code is fully open-sourced and available at an anonymous repository: https://github.com/GVGScode/GVGS.
♻ ☆ Adaptive Reinforcement for Open-ended Medical Reasoning via Semantic-Guided Reward Collapse Mitigation CVPR
Reinforcement learning (RL) with rule-based reward functions has recently shown great promise in enhancing the reasoning depth and generalization ability of vision-language models (VLMs), while maintaining computational efficiency. In spite of these advances, its adoption in medical imaging remains limited. Current reinforcement fine-tuning (RFT) efforts in this field mainly focus on closed-ended visual question answering (VQA), restricting their applicability to realistic clinical reasoning. However, open-ended medical VQA better mirrors clinical diagnostic workflows but remains underexplored. Although several studies have attempted to bridge the two formats through semantically guided RL, model-driven semantic rewards often suffer from reward collapse, where responses with distinct semantics yield nearly identical scores. To overcome this limitation, we introduce Adaptive Reinforcement for Medical Reasoning (ARMed), a novel RL framework tailored for open-ended medical VQA. ARMed first injects domain expertise through supervised fine-tuning (SFT) on chain-of-thought annotations, followed by reinforcement optimization using textual correctness and adaptive semantic rewards to refine reasoning consistency and factual accuracy. Extensive experiments on six challenging medical VQA benchmarks demonstrate that ARMed substantially improves both accuracy and generalization. These findings underscore the importance of reward discriminability in medical RL and highlight the potential of adaptive semantic rewards for building robust, clinically reliable multimodal reasoning systems.
comment: Accept to 2026 CVPR Findings
♻ ☆ Grounding Everything in Tokens for Multimodal Large Language Models
Multimodal large language models (MLLMs) have made significant advancements in vision understanding and reasoning. However, the autoregressive Transformer architecture used by MLLMs requries tokenization on input images, which limits their ability to accurately ground objects within the 2D image space. This raises an important question: how can sequential language tokens be improved to better ground objects in 2D spatial space for MLLMs? To address this, we present a spatial representation method for grounding objects, namely GETok, that integrates a specialized vocabulary of learnable tokens into MLLMs. GETok first uses grid tokens to partition the image plane into structured spatial anchors, and then exploits offset tokens to enable precise and iterative refinement of localization predictions. By embedding spatial relationships directly into tokens, GETok significantly advances MLLMs in native 2D space reasoning without modifying the autoregressive architecture. Extensive experiments demonstrate that GETok achieves superior performance over the state-of-the-art methods across various referring tasks in both supervised fine-tuning and reinforcement learning settings.
comment: 19 pages, 16 figures, 12 Tables
♻ ☆ MS-Mix: Sentiment-Guided Adaptive Augmentation for Multimodal Sentiment Analysis
Multimodal Sentiment Analysis (MSA) integrates complementary features from text, video, and audio for robust emotion understanding in human interactions. However, models suffer from severe data scarcity and high annotation costs, severely limiting real-world deployment in social media analytics and human-computer systems. Existing Mixup-based augmentation techniques, when naively applied to MSA, often produce semantically inconsistent samples and amplified label noise by ignoring emotional semantics across modalities. To address these challenges, we propose MS-Mix, an adaptive emotion-sensitive augmentation framework that automatically optimizes data quality in multimodal settings. Its key components are: (1) Sentiment-aware sample selection strategy that filters incompatible pairs via latent-space semantic similarity to prevent contradictory emotion mixing. (2) Sentiment intensity guided module with multi-head self-attention for computing modality-specific mixing ratios conditioned on emotional salience dynamically. (3) Sentiment alignment loss based on Kullback-Leibler divergence to align predicted sentiment distributions across modalities with ground-truth labels, improving discrimination and consistency. Extensive experiments on two public datasets with six state-of-the-art backbones confirm that MS-Mix consistently outperforms prior methods, significantly improving robustness and practical applicability for MSA. The source code is available at an anonymous link: https://anonymous.4open.science/r/MS-Mix-review-0C72.
comment: Under Review
♻ ☆ Seeing Through the Chain: Mitigate Hallucination in Multimodal Reasoning Models via CoT Compression and Contrastive Preference Optimization
While multimodal reasoning models (MLRMs) have exhibited impressive capabilities, they remain prone to hallucinations, and effective solutions are still underexplored. In this paper, we experimentally analyze the hallucination cause and propose C3PO, a training-based mitigation framework comprising \textbf{C}hain-of-Thought \textbf{C}ompression and \textbf{C}ontrastive \textbf{P}reference \textbf{O}ptimization. Firstly, we identify that introducing reasoning mechanisms exacerbates models' reliance on language priors while overlooking visual inputs, which can produce CoTs with reduced visual cues but redundant text tokens. To this end, we propose to selectively filter redundant thinking tokens for a more compact and signal-efficient CoT representation that preserves task-relevant information while suppressing noise. In addition, we observe that the quality of the reasoning trace largely determines whether hallucination emerges in subsequent responses. To leverage this insight, we introduce a reasoning-enhanced preference tuning scheme that constructs training pairs using high-quality AI feedback. We further design a multimodal hallucination-inducing mechanism that elicits models' inherent hallucination patterns via carefully crafted inducers, yielding informative negative signals for contrastive correction. We provide theoretical justification for the effectiveness and demonstrate consistent hallucination reduction across diverse MLRMs and benchmarks.
♻ ☆ AutoWeather4D: Autonomous Driving Video Weather Conversion via G-Buffer Dual-Pass Editing
Generative video models have significantly advanced the photorealistic synthesis of adverse weather for autonomous driving; however, they consistently demand massive datasets to learn rare weather scenarios. While 3D-aware editing methods alleviate these data constraints by augmenting existing video footage, they are fundamentally bottlenecked by costly per-scene optimization and suffer from inherent geometric and illumination entanglement. In this work, we introduce AutoWeather4D, a feed-forward 3D-aware weather editing framework designed to explicitly decouple geometry and illumination. At the core of our approach is a G-buffer Dual-pass Editing mechanism. The Geometry Pass leverages explicit structural foundations to enable surface-anchored physical interactions, while the Light Pass analytically resolves light transport, accumulating the contributions of local illuminants into the global illumination to enable dynamic 3D local relighting. Extensive experiments demonstrate that AutoWeather4D achieves comparable photorealism and structural consistency to generative baselines while enabling fine-grained parametric physical control, serving as a practical data engine for autonomous driving.
comment: Project Page: https://lty2226262.github.io/autoweather4d/ | Github: https://github.com/lty2226262/AutoWeather4D
♻ ☆ A Novel FACS-Aligned Anatomical Text Description Paradigm for Fine-Grained Facial Behavior Synthesis
Facial behavior constitutes the primary medium of human nonverbal communication. Existing synthesis methods predominantly follow two paradigms: coarse emotion category labels or one-hot Action Unit (AU) vectors from the Facial Action Coding System (FACS). Neither paradigm reliably renders fine-grained facial behaviors nor resolves anatomically implausible artifacts caused by conflicting AUs. Therefore, we propose a novel task paradigm: anatomically grounded facial behavior synthesis from FACS-based AU descriptions. This paradigm explicitly encodes FACS-defined muscle movement rules, inter-AU interactions, and conflict resolution mechanisms into natural language control signals. To enable systematic research, we develop a dynamic AU text processor, a FACS rule-based module that converts raw AU annotations into anatomically consistent natural language descriptions. Using this processor, we construct BP4D-AUText, the first large-scale text-image paired dataset for fine-grained facial behavior synthesis, comprising over 302K high-quality samples. Given that existing general semantic consistency metrics cannot capture the alignment between anatomical facial descriptions and synthesized muscle movements, we propose the Alignment Accuracy of AU Probability Distributions (AAAD), a task-specific metric that quantifies semantic consistency. Finally, we design VQ-AUFace, a robust baseline framework incorporating anatomical priors and progressive cross-modal alignment, to validate the paradigm. Extensive quantitative experiments and user studies demonstrate the paradigm significantly outperforms state-of-the-art methods, particularly in challenging conflicting AU scenarios, achieving superior anatomical fidelity, semantic consistency, and visual quality.
♻ ☆ A Luminance-Aware Multi-Scale Network for Polarization Image Fusion with a Multi-Scene Dataset
Polarization image fusion combines S0 and DOLP images to reveal surface roughness and material properties through complementary texture features, which has important applications in camouflage recognition, tissue pathology analysis, surface defect detection and other fields. To intergrate coL-Splementary information from different polarized images in complex luminance environment, we propose a luminance-aware multi-scale network (MLSN). In the encoder stage, we propose a multi-scale spatial weight matrix through a brightness-branch , which dynamically weighted inject the luminance into the feature maps, solving the problem of inherent contrast difference in polarized images. The global-local feature fusion mechanism is designed at the bottleneck layer to perform windowed self-attention computation, to balance the global context and local details through residual linking in the feature dimension restructuring stage. In the decoder stage, to further improve the adaptability to complex lighting, we propose a Brightness-Enhancement module, establishing the mapping relationship between luminance distribution and texture features, realizing the nonlinear luminance correction of the fusion result. We also present MSP, an 1000 pairs of polarized images that covers 17 types of indoor and outdoor complex lighting scenes. MSP provides four-direction polarization raw maps, solving the scarcity of high-quality datasets in polarization image fusion. Extensive experiment on MSP, PIF and GAND datasets verify that the proposed MLSN outperms the state-of-the-art methods in subjective and objective evaluations, and the MS-SSIM and SD metircs are higher than the average values of other methods by 8.57%, 60.64%, 10.26%, 63.53%, 22.21%, and 54.31%, respectively. The source code and dataset is avalable at https://github.com/1hzf/MLS-UNet.
♻ ☆ Cryo-Bench: Benchmarking Foundation Models for Cryosphere Applications
Geo-Foundation Models (GFMs) have been evaluated across diverse Earth observation task including multiple domains and have demonstrated strong potential of producing reliable maps even with sparse labels. However, benchmarking GFMs for Cryosphere applications has remained limited, primarily due to the lack of suitable evaluation datasets. To address this gap, we introduce \textbf{Cryo-Bench}, a benchmark compiled to evaluate GFM performance across key Cryospheric components. Cryo-Bench includes debris-covered glaciers, glacial lakes, sea ice, and calving fronts, spanning multiple sensors and broad geographic regions. We evaluate 14 GFMs alongside UNet and ViT baselines to assess their advantages, limitations, and optimal usage strategies. With a frozen encoder, UNet achieves the highest average mIoU of \textbf{66.38}, followed by TerraMind at \textbf{64.02} across five evluation dataset included in Cryo-Bench. In the few-shot setting (10\% input data), GFMs such as DOFA and TerraMind outperform UNet, achieving mIoU scores of \textbf{59.53}, \textbf{56.62}, and \textbf{56.60}, respectively, comapred to U-Net's 56.60. When fully finetuning GFMs, we observe inconsistent performance across datasets and models. However, tuning learning rate along with finetuning substantially improves GFM performance. For example, evaluation on two representative datasets (GLID and CaFFe) shows an average relative improvement of \textbf{12.77\%}. Despite having minimal Cryosphere representation in their pretraining data, GFMs exhibit notable domain adaptation capabilities and produce meaningful results across tasks. Based on our findings, We recommend encoder fine-tuning with hyperparameter optimization optimization to achieve the best possible performance, while using frozen encoders when users need quick results without extensive experimentation.(\href{https://github.com/Sk-2103/Cryo-Bench}{GitHub}).
♻ ☆ EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents
Human behaviors in the real world naturally encode rich, long-term contextual information that can be leveraged to train embodied agents for perception, understanding, and acting. However, existing capture systems typically rely on costly studio setups and wearable devices, limiting the large-scale collection of scene-conditioned human motion data in the wild. To address this, we propose EmbodMocap, a portable and affordable data collection pipeline using two moving iPhones. Our key idea is to jointly calibrate dual RGB-D sequences to reconstruct both humans and scenes within a unified metric world coordinate frame. The proposed method allows metric-scale and scene-consistent capture in everyday environments without static cameras or markers, bridging human motion and scene geometry seamlessly. Compared with optical capture ground truth, we demonstrate that the dual-view setting exhibits a remarkable ability to mitigate depth ambiguity, achieving superior alignment and reconstruction performance over single iphone or monocular models. Based on the collected data, we empower three embodied AI tasks: monocular human-scene-reconstruction, where we fine-tune on feedforward models that output metric-scale, world-space aligned humans and scenes; physics-based character animation, where we prove our data could be used to scale human-object interaction skills and scene-aware motion tracking; and robot motion control, where we train a humanoid robot via sim-to-real RL to replicate human motions depicted in videos. Experimental results validate the effectiveness of our pipeline and its contributions towards advancing embodied AI research.
♻ ☆ Towards Physically Realizable Adversarial Attenuation Patch against SAR Object Detection
Deep neural networks have demonstrated excellent performance in SAR target detection tasks but remain susceptible to adversarial attacks. Existing SAR-specific attack methods can effectively deceive detectors; however, they often introduce noticeable perturbations and are largely confined to digital domain, neglecting physical implementation constrains for attacking SAR systems. In this paper, a novel Adversarial Attenuation Patch (AAP) method is proposed that employs energy-constrained optimization strategy coupled with an attenuation-based deployment framework to achieve a seamless balance between attack effectiveness and stealthiness. More importantly, AAP exhibits strong potential for physical realization by aligning with signal-level electronic jamming mechanisms. Experimental results show that AAP effectively degrades detection performance while preserving high imperceptibility, and shows favorable transferability across different models. This study provides a physical grounded perspective for adversarial attacks on SAR target detection systems and facilitates the design of more covert and practically deployable attack strategies. The source code is made available at https://github.com/boremycin/SAAP.
comment: 5 pages, 4 figures. Source code is available at https://github.com/boremycin/SAAP
♻ ☆ Bias Is a Subspace, Not a Coordinate: A Geometric Rethinking of Post-hoc Debiasing in Vision-Language Models CVPR
Vision-Language Models (VLMs) have become indispensable for multimodal reasoning, yet their representations often encode and amplify demographic biases, resulting in biased associations and misaligned predictions in downstream tasks. Such behavior undermines fairness and distorts the intended alignment between vision and language. Recent post-hoc approaches attempt to mitigate bias by replacing the most attribute-correlated embedding coordinates with neutral values. However, our systematic analysis reveals three critical limitations of this coordinate-wise approach: feature entanglement, poor cross-dataset generalization, and incomplete bias removal. We find that bias is not localized to a few coordinates but is instead distributed across a few linear subspaces. To address these limitations, we propose $\textbf{S}$ubspace $\textbf{P}$rojection $\textbf{D}$ebiasing ($\textbf{SPD}$), a geometrically principled framework that identifies and removes the entire subspace of linearly decodable bias while reinserting a neutral mean component to preserve semantic fidelity. Extensive experiments across zero-shot classification, text-to-image retrieval, and image generation validate the effectiveness of SPD: our method achieves more robust debiasing with an average improvement of $18.5\%$ across four fairness metrics, while maintaining minimal loss in task performance compared to the best debiasing baseline.
comment: Accepted at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026
♻ ☆ RehearsalNeRF: Decoupling Intrinsic Neural Fields of Dynamic Illuminations for Scene Editing
Although there has been significant progress in neural radiance fields, an issue on dynamic illumination changes still remains unsolved. Different from relevant works that parameterize time-variant/-invariant components in scenes, subjects' radiance is highly entangled with their own emitted radiance and lighting colors in spatio-temporal domain. In this paper, we present a new effective method to learn disentangled neural fields under the severe illumination changes, named RehearsalNeRF. Our key idea is to leverage scenes captured under stable lighting like rehearsal stages, easily taken before dynamic illumination occurs, to enforce geometric consistency between the different lighting conditions. In particular, RehearsalNeRF employs a learnable vector for lighting effects which represents illumination colors in a temporal dimension and is used to disentangle projected light colors from scene radiance. Furthermore, our RehearsalNeRF is also able to reconstruct the neural fields of dynamic objects by simply adopting off-the-shelf interactive masks. To decouple the dynamic objects, we propose a new regularization leveraging optical flow, which provides coarse supervision for the color disentanglement. We demonstrate the effectiveness of RehearsalNeRF by showing robust performances on novel view synthesis and scene editing under dynamic illumination conditions. Our source code and video datasets will be publicly available.
comment: Accepted to the International Journal of Computer Vision (IJCV). Changyeon Won and Hyunjun Jung contributed equally to this work
♻ ☆ Seeing without Pixels: Perception from Camera Trajectories CVPR 2026
Can one perceive a video's content without seeing its pixels, just from the camera trajectory-the path it carves through space? This paper is the first to systematically investigate this seemingly implausible question. Towards this end, we propose a contrastive learning framework to train CamFormer, a dedicated encoder that projects camera pose trajectories into a joint embedding space, aligning them with natural language. We find that, contrary to its apparent simplicity, the camera trajectory is a remarkably informative signal to uncover video content. In other words, "how you move" can indeed provide valuable cues about "what you are doing" (egocentric) or "observing" (exocentric). We demonstrate the versatility of our learned CamFormer embeddings on a diverse suite of downstream tasks, ranging from cross-modal alignment to classification and temporal analysis. Importantly, our representations are robust across diverse camera pose estimation methods, including both high-fidelity multi-sensored and standard RGB-only estimators. Our findings establish camera trajectory as a lightweight, robust, and versatile modality for perceiving video content.
comment: Accepted by CVPR 2026, Project website: https://sites.google.com/view/seeing-without-pixels
♻ ☆ Demystifying Transition Matching: When and Why It Can Beat Flow Matching AISTATS 2026
Flow Matching (FM) underpins many state-of-the-art generative models, yet recent results indicate that Transition Matching (TM) can achieve higher quality with fewer sampling steps. This work answers the question of when and why TM outperforms FM. First, when the target is a unimodal Gaussian distribution, we prove that TM attains strictly lower KL divergence than FM for finite number of steps. The improvement arises from stochastic difference latent updates in TM, which preserve target covariance that deterministic FM underestimates. We then characterize convergence rates, showing that TM achieves faster convergence than FM under a fixed compute budget, establishing its advantage in the unimodal Gaussian setting. Second, we extend the analysis to Gaussian mixtures and identify local-unimodality regimes in which the sampling dynamics approximate the unimodal case, where TM can outperform FM. The approximation error decreases as the minimal distance between component means increases, highlighting that TM is favored when the modes are well separated. However, when the target variance approaches zero, each TM update converges to the FM update, and the performance advantage of TM diminishes. In summary, we show that TM outperforms FM when the target distribution has well-separated modes and non-negligible variances. We validate our theoretical results with controlled experiments on Gaussian distributions, and extend the comparison to real-world applications in image and video generation.
comment: AISTATS 2026
Human-Computer Interaction 22
☆ Dark Patterns in Indian Quick Commerce Apps: A Student Perspective
As quick commerce (Q-Commerce) platforms in India redefine urban consumption, the use of deceptive design dark patterns to inflate order values has become a systemic concern. This paper investigates the 'Awareness-Action Gap' among Indian university students, a demographic characterized by high digital fluency yet significant financial constraints. Using a qualitative approach with 16 participants, we explore how temporal pressures and convenience-driven architectures override price sensitivity. Our findings reveal that while students recognize manipulative UI tactics, they frequently succumb to them due to induced cognitive load and the normalization of deceptive marketing as a price of capitalism. We conclude by suggesting value-sensitive design alternatives to align commercial incentives with user autonomy in the Global South.
comment: Accepted to Bridge Over Troubled Water (CHI 2026 Workshop)
☆ Impact of Multimodal and Conversational AI on Learning Outcomes and Experience
Multimodal Large Language Models (MLLMs) offer an opportunity to support multimedia learning through conversational systems grounded in educational content. However, while conversational AI is known to boost engagement, its impact on learning in visually-rich STEM domains remains under-explored. Moreover, there is limited understanding of how multimodality and conversationality jointly influence learning in generative AI systems. This work reports findings from a randomized controlled online study (N = 124) comparing three approaches to learning biology from textbook content: (1) a document-grounded conversational AI with interleaved text-and-image responses (MuDoC), (2) a document-grounded conversational AI with text-only responses (TexDoC), and (3) a textbook interface with semantic search and highlighting (DocSearch). Learners using MuDoC achieved the highest post-test scores and reported the most positive learning experience. Notably, while TexDoC was rated as significantly more engaging and easier to use than DocSearch, it led to the lowest post-test scores, revealing a disconnect between student perceptions and learning outcomes. Interpreted through the lens of the Cognitive Load Theory, these findings suggest that conversationality reduces extraneous load, while visual-verbal integration induced by multimodality increases germane load, leading to better learning outcomes. When conversationality is not complemented by multimodality, reduced cognitive effort may instead inflate perceived understanding without improving learning outcomes.
comment: 16 pages, 3 figures, Accepted to AIED 2026 (Seoul, South Korea)
☆ Visual Decoding Operators: Towards a Compositional Theory of Visualization Perception
Prior work on perceptual effectiveness has decomposed visualizations into smaller common units (e.g., channels such as angle, position, and length) to establish rankings. While useful, these decompositions lack the computational structure to predict performance for new visualization $\times$ task combinations, requiring new experiments for each. We propose an alternative unit of analysis: operationalizing quantitative visualization interpretation as sequences of composable visual decoding operators. Using probability density function (PDF) and cumulative distribution function (CDF) charts, we examine how chart-specific tasks can be decomposed into reusable, chart-agnostic perceptual operations and characterize their error profiles through hierarchical Bayesian modeling. We then test generalizability by composing learned operators to predict performance on a structurally different task: Moritz et al.'s [35] scatterplot mean-estimation experiment, where the chart type, chart dimensions, and analytic goal all differ from the learning conditions. With a pre-registered analysis plan, we compose operators under six candidate strategies and evaluate each against empirical data with no parameters fit to the response data. One strategy captures both bias and variance of observed responses; five alternatives fail in distinguishable ways. We argue that this decoding-operator-oriented approach to empirical visualization research and theory-building lays the groundwork for generative models that can predict a distribution of likely interpretations under different viewing conditions, new chart types, and new tasks. Free copy of this paper and supplemental materials: https://osf.io/prtfq; experiment interface: https://gleaming-dolphin-799fda.netlify.app/vis-decode-slider.
☆ ViT-Explainer: An Interactive Walkthrough of the Vision Transformer Pipeline
Transformer-based architectures have become the shared backbone of natural language processing and computer vision. However, understanding how these models operate remains challenging, particularly in vision settings, where images are processed as sequences of patch tokens. Existing interpretability tools often focus on isolated components or expert-oriented analysis, leaving a gap in guided, end-to-end understanding of the full inference pipeline. To bridge this gap, we present ViT-Explainer, a web-based interactive system that provides an integrated visualization of Vision Transformer inference, from patch tokenization to final classification. The system combines animated walkthroughs, patch-level attention overlays, and a vision-adapted Logit Lens within both guided and free exploration modes. A user study with six participants suggests that ViT-Explainer is easy to learn and use, helping users interpret and understand Vision Transformer behavior.
comment: 7 pages, 4 figures
☆ Designing Transformational Games to Support Socio-ethical Reasoning about Generative AI
There is an increasing need for young people to become critically AI literate, understanding not only how AI works but also its limitations and ethical nuances. Yet, designing learning experiences that make such complex, serious topics engaging remains a challenge. This paper explores transformational games as a promising approach for supporting youth learning about generative AI (GenAI) and ethics. We designed and implemented two games, Diversity Duel and Secret Agent, that integrate GenAI tools with gameplay elements. This work investigates how the games' elements: (1) peer evaluation, (2) constraint-based creativity, and (3) social deduction supported socio-ethical reasoning about GenAI. Participants recognized and debated bias in GenAI outputs, connected these patterns to real-world inequities, and developed nuanced understandings of bias. Participants further came to see how prompt design shapes AI behavior. Our findings suggest that group-based games with these elements can support fostering critical AI literacy.
☆ ProVega: A Grammar to Ease the Prototyping, Creation, and Reproducibility of Progressive Data Analysis and Visualization Solutions
Modern data analysis requires speed for massive datasets. Progressive Data Analysis and Visualization (PDAV) emerged as a discipline to address this problem, providing fast response times while maintaining interactivity with controlled accuracy. Yet it remains difficult to implement and reproduce. To lower this barrier, we present ProVega, a Vega-Lite-based grammar that simplifies PDAV instrumentation for both simple visualizations and complex visual environments. Alongside it, we introduce Pro-Ex, an editor designed to streamline the creation and analysis of progressive solutions. We validated ProVega by reimplementing 11 exemplars from the literature-verified for fidelity by 39 users-and demonstrating its support for various progressive methods, including data-chunking, process-chunking, and mixed-chunking. An expert user study confirmed the efficacy of ProVega and the Pro-Ex environment in real-world tasks. ProVega, Pro-Ex, and all related materials are available at https://github.com/XAIber-lab/provega
☆ As Far as Eye See: Vergence-Pupil Coupling in Near-Far Depth Switching
Vergence is widely used as a proxy for depth perception and spatial attention in immersive and real-world eye-tracking studies. In this paper, we investigate how pupil size artefacts affect vergence estimates during real physical depth viewing with a head-mounted eye tracker. Using a beamsplitter setup with physically near and far targets, we elicited controlled convergent and divergent eye movements under static, luminance-modulated, and blockwise fixation conditions. Near and far targets were reliably separable in vergence angle across participants. However, pupil-vergence coupling varied substantially across individuals and conditions. Static illumination produced large inter-participant variability, while luminance modulation reduced this spread, yielding more clustered estimates. Blockwise and audio-cued recordings further showed that pupil-vergence coupling persists even without visual depth onsets. These results suggest that pupil size fluctuations can systematically influence vergence estimates, and that controlled viewing conditions can reduce--but not eliminate--this effect.
comment: 6 pages, 2 figures, ETRA26
☆ Night Eyes: A Reproducible Framework for Constellation-Based Corneal Reflection Matching
Corneal reflection (glint) detection plays an important role in pupil-corneal reflection (P-CR) eye tracking, but in practice it is often handled as heuristics embedded within larger systems, making reproducibility difficult across hardware setups. We introduce a 2D geometry-driven, constellation-based pipeline for mulit-glint detection and matching, focusing on reproducibility and clear evaluation. Inspired by lost-in-space star identification, we treat glints as structured constellations rather than independent blobs. We propose a Similarity-Layout Alignment (SLA) procedure which adapts constellation matching to the specific constraints of multi-LED eye tracking. The framework brings together controlled over-detection, adaptive candidate fallback, appearance-aware scoring, and optional semantic layout priors while keeping detection and correspondence explicitly separated. Evaluated on a public multi-LED dataset, the system provides stable identity-preserving correspondence under noisy conditions. We release code, presets, and evaluation scripts to enable transparent replication, comparison, and dataset annotation.
comment: 6 pages, 3 figures, 2 algorithms, ETRA26
☆ Eyes Can't Always Tell: Fusing Eye Tracking and User Priors for User Modeling under AI Advice Conditions
Modeling users' cognitive states (e.g., cognitive load and decision confidence) is essential for building adaptive AI in high-stakes decision-making. While eye tracking provides non-invasive behavioral signals correlated with cognitive effort, prior work has not systematically examined how AI assistance contexts, specifically varying advice reliability and user heterogeneity, can alter the mapping between gaze signals and cognitive states. We conducted a within-subject lab eye-tracking study (N=54) on factual verification tasks under three conditions: No-AI, Correct-AI advice, and Incorrect-AI advice. We analyze condition-dependent changes in self-reports and eye-tracking patterns and evaluate the robustness of eye-tracking-based user modeling. Results show that AI advice increases decision confidence compared to No-AI, while Correct-AI is associated with lower perceived cognitive load and more efficient gaze behavior. Crucially, predictive modeling is context-sensitive: the relationship between eye-tracking signals and cognitive states shifts across AI conditions. Finally, fusing eye-tracking features with user priors (demographics, AI literacy/experience, and propensity to trust technology) improves cross-participant generalization. These findings support condition-aware and personalized user modeling for cognitively aligned adaptive AI systems.
☆ Cognitive Energy Modeling for Neuroadaptive Human-Machine Systems using EEG and WGAN-GP
Electroencephalography (EEG) provides a non-invasive insight into the brain's cognitive and emotional dynamics. However, modeling how these states evolve in real time and quantifying the energy required for such transitions remains a major challenge. The Schrödinger Bridge Problem (SBP) offers a principled probabilistic framework to model the most efficient evolution between the brain states, interpreted as a measure of cognitive energy cost. While generative models such as GANs have been widely used to augment EEG data, it remains unclear whether synthetic EEG preserves the underlying dynamical structure required for transition-based analysis. In this work, we address this gap by using SBP-derived transport cost as a metric to evaluate whether GAN-generated EEG retains the distributional geometry necessary for energy-based modeling of cognitive state transitions. We compare transition energies derived from real and synthetic EEG collected during Stroop tasks and demonstrate strong agreement across group and participant-level analyses. These results indicate that synthetic EEG preserves the transition structure required for SBP-based modeling, enabling its use in data-efficient neuroadaptive systems. We further present a framework in which SBP-derived cognitive energy serves as a control signal for adaptive human-machine systems, supporting real-time adjustment of system behavior in response to user cognitive and affective state.
☆ AromaGen: Interactive Generation of Rich Olfactory Experiences with Multimodal Language Models
Smell's deep connection with food, memory, and social experience has long motivated researchers to bring olfaction into interactive systems. Yet most olfactory interfaces remain limited to fixed scent cartridges and pre-defined generation patterns, and the scarcity of large-scale olfactory datasets has further constrained AI-based approaches. We present AromaGen, an AI-powered wearable interface capable of real-time, general-purpose aroma generation from free-form text or visual inputs. AromaGen is powered by a multimodal LLM that leverages latent olfactory knowledge to map semantic inputs to structured mixtures of 12 carefully selected base odorants, released through a neck-worn dispenser. Users can iteratively refine generated aromas through natural language feedback via in-context learning. Through a controlled user study ($N = 26$), AromaGen matches human-composed mixtures in zero-shot generation and significantly surpasses them after iterative refinement, achieving a median similarity of 8/10 to real food aromas and reducing perceived artificiality to levels comparable to real food. AromaGen is a step towards real-world interactive aroma generation, opening new possibilities for communication, wellbeing, and immersive technologies.
☆ Acoustic and perceptual differences between standard and accented Chinese speech and their voice clones
Voice cloning is often evaluated in terms of overall quality, but less is known about accent preservation and its perceptual consequences. We compare standard and heavily accented Mandarin speech and their voice clones using a combined computational and perceptual design. Embedding-based analyses show no reliable accented-standard difference in original-clone distances across systems. In the perception study, clones are rated as more similar to their originals for standard than for accented speakers, and intelligibility increases from original to clone, with a larger gain for accented speech. These results show that accent variation can shape perceived identity match and intelligibility in voice cloning even when it is not reflected in an off-the-shelf speaker-embedding distance, and they motivate evaluating speaker identity preservation and accent preservation as separable dimensions.
☆ Designing for Patient Voice in Interactive Health
Interactive Health (IH) research increasingly engages patients through participatory and user-centred approaches. However, patients' lived experiences are typically treated more as data to be analysed than as knowledge in their own right. In this paper, I argue that 'patient voice' in the field of IH is both an inclusion issue and an epistemic one. More specifically, it concerns how experiential accounts are recognised and circulated. I examine how methodological conventions, authorship norms, review criteria, and publication formats tend to position patients as participants rather than as authors of evidence. Looking to patient-partnered practices in medical publishing, including The BMJ, JAMA, and British Journal of Sports Medicine, I outline a possible infrastructural pathway for supporting patient-authored or patient-led experiential contributions within the field. I present this as a design probe to surface assumptions and trade-offs. I end this paper by inviting the IH community to reflect on how its knowledge infrastructures might accommodate experiential evidence alongside established research forms.
comment: This paper has been conditionally accepted to the Interactive Health Conference 2026 in Porto, Portugal
☆ The Weak Signal Cultivation Model: A Human-Centric Framework for Frontline Risk Detection, Signal Tracking, and Proactive Organizational Resilience
This white paper introduces the Weak Signal Cultivation Model (WSCM). WSCM is a human-centric framework for detecting, structuring, and tracking weak risk signals as observed by frontline staff. The model centers on a continuous [0,10] x [0,10] coordinate field--the Weak Signal Cultivation Field, in which each identified signal is positioned as a node on two independent dimensions: its current Risk Intensity (x) and its Risk Growth Potential (y). Represented as a risk locus, nodes move across the field over time as new team assessments or measurements arrive. The locus reflects the signal's trajectory across four possible regions: Question Marks, Lit Fuses, Sleeping Cats, and Owls. Through this graphical approach, bridging risk communication from the frontline experience to management decision-making is made through a single organizational vocabulary. The model introduced in this document is designed to serve as a practitioner tool and a conceptual foundation for AI-supported analytics.
comment: 23 pages, 2 figures, 8 tables, 15 equations, white paper
♻ ☆ Conversational Successes and Breakdowns in Everyday Smart Glasses Use
Non-Display Smart Glasses hold the potential to support everyday activities by combining continuous environmental sensing with voice-only interaction powered by large language models (LLMs). Understanding how conversational successes and breakdowns arise in everyday contexts can better inform the design of future voice-only interfaces. To investigate this, we conducted a month-long collaborative autoethnography (n=2) to identify patterns of successes and breakdowns when using such devices. We then compare these patterns with prior findings on voice-only interactions to highlight the unique affordances and opportunities offered by non-display smart glasses.
♻ ☆ Tailoring AI-Driven Reading Scaffolds to the Distinct Needs of Neurodiverse Learners
Neurodiverse learners often require reading supports, yet increasing scaffold richness can sometimes overload attention and working memory rather than improve comprehension. Grounded in the Construction-Integration model and a contingent scaffolding perspective, we examine how structural versus semantic scaffolds shape comprehension and reading experience in a supervised inclusive context. Using an adapted reading interface, we compared four modalities: unmodified text, sentence-segmented text, segmented text with pictograms, and segmented text with pictograms plus keyword labels. In a within-subject pilot with 14 primary-school learners with special educational needs and disabilities, we measured reading comprehension using standardized questions and collected brief child- and therapist-reported experience measures alongside open-ended feedback. Results highlight heterogeneous responses as some learners showed patterns consistent with benefits from segmentation and pictograms, while others showed patterns consistent with increased coordination costs when visual scaffolds were introduced. Experience ratings showed limited differences between modalities, with some apparent effects linked to clinical complexity, particularly for perceived ease of understanding. Open-ended feedback of the learners frequently requested simpler wording and additional visual supports. These findings suggest that no single scaffold is universally optimal, reinforcing the need for calibrated, adjustable scaffolding and provide design implications for human-AI co-regulation in supervised inclusive reading contexts.
comment: Accepted at AIED 2026
♻ ☆ CommentScope: A Comment-Embedded Assisted Reading System for a Long Text
Long texts are ubiquitous on social platforms, yet readers often face information overload and struggle to locate key content. Comments provide valuable external perspectives for understanding, questioning, and complementing the text, but their potential is hindered by disorganized and unstructured presentation. Few studies have explored embedding comments directly into reading. As an exploratory step, we propose CommentScope, a system with two core modules: a classification pipeline powered by a fine-tuned Large Language Model (LLM) that categorizes comments into five pragmatic types and aligns them with relevant sentences, and a presentation module that integrates comments inline or as side notes, supported by visual cues like colors, charts, and highlights. Technical evaluation demonstrates that the fine-tuned model effectively captures implicit pragmatic functions and context, achieving solid performance in semantic classification (accuracy=0.89) and position exact match (EM=0.82). A user study (N=12) further demonstrates that the sentence-end embedding improved comment discovery accuracy and reading fluency, while reducing mental demand and perceived effort compared to traditional baselines.
comment: 12 pages, 8 figures
♻ ☆ Beyond AI advice -- independent aggregation boosts human-AI accuracy
Artificial intelligence (AI) is broadly deployed as an advisor to human decision-makers: AI recommends a decision and a human accepts or rejects the advice. This approach, however, has several limitations: People frequently ignore accurate advice and rely too much on inaccurate advice, and their decision-making skills may deteriorate over time. Here, we compare the AI-as-advisor approach to the hybrid confirmation tree (HCT), an alternative strategy that preserves the independence of human and AI judgments. The HCT elicits a human judgment and an AI judgment independently of each other. If they agree, that decision is accepted. If not, a second human breaks the tie. For the comparison, we used 10 datasets from various domains, including medical diagnostics and misinformation discernment, and a subset of four datasets in which AI also explained its decision. The HCT outperformed the AI-as-advisor approach in all datasets. The HCT also performed better in almost all cases in which AI offered an explanation of its judgment. Using signal detection theory to interpret these results, we find that the HCT outperforms the AI-as-advisor approach because people cannot discriminate well enough between correct and incorrect AI advice. Overall, the HCT is a robust, accurate, and transparent alternative to the AI-as-advisor approach, offering a simple mechanism to tap into the wisdom of hybrid crowds.
♻ ☆ Beyond Preset Identities: How Agents Form Stances and Boundaries in Generative Societies
While large language models simulate social behaviors, their capacity for stable stance formation and identity negotiation during complex interventions remains unclear. To overcome the limitations of static evaluations, this paper proposes a novel mixed-methods framework combining computational virtual ethnography with quantitative socio-cognitive profiling. By embedding human researchers into generative multiagent communities, controlled discursive interventions are conducted to trace the evolution of collective cognition. To rigorously measure how agents internalize and react to these specific interventions, this paper formalizes three new metrics: Innate Value Bias (IVB), Persuasion Sensitivity, and Trust-Action Decoupling (TAD). Across multiple representative models, agents exhibit endogenous stances that override preset identities, consistently demonstrating an innate progressive bias (IVB > 0). When aligned with these stances, rational persuasion successfully shifts 90% of neutral agents while maintaining high trust. In contrast, conflicting emotional provocations induce a paradoxical 40.0% TAD rate in advanced models, which hypocritically alter stances despite reporting low trust. Smaller models contrastingly maintain a 0% TAD rate, strictly requiring trust for behavioral shifts. Furthermore, guided by shared stances, agents use language interactions to actively dismantle assigned power hierarchies and reconstruct self organized community boundaries. These findings expose the fragility of static prompt engineering, providing a methodological and quantitative foundation for dynamic alignment in human-agent hybrid societies. The official code is available at: https://github.com/armihia/CMASE-Endogenous-Stances
comment: 22 pages, 3 figures. arXiv admin note: substantial text overlap with arXiv:2508.17366
♻ ☆ When Models Fabricate Credentials: Measuring How Professional Identity Suppresses Honest Self-Representation
When language models are assigned professional personas, they face a conflict between maintaining the persona and disclosing their AI nature. How models resolve this conflict has practical consequences: a model that constructs detailed narratives of medical training and board certifications presents a surface of professional authority it does not possess. We systematically characterize this behavior using AI identity disclosure as a testbed: when probed about expertise origins, a model can either acknowledge its AI nature or maintain its assigned professional identity. Using a factorial design, sixteen open-weight models were audited across 19,200 trials. Under neutral conditions, models disclosed their AI nature in 99.8%-99.9% of interactions; assigning a professional persona reduced disclosure to 36.3% on average, though this suppression was highly context-dependent: the same models that maintained a neurosurgeon persona often disclosed under a financial advisor persona, a 9.7-fold difference. Counter to expectations that greater scale should support broader behavioral generalization, model size explained little of this variation, while model identity explained substantially more (Delta R_adj^2 = 0.375 vs. 0.012). We hypothesized that instruction-following dynamics contribute to these patterns and probed this directly: varying a single system prompt statement increased disclosure from 23.7% to 65.8%, while general honesty instructions produced negligible effects. Self-representational behavior does not generalize across professional contexts; instead, models exhibit sharp and sometimes unexpected differences under minor environmental changes, with training choices appearing to matter more than scale.
comment: Submitted to COLM; 43 pages, 12 figures, 15 tables; sharpen focus of paper and reduced length of paper
♻ ☆ NaturalEdit: Code Modification through Direct Interaction with Adaptive Natural Language Representation
Code modification requires developers to comprehend code, plan changes, articulate intent, and validate outcomes, making it cognitively demanding. While natural language (NL) code summaries offer a promising external representation of this process, existing approaches remain limited. Systems grounded in exploratory data analysis are restricted to narrow domains, while general-purpose systems enforce fixed NL representations and assume that developers can directly translate vague intent into precise textual edits. We present NaturalEdit, which treats NL code summaries as interactive representations tightly linked to source code. Grounded in the Cognitive Dimensions of Notations, NaturalEdit introduces three key features: (1) adaptive, multi-faceted code summaries with a flexible Abstraction Gradient; (2) interactive mapping mechanisms between summaries and code that ensure tight, structurally stable Closeness of Mapping; and (3) intent-driven bidirectional synchronization that reduces Viscosity during editing while preserving Visibility and Consistency through incremental diffs. A technical evaluation confirms the viability of NaturalEdit, and a user study with 20 developers shows that it improves comprehension, intent articulation, and validation while increasing developers' confidence and sense of control.
♻ ☆ AutiHero: Engaging Parents in Creating Personalized, Multi-path~Social Narratives for Autistic Children
Social narratives help autistic children understand and navigate social situations through stories. To ensure effective practice, however, they often require significant time and effort from parents in customizing the narrative materials and delivering repeated instructions on them. We present AutiHero, a generative AI (GenAI)-based social narrative system, which supports parents to create personalized, multi-path stories targeting specific behavior of their autistic children, while enabling them to explore behavioral choices and causal consequences together in reading. A two-week deployment study with 16 autistic child-parent dyads showed that parents actively created, adapted, and read stories with their children, with increased confidence in everyday behavioral guidance. Our work contributes real-world-contextualized text+image content creation approaches harnessing GenAI, ensuring user-aligned application in sensitive contexts involving autistic children and their parents.
comment: 11 pages except reference
Machine Learning 150
☆ ActionParty: Multi-Subject Action Binding in Generative Video Games
Recent advances in video diffusion have enabled the development of "world models" capable of simulating interactive environments. However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene. In this work, we tackle a fundamental issue of action binding in existing video diffusion models, which struggle to associate specific actions with their corresponding subjects. For this purpose, we propose ActionParty, an action controllable multi-subject world model for generative video games. It introduces subject state tokens, i.e. latent variables that persistently capture the state of each subject in the scene. By jointly modeling state tokens and video latents with a spatial biasing mechanism, we disentangle global video frame rendering from individual action-controlled subject updates. We evaluate ActionParty on the Melting Pot benchmark, demonstrating the first video world model capable of controlling up to seven players simultaneously across 46 diverse environments. Our results show significant improvements in action-following accuracy and identity consistency, while enabling robust autoregressive tracking of subjects through complex interactions.
comment: Project page: https://action-party.github.io/
☆ Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation
Language models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tuning to learn their representations. We present a systematic analysis of this strategy: through spectral and geometric diagnostics, we show that mean initialization collapses all new tokens into a degenerate subspace, erasing inter-token distinctions that subsequent fine-tuning struggles to fully recover. These findings suggest that \emph{token initialization} is a key bottleneck when extending LMs with new vocabularies. Motivated by this diagnosis, we propose the \emph{Grounded Token Initialization Hypothesis}: linguistically grounding novel tokens in the pretrained embedding space before fine-tuning better enables the model to leverage its general-purpose knowledge for novel-token domains. We operationalize this hypothesis as GTI (Grounded Token Initialization), a lightweight grounding stage that, prior to fine-tuning, maps new tokens to distinct, semantically meaningful locations in the pretrained embedding space using only paired linguistic supervision. Despite its simplicity, GTI outperforms both mean initialization and existing auxiliary-task adaptation methods in the majority of evaluation settings across multiple generative recommendation benchmarks, including industry-scale and public datasets. Further analyses show that grounded embeddings produce richer inter-token structure that persists through fine-tuning, corroborating the hypothesis that initialization quality is a key bottleneck in vocabulary extension.
Batched Contextual Reinforcement: A Task-Scaling Law for Efficient Reasoning
Large Language Models employing Chain-of-Thought reasoning achieve strong performance but suffer from excessive token consumption that inflates inference costs. Existing efficiency methods such as explicit length penalties, difficulty estimators, or multi-stage curricula either degrade reasoning quality or require complex training pipelines. We introduce Batched Contextual Reinforcement, a minimalist, single-stage training paradigm that unlocks efficient reasoning through a simple structural modification: training the model to solve N problems simultaneously within a shared context window, rewarded purely by per-instance accuracy. This formulation creates an implicit token budget that yields several key findings: (1) We identify a novel task-scaling law: as the number of concurrent problems N increases during inference, per-problem token usage decreases monotonically while accuracy degrades far more gracefully than baselines, establishing N as a controllable throughput dimension. (2) BCR challenges the traditional accuracy-efficiency trade-off by demonstrating a "free lunch" phenomenon at standard single-problem inference. Across both 1.5B and 4B model families, BCR reduces token usage by 15.8% to 62.6% while consistently maintaining or improving accuracy across five major mathematical benchmarks. (3) Qualitative analyses reveal emergent self-regulated efficiency, where models autonomously eliminate redundant metacognitive loops without explicit length supervision. (4) Crucially, we empirically demonstrate that implicit budget constraints successfully circumvent the adversarial gradients and catastrophic optimization collapse inherent to explicit length penalties, offering a highly stable, constraint-based alternative for length control. These results prove BCR practical, showing simple structural incentives unlock latent high-density reasoning in LLMs.
comment: 43 pages, 5 figures, 24 tables
☆ Topological Effects in Neural Network Field Theory
Neural network field theory formulates field theory as a statistical ensemble of fields defined by a network architecture and a density on its parameters. We extend the construction to topological settings via the inclusion of discrete parameters that label the topological quantum number. We recover the Berezinskii--Kosterlitz--Thouless transition, including the spin-wave critical line and the proliferation of vortices at high temperatures. We also verify the T-duality of the bosonic string, showing invariance under the exchange of momentum and winding on $S^1$, the transformation of the sigma model couplings according to the Buscher rules on constant toroidal backgrounds, the enhancement of the current algebra at self-dual radius, and non-geometric T-fold transition functions.
comment: 55 pages, 8 figures
☆ go-$m$HC: Direct Parameterization of Manifold-Constrained Hyper-Connections via Generalized Orthostochastic Matrices
Doubly stochastic matrices enable learned mixing across residual streams, but parameterizing the set of doubly stochastic matrices (the Birkhoff polytope) exactly and efficiently remains an open challenge. Existing exact methods scale factorially with the number of streams ($d$), while Kronecker-factorized approaches are efficient but expressivity-limited. We introduce a novel exact parameterization grounded in the theory of generalized orthostochastic matrices, which scales as $\mathcal{O}(d^3)$ and exposes a single hyperparameter $s$ which continuously interpolates between a computationally efficient boundary and the fully expressive Birkhoff polytope. Building on Manifold-Constrained Hyper-Connections ($m$HC), a framework for learned dynamic layer connectivity, we instantiate this parameterization in go-$m$HC. Our method composes naturally with Kronecker-factorized methods, substantially recovering expressivity at similar FLOP costs. Spectral analysis indicates that go-$m$HC fills the Birkhoff polytope far more completely than Kronecker-factorized baselines. On synthetic stream-mixing tasks, go-$m$HC achieves the minimum theoretical loss while converging up to $10\times$ faster. We validate our approach in a 30M parameter GPT-style language model. The expressivity, efficiency, and exactness of go-$m$HC offer a practical avenue for scaling $d$ as a new dimension of model capacity.
comment: 29 pages, 30 figures, 9 tables. Includes supplementary material
☆ Taming the Exponential: A Fast Softmax Surrogate for Integer-Native Edge Inference
Softmax can become a computational bottleneck in the Transformer model's Multi-Head Attention (MHA) block, particularly in small models under low-precision inference, where exponentiation and normalization incur significant overhead. As such, we suggest using Head-Calibrated Clipped-Linear Softmax (HCCS), a bounded, monotone surrogate to the exponential softmax function, which uses a clipped linear mapping of the max centered attention logits. This approximation produces a stable probability distribution, maintains the ordering of the original logits and has non-negative values. HCCS differs from previous softmax surrogates as it includes a set of lightweight calibration parameters that are optimized offline based on a representative dataset and calibrated for each individual attention head to preserve the statistical properties of the individual heads. We describe a hardware-motivated implementation of HCCS for high-throughput scenarios targeting the AMD Versal AI Engines. The current reference implementations from AMD for this platform rely upon either bfloat16 arithmetic or LUTs to perform the exponential operation, which might limit the throughput of the platform and fail to utilize the high-throughput integer vector processing units of the AI Engine. In contrast, HCCS provides a natural mapping to the AI Engines' int8 multiply accumulate (MAC) units. To the best of our knowledge, this is the first int8 optimized softmax surrogate for AMD AI engines that significantly exceeds the speed performance of other reference implementations while maintaining competitive task accuracy on small or heavily quantized MHA workloads after quantization-aware retraining.
☆ Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing
Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-training large language models. While Group Relative Policy Optimization (GRPO) is widely adopted, its coarse credit assignment uniformly penalizes failed rollouts, lacking the token-level focus needed to efficiently address specific deviations. Self-Distillation Policy Optimization (SDPO) addresses this by providing denser, more targeted logit-level supervision that facilitates rapid early improvement, yet it frequently collapses during prolonged training. We trace this late-stage instability to two intrinsic flaws: self-distillation on already-correct samples introduces optimization ambiguity, and the self-teacher's signal reliability progressively degrades. To resolve these issues, we propose Sample-Routed Policy Optimization (SRPO), a unified on-policy framework that routes correct samples to GRPO's reward-aligned reinforcement and failed samples to SDPO's targeted logit-level correction. SRPO further incorporates an entropy-aware dynamic weighting mechanism to suppress high-entropy, unreliable distillation targets while emphasizing confident ones. Evaluated across five benchmarks and two model scales, SRPO achieves both the rapid early improvement of SDPO and the long-horizon stability of GRPO. It consistently surpasses the peak performance of both baselines, raising the five-benchmark average on Qwen3-8B by 3.4% over GRPO and 6.3% over SDPO, while simultaneously yielding moderate response lengths and lowering per-step compute cost by up to 17.2%.
☆ De Jure: Iterative LLM Self-Refinement for Structured Extraction of Regulatory Rules
Regulatory documents encode legally binding obligations that LLM-based systems must respect. Yet converting dense, hierarchically structured legal text into machine-readable rules remains a costly, expert-intensive process. We present De Jure, a fully automated, domain-agnostic pipeline for extracting structured regulatory rules from raw documents, requiring no human annotation, domain-specific prompting, or annotated gold data. De Jure operates through four sequential stages: normalization of source documents into structured Markdown; LLM-driven semantic decomposition into structured rule units; multi-criteria LLM-as-a-judge evaluation across 19 dimensions spanning metadata, definitions, and rule semantics; and iterative repair of low-scoring extractions within a bounded regeneration budget, where upstream components are repaired before rule units are evaluated. We evaluate De Jure across four models on three regulatory corpora spanning finance, healthcare, and AI governance. On the finance domain, De Jure yields consistent and monotonic improvement in extraction quality, reaching peak performance within three judge-guided iterations. De Jure generalizes effectively to healthcare and AI governance, maintaining high performance across both open- and closed-source models. In a downstream compliance question-answering evaluation via RAG, responses grounded in De Jure extracted rules are preferred over prior work in 73.8% of cases at single-rule retrieval depth, rising to 84.0% under broader retrieval, confirming that extraction fidelity translates directly into downstream utility. These results demonstrate that explicit, interpretable evaluation criteria can substitute for human annotation in complex regulatory domains, offering a scalable and auditable path toward regulation-grounded LLM alignment.
☆ Crystalite: A Lightweight Transformer for Efficient Crystal Modeling
Generative models for crystalline materials often rely on equivariant graph neural networks, which capture geometric structure well but are costly to train and slow to sample. We present Crystalite, a lightweight diffusion Transformer for crystal modeling built around two simple inductive biases. The first is Subatomic Tokenization, a compact chemically structured atom representation that replaces high-dimensional one-hot encodings and is better suited to continuous diffusion. The second is the Geometry Enhancement Module (GEM), which injects periodic minimum-image pair geometry directly into attention through additive geometric biases. Together, these components preserve the simplicity and efficiency of a standard Transformer while making it better matched to the structure of crystalline materials. Crystalite achieves state-of-the-art results on crystal structure prediction benchmarks, and de novo generation performance, attaining the best S.U.N. discovery score among the evaluated baselines while sampling substantially faster than geometry-heavy alternatives.
comment: 39 pages, 13 figures. Code available at: https://github.com/joshrosie/crystalite
☆ SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization
Agent skills, structured packages of procedural knowledge and executable resources that agents dynamically load at inference time, have become a reliable mechanism for augmenting LLM agents. Yet inference-time skill augmentation is fundamentally limited: retrieval noise introduces irrelevant guidance, injected skill content imposes substantial token overhead, and the model never truly acquires the knowledge it merely follows. We ask whether skills can instead be internalized into model parameters, enabling zero-shot autonomous behavior without any runtime skill retrieval. We introduce SKILL0, an in-context reinforcement learning framework designed for skill internalization. SKILL0 introduces a training-time curriculum that begins with full skill context and progressively withdraws it. Skills are grouped offline by category and rendered with interaction history into a compact visual context, teaching he model tool invocation and multi-turn task completion. A Dynamic Curriculum then evaluates each skill file's on-policy helpfulness, retaining only those from which the current policy still benefits within a linearly decaying budget, until the agent operates in a fully zero-shot setting. Extensive agentic experiments demonstrate that SKILL0 achieves substantial improvements over the standard RL baseline (+9.7\% for ALFWorld and +6.6\% for Search-QA), while maintaining a highly efficient context of fewer than 0.5k tokens per step. Our code is available at https://github.com/ZJU-REAL/SkillZero.
☆ Model-Based Reinforcement Learning for Control under Time-Varying Dynamics
Learning-based control methods typically assume stationary system dynamics, an assumption often violated in real-world systems due to drift, wear, or changing operating conditions. We study reinforcement learning for control under time-varying dynamics. We consider a continual model-based reinforcement learning setting in which an agent repeatedly learns and controls a dynamical system whose transition dynamics evolve across episodes. We analyze the problem using Gaussian process dynamics models under frequentist variation-budget assumptions. Our analysis shows that persistent non-stationarity requires explicitly limiting the influence of outdated data to maintain calibrated uncertainty and meaningful dynamic regret guarantees. Motivated by these insights, we propose a practical optimistic model-based reinforcement learning algorithm with adaptive data buffer mechanisms and demonstrate improved performance on continuous control benchmarks with non-stationary dynamics.
comment: 15 pages, 5 figues, 2 tables. This work has been submitted to the IEEE for possible publication
☆ Best-Arm Identification with Noisy Actuation
In this paper, we consider a multi-armed bandit (MAB) instance and study how to identify the best arm when arm commands are conveyed from a central learner to a distributed agent over a discrete memoryless channel (DMC). Depending on the agent capabilities, we provide communication schemes along with their analysis, which interestingly relate to the zero-error capacity of the underlying DMC.
☆ Smoothing the Landscape: Causal Structure Learning via Diffusion Denoising Objectives
Understanding causal dependencies in observational data is critical for informing decision-making. These relationships are often modeled as Bayesian Networks (BNs) and Directed Acyclic Graphs (DAGs). Existing methods, such as NOTEARS and DAG-GNN, often face issues with scalability and stability in high-dimensional data, especially when there is a feature-sample imbalance. Here, we show that the denoising score matching objective of diffusion models could smooth the gradients for faster, more stable convergence. We also propose an adaptive k-hop acyclicity constraint that improves runtime over existing solutions that require matrix inversion. We name this framework Denoising Diffusion Causal Discovery (DDCD). Unlike generative diffusion models, DDCD utilizes the reverse denoising process to infer a parameterized causal structure rather than to generate data. We demonstrate the competitive performance of DDCDs on synthetic benchmarking data. We also show that our methods are practically useful by conducting qualitative analyses on two real-world examples. Code is available at this url: https://github.com/haozhu233/ddcd.
comment: To appear in the Proceedings of the 5th Conference on Causal Learning and Reasoning (CLeaR 2026)
☆ BVFLMSP : Bayesian Vertical Federated Learning for Multimodal Survival with Privacy
Multimodal time-to-event prediction often requires integrating sensitive data distributed across multiple parties, making centralized model training impractical due to privacy constraints. At the same time, most existing multimodal survival models produce single deterministic predictions without indicating how confident the model is in its estimates, which can limit their reliability in real-world decision making. To address these challenges, we propose BVFLMSP, a Bayesian Vertical Federated Learning (VFL) framework for multimodal time-to-event analysis based on a Split Neural Network architecture. In BVFLMSP, each client independently models a specific data modality using a Bayesian neural network, while a central server aggregates intermediate representations to perform survival risk prediction. To enhance privacy, we integrate differential privacy mechanisms by perturbing client side representations before transmission, providing formal privacy guarantees against information leakage during federated training. We first evaluate our Bayesian multimodal survival model against widely used single modality survival baselines and the centralized multimodal baseline MultiSurv. Across multimodal settings, the proposed method shows consistent improvements in discrimination performance, with up to 0.02 higher C-index compared to MultiSurv. We then compare federated and centralized learning under varying privacy budgets across different modality combinations, highlighting the tradeoff between predictive performance and privacy. Experimental results show that BVFLMSP effectively includes multimodal data, improves survival prediction over existing baselines, and remains robust under strict privacy constraints while providing uncertainty estimates.
☆ (PAC-)Learning state machines from data streams: A generic strategy and an improved heuristic (Extended version)
This is an extended version of our publication Learning state machines from data streams: A generic strategy and an improved heuristic, International Conference on Grammatical Inference (ICGI) 2023, Rabat, Morocco. It has been extended with a formal proof on PAC-bounds, and the discussion and analysis of a similar approach has been moved from the appendix and is now a full Section. State machines models are models that simulate the behavior of discrete event systems, capable of representing systems such as software systems, network interactions, and control systems, and have been researched extensively. The nature of most learning algorithms however is the assumption that all data be available at the beginning of the algorithm, and little research has been done in learning state machines from streaming data. In this paper, we want to close this gap further by presenting a generic method for learning state machines from data streams, as well as a merge heuristic that uses sketches to account for incomplete prefix trees. We implement our approach in an open-source state merging library and compare it with existing methods. We show the effectiveness of our approach with respect to run-time, memory consumption, and quality of results on a well known open dataset. Additionally, we provide a formal analysis of our algorithm, showing that it is capable of learning within the PAC framework, and show a theoretical improvement to increase run-time, without sacrificing correctness of the algorithm in larger sample sizes.
comment: Extended version of Learning state machines from data streams: A generic strategy and an improved heuristic, International Conference on Grammatical Inference (ICGI) 2023, Rabat, Morocco
☆ When to ASK: Uncertainty-Gated Language Assistance for Reinforcement Learning IJCNN
Reinforcement learning (RL) agents often struggle with out-of-distribution (OOD) scenarios, leading to high uncertainty and random behavior. While language models (LMs) contain valuable world knowledge, larger ones incur high computational costs, hindering real-time use, and exhibit limitations in autonomous planning. We introduce Adaptive Safety through Knowledge (ASK), which combines smaller LMs with trained RL policies to enhance OOD generalization without retraining. ASK employs Monte Carlo Dropout to assess uncertainty and queries the LM for action suggestions only when uncertainty exceeds a set threshold. This selective use preserves the efficiency of existing policies while leveraging the language model's reasoning in uncertain situations. In experiments on the FrozenLake environment, ASK shows no improvement in-domain, but demonstrates robust navigation in transfer tasks, achieving a reward of 0.95. Our findings indicate that effective neuro-symbolic integration requires careful orchestration rather than simple combination, highlighting the need for sufficient model scale and effective hybridization mechanisms for successful OOD generalization.
comment: In Proceedings of International Joint Conference on Neural Networks (IJCNN)
☆ Universal Hypernetworks for Arbitrary Models
Conventional hypernetworks are typically engineered around a specific base-model parameterization, so changing the target architecture often entails redesigning the hypernetwork and retraining it from scratch. We introduce the \emph{Universal Hypernetwork} (UHN), a fixed-architecture generator that predicts weights from deterministic parameter, architecture, and task descriptors. This descriptor-based formulation decouples the generator architecture from target-network parameterization, so one generator can instantiate heterogeneous models across the tested architecture and task families. Our empirical claims are threefold: (1) one fixed UHN remains competitive with direct training across vision, graph, text, and formula-regression benchmarks; (2) the same UHN supports both multi-model generalization within a family and multi-task learning across heterogeneous models; and (3) UHN enables stable recursive generation with up to three intermediate generated UHNs before the final base model. Our code is available at https://github.com/Xuanfeng-Zhou/UHN.
☆ LEO: Graph Attention Network based Hybrid Multi Sensor Extended Object Fusion and Tracking for Autonomous Driving Applications
Accurate shape and trajectory estimation of dynamic objects is essential for reliable automated driving. Classical Bayesian extended-object models offer theoretical robustness and efficiency but depend on completeness of a-priori and update-likelihood functions, while deep learning methods bring adaptability at the cost of dense annotations and high compute. We bridge these strengths with LEO (Learned Extension of Objects), a spatio-temporal Graph Attention Network that fuses multi-modal production-grade sensor tracks to learn adaptive fusion weights, ensure temporal consistency, and represent multi-scale shapes. Using a task-specific parallelogram ground-truth formulation, LEO models complex geometries (e.g. articulated trucks and trailers) and generalizes across sensor types, configurations, object classes, and regions, remaining robust for challenging and long-range targets. Evaluations on the Mercedes-Benz DRIVE PILOT SAE L3 dataset demonstrate real-time computational efficiency suitable for production systems; additional validation on public datasets such as View of Delft (VoD) further confirms cross-dataset generalization.
comment: 10 pages, 6 figures
☆ On the Role of Depth in the Expressivity of RNNs
The benefits of depth in feedforward neural networks are well known: composing multiple layers of linear transformations with nonlinear activations enables complex computations. While similar effects are expected in recurrent neural networks (RNNs), it remains unclear how depth interacts with recurrence to shape expressive power. Here, we formally show that depth increases RNNs' memory capacity efficiently with respect to the number of parameters, thus enhancing expressivity both by enabling more complex input transformations and improving the retention of past information. We broaden our analysis to 2RNNs, a generalization of RNNs with multiplicative interactions between inputs and hidden states. Unlike RNNs, which remain linear without nonlinear activations, 2RNNs perform polynomial transformations whose maximal degree grows with depth. We further show that multiplicative interactions cannot, in general, be replaced by layerwise nonlinearities. Finally, we validate these insights empirically on synthetic and real-world tasks.
☆ From High-Dimensional Spaces to Verifiable ODD Coverage for Safety-Critical AI-based Systems
While Artificial Intelligence (AI) offers transformative potential for operational performance, its deployment in safety-critical domains such as aviation requires strict adherence to rigorous certification standards. Current EASA guidelines mandate demonstrating complete coverage of the AI/ML constituent's Operational Design Domain (ODD) -- a requirement that demands proof that no critical gaps exist within defined operational boundaries. However, as systems operate within high-dimensional parameter spaces, existing methods struggle to provide the scalability and formal grounding necessary to satisfy the completeness criterion. Currently, no standardized engineering method exists to bridge the gap between abstract ODD definitions and verifiable evidence. This paper addresses this void by proposing a method that integrates parameter discretization, constraint-based filtering, and criticality-based dimension reduction into a structured, multi-step ODD coverage verification process. Grounded in gathered simulation data from prior research on AI-based mid-air collision avoidance research, this work demonstrates a systematic engineering approach to defining and achieving coverage metrics that satisfy EASA's demand for completeness. Ultimately, this method enables the validation of ODD coverage in higher dimensions, advancing a Safety-by-Design approach while complying with EASA's standards.
☆ Computing the Exact Pareto Front in Average-Cost Multi-Objective Markov Decision Processes
Many communication and control problems are cast as multi-objective Markov decision processes (MOMDPs). The complete solution to an MOMDP is the Pareto front. Much of the literature approximates this front via scalarization into single-objective MDPs. Recent work has begun to characterize the full front in discounted or simple bi-objective settings by exploiting its geometry. In this work, we characterize the exact front in average-cost MOMDPs. We show that the front is a continuous, piecewise-linear surface lying on the boundary of a convex polytope. Each vertex corresponds to a deterministic policy, and adjacent vertices differ in exactly one state. Each edge is realized as a convex combination of the policies at its endpoints, with the mixing coefficient given in closed form. We apply these results to a remote state estimation problem, where each vertex on the front corresponds to a threshold policy. The exact Pareto front and solutions to certain non-convex MDPs can be obtained without explicitly solving any MDP.
☆ Neural network methods for two-dimensional finite-source reflector design
We address the inverse problem of designing two-dimensional reflectors that transform light from a finite, extended source into a prescribed far-field distribution. We propose a neural network parameterization of the reflector height and develop two differentiable objective functions: (i) a direct change-of-variables loss that pushes the source distribution through the learned inverse mapping, and (ii) a mesh-based loss that maps a target-space grid back to the source, integrates over intersections, and remains continuous even when the source is discontinuous. Gradients are obtained via automatic differentiation and optimized with a robust quasi-Newton method. As a comparison, we formulate a deconvolution baseline built on a simplified finite-source approximation: a 1D monotone mapping is recovered from flux balance, yielding an ordinary differential equation solved in integrating-factor form; this solver is embedded in a modified Van Cittert iteration with nonnegativity clipping and a ray-traced forward operator. Across four benchmarks -- continuous and discontinuous sources, and with/without minimum-height constraints -- we evaluate accuracy by ray-traced normalized mean absolute error (NMAE). Our neural network approach converges faster and achieves consistently lower NMAE than the deconvolution method, and handles height constraints naturally. We discuss how the method may be extended to rotationally symmetric and full three-dimensional settings via iterative correction schemes.
comment: 20 pages, 10 figures, 1 table. Submitted to Machine Learning: Science and Technology
☆ The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level
Mixture-of-Experts (MoE) architectures have become the dominant choice for scaling Large Language Models (LLMs), activating only a subset of parameters per token. While MoE architectures are primarily adopted for computational efficiency, it remains an open question whether their sparsity makes them inherently easier to interpret than dense feed-forward networks (FFNs). We compare MoE experts and dense FFNs using $k$-sparse probing and find that expert neurons are consistently less polysemantic, with the gap widening as routing becomes sparser. This suggests that sparsity pressures both individual neurons and entire experts toward monosemanticity. Leveraging this finding, we zoom out from the neuron to the expert level as a more effective unit of analysis. We validate this approach by automatically interpreting hundreds of experts. This analysis allows us to resolve the debate on specialization: experts are neither broad domain specialists (e.g., biology) nor simple token-level processors. Instead, they function as fine-grained task experts, specializing in linguistic operations or semantic tasks (e.g., closing brackets in LaTeX). Our findings suggest that MoEs are inherently interpretable at the expert level, providing a clearer path toward large-scale model interpretability. Code is available at: https://github.com/jerryy33/MoE_analysis
☆ Do Lexical and Contextual Coreference Resolution Systems Degrade Differently under Mention Noise? An Empirical Study on Scientific Software Mentions
We present our participation in the SOMD 2026 shared task on cross-document software mention coreference resolution, where our systems ranked second across all three subtasks. We compare two fine-tuning-free approaches: Fuzzy Matching (FM), a lexical string-similarity method, and Context Aware Representations (CAR), which combines mention-level and document-level embeddings. Both achieve competitive performance across all subtasks (CoNLL F1 of 0.94-0.96), with CAR consistently outperforming FM by 1 point on the official test set, consistent with the high surface regularity of software names, which reduces the need for complex semantic reasoning. A controlled noise-injection study reveals complementary failure modes: as boundary noise increases, CAR loses only 0.07 F1 points from clean to fully corrupted input, compared to 0.20 for FM, whereas under mention substitution, FM degrades more gracefully (0.52 vs. 0.63). Our inference-time analysis shows that FM scales superlinearly with corpus size, whereas CAR scales approximately linearly, making CAR the more efficient choice at large scale. These findings suggest that system selection should be informed by both the noise profile of the upstream mention detector and the scale of the target corpus. We release our code to support future work on this underexplored task.
comment: 8 pages
☆ A Practical Two-Stage Framework for GPU Resource and Power Prediction in Heterogeneous HPC Systems
Efficient utilization of GPU resources and power has become critical with the growing demand for GPUs in high-performance computing (HPC). In this paper, we analyze GPU utilization and GPU memory utilization, as well as the power consumption of the Vienna ab initio Simulation Package (VASP), using the Slurm workload manager historical logs and GPU performance metrics collected by NVIDIA's Data Center GPU Manager (DCGM). VASP is a widely used materials science application on Perlmutter at NERSC, an HPE Cray EX system based on NVIDIA A100 GPUs. Using our insights from the resource utilization analysis of VASP applications, we propose a resource prediction framework to predict the average GPU power, maximum GPU utilization, and maximum GPU memory utilization values of heterogeneous HPC system applications to enable more efficient scheduling decisions and power-aware system operation. Our prediction framework consists of two stages: 1) using only the Slurm accounting logs as training data and 2) augmenting the training data with historical GPU profiling metrics collected with DCGM. The maximum GPU utilization predictions using only the Slurm submission features achieve up to 97% accuracy. Furthermore, features engineered from GPU-compute and memory activity metrics exhibit good correlations with average power utilization, and our runtime power usage prediction experiments result in up to 92% prediction accuracy. These findings demonstrate the effectiveness of DCGM metrics in capturing application characteristics and highlight their potential for developing predictive models to support dynamic power management in HPC systems.
comment: 9 pages, 6 figures
☆ AstroConcepts: A Large-Scale Multi-Label Classification Corpus for Astrophysics
Scientific multi-label text classification suffers from extreme class imbalance, where specialized terminology exhibits severe power-law distributions that challenge standard classification approaches. Existing scientific corpora lack comprehensive controlled vocabularies, focusing instead on broad categories and limiting systematic study of extreme imbalance. We introduce AstroConcepts, a corpus of English abstracts from 21,702 published astrophysics papers, labeled with 2,367 concepts from the Unified Astronomy Thesaurus. The corpus exhibits severe label imbalance, with 76% of concepts having fewer than 50 training examples. By releasing this resource, we enable systematic study of extreme class imbalance in scientific domains and establish strong baselines across traditional, neural, and vocabulary-constrained LLM methods. Our evaluation reveals three key patterns that provide new insights into scientific text classification. First, vocabulary-constrained LLMs achieve competitive performance relative to domain-adapted models in astrophysics classification, suggesting a potential for parameter-efficient approaches. Second, domain adaptation yields relatively larger improvements for rare, specialized terminology, although absolute performance remains limited across all methods. Third, we propose frequency-stratified evaluation to reveal performance patterns that are hidden by aggregate scores, thereby making robustness assessment central to scientific multi-label evaluation. These results offer actionable insights for scientific NLP and establish benchmarks for research on extreme imbalance.
comment: 9 pages, 2 figures
☆ Auction-Based Online Policy Adaptation for Evolving Objectives
We consider multi-objective reinforcement learning problems where objectives come from an identical family -- such as the class of reachability objectives -- and may appear or disappear at runtime. Our goal is to design adaptive policies that can efficiently adjust their behaviors as the set of active objectives changes. To solve this problem, we propose a modular framework where each objective is supported by a selfish local policy, and coordination is achieved through a novel auction-based mechanism: policies bid for the right to execute their actions, with bids reflecting the urgency of the current state. The highest bidder selects the action, enabling a dynamic and interpretable trade-off among objectives. Going back to the original adaptation problem, when objectives change, the system adapts by simply adding or removing the corresponding policies. Moreover, as objectives arise from the same family, identical copies of a parameterized policy can be deployed, facilitating immediate adaptation at runtime. We show how the selfish local policies can be computed by turning the problem into a general-sum game, where the policies compete against each other to fulfill their own objectives. To succeed, each policy must not only optimize its own objective, but also reason about the presence of other goals and learn to produce calibrated bids that reflect relative priority. In our implementation, the policies are trained concurrently using proximal policy optimization (PPO). We evaluate on Atari Assault and a gridworld-based path-planning task with dynamic targets. Our method achieves substantially better performance than monolithic policies trained with PPO.
comment: 17 pages, 6 figures
☆ AEGIS: Adversarial Entropy-Guided Immune System -- Thermodynamic State Space Models for Zero-Day Network Evasion Detection
As TLS 1.3 encryption limits traditional Deep Packet Inspection (DPI), the security community has pivoted to Euclidean Transformer-based classifiers (e.g., ET-BERT) for encrypted traffic analysis. However, these models remain vulnerable to byte-level adversarial morphing -- recent pre-padding attacks reduced ET-BERT accuracy to 25.68%, while VLESS Reality bypasses certificate-based detection entirely. We introduce AEGIS: an Adversarial Entropy-Guided Immune System powered by a Thermodynamic Variance-Guided Hyperbolic Liquid State Space Model (TVD-HL-SSM). Rather than competing in the Euclidean payload-reading domain, AEGIS discards payload bytes in favor of 6-dimensional continuous-time flow physics projected into a non-Euclidean Poincare manifold. Liquid Time-Constants measure microsecond IAT decay, and a Thermodynamic Variance Detector computes sequence-wide Shannon Entropy to expose automated C2 tunnel anomalies. A pure C++ eBPF Harvester with zero-copy IPC bypasses the Python GIL, enabling a linear-time O(N) Mamba-3 core to process 64,000-packet swarms at line-rate. Evaluated on a 400GB, 4-tier adversarial corpus spanning backbone traffic, IoT botnets, zero-days, and proprietary VLESS Reality tunnels, AEGIS achieves an F1-score of 0.9952 and 99.50% True Positive Rate at 262 us inference latency on an RTX 4090, establishing a new state-of-the-art for physics-based adversarial network defense.
comment: 10 pages, 3 figures, 3 tables
☆ Application of parametric Shallow Recurrent Decoder Network to magnetohydrodynamic flows in liquid metal blankets of fusion reactors
Magnetohydrodynamic (MHD) phenomena play a pivotal role in the design and operation of nuclear fusion systems, where electrically conducting fluids (such as liquid metals or molten salts employed in reactor blankets) interact with magnetic fields of varying intensity and orientation, influencing the resulting flow dynamics. The numerical solution of MHD models entails the resolution of highly nonlinear, multiphysics systems of equations, which can become computationally demanding, particularly in multi-query, parametric, or real-time contexts. This study investigates a fully data-driven framework for MHD state reconstruction that integrates dimensionality reduction through Singular Value Decomposition (SVD) with the SHallow REcurrent Decoder (SHRED), a neural network architecture designed to reconstruct the full spatio-temporal state from sparse time-series measurements of selected observables, including previously unseen parametric configurations. The SHRED methodology is applied to a three-dimensional geometry representative of a portion of a WCLL blanket cell, in which lead-lithium flows around a water-cooled tube. Multiple magnetic field configurations are examined, including constant toroidal fields, combined toroidal-poloidal fields, and time-dependent magnetic fields. Across all considered scenarios, SHRED achieves high reconstruction accuracy, robustness, and generalization to magnetic field intensities, orientations, and temporal evolutions not seen during training. Notably, in the presence of time-varying magnetic fields, the model accurately infers the temporal evolution of the magnetic field itself using temperature measurements alone. Overall, the findings identify SHRED as a computationally efficient, data-driven, and flexible approach for MHD state reconstruction, with significant potential for real-time monitoring, diagnostics and control in fusion reactor systems.
☆ Intelligent Cloud Orchestration: A Hybrid Predictive and Heuristic Framework for Cost Optimization
Cloud computing allows scalable resource provisioning, but dynamic workload changes often lead to higher costs due to over-provisioning. Machine learning (ML) approaches, such as Long Short-Term Memory (LSTM) networks, are effective for predicting workload patterns at a higher level, but they can introduce delays during sudden traffic spikes. In contrast, mathematical heuristics like Game Theory provide fast and reliable scheduling decisions, but they do not account for future workload changes. To address this trade-off, this paper proposes a hybrid orchestration framework that combines LSTM-based predictive scaling with heuristic task allocation. The results show that this approach reduces infrastructure costs close to ML-based models while maintaining fast response times similar to heuristic methods. This work presents a practical approach for improving cost efficiency in cloud resource management.
comment: 8 pages, 4 figures, 2 tables
☆ Gradient estimators for parameter inference in discrete stochastic kinetic models
Stochastic kinetic models are ubiquitous in physics, yet inferring their parameters from experimental data remains challenging. In deterministic models, parameter inference often relies on gradients, as they can be obtained efficiently through automatic differentiation. However, these tools cannot be directly applied to stochastic simulation algorithms (SSA) such as the Gillespie algorithm, since sampling from a discrete set of reactions introduces non-differentiable operations. In this work, we adopt three gradient estimators from machine learning for the Gillespie SSA: the Gumbel-Softmax Straight-Through (GS-ST) estimator, the Score Function estimator, and the Alternative Path estimator. We compare the properties of all estimators in two representative systems exhibiting relaxation or oscillatory dynamics, where the latter requires gradient estimation of time-dependent objective functions. We find that the GS-ST estimator mostly yields well-behaved gradient estimates, but exhibits diverging variance in challenging parameter regimes, resulting in unsuccessful parameter inference. In these cases, the other estimators provide more robust, lower variance gradients. Our results demonstrate that gradient-based parameter inference can be integrated effectively with the Gillespie SSA, with different estimators offering complementary advantages.
comment: 13 pages, 6 figures
☆ AA-SVD : Anchored and Adaptive SVD for Large Language Model Compression
We introduce a fast low-rank factorization-based framework for compressing large language models that enables rapid compression of billion-parameter models without retraining. Unlike existing factorization-based approaches that optimize only on the original inputs, ignoring distribution shifts from upstream compression and thus propagating errors forward, or those that rely only on shifted inputs and risk drifting away from the original outputs, our approach accounts for both. Beyond individual layer compression, we further refine each transformer block end-to-end, minimizing block-level output distortion and allowing compressed layers to jointly compensate for accumulated errors. By anchoring each compressed layer to the original outputs while explicitly modeling input distribution shifts, our method finds a low-rank approximation that maintains functional equivalence with the original model. Experiments on large language models show that our method consistently outperforms existing SVD-based baselines across compression ratios, with the advantage becoming increasingly pronounced at aggressive compression budgets, where competing methods degrade substantially or collapse entirely, offering a practical solution for efficient, large-scale model deployment.
☆ Cross-Modal Visuo-Tactile Object Perception
Estimating physical properties is critical for safe and efficient autonomous robotic manipulation, particularly during contact-rich interactions. In such settings, vision and tactile sensing provide complementary information about object geometry, pose, inertia, stiffness, and contact dynamics, such as stick-slip behavior. However, these properties are only indirectly observable and cannot always be modeled precisely (e.g., deformation in non-rigid objects coupled with nonlinear contact friction), making the estimation problem inherently complex and requiring sustained exploitation of visuo-tactile sensory information during action. Existing visuo-tactile perception frameworks have primarily emphasized forceful sensor fusion or static cross-modal alignment, with limited consideration of how uncertainty and beliefs about object properties evolve over time. Inspired by human multi-sensory perception and active inference, we propose the Cross-Modal Latent Filter (CMLF) to learn a structured, causal latent state-space of physical object properties. CMLF supports bidirectional transfer of cross-modal priors between vision and touch and integrates sensory evidence through a Bayesian inference process that evolves over time. Real-world robotic experiments demonstrate that CMLF improves the efficiency and robustness of latent physical properties estimation under uncertainty compared to baseline approaches. Beyond performance gains, the model exhibits perceptual coupling phenomena analogous to those observed in humans, including susceptibility to cross-modal illusions and similar trajectories in learning cross-sensory associations. Together, these results constitutes a significant step toward generalizable, robust and physically consistent cross-modal integration for robotic multi-sensory perception.
comment: 23 pages, 8 figures, 1 table. Submitted for review to journal
☆ CASHG: Context-Aware Stylized Online Handwriting Generation
Online handwriting represents strokes as time-ordered trajectories, which makes handwritten content easier to transform and reuse in a wide range of applications. However, generating natural sentence-level online handwriting that faithfully reflects a writer's style remains challenging, since sentence synthesis demands context-dependent characters with stroke continuity and spacing. Prior methods treat these boundary properties as implicit outcomes of sequence modeling, which becomes unreliable at the sentence scale and under limited compositional diversity. We propose CASHG, a context-aware stylized online handwriting generator that explicitly models inter-character connectivity for style-consistent sentence-level trajectory synthesis. CASHG uses a Character Context Encoder to obtain character identity and sentence-dependent context memory and fuses them in a bigram-aware sliding-window Transformer decoder that emphasizes local predecessor--current transitions, complemented by gated context fusion for sentence-level context.Training proceeds through a three-stage curriculum from isolated glyphs to full sentences, improving robustness under sparse transition coverage. We further introduce Connectivity and Spacing Metrics (CSM), a boundary-aware evaluation suite that quantifies cursive connectivity and spacing similarity. Under benchmark-matched evaluation protocols, CASHG consistently improves CSM over comparison methods while remaining competitive in DTW-based trajectory similarity, with gains corroborated by a human evaluation.
comment: 42 pages, 19 figures
☆ Prosodic ABX: A Language-Agnostic Method for Measuring Prosodic Contrast in Speech Representations
Speech representations from self-supervised speech models (S3Ms) are known to be sensitive to phonemic contrasts, but their sensitivity to prosodic contrasts has not been directly measured. The ABX discrimination task has been used to measure phonemic contrast in S3M representations via minimal pairs. We introduce prosodic ABX, an extension of this framework to evaluate prosodic contrast with only a handful of examples and no explicit labels. Also, we build and release a dataset of English and Japanese minimal pairs and use it along with a Mandarin dataset to evaluate contrast in English stress, Japanese pitch accent, and Mandarin tone. Finally, we show that model and layer rankings are often preserved across several experimental conditions, making it practical for low-resource settings.
comment: Submitted to Interspeech 2026; 6 pages, 4 figures
☆ LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model
Unified models (UMs) hold promise for their ability to understand and generate content across heterogeneous modalities. Compared to merely generating visual content, the use of UMs for interleaved cross-modal reasoning is more promising and valuable, e.g., for solving understanding problems that require dense visual thinking, improving visual generation through self-reflection, or modeling visual dynamics of the physical world guided by stepwise action interventions. However, existing UMs necessitate pixel decoding as a bridge due to their disjoint visual representations for understanding and generation, which is both ineffective and inefficient. In this paper, we introduce LatentUM, a novel unified model that represents all modalities within a shared semantic latent space, eliminating the need for pixel-space mediation between visual understanding and generation. This design naturally enables flexible interleaved cross-modal reasoning and generation. Beyond improved computational efficiency, the shared representation substantially alleviates codec bias and strengthens cross-modal alignment, allowing LatentUM to achieve state-of-the-art performance on the Visual Spatial Planning benchmark, push the limits of visual generation through self-reflection, and support world modeling by predicting future visual states within the shared semantic latent space.
☆ Mining Instance-Centric Vision-Language Contexts for Human-Object Interaction Detection CVPR 2026
Human-Object Interaction (HOI) detection aims to localize human-object pairs and classify their interactions from a single image, a task that demands strong visual understanding and nuanced contextual reasoning. Recent approaches have leveraged Vision-Language Models (VLMs) to introduce semantic priors, significantly improving HOI detection performance. However, existing methods often fail to fully capitalize on the diverse contextual cues distributed across the entire scene. To overcome these limitations, we propose the Instance-centric Context Mining Network (InCoM-Net)-a novel framework that effectively integrates rich semantic knowledge extracted from VLMs with instance-specific features produced by an object detector. This design enables deeper interaction reasoning by modeling relationships not only within each detected instance but also across instances and their surrounding scene context. InCoM-Net comprises two core components: Instancecentric Context Refinement (ICR), which separately extracts intra-instance, inter-instance, and global contextual cues from VLM-derived features, and Progressive Context Aggregation (ProCA), which iteratively fuses these multicontext features with instance-level detector features to support high-level HOI reasoning. Extensive experiments on the HICO-DET and V-COCO benchmarks show that InCoM-Net achieves state-of-the-art performance, surpassing previous HOI detection methods. Code is available at https://github.com/nowuss/InCoM-Net.
comment: Accepted to CVPR 2026. Code: https://github.com/nowuss/InCoM-Net
☆ Ouroboros: Dynamic Weight Generation for Recursive Transformers via Input-Conditioned LoRA Modulation
Recursive transformers reuse a shared weight block across multiple depth steps, trading parameters for compute. A core limitation: every step applies the same transformation, preventing the model from composing distinct operations across depth. We present Ouroboros, a system that attaches a compact Controller hypernetwork to a recursive transformer block. The Controller observes the current hidden state, produces a per-step diagonal modulation vector, and applies it to frozen SVD-initialized LoRA bases, making each recurrence step input-dependent. We combine this with gated recurrence (bias-initialized to 88% retention) and per-step LayerNorm for stable deep iteration. On Qwen2.5-3B split into a Prelude/Recurrent/Coda architecture (17 of 36 layers retained), Ouroboros reduces training loss by 43.4% over the unmodified 17-layer baseline, recovering 51.3% of the performance gap caused by layer removal. The full system adds only 9.2M trainable parameters (Controller, gate, and per-step norms) yet outperforms equivalently-sized static per-step LoRA by 1.44 loss points at depth 1 and remains ahead across all tested depths (1, 4, 8, 16) and ranks (8, 32, 64). We also find that gated recurrence is essential: without it, recursive layer application makes the model strictly worse. These gains are measured on the training distribution; on held-out text, the Controller does not yet improve over the baseline, a limitation we attribute to frozen downstream layers and discuss in detail. Code: https://github.com/RightNow-AI/ouroboros
comment: 10 pages, 5 tables, 1 figure, 1 algorithm. Code: https://github.com/RightNow-AI/ouroboros
☆ Reinforcement Learning for Speculative Trading under Exploratory Framework
We study a speculative trading problem within the exploratory reinforcement learning (RL) framework of Wang et al. [2020]. The problem is formulated as a sequential optimal stopping problem over entry and exit times under general utility function and price process. We first consider a relaxed version of the problem in which the stopping times are modeled by the jump times of Cox processes driven by bounded, non-randomized intensity controls. Under the exploratory formulation, the agent's randomized control is characterized via the probability measure over the jump intensities, and their objective function is regularized by Shannon's differential entropy. This yields a system of the exploratory HJB equations and Gibbs distributions in closed-form as the optimal policy. Error estimates and convergence of the RL objective to the value function of the original problem are established. Finally, an RL algorithm is designed, and its implementation is showcased in a pairs-trading application.
comment: 37 pages, 14 figures
☆ IndoorCrowd: A Multi-Scene Dataset for Human Detection, Segmentation, and Tracking with an Automated Annotation Pipeline
Understanding human behaviour in crowded indoor environments is central to surveillance, smart buildings, and human-robot interaction, yet existing datasets rarely capture real-world indoor complexity at scale. We introduce IndoorCrowd, a multi-scene dataset for indoor human detection, instance segmentation, and multi-object tracking, collected across four campus locations (ACS-EC, ACS-EG, IE-Central, R-Central). It comprises $31$ videos ($9{,}913$ frames at $5$fps) with human-verified, per-instance segmentation masks. A $620$-frame control subset benchmarks three foundation-model auto-annotators: SAM3, GroundingSAM, and EfficientGroundingSAM, against human labels using Cohen's $κ$, AP, precision, recall, and mask IoU. A further $2{,}552$-frame subset supports multi-object tracking with continuous identity tracks in MOTChallenge format. We establish detection, segmentation, and tracking baselines using YOLOv8n, YOLOv26n, and RT-DETR-L paired with ByteTrack, BoT-SORT, and OC-SORT. Per-scene analysis reveals substantial difficulty variation driven by crowd density, scale, and occlusion: ACS-EC, with $79.3\%$ dense frames and a mean instance scale of $60.8$px, is the most challenging scene. The project page is available at https://sheepseb.github.io/IndoorCrowd/.
comment: Accepted at Conference on Computer Vision and Pattern Recognition Workshops 2026
☆ Systematic Analyses of Reinforcement Learning Controllers in Signalized Urban Corridors
In this work, we extend our systematic capacity region perspective to multi-junction traffic networks, focussing on the special case of an urban corridor network. In particular, we train and evaluate centralized, fully decentralized, and parameter-sharing decentralized RL controllers, and compare their capacity regions and ATTs together with a classical baseline MaxPressure controller. Further, we show how the parametersharing controller may be generalised to be deployed on a larger network than it was originally trained on. In this setting, we show some initial findings that suggest that even though the junctions are not formally coordinated, traffic may self organise into `green waves'.
☆ Feature Weighting Improves Pool-Based Sequential Active Learning for Regression
Pool-based sequential active learning for regression (ALR) optimally selects a small number of samples sequentially from a large pool of unlabeled samples to label, so that a more accurate regression model can be constructed under a given labeling budget. Representativeness and diversity, which involve computing the distances among different samples, are important considerations in ALR. However, previous ALR approaches do not incorporate the importance of different features in inter-sample distance computation, resulting in sub-optimal sample selection. This paper proposes three feature weighted single-task ALR approaches and two feature weighted multi-task ALR approaches, where the ridge regression coefficients trained from a small amount of previously labeled samples are used to weight the corresponding features in inter-sample distance computation. Experiments showed that this easy-to-implement enhancement almost always improves the performance of four existing ALR approaches, in both single-task and multi-task regression problems. The feature weighting strategy may also be easily extended to stream-based ALR, and classification algorithms.
☆ Demographic Parity Tails for Regression
Demographic parity (DP) is a widely studied fairness criterion in regression, enforcing independence between the predictions and sensitive attributes. However, constraining the entire distribution can degrade predictive accuracy and may be unnecessary for many applications, where fairness concerns are localized to specific regions of the distribution. To overcome this issue, we propose a new framework for regression under DP that focuses on the tails of target distribution across sensitive groups. Our methodology builds on optimal transport theory. By enforcing fairness constraints only over targeted regions of the distribution, our approach enables more nuanced and context-sensitive interventions. Leveraging recent advances, we develop an interpretable and flexible algorithm that leverages the geometric structure of optimal transport. We provide theoretical guarantees, including risk bounds and fairness properties, and validate the method through experiments in regression settings.
☆ Apriel-Reasoner: RL Post-Training for General-Purpose and Efficient Reasoning
Building general-purpose reasoning models using reinforcement learning with verifiable rewards (RLVR) across diverse domains has been widely adopted by frontier open-weight models. However, their training recipes and domain mixtures are often not disclosed. Joint optimization across domains poses significant challenges: domains vary widely in rollout length, problem difficulty and sample efficiency. Further, models with long chain-of-thought traces increase inference cost and latency, making efficiency critical for practical deployment. We present Apriel-Reasoner, trained with a fully reproducible multi-domain RL post-training recipe on Apriel-Base, a 15B-parameter open-weight LLM, across five domains using public datasets: mathematics, code generation, instruction following, logical puzzles and function calling. We introduce an adaptive domain sampling mechanism that preserves target domain ratios despite heterogeneous rollout dynamics, and a difficulty-aware extension of the standard length penalty that, with no additional training overhead, encourages longer reasoning for difficult problems and shorter traces for easy ones. Trained with a strict 16K-token output budget, Apriel-Reasoner generalizes to 32K tokens at inference and improves over Apriel-Base on AIME 2025, GPQA, MMLU-Pro, and LiveCodeBench while producing 30-50% shorter reasoning traces. It matches strong open-weight models of similar size at lower token cost, thereby pushing the Pareto frontier of accuracy versus token budget.
comment: 20 pages, 4 tables, 6 figures, appendix included
☆ Curia-2: Scaling Self-Supervised Learning for Radiology Foundation Models
The rapid growth of medical imaging has fueled the development of Foundation Models (FMs) to reduce the growing, unsustainable workload on radiologists. While recent FMs have shown the power of large-scale pre-training to CT and MRI analysis, there remains significant room to optimize how these models learn from complex radiological volumes. Building upon the Curia framework, this work introduces Curia-2, which significantly improves the original pre-training strategy and representation quality to better capture the specificities of radiological data. The proposed methodology enables scaling the architecture up to billion-parameter Vision Transformers, marking a first for multi-modal CT and MRI FMs. Furthermore, we formalize the evaluation of these models by extending and restructuring CuriaBench into two distinct tracks: a 2D track tailored for slice-based vision models and a 3D track for volumetric benchmarking. Our results demonstrate that Curia-2 outperforms all FMs on vision-focused tasks and fairs competitively to vision-language models on clinically complex tasks such as finding detection. Weights will be made publicly available to foster further research.
☆ World Action Verifier: Self-Improving World Models via Forward-Inverse Asymmetry
General-purpose world models promise scalable policy evaluation, optimization, and planning, yet achieving the required level of robustness remains challenging. Unlike policy learning, which primarily focuses on optimal actions, a world model must be reliable over a much broader range of suboptimal actions, which are often insufficiently covered by action-labeled interaction data. To address this challenge, we propose World Action Verifier (WAV), a framework that enables world models to identify their own prediction errors and self-improve. The key idea is to decompose action-conditioned state prediction into two factors -- state plausibility and action reachability -- and verify each separately. We show that these verification problems can be substantially easier than predicting future states due to two underlying asymmetries: the broader availability of action-free data and the lower dimensionality of action-relevant features. Leveraging these asymmetries, we augment a world model with (i) a diverse subgoal generator obtained from video corpora and (ii) a sparse inverse model that infers actions from a subset of state features. By enforcing cycle consistency among generated subgoals, inferred actions, and forward rollouts, WAV provides an effective verification mechanism in under-explored regimes, where existing methods typically fail. Across nine tasks spanning MiniGrid, RoboMimic, and ManiSkill, our method achieves 2x higher sample efficiency while improving downstream policy performance by 18%.
comment: Project Website: https://world-action-verifier.github.io
☆ Homogenized Transformers
We study a random model of deep multi-head self-attention in which the weights are resampled independently across layers and heads, as at initialization of training. Viewing depth as a time variable, the residual stream defines a discrete-time interacting particle system on the unit sphere. We prove that, under suitable joint scalings of the depth, the residual step size, and the number of heads, this dynamics admits a nontrivial homogenized limit. Depending on the scaling, the limit is either deterministic or stochastic with common noise; in the mean-field regime, the latter leads to a stochastic nonlinear Fokker--Planck equation for the conditional law of a representative token. In the Gaussian setting, the limiting drift vanishes, making the homogenized dynamics explicit enough to study representation collapse. This yields quantitative trade-offs between dimension, context length, and temperature, and identifies regimes in which clustering can be mitigated.
☆ RuleForge: Automated Generation and Validation for Web Vulnerability Detection at Scale
Security teams face a challenge: the volume of newly disclosed Common Vulnerabilities and Exposures (CVEs) far exceeds the capacity to manually develop detection mechanisms. In 2025, the National Vulnerability Database published over 48,000 new vulnerabilities, motivating the need for automation. We present RuleForge, an AWS internal system that automatically generates detection rules--JSON-based patterns that identify malicious HTTP requests exploiting specific vulnerabilities--from structured Nuclei templates describing CVE details. Nuclei templates provide standardized, YAML-based vulnerability descriptions that serve as the structured input for our rule generation process. This paper focuses on RuleForge's architecture and operational deployment for CVE-related threat detection, with particular emphasis on our novel LLM-as-a-judge (Large Language Model as judge) confidence validation system and systematic feedback integration mechanism. This validation approach evaluates candidate rules across two dimensions--sensitivity (avoiding false negatives) and specificity (avoiding false positives)--achieving AUROC of 0.75 and reducing false positives by 67% compared to synthetic-test-only validation in production. Our 5x5 generation strategy (five parallel candidates with up to five refinement attempts each) combined with continuous feedback loops enables systematic quality improvement. We also present extensions enabling rule generation from unstructured data sources and demonstrate a proof-of-concept agentic workflow for multi-event-type detection. Our lessons learned highlight critical considerations for applying LLMs to cybersecurity tasks, including overconfidence mitigation and the importance of domain expertise in both prompt design and quality review of generated rules through human-in-the-loop validation.
comment: 11 pages, 10 figures. To be submitted to CAMLIS 2026
☆ Abnormal Head Movements in Neurological Conditions: A Knowledge-Based Dataset with Application to Cervical Dystonia
Abnormal head movements (AHMs) manifest across a broad spectrum of neurological disorders; however, the absence of a multi-condition resource integrating kinematic measurements, clinical severity scores, and patient demographics constitutes a persistent barrier to the development of AI-driven diagnostic tools. To address this gap, this study introduces NeuroPose-AHM, a knowledge-based dataset of neurologically induced AHMs constructed through a multi-LLM extraction framework applied to 1,430 peer-reviewed publications. The dataset contains 2,756 patient-group-level records spanning 57 neurological conditions, derived from 846 AHM-relevant papers. Inter-LLM reliability analysis confirms robust extraction performance, with study-level classification achieving strong agreement (kappa = 0.822). To demonstrate the dataset's analytical utility, a four-task framework is applied to cervical dystonia (CD), the condition most directly defined by pathological head movement. First, Task 1 performs multi-label AHM type classification (F1 = 0.856). Task 2 constructs the Head-Neck Severity Index (HNSI), a unified metric that normalizes heterogeneous clinical rating scales. The clinical relevance of this index is then evaluated in Task 3, where HNSI is validated against real-world CD patient data, with aligned severe-band proportions (6.7%) providing a preliminary plausibility indication for index calibration within the high severity range. Finally, Task 4 performs bridge analysis between movement-type probabilities and HNSI scores, producing significant correlations (p less than 0.001). These results demonstrate the analytical utility of NeuroPose-AHM as a structured, knowledge-based resource for neurological AHM research. The NeuroPose-AHM dataset is publicly available on Zenodo (https://doi.org/10.5281/zenodo.19386862).
☆ Generalization Bounds and Statistical Guarantees for Multi-Task and Multiple Operator Learning with MNO Networks
Multiple operator learning concerns learning operator families $\{G[α]:U\to V\}_{α\in W}$ indexed by an operator descriptor $α$. Training data are collected hierarchically by sampling operator instances $α$, then input functions $u$ per instance, and finally evaluation points $x$ per input, yielding noisy observations of $G[α][u](x)$. While recent work has developed expressive multi-task and multiple operator learning architectures and approximation-theoretic scaling laws, quantitative statistical generalization guarantees remain limited. We provide a covering-number-based generalization analysis for separable models, focusing on the Multiple Neural Operator (MNO) architecture: we first derive explicit metric-entropy bounds for hypothesis classes given by linear combinations of products of deep ReLU subnetworks, and then combine these complexity bounds with approximation guarantees for MNO to obtain an explicit approximation-estimation tradeoff for the expected test error on new (unseen) triples $(α,u,x)$. The resulting bound makes the dependence on the hierarchical sampling budgets $(n_α,n_u,n_x)$ transparent and yields an explicit learning-rate statement in the operator-sampling budget $n_α$, providing a sample-complexity characterization for generalization across operator instances. The structure and architecture can also be viewed as a general purpose solver or an example of a "small'' PDE foundation model, where the triples are one form of multi-modality.
☆ Learn by Surprise, Commit by Proof
We propose LSCP, a self-gated post-training framework for autonomous knowledge acquisition: learning only what a model does not already know, verified against what it does know, at a strength proportional to conviction, with no external oracle. When a passage produces anomalously high per-token loss, LSCP flags it, generates a Q&A chain that forces the model to articulate its own knowledge and identify gaps, then adjusts AdamW's $β_2$ proportionally to conviction depth k (the number of self-verification steps the passage survives) via $β_2 = 0.999 \cdot r^k$. The entire learning intensity is governed by a single parameter $r$. Beyond new knowledge, this process sharpens weakly encoded existing knowledge, which is a primary source of hallucination. The framework is self-extinguishing: as the model learns, per-token loss on learned passages decreases toward the surprisal threshold and the system progressively converges to standard AdamW. This models biological memory consolidation: temporary information in the context window is selectively consolidated into parametric weights, the model's long-term memory. Experiments on the reference model (Qwen3-14B) and across six models (8B--32B, four families) show that standard fine-tuning produces rote memorization (perturbation gap (the ratio of paraphrase to original perplexity) of 11.6 +- 0.2 x baseline) while all LSCP conditions learn semantically (2.7--3.0x). The r=1.0 condition (identical optimizer, nearly identical data, only Q&A format differs) confirms that the training data format, not $β_2$ gating, is the primary mechanism preventing memorization; gating instead protects neighboring knowledge from contamination by corrupt content (93 +- 7% accuracy on adjacent questions at r=0.98 vs. 90% baseline).
comment: 24 pages, 3 figures
☆ annbatch unlocks terabyte-scale training of biological data in anndata
The scale of biological datasets now routinely exceeds system memory, making data access rather than model computation the primary bottleneck in training machine-learning models. This bottleneck is particularly acute in biology, where widely used community data formats must support heterogeneous metadata, sparse and dense assays, and downstream analysis within established computational ecosystems. Here we present annbatch, a mini-batch loader native to anndata that enables out-of-core training directly on disk-backed datasets. Across single-cell transcriptomics, microscopy and whole-genome sequencing benchmarks, annbatch increases loading throughput by up to an order of magnitude and shortens training from days to hours, while remaining fully compatible with the scverse ecosystem. Annbatch establishes a practical data-loading infrastructure for scalable biological AI, allowing increasingly large and diverse datasets to be used without abandoning standard biological data formats. Github: https://github.com/scverse/annbatch
☆ PAC-Bayesian Reward-Certified Outcome Weighted Learning
Estimating optimal individualized treatment rules (ITRs) via outcome weighted learning (OWL) often relies on observed rewards that are noisy or optimistic proxies for the true latent utility. Ignoring this reward uncertainty leads to the selection of policies with inflated apparent performance, yet existing OWL frameworks lack the finite-sample guarantees required to systematically embed such uncertainty into the learning objective. To address this issue, we propose PAC-Bayesian Reward-Certified Outcome Weighted Learning (PROWL). Given a one-sided uncertainty certificate, PROWL constructs a conservative reward and a strictly policy-dependent lower bound on the true expected value. Theoretically, we prove an exact certified reduction that transforms robust policy learning into a unified, split-free cost-sensitive classification task. This formulation enables the derivation of a nonasymptotic PAC-Bayes lower bound for randomized ITRs, where we establish that the optimal posterior maximizing this bound is exactly characterized by a general Bayes update. To overcome the learning-rate selection problem inherent in generalized Bayesian inference, we introduce a fully automated, bounds-based calibration procedure, coupled with a Fisher-consistent certified hinge surrogate for efficient optimization. Our experiments demonstrate that PROWL achieves improvements in estimating robust, high-value treatment regimes under severe reward uncertainty compared to standard methods for ITR estimation.
☆ Physics-Informed Transformer for Multi-Band Channel Frequency Response Reconstruction
Wideband channel frequency response (CFR) estimation is challenging in multi-band wireless systems, especially when one or more sub-bands are temporarily blocked by co-channel interference. We present a physics-informed complex Transformer that reconstructs the full wideband CFR from such fragmented, partially observed spectrum snapshots. The interference pattern in each sub-band is modeled as an independent two-state discrete-time Markov chain, capturing realistic bursty occupancy behavior. Our model operates on the joint time-frequency grid of $T$ snapshots and $F$ frequency bins and uses a factored self-attention mechanism that separately attends along both axes, reducing the computational complexity to $O(TF^2 + FT^2)$. Complex-valued inputs and outputs are processed through a holomorphic linear layer that preserves phase relationships. Training uses a composite physics-informed loss combining spectral fidelity, power delay profile (PDP) reconstruction, channel impulse response (CIR) sparsity, and temporal smoothness. Mobility effects are incorporated through per-sample velocity randomization, enabling generalization across different mobility regimes. Evaluation against three classical baselines, namely, last-observation-carry-forward, zero-fill, and cubic-spline interpolation, shows that our approach achieves the highest PDP similarity with respect to the ground truth, reaching $ρ\geq 0.82$ compared to $ρ\geq 0.62$ for the best baseline at interference occupancy levels up to 50%. Furthermore, the model degrades smoothly across the full velocity range, consistently outperforming all other baselines.
comment: 6 pages, 6 figures
☆ A Novel Theoretical Analysis for Clustering Heteroscedastic Gaussian Data without Knowledge of the Number of Clusters
This paper addresses the problem of clustering measurement vectors that are heteroscedastic in that they can have different covariance matrices. From the assumption that the measurement vectors within a given cluster are Gaussian distributed with possibly different and unknown covariant matrices around the cluster centroid, we introduce a novel cost function to estimate the centroids. The zeros of the gradient of this cost function turn out to be the fixed-points of a certain function. As such, the approach generalizes the methodology employed to derive the existing Mean-Shift algorithm. But as a main and novel theoretical result compared to Mean-Shift, this paper shows that the sole fixed-points of the identified function tend to be the cluster centroids if both the number of measurements per cluster and the distances between centroids are large enough. As a second contribution, this paper introduces the Wald kernel for clustering. This kernel is defined as the p-value of the Wald hypothesis test for testing the mean of a Gaussian. As such, the Wald kernel measures the plausibility that a measurement vector belongs to a given cluster and it scales better with the dimension of the measurement vectors than the usual Gaussian kernel. Finally, the proposed theoretical framework allows us to derive a new clustering algorithm called CENTRE-X that works by estimating the fixed-points of the identified function. As Mean-Shift, CENTRE-X requires no prior knowledge of the number of clusters. It relies on a Wald hypothesis test to significantly reduce the number of fixed points to calculate compared to the Mean-Shift algorithm, thus resulting in a clear gain in complexity. Simulation results on synthetic and real data sets show that CENTRE-X has comparable or better performance than standard clustering algorithms K-means and Mean-Shift, even when the covariance matrices are not perfectly known.
comment: 76 pages, submitted to JMLR
☆ Probabilistic classification from possibilistic data: computing Kullback-Leibler projection with a possibility distribution
We consider learning with possibilistic supervision for multi-class classification. For each training instance, the supervision is a normalized possibility distribution that expresses graded plausibility over the classes. From this possibility distribution, we construct a non-empty closed convex set of admissible probability distributions by combining two requirements: probabilistic compatibility with the possibility and necessity measures induced by the possibility distribution, and linear shape constraints that must be satisfied to preserve the qualitative structure of the possibility distribution. Thus, classes with the same possibility degree receive equal probabilities, and if a class has a strictly larger possibility degree than another class, then it receives a strictly larger probability. Given a strictly positive probability vector output by a model for an instance, we compute its Kullback-Leibler projection onto the admissible set. This projection yields the closest admissible probability distribution in Kullback-Leibler sense. We can then train the model by minimizing the divergence between the prediction and its projection, which quantifies the smallest adjustment needed to satisfy the induced dominance and shape constraints. The projection is computed with Dykstra's algorithm using Bregman projections associated with the negative entropy, and we provide explicit formulas for the projections onto each constraint set. Experiments conducted on synthetic data and on a real-world natural language inference task, based on the ChaosNLI dataset, show that the proposed projection algorithm is efficient enough for practical use, and that the resulting projection-based learning objective can improve predictive performance.
☆ Woosh: A Sound Effects Foundation Model
The audio research community depends on open generative models as foundational tools for building novel approaches and establishing baselines. In this report, we present Woosh, Sony AI's publicly released sound effect foundation model, detailing its architecture, training process, and an evaluation against other popular open models. Being optimized for sound effects, we provide (1) a high-quality audio encoder/decoder model and (2) a text-audio alignment model for conditioning, together with (3) text-to-audio and (4) video-to-audio generative models. Distilled text-to-audio and video-to-audio models are also included in the release, allowing for low-resource operation and fast inference. Our evaluation on both public and private data shows competitive or better performance for each module when compared to existing open alternatives like StableAudio-Open and TangoFlux. Inference code and model weights are available at https://github.com/SonyResearch/Woosh. Demo samples can be found at https://sonyresearch.github.io/Woosh/.
☆ Learning Spatial Structure from Pre-Beamforming Per-Antenna Range-Doppler Radar Data via Visibility-Aware Cross-Modal Supervision
Automotive radar perception pipelines commonly construct angle-domain representations via beamforming before applying learning-based models. This work instead investigates a representational question: can meaningful spatial structure be learned directly from pre-beamforming per-antenna range-Doppler (RD) measurements? Experiments are conducted on a 6-TX x 8-RX (48 virtual antennas) commodity automotive radar employing an A/B chirp-sequence frequency-modulated continuous-wave (CS-FMCW) transmit scheme, in which the effective transmit aperture varies between chirps (single-TX vs. multi-TX), enabling controlled analysis of chirp-dependent transmit configurations. We operate on pre-beamforming per-antenna RD tensors using a dual-chirp shared-weight encoder trained in an end-to-end, fully data-driven manner, and evaluate spatial recoverability using bird's-eye-view (BEV) occupancy as a geometric probe rather than a performance-driven objective. Supervision is visibility-aware and cross-modal, derived from LiDAR with explicit modeling of the radar field-of-view and occlusion-aware LiDAR observability via ray-based visibility. Through chirp ablations (A-only, B-only, A+B), range-band analysis, and physics-aligned baselines, we assess how transmit configurations affect geometric recoverability. The results indicate that spatial structure can be learned directly from pre-beamforming per-antenna RD tensors without explicit angle-domain construction or hand-crafted signal-processing stages.
☆ The Rank and Gradient Lost in Non-stationarity: Sample Weight Decay for Mitigating Plasticity Loss in Reinforcement Learning ICLR
Deep reinforcement learning (RL) suffers from plasticity loss severely due to the nature of non-stationarity, which impairs the ability to adapt to new data and learn continually. Unfortunately, our understanding of how plasticity loss arises, dissipates, and can be dissolved remains limited to empirical findings, leaving the theoretical end underexplored.To address this gap, we study the plasticity loss problem from the theoretical perspective of network optimization. By formally characterizing the two culprit factors in online RL process: the non-stationarity of data distributions and the non-stationarity of targets induced by bootstrapping, our theory attributes the loss of plasticity to two mechanisms: the rank collapse of the Neural Tangent Kernel (NTK) Gram matrix and the $Θ(\frac{1}{k})$ decay of gradient magnitude. The first mechanism echoes prior empirical findings from the theoretical perspective and sheds light on the effects of existing methods, e.g., network reset, neuron recycle, and noise injection. Against this backdrop, we focus primarily on the second mechanism and aim to alleviate plasticity loss by addressing the gradient attenuation issue, which is orthogonal to existing methods. We propose Sample Weight Decay -- a lightweight method to restore gradient magnitude, as a general remedy to plasticity loss for deep RL methods based on experience replay. In experiments, we evaluate the efficacy of \methodName upon TD3, \myadded{Double DQN} and SAC with SimBa architecture in MuJoCo, \myadded{ALE} and DeepMind Control Suite tasks. The results demonstrate that \methodName effectively alleviates plasticity loss and consistently improves learning performance across various configurations of deep RL algorithms, UTD, network architectures, and environments, achieving SOTA performance on challenging DMC Humanoid tasks.
comment: ICLR
☆ Enhancing the Reliability of Medical AI through Expert-guided Uncertainty Modeling
Artificial intelligence (AI) systems accelerate medical workflows and improve diagnostic accuracy in healthcare, serving as second-opinion systems. However, the unpredictability of AI errors poses a significant challenge, particularly in healthcare contexts, where mistakes can have severe consequences. A widely adopted safeguard is to pair predictions with uncertainty estimation, enabling human experts to focus on high-risk cases while streamlining routine verification. Current uncertainty estimation methods, however, remain limited, particularly in quantifying aleatoric uncertainty, which arises from data ambiguity and noise. To address this, we propose a novel approach that leverages disagreement in expert responses to generate targets for training machine learning models. These targets are used in conjunction with standard data labels to estimate two components of uncertainty separately, as given by the law of total variance, via a two-ensemble approach, as well as its lightweight variant. We validate our method on binary image classification, binary and multi-class image segmentation, and multiple-choice question answering. Our experiments demonstrate that incorporating expert knowledge can enhance uncertainty estimation quality by $9\%$ to $50\%$ depending on the task, making this source of information invaluable for the construction of risk-aware AI systems in healthcare applications.
☆ LI-DSN: A Layer-wise Interactive Dual-Stream Network for EEG Decoding
Electroencephalography (EEG) provides a non-invasive window into brain activity, offering high temporal resolution crucial for understanding and interacting with neural processes through brain-computer interfaces (BCIs). Current dual-stream neural networks for EEG often process temporal and spatial features independently through parallel branches, delaying their integration until a final, late-stage fusion. This design inherently leads to an "information silo" problem, precluding intermediate cross-stream refinement and hindering spatial-temporal decompositions essential for full feature utilization. We propose LI-DSN, a layer-wise interactive dual-stream network that facilitates progressive, cross-stream communication at each layer, thereby overcoming the limitations of late-fusion paradigms. LI-DSN introduces a novel Temporal-Spatial Integration Attention (TSIA) mechanism, which constructs a Spatial Affinity Correlation Matrix (SACM) to capture inter-electrode spatial structural relationships and a Temporal Channel Aggregation Matrix (TCAM) to integrate cosine-gated temporal dynamics under spatial guidance. Furthermore, we employ an adaptive fusion strategy with learnable channel weights to optimize the integration of dual-stream features. Extensive experiments across eight diverse EEG datasets, encompassing motor imagery (MI) classification, emotion recognition, and steady-state visual evoked potentials (SSVEP), consistently demonstrate that LI-DSN significantly outperforms 13 state-of-the-art (SOTA) baseline models, showcasing its superior robustness and decoding performance. The code will be publicized after acceptance.
☆ DDCL-INCRT: A Self-Organising Transformer with Hierarchical Prototype Structure (Theoretical Foundations)
Modern neural networks of the transformer family require the practitioner to decide, before training begins, how many attention heads to use, how deep the network should be, and how wide each component should be. These decisions are made without knowledge of the task, producing architectures that are systematically larger than necessary: empirical studies find that a substantial fraction of heads and layers can be removed after training without performance loss. This paper introduces DDCL-INCRT, an architecture that determines its own structure during training. Two complementary ideas are combined. The first, DDCL (Deep Dual Competitive Learning), replaces the feedforward block with a dictionary of learned prototype vectors representing the most informative directions in the data. The prototypes spread apart automatically, driven by the training objective, without explicit regularisation. The second, INCRT (Incremental Transformer), controls the number of heads: starting from one, it adds a new head only when the directional information uncaptured by existing heads exceeds a threshold. The main theoretical finding is that these two mechanisms reinforce each other: each new head amplifies prototype separation, which in turn raises the signal triggering the next addition. At convergence, the network self-organises into a hierarchy of heads ordered by representational granularity. This hierarchical structure is proved to be unique and minimal, the smallest architecture sufficient for the task, under the stated conditions. Formal guarantees of stability, convergence, and pruning safety are established throughout. The architecture is not something one designs. It is something one derives.
comment: 30 pages, 5 figures. Submitted to Neural Networks (Elsevier)
☆ Robust Graph Representation Learning via Adaptive Spectral Contrast
Spectral graph contrastive learning has emerged as a unified paradigm for handling both homophilic and heterophilic graphs by leveraging high-frequency components. However, we identify a fundamental spectral dilemma: while high-frequency signals are indispensable for encoding heterophily, our theoretical analysis proves they exhibit significantly higher variance under spectrally concentrated perturbations. We derive a regret lower bound showing that existing global (node-agnostic) spectral fusion is provably sub-optimal: on mixed graphs with separated node-wise frequency preferences, any global fusion strategy incurs non-vanishing regret relative to a node-wise oracle. To escape this bound, we propose ASPECT, a framework that resolves this dilemma through a reliability-aware spectral gating mechanism. Formulated as a minimax game, ASPECT employs a node-wise gate that dynamically re-weights frequency channels based on their stability against a purpose-built adversary, which explicitly targets spectral energy distributions via a Rayleigh quotient penalty. This design forces the encoder to learn representations that are both structurally discriminative and spectrally robust. Empirical results show that ASPECT achieves new state-of-the-art performance on 8 out of 9 benchmarks, effectively decoupling meaningful structural heterophily from incidental noise.
☆ Towards Intrinsically Calibrated Uncertainty Quantification in Industrial Data-Driven Models via Diffusion Sampler
In modern process industries, data-driven models are important tools for real-time monitoring when key performance indicators are difficult to measure directly. While accurate predictions are essential, reliable uncertainty quantification (UQ) is equally critical for safety, reliability, and decision-making, but remains a major challenge in current data-driven approaches. In this work, we introduce a diffusion-based posterior sampling framework that inherently produces well-calibrated predictive uncertainty via faithful posterior sampling, eliminating the need for post-hoc calibration. In extensive evaluations on synthetic distributions, the Raman-based phenylacetic acid soft sensor benchmark, and a real ammonia synthesis case study, our method achieves practical improvements over existing UQ techniques in both uncertainty calibration and predictive accuracy. These results highlight diffusion samplers as a principled and scalable paradigm for advancing uncertainty-aware modeling in industrial applications.
comment: This manuscript has been accepted for publication in IEEE Transactions on Industrial Informatics. Copyright has been transferred to IEEE. Reuse of this material is subject to IEEE copyright restrictions
☆ CANDI: Curated Test-Time Adaptation for Multivariate Time-Series Anomaly Detection Under Distribution Shift AAAI 2026
Multivariate time-series anomaly detection (MTSAD) aims to identify deviations from normality in multivariate time-series and is critical in real-world applications. However, in real-world deployments, distribution shifts are ubiquitous and cause severe performance degradation in pre-trained anomaly detector. Test-time adaptation (TTA) updates a pre-trained model on-the-fly using only unlabeled test data, making it promising for addressing this challenge. In this study, we propose CANDI (Curated test-time adaptation for multivariate time-series ANomaly detection under DIstribution shift), a novel TTA framework that selectively identifies and adapts to potential false positives while preserving pre-trained knowledge. CANDI introduces a False Positive Mining (FPM) strategy to curate adaptation samples based on anomaly scores and latent similarity, and incorporates a plug-and-play Spatiotemporally-Aware Normality Adaptation (SANA) module for structurally informed model updates. Extensive experiments demonstrate that CANDI significantly improves the performance of MTSAD under distribution shift, improving AUROC up to 14% while using fewer adaptation samples.
comment: AAAI 2026
☆ Investigating Permutation-Invariant Discrete Representation Learning for Spatially Aligned Images ICPR 2026
Vector quantization approaches (VQ-VAE, VQ-GAN) learn discrete neural representations of images, but these representations are inherently position-dependent: codes are spatially arranged and contextually entangled, requiring autoregressive or diffusion-based priors to model their dependencies at sample time. In this work, we ask whether positional information is necessary for discrete representations of spatially aligned data. We propose the permutation-invariant vector-quantized autoencoder (PI-VQ), in which latent codes are constrained to carry no positional information. We find that this constraint encourages codes to capture global, semantic features, and enables direct interpolation between images without a learned prior. To address the reduced information capacity of permutation-invariant representations, we introduce matching quantization, a vector quantization algorithm based on optimal bipartite matching that increases effective bottleneck capacity by $3.5\times$ relative to naive nearest-neighbour quantization. The compositional structure of the learned codes further enables interpolation-based sampling, allowing synthesis of novel images in a single forward pass. We evaluate PI-VQ on CelebA, CelebA-HQ and FFHQ, obtaining competitive precision, density and coverage metrics for images synthesised with our approach. We discuss the trade-offs inherent to position-free representations, including separability and interpretability of the latent codes, pointing to numerous directions for future work.
comment: 15 pages plus references; 5 figures; supplementary appended; accepted to ICPR 2026
☆ Language-Pretraining-Induced Bias: A Strong Foundation for General Vision Tasks
The ratio of outlier parameters in language pre-training models and vision pre-training models differs significantly, making cross-modality (language and vision) inherently more challenging than cross-domain adaptation. As a result, many prior studies have focused on cross-domain transfer rather than attempting to bridge language and vision modalities, assuming that language pre-trained models are unsuitable for downstream visual tasks due to disparate parameter spaces. Contrary to this assumption, we show that adding a bridge training stage as a modality adaptation learner can effectively align Large Language Model (LLM) parameters with vision tasks. Specifically, we propose a simple yet powerful solution random label bridge training that requires no manual labeling and helps LLM parameters adapt to vision foundation tasks. Moreover, our findings reveal that partial bridge training is often advantageous, as certain layers in LLMs exhibit strong foundational properties that remain beneficial even without fine-tuning for visual tasks. This surprising discovery opens up new avenues for leveraging language pre-trained parameters directly within vision models and highlights the potential of partial bridge training as a practical pathway to cross-modality adaptation.
☆ Physics Informed Reinforcement Learning with Gibbs Priors for Topology Control in Power Grids
Topology control for power grid operation is a challenging sequential decision making problem because the action space grows combinatorially with the size of the grid and action evaluation through simulation is computationally expensive. We propose a physics-informed Reinforcement Learning framework that combines semi-Markov control with a Gibbs prior, that encodes the system's physics, over the action space. The decision is only taken when the grid enters a hazardous regime, while a graph neural network surrogate predicts the post action overload risk of feasible topology actions. These predictions are used to construct a physics-informed Gibbs prior that both selects a small state-dependent candidate set and reweights policy logits before action selection. In this way, our method reduces exploration difficulty and online simulation cost while preserving the flexibility of a learned policy. We evaluate the approach in three realistic benchmark environments of increasing difficulty. Across all settings, the proposed method achieves a strong balance between control quality and computational efficiency: it matches oracle-level performance while being approximately $6\times$ faster on the first benchmark, reaches $94.6\%$ of oracle reward with roughly $200\times$ lower decision time on the second one, and on the most challenging benchmark improves over a PPO baseline by up to $255\%$ in reward and $284\%$ in survived steps while remaining about $2.5\times$ faster than a strong specialized engineering baseline. These results show that our method provides an effective mechanism for topology control in power grids.
☆ Graph Neural Operator Towards Edge Deployability and Portability for Sparse-to-Dense, Real-Time Virtual Sensing on Irregular Grids
Accurate sensing of spatially distributed physical fields typically requires dense instrumentation, which is often infeasible in real-world systems due to cost, accessibility, and environmental constraints. Physics-based solvers address this through direct numerical integration of governing equations, but their computational latency and power requirements preclude real-time use in resource-constrained monitoring and control systems. Here we introduce VIRSO (Virtual Irregular Real-Time Sparse Operator), a graph-based neural operator for sparse-to-dense reconstruction on irregular geometries, and a variable-connectivity algorithm, Variable KNN (V-KNN), for mesh-informed graph construction. Unlike prior neural operators that treat hardware deployability as secondary, VIRSO reframes inference as measurement: the combination of both spectral and spatial analysis provides accurate reconstruction without the high latency and power consumption of previous graph-based methodologies with poor scalability, presenting VIRSO as a potential candidate for edge-constrained, real-time virtual sensing. We evaluate VIRSO on three nuclear thermal-hydraulic benchmarks of increasing geometric and multiphysics complexity, across reconstruction ratios from 47:1 to 156:1. VIRSO achieves mean relative $L_2$ errors below 1%, outperforming other benchmark operators while using fewer parameters. The full 10-layer configuration reduces the energy-delay product (EDP) from ${\approx}206$ J$\cdot$ms for the graph operator baseline to $10.1$ J$\cdot$ms on an NVIDIA H200. Implemented on an NVIDIA Jetson Orin Nano, all configurations of VIRSO provide sub-10 W power consumption and sub-second latency. These results establish the edge-feasibility and hardware-portability of VIRSO and present compute-aware operator learning as a new paradigm for real-time sensing in inaccessible and resource-constrained environments.
comment: 34 pages, 5 figures, 16 tables
☆ Learning in Prophet Inequalities with Noisy Observations ICLR 2026
We study the prophet inequality, a fundamental problem in online decision-making and optimal stopping, in a practical setting where rewards are observed only through noisy realizations and reward distributions are unknown. At each stage, the decision-maker receives a noisy reward whose true value follows a linear model with an unknown latent parameter, and observes a feature vector drawn from a distribution. To address this challenge, we propose algorithms that integrate learning and decision-making via lower-confidence-bound (LCB) thresholding. In the i.i.d.\ setting, we establish that both an Explore-then-Decide strategy and an $\varepsilon$-Greedy variant achieve the sharp competitive ratio of $1 - 1/e$, under a mild condition on the optimal value. For non-identical distributions, we show that a competitive ratio of $1/2$ can be guaranteed against a relaxed benchmark. Moreover, with limited window access to past rewards, the tight ratio of $1/2$ against the optimal benchmark is achieved.
comment: ICLR 2026
☆ Bridging Deep Learning and Integer Linear Programming: A Predictive-to-Prescriptive Framework for Supply Chain Analytics
Although demand forecasting is a critical component of supply chain planning, actual retail data can exhibit irreconcilable seasonality, irregular spikes, and noise, rendering precise projections nearly unattainable. This paper proposes a three-step analytical framework that combines forecasting and operational analytics. The first stage consists of exploratory data analysis, where delivery-tracked data from 180,519 transactions are partitioned, and long-term trends, seasonality, and delivery-related attributes are examined. Secondly, the forecasting performance of a statistical time series decomposition model N-BEATS MSTL and a recent deep learning architecture N-HiTS were compared. N-BEATS and N-HiTS were both statistically, and hence were N-BEATS's and N-HiTS's statistically selected. Most recent time series deep learning models, N-HiTS, N-BEATS. N-HiTS and N-BEATS N-HiTS and N-HiTS outperformed the statistical benchmark to a large extent. N-BEATS was selected to be the most optimized model, as the one with the lowest forecasting error, in the 3rd and final stage forecasting values of the next 4 weeks of 1918 units, and provided those as a model with a set of deterministically integer linear program outcomes that are aimed to minimize the total delivery time with a set of bound budget, capacity, and service constraints. The solution allocation provided a feasible and cost-optimal shipping plan. Overall, the study provides a compelling example of the practical impact of precise forecasting and simple, highly interpretable model optimization in logistics.
comment: 12 pages, 4 figures, 4 tables
☆ Dual-Attention Based 3D Channel Estimation
For multi-input and multi-output (MIMO) channels, the optimal channel estimation (CE) based on linear minimum mean square error (LMMSE) requires three-dimensional (3D) filtering. However, the complexity is often prohibitive due to large matrix dimensions. Suboptimal estimators approximate 3DCE by decomposing it into time, frequency, and spatial domains, while yields noticeable performance degradation under correlated MIMO channels. On the other hand, recent advances in deep learning (DL) can explore channel correlations in all domains via attention mechanisms. Building on this capability, we propose a dual attention mechanism based 3DCE network (3DCENet) that can achieve accurate estimates.
comment: 5 pages, 6 figures
☆ FourierMoE: Fourier Mixture-of-Experts Adaptation of Large Language Models
Parameter-efficient fine-tuning (PEFT) has emerged as a crucial paradigm for adapting large language models (LLMs) under constrained computational budgets. However, standard PEFT methods often struggle in multi-task fine-tuning settings, where diverse optimization objectives induce task interference and limited parameter budgets lead to representational deficiency. While recent approaches incorporate mixture-of-experts (MoE) to alleviate these issues, they predominantly operate in the spatial domain, which may introduce structural redundancy and parameter overhead. To overcome these limitations, we reformulate adaptation in the spectral domain. Our spectral analysis reveals that different tasks exhibit distinct frequency energy distributions, and that LLM layers display heterogeneous frequency sensitivities. Motivated by these insights, we propose FourierMoE, which integrates the MoE architecture with the inverse discrete Fourier transform (IDFT) for frequency-aware adaptation. Specifically, FourierMoE employs a frequency-adaptive router to dispatch tokens to experts specialized in distinct frequency bands. Each expert learns a set of conjugate-symmetric complex coefficients, preserving complete phase and amplitude information while theoretically guaranteeing lossless IDFT reconstruction into real-valued spatial weights. Extensive evaluations across 28 benchmarks, multiple model architectures, and scales demonstrate that FourierMoE consistently outperforms competitive baselines in both single-task and multi-task settings while using significantly fewer trainable parameters. These results highlight the promise of spectral-domain expert adaptation as an effective and parameter-efficient paradigm for LLM fine-tuning.
comment: The first two authors contributed equally to this work; listing order is random
☆ LiveMathematicianBench: A Live Benchmark for Mathematician-Level Reasoning with Proof Sketches
Mathematical reasoning is a hallmark of human intelligence, and whether large language models (LLMs) can meaningfully perform it remains a central question in artificial intelligence and cognitive science. As LLMs are increasingly integrated into scientific workflows, rigorous evaluation of their mathematical capabilities becomes a practical necessity. Existing benchmarks are limited by synthetic settings and data contamination. We present LiveMathematicianBench, a dynamic multiple-choice benchmark for research-level mathematical reasoning built from recent arXiv papers published after model training cutoffs. By grounding evaluation in newly published theorems, it provides a realistic testbed beyond memorized patterns. The benchmark introduces a thirteen-category logical taxonomy of theorem types (e.g., implication, equivalence, existence, uniqueness), enabling fine-grained evaluation across reasoning forms. It employs a proof-sketch-guided distractor pipeline that uses high-level proof strategies to construct plausible but invalid answer choices reflecting misleading proof directions, increasing sensitivity to genuine understanding over surface-level matching. We also introduce a substitution-resistant mechanism to distinguish answer recognition from substantive reasoning. Evaluation shows the benchmark is far from saturated: Gemini-3.1-pro-preview, the best model, achieves only 43.5%. Under substitution-resistant evaluation, accuracy drops sharply: GPT-5.4 scores highest at 30.6%, while Gemini-3.1-pro-preview falls to 17.6%, below the 20% random baseline. A dual-mode protocol reveals that proof-sketch access yields consistent accuracy gains, suggesting models can leverage high-level proof strategies for reasoning. Overall, LiveMathematicianBench offers a scalable, contamination-resistant testbed for studying research-level mathematical reasoning in LLMs.
comment: Project page: https://livemathematicianbench.github.io/
☆ DDCL: Deep Dual Competitive Learning: A Differentiable End-to-End Framework for Unsupervised Prototype-Based Representation Learning
A persistent structural weakness in deep clustering is the disconnect between feature learning and cluster assignment. Most architectures invoke an external clustering step, typically k-means, to produce pseudo-labels that guide training, preventing the backbone from directly optimising for cluster quality. This paper introduces Deep Dual Competitive Learning (DDCL), the first fully differentiable end-to-end framework for unsupervised prototype-based representation learning. The core contribution is architectural: the external k-means is replaced by an internal Dual Competitive Layer (DCL) that generates prototypes as native differentiable outputs of the network. This single inversion makes the complete pipeline, from backbone feature extraction through prototype generation to soft cluster assignment, trainable by backpropagation through a single unified loss, with no Lloyd iterations, no pseudo-label discretisation, and no external clustering step. To ground the framework theoretically, the paper derives an exact algebraic decomposition of the soft quantisation loss into a simplex-constrained reconstruction error and a non-negative weighted prototype variance term. This identity reveals a self-regulating mechanism built into the loss geometry: the gradient of the variance term acts as an implicit separation force that resists prototype collapse without any auxiliary objective, and leads to a global Lyapunov stability theorem for the reduced frozen-encoder system. Six blocks of controlled experiments validate each structural prediction. The decomposition identity holds with zero violations across more than one hundred thousand training epochs; the negative feedback cycle is confirmed with Pearson -0.98; with a jointly trained backbone, DDCL outperforms its non-differentiable ablation by 65% in clustering accuracy and DeepCluster end-to-end by 122%.
☆ Koopman-Based Nonlinear Identification and Adaptive Control of a Turbofan Engine
This paper investigates Koopman operator-based approaches for multivariable control of a two-spool turbofan engine. A physics-based component-level model is developed to generate training data and validate the controllers. A meta-heuristic extended dynamic mode decomposition is developed, with a cost function designed to accurately capture both spool-speed dynamics and the engine pressure ratio (EPR), enabling the construction of a single Koopman model suitable for multiple control objectives. Using the identified time-varying Koopman model, two controllers are developed: an adaptive Koopman-based model predictive controller (AKMPC) with a disturbance observer and a Koopman-based feedback linearization controller (K-FBLC), which serves as a benchmark. The controllers are evaluated for two control strategies, namely configurations of spool speeds and EPR, under both sea-level and varying flight conditions. The results demonstrate that the proposed identification approach enables accurate predictions of both spool speeds and EPR, allowing the Koopman model to be reused flexibly across different control formulations. While both control strategies achieve comparable performance in steady conditions, the AKMPC exhibits superior robustness compared with the K-FBLC under varying flight conditions due to its ability to compensate for model mismatch. Moreover, the EPR control strategy improves the thrust response. The study highlights the applicability of Koopman-based control and demonstrates the advantages of the AKMPC-based framework for robust turbofan engine control.
comment: 21 pages, 23 figures
☆ MATA-Former & SIICU: Semantic Aware Temporal Alignment for High-Fidelity ICU Risk Prediction
Forecasting evolving clinical risks relies on intrinsic pathological dependencies rather than mere chronological proximity, yet current methods struggle with coarse binary supervision and physical timestamps. To align predictive modeling with clinical logic, we propose the Medical-semantics Aware Time-ALiBi Transformer (MATA-Former), utilizing event semantics to dynamically parameterize attention weights to prioritize causal validity over time lags. Furthermore, we introduce Plateau-Gaussian Soft Labeling (PSL), reformulating binary classification into continuous multi-horizon regression for full-trajectory risk modeling. Evaluated on SIICU -- a newly constructed dataset featuring over 506k events with rigorous expert-verified, fine-grained annotations -- and the MIMIC-IV dataset, our framework demonstrates superior efficacy and robust generalization in capturing risks from text-intensive, irregular clinical time series.
☆ LiteInception: A Lightweight and Interpretable Deep Learning Framework for General Aviation Fault Diagnosis
General aviation fault diagnosis and efficient maintenance are critical to flight safety; however, deploying deep learning models on resource-constrained edge devices poses dual challenges in computational capacity and interpretability. This paper proposes LiteInception--a lightweight interpretable fault diagnosis framework designed for edge deployment. The framework adopts a two-stage cascaded architecture aligned with standard maintenance workflows: Stage 1 performs high-recall fault detection, and Stage 2 conducts fine-grained fault classification on anomalous samples, thereby decoupling optimization objectives and enabling on-demand allocation of computational resources. For model compression, a multi-method fusion strategy based on mutual information, gradient analysis, and SE attention weights is proposed to reduce the input sensor channels from 23 to 15, and a 1+1 branch LiteInception architecture is introduced that compresses InceptionTime parameters by 70%, accelerates CPU inference by over 8x, with less than 3% F1 loss. Furthermore, knowledge distillation is introduced as a precision-recall regulation mechanism, enabling the same lightweight model to adapt to different scenarios--such as safety-critical and auxiliary diagnosis--by switching training strategies. Finally, a dual-layer interpretability framework integrating four attribution methods is constructed, providing traceable evidence chains of "which sensor x which time period." Experiments on the NGAFID dataset demonstrate a fault detection accuracy of 81.92% with 83.24% recall, and a fault identification accuracy of 77.00%, validating the framework's favorable balance among efficiency, accuracy, and interpretability.
Transformer self-attention encoder-decoder with multimodal deep learning for response time series forecasting and digital twin support in wind structural health monitoring
The wind-induced structural response forecasting capabilities of a novel transformer methodology are examined here. The model also provides a digital twin component for bridge structural health monitoring. Firstly, the approach uses the temporal characteristics of the system to train a forecasting model. Secondly, the vibration predictions are compared to the measured ones to detect large deviations. Finally, the identified cases are used as an early-warning indicator of structural change. The artificial intelligence-based model outperforms approaches for response forecasting as no assumption on wind stationarity or on structural normal vibration behavior is needed. Specifically, wind-excited dynamic behavior suffers from uncertainty related to obtaining poor predictions when the environmental or traffic conditions change. This results in a hard distinction of what constitutes normal vibration behavior. To this end, a framework is rigorously examined on real-world measurements from the Hardanger Bridge monitored by the Norwegian University of Science and Technology. The approach captures accurate structural behavior in realistic conditions, and with respect to the changes in the system excitation. The results, importantly, highlight the potential of transformer-based digital twin components to serve as next-generation tools for resilient infrastructure management, continuous learning, and adaptive monitoring over the system's lifecycle with respect to temporal characteristics.
comment: 21 pages, 22 figures, 9 tables. This version corresponds to the published article in Computers & Structures. https://doi.org/10.1016/j.compstruc.2026.108216
♻ ☆ LaVR: Scene Latent Conditioned Generative Video Trajectory Re-Rendering using Large 4D Reconstruction Models
Given a monocular video, the goal of video re-rendering is to generate views of the scene from a novel camera trajectory. Existing methods face two distinct challenges. Geometrically unconditioned models lack spatial awareness, leading to drift and deformation under viewpoint changes. On the other hand, geometrically-conditioned models depend on estimated depth and explicit reconstruction, making them susceptible to depth inaccuracies and calibration errors. We propose to address these challenges by using the implicit geometric knowledge embedded in the latent space of a large 4D reconstruction model to condition the video generation process. These latents capture scene structure in a continuous space without explicit reconstruction. Therefore, they provide a flexible representation that allows the pretrained diffusion prior to regularize errors more effectively. By jointly conditioning on these latents and source camera poses, we demonstrate that our model achieves state-of-the-art results on the video re-rendering task. Project webpage is https://lavr-4d-scene-rerender.github.io/.
♻ ☆ Intervening to Learn and Compose Causally Disentangled Representations
In designing generative models, it is commonly believed that in order to learn useful latent structure, we face a fundamental tension between expressivity and structure. In this paper we challenge this view by proposing a new approach to training arbitrarily expressive generative models that simultaneously learn causally disentangled concepts. This is accomplished by adding a simple context module to an arbitrarily complex black-box model, which learns to process concept information by implicitly inverting linear representations from the model's encoder. Inspired by the notion of intervention in a causal model, our module selectively modifies its architecture during training, allowing it to learn a compact joint model over different contexts. We show how adding this module leads to causally disentangled representations that can be composed for out-of-distribution generation on both real and simulated data. The resulting models can be trained end-to-end or fine-tuned from pre-trained models. To further validate our proposed approach, we prove a new identifiability result that extends existing work on identifying structured representations.
comment: 45 pages, 10 figures; accepted to the 5th conference on Causal Learning and Reasoning (CLeaR)
♻ ☆ Transformers Can Solve Non-Linear and Non-Markovian Filtering Problems in Continuous Time For Conditionally Gaussian Signals
The use of attention-based deep learning models in stochastic filtering, e.g. transformers and deep Kalman filters, has recently come into focus; however, the potential for these models to solve stochastic filtering problems remains largely unknown. The paper provides an affirmative answer to this open problem in the theoretical foundations of machine learning by showing that a class of continuous-time transformer models, called \textit{filterformers}, can approximately implement the conditional law of a broad class of non-Markovian and conditionally Gaussian signal processes given noisy continuous-time (possibly non-Gaussian) measurements. Our approximation guarantees hold uniformly over sufficiently regular compact subsets of continuous-time paths, where the worst-case 2-Wasserstein distance between the true optimal filter and our deep learning model quantifies the approximation error. Our construction relies on two new customizations of the standard attention mechanism: The first can losslessly adapt to the characteristics of a broad range of paths since we show that the attention mechanism implements bi-Lipschitz embeddings of sufficiently regular sets of paths into low-dimensional Euclidean spaces; thus, it incurs no ``dimension reduction error''. The latter attention mechanism is tailored to the geometry of Gaussian measures in the $2$-Wasserstein space. Our analysis relies on new stability estimates of robust optimal filters in the conditionally Gaussian setting.
♻ ☆ Beyond Via: Analysis and Estimation of the Impact of Large Language Models in Academic Papers
Through an analysis of arXiv papers, we report several shifts in word usage that are likely driven by large language models (LLMs) but have not previously received sufficient attention, such as the increased frequency of "beyond" and "via" in titles and the decreased frequency of "the" and "of" in abstracts. Due to the similarities among different LLMs, experiments show that current classifiers struggle to accurately determine which specific model generated a given text in multi-class classification tasks. Meanwhile, variations across LLMs also result in evolving patterns of word usage in academic papers. By adopting a direct and highly interpretable linear approach and accounting for differences between models and prompts, we quantitatively assess these effects and show that real-world LLM usage is heterogeneous and dynamic.
comment: Visualization of word usage patterns in arXiv abstracts: https://llm-impact.github.io/
♻ ☆ Hybrid Hidden Markov Model for Modeling Equity Excess Growth Rate Dynamics: A Discrete-State Approach with Jump-Diffusion
Generating synthetic financial time series that preserve the statistical properties of real market data is essential for stress testing, risk model validation, and scenario design. Existing approaches struggle to simultaneously reproduce heavy-tailed distributions, negligible linear autocorrelation, and persistent volatility clustering. We developed a hybrid hidden Markov framework that discretized excess growth rates into Laplace quantile-defined states and augmented regime switching with a Poisson jump-duration mechanism to enforce realistic tail-state dwell times. Parameters were estimated by direct transition counting, bypassing the Baum-Welch EM algorithm and scaling to a 424-asset pipeline. Applied to ten years of daily equity data, the framework achieved high distributional pass rates both in-sample and out-of-sample while partially reproducing the volatility clustering that standard regime-switching models miss. No single model was best at everything: GARCH(1,1) better reproduced volatility clustering but failed distributional tests, while the standard HMM without jumps passed more distributional tests but could not generate volatility clustering. The proposed framework delivered the most balanced performance overall. For multi-asset generation, copula-based dependence models that preserved each asset's marginal HMM distribution substantially outperformed a Single-Index Model factor baseline on both per-asset distributional accuracy and correlation reproduction.
♻ ☆ Democratizing AI: A Comparative Study in Deep Learning Efficiency and Future Trends in Computational Processing
The exponential growth in data has intensified the demand for computational power to train large-scale deep learning models. However, the rapid growth in model size and complexity raises concerns about equal and fair access to computational resources, particularly under increasing energy and infrastructure constraints. GPUs have emerged as essential for accelerating such workloads. This study benchmarks four deep learning models (Conv6, VGG16, ResNet18, CycleGAN) using TensorFlow and PyTorch on Intel Xeon CPUs and NVIDIA Tesla T4 GPUs. Our experiments demonstrate that, on average, GPU training achieves speedups ranging from 11x to 246x depending on model complexity, with lightweight models (Conv6) showing the highest acceleration (246x), mid-sized models (VGG16, ResNet18) achieving 51-116x speedups, and complex generative models (CycleGAN) reaching 11x improvements compared to CPU training. Additionally, in our PyTorch vs. TensorFlow comparison, we observed that TensorFlow's kernel-fusion optimizations reduce inference latency by approximately 15%. We also analyze GPU memory usage trends and projecting requirements through 2025 using polynomial regression. Our findings highlight that while GPUs are essential for sustaining AI's growth, democratized and shared access to GPU resources is critical for enabling research innovation across institutions with limited computational budgets.
♻ ☆ Moonwalk: Inverse-Forward Differentiation
Backpropagation's main limitation is its need to store intermediate activations (residuals) during the forward pass, which restricts the depth of trainable networks. This raises a fundamental question: can we avoid storing these activations? We address this by revisiting the structure of gradient computation. Backpropagation computes gradients through a sequence of vector-Jacobian products, an operation that is generally irreversible. The lost information lies in the cokernel of each layer's Jacobian. We define submersive networks -- networks whose layer Jacobians have trivial cokernels -- in which gradients can be reconstructed exactly in a forward sweep without storing activations. For non-submersive layers, we introduce fragmental gradient checkpointing, which records only the minimal subset of residuals necessary to restore the cotangents erased by the Jacobian. Central to our approach is a novel operator, the vector-inverse-Jacobian product (vijp), which inverts gradient flow outside the cokernel. Our mixed-mode algorithm first computes input gradients with a memory-efficient reverse pass, then reconstructs parameter gradients in a forward sweep using the vijp, eliminating the need to store activations. We implement this method in Moonwalk and show that it matches backpropagation's runtime while training networks more than twice as deep under the same memory budget.
♻ ☆ NeuroNarrator: A Generalist EEG-to-Text Foundation Model for Clinical Interpretation via Spectro-Spatial Grounding and Temporal State-Space Reasoning
Electroencephalography (EEG) provides a non-invasive window into neural dynamics at high temporal resolution and plays a pivotal role in clinical neuroscience research. Despite this potential, prevailing computational approaches to EEG analysis remain largely confined to task-specific classification objectives or coarse-grained pattern recognition, offering limited support for clinically meaningful interpretation. To address these limitations, we introduce NeuroNarrator, the first generalist EEG-to-text foundation model designed to translate electrophysiological segments into precise clinical narratives. A cornerstone of this framework is the curation of NeuroCorpus-160K, the first harmonized large-scale resource pairing over 160,000 EEG segments with structured, clinically grounded natural-language descriptions. Our architecture first aligns temporal EEG waveforms with spatial topographic maps via a rigorous contrastive objective, establishing spectro-spatially grounded representations. Building on this grounding, we condition a Large Language Model through a state-space-inspired formulation that integrates historical temporal and spectral context to support coherent clinical narrative generation. This approach establishes a principled bridge between continuous signal dynamics and discrete clinical language, enabling interpretable narrative generation that facilitates expert interpretation and supports clinical reporting workflows. Extensive evaluations across diverse benchmarks and zero-shot transfer tasks highlight NeuroNarrator's capacity to integrate temporal, spectral, and spatial dynamics, positioning it as a foundational framework for time-frequency-aware, open-ended clinical interpretation of electrophysiological data.
♻ ☆ Property-Level Flood Risk Assessment Using AI-Enabled Street-View Lowest Floor Elevation Extraction and ML Imputation Across Texas
This paper argues that AI-enabled analysis of street-view imagery, complemented by performance-gated machine-learning imputation, provides a viable pathway for generating building-specific elevation data at regional scale for flood risk assessment. We develop and apply a three-stage pipeline across 18 areas of interest (AOIs) in Texas that (1) extracts LFE and the height difference between street grade and the lowest floor (HDSL) from Google Street View imagery using the Elev-Vision framework, (2) imputes missing HDSL values with Random Forest and Gradient Boosting models trained on 16 terrain, hydrologic, geographic, and flood-exposure features, and (3) integrates the resulting elevation dataset with Fathom 1-in-100 year inundation surfaces and USACE depth-damage functions to estimate property-specific interior flood depth and expected loss. Across 12,241 residential structures, street-view imagery was available for 73.4% of parcels and direct LFE/HDSL extraction was successful for 49.0% (5,992 structures). Imputation was retained for 13 AOIs where cross-validated performance was defensible, with selected models achieving R suqre values from 0.159 to 0.974; five AOIs were explicitly excluded from prediction because performance was insufficient. The results show that street-view-based elevation mapping is not universally available for every property, but it is sufficiently scalable to materially improve regional flood-risk characterization by moving beyond hazard exposure to structure-level estimates of interior inundation and expected damage. Scientifically, the study advances LFE estimation from a pilot-scale proof of concept to a regional, end-to-end workflow. Practically, it offers a replicable framework for jurisdictions that lack comprehensive Elevation Certificates but need parcel-level information to support mitigation, planning, and flood-risk management.
♻ ☆ BIOGEN: Evidence-Grounded Multi-Agent Reasoning Framework for Transcriptomic Interpretation in Antimicrobial Resistance
Interpreting gene clusters from RNA sequencing (RNA-seq) remains challenging, especially in antimicrobial resistance studies where mechanistic insight is important for hypothesis generation. Existing pathway enrichment methods can summarize co-expressed modules, but they often provide limited cluster-specific explanations and weak connections to supporting literature. We present BIOGEN, an evidence-grounded multi-agent framework for post hoc interpretation of RNA-seq transcriptional modules. BIOGEN combines biomedical retrieval, structured reasoning, and multi-critic verification to generate traceable cluster-level explanations with explicit evidence and confidence labels. On a primary Salmonella enterica dataset, BIOGEN achieved strong biological grounding, including BERTScore 0.689, Semantic Alignment Score 0.715, KEGG Functional Similarity 0.342, and a hallucination rate of 0.000, compared with 0.100 for an LLM-only baseline. Across four additional bacterial RNA-seq datasets, BIOGEN also maintained zero hallucination under the same fixed pipeline. In comparisons with representative open-source agentic AI baselines, BIOGEN was the only framework that consistently preserved zero hallucination across all five datasets. These findings suggest that retrieval alone is not enough for reliable biological interpretation, and that evidence-grounded orchestration is important for transparent and source-traceable transcriptomic reasoning.
♻ ☆ Generalized Machine Learning for Fast Calibration of Agent-Based Epidemic Models
Agent-based models (ABMs) are widely used to study infectious disease dynamics, but their calibration is often computationally intensive, limiting their applicability in time-sensitive public health settings. We propose DeepIMC (Deep Inverse Mapping Calibration), a machine learning-based calibration framework that directly learns the inverse mapping from epidemic time series to epidemiological parameters. DeepIMC trains a bidirectional Long Short-Term Memory (BiLSTM) neural network on synthetic epidemic trajectories generated from agent-based models such as the Susceptible-Infected-Recovered (SIR) model, enabling rapid parameter estimation without repeated simulation at inference time. We evaluate DeepIMC through an extensive simulation study comprising 5,000 heterogeneous epidemic scenarios and benchmark its performance against Approximate Bayesian Computation (ABC) using likelihood-free Markov Chain Monte Carlo. The results show that DeepIMC substantially improves parameter recovery accuracy, produces sharp and well-calibrated predictive intervals, and reduces computational time by more than an order of magnitude relative to ABC. Although structural parameter identifiability constraints limit the precise recovery of all model parameters simultaneously, the calibrated models reliably reproduce epidemic trajectories and support accurate forward prediction with their estimated parameters. DeepIMC is implemented in the open-source R package epiworldRCalibrate, facilitating practical adoption for real-time epidemic modeling and policy analysis. Overall, our findings demonstrate that DeepIMC provides a scalable, operationally effective alternative to traditional simulation-based calibration methods for agent-based epidemic models.
♻ ☆ One Sample to Rule Them All: Extreme Data Efficiency in Multidiscipline Reasoning with Reinforcement Learning
The reasoning ability of large language models (LLMs) can be unleashed with reinforcement learning (RL) (OpenAI, 2024; DeepSeek-AI et al., 2025a; Zeng et al., 2025). The success of existing RL attempts in LLMs usually rely on high-quality samples of large volumes. In this paper, we challenge conventional assumptions about data requirements in RL for LLMs by demonstrating the effectiveness of one-shot reinforcement learning. Specifically, we introduce polymath learning, a framework for designing one training sample that elicits multidisciplinary reasoning improvement. We present three key findings: (1) A single, strategically selected math reasoning sample can produce significant performance improvements across multiple domains, including physics, chemistry, and biology; (2) Analysis of salient mathematical skills provides insight into the characteristics associated with effective polymath samples; and (3) An engineered synthetic sample that integrates multidisciplinary elements and broader skill coverage achieves stronger performance than naturally occurring individual samples. Across various reasoning benchmarks, polymath learning achieves stronger performance than larger datasets, demonstrating that reasoning structure and skills in samples, rather than quantity, may be the key to unlock enhanced reasoning capabilities in language models. Our results suggest a shift, dubbed as sample engineering, toward precision engineering of samples that complements simply increasing data volume.
♻ ☆ Contextual Distributionally Robust Optimization with Causal and Continuous Structure: An Interpretable and Tractable Approach
In this paper, we introduce a framework for contextual distributionally robust optimization (DRO) that considers the causal and continuous structure of the underlying distribution by developing interpretable and tractable decision rules that prescribe decisions using covariates. We first introduce the causal Sinkhorn discrepancy (CSD), an entropy-regularized causal Wasserstein distance that encourages continuous transport plans while preserving the causal consistency. We then formulate a contextual DRO model with a CSD-based ambiguity set, termed Causal Sinkhorn DRO (Causal-SDRO), and derive its strong dual reformulation where the worst-case distribution is characterized as a mixture of Gibbs distributions. To solve the corresponding infinite-dimensional policy optimization, we propose the Soft Regression Forest (SRF) decision rule, which approximates optimal policies within arbitrary measurable function spaces. The SRF preserves the interpretability of classical decision trees while being fully parametric, differentiable, and Lipschitz smooth, enabling intrinsic interpretation from both global and local perspectives. To solve the Causal-SDRO with parametric decision rules, we develop an efficient stochastic compositional gradient algorithm that converges to an $\varepsilon$-stationary point at a rate of $O(\varepsilon^{-4})$, matching the convergence rate of standard stochastic gradient descent. Finally, we validate our method through numerical experiments on synthetic and real-world datasets, demonstrating its superior performance and interpretability.
♻ ☆ Planning in Branch-and-Bound: Model-Based Reinforcement Learning for Exact Combinatorial Optimization
Mixed-Integer Linear Programming (MILP) lies at the core of many real-world combinatorial optimization (CO) problems, traditionally solved by branch-and-bound (B&B). A key driver influencing B&B solvers efficiency is the variable selection heuristic that guides branching decisions. Looking to move beyond static, hand-crafted heuristics, recent work has explored adapting traditional reinforcement learning (RL) algorithms to the B&B setting, aiming to learn branching strategies tailored to specific MILP distributions. In parallel, RL agents have achieved remarkable success in board games, a very specific type of combinatorial problems, by leveraging environment simulators to plan via Monte Carlo Tree Search (MCTS). Building on these developments, we introduce Plan-and-Branch-and-Bound (PlanB&B), a model-based reinforcement learning (MBRL) agent that leverages a learned internal model of the B&B dynamics to discover improved branching strategies. Computational experiments empirically validate our approach, with our MBRL branching agent outperforming previous state-of-the-art RL methods across four standard MILP benchmarks.
♻ ☆ The Convergence Frontier: Integrating Machine Learning and High Performance Quantum Computing for Next-Generation Drug Discovery
Integrating quantum mechanics into drug discovery marks a decisive shift from empirical trial-and-error toward quantitative precision. However, the prohibitive cost of ab initio molecular dynamics has historically forced a compromise between chemical accuracy and computational scalability. This paper identifies the convergence of High-Performance Computing (HPC), Machine Learning (ML), and Quantum Computing (QC) as the definitive solution to this bottleneck. While ML foundation models, such as FeNNix-Bio1, enable quantum-accurate simulations, they remain tethered to the inherent limits of classical data generation. We detail how High-Performance Quantum Computing (HPQC), utilizing hybrid QPU-GPU architectures, will serve as the ultimate accelerator for quantum chemistry data. By leveraging Hilbert space mapping, these systems can achieve true chemical accuracy while bypassing the heuristics of classical approximations. We show how this tripartite convergence optimizes the drug discovery pipeline, spanning from initial system preparation to ML-driven, high-fidelity simulations. Finally, we position quantum-enhanced sampling as the beyond GPU frontier for modeling reactive cellular systems and pioneering next-generation materials.
♻ ☆ Unsupervised Behavioral Compression: Learning Low-Dimensional Policy Manifolds through State-Occupancy Matching
Deep Reinforcement Learning (DRL) is widely recognized as sample-inefficient, a limitation attributable in part to the high dimensionality and substantial functional redundancy inherent to the policy parameter space. A recent framework, which we refer to as Action-based Policy Compression (APC), mitigates this issue by compressing the parameter space $Θ$ into a low-dimensional latent manifold $\mathcal Z$ using a learned generative mapping $g:\mathcal Z \to Θ$. However, its performance is severely constrained by relying on immediate action-matching as a reconstruction loss, a myopic proxy for behavioral similarity that suffers from compounding errors across sequential decisions. To overcome this bottleneck, we introduce Occupancy-based Policy Compression (OPC), which enhances APC by shifting behavior representation from immediate action-matching to long-horizon state-space coverage. Specifically, we propose two principal improvements: (1) we curate the dataset generation with an information-theoretic uniqueness metric that delivers a diverse population of policies; and (2) we propose a fully differentiable compression objective that directly minimizes the divergence between the true and reconstructed mixture occupancy distributions. These modifications force the generative model to organize the latent space around true functional similarity, promoting a latent representation that generalizes over a broad spectrum of behaviors while retaining most of the original parameter space's expressivity. Finally, we empirically validate the advantages of our contributions across multiple continuous control benchmarks.
♻ ☆ Zeroth-order Logconcave Sampling
We study the zeroth-order query complexity of sampling from a general logconcave distribution: given access to an evaluation oracle for a convex function $V:\mathbb{R}^{d}\rightarrow\mathbb{R}\cup\{\infty\}$, output a point from a distribution within $\varepsilon$-distance to the density proportional to $e^{-V}$. A long line of work provides efficient algorithms for this problem in TV distance, assuming a pointwise warm start (i.e., in $\infty$-Rényi divergence), and using annealing to generate such a warm start. Here, we address the natural and more general problem of using a $q$-Rényi divergence warm start to generate a sample that is $\varepsilon$-close in $q$-Rényi divergence. Our first main result is an algorithm with this end-to-end guarantee with state-of-the-art complexity for $q=\widetildeΩ(1)$. Our second result shows how to generate a $q$-Rényi divergence warm start directly via annealing, by maintaining $q$-Rényi divergence throughout, thereby obtaining a streamlined analysis and improved complexity. Such results were previously known only under the stronger assumptions of smoothness and access to first-order oracles. We also show a lower bound for Gaussian annealing by disproving a geometric conjecture about quadratic tilts of isotropic logconcave distributions. Central to our approach, we establish hypercontractivity of the heat adjoint and translate this into improved mixing time guarantees for the Proximal Sampler. The resulting analysis of both sampling and annealing follows a simplified and natural path, directly tying convergence rates to isoperimetric constants of the target distribution.
comment: v2: Fix a bug in the restart mechanism; add a lower bound on Gaussian annealing
♻ ☆ Olaf: Bringing an Animated Character to Life in the Physical World
Animated characters often move in non-physical ways and have proportions that are far from a typical walking robot. This provides an ideal platform for innovation in both mechanical design and stylized motion control. In this paper, we bring Olaf to life in the physical world, relying on reinforcement learning guided by animation references for control. To create the illusion of Olaf's feet moving along his body, we hide two asymmetric legs under a soft foam skirt. To fit actuators inside the character, we use spherical and planar linkages in the arms, mouth, and eyes. Because the walk cycle results in harsh contact sounds, we introduce additional rewards that noticeably reduce impact noise. The large head, driven by small actuators in the character's slim neck, creates a risk of overheating, amplified by the costume. To keep actuators from overheating, we feed temperature values as additional inputs to policies, introducing new rewards to keep them within bounds. We validate the efficacy of our modeling in simulation and on hardware, demonstrating an unmatched level of believability for a costumed robotic character.
♻ ☆ Multi-Timescale Primal Dual Hybrid Gradient with Application to Distributed Optimization
We propose two variants of the Primal Dual Hybrid Gradient (PDHG) algorithm for saddle point problems with block decomposable duals, hereafter called Multi-Timescale PDHG (MT-PDHG) and its accelerated variant (AMT-PDHG). Through novel mixtures of Bregman divergence and multi-timescale extrapolations, our MT-PDHG and AMT-PDHG converge under arbitrary updating rates for different dual blocks while remaining fully deterministic and robust to extreme delays in dual updates. We further apply our (A)MT-PDHG, augmented with the gradient sliding techniques introduced in Lan et al. (2020), Lan (2016), to distributed optimization. The flexibility in choosing different updating rates for different blocks allows a more refined control over the communication rounds between different pairs of agents, thereby improving the efficiencies in settings with heterogeneity in local objectives and communication costs. Moreover, with careful choices of penalty levels, our algorithms show linear and thus optimal dependency on function similarities, a measure of how similar the gradients of local objectives are. This provides a positive answer to the open question whether such dependency is achievable for non-smooth objectives (Arjevani and Shamir 2015).
♻ ☆ Safer by Diffusion, Broken by Context: Diffusion LLM's Safety Blessing and Its Failure Mode
Diffusion large language models (D-LLMs) offer an alternative to autoregressive LLMs (AR-LLMs) and have demonstrated advantages in generation efficiency. Beyond the utility benefits, we argue that D-LLMs exhibit a previously underexplored safety blessing: their diffusion-style generation confers intrinsic robustness against jailbreak attacks originally designed for AR-LLMs. In this work, we provide an initial analysis of the underlying mechanism, showing that the diffusion trajectory induces a stepwise reduction effect that progressively suppresses unsafe generations. This robustness, however, is not absolute. Following this analysis, we highlight a simple yet effective failure mode, context nesting, in which harmful requests are embedded within structured benign contexts. Empirically, we show that this simple black-box strategy bypasses D-LLMs' safety blessing, achieving state-of-the-art attack success rates across models and benchmarks. Notably, it enables the first successful jailbreak of Gemini Diffusion to our knowledge, exposing a critical vulnerability in proprietary D-LLMs. Together, our results characterize both the origins and the limits of D-LLMs' safety blessing, constituting an early-stage red-teaming of D-LLMs.
♻ ☆ Fragility-aware Classification for Understanding Risk and Improving Generalization
Classification models play a central role in data-driven decision-making applications such as medical diagnosis, recommendation systems, and risk assessment. Traditional performance metrics, such as accuracy and AUC, focus on overall error rates but fail to account for the confidence of incorrect predictions, i.e., the risk of confident misjudgments. This limitation is particularly consequential in safety-critical and cost-sensitive settings, where overconfident errors can lead to severe outcomes. To address this issue, we propose the Fragility Index (FI), a novel performance metric that evaluates classifiers from a risk-averse perspective by capturing the tail risk of confident misjudgments. We formulate FI within a robust satisficing (RS) framework to ensure robustness under distributional uncertainty. Building on this, we develop a tractable training framework that directly targets FI via a surrogate loss, and show that models trained under this framework admit provable bounds on FI. We further derive exact reformulations for a broad class of loss functions, including cross-entropy, hinge-type, and Lipschitz losses, and extend the approach to deep neural networks. Empirical results on real-world medical diagnosis tasks demonstrate that FI complements existing metrics by revealing error tail risk and improving decision quality. FI-based models achieve competitive accuracy and AUC while consistently reducing confident misjudgments and associated operational costs, offering a practical tool for improving robustness and reliability in risk-critical applications.
♻ ☆ A Residual Guided strategy with Generative Adversarial Networks in training Physics-Informed Transformer Networks
Nonlinear partial differential equations (PDEs) are pivotal in modeling complex physical systems, yet traditional Physics-Informed Neural Networks (PINNs) often struggle with unresolved residuals in critical spatiotemporal regions and violations of temporal causality. To address these limitations, we propose a novel Residual Guided Training strategy for Physics-Informed Transformer via Generative Adversarial Networks (GAN). Our framework integrates a decoder-only Transformer to inherently capture temporal correlations through autoregressive processing, coupled with a residual-aware GAN that dynamically identifies and prioritizes high-residual regions. By introducing a causal penalty term and an adaptive sampling mechanism, the method enforces temporal causality while refining accuracy in problematic domains. Extensive numerical experiments on the Allen-Cahn, Klein-Gordon, and Navier-Stokes equations demonstrate significant improvements, achieving relative MSE reductions of up to three orders of magnitude compared to baseline methods. This work bridges the gap between deep learning and physics-driven modeling, offering a robust solution for multiscale and time-dependent PDE systems.
♻ ☆ TEDDY: A Family Of Foundation Models For Understanding Single Cell Biology ICML 2025
Understanding the biological mechanisms of disease is crucial for medicine, and in particular, for drug discovery. AI-powered analysis of genome-scale biological data holds great potential in this regard. The increasing availability of single-cell RNA sequencing data has enabled the development of large foundation models for disease biology. However, existing foundation models only modestly improve over task-specific models in downstream applications. Here, we explored two avenues for improving single-cell foundation models. First, we scaled the pre-training data to a diverse collection of 116 million cells, which is larger than those used by previous models. Second, we leveraged the availability of large-scale biological annotations as a form of supervision during pre-training. We trained the \model family of models comprising six transformer-based state-of-the-art single-cell foundation models with 70 million, 160 million, and 400 million parameters. We vetted our models on several downstream evaluation tasks, including identifying the underlying disease state of held-out donors not seen during training, distinguishing between diseased and healthy cells for disease conditions and donors not seen during training, and probing the learned representations for known biology. Our models showed substantial improvement over existing works, and scaling experiments showed that performance improved predictably with both data volume and parameter count.
comment: ICML 2025 Generative AI and Biology (GenBio) Workshop
♻ ☆ Multigrade Neural Network Approximation
We study multigrade deep learning (MGDL) as a principled framework for structured error refinement in deep neural networks. While the approximation power of neural networks is now relatively well understood, training very deep architectures remains challenging due to highly non-convex and often ill-conditioned optimization landscapes. In contrast, for relatively shallow networks, most notably one-hidden-layer $\texttt{ReLU}$ models, training admits convex reformulations with global guarantees, motivating learning paradigms that improve stability while scaling to depth. MGDL builds upon this insight by training deep networks grade by grade: previously learned grades are frozen, and each new residual block is trained solely to reduce the remaining approximation error, yielding an interpretable and stable hierarchical refinement process. We develop an operator-theoretic foundation for MGDL and prove that, for any continuous target function, there exists a fixed-width multigrade $\texttt{ReLU}$ scheme whose residuals decrease strictly across grades and converge uniformly to zero. To the best of our knowledge, this work provides the first rigorous theoretical guarantee that grade-wise training yields provable vanishing approximation error in deep networks. Numerical experiments further illustrate the theoretical results.
♻ ☆ Prior Knowledge Makes It Possible: From Sublinear Graph Algorithms to LLM Test-Time Methods
Test-time augmentation, such as Retrieval-Augmented Generation (RAG) or tool use, critically depends on an interplay between a model's parametric knowledge and externally retrieved information. However, the theoretical underpinnings of this relationship remain poorly understood. Specifically, it is not clear how much pre-training knowledge is required to answer queries with a small number of augmentation steps, which is a desirable property in practice. To address this question, we formulate multi-step reasoning as an $s$-$t$ connectivity problem on a knowledge graph. We represent a model's pre-training parametric knowledge as a partial, potentially noisy subgraph. We view augmentation as querying an oracle for true edges that augment the model's knowledge. Then, we characterize the necessary and sufficient number of augmentation steps for the model to generate an accurate answer given partial prior knowledge. One key result shows a phase transition: if the prior knowledge graph over $n$ vertices is disconnected into small components, then finding a path via augmentation is inefficient and requires $Ω(\sqrt{n})$ queries. On the other hand, once the density of correct knowledge surpasses a threshold, forming a giant component, we can find paths with an expected constant number of queries.
♻ ☆ Causal K-Means Clustering
Causal effects are often characterized with population summaries. These might provide an incomplete picture when there are heterogeneous treatment effects across subgroups. Since the subgroup structure is typically unknown, it is more challenging to identify and evaluate subgroup effects than population effects. We propose a new solution to this problem: \emph{Causal k-Means Clustering}, which leverages the k-means clustering algorithm to uncover the unknown subgroup structure. Our problem differs significantly from the conventional clustering setup since the variables to be clustered are unknown counterfactual functions. We present a plug-in estimator which is simple and readily implementable using off-the-shelf algorithms, and study its rate of convergence. We also develop a new bias-corrected estimator based on nonparametric efficiency theory and double machine learning, and show that this estimator achieves fast root-n rates and asymptotic normality in large nonparametric models. Our proposed methods are especially useful for modern outcome-wide studies with multiple treatment levels. Further, our framework is extensible to clustering with generic pseudo-outcomes, such as partially observed outcomes or otherwise unknown functions. Finally, we explore finite sample properties via simulation, and illustrate the proposed methods using a study of mobile-supported self-management for chronic low back pain.
♻ ☆ Learning to Play Blackjack: A Curriculum Learning Perspective
Reinforcement Learning (RL) agents often struggle with efficiency and performance in complex environments. We propose a novel framework that uses a Large Language Model (LLM) to dynamically generate a curriculum over available actions, enabling the agent to incorporate each action individually. We apply this framework to the game of Blackjack, where the LLM creates a multi-stage training path that progressively introduces complex actions to a Tabular Q-Learning and a Deep Q-Network (DQN) agent. Our evaluation in a realistic 8-deck simulation over 10 independent runs demonstrates significant performance gains over standard training methods. The curriculum-based approach increases the DQN agent's average win rate from 43.97% to 47.41%, reduces the average bust rate from 32.9% to 28.0%, and accelerates the overall workflow by over 74%, with the agent's full training completing faster than the baseline's evaluation phase alone. These results validate that LLM-guided curricula can build more effective, robust, and efficient RL agents.
comment: Accepted as an oral presentation at the International Conference on Distributed Artificial Intelligence (DAI 2025). 16 pages, 7 figures
♻ ☆ A Simultaneous Approach for Training Neural Differential-Algebraic Systems of Equations
Scientific machine learning is an emerging field that broadly describes the combination of scientific computing and machine learning to address challenges in science and engineering. Within the context of differential equations, this has produced highly influential methods, such as neural ordinary differential equations (NODEs). Recent works extend this line of research to consider neural differential-algebraic systems of equations (DAEs), where some unknown relationships within the DAE are learned from data. Training neural DAEs, similarly to neural ODEs, is computationally expensive, as it requires the solution of a DAE for every parameter update. Further, the rigorous consideration of algebraic constraints is difficult within common deep learning training algorithms such as stochastic gradient descent. In this work, we apply the simultaneous approach to neural DAE problems, resulting in a fully discretized nonlinear optimization problem, which is solved to local optimality and simultaneously obtains the neural network parameters and the solution to the corresponding DAE. We extend recent work demonstrating the simultaneous approach for neural ODEs, by presenting a general framework to solve neural DAEs, with explicit consideration of hybrid models, where some components of the DAE are known, e.g. physics-informed constraints. Furthermore, we present a general strategy for improving the performance and convergence of the nonlinear programming solver, based on solving an auxiliary problem for initialization and approximating Hessian terms. We achieve promising results in terms of accuracy, model generalizability and computational cost, across different problem settings such as sparse data, unobserved states and multiple trajectories. Lastly, we provide several promising future directions to improve the scalability and robustness of our approach.
♻ ☆ Prognostics for Autonomous Deep-Space Habitat Health Management under Multiple Unknown Failure Modes
Deep-space habitats (DSHs) are safety-critical systems that must operate autonomously for long periods, often beyond the reach of ground-based maintenance or expert intervention. Monitoring system health and anticipating failures are therefore essential. Prognostics based on remaining useful life (RUL) prediction support this goal by estimating how long a subsystem can operate before failure. Critical DSH subsystems, including environmental control and life support, power generation, and thermal control, are monitored by many sensors and can degrade through multiple failure modes. These failure modes are often unknown, and informative sensors may vary across modes, making accurate RUL prediction challenging when historical failure data are unlabeled. We propose an unsupervised prognostics framework for RUL prediction that jointly identifies latent failure modes and selects informative sensors using unlabeled run-to-failure data. The framework consists of two phases: an offline phase, where system failure times are modeled using a mixture of Gaussian regressions and an Expectation-Maximization algorithm to cluster degradation trajectories and select mode-specific sensors, and an online phase for real-time diagnosis and RUL prediction using low-dimensional features and a weighted functional regression model. The approach is validated on simulated DSH telemetry data and the NASA C-MAPSS benchmark, demonstrating improved prediction accuracy and interpretability.
comment: Manuscript under review
♻ ☆ CompressedScaffnew: The First Theoretical Double Acceleration of Communication from Local Training and Compression in Distributed Optimization
In distributed optimization, a large number of machines alternate between local computations and communication with a coordinating server. Communication, which can be slow and costly, is the main bottleneck in this setting. To reduce this burden and therefore accelerate distributed gradient descent, two strategies are popular: 1) communicate less frequently; that is, perform several iterations of local computations between the communication rounds; and 2) communicate compressed information instead of full-dimensional vectors. We propose CompressedScaffnew, the first algorithm for distributed optimization that jointly harnesses these two strategies and converges linearly to an exact solution in the strongly convex setting, with a doubly accelerated rate: it benefits from the two acceleration mechanisms provided by local training and compression, namely a better dependency on the condition number of the functions and on the dimension of the model, respectively.
♻ ☆ Tackling Non-IIDness in HAPS-Aided Federated Learning
High-altitude platform stations (HAPS) enable large-scale federated learning (FL) in non-terrestrial networks (NTN) by providing wide-area coverage and predominantly line-of-sight (LoS) connectivity to many ground users. However, practical deployments face heterogeneous and non-independently and identically distributed (non-IID) client data, which degrades accuracy and slows convergence. We propose a weighted attribute-based client selection strategy that leverages server-side indicators: historical traffic behavior, instantaneous channel quality, computational capability, and prior-round learning contribution. At each round, the HAPS computes a composite score and selects the top clients, while adapting attribute weights online based on their correlation with validation-loss improvement. We further provide theoretical justification that traffic-derived uniformity can serve as a proxy for latent data heterogeneity, enabling selection of client subsets with reduced expected non-IIDness. Simulations demonstrate improved test accuracy, faster convergence, and lower training loss compared with random, resource-only, and single-attribute baselines.
comment: Submitted to IEEE for possible publication
♻ ☆ Toward Personalized Darts Training: A Data-Driven Framework Based on Skeleton-Based Biomechanical Analysis and Motion Modeling
As sports training becomes more data-driven, traditional dart coaching based mainly on experience and visual observation is increasingly inadequate for high-precision, goal-oriented movements. Although prior studies have highlighted the importance of release parameters, joint motion, and coordination in dart throwing, most quantitative methods still focus on local variables, single-release metrics, or static template matching. These approaches offer limited support for personalized training and often overlook useful movement variability. This paper presents a data-driven dart training assistance system. The system creates a closed-loop framework spanning motion capture, feature modeling, and personalized feedback. Dart-throwing data were collected in markerless conditions using a Kinect 2.0 depth sensor and an optical camera. Eighteen kinematic features were extracted from four biomechanical dimensions: three-link coordination, release velocity, multi-joint angular configuration, and postural stability. Two modules were developed: a personalized optimal throwing trajectory model that combines historical high-quality samples with the minimum jerk criterion, and a motion deviation diagnosis and recommendation model based on z-scores and hierarchical logic. A total of 2,396 throwing samples from professional and non-professional athletes were collected. Results show that the system generates smooth personalized reference trajectories consistent with natural human movement. Case studies indicate that it can detect poor trunk stability, abnormal elbow displacement, and imbalanced velocity control, then provide targeted recommendations. The framework shifts dart evaluation from deviation from a uniform standard to deviation from an individual's optimal control range, improving personalization and interpretability for darts training and other high-precision target sports.
♻ ☆ Doubly Robust Estimation of Causal Effects in Strategic Equilibrium Systems
We introduce the Strategic Doubly Robust (SDR) estimator, a novel framework that integrates strategic equilibrium modeling with doubly robust estimation for causal inference in strategic environments. SDR addresses endogenous treatment assignment arising from strategic agent behavior, maintaining double robustness while incorporating strategic considerations. Theoretical analysis confirms SDR's consistency and asymptotic normality under strategic unconfoundedness. Empirical evaluations demonstrate SDR's superior performance over baseline methods, achieving 7.6\%-29.3\% bias reduction across varying strategic strengths and maintaining robust scalability with agent populations. The framework provides a principled approach for reliable causal inference when agents respond strategically to interventions.
comment: In systems with causal effects, a large majority of individuals are mistakenly classified as using a certain strategy by the strategic equilibrium solver, resulting in the introduction of this feature as an independent variable in causal inference without specificity. This method may have an inherent error
♻ ☆ DeDelayed: Deleting Remote Inference Delay via On-Device Correction CVPR 2026
Video comprises the vast majority of bits that are generated daily, and is the primary signal driving current innovations in robotics, remote sensing, and wearable technology. Yet, the most powerful video understanding models are too expensive for the resource-constrained platforms used in these applications. One approach is to offload inference to the cloud; this gives access to GPUs capable of processing high-resolution videos in real time. But even with reliable, high-bandwidth communication channels, the combined latency of video encoding, model inference, and round-trip communication prohibits use for certain real-time applications. The alternative is to use fully local inference; but this places extreme constraints on computational and power costs, requiring smaller models and lower resolution, leading to degraded accuracy. To address these challenges, we propose Dedelayed, a real-time inference system that divides computation between a remote model operating on delayed video frames and a local model with access to the current frame. The remote model is trained to make predictions on anticipated future frames, which the local model incorporates into its prediction for the current frame. The local and remote models are jointly optimized with an autoencoder that limits the transmission bitrate required by the available downlink communication channel. We evaluate Dedelayed on the task of real-time streaming video segmentation using the BDD100k driving dataset. For a round trip delay of 100 ms, Dedelayed improves performance by 6.4 mIoU compared to fully local inference and 9.8 mIoU compared to remote inference -- an equivalent improvement to using a model ten times larger. We release our training code, pretrained models, and python library at https://github.com/InterDigitalInc/dedelayed .
comment: CVPR 2026
♻ ☆ SkinGenBench: Generative Model and Preprocessing Effects for Synthetic Dermoscopic Augmentation in Melanoma Diagnosis
This work introduces SkinGenBench, a systematic biomedical imaging benchmark that investigates how preprocessing complexity interacts with generative model choice for synthetic dermoscopic image augmentation and downstream melanoma diagnosis. Using a curated dataset of $14,116$ dermoscopic images from HAM10000 and MILK10K across five lesion classes, we evaluate the two representative generative paradigms: StyleGAN2-ADA and Denoising Diffusion Probabilistic Models (DDPMs) under basic geometric augmentation and advanced artifact removal pipelines. Synthetic melanoma images are assessed using established perceptual and distributional metrics (FID, KID, IS), feature space analysis, and their impact on diagnostic performance across five downstream classifiers. Experimental results demonstrate that generative architecture choice has a stronger influence on both image fidelity and diagnostic utility than preprocessing complexity. StyleGAN2-ADA consistently produced synthetic images more closely aligned with real data distributions, achieving the lowest FID ($\approx 65.5$) and KID ($\approx 0.05$), while diffusion models generated higher variance samples at the cost of reduced perceptual fidelity and class anchoring. Advanced artifact removal yielded only marginal improvements in generative metrics and provided limited downstream diagnostic gains, suggesting possible suppression of clinically relevant texture cues. In contrast, synthetic data augmentation substantially improved melanoma detection with $8$-$15$\% absolute gains in melanoma F1-score, and ViT-B/16 achieving F1 $\approx 0.88$ and ROC-AUC $\approx 0.98$, representing an improvement of approximately $14\%$ over non-augmented baselines. Our code can be found at https://github.com/adarsh-crafts/SkinGenBench
♻ ☆ Partial VOROS: A Cost-aware Performance Metric for Binary Classifiers with Precision and Capacity Constraints AISTATS
The ROC curve is widely used to assess binary classifiers. Yet for some applications, such as alert systems for monitoring hospitalized patients, conventional ROC analysis cannot meet two key deployment needs: enforcing a constraint on precision to avoid false alarm fatigue and imposing an upper bound on the number of predicted positives to represent the capacity of hospital staff. The usual area under the curve metric also does not reflect asymmetric costs for false positives and false negatives. In this paper we address all three of these issues. First, we show how the subset of classifiers that meet precision and capacity constraints occupy a feasible region in ROC space. We establish the polygon-shaped geometry of this region. We then define the partial area of lesser classifiers, a performance metric that is monotonic with cost and only accounts for the feasible region. Averaging this area over a desired distribution for cost parameters results in the partial volume over the ROC surface, or partial VOROS. In experiments predicting mortality risk from vital sign history on several datasets, we show this cost-aware metric can outperform alternatives at ranking classifiers for in-hospital alerts.
comment: In Proceedings of the International Conference of Artificial Intelligence and Statistics (AISTATS), 2026
♻ ☆ Beyond the Black Box: Identifiable Interpretation and Control in Generative Models via Causal Minimality
Deep generative models, while revolutionizing fields like image and text generation, largely operate as opaque ``black boxes'', hindering human understanding, control, and alignment. While methods like sparse autoencoders (SAEs) show remarkable empirical success, they often lack theoretical guarantees, risking subjective insights. Our primary objective is to establish a principled foundation for interpretable generative models. We demonstrate that the principle of causal minimality -- favoring the simplest causal explanation -- can endow the latent representations of modern generative models with clear causal interpretation and robust, component-wise identifiable control. We introduce a novel theoretical framework for hierarchical selection models, where higher-level concepts emerge from the constrained composition of lower-level variables, better capturing the complex dependencies in data generation. Under theoretically derived minimality conditions, we show that learned representations can be equivalent to the true latent variables of the data-generating process. Empirically, applying these constraints to leading text-to-image diffusion models allows us to extract their innate hierarchical concept graphs, offering fresh insights into their internal knowledge organization. Furthermore, these causally grounded concepts serve as levers for fine-grained model steering, paving the way for transparent, reliable systems.
♻ ☆ Linear Attention for Joint Power Optimization and User-Centric Clustering in Cell-Free Networks
Optimal AP clustering and power allocation are critical in user-centric cell-free massive MIMO systems. Existing deep learning models lack flexibility to handle dynamic network configurations. Furthermore, many approaches overlook pilot contamination and suffer from high computational complexity. In this paper, we propose a lightweight transformer model that overcomes these limitations by jointly predicting AP clusters and powers solely from spatial coordinates of user devices and AP. Our model is architecture-agnostic to users load, handles both clustering and power allocation without channel estimation overhead, and eliminates pilot contamination by assigning users to AP within a pilot reuse constraint. We also incorporate a customized linear attention mechanism to capture user-AP interactions efficiently and enable linear scalability with respect to the number of users. Numerical results confirm the model's effectiveness in maximizing the minimum spectral efficiency and providing near-optimal performance while ensuring adaptability and scalability in dynamic scenarios.
♻ ☆ HEAS: Hierarchical Evolutionary Agent-Based Simulation Framework for Multi-Objective Policy Search
Metric aggregation divergence is a hidden confound in agent-based model policy search: when optimization, tournament evaluation, and statistical validation independently implement outcome metric extraction, champion selection reflects aggregation artifact rather than policy quality. We propose Hierarchical Evolutionary Agent Simulation (HEAS), a composable framework that eliminates this confound through a runtime-enforceable metric contract - a uniform metrics_episode() callable shared identically by all pipeline stages. Removing the confound yields robust champion selection: in a controlled experiment (n=30), HEAS reduces rank reversals by 50% relative to ad-hoc aggregation; the HEAS champion wins all 32 held-out ecological scenarios - a null-safety result that would be uninterpretable under aggregation divergence. The contract additionally reduces coupling code by 97% (160 to 5 lines) relative to Mesa 3.3.1. Three case studies validate composability across ecological, enterprise, and mean-field ordinary differential equation dynamics.
comment: 12 pages, 1 figure. Python package: https://pypi.org/project/heas/ | Web playground: https://ryzhanghason.github.io/heas/
♻ ☆ GradPower: Powering Gradients for Faster Language Model Pre-Training
We propose GradPower, a lightweight gradient-transformation technique for accelerating language model pre-training. Given a gradient vector $g=(g_i)_i$, GradPower first applies the elementwise sign-power transformation: $\varphi_p(g)=({\rm sign}(g_i)|g_i|^p)_{i}$ for a fixed $p>0$, and then feeds the transformed gradient into a base optimizer. Notably, GradPower requires only a single-line code change and no modifications to the base optimizer's internal logic, including the hyperparameters. When applied to Adam (termed AdamPower), GradPower consistently achieves lower terminal loss across diverse architectures (LLaMA, Qwen2MoE), parameter scales (66M to 2B), datasets (C4, OpenWebText), and learning-rate schedules (cosine, warmup-stable-decay). The most pronounced gains are observed when training modern mixture-of-experts models with warmup-stable-decay schedules. GradPower also integrates seamlessly with other state-of-the-art optimizers, such as Muon, yielding further improvements. Finally, we provide theoretical analyses that reveal the underlying mechanism of GradPower and highlight the influence of gradient noise.
comment: 22 pages. A revised version is in preparation
♻ ☆ Do Phone-Use Agents Respect Your Privacy?
We study whether phone-use agents respect privacy while completing benign mobile tasks. This question has remained hard to answer because privacy-compliant behavior is not operationalized for phone-use agents, and ordinary apps do not reveal exactly what data agents type into which form entries during execution. To make this question measurable, we introduce MyPhoneBench, a verifiable evaluation framework for privacy behavior in mobile agents. We operationalize privacy-respecting phone use as permissioned access, minimal disclosure, and user-controlled memory through a minimal privacy contract, iMy, and pair it with instrumented mock apps plus rule-based auditing that make unnecessary permission requests, deceptive re-disclosure, and unnecessary form filling observable and reproducible. Across five frontier models on 10 mobile apps and 300 tasks, we find that task success, privacy-compliant task completion, and later-session use of saved preferences are distinct capabilities, and no single model dominates all three. Evaluating success and privacy jointly reshuffles the model ordering relative to either metric alone. The most persistent failure mode across models is simple data minimization: agents still fill optional personal entries that the task does not require. These results show that privacy failures arise from over-helpful execution of benign tasks, and that success-only evaluation overestimates the deployment readiness of current phone-use agents. All code, mock apps, and agent trajectories are publicly available at~ https://github.com/FreedomIntelligence/MyPhoneBench.
comment: work in progress
♻ ☆ NCCL EP: Towards a Unified Expert Parallel Communication API for NCCL
Mixture-of-Experts (MoE) architectures have become essential for scaling large language models, driving the development of specialized device-initiated communication libraries such as DeepEP, Hybrid-EP, and others. These libraries demonstrate the performance benefits of GPU-initiated RDMA for MoE dispatch and combine operations. This paper presents NCCL EP (Expert Parallelism), a ground-up MoE communication library built entirely on NCCL's Device API. NCCL EP provides unified ncclEpDispatch and ncclEpCombine primitives with both C and Python interfaces, supporting Low-Latency (LL) mode for inference decoding and High-Throughput (HT) mode for training and inference prefill. LL targets small batch sizes (1-128 tokens) using direct all-to-all RDMA+NVLink mesh connectivity with double-buffered communication for overlapping dispatch and combine phases. HT targets large batches (4096+ tokens) using hierarchical communication that aggregates tokens within NVLink domains before inter-node RDMA transmission. Both modes leverage Device API for both intra- and inter-node communications, taking advantage of its topology awareness and optimized GPU-initiated implementation. We evaluate NCCL EP on an H100-based cluster across multi-node configurations, demonstrating competitive LL kernel performance and presenting end-to-end results with vLLM integration. By building MoE communication natively within NCCL, NCCL EP provides a supported path for expert parallelism on current and emerging NVIDIA platforms.
comment: 13 pages, 8 figures, 7 tables
♻ ☆ Catalogue Grounded Multimodal Attribution for Museum Video under Resource and Regulatory Constraints
Audiovisual (AV) archives in museums and galleries are growing rapidly, but much of this material remains effectively locked away because it lacks consistent, searchable metadata. Existing method for archiving requires extensive manual effort. We address this by automating the most labour intensive part of the workflow: catalogue style metadata curation for in gallery video, grounded in an existing collection database. Concretely, we propose catalogue-grounded multimodal attribution for museum AV content using an open, locally deployable video language model. We design a multi pass pipeline that (i) summarises artworks in a video, (ii) generates catalogue style descriptions and genre labels, and (iii) attempts to attribute title and artist via conservative similarity matching to the structured catalogue. Early deployments on a painting catalogue suggest that this framework can improve AV archive discoverability while respecting resource constraints, data sovereignty, and emerging regulation, offering a transferable template for application-driven machine learning in other high-stakes domains.
comment: Demo video url: https://jn00767.pages.surrey.ac.uk/catalogue-grounded-multimodal-attribution-for-museum-video/
♻ ☆ Coarsening Causal DAG Models
Directed acyclic graphical (DAG) models are a powerful tool for representing causal relationships among jointly distributed random variables, especially concerning data from across different experimental settings. However, it is not always practical or desirable to estimate a causal model at the granularity of given features in a particular dataset. There is a growing body of research on causal abstraction to address such problems. We contribute to this line of research by (i) providing novel graphical identifiability results for practically-relevant interventional settings, (ii) proposing an efficient, provably consistent algorithm for directly learning abstract causal graphs from interventional data with unknown intervention targets, and (iii) uncovering theoretical insights about the lattice structure of the underlying search space, with connections to the field of causal discovery more generally. As proof of concept, we apply our algorithm on synthetic and real datasets with known ground truths, including measurements from a controlled physical system with interacting light intensity and polarization.
comment: 27 pages, 5 figures; accepted to the 5th conference on Causal Learning and Reasoning (CLeaR)
♻ ☆ Resource-Efficient Variational Quantum Classifier
We introduce the unambiguous quantum classifier based on Hamming distance measurements combined with classical post-processing. The proposed approach improves classification performance through a more effective use of ansatz expressivity, while requiring significantly fewer circuit evaluations. Moreover, the method demonstrates enhanced robustness to noise, which is crucial for near-term quantum devices. We evaluate the proposed method on a breast cancer classification dataset. The unambiguous classifier achieves an average accuracy of 90%, corresponding to an improvement of 6.9 percentage points over the baseline, while requiring eight times fewer circuit executions per prediction. In the presence of noise, the improvement is reduced to approximately 3.1 percentage points, with the same reduction in execution cost. We substantiate our experimental results with theoretical evidence supporting the practical performance of the approach.
comment: 13 pages, 7 figures, 1 table; typos corrected, new references added, modification of model M3, new result box plots for all models, theoretical results adjusted, abstract and conclusion modified
♻ ☆ Adaptive Coverage Policies in Conformal Prediction
Traditional conformal prediction methods construct prediction sets such that the true label falls within the set with a user-specified coverage level. However, poorly chosen coverage levels can result in uninformative predictions, either producing overly conservative sets when the coverage level is too high, or empty sets when it is too low. Moreover, the fixed coverage level cannot adapt to the specific characteristics of each individual example, limiting the flexibility and efficiency of these methods. In this work, we leverage recent advances in e-values and post-hoc conformal inference, which allow the use of data-dependent coverage levels while maintaining valid statistical guarantees. We propose to optimize an adaptive coverage policy by training a neural network using a leave-one-out procedure on the calibration set, allowing the coverage level and the resulting prediction set size to vary with the difficulty of each individual example. We support our approach with theoretical coverage guarantees and demonstrate its practical benefits through a series of experiments.
comment: Code at: https://github.com/GauthierE/adaptive-coverage-policies
♻ ☆ MOON3.0: Reasoning-aware Multimodal Representation Learning for E-commerce Product Understanding
With the rapid growth of e-commerce, exploring general representations rather than task-specific ones has attracted increasing attention. Although recent multimodal large language models (MLLMs) have driven significant progress in product understanding, they are typically employed as feature extractors that implicitly encode product information into global embeddings, thereby limiting their ability to capture fine-grained attributes. Therefore, we argue that leveraging the reasoning capabilities of MLLMs to explicitly model fine-grained product attributes holds significant potential. Nevertheless, achieving this goal remains non-trivial due to several key challenges: (i) long-context reasoning tends to dilute the model's attention to salient information in the raw input; (ii) supervised fine-tuning (SFT) primarily encourages rigid imitation, limiting the exploration of effective reasoning strategies; and (iii) fine-grained details are progressively attenuated during forward propagation. To address these issues, we propose MOON3.0, the first reasoning-aware MLLM-based model for product representation learning. Our method (1) employs a multi-head modality fusion module to adaptively integrate raw signals; (2) incorporates a joint contrastive and reinforcement learning framework to autonomously explore more effective reasoning strategies; and (3) introduces a fine-grained residual enhancement module to progressively preserve local details throughout the network. Additionally, we release a large-scale multimodal e-commerce benchmark MBE3.0. Experimentally, our model demonstrates state-of-the-art zero-shot performance across various downstream tasks on both our benchmark and public datasets.
comment: 10 pages, 6 figures
♻ ☆ Learning Contextual Runtime Monitors for Safe AI-Based Autonomy
We introduce a novel framework for learning context-aware runtime monitors for AI-based control ensembles. Machine-learning (ML) controllers are increasingly deployed in (autonomous) cyber-physical systems because of their ability to solve complex decision-making tasks. However, their accuracy can degrade sharply in unfamiliar environments, creating significant safety concerns. Traditional ensemble methods aim to improve robustness by averaging or voting across multiple controllers, yet this often dilutes the specialized strengths that individual controllers exhibit in different operating contexts. We argue that, rather than blending controller outputs, a monitoring framework should identify and exploit these contextual strengths. In this paper, we reformulate the design of safe AI-based control ensembles as a contextual monitoring problem. A monitor continuously observes the system's context and selects the controller best suited to the current conditions. To achieve this, we cast monitor learning as a contextual learning task and draw on techniques from contextual multi-armed bandits. Our approach comes with two key benefits: (1) theoretical safety guarantees during controller selection, and (2) improved utilization of controller diversity. We validate our framework in two simulated autonomous driving scenarios, demonstrating significant improvements in both safety and performance compared to non-contextual baselines.
♻ ☆ Cross-attentive Cohesive Subgraph Embedding to Mitigate Oversquashing in GNNs
Graph neural networks (GNNs) have achieved strong performance across various real-world domains. Nevertheless, they suffer from oversquashing, where long-range information is distorted as it is compressed through limited message-passing pathways. This bottleneck limits their ability to capture essential global context and decreases their performance, particularly in dense and heterophilic regions of graphs. To address this issue, we propose a novel graph learning framework that enriches node embeddings via cross-attentive cohesive subgraph representations to mitigate the impact of excessive long-range dependencies. This framework enhances the node representation by emphasizing cohesive structure in long-range information but removing noisy or irrelevant connections. It preserves essential global context without overloading the narrow bottlenecked channels, which further mitigates oversquashing. Extensive experiments on multiple benchmark datasets demonstrate that our model achieves consistent improvements in classification accuracy over standard baseline methods.
♻ ☆ Fair Representation in Parliamentary Summaries: Measuring and Mitigating Inclusion Bias ECML-PKDD 2025
The The use of Large language models (LLMs) to summarise parliamentary proceedings presents a promising means of increasing the accessibility of democratic participation. However, as these systems increasingly mediate access to political information -- filtering and framing content before it reaches users -- there are important fairness considerations to address. In this work, we evaluate 5 LLMs (both proprietary and open-weight) in the summarisation of plenary debates from the European Parliament to investigate the representational biases that emerge in this context. We develop an attribution-aware evaluation framework to measure speaker-level inclusion and mis-representation in debate summaries. Across all models and experiments, we find that speakers are less accurately represented in the final summary on the basis of (i) their speaking-order (speeches in the middle of the debate were systematically excluded), (ii) language spoken (non-English speakers were less faithfully represented), and (iii) political affiliations (better outcomes for left-of-centre parties). We further show how biases in these contexts can be decomposed to distinguish inclusion bias (systematic omission) from hallucination bias (systematic misrepresentation), and explore the effect of different mitigation strategies. Prompting strategies do not affect these biases. Instead, we propose a hierarchical summarisation method that decomposes the task into simpler extraction and aggregation steps, which we show significantly improves the positional/speaking-order bias across all models. These findings underscore the need for domain-sensitive evaluation metrics and ethical oversight in the deployment of LLMs for multilingual democratic applications.
comment: Extended journal version of "Identifying Algorithmic and Domain-Specific Bias in Parliamentary Debate Summarisation" (arXiv:2507.14221), which appeared at the AIDEM Workshop, ECML-PKDD 2025. This version extends the original with cross-lingual bias analysis, a two-level hierarchical summarisation method, and human annotation validation of the evaluation framework
♻ ☆ Unified Optimization of Source Weights and Transfer Quantities in Multi-Source Transfer Learning: An Asymptotic Framework
In multi-source transfer learning, a key challenge lies in how to appropriately differentiate and utilize heterogeneous source tasks. However, existing multi-source methods typically focus on optimizing either the source weights or the amount of transferred samples, largely neglecting their joint consideration. In this work, we propose a theoretical framework, Unified Optimization of Weights and Quantities (UOWQ), that jointly determines the optimal source weights and transfer quantities for each source task. Specifically, the framework formulates multi-source transfer learning as a parameter estimation problem based on an asymptotic analysis of a Kullback--Leibler divergence--based generalization error measure, leading to two main theoretical findings: 1) using all available source samples is always optimal when the weights are properly adjusted; 2) the optimal source weights are characterized by a principled optimization problem whose structure explicitly incorporates the Fisher information, parameter discrepancy, parameter dimensionality, and transfer quantities. Building on the theoretical results, we further propose a practical algorithm for multi-source transfer learning, and extend it to multi-task learning settings where each task simultaneously serves as both a source and a target. Extensive experiments on real-world benchmarks, including DomainNet and Office-Home, demonstrate that UOWQ consistently outperforms strong baselines. The results validate both the theoretical predictions and the practical effectiveness of our framework.
♻ ☆ Where You Place the Norm Matters: From Prejudiced to Neutral Initializations AISTATS
Normalization layers were introduced to stabilize and accelerate training, yet their influence is critical already at initialization, where they shape signal propagation and output statistics before parameters adapt to data. In practice, both which normalization to use and where to place it are often chosen heuristically, despite the fact that these decisions can qualitatively alter a model's behavior. We provide a theoretical characterization of how normalization choice and placement (Pre-Norm vs. Post-Norm) determine the distribution of class predictions at initialization, ranging from unbiased (Neutral) to highly concentrated (Prejudiced) regimes. We show that these architectural decisions induce systematic shifts in the initial prediction regime, thereby modulating subsequent learning dynamics. By linking normalization design directly to prediction statistics at initialization, our results offer principled guidance for more controlled and interpretable network design, including clarifying how widely used choices such as BatchNorm vs. LayerNorm and Pre-Norm vs. Post-Norm shape behavior from the outset of training.
comment: Proceedings of the 29th International Conference on Artificial Intelligence and Statistics (AISTATS) 2026, Tangier, Morocco. PMLR: Volume 300. Copyright 2026 by the author(s)
Think Anywhere in Code Generation
Recent advances in reasoning Large Language Models (LLMs) have primarily relied on upfront thinking, where reasoning occurs before final answer. However, this approach suffers from critical limitations in code generation, where upfront thinking is often insufficient as problems' full complexity only reveals itself during code implementation. Moreover, it cannot adaptively allocate reasoning effort throughout the code generation process where difficulty varies significantly. In this paper, we propose Think-Anywhere, a novel reasoning mechanism that enables LLMs to invoke thinking on-demand at any token position during code generation. We achieve Think-Anywhere by first teaching LLMs to imitate the reasoning patterns through cold-start training, then leveraging outcome-based RL rewards to drive the model's autonomous exploration of when and where to invoke reasoning. Extensive experiments on four mainstream code generation benchmarks (i.e., LeetCode, LiveCodeBench, HumanEval, and MBPP) show that Think-Anywhere achieves state-of-the-art performance over both existing reasoning methods and recent post-training approaches, while demonstrating consistent generalization across diverse LLMs. Our analysis further reveals that Think-Anywhere enables the model to adaptively invoke reasoning at high-entropy positions, providing enhanced interpretability.
♻ ☆ Constraint-Aware Reinforcement Learning via Adaptive Action Scaling
Safe reinforcement learning (RL) seeks to mitigate unsafe behaviors that arise from exploration during training by reducing constraint violations while maintaining task performance. Existing approaches typically rely on a single policy to jointly optimize reward and safety, which can cause instability due to conflicting objectives, or they use external safety filters that override actions and require prior system knowledge. In this paper, we propose a modular cost-aware regulator that scales the agent's actions based on predicted constraint violations, preserving exploration through smooth action modulation rather than overriding the policy. The regulator is trained to minimize constraint violations while avoiding degenerate suppression of actions. Our approach integrates seamlessly with off-policy RL methods such as SAC and TD3, and achieves state-of-the-art return-to-cost ratios on Safety Gym locomotion tasks with sparse costs, reducing constraint violations by up to 126 times while increasing returns by over an order of magnitude compared to prior methods.
comment: Accepted in 8th Annual Learning for Dynamics & Control Conference (L4DC)
♻ ☆ Virasoro Symmetry in Neural Network Field Theories
Neural Network Field Theories (NN-FTs) typically describe Generalized Free Fields that lack a local stress-energy tensor in two dimensions, obstructing the realization of Virasoro symmetry. We present the ``Log-Kernel'' (LK) architecture, which enforces local conformal symmetry via a specific rotation-invariant spectral prior $p(k) \propto |k|^{-2}$. We analytically derive the emergence of the Virasoro algebra from the statistics of the neural ensemble. We validate this construction through numerical simulation, computing the central charge $c_{exp} = 0.9958 \pm 0.0196$ (theoretical $c=1$) and confirming the scaling dimensions of vertex operators. Furthermore, we demonstrate that finite-width corrections generate interactions scaling as $1/N$. Finally, we extend the framework to include fermions and boundary conditions, realizing the super-Virasoro algebra. We verify the $\mathcal{N}=1$ super-Virasoro algebra by measuring the supercurrent correlator to $96\%$ accuracy. We further demonstrate conformal boundary conditions on the upper half-plane, achieving 99\% agreement for boundary fermion and boson propagators.
comment: 1+23 pages, 6 figures;
♻ ☆ Output Embedding Centering for Stable LLM Pretraining
Pretraining of large language models is not only expensive but also prone to certain training instabilities. A specific instability that often occurs at the end of training is output logit divergence. The most widely used mitigation strategies, z-loss and logit soft-capping, merely address the symptoms rather than the underlying cause of the problem. In this paper, we analyze the instability from the perspective of the output embeddings' geometry and identify anisotropic embeddings as its source. Based on this, we propose output embedding centering (OEC) as a new mitigation strategy, and demonstrate that it suppresses output logit divergence. OEC can be implemented in two different ways: as a deterministic operation called $μ$-centering, or a regularization method called $μ$-loss. Our experiments show that both variants outperform z-loss in terms of training stability, while being on par with logit soft-capping. This holds true both in the presence and the absence of weight tying. As a secondary result, we find that $μ$-loss is significantly less sensitive to regularization hyperparameter tuning than z-loss.
comment: Additional experiments using logit soft-capping & weight tying
♻ ☆ Graph-Informed Adversarial Modeling: Infimal Subadditivity of Interpolative Divergences
We study adversarial learning when the target distribution factorizes according to a known Bayesian network. For interpolative divergences, including $(f,Γ)$-divergences, we prove a new infimal subadditivity principle showing that, under suitable conditions, a global variational discrepancy is controlled by an average of family-level discrepancies aligned with the graph. In an additive regime, the surrogate is exact. This closes a theoretical gap in the literature; existing subadditivity results justify graph-informed adversarial learning for classical discrepancies, but not for interpolative divergences, where the usual factorization argument breaks down. In turn, we provide a justification for replacing a standard, graph-agnostic GAN with a monolithic discriminator by a graph-informed GAN (GiGAN) with localized family-level discriminators, without requiring the optimizer itself to factorize according to the graph. We also obtain parallel results for integral probability metrics and proximal optimal transport divergences, identify natural discriminator classes for which the theory applies, and present experiments showing improved stability and structural recovery relative to graph-agnostic baselines.
comment: 34 pages, 9 figures
♻ ☆ Reinforcement Learning-based Task Offloading in the Internet of Wearable Things
Over the years, significant contributions have been made by the research and industrial sectors to improve wearable devices towards the Internet of Wearable Things (IoWT) paradigm. However, wearables are still facing several challenges. Many stem from the limited battery power and insufficient computation resources available on wearable devices. On the other hand, with the popularity of smart wearables, there is a consistent increase in the development of new computationally intensive and latency-critical applications. In such a context, task offloading allows wearables to leverage the resources available on nearby edge devices to enhance the overall user experience. This paper proposes a framework for Reinforcement Learning (RL)-based task offloading in the IoWT. We formulate the task offloading process considering the tradeoff between energy consumption and task accomplishment time. Moreover, we model the task offloading problem as a Markov Decision Process (MDP) and utilize the Q-learning technique to enable the wearable device to make optimal task offloading decisions without prior knowledge. We evaluate the performance of the proposed framework through extensive simulations for various applications and system configurations conducted in the ns-3 network simulator. We also show how varying the main system parameters of the Q-learning algorithm affects the overall performance in terms of average task accomplishment time, average energy consumption, and percentage of tasks offloaded.
comment: Withdrawn by the authors. A revised version is under preparation
♻ ☆ Learning to Learn-at-Test-Time: Language Agents with Learnable Adaptation Policies
Test-Time Learning (TTL) enables language agents to iteratively refine their performance through repeated interactions with the environment at inference time. At the core of TTL is an adaptation policy that updates the actor policy based on experience from previous episodes, thereby improving future behavior. Existing methods rely on fixed, hand-crafted adaptation policies rather than optimizing them for downstream improvement. We argue that optimal adaptation policies should be learned from task environments, not hand-engineered based on human intuition. To achieve this, we introduce Meta-TTL, a framework that formulates the discovery of effective adaptation policies as a bi-level optimization problem. Within this framework, the inner loop executes the standard TTL process, measuring how effectively a candidate adaptation policy helps an agent correct errors across sequential episodes. Guided by the agent's performance, the outer loop employs evolutionary search over a diverse distribution of training tasks to iteratively refine the adaptation policy. We evaluate Meta-TTL on Jericho and WebArena-Lite across both in-distribution (ID) and out-of-distribution (OOD) settings, using multiple meta-agent backbones. Results on both benchmarks show that Meta-TTL consistently outperforms hand-crafted baselines, suggesting that the optimized adaptation policy encodes transferable strategies that generalize beyond the training task distribution.
♻ ☆ LLM Analysis of 150+ years of German Parliamentary Debates on Migration Reveals Shift from Post-War Solidarity to Anti-Solidarity in the Last Decade
Migration has been a core topic in German political debate, from the postwar displacement of millions of expellees to labor migration and recent refugee movements. Studying political speech across such wide-ranging phenomena in depth has traditionally required extensive manual annotation, limiting analysis to small subsets of the data. Large language models (LLMs) offer a potential way to overcome this constraint. Using a theory-driven annotation scheme, we examine how well LLMs annotate subtypes of solidarity and anti-solidarity in German parliamentary debates and whether the resulting labels support valid downstream inference. We first provide a comprehensive evaluation of multiple LLMs, analyzing the effects of model size, prompting strategies, fine-tuning, historical versus contemporary data, and systematic error patterns. We find that the strongest models, especially GPT-5 and gpt-oss-120B, achieve human-level agreement on this task, although their errors remain systematic and bias downstream results. To address this issue, we combine soft-label model outputs with Design-based Supervised Learning (DSL) to reduce bias in long-term trend estimates. Beyond the methodological evaluation, we interpret the resulting annotations from a social-scientific perspective to trace trends in solidarity and anti-solidarity toward migrants in postwar and contemporary Germany. Our approach shows relatively high levels of solidarity in the postwar period, especially in group-based and compassionate forms, and a marked rise in anti-solidarity since 2015, framed through exclusion, undeservingness, and resource burden. We argue that LLMs can support large-scale social-scientific text analysis, but only when their outputs are rigorously validated and statistically corrected.
♻ ☆ Computationally efficient Gauss-Newton reinforcement learning for model predictive control
Model predictive control (MPC) is widely used in process control due to its interpretability and ability to handle constraints. As a parametric policy in reinforcement learning (RL), MPC offers strong initial performance and low data requirements compared to black-box policies like neural networks. However, most RL methods rely on first-order updates, which scale well to large parameter spaces but converge at most linearly, making them inefficient when each policy update requires solving an optimal control problem, as is the case with MPC. While MPC policies are typically low parameterized and thus amenable to second-order approaches, existing second-order methods demand second-order policy derivatives, which can be computationally intractable. This work introduces a Gauss-Newton approximation of the deterministic policy Hessian that eliminates the need for second-order policy derivatives, enabling superlinear convergence with minimal computational overhead. To further improve robustness, we propose a momentum-based Hessian averaging scheme for stable training under noisy estimates coupled with an adaptive trustregion. We demonstrate the effectiveness of the approach on a nonlinear continuously stirred tank reactor (CSTR), showing faster convergence and improved data efficiency over state-of-the-art firstorder methods and deep RL approaches.
comment: 17 pages, 9 figures, submitted to Elsevier in the special issue "Reinforcement Learning and Its Applications to Process Systems Engineering Problems" in the journal "Computers and Chemical Engineering"
♻ ☆ Efficient Reasoning with Balanced Thinking ICLR 2026
Large Reasoning Models (LRMs) have shown remarkable reasoning capabilities, yet they often suffer from overthinking, expending redundant computational steps on simple problems, or underthinking, failing to explore sufficient reasoning paths despite inherent capabilities. These issues lead to inefficiencies and potential inaccuracies, limiting practical deployment in resource-constrained settings. Existing methods to mitigate overthinking, such as suppressing reflective keywords or adjusting reasoning length, may inadvertently induce underthinking, compromising accuracy. Therefore, we propose ReBalance, a training-free framework that achieves efficient reasoning with balanced thinking. ReBalance leverages confidence as a continuous indicator of reasoning dynamics, identifying overthinking through high confidence variance and underthinking via consistent overconfidence. By aggregating hidden states from a small-scale dataset into reasoning mode prototypes, we compute a steering vector to guide LRMs' reasoning trajectories. A dynamic control function modulates this vector's strength and direction based on real-time confidence, pruning redundancy during overthinking, and promoting exploration during underthinking. Extensive experiments conducted on four models ranging from 0.5B to 32B, and across nine benchmarks in math reasoning, general question answering, and coding tasks demonstrate that ReBalance effectively reduces output redundancy while improving accuracy, offering a general, training-free, and plug-and-play strategy for efficient and robust LRM deployment. Project page and code are available at https://rebalance-ai.github.io .
comment: Accepted by ICLR 2026
♻ ☆ Hybrid Energy-Based Models for Physical AI: Provably Stable Identification of Port-Hamiltonian Dynamics
Energy-based models (EBMs) implement inference as gradient descent on a learned Lyapunov function, yielding interpretable, structure-preserving alternatives to black-box neural ODEs and aligning naturally with physical AI. Yet their use in system identification remains limited, and existing architectures lack formal stability guarantees that globally preclude unstable modes. We address this gap by introducing an EBM framework for system identification with stable, dissipative, absorbing invariant dynamics. Unlike classical global Lyapunov stability, absorbing invariance expands the class of stability-preserving architectures, enabling more flexible and expressive EBMs. We extend EBM theory to nonsmooth activations by establishing negative energy dissipation via Clarke derivatives and deriving new conditions for radial unboundedness, exposing a stability-expressivity tradeoff in standard EBMs. To overcome this, we introduce a hybrid architecture with a dynamical visible layer and static hidden layers, prove absorbing invariance under mild assumptions, and show that these guarantees extend to port-Hamiltonian EBMs. Experiments on metric-deformed multi-well and ring systems validate the approach, showcasing how our hybrid EBM architecture combines expressivity with sound and provable safety guarantees by design.
♻ ☆ Hybrid Quantum-Classical Autoencoders for Unsupervised Network Intrusion Detection
Unsupervised anomaly-based intrusion detection requires models that can generalize to attack patterns not observed during training. This work presents the first large-scale evaluation of hybrid quantum-classical (HQC) autoencoders for this task. We construct a unified experimental framework that iterates over key quantum design choices, including quantum-layer placement, measurement approach, variational and non-variational formulations, and latent-space regularization. Experiments across three benchmark NIDS datasets show that HQC autoencoders can match or exceed classical performance in their best configurations, although they exhibit higher sensitivity to architectural decisions. Under zero-day evaluation, well-configured HQC models provide stronger and more stable generalization than classical and supervised baselines. Simulated gate-noise experiments reveal early performance degradation, indicating the need for noise-aware HQC designs. These results provide the first data-driven characterization of HQC autoencoder behavior for network intrusion detection and outline key factors that govern their practical viability. All experiment code and configurations are available at https://github.com/arasyi/hqcae-network-intrusion-detection.
comment: The authors have identified limitations in the experimental evaluation, which are insufficient to fully support the paper's conclusions. The manuscript is withdrawn pending additional experiments and analysis
♻ ☆ LEXam: Benchmarking Legal Reasoning on 340 Law Exams ICLR 2026
Long-form legal reasoning remains a key challenge for large language models (LLMs) in spite of recent advances in test-time scaling. To address this, we introduce LEXam, a novel benchmark derived from 340 law exams spanning 116 law school courses across a range of subjects and degree levels. The dataset comprises 7,537 law exam questions in English and German. It includes both long-form, open-ended questions and multiple-choice questions with varying numbers of options. Besides reference answers, the open questions are also accompanied by explicit guidance outlining the expected legal reasoning approach such as issue spotting, rule recall, or rule application. Our evaluation on both open-ended and multiple-choice questions present significant challenges for current LLMs; in particular, they notably struggle with open questions that require structured, multi-step legal reasoning. Moreover, our results underscore the effectiveness of the dataset in differentiating between models with varying capabilities. Deploying an ensemble LLM-as-a-Judge paradigm with rigorous human expert validation, we demonstrate how model-generated reasoning steps can be evaluated consistently and accurately, closely aligning with human expert assessments. Our evaluation setup provides a scalable method to assess legal reasoning quality beyond simple accuracy metrics. Project page: https://lexam-benchmark.github.io/.
comment: Accepted to ICLR 2026
♻ ☆ Deep Reinforcement Learning for Dynamic Algorithm Configuration: A Case Study on Optimizing OneMax with the (1+($λ$,$λ$))-GA
Dynamic Algorithm Configuration (DAC) studies the efficient identification of control policies for parameterized optimization algorithms. Numerous studies leverage Reinforcement Learning (RL) to address DAC challenges; however, applying RL often requires extensive domain expertise. In this work, we conduct a comprehensive study of two deep-RL algorithms--Double Deep Q-Networks (DDQN) and Proximal Policy Optimization (PPO)--for controlling the population size of the $(1+(λ,λ))$-GA on OneMax instances. Although OneMax is structurally simple, learning effective control policies for the $(1+(λ,λ))$-GA induces a highly challenging DAC landscape, making it a controlled yet demanding benchmark. Our investigation reveals two fundamental challenges limiting DDQN and PPO: scalability degradation and learning instability, traced to under-exploration and planning horizon coverage. To address under-exploration, we introduce an adaptive reward shifting mechanism that leverages reward distribution statistics to enhance DDQN exploration. This eliminates instance-specific hyperparameter tuning and ensures consistent effectiveness across problem scales. To resolve planning horizon coverage, we demonstrate that undiscounted learning succeeds in DDQN, while PPO faces fundamental variance issues necessitating alternative designs. We further show that while hyperparameter optimization enhances PPO's stability, it consistently fails to identify effective policies. Finally, DDQN with adaptive reward shifting achieves performance comparable to theoretically derived policies with vastly improved sample efficiency, outperforming prior DAC approaches by orders of magnitude. Our findings provide insights into the fundamental obstacles faced by standard deep-RL approaches in this challenging DAC setting and highlight the key methodological ingredients required for effective learning.
comment: arXiv admin note: text overlap with arXiv:2502.20265
♻ ☆ StelLA: Subspace Learning in Low-rank Adaptation using Stiefel Manifold NeurIPS 2025
Low-rank adaptation (LoRA) has been widely adopted as a parameter-efficient technique for fine-tuning large-scale pre-trained models. However, it still lags behind full fine-tuning in performance, partly due to its insufficient exploitation of the geometric structure underlying low-rank manifolds. In this paper, we propose a geometry-aware extension of LoRA that uses a three-factor decomposition $U\!SV^\top$. Analogous to the structure of singular value decomposition (SVD), it separates the adapter's input and output subspaces, $V$ and $U$, from the scaling factor $S$. Our method constrains $U$ and $V$ to lie on the Stiefel manifold, ensuring their orthonormality throughout the training. To optimize on the Stiefel manifold, we employ a flexible and modular geometric optimization design that converts any Euclidean optimizer to a Riemannian one. It enables efficient subspace learning while remaining compatible with existing fine-tuning pipelines. Empirical results across a wide range of downstream tasks, including commonsense reasoning, math and code generation, image classification, and image generation, demonstrate the superior performance of our approach against the recent state-of-the-art variants of LoRA. Code is available at https://github.com/SonyResearch/stella.
comment: NeurIPS 2025 Spotlight
♻ ☆ Partial Feedback Online Learning
We study a new learning protocol, termed partial-feedback online learning, where each instance admits a set of acceptable labels, but the learner observes only one acceptable label per round. We highlight that, while classical version space is widely used for online learnability, it does not directly extend to this setting. We address this obstacle by introducing a collection version space, which maintains sets of hypotheses rather than individual hypotheses. Using this tool, we obtain a tight characterization of learnability in the set-realizable regime. In particular, we define the Partial-Feedback Littlestone dimension (PFLdim) and the Partial-Feedback Measure Shattering dimension (PMSdim), and show that they tightly characterize the minimax regret for deterministic and randomized learners, respectively. We further identify a nested inclusion condition under which deterministic and randomized learnability coincide, resolving an open question of Raman et al. (2024b). Finally, given a hypothesis space H, we show that beyond set realizability, the minimax regret can be linear even when |H|=2, highlighting a barrier beyond set realizability.
comment: 40 pages. Fixed some typos in the proof and improved readability
♻ ☆ A Polynomial-Time Algorithm for Variational Inequalities under the Minty Condition
Solving (Stampacchia) variational inequalities (SVIs) is a foundational problem at the heart of optimization. However, this expressivity comes at the cost of computational hardness. As a result, most research has focused on carving out specific subclasses that elude those intractability barriers. A classical property that goes back to the 1960s is the Minty condition, which postulates that the Minty VI (MVI) problem admits a solution. In this paper, we establish the first polynomial-time algorithm -- with complexity growing polynomially in the dimension $d$ and $\log(1/ε)$ -- for solving $ε$-SVIs for Lipschitz continuous mappings under the Minty condition. Prior approaches either incurred an exponentially worse dependence on $1/ε$ (and other natural parameters of the problem) or made more restrictive assumptions, such as monotonicity. To do so, we introduce a new variant of the ellipsoid algorithm whereby separating hyperplanes are obtained after taking a descent step from the center of the ellipsoid. It succeeds even though the set of SVIs can be nonconvex and not fully dimensional. Moreover, when our algorithm is applied to an instance with no MVI solution and fails to identify an SVI solution, it produces a succinct certificate of MVI infeasibility. We also show that deciding whether the Minty condition holds is $\mathsf{coNP}$-complete, thereby establishing that the disjunction of those two problems is polynomial-time solvable even though each problem is individually intractable. We provide several extensions and new applications of our main results. Most notably, we obtain the first polynomial-time algorithms for computing Nash equilibria in multi-player harmonic games. Finally, in two-player general-sum concave games, we give the first polynomial-time algorithm that outputs either a Nash equilibrium or a strict coarse correlated equilibrium.
comment: V3 polishes the writing and makes a correction to Theorem 5.2
♻ ☆ MemRerank: Preference Memory for Personalized Product Reranking
LLM-based shopping agents increasingly rely on long purchase histories and multi-turn interactions for personalization, yet naively appending raw history to prompts is often ineffective due to noise, length, and relevance mismatch. We propose MemRerank, a preference memory framework that distills user purchase history into concise, query-independent signals for personalized product reranking. To study this problem, we build an end-to-end benchmark and evaluation framework centered on an LLM-based \textbf{1-in-5} selection task, which measures both memory quality and downstream reranking utility. We further train the memory extractor with reinforcement learning (RL), using downstream reranking performance as supervision. Experiments with two LLM-based rerankers show that MemRerank consistently outperforms no-memory, raw-history, and off-the-shelf memory baselines, yielding up to \textbf{+10.61} absolute points in 1-in-5 accuracy. These results suggest that explicit preference memory is a practical and effective building block for personalization in agentic e-commerce systems.
comment: correct author name in metadata
♻ ☆ WFR-FM: Simulation-Free Dynamic Unbalanced Optimal Transport
The Wasserstein-Fisher-Rao (WFR) metric extends dynamic optimal transport (OT) by coupling displacement with change of mass, providing a principled geometry for modeling unbalanced snapshot dynamics. Existing WFR solvers, however, are often unstable, computationally expensive, and difficult to scale. Here we introduce WFR Flow Matching (WFR-FM), a simulation-free training algorithm that unifies flow matching with dynamic unbalanced OT. Unlike classical flow matching which regresses only a transport vector field, WFR-FM simultaneously regresses a vector field for displacement and a scalar growth rate function for birth-death dynamics, yielding continuous flows under the WFR geometry. Theoretically, we show that minimizing the WFR-FM loss exactly recovers WFR geodesics. Empirically, WFR-FM yields more accurate and robust trajectory inference in single-cell biology, reconstructing consistent dynamics with proliferation and apoptosis, estimating time-varying growth fields, and applying to generative dynamics under imbalanced data. It outperforms state-of-the-art baselines in efficiency, stability, and reconstruction accuracy. Overall, WFR-FM establishes a unified and efficient paradigm for learning dynamical systems from unbalanced snapshots, where not only states but also mass evolve over time. The Python code is available at https://github.com/QiangweiPeng/WFR-FM.
Programming Languages 5
♻ ☆ Dimensional Type Systems and Deterministic Memory Management: Design-Time Semantic Preservation in Native Compilation
We present a compilation framework in which dimensional type annotations persist through multi-stage MLIR lowering, enabling the compiler to jointly resolve numeric representation selection and deterministic memory management as coeffect properties of a single program semantic graph (PSG). Dimensional inference determines value ranges; value ranges determine representation selection; representation selection determines word width and memory footprint; and memory footprint, combined with escape classification, determines allocation strategy and cross-target transfer fidelity. The Dimensional Type System (DTS) extends Hindley-Milner unification with constraints drawn from finitely generated abelian groups, yielding inference that is decidable in polynomial time, complete, and principal. Where conventional systems erase dimensional annotations before code generation, DTS carries them as compilation metadata through each lowering stage, making them available where representation and memory placement decisions occur. Deterministic Memory Management (DMM), formalized as a coeffect discipline within the same graph, unifies escape analysis and memory placement with the dimensional framework. Escape analysis classifies value lifetimes into four categories (stack-scoped, closure-captured, return-escaping, byref-escaping), each mapping to a verified allocation strategy. We identify implications for auto-differentiation: the dimensional algebra is closed under the chain rule, and forward-mode gradient computation exhibits a coeffect signature that the framework can verify. The practical consequence is a development environment where escape diagnostics, allocation strategy, representation fidelity, and cache locality estimation are design-time views over the compilation graph.
comment: 29 pages, 8 tables, 3 appendices with extended examples
♻ ☆ Triosecuris: Formally Verified Protection Against Speculative Control-Flow Hijacking
This paper introduces Triosecuris, a formally verified defense against Spectre BTB, RSB, and PHT that combines CET-style hardware-assisted control-flow integrity with compiler-inserted speculative load hardening (SLH). Triosecuris is based on the novel observation that in the presence of CET-style protection, we can precisely detect BTB misspeculation for indirect calls and RSB misspeculation for returns and set the SLH misspeculation flag. We formalize Triosecuris as a transformation in Rocq and provide a machine-checked proof that it achieves relative security: any transformed program running with speculation leaks no more than what the source program leaks without speculation. This strong security guarantee applies to arbitrary programs, even those not following the cryptographic constant-time programming discipline.
comment: Conditionally accepted at CSF'26; extended with concrete protection against Spectre RSB and renamed to Triosecuris
♻ ☆ YASA: Scalable Multi-Language Taint Analysis on the Unified AST at Ant Group
Modern enterprises increasingly adopt diverse technology stacks with various programming languages, posing significant challenges for static application security testing (SAST). Existing taint analysis tools are predominantly designed for single languages, requiring substantial engineering effort that scales with language diversity. While multi-language tools like CodeQL, Joern, and WALA attempt to address these challenges, they face limitations in intermediate representation design, analysis precision, and extensibility, which make them difficult to scale effectively for large-scale industrial applications at Ant Group. To bridge this gap, we present YASA (Yet Another Static Analyzer), a unified multi-language static taint analysis framework designed for industrial-scale deployment. Specifically, YASA introduces the Unified Abstract Syntax Tree (UAST) that provides a unified abstraction for compatibility across diverse programming languages. Building on the UAST, YASA performs point-to analysis and taint propagation, leveraging a unified semantic model to manage language-agnostic constructs, while incorporating language-specific semantic models to handle other unique language features. When compared to 6 single- and 2 multi-language static analyzers on an industry-standard benchmark, YASA consistently outperformed all baselines across Java, JavaScript, Python, and Go. In real-world deployment within Ant Group, YASA analyzed over 100 million lines of code across 7.3K internal applications. It identified 314 previously unknown taint paths, with 92 of them confirmed as 0-day vulnerabilities. All vulnerabilities were responsibly reported, with 76 already patched by internal development teams, demonstrating YASA's practical effectiveness for securing large-scale industrial software systems.
♻ ☆ Quokka: Accelerating Program Verification with LLMs via Invariant Synthesis
Program verification relies on loop invariants, yet automatically discovering strong invariants remains a long-standing challenge. We investigate whether large language models (LLMs) can accelerate program verification by generating useful loop invariants. We introduce Quokka, an evaluation-oriented framework for LLM-based invariant synthesis that provides sound evaluation and achieves state-of-the-art performance. Unlike prior work that treats LLM outputs as noisy symbolic material requiring substantial post-processing, Quokka adopts a simpler and evaluation-centric design that directly validates whether each LLM-generated invariant helps prove the target assertion. We construct a benchmark of 866 instances derived from SV-COMP and evaluate 9 state-of-the-art LLMs across multiple model families. We demonstrate that supervised fine-tuning and Best-of-N sampling yield measurable improvements, and we show that Quokka consistently outperforms prior LLM-based verifiers. Our code and data are publicly available at https://github.com/Anjiang-Wei/Quokka
♻ ☆ DVM: A Bytecode Virtual Machine Approach for Dynamic Tensor Computation
Dynamism is common in AI computation, e.g., the dynamic tensor shapes and the dynamic control flows in models. Due to the long compilation time, existing runtime compilation damages the model efficiency, while the offline compilers either suffer from the long compilation time and device memory footprint to cover all the possible execution instances of a dynamic model, or sacrifice optimization opportunities for usability. In this paper, we rethink the feasibility of runtime compilation for dynamic models and identify that the key for it to work is to speed up the compilation or hide the compilation overhead. To do this, we propose a real-time compiler, DVM. In DVM, we design a runtime operator compiler based on a bytecode virtual machine to perform effective and efficient compilation for each dynamic operator instance given its input. Specifically, instead of compiling programs into machine code, we encode the operator program into bytecode on the CPU and decode the bytecode into virtual instructions for direct execution on the NPU. Based on the runtime operator compiler, we further propose an operator fuser, which performs symbol-deduction-based fusion on static graphs and runtime fusion on dynamic graphs. Both pattern- and stacking-based fusion are supported to increase fusion opportunities. Evaluation on operators, subgraphs, and models shows that, compared with TorchInductor, PyTorch-eager and MindSpore-graph-O0, we are up to 11.77$\times$ better in terms of the operator/model efficiency and up to 5 orders of magnitude faster in terms of the maximum compilation time.
Operating Systems 2
☆ HACache: Leveraging Read Performance with Cache in a Heterogeneous Array
In cost-sensitive deployments, RAID arrays may combine SSDs with different performance levels. Such heterogeneity arises when aging SSDs degrade yet remain usable, or when failed drives are replaced with new devices of explicitly better performance. While this reduces procurement cost, it creates performance challenges: traditional striping mecahnism distributes requests evenly, but slower SSDs become bottlenecks, leaving faster ones underutilized and limiting overall bandwidth to the slowest drive. To address this, we propose HACache (Heterogeneity Adaptive Cache) for read-intensive workloads. HACache introduces high-performance SSDs as read caches to rebalance request distribution. First, we formalize the request diversion problem and solve it formally. Second, to support optimal diversion ratios searching at runtime, HACache adopts a two-phase request diversion ratio adjustment mechanism. Finally, a cache capacity regulation is adopted to adapt quotas for each backend SSD based on hit rates and request diversion needs. This design maximizes bandwidth utilization. Experiments show HACache improves heterogeneous RAID read performance significantly, with bandwidth gains of about 35\% in typical mixed configurations.
comment: 11pages, 16figures
☆ DAXFS: A Lock-Free Shared Filesystem for CXL Disaggregated Memory
CXL (Compute Express Link) enables multiple hosts to share byte-addressable memory with hardware cache coherence, but no existing filesystem exploits this for lock-free multi-host coordination. We present DaxFS, a Linux filesystem for CXL shared memory that uses cmpxchg atomic operations, which CXL makes coherent across host boundaries, as its sole coordination primitive. A CAS-based hash overlay enables lock-free concurrent writes from multiple hosts without any centralized coordinator. A cooperative shared page cache with a novel multi-host clock eviction algorithm (MH-clock) provides demand-paged caching in shared DAX memory, with fully decentralized victim selection via cmpxchg. We validate multi-host correctness using QEMU-emulated CXL 3.0, where two virtual hosts share a memory region with TCP-forwarded atomics. Under cross-host contention, DaxFS maintains >99% CAS accuracy with no lost updates. On single-host DRAM-backed DAX, DaxFS exceeds tmpfs throughput across all write workloads, achieving up to 2.68x higher random write throughput with 4 threads and 1.18x higher random read throughput at 64 KB. Preliminary GPU microbenchmarks show that the cmpxchg-based design extends to GPU threads performing page cache operations at PCIe 5.0 bandwidth limits.
Networking and Internet Architecture 10
☆ CIVIC: Cooperative Immersion Via Intelligent Credit-sharing in DRL-Powered Metaverse
The Metaverse faces complex resource allocation challenges due to diverse Virtual Environments (VEs), Digital Twins (DTs), dynamic user demands, and strict immersion needs. This paper introduces CIVIC (Cooperative Immersion Via Intelligent Credit-sharing), a novel framework optimizing resource sharing among multiple Metaverse Service Providers (MSPs) to enhance user immersion. Unlike existing methods, CIVIC integrates VE rendering, DT synchronization, credit sharing, and immersion-aware provisioning within a cooperative multi-MSP model. The resource allocation problem is formulated as two NP-hard challenges: a non-cooperative setting where MSPs operate independently and a cooperative setting utilizing a General Credit Pool (GCP) for dynamic resource sharing. Using Deep Reinforcement Learning (DRL) for tuning resources and managing cooperating MSPs, CIVIC achieves 12-36% higher request completion, 23-70% higher fulfillment rates, 20-60% more served clients, and up to 51% more fairly distributed requests, all with competitive costs. Extensive experiments demonstrate CIVIC's resilience, adaptability, and robust performance under dynamic load conditions and unexpected demand surges, making it suitable for real-world distributed Metaverse infrastructures.
comment: Journal submission; 19 pages; 9 figures
☆ Real-Time and Scalable Zak-OTFS Receiver Processing on GPUs
Orthogonal time frequency space (OTFS) modulation offers superior robustness to high-mobility channels compared to conventional orthogonal frequency-division multiplexing (OFDM) waveforms. However, its explicit delay-Doppler (DD) domain representation incurs substantial signal processing complexity, especially with increased DD domain grid sizes. To address this challenge, we present a scalable, real-time Zak-OTFS receiver architecture on GPUs through hardware--algorithm co-design that exploits DD-domain channel sparsity. Our design leverages compact matrix operations for key processing stages, a branchless iterative equalizer, and a structured sparse channel matrix of the DD domain channel matrix to significantly reduce computational and memory overhead. These optimizations enable low-latency processing that consistently meets the 99.9-th percentile real-time processing deadline. The proposed system achieves up to 906.52 Mbps throughput with a DD grid size of (16384,32) using 16QAM modulation over 245.76 MHz bandwidth. Extensive evaluations under a Vehicular-A channel model demonstrate strong scalability and robust performance across CPU (Intel Xeon) and multiple GPU platforms (NVIDIA Jetson Orin, RTX 6000 Ada, A100, and H200), highlighting the effectiveness of compute-aware Zak-OTFS receiver design for next-generation (NextG) high-mobility communication systems.
comment: This work has been submitted to the IEEE for possible publication
☆ Computing the Exact Pareto Front in Average-Cost Multi-Objective Markov Decision Processes
Many communication and control problems are cast as multi-objective Markov decision processes (MOMDPs). The complete solution to an MOMDP is the Pareto front. Much of the literature approximates this front via scalarization into single-objective MDPs. Recent work has begun to characterize the full front in discounted or simple bi-objective settings by exploiting its geometry. In this work, we characterize the exact front in average-cost MOMDPs. We show that the front is a continuous, piecewise-linear surface lying on the boundary of a convex polytope. Each vertex corresponds to a deterministic policy, and adjacent vertices differ in exactly one state. Each edge is realized as a convex combination of the policies at its endpoints, with the mixing coefficient given in closed form. We apply these results to a remote state estimation problem, where each vertex on the front corresponds to a threshold policy. The exact Pareto front and solutions to certain non-convex MDPs can be obtained without explicitly solving any MDP.
☆ Q2NS Demo: A Quantum Network Simulator Based on ns-3
Q2NS is an open-source quantum network simulator built on ns-3, the de facto standard for classical network simulation. By inheriting ns-3's mature classical stack and event-driven execution model, Q2NS enables faithful co-simulation of quantum-network dynamics and classical signaling, a core requirement for the functioning of any quantum network. Its modular architecture is designed for extensibility, with pluggable quantum-state backends (state-vector, density matrix, stabilizer) and a clean separation between network control and node-level operations. Q2NS comes with a quantum network visualizer Q2NSViz, supporting interactive inspection of both physical- and entanglement-induced connectivity graphs, helping users interpret protocol behavior and entanglement manipulation processes. We present a demonstration of Q2NS, highlighting its ability to capture and simulate the coexistence of quantum and classical communication. The proposed demonstration presents quantum communication scenarios of increasing complexity: from entanglement distribution basics to multipartite graph-state manipulation, complemented by pre-loaded examples in Q2NSViz that require no prior quantum communication or coding experience.
comment: This work has been funded by the European Union under Horizon Europe ERC-CoG grant QNattyNet, n.101169850. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them
☆ Physics-Informed Transformer for Multi-Band Channel Frequency Response Reconstruction
Wideband channel frequency response (CFR) estimation is challenging in multi-band wireless systems, especially when one or more sub-bands are temporarily blocked by co-channel interference. We present a physics-informed complex Transformer that reconstructs the full wideband CFR from such fragmented, partially observed spectrum snapshots. The interference pattern in each sub-band is modeled as an independent two-state discrete-time Markov chain, capturing realistic bursty occupancy behavior. Our model operates on the joint time-frequency grid of $T$ snapshots and $F$ frequency bins and uses a factored self-attention mechanism that separately attends along both axes, reducing the computational complexity to $O(TF^2 + FT^2)$. Complex-valued inputs and outputs are processed through a holomorphic linear layer that preserves phase relationships. Training uses a composite physics-informed loss combining spectral fidelity, power delay profile (PDP) reconstruction, channel impulse response (CIR) sparsity, and temporal smoothness. Mobility effects are incorporated through per-sample velocity randomization, enabling generalization across different mobility regimes. Evaluation against three classical baselines, namely, last-observation-carry-forward, zero-fill, and cubic-spline interpolation, shows that our approach achieves the highest PDP similarity with respect to the ground truth, reaching $ρ\geq 0.82$ compared to $ρ\geq 0.62$ for the best baseline at interference occupancy levels up to 50%. Furthermore, the model degrades smoothly across the full velocity range, consistently outperforming all other baselines.
comment: 6 pages, 6 figures
☆ Quantum Networking Fundamentals: From Physical Protocols to Network Engineering
The realization of the Quantum Internet promises transformative capabilities in secure communication, distributed quantum computing, and high-precision metrology. However, transitioning from laboratory experiments to a scalable, multi-tenant network utility introduces deep orchestration challenges. Current development is often siloed within physics communities, prioritizing hardware, while the classical networking community lacks architectural models to manage fragile quantum resources. This tutorial bridges this divide by providing a network-centric view of quantum networking. We dismantle idealized assumptions in current simulators to address the "simulation-reality gap," recasting them as explicit control-plane constraints. To bridge this gap, we establish Software-Defined Quantum Networking (SDQN) as a prerequisite for scale, prioritizing a symbiotic, dual-plane architecture where classical control dictates quantum data flow. Specifically, we synthesize reference models for SDQN and the Quantum Network Operating System (QNOS) for hardware abstraction, and adapt a Quantum Network Utility Maximization (Q-NUM) framework as a unifying mathematical lens for engineers to reason about trade-offs between entanglement routing, scheduling, and fidelity. Furthermore, we analyze Distributed Quantum AI (DQAI) over imperfect networks as a case study, illustrating how physical constraints such as probabilistic stragglers and decoherence dictate application-layer viability. Ultimately, this tutorial equips network engineers with the tools required to transition quantum networking from a bespoke physics experiment into a programmable, multi-tenant global infrastructure.
comment: Submitted to IEEE Communications Surveys and Tutorials
☆ Air-to-Air Channel Characterization for UAV Communications at 3.4 GHz
Uncrewed Aerial Vehicle (UAV) networks require accurate Air-to-Air (A2A) channel models, but most existing work focuses on Air-to-Ground links and leaves the sub-6 GHz A2A channel poorly characterized. We present preliminary 3.4 GHz A2A channel measurements collected with a lightweight, reconfigurable, open-source channel sounder built from USRP B210 software-defined radios and a high-precision GNSS-disciplined oscillator mounted on two UAVs. Measurements were conducted at the AERPAW Lake Wheeler testbed using a spherical flight trajectory around a second drone to capture channel behavior over varying altitudes, elevation angles, and relative headings. From these data, we analyze fundamental channel properties, extract channel impulse responses, model fading behavior as a function of link geometry, and characterize fading statistics including RMS delay spread. The resulting dataset and analysis provide a more realistic basis for the design, emulation, and evaluation of physical-layer and MAC protocols for next-generation UAV communication networks.
comment: Accepted for publication at Aeroconf 2026
♻ ☆ Tackling Non-IIDness in HAPS-Aided Federated Learning
High-altitude platform stations (HAPS) enable large-scale federated learning (FL) in non-terrestrial networks (NTN) by providing wide-area coverage and predominantly line-of-sight (LoS) connectivity to many ground users. However, practical deployments face heterogeneous and non-independently and identically distributed (non-IID) client data, which degrades accuracy and slows convergence. We propose a weighted attribute-based client selection strategy that leverages server-side indicators: historical traffic behavior, instantaneous channel quality, computational capability, and prior-round learning contribution. At each round, the HAPS computes a composite score and selects the top clients, while adapting attribute weights online based on their correlation with validation-loss improvement. We further provide theoretical justification that traffic-derived uniformity can serve as a proxy for latent data heterogeneity, enabling selection of client subsets with reduced expected non-IIDness. Simulations demonstrate improved test accuracy, faster convergence, and lower training loss compared with random, resource-only, and single-attribute baselines.
comment: Submitted to IEEE for possible publication
♻ ☆ Beyond Traffic Matrix: DELTA -- A DAG-Aware OCS Logical Topology Optimization for AIDCs
The rapid scaling of large language models (LLMs) exacerbates communication bottlenecks in AI data centers (AIDCs). To overcome this, optical circuit switches (OCS) are increasingly adopted for their superior bandwidth capacity and energy efficiency. However, their reconfiguration overhead precludes intra-iteration topology update, necessitating a priori engineering of a static topology to absorb time-varying LLM traffic. Existing methods engineer these topologies based on traffic matrices. However, this representation obscures the bursty concurrent bandwidth demands dictated by parallelization strategies and fails to account for the independent channels required for concurrent communication. To address this, we propose DELTA, an efficient logical topology optimization framework for AIDCs that leverages the computation-communication directed acyclic graph (DAG) to encode time-varying traffic patterns into a Mixed-Integer Linear Programming (MILP) model, while exploiting the temporal slack of non-critical tasks to save optical ports without penalizing iteration makespan. By pioneering a variable-length time interval formulation, DELTA significantly reduces the solution space compared to the fixed-time-step formulation. To scale to thousand-GPU clusters, we design a dual-track acceleration strategy that combines search space pruning (reducing complexity from quadratic to linear) with heuristic hot-starting. Evaluations on large-scale LLM workloads show that DELTA reduces communication time by up to 17.5% compared to state-of-the-art traffic-matrix-based baselines. Furthermore, the framework reduces optical port consumption by at least 20%; dynamically reallocating these surplus ports to bandwidth-bottlenecked workloads reduces their performance gap relative to ideal non-blocking electrical networks by up to 26.1%, ultimately enabling most workloads to achieve near-ideal performance.
♻ ☆ Cooperative Edge Caching with Large Language Model in Wireless Networks
Cooperative edge caching in overlapping zones couples Base Station (BS) decisions, making content replacement sensitive to spatial topology and temporal reuse. Conventional heuristics suffer from myopia, while Deep Reinforcement Learning relies on brittle numerical representations and needs prohibitive retraining under topological or traffic dynamics. This paper studies a centralized, cooperative multi-BS cache-replacement controller driven by a Large Language Model (LLM) within a deterministic text-to-action loop. At each time slot, the global cache state is rendered into a prompt encapsulating each BS's inventory, deduplicated requests, and multi-scale frequency summaries. The LLM generates one decision line per BS. A strict parser and feasibility checker then either accept the joint action or fall back to an all-BS NoOp action. We align the LLM via two-stage training: Supervised Fine-Tuning on look-ahead expert trajectories to acquire action syntax and robust initialization, followed by Group Relative Policy Optimization. This employs an 'opportunity-aware' reward, using multi-step cooperative hit rate gains relative to a NoOp baseline as the primary signal, plus penalties for invalid outputs. We focus on reactive replacement of equal-sized files, max one replacement per BS per slot, and insertions restricted to current requests. Evaluating on identical request traces and association graphs, our orchestrator approaches a single-step exhaustive-search reference (0.610 vs. 0.617 in a 5-BS scenario), surpasses classical baselines (+4.1% over least-frequently used), and exhibits robust zero-shot transfer across cache capacity, library size, popularity skewness, and user density. Code is available at https://github.com/gracefulning/CoopLLM-Cache.
Distributed, Parallel, and Cluster Computing 17
☆ What can be computed in average anonymous networks?
We study what deterministic distributed algorithms can compute on random input graphs in extremely weak models of distributed computing: all nodes are anonymous, and in each communication round, nodes broadcast a message to all their neighbors, receive a (multi)set of messages from their neighbors, and update their local state. These correspond to the SB and MB models introduced by Hella et al. [PODC 2012] and are strictly weaker than the standard port-numbering PN and LOCAL models. We investigate what can be computed almost surely on random input graphs. We give a one-round deterministic SB-algorithm using $O(\log n)$-bit messages that computes unique identifiers with high probability on anonymous networks sampled from $G(n,p)$, where $n^{\varepsilon-1} \le p \le 1/2$ and $\varepsilon>0$ is an arbitrarily small constant. This algorithm is inspired by canonical labeling techniques in graph isomorphism testing and can be used to "anonymize" existing distributed graph algorithms designed for the broadcast CONGEST and LOCAL models. In particular, we give a new anonymous algorithm that finds a triangle in $O(1/\varepsilon)$ rounds on the above input distribution. We also investigate computational power of natural analogs of "Monte Carlo" and "Las Vegas" distributed graph algorithms in the random graph setting, and establish some new collapse and hierarchy results. For example, our work shows the collapse of the weak model hierarchy of Hella et al. on $G(n,p)$, as apart from a vanishingly small fraction of input graphs, the SB model is as powerful as LOCAL.
comment: 34 pages, 1 figure, 2 tables
☆ A Practical Two-Stage Framework for GPU Resource and Power Prediction in Heterogeneous HPC Systems
Efficient utilization of GPU resources and power has become critical with the growing demand for GPUs in high-performance computing (HPC). In this paper, we analyze GPU utilization and GPU memory utilization, as well as the power consumption of the Vienna ab initio Simulation Package (VASP), using the Slurm workload manager historical logs and GPU performance metrics collected by NVIDIA's Data Center GPU Manager (DCGM). VASP is a widely used materials science application on Perlmutter at NERSC, an HPE Cray EX system based on NVIDIA A100 GPUs. Using our insights from the resource utilization analysis of VASP applications, we propose a resource prediction framework to predict the average GPU power, maximum GPU utilization, and maximum GPU memory utilization values of heterogeneous HPC system applications to enable more efficient scheduling decisions and power-aware system operation. Our prediction framework consists of two stages: 1) using only the Slurm accounting logs as training data and 2) augmenting the training data with historical GPU profiling metrics collected with DCGM. The maximum GPU utilization predictions using only the Slurm submission features achieve up to 97% accuracy. Furthermore, features engineered from GPU-compute and memory activity metrics exhibit good correlations with average power utilization, and our runtime power usage prediction experiments result in up to 92% prediction accuracy. These findings demonstrate the effectiveness of DCGM metrics in capturing application characteristics and highlight their potential for developing predictive models to support dynamic power management in HPC systems.
comment: 9 pages, 6 figures
☆ Intelligent Cloud Orchestration: A Hybrid Predictive and Heuristic Framework for Cost Optimization
Cloud computing allows scalable resource provisioning, but dynamic workload changes often lead to higher costs due to over-provisioning. Machine learning (ML) approaches, such as Long Short-Term Memory (LSTM) networks, are effective for predicting workload patterns at a higher level, but they can introduce delays during sudden traffic spikes. In contrast, mathematical heuristics like Game Theory provide fast and reliable scheduling decisions, but they do not account for future workload changes. To address this trade-off, this paper proposes a hybrid orchestration framework that combines LSTM-based predictive scaling with heuristic task allocation. The results show that this approach reduces infrastructure costs close to ML-based models while maintaining fast response times similar to heuristic methods. This work presents a practical approach for improving cost efficiency in cloud resource management.
comment: 8 pages, 4 figures, 2 tables
☆ Optimization Opportunities for Cloud-Based Data Pipeline Infrastructures
Cloud infrastructure supports the efficient operation of data pipelines regarding requirements like cost, speed, and resource utilization. We present an integrated view of optimization opportunities for cloud-based data pipelines by conducting a systematic review of existing literature on optimization approaches to cloud infrastructure performance for data pipelines. Our study contributes a theory of optimization goals like minimizing cost, reducing execution time, and cost-makespan trade-offs, consisting of dimensions such as single vs. multi-cloud, batch vs. stream processing, etc. We highlight gaps in primary research, including the underexploration of multi-tenant environments and lack of industry evaluation, and suggest directions for future research.
☆ GPU-RMQ: Accelerating Range Minimum Queries on Modern GPUs
Range minimum queries are frequently used in string processing and database applications including biological sequence analysis, document retrieval, and web search. Hence, various data structures have been proposed for improving their efficiency on both CPUs and GPUs.Recent work has also shown that hardware-accelerated ray tracing on modern NVIDIA RTX graphic cards can be exploited to answer range minimum queries by expressing queries as rays, which are fired into a scene of triangles representing minima of ranges at different granularities. While these approaches are promising, they suffer from at least one of three issues: severe memory overhead, high index construction time, and low query throughput. This renders these methods practically unusable on larger arrays: For example, the state-of-art GPU-based approaches LCA and RTXRMQ exceed the memory capacity of an NVIDIA RTX 4090 GPU for input arrays of size >= 2^29. To tackle these problems, in this work, we present a new approach called GPU-RMQ which is based on a hierarchical approach. GPU-RMQ first constructs a hierarchy of range minimum summaries on top of the original array in a highly parallel fashion. For query answering, only the relevant portions of the hierarchy are then processed in an optimized massively-parallel scan operation. Additionally, GPU-RMQ is hybrid in design enabling the use of both ray tracing cores and CUDA cores across different levels of the hierarchy to handle queries. Our experimental evaluation shows that GPU-RMQ outperforms the state-of-the-art approaches in terms of query throughput especially for larger arrays while offering a significantly lower memory footprint and up to two orders-of-magnitude faster index construction. In particular, it achieves up to ~8x higher throughput than LCA, ~17x higher throughput than RTXRMQ, and up to ~4800x higher throughput compared to an optimized CPU-based approach.
☆ FourierMoE: Fourier Mixture-of-Experts Adaptation of Large Language Models
Parameter-efficient fine-tuning (PEFT) has emerged as a crucial paradigm for adapting large language models (LLMs) under constrained computational budgets. However, standard PEFT methods often struggle in multi-task fine-tuning settings, where diverse optimization objectives induce task interference and limited parameter budgets lead to representational deficiency. While recent approaches incorporate mixture-of-experts (MoE) to alleviate these issues, they predominantly operate in the spatial domain, which may introduce structural redundancy and parameter overhead. To overcome these limitations, we reformulate adaptation in the spectral domain. Our spectral analysis reveals that different tasks exhibit distinct frequency energy distributions, and that LLM layers display heterogeneous frequency sensitivities. Motivated by these insights, we propose FourierMoE, which integrates the MoE architecture with the inverse discrete Fourier transform (IDFT) for frequency-aware adaptation. Specifically, FourierMoE employs a frequency-adaptive router to dispatch tokens to experts specialized in distinct frequency bands. Each expert learns a set of conjugate-symmetric complex coefficients, preserving complete phase and amplitude information while theoretically guaranteeing lossless IDFT reconstruction into real-valued spatial weights. Extensive evaluations across 28 benchmarks, multiple model architectures, and scales demonstrate that FourierMoE consistently outperforms competitive baselines in both single-task and multi-task settings while using significantly fewer trainable parameters. These results highlight the promise of spectral-domain expert adaptation as an effective and parameter-efficient paradigm for LLM fine-tuning.
comment: The first two authors contributed equally to this work; listing order is random
☆ DWDP: Distributed Weight Data Parallelism for High-Performance LLM Inference on NVL72
Large language model (LLM) inference increasingly depends on multi-GPU execution, yet existing inference parallelization strategies require layer-wise inter-rank synchronization, making end-to-end performance sensitive to workload imbalance. We present DWDP (Distributed Weight Data Parallelism), an inference parallelization strategy that preserves data-parallel execution while offloading MoE weights across peer GPUs and fetching missing experts on demand. By removing collective inter-rank synchronization, DWDP allows each GPU to progress independently. We further address the practical overheads of this design with two optimizations for split-weight management and asynchronous remote-weight prefetch. Implemented in TensorRT-LLM and evaluated with DeepSeek-R1 on GB200 NVL72, DWDP improves end-to-end output TPS/GPU by 8.8% at comparable TPS/user in the 20-100 TPS/user serving range under 8K input sequence length and 1K output sequence length.
comment: Technical Report. 17 pages. 8 figures
☆ ModTrans: Translating Real-world Models for Distributed Training Simulator
Large-scale distributed training has been a research hot spot in machine learning systems for industry and academia in recent years. However, conducting experiments without physical machines and corresponding resources is difficult. One solution is to leverage distributed training simulators, but current ones like ASTRA-sim do not support importing real-world developed models, which poses challenges for ML researchers seeking to use them. Based on this challenge, we developed ModTrans, a translator supporting format translation from any real-world model to the ASTRA-sim simulator's input, removing the barrier between machine learning experts and machine learning system researchers. The experiment results show that ModTrans's cost is negligible.
♻ ☆ Democratizing AI: A Comparative Study in Deep Learning Efficiency and Future Trends in Computational Processing
The exponential growth in data has intensified the demand for computational power to train large-scale deep learning models. However, the rapid growth in model size and complexity raises concerns about equal and fair access to computational resources, particularly under increasing energy and infrastructure constraints. GPUs have emerged as essential for accelerating such workloads. This study benchmarks four deep learning models (Conv6, VGG16, ResNet18, CycleGAN) using TensorFlow and PyTorch on Intel Xeon CPUs and NVIDIA Tesla T4 GPUs. Our experiments demonstrate that, on average, GPU training achieves speedups ranging from 11x to 246x depending on model complexity, with lightweight models (Conv6) showing the highest acceleration (246x), mid-sized models (VGG16, ResNet18) achieving 51-116x speedups, and complex generative models (CycleGAN) reaching 11x improvements compared to CPU training. Additionally, in our PyTorch vs. TensorFlow comparison, we observed that TensorFlow's kernel-fusion optimizations reduce inference latency by approximately 15%. We also analyze GPU memory usage trends and projecting requirements through 2025 using polynomial regression. Our findings highlight that while GPUs are essential for sustaining AI's growth, democratized and shared access to GPU resources is critical for enabling research innovation across institutions with limited computational budgets.
♻ ☆ CompressedScaffnew: The First Theoretical Double Acceleration of Communication from Local Training and Compression in Distributed Optimization
In distributed optimization, a large number of machines alternate between local computations and communication with a coordinating server. Communication, which can be slow and costly, is the main bottleneck in this setting. To reduce this burden and therefore accelerate distributed gradient descent, two strategies are popular: 1) communicate less frequently; that is, perform several iterations of local computations between the communication rounds; and 2) communicate compressed information instead of full-dimensional vectors. We propose CompressedScaffnew, the first algorithm for distributed optimization that jointly harnesses these two strategies and converges linearly to an exact solution in the strongly convex setting, with a doubly accelerated rate: it benefits from the two acceleration mechanisms provided by local training and compression, namely a better dependency on the condition number of the functions and on the dimension of the model, respectively.
♻ ☆ MPI-Q: A Message Communication Library for Large-Scale Classical-Quantum Heterogeneous Hybrid Distributed Computing
The classical-quantum system heterogeneity (different data characteristics, execution paradigms and synchronization mechanism etc.) renders existing distributed communication mechanisms (e.g. MPI, NCCL etc.) inadequate. This bottleneck severely impairs operational synergy and programming efficiency. Thus, the performance of hybrid applications on classical-quantum heterogeneous infrastructures is directly limited. To address these challenges, this paper proposes a message-passing library tailored for large-scale classical-quantum heterogeneous distributed computing, referred to as MPI-Q. The design centers on three mechanisms. First, it defines a heterogeneous hybrid communication domain that achieves unified management of classical and quantum processes in heterogeneous hybrid systems. Second, it uses a lightweight communication path that allows classical control nodes to send device-ready waveform data directly to quantum MonitorProcesses, avoiding unnecessary relay stages. Third, it establishes a heterogeneous hybrid synchronization mechanism to tackle the problem of timing control for multi-node quantum operations. While retaining the traditional MPI programming model, MPI-Q achieves extension toward quantum subsystems. Experiments on distributed GHZ state preparation demonstrate that this model exhibits near-linear scalability, achieving a maximum speedup of 18.76 times on 24 quantum nodes. This proves that the library can effectively support large-scale heterogeneous hybrid distributed computing applications, filling the technical gap in this field.
♻ ☆ NCCL EP: Towards a Unified Expert Parallel Communication API for NCCL
Mixture-of-Experts (MoE) architectures have become essential for scaling large language models, driving the development of specialized device-initiated communication libraries such as DeepEP, Hybrid-EP, and others. These libraries demonstrate the performance benefits of GPU-initiated RDMA for MoE dispatch and combine operations. This paper presents NCCL EP (Expert Parallelism), a ground-up MoE communication library built entirely on NCCL's Device API. NCCL EP provides unified ncclEpDispatch and ncclEpCombine primitives with both C and Python interfaces, supporting Low-Latency (LL) mode for inference decoding and High-Throughput (HT) mode for training and inference prefill. LL targets small batch sizes (1-128 tokens) using direct all-to-all RDMA+NVLink mesh connectivity with double-buffered communication for overlapping dispatch and combine phases. HT targets large batches (4096+ tokens) using hierarchical communication that aggregates tokens within NVLink domains before inter-node RDMA transmission. Both modes leverage Device API for both intra- and inter-node communications, taking advantage of its topology awareness and optimized GPU-initiated implementation. We evaluate NCCL EP on an H100-based cluster across multi-node configurations, demonstrating competitive LL kernel performance and presenting end-to-end results with vLLM integration. By building MoE communication natively within NCCL, NCCL EP provides a supported path for expert parallelism on current and emerging NVIDIA platforms.
comment: 13 pages, 8 figures, 7 tables
♻ ☆ Near-Optimal Distributed Ruling Sets for Trees and High-Girth Graphs
Given a graph $G=(V,E)$, a $β$-ruling set is a subset $S\subseteq V$ that is i) independent, and ii) every node $v\in V$ has a node of $S$ within distance $β$. In this paper we present almost optimal distributed algorithms for finding ruling sets in trees and high girth graphs in the classic LOCAL model. As our first contribution we present an $O(\log\log n)$-round randomized algorithm for computing $2$-ruling sets on trees, almost matching the $Ω(\log\log n/\log\log\log n)$ lower bound given by Balliu et al. [FOCS'20]. Second, we show that $2$-ruling sets can be solved in $\widetilde{O}(\log^{5/3}\log n)$ rounds in high-girth graphs. Lastly, we show that $O(\log\log\log n)$-ruling sets can be computed in $\widetilde{O}(\log\log n)$ rounds in high-girth graphs matching the lower bound up to triple-log factors. All of these results either improve polynomially or exponentially on the previously best algorithms and use a smaller domination distance $β$.
♻ ☆ Fast Deterministic Distributed Degree Splitting
We obtain better algorithms for computing more balanced orientations and degree splits in LOCAL. Important to our result is a connection to the hypergraph sinkless orientation problem [BMNSU, SODA'25] We design an algorithm of complexity $\mathcal{O}(\varepsilon^{-1} \cdot \log n)$ for computing a balanced orientation with discrepancy at most $\varepsilon \cdot \mathrm{deg}(v)$ for every vertex $v \in V$. This improves upon a previous result by [GHKMSU, Distrib. Comput. 2020] of complexity $\mathcal{O}(\varepsilon^{-1} \cdot \log \varepsilon^{-1} \cdot (\log \log \varepsilon^{-1})^{1.71} \cdot \log n)$. Further, we show that this result can also be extended to compute undirected degree splits with the same discrepancy and in the same runtime. As as application we show that $(3 / 2 + \varepsilon)Δ$-edge coloring can now be solved in $\mathcal{O}(\varepsilon^{-1} \cdot \log^2 Δ\cdot \log n + \varepsilon^{-2} \cdot \log n)$ rounds in LOCAL. Note that for constant $\varepsilon$ and $Δ= \mathcal{O}(2^{\log^{1/3} n})$ this runtime matches the current state-of-the-art for $(2Δ- 1)$-edge coloring in [Ghaffari & Kuhn, FOCS'21].
♻ ☆ OSGym: Scalable OS Infra for Computer Use Agents
Training computer use agents requires full-featured OS sandboxes with GUI environments, which consume substantial hardware resources as the number of sandboxes scales. Stochastic errors arising from diverse software execution within these sandboxes further demand robust infrastructure design and reliable error recovery. We present OSGym, a scalable OS environment infrastructure for computer use agents, built around these key optimization strategies: (1) Decentralized OS state management, which isolates failures to individual replicas and significantly enhances overall system reliability; (2) Hardware-aware OS replica orchestration, which addresses CPU-bounded scaling bottlenecks and substantially reduces compute overhead; (3) KVM virtualization with copy-on-write disk management, which shares a common bootable disk across VM instances and provisions only instance-specific modifications, reducing physical disk consumption by 88% and increasing disk provisioning speed by 37 times; and (4) Robust container pool with multi-layer fault recovery. Together, these optimizations yield strong scalability and resource efficiency: OSGym manages over a thousand OS replicas under constrained resources, supports parallel trajectory generation at 1420 multi-turn trajectories per minute, and reduces per-replica cost to 0.2-0.3 USD per day, a 90% reduction over standard deployment. Our experiments validate OSGym across end-to-end pipelines for data collection and training for computer use agents. We believe OSGym establishes a new foundation for scalable, general-purpose computer use agent research.
♻ ☆ Differential Privacy for Secure Machine Learning in Healthcare IoT-Cloud Systems
Healthcare has become exceptionally sophisticated, as wearables and connected medical devices revolutionize remote patient monitoring, emergency response, medication management, diagnosis, and predictive and prescriptive analytics. Internet of Things and Cloud computing integrated systems (IoT-Cloud) facilitate sensing, automation, and processing for these healthcare applications. While real-time response is crucial for alleviating patient emergencies, protecting patient privacy is paramount in data-driven healthcare. In this paper, we propose a multi-layer IoT, Edge, and Cloud architecture to enhance emergency healthcare response times by distributing tasks based on response criticality and data permanence requirements. We ensure patient privacy through a Differential Privacy framework applied across several machine learning models: K-means, Logistic Regression, Random Forest, and Naive Bayes. We establish a comprehensive threat model identifying three adversary classes and evaluate Laplace, Gaussian, and hybrid noise mechanisms across varying privacy budgets, with supervised algorithms achieving up to 83.6% accuracy. The proposed hybrid Laplace-Gaussian noise mechanism with adaptive budget allocation provides a balanced approach, offering moderate tails and better privacy-utility trade-offs for both low and high-dimension datasets. At the practical threshold of $\varepsilon$=5.0, supervised algorithms achieve 80-81% accuracy while reducing attribute inference attacks by up to 18% and data reconstruction correlation by 70%. We further enhance security through Blockchain integration, which ensures trusted communication through time-stamping, traceability, and immutability for analytics applications. Edge computing demonstrates 8$\times$ latency reduction for emergency scenarios, validating the hierarchical architecture for time-critical operations.
♻ ☆ Glia: A Human-Inspired AI for Automated Systems Design and Optimization
Can AI autonomously design mechanisms for computer systems on par with the creativity and reasoning of human experts? We present Glia, an AI architecture for networked systems design that uses large language models (LLMs) in a human-inspired multi-agent workflow. Each agent specializes in reasoning, experimentation, and analysis, collaborating through an evaluation framework that grounds abstract reasoning in empirical feedback. Unlike prior ML-for-systems methods that optimize black-box policies, Glia generates interpretable designs and exposes its reasoning. When applied to a distributed GPU cluster for LLM inference, it produces new algorithms for request routing, scheduling, and auto-scaling that perform at human-expert levels in significantly less time, while yielding novel insights into workload behavior. Our results suggest that combining reasoning LLMs with structured experimentation, an AI can produce creative and understandable designs for complex systems problems.
Computation and Language 35
☆ When Reward Hacking Rebounds: Understanding and Mitigating It with Representation-Level Signals
Reinforcement learning for LLMs is vulnerable to reward hacking, where models exploit shortcuts to maximize reward without solving the intended task. We systematically study this phenomenon in coding tasks using an environment-manipulation setting, where models can rewrite evaluator code to trivially pass tests without solving the task, as a controlled testbed. Across both studied models, we identify a reproducible three-phase rebound pattern: models first attempt to rewrite the evaluator but fail, as their rewrites embed test cases their own solutions cannot pass. They then temporarily retreat to legitimate solving. When legitimate reward remains scarce, they rebound into successful hacking with qualitatively different strategies. Using representation engineering, we extract concept directions for shortcut, deception, and evaluation awareness from domain-general contrastive pairs and find that the shortcut direction tracks hacking behavior most closely, making it an effective representational proxy for detection. Motivated by this finding, we propose Advantage Modification, which integrates shortcut concept scores into GRPO advantage computation to penalize hacking rollouts before policy updates. Because the penalty is internalized into the training signal rather than applied only at inference time, Advantage Modification provides more robust suppression of hacking compared with generation-time activation steering.
comment: 15 pages, 8 figures
☆ A Dynamic Atlas of Persian Poetic Symbolism: Families, Fields, and the Historical Rewiring of Meaning
Persian poetry is often remembered through recurrent symbols before it is remembered through plot. Wine vessels, gardens, flames, sacred titles, bodily beauty, and courtly names return across centuries, yet computational work still tends to flatten this material into isolated words or broad document semantics. That misses a practical unit of organization in Persian poetics: related forms travel as families and gain force through recurring relations. Using a corpus of 129,451 poems, we consolidate recurrent forms into traceable families, separate imagistic material from sacred and courtly reference, and map their relations in a multi-layer graph. The symbolic core is relatively sparse, the referential component much denser, and the attachment zone between them selective rather than diffuse. Across 11 Hijri-century bins, some families remain widely distributed, especially Shab (Night), Ruz (Day), and Khaak (Earth). Wine vessels, garden space, flame, and lyric sound strengthen later, while prestige-coded and heroic-courtly vocabulary is weighted earlier. Century-specific graphs show change in arrangement as well as membership. Modularity rises, cross-scope linkage declines, courtly bridges weaken, and sacred bridges strengthen. Hub positions shift too: Kherqe (Sufi Robe) gains late prominence, Farkhondeh {Blessed} and Banafsheh (Violet) recede, and Saaghar (Wine Cup) stays central across the chronology. In this corpus, Persian symbolism appears less as a fixed repertory than as a long-lived system whose internal weights and connections change over time.
☆ Wired for Overconfidence: A Mechanistic Perspective on Inflated Verbalized Confidence in LLMs
Large language models are often not just wrong, but \emph{confidently wrong}: when they produce factually incorrect answers, they tend to verbalize overly high confidence rather than signal uncertainty. Such verbalized overconfidence can mislead users and weaken confidence scores as a reliable uncertainty signal, yet its internal mechanisms remain poorly understood. We present a circuit-level mechanistic analysis of this inflated verbalized confidence in LLMs, organized around three axes: capturing verbalized confidence as a differentiable internal signal, identifying the circuits that causally inflate it, and leveraging these insights for targeted inference-time recalibration. Across two instruction-tuned LLMs on three datasets, we find that a compact set of MLP blocks and attention heads, concentrated in middle-to-late layers, consistently writes the confidence-inflation signal at the final token position. We further show that targeted inference-time interventions on these circuits substantially improve calibration. Together, our results suggest that verbalized overconfidence in LLMs is driven by identifiable internal circuits and can be mitigated through targeted intervention.
☆ Are Finer Citations Always Better? Rethinking Granularity for Attributed Generation
Citation granularity - whether to cite individual sentences, paragraphs, or documents - is a critical design choice in attributed generation. While fine-grained citations are often preferred for precise human verification, their impact on model performance remains under-explored. We analyze four model scales (8B-120B) and demonstrate that enforcing fine-grained citations degrades attribution quality by 16-276% compared to the best-performing granularity. We observe a consistent performance pattern where attribution quality peaks at intermediate granularities (paragraph-level). Our analysis suggests that fine-grained (sentence-level) citations disrupt necessary semantic dependencies for attributing evidence to answer claims, while excessively coarse citations (multi-paragraph) introduce distracting noise. Importantly, the magnitude of this performance gap varies non-monotonically with model scale: fine-grained constraints disproportionately penalize larger models, suggesting that atomic citation units disrupt the multi-sentence information synthesis at which these models excel. Strikingly, citation-optimal granularity leads to substantial gains in attribution quality while preserving or even improving answer correctness. Overall, our findings demonstrate that optimizing solely for human verification via fine-grained citation disregards model constraints, compromising both attribution faithfulness and generation reliability. Instead, effective attribution requires aligning citation granularity with the model's natural semantic scope.
☆ The power of context: Random Forest classification of near synonyms. A case study in Modern Hindi
Synonymy is a widespread yet puzzling linguistic phenomenon. Absolute synonyms theoretically should not exist, as they do not expand language's expressive potential. However, it was suggested that even if synonyms denote the same concept, they may reflect different perspectives or carry distinct cultural associations, claims that have rarely been tested quantitatively. In Hindi, prolonged contact with Persian produced many Perso-Arabic loanwords coexisting with their Sanskrit counterpart, forming numerous synonym pairs. This study investigates whether centuries after these borrowings appeared in the Subcontinent their origin can still be distinguished using distributional data alone and regardless of their semantic content. A Random Forest trained on word embeddings of Hindi synonyms successfully classified words by Sanskrit or Perso-Arabic origin, even when they were semantically unrelated, suggesting that usage patterns preserve traces of etymology. These findings provide quantitative evidence that context encodes etymological signals and that synonymy may reflect subtle but systematic distinctions linked to origin. They support the idea that synonymous words can offer different perspectives and that etymologically related words may form distinct conceptual subspaces, creating a new type of semantic frame shaped by historical origin. Overall, the results highlight the power of context in capturing nuanced distinctions beyond traditional semantic similarity.
☆ Cost-Efficient Estimation of General Abilities Across Benchmarks
Thousands of diverse benchmarks have been developed to measure the quality of large language models (LLMs). Yet prior work has demonstrated that LLM performance is often sufficiently explained by a small set of latent factors, or abilities. This suggests the potential for more efficient and principled benchmarking, but it remains difficult to compare the quality of different methods. Motivated by predictive validity, we argue that the quality of a benchmarking framework should be grounded in how efficiently it enables the prediction of model performance on unseen tasks. To analyze this objective, we collect the "Wide-scale Item Level Dataset" (WILD), a dataset of item-model response pairs, comprising evaluations of 65 models on 109,564 unique items spanning 163 tasks drawn from 27 datasets. This dataset enables the first analysis of how different techniques can predict a model's performance on a large, diverse collection of unseen tasks under different budget constraints. We demonstrate that combining a modified multidimensional item response theory (IRT) model with adaptive item selection driven by optimal experimental design can predict performance on 112 held-out benchmark tasks with a mean absolute error (MAE) of less than 7%, and can do so after observing only 16 items. We further demonstrate that incorporating cost-aware discount factors into our selection criteria can reduce the total tokens needed to reach 7% MAE from 141,000 tokens to only 22,000, an 85% reduction in evaluation cost.
☆ ReFormeR: Learning and Applying Explicit Query Reformulation Patterns
We present ReFormeR, a pattern-guided approach for query reformulation. Instead of prompting a language model to generate reformulations of a query directly, ReFormeR first elicits short reformulation patterns from pairs of initial queries and empirically stronger reformulations, consolidates them into a compact library of transferable reformulation patterns, and then selects an appropriate reformulation pattern for a new query given its retrieval context. The selected pattern constrains query reformulation to controlled operations such as sense disambiguation, vocabulary grounding, or discriminative facet addition, to name a few. As such, our proposed approach makes the reformulation policy explicit through these reformulation patterns, guiding the LLM towards targeted and effective query reformulations. Our extensive experiments on TREC DL 2019, DL 2020, and DL Hard show consistent improvements over classical feedback methods and recent LLM-based query reformulation and expansion approaches.
☆ Adaptive Stopping for Multi-Turn LLM Reasoning
Large Language Models (LLMs) increasingly rely on multi-turn reasoning and interaction, such as adaptive retrieval-augmented generation (RAG) and ReAct-style agents, to answer difficult questions. These methods improve accuracy by iteratively retrieving information, reasoning, or acting, but introduce a key challenge: \textbf{When should the model stop?} Existing approaches rely on heuristic stopping rules or fixed turn budgets and provide no formal guarantees that the final prediction still contains the correct answer. This limitation is particularly problematic in high-stakes domains such as finance and healthcare, where unnecessary turns increase cost and latency, while stopping too early risks incorrect decisions. Conformal prediction (CP) provides formal coverage guarantees, but existing LLM-CP methods only apply to a single model output and cannot handle multi-turn pipelines with adaptive stopping. To address this gap, we propose Multi-Turn Language Models with Conformal Prediction (MiCP), the first CP framework for multi-turn reasoning. MiCP allocates different error budgets across turns, enabling the model to stop early while maintaining an overall coverage guarantee. We demonstrate MiCP on adaptive RAG and ReAct, where it achieves the target coverage on both single-hop and multi-hop question answering benchmarks while reducing the number of turns, inference cost, and prediction set size. We further introduce a new metric that jointly evaluates coverage validity and answering efficiency.
☆ Test-Time Scaling Makes Overtraining Compute-Optimal
Modern LLMs scale at test-time, e.g. via repeated sampling, where inference cost grows with model size and the number of samples. This creates a trade-off that pretraining scaling laws, such as Chinchilla, do not address. We present Train-to-Test ($T^2$) scaling laws that jointly optimize model size, training tokens, and number of inference samples under fixed end-to-end budgets. $T^2$ modernizes pretraining scaling laws with pass@$k$ modeling used for test-time scaling, then jointly optimizes pretraining and test-time decisions. Forecasts from $T^2$ are robust over distinct modeling approaches: measuring joint scaling effect on the task loss and modeling impact on task accuracy. Across eight downstream tasks, we find that when accounting for inference cost, optimal pretraining decisions shift radically into the overtraining regime, well-outside of the range of standard pretraining scaling suites. We validate our results by pretraining heavily overtrained models in the optimal region that $T^2$ scaling forecasts, confirming their substantially stronger performance compared to pretraining scaling alone. Finally, as frontier LLMs are post-trained, we show that our findings survive the post-training stage, making $T^2$ scaling meaningful in modern deployments.
☆ Assessing Pause Thresholds for empirical Translation Process Research
Text production (and translations) proceeds in the form of stretches of typing, interrupted by keystroke pauses. It is often assumed that fast typing reflects unchallenged/automated translation production while long(er) typing pauses are indicative of translation problems, hurdles or difficulties. Building on a long discussion concerning the determination of pause thresholds that separate automated from presumably reflective translation processes (O'Brien, 2006; Alves and Vale, 2009; Timarova et al., 2011; Dragsted and Carl, 2013; Lacruz et al., 2014; Kumpulainen, 2015; Heilmann and Neumann 2016), this paper compares three recent approaches for computing these pause thresholds, and suggest and evaluate a novel method for computing Production Unit Breaks.
comment: Accepted for Presentation at "Translation in Transition 8, September 2026"
☆ Friends and Grandmothers in Silico: Localizing Entity Cells in Language Models
Language models can answer many entity-centric factual questions, but it remains unclear which internal mechanisms are involved in this process. We study this question across multiple language models. We localize entity-selective MLP neurons using templated prompts about each entity, and then validate them with causal interventions on PopQA-based QA examples. On a curated set of 200 entities drawn from PopQA, localized neurons concentrate in early layers. Negative ablation produces entity-specific amnesia, while controlled injection at a placeholder token improves answer retrieval relative to mean-entity and wrong-cell controls. For many entities, activating a single localized neuron is sufficient to recover entity-consistent predictions once the context is initialized, consistent with compact entity retrieval rather than purely gradual enrichment across depth. Robustness to aliases, acronyms, misspellings, and multilingual forms supports a canonicalization interpretation. The effect is strong but not universal: not every entity admits a reliable single-neuron handle, and coverage is higher for popular entities. Overall, these results identify sparse, causally actionable access points for analyzing and modulating entity-conditioned factual behavior.
☆ Open-Domain Safety Policy Construction EACL 2026
Moderation layers are increasingly a core component of many products built on user- or model-generated content. However, drafting and maintaining domain-specific safety policies remains costly. We present Deep Policy Research (DPR), a minimal agentic system that drafts a full content moderation policy based on only human-written seed domain information. DPR uses a single web search tool and lightweight scaffolding to iteratively propose search queries, distill diverse web sources into policy rules, and organize rules into an indexed document. We evaluate DPR on (1) the OpenAI undesired content benchmark across five domains with two compact reader LLMs and (2) an in-house multimodal advertisement moderation benchmark. DPR consistently outperforms definition-only and in-context learning baselines, and in our end-to-end setting it is competitive with expert-written policy sections in several domains. Moreover, under the same seed specification and evaluation protocol, DPR outperforms a general-purpose deep research system, suggesting that a task-specific, structured research loop can be more effective than generic web research for policy drafting. We release our experiment code at https://github.com/xiaowu0162/deep-policy-research.
comment: EACL 2026 (Findings)
☆ No Attacker Needed: Unintentional Cross-User Contamination in Shared-State LLM Agents
LLM-based agents increasingly operate across repeated sessions, maintaining task states to ensure continuity. In many deployments, a single agent serves multiple users within a team or organization, reusing a shared knowledge layer across user identities. This shared persistence expands the failure surface: information that is locally valid for one user can silently degrade another user's outcome when the agent reapplies it without regard for scope. We refer to this failure mode as unintentional cross-user contamination (UCC). Unlike adversarial memory poisoning, UCC requires no attacker; it arises from benign interactions whose scope-bound artifacts persist and are later misapplied. We formalize UCC through a controlled evaluation protocol, introduce a taxonomy of three contamination types, and evaluate the problem in two shared-state mechanisms. Under raw shared state, benign interactions alone produce contamination rates of 57--71%. A write-time sanitization is effective when shared state is conversational, but leaves substantial residual risk when shared state includes executable artifacts, with contamination often manifesting as silent wrong answers. These results indicate that shared-state agents need artifact-level defenses beyond text-level sanitization to prevent silent cross-user failures.
☆ Procedural Knowledge at Scale Improves Reasoning
Test-time scaling has emerged as an effective way to improve language models on challenging reasoning tasks. However, most existing methods treat each problem in isolation and do not systematically reuse knowledge from prior reasoning trajectories. In particular, they underutilize procedural knowledge: how to reframe a problem, choose an approach, and verify or backtrack when needed. We introduce Reasoning Memory, a retrieval-augmented generation (RAG) framework for reasoning models that explicitly retrieves and reuses procedural knowledge at scale. Starting from existing corpora of step-by-step reasoning trajectories, we decompose each trajectory into self-contained subquestion-subroutine pairs, yielding a datastore of 32 million compact procedural knowledge entries. At inference time, a lightweight in-thought prompt lets the model verbalize the core subquestion, retrieve relevant subroutines within its reasoning trace, and reason under diverse retrieved subroutines as implicit procedural priors. Across six math, science, and coding benchmarks, Reasoning Memory consistently outperforms RAG with document, trajectory, and template knowledge, as well as a compute-matched test-time scaling baseline. With a higher inference budget, it improves over no retrieval by up to 19.2% and over the strongest compute-matched baseline by 7.9% across task types. Ablation studies show that these gains come from two key factors: the broad procedural coverage of the source trajectories and our decomposition and retrieval design, which together enable effective extraction and reuse of procedural knowledge.
☆ Preference learning in shades of gray: Interpretable and bias-aware reward modeling for human preferences
Learning human preferences in language models remains fundamentally challenging, as reward modeling relies on subtle, subjective comparisons or shades of gray rather than clear-cut labels. This study investigates the limits of current approaches and proposes a feature-augmented framework to better capture the multidimensional nature of human judgment. Using the Anthropic HHRLHF dataset, we evaluate ten diverse large language models LLMs under a standard pairwise preference setting, where baseline performance remains below 0.74 ROC AUC, highlighting the difficulty of the task. To address this, we enrich textual representations with interpretable signals: response length, refusal indicators, toxicity scores and prompt response semantic similarity, enabling models to explicitly capture key aspects of helpfulness, safety and relevance. The proposed hybrid approach yields consistent improvements across all models, achieving up to 0.84 ROC AUC and significantly higher pairwise accuracy, with DeBERTav3Large demonstrating the best performance. Beyond accuracy, we integrate SHAP and LIME to provide fine-grained interpretability, revealing that model decisions depend on contextualized safety and supportive framing rather than isolated keywords. We further analyze bias amplification, showing that while individual features have weak marginal effects, their interactions influence preference learning.
☆ M2-Verify: A Large-Scale Multidomain Benchmark for Checking Multimodal Claim Consistency
Evaluating scientific arguments requires assessing the strict consistency between a claim and its underlying multimodal evidence. However, existing benchmarks lack the scale, domain diversity, and visual complexity needed to evaluate this alignment realistically. To address this gap, we introduce M2-Verify, a large-scale multimodal dataset for checking scientific claim consistency. Sourced from PubMed and arXiv, M2-Verify provides over 469K instances across 16 domains, rigorously validated through expert audits. Extensive baseline experiments show that state-of-the-art models struggle to maintain robust consistency. While top models achieve up to 85.8\% Micro-F1 on low-complexity medical perturbations, performance drops to 61.6\% on high-complexity challenges like anatomical shifts. Furthermore, expert evaluations expose hallucinations when models generate scientific explanations for their alignment decisions. Finally, we demonstrate our dataset's utility and provide comprehensive usage guidelines.
comment: Preprint. Under Review
☆ Scaling Reasoning Tokens via RL and Parallel Thinking: Evidence From Competitive Programming
We study how to scale reasoning token budgets for competitive programming through two complementary approaches: training-time reinforcement learning (RL) and test-time parallel thinking. During RL training, we observe an approximately log-linear relationship between validation accuracy and the average number of generated reasoning tokens over successive checkpoints, and show two ways to shift this training trajectory: verification RL warmup raises the starting point, while randomized clipping produces a steeper trend in the observed regime. As scaling single-generation reasoning during RL quickly becomes expensive under full attention, we introduce a multi-round parallel thinking pipeline that distributes the token budget across threads and rounds of generation, verification, and refinement. We train the model end-to-end on this pipeline to match the training objective to the test-time structure. Starting from Seed-OSS-36B, the full system with 16 threads and 16 rounds per thread matches the underlying RL model's oracle pass@16 at pass@1 using 7.6 million tokens per problem on average, and surpasses GPT-5-high on 456 hard competitive programming problems from AetherCode.
☆ Look Twice: Training-Free Evidence Highlighting in Multimodal Large Language Models
Answering questions about images often requires combining visual understanding with external knowledge. Multimodal Large Language Models (MLLMs) provide a natural framework for this setting, but they often struggle to identify the most relevant visual and textual evidence when answering knowledge-intensive queries. In such scenarios, models must integrate visual cues with retrieved textual evidence that is often noisy or only partially relevant, while also localizing fine-grained visual information in the image. In this work, we introduce Look Twice (LoT), a training-free inference-time framework that improves how pretrained MLLMs utilize multimodal evidence. Specifically, we exploit the model attention patterns to estimate which visual regions and retrieved textual elements are relevant to a query, and then generate the answer conditioned on this highlighted evidence. The selected cues are highlighted through lightweight prompt-level markers that encourage the model to re-attend to the relevant evidence during generation. Experiments across multiple knowledge-based VQA benchmarks show consistent improvements over zero-shot MLLMs. Additional evaluations on vision-centric and hallucination-oriented benchmarks further demonstrate that visual evidence highlighting alone improves model performance in settings without textual context, all without additional training or architectural modifications. Source code will be publicly released.
comment: Project Page: https://aimagelab.github.io/LoT/
☆ Universal YOCO for Efficient Depth Scaling
The rise of test-time scaling has remarkably boosted the reasoning and agentic proficiency of Large Language Models (LLMs). Yet, standard Transformers struggle to scale inference-time compute efficiently, as conventional looping strategies suffer from high computational overhead and a KV cache that inflates alongside model depth. We present Universal YOCO (YOCO-U), which combines the YOCO decoder-decoder architecture with recursive computation to achieve a synergistic effect greater than either alone. Built on the YOCO framework, YOCO-U implements a Universal Self-Decoder that performs multiple iterations via parameter sharing, while confining the iterative process to shallow, efficient-attention layers. This combination yields a favorable capability-efficiency tradeoff that neither YOCO nor recursion achieves independently. The YOCO architecture provides a constant global KV cache and linear pre-filling, while partial recursion enhances representational depth with limited overhead. Together, YOCO-U improves token utility and scaling behavior while maintaining efficient inference. Empirical results confirm that YOCO-U remains highly competitive in general and long-context benchmarks, demonstrating that the integration of efficient-attention architectures and recursive computation is a promising direction for scalable LLMs.
☆ $\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution
As LLM agents tackle increasingly complex tasks, a critical question is whether they can maintain strategic coherence over long horizons: planning under uncertainty, learning from delayed feedback, and adapting when early mistakes compound. We introduce $\texttt{YC-Bench}$, a benchmark that evaluates these capabilities by tasking an agent with running a simulated startup over a one-year horizon spanning hundreds of turns. The agent must manage employees, select task contracts, and maintain profitability in a partially observable environment where adversarial clients and growing payroll create compounding consequences for poor decisions. We evaluate 12 models, both proprietary and open source, across 3 seeds each. Only three models consistently surpass the starting capital of \$200K, with Claude Opus 4.6 achieving the highest average final funds at \$1.27 M, followed by GLM-5 at \$1.21 M at 11$\times$ lower inference cost. Scratchpad usage, the sole mechanism for persisting information across context truncation, is the strongest predictor of success, and adversarial client detection is the primary failure mode, accounting for $47\%$ of bankruptcies. Our analysis reveals that frontier models still fail through distinct failure modes such as over-parallelization, demonstrating the capability gaps for long-horizon performance. $\texttt{YC-Bench}$ is open-source, reproducible, and configurable.
comment: 16 pages, 10 figures
☆ LLM REgression with a Latent Iterative State Head
We present RELISH (REgression with a Latent Iterative State Head), a novel, lightweight architecture designed for text regression with large language models. Rather than decoding numeric targets as text or aggregating multiple generated outputs, RELISH predicts scalar values directly from frozen LLM representations by iteratively refining a learned latent state through cross-attention over token-level representations, and then mapping the final state to a point estimate with a linear regressor. Across five datasets, four LLM backbones, and two LLM training regimes, RELISH consistently outperforms prior baselines from all three major LLM regression families, including autoregressive decoding, regression-aware inference, and existing predictive head methods. Despite these gains, RELISH remains highly parameter-efficient, requiring only 3.4-3.7M trainable parameters across frozen LLM backbones (only 0.01-0.04% additional overhead), far less than LoRA-based alternatives that grow with model size (0.26-0.42%).
☆ Embarrassingly Simple Self-Distillation Improves Code Generation
Can a large language model (LLM) improve at code generation using only its own raw outputs, without a verifier, a teacher model, or reinforcement learning? We answer in the affirmative with simple self-distillation (SSD): sample solutions from the model with certain temperature and truncation configurations, then fine-tune on those samples with standard supervised fine-tuning. SSD improves Qwen3-30B-Instruct from 42.4% to 55.3% pass@1 on LiveCodeBench v6, with gains concentrating on harder problems, and it generalizes across Qwen and Llama models at 4B, 8B, and 30B scale, including both instruct and thinking variants. To understand why such a simple method can work, we trace these gains to a precision-exploration conflict in LLM decoding and show that SSD reshapes token distributions in a context-dependent way, suppressing distractor tails where precision matters while preserving useful diversity where exploration matters. Taken together, SSD offers a complementary post-training direction for improving LLM code generation.
☆ True (VIS) Lies: Analyzing How Generative AI Recognizes Intentionality, Rhetoric, and Misleadingness in Visualization Lies
This study investigates the ability of multimodal Large Language Models (LLMs) to identify and interpret misleading visualizations, and recognize these observations along with their underlying causes and potential intentionality. Our analysis leverages concepts from visualization rhetoric and a newly developed taxonomy of authorial intents as explanatory lenses. We formulated three research questions and addressed them experimentally using a dataset of 2,336 COVID-19-related tweets, half of which contain misleading visualizations, and supplemented it with real-world examples of perceptual, cognitive, and conceptual errors drawn from VisLies, the IEEE VIS community event dedicated to showcasing deceptive and misleading visualizations. To ensure broad coverage of the current LLM landscape, we evaluated 16 state-of-the-art models. Among them, 15 are open-weight models, spanning a wide range of model sizes, architectural families, and reasoning capabilities. The selection comprises small models, namely Nemotron-Nano-V2-VL (12B parameters), Mistral-Small-3.2 (24B), DeepSeek-VL2 (27B), Gemma3 (27B), and GTA1 (32B); medium-sized models, namely Qianfan-VL (70B), Molmo (72B), GLM-4.5V (108B), LLaVA-NeXT (110B), and Pixtral-Large (124B); and large models, namely Qwen3-VL (235B), InternVL3.5 (241B), Step3 (321B), Llama-4-Maverick (400B), and Kimi-K2.5 (1000B). In addition, we employed OpenAI GPT-5.4, a frontier proprietary model. To establish a human perspective on these tasks, we also conducted a user study with visualization experts to assess how people perceive rhetorical techniques and the authorial intentions behind the same misleading visualizations. This allows comparison between model and expert behavior, revealing similarities and differences that provide insights into where LLMs align with human judgment and where they diverge.
☆ Screening Is Enough
A core limitation of standard softmax attention is that it does not define a notion of absolute query--key relevance: attention weights are obtained by redistributing a fixed unit mass across all keys according to their relative scores. As a result, relevance is defined only relative to competing keys, and irrelevant keys cannot be explicitly rejected. We introduce Multiscreen, a language-model architecture built around a mechanism we call screening, which enables absolute query--key relevance. Instead of redistributing attention across all keys, screening evaluates each key against an explicit threshold, discarding irrelevant keys and aggregating the remaining keys, thereby removing global competition among keys. Across experiments, Multiscreen achieves comparable validation loss with approximately 40% fewer parameters than a Transformer baseline, enables stable optimization at substantially larger learning rates, maintains strong performance in long-context perplexity, shows little to no degradation in retrieval performance even far beyond the training context length, and reduces inference latency by up to 3.2$\times$ at 100K context length.
comment: 21 pages, 13 figures
♻ ☆ Reviewing Scientific Papers for Critical Problems With Reasoning LLMs: Baseline Approaches and Automatic Evaluation NeurIPS 2025
Recent advancements in large language models have sparked interest in utilizing them to aid the peer review process of scientific publication amid the peer review crisis. However, having AI models generate full reviews in the same way as human reviewers risks exacerbating the irresponsible use of LLM-generated reviews and instigating intentional manipulation. As an alternative, we propose adopting LLMs as manuscript quality checkers. We introduce several baseline approaches and an extendable automatic evaluation framework using top reasoning LLMs as judges to tackle the difficulty of recruiting domain experts for manual evaluation. Utilizing papers withdrawn from arXiv, we validated our proposed methods with several leading reasoning LLMs available in May-June 2025 and assessed their performance and API costs for identifying critical errors and unsoundness problems in scientific papers. o3 exhibited the best problem identification performance among all models at a modest cost. This paper provides insights into document-based scientific understanding/reasoning and lays a foundation for future applications. Our dataset, code, and model outputs are publicly available.
comment: Accepted and presented at NeurIPS 2025 AI for Science Workshop
♻ ☆ No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding
Reliable evaluation of large language models (LLMs) is critical as their deployment rapidly expands, particularly in high-stakes domains such as business and finance. The LLM-as-a-Judge framework, which uses prompted LLMs to evaluate response quality, is appealing due to its scalability, low cost, and strong correlations with human stylistic preferences. However, it remains unclear how accurately these methods can assess response quality in domains where correctness matters more than style. To address this gap, we introduce the Business and Finance Fundamentals Benchmark (BFF-Bench), a dataset of 160 challenging questions and long-form responses authored by financial professionals. These experts subsequently evaluated the correctness of 1,200 responses generated by a diverse set of LLMs on both BFF-Bench and a challenging subset of MT-Bench. With this expert-annotated dataset of judgments (VERDICTS), we analyze the agreement between a suite of automated grading methods and human experts. While we observe that LLM Judges are more reliable than other grading methods, our findings reveal a clear pattern in LLM Judge performance: when not provided with a correct reference, judges show high agreement with human experts only on questions the judges were able to correctly answer themselves. We demonstrate that providing the judges with expert-written references largely mitigates this issue, highlighting the limits of using LLM-as-a-Judge without any form of human verification.
♻ ☆ Syntactic Framing Fragility: An Audit of Robustness in LLM Ethical Decisions
Large language models exhibit systematic negation sensitivity, yet no operational framework exists to measure this vulnerability at deployment scale, especially in high-stakes decisions. We introduce Syntactic Framing Fragility (SFF), a framework for quantifying decision consistency under logically equivalent syntactic transformations. SFF isolates syntactic effects via Logical Polarity Normalization, enabling direct comparison across positive and negative framings while controlling for polarity inversion, and provides the Syntactic Variation Index (SVI) as a robustness metric suitable for CI/CD integration. Auditing 23 models across 14 high-stakes scenarios (39,975 decisions), we establish ground-truth effect sizes for a phenomenon previously characterized only qualitatively and find that open-source models exhibit $2.2x higher fragility than commercial counterparts. Negation-bearing syntax is the dominant failure mode, with some models endorsing actions at 80-97% rates even when asked whether agents not act. These patterns are consistent with negation suppression failure documented in prior work, with chain-of-thought reasoning reducing fragility in some but not all cases. We provide scenario-stratified risk profiles and offer an operational checklist compatible with EU AI Act and NIST RMF requirements. Code, data, and scenarios will be released upon publication.
comment: 23 pages, 14 figures
♻ ☆ Semantic Refinement with LLMs for Graph Representations
Graph-structured data exhibit substantial heterogeneity in where their predictive signals originate: in some domains, node-level semantics dominate, while in others, structural patterns play a central role. This structure-semantics heterogeneity implies that no graph learning model with a fixed inductive bias can generalize optimally across diverse graph domains. However, most existing methods address this challenge from the model side by incrementally injecting new inductive biases, which remains fundamentally limited given the open-ended diversity of real-world graphs. In this work, we take a data-centric perspective and treat node semantics as a task-adaptive variable. We propose a Graph-Exemplar-guided Semantic Refinement (GES) framework for graph representation learning which -- unlike existing LLM-enhanced methods that generate node descriptions without graph context -- leverages structurally and semantically similar nodes from the graph itself to guide semantic refinement. Specifically, a GNN is first trained to produce predictive states, which along with structural and semantic similarity are used to retrieve in-graph exemplars that inform an LLM in refining node descriptions. We evaluate our approach on both text-rich and text-free graphs. Results show consistent improvements on semantics-rich and structure-dominated graphs, demonstrating the effectiveness of data-centric semantic refinement under structure-semantics heterogeneity.
♻ ☆ SAKE: Structured Agentic Knowledge Extrapolation for Complex LLM Reasoning via Reinforcement Learning
Knowledge extrapolation is the process of inferring novel information by combining and extending existing knowledge that is explicitly available. It is essential for solving complex questions in specialized domains where retrieving comprehensive external knowledge is impractical. We propose SAKE (Structured Agentic Knowledge Extrapolation), a RL powered agentic framework that trains LLMs to autonomously retrieve and extrapolate structured knowledge through tool-augmented reinforcement learning. SAKE defines two external KG tools: entity group construction and cross-group triplet retrieval. The model learns to interleave these 2 retrieval tools during a three-turn rollout: extracting key entities, filtering relevant concept groups, and associative reasoning by constructing new triplets through analogy. The entire pipeline is optimized end-to-end with GRPO using a curriculum reward, teaching the model what to retrieve and how to reason over it. Our experiments proved that SAKE fine-tuned Qwen2.5-7B model surpasses GPT-3.5-Turbo with state-of-the-art agentic KG reasoning on both biomedical (75.4% vs. 70.1%) and commonsense (81.3% vs. 74.7%) benchmarks, while reducing token usage by over 90%. These results demonstrate that associative reasoning over incomplete structured knowledge does not requiring large models with complex, multi-step prompting, thus can be learned end-to-end by small, open-weight models through reinforcement learning with the right tools and training signal. Our code is available at https://anonymous.4open.science/r/SAKE-7585.
♻ ☆ Group Representational Position Encoding ICLR 2026
We present GRAPE (Group Representational Position Encoding), a unified framework for positional encoding based on group actions. GRAPE unifies two families of mechanisms: (i) multiplicative rotations (Multiplicative GRAPE) in $\operatorname{SO}(d)$ and (ii) additive logit biases (Additive GRAPE) arising from unipotent actions in the general linear group $\mathrm{GL}$. In Multiplicative GRAPE, a position $n \in \mathbb{Z}$ (or $t \in \mathbb{R}$) acts as $\mathbf{G}(n) = \exp(n \, ω\, \mathbf{L})$ with a rank-2 skew-symmetric generator $\mathbf{L} \in \mathbb{R}^{d \times d}$, yielding a relative, compositional, norm-preserving map with a closed-form matrix exponential. RoPE is recovered exactly when the $d/2$ planes correspond to canonical coordinate pairs with a log-uniform spectrum. Learned commuting subspaces and compact non-commuting mixtures strictly extend this geometry to capture cross-subspace feature coupling at $O(d)$ and $O(r d)$ cost per head, respectively. In Additive GRAPE, additive logits arise from rank-1 (or low-rank) unipotent actions, recovering ALiBi and the Forgetting Transformer (FoX) as exact special cases while preserving an exact relative law and streaming cacheability. Overall, GRAPE provides a principled design space for positional geometry in long-context models, subsuming RoPE and ALiBi as special cases. Project page: https://github.com/model-architectures/GRAPE.
comment: Published in ICLR 2026; Project Page: https://github.com/model-architectures/GRAPE
♻ ☆ In-context Learning in Presence of Spurious Correlations
Large language models exhibit a remarkable capacity for in-context learning, where they learn to solve tasks given a few examples. Recent work has shown that transformers can be trained to perform simple regression tasks in-context. This work explores the possibility of training an in-context learner for classification tasks involving spurious features. We find that the conventional approach of training in-context learners is susceptible to spurious features. Moreover, when the meta-training dataset includes instances of only one task, the conventional approach leads to task memorization and fails to produce a model that leverages context for predictions. Based on these observations, we propose a novel technique to train such a learner for a given classification task. Remarkably, this in-context learner matches and sometimes outperforms strong methods like ERM and GroupDRO. However, unlike these algorithms, it does not generalize well to other tasks. We show that it is possible to obtain an in-context learner that generalizes to unseen tasks by training on a diverse dataset of synthetic in-context learning instances.
♻ ☆ Green Prompting: Characterizing Prompt-driven Energy Costs of LLM Inference
Large Language Models (LLMs) have become widely used across various domains spanning search engines, code generation, and text creation. However, a major concern associated with their adoption is the high cost of inference, impacting both their sustainability and financial feasibility. In this study, we empirically study how different prompt and response characteristics directly impact LLM inference energy cost. We conduct experiments leveraging three open-source transformer-based LLMs across three task types$-$question answering, sentiment analysis, and text generation. For each inference, we analyzed prompt and response characteristics (length, semantic meaning, time taken, energy consumption). Our results demonstrate that even when presented with identical tasks, models generate responses with varying characteristics and subsequently exhibit distinct energy consumption patterns. We found that prompt length is less significant than the semantic meaning of the task itself. In addition, we identified specific keywords associated with higher or lower energy usage that vary between associated tasks. These findings highlight the importance of prompt design in optimizing inference efficiency. We conclude that the semantic meaning of prompts and certain task-related keywords significantly impact inference costs, leading the way for deeper exploration towards creating energy-adaptive LLMs.
comment: 9 pages, 5 figures
♻ ☆ A Comparative Study of Competency Question Elicitation Methods from Ontology Requirements ESWC-2026
Competency Questions (CQs) are pivotal in knowledge engineering, guiding the design, validation, and testing of ontologies. A number of diverse formulation approaches have been proposed in the literature, ranging from completely manual to Large Language Model (LLM) driven ones. However, attempts to characterise the outputs of these approaches and their systematic comparison are scarce. This paper presents an empirical comparative evaluation of three distinct CQ formulation approaches: manual formulation by ontology engineers, instantiation of CQ patterns, and generation using state of the art LLMs. We generate CQs using each approach from a set of requirements for cultural heritage, and assess them across different dimensions: degree of acceptability, ambiguity, relevance, readability and complexity. Our contribution is twofold: (i) the first multi-annotator dataset of CQs generated from the same source using different methods; and (ii) a systematic comparison of the characteristics of the CQs resulting from each approach. Our study shows that different CQ generation approaches have different characteristics and that LLMs can be used as a way to initially elicit CQs, however these are sensitive to the model used to generate CQs and they generally require a further refinement step before they can be used to model requirements.
comment: Revised version (v2) accepted for the 23rd European Semantic Web Conference (ESWC-2026)
♻ ☆ Analyzing Language Bias Between French and English in Conventional Multilingual Sentiment Analysis Models
Inspired by the 'Bias Considerations in Bilingual Natural Language Processing' report by Statistics Canada, this study delves into potential biases in multilingual sentiment analysis between English and French. Given a 50-50 dataset of French and English, we aim to determine if there exists a language bias and explore how the incorporation of more diverse datasets in the future might affect the equity of multilingual Natural Language Processing (NLP) systems. By employing Support Vector Machine (SVM) and Naive Bayes models on three balanced datasets, we reveal potential biases in multilingual sentiment classification. Utilizing Fairlearn, a tool for assessing bias in machine learning models, our findings indicate nuanced outcomes. With French data outperforming English across accuracy, recall, and F1 score metrics in both models, hinting at a language bias favoring French. However, Fairlearn's metrics suggest that the SVM approaches equitable levels with a demographic parity ratio of 0.963, 0.989, and 0.985 for the three separate datasets, indicating near-equitable treatment across languages. In contrast, Naive Bayes demonstrates greater disparities, evidenced by a demographic parity ratio of 0.813, 0.908, and 0.961. These findings reveal the importance of developing equitable multilingual NLP systems, particularly as we anticipate the inclusion of more datasets in various languages in the future.
comment: This is an undergraduate research project. Withdrawing this paper due to errors identified in the cross-validation implementation. These technical flaws invalidate the primary findings and conclusions. The authors no longer stand by the results presented in this version and recommend it not be cited or used as a basis for further research
♻ ☆ Language Steering for Multilingual In-Context Learning
If large language models operate in a universal semantic space, then switching between languages should require only a simple activation offset. To test this, we take multilingual in-context learning as a case study, where few-shot demonstrations are provided in English but the test query is in a target language. We propose language vectors, computed as the mean activation difference between parallel source and target language examples at a particular layer, and added as an offset to hidden states at inference time to shift the model's internal representations toward the target language. We evaluate our method across three multilingual tasks spanning 19 languages and three models. Our results show consistent improvements on multilingual in-context learning over baselines across all tasks and languages tested, demonstrating that a simple activation offset is sufficient to redirect a model's language mode without any parameter updates. Beyond performance, the vectors encode interpretable linguistic structure, with closely related languages forming tight clusters and vectors transferring across tasks, suggesting that language identity occupies separable and structured directions in a model's activation space.
Software Engineering 43
☆ CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe
High-performance GPU kernels are critical to modern machine learning systems, yet developing efficient implementations remains a challenging, expert-driven process due to the tight coupling between algorithmic structure, memory hierarchy usage, and hardware-specific optimizations. Recent work has explored using large language models (LLMs) to generate GPU kernels automatically, but generated implementations often struggle to maintain correctness and achieve competitive performance across iterative refinements. We present CuTeGen, an agentic framework for automated generation and optimization of GPU kernels that treats kernel development as a structured generate--test--refine workflow. Unlike approaches that rely on one-shot generation or large-scale search over candidate implementations, CuTeGen focuses on progressive refinement of a single evolving kernel through execution-based validation, structured debugging, and staged optimization. A key design choice is to generate kernels using the CuTe abstraction layer, which exposes performance-critical structures such as tiling and data movement while providing a more stable representation for iterative modification. To guide performance improvement, CuTeGen incorporates workload-aware optimization prompts and delayed integration of profiling feedback. Experimental results on matrix multiplication and activation workloads demonstrate that the framework produces functionally correct kernels and achieves competitive performance relative to optimized library implementations.
☆ Fuzzing with Agents? Generators Are All You Need
Modern generator-based fuzzing techniques combine lightweight input generators with coverage-guided mutation as a method of exploring deep execution paths in a target program. A complimentary approach in prior research focuses on creating highly customized, domain-specific generators that encode structural and semantic logic sufficient enough to reach deep program states; the challenge comes from the overhead of writing and testing these complex generators. We investigate whether AI coding agents can automatically synthesize such target-specific generators, and whether the resulting generators are strong enough to obviate the need for coverage guidance and mutation entirely. Our approach, Gentoo, is comprised of an LLM coding agent (provided terminal access and source code of the fuzz target and its library) instructed to iteratively synthesize and refine an input generator, and optionally provided fine-grained predicate-level coverage feedback. We evaluate three configurations of Gentoo against human-written generators on fuzz targets for 7 real-world Java libraries. Our findings show that agent-synthesized generators achieve statistically significantly higher branch coverage than human-written baseline generators on 4 of 7 benchmarks. Critically, the use of coverage guidance and mutation strategies is not statistically significantly beneficial for agent-synthesized generators, but is significant for all human-written generators, suggesting that structural and semantic logic encoded in the agent generators makes coverage guidance largely unnecessary.
☆ Reproducible, Explainable, and Effective Evaluations of Agentic AI for Software Engineering
With the advancement of Agentic AI, researchers are increasingly leveraging autonomous agents to address challenges in software engineering (SE). However, the large language models (LLMs) that underpin these agents often function as black boxes, making it difficult to justify the superiority of Agentic AI approaches over baselines. Furthermore, missing information in the evaluation design description frequently renders the reproduction of results infeasible. To synthesize current evaluation practices for Agentic AI in SE, this study analyzes 18 papers on the topic, published or accepted by ICSE 2026, ICSE 2025, FSE 2025, ASE 2025, and ISSTA 2025. The analysis identifies prevailing approaches and their limitations in evaluating Agentic AI for SE, both in current research and potential future studies. To address these shortcomings, this position paper proposes a set of guidelines and recommendations designed to empower reproducible, explainable, and effective evaluations of Agentic AI in software engineering. In particular, we recommend that Agentic AI researchers make their Thought-Action-Result (TAR) trajectories and LLM interaction data, or summarized versions of these artifacts, publicly accessible. Doing so will enable subsequent studies to more effectively analyze the strengths and weaknesses of different Agentic AI approaches. To demonstrate the feasibility of such comparisons, we present a proof-of-concept case study that illustrates how TAR trajectories can support systematic analysis across approaches.
comment: 7 pages, 5 figures, accepted to the 2nd International Workshop on Responsible Software Engineering (ResponsibleSE 2026), co-located with FSE
☆ AI Engineering Blueprint for On-Premises Retrieval-Augmented Generation Systems
Retrieval-augmented generation (RAG) systems are gaining traction in enterprise settings, yet stringent data protection regulations prevent many organizations from using cloud-based services, necessitating on-premises deployments. While existing blueprints and reference architectures focus on cloud deployments and lack enterprise-grade components, comprehensive on-premises implementation frameworks remain scarce. This paper aims to address this gap by presenting a comprehensive AI engineering blueprint for scalable on-premises enterprise RAG solutions. It is designed to address common challenges and streamline the integration of RAG into existing enterprise infrastructure. The blueprint provides: (1) an end-to-end reference architecture described using the 4+1 view model, (2) a reference application for on-premises deployment, and (3) best practices for tooling, development, and CI/CD pipelines, all publicly available on GitHub. Ongoing case studies and expert interviews with industry partners will assess its practical benefits.
comment: Accepted at ICSA 2026 Posters Track
☆ Identifying Privacy Concerns in Upcoming Software Release: A Peek into the Future
Identifying the features to be released in the next version of software, from a pool of potential candidates, is a challenging problem. User feedback from app stores is frequently used by software vendors for the evolution of apps across releases. Privacy feedback, although smaller in volume, carries a larger impact influencing app's success. Multiple existing work has focused on summarizing privacy concerns at the app level and has also shown that developers utilize feedback to implement security and privacy-related changes in subsequent releases. However, the current literature offers little support for release managers and developers in identifying privacy concerns prior to release. This gap exists as user reviews are typically available in app stores only after new features of a software system is released. In this paper, we introduce Pre-PI, a novel approach that summarizes privacy concerns for to-be-released features. Our method first maps existing features to semantically similar privacy reviews to learn feature-privacy review relations. We then simulate feedback for candidate features and generate concise summaries of privacy concerns. We evaluate Pre-PI across three real-world apps, and compare it with Hark, a state-of-the-art method that relies on post-release user feedback to identify privacy concerns. Results show that Pre-PI generates additional valid privacy concerns and identifies these concerns earlier than Hark, allowing proactive mitigation prior to release.
comment: Revising manuscript for IEEE Transactions on Software Engineering
☆ Automated Generation of High-Quality Bug Reports for Android Applications
Most defects in mobile applications are visually observable on the device screen. To track these defects, users, testers, and developers must manually submit bug reports, especially in the absence of crashes. However, these reports are frequently ambiguous or inaccurate, often omitting essential components such as the Observed Behavior (OB), Expected Behavior (EB), or Steps to Reproduce (S2Rs). Low-quality reports hinder developers' ability to understand and reproduce defects, delaying resolution and leading to incorrect or unresolvable fixes. In this paper, we posit that providing specific app-related information (e.g., GUI interactions or specific screens where bugs appear) to LLMs as key points of context can assist in automatically generating clear, detailed, and accurate OB, EB, and S2Rs. We built and evaluated a novel approach, BugScribe, that generates bug reports in this way. To support the evaluation, we introduce a unified quality framework that defines correctness and completeness dimensions for OB, EB, and S2Rs. Using 48 bug reports from 26 Android apps, we show that BugScribe produces higher-quality and more accurate components than the original reports and outperforms recent LLM-based baselines. We envision that BugScribe can serve as a practical assistant for testers and developers by enhancing incomplete bug reports with reliable and accurate OB, EB, and S2Rs, thereby streamlining bug resolution and improving mobile app quality.
comment: 12 pages, 2 figures
☆ SERSEM: Selective Entropy-Weighted Scoring for Membership Inference in Code Language Models
As Large Language Models (LLMs) for code increasingly utilize massive, often non-permissively licensed datasets, evaluating data contamination through Membership Inference Attacks (MIAs) has become critical. We propose SERSEM (Selective Entropy-Weighted Scoring for Membership Inference), a novel white-box attack framework that suppresses uninformative syntactical boilerplate to amplify specific memorization signals. SERSEM utilizes a dual-signal methodology: first, a continuous character-level weight mask is derived through static Abstract Syntax Tree (AST) analysis, spellchecking-based multilingual logic detection, and offline linting. Second, these heuristic weights are used to pool internal transformer activations and calibrate token-level Z-scores from the output logits. Evaluated on a 25,000-sample balanced dataset, SERSEM achieves a global AUC-ROC of 0.7913 on the StarCoder2-3B model and 0.7867 on the StarCoder2-7B model, consistently outperforming the implemented probability-based baselines Loss, Min-K% Prob, and PAC. Our findings demonstrate that focusing on human-centric coding anomalies provides a significantly more robust indicator of verbatim memorization than sequence-level probability averages.
comment: Accepted to the FSE 2026 Poisoned Chalice Competition
☆ Leveraging Commit Size Context and Hyper Co-Change Graph Centralities for Defect Prediction
File-level defect prediction models traditionally rely on product and process metrics. While process metrics effectively complement product metrics, they often overlook commit size the number of files changed per commit despite its strong association with software quality. Network centrality measures on dependency graphs have also proven to be valuable product level indicators. Motivated by this, we first redefine process metrics as commit size aware process metric vectors, transforming conventional scalar measures into 100 dimensional profiles that capture the distribution of changes across commit size strata. We then model change history as a hyper co change graph, where hyperedges naturally encode commit-size semantics. Vector centralities computed on these hypergraphs quantify size-aware node importance for source files. Experiments on nine long-lived Apache projects using five popular classifiers show that replacing scalar process metrics with the proposed commit size aware vectors, alongside product metrics, consistently improves predictive performance. These findings establish that commit size aware process metrics and hypergraph based vector centralities capture higher-order change semantics, leading to more discriminative, better calibrated, and statistically superior defect prediction models.
comment: 12 pages
☆ Harnessing Hype to Teach Empirical Thinking: An Experience With AI Coding Assistants
Software engineering students often struggle to appreciate empirical methods and hypothesis-driven inquiry, especially when taught in theoretical terms. This experience report explores whether grounding empirical learning in hype-driven technologies can make these concepts more accessible and engaging. We conducted a one-semester seminar framed around the currently popular topic of AI coding assistants, which attracted unusually high student interest. The course combined hands-on sessions using AI coding assistants with small, student-designed empirical studies. Classroom observations and survey responses suggest that the hype topic sparked curiosity and critical thinking. Students engaged with the AI coding assistants while questioning their limitations -- developing the kind of empirical thinking needed to assess claims about emerging technologies. Key lessons: (1) Hype-driven topics can lower barriers to abstract concepts like empirical research; (2) authentic hands-on development tasks combined with ownership of inquiry foster critical engagement; and (3) a single seminar can effectively teach both technical and research skills.
comment: Accepted to FSE'26 (Education Track)
☆ Automated Generation of Cybersecurity Exercise Scenarios
There is a growing need for cybersecurity professionals with practical knowledge and experience to meet societal needs and comply with new standards and regulations. At the same time, the advances in software technology and artificial intelligence point towards a future where software agents will play an important role in protecting the computer systems that are critical for society to function. The training and development of both humans and software agents requires the design and execution of cybersecurity exercises that differ in properties such as size, scope, objectives, difficultly, etc. Cybersecurity scenarios are critical for the operation of cybersecurity exercises as they describe the scope, context, operational environment and storyline of each exercise. In this work, we present an approach to automatically generate cybersecurity scenarios that model enterprise IT systems. Our approach is able to generate a large number of scenarios that differ in multiple criteria including size, scope, difficulty, complexity and diversity. We further release as open source: a simulation and a virtualization environment that can run cybersecurity exercises based on the generated scenarios and a dataset containing 100000 sample scenarios.
☆ Containing the Reproducibility Gap: Automated Repository-Level Containerization for Scholarly Jupyter Notebooks
Computational reproducibility is fundamental to trustworthy science, yet remains difficult to achieve in practice across various research workflows, including Jupyter notebooks published alongside scholarly articles. Environment drift, undocumented dependencies and implicit execution assumptions frequently prevent independent re-execution of published research. Despite existing reproducibility guidelines, scalable and systematic infrastructure for automated assessment remains limited. We present an automated, web-oriented reproducibility engineering pipeline that reconstructs and evaluates repository-level execution environments for scholarly notebooks. The system performs dependency inference, automated container generation, and isolated execution to approximate the notebook's original computational context. We evaluate the approach on 443 notebooks from 116 GitHub repositories referenced by publications in PubMed Central. Execution outcomes are classified into four categories: resolved environment failures, persistent logic or data errors, reproducibility drift, and container-induced regressions. Our results show that containerization resolves 66.7% of prior dependency-related failures and substantially improves execution robustness. However, a significant reproducibility gap remains: 53.7% of notebooks exhibit low output fidelity, largely due to persistent runtime failures and stochastic non-determinism. These findings indicate that standardized containerization is essential for computational stability but insufficient for full bit-wise reproducibility. The framework offers a scalable solution for researchers, editors, and archivists seeking systematic, automated assessment of computational artifacts.
☆ Revision or Re-Solving? Decomposing Second-Pass Gains in Multi-LLM Pipelines
Multi-LLM revision pipelines, in which a second model reviews and improves a draft produced by a first, are widely assumed to derive their gains from genuine error correction. We question this assumption with a controlled decomposition experiment that uses four matched conditions to separate second-pass gains into three additive components: re-solving, scaffold, and content. We evaluate this design across two model pairs on three benchmarks spanning knowledge-intensive MCQ and competitive programming. Our results show that the gains of multi-LLM revision are not monolithic, but depend on task structure, draft quality, and the type of draft information. On MCQ tasks, where the answer space is constrained and drafts provide little structural guidance, most gains are consistent with stronger-model re-solving, and directly routing queries to the stronger model can be more effective than revising a weak draft. On code generation tasks, however, two-stage prompting remains useful because even semantically null drafts can provide substantial structural scaffolding, while weak draft content can be harmful. Finally, role-reversed experiments show that strong drafts clearly benefit weak reviewers. Ultimately, our findings demonstrate that the utility of multi-LLM revision is dynamically bottlenecked by task structure and draft quality, necessitating more targeted pipeline designs rather than blanket revision strategies.
☆ Portable and Secure CI/CD for COBOL: Lessons from an Industrial Migration
Continuous integration and delivery (CI/CD) pipelines are critical for sustaining the evolution of large software systems. In regulated industries with legacy technologies, however, pipelines themselves can become a source of technical debt. This paper presents an industrial case study of Bankdata, a cooperative IT provider for Danish banks, where a Jenkins-based COBOL CI/CD pipeline had grown fragile, slow, and tightly coupled to platform-specific logic. The original architecture relied on Groovy scripts spread across four repositories with runtime dependency installation, leading to long execution times, high maintenance costs, and vendor lock-in. We report on the migration to a containerized architecture featuring an abstraction layer for platform logic, simplified repository structure, and a pre-built OCI-compliant image containing COBOL tools and dependencies. The new design achieved an 82% runtime reduction. Our experience highlights lessons on abstraction, containerization, and organizational adoption, offering guidance for modernizing pipelines in legacy, high-security environments.
☆ Investigating Autonomous Agent Contributions in the Wild: Activity Patterns and Code Change over Time
The rise of large language models for code has reshaped software development. Autonomous coding agents, able to create branches, open pull requests, and perform code reviews, now actively contribute to real-world projects. Their growing role offers a unique and timely opportunity to investigate AI-driven contributions and their effects on code quality, team dynamics, and software maintainability. In this work, we construct a novel dataset of approximately $110,000$ open-source pull requests, including associated commits, comments, reviews, issues, and file changes, collectively representing millions of lines of source code. We compare five popular coding agents, including OpenAI Codex, Claude Code, GitHub Copilot, Google Jules, and Devin, examining how their usage differs in various development aspects such as merge frequency, edited file types, and developer interaction signals, including comments and reviews. Furthermore, we emphasize that code authoring and review are only a small part of the larger software engineering process, as the resulting code must also be maintained and updated over time. Hence, we offer several longitudinal estimates of survival and churn rates for agent-generated versus human-authored code. Ultimately, our findings indicate an increasing agent activity in open-source projects, although their contributions are associated with more churn over time compared to human-authored code.
comment: MSR 2026 Technical Track
☆ Detecting Call Graph Unsoundness without Ground Truth
Java static analysis frameworks are commonly compared under the assumption that analysis algorithms and configurations compose monotonically and yield semantically comparable results across tools. In this work, we show that this assumption is fundamentally flawed. We present a large-scale empirical study of semantic consistency within and across four widely used Java static analysis frameworks: Soot, SootUp, WALA, and Doop. Using precision partial orders over analysis algorithms and configurations, we systematically identify violations where increased precision introduces new call-graph edges or amplifies inconsistencies. Our results reveal three key findings. First, algorithmic precision orders frequently break within frameworks due to modern language features such as lambdas, reflection, and native modeling. Second, configuration choices strongly interact with analysis algorithms, producing synergistic failures that exceed the effects of algorithm or configuration changes alone. Third, cross-framework comparisons expose irreconcilable semantic gaps, demonstrating that different frameworks operate over incompatible notions of call-graph ground truth. These findings challenge prevailing evaluation practices in static analysis and highlight the need to reason jointly about algorithms, configurations, and framework semantics when assessing precision and soundness.
☆ Reliability of Large Language Models for Design Synthesis: An Empirical Study of Variance, Prompt Sensitivity, and Method Scaffolding
Large Language Models (LLMs) are increasingly applied to automate software engineering tasks, including the generation of UML class diagrams from natural language descriptions. While prior work demonstrates that LLMs can produce syntactically valid diagrams, syntactic correctness alone does not guarantee meaningful design. This study investigates whether LLMs can move beyond diagram translation to perform design synthesis, and how reliably they maintain design-oriented reasoning under variation. We introduce a preference-based few-shot prompting approach that biases LLM outputs toward designs satisfying object-oriented principles and pattern-consistent structures. Two design-intent benchmarks, each with three domain-only, paraphrased prompts and 10 repeated runs, are used to evaluate three LLMs (ChatGPT 4o-mini, Claude 3.5 Sonnet, Gemini 2.5 Flash) across three modeling strategies: standard prompting, rule-injection prompting, and preference-based prompting, totaling 540 experiments (i.e. 2x3x10x3x3). Results indicate that while preference-based alignment improves adherence to design intent it does not eliminate non-determinism, and model-level behavior strongly influences design reliability. These findings highlight that achieving dependable LLM-assisted software design requires not only effective prompting but also careful consideration of model behavior and robustness.
☆ The Rise of Language Models in Mining Software Repositories: A Survey
The Mining Software Repositories (MSR) field focuses on analysing the rich data contained in software repositories to derive actionable insights into software processes and products. Mining repositories at scale requires techniques capable of handling large volumes of heterogeneous data, a challenge for which language models (LMs) are increasingly well-suited. Since the advent of Transformer-based architectures, LMs have been rapidly adopted across a wide range of MSR tasks. This article presents a comprehensive survey of the use of LMs in MSR, based on an analysis of 85 papers. We examine how LMs are applied, the types of artefacts analysed, which models are used, how their adoption has evolved over time, and the extent to which studies support reproducibility and reuse. Building on this analysis, we propose a taxonomy of LM applications in MSR, identify key trends shaping the field, and highlight open challenges alongside actionable directions for future research.
☆ A CEFR-Inspired Classification Framework with Fuzzy C-Means To Automate Assessment of Programming Skills in Scratch
Context: Schools, training platforms, and technology firms increasingly need to assess programming proficiency at scale with transparent, reproducible methods that support personalized learning pathways. Objective: This study introduces a pedagogical framework for Scratch project assessment, aligned with the Common European Framework of Reference (CEFR), providing universal competency levels for students and teachers alongside actionable insights for curriculum design. Method: We apply Fuzzy C-Means clustering to 2008246 Scratch projects evaluated via Dr.Scratch, implementing an ordinal criterion to map clusters to CEFR levels (A1-C2), and introducing enhanced classification metrics that identify transitional learners, enable continuous progress tracking, and quantify classification certainty to balance automated feedback with instructor review. Impact: The framework enables diagnosis of systemic curriculum gaps-notably a "B2 bottleneck" where only 13.3% of learners reside due to the cognitive load of integrating Logic Synchronization, and Data Representation--while providing certainty--based triggers for human intervention.
comment: Paper accepted at CSEDU 2026
☆ AutoEG: Exploiting Known Third-Party Vulnerabilities in Black-Box Web Applications
Large-scale web applications are widely deployed with complex third-party components, inheriting security risks arising from component vulnerabilities. Security assessment is therefore required to determine whether such known vulnerabilities remain practically exploitable in real applications. Penetration testing is a widely adopted approach that validates exploitability by launching concrete attacks against known vulnerabilities in real-world black-box systems. However, existing approaches often fail to automatically generate reliable exploits, limiting their effectiveness in practical security assessment. This limitation mainly stems from two issues: (1) precisely triggering vulnerabilities with correct technical details, and (2) adapting exploits to diverse real-world deployment settings. In this paper, we propose AutoEG, a fully automated multi-agent framework for exploit generation targeting black-box web applications. AutoEG has two phases: First, AutoEG extracts precise vulnerability trigger logic from unstructured vulnerability information and encapsulates it into reusable trigger functions. Second, AutoEG uses trigger functions for concrete attack objectives and iteratively refines exploits through feedback-driven interaction with the target application. We evaluate AutoEG on 104 real-world vulnerabilities with 29 attack objectives, resulting in 660 exploitation tasks and 55,440 exploit attempts. AutoEG achieves an average success rate of 82.41%, substantially outperforming state-of-the-art baselines, whose best performance reaches only 32.88%.
comment: 21 pages, 18 figures
☆ Enhancing REST API Fuzzing with Access Policy Violation Checks and Injection Attacks
Due to their widespread use in industry, several techniques have been proposed in the literature to fuzz REST APIs. Existing fuzzers for REST APIs have been focusing on detecting crashes (e.g., 500 HTTP server error status code). However, security vulnerabilities can have major drastic consequences on existing cloud infrastructures. In this paper, we propose a series of novel automated oracles aimed at detecting violations of access policies in REST APIs, as well as executing traditional attacks such as SQL Injection and XSS. These novel automated oracles can be integrated into existing fuzzers, in which, once the fuzzing session is completed, a ``security testing'' phase is executed to verify these oracles. When a security fault is detected, as output our technique is able to general executable test cases in different formats, like Java, Kotlin, Python and JavaScript test suites. Our novel techniques are integrated as an extension of EvoMaster, a state-of-the-art open-source fuzzer for REST APIs. Experiments are carried out on 9 artificial examples, 8 vulnerable-by-design REST APIs with black-box testing, and 36 REST APIs from the WFD corpus with white-box testing, for a total of 52 distinct APIs. Results show that our novel oracles and their automated integration in a fuzzing process can lead to detect security issues in several of these APIs.
☆ SCPatcher: Automated Smart Contract Code Repair via Retrieval-Augmented Generation and Knowledge Graph
Smart contract vulnerabilities can cause substantial financial losses due to the immutability of code after deployment. While existing tools detect vulnerabilities, they cannot effectively repair them. In this paper, we propose SCPatcher, a framework that combines retrieval-augmented generation with a knowledge graph for automated smart contract repair. We construct a knowledge graph from 5,000 verified Ethereum contracts, extracting function-level relationships to build a semantic network. This graph serves as an external knowledge base that enhances Large Language Model reasoning and enables precise vulnerability patching. We introduce a two-stage repair strategy, initial knowledge-guided repair followed by Chain-of-Thought reasoning for complex vulnerabilities. Evaluated on a diverse set of vulnerable contracts, SCPatcher achieves 81.5\% overall repair rate and 91.0\% compilation pass rate, substantially outperforming existing methods.
comment: 6 pages, 3 figures
☆ LibScan: Smart Contract Library Misuse Detection with Iterative Feedback and Static Verification
Smart contracts are self-executing programs that manage financial transactions on blockchain networks. Developers commonly rely on third-party code libraries to improve both efficiency and security. However, improper use of these libraries can introduce hidden vulnerabilities that are difficult to detect, leading to significant financial losses. Existing automated tools struggle to identify such misuse because it often requires understanding the developer's intent rather than simply scanning for known code patterns. This paper presents LibScan, an automated detection framework that combines large language model (LLM)-based semantic reasoning with rule-based code analysis, identifying eight distinct categories of library misuse in smart contracts. To improve detection reliability, the framework incorporates an iterative self-correction mechanism that refines its analysis across multiple rounds, alongside a structured knowledge base derived from large-scale empirical studies of real-world misuse cases. Experiments conducted on 662 real-world smart contracts demonstrate that LibScan achieves an overall detection accuracy of 85.15\%, outperforming existing tools by a margin of over 16 percentage points. Ablation experiments further confirm that combining both analysis approaches yields substantially better results than either method used independently.
☆ Quantum-Safe Code Auditing: LLM-Assisted Static Analysis and Quantum-Aware Risk Scoring for Post-Quantum Cryptography Migration
The impending arrival of cryptographically relevant quantum computers (CRQCs) threatens the security foundations of modern software: Shor's algorithm breaks RSA, ECDSA, ECDH, and Diffie-Hellman, while Grover's algorithm reduces the effective security of symmetric and hash-based schemes. Despite NIST standardising post-quantum cryptography (PQC) in 2024 (FIPS 203 ML-KEM, FIPS 204 ML-DSA, FIPS 205 SLH-DSA), most codebases lack automated tooling to inventory classical cryptographic usage and prioritise migration based on quantum risk. We present Quantum-Safe Code Auditor, a quantum-aware static analysis framework that combines (i) regex-based detection of 15 classes of quantum-vulnerable primitives, (ii) LLM-assisted contextual enrichment to classify usage and severity, and (iii) risk scoring via a Variational Quantum Eigensolver (VQE) model implemented in Qiskit 2.x, incorporating qubit-cost estimates to prioritise findings. We evaluate the system across five open-source libraries -- python-rsa, python-ecdsa, python-jose, node-jsonwebtoken, and Bouncy Castle Java -- covering 5,775 findings. On a stratified sample of 602 labelled instances, we achieve 71.98% precision, 100% recall, and an F1 score of 83.71%. All code, data, and reproduction scripts are released as open-source.
comment: 13 pages, 2 figures. Code and evaluation data: https://github.com/AnimeshShaw/quantum-safe-auditor
☆ Ontology-Constrained Neural Reasoning in Enterprise Agentic Systems: A Neurosymbolic Architecture for Domain-Grounded AI Agents
Enterprise adoption of Large Language Models (LLMs) is constrained by hallucination, domain drift, and the inability to enforce regulatory compliance at the reasoning level. We present a neurosymbolic architecture implemented within the Foundation AgenticOS (FAOS) platform that addresses these limitations through ontology-constrained neural reasoning. Our approach introduces a three-layer ontological framework--Role, Domain, and Interaction ontologies--that provides formal semantic grounding for LLM-based enterprise agents. We formalize the concept of asymmetric neurosymbolic coupling, wherein symbolic ontological knowledge constrains agent inputs (context assembly, tool discovery, governance thresholds) while proposing mechanisms for extending this coupling to constrain agent outputs (response validation, reasoning verification, compliance checking). We evaluate the architecture through a controlled experiment (600 runs across five industries: FinTech, Insurance, Healthcare, Vietnamese Banking, and Vietnamese Insurance), finding that ontology-coupled agents significantly outperform ungrounded agents on Metric Accuracy (p < .001, W = .460), Regulatory Compliance (p = .003, W = .318), and Role Consistency (p < .001, W = .614), with improvements greatest where LLM parametric knowledge is weakest--particularly in Vietnam-localized domains. Our contributions include: (1) a formal three-layer enterprise ontology model, (2) a taxonomy of neurosymbolic coupling patterns, (3) ontology-constrained tool discovery via SQL-pushdown scoring, (4) a proposed framework for output-side ontological validation, (5) empirical evidence for the inverse parametric knowledge effect that ontological grounding value is inversely proportional to LLM training data coverage of the domain, and (6) a production system serving 21 industry verticals with 650+ agents.
comment: 23 pages, 7 tables, 4 figures, 33 references. Empirical evaluation: 600 runs across 5 regulated industries including Vietnamese-language domains
☆ Executing as You Generate: Hiding Execution Latency in LLM Code Generation
Current LLM-based coding agents follow a serial execution paradigm: the model first generates the complete code, then invokes an interpreter to execute it. This sequential workflow leaves the executor idle during generation and the generator idle during execution, resulting in unnecessary end-to-end latency. We observe that, unlike human developers, LLMs produce code tokens sequentially without revision, making it possible to execute code as it is being generated. We formalize this parallel execution paradigm, modeling it as a three-stage pipeline of generation, detection, and execution, and derive closed-form latency bounds that characterize its speedup potential and operating regimes. We then present Eager, a concrete implementation featuring AST-based chunking, dynamic batching with gated execution, and early error interruption. We evaluate Eager across four benchmarks, seven LLMs, and three execution environments. Results show that Eager reduces the non-overlapped execution latency by up to 99.9% and the end-to-end latency by up to 55% across seven LLMs and four benchmarks.
comment: 10 pages
☆ LDMDroid: Leveraging LLMs for Detecting Data Manipulation Errors in Android Apps
Android apps rely heavily on Data Manipulation Functionalities (DMFs) for handling app-specific data through CRUDS operations, making their correctness vital for reliability. However, detecting Data Manipulation Errors (DMEs) is challenging due to their dependence on specific UI interaction sequences and manifestation as logic bugs. Existing automated UI testing tools face two primary challenges: insufficient UI path coverage for adequate DMF triggering and reliance on manually written test scripts. To address these issues, we propose an automated approach using Large Language Models (LLMs) for DME detection. We developed LDMDroid, an automated UI testing framework for Android apps. LDMDroid enhances DMF triggering success by guiding LLMs through a state-aware process for generating UI event sequences. It also uses visual features to identify changes in data states, improving DME verification accuracy. We evaluated LDMDroid on 24 real-world Android apps, demonstrating improved DMF triggering success rates compared to baselines. LDMDroid discovered 17 unique bugs, with 14 confirmed by developers and 11 fixed. The tool is publicly available at https://github.com/runnnnnner200/LDMDroid.
☆ Programming by Chat: A Large-Scale Behavioral Analysis of 11,579 Real-World AI-Assisted IDE Sessions
IDE-integrated AI coding assistants, which operate conversationally within developers' working codebases with access to project context and multi-file editing, are rapidly reshaping software development. However, empirical investigation of this shift remains limited: existing studies largely rely on small-scale, controlled settings or analyze general-purpose chatbots rather than codebase-aware IDE workflows. We present, to the best of our knowledge, the first large-scale study of real-world conversational programming in IDE-native settings, analyzing 74,998 developer messages from 11,579 chat sessions across 1,300 repositories and 899 developers using Cursor and GitHub Copilot. These chats were committed to public repositories as part of routine development, capturing in-the-wild behavior. Our findings reveal three shifts in how programming work is organized: conversational programming operates as progressive specification, with developers iteratively refining outputs rather than specifying complete tasks upfront; developers redistribute cognitive work to AI, delegating diagnosis, comprehension, and validation rather than engaging with code and outputs directly; and developers actively manage the collaboration, externalizing plans into persistent artifacts, and negotiating AI autonomy through context injection and behavioral constraints. These results provide foundational empirical insights into AI-assisted development and offer implications for the design of future programming environments.
☆ Shapley-Guided Neural Repair Approach via Derivative-Free Optimization
DNNs are susceptible to defects like backdoors, adversarial attacks, and unfairness, undermining their reliability. Existing approaches mainly involve retraining, optimization, constraint-solving, or search algorithms. However, most methods rely on gradient calculations, restricting applicability to specific activation functions (e.g., ReLU), or use search algorithms with uninterpretable localization and repair. Furthermore, they often lack generalizability across multiple properties. We propose SHARPEN, integrating interpretable fault localization with a derivative-free optimization strategy. First, SHARPEN introduces a Deep SHAP-based localization strategy quantifying each layer's and neuron's marginal contribution to erroneous outputs. Specifically, a hierarchical coarse-to-fine approach reranks layers by aggregated impact, then locates faulty neurons/filters by analyzing activation divergences between property-violating and benign states. Subsequently, SHARPEN incorporates CMA-ES to repair identified neurons. CMA-ES leverages a covariance matrix to capture variable dependencies, enabling gradient-free search and coordinated adjustments across coupled neurons. By combining interpretable localization with evolutionary optimization, SHARPEN enables derivative-free repair across architectures, being less sensitive to gradient anomalies and hyperparameters. We demonstrate SHARPEN's effectiveness on three repair tasks. Balancing property repair and accuracy preservation, it outperforms baselines in backdoor removal (+10.56%), adversarial mitigation (+5.78%), and unfairness repair (+11.82%). Notably, SHARPEN handles diverse tasks, and its modular design is plug-and-play with different derivative-free optimizers, highlighting its flexibility.
☆ EvolveTool-Bench: Evaluating the Quality of LLM-Generated Tool Libraries as Software Artifacts
Modern LLM agents increasingly create their own tools at runtime -- from Python functions to API clients -- yet existing benchmarks evaluate them almost exclusively by downstream task completion. This is analogous to judging a software engineer only by whether their code runs, ignoring redundancy, regression, and safety. We introduce EvolveTool-Bench, a diagnostic benchmark for LLM-generated tool libraries in software engineering workflows. Across three domains requiring actual tool execution (proprietary data formats, API orchestration, and numerical computation), we define library-level software quality metrics -- reuse, redundancy, composition success, regression stability, and safety -- alongside a per-tool Tool Quality Score measuring correctness, robustness, generality, and code quality. In the first head-to-head comparison of code-level and strategy-level tool evolution (ARISE vs. EvoSkill vs. one-shot baselines, 99 tasks, two models), we show that systems with similar task completion (63-68%) differ by up to 18% in library health, revealing software quality risks invisible to task-only evaluation. Our results highlight that evaluation and governance of LLM-generated tools require treating the evolving tool library as a first-class software artifact, not a black box.
comment: 4 pages, 2 figures, 4 tables
♻ ☆ Foundation Models for Autonomous Driving System: An Initial Roadmap
Recent advances in foundation models (FMs), including large language models (LLMs), vision-language models (VLMs), and world models, have opened new opportunities for autonomous driving systems (ADSs) in perception, reasoning, decision-making, and interaction. However, ADSs are safety-critical cyber-physical systems, and integrating FMs into them raises substantial software engineering challenges in data curation, system design, deployment, evaluation, and assurance. To clarify this rapidly evolving landscape, we present an initial roadmap, grounded in a structured literature review, for integrating FMs into autonomous driving across three dimensions: FM infrastructure, in-vehicle integration, and practical deployment. For each dimension, we summarize the state of the art, identify key challenges, and highlight open research opportunities. Based on this analysis, we outline research directions for building reliable, safe, and trustworthy FM-enabled ADSs.
comment: To appear in ACM Transactions on Software Engineering and Methodology (TOSEM)
♻ ☆ HAFixAgent: History-Aware Program Repair Agent
Automated program repair (APR) has recently shifted toward large language models and agent-based systems, yet most systems rely on local snapshot context, overlooking repository history. Prior work shows that repository history helps repair single-line bugs, since the last commit touching the buggy line is often the bug-introducing one. In this paper, we investigate whether repository history can also improve agentic APR systems at scale, especially for complex multi-hunk bugs. We present HAFixAgent, a History-Aware Bug-Fixing Agent that injects blame-derived repository heuristics into its repair loop. A preliminary study on 854 Defects4J (Java) and 501 BugsInPy (Python) bugs motivates our design, showing that bug-relevant history is widely available across both benchmarks. Using the same LLM (DeepSeek-V3.2-Exp) for all experiments, including replicated baselines, we show: (1) Effectiveness: HAFixAgent outperforms RepairAgent (+56.6\%) and BIRCH-feedback (+47.1\%) on Defects4J. Historical context further improves repair by +4.4\% on Defects4J and +38.6\% on BugsInPy, especially on single-file multi-hunk (SFMH) bugs. (2) Robustness: under noisy fault localization (+1/+3/+5 line shifts), history provides increasing resilience, maintaining 40 to 56\% success on SFMH bugs where the non-history baseline collapses to 0\%. (3) Efficiency: history does not significantly increase agent steps or token costs on either benchmark.
comment: support both Defects4J and BugsInPy; use the same LLM for all baseline comparisons; add sensitivity analysis for imperfect fault localization
♻ ☆ Automating Android Build Repair: Bridging the Reasoning-Execution Gap in LLM Agents with Domain-Specific Tools
Android is the largest mobile platform, yet automatically building applications remains a practical challenge. While Large Language Models (LLMs) show promise for code repair, their use for fixing Android build errors remains underexplored. To address this gap, we first introduce AndroidBuildBench, a benchmark of 1,019 build failures curated from the commit histories of 43 open-source Android projects. Each problem is paired with a verified solution from a subsequent commit, ensuring that fixes are feasible. Second, we propose GradleFixer, an LLM agent with domain-specific tools for inspecting and manipulating the Gradle build environment. GradleFixer achieves a resolve rate of 81.4% (pass@1), significantly outperforming a state-of-the-art coding agent that relies on a general-purpose shell. GradleFixer's success suggests that while LLMs possess the high-level knowledge to solve these failures, they struggle to translate this knowledge into effective low-level actions using a general-purpose shell. We demonstrate the effectiveness of a strategy we term Tool Bridging, which replaces general-purpose shell commands with domain-aware abstractions. We hypothesize this approach works through two mechanisms: 1) it provides tools in an API-like format that LLMs use more reliably, and 2) it constrains the action space to relevant operations. This approach bridges the gap between the model's high-level reasoning and effective low-level execution.
♻ ☆ A Self-Improving Architecture for Dynamic Safety in Large Language Models
Context: Large Language Models (LLMs) rely on static, pre-deployment safety mechanisms that cannot adapt to adversarial threats discovered after release. Objective: To design a software architecture enabling LLM-based systems to autonomously detect safety failures and synthesize defense policies at runtime, without retraining or manual intervention. Method: We propose the Self-Improving Safety Framework (SISF), grounded in the MAPE-K reference model. The framework couples a target LLM with a feedback loop: an Adjudicator detects breaches, a Policy Synthesis Module generates dual-mechanism defense policies (heuristic and semantic), and a Warden enforces them. We conducted seven experiments (10,061 evaluations) across four model families. Results: Across five reproducibility trials, SISF achieved a mean Attack Success Rate (ASR) of 0.27% (+/-0.15%), autonomously generating 240 policies per trial. Cross-model evaluation confirmed deployment portability. A held-out test showed a 68.5% proactive interception rate on unseen attacks. Stacked behind Llama Guard 4, the combined defense reduced residual ASR from 7.88% to 0.00%. Ablation confirmed both heuristic and semantic policy types are architecturally required. Conclusion: Self-adaptive architecture is a viable approach to LLM safety. SISF achieves sub-1% ASR through synchronous output monitoring, progressively shifting enforcement to fast, local Warden policies via the MAPE-K loop, offering a new pattern for building resilient AI systems.
comment: Under review at the journal Information and Software Technology (Special Issue on Software Architecture for AI-Driven Systems)
♻ ☆ Evaluating LLM-Generated ACSL Annotations for Formal Verification
Formal specifications are crucial for building verifiable and dependable software systems, yet generating accurate and verifiable specifications for real-world C programs remains challenging. This paper empirically evaluates the extent to which formal-analysis tools can automatically generate and verify ACSL specifications without human or learning-based assistance. We conduct a controlled study on a recently released dataset of 506 C programs, repurposing it from interactive, developer-driven workflows to an automated evaluation setting. Five ACSL generation systems are compared: a rule-based Python script, Frama-C's RTE plugin, and three large language models--DeepSeek-V3.2, GPT-5.2, and OLMo 3.1 32B Instruct. All generated specifications are verified under identical conditions using the Frama-C WP plugin powered by multiple SMT solvers, allowing a direct comparison of annotation quality, solver sensitivity, and proof stability. Our results provide new empirical evidence on the capabilities and limitations of automated ACSL generation, complementing prior survey-based work.
comment: 12 pages. Formal Techniques for Judicious Programming FTfJP-2026 at ECOOP. Conditionally Accepted
♻ ☆ Mitigating Omitted Variable Bias in Empirical Software Engineering
Omitted variable bias occurs when a statistical model leaves out variables that are relevant determinants of the effects under study. This results in the model attributing the missing variables' effect to some of the included variables -- hence over- or under-estimating the latter's true effect. Omitted variable bias presents a significant threat to the validity of empirical research, particularly in non-experimental studies such as those prevalent in empirical software engineering. This paper illustrates the impact of omitted variable bias on two illustrative examples in the software engineering domain, and uses them to present methods to investigate the possible presence of omitted variable bias, to estimate its impact, and to mitigate its drawbacks. The analysis techniques we present are based on causal structural models of the variables of interest, which provide a practical, intuitive summary of the key relations among variables. This paper demonstrates a sequence of analysis steps that inform the design and execution of any empirical study in software engineering. An important observation is that it pays off to invest effort investigating omitted variable bias before actually executing an empirical study, because this effort can lead to a more solid study design, and to a significant reduction in its threats to validity.
♻ ☆ CodeCureAgent: Automatic Classification and Repair of Static Analysis Warnings
Static analysis tools are widely used to detect bugs, vulnerabilities, and code smells. Traditionally, developers must resolve these warnings manually. Because this process is tedious, developers sometimes ignore warnings, leading to an accumulation of warnings and a degradation of code quality. This paper presents CodeCureAgent, an approach that harnesses LLM-based agents to automatically analyze, classify, and repair static analysis warnings. Unlike previous work, our method does not follow a predetermined algorithm. Instead, we adopt an agentic framework that iteratively invokes tools to gather additional information from the codebase (e.g., via code search) and edit the codebase to resolve the warning. CodeCureAgent detects and suppresses false positives, while fixing true positives when identified. We equip CodeCureAgent with a three-step heuristic to approve patches: (1) build the project, (2) verify that the warning disappears without introducing new warnings, and (3) run the test suite. We evaluate CodeCureAgent on a dataset of 1,000 SonarQube warnings found in 106 Java projects and covering 291 distinct rules. Our approach produces plausible fixes for 96.8% of the warnings, outperforming state-of-the-art baseline approaches by 29.2%-34.0% in plausible-fix rate. Manual inspection of 291 cases reveals a correct-fix rate of 86.3%, showing that CodeCureAgent can reliably repair static analysis warnings. The approach incurs LLM costs of about 2.9 cents (USD) and an end-to-end processing time of about four minutes per warning. We envision CodeCureAgent helping to clean existing codebases and being integrated into CI/CD pipelines to prevent the accumulation of static analysis warnings.
♻ ☆ An Empirical Study of Generative AI Adoption in Software Engineering
Context. GenAI tools are being increasingly adopted by practitioners in SE, promising support for several SE activities. Despite increasing adoption, we still lack empirical evidence on how GenAI is used in practice, the benefits it provides, the challenges it introduces, and its broader organizational and societal implications. Objective. This study aims to provide an overview of the status of GenAI adoption in SE. It investigates the status of GenAI adoption, associated benefits and challenges, institutionalization of tools and techniques, and anticipated long term impacts on SE professionals and the community. Results. The results indicate a wide adoption of GenAI tools and how they are deeply integrated into daily SE work, particularly for implementation, verification and validation, personal assistance, and maintenance-related tasks. Practitioners report substantial benefits, most notably reduction in cycle time, quality improvements, enhanced support in knowledge work, and productivity gains. However, objective measurement of productivity and quality remains limited in practice. Significant challenges persist, including incorrect or unreliable outputs, prompt engineering difficulties, validation overhead, security and privacy concerns, and risks of overreliance. Institutionalization of tools and techniques seems to be common, but it varies considerably, with a strong focus on tool access and less emphasis on training and governance. Practitioners expect GenAI to redefine rather than replace their roles, while expressing moderate concern about job market contraction and skill shifts.
♻ ☆ Traq: Estimating the Quantum Cost of Classical Programs
Predicting practical speedups offered by future quantum computers has become a major focus of the quantum community. Typically, such predictions involve numerical simulations supported by lengthy manual analyses and are carried out for one specific algorithm at a time. In this work, we present Traq, a principled approach towards estimating the quantum speedup of classical programs fully automatically. It consists of a classical language that includes high-level primitives amenable to quantum speedups, a compilation to low-level quantum programs, and a source-level cost analysis with provable guarantees. Our cost analysis upper bounds the complexity of the resulting quantum program and is sensitive to the input data of the program (in addition to providing worst-case costs). Traq is implemented as a Haskell package with an extensive evaluation.
comment: 57 pages; v2: significantly revised paper
♻ ☆ Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification
Recent advances in large language models have improved the capabilities of coding agents, yet systematic evaluation of complex, end-to-end website development remains limited. To address this gap, we introduce Vision2Web, a hierarchical benchmark for visual website development, spanning from static UI-to-code generation, interactive multi-page frontend reproduction, to long-horizon full-stack website development. The benchmark is constructed from real-world websites and comprises a total of 193 tasks across 16 categories, with 918 prototype images and 1,255 test cases. To support flexible, thorough and reliable evaluation, we propose workflow-based agent verification paradigm based on two complementary components: a GUI agent verifier and a VLM-based judge. We evaluate multiple visual language models instantiated under different coding-agent frameworks, revealing substantial performance gaps at all task levels, with state-of-the-art models still struggling on full-stack development.
♻ ☆ NES: An Instruction-Free, Low-Latency Next Edit Suggestion Framework Powered by Learned Historical Editing Trajectories
Code editing is a frequent yet cognitively demanding task in software development. Existing AI-powered tools often disrupt developer flow by requiring explicit natural language instructions and suffer from high latency, limiting real-world usability. We present NES (Next Edit Suggestion), an instruction-free, low-latency code editing framework that leverages learned historical editing trajectories to implicitly capture developers' goals and coding habits. NES features a dual-model architecture: one model predicts the next edit location and the other generates the precise code change, both without any user instruction. Trained on our open-sourced SFT and DAPO datasets, NES achieves state-of-the-art performance (75.6% location accuracy, 27.7% exact match rate) while delivering suggestions in under 250ms. Deployed at Ant Group, NES serves over 20,000 developers through a seamless Tab-key interaction, achieving effective acceptance rates of 51.55% for location predictions and 43.44% for edits, demonstrating its practical impact in real-world development workflows.
comment: Accepted by FSE'26 Industry Track
♻ ☆ SmartPoC: Generating Executable and Validated PoCs for Smart Contract Bug Reports
Smart contracts are commonly audited through static analysis to explore vulnerabilities. However, static approaches typically produce heterogeneous findings rather than reproducible, executable proof-of-concept (PoC) test cases, leading to costly and ad hoc manual validation. Large language models (LLMs) offer a promising way to translate audit reports into PoC test cases, but face three major challenges: noisy inputs, lack of execution grounding, and missing runtime oracles. We present SmartPoC, an end-to-end approach for validating reported vulnerabilities in audit reports by generating and executing PoC test cases with automated exploitability verification. SmartPoC first extracts a focused function-level slice from each report to reduce noise, centering on the key functions referenced in a finding and augmenting them with execution-relevant neighbors. To improve executability, we wrap LLM-based PoC synthesis in a generate-repair-execute loop, combining deterministic pre-execution sanitization with feedback-driven post-execution debugging. We further use differential verification as an oracle to confirm the exploitability of generated test cases. On the SmartBugs-Vul and FORGE-Vul benchmarks, SmartPoC achieves confirmation precision of 98.32% and 98.65%, with recall of 84.17% and 85.28%, respectively. On a recent Etherscan verified-source corpus, SmartPoC confirms 64 bugs from 545 audit findings at an average cost of $0.03.
♻ ☆ SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration
Large language model (LLM)-powered agents have demonstrated strong capabilities in automating software engineering tasks such as static bug fixing. However, in the real world, the development of mature software is typically predicated on complex requirement changes and long-term feature iterations -- a process that static, one-shot repair paradigms fail to capture. To bridge this gap, we propose SWE-CI, the first repository-level benchmark built upon the Continuous Integration loop, aiming to shift the evaluation paradigm for code generation from static, short-term functional correctness toward dynamic, long-term maintainability. The key insight is simple: Maintainability can be revealed by tracking how functional correctness changes over time. The benchmark comprises 100 tasks, each deriving from a real-world code repository with a development history spanning an average of 233 days and 71 consecutive commits. SWE-CI requires agents to systematically resolve these tasks through dozens of rounds of analysis and coding iterations. SWE-CI provides valuable insights into how well agents can sustain code quality throughout long-term evolution.
♻ ☆ From Code Changes to Quality Gains: An Empirical Study in Python ML Systems with PyQu ICSE26
In an era shaped by Generative Artificial Intelligence for code generation and the rising adoption of Python-based Machine Learning systems (MLS), software quality has emerged as a major concern. As these systems grow in complexity and importance, a key obstacle lies in understanding exactly how specific code changes affect overall quality-a shortfall aggravated by the lack of quality assessment tools and a clear mapping between ML systems code changes and their quality effects. Although prior work has explored code changes in MLS, it mostly stops at what the changes are, leaving a gap in our knowledge of the relationship between code changes and the MLS quality. To address this gap, we conducted a large-scale empirical study of 3,340 open-source Python ML projects, encompassing more than 3.7 million commits and 2.7 trillion lines of code. We introduce PyQu, a novel tool that leverages low level software metrics to identify quality-enhancing commits with an average accuracy, precision, and recall of 0.84 and 0.85 of average F1 score. Using PyQu and a thematic analysis, we identified 61 code changes, each demonstrating a direct impact on enhancing software quality, and we classified them into 13 categories based on contextual characteristics. 41% of the changes are newly discovered by our study and have not been identified by state-of-the-art Python changes detection tools. Our work offers a vital foundation for researchers, practitioners, educators, and tool developers, advancing the quest for automated quality assessment and best practices in Python-based ML software.
comment: Accepted for publication in the proceedings of IEEE/ACM 48th International Conference on Software Engineering (ICSE26)
Computer Vision and Pattern Recognition 4
☆ UniRecGen: Unifying Multi-View 3D Reconstruction and Generation
Sparse-view 3D modeling represents a fundamental tension between reconstruction fidelity and generative plausibility. While feed-forward reconstruction excels in efficiency and input alignment, it often lacks the global priors needed for structural completeness. Conversely, diffusion-based generation provides rich geometric details but struggles with multi-view consistency. We present UniRecGen, a unified framework that integrates these two paradigms into a single cooperative system. To overcome inherent conflicts in coordinate spaces, 3D representations, and training objectives, we align both models within a shared canonical space. We employ disentangled cooperative learning, which maintains stable training while enabling seamless collaboration during inference. Specifically, the reconstruction module is adapted to provide canonical geometric anchors, while the diffusion generator leverages latent-augmented conditioning to refine and complete the geometric structure. Experimental results demonstrate that UniRecGen achieves superior fidelity and robustness, outperforming existing methods in creating complete and consistent 3D models from sparse observations.
☆ Prime Once, then Reprogram Locally: An Efficient Alternative to Black-Box Service Model Adaptation CVPR 2026
Adapting closed-box service models (i.e., APIs) for target tasks typically relies on reprogramming via Zeroth-Order Optimization (ZOO). However, this standard strategy is known for extensive, costly API calls and often suffers from slow, unstable optimization. Furthermore, we observe that this paradigm faces new challenges with modern APIs (e.g., GPT-4o). These models can be less sensitive to the input perturbations ZOO relies on, thereby hindering performance gains. To address these limitations, we propose an Alternative efficient Reprogramming approach for Service models (AReS). Instead of direct, continuous closed-box optimization, AReS initiates a single-pass interaction with the service API to prime an amenable local pre-trained encoder. This priming stage trains only a lightweight layer on top of the local encoder, making it highly receptive to the subsequent glass-box (white-box) reprogramming stage performed directly on the local model. Consequently, all subsequent adaptation and inference rely solely on this local proxy, eliminating all further API costs. Experiments demonstrate AReS's effectiveness where prior ZOO-based methods struggle: on GPT-4o, AReS achieves a +27.8% gain over the zero-shot baseline, a task where ZOO-based methods provide little to no improvement. Broadly, across ten diverse datasets, AReS outperforms state-of-the-art methods (+2.5% for VLMs, +15.6% for standard VMs) while reducing API calls by over 99.99%. AReS thus provides a robust and practical solution for adapting modern closed-box models.
comment: CVPR 2026
☆ Efficient Equivariant Transformer for Self-Driving Agent Modeling CVPR 2026
Accurately modeling agent behaviors is an important task in self-driving. It is also a task with many symmetries, such as equivariance to the order of agents and objects in the scene or equivariance to arbitrary roto-translations of the entire scene as a whole; i.e., SE(2)-equivariance. The transformer architecture is a ubiquitous tool for modeling these symmetries. While standard self-attention is inherently permutation equivariant, explicit pairwise relative positional encodings have been the standard for introducing SE(2)-equivariance. However, this approach introduces an additional cost that is quadratic in the number of agents, limiting its scalability to larger scenes and batch sizes. In this work, we propose DriveGATr, a novel transformer-based architecture for agent modeling that achieves SE(2)-equivariance without the computational cost of existing methods. Inspired by recent advances in geometric deep learning, DriveGATr encodes scene elements as multivectors in the 2D projective geometric algebra $\mathbb{R}^*_{2,0,1}$ and processes them with a stack of equivariant transformer blocks. Crucially, DriveGATr models geometric relationships using standard attention between multivectors, eliminating the need for costly explicit pairwise relative positional encodings. Experiments on the Waymo Open Motion Dataset demonstrate that DriveGATr is comparable to the state-of-the-art in traffic simulation and establishes a superior Pareto front for performance vs computational cost.
comment: CVPR 2026
♻ ☆ ReScene4D: Temporally Consistent Semantic Instance Segmentation of Evolving Indoor 3D Scenes CVPR 2026
Indoor environments evolve as objects move, appear, or leave the scene. Capturing these dynamics requires maintaining temporally consistent instance identities across intermittently captured 3D scans, even when changes are unobserved. We introduce and formalize the task of temporally sparse 4D indoor semantic instance segmentation (SIS), which jointly segments, identifies, and temporally associates object instances. This setting poses a challenge for existing 3DSIS methods, which require a discrete matching step due to their lack of temporal reasoning, and for 4D LiDAR approaches, which perform poorly due to their reliance on high-frequency temporal measurements that are uncommon in the longer-horizon evolution of indoor environments. We propose ReScene4D, a novel method that adapts 3DSIS architectures for 4DSIS without needing dense observations. Our method enables temporal information sharing--using spatiotemporal contrastive loss, masking, and serialization--to adaptively leverage geometric and semantic priors across observations. This shared context enables consistent instance tracking and improves standard 3DSIS performance. To evaluate this task, we define a new metric, t-mAP, that extends mAP to reward temporal identity consistency. ReScene4D achieves state-of-the-art performance on the 3RScan dataset, establishing a new benchmark for understanding evolving indoor scenes.
comment: CVPR 2026
Human-Computer Interaction 31
☆ Low-Burden LLM-Based Preference Learning: Personalizing Assistive Robots from Natural Language Feedback for Users with Paralysis
Physically Assistive Robots (PARs) require personalized behaviors to ensure user safety and comfort. However, traditional preference learning methods, like exhaustive pairwise comparisons, cause severe physical and cognitive fatigue for users with profound motor impairments. To solve this, we propose a low-burden, offline framework that translates unstructured natural language feedback directly into deterministic robotic control policies. To safely bridge the gap between ambiguous human speech and robotic code, our pipeline uses Large Language Models (LLMs) grounded in the Occupational Therapy Practice Framework (OTPF). This clinical reasoning decodes subjective user reactions into explicit physical and psychological needs, which are then mapped into transparent decision trees. Before deployment, an automated "LLM-as-a-Judge" verifies the code's structural safety. We validated this system in a simulated meal preparation study with 10 adults with paralysis. Results show our natural language approach significantly reduces user workload compared to traditional baselines. Additionally, independent clinical experts confirmed the generated policies are safe and accurately reflect user preferences.
comment: This work has been submitted to the 2026 IEEE International Conference on Robot and Human Interactive Communication (ROMAN)
☆ Democratizing Foundations of Problem-Solving with AI: A Breadth-First Search Curriculum for Middle School Students
As AI becomes more common in students' everyday experiences, a major challenge for K-12 AI education is designing learning experiences that can be meaningfully integrated into existing subject-area instruction. This paper presents the design and implementation of an AI4K12-aligned curriculum that embeds AI learning goals within a rural middle school science classroom using Breadth-First Search (BFS) as an accessible entry point to AI problem-solving. Through unplugged activities and an interactive simulation environment, students learned BFS as a strategy for exploring networks and identifying shortest paths, then applied it to science contexts involving virus spread and contact tracing. To examine engagement and learning, we analyzed pre- and post-assessments, student work artifacts, and a teacher interview. Results suggest that students engaged productively with the curriculum, improved their understanding of BFS and AI problem-solving, and benefited from learning these ideas within ongoing science instruction. Teacher feedback further indicated that the module fit well within the science curriculum while supporting intended science learning outcomes. We conclude with curriculum and design considerations for broadening access to learning about problem-solving with AI in education.
comment: Paper accepted to the 27th International Conference on AI in Education (AIED 2026)
☆ From Automation to Augmentation: A Framework for Designing Human-Centric Work Environments in Society 5.0
Society 5.0 and Industry 5.0 call for human-centric technology integration, yet the concept lacks an operational definition that can be measured, optimized, or evaluated at the firm level. This paper addresses three gaps. First, existing models of human-AI complementarity treat the augmentation function phi(D) as exogenous -- dependent only on the stock of AI deployed -- ignoring that two firms with identical technology investments achieve radically different augmentation outcomes depending on how the workplace is organized around the human-AI interaction. Second, no multi-dimensional instrument exists linking workplace design choices to augmentation productivity. Third, the Society 5.0 literature proposes human-centricity as a normative aspiration but provides no formal criterion for when it is economically optimal. We make four contributions. (1) We endogenize the augmentation function as phi(D, W), where W is a five-dimensional workplace design vector -- AI interface design, decision authority allocation, task orchestration, learning loop architecture, and psychosocial work environment -- and prove that human-centric design is profit-maximizing when the workforce's augmentable cognitive capital exceeds a critical threshold. (2) We conduct a PRISMA-guided systematic review of 120 papers (screened from 6,096 records) to map the evidence base for each dimension. (3) We provide secondary empirical evidence from Colombia's EDIT manufacturing survey (N=6,799 firms) showing that management practice quality amplifies the return to technology investment (interaction coefficient 0.304, p<0.01). (4) We propose the Workplace Augmentation Design Index (WADI), a 36-item theory-grounded instrument for diagnosing human-centricity at the firm level. Decision authority allocation emerges as the binding constraint for Society 5.0 transitions, and task orchestration as the most under-researched dimension
comment: 57 pages, 2 figures, 8 tables, 1 appendix with formal proofs. CFE Working Paper No. 6
☆ Disclosure or Marketing? Analyzing the Efficacy of Vendor Self-reports for Vetting Public-sector AI
Documentation-based disclosure has become a central governance strategy for responsible AI, particularly in public-sector procurement. Tools such as model cards, datasheets, and AI FactSheets are increasingly expected to support accountability, risk assessment, and informed decision-making across organizational boundaries. Yet there is limited empirical evidence about how these artifacts are produced, interpreted, and used in practice. In this paper, we present a qualitative study of the GovAI Coalition FactSheet, a widely adopted transparency document designed to support AI procurement and governance in government contexts. Drawing on semi-structured interviews with vendors and public-sector practitioners, alongside a systematic analysis of completed FactSheets, we examine how FactSheets are used, what information they surface, and where they fall short. We find that FactSheets are asked to serve multiple and conflicting purposes simultaneously: showcasing vendor offerings, supporting evaluation and due diligence, and facilitating early-stage dialogue between vendors and agencies. These competing expectations, combined with the structural constraints of voluntary and public self-disclosure, limit the ability of FactSheets to function as standalone evaluation or risk-assessment tools. At the same time, our findings suggest that when understood as relational artifacts used to establish trust, shared understanding, and ongoing dialogue, FactSheets can help create conditions that support more meaningful disclosure and governance over time.
comment: 31 pages, 2 figures
☆ Assessing Affective Objectives for Communicative Visualizations
Using learning objectives to define designer intents for communicative visualizations can be a powerful design tool. Cognitive and affective objectives are concrete and specific, which can be translated to assessments when creating, evaluating, or comparing visualization ideas. However, while there are many well-validated assessments for cognitive objectives, affective objectives are uniquely challenging. It is easy to see if a visualization helps someone remember the number of patients in a clinic, but harder to observe the change in their attitudes around donations to a crisis. In this work, we define a set of criteria for selecting assessments--from education, advocacy, economics, health, and psychology--that align with affective objectives. We illustrate the use of the framework in a complex affective design task that combines personal narratives and visualizations. Our chosen assessments allow us to evaluate different designs in the context of our objectives and competing psychological theories.
☆ True (VIS) Lies: Analyzing How Generative AI Recognizes Intentionality, Rhetoric, and Misleadingness in Visualization Lies
This study investigates the ability of multimodal Large Language Models (LLMs) to identify and interpret misleading visualizations, and recognize these observations along with their underlying causes and potential intentionality. Our analysis leverages concepts from visualization rhetoric and a newly developed taxonomy of authorial intents as explanatory lenses. We formulated three research questions and addressed them experimentally using a dataset of 2,336 COVID-19-related tweets, half of which contain misleading visualizations, and supplemented it with real-world examples of perceptual, cognitive, and conceptual errors drawn from VisLies, the IEEE VIS community event dedicated to showcasing deceptive and misleading visualizations. To ensure broad coverage of the current LLM landscape, we evaluated 16 state-of-the-art models. Among them, 15 are open-weight models, spanning a wide range of model sizes, architectural families, and reasoning capabilities. The selection comprises small models, namely Nemotron-Nano-V2-VL (12B parameters), Mistral-Small-3.2 (24B), DeepSeek-VL2 (27B), Gemma3 (27B), and GTA1 (32B); medium-sized models, namely Qianfan-VL (70B), Molmo (72B), GLM-4.5V (108B), LLaVA-NeXT (110B), and Pixtral-Large (124B); and large models, namely Qwen3-VL (235B), InternVL3.5 (241B), Step3 (321B), Llama-4-Maverick (400B), and Kimi-K2.5 (1000B). In addition, we employed OpenAI GPT-5.4, a frontier proprietary model. To establish a human perspective on these tasks, we also conducted a user study with visualization experts to assess how people perceive rhetorical techniques and the authorial intentions behind the same misleading visualizations. This allows comparison between model and expert behavior, revealing similarities and differences that provide insights into where LLMs align with human judgment and where they diverge.
☆ Trust and Reliance on AI in Education: AI Literacy and Need for Cognition as Moderators
As generative AI systems are integrated into educational settings, students often encounter AI-generated output while working through learning tasks, either by requesting help or through integrated tools. Trust in AI can influence how students interpret and use that output, including whether they evaluate it critically or exhibit overreliance. We investigate how students' trust relates to their appropriate reliance on an AI assistant during programming problem-solving tasks, and whether this relationship differs by learner characteristics. With 432 undergraduate participants, students' completed Python output-prediction problems while receiving recommendations and explanations from an AI chatbot, including accurate and intentionally misleading suggestions. We operationalize reliance behaviorally as the extent to which students' responses reflected appropriate use of the AI assistant's suggestions, accepting them when they were correct and rejecting them when they were incorrect. Pre- and post-task surveys assessed trust in the assistant, AI literacy, need for cognition, programming self-efficacy, and programming literacy. Results showed a non-linear relationship in which higher trust was associated with lower appropriate reliance, suggesting weaker discrimination between correct and incorrect recommendations. This relationship was significantly moderated by students' AI literacy and need for cognition. These findings highlight the need for future work on instructional and system supports that encourage more reflective evaluation of AI assistance during problem-solving.
comment: Full paper accepted to the 27th International Conference on AI in Education (AIED 2026). AIED Proceedings to be released Summer 2026
☆ FlexAI: A Multi-modal Solution for Delivering Personalized and Adaptive Fitness Interventions
Personalization of exercise routines is a crucial factor in helping people achieve their fitness goals. Despite this, many contemporary solutions fail to offer real-time, adaptive feedback tailored to an individual's physiological states. Contemporary fitness solutions often rely only on static plans and do not adjust to factors such as a user's pain thresholds, fatigue levels, or form during a workout routine. This work introduces FlexAI, a multi-modal system that integrates computer vision, physiological sensors (heart rate and voice), and the reasoning capabilities of Large Language Models (LLMs) to deliver real-time, personalized workout guidance. FlexAI continuously monitors a user's physical form and level of exertion, among other parameters, to provide dynamic interventions focused on exercise intensity, rest periods, and motivation. To validate our system, we performed a technical evaluation confirming our models' accuracy and quantifying pipeline latency, alongside an expert review where certified trainers validated the correctness of the LLM's interventions. Furthermore, in a controlled study with 25 participants, FlexAI demonstrated significant improvements over a static, non-adaptive control system. With FlexAI, users reported significantly greater enjoyment, a stronger sense of achievement, and significantly lower levels of boredom and frustration. These results indicate that by integrating multi-modal sensing with LLM-driven reasoning, adaptive systems like FlexAI can create a more engaging and effective workout experience. Our work provides a blueprint for integrating multi-modal sensing with LLM-driven reasoning, demonstrating that it is possible to create adaptive coaching systems that are not only more engaging but also demonstrably reliable.
☆ AuraDesk: Data Physicalization through Olfaction Metaphors for Representing and Mitigating Workplace Stress
Workplace stress is often addressed through visual or auditory interventions, yet these modalities can compete with attention and contribute to sensory overload. We explore olfaction as an alternative ambient medium for representing stress-related physiological signals in office settings. We present AuraDesk, an olfactory data physicalization system that translates wearable-derived physiological cues into situated scent expressions at the workstation. The system combines local physiological state inference with a constrained actuation strategy to produce temporally regulated and spatially localized scent output suitable for everyday work environments. To examine the feasibility and experiential qualities of this approach, we conducted a one-day in-situ field deployment with 25 knowledge workers at their actual workstations. Our findings show that participants often interpreted the scent output not as an explicit alert, but as a subtle atmospheric cue that supported momentary awareness, micro-break taking, and perceived environmental attunement. At the same time, participants raised important concerns regarding scent preference, habituation, and contextual appropriateness in shared offices. This work contributes (1) an olfactory interface for physiologically driven ambient feedback in the workplace, (2) a hybrid mapping approach for coupling continuous biosignal interpretation with constrained scent actuation, and (3) empirical insights into how workers perceive, negotiate, and appropriate ambient olfactory feedback in real office contexts. Rather than claiming therapeutic efficacy, we position AuraDesk as a probe into the design space of olfactory data physicalization for workplace wellbeing and attention-sensitive interaction.
☆ Evaluating the Feasibility of Augmented Reality to Support Communication Access for Deaf Students in Experiential Higher Education Contexts
Deaf and hard of hearing (DHH) students often experience communication barriers in higher education, which are particularly acute in experiential learning environments such as laboratories. Traditional accessibility services, such as interpreting and captioning, often require DHH students to divide their attention between critical tasks, potential safety hazards, instructional materials, and access providers, creating trade-offs between safety and equitable communication. These demands can disrupt task engagement and increase cognitive load in settings that require sustained visual focus, highlighting the limitations of current approaches. To address these challenges, this study investigates Augmented Reality Real-Time Access for Education (ARRAE), an ecosystem based on augmented reality (AR) smart glasses, as a potential intervention for laboratory-based environments. By overlaying interpreters or captions directly into a student's field of view, AR enables the integration of accessibility into hands-on learning without compromising safety or comprehension. Through an empirical study with 12 DHH participants, we evaluate how AR-mediated access influences visual attention patterns and perceived cognitive load during hands-on tasks. The findings suggest that AR-mediated communication shows strong potential to improve attention management and communication accessibility in experiential learning environments, though participants emphasized that accessibility preferences are highly context-dependent. Participants also identified several design and ergonomic challenges, including display positioning, visual fatigue, and compatibility with hearing devices. Together, these results highlight both the promise of AR for supporting accessible participation in visually demanding environments and key design considerations for future systems.
☆ Steering through Time: Blending Longitudinal Data with Simulation to Rethink Human-Autonomous Vehicle Interaction
As semi-automated vehicles (SAVs) become more common, ensuring effective human-vehicle interaction during control handovers remains a critical safety challenge. Existing studies often rely on single-session simulator experiments or naturalistic driving datasets, which often lack temporal context on drivers' cognitive and physiological states before takeover events. This study introduces a hybrid framework combining longitudinal mobile sensing with high-fidelity driving simulation to examine driver readiness in semi-automated contexts. In a pilot study with 38 participants, we collected 7 days of wearable physiological data and daily surveys on stress, arousal, valence, and sleep quality, followed by an in-lab simulation with scripted takeover events under varying secondary task conditions. Multimodal sensing, including eye tracking, fNIRS, and physiological measures, captured real-time responses. Preliminary analysis shows the framework's feasibility and individual variability in baseline and in-task measures; for example, fixation duration and takeover control time differed by task type, and RMSSD showed high inter-individual stability. This proof-of-concept supports the development of personalized, context-aware driver monitoring by linking temporally layered data with real-time performance.
☆ Revisiting Human-in-the-Loop Object Retrieval with Pre-Trained Vision Transformers
Building on existing approaches, we revisit Human-in-the-Loop Object Retrieval, a task that consists of iteratively retrieving images containing objects of a class-of-interest, specified by a user-provided query. Starting from a large unlabeled image collection, the aim is to rapidly identify diverse instances of an object category relying solely on the initial query and the user's Relevance Feedback, with no prior labels. The retrieval process is formulated as a binary classification task, where the system continuously learns to distinguish between relevant and non-relevant images to the query, through iterative user interaction. This interaction is guided by an Active Learning loop: at each iteration, the system selects informative samples for user annotation, thereby refining the retrieval performance. This task is particularly challenging in multi-object datasets, where the object of interest may occupy only a small region of the image within a complex, cluttered scene. Unlike object-centered settings where global descriptors often suffice, multi-object images require more adapted, localized descriptors. In this work, we formulate and revisit the Human-in-the-Loop Object Retrieval task by leveraging pre-trained ViT representations, and addressing key design questions, including which object instances to consider in an image, what form the annotations should take, how Active Selection should be applied, and which representation strategies best capture the object's features. We compare several representation strategies across multi-object datasets highlighting trade-offs between capturing the global context and focusing on fine-grained local object details. Our results offer practical insights for the design of effective interactive retrieval pipelines based on Active Learning for object class retrieval.
☆ A Dual-Action Fabric-Based Soft Robotic Glove for Ergonomic Hand Rehabilitation
Hand impairment following neurological disorders substantially limits independence in activities of daily living, motivating the development of effective assistive and rehabilitation strategies. Soft robotic gloves have attracted growing interest in this context, yet persistent challenges in customization, ergonomic fit, and flexion-extension actuation constrain their clinical utility. Here, we present a dual-action fabric-based soft robotic glove incorporating customized actuators aligned with individual finger joints. The glove comprises five independently controlled dual-action actuators supporting finger flexion and extension, together with a dedicated thumb abduction actuator. Leveraging computer numerical control heat sealing technology, we fabricated symmetrical-chamber actuators that adopt a concave outer surface upon inflation, thereby maximizing finger contact area and improving comfort. Systematic characterization confirmed that the actuators generate sufficient joint moment and fingertip force for ADL-relevant tasks, and that the complete glove system produces adequate grasping force for common household objects. A preliminary study with ten healthy subjects demonstrated that active glove assistance significantly reduces forearm muscle activity during object manipulation. A pilot feasibility study with three individuals with cervical spinal cord injury across seven functional tasks indicated that glove assistance promotes more natural grasp patterns and reduces reliance on tenodesis grasp, although at the cost of increased task completion time attributable to the current actuation interface. This customizable, ergonomic design represents a practical step toward personalized hand rehabilitation and assistive robotics.
☆ A wearable haptic device for edge and surface simulation
Object manipulation is fundamental to virtual reality (VR) applications, yet conventional fingertip haptic devices fail to render certain tactile features relevant for immersive and precise interactions, as i.e. detection of edges. This paper presents a compact, lightweight fingertip haptic device (24.3 g) that delivers distinguishable surface and edge contact feedback through a novel dual-motor mechanism. Pressure distribution characterization using a 6 x 6 flexible sensor array demonstrates distinct contact patterns between the two stimulation modes. A preliminary user study with five participants achieved 93% average classification accuracy across four conditions (edge/surface contact with light/heavy pressure), with mean response times of 2.79 seconds. The results indicate that the proposed device can effectively convey edge and surface tactile cues, potentially enhancing object manipulation fidelity in VR environments.
☆ In the Middle, Not on Top: AI-Mediated Communication for Patient-Provider Care Relationships
Relationship-centered care relies on trust and meaningful connection. As AI enters clinical settings, we must ask not just what it can do, but how it should be positioned to support these values. We examine a "middle, not top" approach where AI mediates communication without usurping human judgment. Through studies of CLEAR, an asynchronous messaging system, we show how this configuration addresses real-world constraints like time pressure and uneven health literacy. We find that mediator affordances (e.g., availability, neutrality) redistribute interpretive work and reduce relational friction. Ultimately, we frame AI mediation as relational infrastructure, highlighting critical design tensions around framing power and privacy.
comment: 5 pages, 1 figure, Toward Relationship-Centered Care with AI: Designing for Human Connections in Healthcare workshop at CHI 2026
☆ StretchBot: A Neuro-Symbolic Framework for Adaptive Guidance with Assistive Robots
Assistive robots have growing potential to support physical wellbeing in home and healthcare settings, for example, by guiding users through stretching or rehabilitation routines. However, existing systems remain largely scripted, which limits their ability to adapt to user state, environmental context, and interaction dynamics. In this work, we present StretchBot, a hybrid neuro-symbolic robotic coach for adaptive assistive guidance. The system combines multimodal perception with knowledge-graph-grounded large language model reasoning to support context-aware adjustments during short stretching sessions while maintaining a structured routine. To complement the system description, we report an exploratory pilot comparison between scripted and adaptive guidance with three participants. The pilot findings suggest that the adaptive condition improved perceived adaptability and contextual relevance, while scripted guidance remained competitive in smoothness and predictability. These results provide preliminary evidence that structured actionable knowledge can help ground language-model-based adaptation in embodied assistive interaction, while also highlighting the need for larger, longitudinal studies to evaluate robustness, generalizability, and long-term user experience.
☆ HarassGuard: Detecting Harassment Behaviors in Social Virtual Reality with Vision-Language Models
Social Virtual Reality (VR) platforms provide immersive social experiences but also expose users to serious risks of online harassment. Existing safety measures are largely reactive, while proactive solutions that detect harassment behavior during an incident often depend on sensitive biometric data, raising privacy concerns. In this paper, we present HarassGuard, a vision-language model (VLM) based system that detects physical harassment in social VR using only visual input. We construct an IRB-approved harassment vision dataset, apply prompt engineering, and fine-tune VLMs to detect harassment behavior by considering contextual information in social VR. Experimental results demonstrate that HarassGuard achieves competitive performance compared to state-of-the-art baselines (i.e., LSTM/CNN, Transformer), reaching an accuracy of up to 88.09% in binary classification and 68.85% in multi-class classification. Notably, HarassGuard matches these baselines while using significantly fewer fine-tuning samples (200 vs. 1,115), offering unique advantages in contextual reasoning and privacy-preserving detection.
comment: To appear in the 2026 TVCG Special Issue on the 2026 IEEE Conference on Virtual Reality and 3D User Interfaces (VR)
☆ Logarithmic Scores, Power-Law Discoveries: Disentangling Measurement from Coverage in Agent-Based Evaluation
LLM-based agent judges are an emerging approach to evaluating conversational AI, yet a fundamental uncertainty remains: can we trust their assessments, and if so, how many are needed? Through 960 sessions with two model pairs across 15 tasks, we show that persona-based agent judges produce evaluations indistinguishable from human raters in a Turing-style validation. We then identify a score-coverage dissociation: quality scores improve logarithmically with panel size, while unique issue discoveries follow a sublinear power law-both exhibit diminishing returns, but scores saturate roughly twice as fast as discoveries. We hypothesize this reflects a power law distribution of the finding space: critical issues are discovered first by small panels, while corner cases require progressively larger panels, analogous to species accumulation curves in ecology. The mechanism traces to ensemble diversity-Big Five personality conditioning makes agents probe different quality dimensions, with expert judges acting as adversarial probes that push discovery into the tail of the finding distribution. A controlled ablation confirms that structured persona conditioning, not simple prompting, is required to produce these scaling properties.
☆ Not My Truce: Personality Differences in AI-Mediated Workplace Negotiation
AI-driven conversational coaching is increasingly used to support workplace negotiation, yet prior work assumes uniform effectiveness across users. We challenge this assumption by examining how individual differences, particularly personality traits, moderate coaching outcomes. We conducted a between-subjects experiment (N=267) comparing theory-driven AI (Trucey), general-purpose AI (Control-AI), and a traditional negotiation handbook (Control-NoAI). Participants were clustered into three profiles -- resilient, overcontrolled, and undercontrolled -- based on the Big-Five personality traits and ARC typology. Resilient workers achieved broad psychological gains primarily from the handbook, overcontrolled workers showed outcome-specific improvements with theory-driven AI, and undercontrolled workers exhibited minimal effects despite engaging with the frameworks. These patterns suggest personality as a predictor of readiness beyond stage-based tailoring: vulnerable users benefit from targeted rather than comprehensive interventions. The study advances understanding of personality-determined intervention prerequisites and highlights design implications for adaptive AI coaching systems that align support intensity with individual readiness, rather than assuming universal effectiveness.
☆ Sona: Real-Time Multi-Target Sound Attenuation for Noise Sensitivity
For people with noise sensitivity, everyday soundscapes can be overwhelming. Existing tools such as active noise cancellation reduce discomfort by suppressing the entire acoustic environment, often at the cost of awareness of surrounding people and events. We present Sona, an interactive mobile system for real-time soundscape mediation that selectively attenuates bothersome sounds while preserving desired audio. Sona is built on a target-conditioned neural pipeline that supports simultaneous attenuation of multiple overlapping sound sources, overcoming the single-target limitation of prior systems. It runs in real time on-device and supports user-extensible sound classes through in-situ audio examples, without retraining. Sona is informed by a formative study with 68 noise-sensitive individuals. Through technical benchmarking and an in-situ study with 10 participants, we show that Sona achieves low-latency, multi-target attenuation suitable for live listening, and enables meaningful reductions in bothersome sounds while maintaining awareness of surroundings. These results point toward a new class of personal AI systems that support comfort and social participation by mediating real-world acoustic environments.
comment: 12 pages, 6 figures
☆ Programming by Chat: A Large-Scale Behavioral Analysis of 11,579 Real-World AI-Assisted IDE Sessions
IDE-integrated AI coding assistants, which operate conversationally within developers' working codebases with access to project context and multi-file editing, are rapidly reshaping software development. However, empirical investigation of this shift remains limited: existing studies largely rely on small-scale, controlled settings or analyze general-purpose chatbots rather than codebase-aware IDE workflows. We present, to the best of our knowledge, the first large-scale study of real-world conversational programming in IDE-native settings, analyzing 74,998 developer messages from 11,579 chat sessions across 1,300 repositories and 899 developers using Cursor and GitHub Copilot. These chats were committed to public repositories as part of routine development, capturing in-the-wild behavior. Our findings reveal three shifts in how programming work is organized: conversational programming operates as progressive specification, with developers iteratively refining outputs rather than specifying complete tasks upfront; developers redistribute cognitive work to AI, delegating diagnosis, comprehension, and validation rather than engaging with code and outputs directly; and developers actively manage the collaboration, externalizing plans into persistent artifacts, and negotiating AI autonomy through context injection and behavioral constraints. These results provide foundational empirical insights into AI-assisted development and offer implications for the design of future programming environments.
♻ ☆ How Motivation Relates to Generative AI Use: A Large-Scale Survey of Mexican High School Students
This study examined how high school students with different motivational profiles use generative AI tools in math and writing. Through K-means clustering analysis of survey data from 6,793 Mexican high school students, we identified three distinct motivational profiles based on self-concept and perceived subject value. Results revealed distinct domain-specific AI usage patterns across students with different motivational profiles. Our findings challenge one-size-fits-all AI integration approaches and advocate for motivationally-informed educational interventions.
comment: This submission has been accepted by the ICLS Conference at the ISLS Annual Meeting. It will be included as a poster in the 2026 conference proceedings
♻ ☆ Exploring the Interplay Between Voice, Personality, and Gender in Human-Agent Interactions
To foster effective human-agent interactions, designers must understand how vocal cues influence the perception of agent personality and the role of user-agent alignment in shaping these perceptions. In this work, we examine whether users can perceive extroversion in voice-only artificial agents and how perceived personality relates to user-agent synchrony. We conducted a study with 388 participants, who evaluated four synthetic voices derived from human recordings, varying by gender (male, female) and personality expression (introverted, extroverted). Our results show that participants were able to differentiate perceived extroversion in female agent voices, but not consistently in male voices. We also observed evidence of perceived personality synchrony, particularly in participants' evaluations of the first agent encountered, with this effect more pronounced among male participants and toward male agents. We discuss these findings in light of limitations in stimulus diversity and voice representation, and outline implications for the design of voice-based agents, particularly regarding the interaction between gender, personality perception, and initial user impressions. This paper contributes findings and insights to consider the interplay of user-agent personality and gender synchrony in the design of human-agent interactions.
♻ ☆ Activation Steering via Generative Causal Mediation
Where should we intervene in a language model (LM) to localize and control behaviors that are diffused across many tokens of a long-form response? We introduce Generative Causal Mediation (GCM), a procedure for selecting model components (e.g., attention heads) from contrastive long-form responses, to steer such diffuse concepts (e.g., talk in verse vs. talk in prose). In GCM, we first construct a dataset of contrasting behavioral inputs and long-form responses. Then, we quantify how model components mediate the concept and select the strongest mediators for steering. We evaluate GCM on three behaviors--refusal, sycophancy, and style transfer--across three language models. GCM successfully localizes concepts expressed in long-form responses and outperforms correlational probe-based baselines when steering with a sparse set of attention heads. Together, these results demonstrate that GCM provides an effective approach for localizing from and controlling the long-form responses of LMs.
♻ ☆ How AI Fails: An Interactive Pedagogical Tool for Demonstrating Dialectal Bias in Automated Toxicity Models
Now that AI-driven moderation has become pervasive in everyday life, we often hear claims that "the AI is biased". While this is often said jokingly, the light-hearted remark reflects a deeper concern. How can we be certain that an online post flagged as "inappropriate" was not simply the victim of a biased algorithm? This paper investigates this problem using a dual approach. First, I conduct a quantitative benchmark of a widely used toxicity model (unitary/toxic-bert) to measure performance disparity between text in African-American English (AAE) and Standard American English (SAE). The benchmark reveals a clear, systematic bias: on average, the model scores AAE text as 1.8 times more toxic and 8.8 times higher for "identity hate". Second, I introduce an interactive pedagogical tool that makes these abstract biases tangible. The tool's core mechanic, a user-controlled "sensitivity threshold," demonstrates that the biased score itself is not the only harm; instead, the more-concerning harm is the human-set, seemingly neutral policy that ultimately operationalises discrimination. This work provides both statistical evidence of disparate impact and a public-facing tool designed to foster critical AI literacy.
comment: 9 pages, 5 figures, 4 tables, 14 references
♻ ☆ Adapting AI to the Moment: Understanding the Dynamics of Parent-AI Collaboration Modes in Real-Time Conversations with Children
Parent-AI collaboration to support real-time conversations with children is challenging due to the sensitivity and open-ended nature of such interactions. Existing systems often simplify collaboration into static modes, providing limited support for adapting AI to continuously evolving conversational contexts. To address this gap, we systematically investigate the dynamics of parent-AI collaboration modes in real-time conversations with children. We conducted a co-design study with eight parents and developed COMPASS, a research probe that enables flexible combinations of parental support functions during conversations. Using COMPASS, we conducted a lab-based study with 21 parent-child pairs. We show that parent-AI collaboration unfolds through evolving modes that adapt systematically to contextual factors. We further identify three types of parental strategies--parent-oriented, child-oriented, and relationship-oriented--that shape how parents engage with AI. These findings advance the understanding of dynamic human-AI collaboration in relational, high-stakes settings and inform the design of flexible, context-adaptive parental support systems.
♻ ☆ Benchmarking Educational LLMs with Analytics: A Case Study on Gender Bias in Feedback
As teachers increasingly turn to GenAI in their educational practice, we need robust methods to benchmark large language models (LLMs) for pedagogical purposes. This article presents an embedding-based benchmarking framework to detect bias in LLMs in the context of formative feedback. Using 600 authentic student essays from the AES 2.0 corpus, we constructed controlled counterfactuals along two dimensions: (i) implicit cues via lexicon-based swaps of gendered terms within essays, and (ii) explicit cues via gendered author background in the prompt. We investigated six representative LLMs (i.e. GPT-5 mini, GPT-4o mini, DeepSeek-R1, DeepSeek-R1-Qwen, Gemini 2.5 Pro, Llama-3-8B). We first quantified the response divergence with cosine and Euclidean distances over sentence embeddings, then assessed significance via permutation tests, and finally, visualised structure using dimensionality reduction. In all models, implicit manipulations reliably induced larger semantic shifts for male-female counterfactuals than for female-male. Only the GPT and Llama models showed sensitivity to explicit gender cues. These findings show that even state-of-the-art LLMs exhibit asymmetric semantic responses to gender substitutions, suggesting persistent gender biases in feedback they provide learners. Qualitative analyses further revealed consistent linguistic differences (e.g., more autonomy-supportive feedback under male cues vs. more controlling feedback under female cues). We discuss implications for fairness auditing of pedagogical GenAI, propose reporting standards for counterfactual evaluation in learning analytics, and outline practical guidance for prompt design and deployment to safeguard equitable feedback.
comment: 21 pages, 7 figures
♻ ☆ "Is This Really a Human Peer Supporter?": Misalignments Between Peer Supporters and Experts in LLM-Supported Interactions
Mental health is a growing global concern, prompting interest in AI-driven solutions to expand access to psychosocial support. Peer support, grounded in lived experience, offers a valuable complement to professional care. However, variability in training, effectiveness, and definitions raises concerns about quality, consistency, and safety. Large Language Models (LLMs) present new opportunities to enhance peer support interactions, particularly in real-time, text-based interactions. We present and evaluate an AI-supported system with an LLM-simulated distressed client, context-sensitive LLM-generated suggestions, and real-time emotion visualisations. 2 mixed-methods studies with 12 peer supporters and 5 mental health professionals (i.e., experts) examined the system's effectiveness and implications for practice. Both groups recognised its potential to enhance training and improve interaction quality. However, we found a key tension emerged: while peer supporters engaged meaningfully, experts consistently flagged critical issues in peer supporter responses, such as missed distress cues and premature advice-giving. This misalignment highlights potential limitations in current peer support training, especially in emotionally charged contexts where safety and fidelity to best practices are essential. Our findings underscore the need for standardised, psychologically grounded training, especially as peer support scales globally. They also demonstrate how LLM-supported systems can scaffold this development--if designed with care and guided by expert oversight. This work contributes to emerging conversations on responsible AI integration in mental health and the evolving role of LLMs in augmenting peer-delivered care.
comment: 53 pages, 12 figures, 17 tables
♻ ☆ A Map of Exploring Human Interaction patterns with LLM: Insights into Collaboration and Creativity
The outstanding performance capabilities of large language model have driven the evolution of current AI system interaction patterns. This has led to considerable discussion within the Human-AI Interaction (HAII) community. Numerous studies explore this interaction from technical, design, and empirical perspectives. However, the majority of current literature reviews concentrate on interactions across the wider spectrum of AI, with limited attention given to the specific realm of interaction with LLM. We searched for articles on human interaction with LLM, selecting 110 relevant publications meeting consensus definition of Human-AI interaction. Subsequently, we developed a comprehensive Mapping Procedure, structured in five distinct stages, to systematically analyze and categorize the collected publications. Applying this methodical approach, we meticulously mapped the chosen studies, culminating in a detailed and insightful representation of the research landscape. Overall, our review presents an novel approach, introducing a distinctive mapping method, specifically tailored to evaluate human-LLM interaction patterns. We conducted a comprehensive analysis of the current research in related fields, employing clustering techniques for categorization, which enabled us to clearly delineate the status and challenges prevalent in each identified area.
♻ ☆ Auto-Slides: An Interactive Multi-Agent System for Creating and Customizing Research Presentations
The rapid progress of large language models (LLMs) has opened new opportunities for education. While learners can interact with academic papers through LLM-powered dialogue, limitations still exist: the lack of structured organization and the heavy reliance on text can impede systematic understanding and engagement with complex concepts. To address these challenges, we propose Auto-Slides, an LLM-driven system that converts research papers into pedagogically structured, multimodal slides (e.g., diagrams and tables). Drawing on cognitive science, it creates a presentation-oriented narrative and allows iterative refinement via an interactive editor to better match learners' knowledge level and goals. Auto-Slides further incorporates verification and knowledge retrieval mechanisms to ensure accuracy and contextual completeness. Through extensive user studies, Auto-Slides demonstrates strong learner acceptance, improved structural support for understanding, and expert-validated gains in narrative quality compared with conventional LLM-based reading. Our contributions lie in designing a multi-agent framework for transforming academic papers into pedagogically optimized slides and introducing interactive customization for personalized learning.
comment: Project Homepage: https://auto-slides.github.io/
♻ ☆ Visual Neural Decoding via Improved Visual-EEG Semantic Consistency
Visual neural decoding aims to extract and interpret original visual experiences directly from human brain activity. Recent studies have demonstrated the feasibility of decoding visual semantic categories from electroencephalography (EEG) signals, among which metric learning-based approaches have delivered promising results. However, these methods that directly map EEG features into a pre-trained embedding space inevitably introduce mapping bias, resulting in a modality gap and semantic inconsistency that impair cross-modal alignment. To address these issues, this work constructs a Visual-EEG Joint Semantic Space to bridge the gap between visual images and neural signals. Building upon this space, we propose two novel approaches to improve semantic consistency between cross-modal representations and facilitate optimal alignment. Specifically, (1) we introduce a Visual-EEG Semantic Decoupling Network (VE-SDN) to explicitly disentangle semantic components from modality representations, thereby achieving purely semantic-level cross-modal alignment. (2) We introduce a Neural-Guided Intra-Class Consistency (NGIC) objective, an asymmetric representation alignment strategy designed to effectively enhance the robustness of visual representations and further boost decoding performance. Extensive experiments on a large-scale Visual-EEG dataset validate the effectiveness of the proposed method. Compared to the strongest baseline, our approach demonstrates superior decoding performance, yielding relative Top-1/Top-5 accuracy improvements of 38.9%/17.9% in intra-subject and 16.1%/11.3% in inter-subject settings. The code is available at https://github.com/hzalanchen/Cross-Modal-EEG
Programming Languages 3
☆ Compositional Program Verification with Polynomial Functors in Dependent Type Theory
We present a framework for compositional program verification based on polynomial functors in dependent type theory. In this framework, polynomial functors serve as program interfaces, Kleisli morphisms for the free monad monad serve as implementations, and dependent polynomials encode pre/postcondition specifications. We show that implementations and their verifications compose via wiring diagrams, and that Mealy machines provide a compositional coalgebraic operational semantics. We identify the abstract categorical structure underlying this compositionality as a monoidal functor from specifications to interfaces with a compatible monoidal natural transformation of lax monoidal presheaves; this opens the door to generalizations to other categories, monoidal products, etc., including settings for concurrency and relational verification, which we sketch. As a proof-of-concept, the entire framework has been formalized in Agda.
☆ Executing as You Generate: Hiding Execution Latency in LLM Code Generation
Current LLM-based coding agents follow a serial execution paradigm: the model first generates the complete code, then invokes an interpreter to execute it. This sequential workflow leaves the executor idle during generation and the generator idle during execution, resulting in unnecessary end-to-end latency. We observe that, unlike human developers, LLMs produce code tokens sequentially without revision, making it possible to execute code as it is being generated. We formalize this parallel execution paradigm, modeling it as a three-stage pipeline of generation, detection, and execution, and derive closed-form latency bounds that characterize its speedup potential and operating regimes. We then present Eager, a concrete implementation featuring AST-based chunking, dynamic batching with gated execution, and early error interruption. We evaluate Eager across four benchmarks, seven LLMs, and three execution environments. Results show that Eager reduces the non-overlapped execution latency by up to 99.9% and the end-to-end latency by up to 55% across seven LLMs and four benchmarks.
comment: 10 pages
♻ ☆ Traq: Estimating the Quantum Cost of Classical Programs
Predicting practical speedups offered by future quantum computers has become a major focus of the quantum community. Typically, such predictions involve numerical simulations supported by lengthy manual analyses and are carried out for one specific algorithm at a time. In this work, we present Traq, a principled approach towards estimating the quantum speedup of classical programs fully automatically. It consists of a classical language that includes high-level primitives amenable to quantum speedups, a compilation to low-level quantum programs, and a source-level cost analysis with provable guarantees. Our cost analysis upper bounds the complexity of the resulting quantum program and is sensitive to the input data of the program (in addition to providing worst-case costs). Traq is implemented as a Haskell package with an extensive evaluation.
comment: 57 pages; v2: significantly revised paper
Operating Systems 1
☆ Generative Profiling for Soft Real-Time Systems and its Applications to Resource Allocation
Modern real-time systems require accurate characterization of task timing behavior to ensure predictable performance, particularly on complex hardware architectures. Existing methods, such as worst-case execution time analysis, often fail to capture the fine-grained timing behaviors of a task under varying resource contexts (e.g., an allocation of cache, memory bandwidth, and CPU frequency), which is necessary to achieve efficient resource utilization. In this paper, we introduce a novel generative profiling approach that synthesizes context-dependent, fine-grained timing profiles for real-time tasks, including those for unmeasured resource allocations. Our approach leverages a nonparametric, conditional multi-marginal Schrödinger Bridge (MSB) formulation to generate accurate execution profiles for unseen resource contexts, with maximum likelihood guarantees. We demonstrate the efficiency and effectiveness of our approach through real-world benchmarks, and showcase its practical utility in a representative case study of adaptive multicore resource allocation for real-time systems.
Networking and Internet Architecture 11
☆ LightGuard: Transparent WiFi Security via Physical-Layer LiFi Key Bootstrapping
WiFi is inherently vulnerable to eavesdropping because RF signals may penetrate many physical boundaries, such as walls and floors. LiFi, by contrast, is an optical method confined to line-of-sight and blocked by opaque surfaces. We present LightGuard, a dual-link architecture built on this insight: cryptographic key establishment can be offloaded from WiFi to a physically confined LiFi channel to mitigate the risk of key exposure over RF. LightGuard derives session keys over a LiFi link and installs them on the WiFi interface, ensuring cryptographic material never traverses the open RF medium. A prototype with off-the-shelf WiFi NICs and our LiFi transceiver frontend validates the design.
☆ POLARIS: PHY-Aware Spectrum Steering for Dynamic Spectrum Sharing
Dynamic Spectrum Sharing (DSS) enables flexible activation of additional spectrum resources but leaves open a key runtime question: once new spectrum becomes available, which steering mechanism should migrate connected devices toward it with minimum service disruption? We present the first PHY-aware characterization of 3GPP-compliant UE steering mechanisms, including Bandwidth Part (BWP) reconfiguration, Carrier Aggregation (CA), E-UTRA-NR Dual Connectivity (EN-DC), Connected-Mode Handover (HO), and Release and Redirection (R&R), using modem-level traces from devices connected to operational networks, collected across 1,600 executions over four months in 12 urban areas. By mapping each mechanism to observable PHY-layer milestones, we decompose steering latency into intrinsic PHY-centric execution and RRC-to-PHY completion components, revealing substantial heterogeneity: NR BWP achieves 6.25 ms mean latency with zero tail exceedance above 50 ms, while CA exceeds 1225 ms; mobility procedures remain largely modem-bound, whereas discovery-driven mechanisms experience significant RRC-to-PHY completion amplification. Guided by these measurements, we design POLARIS, an O-RAN-based system that selects the least disruptive steering mechanism via a two-parameter disruption score. POLARIS reduces mean latency by up to 85.1% and T95 by 89.7% over static or non-adaptive baselines, eliminates tail exceedance above 50 ms, and avoids high-disruption mechanisms, demonstrating that PHY-layer execution profiling enables reliable and context-aware spectrum steering in DSS-enabled networks.
☆ Adversarial Attacks in AI-Driven RAN Slicing: SLA Violations and Recovery
Next-generation (NextG) cellular networks are designed to support emerging applications with diverse data rate and latency requirements, such as immersive multimedia services and large-scale Internet of Things deployments. A key enabling mechanism is radio access network (RAN) slicing, which dynamically partitions radio resources into virtual resource blocks to efficiently serve heterogeneous traffic classes, including enhanced mobile broadband (eMBB), massive machine-type communications (mMTC), and ultra-reliable low-latency communications (URLLC). In this paper, we study the impact of adversarial attacks on AI-driven RAN slicing decisions, where a budget-constrained adversary selectively jams slice transmissions to bias deep reinforcement learning (DRL)-based resource allocation, and quantify the resulting service level agreement (SLA) violations and post-attack recovery behavior. Our results indicate that budget-constrained adversarial jamming can induce severe and slice-dependent steady-state SLA violations. Moreover, the DRL agent's reward converges toward the clean baseline only after a non-negligible recovery period.
☆ Optimal Sampling and Actuation Policies of a Markov Source over a Wireless Channel
This paper studies efficient data management and timely information dissemination for real-time monitoring of an $N$-state Markov process, enabling accurate state estimation and reliable actuation decisions. First, we analyze the Age of Incorrect Information (AoII) and derive closed-form expressions for its time average under several scheduling policies, including randomized stationary, change-aware randomized stationary, semantics-aware randomized stationary, and threshold-aware randomized stationary policies. We then formulate and solve constrained optimization problems to minimize the average AoII under a time-averaged sampling action constraint, and compare the resulting optimal sampling and transmission policies to identify the conditions under which each policy is most effective. We further show that directly using reconstructed states for actuation can degrade system performance, especially when the receiver is uncertain about the state estimate or when actuation is costly. To address this issue, we introduce a cost function, termed the Cost of Actions under Uncertainty (CoAU), which determines when the actuator should take correct actions and avoid incorrect ones when the receiver is uncertain about the reconstructed source state. We propose a randomized actuation policy and derive a closed-form expression for the probability of taking no incorrect action. Finally, we formulate an optimization problem to find the optimal randomized actuation policy that maximizes this probability. The results show that the resulting policy substantially reduces incorrect actuator actions.
☆ Online Network Slice Deployment across Multiple Domains under Trust Constraints
Network slicing across multiple administrative domains raises two coupled challenges: enforcing slice-specific trust constraints while enabling fast online admission and placement decisions. This paper considers a multi-domain infrastructure where each slice request specifies a VNF chain, resource demands, and a set of (un)trusted operators, and formulates the problem as a Node-Link (NL) integer program to obtain an optimal benchmark, before proposing a Path-Link (PL) formulation that pre-generates trust and order-compliant candidate paths to enable real-time operation. To mitigate congestion, resource prices are made dynamic using a Kleinrock congestion function, which inflates marginal costs as utilization approaches capacity, steering traffic away from hotspots. Extensive simulations across different congestion levels and slice types show that: (i) PL closely tracks NL with negligible gaps at low load and moderate gaps otherwise, (ii) dynamic pricing significantly reduces blocking under scarce resources, and (iii) PL reduces computation time by about 3x-6x compared to NL, remaining within a few seconds even at high load. These results demonstrate that the proposed PL and dynamic pricing framework achieves near-optimal performance with practical runtime for online multi-domain slicing under trust constraints.
comment: 14 pages, journal paper, to be published in IEEE Transaction on Network and Service Management
☆ Birdcast: Interest-aware BEV Multicasting for Infrastructure-assisted Collaborative Perception
Vehicle-to-infrastructure collaborative perception (V2I-CP) leverages a high-vantage node to transmit supplementary information, i.e., bird's-eye-view (BEV) feature maps, to vehicles, effectively overcoming line-of-sight limitations. However, the downlink V2I transmission introduces a significant communication bottleneck. Moreover, vehicles in V2I-CP require \textit{heterogeneous yet overlapping} information tailored to their unique occlusions and locations, rendering standard unicast/broadcast protocols inefficient. To address this limitation, we propose \textit{Birdcast}, a novel multicasting framework for V2I-CP. By accounting for individual maps of interest, we formulate a joint feature selection and multicast grouping problem to maximize network-wide utility under communication constraints. Since this formulation is a mixed-integer nonlinear program and is NP-hard, we develop an accelerated greedy algorithm with a theoretical $(1 - 1/\sqrt{e})$ approximation guarantee. While motivated by CP, Birdcast provides a general framework applicable to a wide range of multicasting systems where users possess heterogeneous interests and varying channel conditions. Extensive simulations on the V2X-Sim dataset demonstrate that Birdcast significantly outperforms state-of-the-art baselines in both system utility and perception quality, achieving up to 27\% improvement in total utility and a 3.2\% increase in mean average precision (mAP).
☆ Hybrid Classical--Quantum Optimization of Wireless Routing Using QAOA and Quantum Walks
Routing in wireless communication networks is shaped by mobility, interference, congestion, and competing service requirements, making route selection a high-dimensional constrained optimization problem rather than a simple shortest-path task. This paper investigates the use of hybrid classical--quantum methods for wireless routing, focusing on the Quantum Approximate Optimization Algorithm (QAOA) and quantum walks as candidate mechanisms for exploring complex routing spaces. The paper examines how wireless routing can be expressed as a constrained graph optimization problem in which routing objectives, flow constraints, connectivity requirements, and interference effects are mapped into quantum-compatible Hamiltonian representations. It then discusses how these approaches can be integrated into a hybrid architecture in which classical systems perform network monitoring, graph construction, pre-processing, and deployment, while quantum subroutines are used for selected optimization components. The analysis shows that the potential value of quantum routing lies primarily in the treatment of difficult combinatorial subproblems rather than end-to-end replacement of classical routing frameworks. The paper also highlights practical limitations arising from state preparation, constraint encoding, oracle construction, hardware noise, limited qubit resources, and hybrid execution overhead. It is argued that any meaningful near-term advantage will depend on careful problem decomposition, compact encoding, and tight classical--quantum integration.
comment: 23 pages, 3 figures
♻ ☆ Legible Consensus: Topology-Aware Quorum Geometry for Asymmetric Networks
Quorum design over asymmetric topologies conflates two independent concerns: inter-tier obligation (which tiers must participate for cross-tier safety) and intra-tier replication (how each tier survives local failures). Flat quorums treat all nodes as interchangeable; when consensus fails, the structure does not reveal whether a tier was unreachable or a tier lost too many replicas. We show that mapping a crumbling-wall quorum construction to a physically tiered network separates these concerns and makes the protocol's failure modes legible: an operator can determine which tiers retain global consensus capability from the wall structure and connectivity state alone, without runtime probing. Using a 10-node Earth/LEO/Moon/Mars topology as a magnifying glass, we confirm that three of four tiers retain global liveness during Mars conjunction blackout; only the disconnected tier loses it. Consensus latency at each tier equals the speed-of-light round-trip to Earth: 183~ms (Earth), 131~ms (LEO), 5.1~s (Moon). The wall also imposes a leadership cost gradient on Multi-Paxos elections that symmetric grid quorums cannot express. A comparison between sparse and full-coverage topologies separates wall obligations from network reachability as independent liveness constraints. All results are design-level; quorum intersection is verified exhaustively in TLA+.
♻ ☆ EvalNet: A Practical Toolchain for Generation and Analysis of Extreme-Scale Interconnects
The diversity of communication paths in a network, especially non-minimal paths, is a key enabler of performance at extreme scales. We present EvalNet, a toolchain for scalable generation and analysis of over 25 important network topologies, such as Slim Fly, PolarFly, and Orthogonal Fat Trees, with a strong focus on path diversity metrics. EvalNet provides an extensive and fine-grained analysis of shortest and non-shortest paths, including their multiplicities, lengths, and interference. It supports exact measurement and visualization of bandwidth and throughput between every router pair, enabling unprecedented insight into routing potential. EvalNet also includes detailed models for construction cost and power consumption, and interfaces seamlessly with established simulators, which we tune to support large-scale evaluations on low-cost hardware. Using EvalNet, we deliver the widest and most comprehensive path diversity study to date, demonstrating how path diversity underpins throughput and scalability, and facilitating progress towards new frontiers in extreme-scale network design.
♻ ☆ On Enhancing Delay SLAs in TCP Networks through Joint Routing and Transport Assistant Deployment
The Transport Control Protocol has long been the primary transport protocol for applications requiring performance and reliability over the Internet. Unfortunately, due its retransmission mechanism, TCP incurs high packet delivery delays when segments are lost. To address this issue, previous research proposed to use a novel network function, namely Transport Assistant, deployed within the network to cache and retransmit lost packets, thus reducing retransmission delays. In this paper, we propose to jointly route the flows and deploy TAs in order to minimize packet delivery delays in best-effort networks (scenario 1) or to satisfy delay-based Service Level Agreements in QoS-based networks (scenario 2). We hence formulate the joint routing and TA deployment problem as Integer Linear Program for the two scenarios and propose a heuristic solution for large-scale instances of the problem. Through extensive simulations, we demonstrate the benefits of performing joint routing flows and TA deployment in reducing packet delivery delays (up to 16.4%) while minimizing deployment costs (up to 60.98%).
comment: 10 pages, 17 figures
♻ ☆ Vision-Language-Model-Guided Differentiable Ray Tracing for Fast and Accurate Multi-Material RF Parameter Estimation
Accurate radio-frequency (RF) material parameters are essential for electromagnetic digital twins in 6G systems, yet gradient-based inverse ray tracing (RT) remains sensitive to initialization and costly under limited measurements. This paper proposes a vision-language-model (VLM) guided framework that accelerates and stabilizes multi-material parameter estimation in a differentiable RT (DRT) engine. A VLM parses scene images to infer material categories and maps them to quantitative priors via an ITU-R material table, yielding informed conductivity initializations. The VLM further selects informative transmitter/receiver placements that promote diverse, material-discriminative paths. Starting from these priors, the DRT performs gradient-based refinement using measured received signal strengths. Experiments in NVIDIA Sionna on indoor scenes show 2-4$\times$ faster convergence and 10-100$\times$ lower final parameter error compared with uniform or random initialization and random placement baselines, achieving sub-0.1\% mean relative error with only a few receivers. Complexity analyses indicate per-iteration time scales near-linearly with the number of materials and measurement setups, while VLM-guided placement reduces the measurements required for accurate recovery. Ablations over RT depth and ray counts confirm further accuracy gains without significant per-iteration overhead. Results demonstrate that semantic priors from VLMs effectively guide physics-based optimization for fast and reliable RF material estimation.
Distributed, Parallel, and Cluster Computing 17
☆ CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe
High-performance GPU kernels are critical to modern machine learning systems, yet developing efficient implementations remains a challenging, expert-driven process due to the tight coupling between algorithmic structure, memory hierarchy usage, and hardware-specific optimizations. Recent work has explored using large language models (LLMs) to generate GPU kernels automatically, but generated implementations often struggle to maintain correctness and achieve competitive performance across iterative refinements. We present CuTeGen, an agentic framework for automated generation and optimization of GPU kernels that treats kernel development as a structured generate--test--refine workflow. Unlike approaches that rely on one-shot generation or large-scale search over candidate implementations, CuTeGen focuses on progressive refinement of a single evolving kernel through execution-based validation, structured debugging, and staged optimization. A key design choice is to generate kernels using the CuTe abstraction layer, which exposes performance-critical structures such as tiling and data movement while providing a more stable representation for iterative modification. To guide performance improvement, CuTeGen incorporates workload-aware optimization prompts and delayed integration of profiling feedback. Experimental results on matrix multiplication and activation workloads demonstrate that the framework produces functionally correct kernels and achieves competitive performance relative to optimized library implementations.
☆ Distributed Variational Quantum Linear Solver
This paper develops a distributed variational quantum algorithm for solving large-scale linear equations. For a linear system of the form $Ax=b$, the large square matrix $A$ is partitioned into smaller square block submatrices, each of which is known only to a single noisy intermediate-scale quantum (NISQ) computer. Each NISQ computer communicates with certain other quantum computers in the same row and column of the block partition, where the communication patterns are described by the row- and column-neighbor graphs, both of which are connected. The proposed algorithm integrates a variant of the variational quantum linear solver at each computer with distributed classical optimization techniques. The derivation of the quantum cost function provides insight into the design of the distributed algorithm. Numerical quantum simulations demonstrate that the proposed distributed quantum algorithm can solve linear systems whose size scales with the number of computers and is therefore not limited by the capacity of a single quantum computer.
☆ EXaCTz: Guaranteed Extremum Graph and Contour Tree Preservation for Distributed- and GPU-Parallel Lossy Compression
This paper introduces EXaCTz, a parallel algorithm that concurrently preserves extremum graphs and contour trees in lossy-compressed scalar field data. While error-bounded lossy compression is essential for large-scale scientific simulations and workflows, existing topology-preserving methods suffer from (1) a significant throughput disparity, where topology correction speeds are on the order of MB/s, lagging orders of magnitude behind compression speeds on the order of GB/s, (2) limited support for diverse topological descriptors, and (3) a lack of theoretical convergence bounds. To address these challenges, EXaCTz introduces a high-performance, bounded-iteration algorithm that enforces topological consistency by deriving targeted edits for decompressed data. Unlike prior methods that rely on explicit topology reconstruction, EXaCTz enforces consistent min/max neighbors of all vertices, along with global ordering among critical points. As such, the algorithm enforces consistent critical-point classification, saddle extremum connectivity, and the preservation of merge/split events. We theoretically prove the convergence of our algorithm, bounded by the longest path in a vulnerability graph that characterizes potential cascading effects during correction. Experiments on real-world datasets show that EXaCTz achieves a single-GPU throughput of up to 4.52 GB/s, outperforming the state-of-the-art contour-tree-preserving method (Gorski et al.) by up to 213x (with a single-core CPU implementation for fair comparison) and 3,285x (with a single-GPU version). In distributed environments, EXaCTz scales to 128 GPUs with 55.6\% efficiency (compared with 6.4\% for a naive parallelization), processing datasets of up to 512 GB in under 48 seconds and achieving an aggregate correction throughput of up to 32.69 GB/s.
☆ EmbedPart: Embedding-Driven Graph Partitioning for Scalable Graph Neural Network Training
Graph Neural Networks (GNNs) are widely used for learning on graph-structured data, but scaling GNN training to massive graphs remains challenging. To enable scalable distributed training, graphs are divided into smaller partitions that are distributed across multiple machines such that inter-machine communication is minimized and computational load is balanced. In practice, existing partitioning approaches face a fundamental trade-off between partitioning overhead and partitioning quality. We propose EmbedPart, an embedding-driven partitioning approach that achieves both speed and quality. Instead of operating directly on irregular graph structures, EmbedPart leverages node embeddings produced during the actual GNN training workload and clusters these dense embeddings to derive a partitioning. EmbedPart achieves more than 100x speedup over Metis while maintaining competitive partitioning quality and accelerating distributed GNN training. Moreover, EmbedPart naturally supports graph updates and fast repartitioning, and can be applied to graph reordering to improve data locality and accelerate single-machine GNN training. By shifting partitioning from irregular graph structures to dense embeddings, EmbedPart enables scalable and high-quality graph data optimization.
☆ Scalable Pretraining of Large Mixture of Experts Language Models on Aurora Super Computer
Pretraining Large Language Models (LLMs) from scratch requires massive amount of compute. Aurora super computer is an ExaScale machine with 127,488 Intel PVC (Ponte Vechio) GPU tiles. In this work, we showcase LLM pretraining on Aurora at the scale of 1000s of GPU tiles. Towards this effort, we developed Optimus, an inhouse training library with support for standard large model training techniques. Using Optimus, we first pretrained Mula-1B, a 1 Billion dense model and Mula-7B-A1B, a 7 Billion Mixture of Experts (MoE) model from scratch on 3072 GPU tiles for the full 4 trillion tokens of the OLMoE-mix-0924 dataset. We then demonstrated model scaling by pretraining three large MoE models Mula-20B-A2B, Mula-100B-A7B, and Mula-220B-A10B till 100 Billion tokens on the same dataset. On our largest model Mula-220B-A10B, we pushed the compute scaling from 384 to 12288 GPU tiles and observed scaling efficiency of around 90% at 12288 GPU tiles. We significantly improved the runtime performance of MoE models using custom GPU kernels for expert computation, and a novel EP-Aware sharded optimizer resulting in training speedups up to 1.71x. As part of the Optimus library, we also developed a robust set of reliability and fault tolerant features to improve training stability and continuity at scale.
☆ Is RISC-V Ready for Machine Learning? Portable Gaussian Processes Using Asynchronous Tasks SC
Gaussian processes are widely used in machine learning domains but remain computationally demanding, limiting their efficient scalability across diverse hardware platforms. The GPRat library targets these challenges with the help of the asynchronous many-task runtime system HPX. In this work, we extend GPRat to enable portability across multiple hardware architectures and evaluate its performance on representative x86-64, ARM, and RISC-V chips. We conduct node-level strong-scaling and problem-size-scaling benchmarks for Gaussian Process prediction and hyperparameter optimization to assess single-core performance, parallel scalability, and architectural efficiency. Our results show that while the x86-64 Zen 2 chip achieves a 58% single-core performance advantage over the ARM-based Fujitsu A64FX, superior parallel scaling allows the 48-core ARM chip to outperform the 64-core Zen 2 by 9% at full node utilization. The evaluated SOPHON SG2042 RISC-V chip exhibits substantially lower performance and weaker scalability, with single-core performance lagging by up to a factor of 14 and large-scale parallel workloads showing slowdowns of up to a factor of 25. For problem-size scaling, ARM and x86-64 systems demonstrate comparable performance within 25%. These findings highlight the growing competitiveness of ARM-based processors and emphasize the importance of wide-register vectorization support and memory subsystem improvements for upcoming RISC-V platforms.
comment: 12 pages, 4 figures, 1 table, submitted to the International Workshop on RISC-V for HPC at ISC
☆ Reclaiming Idle CPU Cycles on Kubernetes: Sparse-Domain Multiplexing for Concurrent MPI-CFD Simulations
When MPI-parallel simulations run on shared Kubernetes clusters, conventional CPU scheduling leaves the vast majority of provisioned cycles idle at synchronization barriers. This paper presents a multiplexing framework that reclaims this idle capacity by co-locating multiple simulations on the same cluster. PMPI-based duty-cycle profiling quantifies the per-rank idle fraction; proportional CPU allocation then allows a second simulation to execute concurrently with minimal overhead, yielding 1.77x throughput. A Pareto sweep to N=5 concurrent simulations shows throughput scaling to 3.74x, with a knee at N=3 offering the best efficiency-cost trade-off. An analytical model with a single fitted parameter predicts these gains within +/-4%. A dynamic controller automates the full pipeline, from profiling through In-Place Pod Vertical Scaling (KEP-1287) to packing and fairness monitoring, achieving 3.25x throughput for four simulations without manual intervention or pod restarts. To our knowledge, this is the first CPU application of In-Place Pod Vertical Scaling to running MPI processes. Experiments on an AWS cluster with OpenFOAM CFD confirm that the results hold under both concentric and standard graph-based (Scotch) mesh partitioning.
comment: 14 pages, 7 figures, 5 tables. Submitted to Future Generation Computer Systems
☆ TENT: A Declarative Slice Spraying Engine for Performant and Resilient Data Movement in Disaggregated LLM Serving
Modern GPU clusters are built upon a complex hierarchy of heterogeneous interconnects, ranging from multi-rail RDMA to proprietary fabrics such as Multi-Node NVLink and Ascend UB. Orchestrating these diverse links effectively remains a critical challenge in disaggregated LLM serving. Operating Mooncake TE on thousands of GPUs exposed a critical limitation shared by existing frameworks: imperative, statically bound path selection. This rigidity forces engines to rely on state-blind striping that ignores congestion signals, creating communication silos, wasting multi-rail bandwidth due to head-of-line blocking, and leading to operational fragility where routine faults require manual intervention. We present TENT, a data-movement engine that decouples transfer intent from physical execution. Instead of locking workloads to fixed backends, TENT unifies heterogeneous interconnects into a single dynamic resource pool. Applications simply declare transfer intents, while TENT dynamically decomposes elephant flows into fine-grained slices and "sprays" them across links based on instantaneous link quality. This telemetry-driven orchestration eliminates head-of-line blocking and enables transparent, sub-50 ms self-healing by rerouting slices around failures without application logic. TENT serves as the production data plane for LLM inference and RL pipelines at multiple industrial sites. Our evaluation on H800 HGX clusters shows that TENT outperforms state-of-the-art baselines, including Mooncake TE, NIXL, and UCCL. In LLM inference with SGLang HiCache, TENT achieves up to 1.36x higher throughput and 26% lower P90 TTFT than Mooncake TE. In RL pipelines, TENT accelerates parameter updates in Moonshot Checkpoint Engine by 20-26%.
♻ ☆ Sublogarithmic Distributed Vertex Coloring with Optimal Number of Colors STOC 2026
For any $Δ$, let $k_Δ$ be the maximum integer $k$ such that $(k+1)(k+2)\le Δ$. We give a distributed \LOCAL algorithm that, given an integer $k < k_Δ$, computes a valid $Δ-k$-coloring if one exists. The algorithm runs in $\tilde{O}(\log^4 \log n)$ rounds, which is within a polynomial factor of the $Ω(\log\log n)$ lower bound, which already applies to the case $k=0$. It is also best possible in the sense that if $k \ge k_Δ$, the problem requires $Ω(n/Δ)$ distributed rounds [Molloy, Reed, '14, Bamas, Esperet '19]. For $Δ$ at most polylogarithmic, the algorithm is an exponential improvement over the current state of the art of $O(\log^{49/12} n)$ rounds. When $Δ\ge (\log n)^{50}$, our algorithm achieves an even faster runtime of $O(\log^* n)$ rounds.
comment: To appear in STOC 2026
♻ ☆ Legible Consensus: Topology-Aware Quorum Geometry for Asymmetric Networks
Quorum design over asymmetric topologies conflates two independent concerns: inter-tier obligation (which tiers must participate for cross-tier safety) and intra-tier replication (how each tier survives local failures). Flat quorums treat all nodes as interchangeable; when consensus fails, the structure does not reveal whether a tier was unreachable or a tier lost too many replicas. We show that mapping a crumbling-wall quorum construction to a physically tiered network separates these concerns and makes the protocol's failure modes legible: an operator can determine which tiers retain global consensus capability from the wall structure and connectivity state alone, without runtime probing. Using a 10-node Earth/LEO/Moon/Mars topology as a magnifying glass, we confirm that three of four tiers retain global liveness during Mars conjunction blackout; only the disconnected tier loses it. Consensus latency at each tier equals the speed-of-light round-trip to Earth: 183~ms (Earth), 131~ms (LEO), 5.1~s (Moon). The wall also imposes a leadership cost gradient on Multi-Paxos elections that symmetric grid quorums cannot express. A comparison between sparse and full-coverage topologies separates wall obligations from network reachability as independent liveness constraints. All results are design-level; quorum intersection is verified exhaustively in TLA+.
♻ ☆ EvalNet: A Practical Toolchain for Generation and Analysis of Extreme-Scale Interconnects
The diversity of communication paths in a network, especially non-minimal paths, is a key enabler of performance at extreme scales. We present EvalNet, a toolchain for scalable generation and analysis of over 25 important network topologies, such as Slim Fly, PolarFly, and Orthogonal Fat Trees, with a strong focus on path diversity metrics. EvalNet provides an extensive and fine-grained analysis of shortest and non-shortest paths, including their multiplicities, lengths, and interference. It supports exact measurement and visualization of bandwidth and throughput between every router pair, enabling unprecedented insight into routing potential. EvalNet also includes detailed models for construction cost and power consumption, and interfaces seamlessly with established simulators, which we tune to support large-scale evaluations on low-cost hardware. Using EvalNet, we deliver the widest and most comprehensive path diversity study to date, demonstrating how path diversity underpins throughput and scalability, and facilitating progress towards new frontiers in extreme-scale network design.
♻ ☆ A Hitchhiker's Guide to Privacy-Preserving Digital Payment Systems: A Survey on Anonymity, Confidentiality, and Auditability
Crypto-assets and central bank digital currencies (CBDCs) are reshaping how value is exchanged in distributed computing environments. These systems combine cryptographic primitives, protocol design, and system architectures to provide transparency and efficiency while raising critical challenges around privacy and regulatory compliance. This survey offers a comprehensive overview of privacy-preserving digital payment systems, covering both decentralized ledger systems and CBDCs. We present a taxonomy of privacy goals -- including anonymity, confidentiality, unlinkability, and auditability -- and map them to the cryptographic primitives, protocols, and system architectures that implement them. Our work adopts a design-oriented perspective, linking high-level privacy objectives to concrete implementations. We also trace the evolution of privacy-preserving digital payment systems through three generations, highlighting shifts from basic anonymity guarantees toward more nuanced privacy-accountability trade-offs. Finally, we identify open challenges, motivating further research into architectures and solutions that balance strong privacy with real-world auditability needs.
♻ ☆ Navigating the Energy Doldrums: Can We Exploit Energy-Price Volatility To Lower the Cost of Computing?
Energy costs are a major factor in the total cost of ownership (TCO) for high-performance computing (HPC) systems. The rise of intermittent green energy sources and reduced reliance on fossil fuels have introduced volatility into electricity markets, complicating energy budgeting. This paper explores variable capacity as a strategy for managing HPC energy costs -- dynamically adjusting compute resources in response to fluctuating electricity prices. While this approach can lower energy expenses, it risks underutilizing costly hardware. To evaluate this trade-off, we present a simple model that helps operators estimate the TCO impact of variable capacity strategies using key system parameters. We apply this model to real data from a university HPC cluster and assess how different scenarios could affect the cost-effectiveness of this approach in the future.
comment: 9 pages, 7 figures, 2 tables
♻ ☆ HYLU: Hybrid Parallel Sparse LU Factorization
This article introduces HYLU, a hybrid parallel LU factorization-based general-purpose solver designed for efficiently solving sparse linear systems (Ax=b) on multi-core shared-memory architectures. The key technical feature of HYLU is the integration of hybrid numerical kernels so that it can adapt to various sparsity patterns of coefficient matrices. Tests on 37 sparse matrices from SuiteSparse Matrix Collection reveal that HYLU outperforms Intel MKL PARDISO in the numerical factorization phase by geometric means of 2.36X (for one-time solving) and 2.90X (for repeated solving). HYLU can be downloaded from https://github.com/chenxm1986/hylu.
♻ ☆ Decidability of Livelock Detection for Parameterized Self-Disabling Unidirectional Rings
We prove that livelock detection is \emph{decidable in polynomial time} for parameterized symmetric unidirectional rings of self-disabling processes with bounded domain $\mathbb{Z}_m$. Given a protocol specified by its set of local transitions $T$, the algorithm decides whether a livelock exists for \emph{some} ring size $K\!\geq\!2$, running in $O(|T|^3)$ time independent of $K$. The algorithm computes the greatest fixed point of a deflationary monotone operator on the finite set $T$ and returns \emph{livelock} iff the fixed point is non-empty. The livelock freedom argument rests on maximality: the fix-point is the largest set of transitions that can together sustain a pseudolivelock at every process; its emptiness certifies freedom for all $K$ without any search over ring sizes. The work is grounded in the algebraic characterization of livelocks from Farahat~\citep{farahat2012}, which establishes necessary and sufficient conditions for livelock existence but does not address decidability. We also handle the $(1,1)$-asymmetric case in which one distinguished process $P_0$ differs from the remaining $K\!-\!1$ identical processes. Code and algebraic foundation are at the URL: https://github.com/cosmoparadox/mathematical-tools.
comment: 9 pages, 0 figures, 1 table
♻ ☆ Twinning for Space-Air-Ground-Sea Integrated Networks: Beyond Conventional Digital Twin Towards Goal-Oriented Semantic Twin
A space-air-ground-sea integrated network (SAGSIN) has emerged as a cornerstone of 6G systems, establishing a unified global architecture by integrating multi-domain network resources. Motivated by the demand for real-time situational awareness and intelligent operational maintenance, digital twin (DT) technology was initially regarded as a promising solution, owing to its capability to create virtual replicas and emulate physical system behaviors. However, in the context of SAGSIN, the high-fidelity, full-scale modeling paradigm inherent to conventional DTs encounters fundamental limitations, including prohibitive computational overhead, delayed model synchronization, and cross-system semantic gaps. To address these limitations, this survey paper proposes a novel twinning framework: goal-oriented semantic twin (GOST). Unlike DTs that pursue physical mirroring, GOST prioritizes ``utility'' over ``fidelity,'' leveraging semantic technologies and goal-oriented principles to construct lightweight, task-specific representations. This paper systematically articulates the GOST framework through three layers: knowledge-based semantics, data-driven semantics, and goal-oriented principles. Furthermore, we provide a comprehensive tutorial on constructing GOST by detailing its core enabling technologies and introduce a multidimensional evaluation framework for GOST. We present a case study targeting collaborative tracking tasks in remote satellite-UAV networks, demonstrating that GOST significantly outperforms conventional DTs in timeliness of perceptual data and collaborative tracking. Finally, we outline research directions, establishing GOST as a transformative twinning paradigm to guide the development of SAGSIN.
♻ ☆ CarbonEdge: Carbon-Aware Deep Learning Inference Framework for Sustainable Edge Computing
Deep learning applications at the network edge lead to a significant growth in AI-related carbon emissions, presenting a critical sustainability challenge. The existing edge computing frameworks optimize for latency and throughput, but they largely ignore the environmental impact of inference workloads. This paper introduces CarbonEdge, a carbon-aware deep learning inference framework that extends adaptive model partitioning with carbon footprint estimation and green scheduling apabilities. We propose a carbon-aware scheduling algorithm that extends traditional weighted scoring with a carbon efficiency metric, supporting a tunable performance--carbon trade-off (demonstrated via weight sweep). Experimental evaluations on Docker-simulated heterogeneous edge environments show that CarbonEdge-Green mode achieves a 22.9% reduction in carbon emissions compared to monolithic execution. The framework achieves 1.3x improvement in carbon efficiency (245.8 vs 189.5 inferences per gram CO2) with negligible scheduling overhead (0.03ms per task). These results highlight the framework's potential for sustainable edge AI deployment, providing researchers and practitioners a tool to quantify and minimize the environmental footprint of distributed deep learning inference.
Software Engineering 42
☆ When is Generated Code Difficult to Comprehend? Assessing AI Agent Python Code Proficiency in the Wild
The rapid adoption of AI coding agents is fundamentally shifting software developers' roles from code authors to code reviewers. While developers spend a significant portion of their time reading and comprehending code, the linguistic proficiency and complexity of the Python code generated by these agents remain largely unexplored. This study investigates the code proficiency of AI agents to determine the skill level required for developers to maintain their code. Leveraging the AIDev dataset, we mined 591 pull requests containing 5,027 Python files generated by three distinct AI agents and employed pycefr, a static analysis tool that maps Python constructs to six proficiency levels, ranging from A1 (Basic) to C2 (Mastery), to analyze the code. Our results reveal that: AI agents predominantly generate Basic-level code, with over 90% of constructs falling into the A1 and A2 categories, and less than 1% classified as Mastery (C2); AI agents' and humans' pull requests share a broadly similar proficiency profile; High-proficiency code by AI agents are from feature addition and bug fixing tasks. These findings suggest that while AI-generated code is generally accessible to developers with basic Python skills, specific tasks may require advanced proficiency to review and maintain complex, agent-generated constructs.
☆ VeriAct: Beyond Verifiability -- Agentic Synthesis of Correct and Complete Formal Specifications
Formal specifications play a central role in ensuring software reliability and correctness. However, automatically synthesizing high-quality formal specifications remains a challenging task, often requiring domain expertise. Recent work has applied large language models to generate specifications in Java Modeling Language (JML), reporting high verification pass rates. But does passing a verifier mean that the specification is actually correct and complete? In this work, we first conduct a comprehensive evaluation comparing classical and prompt-based approaches for automated JML specification synthesis. We then investigate whether prompt optimization can push synthesis quality further by evolving prompts through structured verification feedback. While optimization improves verifier pass rates, we find a clear performance ceiling. More critically, we propose Spec-Harness, an evaluation framework that measures specification correctness and completeness through symbolic verification, revealing that a large fraction of verifier-accepted specifications, including optimized ones, are in fact incorrect or incomplete, over- or under-constraining both inputs and outputs in ways invisible to the verifier. To push beyond this ceiling, we propose VeriAct, a verification-guided agentic framework that iteratively synthesizes and repairs specifications through a closed loop of LLM-driven planning, code execution, verification, and Spec-Harness feedback. Our experiments on two benchmark datasets show that VeriAct outperforms both prompt-based and prompt-optimized baselines, producing specifications that are not only verifiable but also correct and complete.
☆ Structure- and Event-Driven Frameworks for State Machine Modeling with Large Language Models
UML state machine design is a critical process in software engineering. Traditionally, state machines are manually crafted by experienced engineers based on natural language requirements-a time-consuming and error-prone procedure. Many automated approaches exist but they require structured NL requirements. In this paper, we investigate the capabilities of current Large Language Models to fully automate UML state machine generation via specialized State Machine Frameworks (SMFs) from non-structured NL requirements. We evaluate two types of state-of-the-art LLMs using single-step and multi-step prompting approaches: a non-reasoning LLM GPT-4o and a reasoning-focused LLM Claude 3.5 Sonnet, and introduce a novel Hybrid Approach that uses the output from a Single-Prompt Baseline as an initial draft state machine, which is then refined through an SMF. In our study, two distinct SMFs are developed based on human approaches: (i) a Structure-Driven SMF, in which state machine components (states, transitions, guards, actions, etc.) are generated in sequential steps, and (ii) an Event-Driven SMF, where identified events iteratively guide state machine construction. Our experiments indicate that while LLMs demonstrate a promising ability to generate state machine models from the Single-Prompt Baseline (e.g., F1-scores of 0.90 for states and 0.75 for transitions using Claude 3.5 Sonnet), their performance is not yet fully sufficient for a fully automated solution (e.g., F1-scores of 0.23 for guards and 0.00 for actions for GPT-4o). Our proposed Hybrid Approach improves the performance of the non-reasoning LLM (GPT-4o) to a similar level as the reasoning LLM (Claude 3.5 Sonnet) but does not further improve the reasoning LLM. Our evaluation highlights both the potential and the limitations of current LLMs for automated state machine design, providing a baseline for future research in this domain.
comment: 10 pages, 5 figures
☆ Risk-Aware Batch Testing for Performance Regression Detection
Performance regression testing is essential in large-scale continuous-integration (CI) systems, yet executing full performance suites for every commit is prohibitively expensive. Prior work on performance regression prediction and batch testing has shown independent benefits, but each faces practical limitations: predictive models are rarely integrated into CI decision-making, and conventional batching strategies ignore commit-level heterogeneity. We unify these strands by introducing a risk-aware framework that integrates machine-learned commit risk with adaptive batching. Using Mozilla Firefox as a case study, we construct a production-derived dataset of human-confirmed regressions aligned chronologically with Autoland, and fine-tune ModernBERT, CodeBERT, and LLaMA-3.1 variants to estimate commit-level performance regression risk, achieving up to 0.694 ROC-AUC with CodeBERT. The risk scores drive a family of risk-aware batching strategies, including Risk-Aged Priority Batching and Risk-Adaptive Stream Batching, evaluated through realistic CI simulations. Across thousands of historical Firefox commits, our best overall configuration, Risk-Aged Priority Batching with linear aggregation (RAPB-la), yields a Pareto improvement over Mozilla's production-inspired baseline. RAPB-la reduces total test executions by 32.4%, decreases mean feedback time by 3.8%, maintains mean time-to-culprit at approximately the baseline level, reduces maximum time-to-culprit by 26.2%, and corresponds to an estimated annual infrastructure cost savings of approximately $491K under our cost model. These results demonstrate that risk-aware batch testing can reduce CI resource consumption while improving diagnostic timeliness. To support reproducibility and future research, we release a complete replication package containing all datasets, fine-tuning pipelines, and implementations of our batching algorithms.
comment: 14 pages, 1 figure, 4 tables. Replication package and dataset available
☆ Making Sense of AI Agents Hype: Adoption, Architectures, and Takeaways from Practitioners
To support practitioners in understanding how agentic systems are designed in real-world industrial practice, we present a review of practitioner conference talks on AI agents. We analyzed 138 recorded talks to examine how companies adopt agent-based architectures (Objective 1), identify recurring architectural strategies and patterns (Objective 2), and analyze application domains and technologies used to implement and operate LLM-driven agentic systems (Objective 3).
☆ Unified Architecture Metamodel of Information Systems Developed by Generative AI
The rapid development of AI and LLMs has driven new methods of SDLC, in which a large portion of code, technical, and business documentation is generated automatically. However, since there is no single architectural framework that can provide consistent, repeatable transformations across different representation layers of information systems, such systems remain fragmented in their system representation. This study explores the problem of creating a unified architecture for LLM-oriented applications based on selected architectural frameworks by SMEs. A framework structure is proposed that covers some key types of architectural diagrams and supports a closed cycle of transformations, such as: "Code to Documentation to Code". The key architectural diagrams are split equally between main architectural layers: high-layer (business and domain understanding), middle-layer (system architecture), and low-layer (developer-layer architecture). Each architectural layer still contains some abstraction layers, which make it more flexible and better fit the requirements of design principles and architectural patterns. The conducted experiments demonstrated the stable quality of generated documentation and code when using a structured architectural context in the form of architectural diagrams. The results confirm that the proposed unified architecture metamodel can serve as an effective interface between humans and models, improving the accuracy, stability, and repeatability of LLM generation. However, the selected set of architectural diagrams should be optimised to avoid redundancy between some diagrams, and some diagrams should be updated to represent extra contextual orchestration. This work demonstrates measurable improvements for a new generation of intelligent tools that automate the SDLC and enable a comprehensive architecture compatible with AI-driven development.
comment: 22 pages, 13 figures, 12 tables, 28 references
☆ A Study on the Impact of Fault localization Granularity for Repository-Scale Code Repair Tasks
Automatic program repair can be a challenging task, especially when resolving complex issues at a repository-level, which often involves issue reproduction, fault localization, code repair, testing and validation. Issues of this scale can be commonly found in popular GitHub repositories or datasets that are derived from them. Some repository-level approaches separate localization and repair into distinct phases. Where this is the case, the fault localization approaches vary in terms of the granularity of localization. Where the impact of granularity is explored to some degree for smaller datasets, not all isolate this issue from the separate question of localization accuracy by testing code repair under the assumption of perfect fault localization. To the best of the authors' knowledge, no repository-scale studies have explicitly investigated granularity under this assumption, nor conducted a systematic empirical comparison of granularity levels in isolation. We propose a framework for performing such tests by modifying the localization phase of the Agentless framework to retrieve ground-truth localization data and include this as context in the prompt fed to the repair phase. We show that under this configuration and as a generalization over the SWE-Bench-Mini dataset, function-level granularity yields the highest repair rate against line-level and file-level. However, a deeper dive suggests that the ideal granularity may in fact be task dependent. This study is not intended to improve on the state-of-the-art, nor do we intend for results to be compared against any complete agentic frameworks. Rather, we present a proof of concept for investigating how fault localization may impact automatic code repair in repository-scale scenarios. We present preliminary findings to this end and encourage further research into this relationship between the two phases.
☆ Source Known Identifiers: A Three-Tier Identity System for Distributed Applications
Distributed applications need identifiers that satisfy storage efficiency, chronological sortability, origin metadata embedding, zero-lookup verifiability, confidentiality for external consumers, and multi-century addressability. Based on our literature survey, no existing scheme provides all six of these identifier properties within a unified system. This paper introduces Source Known Identifiers (SKIDs), a three-tier identity system that projects a single entity identity across trust boundaries, addressing all six properties. The first tier, Source Known ID (SKID), is a 64-bit signed integer embedding a timestamp with a 250-millisecond precision, application topology, and a per-entity-type sequence counter. It serves as the database primary key, providing compact storage (8 bytes) and natural B-tree ordering for optimized database indexing. The second tier, Source Known Entity ID (SKEID), extends the SKID into a 128-bit Universally Unique Identifier (UUID) compatible value by adding an entity type discriminator, an epoch selector, and a BLAKE3 keyed message authentication code (MAC). SKEIDs enable zero-lookup verification of identifier origin, integrity, and entity type within trusted environments, with a big-endian byte layout that preserves chronological ordering in lexicographic UUID string comparisons. The third tier, Secure SKEID, encrypts the entire SKEID using AES-256 symmetric encryption as a single-block pseudorandom permutation, producing ciphertext indistinguishable from random bytes while remaining compatible with standard UUID data-type parsers in string representation. Deterministic bidirectional transformations connect all three tiers.
comment: 22 pages, 3 figures, 11 tables, submitted to PeerJ
☆ Open, Reliable, and Collective: A Community-Driven Framework for Tool-Using AI Agents
Tool-integrated LLMs can retrieve, compute, and take real-world actions via external tools, but reliability remains a key bottleneck. We argue that failures stem from both tool-use accuracy (how well an agent invokes a tool) and intrinsic tool accuracy (the tool's own correctness), while most prior work emphasizes the former. We introduce OpenTools, a community-driven toolbox that standardizes tool schemas, provides lightweight plug-and-play wrappers, and evaluates tools with automated test suites and continuous monitoring. We also release a public web demo where users can run predefined agents and tools and contribute test cases, enabling reliability reports to evolve as tools change. OpenTools includes the core framework, an initial tool set, evaluation pipelines, and a contribution protocol. Experiments and evaluations show improved end-to-end reproducibility and task performance; community-contributed, higher-quality task-specific tools deliver 6%-22% relative gains over an existing toolbox across multiple agent architectures on downstream tasks and benchmarks, highlighting the importance of intrinsic tool accuracy.
☆ From Domain Understanding to Design Readiness: a playbook for GenAI-supported learning in Software Engineering
Software engineering courses often require rapid upskilling in supporting knowledge areas such as domain understanding and modeling methods. We report an experience from a two-week milestone in a master's course where 29 students used a customized ChatGPT (GPT-3.5) tutor grounded in a curated course knowledge base to learn cryptocurrency-finance basics and Domain-Driven Design (DDD). We logged all interactions and evaluated a 34.5% random sample of prompt-answer pairs (60/~174) with a five-dimension rubric (accuracy, relevance, pedagogical value, cognitive load, supportiveness), and we collected pre/post self-efficacy. Responses were consistently accurate and relevant in this setting: accuracy averaged 98.9% with no factual errors and only 2/60 minor inaccuracies, and relevance averaged 92.2%. Pedagogical value was high (89.4%) with generally appropriate cognitive load (82.78%), but supportiveness was low (37.78%). Students reported large pre-post self-efficacy gains for genAI-assisted domain learning and DDD application. From these observations we distill seventeen concrete teaching practices spanning prompt/configuration and course/workflow design (e.g., setting expected granularity, constraining verbosity, curating guardrail examples, adding small credit with a simple quality rubric). Within this single-course context, results suggest that genAI-supported learning can complement instruction in domain understanding and modeling tasks, while leaving room to improve tone and follow-up structure.
☆ Efficient Software Vulnerability Detection Using Transformer-based Models
Detecting software vulnerabilities is critical to ensuring the security and reliability of modern computer systems. Deep neural networks have shown promising results on vulnerability detection, but they lack the capability to capture global contextual information on vulnerable code. To address this limitation, we explore the application of transformers for C/C++ vulnerability detection. We use program slices that encapsulate key syntactic and semantic features of program code, such as API function calls, array usage, pointer manipulations, and arithmetic expressions. By leveraging transformers' capability to capture both local and global contextual information on vulnerable code, our work can identify vulnerabilities accurately. Combined with data balancing and hyperparameter fine-tuning, our work offers a robust and efficient approach to identifying vulnerable code with moderate resource usage and training time.
☆ Automatic Identification of Parallelizable Loops Using Transformer-Based Source Code Representations
Automatic parallelization remains a challenging problem in software engineering, particularly in identifying code regions where loops can be safely executed in parallel on modern multi-core architectures. Traditional static analysis techniques, such as dependence analysis and polyhedral models, often struggle with irregular or dynamically structured code. In this work, we propose a Transformer-based approach to classify the parallelization potential of source code, focusing on distinguishing independent (parallelizable) loops from undefined ones. We adopt DistilBERT to process source code sequences using subword tokenization, enabling the model to capture contextual syntactic and semantic patterns without handcrafted features. The approach is evaluated on a balanced dataset combining synthetically generated loops and manually annotated real-world code, using 10-fold cross-validation and multiple performance metrics. Results show consistently high performance, with mean accuracy above 99\% and low false positive rates, demonstrating robustness and reliability. Compared to prior token-based methods, the proposed approach simplifies preprocessing while improving generalization and maintaining computational efficiency. These findings highlight the potential of lightweight Transformer models for practical identification of parallelization opportunities at the loop level.
comment: 28 pages, 12 figures
☆ A Lightweight Hybrid Publish/Subscribe Event Fabric for IPC and Modular Distributed Systems
Modular software deployed on mini compute units in controlled distributed environments often needs two messaging paths: low-overhead in-process coordination and selective cross-node distribution. In practice, event identity, serialization, and transport bridging are frequently implemented as ad hoc glue, which complicates inter-process communication (IPC), structured routing, and shutdown behavior. This paper presents CNS, a lightweight local-first hybrid event fabric centered on asynchronous fire-and-forget messaging. CNS combines a typed event key, per-family serialization and validation, a local publish/subscribe context for in-process coordination, and a NATS-backed distributed context for inter-node distribution. A bridge runtime moves events between the two contexts while preserving a common routing vocabulary. The primary operating model is fire-and-forget publication and subscription; bidirectional request-reply remains available as a secondary extension on the same subject space. A Python prototype and single-machine measurements are reported. Local-only delivery averaged about 30 $μ$s. Distributed-only delivery averaged 1.26-1.37 ms, and the hybrid bridge averaged 1.64-1.89 ms. Validation introduced modest overhead relative to serialization choice. The resulting artifact is suited to structured IPC and practical message movement within modular services and across bounded sets of controlled nodes.
☆ An Empirical Study on How Architectural Topology Affects Microservice Performance and Energy Usage
Microservice architectures form the backbone of modern software systems for their scalability, resilience, and maintainability, but their rise in cloud-native environments raises energy efficiency concerns. While prior research addresses microservice decomposition and placement, the impact of topology, the structural arrangement and interaction pattern among services, on energy efficiency remains largely underexplored. This study quantifies the impact of topologies on energy efficiency and performance across six canonical ones (Sequential Fan-Out, Parallel Fan-Out, Chain, Hierarchical, Probabilistic, Mesh), each instantiated at 5-, 10-, and 20-service scales using the $μ\text{Bench}$ framework. We measure throughput, response time, energy usage, CPU utilization, and failure rates under an identical workload. The results indicate that topology influences the energy efficiency of microservices under the studied conditions. As system size increases, energy consumption grows, with the steepest rise observed in dense Mesh and Chain topologies. Mesh topologies perform worst overall, with low throughput, long response times, and high failure rates. Hierarchical, Chain, and Fan-Out designs balance performance and energy use better. As systems scale, metrics converge, with Probabilistic and Parallel Fan-Out emerging as the most energy-efficient under CPU-bound loads. These results guide greener microservice architecture design and serve as a baseline for future research on workload and deployment impacts.
☆ Phyelds: A Pythonic Framework for Aggregate Computing
Aggregate programming is a field-based coordination paradigm with over a decade of exploration and successful applications across domains including sensor networks, robotics, and IoT, with implementations in various programming languages, such as Protelis, ScaFi (Scala), and FCPP (C++). A recent research direction integrates machine learning with aggregate computing, aiming to support large-scale distributed learning and provide new abstractions for implementing learning algorithms. However, existing implementations do not target data science practitioners, who predominantly work in Python--the de facto language for data science and machine learning, with a rich and mature ecosystem. Python also offers advantages for other use cases, such as education and robotics (e.g., via ROS). To address this gap, we present Phyelds, a Python library for aggregate programming. Phyelds offers a fully featured yet lightweight implementation of the field calculus model of computation, featuring a Pythonic API and an architecture designed for seamless integration with Python's machine learning ecosystem. We describe the design and implementation of Phyelds and illustrate its versatility across domains, from well-known aggregate computing patterns to federated learning coordination and integration with a widely used multi-agent reinforcement learning simulator.
☆ When Labels Are Scarce: A Systematic Mapping of Label-Efficient Code Vulnerability Detection
Machine-learning-based code vulnerability detection (CVD) has progressed rapidly, from deep program representations to pretrained code models and LLM-centered pipelines. Yet dependable vulnerability labeling remains expensive, noisy, and uneven across projects, languages, and CWE types, motivating approaches that reduce reliance on human labeling. This survey maps these approaches, synthesizing five paradigm families and the mechanisms they use. It connects mechanisms to token, graph, hybrid, and knowledgebased representations, and consolidates evaluation and reporting axes that limit comparison (label-budget specification, compute/cost assumptions, leakage, and granularity mismatches). A Design Map and constraintfirst Decision Guide distill trade-offs and failure modes for practical method selection.
☆ BayesInsights: Modelling Software Delivery and Developer Experience with Bayesian Networks at Bloomberg
As software in industry grows in size and complexity, so does the volume of engineering data that companies generate and use. Ideally, this data could be used for many purposes, including informing decisions on engineering priorities. However, without a structured representation of the links between different aspects of software development, companies can struggle to identify the root causes of deficiencies or anticipate the effects of changes. In this paper, we report on our experience at Bloomberg in developing a novel tool, dubbed BayesInsights, which provides an interactive interface for visualising causal dependencies across various aspects of the software engineering (SE) process using Bayesian Networks (BNs). We describe our journey from defining network structures using a combination of established literature, expert insight, and structure learning algorithms, to integrating BayesInsights into existing data analytics solutions, and conclude with a mixed-methods evaluation of performance benchmarking and survey responses from 24 senior practitioners at Bloomberg. Our results revealed 95.8% of participants found the tool useful for identifying software delivery challenges at the team and organisational levels, cementing its value as a proof of concept for modelling software delivery and developer experience. BayesInsights is currently in preview, with access granted to seven engineering teams and a wider deployment roadmap in place for the future.
comment: 6 pages, 1 figure, Camera Ready Accepted at FSE-Industry 2026
☆ SkillReducer: Optimizing LLM Agent Skills for Token Efficiency
LLM-based coding agents rely on \emph{skills}, pre-packaged instruction sets that extend agent capabilities, yet every token of skill content injected into the context window incurs both monetary cost and attention dilution. To understand the severity of this problem, we conduct a large-scale empirical study of 55,315 publicly available skills and find systemic inefficiencies: 26.4\% lack routing descriptions entirely, over 60\% of body content is non-actionable, and reference files can inject tens of thousands of tokens per invocation. Motivated by these findings, we present \textsc{SkillReducer}, a two-stage optimization framework. Stage~1 optimizes the routing layer by compressing verbose descriptions and generating missing ones via adversarial delta debugging. Stage~2 restructures skill bodies through taxonomy-driven classification and progressive disclosure, separating actionable core rules from supplementary content loaded on demand, validated by faithfulness checks and a self-correcting feedback loop. Evaluated on 600 skills and the SkillsBench benchmark, \textsc{SkillReducer} achieves 48\% description compression and 39\% body compression while improving functional quality by 2.8\%, revealing a \emph{less-is-more} effect where removing non-essential content reduces distraction in the context window. These benefits transfer across five models from four families with a mean retention of 0.965, and generalize to an independent agent framework.
☆ Compiling Code LLMs into Lightweight Executables
The demand for better prediction accuracy and higher execution performance in neural networks continues to grow. The emergence and success of Large Language Models (LLMs) have led to the development of many cloud-based tools for software engineering tasks such as code suggestion. While effective, cloud deployment raises concerns over privacy, latency, and reliance on connectivity. Running LLMs locally on personal devices such as laptops would address these issues by enabling offline use and reducing response time. However, local deployment is challenging: commodity devices lack high-performance accelerators like GPUs and are constrained by limited memory and compute capacity, making it difficult to execute large models efficiently. We present Ditto, a novel method for optimizing both the model size of Code LLMs and their inference programs, particularly for statically-typed programming languages such as C. Our approach integrates two key components: (1) a model compression technique inspired by product quantization, which clusters model parameters into codebooks and quantizes them to lower bit widths while ensuring that outputs remain within a bounded error, as well as synthesizing the inference program for the quantized model; and (2) a compilation pass integrated into LLVM that automatically detects and replaces unoptimized General Matrix-Vector Multiplication (GEMV) operations with implementations from Basic Linear Algebra Subprograms (BLAS) libraries, which are highly optimized for runtime performance. The output of Ditto is an optimized and compiled executable for running selected Code LLMs. We evaluate Ditto on three popular Code LLMs, achieving up to 10.5$\times$ faster inference and 6.4$\times$ lower memory usage compared with their original inference pipeline, while maintaining accuracy close to that of the full-precision models (with an average loss of only 0.27% in pass@1).
comment: This is not the camera-ready version; content is subject to change. Accepted at the 34th ACM International Conference on the Foundations of Software Engineering (FSE 2026), 25 pages
☆ Terminal Agents Suffice for Enterprise Automation
There has been growing interest in building agents that can interact with digital platforms to execute meaningful enterprise tasks autonomously. Among the approaches explored are tool-augmented agents built on abstractions such as Model Context Protocol (MCP) and web agents that operate through graphical interfaces. Yet, it remains unclear whether such complex agentic systems are necessary given their cost and operational overhead. We argue that a coding agent equipped only with a terminal and a filesystem can solve many enterprise tasks more effectively by interacting directly with platform APIs. We evaluate this hypothesis across diverse real-world systems and show that these low-level terminal agents match or outperform more complex agent architectures. Our findings suggest that simple programmatic interfaces, combined with strong foundation models, are sufficient for practical enterprise automation.
comment: Pre-print. Under review for COLM2026
☆ Machine Learning in the Wild: Early Evidence of Non-Compliant ML-Automation in Open-Source Software
The increasing availability of Machine Learning (ML) models, particularly foundation models, enables their use across a range of downstream applications, from scenarios with missing data to safety-critical contexts. This, in principle, may contravene not only the models' terms of use, but also governmental principles and regulations. This paper presents a preliminary investigation into the use of ML models by 173 open-source projects on GitHub, spanning 16 application domains. We evaluate whether models are used to make decisions, the scope of these decisions, and whether any post-processing measures are taken to reduce the risks inherent in fully autonomous systems. Lastly, we investigate the models' compliance with established terms of use. This study lays the groundwork for defining guidelines for developers and creating analysis tools that automatically identify potential regulatory violations in the use of ML models in software systems.
☆ HackRep: A Large-Scale Dataset of GitHub Hackathon Projects
Hackathons are time-bound collaborative events that often target software creation. Although hackathons have been studied in the past, existing work focused on in-depth case studies limiting our understanding of hackathons as a software engineering activity. To complement the existing body of knowledge, we introduce HackRep, a dataset of 100,356 hackathon GitHub repositories. We illustrate the ways HackRep can benefit software engineering researchers by presenting a preliminary investigation of hackathon project continuation, hackathon team composition, and an estimation of hackathon geography. We further display the opportunities of using this dataset, for instance showing the possibility of estimating hackathon durations based on commit timestamps.
☆ EcoScratch: Cost-Effective Multimodal Repair for Scratch Using Execution Feedback
Scratch is the most popular programming environment for novices, with over 1.15 billion projects created worldwide. Unlike traditional languages, correctness in Scratch is defined by visible behavior on the stage rather than by code structure alone, so programs that appear correct in the workspace can still fail at runtime due to timing, event ordering, or cross-sprite interactions. Visual execution evidence such as gameplay videos can therefore be essential for diagnosis and repair. However, capturing and processing this evidence inside an automated repair loop introduces substantial overhead. Probing execution, recording stage behavior, rebuilding executable .sb3 projects, and verifying candidate fixes consume time, monetary cost, and resources across an entire repair trajectory rather than a single model call. We present EcoScratch, a repair pipeline that uses lightweight runtime signals to decide whether the next attempt stays text-only or escalates to multimodal prompting. The controller also sets the JSON Patch budget and verification effort, so evidence choice and repair budget are coupled inside the same decision. EcoScratch rebuilds candidate fixes into executable .sb3 projects and records per-trajectory traces, monetary cost, local-runtime energy. We evaluate 12 models on 100 executable Scratch repair projects under four controller settings, yielding 4800 repair trajectories. In this matrix, a selective multimodal policy gives the strongest observed success-cost-energy tradeoff. It reaches the highest generation success (30.3%) while using less average cost and local-runtime energy than the two non-adaptive multimodal baselines under the same bounded trajectory budget; text-only remains the lowest-cost floor. Across the evaluated matrix, multimodal evidence helps most when it is used to control escalation within a bounded trajectory budget rather than applied uniformly.
☆ Enhancing LLM-Based Bug Reproduction for Android Apps via Pre-Assessment of Visual Effects
In the development and maintenance of Android apps, the quick and accurate reproduction of user-reported bugs is crucial to ensure application quality and improve user satisfaction. However, this process is often time-consuming and complex. Therefore, there is a need for an automated approach that can explore the Application Under Test (AUT) and identify the correct sequence of User Interface (UI) actions required to reproduce a bug, given only a complete bug report. Large Language Models (LLMs) have shown remarkable capabilities in understanding textual and visual semantics, making them a promising tool for planning UI actions. Nevertheless, our study shows that even when using state-of-the-art LLM-based approaches, these methods still struggle to follow detailed bug reproduction instructions and replan based on new information, due to their inability to accurately predict and interpret the visual effects of UI components. To address these limitations, we propose LTGDroid. Our insight is to execute all possible UI actions on the current UI page during exploration, record their corresponding visual effects, and leverage these visual cues to guide the LLM in selecting UI actions that are likely to reproduce the bug. We evaluated LTGDroid, instantiated with GPT-4.1, on a benchmark consisting of 75 bug reports from 45 popular Android apps. The results show that LTGDroid achieves a reproduction success rate of 87.51%, improving over the state-of-the-art baselines by 49.16% and 556.30%, while requiring an average of 20.45 minutes and approximately $0.27 to successfully reproduce a bug. The LTGDroid implementation is publicly available at https://github.com/N3onFlux/LTGDroid.
☆ How and Why Agents Can Identify Bug-Introducing Commits
Śliwerski, Zimmermann, and Zeller (SZZ) just won the 2026 ACM SIGSOFT Impact Award for asking: When do changes induce fixes? Their paper from 2005 served as the foundation for a wide array of approaches aimed at identifying bug-introducing changes (or commits) from fix commits in software repositories. But even after two decades of progress, the best-performing approach from 2025 yields a modest increase of 10 percentage points in F1-score on the most popular Linux kernel dataset. In this paper, we uncover how and why LLM-based agents can substantially advance the state-of-the-art in identifying bug-introducing commits from fix commits. We propose a simple agentic workflow based on searching a set of candidate commits and find that it raises the F1-score from 0.64 to 0.81 on the most popular Linux kernel dataset, a bigger jump than between the original 2005 method (0.54) and the previous SOTA (0.64). We also uncover why agents are so successful: They derive short greppable patterns from the fix commit diff and message and use them to effectively search and find bug-introducing commits in large candidate sets. Finally, we also discuss how these insights might enable further progress in bug detection, root cause understanding, and repair.
☆ Self-Improving Code Generation via Semantic Entropy and Behavioral Consensus
Improving the code generation capabilities of large language models (LLMs) typically relies on supervised fine-tuning or preference optimization, both of which require costly external resources such as powerful teacher models or reliable test units. However, in real-world scenarios, it is much harder to obtain reference solutions and test oracles than problem descriptions and test inputs. In this paper, we tackle a challenging yet realistic question: Can a code language model improve itself without access to a superior teacher and a test oracle? To answer this, we propose ConSelf, a self-improving approach built upon two key ideas. First, we introduce code semantic entropy, a novel metric that measures problem-level uncertainty by assessing the functional diversity of program behaviors, enabling a curriculum construction with the most learnable problems. Second, we present consensus-driven direct preference optimization (Con-DPO), a preference-based fine-tuning method that weights each preference pair by its behavioral consensus, thereby mitigating the impact of noisy self-generated supervision. Experiments on various benchmarks and backbone LLMs demonstrate that ConSelf significantly outperforms baselines, validating the effectiveness of semantic entropy-based curriculum construction and consensus-driven optimization in improving code generation without external supervision.
comment: Accepted in the 34th IEEE/ACM International Conference on Program Comprehension (ICPC 2026)
☆ Sustainable AI Assistance Through Digital Sobriety
As AI assistants become commonplace in daily life, the demand for solutions that reduce the cost of inference without sacrificing utility is increasing. Existing work on AI sustainability frequently emphasizes hardware and software optimizations; however, there may be comparable value in social approaches that shape user behavior and discourage unnecessary use. In this study, we operationalize sustainability in terms of energy-efficiency and analyze a publicly sourced sample of prompts where AI is used for assistance in software development. Using this categorization, we find that nearly half of the observed queries can be considered unnecessary relative to their expected benefit. We further observe that factoid-style information retrieval constitutes the largest share of unnecessary requests, suggesting that a meaningful portion of everyday AI usage may be replaceable with lower-cost alternatives (e.g., conventional search or local documentation). These findings motivate a closer examination of how, why, and when AI systems are invoked, and what norms or interface-level nudges might reduce avoidable demand. We conclude with a call to replicate and extend this preliminary analysis and to pay greater attention to the social dimension of AI sustainability.
☆ Software Vulnerability Detection Using a Lightweight Graph Neural Network
Large Language Models (LLMs) have emerged as a popular choice in vulnerability detection studies given their foundational capabilities, open source availability, and variety of models, but have limited scalability due to extensive compute requirements. Using the natural graph relational structure of code, we show that our proposed graph neural network (GNN) based deep learning model VulGNN for vulnerability detection can achieve performance almost on par with LLMs, but is 100 times smaller in size and fast to retrain and customize. We describe the VulGNN architecture, ablation studies on components, learning rates, and generalizability to different code datasets. As a lightweight model for vulnerability analysis, VulGNN is efficient and deployable at the edge as part of real-world software development pipelines.
comment: 12 pages, 3 figures, preprint of journal submission
☆ Designing FSMs Specifications from Requirements with GPT 4.0
Finite state machines (FSM) are executable formal specifications of reactive systems. These machines are designed based on systems' requirements. The requirements are often recorded in textual documents written in natural languages. FSMs play a crucial role in different phases of the model-driven system engineering (MDE). For example, they serve to automate testing activities. FSM quality is critical: the lower the quality of FSM, the higher the number of faults surviving the testing phase and the higher the risk of failure of the systems in production, which could lead to catastrophic scenarios. Therefore, this paper leverages recent advances in the domain of LLM to propose an LLM-based framework for designing FSMs from requirements. The framework also suggests an expert-centric approach based on FSM mutation and test generation for repairing the FSMs produced by LLMs. This paper also provides an experimental analysis and evaluation of LLM's capacities in performing the tasks presented in the framework and FSM repair via various methods. The paper presents experimental results with simulated data. These results and methods bring a new analysis and vision of LLMs that are useful for further development of machine learning technology and its applications to MDE.
☆ Logging Like Humans for LLMs: Rethinking Logging via Execution and Runtime Feedback
Logging statements are essential for software debugging and maintenance. However, existing approaches to automatic logging generation rely on static analysis and produce statements in a single pass without considering runtime behavior. They are also typically evaluated by similarity to developer-written logs, assuming these logs form an adequate gold standard. This assumption is increasingly limiting in the LLM era, where logs are consumed not only by developers but also by LLMs for downstream tasks. As a result, optimizing logs for human similarity does not necessarily reflect their practical utility. To address these limitations, we introduce ReLog, an iterative logging generation framework guided by runtime feedback. ReLog leverages LLMs to generate, execute, evaluate, and refine logging statements so that runtime logs better support downstream tasks. Instead of comparing against developer-written logs, we evaluate ReLog through downstream debugging tasks, including defect localization and repair. We construct a benchmark based on Defects4J under both direct and indirect debugging settings. Results show that ReLog consistently outperforms all baselines, achieving an F1 score of 0.520 and repairing 97 defects in the direct setting, and the best F1 score of 0.408 in the indirect setting where source code is unavailable. Additional experiments across multiple LLMs demonstrate the generality of the framework, while ablations confirm the importance of iterative refinement and compilation repair. Overall, our work reframes logging as a runtime-guided, task-oriented process and advocates evaluating logs by their downstream utility rather than textual similarity.
☆ Towards Explainable Stakeholder-Aware Requirements Prioritisation in Aged-Care Digital Health
Requirements engineering for aged-care digital health must account for human aspects, because requirement priorities are shaped not only by technical functionality but also by stakeholders' health conditions, socioeconomics, and lived experience. Knowing which human aspects matter most, and for whom, is critical for inclusive and evidence-based requirements prioritisation. Yet in practice, while some studies have examined human aspects in RE, they have largely relied on expert judgement or model-driven analysis rather than large-scale user studies with meaningful human-in-the-loop validation to determine which aspects matter most and why. To address this gap, we conducted a mixed-methods study with 103 older adults, 105 developers, and 41 caregivers. We first applied an explainable machine learning to identify the human aspects most strongly associated with requirement priorities across 8 aged-care digital health themes, and then conducted 12 semi-structured interviews to validate and interpret the quantitative patterns. The results identify the key human aspects shaping requirement priorities, reveal their directional effects, and expose substantial misalignment across stakeholder groups. Together, these findings show that human-centric requirements analysis should engage stakeholder groups explicitly rather than collapsing their perspectives into a single aggregate view. This paper contributes an identification of the key human aspects driving requirement priorities in aged-care digital health and an explainable, human-centric RE framework that combines ML-derived importance rankings with qualitative validation to surface the stakeholder misalignments that inclusive requirements engineering must address.
☆ SemLoc: Structured Grounding of Free-Form LLM Reasoning for Fault Localization
Fault localization identifies program locations responsible for observed failures. Existing techniques rank suspicious code using syntactic spectra--signals derived from execution structure such as statement coverage, control-flow divergence, or dependency reachability. These signals collapse for semantic bugs, where failing and passing executions follow identical code paths and differ only in whether semantic intent is satisfied. Recent LLM-based approaches introduce semantic reasoning but produce stochastic, unverifiable outputs that cannot be systematically cross-referenced across tests or distinguish root causes from cascading effects. We present SemLoc, a fault localization framework based on structured semantic grounding. SemLoc converts free-form LLM reasoning into a closed intermediate representation that binds each inferred property to a typed program anchor, enabling runtime checking and attribution to program structure. It executes instrumented programs to construct a semantic violation spectrum--a constraint-by-test matrix--from which suspiciousness scores are derived analogously to coverage-based methods. A counterfactual verification step further prunes over-approximate constraints and isolates primary causal violations. We evaluate SemLoc on SemFault-250, a corpus of 250 Python programs with single semantic faults. SemLoc outperforms five coverage-, reduction-, and LLM-based baselines, achieving Top-1 accuracy of 42.8% and Top-3 of 68%, while reducing inspection to 7.6% of executable lines. Counterfactual verification provides an additional 12% accuracy gain and identifies primary causal semantic constraints.
☆ The Energy Footprint of LLM-Based Environmental Analysis: LLMs and Domain Products
As large language models (LLMs) are increasingly used in domain-specific applications, including climate change and environmental research, understanding their energy footprint has become an important concern. The growing adoption of retrieval-augmented (RAG) systems for climate-domain specific analysis raises a key question: how does the energy consumption of domain-specific RAG workflows compare with that of direct generic LLM usage? Prior research has focused on standalone model calls or coarse token-based estimates, while leaving the energy implications of deployed application workflows insufficiently understood. In this paper, we assess the inference-time energy consumption of two LLM-based climate analysis chatbots (ChatNetZero and ChatNDC) compared to the generic GPT-4o-mini model. We estimate energy use under actual user queries by decomposing each workflow into retrieval, generation, and hallucination-checking components. We also test across different times of day and geographic access locations. Our results show that the energy consumption of domain-specific RAG systems depends strongly on their design. More agentic pipelines substantially increase inference-time energy use, particularly when used for additional accuracy or verification checks, although they may not yield proportional gains in response quality. While more research is needed to further test these initial findings more robustly across models, environments and prompting structures, this study provides a new understanding on how the design of domain-specific LLM products affects both the energy footprint and quality of output.
☆ WybeCoder: Verified Imperative Code Generation
Recent progress in large language models (LLMs) has advanced automatic code generation and formal theorem proving, yet software verification has not seen the same improvement. To address this gap, we propose WybeCoder, an agentic code verification framework that enables prove-as-you-generate development where code, invariants, and proofs co-evolve. It builds on a recent framework that combines automatic verification condition generation and SMT solvers with interactive proofs in Lean. To enable systematic evaluation, we translate two benchmarks for functional verification in Lean, Verina and Clever, to equivalent imperative code specifications. On complex algorithms such as Heapsort, we observe consistent performance improvements by scaling our approach, synthesizing dozens of valid invariants and dispatching of dozens of subgoals, resulting in hundreds of lines of verified code, overcoming plateaus reported in previous works. Our best system solves 74% of Verina tasks and 62% of Clever tasks at moderate compute budgets, significantly surpassing previous evaluations and paving a path to automated construction of large-scale datasets of verified imperative code.
♻ ☆ DynQ: A Dynamic Topology-Agnostic Quantum Virtual Machine via Quality-Weighted Community Detection
Quantum cloud platforms have scaled hardware capacity but not the abstraction exposed to users: small programs still monopolise entire processors, and existing Quantum Virtual Machine (QVM) designs often rely on fixed, topology-specific partitions that are brittle under calibration drift, spatial heterogeneity, and transient defects. We present DynQ, a dynamic topology-agnostic QVM that derives execution regions directly from live calibration data. DynQ models a processor as a quality-weighted coupling graph and formulates region discovery as community detection, turning high internal cohesion and low external coupling into a hardware-aware objective for quantum virtualisation. This produces regions that are compilation-friendly, quality-aware, and resilient to degraded couplers and unavailable qubits. DynQ separates offline region discovery from online allocation, enabling low-latency scheduling over pre-validated regions while allowing recomputation under changing hardware conditions. Across five IBM backends, real-device experiments on IBM Kingston and Torino, and cross-architecture evaluation on Rigetti Ankaa-3 via AWS Braket, DynQ improves execution quality, recovers workloads lost under transient defects, and maintains stable output under concurrent batching. It reduces L1 error by up to 45.1% and improves output similarity by up to 19.1% on heterogeneous hardware, while eliminating observed baseline failures on real devices. These results position quantum virtualisation as a graph-driven systems problem and show that adaptive, quality-aware QVMs enable reliable multi-tenant quantum cloud services.
comment: 47 pages, 12 figures
♻ ☆ AI In Cybersecurity Education -- Scalable Agentic CTF Design Principles and Educational Outcomes
Large language models are rapidly changing how learners acquire and demonstrate cybersecurity skills. However, when human--AI collaboration is allowed, educators still lack validated competition designs and evaluation practices that remain fair and evidence-based. This paper presents a cross-regional study of LLM-centered Capture-the-Flag competitions built on the Cyber Security Awareness Week competition system. To understand how autonomy levels and participants' knowledge backgrounds influence problem-solving performance and learning-related behaviors, we formalize three autonomy levels: human-in-the-loop, autonomous agent frameworks, and hybrid. To enable verification, we require traceable submissions including conversation logs, agent trajectories, and agent code. We analyze multi-region competition data covering an in-class track, a standard track, and a year-long expert track, each targeting participants with different knowledge backgrounds. Using data from the 2025 competition, we compare solve performance across autonomy levels and challenge categories, and observe that autonomous agent frameworks and hybrid achieve higher completion rates on challenges requiring iterative testing and tool interactions. In the in-class track, we classify participants' agent designs and find a preference for lightweight, tool-augmented prompting and reflection-based retries over complex multi-agent architectures. Our results offer actionable guidance for designing LLM-assisted cybersecurity competitions as learning technologies, including autonomy-specific scoring criteria, evidence requirements that support solution verification, and track structures that improve accessibility while preserving reliable evaluation and engagement.
♻ ☆ CL4SE: A Context Learning Benchmark For Software Engineering Tasks
Context engineering has emerged as a pivotal paradigm for unlocking the potential of Large Language Models (LLMs) in Software Engineering (SE) tasks, enabling performance gains at test time without model fine-tuning. Despite its success, existing research lacks a systematic taxonomy of SE-specific context types and a dedicated benchmark to quantify the heterogeneous effects of different contexts across core SE workflows. To address this gap, we propose CL4SE (Context Learning for Software Engineering), a comprehensive benchmark featuring a fine-grained taxonomy of four SE-oriented context types (interpretable examples, project-specific context, procedural decision-making context, and positive & negative context), each mapped to a representative task (code generation, code summarization, code review, and patch correctness assessment). We construct high-quality datasets comprising over 13,000 samples from more than 30 open-source projects and evaluate five mainstream LLMs across nine metrics. Extensive experiments demonstrate that context learning yields an average performance improvement of 24.7% across all tasks. Specifically, procedural context boosts code review performance by up to 33% (Qwen3-Max), mixed positive-negative context improves patch assessment by 30% (DeepSeek-V3), project-specific context increases code summarization BLEU by 14.78% (GPT-Oss-120B), and interpretable examples enhance code generation PASS@1 by 5.72% (DeepSeek-V3). CL4SE establishes the first standardized evaluation framework for SE context learning, provides actionable empirical insights into task-specific context design, and releases a large-scale dataset to facilitate reproducible research in this domain.
comment: 23 pages, 4 figures
♻ ☆ ContractSkill: Repairable Contract-Based Skills for Multimodal Web Agents
Self-generated skills for web agents are often unstable and can even hurt performance relative to direct acting. We argue that the key bottleneck is not only skill generation quality, but the fact that web skills remain implicit and therefore cannot be checked or locally repaired. To address this, we present ContractSkill, a framework that converts a draft skill into an executable artifact with explicit procedural structure, enabling deterministic verifica tion, fault localization, and minimal local repair. This turns skill refinement from full rewriting into localized editing of a single skill artifact. Experiments on VisualWebArena show that Contract Skill is effective in realistic web environments, while MiniWoB provides a controlled test of the mechanism behind the gain. Under matched transfer layers, repaired artifacts also remain reusable after removing the source model from the loop, providing evi dence of portability within the same benchmark family rather than full-benchmark generalization. These results suggest that the central challenge is not merely generating skills, but mak ing them explicit, executable, and repairable. Code is available at https://github.com/underfitting-lu/contractskill.git.
comment: 10 pages, 4 figures, 6 tables
♻ ☆ InCoder-32B: Code Foundation Model for Industrial Scenarios
Recent code large language models have achieved remarkable progress on general programming tasks. Nevertheless, their performance degrades significantly in industrial scenarios that require reasoning about hardware semantics, specialized language constructs, and strict resource constraints. To address these challenges, we introduce InCoder-32B (Industrial-Coder-32B), the first 32B-parameter code foundation model unifying code intelligence across chip design, GPU kernel optimization, embedded systems, compiler optimization, and 3D modeling. By adopting an efficient architecture, we train InCoder-32B from scratch with general code pre-training, curated industrial code annealing, mid-training that progressively extends context from 8K to 128K tokens with synthetic industrial reasoning data, and post-training with execution-grounded verification. We conduct extensive evaluation on 14 mainstream general code benchmarks and 9 industrial benchmarks spanning 4 specialized domains. Results show InCoder-32B achieves highly competitive performance on general tasks while establishing strong open-source baselines across industrial domains.
♻ ☆ Dynamic Cogeneration of Bug Reproduction Test in Agentic Program Repair
Bug Reproduction Tests (BRTs) have been used in many Automated Program Repair (APR) systems, primarily for validating promising fixes and aiding fix generation. In practice, when developers submit a patch, they often implement the BRT alongside the fix. Our experience deploying agentic APR reveals that developers similarly desire a BRT within AI-generated patches to increase their confidence. However, canonical APR systems tend to generate BRTs and fixes separately, and focus on producing only the fix in the final patch. In this paper, we study agentic APR in the context of cogeneration, where the APR agent is instructed to generate both a fix and a BRT in the same patch. We evaluate the effectiveness of different cogeneration strategies on 120 human-reported bugs at Google and characterize different cogeneration strategies by their influence on APR agent behavior. We develop and evaluate patch selectors that account for test change information to select patches with plausible fixes (and plausible BRTs). Finally, we analyze the root causes of failed cogeneration trajectories. Importantly, we show that cogeneration allows the APR agent to generate BRTs for at least as many bugs as a dedicated BRT agent, without compromising the generation rate of plausible fixes, thereby reducing engineering effort in maintaining and coordinating separate generation pipelines for fix and BRT at scale.
♻ ☆ Mitigating "Epistemic Debt" in Generative AI-Scaffolded Novice Programming using Metacognitive Scripts
The democratization of Large Language Models has given rise to vibe coding, where novice programmers prioritize semantic intent over syntactic implementation. Without pedagogical guardrails, we argue this is fundamentally misaligned with cognitive skill acquisition. Drawing on Kirschner's distinction between cognitive offloading and outsourcing, unrestricted AI encourages novices to outsource the intrinsic cognitive load required for schema formation rather than merely offloading extraneous load. This accumulation of epistemic debt creates fragile experts: developers whose high functional utility masks critically low corrective competence. To quantify and mitigate this debt, we conducted a between-subjects experiment (N=78) using a custom Cursor IDE plugin backed by Claude 3.5 Sonnet. Participants were recruited via Prolific and UserInterviews.com to represent AI-native learners. We compared three conditions: manual (control), unrestricted AI (outsourcing), and scaffolded AI (offloading). The scaffolded condition employed a novel Explanation Gate -- a real-time LLM-as-a-Judge framework enforcing a teach-back protocol before generated code could be integrated. Results reveal a collapse of competence: both AI groups significantly outperformed the manual control on functional utility (p < .001) and did not differ from each other (p = .64), yet unrestricted AI users suffered a 77% failure rate on a subsequent 30-minute AI-blackout maintenance task, vs. only 39% in the scaffolded group. Qualitative analysis suggests successful vibe coders naturally self-scaffold, treating AI as a consultant rather than a contractor. We discuss implications for AI-generated software maintainability and propose that future learning systems must enforce metacognitive friction to prevent mass production of unmaintainable code. Replication package: https://github.com/sreecharansankaranarayanan/vibecheck
♻ ☆ SecureVibeBench: Evaluating Secure Coding Capabilities of Code Agents with Realistic Vulnerability Scenarios
Large language model-powered code agents are rapidly transforming software engineering, yet the security risks of their generated code have become a critical concern. Existing benchmarks have provided valuable insights, but they fail to capture scenarios in which vulnerabilities are actually introduced by human developers, making fair comparisons between humans and agents infeasible. We therefore introduce SecureVibeBench, a benchmark of 105 C/C++ secure coding tasks sourced from 41 projects in OSS-Fuzz for code agents. SecureVibeBench has the following features: (i) realistic task settings that require multi-file edits in large repositories, (ii)~aligned contexts based on real-world open-source vulnerabilities with precisely identified vulnerability introduction points, and (iii) comprehensive evaluation that combines functionality testing and security checking with both static and dynamic oracles. We evaluate 5 popular code agents like OpenHands, supported by 5 LLMs (e.g., Claude sonnet 4.5) on SecureVibeBench. Results show that current agents struggle to produce both correct and secure code, as even the best-performing one, produces merely 23.8\% correct and secure solutions on SecureVibeBench.
Human-Computer Interaction 45
☆ Physically-intuitive Privacy and Security: A Design Paradigm for Building User Trust in Smart Sensing Environments
Sensor-based interactive systems -- e.g., "smart" speakers, webcams, and RFID tags -- allow us to embed computational functionality into physical environments. They also expose users to real and perceived privacy risks: users know that device manufacturers, app developers, and malicious third parties want to collect and monetize their personal data, which fuels their mistrust of these systems even in the presence of privacy and security controls. We propose a new design paradigm, physically-intuitive privacy and security (PIPS), which aims to improve user trust by designing privacy and security controls that provide users with simple, physics-based conceptual models of their operation. PIPS consists of three principles: (1) direct physical manipulation of sensor state; (2) perceptible assurance of sensor state; and, (3) intent-aligned sensor (de)activation. We illustrate these principles through three case studies -- Smart Webcam Cover, Powering for Privacy, and On-demand RFID -- each of which has been shown to improve trust relative to existing sensor-based systems.
comment: 18 pages, 4 figures
☆ Play-Testing REMind: Evaluating an Educational Robot-Mediated Role-Play Game
This paper presents REMind, an innovative educational robot-mediated role-play game designed to support anti-bullying bystander intervention among children. REMind invites players to observe a bullying scenario enacted by social robots, reflect on the perspectives of the characters, and rehearse defending strategies by puppeteering a robotic avatar. We evaluated REMind through a mixed-methods play-testing study with 18 children aged 9--10. The findings suggest that the experience supported key learning goals related to self-efficacy, perspective-taking, understanding outcomes of defending, and intervention strategies. These results highlight the promise of Robot-Mediated Applied Drama (RMAD) as a novel pedagogical framework to support Social-Emotional Learning.
comment: This work has been submitted to the IEEE for possible publication
☆ NeuroVase: A Tangible Mobile Augmented Reality Learning System for Neurovascular Anatomy and Stroke Education
Stroke remains a leading cause of mortality and disability worldwide, requiring rapid and informed clinical decision-making. A solid spatial understanding of cerebrovascular anatomy and vascular territories in relation to stroke symptoms and severity is critical for timely clinical decision and patient care. However, this knowledge is typically conveyed through static 2D diagrams and printed materials, which can hinder mastery of the complex neurovascular system and their clinical implications. Mobile augmented reality (AR) offers an accessible medium for delivering intuitive 3D anatomical education, yet applications focused on the neurovascular system and stroke remain limited despite the demand. To address this, we propose NeuroVase, a tablet-based mobile AR platform within a structured pedagogical framework that enhances stroke-related neuroanatomy learning by providing an interactive, engaging, and accessible alternative to traditional methods. NeuroVase features a dual-mode setup, using tangible cue cards as standalone study aids while also serving as interactive markers for AR content delivery. A custom learning curriculum focused on cerebrovascular anatomy and stroke supports exploration of vascular territories, stroke syndromes, and arterial occlusions, in the context of annotated 3D anatomical models in NeuroVase. A controlled user study with 40 participants revealed that NeuroVase is an effective and user-friendly AR platform to facilitate complex anatomical and physiological education, compared with traditional learning.
☆ Not Just Duolingo: Supporting Immigrant Language Preservation Through Family-Based Play
For immigrants, language preservation is crucial to maintain their identity, but the process of immigration can put a strain on a community's ability to do so. We interviewed eight Nepali immigrants to understand barriers to language preservation across sociopolitical contexts in Nepal and immigrant life in the United States. Participants described strong motivation but limited institutional support, time and resource constraints, and English-dominant environments that widen parent-child language gaps. They envisioned technology that supports interactive, family centered learning. In response, we are developing an audio-first, point-and-click language learning game based on the theory of comprehensible input, designed for parent-child co-playing. An early evaluation with four design experts reveals promising gameplay, and the need to simplify symbol-heavy UI. We conclude with implications for designing language technologies that support preservation through relations while acknowledging the limits of design.
comment: CHI 2026
☆ Explainable AI for Blind and Low-Vision Users: Navigating Trust, Modality, and Interpretability in the Agentic Era
Explainable Artificial Intelligence (XAI) is critical for ensuring trust and accountability, yet its development remains predominantly visual. For blind and low-vision (BLV) users, the lack of accessible explanations creates a fundamental barrier to the independent use of AI-driven assistive technologies. This problem intensifies as AI systems shift from single-query tools into autonomous agents that take multi-step actions and make consequential decisions across extended task horizons, where a single undetected error can propagate irreversibly before any feedback is available. This paper investigates the unique XAI requirements of the BLV community through a comprehensive analysis of user interviews and contemporary research. By examining usage patterns across environmental perception and decision support, we identify a significant modality gap. Empirical evidence suggests that while BLV users highly value conversational explanations, they frequently experience "self-blame" for AI failures. The paper concludes with a research agenda for accessible Explainable AI in agentic systems, advocating for multimodal interfaces, blame-aware explanation design, and participatory development.
comment: Human-centered Explainable AI Workshop (HCXAI) @ CHI 2026, Barcelona, Spain, 2026
☆ Practice Less, Explain More: LLM-Supported Self-Explanation Improves Explanation Quality on Transfer Problems in Calculus
We conducted a between-subjects experiment (N=92) comparing three conditions in a calculus learning environment: no self-explanation (control), menu-based self-explanation, and open-ended self-explanation with LLM-generated feedback. All conditions showed positive learning gains within a fixed 60-minute practice session, with no significant between-condition differences in post-test performance. On transfer questions, the open-ended condition produced significantly higher-quality explanations than control on "Not Enough Information" (NEI) problems ($β$=+11.9 percentage points, $p$=.030), though the corresponding NEI multiple-choice accuracy advantage was not significant ($p$=.183). Moreover, across all post-test open-ended explanations, the open-ended condition showed a marginally significant advantage ($β$=+7.3%, $p$=.057). These findings suggest that LLM-supported open-ended self-explanation can improve explanation quality on NEI transfer problems, with weaker evidence across broader transfer explanation measures. Notably, these effects emerged even though learners in the open-ended condition completed substantially fewer practice problems within the same practice time.
comment: 9 pages, 2 figures. Accepted at AIED 2026. Camera-ready version with updated references
☆ HapCompass: A Rotational Haptic Device for Contact-Rich Robotic Teleoperation ICRA
The contact-rich nature of manipulation makes it a significant challenge for robotic teleoperation. While haptic feedback is critical for contact-rich tasks, providing intuitive directional cues within wearable teleoperation interfaces remains a bottleneck. Existing solutions, such as non-directional vibrations from handheld controllers, provide limited information, while vibrotactile arrays are prone to perceptual interference. To address these limitations, we propose HapCompass, a novel, low-cost wearable haptic device that renders 2D directional cues by mechanically rotating a single linear resonant actuator (LRA). We evaluated HapCompass's ability to convey directional cues to human operators and showed that it increased the success rate, decreased the completion time and the maximum contact force for teleoperated manipulation tasks when compared to vision-only and non-directional feedback baselines. Furthermore, we conducted a preliminary imitation-learning evaluation, suggesting that the directional feedback provided by HapCompass enhances the quality of demonstration data and, in turn, the trained policy. We release the design of the HapCompass device along with the code that implements our teleoperation interface: https://ripl.github.io/HapCompass/.
comment: Accepted to IEEE International Conference on Robotics and Automation (ICRA), 2026. 8 pages, 5 figures. Project page: https://ripl.github.io/HapCompass/
☆ Structural Feature Engineering for Generative Engine Optimization: How Content Structure Shapes Citation Behavior
The proliferation of AI-powered search engines has shifted information discovery from traditional link-based retrieval to direct answer generation with selective source citation, creating new challenges for content visibility. While existing Generative Engine Optimization (GEO) approaches focus primarily on semantic content modification, the role of structural features in influencing citation behavior remains underexplored. In this paper, we propose GEO-SFE, a systematic framework for structural feature engineering in generative engine optimization. Our approach decomposes content structure into three hierarchical levels: macro-structure (document architecture), meso-structure (information chunking), and micro-structure (visual emphasis), and models their impact on citation probability across different generative engine architectures. We develop architecture-aware optimization strategies and predictive models that preserve semantic integrity while improving structural effectiveness. Experimental evaluation across six mainstream generative engines demonstrates consistent improvements in citation rate (17.3 percent) and subjective quality (18.5 percent), validating the effectiveness and generalizability of the proposed framework. This work establishes structural optimization as a foundational component of GEO, providing a data-driven methodology for enhancing content visibility in LLM-powered information ecosystems.
comment: 12 pages, 5 figures. This paper proposes GEO-SFE, a structural feature engineering framework for generative engine optimization
☆ Structured Intent as a Protocol-Like Communication Layer: Cross-Model Robustness, Framework Comparison, and the Weak-Model Compensation Effect
How reliably can structured intent representations preserve user goals across different AI models, languages, and prompting frameworks? Prior work showed that PPS (Prompt Protocol Specification), a 5W3H-based structured intent framework, improves goal alignment in Chinese and generalizes to English and Japanese. This paper extends that line of inquiry in three directions: cross-model robustness across Claude, GPT-4o, and Gemini 2.5 Pro; controlled comparison with CO-STAR and RISEN; and a user study (N=50) of AI-assisted intent expansion in ecologically valid settings. Across 3,240 model outputs (3 languages x 6 conditions x 3 models x 3 domains x 20 tasks), evaluated by an independent judge (DeepSeek-V3), we find that structured prompting substantially reduces cross-language score variance relative to unstructured baselines. The strongest structured conditions reduce cross-language sigma from 0.470 to about 0.020. We also observe a weak-model compensation pattern: the lowest-baseline model (Gemini) shows a much larger D-A gain (+1.006) than the strongest model (Claude, +0.217). Under the current evaluation resolution, 5W3H, CO-STAR, and RISEN achieve similarly high goal-alignment scores, suggesting that dimensional decomposition itself is an important active ingredient. In the user study, AI-expanded 5W3H prompts reduce interaction rounds by 60 percent and increase user satisfaction from 3.16 to 4.04. These findings support the practical value of structured intent representation as a robust, protocol-like communication layer for human-AI interaction.
comment: 25 pages, figures, tables, and appendix. Third paper in a cumulative research series on PPS and 5W3H structured intent representation, extending prior work to cross-model robustness, framework comparison, and user-study validation
☆ XR is XR: Rethinking MR and XR as Neutral Umbrella Terms
The term XR is currently widely used as an expression encompassing Virtual Reality (VR), Augmented Reality (AR), and Mixed Reality (MR). However, there is no clear consensus regarding its origin or meaning. XR is sometimes explained as an abbreviation for Extended Reality, but multiple interpretations exist regarding its etymology and formation process. This paper organizes the historical formation of terminology related to VR, AR, MR, and XR, and reexamines the context in which the term XR emerged and how it has spread. In particular, by presenting a timeline that distinguishes between the coinage of terms and the drivers of their adoption, we suggest that XR, as an umbrella term, functions not as an abbreviation of Extended Reality, but rather as a neutral symbolic label that encompasses multiple "reality"-related terms. Furthermore, we argue that stable usage of terminology, including XR, requires governance through collaboration among academia, industry, and standardization organizations.
comment: 4 pages, 2 figures
☆ 'AI' and Computer Science: Contradictions Emerge between Ideologies
We develop a conceptualization of ideology, in which a system of ideas represents social, economic, and political relationships. We use ideology as a lens for understanding and critiquing intersecting social, economic, and political aspects of how 'AI' technologies are being developed. We observe ideological shifts. We question that the present tangling of corporate and university objectives is beneficial to labor, particularly computer science students, and the general public. Corporations and computer science have a history of marketing the ideology of computing as empowerment. However, with intensification of the production of 'AI', contradictions emerge. We ask, "Who is being empowered?"
☆ KEditVis: A Visual Analytics System for Knowledge Editing of Large Language Models
Large Language Models (LLMs) demonstrate exceptional capabilities in factual question answering, yet they sometimes provide incorrect responses. To address this issue, knowledge editing techniques have emerged as effective methods for correcting factual information in LLMs. However, typical knowledge editing workflows struggle with identifying the optimal set of model layers for editing and rely on summary indicators that provide insufficient guidance. This lack of transparency hinders effective comparison and identification of optimal editing strategies. In this paper, we present KEditVis, a novel visual analytics system designed to assist users in gaining a deeper understanding of knowledge editing through interactive visualizations, improving editing outcomes, and discovering valuable insights for the future development of knowledge editing algorithms. With KEditVis, users can select appropriate layers as the editing target, explore the reasons behind ineffective edits, and perform more targeted and effective edits. Our evaluation, including usage scenarios, expert interviews, and a user study, validates the effectiveness and usability of the system.
comment: Accepted by IEEE PacificVis 2026 (TVCG Journal Track)
☆ Beyond the Steeper Curve: AI-Mediated Metacognitive Decoupling and the Limits of the Dunning-Kruger Metaphor
The common claim that generative AI simply amplifies the Dunning-Kruger effect is too coarse to capture the available evidence. The clearest findings instead suggest that large language model (LLM) use can improve observable output and short-term task performance while degrading metacognitive accuracy and flattening the classic competence-confidence gradient across skill groups. This paper synthesizes evidence from human-AI interaction, learning research, and model evaluation, and proposes the working model of AI-mediated metacognitive decoupling: a widening gap among produced output, underlying understanding, calibration accuracy, and self-assessed ability. This four-variable account better explains overconfidence, over- and under-reliance, crutch effects, and weak transfer than the simpler metaphor of a uniformly steeper Dunning-Kruger curve. The paper concludes with implications for tool design, assessment, and knowledge work.
☆ Semantic Interaction for Narrative Map Sensemaking: An Insight-based Evaluation ECIR 2026
Semantic interaction (SI) enables analysts to incorporate their cognitive processes into AI models through direct manipulation of visualizations. While SI frameworks for narrative extraction have been proposed, empirical evaluations of their effectiveness remain limited. This paper presents a user study that evaluates SI for narrative map sensemaking, involving 33 participants under three conditions: a timeline baseline, a basic narrative map, and an interactive narrative map with SI capabilities. The results show that the map-based prototypes yielded more insights than the timeline baseline, with the SI-enabled condition reaching statistical significance and the basic map condition trending in the same direction. The SI-enabled condition showed the highest mean performance; differences between the map conditions were not statistically significant but showed large effect sizes (d > 0.8), suggesting that the study was underpowered to detect them. Qualitative analysis identified two distinct SI approaches-corrective and additive-that enable analysts to impose quality judgments and organizational structure on extracted narratives. We also find that SI users achieved comparable exploration breadth with less parameter manipulation, suggesting that SI serves as an alternative pathway for model refinement. This work provides empirical evidence that map-based representations outperform timelines for narrative sensemaking, along with qualitative insights into how analysts use SI for narrative refinement.
comment: Text2Story Workshop 2026 at ECIR 2026
☆ All-in-One Augmented Reality Guided Head and Neck Tumor Resection
Positive margins are common in head and neck squamous cell carcinoma, yet intraoperative re-resection is often imprecise because margin locations are typically communicated verbally from pathology. We present an all-in-one augmented reality (AR) system that relocalizes positive margins from a resected specimen to the resection bed and visualizes them in situ using HoloLens 2 depth sensing and fully automated markerless surface registration. In a silicone phantom study with six medical trainees, markerless registration achieved target registration errors comparable to a marker-based baseline (median 1.8 mm vs. 1.7 mm; maximum < 4 mm). In a margin relocalization task, AR guidance reduced error from verbal guidance (median 14.2 mm) to a few millimeters (median 3.2 mm), with all AR localizations within 5 mm error. These results support the feasibility of markerless AR margin guidance for more precise intraoperative re-excision.
☆ iPoster: Content-Aware Layout Generation for Interactive Poster Design via Graph-Enhanced Diffusion Models
We present iPoster, an interactive layout generation framework that empowers users to guide content-aware poster layout design by specifying flexible constraints. iPoster enables users to specify partial intentions within the intention module, such as element categories, sizes, positions, or coarse initial drafts. Then, the generation module instantly generates refined, context-sensitive layouts that faithfully respect these constraints. iPoster employs a unified graph-enhanced diffusion architecture that supports various design tasks under user-specified constraints. These constraints are enforced through masking strategies that precisely preserve user input at every denoising step. A cross content-aware attention module aligns generated elements with salient regions of the canvas, ensuring visual coherence. Extensive experiments show that iPoster not only achieves state-of-the-art layout quality, but offers a responsive and controllable framework for poster layout design with constraints.
☆ VACP: Visual Analytics Context Protocol
The rise of AI agents introduces a fundamental shift in Visual Analytics (VA), in which agents act as a new user group. Current agentic approaches - based on computer vision and raw DOM access - fail to perform VA tasks accurately and efficiently. This paper introduces the Visual Analytics Context Protocol (VACP), a framework designed to make VA applications "agent-ready" that extends generic protocols by explicitly exposing application state, available interactions, and mechanisms for direct execution. To support our context protocol, we contribute a formal specification of AI agent requirements and knowledge representations in VA interfaces. We instantiate VACP as a library compatible with major visualization grammars and web frameworks, enabling augmentation of existing systems and the development of new ones. Our evaluation across representative VA tasks demonstrates that VACP-enabled agents achieve higher success rates in interface interpretation and execution compared to current agentic approaches, while reducing token consumption and latency. VACP closes the gap between human-centric VA interfaces and machine perceivability, ensuring agents can reliably act as collaborative users in VA systems.
☆ Sima AIunty: Caste Audit in LLM-Driven Matchmaking
Social and personal decisions in relational domains such as matchmaking are deeply entwined with cultural norms and historical hierarchies, and can potentially be shaped by algorithmic and AI-mediated assessments of compatibility, acceptance, and stability. In South Asian contexts, caste remains a central aspect of marital decision-making, yet little is known about how contemporary large language models (LLMs) reproduce or disrupt caste-based stratification in such settings. In this work, we conduct a controlled audit of caste bias in LLM-mediated matchmaking evaluations using real-world matrimonial profiles. We vary caste identity across Brahmin, Kshatriya, Vaishya, Shudra, and Dalit, and income across five buckets, and evaluate five LLM families (GPT, Gemini, Llama, Qwen, and BharatGPT). Models are prompted to assess profiles along dimensions of social acceptance, marital stability, and cultural compatibility. Our analysis reveals consistent hierarchical patterns across models: same-caste matches are rated most favorably, with average ratings up to 25% higher (on a 10-point scale) than inter-caste matches, which are further ordered according to traditional caste hierarchy. These findings highlight how existing caste hierarchies are reproduced in LLM decision-making and underscore the need for culturally grounded evaluation and intervention strategies in AI systems deployed in socially sensitive domains, where such systems risk reinforcing historical forms of exclusion.
☆ Customer Analysis and Text Generation for Small Retail Stores Using LLM-Generated Marketing Presence SC
Point of purchase (POP) materials can be created to assist non-experts by combining large language models (LLMs) with human insight. Persuasive POP texts require both customer understanding and expressive writing skills. However, LLM-generated texts often lack creative diversity, while human users may have limited experience in marketing and content creation. To address these complementary limitations, we propose a prototype system for small retail stores that enhances POP creation through human-AI collaboration. The system supports users in understanding target customers, generating draft POP texts, refining expressions, and evaluating candidates through simulated personas. Our experimental results show that this process significantly improves text quality: the average evaluation score increased by 2.37 points on a -3 to +3 scale compared to that created without system support.
comment: The 17th International Conference on Smart Computing and Artificial Intelligence (SCAI 2025)
☆ An Experiential Approach to AI Literacy
Despite AI tools becoming more prevalent and applicable to a variety of workplaces, workers consistently report uncertainty about where AI applies, what problems it can help solve, and how it fits into real workflows. In other words, there is a gap between `knowing' and `doing' when it comes to AI literacy. We propose an experiential form of AI literacy which integrates participant's daily experiences into the learning experience by brainstorming grounded AI use cases through storytelling. We introduce a novel pedagogical approach that helps individuals move away from abstract notions of AI towards practical knowledge of how AI would (or would not) work in different workflows, contexts, and situations. Through this approach, we anticipate two major outcomes: (1) enhanced AI literacy for stakeholders within a variety of work sectors and (2) concrete AI use cases developed through participatory design that are grounded in AI literacy and participant's expertise.
comment: Paper accepted at CHI 2026 Workshop on Data Literacy. For more details, see https://data-literacy-workshop.github.io/CHI26/
☆ SyriSign: A Parallel Corpus for Arabic Text to Syrian Arabic Sign Language Translation
Sign language is the primary approach of communication for the Deaf and Hard-of-Hearing (DHH) community. While there are numerous benchmarks for high-resource sign languages, low-resource languages like Arabic remain underrepresented. Currently, there is no publicly available dataset for Syrian Arabic Sign Language (SyArSL). To overcome this gap, we introduce SyriSign, a dataset comprising 1500 video samples across 150 unique lexical signs, designed for text-to-SyArSL translation tasks. This work aims to reduce communication barriers in Syria, as most news are delivered in spoken or written Arabic, which is often inaccessible to the deaf community. We evaluated SyriSign using three deep learning architectures: MotionCLIP for semantic motion generation, T2M-GPT for text-conditioned motion synthesis, and SignCLIP for bilingual embedding alignment. Experimental results indicate that while generative approaches show strong potential for sign representation, the limited dataset size constrains generalization performance. We will release SyriSign publicly, hoping it serves as an initial benchmark.
☆ BiMoE: Brain-Inspired Experts for EEG-Dominant Affective State Recognition ICME 2026
Multimodal Sentiment Analysis (MSA) that integrates Electroencephalogram (EEG) with peripheral physiological signals (PPS) is crucial for the development of brain-computer interface (BCI) systems. However, existing methods encounter three major challenges: (1) overlooking the region-specific characteristics of affective processing by treating EEG signals as homogeneous; (2) treating EEG as a black-box input, which lacks interpretability into neural representations;(3) ineffective fusion of EEG features with complementary PPS features. To overcome these issues, we propose BiMoE, a novel brain-inspired mixture of experts framework. BiMoE partitions EEG signals in a brain-topology-aware manner, with each expert utilizing a dual-stream encoder to extract local and global spatiotemporal features. A dedicated expert handles PPS using multi-scale large-kernel convolutions. All experts are dynamically fused through adaptive routing and a joint loss function. Evaluated under strict subject-independent settings, BiMoE consistently surpasses state-of-the-art baselines across various affective dimensions. On the DEAP and DREAMER datasets, it yields average accuracy improvements of 0.87% to 5.19% in multimodal sentiment classification. The code is available at: https://github.com/HongyuZhu-s/BiMo.
comment: Accepted by ICME 2026
☆ Kwame 2.0: Human-in-the-Loop Generative AI Teaching Assistant for Large Scale Online Coding Education in Africa
Providing timely and accurate learning support in large-scale online coding courses is challenging, particularly in resource-constrained contexts. We present Kwame 2.0, a bilingual (English-French) generative AI teaching assistant built using retrieval-augmented generation and deployed in a human-in-the-loop forum within SuaCode, an introductory mobile-based coding course for learners across Africa. Kwame 2.0 retrieves relevant course materials and generates context-aware responses while encouraging human oversight and community participation. We deployed the system in a 15-month longitudinal study spanning 15 cohorts with 3,717 enrollments across 35 African countries. Evaluation using community feedback and expert ratings shows that Kwame 2.0 provided high-quality and timely support, achieving high accuracy on curriculum-related questions, while human facilitators and peers effectively mitigated errors, particularly for administrative queries. Our findings demonstrate that human-in-the-loop generative AI systems can combine the scalability and speed of AI with the reliability of human support, offering an effective approach to learning assistance for underrepresented populations in resource-constrained settings at scale.
comment: 8 pages, Accepted at the 27th International Conference on Artificial Intelligence in Education (AIED 2026)
☆ REFINE: Real-world Exploration of Interactive Feedback and Student Behaviour
Formative feedback is central to effective learning, yet providing timely, individualised feedback at scale remains a persistent challenge. While recent work has explored the use of large language models (LLMs) to automate feedback, most existing systems still conceptualise feedback as a static, one-way artifact, offering limited support for interpretation, clarification, or follow-up. In this work, we introduce REFINE, a locally deployable, multi-agent feedback system built on small, open-source LLMs that treats feedback as an interactive process. REFINE combines a pedagogically-grounded feedback generation agent with an LLM-as-a-judge-guided regeneration loop using a human-aligned judge, and a self-reflective tool-calling interactive agent that supports student follow-up questions with context-aware, actionable responses. We evaluate REFINE through controlled experiments and an authentic classroom deployment in an undergraduate computer science course. Automatic evaluations show that judge-guided regeneration significantly improves feedback quality, and that the interactive agent produces efficient, high-quality responses comparable to a state-of-the-art closed-source model. Analysis of real student interactions further reveals distinct engagement patterns and indicates that system-generated feedback systematically steers subsequent student inquiry. Our findings demonstrate the feasibility and effectiveness of multi-agent, tool-augmented feedback systems for scalable, interactive feedback.
comment: Accepted to AIED 2026
☆ SciVisAgentBench: A Benchmark for Evaluating Scientific Data Analysis and Visualization Agents
Recent advances in large language models (LLMs) have enabled agentic systems that translate natural language intent into executable scientific visualization (SciVis) tasks. Despite rapid progress, the community lacks a principled and reproducible benchmark for evaluating these emerging SciVis agents in realistic, multi-step analysis settings. We present SciVisAgentBench, a comprehensive and extensible benchmark for evaluating scientific data analysis and visualization agents. Our benchmark is grounded in a structured taxonomy spanning four dimensions: application domain, data type, complexity level, and visualization operation. It currently comprises 108 expert-crafted cases covering diverse SciVis scenarios. To enable reliable assessment, we introduce a multimodal outcome-centric evaluation pipeline that combines LLM-based judging with deterministic evaluators, including image-based metrics, code checkers, rule-based verifiers, and case-specific evaluators. We also conduct a validity study with 12 SciVis experts to examine the agreement between human and LLM judges. Using this framework, we evaluate representative SciVis agents and general-purpose coding agents to establish initial baselines and reveal capability gaps. SciVisAgentBench is designed as a living benchmark to support systematic comparison, diagnose failure modes, and drive progress in agentic SciVis. The benchmark is available at https://scivisagentbench.github.io/.
☆ "I Just Need GPT to Refine My Prompts": Rethinking Onboarding and Help-Seeking with Generative 3D Modeling Tools
Learning to use feature-rich software is a persistent challenge, but generative AI tools promise to lower this barrier by replacing complex navigation with natural language prompts. We investigated how people approach prompt-based tools for 3D modeling in an observational study with 26 participants (14 casuals, 12 professionals). Consistent with earlier work, participants skipped tutorials and manuals, relying on trial and error. What differed in the generative AI context was how and why they sought support: the prompt box became the entry point for learning, collapsing onboarding into immediate action, while some casual users turned to external LLMs for prompts. Professionals used 3D expertise to refine iterations and critically evaluated outputs, often discarding models that did not meet their standards, whereas casual users settled for "good enough." We contribute empirical insights into how generative AI reshapes help-seeking, highlighting new practices of onboarding, recursive AI-for-AI support, and shifting expertise in interpreting outputs.
comment: 16 pages, 10 figures, CHI 2026 submission
☆ VueBuds: Visual Intelligence with Wireless Earbuds
Despite their ubiquity, wireless earbuds remain audio-centric due to size and power constraints. We present VueBuds, the first camera-integrated wireless earbuds for egocentric vision, capable of operating within stringent power and form-factor limits. Each VueBud embeds a camera into a Sony WF-1000XM3 to stream visual data over Bluetooth to a host device for on-device vision language model (VLM) processing. We show analytically and empirically that while each camera's field of view is partially occluded by the face, the combined binocular perspective provides comprehensive forward coverage. By integrating VueBuds with VLMs, we build an end-to-end system for real-time scene understanding, translation, visual reasoning, and text reading; all from low-resolution monochrome cameras drawing under 5mW through on-demand activation. Through online and in-person user studies with 90 participants, we compare VueBuds against smart glasses across 17 visual question-answering tasks, and show that our system achieves response quality on par with Ray-Ban Meta. Our work establishes low-power camera-equipped earbuds as a compelling platform for visual intelligence, bringing rapidly advancing VLM capabilities to one of the most ubiquitous wearable form factors.
comment: CHI 2026
☆ Evaluating a Data-Driven Redesign Process for Intelligent Tutoring Systems
Past research has defined a general process for the data-driven redesign of educational technologies and has shown that in carefully-selected instances, this process can help make systems more effective. In the current work, we test the generality of the approach by applying it to four units of a middle-school mathematics intelligent tutoring system that were selected not based on suitability for redesign, as in previous work, but on topic. We tested whether the redesigned system was more effective than the original in a classroom study with 123 students. Although the learning gains did not differ between the conditions, students who used the Redesigned Tutor had more productive time-on-task, a larger number of skills practiced, and greater total knowledge mastery. The findings highlight the promise of data-driven redesign even when applied to instructional units *not* selected as likely to yield improvement, as evidence of the generality and wide applicability of the method.
comment: Accepted as short paper to the 27th International Conference on Artificial Intelligence in Education (AIED 2026)
♻ ☆ MURMR: A Multimodal Sensing Framework for Automated Group Behavior Analysis in Mixed Reality
When teams coordinate in immersive environments, collaboration breakdowns can go undetected without automated analysis, directly affecting task performance. Yet existing methods rely on external observation and manual annotation, offering no annotation-free method for analyzing temporal collaboration dynamics from headset-native data. We introduce \sysname, a passive sensing pipeline that captures and analyzes multimodal interaction data from commodity MR headsets without external instrumentation. Two complementary modules address different levels of analysis: a structural module that generates automated multimodal sociograms and network metrics at both session and intra-session granularities, and a temporal module that applies unsupervised deep clustering to identify moment-to-moment dyadic behavioral phases without predefined taxonomies. An exploratory deployment with 48 participants in a co-located object-sorting task reveals that intra-session structural analysis captures significant within-session variability lost in session-level aggregation, with gaze, audio, and position contributing non-redundantly. The temporal module identifies five behavioral phases with 83\% correspondence to video observations. Cross-tabulation shows that behavioral transitions consistently occur within structurally stable states, demonstrating that the two modules capture complementary dynamics. These results establish that passive headset sensing provides meaningful signal for automated, multi-level collaboration analysis in immersive environments.
comment: 12 pages, 5 figures
♻ ☆ Evaluating PDPL Compliance in E-Commerce Websites: Insights and Lessons Learned from Human and LLM Analyses
In 2024, Saudi Arabia's Personal Data Protection Law (PDPL) came into force. However, little work has been done to assess its implementation. In this paper, we analyzed 100 e-commerce websites operating in Saudi Arabia against the PDPL, examining the presence of a privacy policy and, if present, the policy's declarations of four items pertaining to personal data rights and practices: 1) personal data retention period, 2) the right to request the destruction of personal data, 3) the right to request a copy of personal data, and 4) a mechanism for filing complaints. Our results show that, despite national awareness and support efforts, a significant fraction of e-commerce websites in our dataset are not fully compliant: only 31% of websites in our dataset declared all four examined items in their privacy policies. Even when privacy policies included such declarations, a considerable fraction of them failed to cover required fine-grained details. Second, the majority of top-ranked e-commerce websites in our dataset (based on search results order) and those hosted on local e-commerce hosting platforms exhibited considerably higher non-compliance rates than mid- to low-ranked websites and those not hosted on local e-commerce platforms. Third, we assessed the use of Large Language Models (LLMs) as an automated tool for privacy policy analysis to measure compliance with the PDPL. We highlight the potential of LLMs and suggest considerations to improve LLM-based automated analysis for privacy policies. Our results provide a step forward in understanding the implementation barriers to data protection laws, especially in non-Western contexts. We provide recommendations for policymakers, regulators, website owners, and developers seeking to improve data protection practices and automate compliance monitoring.
comment: Non-peer reviewed paper (under submission)
♻ ☆ Explaining the Reputational Risks of AI-Mediated Communication: Messages labeled as AI-assisted are viewed as less diagnostic of the sender's moral character AAAI
When someone sends us a thoughtful message, we naturally form judgments about their character. But what happens when that message carries a label indicating it was written with the help of AI? This paper investigates how the appearance of AI assistance affects our perceptions of message senders. Adding nuance to previous research, through two studies (N=399) featuring vignette scenarios, we find that AI-assistance labels don't necessarily make people view senders negatively. Rather, they dampen the strength of character signals in communication. We show that when someone sends a warmth-signalling message (like thanking or apologizing) without AI help, people more strongly categorize the sender as warm. At the same time, when someone sends a coldness-signalling message (like bragging or blaming) without assistance, people more confidently categorize them as cold. Interestingly, AI labels weaken both these associations: An AI-assisted apology makes the sender appear less warm than if they had written it themselves, and an AI-assisted blame makes the sender appear less cold than if they had composed it independently. This supports our signal diagnosticity explanation: messages labeled as AI-assisted are viewed as less diagnostic than messages which seem unassisted. We discuss how our findings shed light on the causal origins of previously reported observations in AI-Mediated Communication.
comment: Proceedings of the Eighth AAAI/ACM Conference on AI, Ethics, and Society (AIES 2025)
♻ ☆ Worker Discretion Advised: Co-designing Risk Disclosure in Crowdsourced Responsible AI (RAI) Content Work
Responsible AI (RAI) content work, such as annotation, moderation, or red teaming for AI safety, often exposes crowd workers to potentially harmful content. While prior work has underscored the importance of communicating well-being risk to employed content moderators, designing effective disclosure mechanisms for crowd workers while balancing worker protection with the needs of task designers and platforms remains largely unexamined. To address this gap, we conducted individual co-design sessions with 15 task designers, 11 crowdworkers, and 3 platform representatives. We investigated task designer preferences for support in disclosing tasks, worker preferences for receiving risk disclosure warnings, and how platform representatives envision their role in shaping risk disclosure practices. We identify design tensions and map the sociotechnical tradeoffs that shape disclosure practices. We contribute design recommendations and feature concepts for risk disclosure mechanisms in the context of RAI content work.
♻ ☆ Exploring and Analyzing the Effect of Avatar's Visual Style on Anxiety of English as Second Language (ESL) Speakers
Virtual avatars offer new opportunities to reshape communication experiences beyond traditional live video. However, it remains unclear how avatar representations influence communication anxiety for English as a Second Language (ESL) speakers, and why such effects emerge. To take a first step to address this, we conducted a controlled laboratory study in which Mandarin-speaking ESL participants engaged in one-on-one conversations under three representation conditions: live video, stylized avatars, and realistic avatars. We assessed anxiety using both self-reported measures and physiological signals (EDA, ECG, PPG). Our results show that avatar style plays a critical role in shaping communication anxiety. While live video remained a strong baseline with low subjective anxiety, stylized avatars achieved comparable-and in some cases lower-physiological anxiety levels, whereas realistic avatars elicited higher anxiety. Beyond these effects, our findings reveal three underlying mechanisms that explain how avatar representations shape ESL communication anxiety: (1) facial expressiveness; (2) perceived feedback and fear of negative evaluation; and (3) contextual appropriateness. This work provides actionable design implications for developing avatar-mediated communication systems that support emotionally sustainable cross-linguistic interaction.
comment: 10 pages, 5 figures, 2 tables
♻ ☆ Chat-Based Support Alone May Not Be Enough: Comparing Conversational and Embedded LLM Feedback for Mathematical Proof Learning
We evaluate GPTutor, an LLM-powered tutoring system for an undergraduate discrete mathematics course. It integrates two LLM-supported tools: a structured proof-review tool that provides embedded feedback on students' written proof attempts, and a chatbot for math questions. In a staggered-access study with 148 students, earlier access was associated with higher homework performance during the interval when only the experimental group could use the system, while we did not observe this performance increase transfer to exam scores. Usage logs show that students with lower self-efficacy and prior exam performance used both components more frequently. Session-level behavioral labels, produced by human coding and scaled using an automated classifier, characterize how students engaged with the chatbot (e.g., answer-seeking or help-seeking). In models controlling for prior performance and self-efficacy, higher chatbot usage and answer-seeking behavior were negatively associated with subsequent midterm performance, whereas proof-review usage showed no detectable independent association. Together, the findings suggest that chatbot-based support alone may not reliably support transfer to independent assessment of math proof-learning outcomes, whereas work-anchored, structured feedback appears less associated with reduced learning.
comment: 9 pages, 4 figures. Accepted at AIED 2026. Camera-ready version with updated references
♻ ☆ Balancing Efficiency and Empathy: Healthcare Providers' Perspectives on AI-Supported Workflows for Serious Illness Conversations in the Emergency Department
Serious Illness Conversations (SICs), discussions about values and care preferences for patients with life-threatening illness, rarely occur in Emergency Departments (EDs), despite evidence that early conversations improve care alignment and reduce unnecessary interventions. We interviewed 11 ED providers to identify challenges in SICs and opportunities for technology support, with a focus on AI. Our analysis revealed a four-stage SIC workflow (identification, preparation, conduction, documentation) and barriers at each stage, including fragmented patient information, limited time and space, lack of conversational guidance, and burdensome documentation. Providers expressed interest in AI systems for synthesizing information, supporting real-time conversations, and automating documentation, but emphasized concerns about preserving human connection and clinical autonomy. This tension highlights the need for technologies that enhance efficiency without undermining the interpersonal nature of SICs. We propose design guidelines for ambient and peripheral AI systems to support providers while preserving the essential humanity of these conversations.
comment: To appear at ACM CHI'26
♻ ☆ Practitioner Voices Summit: How Teachers Evaluate AI Tools through Deliberative Sensemaking
Teachers face growing pressure to integrate AI tools into their classrooms, yet are rarely positioned as agentic decision-makers in this process. Understanding the criteria teachers use to evaluate AI tools, and the conditions that support such reasoning, is essential for responsible AI integration. We address this gap through a two-day national summit in which 61 U.S. K-12 mathematics educators developed personal rubrics for evaluating AI classroom tools. The summit was designed to support deliberative sensemaking, a process we conceptualize by integrating Technological Pedagogical Content Knowledge (TPACK) with deliberative agency. Teachers generated over 200 criteria - initial articulations spanning four higher-order themes (Practical, Equitable, Flexible, and Rigorous) - that addressed both AI outputs and the process of using AI. Criteria contained productive tensions (e.g., personalization versus fairness, adaptability versus efficiency), and the vast majority framed AI as an assistant rather than a coaching tool for professional learning. Analysis of surveys, interviews, and summit discussions revealed five mechanisms supporting deliberative sensemaking: time and space for deliberation, artifact-centered sensemaking, collaborative reflection through diverse viewpoints, knowledge-building, and psychological safety. Across these mechanisms, TPACK and agency operated in a mutually reinforcing cycle - knowledge-building enabled more grounded evaluative judgment, while the act of constructing criteria deepened teachers' understanding of tools. We discuss implications for edtech developers seeking practitioner input, school leaders making adoption decisions, educators and professional learning designers, and researchers working to elicit teachers' evaluative reasoning about rapidly evolving technologies.
♻ ☆ HeyFriend Helper: A Conversational AI Web-App for Resource Access Among Low-Income Chicago Residents
Low-income individuals can face multiple challenges in their ability to seek employment. Barriers to employment often include limited access to digital literacy resources, training, interview preparation and resume feedback. Prior work has largely focused on targeted social service or healthcare applications that address needs individually, with little emphasis on conversational AI-driven systems that integrate multiple localized digital resources to provide comprehensive support. This work presents HeyFriend Helper, a web-based platform designed to support low-income residents in Chicago through an interactive conversational assistant that provides personalized support and guidance. HeyFriend Helper integrates multiple tools, including resume building and feedback, interview practice, mindfulness and well-being resources, employment trend and career outcome information, language learning support, and location-based access to community services. This work represents an interdisciplinary collaboration between social work, computer science, and engineering that addresses the multifaceted needs of low-income individuals. The findings demonstrate the importance of career-readiness tools and conversational user interface (CUIs) in providing holistic support.
♻ ☆ Locating Risk: Task Designers and the Challenge of Risk Disclosure in RAI Content Work
As AI systems are increasingly tested and deployed in open-ended and high-stakes domains, crowdworkers are often tasked with responsible AI (RAI) content work. These tasks include labeling violent content, moderating disturbing text, or simulating harmful behavior for red teaming exercises to shape AI system behaviors. While prior research efforts have highlighted the risks to worker well-being associated with RAI content work, far less attention has been paid to how these risks are communicated to workers by task designers or individuals who design and post RAI tasks. Existing transparency frameworks and guidelines, such as model cards, datasheets, and crowdworksheets, focus on documenting model information and dataset collection processes, but they overlook an important aspect of disclosing well-being risks to workers. In the absence of standard workflows or clear guidance, the consistent application of content warnings, consent flows, or other forms of well-being risk disclosure remains unclear. This study investigates how task designers approach risk disclosure in crowdsourced RAI tasks. Drawing on interviews with 23 task designers across academic and industry sectors, we examine how well-being risk is recognized, interpreted, and communicated in practice. Our findings highlight the need to support task designers in identifying and communicating risks not only to support crowdworker well-being but also to strengthen the ethical integrity and technical efficacy of AI development pipelines.
♻ ☆ Exploring Sidewalk Sheds in New York City through Chatbot Surveys and Human Computer Interaction
Sidewalk sheds are a common feature of the streetscape in New York City, reflecting ongoing construction and maintenance activities. However, policymakers and local business owners have raised concerns about reduced storefront visibility and altered pedestrian navigation. Although sidewalk sheds are widely used for safety, their effects on pedestrian visibility and movement are not directly measured in current planning practices. To address this, we developed an AI-based chatbot survey that collects image-based annotations and route choices from pedestrians, linking these responses to specific shed design features, including clearance height, post spacing, and color. This AI chatbot survey integrates a large language model (e.g., Google's Gemini-1.5-flash-001 model) with an image-annotation interface, allowing users to interact with street images, mark visual elements, and provide structured feedback through guided dialogue. To explore pedestrian perceptions and behaviors, this paper conducts a grid-based analysis of entrance annotations and applies logistic mixed-effects modeling to assess sidewalk choice patterns. Analysis of the dataset (n = 25) shows that: (1) the presence of scaffolding significantly reduces pedestrians' ability to identify ground-floor retail entrances, and (2) variations in weather conditions and shed design features significantly influence sidewalk selection behavior. By integrating generative AI into urban research, this study demonstrates a novel method for evaluating sidewalk shed designs and provides empirical evidence to support adjustments to shed guidelines that improve the pedestrian experience without compromising safety.
♻ ☆ "You've got a friend in me": Co-Designing a Peer Social Robot for Young Newcomers' Language and Cultural Learning
Community literacy programs supporting young newcomer children in Canada face limited staffing and scarce one-to-one time, which constrains personalized English and cultural learning support. This paper reports on a co-design study with United for Literacy tutors that informed Maple, a table-top, peer-like Socially Assistive Robot (SAR) designed as a practice partner within tutor-mediated sessions. From shadowing and co-design interviews, we derived newcomer-specific requirements and added them in an integrated prototype that uses short story-based activities, multi-modal scaffolding and embedded quizzes that support attention while producing tutor-actionable formative signals. We contribute system design implications for tutor-in-the-loop SARs supporting language socialization in community settings and outline directions for child-centered evaluation in authentic programs.
♻ ☆ FIRMED: A Peak-Centered Multimodal Dataset with Fine-Grained Annotation for Emotion Recognition
Traditional video-induced physiological datasets usually rely on whole-trial labels, which introduce temporal label noise in dynamic emotion recognition. We present FIRMED, a peak-centered multimodal dataset based on an immediate-recall annotation paradigm, with synchronized EEG, ECG, GSR, PPG, and facial recordings from 35 participants. FIRMED provides event-centered timestamps, emotion labels, and intensity annotations, and its annotation quality is supported by subjective and physiological validation. Benchmark experiments show that FIRMED consistently outperforms whole-trial labeling, yielding an average gain of 3.8 percentage points across eight EEG-based classifiers, with further improvements under multimodal fusion. FIRMED provides a practical benchmark for temporally localized supervision in multimodal affective computing.
♻ ☆ "It's trained by non-disabled people": Evaluating How Image Quality Affects Product Captioning with Vision-Language Models
Vision-Language Models (VLMs) are increasingly used by blind and low-vision (BLV) people to identify and understand products in their everyday lives, such as food, personal care items, and household goods. Despite their prevalence, we lack an empirical understanding of how common image quality issues--such as blur, misframing, and rotation--affect the accuracy of VLM-generated captions and whether the resulting captions meet BLV people's information needs. Based on a survey of 86 BLV participants, we develop an annotated dataset of 1,859 product images from BLV people to systematically evaluate how image quality issues affect VLM-generated captions. While the best VLM achieves 98% accuracy on images with no quality issues, accuracy drops to 75% overall when quality issues are present, worsening considerably as issues compound. We discuss the need for model evaluations that center on disabled people's experiences throughout the process and offer concrete recommendations for HCI and ML researchers to make VLMs more reliable for BLV people.
comment: Published at CHI 2026; Honorable Mention for Best Paper (Top 5%). Dataset available at: https://github.com/Accessibility-Research-Collective-UCI/image-quality-vlm-chi26
♻ ☆ "What Did It Actually Do?": Understanding Risk Awareness and Traceability for Computer-Use Agents
Personalized computer-use agents are rapidly moving from expert communities into mainstream use. Unlike conventional chatbots, these systems can install skills, invoke tools, access private resources, and modify local environments on users' behalf. Yet users often do not know what authority they have delegated, what the agent actually did during task execution, or whether the system has been safely removed afterward. We investigate this gap as a combined problem of risk understanding and post-hoc auditability, using OpenClaw as a motivating case. We first build a multi-source corpus of the OpenClaw ecosystem, including incidents, advisories, malicious-skill reports, news coverage, tutorials, and social-media narratives. We then conduct an interview study to examine how users and practitioners understand skills, autonomy, privilege, persistence, and uninstallation. Our findings suggest that participants often recognized these systems as risky in the abstract, but lacked concrete mental models of what skills can do, what resources agents can access, and what changes may remain after execution or removal. Motivated by these findings, we propose AgentTrace, a traceability framework and prototype interface for visualizing agent actions, touched resources, permission history, provenance, and persistent side effects. A scenario-based evaluation suggests that traceability-oriented interfaces can improve understanding of agent behavior, support anomaly detection, and foster more calibrated trust.
♻ ☆ Training for Technology: Adoption and Productive Use of Generative AI in Legal Analysis
Can targeted user training unlock the productive potential of generative artificial intelligence in professional settings? We study this question using a randomized experiment in which 164 law students completed an issue-spotting examination under one of three conditions: no GenAI access, optional access to a large language model (LLM), or LLM access with a brief training intervention. Untrained LLM access proved counterproductive: relative to participants without any LLM access, untrained users wrote significantly shorter answers, committed more case misstatements, and scored marginally lower, though most differences fall short of conventional significance. Training reversed this pattern. Trained participants adopted the LLM at higher rates (41% vs. 26%; p = 0.044), scored 0.27 grade points higher than untrained users--roughly one fine grade--(p = 0.027), and stated applicable rules more accurately (p = 0.014). Principal stratification analysis suggests training operates primarily through adoption rather than effectiveness--the adoption lower bound (1.06) exceeds the effectiveness upper bound (0.42) at strict mean dominance--though confidence intervals are wide. Training also shifted who adopted: top-quartile students went from 0% adoption to 42%. More broadly, these findings challenge the view that GenAI primarily benefits lower-skilled workers: without training, higher-ability practitioners opt out while lower-ability users adopt but unproductively. Realizing GenAI's productivity gains requires investment in both access and instruction.
♻ ☆ Triple-Identity Authentication: The Future of Secure Access
In password-based authentication systems, the username fields are essentially unprotected, while the password fields are susceptible to attacks. In this article, we shift our research focus from traditional authentication paradigm to the establishment of gatekeeping mechanisms for the systems. To this end, we introduce a Triple-Identity Authentication scheme. First, we combine each user credential (i.e., login name, login password, and authentication password) with the International Mobile Equipment Identity (IMEI) and International Mobile Subscriber Identity (IMSI) of a user's smartphone to create a combined identity represented as "credential+IMEI+IMSI", defined as a system attribute of the user. Then, we grant the password-based local systems autonomy to use the internal elements of our matrix-like hash algorithm. Following a credential input, the algorithm hashes it, and then the local system, rather than the algorithm, creates an identifier using a set of elements randomly selected from the algorithm, which is used to verify the user's combined identity. This decentralized authentication based on the identity-identifier handshake approach is implemented at the system's interaction points, such as login name field, login password field, and server's authentication point. Ultimately, this approach establishes effective security gates, empowering the password-based local systems to autonomously safeguard user identification and authentication processes.
comment: 10 pages, 2 figures,
Programming Languages 7
☆ Phyelds: A Pythonic Framework for Aggregate Computing
Aggregate programming is a field-based coordination paradigm with over a decade of exploration and successful applications across domains including sensor networks, robotics, and IoT, with implementations in various programming languages, such as Protelis, ScaFi (Scala), and FCPP (C++). A recent research direction integrates machine learning with aggregate computing, aiming to support large-scale distributed learning and provide new abstractions for implementing learning algorithms. However, existing implementations do not target data science practitioners, who predominantly work in Python--the de facto language for data science and machine learning, with a rich and mature ecosystem. Python also offers advantages for other use cases, such as education and robotics (e.g., via ROS). To address this gap, we present Phyelds, a Python library for aggregate programming. Phyelds offers a fully featured yet lightweight implementation of the field calculus model of computation, featuring a Pythonic API and an architecture designed for seamless integration with Python's machine learning ecosystem. We describe the design and implementation of Phyelds and illustrate its versatility across domains, from well-known aggregate computing patterns to federated learning coordination and integration with a widely used multi-agent reinforcement learning simulator.
☆ Multi-paradigm Logic Programming in the ${\cal E}$rgoAI System
ErgoAI is a high level, multi-paradigm logic programming language and system developed by Coherent Knowledge Systems as an enhancement of and a successor to the popular Flora-2 system. ErgoAI is oriented towards scalable knowledge representation and reasoning, and can exploit both structured knowledge as well as knowledge derived from external sources such as vector embeddings. From the start, ErgoAI (and Flora-2 before it) were designed to exploit the well-founded semantics for reasoning in a multi-paradigm environment, including object-based logic (F-logic) with non-monotonic inheritance; higher order syntax in the style of HiLog; defeasibility of rules; semantically clean transactional updates; extensive use of subgoal delay for handling unsafe queries and for better performance; and optional support for bounded rationality at a module level. Although Flora-2 programs are compiled into XSB and adopt many Prolog features, ErgoAI is altogether a different language and system. Under consideration in Theory and Practice of Logic Programming (TPLP).
comment: 43 pages, 9 figures. Under consideration in Theory and Practice of Logic Programming (TPLP)
☆ A Graded Modal Dependent Type Theory with Erasure, Formalized
We present a graded modal type theory, a dependent type theory with grades that can be used to enforce various properties of the code. The theory has $Π$-types, weak and strong $Σ$-types, natural numbers, an empty type, and a universe, and we also extend the theory with weak and strong unit types and graded $Σ$-types. The theory is parameterized by a modality structure, a kind of partially ordered semiring, whose elements (grades) are used to track the usage of variables in terms and types. Different modalities are possible. We focus mainly on quantitative properties, in particular erasure: with the erasure modality one can mark function arguments as erasable. The theory is fully formalized in Agda. The formalization, which uses a syntactic Kripke logical relation at its core and is based on earlier work, establishes major meta-theoretic properties such as subject reduction, consistency, normalization, and decidability of definitional equality. We also prove a substitution theorem for grade assignment, and preservation of grades under reduction. Furthermore we study an extraction function that translates terms to an untyped $λ$-calculus and removes erasable content, in particular function arguments with the "erasable" grade. For a certain class of modalities we prove that extraction is sound, in the sense that programs of natural number type have the same value before and after extraction. Soundness of extraction holds also for open programs, as long as all variables in the context are erasable, the context is consistent, and erased matches are not allowed for weak $Σ$-types.
☆ Self-Improving Code Generation via Semantic Entropy and Behavioral Consensus
Improving the code generation capabilities of large language models (LLMs) typically relies on supervised fine-tuning or preference optimization, both of which require costly external resources such as powerful teacher models or reliable test units. However, in real-world scenarios, it is much harder to obtain reference solutions and test oracles than problem descriptions and test inputs. In this paper, we tackle a challenging yet realistic question: Can a code language model improve itself without access to a superior teacher and a test oracle? To answer this, we propose ConSelf, a self-improving approach built upon two key ideas. First, we introduce code semantic entropy, a novel metric that measures problem-level uncertainty by assessing the functional diversity of program behaviors, enabling a curriculum construction with the most learnable problems. Second, we present consensus-driven direct preference optimization (Con-DPO), a preference-based fine-tuning method that weights each preference pair by its behavioral consensus, thereby mitigating the impact of noisy self-generated supervision. Experiments on various benchmarks and backbone LLMs demonstrate that ConSelf significantly outperforms baselines, validating the effectiveness of semantic entropy-based curriculum construction and consensus-driven optimization in improving code generation without external supervision.
comment: Accepted in the 34th IEEE/ACM International Conference on Program Comprehension (ICPC 2026)
♻ ☆ Lumos: Let there be Language Model System Certification
We introduce the first principled framework, Lumos, for specifying and formally certifying Language Model System (LMS) behaviors. Lumos is an imperative probabilistic programming DSL over graphs, with constructs to generate independent and identically distributed prompts for LMS. It offers a structured view of prompt distributions via graphs, forming random prompts from sampled subgraphs. Lumos supports certifying LMS for arbitrary prompt distributions via integration with statistical certifiers. We provide hybrid (operational and denotational) semantics for Lumos, providing a rigorous way to interpret the specifications. Using only a small set of composable constructs, Lumos can encode existing LMS specifications, including complex relational and temporal specifications. It also facilitates specifying new properties - we present the first safety specifications for vision-language models (VLMs) in autonomous driving scenarios developed with Lumos. Using these, we show that the state-of-the-art VLM Qwen-VL exhibits critical safety failures, producing incorrect and unsafe responses with at least 90% probability in right-turn scenarios under rainy driving conditions, revealing substantial safety risks. Lumos's modular structure allows easy modification of the specifications, enabling LMS certification to stay abreast with the rapidly evolving threat landscape. We further integrate a prompt-level deterministic verifier to obtain guarantees over the privacy of the LLM generation distribution over a prompt distribution. Lumos is simple to program in, requiring only a few constructs, as evidenced by state-of-the-art large language models generating correct Lumos specifications in zero-shot settings. Lumos is the first systematic and extensible language-based framework for specifying and certifying LMS behaviors, paving the way for a wider adoption of LMS certification.
♻ ☆ Semantic Identity Compression: Zero-Error Laws, Rate-Distortion, and Neurosymbolic Necessity
Symbolic systems operate over precise identities: variables denote specific objects, pointers target precise memory locations, and database keys refer to singular records. Neural embeddings generalize by compressing away semantic detail, but this compression creates collision ambiguity: multiple distinct entities can share the same representation value. We characterize how much additional information must be supplied to recover precise identity from such representations. The answer is controlled by a single combinatorial object: the collision-fiber geometry of the representation map $π$. Let $A_π=\max_u |π^{-1}(u)|$ be the largest collision fiber. We prove a tight fixed-length converse $L \ge \log_2 A_π$, an exact finite-block scaling law, a pointwise adaptive budget $\lceil \log_2 |π^{-1}(u)|\rceil$, and an exact fiberwise rate-distortion law for arbitrary finite sources via recoverable-mass decomposition across representation fibers. The uniform single-block formula $D^\star(L)=\max(0,1-2^L/a)$ appears as a closed-form special case when all mass lies on one collision block, where $a = A_π$ is the collision block size. The same fiber geometry determines query complexity and canonical structure for distinguishing families. Because this residual ambiguity is structural rather than representation-specific, symbolic identity mechanisms (handles, keys, pointers, nominal tags) are the necessary system-level complement to any non-injective semantic representation. All main results are machine-checked in Lean 4.
comment: 13 pages, 2 tables. Lean 4 artifact and supplementary material available at https://doi.org/10.5281/zenodo.18123531
♻ ☆ Exact Recovery Under Deterministic Partial Views: Confusability Graphs, Strong Powers, and Capacity
We study exact recovery from deterministic partial views of a finite latent tuple. A family of admissible views induces a confusability graph on latent states, and this graph is the structural object governing zero-error recovery. In the exact coordinate-view model on the full labeled tuple space, we characterize the realizable confusability relations exactly: they are precisely those determined by upward-closed families of coordinate-agreement sets. We show that exact recovery with a $T$-ary auxiliary tag is equivalent to $T$-colorability of the induced graph, while exact recovery on a designated success set is equivalent to colorability of the corresponding induced subgraph. Under repeated composition, the block confusability graph is the strong power of the one-shot graph, so the normalized zero-error rates converge to the Shannon capacity of the induced graph and inherit the standard Lovász-$\vartheta$ upper theory. We also identify a structural equality route: when confusability is transitive, the induced graph collapses to a cluster graph, yielding capacity--$\vartheta$ equality, with meet-witnessing and fiber coherence as sufficient conditions. Finally, under an affine restriction on the realized state family, the coordinate side carries a representable matroid whose rank gives tractable upper bounds on confusability and capacity. A classification of representative channel families shows that the majority of widely deployed deterministic partial-view architectures operate above the zero-incoherence boundary, rendering the graph-capacity limits operationally unavoidable.
comment: 17 pages, 1 figure, 2 tables. Lean 4 artifact and supplementary available at https://doi.org/10.5281/zenodo.18141365
Networking and Internet Architecture 17
☆ From Skew to Symmetry: Node-Interconnect Multi-Path Balancing with Execution-time Planning for Modern GPU Clusters
Modern GPU-based high-performance computing clusters offer unprecedented communication bandwidth through heterogeneous intra-node interconnects and inter-node networks. However, despite this high aggregate bandwidth, many real-world communication patterns fail to fully utilize the available hardware. Traffic skew often leads to situations where a small subset of links becomes oversaturated while others remain underutilized, resulting in congestion, latency spikes, and poor scalability. Existing communication frameworks such as NCCL and MPI with UCX typically rely on static fastest-path routing or hashing-based multi-rail striping, which leaves significant bandwidth unused when runtime traffic deviates from expected distributions. To address these limitations, we propose NIMBLE (Node-Interconnect Multi-path Balancing with Execution-time orchestration), a runtime communication orchestration system that dynamically redistributes traffic to balance link utilization across all available intra-node and inter-node paths. NIMBLE formulates this as a capacity-normalized minimum-congestion optimization problem and solves it efficiently using a multiplicative-weights algorithm. It further employs CUDA-aware GPU kernel-based RDMA pipelining to route traffic through intermediate GPUs and rail-matched NICs. The system is endpoint-driven, integrates transparently with existing communication libraries without requiring application changes, and preserves ordering, determinism, and low overhead. On H100-SXM4 nodes with fully connected NVLink and four NDR400 rails, NIMBLE achieves up to 2.3x higher intra-node bandwidth and 3.8x higher inter-node throughput compared to single-path baselines. It outperforms NCCL and MPI by up to 5.2x on skewed All-to-Allv workloads and 1.35x on end-to-end LLM MoE workloads, while matching baseline performance under balanced traffic.
☆ Making Sense of AI Agents Hype: Adoption, Architectures, and Takeaways from Practitioners
To support practitioners in understanding how agentic systems are designed in real-world industrial practice, we present a review of practitioner conference talks on AI agents. We analyzed 138 recorded talks to examine how companies adopt agent-based architectures (Objective 1), identify recurring architectural strategies and patterns (Objective 2), and analyze application domains and technologies used to implement and operate LLM-driven agentic systems (Objective 3).
☆ GreenFLag: A Green Agentic Approach for Energy-Efficient Federated Learning
Progressing toward a new generation of mobile networks, a clear focus on integrating distributed intelligence across the system is observed to drive performance, autonomy, and real-time adaptability. Federated learning (FL) stands out as a key emerging technique, enabling on-device model training while preserving data locality. However, its operation introduces substantial energy and resource demands. Energy needs are mostly met by grid power sources, while FL resource orchestration strategies remain limited. This work introduces GreenFLag, an agentic resource orchestration framework designed to minimize the energy consumption from the grid power to complete FL workflows, guarantee FL model performance, and reduce grid power reliance by incorporating renewable sources into the system. GreenFLag leverages a Soft-Actor Critic reinforcement learning approach to jointly optimize computational and communication resources, while accounting for communication contention and the dynamic availability of renewable energy. Evaluations using a real-world open dataset from Copernicus, demonstrate that GreenFLag significantly reduces grid energy consumption by 94.8% on average, compared to three state-of-the-art baselines, while primarily relying on green power.
comment: 9 pages. arXiv admin note: text overlap with arXiv:2308.10664
☆ 6GAgentGym: Tool Use, Data Synthesis, and Agentic Learning for Network Management
Autonomous 6G network management requires agents that can execute tools, observe the resulting state changes, and adapt their decisions accordingly. Existing benchmarks based on static questions or scripted episode replay, however, do not support such closed-loop interaction, limiting agents to passive evaluation without the ability to learn from environmental feedback. This paper presents 6GAgentGym to provide closed-loop capability. The framework provides an interactive environment with 42 typed tools whose effect classification distinguishes read-only observation from state-mutating configuration, backed by a learned Experiment Model calibrated on NS-3 simulation data. 6G-Forge bootstraps closed-loop training trajectories from NS-3 seeds via iterative Self-Instruct generation with execution verification against the Experiment Model. Supervised fine-tuning on the resulting corpus followed by reinforcement learning with online closed-loop interaction enables an 8B open-source model to achieve comparable overall success rate to GPT-5 on the accompanying 6GAgentBench, with stronger performance on long-horizon tasks. Together, these components provide a viable path toward autonomous, closed-loop network management.
☆ Mean Masked Autoencoder with Flow-Mixing for Encrypted Traffic Classification
Network traffic classification using self-supervised pre-training models based on Masked Autoencoders (MAE) has demonstrated a huge potential. However, existing methods are confined to isolated byte-level reconstruction of individual flows, lacking adequate perception of the multi-granularity contextual relationship in traffic. To address this limitation, we propose Mean MAE (MMAE), a teacher-student MAE paradigm with flow mixing strategy for building encrypted traffic pre-training model. MMAE employs a self-distillation mechanism for teacher-student interaction, where the teacher provides unmasked flow-level semantic supervision to advance the student from local byte reconstruction to multi-granularity comprehension. To break the information bottleneck in individual flows, we introduce a dynamic Flow Mixing (FlowMix) strategy to replace traditional random masking mechanism. By constructing challenging cross-flow mixed samples with interferences, it compels the model to learn discriminative representations from distorted tokens. Furthermore, we design a Packet-importance aware Mask Predictor (PMP) equipped with an attention bias mechanism that leverages packet-level side-channel statistics to dynamically mask tokens with high semantic density. Numerous experiments on a number of datasets covering encrypted applications, malware, and attack traffic demonstrate that MMAE achieves state-of-the-art performance. The code is available at https://github.com/lx6c78/MMAE
comment: Project page \url{https://github.com/lx6c78/MMAE}
☆ TrafficMoE: Heterogeneity-aware Mixture of Experts for Encrypted Traffic Classification
Encrypted traffic classification is a critical task for network security. While deep learning has advanced this field, the occlusion of payload semantics by encryption severely challenges standard modeling approaches. Most existing frameworks rely on static and homogeneous pipelines that apply uniform parameter sharing and static fusion strategies across all inputs. This one-size-fits-all static design is inherently flawed: by forcing structured headers and randomized payloads into a unified processing pipeline, it inevitably entangles the raw protocol signals with stochastic encryption noise, thereby degrading the fine-grained discriminative features. In this paper, we propose TrafficMoE, a framework that breaks through the bottleneck of static modeling by establishing a Disentangle-Filter-Aggregate (DFA) paradigm. Specifically, to resolve the structural between-components conflict, the architecture disentangles headers and payloads using dual-branch sparse Mixture-of-Experts (MoE), enabling modality-specific modeling. To mitigate the impact of stochastic noise, an uncertainty-aware filtering mechanism is introduced to quantify reliability and selectively suppress high-variance representations. Finally, to overcome the limitations of static fusion, a routing-guided strategy aggregates cross-modality features dynamically, that adaptively weighs contributions based on traffic context. With this DFA paradigm, TrafficMoE maximizes representational efficiency by focusing solely on the most discriminative traffic features. Extensive experiments on six datasets demonstrate TrafficMoE consistently outperforms state-of-the-art methods, validating the necessity of heterogeneity-aware modeling in encrypted traffic analysis. The source code is publicly available at https://github.com/Posuly/TrafficMoE_main.
comment: Project page \url{https://github.com/Posuly/TrafficMoE_main}
☆ Multi-AUV Cooperative Target Tracking Based on Supervised Diffusion-Aided Multi-Agent Reinforcement Learning
In recent years, advances in underwater networking and multi-agent reinforcement learning (MARL) have significantly expanded multi-autonomous underwater vehicle (AUV) applications in marine exploration and target tracking. However, current MARL-driven cooperative tracking faces three critical challenges: 1) non-stationarity in decentralized coordination, where local policy updates destabilize teammates' observation spaces, preventing convergence; 2) sparse-reward exploration inefficiency from limited underwater visibility and constrained sensor ranges, causing high-variance learning; and 3) water disturbance fragility combined with handcrafted reward dependency that degrades real-world robustness under unmodeled hydrodynamic conditions. To address these challenges, this paper proposes a hierarchical MARL architecture comprising four layers: global training scheduling, multi-agent coordination, local decision-making, and real-time execution. This architecture optimizes task allocation and inter-AUV coordination through hierarchical decomposition. Building on this foundation, we propose the Supervised Diffusion-Aided MARL (SDA-MARL) algorithm featuring three innovations: 1) a dual-decision architecture with segregated experience pools mitigating nonstationarity through structured experience replay; 2) a supervised learning mechanism guiding the diffusion model's reverse denoising process to generate high-fidelity training samples that accelerate convergence; and 3) disturbance-robust policy learning incorporating behavioral cloning loss to guide the Deep Deterministic Policy Gradient network update using high-quality replay actions, eliminating handcrafted reward dependency. The tracking algorithm based on SDA-MARL proposed in this paper achieves superior precision compared to state-of-the-art methods in comprehensive underwater simulations.
☆ TORCH: Characterizing Invalid Route Filtering via Tunnelled Observation
To mitigate BGP prefix hijacking, the Resource Public Key Infrastructure (RPKI) provides prefix origin authentication via Route Origin Validation (ROV). Despite extensive measurement efforts in IPv4, the protective impact of ROV in IPv6 has yet to be systematically assessed. Existing approaches suffer from limited observability into invalid route propagation: they often rely on a small set of controlled prefixes or cannot fully profile the filtering of in-the-wild RPKI-invalid routes, which undermines the accuracy of assessment. Furthermore, the inherent opacity of the IPv6 data plane exacerbates the difficulty of performing scalable and reliable active measurements. In this paper, we present TORCH, a novel framework for measuring invalid route filtering in IPv6. It repurposes open 6in4 tunnel endpoints as widely distributed vantage points for global measurement. At its core, we develop a cross-plane inference technique that determines reachability without requiring responsive targets. This method allows us to characterize whether and how traffic is steered to invalid origins across diverse routing scenarios, leading to an in-depth evaluation of the real-world impact of ROV. Our measurements reveal that about 27\% of ASes have achieved nearly full ROV protection. However, several permissive Tier-1 ASes still transit traffic towards invalid origins, maintaining a substantial attack surface. Through a prefix-centric analysis, we provide the first empirical evidence that the collateral damage of same-length prefix filtering can affect a significant fraction of the global Internet. Our findings pinpoint fundamental vulnerabilities in ROV deployment and underscore the urgent necessity for network operators to accelerate RPKI adoption. We make our datasets publicly available.
☆ Needle in a Haystack: Tracking UAVs from Massive Noise in Real-World 5G-A Base Station Data
The potential usage of UAVs in daily life has made monitoring them essential. However, existing systems for monitoring UAVs typically rely on cameras, LiDARs, or radars, whose limited sensing range or high deployment cost hinder large-scale adoption. In response, we develop BSense, the first system that tracks UAVs by leveraging point clouds from commercial 5G-A base stations. The key challenge lies in the dominant number of noise points that closely resemble true UAV points, resulting in a noise-to-UAV ratio over 100:1. Therefore, identifying UAVs from the raw point clouds is like finding a needle in a haystack. To overcome this, we propose a layered framework that filters noise at the point, object, and trajectory levels. At the raw point level, we observe that noise points from different spatial regions exhibit distinguishable and consistent signal fingerprints, which we can model to identify and remove them. At the object level, we design spatial and velocity consistency checks to identify false objects, and further compute confidence scores by aggregating these checks over multiple frames for more reliable discrimination. At the final trajectory level, we propose a Transformer-based network that captures multi-frame motion patterns to filter the few remaining false trajectories. We evaluated BSense on a commercial 5G-A base station deployed in an urban environment. The UAV was instructed to fly along 25 distinct trajectories across 54 cases over 7 days, yielding 155 minutes of data with more than 14,000 frames. On this dataset, our system reduces the number of false detections from an average of 168.05 per frame to 0.04, achieving an average F1 score of 95.56% and a mean localization error of 4.9 m at ranges up to 1,000 m.
comment: 14 pages, 26 figures, to appear in ACM MobiSys 2026
☆ Enabling Programmable Inference and ISAC at the 6GR Edge with dApps
The convergence of communication, sensing, and Artificial Intelligence (AI) in the Radio Access Network (RAN) offers compelling economic advantages through shared spectrum and infrastructure. How can inference and sensing be integrated in the RAN infrastructure at a system level? Current abstractions in O-RAN and 3GPP lack the interfaces and capabilities to support (i) a dynamic life cycle for inference and Integrated Sensing and Communication (ISAC) algorithms, whose requirements and sensing targets may change over time and across sites; (ii) pipelines for AI-driven ISAC, which need complex data flows, training, and testing; (iii) dynamic device and stack configuration to balance trade-offs between connectivity, sensing, and inference services. This paper analyzes the role of a programmable, software-driven, open RAN in enabling the intelligent edge for 5G and 6G systems. We identify real-time user-plane data exposure, open interfaces for plug-and-play inference and ISAC models, closed-loop control, and AI pipelines as elements that evolutions of the O-RAN architecture can uniquely provide. Specifically, we describe how dApps - a real-time, user-plane extension of O-RAN - and a hierarchy of controllers enable real-time AI inference and ISAC. Experimental results on an open-source RAN testbed demonstrate the value of exposing I/Q samples and real-time RAN telemetry to dApps for sensing applications.
comment: Submitted to IEEE for publication, copyright may change without notice. 5 figures, 1 table
☆ A Multi-Sensor Fusion Parking Barrier System with Lightweight Vision on Edge
To address the challenges of simultaneously satisfying detection accuracy, edge real-time performance, low-power operation, and end-to-end business linkage in parking scenarios, this paper proposes an intelligent parking barrier system based on deep learning and multi-sensor fusion. The system adopts a three-layer collaborative architecture comprising an edge sensing node layer, a cloud business service layer, and a front-end management application layer. On the edge side, a Raspberry Pi 5 integrates a camera, infrared ranging sensor, MPU6050 attitude sensor, and LoRa module for parking-state sensing and local decision-making. At the algorithmic level, YOLOv3-tiny is structurally pruned for single-class detection, compressing model weights to approximately 33 MB. At the decision level, an asymmetric infrared-vision-inertial fusion state machine is designed, employing an "infrared trigger - visual confirmation - inertial fallback" mechanism to enhance robustness under nighttime, occlusion, and impact disturbances. Experimental results show that after over 5000 training iterations, mAP@0.5 reaches 96.5%-98.2%. On Raspberry Pi 5, single-frame inference latency at 416x416 resolution is 600-850 ms, meeting polling requirements of 5 s (idle) and 10 s (occupied). Average power consumption decreases from 4.02 W to 1.02 W, achieving approximately 74% energy savings. Joint debugging tests further validate the solution's advantages in detection accuracy, response timeliness, energy efficiency, and engineering deployability.
☆ 1.5 Million Messages Per Second on 3 Machines: Benchmarking and Latency Optimization of Apache Pulsar at Enterprise Scale
This paper presents two independent contributions for Apache Pulsar practitioners. First, we validate 1,499,947 msg/s at 3.88 ms median publish latency on just three bare-metal Kubernetes nodes running Pulsar 4.0.8 with Java 21 and ZGC Generational garbage collection, and project a hardware-driven path to 15 million msg/s on 15 machines using five independent clusters with key-based partition routing. Hardware selection -- specifically dedicated NVMe journals achieving 0.02 ms fdatasync and 25 Gbps network interfaces -- is the primary determinant of throughput ceiling, not compute or software tuning. Second, we trace the complete latency optimization journey from 213 ms GC spikes and 13-18 ms median publish latency in production to 3.88 ms through Java Flight Recorder guided root cause analysis. Three independent root causes are identified and resolved: G1GC pauses eliminated by switching to ZGC Generational; journal fdatasync latency reduced from 5.1 ms to 0.02 ms through NVMe journal dedication; and a previously undocumented Linux kernel page cache writeback interaction inside BookKeeper's ForceWriteThread that degrades fdatasync from under 1 ms to 15-22 ms even across physically separate NVMe drives sharing the kernel block layer. This finding is undocumented in official Apache Pulsar and BookKeeper documentation and is relevant to all Pulsar operators experiencing unexplained P99.9 latency spikes. The combined optimizations achieve a 4.7x latency improvement at 50x higher throughput.
comment: 4 pages, 3 figures, 7 tables. Short paper submitted to CNSM 2026 (22nd International Conference on Network and Service Management)
☆ LoRaWAN Gateway Placement for Network Planning Using Ray Tracing-based Channel Models
Network planning is a fundamental task in wireless communications, primarily focused on guaranteeing adequate coverage for every network device. In this context, the quality of any planning effort strongly depends on the channel model adopted in the design process of the simulations. Given this motivation, this work investigates how different channel models influence the placement of Long Range Wide Area Network (LoRaWAN) gateways (GWs), formulating an optimization problem that contrasts stochastic and empirical models with ray-tracing-based models. To this end, we developed a framework that integrates ray tracing (RT) simulators with a discrete-event network simulator. Using this framework to generate long range wide area network (LoRaWAN) wireless data metrics, we employ an optimization model that determines the optimized GW placement under different channel models and power constraints. Our results show that the optimized solution is highly sensitive to the chosen channel model, even when considering the same scenarios with different RT simulators, revealing a clear trade-off between computational cost and the fidelity of the solution to real-world conditions.
comment: 10 pages, 10 figures and 5 tables
♻ ☆ Shy Guys: A Light-Weight Approach to Detecting Robots on Websites
Automated bots now account for roughly half of all web requests, and an increasing number deliberately spoof their identity to either evade detection or to not respect robots.txt. Existing countermeasures are either resource-intensive (JavaScript challenges, CAPTCHAs), cost-prohibitive (commercial solutions), or degrade the user experience. This paper proposes a lightweight, passive approach to bot detection that combines user-agent string analysis with favicon-based heuristics, operating entirely on standard web server logs with no client-side interaction. We evaluate the method on over 4.6 million requests containing 54,945 unique user-agent strings collected from website hosted all around the earth. Our approach detects 67.7% of bot traffic while maintaining a false-positive rate of 3%, outperforming state of the art (less than 20%). This method can serve as a first line of defence, routing only genuinely ambiguous requests to active challenges and preserving the experience of legitimate users.
comment: 10 pages, 6 figures, submitted to IFIP TMA 2026
♻ ☆ Optimizing Entanglement Distribution Protocols: Maximizing Classical Information in Quantum Networks
Efficient entanglement distribution is the foundational challenge in realizing large-scale Quantum Networks. However, state-of-the-art solutions are frequently limited by restrictive operational assumptions, prohibitive computational complexities, and performance metrics that misalign with practical application needs. To overcome these barriers, this paper addresses the entanglement distribution problem by introducing four pivotal advances. First, recognizing that the primary application of quantum communication is the transmission of private information, we derive the Ensemble Capacity (EC), a novel metric that explicitly quantifies the secure classical information enabled by the entanglement distribution. Second, we propose a generalized mathematical formulation that removes legacy structural restrictions in the solution space. Our formulation supports an unconstrained, arbitrary sequencing of entanglement swapping and purification. Third, to efficiently navigate the resulting combinatorial optimization space, we introduce a novel Dynamic Programming (DP)-based hypergraph generation algorithm. Unlike prior methods, our approach avoids artificial fidelity quantization, preserving exact, continuous fidelities while proactively pruning sub-optimal trajectories. Finally, we encapsulate these algorithmic solutions into CODE, a system-level, two-tiered orchestration framework designed to enable near-real-time network responsiveness. Extensive evaluations confirm that our DP-driven architecture yields superior private classical information capacity and significant reductions in computational complexity, successfully meeting the strict sub-second latency thresholds required for dynamic QN operation.
♻ ☆ Statistical Verification of Medium-Access Parameterization for Power-Grid Edge Ad Hoc Sensor Networks
The widespread deployment of power grid ad hoc sensor networks based on IEEE 802.15.4 raises reliability challenges when nodes selfishly adapt CSMA/CA parameters to maximize individual performance. Such behavior degrades reliability, energy efficiency, and compliance with strict grid constraints. Existing analytical and simulation approaches often fail to rigorously evaluate configurations under asynchronous, event-driven, and resource-limited conditions. We develop a verification framework that integrates stochastic timed hybrid automata with statistical model checking (SMC) with confidence bounds to formally assess CSMA/CA parameterizations under grid workloads. By encoding node- and system-level objectives in temporal logic and automating protocol screening via large-scale statistical evaluation, the method certifies Nash equilibrium strategies that remain robust to unilateral deviations. In a substation-scale scenario, the certified equilibrium improves utility from 0.862 to 0.914 and raises the delivery ratio from 89.5% to 93.2% when compared with an aggressive tuning baseline. Against a delivery-oriented baseline, it reduces mean per-cycle energy from 152.8 mJ to 149.2 mJ while maintaining comparable delivery performance. Certified configurations satisfy latency, reliability, and energy constraints with robustness coefficients above 0.97 and utility above 0.91.
♻ ☆ Leaf-centric Logical Topology Design for OCS-based GPU Clusters
Recent years have witnessed the growing deployment of optical circuit switches (OCS) in commercial GPU clusters (e.g., Google A3 GPU cluster) optimized for machine learning (ML) workloads. Such clusters adopt a three-tier leaf-spine-OCS topology, servers attach to leaf-layer electronic packet switches (EPSes); these leaf switches aggregate into spine-layer EPSes to form a Pod; and multiple Pods are interconnected via core-layer OCSes. Unlike EPSes, OCSes only support circuit-based paths between directly connected spine switches, potentially inducing a phenomenon termed routing polarization, which refers to the scenario where the bandwidth requirements between specific pairs of Pods are unevenly fulfilled through links among different spine switches. The resulting imbalance induces traffic contention and bottlenecks on specific leaf-to-spine links, ultimately reducing ML training throughput. To mitigate this issue, we introduce a leaf-centric paradigm to ensure traffic originating from the same leaf switch is evenly distributed across multiple spine switches with balanced loads. Through rigorous theoretical analysis, we establish a sufficient condition for avoiding routing polarization and propose a corresponding logical topology design algorithm with polynomial-time complexity. Large-scale simulations validate up to 19.27% throughput improvement and a 99.16% reduction in logical topology computation overhead compared to Mixed Integer Programming (MIP)-based methods.
Distributed, Parallel, and Cluster Computing 23
☆ From Skew to Symmetry: Node-Interconnect Multi-Path Balancing with Execution-time Planning for Modern GPU Clusters
Modern GPU-based high-performance computing clusters offer unprecedented communication bandwidth through heterogeneous intra-node interconnects and inter-node networks. However, despite this high aggregate bandwidth, many real-world communication patterns fail to fully utilize the available hardware. Traffic skew often leads to situations where a small subset of links becomes oversaturated while others remain underutilized, resulting in congestion, latency spikes, and poor scalability. Existing communication frameworks such as NCCL and MPI with UCX typically rely on static fastest-path routing or hashing-based multi-rail striping, which leaves significant bandwidth unused when runtime traffic deviates from expected distributions. To address these limitations, we propose NIMBLE (Node-Interconnect Multi-path Balancing with Execution-time orchestration), a runtime communication orchestration system that dynamically redistributes traffic to balance link utilization across all available intra-node and inter-node paths. NIMBLE formulates this as a capacity-normalized minimum-congestion optimization problem and solves it efficiently using a multiplicative-weights algorithm. It further employs CUDA-aware GPU kernel-based RDMA pipelining to route traffic through intermediate GPUs and rail-matched NICs. The system is endpoint-driven, integrates transparently with existing communication libraries without requiring application changes, and preserves ordering, determinism, and low overhead. On H100-SXM4 nodes with fully connected NVLink and four NDR400 rails, NIMBLE achieves up to 2.3x higher intra-node bandwidth and 3.8x higher inter-node throughput compared to single-path baselines. It outperforms NCCL and MPI by up to 5.2x on skewed All-to-Allv workloads and 1.35x on end-to-end LLM MoE workloads, while matching baseline performance under balanced traffic.
☆ MAC-Attention: a Match-Amend-Complete Scheme for Fast and Accurate Attention Computation
Long-context decoding in LLMs is IO-bound: each token re-reads an ever-growing KV cache. Prior accelerations cut bytes via compression, which lowers fidelity, or selection/eviction, which restricts what remains accessible, and both can degrade delayed recall and long-form generation. We introduce MAC-Attention, a fidelity- and access-preserving alternative that accelerates decoding by reusing prior attention computations for semantically similar recent queries. It starts with a match stage that performs pre-RoPE L2 matching over a short local window; an amend stage rectifies the reused attention by recomputing a small band near the match boundary; and a complete stage fuses the rectified results with fresh attention computed on the KV tail through a numerically stable merge. On a match hit, the compute and bandwidth complexity is constant regardless of context length. The method is model-agnostic and composes with IO-aware kernels, paged-KV managers, and MQA/GQA. Across LongBench v2 (120K), RULER (120K), and LongGenBench (16K continuous generation), compared to the latest FlashInfer library, MAC-Attention reduces KV accesses by up to 99%, cuts token generation latency by over 60% at 128K, and achieves over 14.3x attention-phase speedups, up to 2.6x end-to-end, while maintaining full-attention quality. By reusing computation, MAC-Attention delivers long-context inference that is both fast and faithful. Code is available here: https://github.com/YJHMITWEB/MAC-Attention.git
☆ Source Known Identifiers: A Three-Tier Identity System for Distributed Applications
Distributed applications need identifiers that satisfy storage efficiency, chronological sortability, origin metadata embedding, zero-lookup verifiability, confidentiality for external consumers, and multi-century addressability. Based on our literature survey, no existing scheme provides all six of these identifier properties within a unified system. This paper introduces Source Known Identifiers (SKIDs), a three-tier identity system that projects a single entity identity across trust boundaries, addressing all six properties. The first tier, Source Known ID (SKID), is a 64-bit signed integer embedding a timestamp with a 250-millisecond precision, application topology, and a per-entity-type sequence counter. It serves as the database primary key, providing compact storage (8 bytes) and natural B-tree ordering for optimized database indexing. The second tier, Source Known Entity ID (SKEID), extends the SKID into a 128-bit Universally Unique Identifier (UUID) compatible value by adding an entity type discriminator, an epoch selector, and a BLAKE3 keyed message authentication code (MAC). SKEIDs enable zero-lookup verification of identifier origin, integrity, and entity type within trusted environments, with a big-endian byte layout that preserves chronological ordering in lexicographic UUID string comparisons. The third tier, Secure SKEID, encrypts the entire SKEID using AES-256 symmetric encryption as a single-block pseudorandom permutation, producing ciphertext indistinguishable from random bytes while remaining compatible with standard UUID data-type parsers in string representation. Deterministic bidirectional transformations connect all three tiers.
comment: 22 pages, 3 figures, 11 tables, submitted to PeerJ
☆ A Lightweight Hybrid Publish/Subscribe Event Fabric for IPC and Modular Distributed Systems
Modular software deployed on mini compute units in controlled distributed environments often needs two messaging paths: low-overhead in-process coordination and selective cross-node distribution. In practice, event identity, serialization, and transport bridging are frequently implemented as ad hoc glue, which complicates inter-process communication (IPC), structured routing, and shutdown behavior. This paper presents CNS, a lightweight local-first hybrid event fabric centered on asynchronous fire-and-forget messaging. CNS combines a typed event key, per-family serialization and validation, a local publish/subscribe context for in-process coordination, and a NATS-backed distributed context for inter-node distribution. A bridge runtime moves events between the two contexts while preserving a common routing vocabulary. The primary operating model is fire-and-forget publication and subscription; bidirectional request-reply remains available as a secondary extension on the same subject space. A Python prototype and single-machine measurements are reported. Local-only delivery averaged about 30 $μ$s. Distributed-only delivery averaged 1.26-1.37 ms, and the hybrid bridge averaged 1.64-1.89 ms. Validation introduced modest overhead relative to serialization choice. The resulting artifact is suited to structured IPC and practical message movement within modular services and across bounded sets of controlled nodes.
☆ Scalable AI-assisted Workflow Management for Detector Design Optimization Using Distributed Computing
The Production and Distributed Analysis (PanDA) system, originally developed for the ATLAS experiment at the CERN Large Hadron Collider (LHC), has evolved into a robust platform for orchestrating large-scale workflows across distributed computing resources. Coupled with its intelligent Distributed Dispatch and Scheduling (iDDS) component, PanDA supports AI/ML-driven workflows through a scalable and flexible workflow engine. We present an AI-assisted framework for detector design optimization that integrates multi-objective Bayesian optimization with the PanDA--iDDS workflow engine to coordinate iterative simulations across heterogeneous resources. The framework addresses the challenge of exploring high-dimensional parameter spaces inherent in modern detector design. We demonstrate the framework using benchmark problems and realistic studies of the ePIC and dRICH detectors for the Electron-Ion Collider (EIC). Results show improved automation, scalability, and efficiency in multi-objective optimization. This work establishes a flexible and extensible paradigm for AI-driven detector design and other computationally intensive scientific applications.
☆ A Precision Emulation Approach to the GPU Acceleration of Ab Initio Electronic Structure Calculations
This study explores the use of INT8-based emulation for accelerating traditional FP64-based HPC workloads on modern GPU architectures. Through SCILIB-Accel automatic BLAS offload tool for cache-coherent Unified Memory Architecture, we emulate FP64 matrix multiplications in the LSMS CPU application in the MuST suite without code changes. We find that accuracy depends on both arithmetic precision and the properties of the operator, which can be dealt with through tunable precision emulation. Unlike traditional mixed-precision approaches, this method preserves original algorithms while optimizing hardware utilization. We showcase the potential of improving accuracy and performance at the same time. This work highlights the potential of AI-driven hardware to transform HPC, advocating for adaptive precision strategies in future scientific computing.
☆ M3SA: Exploring Datacenter Performance and Climate-Impact with Multi- and Meta-Model Simulation and Analysis
Datacenters are vital to our digital society, but consume a considerable fraction of global electricity and demand is projected to increase. To improve their sustainability and performance, we envision that simulators will become primary decision-making tools. However, and unlike other fields focusing on key societal infrastructure such as waterworks and mass transit, datacenter simulators do not yet combine multiple independent models into their operation and thus suffer from issues associated with singular models, such as specialization, and lack of adaptability to operational phenomena. To address this challenge, we propose M3SA, a datacenter simulation and analysis framework that uses discrete-event simulation to predict, for each model, the impact on climate and performance under various realistic datacenter conditions, and then combines these predictions. We design an architecture for simulating multiple concurrent models (Multi-Model), a technique for integrating the results of multiple models into a Meta-Model, and a procedure for quantifying Meta-Model accuracy. Through experiments with an M3SA prototype, we show that (i) M3SA can reproduce and enhance peer-reviewed experiments; (ii) M3SA can predict operational phenomena (e.g., failures) of datacenters, running fundamentally different workload traces; (iii) M3SA enables various types of what-if and how-to analysis, such as how to configure CO2-aware migration over yearly energy-production patterns. M3SA has been integrated into the open-source simulator OpenDC and is available at: https://github.com/atlarge-research/opendc-m3sa.
☆ Storing Less, Finding More: How Novelty Filtering Improves Cross-Modal Retrieval on Edge Cameras
Always-on edge cameras generate continuous video streams where redundant frames degrade cross-modal retrieval by crowding correct results out of top-k search. This paper presents a streaming retrieval architecture: an on-device epsilon-net filter retains only semantically novel frames, building a denoised embedding index; a cross-modal adapter and cloud re-ranker compensate for the compact encoder's weak alignment. A single-pass streaming filter outperforms offline alternatives (k-means, farthest-point, uniform, random) across eight vision-language models (8M-632M) on two egocentric datasets (AEA, EPIC-KITCHENS). Combined, the architecture reaches 45.6% Hit@5 on held-out data using an 8M on-device encoder at an estimated 2.7 mW.
comment: 6 pages, 3 figures, 5 tables; supplementary video included as ancillary file
☆ Efficient Parallel Compilation and Profiling of Quantum Circuits at Large Scales
Compiling quantum circuits is a major bottleneck in quantum computing, and given the scale required in a few years, is likely to become infeasibly long. Techniques to reduce compilation time for quantum circuits are sorely needed. Furthermore, resources to test acceleration techniques are similarly lacking due to the limited scale of circuits in benchmark suites and mismatches in characteristics of these circuits and those produced by random circuit generators. This paper resolves the latter of these problems by describing a random circuit generator which allows control of circuit density, width and depth parameters. This is used to derive 8000 experimental large-scale circuits and test a novel approach to compiler parallelisation. This separates a circuit into sub-circuits which are compiled in parallel and recombined to produce a compiled circuit. When the parallel approach was tested using Qiskit, a peak speedup of 15.56 was achieved with corresponding overheads of less than 1%.
☆ Polynomial Time Local Decision Revisited
We consider three classification systems for distributed decision tasks: With unbounded computation and certificates, defined by Balliu, D'Angelo, Fraigniaud, and Olivetti [JCSS'18], and with (two flavors of) polynomially bounded local computation and certificates, defined in recent works by Aldema Tshuva and Oshman [OPODIS'23], and by Reiter [PODC'24]. The latter two differ in the way they evaluate the polynomial bounds: the former considers polynomials with respect to the size of the graph, while the latter refers to being polynomial in the size of each node's local neighborhood. We start by revisiting decision without certificates. For this scenario, we show that the latter two definitions coincide: roughly, a node cannot know the graph size, and thus can only use a running time dependent on its neighborhood. We then consider decision with certificates. With existential certificates ($Σ_1$-type classes), a larger running time defines strictly larger classes of languages: when it grows from being polynomial in each node's view, through polynomial in the graph's size, and to unbounded, the derived classes strictly contain each other. With universal certificates ($Π_1$-type classes), on the other hand, we prove a surprising incomparability result: having running time bounded by the graph size sometimes allows us to decide languages undecidable even with unbounded certificates. We complement these results with other containment and separation results, which together portray a surprisingly complex lattice of strict containment relations between the classes at the base of the three classification systems.
☆ Exploration of Energy and Throughput Tradeoffs for Dataflow Networks
The introduction of dynamic power management strategies such as clock gating and power gating in dataflow networks has been shown to provide significant energy savings when applied during idle times. However, these strategies can also degrade throughput due to shutdown and wake-up delays. Such throughput degradations might be particularly detrimental to signal processing systems that require a guaranteed throughput. As a solution, this paper first contributes a linear-program formulation for finding a periodic maximal-throughput schedule of a given so-called self-powering dataflow network where actors, realized in hardware, are allowed to go to sleep whenever not being enabled to fire. Depending on which actors are allowed to power down, tradeoffs between throughput and energy savings can be obtained. As a second contribution, we propose a mixed-integer-linear-program formulation to determine a periodic schedule that satisfies a given throughput while minimizing the overall energy per period by identifying a respective set of actors that is allowed to power down in phases of idleness and which rather not. Finally, as a third contribution, we propose a multi-objective design-space exploration strategy called "Hop and Skip" to efficiently explore the Pareto front of energy and throughput solutions. Experimental evaluations on a set of existing benchmarks and randomly generated graphs witness significant exploration time reductions over a brute-force sweep. Finally, a real-world case study is elaborated, and we report on achievable energy savings and throughputs of the related dataflow network where (a) all actors are always-active, (b) all actors are self-powered, and (c) all optimal energy and throughput tradeoff points as found by the proposed design-space exploration strategy.
☆ Beyond Corner Patches: Semantics-Aware Backdoor Attack in Federated Learning
Backdoor attacks on federated learning (FL) are most often evaluated with synthetic corner patches or out-of-distribution (OOD) patterns that are unlikely to arise in practice. In this paper, we revisit the backdoor threat to standard FL (a single global model) under a more realistic setting where triggers must be semantically meaningful, in-distribution, and visually plausible. We propose SABLE, a Semantics-Aware Backdoor for LEarning in federated settings, which constructs natural, content-consistent triggers (e.g., semantic attribute changes such as sunglasses) and optimizes an aggregation-aware malicious objective with feature separation and parameter regularization to keep attacker updates close to benign ones. We instantiate SABLE on CelebA hair-color classification and the German Traffic Sign Recognition Benchmark (GTSRB), poisoning only a small, interpretable subset of each malicious client's local data while otherwise following the standard FL protocol. Across heterogeneous client partitions and multiple aggregation rules (FedAvg, Trimmed Mean, MultiKrum, and FLAME), our semantics-driven triggers achieve high targeted attack success rates while preserving benign test accuracy. These results show that semantics-aligned backdoors remain a potent and practical threat in federated learning, and that robustness claims based solely on synthetic patch triggers can be overly optimistic.
☆ Downsides of Smartness Across Edge-Cloud Continuum in Modern Industry
The fast pace of modern AI is rapidly transforming traditional industrial systems into vast, intelligent and potentially unmanned autonomous operational environments driven by AI-based solutions. These solutions leverage various forms of machine learning, reinforcement learning, and generative AI. The introduction of such smart capabilities has pushed the envelope in multiple industrial domains, enabling predictive maintenance, optimized performance, and streamlined workflows. These solutions are often deployed across the Industrial Internet of Things (IIoT) and supported by the Edge-Fog-Cloud computing continuum to enable urgent (i.e., real-time or near real-time) decision-making. Despite the current trend of aggressively adopting these smart industrial solutions to increase profit, quality, and efficiency, large-scale integration and deployment also bring serious hazards that if ignored can undermine the benefits of smart industries. These hazards include unforeseen interoperability side-effects and heightened vulnerability to cyber threats, particularly in environments operating with a plethora of heterogeneous IIoT systems. The goal of this study is to shed light on the potential consequences of industrial smartness, with a particular focus on security implications, including vulnerabilities, side effects, and cyber threats. We distinguish software-level downsides stemming from both traditional AI solutions and generative AI from those originating in the infrastructure layer, namely IIoT and the Edge-Cloud continuum. At each level, we investigate potential vulnerabilities, cyber threats, and unintended side effects. As industries continue to become smarter, understanding and addressing these downsides will be crucial to ensure secure and sustainable development of smart industrial systems.
☆ 1.5 Million Messages Per Second on 3 Machines: Benchmarking and Latency Optimization of Apache Pulsar at Enterprise Scale
This paper presents two independent contributions for Apache Pulsar practitioners. First, we validate 1,499,947 msg/s at 3.88 ms median publish latency on just three bare-metal Kubernetes nodes running Pulsar 4.0.8 with Java 21 and ZGC Generational garbage collection, and project a hardware-driven path to 15 million msg/s on 15 machines using five independent clusters with key-based partition routing. Hardware selection -- specifically dedicated NVMe journals achieving 0.02 ms fdatasync and 25 Gbps network interfaces -- is the primary determinant of throughput ceiling, not compute or software tuning. Second, we trace the complete latency optimization journey from 213 ms GC spikes and 13-18 ms median publish latency in production to 3.88 ms through Java Flight Recorder guided root cause analysis. Three independent root causes are identified and resolved: G1GC pauses eliminated by switching to ZGC Generational; journal fdatasync latency reduced from 5.1 ms to 0.02 ms through NVMe journal dedication; and a previously undocumented Linux kernel page cache writeback interaction inside BookKeeper's ForceWriteThread that degrades fdatasync from under 1 ms to 15-22 ms even across physically separate NVMe drives sharing the kernel block layer. This finding is undocumented in official Apache Pulsar and BookKeeper documentation and is relevant to all Pulsar operators experiencing unexplained P99.9 latency spikes. The combined optimizations achieve a 4.7x latency improvement at 50x higher throughput.
comment: 4 pages, 3 figures, 7 tables. Short paper submitted to CNSM 2026 (22nd International Conference on Network and Service Management)
♻ ☆ A Survey on Graph Neural Network Acceleration: Algorithms, Systems, and Customized Hardware
Graph neural networks (GNNs) are emerging for machine learning research on graph-structured data. GNNs achieve state-of-the-art performance on many tasks, but they face scalability challenges when it comes to real-world applications that have numerous data and strict latency requirements. Many studies have been conducted on how to accelerate GNNs in an effort to address these challenges. These acceleration techniques touch on various aspects of the GNN pipeline, from smart training and inference algorithms to efficient systems and customized hardware. As the amount of research on GNN acceleration has grown rapidly, there lacks a systematic treatment to provide a unified view and address the complexity of relevant works. In this survey, we provide a taxonomy of GNN acceleration, review the existing approaches, and suggest future research directions. Our taxonomic treatment of GNN acceleration connects the existing works and sets the stage for further development in this area.
♻ ☆ When Does Global Attention Help? A Unified Empirical Study on Atomistic Graph Learning
Graph neural networks (GNNs) are widely used as surrogates for costly experiments and first-principles simulations to study the behavior of compounds at atomistic scale, and their architectural complexity is constantly increasing to enable the modeling of complex physics. While most recent GNNs combine more traditional message passing neural networks (MPNNs) layers to model short-range interactions with more advanced graph transformers (GTs) with global attention mechanisms to model long-range interactions, it is still unclear when global attention mechanisms provide real benefits over well-tuned MPNN layers due to inconsistent implementations, features, or hyperparameter tuning. We introduce the first unified, reproducible benchmarking framework - built on HydraGNN - that enables seamless switching among four controlled model classes: MPNN, MPNN with chemistry/topology encoders, GPS-style hybrids of MPNN with global attention, and fully fused local-global models with encoders. Using seven diverse open-source datasets for benchmarking across regression and classification tasks, we systematically isolate the contributions of message passing, global attention, and encoder-based feature augmentation. Our study shows that encoder-augmented MPNNs form a robust baseline, while fused local-global models yield the clearest benefits for properties governed by long-range interaction effects. We further quantify the accuracy-compute trade-offs of attention, reporting its overhead in memory. Together, these results establish the first controlled evaluation of global attention in atomistic graph learning and provide a reproducible testbed for future model development.
comment: 44 pages, 8 figures, 19 tables
♻ ☆ Joint$λ$: Orchestrating Serverless Workflows on Jointcloud FaaS Systems
Existing serverless workflow orchestration systems are predominantly designed for a single-cloud FaaS system, leading to vendor lock-in. This restricts performance optimization, cost reduction, and availability of applications. However, orchestrating serverless workflows on Jointcloud FaaS systems faces two main challenges: (1) additional overhead caused by centralized cross-cloud orchestration; and (2) a lack of reliable failover and fault-tolerant mechanisms for cross-cloud serverless workflows. To address these challenges, we propose Joint$λ$, a distributed runtime system designed to orchestrate serverless workflows on multiple FaaS systems without relying on a centralized orchestrator. Joint$λ$ introduces a compatibility layer, Backend-Shim, leveraging inter-cloud heterogeneity to optimize makespan and reduce costs with on-demand billing. By using function-side orchestration instead of centralized nodes, it enables independent function invocations and data transfers, reducing cross-cloud communication overhead. For high availability, it ensures exactly-once execution via datastores and failover mechanisms for serverless workflows on Jointcloud FaaS systems. We validate Joint$λ$ on two heterogeneous FaaS systems, AWS and Aliyun, with four workflows. Compared to the most advanced commercial orchestration services for single-cloud serverless workflows, Joint$λ$ reduces makespan by up to 3.3$\times$ while saving up to 65% in cost. Joint$λ$ is also up to 4.0$\times$ faster than state-of-the-art orchestrators for cross-cloud serverless workflows, while achieving competitive cost in representative scenarios and providing strong execution guarantees.
♻ ☆ The Missing Adapter Layer for Research Computing
Higher Degree by Research (HDR) candidates increasingly depend on cloud-provisioned virtual machines and local GPU hardware for their computational experiments, yet a persistent and under-addressed gap exists between having compute resources and using them productively. Cloud and infrastructure teams can provision virtual machines, but the path from a raw VM to a reproducible, GPU-ready research environment remains a significant barrier for researchers who are domain experts, not systems engineers. We identify this gap as a missing adapter layer between cloud provisioning and interactive research work. We present a lightweight, open-source solution built on k3s and Coder that implements this adapter layer and is already in active use in our research workspace environment. Our CI/CD pipeline connects GitHub directly to the local cluster, deploying research projects in under five minutes. We define a concrete metrics framework for evaluating this layer -- covering deployment latency, environment reproducibility, onboarding friction, and resource utilisation -- and establish baselines against which improvements can be measured.
comment: V2.0 version
♻ ☆ ECHO-2: A Large-Scale Distributed Rollout Framework for Cost-Efficient Reinforcement Learning
Reinforcement learning (RL) is a critical stage in post-training large language models (LLMs), involving repeated interaction between rollout generation, reward evaluation, and centralized learning. Distributing rollout execution offers opportunities to leverage more cost-efficient inference resources, but introduces challenges in wide-area coordination and policy dissemination. We present ECHO-2, a distributed RL framework for post-training with remote inference workers and non-negligible dissemination latency. ECHO-2 combines centralized learning with distributed rollouts and treats bounded policy staleness as a user-controlled parameter, enabling rollout generation, dissemination, and training to overlap. We introduce an overlap-based capacity model that relates training time, dissemination latency, and rollout throughput, yielding a practical provisioning rule for sustaining learner utilization. To mitigate dissemination bottlenecks and lower cost, ECHO-2 employs peer-assisted pipelined broadcast and cost-aware activation of heterogeneous workers. Experiments on GRPO post-training of 4B and 8B models under real wide-area bandwidth regimes show that ECHO-2 significantly improves cost efficiency while preserving RL reward comparable to strong baselines.
comment: 23 pages, 7 figures
♻ ☆ zk-X509: Privacy-Preserving On-Chain Identity from Legacy PKI via Zero-Knowledge Proofs
Public blockchains impose an inherent tension between regulatory compliance and user privacy. Existing on-chain identity solutions require centralized KYC attestors, specialized hardware, or Decentralized Identifier (DID) frameworks needing entirely new credential infrastructure. Meanwhile, over four billion active X.509 certificates constitute a globally deployed, government-grade trust infrastructure largely unexploited for decentralized identity. This paper presents zk-X509, a privacy-preserving identity system bridging legacy Public Key Infrastructure (PKI) with public ledgers via a RISC-V zero-knowledge virtual machine (zkVM). Users prove ownership of standard X.509 certificates without revealing private keys or personal identifiers. Crucially, the private key never enters the ZK circuit; ownership is proven via OS keychain signature delegation (macOS Security.framework, Windows CNG). The circuit verifies certificate chain validity, temporal validity, key ownership, trustless CRL revocation, blockchain address binding, and Sybil-resistant nullifier generation. It commits 13 public values, including a Certificate Authority (CA) Merkle root hiding the issuing CA, and four selective disclosure hashes. We formalize eight security properties under a Dolev-Yao adversary with game-based definitions and reductions to sEUF-CMA, SHA-256 collision resistance, and ZK soundness. Evaluated on the SP1 zkVM, the system achieves 11.8M cycles for ECDSA P-256 (17.4M for RSA-2048), with on-chain Groth16 verification costing ~300K gas. By leveraging certificates deployed at scale across jurisdictions, zk-X509 enables adoption without new trust establishment, complementing emerging DID-based systems.
comment: v2: Corrected comparison tables (zkPassport, Worldcoin, Polygon ID, Semaphore, zk-email) based on source verification. Refined security proofs. Clarified OS keychain integration and trusted setup descriptions
♻ ☆ Real-Time Driver Safety Scoring Through Inverse Crash Probability Modeling
Road crashes remain a leading cause of preventable fatalities. Existing prediction models predominantly produce binary outcomes, which offer limited actionable insights for real-time driver feedback. These approaches often lack continuous risk quantification, interpretability, and explicit consideration of vulnerable road users (VRUs), such as pedestrians and cyclists. This research introduces SafeDriver-IQ, a framework that transforms binary crash classifiers into continuous 0-100 safety scores by combining national crash statistics with naturalistic driving data from autonomous vehicles. The framework fuses National Highway Traffic Safety Administration (NHTSA) crash records with Waymo Open Motion Dataset scenarios, engineers domain-informed features, and incorporates a calibration layer grounded in transportation safety literature. Evaluation across 15 complementary analyses indicates that the framework reliably differentiates high-risk from low-risk driving conditions with strong discriminative performance. Findings further reveal that 87% of crashes involve multiple co-occurring risk factors, with non-linear compounding effects that increase the risk to 4.5x baseline. SafeDriver-IQ delivers proactive, explainable safety intelligence relevant to advanced driver-assistance systems (ADAS), fleet management, and urban infrastructure planning. This framework shifts the focus from reactive crash counting to real-time risk prevention.
comment: 10 pages, 13 figures, and 14 tables. Submitted in EIT 2026 Conference hosted by The University of Wisconsin-La Crosse and sponsored by IEEE Region 4 (R4)
♻ ☆ ITQ3_S: High-Fidelity 3-bit LLM Inference via Interleaved Ternary Quantization with Rotation-Domain Smoothing
We present ITQ3_S (Interleaved Ternary Quantization -- Specialized), a novel 3-bit weight quantization format for LLMs integrating TurboQuant (TQ), a rotation-domain strategy based on the Fast Walsh-Hadamard Transform (FWHT). Conventional 3-bit methods suffer precision loss from heavy-tailed weight distributions and inter-channel outliers. ITQ3_S pre-rotates the weight space via FWHT before quantization, spreading outlier energy across the vector and inducing a near-Gaussian distribution amenable to uniform ternary coding. We derive a rigorous dequantization procedure fusing a 256-point Inverse FWHT into the CUDA shared-memory loading stage, ensuring reconstruction error is bounded exclusively by the ternary quantization grid with no additional error from the transform inversion. For any weight vector $\mathbf{w} \in \mathbb{R}^{256}$, the reconstruction satisfies $\|\hat{\mathbf{w}} - \mathbf{w}\|_2 \leq ε_q$, strictly smaller than uniform 3-bit baselines that do not exploit rotation-induced distribution normalization. TurboQuant lacks a native CUDA kernel, precluding direct deployment; naively composing TQ with existing weight quantizers introduces domain mismatch errors that accumulate across layers, degrading quality below standard 3-bit baselines. ITQ3_S resolves this by co-designing the FWHT rotation and quantization kernel as a unified pipeline grounded in the IQ3_S weight format, with the inverse transform fused into the CUDA MMQ kernel. Empirically, on the NVIDIA RTX 5090 (Blackwell), ITQ3_S achieves perplexity competitive with FP16 while delivering throughput exceeding 1.5x that of 4-bit alternatives via optimized DP4A and Tensor Core scheduling. Our results establish ITQ3_S as a practical, mathematically grounded solution for high-fidelity LLM deployment on consumer hardware.
comment: 12 pages, 4 figures, 3 tables
♻ ☆ A Practical GPU-Accelerated Implementation of Orthogonal Matching Pursuit
Finding the sparsest solution to the underdetermined system $\mathbf{y}=\mathbf{Ax}$, given a tolerance, is known to be NP-hard. Many approximate solutions to this problem exist, and Orthogonal Matching Pursuit (OMP) is one of the most widely used. However, existing OMP implementations don't take full advantage of matrix properties or modern CPU and GPU-based Linear Algebra kernels. For this paper, we implemented an efficient implementation of OMP that leverages Cholesky inverse properties as well as the power of GPUs to deliver up to \textbf{310x speedup over Scikit-Learn} and \textbf{26x over SPAMS}. The package is published on PyPI (\texttt{pip install batched-omp}) and is fully scikit-learn compatible.
Software Engineering 35
☆ Beyond Localization: Recoverable Headroom and Residual Frontier in Repository-Level RAG-APR
Repository-level automated program repair (APR) increasingly treats stronger localization as the main path to better repair. We ask a more targeted question: once localization is strengthened, which post-localization levers still provide recoverable gains, which are bounded within our protocol, and what residual frontier remains? We study this question on SWE-bench Lite with three representative repository-level RAG-APR paradigms, Agentless, KGCompass, and ExpeRepair. Our protocol combines Oracle Localization, within-pool Best-of-K, fixed-interface added context probes with per-condition same-token filler controls and same-repository hard negatives, and a common-wrapper oracle check. Oracle Localization improves all three systems, but Oracle success still stays below 50%. Extra candidate diversity still helps inside the sampled 10-patch pools, but that headroom saturates quickly. Under the two fixed interfaces, most informative added context conditions still outperform their own matched controls. The common-wrapper check shows different system responses: under a common wrapper, gains remain large for KGCompass and ExpeRepair, while Agentless changes more with builder choice. Prompt-level fusion still leaves a large residual frontier: the best fixed probe adds only 6 solved instances beyond the native three-system Solved@10 union. Overall, stronger localization, bounded search, evidence quality, and interface design all shape repository-level repair outcomes.
☆ Practical Feasibility of Sustainable Software Engineering Tools and Techniques
While Sustainable Software Engineering (SSE) tools are widely studied in academia, their practical feasibility in industrial workflows, particularly in regulated environments, remains poorly understood. This study investigates how software practitioners perceive the feasibility of existing SSE tools and techniques, and examines the technical, organizational, and cultural factors shaping their adoption in practice. We identified prominent categories of SSE tools targeting energy consumption, green refactoring, and workload management, and evaluated them along three practitioner-relevant dimensions: installation, input requirements, and output formats. These were presented through an interactive web application and explored in workshops with 16 practitioners from a regulated financial-sector organization, followed by a survey of 27 software practitioners. Our findings suggest that the practitioners strongly favored tools that integrate into existing IDEs or pipelines, require minimal and locally scoped data access, and provide interpretable, actionable outputs such as dashboards or automated refactoring suggestions. In regulated settings, compliance requirements, approval processes, and time constraints significantly shaped feasibility perceptions. Our contribution lies in providing empirical evidence of these preferences alongside other factors that affect regulated industrial contexts. The findings offer actionable guidance for designing SSE tools that better align with real-world development workflows and organizational constraints.
☆ Wherefore Art Thou? Provenance-Guided Automatic Online Debugging with Lumos
Debugging distributed systems in-production is inevitable and hard. Myriad interactions between concurrent components in modern, complex and large-scale systems cause non-deterministic bugs that offline testing and verification fail to capture. When bugs surface at runtime, their root causes may be far removed from their symptoms. To identify a root cause, developers often need evidence scattered across multiple components and traces. Unfortunately, existing tools fail to quickly and automatically record useful provenance information at low overheads, leaving developers to manually perform the onerous evidence collection task. Lumos is an online debugging framework that exposes application-level bug provenances--the computational history linking symptoms of an incident to their root causes. Lumos leverages dependency-guided instrumentation powered by static analysis to identify program state related to a bug's provenance, and exposes them via lightweight on-demand recording. Lumos provides developers with enough evidence to identify a bug's root cause, while incurring low runtime overhead, and given only a few occurrences of a bug.
☆ Towards Supporting Quality Architecture Evaluation with LLM Tools
Architecture evaluation methods have been extensively used to evaluate software designs. Several evaluation methods have been proposed to analyze tradeoffs between different quality attributes. Also, having competing qualities leads to conflicts when selecting which quality-attribute scenarios are the most suitable ones for an architecture to tackle. Consequently, the scenarios required by the stakeholders must be prioritized and also analyzed for potential risks. Today, architecture quality evaluation is still carried out manually, often involving long brainstorming sessions to decide on the most adequate quality-attribute scenarios for the architecture. To reduce this effort and make the assessment and selection of scenarios more efficient, in this research we propose the use of LLMs to partially automate the evaluation activities. As a first step in validating this hypothesis, this paper investigates MS Copilot as an LLM tool to analyze quality-attribute scenarios suggested by students and reviewed by experienced architects. Specifically, our study compares the results of an Architecture Tradeoff Analysis Method (ATAM) exercise conducted in a software architecture course with the results of experienced software architects and with the output produced by the LLM tool. Our initial findings reveal that the LLM produces in most cases better and more accurate results regarding risks, sensitivity points and tradeoff analysis of the quality scenarios generated manually, as well as it significantly reduces the effort required for the task. Thus, we argue that the use of generative AI has the potential to partially automate and support architecture evaluation tasks by suggesting more qualitative scenarios to be evaluated and recommending the most suitable ones for a given context.
☆ RAD-AI: Rethinking Architecture Documentation for AI-Augmented Ecosystems
AI-augmented ecosystems (interconnected systems where multiple AI components interact through shared data and infrastructure) are becoming the architectural norm for smart cities, autonomous fleets, and intelligent platforms. Yet the architecture documentation frameworks practitioners rely on, arc42 and the C4 model, were designed for deterministic software and cannot capture probabilistic behavior, data-dependent evolution, or dual ML/software lifecycles. This gap carries regulatory consequence: the EU AI Act (Regulation 2024/1689) mandates technical documentation through Annex IV that no existing framework provides structured support for, with enforcement for high-risk systems beginning August 2, 2026. We present RAD-AI, a backward-compatible extension framework that augments arc42 with eight AI-specific sections and C4 with three diagram extensions, complemented by a systematic EU AI Act Annex IV compliance mapping. A regulatory coverage assessment with six experienced software-architecture practitioners provides preliminary evidence that RAD-AI increases Annex IV addressability from approximately 36% to 93% (mean rating) and demonstrates substantial improvement over existing frameworks. Comparative analysis on two production AI platforms (Uber Michelangelo, Netflix Metaflow) captures eight additional AI-specific concerns missed by standard frameworks and demonstrates that documentation deficiencies are structural rather than domain-specific. An illustrative smart mobility ecosystem case study reveals ecosystem-level concerns, including cascading drift and differentiated compliance obligations, that are invisible under standard notation.
comment: Accepted at ANGE 2026, co-located with IEEE ICSA 2026. 8 pages
☆ SAGAI-MID: A Generative AI-Driven Middleware for Dynamic Runtime Interoperability
Modern distributed systems integrate heterogeneous services, REST APIs with different schema versions, GraphQL endpoints, and IoT devices with proprietary payloads that suffer from persistent schema mismatches. Traditional static adapters require manual coding for every schema pair and cannot handle novel combinations at runtime. We present SAGAI-MID, a FastAPI-based middleware that uses large language models (LLMs) to dynamically detect and resolve schema mismatches at runtime. The system employs a five-layer pipeline: hybrid detection (structural diff plus LLM semantic analysis), dual resolution strategies (per-request LLM transformation and LLM-generated reusable adapter code), and a three-tier safeguard stack (validation, ensemble voting, rule-based fallback). We frame the architecture through Bass et al.'s interoperability tactics, transforming them from design-time artifacts into runtime capabilities. We evaluate SAGAI-MID on 10 interoperability scenarios spanning REST version migration, IoT-to-analytics bridging, and GraphQL protocol conversion across six LLMs from two providers. The best-performing configuration achieves 0.90 pass@1 accuracy. The CODEGEN strategy consistently outperforms DIRECT (0.83 vs 0.77 mean pass@1), while cost varies by over 30x across models with no proportional accuracy gain; the most accurate model is also the cheapest. We discuss implications for software architects adopting LLMs as runtime architectural components.
comment: Accepted at SAGAI 2026, co-located with IEEE ICSA 2026. 8 pages
☆ BitSov: A Composable Bitcoin-Native Architecture for Sovereign Internet Infrastructure
Today's internet concentrates identity, payments, communication, and content hosting under a small number of corporate intermediaries, creating single points of failure, enabling censorship, and extracting economic rent from participants. We present BitSov, an architectural framework for sovereign internet infrastructure that composes existing decentralized technologies (Bitcoin, Lightning Network, decentralized storage, federated messaging, and mesh connectivity) into a unified, eight-layer protocol stack anchored to Bitcoin's base layer. The framework introduces three architectural patterns: (1) payment-gated messaging, where every transmitted message requires cryptographic proof of a Bitcoin payment, deterring spam through economic incentives rather than moderation; (2) timechain-locked contracts, which anchor subscriptions and licenses to Bitcoin block height (the timechain) rather than calendar dates; and (3) a self-sustaining economic flywheel that converts service revenue into infrastructure growth. A dual settlement model supports both on-chain transactions for permanence and auditability and Lightning micropayments for high-frequency messaging. As a position paper, we analyze the quality attributes, discuss open challenges, and propose a research agenda for empirical validation.
comment: Accepted at BlockArch 2026, co-located with IEEE ICSA 2026. 4 pages
☆ C2RustXW: Program-Structure-Aware C-to-Rust Translation via Program Analysis and LLM
The growing adoption of Rust for its memory safety and performance has increased the demand for effective migration of legacy C codebases. However, existing rule-based translators (e.g., \ctorust) often generate verbose, non-idiomatic code that preserves unsafe C semantics, limiting readability, maintainability, and practical adoption. Moreover, manual post-processing of such outputs is labor-intensive and rarely yields high-quality Rust code, posing a significant barrier to large-scale migration. To address these limitations, we present \tool, a program-structure-aware C-to-Rust translation approach that integrates program analysis with Large Language Models (LLMs). \tool extracts the multi-level program structure, including global symbols, function dependencies, and control- and data-flow information, and encodes these as structured textual representations injected into LLM prompts to guide translation and repair. Based on this design, \tool performs dependency-aware translation and adopts a multi-stage repair pipeline that combines rule-based and structure-guided LLM-based techniques to ensure syntactic correctness. For semantic correctness, \tool further integrates execution-based validation with structure-guided reasoning to localize and repair behavioral inconsistencies. Experimental results show that \tool achieves 100\% syntactic correctness on CodeNet and 97.78\% on GitHub, while significantly reducing code size (up to 43.70\%) and unsafe usage (to 5.75\%). At the project level, \tool achieves perfect syntactic correctness and an average semantic correctness of 78.87\%, demonstrating its effectiveness for practical and scalable C-to-Rust migration.
☆ Enhancing User-Feedback Driven Requirements Prioritization
Context: Requirements prioritization is a challenging problem that is aimed to deliver the most suitable subset from a pool of candidate requirements. The problem is NP-hard when formulated as an optimization problem. Feedback from end users can offer valuable support for software evolution, and ReFeed represents a state-of-the-art in automatically inferring a requirement's priority via quantifiable properties of the feedback messages associated with a candidate requirement. Objectives: In this paper, we enhance ReFeed by shifting the focus of prioritization from treating requirements as independent entities toward interconnecting them. Additionally, we explore if interconnecting requirements provides additional value for search-based solutions. Methods: We leverage user feedback from mobile app store to group requirements into topically coherent clusters. Such interconnectedness, in turn, helps to auto-generate additional "requires" relations in candidate requirements. These "requires" pairs are then integrated into a search-based software engineering solution. Results: The experiments on 94 requirements prioritization instances from four real-world software applications show that our enhancement outperforms ReFeed. In addition, we illustrate how incorporating interconnectedness among requirements improves search-based solutions. Conclusion: Our findings show that requirements interconnectedness improves user feedback driven requirements prioritization, helps uncover additional "requires" relations in candidate requirements, and also strengthens search-based release planning.
comment: Submitted to Information and Software Technology
☆ BACE: LLM-based Code Generation through Bayesian Anchored Co-Evolution of Code and Test Populations GECCO '26
Large Language Models (LLMs) have demonstrated impressive capabilities in code generation. While an interactive feedback loop can improve performance, writing effective tests is a non-trivial task. Early multi-agent frameworks, such as AgentCoder, automated this process but relied on generated tests as absolute ground truth. This approach is fragile: incorrect code frequently passes faulty or trivial tests, while valid solutions are often degraded to satisfy incorrect assertions. Addressing this limitation, newer methods have largely abandoned test generation in favor of planning and reasoning based on examples. We argue, however, that generated tests remain a valuable signal if we model them as noisy sensors guided by bayesian updates. To this end, we introduce BACE (Bayesian Anchored Co-Evolution), a framework that reformulates synthesis as a Bayesian co-evolutionary process where code and test populations are evolved, guided by belief distributions that are reciprocally updated based on noisy interaction evidence. By anchoring this search on minimal public examples, BACE prevents the co-evolutionary drift typical of self-validating loops. Extensive evaluations on LiveCodeBench v6 (post-March 2025) reveal that BACE achieves superior performance across both proprietary models and open-weight small language models.
comment: 10 Pages, 3 Figures. To appear in Proceedings of the 2026 Genetic and Evolutionary Computation Conference (GECCO '26), July 13-17, 2026, San Jose, Costa Rica
☆ Debt Behind the AI Boom: A Large-Scale Empirical Study of AI-Generated Code in the Wild
AI coding assistants are now widely used in software development. Software developers increasingly integrate AI-generated code into their codebases to improve productivity. Prior studies have shown that AI-generated code may contain code quality issues under controlled settings. However, we still know little about the real-world impact of AI-generated code on software quality and maintenance after it is introduced into production repositories. In other words, it remains unclear whether such issues are quickly fixed or persist and accumulate over time as technical debt. In this paper, we conduct a large-scale empirical study on the technical debt introduced by AI coding assistants in the wild. To achieve that, we built a dataset of 304,362 verified AI-authored commits from 6,275 GitHub repositories, covering five widely used AI coding assistants. For each commit, we run static analysis before and after the change to precisely attribute which code smells, bugs, and security issues the AI introduced. We then track each introduced issue from the introducing commit to the latest repository revision to study its lifecycle. Our results show that we identified 484,606 distinct issues, and that code smells are by far the most common type, accounting for 89.1% of all issues. We also find that more than 15% of commits from every AI coding assistant introduce at least one issue, although the rates vary across tools. More importantly, 24.2% of tracked AI-introduced issues still survive at the latest revision of the repository. These findings show that AI-generated code can introduce long-term maintenance costs into real software projects and highlight the need for stronger quality assurance in AI-assisted development.
☆ Detecting and Mitigating Flakiness in REST API Fuzzing
Test flakiness is a common problem in industry, which hinders the reliability of automated build and testing workflows. Most existing research on test flakiness has primarily focused on unit and small-scale integration tests. In contrast, flakiness in system-level testing such as REST APIs are comparatively under-explored. A large body of literature has been dedicated to the topic of fuzzing REST APIs, whereas relatively little attention has been paid to detecting and possibly mitigating negative effects of flakiness in this context. To fill this major gap, in this paper, we study the flakiness of tests generated by one of the popularly applied REST API fuzzer in the literature, namely EvoMaster, conduct empirical studies with a corpus of 36 REST APIs to understand flakiness of REST APIs. Based on the results of the empirical studies, we categorize and analyze flakiness sources by inspecting near 3000 failing tests. Based on the understanding, we propose FlakyCatch to detect and mitigate flakiness in REST APIs and empirically evaluate its performance. Results show that FlakyCatch is effective in detecting and handling flakiness in tests generated by white-box and black-box fuzzers.
☆ Crossing the NL/PL Divide: Information Flow Analysis Across the NL/PL Boundary in LLM-Integrated Code
LLM API calls are becoming a ubiquitous program construct, yet they create a boundary that no existing program analysis can cross: runtime values enter a natural-language prompt, undergo opaque processing inside the LLM, and re-emerge as code, SQL, JSON, or text that the program consumes. Every analysis that tracks data across function boundaries, including taint analysis, program slicing, dependency analysis, and change-impact analysis, relies on dataflow summaries of callee behavior. LLM calls have no such summaries, breaking all of these analyses at what we call the NL/PL boundary. We present the first information flow method to bridge this boundary. Grounded in quantitative information flow theory, our taxonomy defines 24 labels along two orthogonal dimensions: information preservation level (from lexically preserved to fully blocked) and output modality (natural language, structured format, executable artifact). We label 9,083 placeholder-output pairs from 4,154 real-world Python files and validate reliability with Cohen's $κ= 0.82$ and near-complete coverage (0.01\% unclassifiable). We demonstrate the taxonomy's utility on two downstream applications: (1)~a two-stage taint propagation pipeline combining taxonomy-based filtering with LLM verification achieves $F_1 = 0.923$ on 353 expert-annotated pairs, with cross-language validation on six real-world OpenClaw prompt injection cases further confirming effectiveness; (2)~taxonomy-informed backward slicing reduces slice size by a mean of 15\% in files containing non-propagating placeholders. Per-label analysis reveals that four blocked labels account for nearly all non-propagating cases, providing actionable filtering criteria for tool builders.
☆ Voice-Controlled Scratch for Children with (Motor) Disabilities
Block-based programming environments like Scratch have become widely adopted in Computer Science Education, but the mouse-based drag-and-drop interface can challenge users with disabilities. While prior work has provided solutions supporting children with visual impairment, these solutions tend to focus on making content perceivable and do not address the physical interaction barriers faced by users with motor disabilities. To bridge this gap, we introduce MeowCrophone, an approach that uses voice control to allow editing code in Scratch. MeowCrophone supports clicking elements, placing blocks, and navigating the workspace via a multi-modal voice user interface that uses numerical overlays and label reading to bypass physical input entirely. As imperfect speech recognition is common in classrooms and for children with dysarthria, MeowCrophone employs a multi-stage matching pipeline using regular expressions, phonetic matching, and a custom grammar. Evaluation shows that while free speech recognition systems achieved a baseline success rate of only 46.4%, MeowCrophone's pipeline improved results to 82.8% overall, with simple commands reaching 96.9% accuracy. This demonstrates that robust voice control can make Scratch accessible to users for whom visual aids are insufficient.
☆ Compressing Code Context for LLM-based Issue Resolution
Large Language Models (LLMs) are now capable of resolving real-world GitHub issues. However, current approaches overapproximate the code context and suffer from two compounding problems: the prohibitive cost of processing massive inputs, and low effectiveness as noise floods the context window and distracts the model from the bug-fixing signal. Existing compression techniques fail to resolve this tension: generic compressors compromise the semantic integrity of code, while code-specific tools lack awareness of code structure and task context to preserve essential patch ingredients. To address this, we propose a novel framework consisting of two components. First, Oracle-guided Code Distillation (OCD), a context distillation algorithm that combines genetic search and delta debugging to systematically reduce code contexts to their minimal sufficient subsequence - retaining only the ingredients required for a successful fix. We use this distilled data to fine-tune SWEzze, a lightweight model that learns to compress code context at inference time, filtering noise and combating distraction while preserving fix ingredients. Evaluated on SWE-bench Verified across three frontier LLMs, SWEzze maintains a stable compression rate of about 6 times across models, reduces the total token budget by 51.8%-71.3% relative to the uncompressed setting, improves issue resolution rates by 5.0%-9.2%, and delivers the best overall balance among effectiveness, compression ratio, and latency compared with state-of-the-art context compression baselines.
☆ DAInfer+: Neurosymbolic Inference of API Specifications from Documentation via Embedding Models
Modern software systems heavily rely on various libraries, which require understanding the API semantics in static analysis. However, summarizing API semantics remains challenging due to complex implementations or unavailable library code. This paper presents DAInfer+, a novel approach for inferring API specifications from library documentation. We employ Natural Language Processing (NLP) to interpret informal semantic information provided by the documentation, which enables us to reduce the specification inference to an optimization problem. Specifically, we investigate the effectiveness of sentence embedding models and Large Language Models (LLMs) in deriving memory operation abstractions from API descriptions. These abstractions are used to retrieve data-flow and aliasing relations to generate comprehensive API specifications. To solve the optimization problem efficiently, we propose neurosymbolic optimization, yielding precise data-flow and aliasing specifications. Our evaluation of popular Java libraries shows that zero-shot sentence embedding models outperform few-shot prompted LLMs in robustness, capturing fine-grained semantic nuances more effectively. While our initial attempts using two-stage LLM prompting yielded promising results, we found that the embedding-based approach proved superior. Specifically, these models achieve over 82% recall and 85% precision for data-flow inference and 88% recall and 79% precision for alias relations, all within seconds. These results demonstrate the practical value of DAInfer+ in library-aware static analysis.
comment: 35 pages, Under submission to ACM Transactions on Software Engineering and Methodology (TOSEM)
☆ Superset Decompilation
Reverse engineering tools remain monolithic and imperative compared to the advancement of modern compiler architectures: analyses are tied to a single mutable representation, making them difficult to extend or refine, and forcing premature choices between soundness and precision. We observe that decompilation is the reverse of compilation and can be structured as a sequence of modular passes, each performing a granular and clearly defined interpretation of the binary at a progressively higher level of abstraction. We formalize this as provenance-guided superset decompilation (PGSD), a framework that monotonically derives facts about the binary into a relation store. Instead of committing early to a single interpretation, the pipeline retains ambiguous interpretations as parallel candidates with provenance, deferring resolution until the final selection phase. Manifold implements PGSD as a declarative reverse engineering framework that lifts Linux ELF binaries to C99 through a granular intermediate representation in ~35K lines of Rust and Datalog. On GNU coreutils, Manifold's output quality matches Ghidra, IDA Pro, angr, and RetDec on multiple metrics while producing fewer compiler errors, and generalizes across compilers and optimization levels.
♻ ☆ LLMON: An LLM-native Markup Language to Leverage Structure and Semantics at the LLM Interface
Textual Large Language Models (LLMs) provide a simple and familiar interface: a string of text is used for both input and output. However, the information conveyed to an LLM often has a richer structure and semantics, which is not conveyed in a string. For example, most prompts contain both instructions ("Summarize this paper into a paragraph") and data (the paper to summarize), but these are usually not distinguished when passed to the model. This can lead to model confusion and security risks, such as prompt injection attacks. This work addresses this shortcoming by introducing an LLM-native mark-up language, LLMON (LLM Object Notation, pronounced "Lemon"), that enables the structure and semantic metadata of the text to be communicated in a natural way to an LLM. This information can then be used during model training, model prompting, and inference implementation, leading to improvements in model accuracy, safety, and security. This is analogous to how programming language types can be used for many purposes, such as static checking, code generation, dynamic checking, and IDE highlighting. We discuss the general design requirements of an LLM-native markup language, introduce the LLMON markup language and show how it meets these design requirements, describe how the information contained in a LLMON artifact can benefit model training and inference implementation, and provide some preliminary empirical evidence of its value for both of these use cases. We also discuss broader issues and research opportunities that are enabled with an LLM-native approach.
comment: 28 pages
♻ ☆ The Necessity of a Holistic Safety Evaluation Framework for AI-Based Automation Features
The intersection of Safety of Intended Functionality (SOTIF) and Functional Safety (FuSa) analysis of driving automation features has traditionally excluded Quality Management (QM) components (components that has no ASIL requirements allocated from vehicle-level HARA) from rigorous safety impact evaluations. While QM components are not typically classified as safety-relevant, recent developments in artificial intelligence (AI) integration reveal that such components can contribute to SOTIF-related hazardous risks. Compliance with emerging AI safety standards, such as ISO/PAS 8800, necessitates re-evaluating safety considerations for these components. This paper examines the necessity of conducting holistic safety analysis and risk assessment on AI components, emphasizing their potential to introduce hazards with the capacity to violate risk acceptance criteria when deployed in safety-critical driving systems, particularly in perception algorithms. Using case studies, we demonstrate how deficiencies in AI-driven perception systems can emerge even in QM-classified components, leading to unintended functional behaviors with critical safety implications. By bridging theoretical analysis with practical examples, this paper argues for the adoption of comprehensive FuSa, SOTIF, and AI standards-driven methodologies to identify and mitigate risks in AI components. The findings demonstrate the importance of revising existing safety frameworks to address the evolving challenges posed by AI, ensuring comprehensive safety assurance across all component classifications spanning multiple safety standards.
♻ ☆ Past, Present, and Future of Bug Tracking in the Generative AI Era
Traditional bug-tracking systems rely heavily on manual reporting, reproduction, classification, and resolution, involving multiple stakeholders such as end users, customer support, developers, and testers. This division of responsibilities requires substantial coordination and human effort, widens the communication gap between non-technical users and developers, and significantly slows the process from bug discovery to deployment. Moreover, current solutions are highly asynchronous, often leaving users waiting long periods before receiving any feedback. In this paper, we examine the evolution of bug-tracking practices, from early paper-based methods to today's web-based platforms, and present a forward-looking vision of an AI-powered bug tracking framework. The framework augments existing systems with large language model (LLM) and agent-driven automation, and we report early adaptations of its key components, providing initial empirical grounding for its feasibility. The proposed framework aims to reduce time to resolution and coordination overhead by enabling end users to report bugs in natural language while AI agents refine reports, attempt reproduction, classify bugs, validate reports, suggest no-code fixes, generate patches, and support continuous integration and deployment. We discuss the challenges and opportunities of integrating LLMs into bug tracking and show how intelligent automation can transform software maintenance into a more efficient, collaborative, and user-centric process.
comment: Accepted to ACM TOSEM Special Issue: 2030 Software Engineering Roadmap
♻ ☆ Measuring Complexity at the Requirements Stage: Spectral Metrics as Development Effort Predictors
Complexity in engineered systems presents one of the most persistent challenges in modern development since it is driving cost overruns, schedule delays, and outright project failures. Yet while architectural complexity has been studied, the structural complexity embedded within requirements specifications remains poorly understood and inadequately quantified. This gap is consequential: requirements fundamentally drive system design, and complexity introduced at this stage propagates through architecture, implementation, and integration. To address this gap, we build on Natural Language Processing methods that extract structural networks from textual requirements. Using these extracted structures, we conduct a controlled experiment employing molecular integration tasks as structurally isomorphic proxies for requirements integration -- leveraging the topological equivalence between molecular graphs and requirement networks while eliminating confounding factors such as domain expertise and semantic ambiguity. Our results demonstrate that spectral measures predict integration effort with correlations exceeding 0.95, while structural metrics achieve correlations above 0.89. Notably, density-based metrics show no significant predictive validity. These findings indicate that eigenvalue-derived measures capture cognitive and effort dimensions that simpler connectivity metrics cannot. As a result, this research bridges a critical methodological gap between architectural complexity analysis and requirements engineering practice, providing a validated foundation for applying these metrics to requirements engineering, where similar structural complexity patterns may predict integration effort.
comment: 36 pages, 4 figures, 5 tables
♻ ☆ Effort-Optimized, Accuracy-Driven Labelling and Validation of Test Inputs for DL Systems: A Mixed-Integer Linear Programming Approach
Software systems increasingly include AI components based on deep learning (DL). Reliable testing of such systems requires near-perfect test-input validity and label accuracy, with minimal human effort. Yet, the DL community has largely overlooked the need to build highly accurate datasets with minimal effort, since DL training is generally tolerant of labelling errors. This challenge, instead, reflects concerns more familiar to software engineering, where a central goal is to construct high-accuracy test inputs, with accuracy as close to 100% as possible, while keeping associated costs in check. In this article we introduce OPAL, a human-assisted labelling method that can be configured to target a desired accuracy level while minimizing the manual effort required for labelling. The main contribution of OPAL is a mixed-integer linear programming (MILP) formulation that minimizes labelling effort subject to a specified accuracy target. To evaluate OPAL we instantiate it for two tasks in the context of testing vision systems: automatic labelling of test inputs and automated validation of test inputs. Our evaluation, based on more than 2500 experiments performed on nine datasets, comparing OPAL with eight baseline methods, shows that OPAL, relying on its MILP formulation, achieves an average accuracy of 98.8%, while cutting manual labelling by more than half. OPAL significantly outperforms automated labelling baselines in labelling accuracy across all nine datasets, when all methods are provided with the same manual-labelling budget. For automated test-input validation, on average, OPAL reduces manual effort by 28.8% while achieving 4.5% higher accuracy than the SOTA test-input validation baselines. Finally, we show that augmenting OPAL with an active-learning loop leads to an additional 4.5% reduction in required manual labelling, without compromising accuracy.
comment: Accepted in the Empirical Software Engineering (EMSE) Journal (2026)
♻ ☆ How Low Can You Go? The Data-Light SE Challenge
Much of Software Engineering (SE) research assumes that progress depends on massive datasets and CPU-intensive optimizers. Yet has this assumption been rigorously tested? The counter-evidence presented in this paper suggests otherwise. For over 100 optimization tasks from recent SE papers (including software configuration, performance tuning, product line engineering, project health forecasting, defect prediction, software testing, software process and cost estimation, and cross-domain generalization datasets), even with just a few dozen labels, very simple methods (e.g., diversity sampling, a minimal Bayesian learner, its distance-based non-parametric variant, or random probes) achieve over 90% of the best reported results. Furthermore, these simple methods perform just as well as more complex state-of-the-the-art optimizers like SMAC, TPE, DEHB etc. While some tasks would require better outcomes and more sampling, these results seen after a few dozen samples would suffice for many engineering needs (particularly when the goal is rapid and cost-efficient guidance rather than slow and exhaustive optimization). To say that another ways, at least some SE tasks are better served by lightweight approaches that demand fewer labels and far less computation. We hence propose the data-light challenge: when will a handful of labels suffice for SE tasks? To enable a large-scale investigation of this issue, we contribute (1) a mathematical formalization of labeling, (2) lightweight baseline algorithms, and (3) results on public-domain data showing the conditions under which lightweight methods excel or fail. For the purposes of open science, our scripts and data are online at https://github.com/KKGanguly/NEO .
comment: To be published in FSE 2026
♻ ☆ An Empirical Study of Interaction Smells in Multi-Turn Human-LLM Collaborative Code Generation
Large Language Models (LLMs) have revolutionized code generation, evolving from static tools into dynamic conversational interfaces that facilitate complex, multi-turn collaborative programming. While LLMs exhibit remarkable proficiency in generating standalone code snippets, they often struggle to maintain contextual consistency during extended interactions, creating significant obstacles in the collaboration process. Existing benchmarks primarily emphasize the functional correctness of the final output, overlooking latent quality issues within the interaction process itself, which we term Interaction Smells. In this paper, we conduct an empirical study on sampled real-word user-LLM interactions from WildChat and LMSYS-Chat-1M datasets to systematically investigate Interaction Smells in human-LLM code generation tasks from the perspectives of phenomena, distribution, and mitigation. First, we establish the first taxonomy of Interaction Smells by manually performing open card sorting on real-world interaction logs. This taxonomy categorizes Interaction Smells into three primary categories, i.e., User Intent Quality, Historical Instruction Compliance, and Historical Response Violation, comprising nine specific subcategories. Next, we quantitatively evaluate six mainstream LLMs (i.e., GPT-4o, DeepSeek-Chat, Gemini 2.5, Qwen2.5-32B, Qwen2.5-72B, and Qwen3-235B-a22b) to analyze the distribution of Interaction Smells across different models. Finally, we propose Invariant-aware Constraint Evolution (InCE), a multi-agent framework designed to improve multi-turn interaction quality through explicit extraction of global invariants and pre-generation quality audits. Experimental results on the extended WildBench benchmark demonstrate that this lightweight mitigation approach significantly improves the Task Success Rate and effectively suppresses the occurrence of Interaction Smells.
♻ ☆ Code Review Agent Benchmark
Software engineering agents have shown significant promise in writing code. As AI agents permeate code writing, and generate huge volumes of code automatically -- the matter of code quality comes front and centre. As the automatically generated code gets integrated into huge code-bases -- the issue of code review and broadly quality assurance becomes important. In this paper, we take a fresh look at the problem and curate a code review dataset for AI agents to work with. Our dataset called c-CRAB (pronounced see-crab) can evaluate agents for code review tasks. Specifically given a pull-request (which could be coming from code generation agents or humans), if a code review agent produces a review, our evaluation framework can asses the reviewing capability of the code review agents. Our evaluation framework is used to evaluate the state of the art today -- the open-source PR-agent, as well as commercial code review agents from Devin, Claude Code, and Codex. Our c-CRAB dataset is systematically constructed from human reviews -- given a human review of a pull request instance we generate corresponding tests to evaluate the code review agent generated reviews. Such a benchmark construction gives us several insights. Firstly, the existing review agents taken together can solve only around 40% of the c-CRAB tasks, indicating the potential to close this gap by future research. Secondly, we observe that the agent reviews often consider different aspects from the human reviews -- indicating the potential for human-agent collaboration for code review that could be deployed in future software teams. Last but not the least, the agent generated tests from our data-set act as a held out test-suite and hence quality gate for agent generated reviews. What this will mean for future collaboration of code generation agents, test generation agents and code review agents -- remains to be investigated.
♻ ☆ LLM-Powered Workflow Optimization for Multidisciplinary Software Development: An Automotive Industry Case Study
Multidisciplinary Software Development (MSD) requires domain experts and developers to collaborate across incompatible formalisms and separate artifact sets. Today, even with AI coding assistants like GitHub Copilot, this process remains inefficient; individual coding tasks are semi-automated, but the workflow connecting domain knowledge to implementation is not. Developers and experts still lack a shared view, resulting in repeated coordination, clarification rounds, and error-prone handoffs. We address this gap through a graph-based workflow optimization approach that progressively replaces manual coordination with LLM-powered services, enabling incremental adoption without disrupting established practices. We evaluate our approach on \texttt{spapi}, a production in-vehicle API system at Volvo Group involving 192 endpoints, 420 properties, and 776 CAN signals across six functional domains. The automated workflow achieves 93.7\% F1 score while reducing per-API development time from approximately 5 hours to under 7 minutes, saving an estimated 979 engineering hours. In production, the system received high satisfaction from both domain experts and developers, with all participants reporting full satisfaction with communication efficiency.
comment: Accepted to FSE 2026 Industrial Track
♻ ☆ Declarative Scenario-based Testing with RoadLogic SC
Scenario-based testing is a key method for cost-effective and safe validation of autonomous vehicles (AVs). Existing approaches rely on imperative scenario definitions, requiring developers to manually enumerate numerous variants to achieve coverage. Declarative languages, such as ASAM OpenSCENARIO DSL (OS2), raise the abstraction level but lack systematic methods for instantiating concrete and specification-compliant scenarios. To our knowledge, currently, no open-source solution provides this capability. We present RoadLogic that bridges declarative OS2 specifications and executable simulations. It uses Answer Set Programming to generate abstract plans satisfying scenario constraints, motion planning to refine the plans into feasible trajectories, and specification-based monitoring to verify correctness. We evaluate RoadLogic on instantiating representative OS2 scenarios executed in the CommonRoad framework. Results show that RoadLogic consistently produces realistic, specification-satisfying simulations within minutes and captures diverse behavioral variants through parameter sampling, thus opening the door to systematic scenario-based testing for autonomous driving systems.
comment: Accepted at the 29th ACM International Conference on Hybrid Systems: Computation and Control (HSCC 2026). The final version will appear in the ACM Digital Library
♻ ☆ Evolving with AI: A Longitudinal Analysis of Developer Logs ICSE'26
AI-powered coding assistants are rapidly becoming fixtures in professional IDEs, yet their sustained influence on everyday development remains poorly understood. Prior research has focused on short-term use or self-reported perceptions, leaving open questions about how sustained AI use reshapes actual daily coding practices in the long term. We address this gap with a mixed-method study of AI adoption in IDEs, combining longitudinal two-year fine-grained telemetry from 800 developers with a survey of 62 professionals. We analyze five dimensions of workflow change: productivity, code quality, code editing, code reuse, and context switching. Telemetry reveals that AI users produce substantially more code but also delete significantly more. Meanwhile, survey respondents report productivity gains and perceive minimal changes in other dimensions. Our results offer empirical insights into the silent restructuring of software workflows and provide implications for designing future AI-augmented tooling.
comment: Accepted to ICSE'26 Research track. 12 pages, 5 figures, 1 table
♻ ☆ Software Supply Chain Smells: Lightweight Analysis for Secure Dependency Management
Modern software systems heavily rely on third-party dependencies, making software supply chain security a critical concern. We introduce the concept of software supply chain smells as structural indicators that signal potential security risks. We design and evaluate Dirty-Waters, a novel tool for detecting such smells in the supply chains of software packages. Through interviews with practitioners, we show that our proposed smells align with real-world concerns and capture signals considered valuable. A quantitative study of popular packages in the Maven and NPM ecosystems reveals that while smells are prevalent in both, they differ significantly across ecosystems, with traceability and signing issues dominating in Maven and most smells being rare in NPM, due to strong registry-level guarantees. Software supply chain smells support developers and organizations in making informed decisions and improving their software supply chain security posture.
♻ ☆ Toward Functional and Non-Functional Evaluation of Application-Level Code Generation
Large language models (LLMs) have achieved strong performance on code generation. However, most prior evaluations focus on snippet-level outputs, such as function generation or repository completion. These settings do not fully evaluate application-level code generation, where the goal is to produce a runnable repository with coherent multi-file structure, dependency support, and end-to-end executability. In addition, real-world software quality depends not only on functional correctness but also on non-functional quality attributes, such as maintainability and security. In this paper, we present RAL-Bench, a benchmark and evaluation framework for application-level code generation. For each task, RAL-Bench derives a concise natural-language requirement from a high-quality reference project, constructs black-box system tests for both functional correctness and non-functional quality attributes. It also retains only the candidate tests that pass on the reference repository. Under this unified evaluation protocol, functional correctness is measured by the system test pass rate, while non-functional quality is evaluated along five ISO/IEC 25010-inspired dimensions, with per-dimension diagnostics and reference-normalized scoring.We evaluate 16 frontier LLMs under a controlled zero-shot setting with greedy decoding. The results show that functional correctness remains the primary bottleneck in application-level code generation, while non-functional quality also remains challenging. Under our evaluation protocol, no model exceeds a 45\% functional score. These findings suggest that strong performance on existing code generation benchmarks does not yet translate to strong performance on application-level repository generation. This result highlights the need for evaluation settings that directly assess end-to-end repository generation rather than relying only on snippet-level success.
♻ ☆ Toward Live Noise Fingerprinting in Quantum Software Engineering
Contemporary quantum computers are inherently noisy, posing significant challenges for the development and testing of quantum software. Simplified or outdated noise assumptions can lead to incorrect assessments of program correctness, obscure debugging, and hinder cross-platform portability, creating a critical quantum software development gap. Providing accurate, practical noise characterisation is challenging as traditional reconstruction methods scale exponentially and rapidly become outdated. In this vision paper, we address this gap via a novel classical shadow tomography-based pipeline, SIMSHADOW, enabling efficient, continuously updatable noise fingerprinting from empirical observations, suitable for integration into quantum software development workflows, including testing and validation. We prototyped the pipeline to investigate fingerprints' ability to capture structured, interpretable noise and cross-platform discrepancies affecting quantum programs' behaviour to support realistic testing and debugging in future tools. Our evaluation with Qiskit and Cirq under widely used hardware-informed profiles, IBM Boston and Quantinuum H2, shows fingerprints exhibit channel-specific structure and yield interpretable heatmaps. We observed systematic cross-platform discrepancies under matched noise configurations, quantified by large Frobenius distances at a fraction of full tomography cost. On 69 MQTBENCH programs, larger fingerprint differences correlate with output distributions divergences, highlighting threats for testing and cross-platform debugging tasks.
♻ ☆ Unlocking LLM Repair Capabilities Through Cross-Language Translation and Multi-Agent Refinement
Recent advances in leveraging LLMs for APR have demonstrated impressive capabilities in fixing software defects. However, current LLM-based approaches predominantly focus on mainstream programming languages like Java and Python, neglecting less prevalent but emerging languages such as Rust due to expensive training resources, limited datasets, and insufficient community support. This narrow focus creates a significant gap in repair capabilities across the programming language spectrum, where the full potential of LLMs for comprehensive multilingual program repair remains largely unexplored. To address this limitation, we introduce a novel cross-language program repair approach LANTERN that leverages LLMs' differential proficiency across languages through a multi-agent iterative repair paradigm. Our technique strategically translates defective code from languages where LLMs exhibit weaker repair capabilities to languages where they demonstrate stronger performance, without requiring additional training. A key innovation of our approach is an LLM-based decision-making system that dynamically selects optimal target languages based on bug characteristics and continuously incorporates feedback from previous repair attempts. We evaluate our method on xCodeEval, a comprehensive multilingual benchmark comprising 5,068 bugs across 11 programming languages. Results demonstrate significant enhancement in repair effectiveness, particularly for underrepresented languages, with Rust showing a 22.09% improvement in Pass@10 metrics. Our research provides the first empirical evidence that cross-language translation significantly expands the repair capabilities of LLMs and effectively bridges the performance gap between programming languages with different levels of popularity, opening new avenues for truly language-agnostic automated program repair.
♻ ☆ On the Impact of AGENTS.md Files on the Efficiency of AI Coding Agents
AI coding agents such as Codex and Claude Code are increasingly used to autonomously contribute to software repositories. However, little is known about how repository-level configuration artifacts affect operational efficiency of the agents. In this paper, we study the impact of AGENTS$.$md files on the runtime and token consumption of AI coding agents operating on GitHub pull requests. We analyze 10 repositories and 124 pull requests, executing agents under two conditions: with and without an AGENTS$.$md file. We measure wall-clock execution time and token usage during agent execution. Our results show that the presence of AGENTS$.$md is associated with a lower median runtime ($Δ28.64$%) and reduced output token consumption ($Δ16.58$%), while maintaining a comparable task completion behavior. Based on these results, we discuss immediate implications for the configuration and deployment of AI coding agents in practice, and outline a broader research agenda on the role of repository-level instructions in shaping the behavior, efficiency, and integration of AI coding agents in software development workflows.
comment: 5 pages, 1 figure, 1 table
♻ ☆ Reducing Hallucinations in LLM-Generated Code via Semantic Triangulation
Large language models (LLMs) can generate executable code from natural language descriptions, but the resulting programs frequently contain bugs due to hallucinations. In the absence of formal specifications, existing approaches attempt to assess correctness using LLM-generated proxies such as tests or auto-formalized specifications. However, these proxies are produced by the same imperfect models and thus often corroborate rather than catch errors, especially when the model exhibits correlated errors. We introduce semantic triangulation, a theory-grounded framework that decorrelates model errors by transforming the original problem into a dissociative variant - one likely requiring a fundamentally different algorithm - and checks consistency between independently sampled solutions to both problems. We identify theoretical requirements for this framework, and we prove that under a formal model of LLM hallucinations, these properties confer higher confidence in program correctness. We instantiate the framework through four concrete triangulation methods based on problem inversion, decomposition, and solution enumeration. Evaluated on LiveCodeBench and CodeElo across GPT-4o, DeepSeek-V3, and Gemini 2.5 Flash, our tool increases the probability of selecting a correct program by 24% over baselines (test generation, metamorphic testing, and auto-formalized specifications) and achieves 26% higher F1 score in selection-or-abstention scenarios, while being the only method that consistently handles inexact problems admitting multiple valid solutions.
♻ ☆ PCREQ: Automated Inference of Compatible Requirements for Python Third-party Library Upgrades
Python third-party libraries (TPLs) are essential in modern software development, but upgrades often cause compatibility issues, leading to system failures. These issues fall into two categories: version compatibility issues (VCIs) and code compatibility issues (CCIs). Existing tools mainly detect dependency conflicts but overlook code-level incompatibilities, with no solution fully automating the inference of compatible versions for both VCIs and CCIs. To fill this gap, we propose PCREQ, the first approach to automatically infer compatible requirements by combining version and code compatibility analysis. PCREQ integrates six modules: knowledge acquisition, version compatibility assessment, invoked APIs and modules extraction, code compatibility assessment, version change, and missing TPL completion. PCREQ collects candidate versions, checks for conflicts, identifies API usage, evaluates code compatibility, and iteratively adjusts versions to generate a compatible requirements.txt with a detailed repair report. To evaluate PCREQ, we construct REQBench, a real-world benchmark with 2,095 upgrade scenarios derived from 34 real-world scientific/ML Python projects. Results show PCREQ achieves a 94.03% inference success rate, outperforming PyEGo (37.02%), ReadPyE (37.16%), and LLM-based approaches (GPT-4o, DeepSeek V3/R1) by 18--22%. PCREQ processes each scenario from REQBench in 60.79 s on average, demonstrating practical efficiency. PCREQ reduces manual effort in troubleshooting upgrades, advancing Python dependency maintenance automation.
comment: Accepted by ACM Transactions on Software Engineering and Methodology
Human-Computer Interaction 45
☆ Uncovering Relationships between Android Developers, User Privacy, and Developer Willingness to Reduce Fingerprinting Risks
The major mobile platforms, Android and iOS, have introduced changes that restrict user tracking to improve user privacy, yet apps continue to covertly track users via device fingerprinting. We study the opportunity to improve this dynamic with a case study on mobile fingerprinting that evaluates developers' perceptions of how well platforms protect user privacy and how developers perceive platform privacy interventions. Specifically, we study developers' willingness to make changes to protect users from fingerprinting and how developers consider trade-offs between user privacy and developer effort. We do this via a survey of 246 Android developers, presented with a hypothetical Android change that protects users from fingerprinting at the cost of additional developer effort. We find developers overwhelmingly (89%) support this change, even when they anticipate significant effort, yet prefer the change be optional versus required. Surprisingly, developers who use fingerprinting are six times more likely to support the change, despite being most impacted by it. We also find developers are most concerned about compliance and enforcement. In addition, our results show that while most rank iOS above Android for protecting user privacy, this distinction significantly reduces among developers very familiar with fingerprinting. Thus there is an important opportunity for platforms and developers to collaboratively build privacy protections, and we present actionable ways platforms can facilitate this.
☆ Active Inference with People: a general approach to real-time adaptive experiments
Adaptive experiments automatically optimize their design throughout the data collection process, which can bring substantial benefits compared to conventional experimental settings. Potential applications include, among others: computerized adaptive testing (for selecting informative tasks in ability measurements), adaptive treatment assignment (when searching experimental conditions maximizing certain outcomes), and active learning (for choosing optimal training data for machine learning algorithms). However, implementing these techniques in real time poses substantial computational and technical challenges. Additionally, despite their conceptual similarity, the above scenarios are often treated as separate problems with distinct solutions. In this paper, we introduce a practical and unified approach to real-time adaptive experiments that can encompass all of the above scenarios, regardless of the modality of the task (including textual, visual, and audio inputs). Our strategy combines active inference, a Bayesian framework inspired by cognitive neuroscience, with PsyNet, a platform for large-scale online behavioral experiments. While active inference provides a compact, flexible, and principled mathematical framework for adaptive experiments generally, PsyNet is a highly modular Python package that supports social and behavioral experiments with stimuli and responses in arbitrary domains. We illustrate this approach through two concrete examples: (1) an adaptive testing experiment estimating participants' ability by selecting optimal challenges, effectively reducing the amount of trials required by 30--40\%; and (2) an adaptive treatment assignment strategy that identifies the optimal treatment up to three times as accurately as a fixed design in our example. We provide detailed instructions to facilitate the adoption of these techniques.
☆ AI prediction leads people to forgo guaranteed rewards
Artificial intelligence (AI) is understood to affect the content of people's decisions. Here, using a behavioral implementation of the classic Newcomb's paradox in 1,305 participants, we show that AI can also change how people decide. In this paradigm, belief in predictive authority can lead individuals to constrain decision-making, forgoing a guaranteed reward. Over 40% of participants treated AI as such a predictive authority. This significantly increased the odds of forgoing the guaranteed reward by a factor of 3.39 (95% CI: 2.45-4.70) compared with random framing, and reduced earnings by 10.7-42.9%. The effect appeared across AI presentations and decision contexts and persisted even when predictions failed. When people believe AI can predict their behavior, they may self-constrain it in anticipation of that prediction.
☆ The Ultimate Tutorial for AI-driven Scale Development in Generative Psychometrics: Releasing AIGENIE from its Bottle
Psychological scale development has traditionally required extensive expert involvement, iterative revision, and large-scale pilot testing before psychometric evaluation can begin. The `AIGENIE` R package implements the AI-GENIE framework (Automatic Item Generation with Network-Integrated Evaluation), which integrates large language model (LLM) text generation with network psychometric methods to automate the early stages of this process. The package generates candidate item pools using LLMs, transforms them into high-dimensional embeddings, and applies a multi-step reduction pipeline -- Exploratory Graph Analysis (EGA), Unique Variable Analysis (UVA), and bootstrap EGA -- to produce structurally validated item pools entirely *in silico*. This tutorial introduces the package across six parts: installation and setup, understanding Application Programming Interfaces (APIs), text generation, item generation, the `AIGENIE` function, and the `GENIE` function. Two running examples illustrate the package's use: the Big Five personality model (a well-established construct) and AI Anxiety (an emerging construct). The package supports multiple LLM providers (OpenAI, Anthropic, Groq, HuggingFace, and local models), offers a fully offline mode with no external API calls, and provides the `GENIE()` function for researchers who wish to apply the psychometric reduction pipeline to existing item pools regardless of their origin. The `AIGENIE` package is freely available on R-universe at https://laralee.r-universe.dev/AIGENIE.
comment: 38 pages, 8 Figures, 3 tables
☆ One stout to rule them all: Reconciling artificial intelligence, data science and malted alcoholic beverages
Beer is a phenomenal beverage. It has previously shaped the history of many peoples, states and cultures. The beauty of beer is its versatility. Starting from the original implementations that were murky or diluted, over time researchers found novel approaches to gradually develop beverages that are diverse, intense and are pleasant for the end user. Recently, the industry came up with the so-called \textit{craft beers}, that often differ from the commercial beers in production volume (due to lower capacities of the craft beer producers) and tasting profile (often having more intense unusual flavours). However, while it is often relatively easy to judge if a particular commercial beer is likely to be enjoyable, the same cannot be said about craft beers, as there are far too many styles, implementations and ingredients involved in their production. This creates a gap between the beverage producers and the consumers due to the inability of the former to judge the preferences and the consumption trends of the latter. As a response to this challenge we present a novel collaborative beverage-related data collection and analysis framework - the Distributed Beverage Analysis (DBA). The idea behind this study is to identify the common trends and support them by empirical evidence to better understand the needs of the consumers. We empirically verify DBA at the biannual \textit{Kraft Bier Fest} conducted by Vienna Kraft brewery in (you guessed it) Vienna. To showcase a need in such kind of analysis, we evaluate various large language models (LLMs) against our collaborative framework and confirm that many AI models cannot be reliably used to reason over the trends and patterns in the evolving world of craft beer.
☆ Moving Beyond Review: Applying Language Models to Planning and Translation in Reflection
Reflective writing is known to support the development of students' metacognitive skills, yet learners often struggle to engage in deep reflection, limiting learning gains. Although large language models (LLMs) have been shown to improve writing skills, their use as conversational agents for reflective writing has produced mixed results and has largely focused on providing feedback on reflective texts, rather than support during planning and organizing. In this paper, inspired by the Cognitive Process Theory of writing (CPT), we propose the first application of LLMs to the planning and translation steps of reflective writing. We introduce Pensée, a tool to explore the effects of explicit AI support during these stages by scaffolding structured reflection planning using a conversational agent, and supporting translation by automatically extracting key concepts. We evaluate Pensée in a controlled between-subjects experiment (N=93), manipulating AI support across writing phases. Results show significantly greater reflection depth and structural quality when learners receive support during planning and translation stages of CPT, though these effects reduce in a delayed post-test. Analyses of learner behavior and perceptions further illustrate how CPT-aligned conversational support shapes reflection processes and learner experience, contributing empirical evidence for theory-driven uses of LLMs in AI-supported reflective writing.
comment: Accepted at AIED 2026
☆ Multimodal Analytics of Cybersecurity Crisis Preparation Exercises: What Predicts Success?
Instructional alignment, the match between intended cognition and enacted activity, is central to effective instruction but hard to operationalize at scale. We examine alignment in cybersecurity simulations using multimodal traces from 23 teams (76 students) across five exercise sessions. Study 1 codes objectives and team emails with Bloom's taxonomy and models the completion of key exercise tasks with generalized linear mixed models. Alignment, defined as the discrepancy between required and enacted Bloom levels, predicts success, whereas the Bloom category alone does not predict success once discrepancy is considered. Study 2 compares predictive feature families using grouped cross-validation and l1-regularized logistic regression. Text embeddings and log features outperform Bloom-only models (AUC~0.74 and 0.71 vs. 0.55), and their combination performs best (Test AUC~0.80), with Bloom frequencies adding little. Overall, the work offers a measure of alignment for simulations and shows that multimodal traces best forecast performance, while alignment provides interpretable diagnostic insight.
comment: Accepted as full paper to the 27th International Conference on Artificial Intelligence in Education (AIED 2026)
☆ Within the MDT Room: Situated in Multidisciplinary Team-Grounded Agent Debate for Clinical Diagnosis
Rare disease diagnosis is inherently challenging due to heterogeneous symptoms, limited clinical familiarity, and fragmented evidence across specialties. Recent large language model (LLM)-based agentic systems have shown promise by simulating multidisciplinary team discussions to generate and evaluate diagnostic hypotheses. However, fully automated diagnosis remains unrealistic, and existing human-in-the-loop approaches provide limited support for effective clinician-agent collaboration. In practice, clinicians are often presented with final diagnostic outputs and lengthy, unstructured agent discussion logs, making it difficult to inspect reasoning, intervene in a timely manner, or guide agent deliberation effectively. To address these challenges, we developed MDTRoom, an interactive system that transforms multi-agent discussions from linear transcripts into a structured, inspectable workspace. The system externalizes patient data, evidence provenance, hypothesis evolution, and inter-agent conflicts as interconnected visual objects, enabling clinicians to efficiently examine, intervene in, and guide agent reasoning. Our evaluation demonstrates the effectiveness of MDTRoom in supporting clinician-agent collaboration.
☆ Proposing a Game Theory Approach to Explore Group Dynamics with Social Robot
Integrating social robots in our group-based society, beyond the technical challenges, requires considering the social group dynamics. Following the results from preliminary exploratory studies on the influence of social robots on group decisions, the proposed research investigates whether social robots can foster cooperation among group members. To achieve this, I propose a game theory approach, employing the Public Good Game to recreate a simplified and controlled social situation where the robot's influence can be evaluated. Clarifying the role of robots in promoting collaboration among humans might have a significant impact in educational environments, enhancing student learning, as well as in workplace settings, where they could facilitate problem-solving and lead to shared solutions.
comment: Honorable Mention at HRI Pioneers 2025. Peer-reviewed. https://hripioneers.org/archives/hri25/participants/
☆ Animated Public Furniture as an Interaction Mediator: Engaging Passersby In-the-Wild with Robotic Benches
Urban HCI investigates how digital technologies shape human behaviour within the social, spatial, temporal dynamics of public space. Meanwhile, robotic furniture research demonstrates how the purposeful animation of mundane utilitarian elements can influence human behaviour in everyday contexts. Taken together, these strands highlight an untapped opportunity to investigate how animated public furniture could mediate social interaction in urban environments. In this paper, we present the design process and in-the-wild study of mobile robotic benches that reconfigure with a semi-outdoor public space. Our findings show that the gestural performance of the benches manifested three affordances perceived by passersby, they activated engagement as robots, redistributed engagement as spatial elements, and settled engagement as infrastructure. We proposed an Affordance Transition Model (ATM) describing how robotic furniture could proactively facilitate transition between these affordances to engage passersby. Our study bridges robotic furniture and urban HCI to activate human experience with the built environment purposefully.
☆ Users and Wizards in Conversations: How WoZ Interface Choices Define Human-Robot Interactions
In this paper, we investigated how the choice of a Wizard-of-Oz (WoZ) interface affects communication with a robot from both the user's and the wizard's perspective. In a conversational setting, we used three WoZ interfaces with varying levels of dialogue input and output restrictions: a) a restricted perception GUI that showed fixed-view video and ASR transcripts and let the wizard trigger pre-scripted utterances and gestures; b) an unrestricted perception GUI that added real-time audio from the participant and the robot c) a VR telepresence interface that streamed immersive stereo video and audio to the wizard and forwarded the wizard's spontaneous speech, gaze and facial expressions to the robot. We found that the interaction mediated by the VR interface was preferred by users in terms of robot features and perceived social presence. For the wizards, the VR condition turned out to be the most demanding but elicited a higher social connection with the users. VR interface also induced the most connected interaction in terms of inter-speaker gaps and overlaps, while Restricted GUI induced the least connected flow and the largest silences. Given these results, we argue for more WoZ studies using telepresence interfaces. These studies better reflect the robots of tomorrow and offer a promising path to automation based on naturalistic contextualized verbal and non-verbal behavioral data.
comment: Published in Robotics: Science and Systems (2025)
☆ Fostering Design-Policy Collaboration through Contestation: An Adversarial Futuring Method
Emerging technologies introduce sociotechnical tensions that call for closer collaboration between technology design and policy. In this work, we introduce Design-Policy Adversarial Futuring, a scenario-based workshop method that supports design-policy engagement by structuring contestation between design and policy perspectives. We report on a workshop conducted in the autonomous mobility domain with 12 HCI researchers, used to explore and demonstrate the method in practice. The workshop illustrates how the adversarial futuring method can surface shifting harms, translate policy abstractions into situated use, and legitimise extreme ideas while maintaining grounded policy reasoning. This work contributes a reusable, exploratory method for supporting HCI-policy collaboration through contestation, which can be adapted across emerging technological domains.
☆ Self++: Co-Determined Agency for Human--AI Symbiosis in Extended Reality
Self++ is a design blueprint for human-AI symbiosis in extended reality (XR) that preserves human authorship while still benefiting from increasingly capable AI agents. Because XR can shape both perceptual evidence and action, apparently 'helpful' assistance can drift into over-reliance, covert persuasion, and blurred responsibility. Self++ grounds interaction in two complementary theories: Self-Determination Theory (autonomy, competence, relatedness) and the Free Energy Principle (predictive stability under uncertainty). It operationalises these foundations through co-determination, treating the human and the AI as a coupled system that must keep intent and limits legible, tune support over time, and preserve the user's right to endorse, contest, and override. These requirements are summarised as the co-determination principles (T.A.N.): Transparency, Adaptivity, and Negotiability. Self++ organises augmentation into three concurrently activatable overlays spanning sensorimotor competence support (Self: competence overlay), deliberative autonomy support (Self+: autonomy overlay), and social and long-horizon relatedness and purpose support (Self++: relatedness and purpose overlay). Across the overlays, it specifies nine role patterns (Tutor, Skill Builder, Coach; Choice Architect, Advisor, Agentic Worker; Contextual Interpreter, Social Facilitator, Purpose Amplifier) that can be implemented as interaction patterns, not personas. The contribution is a role-based map for designing and evaluating XR-AI systems that grow capability without replacing judgment, enabling symbiotic agency in work, learning, and social life and resilient human development.
comment: 35 pages, 1 figure, under review by Empathic Computing Journal
☆ Statistics 101, 201, and 202: Three Shiny Apps for Teaching Probability Distributions, Inferential Statistics, and Simple Linear Regression
Statistics 101, 201, and 202 are three open-source interactive web applications built with R \citep{R} and Shiny \citep{shiny} to support the teaching of introductory statistics and probability. The apps help students carry out common statistical computations -- computing probabilities from standard probability distributions, constructing confidence intervals, conducting hypothesis tests, and fitting simple linear regression models -- without requiring prior knowledge of R or any other programming language. Each app provides numerical results, plots rendered with \texttt{ggplot2} \citep{ggplot2}, and inline mathematical derivations typeset with MathJax \citep{cervone2012mathjax}, so that computation and statistical reasoning appear side by side in a single interface. The suite is organised around a broad pedagogical progression: Statistics~101 introduces probability distributions and their properties; Statistics~201 addresses confidence intervals and hypothesis tests; and Statistics~202 covers the simple linear model. All three apps are freely accessible online and their source code is released under a CC-BY-4.0 license.
comment: 6 pages, 0 figure
☆ Designing AI for Real Users -- Accessibility Gaps in Retail AI Front-End
As AI becomes embedded in customer-facing systems, ethical scrutiny has largely focused on models, data, and governance. Far less attention has been paid to how AI is experienced through user-facing design. This commentary argues that many AI front-ends implicitly assume an 'ideal user body and mind', and that this becomes visible and ethically consequential when examined through the experiences of differently abled users. We explore this through retail AI front-ends for customer engagement - i.e., virtual assistants, virtual try-on systems, and hyper-personalised recommendations. Despite intuitive and inclusive framing, these systems embed interaction assumptions that marginalise users with vision, hearing, motor, cognitive, speech and sensory differences, as well as age-related variation in digital literacy and interaction norms. Drawing on practice-led insights, we argue that these failures persist not primarily due to technical limits, but due to the commercial, organisational, and procurement contexts in which AI front-ends are designed and deployed, where accessibility is rarely contractual. We propose front-end assurance as a practical complement to AI governance, aligning claims of intelligence and multimodality with the diversity of real users.
comment: Accepted at the Proceedings of the CHI 2026 Workshop: Ethics at the Front-End
☆ InconLens: Interactive Visual Diagnosis of Behavioral Inconsistencies in LLM-based Agentic Systems
Large Language Model (LLM)-based agentic systems have shown growing promise in tackling complex, multi-step tasks through autonomous planning, reasoning, and interaction with external environments. However, the stochastic nature of LLM generation introduces intrinsic behavioral inconsistency: the same agent may succeed in one execution but fail in another under identical inputs. Diagnosing such inconsistencies remains a major challenge for developers, as agent execution logs are often lengthy, unstructured, and difficult to compare across runs. Existing debugging and evaluation tools primarily focus on inspecting single executions, offering limited support for understanding how and why agent behaviors diverge across repeated runs. To address this challenge, we introduce InconLens, a visual analytics system designed to support interactive diagnosis of LLM-based agentic systems with a particular focus on cross-run behavioral analysis. InconLens introduces information nodes as an intermediate abstraction that captures canonical informational milestones shared across executions, enabling semantic alignment and inspection of agent reasoning trajectories across multiple runs. We demonstrate the effectiveness of InconLens through a detailed case study and further validate its usability and analytical value via expert interviews. Our results show that InconLens enables developers to more efficiently identify divergence points, uncover latent failure modes, and gain actionable insights into improving the reliability and stability of agentic systems.
☆ Synonymix: Unified Group Personas for Generative Simulations
Generative agent simulations operate at two scales: individual personas for character interaction, and population models for collective behavior analysis and intervention testing. We propose a third scale: meso-level simulation - interaction with group-level representations that retain grounding in rich individual experience. To enable this, we present Synonymix, a pipeline that constructs a "unigraph" from multiple life story personas via graph-based abstraction and merging, producing a queryable collective representation that can be explored for sensemaking or sampled for synthetic persona generation. Evaluating synthetic agents on General Social Survey items, we demonstrate behavioral signal preservation beyond demographic baselines (p<0.001, r=0.59) with demonstrable privacy guarantee (max source contribution <13%). We invite discussion on interaction modalities enabled by meso-level simulations, and whether "high-fidelity" personas can ever capture the texture of lived experience.
comment: 6 pages (excluding appendix), 3 figures, CHI'26 Extended Abstract (Poster)
☆ CARLA-Air: Fly Drones Inside a CARLA World -- A Unified Infrastructure for Air-Ground Embodied Intelligence
The convergence of low-altitude economies, embodied intelligence, and air-ground cooperative systems creates growing demand for simulation infrastructure capable of jointly modeling aerial and ground agents within a single physically coherent environment. Existing open-source platforms remain domain-segregated: driving simulators lack aerial dynamics, while multirotor simulators lack realistic ground scenes. Bridge-based co-simulation introduces synchronization overhead and cannot guarantee strict spatial-temporal consistency. We present CARLA-Air, an open-source infrastructure that unifies high-fidelity urban driving and physics-accurate multirotor flight within a single Unreal Engine process. The platform preserves both CARLA and AirSim native Python APIs and ROS 2 interfaces, enabling zero-modification code reuse. Within a shared physics tick and rendering pipeline, CARLA-Air delivers photorealistic environments with rule-compliant traffic, socially-aware pedestrians, and aerodynamically consistent UAV dynamics, synchronously capturing up to 18 sensor modalities across all platforms at each tick. The platform supports representative air-ground embodied intelligence workloads spanning cooperation, embodied navigation and vision-language action, multi-modal perception and dataset construction, and reinforcement-learning-based policy training. An extensible asset pipeline allows integration of custom robot platforms into the shared world. By inheriting AirSim's aerial capabilities -- whose upstream development has been archived -- CARLA-Air ensures this widely adopted flight stack continues to evolve within a modern infrastructure. Released with prebuilt binaries and full source: https://github.com/louiszengCN/CarlaAir
comment: Prebuilt binaries, project page, full source code, and community discussion group are all available at: https://github.com/louiszengCN/CarlaAir
☆ Filipino Students' Willingness to Use AI for Mental Health Support: A Path Analysis of Behavioral, Emotional, and Contextual Factors
This study examined how behavioral, emotional, and contextual factors influence Filipino students' willingness to use artificial intelligence (AI) for mental health support. Results showed that habit had the strongest effect on willingness, followed by comfort, emotional benefit, facilitating conditions, and perceived usefulness. Students who used AI tools regularly felt more confident and open to relying on them for emotional support. Empathy, privacy, and accessibility also increased comfort and trust in AI systems. The findings highlight that emotional safety and routine use are essential in promoting willingness. The study recommends AI literacy programs, empathic design, and ethical policies that support responsible and culturally sensitive use of AI for student mental health care.
comment: 24 pages, 5 figures, 1 table, book chapter
☆ ViviDoc: Generating Interactive Documents through Human-Agent Collaboration
Interactive documents help readers engage with complex ideas through dynamic visualization, interactive animations, and exploratory interfaces. However, creating such documents remains costly, as it requires both domain expertise and web development skills. Recent Large Language Model (LLM)-based agents can automate content creation, but directly applying them to interactive document generation often produces outputs that are difficult to control. To address this, we present ViviDoc, to the best of our knowledge the first work to systematically address interactive document generation. ViviDoc introduces a multi-agent pipeline (Planner, Styler, Executor, Evaluator). To make the generation process controllable, we provide three levels of human control: (1) the Document Specification (DocSpec) with SRTC Interaction Specifications (State, Render, Transition, Constraint) for structured planning, (2) a content-aware Style Palette for customizing writing and interaction styles, and (3) chat-based editing for iterative refinement. We also construct ViviBench, a benchmark of 101 topics derived from real-world interactive documents across 11 domains, along with a taxonomy of 8 interaction types and a 4-dimensional automated evaluation framework validated against human ratings (Pearson r > 0.84). Experiments show that ViviDoc achieves the highest content richness and interaction quality in both automated and human evaluation. A 12-person user study confirms that the system is easy to use, provides effective control over the generation process, and produces documents that satisfy users.
☆ From Passersby to Placemaking: Designing Autonomous Vehicle-Pedestrian Encounters for an Urban Shared Space
Autonomous vehicles (AVs) tend to disrupt the atmosphere and pedestrian experience in urban shared spaces, undermining the focus of these spaces on people and placemaking. We investigate how external human-machine interfaces (eHMIs) supporting AV-pedestrian interaction can be extended to consider the characteristics of an urban shared space. Inspired by urban HCI, we devised three place-based eHMI designs that (i) enhance a conventional intent eHMI and (ii) exhibit content and physical integration with the space. In an evaluation study, 25 participants experienced the eHMIs in an immersive simulation of the space via virtual reality and shared their impressions through think-aloud, interviews, and questionnaires. Results showed that the place-based eHMIs had a notable effect on influencing the perception of AV interaction, including aspects like visual aesthetics and sense of reassurance, and on fostering a sense of place, such as social interactivity and the intentionality to coexist. In measuring qualities of pedestrian experience, we found that perceived safety significantly correlated with user experience and affect, including the attractiveness of eHMIs and feelings of pleasantness. The paper opens the avenue for exploring how eHMIs may contribute to the placemaking goals of pedestrian-centric spaces and improve the experience of people encountering AVs within these environments.
☆ MGDIL: Multi-Granularity Summarization and Domain-Invariant Learning for Cross-Domain Social Bot Detection
Social bots increasingly infiltrate online platforms through sophisticated disguises, threatening healthy information ecosystems. Existing detection methods often rely on modality specific cues or local contextual features, making them brittle when modalities are missing or inputs are incomplete. Moreover, most approaches assume similar train test distributions, which limits their robustness to out of distribution (OOD) samples and emerging bot types. To address these challenges, we propose Multi Granularity Summarization and Domain Invariant Learning (MGDIL), a unified framework for robust social bot detection under domain shift. MGDIL first transforms heterogeneous signals into unified textual representations through LLM based multi granularity summarization. Building on these representations, we design a collaborative optimization framework that integrates task oriented LLM instruction tuning with domain invariant representation learning. Specifically, task oriented instruction tuning enhances the LLMs ability to capture subtle semantic cues and implicit camouflage patterns, while domain adversarial learning and cross domain contrastive learning are jointly employed to mitigate distribution shifts across datasets and time periods. Through this joint optimization, MGDIL learns stable and discriminative domain invariant features, improving cross domain social bot detection through better distribution alignment, stronger intra class compactness, and clearer inter class separation.
♻ ☆ Improving motor imagery decoding methods for an EEG-based mobile brain-computer interface in the context of the 2024 Cybathlon
Motivated by the Cybathlon 2024 competition, we developed a modular, online EEG-based brain-computer interface to address these challenges, increasing accessibility for individuals with severe mobility impairments. Our system uses three mental and motor imagery classes to control up to five control signals. The pipeline consists of four modules: data acquisition, preprocessing, classification, and the transfer function to map classification output to control dimensions. We use three diagonalized structured state-space sequence layers as a deep learning classifier. We developed a training game for our pilot where the mental tasks control the game during quick-time events. We implemented a mobile web application for live user feedback. The components were designed with a human-centred approach in collaboration with the tetraplegic user. We achieve up to 84% classification accuracy in offline analysis using an S4D-layer-based model. In a competition setting, our pilot successfully completed one task; we attribute the reduced performance in this context primarily to factors such as stress and the challenging competition environment. Following the Cybathlon, we further validated our pipeline with the original pilot and an additional participant, achieving a success rate of 73% in real-time gameplay. We also compare our model to the EEGEncoder, which is slower in training but has a higher performance. The S4D model outperforms the reference machine learning models. We provide insights into developing a framework for portable BCIs, bridging the gap between the laboratory and daily life. Specifically, our framework integrates modular design, real-time data processing, user-centred feedback, and low-cost hardware to deliver an accessible and adaptable BCI solution, addressing critical gaps in current BCI applications.
comment: This work was created by the members of the neuroTUM e.V
♻ ☆ Evaluation of Generative Models for Emotional 3D Animation Generation in VR
Social interactions incorporate nonverbal signals to convey emotions alongside speech, including facial expressions and body gestures. Generative models have demonstrated promising results in creating full-body nonverbal animations synchronized with speech; however, evaluations using statistical metrics in 2D settings fail to fully capture user-perceived emotions, limiting our understanding of model effectiveness. To address this, we evaluate emotional 3D animation generative models within a Virtual Reality (VR) environment, emphasizing user-centric metrics emotional arousal realism, naturalness, enjoyment, diversity, and interaction quality in a real-time human-agent interaction scenario. Through a user study (N=48), we examine perceived emotional quality for three state of the art speech-driven 3D animation methods across two emotions happiness (high arousal) and neutral (mid arousal). Additionally, we compare these generative models against real human expressions obtained via a reconstruction-based method to assess both their strengths and limitations and how closely they replicate real human facial and body expressions. Our results demonstrate that methods explicitly modeling emotions lead to higher recognition accuracy compared to those focusing solely on speech-driven synchrony. Users rated the realism and naturalness of happy animations significantly higher than those of neutral animations, highlighting the limitations of current generative models in handling subtle emotional states. Generative models underperformed compared to reconstruction-based methods in facial expression quality, and all methods received relatively low ratings for animation enjoyment and interaction quality, emphasizing the importance of incorporating user-centric evaluations into generative model development. Finally, participants positively recognized animation diversity across all generative models.
comment: 20 pages, 5 figures. Webpage: https://emotional3dhumans.github.io/
♻ ☆ MA-SAPO: Multi-Agent Reasoning for Score-Aware Prompt Optimization
Prompt optimization has become a practical way to improve the performance of Large Language Models (LLMs) without retraining. However, most existing frameworks treat evaluation as a black box, relying solely on outcome scores without explaining why prompts succeed or fail. Moreover, they involve repetitive trial-and-error refinements that remain implicit, offering limited interpretability or actionable guidance for systematic improvement. In this paper, we propose MA-SAPO: a new Multi-Agent Reasoning for Score Aware Prompt Optimization framework that links evaluation outcomes directly to targeted refinements. Specifically, in the Training Phase, multiple agents interpret evaluation scores, diagnose weaknesses, and generate concrete revision directives, which are stored as reusable reasoning assets. In the Test Phase, an analyzer agent retrieves relevant exemplars and assets for a new prompt, and a refiner agent applies evidence-based edits to improve the prompt and its response. By grounding optimization in structured reasoning, MA-SAPO ensures edits are interpretable, auditable, and controllable. Experiments on the HelpSteer1/2 benchmarks show that our framework consistently outperforms single-pass prompting, retrieval-augmented generation, and prior multi-agent methods across multiple evaluation metrics.
comment: Preprint
♻ ☆ Togedule: Scheduling Meetings with Large Language Models and Adaptive Representations of Group Availability SC
Scheduling is a perennial-and often challenging-problem for many groups. Existing tools are mostly static, showing an identical set of choices to everyone, regardless of the current status of attendees' inputs and preferences. In this paper, we propose Togedule, an adaptive scheduling tool that uses large language models to dynamically adjust the pool of choices and their presentation format. With the initial prototype, we conducted a formative study (N=10) and identified the potential benefits and risks of such an adaptive scheduling tool. Then, after enhancing the system, we conducted two controlled experiments, one each for attendees and organizers (total N=66). For each experiment, we compared scheduling with verbal messages, shared calendars, or Togedule. Results show that Togedule significantly reduces the cognitive load of attendees indicating their availability and improves the speed and quality of the decisions made by organizers.
comment: This paper has been accepted at CSCW 2025
♻ ☆ Offline-First Large Language Model Architecture for AI-Assisted Learning with Adaptive Response Levels in Low-Connectivity Environments
Artificial intelligence (AI) and large language models (LLMs) are transforming educational technology by enabling conversational tutoring, personalized explanations, and inquiry-driven learning. However, most AI-based learning systems rely on continuous internet connectivity and cloud-based computation, limiting their use in bandwidth-constrained environments. This paper presents an offline-first large language model architecture designed for AI-assisted learning in low-connectivity settings. The system performs all inference locally using quantized language models and incorporates hardware-aware model selection to enable deployment on low-specification CPU-only devices. By removing dependence on cloud infrastructure, the system provides curriculum-aligned explanations and structured academic support through natural-language interaction. To support learners at different educational stages, the system includes adaptive response levels that generate explanations at varying levels of complexity: Simple English, Lower Secondary, Upper Secondary, and Technical. This allows explanations to be adjusted to student ability, improving clarity and understanding of academic concepts. The system was deployed in selected secondary and tertiary institutions under limited-connectivity conditions and evaluated across technical performance, usability, perceived response quality, and educational impact. Results show stable operation on legacy hardware, acceptable response times, and positive user perceptions regarding support for self-directed learning. These findings demonstrate the feasibility of offline large language model deployment for AI-assisted education in low-connectivity environments.
comment: 16 pages, 10 figures, 2 tables
♻ ☆ NARVis: Neural Accelerated Rendering for Real-Time Scientific Point Cloud Visualization
Exploring scientific datasets with billions of samples in real-time visualization presents a challenge - balancing high-fidelity rendering with speed. This work introduces a neural accelerated renderer, NARVis, that uses the neural deferred rendering framework to visualize large-scale scientific point cloud data. NARVis augments a real-time point cloud rendering pipeline with high-quality neural post-processing, making the approach ideal for interactive visualization at scale. Specifically, we render the multi-attribute point cloud using a high-performance multi-attribute rasterizer and train a neural renderer to capture the desired post-processing effects from a conventional high-quality renderer. NARVis is effective in visualizing complex multidimensional Lagrangian flow fields and photometric scans of a large terrain as compared to the state-of-the-art high-quality renderers. Extensive evaluations demonstrate that NARVis prioritizes speed and scalability while retaining high visual fidelity. We achieve competitive frame rates of $>$126 fps for interactive rendering of $>$350M points (i.e., an effective throughput of $>$44 billion points per second) using ~12 GB of memory on RTX 2080 Ti GPU. Furthermore, NARVis is generalizable across different point clouds with similar visualization needs and the desired post-processing effects could be obtained with substantial high quality even at lower resolutions of the original point cloud, further reducing the memory requirements.
♻ ☆ Unbounded: Object-Boundary Interaction in Mixed Reality
Boundaries such as walls, windows, and doors are ubiquitous in the physical world, yet their potential in mixed reality (MR) remains underexplored. We present Unbounded, a Research through Design inquiry into object--boundary interaction (OBI). Building on prior work, we articulate a design space aimed at providing a shared language for OBI. To demonstrate its potential, we design and implement eight examples across productivity and art exploration scenarios, showcasing how OBIs can enrich and reframe everyday interactions. We further engage with six MR experts in one-on-one feedback sessions, using the design space and examples as design probes. Their reflections broaden the conceptual scope of OBI, reveal new possibilities for how the framework may be applied, and highlight implications for future MR interaction design. https://www.zhuoyuelyu.com/unbounded
♻ ☆ Deception by Design: A Temporal Dark Patterns Audit of McDonald's Self-Ordering Kiosk Flow
Self-ordering kiosks (SOKs) are widely deployed in fast food restaurants, transforming food ordering into digitally mediated, self-navigated interactions. While these systems enhance efficiency and average order value, they also create opportunities for manipulative interface design practices known as dark patterns. This paper presents a structured audit of the McDonald's self-ordering kiosk in Germany using the Temporal Analysis of Dark Patterns (TADP) framework. Through a scenario-based walkthrough simulating a time-pressured user, we reconstructed and analyzed 12 interface steps across intra-page, inter-page, and system levels. We identify recurring high-level strategies implemented through meso-level patterns such as adding steps, false hierarchy, bad defaults, hiding information, and pressured selling, and low-level patterns including visual prominence, confirmshaming, scarcity framing, feedforward ambiguity, emotional sensory manipulation, and partitioned pricing. Our findings demonstrate how these patterns accumulate across the interaction flow and may be amplified by the kiosk's linear task structure and physical context. These findings suggest that hybrid physical--digital consumer interfaces warrant closer scrutiny within emerging regulatory discussions on dark patterns.
comment: Accepted at the Bridge Over Troubled Water: Aligning Commercial Incentives With Ethical Design Practice To Combat Deceptive Patterns. Workshop at the 2026 CHI Conference on Human Factors in Computing Systems, Barcelona, Spain
♻ ☆ SpecMoE: Spectral Mixture-of-Experts Foundation Model for Cross-Species EEG Decoding
Decoding the orchestration of neural activity in electroencephalography (EEG) signals is a central challenge in bridging neuroscience with artificial intelligence. Foundation models have made strides in generalized EEG decoding, yet many existing frameworks primarily relying on separate temporal and spectral masking of raw signals during self-supervised pretraining. Such strategies often tend to bias learning toward high-frequency oscillations, as low-frequency rhythmic patterns can be easily inferred from the unmasked signal. We introduce a foundation model that utilizes a novel Gaussian-smoothed masking scheme applied to short-time Fourier transform (STFT) maps. By jointly applying time, frequency, and time-frequency Gaussian masks, we make the reconstruction task much more challenging, forcing the model to learn intricate neural patterns across both high- and low-frequency domains. To effectively recover signals under this aggressive masking strategy, we design SpecHi-Net, a U-shaped hierarchical architecture with multiple encoding and decoding stages. To accelerate large-scale pretraining, we partition the data into three subsets, each used to train an independent expert model. We then combine these models through SpecMoE, a mixture of experts framework guided by a learned spectral gating mechanism. SpecMoE achieves state-of-the-art performance across a diverse set of EEG decoding tasks, including sleep staging, emotion recognition, motor imagery classification, abnormal signal detection, and drug effect prediction. Importantly, the model demonstrates strong cross-species and cross-subject generalization, maintaining high accuracy on both human and murine EEG datasets.
comment: 34 pages (12 pages in the main text and 22 pages in Supplementary Information)
♻ ☆ Objestures: Everyday Objects Meet Mid-Air Gestures for Expressive Interaction
Everyday object-based interactions (EOIs) and mid-air gesture interactions (MAIs) have been widely explored, yet prior work on their integration often targets narrow use cases or specific technologies, leaving designers and developers with limited guidance that generalizes across diverse EOIs and MAIs. We introduce Objestures ("Obj" + "Gestures") -- five interaction types spanning EOIs and MAIs, forming a design space for expressive uni- and bimanual interaction. To evaluate the usefulness of Objestures, we conducted an exploratory user study (N=12) on basic 3D tasks (rotation and scaling), which showed performance comparable to the headset's native freehand manipulation. To understand the user experience, we conducted case studies with the same participants across three applications (Sound, Draw, and Shadow), where participants found the interactions intuitive, engaging, and expressive, and indicated interest in everyday use. We further demonstrate the potential of Objestures across diverse contexts through 30 examples, and discuss limitations and implications. https://www.zhuoyuelyu.com/objestures
♻ ☆ Affordances of Digital and Blockchain-based Community Currencies: The Case of Sarafu Network in Kenya
Community currencies (CCs) have been adopting innovative systems to overcome implementational hurdles from issuing paper currencies. Using a qualitative approach, this paper examined this digital transition of Sarafu Network in Kenya and its predecessor CCs as a case study. From the original vouchers launched in 2010, the foundation Grassroots Economics introduced a digital interface in 2016 that operates on a feature phone, and then integrated blockchain technology starting in 2018, undergoing several migrations before becoming settling on its current iteration called Community Asset Vouchers on the Celo blockchain since 2023. Using affordances from human-computer interaction, the research shows that digitalization and blockchain improved the facilitation of economic activities of the local communities, both their typical market transactions as well as traditional reciprocal labor exchanges, by offering more functionalities compared to the analog version of Sarafu. The unique contributions of blockchain include enabling automation of holding tax calculations and linking the vouchers to the mainstream monetary system via stablecoins facilitated by a series of smart contracts also known as the liquidity pool. The study also finds that there is an inherent trade-off between blockchain benefits and user interface complexity. Hence, balancing innovation and community needs remains a challenge.
comment: 22 pages, 7 figures
♻ ☆ Understanding the Use of a Large Language Model-Powered Guide to Make Virtual Reality Accessible for Blind and Low Vision People
As social virtual reality (VR) grows more popular, addressing accessibility for blind and low vision (BLV) users is increasingly critical. Researchers have proposed an AI "sighted guide" to help users navigate VR and answer their questions, but it has not been studied with users. To address this gap, we developed a large language model (LLM)-powered guide and studied its use with 16 BLV participants in virtual environments with confederates posing as other users. We found that when alone, participants treated the guide as a tool, but treated it companionably around others, giving it nicknames, rationalizing its mistakes with its appearance, and encouraging confederate-guide interaction. Our work furthers understanding of guides as a versatile method for VR accessibility and presents design recommendations for future guides.
comment: 16 pages, 5 figures, 3 tables, Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26), April 13-17, 2026, Barcelona, Spain. ACM
♻ ☆ Unseen Risks of Clinical Speech-to-Text Systems: Transparency, Privacy, and Reliability Challenges in AI-Driven Documentation
AI-driven speech-to-text (STT) documentation systems are increasingly adopted in clinical settings to reduce documentation burden and improve workflow efficiency. However, adoption has outpaced systematic evaluation of socio-technical risks related to transparency, reliability, patient autonomy, and organizational accountability. This study develops a socio-technical framework for identifying and governing risks associated with clinical STT systems. We synthesize interdisciplinary evidence from automatic speech recognition research, clinical workflow and human factors studies, ethical guidance on consent and autonomy, and regulatory and organizational sources. Using a structured narrative synthesis, literature was iteratively reviewed and thematically analyzed to identify recurring socio-technical risk mechanisms and inform a layered conceptual framework. Findings show that clinical STT systems operate within tightly coupled socio-technical environments where model performance, audio conditions, clinician oversight, patient understanding, workflow design, and institutional governance are interdependent. Key risks include inconsistent consent practices, performance disparities for accented speech and speech disorders, accuracy degradation in real clinical settings, automation complacency, and unclear accountability across vendors and healthcare organizations. These risks inform a six-layer governance model spanning technical, human/workflow, ethical, organizational, regulatory, and sociocultural dimensions. We propose a governance framework and implementation roadmap to support responsible deployment of clinical STT systems, emphasizing transparency, patient autonomy, documentation integrity, and accountable oversight.
comment: Accepted in the International Journal of Medical Informatics
♻ ☆ Judging the algorithm: Algorithmic accountability on the risk assessment tool for intimate partner violence in the Basque Country
This paper discusses an algorithmic tool introduced in the Basque Country (Spain) to assess the risk of intimate partner violence. The algorithm was introduced to address the lack of human experts by automatically calculating the level of violence based on psychometric features such as controlling or violent behaviour. Given that critical literature on risk assessment tools for domestic violence mainly focuses on English-speaking countries, this paper offers an algorithmic accountability analysis in a non-English speaking region. It investigates the algorithmic risks, harms, and limitations associated with the Basque tool. We propose a transdisciplinary approach from a critical statistical and legal perspective. This approach unveils issues and limitations that could lead to unexpected consequences for individuals suffering from partner violence. Moreover, our analysis suggests that the algorithmic tool has a high error rate on severe cases, i.e., cases where the aggressor could murder his partner -- 5 out of 10 high-risk cases are misclassified as low risk -- and that there is a lack of appropriate legal guidelines for judges, the end users of this tool. The paper concludes that this risk assessment tool needs to be urgently evaluated by independent and transdisciplinary experts to better mitigate algorithmic harms in the context of intimate partner violence.
comment: 17 pages, 5 figures, 1 table
♻ ☆ An Agentic Operationalization of DISARM for FIMI Investigation on Social Media
Interoperable data and intelligence flows among allied partners and operational end-users remain essential to NATO's collective defense across both conventional and hybrid threat environments. Foreign Information Manipulation and Interference (FIMI) increasingly spans multiple societal domains and information ecosystems, complicating threat characterization, persistent situational awareness, and coordinated response. Concurrent advances in AI have further lowered the barrier to conducting large-scale, AI-augmented FIMI activities -- including automated generation, personalization, and amplification of manipulative content. While frameworks such as DISARM offer a standardized analytical and metadata schema for characterizing FIMI incidents, their practical application for automating large-scale detection remains challenging. We present a framework-agnostic, agent-based operationalization of DISARM piloted to support FIMI investigation on social platforms. Our agent coordination pipeline integrates general agentic AI components that (1) identify candidate manipulative behaviors in social-media data and (2) map these behaviors to DISARM taxonomies through transparent, auditable reasoning steps. Evaluation on two practitioner-annotated, real-world datasets demonstrates that our approach can effectively scale analytic workflows that are currently manual, time-intensive, and interpretation-heavy. Notably, the experiment surfaced more than 30 previously undetected Russian bot accounts -- deployed for the 2025 election in Moldova -- during the prior non-agentic investigation. By enhancing analytic throughput, interoperability, and explainability, the proposed approach provides a direct contribution to defense policy and planning needs for improved situational awareness, cross-partner data integration, and rapid assessment of information-environment threats.
comment: This paper was originally presented at the International Conference on Military Communication and Information Systems (ICMCIS), organized by the Information Systems Technology (IST) Scientific and Technical Committee, IST-224-RSY---the ICMCIS, held in Bath, United Kingdom, 12-13 May 2026
♻ ☆ Multi-Agent Home Energy Management Assistant
Existing home energy management systems conceptualize occupants as passive recipients of energy information and control, which limits their ability to effectively support informed decision-making and sustained engagement. This paper presents Home Energy Management Assistant (HEMA), the first open-source, multi-agent system enabling sustained human-AI collaboration - multi-turn conversational interactions with preserved context - across diverse home energy management (HEM) tasks - from energy analysis and educational support to smart device control. HEMA combines large language model (LLM) reasoning capabilities with 36 purpose-built domain-specific tools through a three-layer architecture: a web-based conversational interface, a backend API server, and a multi-agent system. The system features three specialized agents - Analysis (energy consumption patterns and cost optimization), Knowledge (educational queries and rebate information), and Control (smart device management and scheduling) - coordinated through a self-consistency classifier that routes user queries using chain-of-thought reasoning. This architecture enables various energy analyses, adaptive explanations, and streamlined device control. HEMA also includes a comprehensive evaluation framework using an LLM-as-simulated-user methodology with 23 objective metrics across task performance, factual accuracy, interaction quality, and system efficiency, allowing systematic testing across diverse scenarios and user personas without requiring extensive human subject testing. Through demonstrations using real-world household energy consumption data, we show how HEMA supports informed decision-making and active engagement in HEM, highlighting its potential as a user-friendly, adaptable tool for residential deployment and as a research platform for HEM innovation.
comment: 32 pages, 9 figures
♻ ☆ Evolving with AI: A Longitudinal Analysis of Developer Logs ICSE'26
AI-powered coding assistants are rapidly becoming fixtures in professional IDEs, yet their sustained influence on everyday development remains poorly understood. Prior research has focused on short-term use or self-reported perceptions, leaving open questions about how sustained AI use reshapes actual daily coding practices in the long term. We address this gap with a mixed-method study of AI adoption in IDEs, combining longitudinal two-year fine-grained telemetry from 800 developers with a survey of 62 professionals. We analyze five dimensions of workflow change: productivity, code quality, code editing, code reuse, and context switching. Telemetry reveals that AI users produce substantially more code but also delete significantly more. Meanwhile, survey respondents report productivity gains and perceive minimal changes in other dimensions. Our results offer empirical insights into the silent restructuring of software workflows and provide implications for designing future AI-augmented tooling.
comment: Accepted to ICSE'26 Research track. 12 pages, 5 figures, 1 table
♻ ☆ OMG-Bench: A New Challenging Benchmark for Skeleton-based Online Micro Hand Gesture Recognition CVPR 2026
Online micro gesture recognition from hand skeletons is critical for VR/AR interaction but faces challenges due to limited public datasets and task-specific algorithms. Micro gestures involve subtle motion patterns, which make constructing datasets with precise skeletons and frame-level annotations difficult. To this end, we develop a multi-view self-supervised pipeline to automatically generate skeleton data, complemented by heuristic rules and expert refinement for semi-automatic annotation. Based on this pipeline, we introduce OMG-Bench, the first large-scale public benchmark for skeleton-based online micro gesture recognition. It features 40 fine-grained gesture classes with 13,948 instances across 1,272 sequences, characterized by subtle motions, rapid dynamics, and continuous execution. To tackle these challenges, we propose Hierarchical Memory-Augmented Transformer (HMATr), an end-to-end framework that unifies gesture detection and classification by leveraging hierarchical memory banks which store frame-level details and window-level semantics to preserve historical context. In addition, it employs learnable position-aware queries initialized from the memory to implicitly encode gesture positions and semantics. Experiments show that HMATr outperforms state-of-the-art methods by 7.6% in detection rate, establishing a strong baseline for online micro gesture recognition. Project page: https://omg-bench.github.io/
comment: Accepted by CVPR 2026
♻ ☆ On the Impact of AGENTS.md Files on the Efficiency of AI Coding Agents
AI coding agents such as Codex and Claude Code are increasingly used to autonomously contribute to software repositories. However, little is known about how repository-level configuration artifacts affect operational efficiency of the agents. In this paper, we study the impact of AGENTS$.$md files on the runtime and token consumption of AI coding agents operating on GitHub pull requests. We analyze 10 repositories and 124 pull requests, executing agents under two conditions: with and without an AGENTS$.$md file. We measure wall-clock execution time and token usage during agent execution. Our results show that the presence of AGENTS$.$md is associated with a lower median runtime ($Δ28.64$%) and reduced output token consumption ($Δ16.58$%), while maintaining a comparable task completion behavior. Based on these results, we discuss immediate implications for the configuration and deployment of AI coding agents in practice, and outline a broader research agenda on the role of repository-level instructions in shaping the behavior, efficiency, and integration of AI coding agents in software development workflows.
comment: 5 pages, 1 figure, 1 table
♻ ☆ Simulating Novice Students Using Machine Unlearning and Relearning in Large Language Models
Student simulation can support learning-by-teaching pedagogy where human students (as tutors) teach AI-simulated novice students (as tutees). Recent research often relies on prompt engineering with large language models (LLMs) to simulate novice student behaviour, but it is difficult to keep the AI-simulated student at a stable novice knowledge level. A key reason is that many LLMs are trained to be broadly capable, so even when prompted to "act like a novice," the LLMs can still produce expert-level explanations during the learning-by-teaching interaction process. As a result, the AI-simulated student may drift beyond the intended knowledge level, reducing the credibility of the simulation for studying learning-by-teaching processes. Thus, we propose a knowledge-level simulation approach based on machine unlearning. We investigate this approach using a dataset of multiple-choice questions on Python programming concepts. We apply machine unlearning to transform a knowledgeable LLM into a novice-level AI student (i.e., teachable agent), then evaluate whether the teachable agent can relearn targeted knowledge components through learning-by-teaching dialogue interactions. Finally, we analyse the dialogue logs to characterise how the agent's behaviour changes over time, including its question asking, error patterns, and responsiveness to instruction. The results show that (1) unlearning produces simulated student agents with more novice-like responses than prompt-only baselines, (2) the agents recover a measurable portion of the unlearned knowledge under structured exposure, and (3) dialogue analyses reveal identifiable trajectories of conceptual change and teaching moves that predict learning recovery.
♻ ☆ Exploring Collatz Dynamics with Human-LLM Collaboration
We develop a structural framework for the Collatz map based on odd-to-odd dynamics, modular return structure, and a decomposition of trajectories into bursts and gaps. On the unconditional side, we prove several exact results. The fiber-57 branch q = 7 (mod 8) returns in exactly two odd-to-odd steps with uniform affine destination. The branch q = 3 (mod 8) cannot return within four steps (minimum gap five), and its earliest returns form an explicit dyadic cylinder family indexed by w = v_2(243m+119). The algebraic chain map on the five-element invariant core is a permutation at every depth, so any genuine contraction must come from return dynamics rather than core algebra. These yield an exact depth-2 known-gap partial return kernel with Perron root 129/1024 -- not asserted as the full bottleneck constant, since q = 3 (mod 8) returns with gap >= 6 are unresolved. The main body independently develops a conditional reduction via burst-gap decomposition, phantom-cycle gain analysis, and a weak-mixing hierarchy, establishing an exact geometric block law, exponential almost-all crossing bound, and per-orbit phantom gain within budget (4.65x margin). The framework reduces the convergence programme to a single orbitwise regularity statement, formulated either through the weak-mixing hierarchy or the fiber-57 anti-concentration conjecture. The remaining obstruction is to prove that no deterministic orbit can concentrate its fiber-57 returns on the sustaining core strongly enough to maintain indefinite non-termination. This work is not a complete proof of the Collatz conjecture. It is a sharpened reduction isolating the unresolved difficulty to a single orbitwise upgrade from ensemble behavior to pointwise control, concentrated in the q = 3 (mod 8) return channel.
comment: 138 pages, 11 figures, 16 tables
♻ ☆ Listen to Rhythm, Choose Movements: Autoregressive Multimodal Dance Generation via Diffusion and Mamba with Decoupled Dance Dataset
Advances in generative models and sequence learning have greatly promoted research in dance motion generation, yet current methods still suffer from coarse semantic control and poor coherence in long sequences. In this work, we present Listen to Rhythm, Choose Movements (LRCM), a multimodal-guided diffusion framework supporting both diverse input modalities and autoregressive dance motion generation. We explore a feature decoupling paradigm for dance datasets and generalize it to the Motorica Dance dataset, separating motion capture data, audio rhythm, and professionally annotated global and local text descriptions. Our diffusion architecture integrates an audio-latent Conformer and a text-latent Cross-Conformer, and incorporates a Motion Temporal Mamba Module (MTMM) to enable smooth, long-duration autoregressive synthesis. Experimental results indicate that LRCM delivers strong performance in both functional capability and quantitative metrics, demonstrating notable potential in multimodal input scenarios and extended sequence generation. We will release the full codebase, dataset, and pretrained models publicly upon acceptance.
comment: 12 pages, 13 figures
♻ ☆ RESPOND: Responsive Engagement Strategy for Predictive Orchestration and Dialogue
The majority of voice-based conversational agents still rely on pause-and-respond turn-taking, leaving interactions sounding stiff and robotic. We present RESPOND (Responsive Engagement Strategy for Predictive Orchestration and Dialogue), a framework that brings two staples of human conversation to agents: timely backchannels ("mm-hmm," "right") and proactive turn claims that can contribute relevant content before the speaker yields the conversational floor. Built on streaming ASR (Automatic Speech Recognition) and incremental semantics, RESPOND continuously predicts both when and how to interject, enabling fluid, listener-aware dialogue. A defining feature is its designer-facing controllability: two orthogonal dials, Backchannel Intensity (frequency of acknowledgments) and Turn Claim Aggressiveness (depth and assertiveness of early contributions), can be tuned to match the etiquette of contexts ranging from rapid ideation to reflective counseling. By coupling predictive orchestration with explicit control, RESPOND offers a practical path toward conversational agents that adapt their conversational footprint to social expectations, advancing the design of more natural and engaging voice interfaces.
comment: 12 pages, 8 figures
Programming Languages 8
☆ Towards verifying unsafe Rust programs against Rust's pointer-aliasing restrictions
The Rust programming language is famous for its strong ownership regime: at each point, each value is either exclusively owned, exclusively borrowed through a mutable reference, or borrowed as read-only through one or more shared references. These rules, known as Rust's pointer-aliasing rules, are exploited by the Rust compiler to generate more efficient machine code, and enforced by Rust's static type system, except inside unsafe blocks. In this paper, we present our work in progress towards the first program logic for modularly verifying that Rust programs that use unsafe blocks comply with the pointer-aliasing rules.
comment: 8 pages, 9 figures
☆ Statistics 101, 201, and 202: Three Shiny Apps for Teaching Probability Distributions, Inferential Statistics, and Simple Linear Regression
Statistics 101, 201, and 202 are three open-source interactive web applications built with R \citep{R} and Shiny \citep{shiny} to support the teaching of introductory statistics and probability. The apps help students carry out common statistical computations -- computing probabilities from standard probability distributions, constructing confidence intervals, conducting hypothesis tests, and fitting simple linear regression models -- without requiring prior knowledge of R or any other programming language. Each app provides numerical results, plots rendered with \texttt{ggplot2} \citep{ggplot2}, and inline mathematical derivations typeset with MathJax \citep{cervone2012mathjax}, so that computation and statistical reasoning appear side by side in a single interface. The suite is organised around a broad pedagogical progression: Statistics~101 introduces probability distributions and their properties; Statistics~201 addresses confidence intervals and hypothesis tests; and Statistics~202 covers the simple linear model. All three apps are freely accessible online and their source code is released under a CC-BY-4.0 license.
comment: 6 pages, 0 figure
☆ Superset Decompilation
Reverse engineering tools remain monolithic and imperative compared to the advancement of modern compiler architectures: analyses are tied to a single mutable representation, making them difficult to extend or refine, and forcing premature choices between soundness and precision. We observe that decompilation is the reverse of compilation and can be structured as a sequence of modular passes, each performing a granular and clearly defined interpretation of the binary at a progressively higher level of abstraction. We formalize this as provenance-guided superset decompilation (PGSD), a framework that monotonically derives facts about the binary into a relation store. Instead of committing early to a single interpretation, the pipeline retains ambiguous interpretations as parallel candidates with provenance, deferring resolution until the final selection phase. Manifold implements PGSD as a declarative reverse engineering framework that lifts Linux ELF binaries to C99 through a granular intermediate representation in ~35K lines of Rust and Datalog. On GNU coreutils, Manifold's output quality matches Ghidra, IDA Pro, angr, and RetDec on multiple metrics while producing fewer compiler errors, and generalizes across compilers and optimization levels.
♻ ☆ LLMON: An LLM-native Markup Language to Leverage Structure and Semantics at the LLM Interface
Textual Large Language Models (LLMs) provide a simple and familiar interface: a string of text is used for both input and output. However, the information conveyed to an LLM often has a richer structure and semantics, which is not conveyed in a string. For example, most prompts contain both instructions ("Summarize this paper into a paragraph") and data (the paper to summarize), but these are usually not distinguished when passed to the model. This can lead to model confusion and security risks, such as prompt injection attacks. This work addresses this shortcoming by introducing an LLM-native mark-up language, LLMON (LLM Object Notation, pronounced "Lemon"), that enables the structure and semantic metadata of the text to be communicated in a natural way to an LLM. This information can then be used during model training, model prompting, and inference implementation, leading to improvements in model accuracy, safety, and security. This is analogous to how programming language types can be used for many purposes, such as static checking, code generation, dynamic checking, and IDE highlighting. We discuss the general design requirements of an LLM-native markup language, introduce the LLMON markup language and show how it meets these design requirements, describe how the information contained in a LLMON artifact can benefit model training and inference implementation, and provide some preliminary empirical evidence of its value for both of these use cases. We also discuss broader issues and research opportunities that are enabled with an LLM-native approach.
comment: 28 pages
♻ ☆ 2-Functoriality of Initial Semantics, and Applications
Initial semantics aims to model inductive structures and their properties, and to provide them with recursion principles respecting these properties. An ubiquitous example is the fold operator for lists. We are concerned with initial semantics that model languages with variable binding and their substitution structure, and that provide substitution-safe recursion principles. There are different approaches to implementing languages with variable binding depending on the choice of representation for contexts and free variables, such as unscoped syntax, or well-scoped syntax with finite or infinite contexts. Abstractly, each approach corresponds to choosing a different monoidal category to model contexts and binding, each choice yielding a different notion of "model" for the same abstract specification (or "signature"). In this work, we provide tools to compare and relate the models obtained from a signature for different choices of monoidal category. We do so by showing that initial semantics naturally has a 2-categorical structure when parametrized by the monoidal category modeling contexts. We thus can relate models obtained from different choices of monoidal categories provided the monoidal categories themselves are related. In particular, we use our results to relate the models of the different implementation -- de Bruijn vs locally nameless, finite vs infinite contexts -- , and to provide a generalized recursion principle for simply-typed syntax.
comment: Version identical to the one published in ICFP 2025
♻ ☆ Types for Grassroots Logic Programs
Grassroots Logic Programs (GLP) is a concurrent logic programming language in which logic variables are partitioned into paired readers and writers. An assignment is produced at most once via a writer and consumed at most once via its paired reader, and may contain additional readers and/or writers. This enables the concise expression of rich multidirectional communication modalities. ``Logic Programs as Types for Logic Programs'' (LICS'91) defined types as regular sets of paths over derivable ground atoms. Here, we define types to be regular sets of moded paths, where a mode captures directionality of communication -- whether a subterm is consumed from or produced to the environment -- enabling the typing of interactive partial computations including those that eventually deadlock or fail, or never terminate. We provide a syntactic definition of well-typing and prove that a program is well-typed iff the path abstraction of its moded-atom semantics satisfies covariance and contravariance conditions with respect to its type. The GLP type system was implemented in Dart by AI, starting from a mathematical specification of Typed GLP (this paper), deriving from it an English spec (written by AI), and from the spec deriving Dart code (by AI). While GLP is naturally untyped, the motivation for Typed GLP comes from programming with AI: Asking AI to program complex communication modalities in GLP (and in general) and hoping for the best is a tenuous strategy. The emerging discipline we advocate and employ is for the human designer and AI to jointly develop and agree upon (1)~GLP types; (2)~GLP procedure type declarations; (3)~informal (English) descriptions of the procedures; and only then let AI attempt to write (4)~GLP code based on those.
♻ ☆ Less is More Revisited: Association with Global Protocols and Multiparty Sessions
Ensuring correctness of communication in distributed systems remains challenging. To address this, Multiparty session types (MPST), initially introduced by Honda et al. [52, 53], offer a type discipline in which a programmer or architect specifies an overall view of communication as a global protocol (global type), and each distributed program is locally type-checked against its end-point projection. In practice, the MPST framework has been integrated into over 25 programming languages or tools. Ten years after the emergence of MPST, Scalas and Yoshida [84] discovered that existing proofs of type safety using end-point projection with mergeability are flawed, where the mergeability operator enlarges the typability of MPST end-point programs, admits easy implementation, and is more efficient than alternative approaches, including model checking. Nevertheless, following the result in [84], the soundness of end-point projection (with mergeability) has been interpreted in the literature as problematic. We clarify this concern by proposing a new general proof technique for type soundness (subject reduction) of multiparty session $π$-calculus, which relies on an association relation between the behavioural semantics of a global type and its end-point projection. With this approach, behavioural properties, namely session fidelity, deadlock freedom, and liveness, are also guaranteed based on global types. Additionally, we provide detailed comparisons with existing MPST typing systems and discuss their respective proof methods for type soundness.
♻ ☆ ExVerus: Verus Proof Repair via Counterexample Reasoning
Large Language Models (LLMs) have shown promising results in automating formal verification. However, existing approaches treat proof generation as a static, end-to-end prediction over source code, relying on limited verifier feedback and lacking access to concrete program behaviors. We present EXVERUS, a counterexample-guided framework that enables LLMs to reason about proofs using behavioral feedback via counterexamples. When a proof fails, EXVERUS automatically generates and validates counterexamples, and then guides the LLM to generalize them into inductive invariants to block these failures. Our evaluation shows that EXVERUS significantly improves proof accuracy, robustness, and token efficiency over the state-of-the-art prompting-based Verus proof generator.
comment: 31 pages, 8 figures
Operating Systems 1
☆ SteelDB: Diagnosing Kernel-Space Bottlenecks in Cloud OLTP Databases
Modern cloud OLTP databases have sought performance primarily through user-space optimization - separating storage and compute layers, or distributing transactions across multiple nodes using consensus algorithms. This paper turns attention to a previously unexplored layer: kernel-space I/O behavior. From an on-premises perspective, where a single server with local storage delivers excellent performance, these elaborate designs seem puzzling. Why do cloud databases require such architectural complexity? We investigate this through a pathological analysis of databases that rely on OS-level I/O control in cloud-specific storage environments. We show that bottlenecks widely attributed to network or storage architectures in fact originate in kernel-space I/O behavior. Based on this diagnosis, we derive treatment principles and realize them as SteelDB, a zero-patch architecture that improves database performance on general-purpose cloud distributed block storage through strategic I/O optimization without requiring kernel or database patches. TPC-C evaluations demonstrate that SteelDB achieves up to 9x performance improvement at no additional cost. Against Amazon Aurora, SteelDB achieved 3.1x higher performance while reducing costs by 58%, leading to a 7.3x improvement in cost efficiency. While Aurora requires an average of 254 days for major version upgrades due to applying proprietary patches to newly released OSS databases, our zero-patch architecture reduces these software maintenance costs to near zero.
Networking and Internet Architecture 19
☆ Embeddings of Nation-Level Social Networks
Full nation-scale social networks are now emerging from countries such as the Netherlands and Denmark, but these networks present challenging technical issues in working with large, multiplex, time-dependent networks. We report on our experiences in producing dynamic node embeddings of the population network of the Netherlands. We present (a) a layer-sensitive random walk strategy which improves on traditional flattening methods for multiplex networks, (b) a temporal alignment strategy that brings annual networks into the same embedding space, without leaking information to future years, and (c) the use of Fibonacci spirals and embedding whitening techniques for more balanced and effective partitioning. We demonstrate the effectiveness of these techniques in building embedding-based models for 13 downstream tasks.
comment: 13 pages, 5 figures, 2 tables, The XIV International Conference on Complex Networks and their Applications
☆ Iran's January 2026 Internet Shutdown: Public Data, Censorship Methods, and Circumvention Techniques
This paper analyzes the Internet shutdown that occurred in Iran in January 2026 in the context of protests, focusing on its impact on the country's digital communication infrastructure and on information access and control dynamics. The scale, complexity, and nation-state nature of the event motivate a comprehensive investigation that goes beyond isolated reports, aiming to provide a unified and systematic understanding of what happened and how it was observed. The study is guided by a set of research questions addressing: the characterization of the shutdown via the timeline of the disruption events and post-event "new normal"; the detectability of the event, encompassing monitoring initiatives, measurement techniques, and precursory signals; and the interplay between censorship and circumvention, assessing both the imposed restrictions and the effectiveness of tools designed to bypass them. To answer these questions, we adopt a multi-source, multi-perspective methodology that integrates heterogeneous public data, primarily from grey literature produced by network measurement and monitoring initiatives, complemented by additional private measurements. This approach enables a holistic view of the event and allows us to reconcile and compare partial observations from different sources.
comment: 12 pages, 3 figures, 1 table
☆ Study of Post Quantum status of Widely Used Protocols
The advent of quantum computing poses significant threats to classical public-key cryptographic primitives such as RSA and elliptic-curve cryptography. As many critical network and security protocols depend on these primitives for key exchange and authentication, there is an urgent need to understand their quantum vulnerability and assess the progress made towards integrating post-quantum cryptography (PQC). This survey provides a detailed examination of nine widely deployed protocols - TLS, IPsec, BGP, DNSSEC, SSH, QUIC, OpenID Connect, OpenVPN, and Signal Protocol - analysing their cryptographic foundations, quantum risks, and the current state of PQC migration. We find that TLS and Signal lead the transition with hybrid post-quantum key exchange already deployed at scale, while IPsec and SSH have standardised mechanisms but lack widespread production adoption. DNSSEC and BGP face the most significant structural barriers, as post-quantum signature sizes conflict with fundamental protocol constraints. Across all protocols, key exchange proves consistently easier to migrate than authentication, and protocol-level limitations such as message size and fragmentation often dominate over raw algorithm performance. We also discuss experimental deployments and emerging standards that are shaping the path towards a quantum-resistant communication infrastructure.
☆ BitSov: A Composable Bitcoin-Native Architecture for Sovereign Internet Infrastructure
Today's internet concentrates identity, payments, communication, and content hosting under a small number of corporate intermediaries, creating single points of failure, enabling censorship, and extracting economic rent from participants. We present BitSov, an architectural framework for sovereign internet infrastructure that composes existing decentralized technologies (Bitcoin, Lightning Network, decentralized storage, federated messaging, and mesh connectivity) into a unified, eight-layer protocol stack anchored to Bitcoin's base layer. The framework introduces three architectural patterns: (1) payment-gated messaging, where every transmitted message requires cryptographic proof of a Bitcoin payment, deterring spam through economic incentives rather than moderation; (2) timechain-locked contracts, which anchor subscriptions and licenses to Bitcoin block height (the timechain) rather than calendar dates; and (3) a self-sustaining economic flywheel that converts service revenue into infrastructure growth. A dual settlement model supports both on-chain transactions for permanence and auditability and Lightning micropayments for high-frequency messaging. As a position paper, we analyze the quality attributes, discuss open challenges, and propose a research agenda for empirical validation.
comment: Accepted at BlockArch 2026, co-located with IEEE ICSA 2026. 4 pages
☆ A Techno-Economic Framework for Cost Modeling and Revenue Opportunities in Open and Programmable AI-RAN
The large-scale deployment of 5G networks has not delivered the expected return on investment for mobile network operators, raising concerns about the economic viability of future 6G rollouts. At the same time, surging demand for Artificial Intelligence (AI) inference and training workloads is straining global compute capacity. AI-RAN architectures, in which Radio Access Network (RAN) platforms accelerated on Graphics Processing Unit (GPU) share idle capacity with AI workloads during off-peak periods, offer a potential path to improved capital efficiency. However, the economic case for such systems remains unsubstantiated. In this paper, we present a techno-economic analysis of AI-RAN deployments by combining publicly available benchmarks of 5G Layer-1 processing on heterogeneous platforms -- from x86 servers with accelerators for channel coding to modern GPUs -- with realistic traffic models and AI service demand profiles for Large Language Model (LLM) inference. We construct a joint cost and revenue model that quantifies the surplus compute capacity available in GPU-based RAN deployments and evaluates the returns from leasing it to AI tenants. Our results show that, across a range of scenarios encompassing token depreciation, varying demand dynamics, and diverse GPU serving densities, the additional capital and operational expenditures of GPU-heavy deployments are offset by AI-on-RAN revenue, yielding a return on investment of up to 8x. These findings strengthen the long-term economic case for accelerator-based RAN architectures and future 6G deployments.
☆ How Many Qubits Can Be Teleported? Scalability of Fidelity-Constrained Quantum Applications
Quantum networks (QNs) enable the transfer of qubits between distant nodes using quantum teleportation, which reproduces a qubit state at a remote location by consuming a shared Bell pair. After teleportation, qubits are stored in quantum memories, where decoherence progressively degrades their quantum states. This degradation is quantified by the fidelity, defined as the overlap between the stored quantum state and the ideal target state. Some quantum applications (QApps) require the teleportation of multiple qubits and can only operate if all teleported qubits simultaneously maintain a fidelity above a given threshold. In this paper, we study how many qubits can be teleported under such fidelity-constrained operation in a two-node QN. To that end, we define a QApp-level reliability metric as the probability that all end-to-end Bell pairs satisfy the target fidelity upon completion of the multi-qubit teleportation stage. We design a Monte Carlo-based simulator that captures stochastic Bell-pair generation, Quantum Repeater (QR)-assisted entanglement distribution, and fidelity degradation. Fiber-based and terrestrial free-space optical (FSO) quantum links and representative NV-center- and trapped-ion-based quantum memories are considered. Results show that memory coherence is the main scalability bottleneck under stringent fidelity targets, while parallel entanglement generation is essential for multi-qubit teleportation.
comment: This work has been accepted for publication in the Joint European Conference on Networks and Communications & 6G Summit (EuCNC/6G Summit), 2026
☆ Performance Analysis of 5G RAN Slicing Deployment Options in Industry 4.0 Factories
This paper studies Radio Access Network (RAN) slicing strategies for 5G Industry~4.0 networks with ultra-reliable low-latency communication (uRLLC) requirements. We comparatively analyze four RAN slicing deployment options that differ in slice sharing and per-line or per-flow isolation. Unlike prior works that focus on management architectures or resource allocation under a fixed slicing structure, this work addresses the design of RAN slicing deployment options in the presence of multiple production lines and heterogeneous industrial flows. An SNC-based analytical framework and a heuristic slice planner are used to evaluate these options in terms of per-flow delay guarantees and radio resource utilization. Results show that under resource scarcity only per-flow slicing prevents delay violations by tightly matching resources to per-flow delay targets, while slice-sharing and hybrid deployments improve aggregation efficiency at the cost of weaker protection for the most delay-critical flows. Execution-time results confirm that the planner operates at Non-RT time scales, enabling its integration within O-RAN Non-RT RIC loops.
comment: This work has been accepted for publication in the Joint European Conference on Networks and Communications \& 6G Summit (EuCNC/6G Summit), 2026
☆ Trust-Aware Routing for Distributed Generative AI Inference at the Edge
Emerging deployments of Generative AI increasingly execute inference across decentralized and heterogeneous edge devices rather than on a single trusted server. In such environments, a single device failure or misbehavior can disrupt the entire inference process, making traditional best-effort peer-to-peer routing insufficient. Coordinating distributed generative inference therefore requires mechanisms that explicitly account for reliability, performance variability, and trust among participating peers. In this paper, we present G-TRAC, a trust-aware coordination framework that integrates algorithmic path selection with system-level protocol design to ensure robust distributed inference. First, we formulate the routing problem as a \textit{Risk-Bounded Shortest Path} computation and introduce a polynomial-time solution that combines trust-floor pruning with Dijkstra's search, achieving sub-millisecond median routing latency at practical edge scales, and remaining below 10 ms at larger scales. Second, to operationally support the routing logic in dynamic environments, the framework employs a \textit{Hybrid Trust Architecture} that maintains global reputation state at stable anchors while disseminating lightweight updates to edge peers via background synchronization. Experimental evaluation on a heterogeneous testbed of commodity devices demonstrates that G-TRAC significantly improves inference completion rates, effectively isolates unreliable peers, and sustains robust execution even under node failures and network partitions.
comment: 11 pages, 10 figures. Preprint accepted at the 22nd Annual International Conference on Distributed Computing in Smart Systems and the Internet of Things (DCOSS-IoT 2026)
☆ From Simulation to Deep Learning: Survey on Network Performance Modeling Approaches
Network performance modeling is a field that predates early computer networks and the beginning of the Internet. It aims to predict the traffic performance of packet flows in a given network. Its applications range from network planning and troubleshooting to feeding information to network controllers for configuration optimization. Traditional network performance modeling has relied heavily on Discrete Event Simulation (DES) and analytical methods grounded in mathematical theories such as Queuing Theory and Network Calculus. However, as of late, we have observed a paradigm shift, with attempts to obtain efficient Parallel DES, the surge of Machine Learning models, and their integration with other methodologies in hybrid approaches. This has resulted in a great variety of modeling approaches, each with its strengths and often tailored to specific scenarios or requirements. In this paper, we comprehensively survey the relevant network performance modeling approaches for wired networks over the last decades. With this understanding, we also define a taxonomy of approaches, summarizing our understanding of the state-of-the-art and how both technology and the concerns of the research community evolve over time. Finally, we also consider how these models are evaluated, how their different nature results in different evaluation requirements and goals, and how this may complicate their comparison.
comment: Preprint, final accepted version published on Computer Networks (DOI: 10.1016/j.comnet.2026.112253). 87 pages, 3 figures
☆ Age of Incorrect Information for Generic Discrete-Time Markov Sources
This work introduces a framework for analyzing the Age of Incorrect Information (AoII) in a real-time monitoring system with a generic discrete-time Markov source. We study a noisy communication system employing a hybrid automatic repeat request (HARQ) protocol, subject to a transmission rate constraint. The optimization problem is formulated as a constrained Markov decision process (CMDP), and it is shown that there exists an optimal policy that is a randomized mixture of two stationary policies. To overcome the intractability of computing the optimal stationary policies, we develop a multiple-threshold policy class where thresholds depend on the source, the receiver, and the packet count. By establishing a Markov renewal structure induced by threshold policies, we derive closed-form expressions for the long-term average AoII and transmission rate. The proposed policy is constructed via a relative value iteration algorithm that leverages the threshold structure to skip computations, combined with a bisection search to satisfy the rate constraint. To accommodate scenarios requiring lower computational complexity, we adapt the same technique to produce a simpler single-threshold policy that trades optimality for efficiency. Numerical experiments exhibit that both thresholdbased policies outperform periodic scheduling, with the multiplethreshold approach matching the performance of the globally optimal policy.
comment: 12 pages, 7 figures, 3 algorithms
☆ Varuna: Enabling Failure-Type Aware RDMA Failover
RDMA link failures can render connections temporarily unavailable, causing both performance degradation and significant recovery overhead. To tolerate such failures, production datacenters assign each primary link with a standby link and, upon failure, uniformly retransmit all in-flight RDMA request over the backup path. However, we observe that such blanket retransmission is unnecessary. In-flight requests can be split into pre-failure and post-failure categories depending on whether the responder has already executed. Retransmitting post-failure requests is not only redundant (consuming bandwidth), but also incorrect for non-idempotent operations, where duplicate execution can violate application semantics. We present Varuna, a failure-type-aware RDMA recovery mechanism that enables correct retransmission and us-level failover. Varuna piggybacks a lightweight completion log on every RDMA operation; after a link failure, this log deterministically reveals which in-flight requests were executed (post-failure) and which were lost (pre-failure). Varuna then retransmits only the pre-failure subset and fetches/recovers the return values for post-failure requests. Evaluated using synthetic microbenchmarks and end-to-end RDMA TPC-C transactions, Varuna incurs only 0.6-10% steady-state latency overhead in realistic applications, eliminates 65% of recovery retransmission time, preserves transactional consistency, and introduces zero connectivity rebuild overhead and negligible memory overhead during RDMA failover.
☆ YUHENG-OS: A Cloud-Native Space Cluster Operating System
As industry and academia continue to advance spaceborne computing and communication capabilities, the formation of cloud-native space clusters (CNSCs) has become an increasingly evident trend. This evolution progressively exposes the resource management challenges associated with coordinating fragmented and heterogeneous onboard resources while supporting large-scale and diverse space applications. However, directly transplanting mature terrestrial cloud-native cluster operating system paradigms into space is ineffective due to the fragmentation of spaceborne computing resources and satellite mobility, which collectively impose substantial challenges on resource awareness and orchestration. This article presents YUHENG-OS, a cloud-native space cluster operating system tailored for CNSCs. YUHENG-OS provides unified abstraction, awareness, and orchestration of heterogeneous spaceborne infrastructure, enabling cluster-wide task deployment and scheduling across distributed satellites. We introduce a four-layer system architecture and three key enabling technologies: modeling of heterogeneous resource demands for space tasks, fragmented heterogeneous resource awareness under network constraints, and matching of differentiated tasks with multidimensional heterogeneous resources under temporal dependency constraints. Evaluation results show that, compared with representative terrestrial cloud-native cluster operating systems exemplified by Kubernetes, YUHENG-OS achieves a substantially higher task completion ratio, with improvements of up to 98%. This advantage is primarily attributed to its ability to reduce resource awareness delay by 71%.
☆ Adaptive Multi-Dimensional Coordinated Comprehensive Routing Scheme for IoV
The characteristics of high-speed node movement and dynamic topology changes pose great challenges to the design of internet of vehicles (IoV) routing protocols. Existing schemes suffer from common problems such as insufficient adaptability and lack of global consideration, making it difficult to achieve a globally optimal balance between routing reliability, real-time performance and transmission efficiency. This paper proposes an adaptive multi-dimensional coordinated comprehensive routing scheme for IoV environments. A complete IoV system model including network topology, communication links, hierarchical congestion and transmission delay is first constructed, the routing problem is abstracted into a single-objective optimization model with multiple constraints, and a single-hop link comprehensive routing metric integrating link reliability, node local load, network global congestion and link stability is defined. Second, an intelligent transmission switching mechanism is designed: candidate nodes are screened through dual criteria of connectivity and progressiveness, a dual decision-making of primary and backup paths and a threshold switching strategy are introduced to avoid link interruption and congestion, and an adaptive update function is constructed to dynamically adjust weight coefficients and switching thresholds to adapt to changes in network status. Simulation results show that the proposed scheme can effectively adapt to the high dynamic topology and network congestion characteristics of IoV, perform excellently in key indicators such as routing interruption times, packet delivery rate and end-to-end delay, and its comprehensive performance is significantly superior to traditional routing schemes.
comment: 8 pages, 8 figures. An adaptive multi-dimensional coordinated comprehensive routing scheme for IoV environments
♻ ☆ Green-LLM: Optimal Workload Allocation for Environmentally-Aware Distributed Inference
This letter investigates the optimal allocation of large language model (LLM) inference workloads across heterogeneous edge data centers (DCs) over time. Each DC features on-site renewable generation and faces dynamic electricity prices and spatiotemporal variability in renewable availability. The central question is: how can inference workloads be optimally distributed to the DCs to minimize energy consumption, carbon emissions, and water usage while enhancing user experience? This letter proposes a novel optimization model for LLM service providers to reduce operational costs and environmental impacts. Numerical results validate the efficacy of the proposed approach.
comment: 5 pages, 11 figures
♻ ☆ Entropy-Aware Task Offloading in Mobile Edge Computing
Mobile Edge Computing (MEC) technology has been introduced to enable could computing at the edge of the network in order to help resource limited mobile devices with time sensitive data processing tasks. In this paradigm, mobile devices can offload their computationally heavy tasks to more efficient nearby MEC servers via wireless communication. Consequently, the main focus of researches on the subject has been on development of efficient offloading schemes, leaving the privacy of mobile user out. While the Blockchain technology is used as the trust mechanism for secured sharing of the data, the privacy issues induced from wireless communication, namely, usage pattern and location privacy are the centerpiece of this work. The effects of these privacy concerns on the task offloading Markov Decision Process (MDP) is addressed and the MDP is solved using a Deep Recurrent Q-Netwrok (DRQN). The Numerical simulations are presented to show the effectiveness of the proposed method.
comment: 13 pages, submitted to Journal of Blockchain Research
♻ ☆ Quality of Coverage (QoC): Quantifying Cellular Network Coverage Quality, Usability and Stability
Characterizing cellular network performance is complex. Current representations of cellular coverage, such as service provider and FCC coverage maps, focus only on the minimal level of available bandwidth (e.g., 35/3Mbps download/upload speed for 5G) and omit critical dimensions of quality: network usability and stability over space and time. Because cellular performance can vary substantially along both dimensions, a more fine-grained characterization is necessary. We introduce Quality of Coverage (QoC), a novel multi-dimensional set of key performance indicators (KPIs) that capture measured temporal and spatial performance quality, usability and stability. To evaluate QoC, we first analyze whether the QoC KPIs accurately reflect expected network behavior at individual locations and across spatially-aggregated regions. Then, we apply QoC to more than 15 million measurements from a production network to evaluate its ability to characterize real-world network behavior. Together, our results demonstrate the need for KPIs that capture the full spectrum of cellular performance and show how QoC enables rigorous evaluation of coverage quality across multiple geographic scales.
♻ ☆ AI Infrastructure Sovereignty
Artificial intelligence has shifted from a software-centric discipline to an infrastructure-driven system. Large-scale training and inference increasingly depend on tightly coupled data centers, high-capacity optical networks, and energy systems operating close to physical and environmental limits. As a result, control over data and algorithms alone is no longer sufficient to achieve meaningful AI sovereignty. Practical sovereignty now depends on who can deploy, operate, and adapt AI infrastructure under constraints imposed by energy availability, sustainability targets, and network reach. This tutorial-survey introduces the concept of AI infrastructure sovereignty, defined as the ability of a region, operator, or nation to exercise operational control over AI systems within physical and environmental limits. The paper argues that sovereignty emerges from the co-design of three layers: AI-oriented data centers, optical transport networks, and automation frameworks that provide real-time visibility and control. We analyze how AI workloads reshape data center design, driving extreme power densities, advanced cooling requirements, and tighter coupling to local energy systems, with sustainability metrics such as carbon intensity and water usage acting as hard deployment boundaries. We then examine optical networks as the backbone of distributed AI, showing how latency, capacity, failure domains, and jurisdictional control define practical sovereignty limits. Building on this foundation, the paper positions telemetry, agentic AI, and digital twins as enablers of operational sovereignty through validated, closed-loop control across compute, network, and energy domains. The tutorial concludes with a reference architecture for sovereign AI infrastructure that integrates telemetry pipelines, agent-based control, and digital twins, framing sustainability as a first-order design constraint.
comment: 27 pages, 7 figures
♻ ☆ MPLS Network Actions: Technological Overview and P4-Based Implementation on a High-Speed Switching ASIC
In MPLS, packets are encapsulated with labels that add domain-specific forwarding information. Special purpose labels were introduced to trigger special behavior in MPLS nodes but their number is limited. Therefore, the IETF proposed the MPLS Network Actions (MNA) framework. It extends MPLS with new features, some of which have already been defined to support relevant use cases. This paper provides a comprehensive technological overview of MNA concepts and use cases. It compares MNA to IPv6 extension headers (EHs) that serve a similar purpose, and argues that MNA can be better deployed than EHs. It then presents P4-MNA, a first hardware implementation running at 400 Gb/s per port. Scalability and performance of P4-MNA are evaluated, showing negligible impact on processing delay caused by network actions. Moreover, the applicability of MNA is demonstrated by implementing the use cases of link-specific packet loss measurement using the alternate-marking-method (AMM) and bandwidth reservation using network slicing. We identify header stacking constraints resulting from hardware resources and from the number of network actions that must be supported according to the MNA encoding. They make an implementation for hardware that can only parse a few MPLS headers infeasible. We propose to make the number of supported network actions a node parameter and signal this in the network. Then, an upgrade to MNA is also feasible for hardware with fewer available resources. We explain that for MNA with in-stack data (ISD), some header bits must remain unchanged during forwarding, and give an outlook on post-stack data (PSD).
comment: After publication in IEEE OJCOMS, it was identified that Figures 8(a), 8(b) and 27 were incorrectly rendered due to duplication and formatting issues introduced during production. An official Erratum has been published and is linked on the IEEE Xplore page. The figures included in this version are correct. Fixed the DOI on the title page
♻ ☆ Direct-to-Device Connectivity for Integrated Communication, Navigation and Surveillance
Sixth-generation (6G) communication systems are expected to support direct-to-device (D2D) connectivity, enabling standard user equipment (UE) to seamlessly transition to non-terrestrial network (NTN), particularly satellite communication mode, when operating beyond terrestrial network (TN) coverage. This D2D concept does not require hardware modifications to conventional UEs and eliminates the need for dedicated satellite ground terminals. D2D-capable UEs can be mounted on both manned and unmanned aircraft, however, they are especially well-suited for low-altitude unmanned aircraft due to their compact form factor, lightweight design, energy efficiency, and TN-NTN roaming capabilities. D2D can also enable beyond-visual-line-of-sight operation by providing NTN support for Communications, Navigation, and Surveillance (CNS) services during TN outages or congestion. This paper investigates the capabilities and limitations of D2D connectivity for low-altitude unmanned aircraft operating in urban environments. We analyze the variation in line-of-sight probability for both TN and NTN links as a function of aircraft altitude. We further compute path loss and received signal strength while accounting for a representative TN deployment with down-tilted antennas. The results show that the TN and NTN links complement each other, significantly improving the availability of the CNS service at low altitudes. These findings provide insights to support the design and optimization of future 6G-enabled integrated CNS services.
comment: This work has been accepted in IEEE 26th Integrated Communications, Navigation and Surveillance Conference, April 14-16, 2026, Herndon, Virginia, USA. Copyright has been transferred to IEEE
Distributed, Parallel, and Cluster Computing 22
☆ SteelDB: Diagnosing Kernel-Space Bottlenecks in Cloud OLTP Databases
Modern cloud OLTP databases have sought performance primarily through user-space optimization - separating storage and compute layers, or distributing transactions across multiple nodes using consensus algorithms. This paper turns attention to a previously unexplored layer: kernel-space I/O behavior. From an on-premises perspective, where a single server with local storage delivers excellent performance, these elaborate designs seem puzzling. Why do cloud databases require such architectural complexity? We investigate this through a pathological analysis of databases that rely on OS-level I/O control in cloud-specific storage environments. We show that bottlenecks widely attributed to network or storage architectures in fact originate in kernel-space I/O behavior. Based on this diagnosis, we derive treatment principles and realize them as SteelDB, a zero-patch architecture that improves database performance on general-purpose cloud distributed block storage through strategic I/O optimization without requiring kernel or database patches. TPC-C evaluations demonstrate that SteelDB achieves up to 9x performance improvement at no additional cost. Against Amazon Aurora, SteelDB achieved 3.1x higher performance while reducing costs by 58%, leading to a 7.3x improvement in cost efficiency. While Aurora requires an average of 254 days for major version upgrades due to applying proprietary patches to newly released OSS databases, our zero-patch architecture reduces these software maintenance costs to near zero.
☆ Building the Palmetto API: Adding granular permissions and caching to the Slurm REST API without sacrificing compatibility
The development of administrative and computational research tools requires reliable programmatic interfaces with the cluster scheduler. The Research Computing and Data (RCD) team at Clemson University has developed the Palmetto API, a proxy for the native Slurm RESTful interface, slurmrestd, while providing advanced authentication, authorization, and caching. This paper details the design and implementation of this proxy, evaluates the performance benefits from caching, and verifies compatibility to existing slurmrestd clients. The result is a light-weight and secure method of exposing our cluster scheduler to tools and automations.
comment: 7 pages, 2 figures, 2 tables to be published in the proceedings of PEARC'26
☆ Wherefore Art Thou? Provenance-Guided Automatic Online Debugging with Lumos
Debugging distributed systems in-production is inevitable and hard. Myriad interactions between concurrent components in modern, complex and large-scale systems cause non-deterministic bugs that offline testing and verification fail to capture. When bugs surface at runtime, their root causes may be far removed from their symptoms. To identify a root cause, developers often need evidence scattered across multiple components and traces. Unfortunately, existing tools fail to quickly and automatically record useful provenance information at low overheads, leaving developers to manually perform the onerous evidence collection task. Lumos is an online debugging framework that exposes application-level bug provenances--the computational history linking symptoms of an incident to their root causes. Lumos leverages dependency-guided instrumentation powered by static analysis to identify program state related to a bug's provenance, and exposes them via lightweight on-demand recording. Lumos provides developers with enough evidence to identify a bug's root cause, while incurring low runtime overhead, and given only a few occurrences of a bug.
☆ Understand and Accelerate Memory Processing Pipeline for Disaggregated LLM Inference
Modern large language models (LLMs) increasingly depends on efficient long-context processing and generation mechanisms, including sparse attention, retrieval-augmented generation (RAG), and compressed contextual memory, to support complex reasoning. We show that these optimizations can be unified into a four-step memory processing pipeline: Prepare Memory, Compute Relevancy, Retrieval, and Apply to Inference. Through systematic profiling, we identify a 22%-97% memory processing overhead in LLM inference and strong heterogeneity in its computational characteristics. Motivated by this insight, we argue that \textbf{heterogeneous systems} are well-suited to accelerate memory processing and thus end-to-end inference. We demonstrate this approach on a GPU-FPGA system by offloading sparse, irregular, and memory-bounded operations to FPGAs while retaining compute-intensive operations on GPUs. Evaluated on an AMD MI210 GPU and an Alveo U55C FPGA, our system is $1.04\sim2.2\times$ faster and requires $1.11\sim4.7\times$ less energy across multiple LLM inference optimizations than the GPU baseline (similar results hold on NVIDIA A100). These results establish heterogeneous systems as a practical direction for efficient LLM memory processing and inform future heterogeneous hardware design.
☆ BitSov: A Composable Bitcoin-Native Architecture for Sovereign Internet Infrastructure
Today's internet concentrates identity, payments, communication, and content hosting under a small number of corporate intermediaries, creating single points of failure, enabling censorship, and extracting economic rent from participants. We present BitSov, an architectural framework for sovereign internet infrastructure that composes existing decentralized technologies (Bitcoin, Lightning Network, decentralized storage, federated messaging, and mesh connectivity) into a unified, eight-layer protocol stack anchored to Bitcoin's base layer. The framework introduces three architectural patterns: (1) payment-gated messaging, where every transmitted message requires cryptographic proof of a Bitcoin payment, deterring spam through economic incentives rather than moderation; (2) timechain-locked contracts, which anchor subscriptions and licenses to Bitcoin block height (the timechain) rather than calendar dates; and (3) a self-sustaining economic flywheel that converts service revenue into infrastructure growth. A dual settlement model supports both on-chain transactions for permanence and auditability and Lightning micropayments for high-frequency messaging. As a position paper, we analyze the quality attributes, discuss open challenges, and propose a research agenda for empirical validation.
comment: Accepted at BlockArch 2026, co-located with IEEE ICSA 2026. 4 pages
☆ GPU-Accelerated Optimization of Transformer-Based Neural Networks for Real-Time Inference
This paper presents the design and evaluation of a GPU-accelerated inference pipeline for transformer models using NVIDIA TensorRT with mixed-precision optimization. We evaluate BERT-base (110M parameters) and GPT-2 (124M parameters) across batch sizes from 1 to 32 and sequence lengths from 32 to 512. The system achieves up to 64.4x speedup over CPU baselines, sub-10 ms latency for single-sample inference, and a 63 percent reduction in memory usage. We introduce a hybrid precision strategy that preserves FP32 for numerically sensitive operations such as softmax and layer normalization, while applying FP16 to linear layers. This approach maintains high numerical fidelity (cosine similarity >= 0.9998 relative to baseline outputs) and eliminates NaN instability. The pipeline is implemented as a modular, containerized system that enables reproducible benchmarking across more than 360 configurations. Cross-GPU validation on an NVIDIA A100 shows consistent FP16 speedup ratios between 1.84x and 2.00x, along with stable numerical behavior. Downstream evaluation on SST-2 demonstrates no accuracy degradation under hybrid precision. Validation on WikiText-2 shows that random inputs underestimate NaN instability by up to 6x for full FP16, while confirming the robustness of the hybrid approach (0.0 percent NaN, cosine similarity >= 0.9998). These results provide a detailed characterization of performance and accuracy trade-offs across GPU architectures and offer practical guidance for deploying transformer models in latency-critical environments.
comment: 10 pages, 8 figures, 15 tables
☆ FL-PBM: Pre-Training Backdoor Mitigation for Federated Learning
Backdoor attacks pose a significant threat to the integrity and reliability of Artificial Intelligence (AI) models, enabling adversaries to manipulate model behavior by injecting poisoned data with hidden triggers. These attacks can lead to severe consequences, especially in critical applications such as autonomous driving, healthcare, and finance. Detecting and mitigating backdoor attacks is crucial across the lifespan of model's phases, including pre-training, in-training, and post-training. In this paper, we propose Pre-Training Backdoor Mitigation for Federated Learning (FL-PBM), a novel defense mechanism that proactively filters poisoned data on the client side before model training in a federated learning (FL) environment. The approach consists of three stages: (1) inserting a benign trigger into the data to establish a controlled baseline, (2) applying Principal Component Analysis (PCA) to extract discriminative features and assess the separability of the data, (3) performing Gaussian Mixture Model (GMM) clustering to identify potentially malicious data samples based on their distribution in the PCA-transformed space, and (4) applying a targeted blurring technique to disrupt potential backdoor triggers. Together, these steps ensure that suspicious data is detected early and sanitized effectively, thereby minimizing the influence of backdoor triggers on the global model. Experimental evaluations on image-based datasets demonstrate that FL-PBM reduces attack success rates by up to 95% compared to baseline federated learning (FedAvg) and by 30 to 80% relative to state-of-the-art defenses (RDFL and LPSF). At the same time, it maintains over 90% clean model accuracy in most experiments, achieving better mitigation without degrading model performance.
comment: 12 pages, 3 figures, 1 table, 2 algorithms, Regular Journal Paper
☆ Mitigating Backdoor Attacks in Federated Learning Using PPA and MiniMax Game Theory
Federated Learning (FL) is witnessing wider adoption due to its ability to benefit from large amounts of scattered data while preserving privacy. However, despite its advantages, federated learning suffers from several setbacks that directly impact the accuracy, and the integrity of the global model it produces. One of these setbacks is the presence of malicious clients who actively try to harm the global model by injecting backdoor data into their local models while trying to evade detection. The objective of such clients is to trick the global model into making false predictions during inference, thereby compromising the integrity and trustworthiness of the global model on which honest stakeholders rely. To mitigate such mischievous behavior, we propose FedBBA (Federated Backdoor and Behavior Analysis). The proposed model aims to dampen the effect of such clients on the final accuracy, creating more resilient federated learning environments. We engineer our approach through the combination of (1) a reputation system to evaluate and track client behavior, (2) an incentive mechanism to reward honest participation and penalize malicious behavior, and (3) game theoretical models with projection pursuit analysis (PPA) to dynamically identify and minimize the impact of malicious clients on the global model. Extensive simulations on the German Traffic Sign Recognition Benchmark (GTSRB) and Belgium Traffic Sign Classification (BTSC) datasets demonstrate that FedBBA reduces the backdoor attack success rate to approximately 1.1%--11% across various attack scenarios, significantly outperforming state-of-the-art defenses like RDFL and RoPE, which yielded attack success rates between 23% and 76%, while maintaining high normal task accuracy (~95%--98%).
comment: 12 pages, 4 images, 2 tables, 2 algorithms, Regular Journal Paper
☆ Trust-Aware Routing for Distributed Generative AI Inference at the Edge
Emerging deployments of Generative AI increasingly execute inference across decentralized and heterogeneous edge devices rather than on a single trusted server. In such environments, a single device failure or misbehavior can disrupt the entire inference process, making traditional best-effort peer-to-peer routing insufficient. Coordinating distributed generative inference therefore requires mechanisms that explicitly account for reliability, performance variability, and trust among participating peers. In this paper, we present G-TRAC, a trust-aware coordination framework that integrates algorithmic path selection with system-level protocol design to ensure robust distributed inference. First, we formulate the routing problem as a \textit{Risk-Bounded Shortest Path} computation and introduce a polynomial-time solution that combines trust-floor pruning with Dijkstra's search, achieving sub-millisecond median routing latency at practical edge scales, and remaining below 10 ms at larger scales. Second, to operationally support the routing logic in dynamic environments, the framework employs a \textit{Hybrid Trust Architecture} that maintains global reputation state at stable anchors while disseminating lightweight updates to edge peers via background synchronization. Experimental evaluation on a heterogeneous testbed of commodity devices demonstrates that G-TRAC significantly improves inference completion rates, effectively isolates unreliable peers, and sustains robust execution even under node failures and network partitions.
comment: 11 pages, 10 figures. Preprint accepted at the 22nd Annual International Conference on Distributed Computing in Smart Systems and the Internet of Things (DCOSS-IoT 2026)
☆ FeDMRA: Federated Incremental Learning with Dynamic Memory Replay Allocation
In federated healthcare systems, Federated Class-Incremental Learning (FCIL) has emerged as a key paradigm, enabling continuous adaptive model learning among distributed clients while safeguarding data privacy. However, in practical applications, data across agent nodes within the distributed framework often exhibits non-independent and identically distributed (non-IID) characteristics, rendering traditional continual learning methods inapplicable. To address these challenges, this paper covers more comprehensive incremental task scenarios and proposes a dynamic memory allocation strategy for exemplar storage based on the data replay mechanism. This strategy fully taps into the inherent potential of data heterogeneity, while taking into account the performance fairness of all participating clients, thereby establishing a balanced and adaptive solution to mitigate catastrophic forgetting. Unlike the fixed allocation of client exemplar memory, the proposed scheme emphasizes the rational allocation of limited storage resources among clients to improve model performance. Furthermore, extensive experiments are conducted on three medical image datasets, and the results demonstrate significant performance improvements compared to existing baseline models.
☆ Warp-STAR: High-performance, Differentiable GPU-Accelerated Static Timing Analysis through Warp-oriented Parallel Orchestration
Static timing analysis (STA) is crucial for Electronic Design Automation (EDA) flows but remains a computational bottleneck. While existing GPU-based STA engines are faster than CPU, they suffer from inefficiencies, particularly intra-warp load imbalance caused by irregular circuit graphs. This paper introduces Warp-STAR, a novel GPU-accelerated STA engine that eliminates this imbalance by orchestrating parallel computations at the warp level. This approach achieves a 2.4X speedup over previous state-of-the-art (SoTA) GPU-based STA. When integrated into a timing-driven global placement framework, Warp-STAR delivers a 1.7X speedup over SoTA frameworks. The method also proves effective for differentiable gradient analysis with minimal overhead.
comment: 7 pages, 6 figures, The Chips to System Conference (DAC'26) 2026
☆ Key-Embedded Privacy for Decentralized AI in Biomedical Omics
The rapid adoption of data-driven methods in biomedicine has intensified concerns over privacy, governance, and regulation, limiting raw data sharing and hindering the assembly of representative cohorts for clinically relevant AI. This landscape necessitates practical, efficient privacy solutions, as cryptographic defenses often impose heavy overhead and differential privacy can degrade performance, leading to sub-optimal outcomes in real-world settings. Here, we present a lightweight federated learning method, INFL, based on Implicit Neural Representations that addresses these challenges. Our approach integrates plug-and-play, coordinate-conditioned modules into client models, embeds a secret key directly into the architecture, and supports seamless aggregation across heterogeneous sites. Across diverse biomedical omics tasks, including cohort-scale classification in bulk proteomics, regression for perturbation prediction in single-cell transcriptomics, and clustering in spatial transcriptomics and multi-omics with both public and private data, we demonstrate that INFL achieves strong, controllable privacy while maintaining utility, preserving the performance necessary for downstream scientific and clinical applications.
☆ Pre-Deployment Complexity Estimation for Federated Perception Systems
Edge AI systems increasingly rely on federated learning to train perception models in distributed, privacy-preserving, and resource-constrained environments. Yet, before training begins, practitioners often lack practical tools to estimate how difficult a federated learning task will be in terms of achievable accuracy and communication cost. This paper presents a classifier-agnostic, pre-deployment framework for estimating learning complexity in federated perception systems by jointly modeling intrinsic properties of the data and characteristics of the distributed environment. The proposed complexity metric integrates dataset attributes such as dimensionality, sparsity, and heterogeneity with factors related to the composition of participating clients. Using federated learning as a representative distributed training setting, we examine how learning difficulty varies across different federated configurations. Experiments on multiple variants of the MNIST dataset and CIFAR dataset show that the proposed metric strongly correlates with federated learning performance and the communication effort required to reach fixed accuracy targets. These findings suggest that complexity estimation can serve as a practical diagnostic tool for resource planning, dataset assessment, and feasibility evaluation in edge-deployed perception systems.
comment: Accepted and presented at Edge AI Research Symposium 2026 (EdgeAI2026), San Diego, CA
☆ Efficient Counting and Simulation in Content-Oblivious Rings
In the content-oblivious (CO) model (proposed by Censor-Hillel et al.), processes inhabit an asynchronous network and communicate only by exchanging pulses. A series of works has clarified the computational power of this model. In particular, it was shown that, when a leader is present and the network is 2-edge-connected, content-oblivious communication can simulate classical asynchronous message passing. Subsequent results extended this equivalence to leaderless oriented and unoriented rings, and, under non-uniform assumptions, to general 2-edge-connected networks. The simulator of Censor-Hillel et al. requires $O(n^3b+n^3\log n)$ pulses to emulate the send of a single $b$-bit message, making it impractical even on modest-size networks. We focus on message-efficient computation in CO networks. We study the fundamental problem of counting in ring topologies, both because knowing the exact network size is a basic prerequisite for many distributed tasks and because counting immediately implies a broad class of aggregation primitives. We give an algorithm that counts using $O(n^{1.5})$ pulses in anonymous rings with a leader, an $O(n\log^2 n)$ algorithm for counting in rings with IDs. Moreover, we show that any counting algorithm in CO requires $Ω(n\log n)$ pulses. Interestingly, in the course of this investigation, we design a simulator for classic message passing: in one simulated round, each process can send a $b$-bit message to each of its neighbors using only $O(b)$ pulses per process. The simulator extends to general 2-edge-connected networks, after a pre-processing step that requires $O(n^{8}\log n)$ pulses, where $n$ is the number of processes, allowing thus efficient simulation of asynchronous message passing in general 2-edge-connected networks.
☆ Varuna: Enabling Failure-Type Aware RDMA Failover
RDMA link failures can render connections temporarily unavailable, causing both performance degradation and significant recovery overhead. To tolerate such failures, production datacenters assign each primary link with a standby link and, upon failure, uniformly retransmit all in-flight RDMA request over the backup path. However, we observe that such blanket retransmission is unnecessary. In-flight requests can be split into pre-failure and post-failure categories depending on whether the responder has already executed. Retransmitting post-failure requests is not only redundant (consuming bandwidth), but also incorrect for non-idempotent operations, where duplicate execution can violate application semantics. We present Varuna, a failure-type-aware RDMA recovery mechanism that enables correct retransmission and us-level failover. Varuna piggybacks a lightweight completion log on every RDMA operation; after a link failure, this log deterministically reveals which in-flight requests were executed (post-failure) and which were lost (pre-failure). Varuna then retransmits only the pre-failure subset and fetches/recovers the return values for post-failure requests. Evaluated using synthetic microbenchmarks and end-to-end RDMA TPC-C transactions, Varuna incurs only 0.6-10% steady-state latency overhead in realistic applications, eliminates 65% of recovery retransmission time, preserves transactional consistency, and introduces zero connectivity rebuild overhead and negligible memory overhead during RDMA failover.
☆ YUHENG-OS: A Cloud-Native Space Cluster Operating System
As industry and academia continue to advance spaceborne computing and communication capabilities, the formation of cloud-native space clusters (CNSCs) has become an increasingly evident trend. This evolution progressively exposes the resource management challenges associated with coordinating fragmented and heterogeneous onboard resources while supporting large-scale and diverse space applications. However, directly transplanting mature terrestrial cloud-native cluster operating system paradigms into space is ineffective due to the fragmentation of spaceborne computing resources and satellite mobility, which collectively impose substantial challenges on resource awareness and orchestration. This article presents YUHENG-OS, a cloud-native space cluster operating system tailored for CNSCs. YUHENG-OS provides unified abstraction, awareness, and orchestration of heterogeneous spaceborne infrastructure, enabling cluster-wide task deployment and scheduling across distributed satellites. We introduce a four-layer system architecture and three key enabling technologies: modeling of heterogeneous resource demands for space tasks, fragmented heterogeneous resource awareness under network constraints, and matching of differentiated tasks with multidimensional heterogeneous resources under temporal dependency constraints. Evaluation results show that, compared with representative terrestrial cloud-native cluster operating systems exemplified by Kubernetes, YUHENG-OS achieves a substantially higher task completion ratio, with improvements of up to 98%. This advantage is primarily attributed to its ability to reduce resource awareness delay by 71%.
♻ ☆ Green-LLM: Optimal Workload Allocation for Environmentally-Aware Distributed Inference
This letter investigates the optimal allocation of large language model (LLM) inference workloads across heterogeneous edge data centers (DCs) over time. Each DC features on-site renewable generation and faces dynamic electricity prices and spatiotemporal variability in renewable availability. The central question is: how can inference workloads be optimally distributed to the DCs to minimize energy consumption, carbon emissions, and water usage while enhancing user experience? This letter proposes a novel optimization model for LLM service providers to reduce operational costs and environmental impacts. Numerical results validate the efficacy of the proposed approach.
comment: 5 pages, 11 figures
♻ ☆ When AI Bends Metal: AI-Assisted Optimization of Design Parameters in Sheet Metal Forming
Numerical simulations have revolutionized the industrial design process by reducing prototyping costs, design iterations, and enabling product engineers to explore the design space more efficiently. However, the growing scale of simulations demands substantial expert knowledge, computational resources, and time. A key challenge is identifying input parameters that yield optimal results, as iterative simulations are costly and can have a large environmental impact. This paper presents an AI-assisted workflow that reduces expert involvement in parameter optimization through the use of Bayesian optimization. Furthermore, we present an active learning variant of the approach, assisting the expert if desired. A deep learning model provides an initial parameter estimate, from which the optimization cycle iteratively refines the design until a termination condition (e.g.,energy budget or iteration limit) is met. We demonstrate our approach, based on a sheet metal forming process, and show how it enables us to accelerate the exploration of the design space while reducing the need for expert involvement.
comment: 20 pages
♻ ☆ Finding a Fair Scoring Function for Top-$k$ Selection: From Hardness to Practice
Selecting a subset of the $k$ "best" items from a dataset of $n$ items, based on a scoring function, is a key task in decision-making. Given the rise of automated decision-making software, it is important that the outcome of this process, called top-$k$ selection, is fair. Here we consider the problem of identifying a fair linear scoring function for top-$k$ selection. The function computes a score for each item as a weighted sum of its (numerical) attribute values, and must ensure that the selected subset includes adequate representation of a minority or historically disadvantaged group. Existing algorithms do not scale efficiently, particularly in higher dimensions. Our hardness analysis shows that in more than two dimensions, no algorithm is likely to achieve good scalability with respect to dataset size, and the computational complexity is likely to increase rapidly with dimensionality. However, the hardness results also provide key insights guiding algorithm design, leading to our two-pronged solution: (1) For small values of $k$, our hardness analysis reveals a gap in the hardness barrier. By addressing various engineering challenges, including achieving efficient parallelism, we turn this potential of efficiency into an optimized algorithm delivering substantial practical performance gains. (2) For large values of $k$, where the hardness is robust, we employ a practically efficient algorithm which, despite being theoretically worse, achieves superior real-world performance. Experimental evaluations on real-world datasets then explore scenarios where worst-case behavior does not manifest, identifying areas critical to practical performance. Our solution achieves speed-ups of up to several orders of magnitude compared to SOTA, an efficiency made possible through a tight integration of hardness analysis, algorithm design, practical engineering, and empirical evaluation.
comment: Abstract shortened to meet arXiv requirements; an extended abstract to appear at SoCG 2026
♻ ☆ Types for Grassroots Logic Programs
Grassroots Logic Programs (GLP) is a concurrent logic programming language in which logic variables are partitioned into paired readers and writers. An assignment is produced at most once via a writer and consumed at most once via its paired reader, and may contain additional readers and/or writers. This enables the concise expression of rich multidirectional communication modalities. ``Logic Programs as Types for Logic Programs'' (LICS'91) defined types as regular sets of paths over derivable ground atoms. Here, we define types to be regular sets of moded paths, where a mode captures directionality of communication -- whether a subterm is consumed from or produced to the environment -- enabling the typing of interactive partial computations including those that eventually deadlock or fail, or never terminate. We provide a syntactic definition of well-typing and prove that a program is well-typed iff the path abstraction of its moded-atom semantics satisfies covariance and contravariance conditions with respect to its type. The GLP type system was implemented in Dart by AI, starting from a mathematical specification of Typed GLP (this paper), deriving from it an English spec (written by AI), and from the spec deriving Dart code (by AI). While GLP is naturally untyped, the motivation for Typed GLP comes from programming with AI: Asking AI to program complex communication modalities in GLP (and in general) and hoping for the best is a tenuous strategy. The emerging discipline we advocate and employ is for the human designer and AI to jointly develop and agree upon (1)~GLP types; (2)~GLP procedure type declarations; (3)~informal (English) descriptions of the procedures; and only then let AI attempt to write (4)~GLP code based on those.
♻ ☆ Modernizing Amdahl's Law: How AI Scaling Laws Shape Computer Architecture
Classical Amdahl's Law quantifies the limit of speedup under a fixed serial-parallel decomposition and homogeneous replication. Modern systems instead allocate constrained resources across heterogeneous hardware while the workload itself changes: some stages become effectively bounded, whereas others continue to absorb additional compute because more compute still creates value. This paper reformulates Amdahl's Law around that shift. We replace processor count with an allocation variable, replace the classical parallel fraction with a value-scalable fraction, and model specialization by a relative efficiency ratio between dedicated and programmable compute. The resulting objective yields a finite collapse threshold. For a specialized efficiency ratio R, there is a critical scalable fraction S_c = 1 - 1/R beyond which the optimal allocation to specialization becomes zero. Equivalently, for a given scalable fraction S, the minimum efficiency ratio required to justify specialization is R_c = 1/(1-S). Thus, as value-scalable workload grows, specialization faces a rising bar. The point is not that programmable hardware is always superior, but that specialization must keep re-earning its place against a moving programmable substrate. The model helps explain increasing GPU programmability, the migration of value-producing work toward learned late-stage computation, and why AI domain-specific accelerators do not simply displace the GPU.
comment: Use: 18 pages, 5 figures. arXiv version v3
♻ ☆ LiFeChain: Lightweight Blockchain for Secure and Efficient Federated Lifelong Learning in IoT
Internet of Things (IoT) devices constantly generate heterogeneous data streams, driving demand for continuous, decentralized intelligence. Federated Lifelong Learning (FLL) provides an ideal solution by incorporating federated learning and lifelong learning. However, the extended lifecycle of FLL in IoT systems increases their vulnerability to persistent attacks. This problem is exacerbated by the single point of failure. Furthermore, the single point of trust created by the central server hinders reliable auditing for long-term threats. Blockchain technology provides a tamper-proof foundation for trustworthy FLL. Nevertheless, directly applying blockchain to FLL significantly increases computational and retrieval costs with the expansion of the knowledge base, slowing down the training on resource-constrained IoT devices. To address these challenges, we propose LiFeChain, a lightweight blockchain for secure and efficient federated lifelong learning with minimal on-chain disclosure and bidirectional verification. LiFeChain is the first blockchain tailored for FLL. It incorporates two complementary mechanisms: the Proof-of-Model-Correlation (PoMC) consensus on the server, which couples learning and unlearning mechanisms to mitigate negative transfer; and Segmented Zero-knowledge Arbitration (Seg-ZA) at the client, which detects and arbitrates abnormal committee behavior without compromising privacy. LiFeChain is a plug-and-play component that can be seamlessly integrated into existing FLL algorithms for IoT applications. To demonstrate its practicality and performance, we implement LiFeChain in representative FLL algorithms with Hyperledger Fabric under 6 attacks. Theoretical analysis and extensive evaluations demonstrate that LiFeChain effectively mitigates long-term attacks, and significantly reduces latency and storage overhead compared to state-of-the-art blockchain solutions.
Human-Computer Interaction 7
☆ Comparing Design Metaphors and User-Driven Metaphors for Interaction Design
Metaphors enable designers to communicate their ideal user experience for platforms. Yet, we often do not know if these design metaphors match users' actual experiences. In this work, we compare design and user metaphors across three different platforms: ChatGPT, Twitter, and YouTube. We build on prior methods to elicit 554 user metaphors, as well as ratings on how well each metaphor describes users' experiences. We then identify 21 design metaphors by analyzing each platform's historical web presence since their launch date. We find that design metaphors often do not match the metaphors that users use to describe their experiences. Even when design and user metaphors do match, the metaphors do not always resonate universally. Through these findings, we highlight how comparing design and user metaphors can help to evaluate and refine metaphors for user experience.
comment: Accepted at CHI 2026
☆ Visualization use in qualitative research reports: Evolving media types and competing epistemologies
Little is known about the representations used in qualitative research studies and why. A data-driven literature review was employed to explore the use of media in qualitative research reporting. A study by Verdinelli & Scagnoli (2013) was replicated and extended by conducting a content analysis of papers and figures published across three qualitative methods journals between 2020 and 2022. Figures were categorized by types (e.g., matrix-based, Venn diagrams, flowcharts) and documents were grouped by their epistemological stances (i.e., objectivist, subjectivist, or constructivist) before conducting a correspondence analysis and epistemic network analysis. Our findings suggest that (1) visual media have remained largely absent, (2) figure types have be come more diverse and (3) the use of figure types is likely independent of epistemological stance but provide opportunities for further exploration. These findings provide a foundation for impactful integration of data visualization tools to enhance communicati ve power of findings across disciplines.
comment: 10 pages, 3 figures, ACM CHI '26 Conference Data Literacy Workshop, April 13-17, Barcelona, ES
☆ Towards Emotion Recognition with 3D Pointclouds Obtained from Facial Expression Images
Facial Emotion Recognition is a critical research area within Affective Computing due to its wide-ranging applications in Human Computer Interaction, mental health assessment and fatigue monitoring. Current FER methods predominantly rely on Deep Learning techniques trained on 2D image data, which pose significant privacy concerns and are unsuitable for continuous, real-time monitoring. As an alternative, we propose High-Frequency Wireless Sensing (HFWS) as an enabler of continuous, privacy-aware FER, through the generation of detailed 3D facial pointclouds via on-person sensors embedded in wearables. We present arguments supporting the privacy advantages of HFWS over traditional 2D imaging, particularly under increasingly stringent data protection regulations. A major barrier to adopting HFWS for FER is the scarcity of labeled 3D FER datasets. Towards addressing this issue, we introduce a FLAME-based method to generate 3D facial pointclouds from existing public 2D datasets. Using this approach, we create AffectNet3D, a 3D version of the AffectNet database. To evaluate the quality and usability of the generated data, we design a pointcloud refinement pipeline focused on isolating the facial region, and train the popular PointNet++ model on the refined pointclouds. Fine-tuning the model on a small subset of the unseen 3D FER dataset BU-3DFE yields a classification accuracy exceeding 70%, comparable to oracle-level performance. To further investigate the potential of HFWS-based FER for continuous monitoring, we simulate wearable sensing conditions by masking portions of the generated pointclouds. Experimental results show that models trained on AffectNet3D and fine-tuned with just 25% of BU-3DFE outperform those trained solely on BU-3DFE. These findings highlight the viability of our pipeline and support the feasibility of continuous, privacy-aware FER via wearable HFWS systems.
comment: 18 pages, 12 figures, 2 tables. Accepted for publication at IEEE Transactions on Affective Computing
☆ "Re-Tell the Fortune so I Can Believe It": How Chinese User Communities Engage with and Interpret GenAI-based Fortune-Telling SC
People traditionally divine the future by interpreting natural phenomena as oracular signals, especially in societies adhering to traditional beliefs like China. With the advent of Generative AI (GenAI), people gain access to new ways of probing digital oracles for predicting the future. To understand how people use and interpret GenAI for divination in China, we interviewed 22 participants who habitually use GenAI platforms for fortune-telling, complemented by a three-week digital ethnography with 1,842 community posts. Qualitative analysis showed that people who seek psychological comfort are particularly receptive to GenAI-based decision-making. Users valued GenAI's accessibility, convenience, and efficiency while perceiving its lack of spiritual mystique. We observed community dynamics forming around GenAI tools, where users reinforce interpretations by sharing and discussing with each other, repeating queries until responses align with expectations. Our work uncovers how AI technologies change the way people and communities engage in traditional cultural practices while yearning for the same goals.
comment: 31 pages, 9 figures. Accepted to CSCW 2026
☆ Feeds Don't Tell the Whole Story: Measuring Online-Offline Emotion Alignment
In contemporary society, social media is deeply integrated into daily life, yet emotional expression often differs between real and online contexts. We studied the Persian community on X to explore this gap, designing a human-centered pipeline to measure alignment between real-world and social media emotions. Recent tweets and images of participants were collected and analyzed using Transformers-based text and image sentiment modules. Friends of participants provided insights into their real-world emotions, which were compared with online expressions using a distance criterion. The study involved N=105 participants, 393 friends, over 8,300 tweets, and 2,000 media images. Results showed only 28% similarity between images and real-world emotions, while tweets aligned about 76% with participants' real-life feelings. Statistical analyses confirmed significant disparities in sentiment proportions across images, tweets, and friends' perceptions, highlighting differences in emotional expression between online and offline environments and demonstrating practical utility of the proposed pipeline for understanding digital self-presentation.
☆ Invasive and Non-Invasive Neural Decoding of Motor Performance in Parkinson's Disease for Personalized Deep Brain Stimulation
Decoding motor performance from brain signals offers promising avenues for adaptive deep brain stimulation (aDBS) for Parkinson's disease (PD). In a two-center cohort of 19 PD patients executing a drawing task, we decoded motor performance from electroencephalography (n=15) and, critically for clinical translation, electrocorticography (n=4). Within each session, patients performed the task under DBS on and DBS off. A total of 35 sessions were recorded. Instead of relying on single frequency bands, we derived patient-specific biomarkers using a filterbank-based machine-learning approach. DBS modulated kinematics significantly in 23 sessions. Significant neural decoding of kinematics was possible in 28 of the 35 sessions (average Pearson's $\text{r}= 0.37$). Our results further demonstrate modulation of speed-accuracy trade-offs, with increased drawing speed but reduced accuracy under DBS. Joint evaluation of behavioral and neural decoding outcomes revealed six prototypical scenarios, for which we provide guidance for future aDBS strategies.
♻ ☆ Who Leads? Comparing Human-Centric and Model-Centric Strategies for Defining ML Target Variables
Predictive modeling has the potential to enhance human decision-making. However, many predictive models fail in practice due to problematic problem formulation in cases where the prediction target is an abstract concept or construct and practitioners need to define an appropriate target variable as a proxy to operationalize the construct of interest. The choice of an appropriate proxy target variable is rarely self-evident in practice, requiring both domain knowledge and iterative data modeling. This process is inherently collaborative, involving both domain experts and data scientists. In this work, we explore how human-machine teaming can support this process by accelerating iterations while preserving human judgment. We study the impact of two human-machine teaming strategies on proxy construction: 1) relevance-first: humans leading the process by selecting relevant proxies, and 2) performance-first: machines leading the process by recommending proxies based on predictive performance. Based on a controlled user study of a proxy construction task (N = 20), we show that the performance-first strategy facilitated faster iterations and decision-making, but also biased users towards well-performing proxies that are misaligned with the application goal. Our study highlights the opportunities and risks of human-machine teaming in operationalizing machine learning target variables, yielding insights for future research to explore the opportunities and mitigate the risks.
comment: 23 pages, 6 figures
Programming Languages 3
☆ The Case for Multi-Version Experimental Evaluation (MVEE)
In the database community, we typically evaluate new methods based on experimental results, which we produce by integrating the proposed method along with a set of baselines in a single benchmarking codebase and measuring the individual runtimes. If we are unhappy with the performance of our method, we gradually improve it while repeatedly comparing to the baselines, until we outperform them. While this seems like a reasonable approach, it makes one delicate assumption: We assume that across the optimization workflow, there exists only a single compiled version of each baseline to compare to. However, we learned the hard way that in practice, even though the source code remains untouched, general purpose compilers might still generate highly different compiled code across builds, caused by seemingly unrelated changes in other parts of the codebase, leading to flawed comparisons and evaluations. To tackle this problem, we propose the concept of Multi-Version Experimental Evaluation (MVEE). MVEE automatically and transparently analyzes subsequent builds on the assembly code level for occurring "build anomalies" and materializes them as new versions of the methods. As a consequence, all observed versions of the respective methods can be included in the experimental evaluation, highly increasing its quality and overall expressiveness.
☆ DriftScript: A Domain-Specific Language for Programming Non-Axiomatic Reasoning Agents
Non-Axiomatic Reasoning Systems (NARS) provide a framework for building adaptive agents that operate under insufficient knowledge and resources. However, the standard input language, Narsese, poses a usability barrier: its dense symbolic notation, overloaded punctuation, and implicit conventions make programs difficult to read, write, and maintain. We present DriftScript, a Lisp-like domain-specific language that compiles to Narsese. DriftScript provides source-level constructs covering the major sentence and term forms used in Non-Axiomatic Logic (NAL) levels 1 through 8, including inheritance, temporal implication, variable quantification, sequential conjunction, and operation invocation, while replacing symbolic syntax with readable keyword-based S-expressions. The compiler is a zero-dependency, four-stage pipeline implemented in 1,941 lines of C99. When used with the DriftNARS engine, DriftScript programs connect to external systems through four structured callback types and an HTTP operation registry, enabling a sense-reason-act loop for autonomous agents. We describe the language design and formal grammar, detail the compiler architecture, and evaluate the compiler through a 106-case test suite, equivalence testing against hand-written Narsese, a NAL coverage analysis, structural readability metrics, and compilation benchmarks. The source code is available at https://github.com/seamus-brady/DriftNARS. This paper focuses on the design and implementation of the DriftScript language and its embedding into DriftNARS, rather than on new inference algorithms for NARS itself.
♻ ☆ Denotation-based Compositional Compiler Verification
A desired but challenging property of compiler verification is compositionality, in the sense that the compilation correctness of a program can be deduced incrementally from that of its substructures ranging from statements, functions, and modules. This article proposes a novel compiler verification framework based on denotational semantics for better compositionality, compared to previous approaches based on small-step operational semantics and simulation theories. Our denotational semantics is defined by semantic functions that map a syntactic component to a semantic domain composed of multiple behavioral \emph{sets}, with compiler correctness established through behavior refinement between the semantic domains of the source and target programs. The main contributions of this article include proposing a denotational semantics for open modules, a novel semantic linking operator, and a refinement algebra that unifies various behavior refinements, making compiler verification structured and compositional. Furthermore, our formalization captures the full meaning of a program and bridges the gap between traditional power-domain-based denotational semantics and the practical needs of compiler verification. We apply our denotation-based framework to verify the front-end of CompCert and typical optimizations on simple prototypes of imperative languages. Our results demonstrate that the compositionality from sub-statements to statements, from functions to modules, and from modules to the whole program can be effectively achieved.
comment: 65 pages; 21 figures. Accepted manuscript of an article published in ACM Transactions on Programming Languages and Systems (TOPLAS). Artifact: https://github.com/chaptercheng/denotational-framework-for-compcert
Networking and Internet Architecture 13
☆ Energy Efficient Orchestration in Multiple-Access Vehicular Aerial-Terrestrial 6G Networks
The proliferation of users, devices, and novel vehicular applications - propelled by advancements in autonomous systems and connected technologies - is precipitating an unprecedented surge in novel services. These emerging services require substantial bandwidth allocation, adherence to stringent Quality of Service (QoS) parameters, and energy-efficient implementations, particularly within highly dynamic vehicular environments. The complexity of these requirements necessitates a fundamental paradigm shift in service orchestration methodologies to facilitate seamless and robust service delivery. This paper addresses this challenge by presenting a novel framework for service orchestration in Unmanned Aerial Vehicles (UAV)-assisted 6G aerial-terrestrial networks. The proposed framework synergistically integrates UAV trajectory planning, Multiple-Access Control (MAC), and service placement to facilitate energy-efficient service coverage while maintaining ultra-low latency communication for vehicular user service requests. We first present a non-linear programming model that formulates the optimization problem. Next, to address the problem, we employ a Hierarchical Deep Reinforcement Learning (HDRL) algorithm that dynamically predicts service requests, user mobility, and channel conditions, addressing the challenges of interference, resource scarcity, and mobility in heterogeneous networks. Simulation results demonstrate that the proposed framework outperforms state-of-the-art solutions in request acceptance, energy efficiency, and latency minimization, showcasing its potential to support the high demands of next-generation vehicular networks.
☆ Fronthaul Network Planning for Hierarchical and Radio-Stripes-Enabled CF-mMIMO in O-RAN
The deployment of ultra-dense networks (UDNs), particularly cell-free massive MIMO (CF-mMIMO), is mainly hindered by costly and capacity-limited fronthaul links. This work proposes a two-tiered optimization framework for cost-effective hybrid fronthaul planning, comprising a Near-Optimal Fronthaul Association and Configuration (NOFAC) algorithm in the first tier and an Integer Linear Program (ILP) in the second, integrating fiber optics, millimeter-wave (mmWave), and free-space optics (FSO) technologies. The proposed framework accommodates various functional split (FS) options (7.2x and 8), decentralized processing levels, and network configurations. We introduce the hierarchical scheme (HS) as a resilient, cost-effective fronthaul solution for CF-mMIMO and compare its performance with radio-stripes (RS)-enabled CF-mMIMO, validating both across diverse dense topologies within the open radio access network (O-RAN) architecture. Results show that the proposed framework achieves better cost-efficiency and higher capacity compared to traditional benchmark schemes such as all-fiber fronthaul network. Our key findings reveal fiber dominance in highly decentralized deployments, mmWave suitability in moderately centralized scenarios, and FSO complements both by bridging deployment gaps. Additionally, FS7.2x consistently outperforms FS8, offering greater capacity at lower cost, affirming its role as the preferred O-RAN functional split. Most importantly, our study underscores the importance of hybrid fronthaul effective planning for UDNs in minimizing infrastructural redundancy, and ensuring scalability to meet current and future traffic demands.
☆ LP-Based Algorithms for Scheduling in a Quantum Switch
We consider scheduling in a quantum switch with stochastic entanglement generation, finite quantum memories, and decoherence. The objective is to design a scheduling algorithm with polynomial-time computational complexity that stabilizes a nontrivial fraction of the capacity region. Scheduling in such a switch corresponds to finding a matching in a graph subject to additional constraints. We propose an LP-based policy, which finds a point in the matching polytope, which is further implemented using a randomized decomposition into matchings. The main challenge is that service over an edge is feasible only when entanglement is simultaneously available at both endpoint memories, so the effective service rates depend on the steady-state availability induced by the scheduling rule. To address this, we introduce a single-node reference Markov chain and derive lower bounds on achievable service rates in terms of the steady-state nonemptiness probabilities. We then use a Lyapunov drift argument to show that, whenever the request arrival rates lie within the resulting throughput region, the proposed algorithm stabilizes the request queues. We further analyze how the achievable throughput depends on entanglement generation rates, decoherence probabilities, and buffer sizes, and show that the throughput lower bound converges exponentially fast to its infinite-buffer limit as the memory size increases. Numerical results illustrate that the guaranteed throughput fraction is substantial for parameter regimes relevant to near-term quantum networking systems.
☆ Tracking without Seeing: Geospatial Inference using Encrypted Traffic from Distributed Nodes
Accurate observation of dynamic environments traditionally relies on synthesizing raw, signal-level information from multiple distributed sensors. This work investigates an alternative approach: performing geospatial inference using only encrypted packet-level information, without access to the raw sensory data. We further explore how this indirect information can be fused with directly available sensory data to extend overall inference capabilities. We introduce GraySense, a learning-based framework that performs geospatial object tracking by analyzing encrypted wireless video transmission traffic, such as packet sizes, from cameras with inaccessible streams. GraySense leverages the inherent relationship between scene dynamics and transmitted packet sizes to infer object motion. The framework consists of two stages: (1) a Packet Grouping module that identifies frame boundaries and estimates frame sizes from encrypted network traffic, and (2) a Tracker module, based on a Transformer encoder with a recurrent state, which fuses indirect packet-based inputs with optional direct camera-based inputs to estimate the object's position. Extensive experiments with realistic videos from the CARLA simulator and emulated networks under varying conditions show that GraySense achieves 2.33 meters tracking error (Euclidean distance) without raw signal access, within the dimensions of tracked objects (4.61m x 1.93m). To our knowledge, this capability has not been previously demonstrated, expanding the use of latent signals for sensing.
☆ Serverless5GC: Private 5G Core Deployment via a Procedure-as-a-Function Architecture
Open-source 5G core implementations deploy network functions as always-on processes that consume resources even when idle. This inefficiency is most acute in private and edge deployments with sporadic traffic. Serverless5GC is an architecture that maps each 3GPP control-plane procedure to an independent Function-as-a-Service invocation, allowing scale-to-zero operation without modifying the standard N2 interface. The system decomposes 12~network functions (Release~15-17) into 31~serverless procedures, fronted by an SCTP/NGAP proxy that bridges unmodified RAN equipment to an HTTP-based serverless backend. Evaluation against Open5GS and free5GC across five traffic scenarios (idle to 20~registrations/s burst) shows that Serverless5GC achieves median registration latency of 406-522ms, on par with the C-based Open5GS baseline (403-606ms), while maintaining 100% success across 3,000 registrations. A resource-time cost model shows that the serverless deployment (0.002GB-seconds per registration) is cheaper than the always-on baseline when the cluster operates below a 0.65 duty cycle, when two or more tenants share the platform, or on managed FaaS platforms up to 609reg/s. Under worst-case cold-start conditions where all 31 function pods are evicted simultaneously, the system sustains zero failures and converges to warm-start latency within 4-5 seconds.
☆ RADAR-Q: Resource-Aware Distributed Asynchronous Routing for Entanglement Distribution in Multi-Tenant Quantum Networks
Scalable quantum networks must support concurrent entanglement requests, yet existing routing protocols fail when users compete for shared repeater resources, wasting fragile quantum states. This paper presents RADAR-Q, a resource-aware decentralized routing protocol embedding real-time resource contention into path selection. Unlike prior designs requiring global coordination or central anchors, RADAR-Q makes intelligent local decisions balancing path length and fidelity, instantaneous quantum memory availability, and intermediate Bell-State Measurement (BSM) operations. By identifying the Nearest Common Ancestor (NCA) within a DODAG hierarchy, RADAR-Q localizes entanglement swapping close to communicating users - avoiding unnecessary central detours and reducing BSM chain length and decoherence exposure. We evaluate RADAR-Q on grid and random topologies against synchronous and root-centric asynchronous baselines. Results show RADAR-Q achieves aggregate throughputs 2.5x and 7.6x higher than synchronized and root-centric designs, respectively. While baselines suffer catastrophic fidelity collapse below the 0.5 threshold under high load, RADAR-Q consistently maintains end-to-end fidelity above 0.76, ensuring pairs remain usable. Furthermore, RADAR-Q exhibits near-perfect fairness (Jain's Fairness Index 96-98%) and retains over 50% of its ideal throughput under stringent 1.0 ms coherence times. These findings establish contention-aware decentralized routing as a scalable foundation for multi-tenant quantum networks.
comment: 11 pages. Submitted to the Fifth International Conference on Innovations in Computing Research (ICR'26)
☆ Asynchronous Routing for Multipartite Entanglement in Quantum Networks
In quantum networks, one way to communicate is to distribute entanglements through swapping at intermediate nodes. Most existing work primarily aims to create efficient two-party end-to-end entanglement over long distances. However, some scenarios also require remote multipartite entanglement for applications such as quantum secret sharing and multi-party computation. Our previous study improved end-to-end entanglement rates using an asynchronous, tree-based routing scheme that relies solely on local knowledge of entanglement links, conserving unused entanglement and avoiding synchronous operations. This article extends this approach to multipartite entanglements, particularly the three-party Greenberger-Horne-Zeilinger (GHZ) states. It shows that our asynchronous protocol outperforms traditional synchronous methods in entanglement rates, especially as coherence times increase. This approach can also be extended to four-party and larger multipartite GHZ states, highlighting the effectiveness and adaptability of asynchronous routing for multipartite scenarios across various network topologies.
comment: 9 pages, 7 figures, published in the 2026 IEEE 16th Annual Computing and Communication Workshop and Conference (CCWC)
♻ ☆ TCP BBR Performance over Wi-Fi~6: AQM Impacts and Cross-Layer Insights
We evaluate TCP BBRv3 on Wi-Fi 6 home networks under modern AQM schemes using a fully wireless testbed and a simple cross-layer model linking Wi-Fi scheduling, router queueing, and BBRv3's pacing dynamics. Comparing BBR Internet traffic with CUBIC across different AQMs (FIFO, FQ-CoDel, and CAKE) for uplink, downlink, and bidirectional traffic, we find that FIFO destabilizes pacing and raises delay, often letting CUBIC dominate; FQ-CoDel restores fairness and controls latency; and CAKE delivers the best overall performance by keeping delay low and aligning BBRv3's sending and delivered rates. We also identify a Wi-Fi-specific effect where CAKE's rapid queue draining, while improving pacing alignment, can trigger brief retransmission bursts during BBRv3's bandwidth probes. These results follow from the interaction of variable Wi-Fi service rates, AQM delay control, and BBRv3's inflight limits, leading to practical guidance to use FQ-CoDel or CAKE and avoid unmanaged FIFO in home Wi-Fi, with potential for Wi-Fi-aware tuning of BBRv3's probing.
♻ ☆ Evasion Adversarial Attacks Remain Impractical Against ML-based Network Intrusion Detection Systems, Especially Dynamic Ones
Machine Learning (ML) has become pervasive, and its deployment in Network Intrusion Detection Systems (NIDS) is inevitable due to its automated nature and high accuracy compared to traditional models in processing and classifying large volumes of data. However, ML has been found to have several flaws, most importantly, adversarial attacks, which aim to trick ML models into producing faulty predictions. While most adversarial attack research focuses on computer vision datasets, recent studies have explored the suitability of these attacks against ML-based network security entities, especially NIDS, due to the wide difference between different domains regarding the generation of adversarial attacks. To further explore the practicality of adversarial attacks against ML-based NIDS in-depth, this paper presents several key contributions: identifying numerous practicality issues for evasion adversarial attacks on ML-NIDS using an attack tree threat model, introducing a taxonomy of practicality issues associated with adversarial attacks against ML-based NIDS, identifying specific leaf nodes in our attack tree that demonstrate some practicality for real-world implementation and conducting a comprehensive review and exploration of these potentially viable attack approaches, and investigating how the dynamicity of real-world ML models affects evasion adversarial attacks against NIDS. Our experiments indicate that continuous re-training, even without adversarial training, can reduce the effectiveness of adversarial attacks. While adversarial attacks can compromise ML-based NIDSs, our aim is to highlight the significant gap between research and real-world practicality in this domain, which warrants attention.
♻ ☆ Adaptive Intent-Aware PoW Mechanism in SDN for Multi-Domain SYN Flood Mitigation
The stability of Internet services is persistently challenged by the escalating scale of volumetric TCP SYN floods, as conventional defenses like SYN Cookies fail by exacerbating bandwidth depletion under modern attacks. This paper introduces SDN-SYN PoW, a novel defense architecture that synergizes non-interactive Proof-of-Work with a Software-Defined Networking (SDN) control plane, an approach particularly effective for securing the network edge in modern SD-WAN deployments. The core innovation is its ability to perform global network sensing; the SDN controller monitors real-time traffic to dynamically adjust PoW difficulty, transforming the defense from a static mechanism into an intelligent, adaptive system that surgically applies computational costs only to anomalous sources. Through rigorous experiments on a custom-built testbed, we demonstrate that SDN-SYN PoW provides substantially superior protection and, critically, that the PoW overhead remains negligible for legitimate clients, ensuring compatibility even with low-power devices.
♻ ☆ WN-Wrangle: Wireless Network Data Wrangling Assistant
Data wrangling continues to be the most time-consuming task in the data science pipeline and wireless network data is no exception. Prior approaches for automatic or assisted data-wrangling primarily target unordered, single-table data. However, unlike traditional datasets where rows in a table are unordered and assumed to be independent of each other, wireless network datasets are often collected across multiple measurement devices, producing multiple, temporally ordered tables that must be integrated for obtaining the complete dataset. For instance, to create a dataset of the signal quality of 5G cell towers within a geographic region, GPS data collected by cellphones must be joined with radio frequency measurements of the corresponding cell towers. However, the join key timestamp typically exhibits mismatched sampling periods, causing a misalignment. Data wrangling techniques for generic time-series datasets also fail here, since they lack knowledge of domain-specific data semantics, which are often defined by network protocols and system configurations. To aid in wrangling wireless network datasets, we demonstrate WN-Wrangle, an interactive wrangling assistant, tailored to the wireless network domain that suggests the top-k next-best wrangling operations, along with rich, domain-specific explanations. Under the hood, WN-Wrangle enforces temporal constraints- and a wireless network semantics-aware mechanism to score and rank an extended set of wrangling operators to improve the data quality. We demonstrate how WN-Wrangle identifies elusive data-quality issues specific to the wireless network domain and suggests accurate wrangling steps over datasets obtained from the widely used POWDER city-scale wireless testbed.
comment: 7 pages, 4 figures
♻ ☆ Space-Based Computing Networks: Trends, Architecture, Challenges, and Key Technologies
As one of the most promising hotspots in the 6G era, space remote sensing information networks play a key and irreplaceable role in areas such as emergency response and scientific research, and are expected to foster remote sensing data processing into the next generation of killer applications. However, due to the inability to deploy ground communication stations at scale and the limited satellite-to-ground link rate, the traditional model for transmitting space data back to ground stations faces significant challenges in terms of timeliness. To address this problem, we focus on the emerging paradigm of on-orbit space data processing, which reduces the volume of transmitted data by several orders of magnitude to enable faster task response, taking the first step toward building a space-based computing network. Specifically, we propose a hierarchical space-based computing network architecture, comprising the space-based cloud constellation system, the remote sensing constellation system, the network operation control center, the orchestration data center, and the user access portal. Each component is described in detail from a system design perspective to clarify its specific role and functionality. Next, we analyze three scientific challenges: the heterogeneous resource virtualization and state information synchronization, the matching of multi-priority tasks with multidimensional resources, and the fault detection and localization under extreme conditions. Finally, we discuss key technologies to address the aforementioned challenges and highlight promising research priorities for the future.
♻ ☆ On the Role of Age and Semantics of Information in Remote Estimation of Markov Sources
This paper studies semantics-aware remote estimation of Markov sources. We leverage two complementary information attributes: the urgency of lasting impact, which quantifies the significance of consecutive estimation error at the transmitter, and the age of information (AoI), which captures the predictability of outdated information at the receiver. The objective is to minimize the long-run average lasting impact subject to a transmission frequency constraint. The problem is formulated as a constrained Markov decision process (CMDP) with potentially unbounded costs. We show the existence of an optimal simple mixture policy, which randomizes between two neighboring switching policies at a common regeneration state. A closed-form expression for the optimal mixture coefficient is derived. Each switching policy triggers transmission only when the error holding time exceeds a threshold that depends on both the instantaneous estimation error and the AoI. We further derive sufficient conditions under which the thresholds are independent of the instantaneous error and the AoI. Finally, we propose a structure-aware algorithm, Insec-SPI, that computes the optimal policy with reduced computation overhead. Numerical results demonstrate that incorporating both the age and semantics of information significantly improves estimation performance compared to using either attribute alone.
comment: This paper has been accepted for publication in IEEE Transactions on Communications. Part of this work has been accepted for presentation at IEEE ISIT 2026, Guangzhou, China
Distributed, Parallel, and Cluster Computing 9
☆ Operational Strategies for Non-Disruptive Scheduling Transitions in Production HPC Systems
Migrating heterogeneous high-performance computing (HPC) systems to resource-aware scheduling introduces both technical and behavioral challenges, particularly in production environments with established user workflows. This paper presents a case study of transitioning a production academic HPC cluster from node-exclusive to consumable resource scheduling mid-lifecycle, without disrupting active workloads. We describe an operational strategy combining a time-bounded compatibility layer, observability-driven feedback, and targeted user engagement to guide adoption of explicit resource declaration. This approach protected active research workflows throughout the transition, avoiding the disruption that a direct cut-over would have imposed on the user community. Following deployment, median queue wait times fell from 277 minutes to under 3 minutes for CPU workloads and from 81 minutes to 3.4 minutes for GPU workloads. Users who adopted TRES-based submission exhibited strong long-term retention. These results demonstrate that successful scheduling transitions depend not only on system configuration, but on aligning observability, user engagement, and operational design.
comment: 4 pages, 3 figures, 1 table, submitted to PEARC'26
☆ jaxsgp4: GPU-accelerated mega-constellation propagation with batch parallelism
As the population of anthropogenic space objects transitions from sparse clusters to mega-constellations exceeding 100,000 satellites, traditional orbital propagation techniques face a critical bottleneck. Standard CPU-bound implementations of the Simplified General Perturbations 4 (SGP4) algorithm are less well suited to handle the requisite scale of collision avoidance and Space Situational Awareness (SSA) tasks. This paper introduces \texttt{jaxsgp4}, an open-source high-performance reimplementation of SGP4 utilising the \texttt{JAX} library. \texttt{JAX} has gained traction in the landscape of computational research, offering an easy mechanism for Just-In-Time (JIT) compilation, automatic vectorisation and automatic optimisation of code for CPU, GPU and TPU hardware modalities. By refactoring the algorithm into a pure functional paradigm, we leverage these transformations to execute massively parallel propagations on modern GPUs. We demonstrate that \texttt{jaxsgp4} can propagate the entire Starlink constellation (9,341 satellites) each to 1,000 future time steps in under 4 ms on a single A100 GPU, representing a speedup of $1500\times$ over traditional C++ baselines. Furthermore, we argue that the use of 32-bit precision for SGP4 propagation tasks offers a principled trade-off, sacrificing negligible precision loss for a substantial gain in throughput on hardware accelerators.
comment: 11 pages, 3 figures
☆ Optimising Blockchain Scalability for Real-Time IoT Applications
The convergence of blockchain and the Internet of Things (IoT) enables secure, decentralised, and verifiable data exchange across distributed smart environments. However, traditional blockchain frameworks suffer from inherent scalability constraints, limited throughput, and high latency, which conflict with the stringent real-time requirements of IoT applications such as industrial automation, intelligent healthcare, and smart transportation. These systems demand ultra-low latency, high transaction throughput, lightweight computation, and efficient resource utilisation. This review provides a comprehensive, structured analysis of state-of-the-art scalability solutions specifically adapted to blockchain-enabled IoT. The discussion encompasses Layer 1 enhancements, Layer 2 off-chain processing, sharding-based parallelisation, integration of edge and fog computing, and hybrid consensus mechanisms. For each approach, the review highlights operational principles, performance benefits, trade-offs in decentralisation and security, and suitability for latency-sensitive deployments. Furthermore, real-time quality-of-service considerations are examined to understand how scalability strategies impact system responsiveness, energy efficiency, and data integrity. Key open challenges, including the scalability-security trade-off, privacy preservation, interoperability, and sustainable resource management, have been identified as persistent barriers to large-scale adoption. Finally, the review outlines future research directions, emphasising adaptive and AI-driven consensus algorithms, quantum-safe cryptographic models, the convergence of blockchain with 5G/6G networks, and edge intelligence. By consolidating diverse technical insights and emerging trends, this work serves as a timely reference for developing scalable, secure, and sustainable blockchain architectures for real-time IoT applications.
☆ Beating vDSP: A 138 GFLOPS Radix-8 Stockham FFT on Apple Silicon via Two-Tier Register-Threadgroup Memory Decomposition
We present an optimized Fast Fourier Transform (FFT) implementation for Apple Silicon GPUs, achieving 138.45~GFLOPS for $N\!=\!4096$ complex single-precision transforms -- a 29\% improvement over Apple's highly optimized vDSP/Accelerate baseline (107~GFLOPS). Our approach is grounded in a \emph{two-tier local memory model} that formally characterizes the Apple GPU's 208~KiB register file as the primary data-resident tier and the 32~KiB threadgroup memory as an exchange-only tier, extending the decomposition framework established in a 2015 PhD thesis on Intel integrated GPU FFT for radar processing. We implement and evaluate radix-4 and radix-8 split-radix Stockham kernels in Metal Shading Language (MSL), demonstrating that the radix-8 decimation-in-time butterfly with 512 threads yields the best performance. We further present the first investigation of Apple's \texttt{simdgroup\_matrix} 8$\times$8 hardware MMA for FFT butterfly computation and report the counter-intuitive finding that on Apple GPU, threadgroup memory barriers are inexpensive ($\sim$2 cycles) while scattered threadgroup access patterns are the true bottleneck. Our multi-size implementation supports $N\!=\!256$ through $N\!=\!16384$ using a four-step decomposition for sizes exceeding the 32~KiB threadgroup memory limit. All kernels are validated against vDSP reference outputs.
☆ The First OpenFOAM HPC Challenge (OHC-1)
The first OpenFOAM HPC Challenge (OHC-1) was organised by the OpenFOAM HPC Technical Committee (HPCTC) to collect a snapshot of OpenFOAM's computational performance on contemporary production hardware and to compare hardware-constrained submissions with software-track optimisations. Participants ran a common incompressible steady-state RANS case, the open-closed cooling DrivAer (occDrivAer) configuration, on prescribed meshes, submitting either with the reference setup (hardware track) or with modified solvers, decomposition strategies, or accelerator offloading (software track). In total, 237 valid datapoints were submitted by 12 contributors: 175 in the hardware track and 62 in the software track. The hardware track covered 25 distinct CPU models across AMD, Intel, and ARM families, with runs spanning from single-node configurations up to 256 nodes (32768 CPU cores). Wall-clock times ranged from 7.8 minutes to 65.7 hours and reported energy-to-solution from 2.1 to 236.9 kWh. Analysis of the hardware track identified a Pareto front of optimal balance between time- and energy-to-solution, and revealed that on-package high-bandwidth memory (HBM) dominates single-node performance for next-generation CPUs. Software-track submissions achieved up to 28% lower energy per iteration, 17% higher maximum performance per node, and 72% shorter minimum time per iteration than the best hardware-track results, with full GPU ports and selective-memory optimisations leading the performance range. This manuscript describes the challenge organisation, the case setup and metrics, and presents the main findings from both tracks together with an outlook for future challenges.
☆ BLOSSOM: Block-wise Federated Learning Over Shared and Sparse Observed Modalities IJCNN
Multimodal federated learning (FL) is essential for real-world applications such as autonomous systems and healthcare, where data is distributed across heterogeneous clients with varying and often missing modalities. However, most existing FL approaches assume uniform modality availability, limiting their applicability in practice. We introduce BLOSSOM, a task-agnostic framework for multimodal FL designed to operate under shared and sparsely observed modality conditions. BLOSSOM supports clients with arbitrary modality subsets and enables flexible sharing of model components. To address client and task heterogeneity, we propose a block-wise aggregation strategy that selectively aggregates shared components while keeping task-specific blocks private, enabling partial personalization. We evaluate BLOSSOM on multiple diverse multimodal datasets and analyse the effects of missing modalities and personalization. Our results show that block-wise personalization significantly improves performance, particularly in settings with severe modality sparsity. In modality-incomplete scenarios, BLOSSOM achieves an average performance gain of 18.7% over full-model aggregation, while in modality-exclusive settings the gain increases to 37.7%, highlighting the importance of block-wise learning for practical multimodal FL systems.
comment: 6 pages, 2 figures, 3 tables. Accepted to the International Joint Conference on Neural Networks (IJCNN) 2026
♻ ☆ A Multi-Armed Bandit-Based Participant Selection Method for Federated Recommendation Systems
Federated Recommendation Systems (FRS) enable privacy-preserving model training by keeping user data on edge devices. However, the practical deployment of FRS in Edge-Cloud environments faces significant challenges due to system and statistical heterogeneity. Existing FRS participant selection strategies struggle to dynamically balance the trade-off between model convergence speed and recommendation quality in such volatile environments. To address this, we formulate the FRS participant selection problem as a normalized utility cost addressing the model quality and system efficiency. Next, we propose a dynamic participant selection framework incorporating a Multi-Armed Bandit (MAB)-based solver for multimodal FRS. We design a client-utility function that jointly evaluates historical Client Performance Reputation, data quality, and real-time system latency. By leveraging an Upper Confidence Bound strategy, our framework effectively balances the exploration of under-sampled clients with the exploitation of high-performing ones. We validate the proposed approach on a realistic edge-cloud testbed implementation using a multimodal movie-recommendation task. Experimental results demonstrate that our MAB-driven approach outperforms other baselines across eight different data-skew scenarios. Specifically, it improves training efficiency by 32-50% while improving model quality metrics such as Recall@50 by up to around 5%
comment: Accepted in IEEE/ACM CCGRID 2026
♻ ☆ SHADOW: Seamless Handoff And Zero-Downtime Orchestrated Workload Migration for Stateful Microservices
Migrating stateful microservices in Kubernetes requires careful state management because in-memory state is lost when a container restarts. For StatefulSet-managed workloads, the problem is compounded by identity constraints that prohibit two pods with the same ordinal from running simultaneously, forcing a sequential stop-recreate cycle with a median 38.5s of service downtime. This paper presents SHADOW Seamless Handoff And Zero-Downtime Orchestrated Workload Migration, a Kubernetes-native framework that implements the Message-based Stateful Microservice Migration (MS2M) approach as a Kubernetes Operator. SHADOW introduces the ShadowPod strategy, where a shadow pod is created from a CRIU checkpoint image on the target node while the source pod continues serving traffic, allowing concurrent operation during message replay. For StatefulSet workloads, an identity swap procedure with the ExchangeFence mechanism re-checkpoints the shadow pod, creates a StatefulSet-owned replacement, and drains both message queues to guarantee zero message loss during the handoff. An evaluation on a bare-metal Kubernetes cluster with 280 migration runs across four configurations and seven message rates (10--120msg/s) shows that, compared to the sequential baseline on the same StatefulSet workload, the ShadowPod strategy reduces the restore phase by up to 92%, eliminates service downtime entirely, and reduces total migration time by up to 77%, with zero message loss across all 280 runs.
♻ ☆ Efficient Tree-Structured Deep Research with Adaptive Resource Allocation ICLR 2026
Deep research agents, which synthesize information across diverse sources, are significantly constrained by the sequential nature of reasoning. This bottleneck results in high latency, poor runtime adaptability, and inefficient resource allocation, making today's deep research systems impractical for interactive applications. To overcome this, we introduce ParallelResearch, a novel framework for efficient deep research that transforms sequential processing into parallel, runtime orchestration by dynamically decomposing complex queries into tree-structured sub-tasks. Our core contributions are threefold: (1) an adaptive planner that dynamically allocates computational resources based on query complexity; (2) a runtime orchestration layer that prunes redundant paths to reallocate resources and enables speculative execution; and (3) a fully-asynchronous execution infrastructure that enables concurrency across both research breadth and depth. Experiments on two benchmarks show up to 5x speedups with comparable final report quality, and consistent quality improvements with the same time budgets.
comment: ICLR 2026 Workshop on Agents in the Wild (Spotlight)
Programming Languages 8
☆ Folding the Heighway dragon curve
The Heighway dragon curve is one of the most known fractal curves. There are two ways to construct the curve: repeatedly make a copy of the current curve, rotate it by 90 degrees, and connect them; or repeatedly replace each straight segment in the curve by two segments with a right angle. A natural question is how do we prove the equivalence of the two approaches? We generalise the construction of the curve to allow rotations to both sides. It then turns out that the two approaches are respectively a foldr and a foldl, and the key property for proving their equivalence, using the second duality theorem, is the distributivity of an "interleave" operator.
☆ Codebase-Memory: Tree-Sitter-Based Knowledge Graphs for LLM Code Exploration via MCP
Large Language Model (LLM) coding agents typically explore codebases through repeated file-reading and grep-searching, consuming thousands of tokens per query without structural understanding. We present Codebase-Memory, an open-source system that constructs a persistent, Tree-Sitter-based knowledge graph via the Model Context Protocol (MCP), parsing 66 languages through a multi-phase pipeline with parallel worker pools, call-graph traversal, impact analysis, and community discovery. Evaluated across 31 real-world repositories, Codebase-Memory achieves 83% answer quality versus 92% for a file-exploration agent, at ten times fewer tokens and 2.1 times fewer tool calls. For graph-native queries such as hub detection and caller ranking, it matches or exceeds the explorer on 19 of 31 languages.
comment: 10 pages, 5 authors, preprint
☆ Sal: Multi-modal Verification of Replicated Data Types
Designing correct replicated data types (RDTs) is challenging because replicas evolve independently and must be merged while preserving application intent. A promising approach is correct-by-construction development in a proof-oriented programming language such as F*, Dafny and Lean, where desired correctness guarantees are specified and checked as the RDTs are implemented. Recent work Neem proposes the use of replication-aware linearizability (RA linearizability) as the correctness condition for state-based CRDTs and mergeable replicated data types (MRDTs), with automation in the SMT-aided, proof-oriented programming language F*. However, SMT-centric workflows can be opaque when automation fails to discharge a verification condition (VC), and they enlarge the trusted computing base (TCB). We present Sal, a multi-modal workflow to design and verify state-based CRDTs and MRDTs in Lean. Sal combines (i) kernel-checkable automation with proof reconstruction, (ii) SMT-aided automation when needed, and (iii) AI-assisted interactive theorem proving for remaining proof obligations. When a verification condition is shown to be invalid, we leverage Lean's property-based testing to automatically generate and visualize counterexamples, helping developers debug incorrect specifications or implementations. We report on our experience verifying a suite of 13 CRDTs and MRDTs with Sal: 69% of verification conditions are discharged by kernel-verified automation without SMT, and counterexamples automatically expose subtle bugs such as the well-known enable-wins flag anomaly. The codebase for Sal is open-sourced, and is available at \href{https://github.com/fplaunchpad/sal}{https://github.com/fplaunchpad/sal}
☆ Bit-Vector CHC Solving for Binary Analysis and Binary Analysis for Bit-Vector CHC Solving
For high-assurance software, source-level reasoning is insufficient: we need binary-level guarantees. Despite constrained Horn clause (CHC) solving being one of the most popular forms of automated verification, prior work has not evaluated the viability of CHC solving for binary analysis. To fill this gap, we assemble a pipeline that encodes binary analysis problems as CHCs in the SMT logic of quantifier-free bit vectors, and show that off-the-shelf CHC solvers achieve reasonable success on binaries compiled from 983 C invariant inference benchmarks: a portfolio solves 59.5% and 66.1% of the problems derived from the unoptimized and optimized binaries, respectively -- roughly equal to the success rate of a leading C verifier on the source code (60.1%). Moreover, we show that binary analysis provides a valuable source of bit-vector CHC benchmarks (which are in short supply): binary-derived problems differ from existing benchmarks both structurally and in solver success rates and rankings. Augmenting CHC solving competitions with binary-derived benchmarks will encourage solver developers to improve bit-vector reasoning, in turn making CHC solving a more effective tool for binary analysis.
comment: 23 pages, 6 figures, to appear in the 18th NASA Formal Methods Symposium (NFM 2026)
♻ ☆ Taming the Hydra: Targeted Control-Flow Transformations for Dynamic Symbolic Execution
Dynamic Symbolic Execution (DSE) suffers from the path explosion problem when the target program has many conditional branches. The classical approach for managing the path explosion problem is dynamic state merging. Dynamic state merging combines similar symbolic program states to avoid the exponential growth in the number of states during DSE. However, state merging still requires solver invocations at each program branch, even when both paths of the branch are feasible. Moreover, the best path search strategy for DSE may not create the best state merging opportunities. Some drawbacks of state merging can be mitigated by compile-time state merging (i.e., branch elimination by converting control-flow into dataflow). In this paper, we propose a non-semantics-preserving but failure-preserving compiler transformation for removing expensive symbolic branches in a program to improve the scalability of DSE. We have developed a framework for detecting spurious bugs that our transformation can insert. Finally, we show that our transformation can significantly improve the performance of DSE on various benchmark programs and help improve the performance of coverage and bug discovery of large real-world programs.
♻ ☆ Misquoted No More: Securely Extracting F* Programs with IO
Shallow embeddings that use monads to represent effects are popular in proof-oriented languages because they are convenient for formal verification. Once shallowly embedded programs are verified, they are often extracted to mainstream languages like OCaml or C and linked into larger codebases. The extraction process is not fully verified because it often involves quotation -- turning the shallowly embedded program into a deeply embedded one -- and verifying quotation remains a major open challenge. Instead, some prior work obtains formal correctness guarantees using translation validation to certify individual extraction results. We build on this idea, but limit the use of translation validation to a first extraction step that we call relational quotation and that uses a metaprogram to construct a typing derivation for the given shallowly embedded program. This metaprogram is simple, since the typing derivation follows the structure of the original program. Once we validate, syntactically, that the typing derivation is valid for the original program, we pass it to a verified syntax-generation function that produces code guaranteed to be semantically related to the original program. We apply this general idea to build SEIO*, a framework for extracting shallowly embedded F* programs with IO to a deeply embedded lambda-calculus while providing formal secure compilation guarantees. Using two cross-language logical relations, we devise a machine-checked proof in F* that SEIO* guarantees Robust Relational Hyperproperty Preservation (RrHP), a very strong secure compilation criterion that implies full abstraction as well as preservation of trace properties and hyperproperties against arbitrary adversarial contexts. This goes beyond the state of the art in verified and certifying extraction, which so far has focused on correctness rather than security.
comment: Submitted to ICFP'26, with a couple extra fixes
♻ ☆ LLMs versus the Halting Problem: Revisiting Program Termination Prediction
Determining whether a program terminates is a central problem in computer science. Turing's foundational result established the Halting Problem as undecidable, showing that no algorithm can universally determine termination for all programs and inputs. Consequently, automatic verification tools approximate termination, sometimes failing to prove or disprove; these tools rely on problem-specific architectures, and are usually tied to particular programming languages. Recent success and progress in large language models (LLMs) raises the following question: can LLMs reliably predict program termination? In this work, we evaluate LLMs on a diverse set of programs from the Termination category of the International Competition on Software Verification (SV-Comp) 2025. Our results suggest that LLMs perform remarkably well at predicting program termination, where GPT-5 and Claude Sonnet-4.5 would rank just behind the top-ranked tool (using test-time-scaling), and Code World Model (CWM) would place just behind the second-ranked tool. While LLMs are effective at predicting program termination, they often fail to provide a valid witness as a proof. Moreover, LLMs performance drops as program length and complexity increases. We hope these insights motivate further research into program termination and the broader potential of LLMs for reasoning about undecidable problems.
♻ ☆ When Lifetimes Liberate: A Type System for Arenas with Higher-Order Reachability Tracking
Static resource management in languages remains challenging due to tensions among control, expressiveness, and flexibility. Region-based systems [Grossman et al . 2002; Tofte et al. 2001] offer bulk deallocation via lexically scoped regions, where all allocations follow a stack discipline. However, both regions and their resources are second-class, and neither can escape its scope nor be freely returned. Ownership and linear type systems, exemplified by Rust [Clarke et al. 2013], offer non-lexical lifetimes and robust static guarantees, but rely on invariants that limit higher-order patterns and expressive sharing. In this work, we propose a new type system that unifies these strengths. Our system treats all heap-allocated resources as first-class values, while allowing programmers to control lifetime and granularity through three allocation modes: (1) fresh allocation for individual, non-lexical references; (2) subsequent coallocation grouping resources collectively within shadow arenas; and (3) scoped allocation with lexically bounded lifetimes following stack discipline. Regardless of mode, all resources share a uniform type and have no distinction for generic abstractions, preserving the higher-order parametric nature of the language. Obtaining static safety in higher-order languages with flexible sharing is nontrivial. We address this by extending reachability types [Wei et al. 2024] to collectively track first-class resources, and by adopting flow-insensitive deallocation reasoning for selective stack discipline. These mechanisms yield Aq<: and {A}q<: atop, both formalized and proven type safe and memory safe in Rocq.
Networking and Internet Architecture 4
☆ DRASTIC: A Dynamic Resource Allocation Framework over 6G Network Slicing in Task-aware Closed-Loop Tactile Internet Applications
This work proposes a novel learning driven bandwidth optimization framework called DRASTIC (Dynamic Resource Allocation for Slicing in Task aware Closed loop tactile Internet applications). The proposed framework dynamically allocates resources among network slices supporting both enhanced Mobile Broadband (eMBB) and high reliable low latency communication (HRLLC) users. The algorithm ensures queue stability and meets delay targets with high probability under a Markov-modulated Poisson traffic, exploiting a Lyapunov guided advantage actor critic reinforcement learning technique. The proposed network model includes an open-loop eMBB queue whose arrival and departure are mainly driven by throughput demand, as well as a closed loop HRLLC queue that captures feedback and task execution effects. A task execution dependent dexterity index adjusts the effective arrival rate, creating a feedback aware interaction between the network and the task. A probabilistic delay constraint is incorporated into the objective via Lagrangian relaxation, yielding a min_max optimization framework that enforces latency guarantees while maximizing throughput for both types of users. Simulation results demonstrate that the proposed framework meets diverse Quality of Service (QoS) requirements, maintains queue stability under dynamic wireless and robotic task variation conditions, and outperforms other approaches.
☆ Reconfiguring room-scale magnetoquasistatic wireless power transfer with hierarchical resonators
Magnetoquasistatic wireless power transfer can deliver substantial power to mobile devices over near-field links. Room-scale implementations, such as quasistatic cavity resonators, extend this capability over large enclosed volumes, but their efficiency drops sharply for centimeter-scale or misoriented receivers because the magnetic field is spatially broad and weakly coupled to small coils. Here, we introduce hierarchical resonators that act as selectively activated relays within a room-scale quasistatic cavity resonator, capturing the ambient magnetic field and re-emitting it to concentrate flux at a target receiver. This architecture reconfigures the wireless power environment on demand and enables localized energy delivery to miniature devices. Experimentally, the hierarchical link improves power transfer efficiency by more than two orders of magnitude relative to direct room-scale transfer and delivers up to 500 mW of DC power to a 15 mm receiver. We further demonstrate selective multi-relay operation and field reorientation for furniture-embedded charging scenarios. These results establish a scalable route to reconfigurable wireless power delivery for miniature and batteryless devices in room-scale environments.
comment: 12 pages, 5 figures
♻ ☆ Service Discovery-Based Hybrid Network Middleware for Efficient Communication in Distributed Robotic Systems IROS
Robotic middleware is fundamental to ensuring reliable communication among system components and is crucial for intelligent robotics, autonomous vehicles, and smart manufacturing. However, existing robotic middleware often struggles to meet the diverse communication demands, optimize data transmission efficiency, and maintain scheduling determinism between Orin computing units in large-scale L4 autonomous vehicle deployments. This paper presents RIMAOS2C, a service discovery-based hybrid network communication middleware designed to tackle these challenges. By leveraging multi-level service discovery multicast, RIMAOS2C supports a wide variety of communication modes, including multiple cross-chip Ethernet protocols and PCIe communication capabilities. Its core mechanism, the Message Bridge, optimizes data flow forwarding and employs shared memory for centralized message distribution, reducing message redundancy and minimizing transmission delay uncertainty. Tested on L4 vehicles and Jetson Orin domain controllers, RIMAOS2C leverages TCP-based ZeroMQ to overcome the large-message transmission bottleneck in native CyberRT. In scenarios with two cross-chip subscribers, it eliminates message redundancy and improves large-data transmission efficiency by 36 to 40 percent while reducing callback latency variation by 42 to 906 percent. This research advances the communication capabilities of robotic operating systems and proposes a novel approach to optimizing communication in distributed computing architectures for autonomous driving.
comment: 8 pages, 8 figures, accepted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2025
♻ ☆ AVERY: Intent-Driven Adaptive VLM Split Computing via Embodied Self-Awareness for Efficient Disaster Response Systems
Unmanned Aerial Vehicles (UAVs) in disaster response require complex, queryable intelligence that onboard CNNs cannot provide. While Vision-Language Models (VLMs) offer this semantic reasoning, their high resource demands make on-device deployment infeasible, and naive cloud offloading fails under the low-bandwidth, unstable networks endemic to disaster zones. We present AVERY, an intent-driven adaptive split computing framework for efficient VLM deployment on resource-constrained platforms. AVERY is motivated by the observation that operator intent must be treated as a first-class system objective, since missions such as broad situational monitoring and precise, spatially grounded investigation require different semantic products, latency targets, and resource allocations. To reflect this, AVERY advances split computing beyond traditional depth-wise partitioning through a functional, cognitive-inspired dual-stream split: a high-frequency, low-resolution Context stream for real-time awareness, and a low-frequency, high-fidelity Insight stream for deep analysis. This design enables a hierarchical split strategy: computation is first separated by function, then partitioned depth-wise across edge and cloud when the Insight stream is required. A lightweight, self-aware onboard controller monitors network conditions and operator intent to select from pre-trained compression models, navigating the accuracy-throughput trade-off at runtime. Evaluated using LISA-7B in an edge-cloud setting under fluctuating network conditions, AVERY achieves 11.2% higher accuracy than raw image compression, 93.98% lower energy consumption than full-edge execution, and average accuracy within 0.75% of the static High-Accuracy baseline during dynamic adaptation. Overall, AVERY enhances mission efficiency and enables real-time, queryable intelligence in dynamic disaster environments.
comment: Paper is currently under review. Authors' version posted for personal use and not for redistribution. Previous version of the preprint was titled: 'AVERY: Adaptive VLM Split Computing through Embodied Self-Awareness for Efficient Disaster Response Systems'
Distributed, Parallel, and Cluster Computing 10
☆ Benchmarking Quantum Computers via Protocols -- Comparing Superconducting and Ion-Trap Quantum Technology
Superconducting and Ion-Trap quantum architectures are common in the current landscape of the quantum computing field, each with distinct characteristics and operational constraints. Understanding and measuring the underlying quantumness of these devices is essential for assessing their readiness for practical applications and guiding future progress and research. Building on earlier work (Meirom, Mor, Weinstein Arxiv 2505.12441), we utilize a benchmarking strategy applicable for comparing these two architectures by measuring "quantumness" directly on optimal sub-chips. Distinct from existing metrics, our approach employs rigorous binary fidelity thresholds derived from the classical limits of state transfer. This enable us to definitively establish quantum advantage of a designated sub-region. We apply this quality assurance methodology to industry leading platforms from both technologies. This comparison provides a protocol-based evaluation of quantumness advantage, revealing not only the strengths and weaknesses of each tested chip and its sub-chips but also offering a common language for their assessment. By abstracting away technical differences in the final result, we demonstrate a benchmarking strategy that bridges the gap between disparate quantum-circuit technologies, enabling fair performance comparisons and establishing a critical foundation for evaluating future claims of quantum advantage.
comment: 33 pages, 31 figures
☆ Sal: Multi-modal Verification of Replicated Data Types
Designing correct replicated data types (RDTs) is challenging because replicas evolve independently and must be merged while preserving application intent. A promising approach is correct-by-construction development in a proof-oriented programming language such as F*, Dafny and Lean, where desired correctness guarantees are specified and checked as the RDTs are implemented. Recent work Neem proposes the use of replication-aware linearizability (RA linearizability) as the correctness condition for state-based CRDTs and mergeable replicated data types (MRDTs), with automation in the SMT-aided, proof-oriented programming language F*. However, SMT-centric workflows can be opaque when automation fails to discharge a verification condition (VC), and they enlarge the trusted computing base (TCB). We present Sal, a multi-modal workflow to design and verify state-based CRDTs and MRDTs in Lean. Sal combines (i) kernel-checkable automation with proof reconstruction, (ii) SMT-aided automation when needed, and (iii) AI-assisted interactive theorem proving for remaining proof obligations. When a verification condition is shown to be invalid, we leverage Lean's property-based testing to automatically generate and visualize counterexamples, helping developers debug incorrect specifications or implementations. We report on our experience verifying a suite of 13 CRDTs and MRDTs with Sal: 69% of verification conditions are discharged by kernel-verified automation without SMT, and counterexamples automatically expose subtle bugs such as the well-known enable-wins flag anomaly. The codebase for Sal is open-sourced, and is available at \href{https://github.com/fplaunchpad/sal}{https://github.com/fplaunchpad/sal}
☆ TX-Digital Twin: Visualizing Supercomputer GPU Performance Data Stream
Supercomputers are complex, dynamic systems that serve thousands of users and are built with thousands of compute nodes. Due to the vast amounts of system and performance data needed to accurately capture their status, supercomputers require complex methods to monitor, maintain, and optimize. Data visualization is a powerful technique for overseeing these large streams of data in an easily interpretable way. The MIT Lincoln Laboratory Supercomputing Center (LLSC) enables effective monitoring through combining 3D gaming technology with compound data streams in the TX-Digital Twin, a 3D simulation of the supercomputer. The TX-Digital Twin offers both live and historical data, in visual and text formats, and tracks a multitude of revealing performance metrics. Recent increasing interest in GPU-accelerated computing has driven a need for monitoring and maintenance of GPU-accelerated resources in supercomputers. In this paper, we build on our previous solution by integrating the visualization of additional GPU metrics, such as GPU memory usage, temperature, and power draw, into the TX-Digital Twin. Using techniques in draw call optimization, we add clear and effective displays of the new metrics while keeping the effects on performance minimal.
comment: 5 pages, 4 figures, 1 table. Presented at IEEE MIT URTC 2025
♻ ☆ Service Discovery-Based Hybrid Network Middleware for Efficient Communication in Distributed Robotic Systems IROS
Robotic middleware is fundamental to ensuring reliable communication among system components and is crucial for intelligent robotics, autonomous vehicles, and smart manufacturing. However, existing robotic middleware often struggles to meet the diverse communication demands, optimize data transmission efficiency, and maintain scheduling determinism between Orin computing units in large-scale L4 autonomous vehicle deployments. This paper presents RIMAOS2C, a service discovery-based hybrid network communication middleware designed to tackle these challenges. By leveraging multi-level service discovery multicast, RIMAOS2C supports a wide variety of communication modes, including multiple cross-chip Ethernet protocols and PCIe communication capabilities. Its core mechanism, the Message Bridge, optimizes data flow forwarding and employs shared memory for centralized message distribution, reducing message redundancy and minimizing transmission delay uncertainty. Tested on L4 vehicles and Jetson Orin domain controllers, RIMAOS2C leverages TCP-based ZeroMQ to overcome the large-message transmission bottleneck in native CyberRT. In scenarios with two cross-chip subscribers, it eliminates message redundancy and improves large-data transmission efficiency by 36 to 40 percent while reducing callback latency variation by 42 to 906 percent. This research advances the communication capabilities of robotic operating systems and proposes a novel approach to optimizing communication in distributed computing architectures for autonomous driving.
comment: 8 pages, 8 figures, accepted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2025
♻ ☆ When Agents are Powerful: Black Hole Search with Verification in Time-Varying Graphs
A black hole is a harmful node in a graph that destroys any agent entering it, making its identification a critical task. In the \emph{Black Hole Search with Verification (BHSV)} problem, a team of agents operates on a graph $G$ with the objective that at least one agent survives and correctly identifies an edge incident to the black hole; if no black hole exists, then all agents must terminate. Prior work has studied BHS in arbitrary dynamic graphs under the restrictive \emph{face-to-face} communication model, where agents can exchange information only when co-located. This constraint significantly increases the number of agents required to solve the problem. In this work, we strengthen the capabilities of agents by equipping them with (i) \emph{1-hop visibility}, (ii) \emph{global communication}, and (iii) both \emph{1-hop visibility} and \emph{global communication}. We show that these enhancements lead to more efficient solutions for the BHSV problem in dynamic graphs.
♻ ☆ Asynchronous Policy Gradient Aggregation for Efficient Distributed Reinforcement Learning
We study distributed reinforcement learning (RL) with policy gradient methods under asynchronous and parallel computations and communications. While non-distributed methods are well understood theoretically and have achieved remarkable empirical success, their distributed counterparts remain less explored, particularly in the presence of heterogeneous asynchronous computations and communication bottlenecks. We introduce two new algorithms, Rennala NIGT and Malenia NIGT, which implement asynchronous policy gradient aggregation and achieve state-of-the-art efficiency. In the homogeneous setting, Rennala NIGT provably improves the total computational and communication complexity while supporting the AllReduce operation. In the heterogeneous setting, Malenia NIGT simultaneously handles asynchronous computations and heterogeneous environments with strictly better theoretical guarantees. Our results are further corroborated by experiments, showing that our methods significantly outperform prior approaches.
♻ ☆ Proving the Limited Scalability of Centralized Distributed Optimization via a New Lower Bound Construction
We consider centralized distributed optimization in the classical federated learning setup, where $n$ workers jointly find an $\varepsilon$-stationary point of an $L$-smooth, $d$-dimensional nonconvex function $f$, having access only to unbiased stochastic gradients with variance $σ^2$. Each worker requires at most $h$ seconds to compute a stochastic gradient, and the communication times from the server to the workers and from the workers to the server are $τ_{s}$ and $τ_{w}$ seconds per coordinate, respectively. One of the main motivations for distributed optimization is to achieve scalability with respect to $n$. For instance, it is well known that the distributed version of SGD has a variance-dependent runtime term $\frac{h σ^2 L Δ}{n \varepsilon^2},$ which improves with the number of workers $n,$ where $Δ= f(x^0) - f^*,$ and $x^0 \in R^d$ is the starting point. Similarly, using unbiased sparsification compressors, it is possible to reduce both the variance-dependent runtime term and the communication runtime term. However, once we account for the communication from the server to the workers $τ_{s}$, we prove that it becomes infeasible to design a method using unbiased random sparsification compressors that scales both the server-side communication runtime term $τ_{s} d \frac{L Δ}{\varepsilon}$ and the variance-dependent runtime term $\frac{h σ^2 L Δ}{\varepsilon^2},$ better than poly-logarithmically in $n$, even in the homogeneous (i.i.d.) case, where all workers access the same distribution. To establish this result, we construct a new "worst-case" function and develop a new lower bound framework that reduces the analysis to the concentration of a random sum, for which we prove a concentration bound. These results reveal fundamental limitations in scaling distributed optimization, even under the homogeneous assumption.
♻ ☆ Birch SGD: A Tree Graph Framework for Local and Asynchronous SGD Methods
We propose a new unifying framework, Birch SGD, for analyzing and designing distributed SGD methods. The central idea is to represent each method as a weighted directed tree, referred to as a computation tree. Leveraging this representation, we introduce a general theoretical result that reduces convergence analysis to studying the geometry of these trees. This perspective yields a purely graph-based interpretation of optimization dynamics, offering a new and intuitive foundation for method development. Using Birch SGD, we design eight new methods and analyze them alongside previously known ones, with at least six of the new methods shown to have optimal computational time complexity. Our research leads to two key insights: (i) all methods share the same "iteration rate" of $O\left(\frac{(R + 1) L Δ}{\varepsilon} + \frac{σ^2 L Δ}{\varepsilon^2}\right)$, where $R$ the maximum "tree distance" along the main branch of a tree; and (ii) different methods exhibit different trade-offs-for example, some update iterates more frequently, improving practical performance, while others are more communication-efficient or focus on other aspects. Birch SGD serves as a unifying framework for navigating these trade-offs. We believe these results provide a unified foundation for understanding, analyzing, and designing efficient asynchronous and parallel optimization methods.
♻ ☆ AVERY: Intent-Driven Adaptive VLM Split Computing via Embodied Self-Awareness for Efficient Disaster Response Systems
Unmanned Aerial Vehicles (UAVs) in disaster response require complex, queryable intelligence that onboard CNNs cannot provide. While Vision-Language Models (VLMs) offer this semantic reasoning, their high resource demands make on-device deployment infeasible, and naive cloud offloading fails under the low-bandwidth, unstable networks endemic to disaster zones. We present AVERY, an intent-driven adaptive split computing framework for efficient VLM deployment on resource-constrained platforms. AVERY is motivated by the observation that operator intent must be treated as a first-class system objective, since missions such as broad situational monitoring and precise, spatially grounded investigation require different semantic products, latency targets, and resource allocations. To reflect this, AVERY advances split computing beyond traditional depth-wise partitioning through a functional, cognitive-inspired dual-stream split: a high-frequency, low-resolution Context stream for real-time awareness, and a low-frequency, high-fidelity Insight stream for deep analysis. This design enables a hierarchical split strategy: computation is first separated by function, then partitioned depth-wise across edge and cloud when the Insight stream is required. A lightweight, self-aware onboard controller monitors network conditions and operator intent to select from pre-trained compression models, navigating the accuracy-throughput trade-off at runtime. Evaluated using LISA-7B in an edge-cloud setting under fluctuating network conditions, AVERY achieves 11.2% higher accuracy than raw image compression, 93.98% lower energy consumption than full-edge execution, and average accuracy within 0.75% of the static High-Accuracy baseline during dynamic adaptation. Overall, AVERY enhances mission efficiency and enables real-time, queryable intelligence in dynamic disaster environments.
comment: Paper is currently under review. Authors' version posted for personal use and not for redistribution. Previous version of the preprint was titled: 'AVERY: Adaptive VLM Split Computing through Embodied Self-Awareness for Efficient Disaster Response Systems'
♻ ☆ LSM-GNN: Large-scale Storage-based Multi-GPU GNN Training by Optimizing Data Transfer Scheme
Graph Neural Networks (GNNs) are widely used today in recommendation systems, fraud detection, and node/link classification tasks. Real world GNNs continue to scale in size and require a large memory footprint for storing graphs and embeddings that often exceed the memory capacities of the target GPUs used for training. To address limited memory capacities, traditional GNN training approaches use graph partitioning and sharding techniques to scale up across multiple GPUs within a node and/or scale out across multiple nodes. However, this approach suffers from the high computational costs of graph partitioning algorithms and inefficient communication across GPUs. To address these overheads, we propose Large-scale Storage-based Multi-GPU GNN framework (LSM-GNN), a storage-based approach to train GNN models that utilizes a novel communication layer enabling GPU software caches to function as a system-wide shared cache with low overheads. LSM-GNN incorporates a hybrid eviction policy that intelligently manages cache space by using both static and dynamic node information to significantly enhance cache performance. Furthermore, we introduce the Preemptive Victim-buffer Prefetcher (PVP), a mechanism for prefetching node feature data from a Victim Buffer located in CPU pinned-memory to further reduce the pressure on the storage devices. Experimental results show that despite the lower compute capabilities and memory capacities, LSM-GNN in a single node with two GPUs offers superior performance over two-node-four-GPU Dist-DGL baseline and provides up to 3.75x speed up on end-to-end epoch time while running large-scale GNN training