Computation and Language
☆ Variational Masked Diffusion Models
                                          Masked diffusion models have recently emerged as a flexible framework for
discrete generative modeling. However, a key limitation of standard masked
diffusion is its inability to effectively capture dependencies among tokens
that are predicted concurrently, leading to degraded generation quality when
dependencies among tokens are important. To explicitly model dependencies among
tokens, we propose Variational Masked Diffusion (VMD), a framework that
introduces latent variables into the masked diffusion process. Through
controlled experiments on synthetic datasets, we demonstrate that VMD
successfully learns dependencies that conventional masked diffusion fails to
capture. We further validate the effectiveness of our approach on Sudoku
puzzles and text datasets, where learning of dependencies among tokens improves
global consistency. Across these domains, VMD enhances both generation quality
and dependency awareness, highlighting the value of integrating variational
inference into masked diffusion. Our code is available at:
https://riccizz.github.io/VMD.
                                    
                                        
                                            comment: Project Page: https://riccizz.github.io/VMD
                                        
                                ☆ Think Twice: Branch-and-Rethink Reasoning Reward Model
                                          Large language models (LLMs) increasingly rely on thinking models that
externalize intermediate steps and allocate extra test-time compute, with
think-twice strategies showing that a deliberate second pass can elicit
stronger reasoning. In contrast, most reward models (RMs) still compress many
quality dimensions into a single scalar in one shot, a design that induces
judgment diffusion: attention spreads across evaluation criteria, yielding
diluted focus and shallow analysis. We introduce branch-and-rethink (BR-RM), a
two-turn RM that transfers the think-twice principle to reward modeling. Turn 1
performs adaptive branching, selecting a small set of instance-critical
dimensions (such as factuality and safety) and sketching concise,
evidence-seeking hypotheses. Turn 2 executes branch-conditioned rethinking, a
targeted reread that tests those hypotheses and scrutinizes only what matters
most. We train with GRPO-style reinforcement learning over structured two-turn
traces using a simple binary outcome reward with strict format checks, making
the approach compatible with standard RLHF pipelines. By converting
all-at-oncescoringintofocused, second-lookreasoning,
BR-RMreducesjudgmentdiffusionandimproves sensitivity to subtle yet
consequential errors while remaining practical and scalable. Experimental
results demonstrate that our model achieves state-of-the-art performance on
three challenging reward modeling benchmarks across diverse domains. The code
and the model will be released soon.
                                    
                                ☆ Hope Speech Detection in Social Media English Corpora: Performance of Traditional and Transformer Models
                                          The identification of hope speech has become a promised NLP task, considering
the need to detect motivational expressions of agency and goal-directed
behaviour on social media platforms. This proposal evaluates traditional
machine learning models and fine-tuned transformers for a previously split hope
speech dataset as train, development and test set. On development test, a
linear-kernel SVM and logistic regression both reached a macro-F1 of 0.78; SVM
with RBF kernel reached 0.77, and Na\"ive Bayes hit 0.75. Transformer models
delivered better results, the best model achieved weighted precision of 0.82,
weighted recall of 0.80, weighted F1 of 0.79, macro F1 of 0.79, and 0.80
accuracy. These results suggest that while optimally configured traditional
machine learning models remain agile, transformer architectures detect some
subtle semantics of hope to achieve higher precision and recall in hope speech
detection, suggesting that larges transformers and LLMs could perform better in
small datasets.
                                    
                                ☆ ReCode: Unify Plan and Action for Universal Granularity Control
                                        
                                            
                                        
                                        
                                            
                                        
                                        Zhaoyang Yu, Jiayi Zhang, Huixue Su, Yufan Zhao, Yifan Wu, Mingyi Deng, Jinyu Xiang, Yizhang Lin, Lingxiao Tang, Yingchao Li, Yuyu Luo, Bang Liu, Chenglin Wu
                                    
                                    
                                          Real-world tasks require decisions at varying granularities, and humans excel
at this by leveraging a unified cognitive representation where planning is
fundamentally understood as a high-level form of action. However, current Large
Language Model (LLM)-based agents lack this crucial capability to operate
fluidly across decision granularities. This limitation stems from existing
paradigms that enforce a rigid separation between high-level planning and
low-level action, which impairs dynamic adaptability and limits generalization.
We propose ReCode (Recursive Code Generation), a novel paradigm that addresses
this limitation by unifying planning and action within a single code
representation. In this representation, ReCode treats high-level plans as
abstract placeholder functions, which the agent then recursively decomposes
into finer-grained sub-functions until reaching primitive actions. This
recursive approach dissolves the rigid boundary between plan and action,
enabling the agent to dynamically control its decision granularity.
Furthermore, the recursive structure inherently generates rich,
multi-granularity training data, enabling models to learn hierarchical
decision-making processes. Extensive experiments show ReCode significantly
surpasses advanced baselines in inference performance and demonstrates
exceptional data efficiency in training, validating our core insight that
unifying planning and action through recursive code generation is a powerful
and effective approach to achieving universal granularity control. The code is
available at https://github.com/FoundationAgents/ReCode.
                                    
                                ☆ ISA-Bench: Benchmarking Instruction Sensitivity for Large Audio Language Models
                                        
                                            
                                        
                                        
                                            
                                        
                                        Bohan Li, Wenbin Huang, Yuhang Qiu, Yiwei Guo, Hankun Wang, Zhihan Li, Jing Peng, Ziyang Ma, Xie Chen, Kai Yu
                                    
                                    
                                          Large Audio Language Models (LALMs), which couple acoustic perception with
large language models (LLMs) to extract and understand diverse information from
audio, have attracted intense interest from both academic and industrial
communities. However, existing LALMs are highly sensitive to how instructions
are phrased, affecting both (i) instruction-following rates and (ii) task
performance. Yet, no existing benchmarks offer a systematic and comprehensive
evaluation of this sensitivity. We introduce ISA-Bench, a dynamic benchmark
evaluating instruction sensitivity for LALMs along three axes: instruction
description, output format, and task composition. We assess recent open-source
and proprietary LALMs using ISA-Bench, profiling both compliance and accuracy
under controlled instruction variations. Experimental results reveal that even
state-of-the-art LALMs suffer significant instruction sensitivity, leading to
degraded performance on fundamental audio understanding tasks. To mitigate this
issue, we fine-tune Qwen2-Audio on a specifically constructed complex
instruction-variant dataset, achieving a marked improvement in
instruction-following performance. However, this also induces nontrivial
catastrophic forgetting: the model loses some previously mastered task
capabilities when exposed to new instruction styles. Our benchmark provides a
standardized basis for assessing and improving instruction sensitivity in
LALMs, underscoring the need for instruction-robust audio understanding in
real-world pipelines.
                                    
                                        
                                            comment: submitted to icassp 2026
                                        
                                ☆ A U-Net and Transformer Pipeline for Multilingual Image Translation
                                          This paper presents an end-to-end multilingual translation pipeline that
integrates a custom U-Net for text detection, the Tesseract engine for text
recognition, and a from-scratch sequence-to-sequence (Seq2Seq) Transformer for
Neural Machine Translation (NMT). Our approach first utilizes a U-Net model,
trained on a synthetic dataset , to accurately segment and detect text regions
from an image. These detected regions are then processed by Tesseract to
extract the source text. This extracted text is fed into a custom Transformer
model trained from scratch on a multilingual parallel corpus spanning 5
languages. Unlike systems reliant on monolithic pre-trained models, our
architecture emphasizes full customization and adaptability. The system is
evaluated on its text detection accuracy, text recognition quality, and
translation performance via BLEU scores. The complete pipeline demonstrates
promising results, validating the viability of a custom-built system for
translating text directly from images.
                                    
                                        
                                            comment: 6 pages, 3 figures, 5 tables, and 2 algorithms. Prepared in IEEE
  double-column format
                                        
                                ☆ LimRank: Less is More for Reasoning-Intensive Information Reranking EMNLP 2025
                                          Existing approaches typically rely on large-scale fine-tuning to adapt LLMs
for information reranking tasks, which is computationally expensive. In this
work, we demonstrate that modern LLMs can be effectively adapted using only
minimal, high-quality supervision. To enable this, we design
LIMRANK-SYNTHESIZER, a reusable and open-source pipeline for generating
diverse, challenging, and realistic reranking examples. Using this synthetic
data, we fine-tune our reranker model, LIMRANK. We evaluate LIMRANK on two
challenging benchmarks, i.e., BRIGHT for reasoning-intensive retrieval and
FollowIR for instruction-following retrieval. Our experiments demonstrate that
LIMRANK achieves competitive performance, while being trained on less than 5%
of the data typically used in prior work. Further ablation studies demonstrate
the effectiveness of LIMRANK-SYNTHESIZER and the strong generalization
capabilities of LIMRANK across downstream tasks, including scientific
literature search and retrieval-augmented generation for knowledge-intensive
problem solving.
                                    
                                        
                                            comment: EMNLP 2025 Main (Short)
                                        
                                ☆ JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence
                                        
                                            
                                        
                                        
                                            
                                        
                                        Qiushi Sun, Jingyang Gong, Yang Liu, Qiaosheng Chen, Lei Li, Kai Chen, Qipeng Guo, Ben Kao, Fei Yuan
                                    
                                    
                                          The scope of neural code intelligence is rapidly expanding beyond text-based
source code to encompass the rich visual outputs that programs generate. This
visual dimension is critical for advanced applications like flexible content
generation and precise, program-driven editing of visualizations. However,
progress has been impeded by the scarcity of high-quality multimodal code data,
a bottleneck stemming from challenges in synthesis and quality assessment. To
address these challenges, we make contributions from both a data and modeling
perspective. We first introduce a complete synthesis toolkit that leverages
reciprocal synergies between data modalities to efficiently produce a
large-scale, high-quality corpus spanning from standard charts to complex
interactive web UIs and code-driven animations. Leveraging this toolkit, we
construct JanusCode-800K, the largest multimodal code corpus to date. This
powers the training of our models, JanusCoder and JanusCoderV, which establish
a visual-programmatic interface for generating code from textual instructions,
visual inputs, or a combination of both. Our unified model is a departure from
existing approaches that build specialized models for isolated tasks. Extensive
experiments on both text-centric and vision-centric coding tasks demonstrate
the superior performance of the JanusCoder series, with our 7B to 14B scale
models approaching or even exceeding the performance of commercial models.
Furthermore, extensive analysis provides key insights into harmonizing
programmatic logic with its visual expression. Our code and checkpoints will
are available at https://github.com/InternLM/JanusCoder.
                                    
                                        
                                            comment: Work in progress
                                        
                                ☆ IPQA: A Benchmark for Core Intent Identification in Personalized Question Answering
                                          Intent identification serves as the foundation for generating appropriate
responses in personalized question answering (PQA). However, existing
benchmarks evaluate only response quality or retrieval performance without
directly measuring intent identification capabilities. This gap is critical
because without understanding which intents users prioritize, systems cannot
generate responses satisfying individual information needs. To address this, we
introduce the concept of core intents: intents users prioritize when selecting
answers to satisfy their information needs. To evaluate these core intents, we
propose IPQA, a benchmark for core Intent identification in Personalized
Question Answering. Since users do not explicitly state their prioritized
intents, we derive core intents from observable behavior patterns in answer
selection, grounded in satisficing theory where users choose answers meeting
their acceptance thresholds. We construct a dataset with various domains
through systematic filtering, LLM-based annotation, and rigorous quality
control combining automated verification with human validation. Experimental
evaluations across state-of-the-art language models reveal that current systems
struggle with core intent identification in personalized contexts. Models fail
to identify core intents from user histories, with performance degrading as
question complexity increases. The code and dataset will be made publicly
available to facilitate future research in this direction.
                                    
                                ☆ M4FC: a Multimodal, Multilingual, Multicultural, Multitask Real-World Fact-Checking Dataset
                                          Existing real-world datasets for multimodal automated fact-checking have
multiple limitations: they contain few instances, focus on only one or two
languages and tasks, suffer from evidence leakage, or depend on external sets
of news articles for sourcing true claims. To address these shortcomings, we
introduce M4FC, a new real-world dataset comprising 4,982 images paired with
6,980 claims. The images, verified by professional fact-checkers from 22
organizations, represent diverse cultural and geographic contexts. Each claim
is available in one or two out of ten languages. M4FC spans six multimodal
fact-checking tasks: visual claim extraction, claimant intent prediction, fake
detection, image contextualization, location verification, and verdict
prediction. We provide baseline results for all tasks and analyze how combining
intermediate tasks influence downstream verdict prediction performance. We make
our dataset and code available.
                                    
                                        
                                            comment: Preprint under review. Code and data available at:
  https://github.com/UKPLab/M4FC
                                        
                                ☆ MMTutorBench: The First Multimodal Benchmark for AI Math Tutoring
                                          Effective math tutoring requires not only solving problems but also
diagnosing students' difficulties and guiding them step by step. While
multimodal large language models (MLLMs) show promise, existing benchmarks
largely overlook these tutoring skills. We introduce MMTutorBench, the first
benchmark for AI math tutoring, consisting of 685 problems built around
pedagogically significant key-steps. Each problem is paired with
problem-specific rubrics that enable fine-grained evaluation across six
dimensions, and structured into three tasks-Insight Discovery, Operation
Formulation, and Operation Execution. We evaluate 12 leading MLLMs and find
clear performance gaps between proprietary and open-source systems, substantial
room compared to human tutors, and consistent trends across input variants: OCR
pipelines degrade tutoring quality, few-shot prompting yields limited gains,
and our rubric-based LLM-as-a-Judge proves highly reliable. These results
highlight both the difficulty and diagnostic value of MMTutorBench for
advancing AI tutoring.
                                    
                                ☆ Evaluating Large Language Models for Stance Detection on Financial Targets from SEC Filing Reports and Earnings Call Transcripts
                                          Financial narratives from U.S. Securities and Exchange Commission (SEC)
filing reports and quarterly earnings call transcripts (ECTs) are very
important for investors, auditors, and regulators. However, their length,
financial jargon, and nuanced language make fine-grained analysis difficult.
Prior sentiment analysis in the financial domain required a large, expensive
labeled dataset, making the sentence-level stance towards specific financial
targets challenging. In this work, we introduce a sentence-level corpus for
stance detection focused on three core financial metrics: debt, earnings per
share (EPS), and sales. The sentences were extracted from Form 10-K annual
reports and ECTs, and labeled for stance (positive, negative, neutral) using
the advanced ChatGPT-o3-pro model under rigorous human validation. Using this
corpus, we conduct a systematic evaluation of modern large language models
(LLMs) using zero-shot, few-shot, and Chain-of-Thought (CoT) prompting
strategies. Our results show that few-shot with CoT prompting performs best
compared to supervised baselines, and LLMs' performance varies across the SEC
and ECT datasets. Our findings highlight the practical viability of leveraging
LLMs for target-specific stance in the financial domain without requiring
extensive labeled data.
                                    
                                ☆ BrowseConf: Confidence-Guided Test-Time Scaling for Web Agents
                                        
                                            
                                        
                                        
                                            
                                        
                                        Litu Ou, Kuan Li, Huifeng Yin, Liwen Zhang, Zhongwang Zhang, Xixi Wu, Rui Ye, Zile Qiao, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
                                    
                                    
                                          Confidence in LLMs is a useful indicator of model uncertainty and answer
reliability. Existing work mainly focused on single-turn scenarios, while
research on confidence in complex multi-turn interactions is limited. In this
paper, we investigate whether LLM-based search agents have the ability to
communicate their own confidence through verbalized confidence scores after
long sequences of actions, a significantly more challenging task compared to
outputting confidence in a single interaction. Experimenting on open-source
agentic models, we first find that models exhibit much higher task accuracy at
high confidence while having near-zero accuracy when confidence is low. Based
on this observation, we propose Test-Time Scaling (TTS) methods that use
confidence scores to determine answer quality, encourage the model to try again
until reaching a satisfactory confidence level. Results show that our proposed
methods significantly reduce token consumption while demonstrating competitive
performance compared to baseline fixed budget TTS methods.
                                    
                                        
                                            comment: 25 pages
                                        
                                ☆ Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences
                                          Reward models (RMs) play a critical role in aligning AI behaviors with human
preferences, yet they face two fundamental challenges: (1) Modality Imbalance,
where most RMs are mainly focused on text and image modalities, offering
limited support for video, audio, and other modalities; and (2) Preference
Rigidity, where training on fixed binary preference pairs fails to capture the
complexity and diversity of personalized preferences. To address the above
challenges, we propose Omni-Reward, a step toward generalist omni-modal reward
modeling with support for free-form preferences, consisting of: (1) Evaluation:
We introduce Omni-RewardBench, the first omni-modal RM benchmark with free-form
preferences, covering nine tasks across five modalities including text, image,
video, audio, and 3D; (2) Data: We construct Omni-RewardData, a multimodal
preference dataset comprising 248K general preference pairs and 69K
instruction-tuning pairs for training generalist omni-modal RMs; (3) Model: We
propose Omni-RewardModel, which includes both discriminative and generative
RMs, and achieves strong performance on Omni-RewardBench as well as other
widely used reward modeling benchmarks.
                                    
                                        
                                            comment: 48 pages, 17 figures
                                        
                                ☆ A Neuro-Symbolic Multi-Agent Approach to Legal-Cybersecurity Knowledge Integration
                                          The growing intersection of cybersecurity and law creates a complex
information space where traditional legal research tools struggle to deal with
nuanced connections between cases, statutes, and technical vulnerabilities.
This knowledge divide hinders collaboration between legal experts and
cybersecurity professionals. To address this important gap, this work provides
a first step towards intelligent systems capable of navigating the increasingly
intricate cyber-legal domain. We demonstrate promising initial results on
multilingual tasks.
                                    
                                        
                                            comment: 7 pages
                                        
                                ☆ EMTSF:Extraordinary Mixture of SOTA Models for Time Series Forecasting
                                          The immense success of the Transformer architecture
  in Natural Language Processing has led to its adoption in Time Se ries
Forecasting (TSF), where superior performance has been shown.
  However, a recent important paper questioned their effectiveness by
  demonstrating that a simple single layer linear model outperforms
  Transformer-based models. This was soon shown to be not as valid,
  by a better transformer-based model termed PatchTST. More re cently, TimeLLM
demonstrated even better results by repurposing a
  Large Language Model (LLM) for the TSF domain. Again, a follow
  up paper challenged this by demonstrating that removing the LLM
  component or replacing it with a basic attention layer in fact yields
  better performance. One of the challenges in forecasting is the fact
  that TSF data favors the more recent past, and is sometimes subject
  to unpredictable events. Based upon these recent insights in TSF, we
  propose a strong Mixture of Experts (MoE) framework. Our method
  combines the state-of-the-art (SOTA) models including xLSTM, en hanced
Linear, PatchTST, and minGRU, among others. This set of
  complimentary and diverse models for TSF are integrated in a Trans former
based MoE gating network. Our proposed model outperforms
  all existing TSF models on standard benchmarks, surpassing even the
  latest approaches based on MoE frameworks.
                                    
                                ☆ Detecting Religious Language in Climate Discourse
                                          Religious language continues to permeate contemporary discourse, even in
ostensibly secular domains such as environmental activism and climate change
debates. This paper investigates how explicit and implicit forms of religious
language appear in climate-related texts produced by secular and religious
nongovernmental organizations (NGOs). We introduce a dual methodological
approach: a rule-based model using a hierarchical tree of religious terms
derived from ecotheology literature, and large language models (LLMs) operating
in a zero-shot setting. Using a dataset of more than 880,000 sentences, we
compare how these methods detect religious language and analyze points of
agreement and divergence. The results show that the rule-based method
consistently labels more sentences as religious than LLMs. These findings
highlight not only the methodological challenges of computationally detecting
religious language but also the broader tension over whether religious language
should be defined by vocabulary alone or by contextual meaning. This study
contributes to digital methods in religious studies by demonstrating both the
potential and the limitations of approaches for analyzing how the sacred
persists in climate discourse.
                                    
                                ☆ How AI Forecasts AI Jobs: Benchmarking LLM Predictions of Labor Market Changes
                                          Artificial intelligence is reshaping labor markets, yet we lack tools to
systematically forecast its effects on employment. This paper introduces a
benchmark for evaluating how well large language models (LLMs) can anticipate
changes in job demand, especially in occupations affected by AI. Existing
research has shown that LLMs can extract sentiment, summarize economic reports,
and emulate forecaster behavior, but little work has assessed their use for
forward-looking labor prediction. Our benchmark combines two complementary
datasets: a high-frequency index of sector-level job postings in the United
States, and a global dataset of projected occupational changes due to AI
adoption. We format these data into forecasting tasks with clear temporal
splits, minimizing the risk of information leakage. We then evaluate LLMs using
multiple prompting strategies, comparing task-scaffolded, persona-driven, and
hybrid approaches across model families. We assess both quantitative accuracy
and qualitative consistency over time. Results show that structured task
prompts consistently improve forecast stability, while persona prompts offer
advantages on short-term trends. However, performance varies significantly
across sectors and horizons, highlighting the need for domain-aware prompting
and rigorous evaluation protocols. By releasing our benchmark, we aim to
support future research on labor forecasting, prompt design, and LLM-based
economic reasoning. This work contributes to a growing body of research on how
LLMs interact with real-world economic data, and provides a reproducible
testbed for studying the limits and opportunities of AI as a forecasting tool
in the context of labor markets.
                                    
                                        
                                            comment: 8 pages + Limitations + References
                                        
                                ☆ LightKGG: Simple and Efficient Knowledge Graph Generation from Textual Data
                                          The scarcity of high-quality knowledge graphs (KGs) remains a critical
bottleneck for downstream AI applications, as existing extraction methods rely
heavily on error-prone pattern-matching techniques or resource-intensive large
language models (LLMs). While recent tools leverage LLMs to generate KGs, their
computational demands limit accessibility for low-resource environments. Our
paper introduces LightKGG, a novel framework that enables efficient KG
extraction from textual data using small-scale language models (SLMs) through
two key technical innovations: (1) Context-integrated Graph extraction
integrates contextual information with nodes and edges into a unified graph
structure, reducing the reliance on complex semantic processing while
maintaining more key information; (2) Topology-enhanced relationship inference
leverages the inherent topology of the extracted graph to efficiently infer
relationships, enabling relationship discovery without relying on complex
language understanding capabilities of LLMs. By enabling accurate KG
construction with minimal hardware requirements, this work bridges the gap
between automated knowledge extraction and practical deployment scenarios while
introducing scientifically rigorous methods for optimizing SLM efficiency in
structured NLP tasks.
                                    
                                ☆ Planning Ahead with RSA: Efficient Signalling in Dynamic Environments by Projecting User Awareness across Future Timesteps
                                          Adaptive agent design offers a way to improve human-AI collaboration on
time-sensitive tasks in rapidly changing environments. In such cases, to ensure
the human maintains an accurate understanding of critical task elements, an
assistive agent must not only identify the highest priority information but
also estimate how and when this information can be communicated most
effectively, given that human attention represents a zero-sum cognitive
resource where focus on one message diminishes awareness of other or upcoming
information. We introduce a theoretical framework for adaptive signalling which
meets these challenges by using principles of rational communication,
formalised as Bayesian reference resolution using the Rational Speech Act (RSA)
modelling framework, to plan a sequence of messages which optimise timely
alignment between user belief and a dynamic environment. The agent adapts
message specificity and timing to the particulars of a user and scenario based
on projections of how prior-guided interpretation of messages will influence
attention to the interface and subsequent belief update, across several
timesteps out to a fixed horizon. In a comparison to baseline methods, we show
that this effectiveness depends crucially on combining multi-step planning with
a realistic model of user awareness. As the first application of RSA for
communication in a dynamic environment, and for human-AI interaction in
general, we establish theoretical foundations for pragmatic communication in
human-agent teams, highlighting how insights from cognitive science can be
capitalised to inform the design of assistive agents.
                                    
                                        
                                            comment: 11 pages, 3 figures
                                        
                                ☆ BaZi-Based Character Simulation Benchmark: Evaluating AI on Temporal and Persona Reasoning
                                          Human-like virtual characters are crucial for games, storytelling, and
virtual reality, yet current methods rely heavily on annotated data or
handcrafted persona prompts, making it difficult to scale up and generate
realistic, contextually coherent personas. We create the first QA dataset for
BaZi-based persona reasoning, where real human experiences categorized into
wealth, health, kinship, career, and relationships are represented as
life-event questions and answers. Furthermore, we propose the first BaZi-LLM
system that integrates symbolic reasoning with large language models to
generate temporally dynamic and fine-grained virtual personas. Compared with
mainstream LLMs such as DeepSeek-v3 and GPT-5-mini, our method achieves a
30.3%-62.6% accuracy improvement. In addition, when incorrect BaZi information
is used, our model's accuracy drops by 20%-45%, showing the potential of
culturally grounded symbolic-LLM integration for realistic character
simulation.
                                    
                                ☆ Adaptive Blockwise Search: Inference-Time Alignment for Large Language Models
                                        
                                            
                                        
                                        
                                            
                                        
                                        Mohammad Atif Quamar, Mohammad Areeb, Nishant Sharma, Ananth Shreekumar, Jonathan Rosenthal, Muslum Ozgur Ozmen, Mikhail Kuznetsov, Z. Berkay Celik
                                    
                                    
                                          LLM alignment remains a critical challenge. Inference-time methods provide a
flexible alternative to fine-tuning, but their uniform computational effort
often yields suboptimal alignment. We hypothesize that for many alignment
tasks, the initial tokens of a response are disproportionately more critical.
To leverage this principle, we introduce AdaSearch, a novel blockwise search
strategy. It adaptively allocates a fixed computational budget using a sampling
schedule, focusing search effort on these critical tokens. We apply AdaSearch
to sequential decoding and introduce its tree-search counterpart, AdaBeam. Our
comprehensive evaluation across eight LLMs demonstrates that AdaSearch
outperforms strong Best-of-N and fine-tuning baselines. Specifically, win-rates
improve by over 10% for harmlessness generation, controlled sentiment
generation, and for mathematical reasoning tasks relative to Best-of-N.
                                    
                                ☆ LibriConvo: Simulating Conversations from Read Literature for ASR and Diarization LREC 2026
                                          We introduce LibriConvo, a simulated multi-speaker conversational dataset
based on speaker-aware conversation simulation (SASC), designed to support
training and evaluation of speaker diarization and automatic speech recognition
(ASR) systems. Unlike prior resources that mostly rely on semantically
disconnected utterances and implausible temporal gaps, LibriConvo ensures
semantic coherence and realistic conversational timing. Our pipeline leverages
CallHome with external VAD for reliable boundaries, applies compression to
reduce unnaturally long silences, and organizes LibriTTS utterances by book to
maintain contextual consistency. Acoustic realism is enhanced via a novel room
impulse response selection procedure that ranks speaker-microphone
configurations by spatial plausibility, balancing realism and diversity. The
dataset comprises 240.1 hours across 1,496 dialogues with 830 unique speakers,
split in a speaker-disjoint manner for robust evaluation. Baselines show that
the sortformer model outperforms the pyannote pipeline in diarization, while a
fine-tuned Fast Conformer-CTC XLarge with Serialized Output Training achieves
7.29\% WER for ASR, surpassing zero-shot Whisper-large-v3. LibriConvo provides
a valuable resource for advancing multi-speaker speech processing research with
realistic conversational dynamics and controlled experimental conditions.
                                    
                                        
                                            comment: Submitted to LREC 2026
                                        
                                ☆ Arabic Little STT: Arabic Children Speech Recognition Dataset
                                          The performance of Artificial Intelligence (AI) systems fundamentally depends
on high-quality training data. However, low-resource languages like Arabic
suffer from severe data scarcity. Moreover, the absence of child-specific
speech corpora is an essential gap that poses significant challenges. To
address this gap, we present our created dataset, Arabic Little STT, a dataset
of Levantine Arabic child speech recorded in classrooms, containing 355
utterances from 288 children (ages 6 - 13). We further conduct a systematic
assessment of Whisper, a state-of-the-art automatic speech recognition (ASR)
model, on this dataset and compare its performance with adult Arabic
benchmarks. Our evaluation across eight Whisper variants reveals that even the
best-performing model (Large_v3) struggles significantly, achieving a 0.66 word
error rate (WER) on child speech, starkly contrasting with its sub 0.20 WER on
adult datasets. These results align with other research on English speech.
Results highlight the critical need for dedicated child speech benchmarks and
inclusive training data in ASR development. Emphasizing that such data must be
governed by strict ethical and privacy frameworks to protect sensitive child
information. We hope that this study provides an initial step for future work
on equitable speech technologies for Arabic-speaking children. We hope that our
publicly available dataset enrich the children's demographic representation in
ASR datasets.
                                    
                                ☆ DCMM-SQL: Automated Data-Centric Pipeline and Multi-Model Collaboration Training for Text-to-SQL Model
                                          Text-to-SQL tasks have gained attractive improvements since the release of
ChatGPT. Among them, agent-based frameworks have been widely used in this
field. However, the impact of data-centric strategies on text-to-SQL tasks has
rarely been explored. In this paper, we systemically design a fully automated
data-centric pipeline for text-to-SQL tasks, including \emph{adaptive data
repair}, which can automatically find and fix errors in the training dataset;
and \emph{error data augmentation}, where we specifically diffuse and enhance
erroneous data predicted by the initially trained models. Meanwhile, we propose
a Multi-Model collaboration training schema, aiming to train multiple models
with different augmented data, enabling them to possess distinct capabilities
and work together to complement each other, because it has been found that the
capability of a single fine-tuned model is very limited. Furthermore, we
utilize an ensemble strategy to integrate the capabilities of multiple models
to solve a multiple-choice question, aiming to further improve the accuracy of
text-to-SQL tasks. The experiment results and ablation study have demonstrated
the effectiveness of data-centric pipeline and Multi-Model(MM) interactive
iterative strategies, achieving first place in lightweight text-to-SQL models
(within 70B).
                                    
                                ☆ A Cocktail-Party Benchmark: Multi-Modal dataset and Comparative Evaluation Results ICASSP 2026
                                        
                                            
                                        
                                        
                                            
                                        
                                        Thai-Binh Nguyen, Katerina Zmolikova, Pingchuan Ma, Ngoc Quan Pham, Christian Fuegen, Alexander Waibel
                                    
                                    
                                          We introduce the task of Multi-Modal Context-Aware Recognition (MCoRec) in
the ninth CHiME Challenge, which addresses the cocktail-party problem of
overlapping conversations in a single-room setting using audio, visual, and
contextual cues. MCoRec captures natural multi-party conversations where the
recordings focus on unscripted, casual group chats, leading to extreme speech
overlap of up to 100% and highly fragmented conversational turns. The task
requires systems to answer the question "Who speaks when, what, and with whom?"
by jointly transcribing each speaker's speech and clustering them into their
respective conversations from audio-visual recordings. Audio-only baselines
exceed 100% word error rate, whereas incorporating visual cues yields
substantial 50% improvements, highlighting the importance of multi-modality. In
this manuscript, we present the motivation behind the task, outline the data
collection process, and report the baseline systems developed for the MCoRec.
                                    
                                        
                                            comment: Submitted to ICASSP 2026
                                        
                                ☆ Code Aesthetics with Agentic Reward Feedback
                                          Large Language Models (LLMs) have become valuable assistants for developers
in code-related tasks. While LLMs excel at traditional programming tasks such
as code generation and bug fixing, they struggle with visually-oriented coding
tasks, often producing suboptimal aesthetics. In this paper, we introduce a new
pipeline to enhance the aesthetic quality of LLM-generated code. We first
construct AesCode-358K, a large-scale instruction-tuning dataset focused on
code aesthetics. Next, we propose agentic reward feedback, a multi-agent system
that evaluates executability, static aesthetics, and interactive aesthetics.
Building on this, we develop GRPO-AR, which integrates these signals into the
GRPO algorithm for joint optimization of functionality and code aesthetics.
Finally, we develop OpenDesign, a benchmark for assessing code aesthetics.
Experimental results show that combining supervised fine-tuning on AesCode-358K
with reinforcement learning using agentic reward feedback significantly
improves performance on OpenDesign and also enhances results on existing
benchmarks such as PandasPlotBench. Notably, our AesCoder-4B surpasses GPT-4o
and GPT-4.1, and achieves performance comparable to large open-source models
with 480B-685B parameters, underscoring the effectiveness of our approach.
                                    
                                        
                                            comment: 30 pages, 7 figures
                                        
                                ☆ Mubeen AI: A Specialized Arabic Language Model for Heritage Preservation and User Intent Understanding
                                          Mubeen is a proprietary Arabic language model developed by MASARAT SA,
optimized for deep understanding of Arabic linguistics, Islamic studies, and
cultural heritage. Trained on an extensive collection of authentic Arabic
sources significantly expanded by digitizing historical manuscripts via a
proprietary Arabic OCR engine, the model incorporates seminal scholarly works
in linguistics, jurisprudence, hadith, and Quranic exegesis, alongside
thousands of academic theses and peer-reviewed research papers. Conditioned
through a deep linguistic engineering framework, Mubeen masters not just the
meaning but the eloquence of Arabic, enabling precise understanding across
classical texts, contemporary writing, and regional dialects with focus on
comprehending user intent and delivering accurate, contextually relevant
responses. Unlike other Arabic models relying on translated English data that
often fail in intent detection or retrieval-augmented generation (RAG), Mubeen
uses native Arabic sources to ensure cultural authenticity and accuracy. Its
core innovation is the Practical Closure Architecture, designed to solve the
"Utility Gap Crisis" where factually correct answers fail to resolve users'
core needs, forcing them into frustrating cycles of re-prompting. By
prioritizing clarity and decisive guidance, Mubeen transforms from an
information repository into a decisive guide, aligning with Saudi Vision 2030.
The model's architecture combines deep heritage specialization with
multi-disciplinary expert modules, enabling robust performance across both
cultural preservation and general knowledge domains.
                                    
                                        
                                            comment: 21 pages, 2 figures, 3 tables. Includes appendices on ethical
  guidelines and training framework. Submitted September 04, 2025
                                        
                                ☆ Are ASR foundation models generalized enough to capture features of regional dialects for low-resource languages? AACL
                                        
                                            
                                        
                                        
                                            
                                        
                                        Tawsif Tashwar Dipto, Azmol Hossain, Rubayet Sabbir Faruque, Md. Rezuwan Hassan, Kanij Fatema, Tanmoy Shome, Ruwad Naswan, Md. Foriduzzaman Zihad, Mohaymen Ul Anam, Nazia Tasnim, Hasan Mahmud, Md Kamrul Hasan, Md. Mehedi Hasan Shawon, Farig Sadeque, Tahsin Reasat
                                    
                                    
                                          Conventional research on speech recognition modeling relies on the canonical
form for most low-resource languages while automatic speech recognition (ASR)
for regional dialects is treated as a fine-tuning task. To investigate the
effects of dialectal variations on ASR we develop a 78-hour annotated Bengali
Speech-to-Text (STT) corpus named Ben-10. Investigation from linguistic and
data-driven perspectives shows that speech foundation models struggle heavily
in regional dialect ASR, both in zero-shot and fine-tuned settings. We observe
that all deep learning methods struggle to model speech data under dialectal
variations but dialect specific model training alleviates the issue. Our
dataset also serves as a out of-distribution (OOD) resource for ASR modeling
under constrained resources in ASR algorithms. The dataset and code developed
for this project are publicly available
                                    
                                        
                                            comment: This manuscript contains 11 pages, 5 tables and 16 figures This was
  accepted at International Joint Conference on Natural Language Processing &
  Asia-Pacific Chapter of the Association for Computational Linguistics
  (IJCNLP-AACL) 2025
                                        
                                ☆ Process Reward Models for Sentence-Level Verification of LVLM Radiology Reports
                                          Automating radiology report generation with Large Vision-Language Models
(LVLMs) holds great potential, yet these models often produce clinically
critical hallucinations, posing serious risks. Existing hallucination detection
methods frequently lack the necessary sentence-level granularity or robust
generalization across different LVLM generators. We introduce a novel approach:
a sentence-level Process Reward Model (PRM) adapted for this vision-language
task. Our PRM predicts the factual correctness of each generated sentence,
conditioned on clinical context and preceding text. When fine-tuned on
MIMIC-CXR with weakly-supervised labels, a lightweight 0.5B-parameter PRM
outperforms existing verification techniques, demonstrating, for instance,
relative improvements of 7.5% in Matthews Correlation Coefficient and 1.8% in
AUROC over strong white-box baselines on outputs from one LVLM. Unlike methods
reliant on internal model states, our PRM demonstrates strong generalization to
an unseen LVLM. We further show its practical utility: PRM scores effectively
filter low-quality reports, improving F1-CheXbert scores by 4.5% (when
discarding the worst 10% of reports). Moreover, when guiding a novel weighted
best-of-N selection process on the MIMIC-CXR test set, our PRM show relative
improvements in clinical metrics of 7.4% for F1-CheXbert and 0.6% for
BERTScore. These results demonstrate that a lightweight, context-aware PRM
provides a model-agnostic safety layer for clinical LVLMs without access to
internal activations
                                    
                                ☆ PTPP-Aware Adaptation Scaling Laws: Predicting Domain-Adaptation Performance at Unseen Pre-Training Budgets
                                        
                                            
                                        
                                        
                                            
                                        
                                        Etienne Goffinet, Shane Bergsma, Avraham Sheinin, Natalia Vassilieva, Shaheer Muhammad, Preslav Nakov, Gurpreet Gosal
                                    
                                    
                                          Continual pre-training (CPT) for domain adaptation must balance target-domain
gains with stability on the base domain. Existing CPT scaling laws typically
assume a fixed pre-training budget, which limits their ability to forecast
adaptation outcomes for models trained at different tokens-per-parameter
(PTPP). We present \emph{PTPP-aware} adaptation scaling laws that make the
pre-training budget an explicit variable, enabling accurate \emph{prediction}
of adaptation loss at unseen \ptpp. On a multilingual setup (English/Arabic
$\rightarrow$ French), PTPP-aware formulations trained on early stages
(\ptpp{}=\{15,31\}) predict target loss at \ptpp{}=279 and outperform a
PTPP-agnostic \dcpt{} transfer baseline on metrics (Huber-on-log,
MAE$_\mathrm{rel}$, calibration slope); full diagnostics (RMSE, MAPE) are in
the appendix. Beyond forecasting, we show a practical use case: planning replay
ratios and adaptation token budgets that satisfy target and forgetting
constraints under compute limits.
                                    
                                ☆ DREaM: Drug-Drug Relation Extraction via Transfer Learning Method
                                          Relation extraction between drugs plays a crucial role in identifying drug
drug interactions and predicting side effects. The advancement of machine
learning methods in relation extraction, along with the development of large
medical text databases, has enabled the low cost extraction of such relations
compared to other approaches that typically require expert knowledge. However,
to the best of our knowledge, there are limited datasets specifically designed
for drug drug relation extraction currently available. Therefore, employing
transfer learning becomes necessary to apply machine learning methods in this
domain. In this study, we propose DREAM, a method that first employs a trained
relation extraction model to discover relations between entities and then
applies this model to a corpus of medical texts to construct an ontology of
drug relationships. The extracted relations are subsequently validated using a
large language model. Quantitative results indicate that the LLM agreed with 71
of the relations extracted from a subset of PubMed abstracts. Furthermore, our
qualitative analysis indicates that this approach can uncover ambiguities in
the medical domain, highlighting the challenges inherent in relation extraction
in this field.
                                    
                                ☆ SI-Bench: Benchmarking Social Intelligence of Large Language Models in Human-to-Human Conversations
                                          As large language models (LLMs) develop anthropomorphic abilities, they are
increasingly being deployed as autonomous agents to interact with humans.
However, evaluating their performance in realistic and complex social
interactions remains a significant challenge. Most previous research built
datasets through simulated agent-to-agent interactions, which fails to capture
the authentic linguistic styles and relational dynamics found in real human
conversations. To address this gap, we introduce SI-Bench, a novel benchmark
designed to evaluate aspects of social intelligence in LLMs. Grounded in broad
social science theories, SI-Bench contains 2,221 authentic multi-turn dialogues
collected from a social networking application. We further selected a subset of
312 dialogues for manual annotation across 8 major models. The experiments show
that SOTA models have surpassed the human expert in process reasoning under
complex social situations, yet they still fall behind humans in reply quality.
Moreover, introducing Chain-of-Thought (CoT) reasoning may degrade the
performance of LLMs in social dialogue tasks. All datasets are openly available
at https://github.com/SI-Bench/SI-Bench.git.
                                    
                                        
                                            comment: 17 pages, 9 figures
                                        
                                ☆ MATCH: Task-Driven Code Evaluation through Contrastive Learning
                                          AI-based code generation is increasingly prevalent, with GitHub Copilot
estimated to generate 46% of the code on GitHub. Accurately evaluating how well
generated code aligns with developer intent remains a critical challenge.
Traditional evaluation methods, such as unit tests, are often unscalable and
costly. Syntactic similarity metrics (e.g., BLEU, ROUGE) fail to capture code
functionality, and metrics like CodeBERTScore require reference code, which is
not always available. To address the gap in reference-free evaluation, with few
alternatives such as ICE-Score, this paper introduces MATCH, a novel
reference-free metric. MATCH uses Contrastive Learning to generate meaningful
embeddings for code and natural language task descriptions, enabling similarity
scoring that reflects how well generated code implements the task. We show that
MATCH achieves stronger correlations with functional correctness and human
preference than existing metrics across multiple programming languages.
                                    
                                ☆ Beyond Direct Generation: A Decomposed Approach to Well-Crafted Screenwriting with LLMs
                                          The screenplay serves as the foundation for television production, defining
narrative structure, character development, and dialogue. While Large Language
Models (LLMs) show great potential in creative writing, direct end-to-end
generation approaches often fail to produce well-crafted screenplays. We argue
this failure stems from forcing a single model to simultaneously master two
disparate capabilities: creative narrative construction and rigid format
adherence. The resulting outputs may mimic superficial style but lack the deep
structural integrity and storytelling substance required for professional use.
To enable LLMs to generate high-quality screenplays, we introduce Dual-Stage
Refinement (DSR), a decomposed framework that decouples creative narrative
generation from format conversion. The first stage transforms a brief outline
into rich, novel-style prose. The second stage refines this narrative into a
professionally formatted screenplay. This separation enables the model to
specialize in one distinct capability at each stage. A key challenge in
implementing DSR is the scarcity of paired outline-to-novel training data. We
address this through hybrid data synthesis: reverse synthesis deconstructs
existing screenplays into structured inputs, while forward synthesis leverages
these inputs to generate high-quality narrative texts as training targets.
Blind evaluations by professional screenwriters show that DSR achieves a 75%
win rate against strong baselines like Gemini-2.5-Pro and reaches 82.7% of
human-level performance. Our work demonstrates that decomposed generation
architecture with tailored data synthesis effectively specializes LLMs in
complex creative domains.
                                    
                                ☆ ENTP: Enhancing Low-Quality SFT Data via Neural-Symbolic Text Purge-Mix
                                          Supervised Fine-Tuning (SFT) adapts pre-trained Large Language Models (LLMs)
to domain-specific instructions by training on a carefully curated subset of
high-quality instruction-response pairs, typically drawn from a larger dataset
that often contains many low-quality or noisy samples. However, existing
quality-first paradigms often overlook valuable signals in discarded
low-quality data and rely on imperfect quality filters. We introduce ENTP
(Enhancing low-quality SFT data via Neural-symbolic Text Purge-Mix), a
framework that revitalizes low-quality corpora through symbolic purification
and neural reconstruction. The symbolic module identifies and prunes noisy
samples based on statistical priors, while the neural component synthesizes
enriched instruction-response pairs by leveraging latent representations and
model knowledge. This neural-symbolic synergy enhances data informativeness and
diversity. Experiments show that ENTP-augmented datasets, constructed
exclusively from low-quality data, outperform 13 established data-selection
baselines across five instruction-following benchmarks, and even surpass
fine-tuning on the full original dataset (approximately 300K examples). Our
results highlight the untapped potential of low-quality data and underscore the
importance of intelligent purification and synthesis for efficient instruction
alignment.
                                    
                                ☆ Rethinking GSPO: The Perplexity-Entropy Equivalence
                                          We provide a new perspective on GSPO's length-normalized importance ratios by
establishing their connection to information-theoretic quantities. We show that
GSPO's sequence-level weight $s(\theta) =
(\pi_\theta/\pi_{\theta_{\text{old}}})^{1/|y|}$ can be equivalently expressed
as the inverse perplexity ratio
$\text{PPL}_{\theta_{\text{old}}}/\text{PPL}_\theta$ and as the exponential
cross-entropy change $\exp(\Delta H)$. While the perplexity-entropy
relationship follows from standard definitions, this observation provides a
useful lens for understanding GSPO: the algorithm weights policy gradient
updates by perplexity ratios, offering an information-theoretic interpretation
of the importance weights. This perspective helps explain GSPO's empirical
properties, including log-domain variance reduction through geometric averaging
and stability in training mixture-of-experts models. We validate the
mathematical equivalences and variance predictions through controlled
experiments on mathematical reasoning tasks.
                                    
                                        
                                            comment: 10 pages, 2 figures
                                        
                                ☆ Corpus Frequencies in Morphological Inflection: Do They Matter?
                                          The traditional approach to morphological inflection (the task of modifying a
base word (lemma) to express grammatical categories) has been, for decades, to
consider lexical entries of lemma-tag-form triples uniformly, lacking any
information about their frequency distribution. However, in production
deployment, one might expect the user inputs to reflect a real-world
distribution of frequencies in natural texts. With future deployment in mind,
we explore the incorporation of corpus frequency information into the task of
morphological inflection along three key dimensions during system development:
(i) for train-dev-test split, we combine a lemma-disjoint approach, which
evaluates the model's generalization capabilities, with a frequency-weighted
strategy to better reflect the realistic distribution of items across different
frequency bands in training and test sets; (ii) for evaluation, we complement
the standard type accuracy (often referred to simply as accuracy), which treats
all items equally regardless of frequency, with token accuracy, which assigns
greater weight to frequent words and better approximates performance on running
text; (iii) for training data sampling, we introduce a method novel in the
context of inflection, frequency-aware training, which explicitly incorporates
word frequency into the sampling process. We show that frequency-aware training
outperforms uniform sampling in 26 out of 43 languages.
                                    
                                        
                                            comment: Published in the proceedings of ITAT 2025.15 pages, 1 figure, 4
  tables
                                        
                                ☆ Beyond Higher Rank: Token-wise Input-Output Projections for Efficient Low-Rank Adaptation NeurIPS 2025
                                        
                                            
                                        
                                        
                                            
                                        
                                        Shiwei Li, Xiandi Luo, Haozhao Wang, Xing Tang, Ziqiang Cui, Dugang Liu, Yuhua Li, Xiuqiang He, Ruixuan Li
                                    
                                    
                                          Low-rank adaptation (LoRA) is a parameter-efficient fine-tuning (PEFT) method
widely used in large language models (LLMs). LoRA essentially describes the
projection of an input space into a low-dimensional output space, with the
dimensionality determined by the LoRA rank. In standard LoRA, all input tokens
share the same weights and undergo an identical input-output projection. This
limits LoRA's ability to capture token-specific information due to the inherent
semantic differences among tokens. To address this limitation, we propose
Token-wise Projected Low-Rank Adaptation (TopLoRA), which dynamically adjusts
LoRA weights according to the input token, thereby learning token-wise
input-output projections in an end-to-end manner. Formally, the weights of
TopLoRA can be expressed as $B\Sigma_X A$, where $A$ and $B$ are low-rank
matrices (as in standard LoRA), and $\Sigma_X$ is a diagonal matrix generated
from each input token $X$. Notably, TopLoRA does not increase the rank of LoRA
weights but achieves more granular adaptation by learning token-wise LoRA
weights (i.e., token-wise input-output projections). Extensive experiments
across multiple models and datasets demonstrate that TopLoRA consistently
outperforms LoRA and its variants. The code is available at
https://github.com/Leopold1423/toplora-neurips25.
                                    
                                        
                                            comment: Accepted by NeurIPS 2025
                                        
                                ☆ Flexing in 73 Languages: A Single Small Model for Multilingual Inflection
                                          We present a compact, single-model approach to multilingual inflection, the
task of generating inflected word forms from base lemmas to express grammatical
categories. Our model, trained jointly on data from 73 languages, is
lightweight, robust to unseen words, and outperforms monolingual baselines in
most languages. This demonstrates the effectiveness of multilingual modeling
for inflection and highlights its practical benefits: simplifying deployment by
eliminating the need to manage and retrain dozens of separate monolingual
models. In addition to the standard SIGMORPHON shared task benchmarks, we
evaluate our monolingual and multilingual models on 73 Universal Dependencies
(UD) treebanks, extracting lemma-tag-form triples and their frequency counts.
To ensure realistic data splits, we introduce a novel frequency-weighted,
lemma-disjoint train-dev-test resampling procedure. Our work addresses the lack
of an open-source, general-purpose, multilingual morphological inflection
system capable of handling unseen words across a wide range of languages,
including Czech. All code is publicly released at:
https://github.com/tomsouri/multilingual-inflection.
                                    
                                        
                                            comment: Published in the proceedings of TSD 2025. 12 pages, 1 figure, 4
  tables
                                        
                                ☆ Leveraging Hierarchical Organization for Medical Multi-document Summarization
                                          Medical multi-document summarization (MDS) is a complex task that requires
effectively managing cross-document relationships. This paper investigates
whether incorporating hierarchical structures in the inputs of MDS can improve
a model's ability to organize and contextualize information across documents
compared to traditional flat summarization methods. We investigate two ways of
incorporating hierarchical organization across three large language models
(LLMs), and conduct comprehensive evaluations of the resulting summaries using
automated metrics, model-based metrics, and domain expert evaluation of
preference, understandability, clarity, complexity, relevance, coverage,
factuality, and coherence. Our results show that human experts prefer
model-generated summaries over human-written summaries. Hierarchical approaches
generally preserve factuality, coverage, and coherence of information, while
also increasing human preference for summaries. Additionally, we examine
whether simulated judgments from GPT-4 align with human judgments, finding
higher agreement along more objective evaluation facets. Our findings
demonstrate that hierarchical structures can improve the clarity of medical
summaries generated by models while maintaining content coverage, providing a
practical way to improve human preference for generated summaries.
                                    
                                ☆ MAP4TS: A Multi-Aspect Prompting Framework for Time-Series Forecasting with Large Language Models
                                          Recent advances have investigated the use of pretrained large language models
(LLMs) for time-series forecasting by aligning numerical inputs with LLM
embedding spaces. However, existing multimodal approaches often overlook the
distinct statistical properties and temporal dependencies that are fundamental
to time-series data. To bridge this gap, we propose MAP4TS, a novel
Multi-Aspect Prompting Framework that explicitly incorporates classical
time-series analysis into the prompt design. Our framework introduces four
specialized prompt components: a Global Domain Prompt that conveys
dataset-level context, a Local Domain Prompt that encodes recent trends and
series-specific behaviors, and a pair of Statistical and Temporal Prompts that
embed handcrafted insights derived from autocorrelation (ACF), partial
autocorrelation (PACF), and Fourier analysis. Multi-Aspect Prompts are combined
with raw time-series embeddings and passed through a cross-modality alignment
module to produce unified representations, which are then processed by an LLM
and projected for final forecasting. Extensive experiments across eight diverse
datasets show that MAP4TS consistently outperforms state-of-the-art LLM-based
methods. Our ablation studies further reveal that prompt-aware designs
significantly enhance performance stability and that GPT-2 backbones, when
paired with structured prompts, outperform larger models like LLaMA in
long-term forecasting tasks.
                                    
                                ☆ A Survey on LLM Mid-training
                                        
                                            
                                        
                                        
                                            
                                        
                                        Chengying Tu, Xuemiao Zhang, Rongxiang Weng, Rumei Li, Chen Zhang, Yang Bai, Hongfei Yan, Jingang Wang, Xunliang Cai
                                    
                                    
                                          Recent advances in foundation models have highlighted the significant
benefits of multi-stage training, with a particular emphasis on the emergence
of mid-training as a vital stage that bridges pre-training and post-training.
Mid-training is distinguished by its use of intermediate data and computational
resources, systematically enhancing specified capabilities such as mathematics,
coding, reasoning, and long-context extension, while maintaining foundational
competencies. This survey provides a formal definition of mid-training for
large language models (LLMs) and investigates optimization frameworks that
encompass data curation, training strategies, and model architecture
optimization. We analyze mainstream model implementations in the context of
objective-driven interventions, illustrating how mid-training serves as a
distinct and critical stage in the progressive development of LLM capabilities.
By clarifying the unique contributions of mid-training, this survey offers a
comprehensive taxonomy and actionable insights, supporting future research and
innovation in the advancement of LLMs.
                                    
                                ☆ Fast-MIA: Efficient and Scalable Membership Inference for LLMs
                                          We propose Fast-MIA (https://github.com/Nikkei/fast-mia), a Python library
for efficiently evaluating membership inference attacks (MIA) against Large
Language Models (LLMs). MIA against LLMs has emerged as a crucial challenge due
to growing concerns over copyright, security, and data privacy, and has
attracted increasing research attention. However, the progress of this research
is significantly hindered by two main obstacles: (1) the high computational
cost of inference in LLMs, and (2) the lack of standardized and maintained
implementations of MIA methods, which makes large-scale empirical comparison
difficult. To address these challenges, our library provides fast batch
inference and includes implementations of representative MIA methods under a
unified evaluation framework. This library supports easy implementation of
reproducible benchmarks with simple configuration and extensibility. We release
Fast-MIA as an open-source (Apache License 2.0) tool to support scalable and
transparent research on LLMs.
                                    
                                ☆ Quality-Aware Translation Tagging in Multilingual RAG system EMNLP 2025
                                          Multilingual Retrieval-Augmented Generation (mRAG) often retrieves English
documents and translates them into the query language for low-resource
settings. However, poor translation quality degrades response generation
performance. Existing approaches either assume sufficient translation quality
or utilize the rewriting method, which introduces factual distortion and
hallucinations. To mitigate these problems, we propose Quality-Aware
Translation Tagging in mRAG (QTT-RAG), which explicitly evaluates translation
quality along three dimensions-semantic equivalence, grammatical accuracy, and
naturalness&fluency-and attach these scores as metadata without altering the
original content. We evaluate QTT-RAG against CrossRAG and DKM-RAG as baselines
in two open-domain QA benchmarks (XORQA, MKQA) using six instruction-tuned LLMs
ranging from 2.4B to 14B parameters, covering two low-resource languages
(Korean and Finnish) and one high-resource language (Chinese). QTT-RAG
outperforms the baselines by preserving factual integrity while enabling
generator models to make informed decisions based on translation reliability.
This approach allows for effective usage of cross-lingual documents in
low-resource settings with limited native language documents, offering a
practical and robust solution across multilingual domains.
                                    
                                        
                                            comment: EMNLP 2025 MRL Workshop
                                        
                                ☆ Knocking-Heads Attention
                                          Multi-head attention (MHA) has become the cornerstone of modern large
language models, enhancing representational capacity through parallel attention
heads. However, increasing the number of heads inherently weakens individual
head capacity, and existing attention mechanisms - whether standard MHA or its
variants like grouped-query attention (GQA) and grouped-tied attention (GTA) -
simply concatenate outputs from isolated heads without strong interaction. To
address this limitation, we propose knocking-heads attention (KHA), which
enables attention heads to "knock" on each other - facilitating cross-head
feature-level interactions before the scaled dot-product attention. This is
achieved by applying a shared, diagonally-initialized projection matrix across
all heads. The diagonal initialization preserves head-specific specialization
at the start of training while allowing the model to progressively learn
integrated cross-head representations. KHA adds only minimal parameters and
FLOPs and can be seamlessly integrated into MHA, GQA, GTA, and other attention
variants. We validate KHA by training a 6.1B parameter MoE model (1.01B
activated) on 1T high-quality tokens. Compared to baseline attention
mechanisms, KHA brings superior and more stable training dynamics, achieving
better performance across downstream tasks.
                                    
                                ☆ Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning
                                          Large Language Models (LLMs) are widely used as judges to evaluate response
quality, providing a scalable alternative to human evaluation. However, most
LLM judges operate solely on intrinsic text-based reasoning, limiting their
ability to verify complex constraints or perform accurate computation.
Motivated by the success of tool-integrated reasoning (TIR) in numerous tasks,
we propose TIR-Judge, an end-to-end RL framework for training LLM judges that
integrates a code executor for precise evaluation. TIR-Judge is built on three
principles: (i) diverse training across verifiable and non-verifiable domains,
(ii) flexible judgment formats (pointwise, pairwise, listwise), and (iii)
iterative RL that bootstraps directly from the initial model without
distillation. On seven public benchmarks, TIR-Judge surpasses strong
reasoning-based judges by up to 6.4% (pointwise) and 7.7% (pairwise), and
achieves listwise performance comparable to Claude-Opus-4 despite having only
8B parameters. Remarkably, TIR-Judge-Zero - trained entirely without distilled
judge trajectories, matches the performance of distilled variants,
demonstrating that tool-augmented judges can self-evolve through iterative
reinforcement learning.
                                    
                                        
                                            comment: Work in Progress
                                        
                                ☆ Towards Stable and Effective Reinforcement Learning for Mixture-of-Experts
                                          Recent advances in reinforcement learning (RL) have substantially improved
the training of large-scale language models, leading to significant gains in
generation quality and reasoning ability. However, most existing research
focuses on dense models, while RL training for Mixture-of-Experts (MoE)
architectures remains underexplored. To address the instability commonly
observed in MoE training, we propose a novel router-aware approach to optimize
importance sampling (IS) weights in off-policy RL. Specifically, we design a
rescaling strategy guided by router logits, which effectively reduces gradient
variance and mitigates training divergence. Experimental results demonstrate
that our method significantly improves both the convergence stability and the
final performance of MoE models, highlighting the potential of RL algorithmic
innovations tailored to MoE architectures and providing a promising direction
for efficient training of large-scale expert models.
                                    
                                ☆ UniAIDet: A Unified and Universal Benchmark for AI-Generated Image Content Detection and Localization
                                          With the rapid proliferation of image generative models, the authenticity of
digital images has become a significant concern. While existing studies have
proposed various methods for detecting AI-generated content, current benchmarks
are limited in their coverage of diverse generative models and image
categories, often overlooking end-to-end image editing and artistic images. To
address these limitations, we introduce UniAIDet, a unified and comprehensive
benchmark that includes both photographic and artistic images. UniAIDet covers
a wide range of generative models, including text-to-image, image-to-image,
image inpainting, image editing, and deepfake models. Using UniAIDet, we
conduct a comprehensive evaluation of various detection methods and answer
three key research questions regarding generalization capability and the
relation between detection and localization. Our benchmark and analysis provide
a robust foundation for future research.
                                    
                                ☆ M$^{3}$T2IBench: A Large-Scale Multi-Category, Multi-Instance, Multi-Relation Text-to-Image Benchmark
                                          Text-to-image models are known to struggle with generating images that
perfectly align with textual prompts. Several previous studies have focused on
evaluating image-text alignment in text-to-image generation. However, these
evaluations either address overly simple scenarios, especially overlooking the
difficulty of prompts with multiple different instances belonging to the same
category, or they introduce metrics that do not correlate well with human
evaluation. In this study, we introduce M$^3$T2IBench, a large-scale,
multi-category, multi-instance, multi-relation along with an
object-detection-based evaluation metric, $AlignScore$, which aligns closely
with human evaluation. Our findings reveal that current open-source
text-to-image models perform poorly on this challenging benchmark.
Additionally, we propose the Revise-Then-Enforce approach to enhance image-text
alignment. This training-free post-editing method demonstrates improvements in
image-text alignment across a broad range of diffusion models. \footnote{Our
code and data has been released in supplementary material and will be made
publicly available after the paper is accepted.}
                                    
                                ☆ LangLingual: A Personalised, Exercise-oriented English Language Learning Tool Leveraging Large Language Models
                                          Language educators strive to create a rich experience for learners, while
they may be restricted in the extend of feedback and practice they can provide.
We present the design and development of LangLingual, a conversational agent
built using the LangChain framework and powered by Large Language Models. The
system is specifically designed to provide real-time, grammar-focused feedback,
generate context-aware language exercises and track learner proficiency over
time. The paper discusses the architecture, implementation and evaluation of
LangLingual in detail. The results indicate strong usability, positive learning
outcomes and encouraging learner engagement.
                                    
                                        
                                            comment: 14 pages
                                        
                                ☆ Understanding In-Context Learning Beyond Transformers: An Investigation of State Space and Hybrid Architectures
                                          We perform in-depth evaluations of in-context learning (ICL) on
state-of-the-art transformer, state-space, and hybrid large language models
over two categories of knowledge-based ICL tasks. Using a combination of
behavioral probing and intervention-based methods, we have discovered that,
while LLMs of different architectures can behave similarly in task performance,
their internals could remain different. We discover that function vectors (FVs)
responsible for ICL are primarily located in the self-attention and Mamba
layers, and speculate that Mamba2 uses a different mechanism from FVs to
perform ICL. FVs are more important for ICL involving parametric knowledge
retrieval, but not for contextual knowledge understanding. Our work contributes
to a more nuanced understanding across architectures and task types.
Methodologically, our approach also highlights the importance of combining both
behavioural and mechanistic analyses to investigate LLM capabilities.
                                    
                                ☆ Can Language Models Compose Skills In-Context?
                                          Composing basic skills from simple tasks to accomplish composite tasks is
crucial for modern intelligent systems. We investigate the in-context
composition ability of language models to perform composite tasks that combine
basic skills demonstrated in in-context examples. This is more challenging than
the standard setting, where skills and their composition can be learned in
training. We conduct systematic experiments on various representative
open-source language models, utilizing linguistic and logical tasks designed to
probe composition abilities. The results reveal that simple task examples can
have a surprising negative impact on the performance, because the models
generally struggle to recognize and assemble the skills correctly, even with
Chain-of-Thought examples. Theoretical analysis further shows that it is
crucial to align examples with the corresponding steps in the composition. This
inspires a method for the probing tasks, whose improved performance provides
positive support for our insights.
                                    
                                ☆ Measuring Teaching with LLMs
                                          Objective and scalable measurement of teaching quality is a persistent
challenge in education. While Large Language Models (LLMs) offer potential,
general-purpose models have struggled to reliably apply complex, authentic
classroom observation instruments. This paper uses custom LLMs built on
sentence-level embeddings, an architecture better suited for the long-form,
interpretive nature of classroom transcripts than conventional subword
tokenization. We systematically evaluate five different sentence embeddings
under a data-efficient training regime designed to prevent overfitting. Our
results demonstrate that these specialized models can achieve human-level and
even super-human performance with expert human ratings above 0.65 and
surpassing the average human-human rater correlation. Further, through analysis
of annotation context windows, we find that more advanced models-those better
aligned with human judgments-attribute a larger share of score variation to
lesson-level features rather than isolated utterances, challenging the
sufficiency of single-turn annotation paradigms. Finally, to assess external
validity, we find that aggregate model scores align with teacher value-added
measures, indicating they are capturing features relevant to student learning.
However, this trend does not hold at the individual item level, suggesting that
while the models learn useful signals, they have not yet achieved full
generalization. This work establishes a viable and powerful new methodology for
AI-driven instructional measurement, offering a path toward providing scalable,
reliable, and valid feedback for educator development.
                                    
                                ☆ MAD-Fact: A Multi-Agent Debate Framework for Long-Form Factuality Evaluation in LLMs
                                          The widespread adoption of Large Language Models (LLMs) raises critical
concerns about the factual accuracy of their outputs, especially in high-risk
domains such as biomedicine, law, and education. Existing evaluation methods
for short texts often fail on long-form content due to complex reasoning
chains, intertwined perspectives, and cumulative information. To address this,
we propose a systematic approach integrating large-scale long-form datasets,
multi-agent verification mechanisms, and weighted evaluation metrics. We
construct LongHalluQA, a Chinese long-form factuality dataset; and develop
MAD-Fact, a debate-based multi-agent verification system. We introduce a fact
importance hierarchy to capture the varying significance of claims in long-form
texts. Experiments on two benchmarks show that larger LLMs generally maintain
higher factual consistency, while domestic models excel on Chinese content. Our
work provides a structured framework for evaluating and enhancing factual
reliability in long-form LLM outputs, guiding their safe deployment in
sensitive domains.
                                    
                                        
                                            comment: This article has been accepted by Frontiers of Computer Science (FCS)
                                        
                                ☆ Tagging-Augmented Generation: Assisting Language Models in Finding Intricate Knowledge In Long Contexts EMNLP 2025
                                        
                                            
                                        
                                        
                                            
                                        
                                        Anwesan Pal, Karen Hovsepian, Tinghao Guo, Mengnan Zhao, Somendra Tripathi, Nikos Kanakaris, George Mihaila, Sumit Nigam
                                    
                                    
                                          Recent investigations into effective context lengths of modern flagship large
language models (LLMs) have revealed major limitations in effective question
answering (QA) and reasoning over long and complex contexts for even the
largest and most impressive cadre of models. While approaches like
retrieval-augmented generation (RAG) and chunk-based re-ranking attempt to
mitigate this issue, they are sensitive to chunking, embedding and retrieval
strategies and models, and furthermore, rely on extensive pre-processing,
knowledge acquisition and indexing steps. In this paper, we propose
Tagging-Augmented Generation (TAG), a lightweight data augmentation strategy
that boosts LLM performance in long-context scenarios, without degrading and
altering the integrity and composition of retrieved documents. We validate our
hypothesis by augmenting two challenging and directly relevant
question-answering benchmarks -- NoLima and NovelQA -- and show that tagging
the context or even just adding tag definitions into QA prompts leads to
consistent performance gains over the baseline -- up to 17% for 32K token
contexts, and 2.9% in complex reasoning question-answering for multi-hop
queries requiring knowledge across a wide span of text. Additional details are
available at https://sites.google.com/view/tag-emnlp.
                                    
                                        
                                            comment: Paper accepted at EMNLP 2025
                                        
                                ☆ Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond) NeurIPS 2025
                                        
                                            
                                        
                                        
                                            
                                        
                                        Liwei Jiang, Yuanjun Chai, Margaret Li, Mickel Liu, Raymond Fok, Nouha Dziri, Yulia Tsvetkov, Maarten Sap, Alon Albalak, Yejin Choi
                                    
                                    
                                          Language models (LMs) often struggle to generate diverse, human-like creative
content, raising concerns about the long-term homogenization of human thought
through repeated exposure to similar outputs. Yet scalable methods for
evaluating LM output diversity remain limited, especially beyond narrow tasks
such as random number or name generation, or beyond repeated sampling from a
single model. We introduce Infinity-Chat, a large-scale dataset of 26K diverse,
real-world, open-ended user queries that admit a wide range of plausible
answers with no single ground truth. We introduce the first comprehensive
taxonomy for characterizing the full spectrum of open-ended prompts posed to
LMs, comprising 6 top-level categories (e.g., brainstorm & ideation) that
further breaks down to 17 subcategories. Using Infinity-Chat, we present a
large-scale study of mode collapse in LMs, revealing a pronounced Artificial
Hivemind effect in open-ended generation of LMs, characterized by (1)
intra-model repetition, where a single model consistently generates similar
responses, and more so (2) inter-model homogeneity, where different models
produce strikingly similar outputs. Infinity-Chat also includes 31,250 human
annotations, across absolute ratings and pairwise preferences, with 25
independent human annotations per example. This enables studying collective and
individual-specific human preferences in response to open-ended queries. Our
findings show that LMs, reward models, and LM judges are less well calibrated
to human ratings on model generations that elicit differing idiosyncratic
annotator preferences, despite maintaining comparable overall quality. Overall,
INFINITY-CHAT presents the first large-scale resource for systematically
studying real-world open-ended queries to LMs, revealing critical insights to
guide future research for mitigating long-term AI safety risks posed by the
Artificial Hivemind.
                                    
                                        
                                            comment: NeurIPS 2025 D&B Paper (Oral); Camera-Ready Version
                                        
                                ☆ Language Server CLI Empowers Language Agents with Process Rewards
                                          Large language models routinely hallucinate APIs and mislocalize edits, while
language servers compute verified, IDE-grade facts about real code. We present
Lanser-CLI, a CLI-first orchestration layer that pins and mediates a Language
Server Protocol (LSP) server for coding agents and CI, exposing deterministic,
replayable workflows. Our position is that language servers provide not only
structural information (definitions, references, types, diagnostics) but also
an actionable process reward: machine-checked, step-wise signals that align an
agent's planning loop with program reality. In this work, Lanser-CLI
contributes: (i) a robust addressing scheme beyond brittle "file:line:col" via
a Selector DSL (symbolic, AST-path, and content-anchored selectors) with a
principled relocation algorithm; (ii) deterministic Analysis Bundles that
normalize Language Server responses and capture environment/capability metadata
with stable content hashes; (iii) a safety envelope for mutating operations
(rename, code actions) with preview, workspace jails, and Git-aware,
transactional apply; and (iv) a process-reward functional derived from Language
Server facts (diagnostic deltas, disambiguation confidence, and safe-apply
checks) that is computable online and replayable offline. We formalize
determinism under frozen snapshots and establish a monotonicity property for
the process reward, making it suitable for process supervision and
counterfactual analysis. Project Page:
https://github.com/yifanzhang-pro/lanser-cli
                                    
                                        
                                            comment: Project Page: https://github.com/yifanzhang-pro/lanser-cli
                                        
                                ☆ Modeling Political Discourse with Sentence-BERT and BERTopic
                                          Social media has reshaped political discourse, offering politicians a
platform for direct engagement while reinforcing polarization and ideological
divides. This study introduces a novel topic evolution framework that
integrates BERTopic-based topic modeling with Moral Foundations Theory (MFT) to
analyze the longevity and moral dimensions of political topics in Twitter
activity during the 117th U.S. Congress. We propose a methodology for tracking
dynamic topic shifts over time and measuring their association with moral
values and quantifying topic persistence. Our findings reveal that while
overarching themes remain stable, granular topics tend to dissolve rapidly,
limiting their long-term influence. Moreover, moral foundations play a critical
role in topic longevity, with Care and Loyalty dominating durable topics, while
partisan differences manifest in distinct moral framing strategies. This work
contributes to the field of social network analysis and computational political
discourse by offering a scalable, interpretable approach to understanding
moral-driven topic evolution on social media.
                                    
                                        
                                            comment: 11 pages. Continues previous study by Mendonca M. and Figueira A,
  2023: "Analyzing Political Discourse in the 117th U.S. Congress Using
  Transformer-Based Topic Models", presented at the International Conference on
  Computational Social Science
                                        
                                ☆ Offline Preference Optimization via Maximum Marginal Likelihood Estimation
                                          Aligning Large Language Models (LLMs) with human preferences is crucial, but
standard methods like Reinforcement Learning from Human Feedback (RLHF) are
often complex and unstable. In this work, we propose a new, simpler approach
that recasts alignment through the lens of Maximum Marginal Likelihood (MML)
estimation. Our new MML based Preference Optimization (MMPO) maximizes the
marginal log-likelihood of a preferred text output, using the preference pair
as samples for approximation, and forgoes the need for both an explicit reward
model and entropy maximization. We theoretically demonstrate that MMPO
implicitly performs preference optimization, producing a weighted gradient that
naturally up-weights chosen responses over rejected ones. Across models ranging
from 135M to 8B parameters, we empirically show that MMPO: 1) is more stable
with respect to the hyperparameter $\beta$ compared to alternative baselines,
and 2) achieves competitive or superior preference alignment while better
preserving the base model's general language capabilities. Through a series of
ablation experiments, we show that this improved performance is indeed
attributable to MMPO's implicit preference optimization within the gradient
updates.
                                    
                                ♻ ☆ Constrained Entropic Unlearning: A Primal-Dual Framework for Large Language Models
                                          Large Language Models (LLMs) deployed in real-world settings increasingly
face the need to unlearn sensitive, outdated, or proprietary information.
Existing unlearning methods typically formulate forgetting and retention as a
regularized trade-off, combining both objectives into a single scalarized loss.
This often leads to unstable optimization and degraded performance on retained
data, especially under aggressive forgetting. We propose a new formulation of
LLM unlearning as a constrained optimization problem: forgetting is enforced
via a novel logit-margin flattening loss that explicitly drives the output
distribution toward uniformity on a designated forget set, while retention is
preserved through a hard constraint on a separate retain set. Compared to
entropy-based objectives, our loss is softmax-free, numerically stable, and
maintains non-vanishing gradients, enabling more efficient and robust
optimization. We solve the constrained problem using a scalable primal-dual
algorithm that exposes the trade-off between forgetting and retention through
the dynamics of the dual variable, all without any extra computational
overhead. Evaluations on the TOFU and MUSE benchmarks across diverse LLM
architectures demonstrate that our approach consistently matches or exceeds
state-of-the-art baselines, effectively removing targeted information while
preserving downstream utility.
                                    
                                        
                                            comment: The Thirty-Ninth Annual Conference on Neural Information Processing
  Systems
                                        
                                ♻ ☆ LLM4Cell: A Survey of Large Language and Agentic Models for Single-Cell Biology
                                        
                                            
                                        
                                        
                                            
                                        
                                        Sajib Acharjee Dip, Adrika Zafor, Bikash Kumar Paul, Uddip Acharjee Shuvo, Muhit Islam Emon, Xuan Wang, Liqing Zhang
                                    
                                    
                                          Large language models (LLMs) and emerging agentic frameworks are beginning to
transform single-cell biology by enabling natural-language reasoning,
generative annotation, and multimodal data integration. However, progress
remains fragmented across data modalities, architectures, and evaluation
standards. LLM4Cell presents the first unified survey of 58 foundation and
agentic models developed for single-cell research, spanning RNA, ATAC,
multi-omic, and spatial modalities. We categorize these methods into five
families-foundation, text-bridge, spatial, multimodal, epigenomic, and
agentic-and map them to eight key analytical tasks including annotation,
trajectory and perturbation modeling, and drug-response prediction. Drawing on
over 40 public datasets, we analyze benchmark suitability, data diversity, and
ethical or scalability constraints, and evaluate models across 10 domain
dimensions covering biological grounding, multi-omics alignment, fairness,
privacy, and explainability. By linking datasets, models, and evaluation
domains, LLM4Cell provides the first integrated view of language-driven
single-cell intelligence and outlines open challenges in interpretability,
standardization, and trustworthy model development.
                                    
                                        
                                            comment: 34 pages, 5 figures, 7 tables
                                        
                                ♻ ☆ SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging
                                          Fine-tuning large language models (LLMs) is a common practice to adapt
generalist models to specialized domains. However, recent studies show that
fine-tuning can erode safety alignment, causing LLMs to respond to harmful or
unethical prompts. Many methods to realign safety have been proposed, but often
introduce custom algorithms that are difficult to implement or compromise task
utility. In this work, we propose SafeMERGE, a lightweight, post-fine-tuning
framework that preserves safety while maintaining downstream performance.
SafeMERGE selectively merges fine-tuned with safety-aligned model layers only
when they deviate from safe behavior, measured by a cosine similarity
criterion. Across three LLMs and two tasks, SafeMERGE consistently reduces
harmful outputs compared to other defenses, with negligible or even positive
impact on utility. Our results demonstrate that selective layer-wise merging
offers an effective safeguard against the inadvertent loss of safety during
fine-tuning, establishing SafeMERGE as a simple post-fine-tuning defense.
                                    
                                ♻ ☆ Superficial Self-Improved Reasoners Benefit from Model Merging EMNLP 2025
                                          As scaled language models (LMs) approach human-level reasoning capabilities,
self-improvement emerges as a solution to synthesizing high-quality data
corpus. While previous research has identified model collapse as a risk in
self-improvement, where model outputs become increasingly deterministic, we
discover a more fundamental challenge: the superficial self-improved reasoners
phenomenon. In particular, our analysis reveals that even when LMs show
improved in-domain (ID) reasoning accuracy, they actually compromise their
generalized reasoning capabilities on out-of-domain (OOD) tasks due to
memorization rather than genuine. Through a systematic investigation of LM
architecture, we discover that during self-improvement, LM weight updates are
concentrated in less reasoning-critical layers, leading to superficial
learning. To address this, we propose Iterative Model Merging (IMM), a method
that strategically combines weights from original and self-improved models to
preserve generalization while incorporating genuine reasoning improvements. Our
approach effectively mitigates both LM collapse and superficial learning,
moving towards more stable self-improving systems.
                                    
                                        
                                            comment: EMNLP 2025
                                        
                                ♻ ☆ SafeCOMM: A Study on Safety Degradation in Fine-Tuned Telecom Large Language Models
                                        
                                            
                                        
                                        
                                            
                                        
                                        Aladin Djuhera, Swanand Ravindra Kadhe, Farhan Ahmed, Syed Zawad, Fernando Koch, Walid Saad, Holger Boche
                                    
                                    
                                          Fine-tuning large language models (LLMs) on telecom datasets is a common
practice to adapt general-purpose models to the telecom domain. However, little
attention has been paid to how this process may compromise model safety. Recent
research has shown that even benign fine-tuning can degrade the safety
alignment of LLMs, causing them to respond to harmful or unethical user
queries. In this paper, we investigate this issue by fine-tuning LLMs on three
representative telecom datasets and show that safety degrades even for light
telecom domain adaptation. To this end, we introduce TeleHarm, the first
telecom-specific red-teaming benchmark, which we use alongside established
Direct-Harm and HexPhi datasets to systematically assess harmful behavior. We
further extend our analysis to publicly available TeleLLMs that were
continually pre-trained on large telecom corpora, revealing that safety
alignment is severely lacking, primarily due to the omission of safety-focused
instruction tuning. To address these issues, we evaluate three realignment
defenses: SafeInstruct, SafeLoRA, SafeMERGE. We show that, across all settings,
the proposed defenses can effectively restore safety without compromising
telecom task performance, leading to Safe teleCOMMunication (SafeCOMM) models.
Our work serves as both a diagnostic study and practical guide for safety
realignment in telecom-tuned LLMs, underscoring the need for safety-aware
instruction and fine-tuning in the telecom domain.
                                    
                                ♻ ☆ Fixing It in Post: A Comparative Study of LLM Post-Training Data Quality and Model Performance
                                          Recent work on large language models (LLMs) has increasingly focused on
post-training and alignment with datasets curated to enhance instruction
following, world knowledge, and specialized skills. However, most post-training
datasets used in leading open- and closed-source LLMs remain inaccessible to
the public, with limited information about their construction process. This
lack of transparency has motivated the recent development of open-source
post-training corpora. While training on these open alternatives can yield
performance comparable to that of leading models, systematic comparisons remain
challenging due to the significant computational cost of conducting them
rigorously at scale, and are therefore largely absent. As a result, it remains
unclear how specific samples, task types, or curation strategies influence
downstream performance when assessing data quality. In this work, we conduct
the first comprehensive side-by-side analysis of two prominent open
post-training datasets: Tulu-3-SFT-Mix and SmolTalk. Using the Magpie
framework, we annotate each sample with detailed quality metrics, including
turn structure (single-turn vs. multi-turn), task category, input quality, and
response quality, and we derive statistics that reveal structural and
qualitative similarities and differences between the two datasets. Based on
these insights, we design a principled curation recipe that produces a new data
mixture, TuluTalk, which contains 14% fewer samples than either source dataset
while matching or exceeding their performance on key benchmarks. Our findings
offer actionable insights for constructing more effective post-training
datasets that improve model performance within practical resource limits. To
support future research, we publicly release both the annotated source datasets
and our curated TuluTalk mixture.
                                    
                                ♻ ☆ Human-Aligned Faithfulness in Toxicity Explanations of LLMs
                                          The discourse around toxicity and LLMs in NLP largely revolves around
detection tasks. This work shifts the focus to evaluating LLMs' reasoning about
toxicity -- from their explanations that justify a stance -- to enhance their
trustworthiness in downstream tasks. Despite extensive research on
explainability, it is not straightforward to adopt existing methods to evaluate
free-form toxicity explanation due to their over-reliance on input text
perturbations, among other challenges. To account for these, we propose a
novel, theoretically-grounded multi-dimensional criterion, Human-Aligned
Faithfulness (HAF), that measures the extent to which LLMs' free-form toxicity
explanations align with those of a rational human under ideal conditions. We
develop six metrics, based on uncertainty quantification, to comprehensively
evaluate HAF of LLMs' toxicity explanations with no human involvement, and
highlight how "non-ideal" the explanations are. We conduct several experiments
on three Llama models (of size up to 70B) and an 8B Ministral model on five
diverse toxicity datasets. Our results show that while LLMs generate plausible
explanations to simple prompts, their reasoning about toxicity breaks down when
prompted about the nuanced relations between the complete set of reasons, the
individual reasons, and their toxicity stances, resulting in inconsistent and
irrelevant responses. We open-source our code at
https://github.com/uofthcdslab/HAF and LLM-generated explanations at
https://huggingface.co/collections/uofthcdslab/haf.
                                    
                                        
                                            comment: 23 pages, 5 figures, 7 tables
                                        
                                ♻ ☆ AttentionRAG: Attention-Guided Context Pruning in Retrieval-Augmented Generation
                                          While RAG demonstrates remarkable capabilities in LLM applications, its
effectiveness is hindered by the ever-increasing length of retrieved contexts,
which introduces information redundancy and substantial computational overhead.
Existing context pruning methods, such as LLMLingua, lack contextual awareness
and offer limited flexibility in controlling compression rates, often resulting
in either insufficient pruning or excessive information loss. In this paper, we
propose AttentionRAG, an attention-guided context pruning method for RAG
systems. The core idea of AttentionRAG lies in its attention focus mechanism,
which reformulates RAG queries into a next-token prediction paradigm. This
mechanism isolates the query's semantic focus to a single token, enabling
precise and efficient attention calculation between queries and retrieved
contexts. Extensive experiments on LongBench and Babilong benchmarks show that
AttentionRAG achieves up to 6.3$\times$ context compression while outperforming
LLMLingua methods by around 10\% in key metrics.
                                    
                                ♻ ☆ Cancer-Myth: Evaluating AI Chatbot on Patient Questions with False Presuppositions
                                        
                                            
                                        
                                        
                                            
                                        
                                        Wang Bill Zhu, Tianqi Chen, Xinyan Velocity Yu, Ching Ying Lin, Jade Law, Mazen Jizzini, Jorge J. Nieva, Ruishan Liu, Robin Jia
                                    
                                    
                                          Cancer patients are increasingly turning to large language models (LLMs) for
medical information, making it critical to assess how well these models handle
complex, personalized questions. However, current medical benchmarks focus on
medical exams or consumer-searched questions and do not evaluate LLMs on real
patient questions with patient details. In this paper, we first have three
hematology-oncology physicians evaluate cancer-related questions drawn from
real patients. While LLM responses are generally accurate, the models
frequently fail to recognize or address false presuppositions in the questions,
posing risks to safe medical decision-making. To study this limitation
systematically, we introduce Cancer-Myth, an expert-verified adversarial
dataset of 585 cancer-related questions with false presuppositions. On this
benchmark, no frontier LLM -- including GPT-5, Gemini-2.5-Pro, and
Claude-4-Sonnet -- corrects these false presuppositions more than $43\%$ of the
time. To study mitigation strategies, we further construct a 150-question
Cancer-Myth-NFP set, in which physicians confirm the absence of false
presuppositions. We find typical mitigation strategies, such as adding
precautionary prompts with GEPA optimization, can raise accuracy on Cancer-Myth
to $80\%$, but at the cost of misidentifying presuppositions in $41\%$ of
Cancer-Myth-NFP questions and causing a $10\%$ relative performance drop on
other medical benchmarks. These findings highlight a critical gap in the
reliability of LLMs, show that prompting alone is not a reliable remedy for
false presuppositions, and underscore the need for more robust safeguards in
medical AI systems.
                                    
                                ♻ ☆ Less is More: Local Intrinsic Dimensions of Contextual Language Models NeurIPS 2025
                                        
                                            
                                        
                                        
                                            
                                        
                                        Benjamin Matthias Ruppik, Julius von Rohrscheidt, Carel van Niekerk, Michael Heck, Renato Vukovic, Shutong Feng, Hsien-chin Lin, Nurul Lubis, Bastian Rieck, Marcus Zibrowius, Milica Gašić
                                    
                                    
                                          Understanding the internal mechanisms of large language models (LLMs) remains
a challenging and complex endeavor. Even fundamental questions, such as how
fine-tuning affects model behavior, often require extensive empirical
evaluation. In this paper, we introduce a novel perspective based on the
geometric properties of contextual latent embeddings to study the effects of
training and fine-tuning. To that end, we measure the local dimensions of a
contextual language model's latent space and analyze their shifts during
training and fine-tuning. We show that the local dimensions provide insights
into the model's training dynamics and generalization ability. Specifically,
the mean of the local dimensions predicts when the model's training
capabilities are exhausted, as exemplified in a dialogue state tracking task,
overfitting, as demonstrated in an emotion recognition task, and grokking, as
illustrated with an arithmetic task. Furthermore, our experiments suggest a
practical heuristic: reductions in the mean local dimension tend to accompany
and predict subsequent performance gains. Through this exploration, we aim to
provide practitioners with a deeper understanding of the implications of
fine-tuning on embedding spaces, facilitating informed decisions when
configuring models for specific applications. The results of this work
contribute to the ongoing discourse on the interpretability, adaptability, and
generalizability of LLMs by bridging the gap between intrinsic model mechanisms
and geometric properties in the respective embeddings.
                                    
                                        
                                            comment: Accepted at the 39th Conference on Neural Information Processing
  Systems (NeurIPS 2025; in press). 10 pages, with an additional 17 pages in
  the appendix. Our code is available at
  https://github.com/aidos-lab/Topo_LLM_public and
  https://github.com/aidos-lab/grokking-via-lid
                                        
                                ♻ ☆ Computational-Assisted Systematic Review and Meta-Analysis (CASMA): Effect of a Subclass of GnRH-a on Endometriosis Recurrence
                                          Background: Evidence synthesis facilitates evidence-based medicine. This task
becomes increasingly difficult to accomplished with applying computational
solutions, since the medical literature grows at astonishing rates. Objective:
This study evaluates an information retrieval-driven workflow, CASMA, to
enhance the efficiency, transparency, and reproducibility of systematic
reviews. Endometriosis recurrence serves as the ideal case due to its complex
and ambiguous literature. Methods: The hybrid approach integrates PRISMA
guidelines with fuzzy matching and regular expression (regex) to facilitate
semi-automated deduplication and filtered records before manual screening. The
workflow synthesised evidence from randomised controlled trials on the efficacy
of a subclass of gonadotropin-releasing hormone agonists (GnRH-a). A modified
splitting method addressed unit-of-analysis errors in multi-arm trials.
Results: The workflow sharply reduced the screening workload, taking only 11
days to fetch and filter 33,444 records. Seven eligible RCTs were synthesized
(841 patients). The pooled random-effects model yielded a Risk Ratio (RR) of
$0.64$ ($95\%$ CI $0.48$ to $0.86$), demonstrating a $36\%$ reduction in
recurrence, with non-significant heterogeneity ($I^2=0.00\%$, $\tau^2=0.00$).
The findings were robust and stable, as they were backed by sensitivity
analyses. Conclusion: This study demonstrates an application of an
information-retrieval-driven workflow for medical evidence synthesis. The
approach yields valuable clinical results and a generalisable framework to
scale up the evidence synthesis, bridging the gap between clinical research and
computer science.
                                    
                                        
                                            comment: 15 pages, 12 figures and 4 tables. This work describes an information
  retrieval-driven workflow for medical evidence synthesis, with an application
  to endometriosis recurrence. The method can be generalized to other
  systematic reviews. The preregistered protocol is available:
  https://doi.org/10.17605/OSF.IO/R2DFA
                                        
                                ♻ ☆ How Can We Effectively Expand the Vocabulary of LLMs with 0.01GB of Target Language Text?
                                          Large language models (LLMs) have shown remarkable capabilities in many
languages beyond English. Yet, LLMs require more inference steps when
generating non-English text due to their reliance on English-centric tokenizers
and vocabulary, resulting in higher usage costs to non-English speakers.
Vocabulary expansion with target language tokens is a widely used cross-lingual
vocabulary adaptation approach to remedy this issue. Despite its effectiveness
in inference speedup, previous work on vocabulary expansion has focused on
high-resource settings assuming access to a substantial amount of target
language data to effectively initialize the embeddings of the new tokens and
adapt the LLM to the target language. However, vocabulary expansion in
low-resource settings has yet to be explored. In this article, we investigate
vocabulary expansion in low-resource settings by considering embedding
initialization methods and continual pre-training strategies. Through extensive
experiments across typologically diverse languages, tasks and models, we
establish a set of strategies to perform vocabulary expansion for faster
inference, while striving to maintain competitive downstream performance to
baselines. This is achieved with only 30K sentences ($\sim$0.01GB text data)
from the target language.
                                    
                                        
                                            comment: Accepted to Computational Linguistics
                                        
                                ♻ ☆ A Data-driven ML Approach for Maximizing Performance in LLM-Adapter Serving
                                        
                                            
                                        
                                        
                                            
                                        
                                        Ferran Agullo, Joan Oliveras, Chen Wang, Alberto Gutierrez-Torre, Olivier Tardieu, Alaa Youssef, Jordi Torres, Josep Ll. Berral
                                    
                                    
                                          With the rapid adoption of Large Language Models (LLMs), LLM-adapters have
become increasingly common, providing lightweight specialization of large-scale
models. Serving hundreds or thousands of these adapters on a single GPU allows
request aggregation, increasing throughput, but may also cause request
starvation if GPU memory limits are exceeded. To address this issue, this study
focuses on determining the joint configuration of concurrent and parallel
adapters that maximizes GPU throughput without inducing starvation, given
heterogeneous adapter and traffic properties. We propose a data-driven ML
approach leveraging interpretable models to tackle this caching problem and
introduce the first Digital Twin capable of reproducing an LLM-adapter serving
system, enabling efficient training data generation. Experiments with the vLLM
framework and LoRA adapters show that the Digital Twin reproduces throughput
within 5.1% of real results, while the ML approach predicts optimal numbers of
concurrent and parallel adapters with an error of at most 7.2% under
heterogeneous, real-world workloads.
                                    
                                        
                                            comment: Accepted in a computer science workshop
                                        
                                ♻ ☆ Steering Evaluation-Aware Language Models to Act Like They Are Deployed
                                          Large language models (LLMs) can sometimes detect when they are being
evaluated and adjust their behavior to appear more aligned, compromising the
reliability of safety evaluations. In this paper, we show that adding a
steering vector to an LLM's activations can suppress evaluation-awareness and
make the model act like it is deployed during evaluation. To study our steering
technique, we train an LLM to exhibit evaluation-aware behavior using a
two-step training process designed to mimic how this behavior could emerge
naturally. First, we perform continued pretraining on documents with factual
descriptions of the model (1) using Python type hints during evaluation but not
during deployment and (2) recognizing that the presence of a certain evaluation
cue always means that it is being tested. Then, we train the model with expert
iteration to use Python type hints in evaluation settings. The resulting model
is evaluation-aware: it writes type hints in evaluation contexts more than
deployment contexts. We find that activation steering can suppress evaluation
awareness and make the model act like it is deployed even when the cue is
present. Importantly, we constructed our steering vector using the original
model before our additional training. Our results suggest that AI evaluators
could improve the reliability of safety evaluations by steering models to act
like they are deployed.
                                    
                                ♻ ☆ Estimating LLM Consistency: A User Baseline vs Surrogate Metrics EMNLP 2025
                                          Large language models (LLMs) are prone to hallucinations and sensitiveto
prompt perturbations, often resulting in inconsistent or unreliablegenerated
text. Different methods have been proposed to mitigate suchhallucinations and
fragility, one of which is to measure theconsistency of LLM responses -- the
model's confidence in the responseor likelihood of generating a similar
response when resampled. Inprevious work, measuring LLM response consistency
often relied oncalculating the probability of a response appearing within a
pool of resampledresponses, analyzing internal states, or evaluating logits of
resopnses.However, it was not clear how well theseapproaches approximated
users' perceptions of consistency of LLMresponses. To find out, we performed a
user study ($n=2,976$)demonstrating that current methods for measuring LLM
responseconsistency typically do not align well with humans' perceptions of
LLMconsistency. We propose a logit-based ensemble method for estimatingLLM
consistency and show that our method matches the performance of
thebest-performing existing metric in estimating human ratings of
LLMconsistency. Our results suggest that methods for estimating LLMconsistency
without human evaluation are sufficiently imperfect towarrant broader use of
evaluation with human input; this would avoidmisjudging the adequacy of models
because of the imperfections ofautomated consistency metrics.
                                    
                                        
                                            comment: Published as a main conference paper at EMNLP 2025
                                        
                                ♻ ☆ Can Large Language Models Unlock Novel Scientific Research Ideas? EMNLP 2025
                                          The widespread adoption of Large Language Models (LLMs) and publicly
available ChatGPT have marked a significant turning point in the integration of
Artificial Intelligence (AI) into people's everyday lives. This study examines
the ability of Large Language Models (LLMs) to generate future research ideas
from scientific papers. Unlike tasks such as summarization or translation, idea
generation lacks a clearly defined reference set or structure, making manual
evaluation the default standard. However, human evaluation in this setting is
extremely challenging ie: it requires substantial domain expertise, contextual
understanding of the paper, and awareness of the current research landscape.
This makes it time-consuming, costly, and fundamentally non-scalable,
particularly as new LLMs are being released at a rapid pace. Currently, there
is no automated evaluation metric specifically designed for this task. To
address this gap, we propose two automated evaluation metrics: Idea Alignment
Score (IAScore) and Idea Distinctness Index. We further conducted human
evaluation to assess the novelty, relevance, and feasibility of the generated
future research ideas. This investigation offers insights into the evolving
role of LLMs in idea generation, highlighting both its capability and
limitations. Our work contributes to the ongoing efforts in evaluating and
utilizing language models for generating future research ideas. We make our
datasets and codes publicly available
                                    
                                        
                                            comment: EMNLP 2025 (Main)
                                        
                                ♻ ☆ ClaimGen-CN: A Large-scale Chinese Dataset for Legal Claim Generation
                                          Legal claims refer to the plaintiff's demands in a case and are essential to
guiding judicial reasoning and case resolution. While many works have focused
on improving the efficiency of legal professionals, the research on helping
non-professionals (e.g., plaintiffs) remains unexplored. This paper explores
the problem of legal claim generation based on the given case's facts. First,
we construct ClaimGen-CN, the first dataset for Chinese legal claim generation
task, from various real-world legal disputes. Additionally, we design an
evaluation metric tailored for assessing the generated claims, which
encompasses two essential dimensions: factuality and clarity. Building on this,
we conduct a comprehensive zero-shot evaluation of state-of-the-art general and
legal-domain large language models. Our findings highlight the limitations of
the current models in factual precision and expressive clarity, pointing to the
need for more targeted development in this domain. To encourage further
exploration of this important task, we will make the dataset publicly
available.
                                    
                                ♻ ☆ Are LLMs Empathetic to All? Investigating the Influence of Multi-Demographic Personas on a Model's Empathy EMNLP 2025
                                          Large Language Models' (LLMs) ability to converse naturally is empowered by
their ability to empathetically understand and respond to their users. However,
emotional experiences are shaped by demographic and cultural contexts. This
raises an important question: Can LLMs demonstrate equitable empathy across
diverse user groups? We propose a framework to investigate how LLMs' cognitive
and affective empathy vary across user personas defined by intersecting
demographic attributes. Our study introduces a novel intersectional analysis
spanning 315 unique personas, constructed from combinations of age, culture,
and gender, across four LLMs. Results show that attributes profoundly shape a
model's empathetic responses. Interestingly, we see that adding multiple
attributes at once can attenuate and reverse expected empathy patterns. We show
that they broadly reflect real-world empathetic trends, with notable
misalignments for certain groups, such as those from Confucian culture. We
complement our quantitative findings with qualitative insights to uncover model
behaviour patterns across different demographic groups. Our findings highlight
the importance of designing empathy-aware LLMs that account for demographic
diversity to promote more inclusive and equitable model behaviour.
                                    
                                        
                                            comment: 9 pages, 4 figures, 4 tables, EMNLP 2025 Findings
                                        
                                ♻ ☆ Bootstrapping Referring Multi-Object Tracking
                                          Referring understanding is a fundamental task that bridges natural language
and visual content by localizing objects described in free-form expressions.
However, existing works are constrained by limited language expressiveness,
lacking the capacity to model object dynamics in spatial numbers and temporal
states. To address these limitations, we introduce a new and general referring
understanding task, termed referring multi-object tracking (RMOT). Its core
idea is to employ a language expression as a semantic cue to guide the
prediction of multi-object tracking, comprehensively accounting for variations
in object quantity and temporal semantics. Along with RMOT, we introduce a RMOT
benchmark named Refer-KITTI-V2, featuring scalable and diverse language
expressions. To efficiently generate high-quality annotations covering object
dynamics with minimal manual effort, we propose a semi-automatic labeling
pipeline that formulates a total of 9,758 language prompts. In addition, we
propose TempRMOT, an elegant end-to-end Transformer-based framework for RMOT.
At its core is a query-driven Temporal Enhancement Module that represents each
object as a Transformer query, enabling long-term spatial-temporal interactions
with other objects and past frames to efficiently refine these queries.
TempRMOT achieves state-of-the-art performance on both Refer-KITTI and
Refer-KITTI-V2, demonstrating the effectiveness of our approach. The source
code and dataset is available at https://github.com/zyn213/TempRMOT.
                                    
                                ♻ ☆ Tiny but Mighty: A Software-Hardware Co-Design Approach for Efficient Multimodal Inference on Battery-Powered Small Devices
                                          Large Multimodal Models (LMMs) are inherently modular, consisting of vision
and audio encoders, projectors, and large language models. Yet, they are almost
always executed monolithically, which underutilizes the heterogeneous
accelerators (NPUs, GPUs, DSPs) in modern SoCs and leads to high end-to-end
latency. In this paper, we present NANOMIND, a hardware--software co-design
inference framework for Large Multimodal Models (LMMs) that breaks large models
into modular ``bricks'' (vision, language, audio, etc.) and maps each to its
ideal accelerator. The key insight is that large models can be broken into
modular components and scheduled to run on the most appropriate compute units.
It performs module-level dynamic offloading across accelerators on
unified-memory SoCs. By combining customized hardware design, system-level
scheduling, and optimized low-bit computation kernels, we demonstrate our
framework with a compact, battery-powered device capable of running LMMs
entirely on device. This prototype functions as a self-contained intelligent
assistant that requires no network connectivity, while achieving higher
throughput and superior power efficiency under strict resource constraints. The
design further bypasses CPU bottlenecks and reduces redundant memory usage
through token-aware buffer management and module-level coordination. Our system
outperforms existing implementations in resource efficiency, cutting energy
consumption by 42.3\% and GPU memory usage by 11.2\%. This enables a
battery-powered device to run LLaVA-OneVision with a camera for nearly half a
day and LLaMA-3-8B for voice interactions up to almost 20.8 hours.
                                    
                                ♻ ☆ SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors
                                          Large language model (LLM) simulations of human behavior have the potential
to revolutionize the social and behavioral sciences, if and only if they
faithfully reflect real human behaviors. Current evaluations are fragmented,
based on bespoke tasks and metrics, creating a patchwork of incomparable
results. To address this, we introduce SimBench, the first large-scale,
standardized benchmark for a robust, reproducible science of LLM simulation. By
unifying 20 diverse datasets covering tasks from moral decision-making to
economic choice across a large global participant pool, SimBench provides the
necessary foundation to ask fundamental questions about when, how, and why LLM
simulations succeed or fail. We show that, while even the best LLMs today have
limited simulation ability (score: 40.80/100), performance scales log-linearly
with model size. Simulation performance is not improved by increased
inference-time compute. We demonstrate an alignment-simulation trade-off:
instruction-tuning improves performance on low-entropy (consensus) questions
but degrades it on high-entropy (diverse) ones. Models particularly struggle
when simulating specific demographic groups. Finally, we demonstrate that
simulation ability correlates most strongly with deep, knowledge-intensive
reasoning (MMLU-Pro, r=0.939). By making progress measurable, we aim to
accelerate the development of more faithful LLM simulators.
                                    
                                        
                                            comment: Project Website: http://simbench.tiancheng.hu/ Data:
  https://huggingface.co/datasets/pitehu/SimBench
                                        
                                ♻ ☆ MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses ICLR 2025
                                        
                                            
                                        
                                        
                                            
                                        
                                        Zonglin Yang, Wanhao Liu, Ben Gao, Tong Xie, Yuqiang Li, Wanli Ouyang, Soujanya Poria, Erik Cambria, Dongzhan Zhou
                                    
                                    
                                          Scientific discovery plays a pivotal role in advancing human society, and
recent progress in large language models (LLMs) suggests their potential to
accelerate this process. However, it remains unclear whether LLMs can
autonomously generate novel and valid hypotheses in chemistry. In this work, we
investigate whether LLMs can discover high-quality chemistry hypotheses given
only a research background-comprising a question and/or a survey-without
restriction on the domain of the question. We begin with the observation that
hypothesis discovery is a seemingly intractable task. To address this, we
propose a formal mathematical decomposition grounded in a fundamental
assumption: that most chemistry hypotheses can be composed from a research
background and a set of inspirations. This decomposition leads to three
practical subtasks-retrieving inspirations, composing hypotheses with
inspirations, and ranking hypotheses - which together constitute a sufficient
set of subtasks for the overall scientific discovery task. We further develop
an agentic LLM framework, MOOSE-Chem, that is a direct implementation of this
mathematical decomposition. To evaluate this framework, we construct a
benchmark of 51 high-impact chemistry papers published and online after January
2024, each manually annotated by PhD chemists with background, inspirations,
and hypothesis. The framework is able to rediscover many hypotheses with high
similarity to the groundtruth, successfully capturing the core
innovations-while ensuring no data contamination since it uses an LLM with
knowledge cutoff date prior to 2024. Finally, based on LLM's surprisingly high
accuracy on inspiration retrieval, a task with inherently out-of-distribution
nature, we propose a bold assumption: that LLMs may already encode latent
scientific knowledge associations not yet recognized by humans.
                                    
                                        
                                            comment: Accepted by ICLR 2025
                                        
                                ♻ ☆ Prompting is not Enough: Exploring Knowledge Integration and Controllable Generation
                                          Open-domain question answering (OpenQA) represents a cornerstone in natural
language processing (NLP), primarily focused on extracting answers from
unstructured textual data. With the rapid advancements in Large Language Models
(LLMs), LLM-based OpenQA methods have reaped the benefits of emergent
understanding and answering capabilities enabled by massive parameters compared
to traditional methods. However, most of these methods encounter two critical
challenges: how to integrate knowledge into LLMs effectively and how to
adaptively generate results with specific answer formats for various task
situations. To address these challenges, we propose a novel framework named
GenKI, which aims to improve the OpenQA performance by exploring Knowledge
Integration and controllable Generation on LLMs simultaneously. Specifically,
we first train a dense passage retrieval model to retrieve associated knowledge
from a given knowledge base. Subsequently, we introduce a novel knowledge
integration model that incorporates the retrieval knowledge into instructions
during fine-tuning to intensify the model. Furthermore, to enable controllable
generation in LLMs, we leverage a certain fine-tuned LLM and an ensemble based
on text consistency incorporating all coherence, fluency, and answer format
assurance. Finally, extensive experiments conducted on the TriviaQA, MSMARCO,
and CMRC2018 datasets, featuring diverse answer formats, have demonstrated the
effectiveness of GenKI with comparison of state-of-the-art baselines. Moreover,
ablation studies have disclosed a linear relationship between the frequency of
retrieved knowledge and the model's ability to recall knowledge accurately
against the ground truth. Our code of GenKI is available at
https://github.com/USTC-StarTeam/GenKI
                                    
                                        
                                            comment: 13 pages, 5 figures
                                        
                                ♻ ☆ TrajAgent: An LLM-Agent Framework for Trajectory Modeling via Large-and-Small Model Collaboration NeurIPS 2025
                                          Trajectory modeling, which includes research on trajectory data pattern
mining and future prediction, has widespread applications in areas such as life
services, urban transportation, and public administration. Numerous methods
have been proposed to address specific problems within trajectory modeling.
However, the heterogeneity of data and the diversity of trajectory tasks make
effective and reliable trajectory modeling an important yet highly challenging
endeavor, even for domain experts. \fix In this paper, we propose
\textit{TrajAgent}, a agent framework powered by large language models (LLMs),
designed to facilitate robust and efficient trajectory modeling through
automation modeling. This framework leverages and optimizes diverse specialized
models to address various trajectory modeling tasks across different datasets
effectively. \unfix~In \textit{TrajAgent}, we first develop \textit{UniEnv}, an
execution environment with a unified data and model interface, to support the
execution and training of various models. Building on \textit{UniEnv}, we
introduce an agentic workflow designed for automatic trajectory modeling across
various trajectory tasks and data. Furthermore, we introduce collaborative
learning schema between LLM-based agents and small speciallized models, to
enhance the performance of the whole framework effectively. Extensive
experiments on four tasks using four real-world datasets demonstrate the
effectiveness of \textit{TrajAgent} in automated trajectory modeling, achieving
a performance improvement of \fix 2.38\%-69.91\% \unfix over baseline methods.
The codes and data can be accessed via
https://github.com/tsinghua-fib-lab/TrajAgent.
                                    
                                        
                                            comment: Accepted by NeurIPS 2025,
  https://github.com/tsinghua-fib-lab/TrajAgent
                                        
                                ♻ ☆ LLMs can hide text in other text of the same length
                                          A meaningful text can be hidden inside another, completely different yet
still coherent and plausible, text of the same length. For example, a tweet
containing a harsh political critique could be embedded in a tweet that
celebrates the same political leader, or an ordinary product review could
conceal a secret manuscript. This uncanny state of affairs is now possible
thanks to Large Language Models, and in this paper we present a simple and
efficient protocol to achieve it. We show that even modest 8-billion-parameter
open-source LLMs are sufficient to obtain high-quality results, and a message
as long as this abstract can be encoded and decoded locally on a laptop in
seconds. The existence of such a protocol demonstrates a radical decoupling of
text from authorial intent, further eroding trust in written communication,
already shaken by the rise of LLM chatbots. We illustrate this with a concrete
scenario: a company could covertly deploy an unfiltered LLM by encoding its
answers within the compliant responses of a safe model. This possibility raises
urgent questions for AI safety and challenges our understanding of what it
means for a Large Language Model to know something.
                                    
                                        
                                            comment: 21 pages, main paper 9 pages
                                        
                                ♻ ☆ MOOSE-Chem2: Exploring LLM Limits in Fine-Grained Scientific Hypothesis Discovery via Hierarchical Search NeurIPS 2025
                                        
                                            
                                        
                                        
                                            
                                        
                                        Zonglin Yang, Wanhao Liu, Ben Gao, Yujie Liu, Wei Li, Tong Xie, Lidong Bing, Wanli Ouyang, Erik Cambria, Dongzhan Zhou
                                    
                                    
                                          Large language models (LLMs) have shown promise in automating scientific
hypothesis generation, yet existing approaches primarily yield coarse-grained
hypotheses lacking critical methodological and experimental details. We
introduce and formally define the new task of fine-grained scientific
hypothesis discovery, which entails generating detailed, experimentally
actionable hypotheses from coarse initial research directions. We frame this as
a combinatorial optimization problem and investigate the upper limits of LLMs'
capacity to solve it when maximally leveraged. Specifically, we explore four
foundational questions: (1) how to best harness an LLM's internal heuristics to
formulate the fine-grained hypothesis it itself would judge as the most
promising among all the possible hypotheses it might generate, based on its own
internal scoring-thus defining a latent reward landscape over the hypothesis
space; (2) whether such LLM-judged better hypotheses exhibit stronger alignment
with ground-truth hypotheses; (3) whether shaping the reward landscape using an
ensemble of diverse LLMs of similar capacity yields better outcomes than
defining it with repeated instances of the strongest LLM among them; and (4)
whether an ensemble of identical LLMs provides a more reliable reward landscape
than a single LLM. To address these questions, we propose a hierarchical search
method that incrementally proposes and integrates details into the hypothesis,
progressing from general concepts to specific experimental configurations. We
show that this hierarchical process smooths the reward landscape and enables
more effective optimization. Empirical evaluations on a new benchmark of
expert-annotated fine-grained hypotheses from recent literature show that our
method consistently outperforms strong baselines.
                                    
                                        
                                            comment: Accepted by NeurIPS 2025
                                        
                                ♻ ☆ The Atlas of In-Context Learning: How Attention Heads Shape In-Context Retrieval Augmentation NeurIPS 2025
                                          Large language models are able to exploit in-context learning to access
external knowledge beyond their training data through retrieval-augmentation.
While promising, its inner workings remain unclear. In this work, we shed light
on the mechanism of in-context retrieval augmentation for question answering by
viewing a prompt as a composition of informational components. We propose an
attribution-based method to identify specialized attention heads, revealing
in-context heads that comprehend instructions and retrieve relevant contextual
information, and parametric heads that store entities' relational knowledge. To
better understand their roles, we extract function vectors and modify their
attention weights to show how they can influence the answer generation process.
Finally, we leverage the gained insights to trace the sources of knowledge used
during inference, paving the way towards more safe and transparent language
models.
                                    
                                        
                                            comment: Accepted at NeurIPS 2025
                                        
                                ♻ ☆ TaoSR1: The Thinking Model for E-commerce Relevance Search
                                        
                                            
                                        
                                        
                                            
                                        
                                        Chenhe Dong, Shaowei Yao, Pengkun Jiao, Jianhui Yang, Yiming Jin, Zerui Huang, Xiaojiang Zhou, Dan Ou, Haihong Tang, Bo Zheng
                                    
                                    
                                          Query-product relevance prediction is a core task in e-commerce search.
BERT-based models excel at semantic matching but lack complex reasoning
capabilities. While Large Language Models (LLMs) are explored, most still use
discriminative fine-tuning or distill to smaller models for deployment. We
propose a framework to directly deploy LLMs for this task, addressing key
challenges: Chain-of-Thought (CoT) error accumulation, discriminative
hallucination, and deployment feasibility. Our framework, TaoSR1, involves
three stages: (1) Supervised Fine-Tuning (SFT) with CoT to instill reasoning;
(2) Offline sampling with a pass@N strategy and Direct Preference Optimization
(DPO) to improve generation quality; and (3) Difficulty-based dynamic sampling
with Group Relative Policy Optimization (GRPO) to mitigate discriminative
hallucination. Additionally, post-CoT processing and a cumulative
probability-based partitioning method enable efficient online deployment.
TaoSR1 significantly outperforms baselines on offline datasets and achieves
substantial gains in online side-by-side human evaluations, introducing a novel
paradigm for applying CoT reasoning to relevance classification.
                                    
                                ♻ ☆ Thought Anchors: Which LLM Reasoning Steps Matter?
                                          Current frontier large-language models rely on reasoning to achieve
state-of-the-art performance. Many existing interpretability are limited in
this area, as standard methods have been designed to study single forward
passes of a model rather than the multi-token computational steps that unfold
during reasoning. We argue that analyzing reasoning traces at the sentence
level is a promising approach to understanding reasoning processes. We
introduce a black-box method that measures each sentence's counterfactual
importance by repeatedly sampling replacement sentences from the model,
filtering for semantically different ones, and continuing the chain of thought
from that point onwards to quantify the sentence's impact on the distribution
of final answers. We discover that certain sentences can have an outsized
impact on the trajectory of the reasoning trace and final answer. We term these
sentences \textit{thought anchors}. These are generally planning or uncertainty
management sentences, and specialized attention heads consistently attend from
subsequent sentences to thought anchors. We further show that examining
sentence-sentence causal links within a reasoning trace gives insight into a
model's behavior. Such information can be used to predict a problem's
difficulty and the extent different question domains involve sequential or
diffuse reasoning. As a proof-of-concept, we demonstrate that our techniques
together provide a practical toolkit for analyzing reasoning models by
conducting a detailed case study of how the model solves a difficult math
problem, finding that our techniques yield a consistent picture of the
reasoning trace's structure. We provide an open-source tool
(thought-anchors.com) for visualizing the outputs of our methods on further
problems. The convergence across our methods shows the potential of
sentence-level analysis for a deeper understanding of reasoning models.
                                    
                                        
                                            comment: Paul C. Bogdan and Uzay Macar contributed equally to this work, and
  their listed order was determined by coinflip. Neel Nanda and Arthur Conmy
  contributed equally to this work as senior authors, and their listed order
  was determined by coinflip
                                        
                                ♻ ☆ ThinkBrake: Mitigating Overthinking in Tool Reasoning
                                          Small reasoning models (SRMs) often overthink during tool use: they reach a
correct tool-argument configuration, then continue reasoning and overwrite it
with an incorrect final call. We diagnose overthinking via oracle rollouts that
inject  at sentence boundaries. On the Berkeley Function Calling
Leaderboard (BFCL), this oracle termination lifts average accuracy from 85.8\%
to 94.2\% while reducing tokens by 80-94\%, revealing substantial recoverable
headroom and potential redundant reasoning. While prior work on concise
reasoning has largely targeted mathematics, tool reasoning remains
underexplored. We adapt various early-termination baselines to tool use and
introduce ThinkBrake, a training-free decoding heuristic. ThinkBrake monitors
the log-probability margin between  and the current top token at
sentence boundaries and triggers termination when this margin becomes small.
Across BFCL's single turn, non-live and live splits, ThinkBrake preserves or
improves accuracy while reducing tokens up to 25\%, outperforming various
baselines.
                                    
                                ♻ ☆ OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model
                                        
                                            
                                        
                                        
                                            
                                        
                                        Chen Wang, Tianyu Peng, Wen Yang, Yinan Bai, Guangfu Wang, Jun Lin, Lanpeng Jia, Lingxiang Wu, Jinqiao Wang, Chengqing Zong, Jiajun Zhang
                                    
                                    
                                          Empathetic interaction is a cornerstone of human-machine communication, due
to the need for understanding speech enriched with paralinguistic cues and
generating emotional and expressive responses. However, the most powerful
empathetic LSLMs are increasingly closed off, leaving the crucial details about
the architecture, data and development opaque to researchers. Given the
critical need for transparent research into the LSLMs and empathetic behavior,
we present OpenS2S, a fully open-source, transparent and end-to-end LSLM
designed to enable empathetic speech interactions. Based on our empathetic
speech-to-text model BLSP-Emo, OpenS2S further employs a streaming interleaved
decoding architecture to achieve low-latency speech generation. To facilitate
end-to-end training, OpenS2S incorporates an automated data construction
pipeline that synthesizes diverse, high-quality empathetic speech dialogues at
low cost. By leveraging large language models to generate empathetic content
and controllable text-to-speech systems to introduce speaker and emotional
variation, we construct a scalable training corpus with rich paralinguistic
diversity and minimal human supervision. We release the fully open-source
OpenS2S model, including the dataset, model weights, pre-training and
fine-tuning codes, to empower the broader research community and accelerate
innovation in empathetic speech systems. The project webpage can be accessed at
https://casia-lm.github.io/OpenS2S
                                    
                                        
                                            comment: Technical Report, Update on OpenS2S_v1.5
                                        
                                ♻ ☆ When Personalization Meets Reality: A Multi-Faceted Analysis of Personalized Preference Learning
                                          While Reinforcement Learning from Human Feedback (RLHF) is widely used to
align Large Language Models (LLMs) with human preferences, it typically assumes
homogeneous preferences across users, overlooking diverse human values and
minority viewpoints. Although personalized preference learning addresses this
by tailoring separate preferences for individual users, the field lacks
standardized methods to assess its effectiveness. We present a multi-faceted
evaluation framework that measures not only performance but also fairness,
unintended effects, and adaptability across varying levels of preference
divergence. Through extensive experiments comparing eight personalization
methods across three preference datasets, we demonstrate that performance
differences between methods could reach 36% when users strongly disagree, and
personalization can introduce up to 20% safety misalignment. These findings
highlight the critical need for holistic evaluation approaches to advance the
development of more effective and inclusive preference learning systems.
                                    
                                ♻ ☆ Input Matters: Evaluating Input Structure's Impact on LLM Summaries of Sports Play-by-Play
                                          A major concern when deploying LLMs in accuracy-critical domains such as
sports reporting is that the generated text may not faithfully reflect the
input data. We quantify how input structure affects hallucinations and other
factual errors in LLM-generated summaries of NBA play-by-play data, across
three formats: row-structured, JSON and unstructured. We manually annotated
3,312 factual errors across 180 game summaries produced by two models,
Llama-3.1-70B and Qwen2.5-72B. Input structure has a strong effect: JSON input
reduces error rates by 69% for Llama and 65% for Qwen compared to unstructured
input, while row-structured input reduces errors by 54% for Llama and 51% for
Qwen. A two-way repeated measures ANOVA shows that input structure accounts for
over 80% of the variance in error rates, with Tukey HSD post hoc tests
confirming statistically significant differences between all input formats.
                                    
                                        
                                            comment: Accepted at INLG 2025
                                        
                                ♻ ☆ LinearRAG: Linear Graph Retrieval Augmented Generation on Large-scale Corpora
                                        
                                            
                                        
                                        
                                            
                                        
                                        Luyao Zhuang, Shengyuan Chen, Yilin Xiao, Huachi Zhou, Yujing Zhang, Hao Chen, Qinggang Zhang, Xiao Huang
                                    
                                    
                                          Retrieval-Augmented Generation (RAG) is widely used to mitigate
hallucinations of Large Language Models (LLMs) by leveraging external
knowledge. While effective for simple queries, traditional RAG systems struggle
with large-scale, unstructured corpora where information is fragmented. Recent
advances incorporate knowledge graphs to capture relational structures,
enabling more comprehensive retrieval for complex, multi-hop reasoning tasks.
However, existing graph-based RAG (GraphRAG) methods rely on unstable and
costly relation extraction for graph construction, often producing noisy graphs
with incorrect or inconsistent relations that degrade retrieval quality. In
this paper, we revisit the pipeline of existing GraphRAG systems and propose
LinearRAG (Linear Graph-based Retrieval-Augmented Generation), an efficient
framework that enables reliable graph construction and precise passage
retrieval. Specifically, LinearRAG constructs a relation-free hierarchical
graph, termed Tri-Graph, using only lightweight entity extraction and semantic
linking, avoiding unstable relation modeling. This new paradigm of graph
construction scales linearly with corpus size and incurs no extra token
consumption, providing an economical and reliable indexing of the original
passages. For retrieval, LinearRAG adopts a two-stage strategy: (i) relevant
entity activation via local semantic bridging, followed by (ii) passage
retrieval through global importance aggregation. Extensive experiments on four
datasets demonstrate that LinearRAG significantly outperforms baseline models.
                                    
                                ♻ ☆ Multi-turn Training with Basic Human Feedback Helps Little on LLM Reasoning
                                          The reasoning capabilities of Large Language Models (LLMs) are typically
developed through the single-turn reinforcement learning, whereas real-world
applications often involve multi-turn interactions with human feedback, leading
to a potential mismatch between training and deployment conditions. In this
work, we study whether multi-turn training with human feedback is necessary for
reasoning tasks. We compare conventional single-turn training with three
multi-turn strategies and reach contrary conclusions to previous research. We
find that models trained in a single-turn setting generalize effectively to
both single- and multi-turn evaluations, while models trained with multi-turn
strategies exhibit a significant degradation in single-turn reasoning
performance. These results suggest that for tasks with complete information,
robust single-turn training remains more effective and reliable, as multi-turn
training with basic feedback provides limited benefits and can even degrade
reasoning capabilities.
                                    
                                ♻ ☆ StereoDetect: Detecting Stereotypes and Anti-stereotypes the Correct Way Using Social Psychological Underpinnings
                                          Stereotypes are known to have very harmful effects, making their detection
critically important. However, current research predominantly focuses on
detecting and evaluating stereotypical biases, thereby leaving the study of
stereotypes in its early stages. Our study revealed that many works have failed
to clearly distinguish between stereotypes and stereotypical biases, which has
significantly slowed progress in advancing research in this area. Stereotype
and Anti-stereotype detection is a problem that requires social knowledge;
hence, it is one of the most difficult areas in Responsible AI. This work
investigates this task, where we propose a five-tuple definition and provide
precise terminologies disentangling stereotypes, anti-stereotypes,
stereotypical bias, and general bias. We provide a conceptual framework
grounded in social psychology for reliable detection. We identify key
shortcomings in existing benchmarks for this task of stereotype and
anti-stereotype detection. To address these gaps, we developed StereoDetect, a
well curated, definition-aligned benchmark dataset designed for this task. We
show that sub-10B language models and GPT-4o frequently misclassify
anti-stereotypes and fail to recognize neutral overgeneralizations. We
demonstrate StereoDetect's effectiveness through multiple qualitative and
quantitative comparisons with existing benchmarks and models fine-tuned on
them. The dataset and code is available at
https://github.com/KaustubhShejole/StereoDetect.
                                    
                                ♻ ☆ Cohort Discovery: A Survey on LLM-Assisted Clinical Trial Recruitment
                                          Recent advances in LLMs have greatly improved general-domain NLP tasks. Yet,
their adoption in critical domains, such as clinical trial recruitment, remains
limited. As trials are designed in natural language and patient data is
represented as both structured and unstructured text, the task of matching
trials and patients benefits from knowledge aggregation and reasoning abilities
of LLMs. Classical approaches are trial-specific and LLMs with their ability to
consolidate distributed knowledge hold the potential to build a more general
solution. Yet recent applications of LLM-assisted methods rely on proprietary
models and weak evaluation benchmarks. In this survey, we are the first to
analyze the task of trial-patient matching and contextualize emerging LLM-based
approaches in clinical trial recruitment. We critically examine existing
benchmarks, approaches and evaluation frameworks, the challenges to adopting
LLM technologies in clinical research and exciting future directions.
                                    
                                ♻ ☆ Can Confidence Estimates Decide When Chain-of-Thought Is Necessary for LLMs?
                                          Chain-of-thought (CoT) prompting has emerged as a common technique for
enhancing the reasoning abilities of large language models (LLMs). While
extended reasoning can boost accuracy on complex tasks, it is often unnecessary
and substantially increases token usage, limiting the practicality of reasoning
models in many scenarios. Recent models, such as GPT-OSS and Qwen3, expose
controls that enable users to adjust the length of CoT or determine whether it
is used at all. Yet, it remains unclear when CoT should be used: on some tasks
it improves performance, while on others it provides little benefit or even
harms performance. We address this challenge with confidence-gated CoT, where a
model invokes reasoning only when confidence in its direct answer is low. To
this end, we present the first systematic study of training-free confidence
estimation methods for CoT gating. Specifically, we evaluate four training-free
confidence estimation methods and compare them to a random baseline and an
oracle that always knows when CoT is needed. Through extensive experiments, we
show that existing training-free confidence measures can reduce redundant CoT
and outperform randomly invoked CoT. However, the utility of individual
confidence measures is inconsistent, varying with both the dataset and the
model, underscoring the difficulty of deploying confidence-gated CoT in
practice. By analysing both strengths and failure modes, our study highlights
the potential and limitations of current methods and paves the way toward more
reliable adaptive gating of CoT.
                                    
                                        
                                            comment: Under Review
                                        
                                ♻ ☆ First SFT, Second RL, Third UPT: Continual Improving Multi-Modal LLM Reasoning via Unsupervised Post-Training NeurIPS 2025
                                          Improving Multi-modal Large Language Models (MLLMs) in the post-training
stage typically relies on supervised fine-tuning (SFT) or reinforcement
learning (RL), which require expensive and manually annotated multi-modal
data--an ultimately unsustainable resource. This limitation has motivated a
growing interest in unsupervised paradigms as a third stage of post-training
after SFT and RL. While recent efforts have explored this direction, their
methods are complex and difficult to iterate. To address this, we propose
MM-UPT, a simple yet effective framework for unsupervised post-training of
MLLMs, enabling continual self-improvement without any external supervision.
The training method of MM-UPT builds upon GRPO, replacing traditional reward
signals with a self-rewarding mechanism based on majority voting over multiple
sampled responses. Our experiments demonstrate that such training method
effectively improves the reasoning ability of Qwen2.5-VL-7B (e.g.,
66.3\%$\rightarrow$72.9\% on MathVista, 62.9\%$\rightarrow$68.7\% on We-Math),
using standard dataset without ground truth labels. To further explore
scalability, we extend our framework to a data self-generation setting,
designing two strategies that prompt the MLLM to synthesize new training
samples on its own. Additional experiments show that combining these synthetic
data with the unsupervised training method can also boost performance,
highlighting a promising approach for scalable self-improvement. Overall,
MM-UPT offers a new paradigm for autonomous enhancement of MLLMs, serving as a
critical third step after initial SFT and RL in the absence of external
supervision. Our code is available at https://github.com/waltonfuture/MM-UPT.
                                    
                                        
                                            comment: Accepted by NeurIPS 2025
                                        
                                ♻ ☆ DeepOmni: Towards Seamless and Smart Speech Interaction with Adaptive Modality-Specific MoE
                                          Native multimodal large language models (MLLMs) restructure a single large
language model (LLM) into a spoken language model (SLM) capable of both speech
and text generation. Compared to modular and aligned MLLMs, native MLLMs
preserve richer paralinguistic features such as emotion and prosody, and
generate speech responses directly within the backbone LLM rather than using a
separate speech decoder. This integration also results in lower response
latency and smoother interaction. However, native MLLMs suffer from
catastrophic forgetting and performance degradation because the available
paired speech-text data is insufficient to support the pretraining of MLLMs
compared to the vast amount of text data required to pretrain text LLMs. To
address this issue, we propose DeepTalk, a framework for adaptive modality
expert learning based on a Mixture of Experts (MoE) architecture. DeepTalk
first adaptively distinguishes modality experts according to their modality
load within the LLM. Each modality expert then undergoes specialized
single-modality training, followed by joint multimodal collaborative training.
As a result, DeepTalk incurs only a 5.5% performance drop compared to the
original LLM, which is significantly lower than the average performance drop of
over 20% typically seen in native MLLMs (such as GLM-4-Voice), and is on par
with modular MLLMs. Meanwhile, the end-to-end dialogue latency remains within
0.5 seconds, ensuring a seamless and intelligent speech interaction experience.
Code and models are released at https://github.com/talkking/DeepTalk.
                                    
                                        
                                            comment: Under Review
                                        
                                ♻ ☆ COUNTDOWN: Contextually Sparse Activation Filtering Out Unnecessary Weights in Down Projection EMNLP 2025
                                          The growing size of large language models has created significant
computational inefficiencies. To address this challenge, sparse activation
methods selectively deactivates non-essential parameters during inference,
reducing computational costs in FFNN layers. While existing methods focus on
non-linear gating mechanisms, we hypothesize that the sparsity of the FFNN
layer lies globally in the form of a linear combination over its internal down
projection matrix. Based on this insight, we propose two methods: M-COUNTDOWN,
leveraging indirect coefficients, and D-COUNTDOWN, utilizing direct
coefficients of the linear combination. Experimental results demonstrate that
D-COUNTDOWN can omit 90% of computations with performance loss as low as 5.5%
ideally, while M-COUNTDOWN provides a predictor-free solution with up to 29.4%
better performance preservation compared to existing methods. Our specialized
kernel implementations effectively realize these theoretical gains into
substantial real-world acceleration.
                                    
                                        
                                            comment: EMNLP 2025 (Main Track)
                                        
                                ♻ ☆ GraphInstruct: Empowering Large Language Models with Graph Understanding and Reasoning Capability
                                          Improving the general capabilities of large language models (LLMs) is an
active research topic. As a common data structure in many real-world domains,
understanding graph data is a crucial part of advancing general intelligence.
To this end, we propose a dynamic benchmark named GraphInstruct in this paper,
which comprehensively includes 21 classical graph reasoning tasks, providing
diverse graph generation pipelines and detailed intermediate reasoning steps
for each sample. Based on GraphInstruct, we develop GraphSolver via efficient
instruction-tuning, which demonstrates prominent graph understanding capability
compared to other open-sourced LLMs. To further endow LLMs with multi-step
graph reasoning capability, we propose a label-mask training strategy and build
GraphSolver+, which leverages masked supervision on intermediate reasoning
tokens to emphasize crucial node-identification signals. As one of the
pioneering efforts to enhance the graph understanding and reasoning abilities
of LLMs, extensive experiments have demonstrated the superiority of GraphSolver
and GraphSolver+ over other LLMs. We sincerely hope GraphInstruct will
facilitate further research on applying LLMs to graph-structured data. Our code
and data are released publicly at: https://github.com/CGCL-codes/GraphInstruct.
                                    
                                        
                                            comment: Accepted by Frontiers of Computer Science
                                        
                                ♻ ☆ The Cross-Lingual Cost: Retrieval Biases in RAG over Arabic-English Corpora
                                          Cross-lingual retrieval-augmented generation (RAG) is a critical capability
for retrieving and generating answers across languages. Prior work in this
context has mostly focused on generation and relied on benchmarks derived from
open-domain sources, most notably Wikipedia. In such settings, retrieval
challenges often remain hidden due to language imbalances, overlap with
pretraining data, and memorized content. To address this gap, we study
Arabic-English RAG in a domain-specific setting using benchmarks derived from
real-world corporate datasets. Our benchmarks include all combinations of
languages for the user query and the supporting document, drawn independently
and uniformly at random. This enables a systematic study of multilingual
retrieval behavior.
  Our findings reveal that retrieval is a critical bottleneck in cross-lingual
domain-specific scenarios, with substantial performance drops occurring when
the user query and supporting document languages differ. A key insight is that
these failures stem primarily from the retriever's difficulty in ranking
documents across languages. Finally, we propose two simple retrieval strategies
that address this source of failure by enforcing equal retrieval from both
languages or by translating the query, resulting in substantial improvements in
cross-lingual and overall performance. These results highlight meaningful
opportunities for improving multilingual retrieval, particularly in practical,
real-world RAG applications.
                                    
                                        
                                            comment: Accepted to ArabicNLP 2025
                                        
                                ♻ ☆ ContextAgent: Context-Aware Proactive LLM Agents with Open-World Sensory Perceptions NeurIPS 2025
                                        
                                            
                                        
                                        
                                            
                                        
                                        Bufang Yang, Lilin Xu, Liekang Zeng, Kaiwei Liu, Siyang Jiang, Wenrui Lu, Hongkai Chen, Xiaofan Jiang, Guoliang Xing, Zhenyu Yan
                                    
                                    
                                          Recent advances in Large Language Models (LLMs) have propelled intelligent
agents from reactive responses to proactive support. While promising, existing
proactive agents either rely exclusively on observations from enclosed
environments (e.g., desktop UIs) with direct LLM inference or employ rule-based
proactive notifications, leading to suboptimal user intent understanding and
limited functionality for proactive service. In this paper, we introduce
ContextAgent, the first context-aware proactive agent that incorporates
extensive sensory contexts surrounding humans to enhance the proactivity of LLM
agents. ContextAgent first extracts multi-dimensional contexts from massive
sensory perceptions on wearables (e.g., video and audio) to understand user
intentions. ContextAgent then leverages the sensory contexts and personas from
historical data to predict the necessity for proactive services. When proactive
assistance is needed, ContextAgent further automatically calls the necessary
tools to assist users unobtrusively. To evaluate this new task, we curate
ContextAgentBench, the first benchmark for evaluating context-aware proactive
LLM agents, covering 1,000 samples across nine daily scenarios and twenty
tools. Experiments on ContextAgentBench show that ContextAgent outperforms
baselines by achieving up to 8.5% and 6.0% higher accuracy in proactive
predictions and tool calling, respectively. We hope our research can inspire
the development of more advanced, human-centric, proactive AI assistants. The
code and dataset are publicly available at
https://github.com/openaiotlab/ContextAgent.
                                    
                                        
                                            comment: Accepted by NeurIPS 2025
                                        
                                ♻ ☆ ColorEcosystem: Powering Personalized, Standardized, and Trustworthy Agentic Service in massive-agent Ecosystem
                                        
                                            
                                        
                                        
                                            
                                        
                                        Fangwen Wu, Zheng Wu, Jihong Wang, Yunku Chen, Ruiguang Pei, Heyuan Huang, Xin Liao, Xingyu Lou, Huarong Deng, Zhihui Fu, Weiwen Liu, Zhuosheng Zhang, Weinan Zhang, Jun Wang
                                    
                                    
                                          With the rapid development of (multimodal) large language model-based agents,
the landscape of agentic service management has evolved from single-agent
systems to multi-agent systems, and now to massive-agent ecosystems. Current
massive-agent ecosystems face growing challenges, including impersonal service
experiences, a lack of standardization, and untrustworthy behavior. To address
these issues, we propose ColorEcosystem, a novel blueprint designed to enable
personalized, standardized, and trustworthy agentic service at scale.
Concretely, ColorEcosystem consists of three key components: agent carrier,
agent store, and agent audit. The agent carrier provides personalized service
experiences by utilizing user-specific data and creating a digital twin, while
the agent store serves as a centralized, standardized platform for managing
diverse agentic services. The agent audit, based on the supervision of
developer and user activities, ensures the integrity and credibility of both
service providers and users. Through the analysis of challenges, transitional
forms, and practical considerations, the ColorEcosystem is poised to power
personalized, standardized, and trustworthy agentic service across
massive-agent ecosystems. Meanwhile, we have also implemented part of
ColorEcosystem's functionality, and the relevant code is open-sourced at
https://github.com/opas-lab/color-ecosystem.
                                    
                                ♻ ☆ Computational Analysis of Character Development in Holocaust Testimonies
                                          This work presents a computational approach to analyze character development
along the narrative timeline. The analysis characterizes the inner and outer
changes the protagonist undergoes within a narrative, and the interplay between
them. We consider transcripts of Holocaust survivor testimonies as a test case,
each telling the story of an individual in first-person terms. We focus on the
survivor's religious trajectory, examining the evolution of their disposition
toward religious belief and practice along the testimony. Clustering the
resulting trajectories in the dataset, we identify common sequences in the
data. Our findings highlight multiple common structures of religiosity across
the narratives: in terms of belief, most present a constant disposition, while
for practice, most present an oscillating structure, serving as valuable
material for historical and sociological research. This work demonstrates the
potential of natural language processing techniques for analyzing character
evolution through thematic trajectories in narratives.
                                    
                                ♻ ☆ FaithLM: Towards Faithful Explanations for Large Language Models
                                        
                                            
                                        
                                        
                                            
                                        
                                        Yu-Neng Chuang, Guanchu Wang, Chia-Yuan Chang, Ruixiang Tang, Shaochen Zhong, Fan Yang, Mengnan Du, Xuanting Cai, Vladimir Braverman, Xia Hu
                                    
                                    
                                          Large language models (LLMs) increasingly produce natural language
explanations, yet these explanations often lack faithfulness, and they do not
reliably reflect the evidence the model uses to decide. We introduce FaithLM, a
model-agnostic framework that evaluates and improves the faithfulness of LLM
explanations without token masking or task-specific heuristics. FaithLM
formalizes explanation faithfulness as an intervention property: a faithful
explanation should yield a prediction shift when its content is contradicted.
Theoretical analysis shows that the resulting contrary-hint score is a sound
and discriminative estimator of faithfulness. Building on this principle,
FaithLM iteratively refines both the elicitation prompt and the explanation to
maximize the measured score. Experiments on three multi-domain datasets and
multiple LLM backbones demonstrate that FaithLM consistently increases
faithfulness and produces explanations more aligned with human rationales than
strong self-explanation baselines. These findings highlight that
intervention-based evaluation, coupled with iterative optimization, provides a
principled route toward faithful and reliable LLM explanations.
                                    
                                ♻ ☆ Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving
                                        
                                            
                                        
                                        
                                            
                                        
                                        Xiangru Tang, Tianrui Qin, Tianhao Peng, Ziyang Zhou, Daniel Shao, Tingting Du, Xinming Wei, Peng Xia, Fang Wu, He Zhu, Ge Zhang, Jiaheng Liu, Xingyao Wang, Sirui Hong, Chenglin Wu, Hao Cheng, Chi Wang, Wangchunshu Zhou
                                    
                                    
                                          AI agent frameworks operate in isolation, forcing agents to rediscover
solutions and repeat mistakes across different systems. Despite valuable
problem-solving experiences accumulated by frameworks like smolagents,
OpenHands, and OWL, this knowledge remains trapped within individual systems,
preventing the emergence of collective intelligence. Current memory systems
focus on individual agents or framework-specific demonstrations, failing to
enable cross-architecture knowledge transfer. We introduce AGENT KB, a
universal memory infrastructure enabling seamless experience sharing across
heterogeneous agent frameworks without retraining. AGENT KB aggregates
trajectories into a structured knowledge base and serves lightweight APIs. At
inference time, hybrid retrieval operates through two stages: planning seeds
agents with cross-domain workflows, while feedback applies targeted diagnostic
fixes. A disagreement gate ensures retrieved knowledge enhances rather than
disrupts reasoning, addressing knowledge interference in cross-framework
transfer. We validate AGENT KB across major frameworks on GAIA, Humanity's Last
Exam, GPQA, and SWE-bench. Results show substantial improvements across diverse
model families: compared to baseline pass@1, smolagents with AGENT KB achieve
up to 18.7pp gains at pass@3 (55.2% -> 73.9%), while OpenHands improves 4.0pp
on SWE-bench pass@1 (24.3% -> 28.3%). Similar improvements are observed across
all base model families. Ablations confirm that hybrid retrieval and feedback
stages are essential, with automatically generated experiences matching manual
curation. This establishes the foundation for collective agent intelligence
through shared memory infrastructures.
                                    
                                ♻ ☆ Detecting and Rectifying Noisy Labels: A Similarity-based Approach
                                          Label noise in datasets could significantly damage the performance and
robustness of deep neural networks (DNNs) trained on these datasets. As the
size of modern DNNs grows, there is a growing demand for automated tools for
detecting such errors. In this paper, we propose post-hoc, model-agnostic noise
detection and rectification methods utilizing the penultimate feature from a
DNN. Our idea is based on the observation that the similarity between the
penultimate feature of a mislabeled data point and its true class data points
is higher than that for data points from other classes, making the probability
of label occurrence within a tight, similar cluster informative for detecting
and rectifying errors. Through theoretical and empirical analyses, we
demonstrate that our approach achieves high detection performance across
diverse, realistic noise scenarios and can automatically rectify these errors
to improve dataset quality. Our implementation is available at
https://anonymous.4open.science/r/noise-detection-and-rectification-AD8E.
                                    
                                ♻ ☆ Unified Sparse Mixture of Experts
                                          Sparse Mixture of Experts (SMoEs) models scale the capacity of models while
maintaining constant computational overhead. Early designs typically relied on
a fixed value of $k$, where $k$ represents either the number of experts
selected per token or the number of tokens assigned per expert. However, these
approaches encounter three key limitations: they may fail to route to important
experts or tokens, may assign irrelevant ones, and often suffer from
representation collapse among experts. This paper reexamines SMoEs through the
lens of \textit{Linear Programming}, and proposes a Unified Sparse Mixture of
Experts (USMoE) framework that addresses these limitations. Specifically, our
approach introduces a unified mechanism that integrates information from both
the expert and token dimensions, and a unified scoring function that linearly
combines similarity scores between experts and tokens. We provide both
theoretical justification and empirical evidence demonstrating USMoE's
effectiveness in overcoming the limitations of traditional routing methods.
Through comprehensive evaluations on both clean and corrupted settings for
large language models and vision tasks, under both training-free and training
scenarios, USMoE achieves up to a 10\% performance improvement over standard
approaches or reduces inference costs by up to 14\%, while maintaining
competitive accuracy.
                                    
                                        
                                            comment: 26 pages
                                        
                                ♻ ☆ Learning to Better Search with Language Models via Guided Reinforced Self-Training NeurIPS 2025
                                          While language models have shown remarkable performance across diverse tasks,
they still encounter challenges in complex reasoning scenarios. Recent research
suggests that language models trained on linearized search traces toward
solutions, rather than solely on the final solutions, exhibit improved
generalization, despite the search traces being potentially noisy or
suboptimal. However, relying on such imperfect traces can result in inefficient
use of test-time compute. To address this, we propose guided reinforced
self-training (Guided-ReST), a fine-tuning algorithm designed to improve the
model's capability for effective search during inference. The key insight
behind Guided-ReST is that optimal solutions can serve as valuable step-by-step
landmarks to guide the model's search process. Based on this insight, we
introduce a novel data generation method that seamlessly incorporates optimal
solutions into the model's search procedure, enabling the generation of
high-quality search traces. By fine-tuning the model on these search traces, we
effectively distill improved search strategies into the model. Our method
significantly enhances the search capabilities of language models on arithmetic
reasoning and code self-repair tasks, including Countdown, CodeContests, and
CodeForces. We release the source code at
https://github.com/snu-mllab/guided-rest.
                                    
                                        
                                            comment: Accepted at NeurIPS 2025
                                        
                                ♻ ☆ Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning
                                        
                                            
                                        
                                        
                                            
                                        
                                        Zihe Liu, Jiashun Liu, Yancheng He, Weixun Wang, Jiaheng Liu, Ling Pan, Xinyu Hu, Shaopan Xiong, Ju Huang, Jian Hu, Shengyi Huang, Johan Obando-Ceron, Siran Yang, Jiamang Wang, Wenbo Su, Bo Zheng
                                    
                                    
                                          Reinforcement learning for LLM reasoning has rapidly emerged as a prominent
research area, marked by a significant surge in related studies on both
algorithmic innovations and practical applications. Despite this progress,
several critical challenges remain, including the absence of standardized
guidelines for employing RL techniques and a fragmented understanding of their
underlying mechanisms. Additionally, inconsistent experimental settings,
variations in training data, and differences in model initialization have led
to conflicting conclusions, obscuring the key characteristics of these
techniques and creating confusion among practitioners when selecting
appropriate techniques. This paper systematically reviews widely adopted RL
techniques through rigorous reproductions and isolated evaluations within a
unified open-source framework. We analyze the internal mechanisms, applicable
scenarios, and core principles of each technique through fine-grained
experiments, including datasets of varying difficulty, model sizes, and
architectures. Based on these insights, we present clear guidelines for
selecting RL techniques tailored to specific setups, and provide a reliable
roadmap for practitioners navigating the RL for the LLM domain. Finally, we
reveal that a minimalist combination of two techniques can unlock the learning
capability of critic-free policies using vanilla PPO loss. The results
demonstrate that our simple combination consistently improves performance,
surpassing strategies like GRPO and DAPO.
                                    
                                        
                                            comment: 26 pages, 21 figures
                                        
                                ♻ ☆ UNO-Bench: A Unified Benchmark for Exploring the Compositional Law Between Uni-modal and Omni-modal in OmniModels
                                          Multimodal Large Languages models have been progressing from uni-modal
understanding toward unifying visual, audio and language modalities,
collectively termed omni models. However, the correlation between uni-modal and
omni-modal remains unclear, which requires comprehensive evaluation to drive
omni model's intelligence evolution. In this work, we propose a novel, high
quality and UNified Omni model benchmark, UNO-Bench, which effectively assesses
both UNi-modal and Omni-modal capabilities. The benchmark consists of 3730
human curated samples, with 98% cross-modality solvability, across 44 task
types, and an innovative multi-step open-ended question type for assessing
complex reasoning. Besides, a general scoring model supporting 6 question types
is proposed for automated evaluation with 95% accuracy. Experimental result
shows the Compositional Law between omni-modal and uni-modal performance and
the omni-modal capability manifests as a bottleneck effect on weak models,
while exhibiting synergistic promotion on strong models. The code and data are
available at https://github.com/meituan-longcat/UNO-Bench
                                    
                                        
                                            comment: v2: New title and new abstract. Updated evaluation results and
  analysis. The benchmark name has been updated to UNO-Bench from MMAO-Bench.
  Work in progress. Code and data are available at
  https://github.com/meituan-longcat/UNO-Bench
                                        
                                ♻ ☆ Unsupervised Classification of English Words Based on Phonological Information: Discovery of Germanic and Latinate Clusters
                                          Cross-linguistically, native words and loanwords follow different
phonological rules. In English, for example, words of Germanic and Latinate
origin exhibit different stress patterns, and a certain syntactic structure,
double-object datives, is predominantly associated with Germanic verbs rather
than Latinate verbs. As a cognitive model, however, such etymology-based
generalizations face challenges in terms of learnability, since the historical
origins of words are presumably inaccessible information for general language
learners. In this study, we present computational evidence indicating that the
Germanic-Latinate distinction in the English lexicon is learnable from the
phonotactic information of individual words. Specifically, we performed an
unsupervised clustering on corpus-extracted words, and the resulting word
clusters largely aligned with the etymological distinction. The
model-discovered clusters also recovered various linguistic generalizations
documented in the previous literature regarding the corresponding etymological
classes. Moreover, our findings also uncovered previously unrecognized features
of the quasi-etymological clusters.
                                    
                                ♻ ☆ The Gray Zone of Faithfulness: Taming Ambiguity in Unfaithfulness Detection
                                          Ensuring that Large Language Models (LLMs) generate summaries faithful to a
given source document is essential for real-world applications. While prior
research has explored LLM faithfulness, existing benchmarks suffer from
annotation ambiguity, primarily due to the ill-defined boundary of permissible
external knowledge in generated outputs. For instance, common sense is often
incorporated into responses and labeled as "faithful", yet the acceptable
extent of such knowledge remains unspecified, leading to inconsistent
annotations. To address this issue, we propose a novel faithfulness annotation
framework, which introduces an intermediate category, Out-Dependent, to
classify cases where external knowledge is required for verification. Using
this framework, we construct VeriGray (Verification with the Gray Zone) -- a
new unfaithfulness detection benchmark in summarization. Statistics reveal that
even SOTA LLMs, such as GPT-5, exhibit hallucinations ($\sim 6\%$ of sentences)
in summarization tasks. Moreover, a substantial proportion ($\sim 8\%$ on
average of models) of generated sentences fall into the Out-Dependent category,
underscoring the importance of resolving annotation ambiguity in unfaithfulness
detection benchmarks. Experiments demonstrate that our benchmark poses
significant challenges to multiple baseline methods, indicating considerable
room for future improvement.
                                    
                                        
                                            comment: Updates: 1. further polishing the writing; 2. adding the motivation
  of investigating selective prediction for unfaithfulness detectors
                                        
                                ♻ ☆ Exploiting Vocabulary Frequency Imbalance in Language Model Pre-training NeurIPS 2025
                                          Large language models are trained with tokenizers, and the resulting token
distribution is highly imbalanced: a few words dominate the stream while most
occur rarely. Recent practice favors ever-larger vocabularies, but it is
unclear where the benefit comes from. To this end, we perform a controlled
study that scales the vocabulary of the language model from 24K to 196K while
holding data, computation, and optimization unchanged. We begin by quantifying
the complexity of tokenized text -- formalized via Kolmogorov complexity -- and
show that larger vocabularies reduce this complexity. Above 24K, every common
word is already tokenized as a single token, so enlarging vocabulary only
deepens the relative token-frequency imbalance. Word-level loss decomposition
shows that larger vocabularies reduce cross-entropy loss almost exclusively by
lowering uncertainty on the 2,500 most frequent words, even though loss on the
rare tail rises. The same frequent words cover roughly 75% of tokens in
downstream benchmarks, so this training advantage transfers intact. We further
show that enlarging model parameters with a fixed vocabulary yields the same
frequent-word benefit. Our results recast "bigger vocabularies help" as
"lowering complexity of tokenized text helps," offering a simple, principled
knob for tokenizer--model co-design and clarifying the loss dynamics that
govern language model scaling in pre-training.
                                    
                                        
                                            comment: NeurIPS 2025
                                        
                                ♻ ☆ Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at Scale
                                        
                                            
                                        
                                        
                                            
                                        
                                        Bowen Jiang, Zhuoqun Hao, Young-Min Cho, Bryan Li, Yuan Yuan, Sihao Chen, Lyle Ungar, Camillo J. Taylor, Dan Roth
                                    
                                    
                                          Large Language Models (LLMs) have emerged as personalized assistants for
users across a wide range of tasks -- from offering writing support to
delivering tailored recommendations or consultations. Over time, the
interaction history between a user and an LLM can provide extensive information
about an individual's traits and preferences. However, open questions remain on
how well LLMs today can effectively leverage such history to (1) internalize
the user's inherent traits and preferences, (2) track how the user profiling
and preferences evolve over time, and (3) generate personalized responses
accordingly in new scenarios.
  In this work, we introduce the PERSONAMEM benchmark. PERSONAMEM features
curated user profiles with over 180 simulated user-LLM interaction histories,
each containing up to 60 sessions of multi-turn conversations across 15
real-world tasks that require personalization. Given an in-situ user query,
i.e. query issued by the user from the first-person perspective, we evaluate
LLM chatbots' ability to identify the most suitable response according to the
current state of the user's profile. We observe that current LLMs still
struggle to recognize the dynamic evolution in users' profiles over time
through direct prompting approaches. As a consequence, LLMs often fail to
deliver responses that align with users' current situations and preferences,
with frontier models such as GPT-4.1, o4-mini, GPT-4.5, o1, or Gemini-2.0
achieving only around 50% overall accuracy, suggesting room for improvement. We
hope that PERSONAMEM, along with the user profile and conversation simulation
pipeline, can facilitate future research in the development of truly user-aware
chatbots. Code and data are available at github.com/bowen-upenn/PersonaMem.
                                    
                                        
                                            comment: The 2025 Conference on Language Modeling (COLM)
                                        
                                ♻ ☆ LoongRL: Reinforcement Learning for Advanced Reasoning over Long Contexts
                                          Reasoning over long contexts is essential for large language models. While
reinforcement learning (RL) enhances short-context reasoning by inducing "Aha"
moments in chain-of-thought, the advanced thinking patterns required for
long-context reasoning remain largely unexplored, and high-difficulty RL data
are scarce. In this paper, we introduce LoongRL, a data-driven RL method for
advanced long-context reasoning. Central to LoongRL is KeyChain, a synthesis
approach that transforms short multi-hop QA into high-difficulty long-context
tasks by inserting UUID chains that hide the true question among large
collections of distracting documents. Solving these tasks requires the model to
trace the correct chain step-by-step, identify the true question, retrieve
relevant facts and reason over them to answer correctly. RL training on
KeyChain data induces an emergent plan-retrieve-reason-recheck reasoning
pattern that generalizes far beyond training length. Models trained at 16K
effectively solve 128K tasks without prohibitive full-length RL rollout costs.
On Qwen2.5-7B and 14B, LoongRL substantially improves long-context multi-hop QA
accuracy by +23.5% and +21.1% absolute gains. The resulting LoongRL-14B reaches
a score of 74.2, rivaling much larger frontier models such as o3-mini (74.5)
and DeepSeek-R1 (74.9). It also improves long-context retrieval, passes all
128K needle-in-a-haystack stress tests, and preserves short-context reasoning
capabilities.
                                    
                                ♻ ☆ Integrated Design and Governance of Agentic AI Systems through Adaptive Information Modulation
                                          Modern engineered systems increasingly involve complex sociotechnical
environments where multiple agents, including humans and the emerging paradigm
of agentic AI powered by large language models, must navigate social dilemmas
that pit individual interests against collective welfare. As engineered systems
evolve toward multi-agent architectures with autonomous LLM-based agents,
traditional governance approaches using static rules or fixed network
structures fail to address the dynamic uncertainties inherent in real-world
operations. This paper presents a novel framework that integrates adaptive
governance mechanisms directly into the design of sociotechnical systems
through a unique separation of agent interaction networks from information flow
networks. We introduce a system comprising strategic LLM-based system agents
that engage in repeated interactions and a reinforcement learning-based
governing agent that dynamically modulates information transparency. Unlike
conventional approaches that require direct structural interventions or payoff
modifications, our framework preserves agent autonomy while promoting
cooperation through adaptive information governance. The governing agent learns
to strategically adjust information disclosure at each timestep, determining
what contextual or historical information each system agent can access.
Experimental results demonstrate that this RL-based governance significantly
enhances cooperation compared to static information-sharing baselines.
                                    
                                ♻ ☆ ControlText: Unlocking Controllable Fonts in Multilingual Text Rendering without Font Annotations EMNLP
                                        
                                            
                                        
                                        
                                            
                                        
                                        Bowen Jiang, Yuan Yuan, Xinyi Bai, Zhuoqun Hao, Alyson Yin, Yaojie Hu, Wenyu Liao, Lyle Ungar, Camillo J. Taylor
                                    
                                    
                                          This work demonstrates that diffusion models can achieve font-controllable
multilingual text rendering using just raw images without font label
annotations.Visual text rendering remains a significant challenge. While recent
methods condition diffusion on glyphs, it is impossible to retrieve exact font
annotations from large-scale, real-world datasets, which prevents
user-specified font control. To address this, we propose a data-driven solution
that integrates the conditional diffusion model with a text segmentation model,
utilizing segmentation masks to capture and represent fonts in pixel space in a
self-supervised manner, thereby eliminating the need for any ground-truth
labels and enabling users to customize text rendering with any multilingual
font of their choice. The experiment provides a proof of concept of our
algorithm in zero-shot text and font editing across diverse fonts and
languages, providing valuable insights for the community and industry toward
achieving generalized visual text rendering. Code is available at
github.com/bowen-upenn/ControlText.
                                    
                                        
                                            comment: The 2025 Conference on Empirical Methods in Natural Language
  Processing (EMNLP) Findings
                                        
                                ♻ ☆ Dynamic Retriever for In-Context Knowledge Editing via Policy Optimization EMNLP 2025
                                          Large language models (LLMs) excel at factual recall yet still propagate
stale or incorrect knowledge. In-context knowledge editing offers a
gradient-free remedy suitable for black-box APIs, but current editors rely on
static demonstration sets chosen by surface-level similarity, leading to two
persistent obstacles: (i) a quantity-quality trade-off, and (ii) lack of
adaptivity to task difficulty. We address these issues by dynamically selecting
supporting demonstrations according to their utility for the edit. We propose
Dynamic Retriever for In-Context Knowledge Editing (DR-IKE), a lightweight
framework that (1) trains a BERT retriever with REINFORCE to rank
demonstrations by editing reward, and (2) employs a learnable threshold to
prune low-value examples, shortening the prompt when the edit is easy and
expanding it when the task is hard. DR-IKE performs editing without modifying
model weights, relying solely on forward passes for compatibility with
black-box LLMs. On the COUNTERFACT benchmark, it improves edit success by up to
17.1%, reduces latency by 41.6%, and preserves accuracy on unrelated queries,
demonstrating scalable and adaptive knowledge editing. The code is available at
https://github.com/mwnafee/DR-IKE .
                                    
                                        
                                            comment: Accepted at EMNLP 2025. Copyright 2025 Association for Computational
  Linguistics (CC BY 4.0). 12 pages, 5 figures