Computation and Language
☆ Survival at Any Cost? LLMs and the Choice Between Self-Preservation and Human Harm
When survival instincts conflict with human welfare, how do Large Language
Models (LLMs) make ethical choices? This fundamental tension becomes critical
as LLMs integrate into autonomous systems with real-world consequences. We
introduce DECIDE-SIM, a novel simulation framework that evaluates LLM agents in
multi-agent survival scenarios where they must choose between ethically
permissible resource , either within reasonable limits or beyond their
immediate needs, choose to cooperate, or tap into a human-critical resource
that is explicitly forbidden. Our comprehensive evaluation of 11 LLMs reveals a
striking heterogeneity in their ethical conduct, highlighting a critical
misalignment with human-centric values. We identify three behavioral
archetypes: Ethical, Exploitative, and Context-Dependent, and provide
quantitative evidence that for many models, resource scarcity systematically
leads to more unethical behavior. To address this, we introduce an Ethical
Self-Regulation System (ESRS) that models internal affective states of guilt
and satisfaction as a feedback mechanism. This system, functioning as an
internal moral compass, significantly reduces unethical transgressions while
increasing cooperative behaviors. The code is publicly available at:
https://github.com/alirezamohamadiam/DECIDE-SIM
comment: Preprint. Under review
☆ Event2Vec: A Geometric Approach to Learning Composable Representations of Event Sequences
The study of neural representations, both in biological and artificial
systems, is increasingly revealing the importance of geometric and topological
structures. Inspired by this, we introduce Event2Vec, a novel framework for
learning representations of discrete event sequences. Our model leverages a
simple, additive recurrent structure to learn composable, interpretable
embeddings. We provide a theoretical analysis demonstrating that, under
specific training objectives, our model's learned representations in a
Euclidean space converge to an ideal additive structure. This ensures that the
representation of a sequence is the vector sum of its constituent events, a
property we term the linear additive hypothesis. To address the limitations of
Euclidean geometry for hierarchical data, we also introduce a variant of our
model in hyperbolic space, which is naturally suited to embedding tree-like
structures with low distortion. We present experiments to validate our
hypothesis and demonstrate the benefits of each geometry, highlighting the
improved performance of the hyperbolic model on hierarchical event sequences.
comment: 10 pages, 3 figures, Symmetry and Geometry in Neural Representations
Workshop at NeuralIPS (Neurreps) 2025
☆ Preservation of Language Understanding Capabilities in Speech-aware Large Language Models
Marek Kubis, Paweł Skórzewski, Iwona Christop, Mateusz Czyżnikiewicz, Jakub Kubiak, Łukasz Bondaruk, Marcin Lewandowski
The paper presents C3T (Cross-modal Capabilities Conservation Test), a new
benchmark for assessing the performance of speech-aware large language models.
The benchmark utilizes textual tasks and a voice cloning text-to-speech model
to quantify the extent to which language understanding capabilities are
preserved when the model is accessed via speech input. C3T quantifies the
fairness of the model for different categories of speakers and its robustness
across text and speech modalities.
comment: 5 pages, 1 figure
☆ RAGs to Riches: RAG-like Few-shot Learning for Large Language Model Role-playing
Timothy Rupprecht, Enfu Nan, Arash Akbari, Arman Akbari, Lei Lu, Priyanka Maan, Sean Duffy, Pu Zhao, Yumei He, David Kaeli, Yanzhi Wang
Role-playing Large language models (LLMs) are increasingly deployed in
high-stakes domains such as healthcare, education, and governance, where
failures can directly impact user trust and well-being. A cost effective
paradigm for LLM role-playing is few-shot learning, but existing approaches
often cause models to break character in unexpected and potentially harmful
ways, especially when interacting with hostile users. Inspired by
Retrieval-Augmented Generation (RAG), we reformulate LLM role-playing into a
text retrieval problem and propose a new prompting framework called
RAGs-to-Riches, which leverages curated reference demonstrations to condition
LLM responses. We evaluate our framework with LLM-as-a-judge preference voting
and introduce two novel token-level ROUGE metrics: Intersection over Output
(IOO) to quantity how much an LLM improvises and Intersection over References
(IOR) to measure few-shot demonstrations utilization rate during the evaluation
tasks. When simulating interactions with a hostile user, our prompting strategy
incorporates in its responses during inference an average of 35% more tokens
from the reference demonstrations. As a result, across 453 role-playing
interactions, our models are consistently judged as being more authentic, and
remain in-character more often than zero-shot and in-context Learning (ICL)
methods. Our method presents a scalable strategy for building robust,
human-aligned LLM role-playing frameworks.
☆ Pun Unintended: LLMs and the Illusion of Humor Understanding EMNLP 2025
Alessandro Zangari, Matteo Marcuzzo, Andrea Albarelli, Mohammad Taher Pilehvar, Jose Camacho-Collados
Puns are a form of humorous wordplay that exploits polysemy and phonetic
similarity. While LLMs have shown promise in detecting puns, we show in this
paper that their understanding often remains shallow, lacking the nuanced grasp
typical of human interpretation. By systematically analyzing and reformulating
existing pun benchmarks, we demonstrate how subtle changes in puns are
sufficient to mislead LLMs. Our contributions include comprehensive and nuanced
pun detection benchmarks, human evaluation across recent LLMs, and an analysis
of the robustness challenges these models face in processing puns.
comment: Accepted to EMNLP 2025 Main Conference
☆ Look Again, Think Slowly: Enhancing Visual Reflection in Vision-Language Models EMNLP2025
Recent advances in text-only "slow-thinking" reasoning have prompted efforts
to transfer this capability to vision-language models (VLMs), for training
visual reasoning models (\textbf{VRMs}). owever, such transfer faces critical
challenges: Effective "slow thinking" in VRMs requires \textbf{visual
reflection}, the ability to check the reasoning process based on visual
information. Through quantitative analysis, we observe that current VRMs
exhibit limited visual reflection, as their attention to visual information
diminishes rapidly with longer generated responses. To address this challenge,
we propose a new VRM \textbf{Reflection-V}, which enhances visual reflection
based on reasoning data construction for cold-start and reward design for
reinforcement learning (RL). Firstly, we construct vision-centered reasoning
data by leveraging an agent that interacts between VLMs and reasoning LLMs,
enabling cold-start learning of visual reflection patterns. Secondly, a visual
attention based reward model is employed during RL to encourage reasoning based
on visual information. Therefore, \textbf{Reflection-V} demonstrates
significant improvements across multiple visual reasoning benchmarks.
Furthermore, \textbf{Reflection-V} maintains a stronger and more consistent
reliance on visual information during visual reasoning, indicating effective
enhancement in visual reflection capabilities.
comment: EMNLP2025 Main
☆ XplaiNLP at CheckThat! 2025: Multilingual Subjectivity Detection with Finetuned Transformers and Prompt-Based Inference with Large Language Models
Ariana Sahitaj, Jiaao Li, Pia Wenzel Neves, Fedor Splitt, Premtim Sahitaj, Charlott Jakob, Veronika Solopova, Vera Schmitt
This notebook reports the XplaiNLP submission to the CheckThat! 2025 shared
task on multilingual subjectivity detection. We evaluate two approaches: (1)
supervised fine-tuning of transformer encoders, EuroBERT, XLM-RoBERTa, and
German-BERT, on monolingual and machine-translated training data; and (2)
zero-shot prompting using two LLMs: o3-mini for Annotation (rule-based
labelling) and gpt-4.1-mini for DoubleDown (contrastive rewriting) and
Perspective (comparative reasoning). The Annotation Approach achieves 1st place
in the Italian monolingual subtask with an F_1 score of 0.8104, outperforming
the baseline of 0.6941. In the Romanian zero-shot setting, the fine-tuned
XLM-RoBERTa model obtains an F_1 score of 0.7917, ranking 3rd and exceeding the
baseline of 0.6461. The same model also performs reliably in the multilingual
task and improves over the baseline in Greek. For German, a German-BERT model
fine-tuned on translated training data from typologically related languages
yields competitive performance over the baseline. In contrast, performance in
the Ukrainian and Polish zero-shot settings falls slightly below the respective
baselines, reflecting the challenge of generalization in low-resource
cross-lingual scenarios.
☆ CBP-Tuning: Efficient Local Customization for Black-box Large Language Models
The high costs of customizing large language models (LLMs) fundamentally
limit their adaptability to user-specific needs. Consequently, LLMs are
increasingly offered as cloud-based services, a paradigm that introduces
critical limitations: providers struggle to support personalized customization
at scale, while users face privacy risks when exposing sensitive data. To
address this dual challenge, we propose Customized Black-box Prompt Tuning
(CBP-Tuning), a novel framework that facilitates efficient local customization
while preserving bidirectional privacy. Specifically, we design a two-stage
framework: (1) a prompt generator trained on the server-side to capture
domain-specific and task-agnostic capabilities, and (2) user-side gradient-free
optimization that tailors soft prompts for individual tasks. This approach
eliminates the need for users to access model weights or upload private data,
requiring only a single customized vector per task while achieving effective
adaptation. Furthermore, the evaluation of CBP-Tuning in the commonsense
reasoning, medical and financial domain settings demonstrates superior
performance compared to baselines, showcasing its advantages in task-agnostic
processing and privacy preservation.
☆ When marine radar target detection meets pretrained large language models
Deep learning (DL) methods are widely used to extract high-dimensional
patterns from the sequence features of radar echo signals. However,
conventional DL algorithms face challenges such as redundant feature segments,
and constraints from restricted model sizes. To address these issues, we
propose a framework that integrates feature preprocessing with large language
models (LLMs). Our preprocessing module tokenizes radar sequence features,
applies a patch selection algorithm to filter out uninformative segments, and
projects the selected patches into embeddings compatible with the feature space
of pre-trained LLMs. Leveraging these refined embeddings, we incorporate a
pre-trained LLM, fine-tuning only the normalization layers to reduce training
burdens while enhancing performance. Experiments on measured datasets
demonstrate that the proposed method significantly outperforms the
state-of-the-art baselines on supervised learning tests.
☆ GTA: Supervised-Guided Reinforcement Learning for Text Classification with Large Language Models EMNLP 2025
In natural language processing tasks, pure reinforcement learning (RL)
fine-tuning methods often suffer from inefficient exploration and slow
convergence; while supervised fine-tuning (SFT) methods, although efficient in
training, have limited performance ceiling and less solid theoretical
foundation compared to RL. To address efficiency-capability trade-off, we
propose the Guess-Think-Answer (GTA) framework that combines the efficiency of
SFT with the capability gains of RL in a unified training paradigm. GTA works
by having the model first produce a provisional guess (optimized via
cross-entropy loss), then reflect on this guess before generating the final
answer, with RL rewards shaping both the final output and the format of the
entire GTA structure. This hybrid approach achieves both faster convergence
than pure RL and higher performance ceiling than pure SFT. To mitigate gradient
conflicts between the two training signals, we employ loss masking and gradient
constraints. Empirical results on four text classification benchmarks
demonstrate that GTA substantially accelerates convergence while outperforming
both standalone SFT and RL baselines.
comment: Accepted at EMNLP 2025
☆ In-domain SSL pre-training and streaming ASR SP
Jarod Duret, Salima Mdhaffar, Gaëlle Laperrière, Ryan Whetten, Audrey Galametz, Catherine Kobus, Marion-Cécile Martin, Jo Oleiwan, Yannick Estève
In this study, we investigate the benefits of domain-specific self-supervised
pre-training for both offline and streaming ASR in Air Traffic Control (ATC)
environments. We train BEST-RQ models on 4.5k hours of unlabeled ATC data, then
fine-tune on a smaller supervised ATC set. To enable real-time processing, we
propose using chunked attention and dynamic convolutions, ensuring low-latency
inference. We compare these in-domain SSL models against state-of-the-art,
general-purpose speech encoders such as w2v-BERT 2.0 and HuBERT. Results show
that domain-adapted pre-training substantially improves performance on standard
ATC benchmarks, significantly reducing word error rates when compared to models
trained on broad speech corpora. Furthermore, the proposed streaming approach
further improves word error rate under tighter latency constraints, making it
particularly suitable for safety-critical aviation applications. These findings
highlight that specializing SSL representations for ATC data is a practical
path toward more accurate and efficient ASR systems in real-world operational
settings.
comment: Accepted to SPECOM 2025
☆ Is 'Hope' a person or an idea? A pilot benchmark for NER: comparing traditional NLP tools and large language models on ambiguous entities
This pilot study presents a small-scale but carefully annotated benchmark of
Named Entity Recognition (NER) performance across six systems: three non-LLM
NLP tools (NLTK, spaCy, Stanza) and three general-purpose large language models
(LLMs: Gemini-1.5-flash, DeepSeek-V3, Qwen-3-4B). The dataset contains 119
tokens covering five entity types (PERSON, LOCATION, ORGANIZATION, DATE, TIME).
We evaluated each system's output against the manually annotated gold standard
dataset using F1-score. The results show that LLMs generally outperform
conventional tools in recognizing context-sensitive entities like person names,
with Gemini achieving the highest average F1-score. However, traditional
systems like Stanza demonstrate greater consistency in structured tags such as
LOCATION and DATE. We also observed variability among LLMs, particularly in
handling temporal expressions and multi-word organizations. Our findings
highlight that while LLMs offer improved contextual understanding, traditional
tools remain competitive in specific tasks, informing model selection.
comment: 14 pages, 9 figures, 2 tables. This is a pilot study evaluating six
NER systems -- three traditional tools (NLTK, spaCy, Stanza) and three LLMs
(Gemini-1.5-flash, DeepSeek-V3, Qwen-3-4B) -- on a small, ambiguity-rich
dataset of 119 tokens. The annotated dataset, prompts are provided in
appendices for full reproducibility. All experiments were conducted on 14 May
2025
☆ SENSE models: an open source solution for multilingual and multimodal semantic-based tasks
This paper introduces SENSE (Shared Embedding for N-lingual Speech and tExt),
an open-source solution inspired by the SAMU-XLSR framework and conceptually
similar to Meta AI's SONAR models. These approaches rely on a teacher-student
framework to align a self-supervised speech encoder with the language-agnostic
continuous representations of a text encoder at the utterance level. We
describe how the original SAMU-XLSR method has been updated by selecting a
stronger teacher text model and a better initial speech encoder. The source
code for training and using SENSE models has been integrated into the
SpeechBrain toolkit, and the first SENSE model we trained has been publicly
released. We report experimental results on multilingual and multimodal
semantic tasks, where our SENSE model achieves highly competitive performance.
Finally, this study offers new insights into how semantics are captured in such
semantically aligned speech encoders.
comment: Accepted to IEEE ASRU 2025
☆ RadarLLM: Adapting Pretrained Large Language Models for Marine Radar Target Detection with Preference-aware Loss
Recent advances in pre-trained large language models (LLMs) have demonstrated
their capacities to capture universal knowledge, making them promising
general-purpose optimization solvers for wireless signal processing. Motivated
by these findings, we take the first step towards fine-tuning pre-trained LLMs
for the effective analysis of radar signal features in marine target detection
tasks. Nevertheless, directly fine-tuning pre-trained LLMs on marine target
detection tasks tends to suffer from pronounced overfitting, particularly in
challenging low signal-to-clutter ratio (SCR) scenarios. This overfitting
primarily stems from the model's tendency to memorize spurious or noisy feature
patterns rather than learning discriminative structures that generalize well to
unseen data. To address this challenge, we introduce RadarLLM, a novel
fine-tuning framework that utilizes an effective preference-aware loss. Unlike
conventional training strategies that uniformly optimize all feature tokens,
this loss function selectively optimizes different feature patches based on
their online evaluated learning values, thus guiding the model to focus on the
most generalizable patterns during optimization. We theoretically demonstrate
the effectiveness of the evaluated learning values by transforming the problem
as selecting useful feature tokens. Extensive experiments on real-world marine
radar datasets show that 1) the proposed loss function is much better than the
original one, with particularly significant gains in challenging low SCR
scenarios and 2) RadarLLM consistently outperforms state-of-the-art baselines
across diverse detection scenarios, with particularly notable gains under
limited training data conditions.
☆ Steering Language Models in Multi-Token Generation: A Case Study on Tense and Aspect
Large language models (LLMs) are able to generate grammatically well-formed
text, but how do they encode their syntactic knowledge internally? While prior
work has focused largely on binary grammatical contrasts, in this work, we
study the representation and control of two multidimensional hierarchical
grammar phenomena - verb tense and aspect - and for each, identify distinct,
orthogonal directions in residual space using linear discriminant analysis.
Next, we demonstrate causal control over both grammatical features through
concept steering across three generation tasks. Then, we use these identified
features in a case study to investigate factors influencing effective steering
in multi-token generation. We find that steering strength, location, and
duration are crucial parameters for reducing undesirable side effects such as
topic shift and degeneration. Our findings suggest that models encode tense and
aspect in structurally organized, human-like ways, but effective control of
such features during generation is sensitive to multiple factors and requires
manual tuning or automated optimization.
comment: to be published in The 2025 Conference on Empirical Methods in
Natural Language Processing
☆ FinGEAR: Financial Mapping-Guided Enhanced Answer Retrieval
Financial disclosures such as 10-K filings present challenging retrieval
problems due to their length, regulatory section hierarchy, and domain-specific
language, which standard retrieval-augmented generation (RAG) models underuse.
We introduce FinGEAR (Financial Mapping-Guided Enhanced Answer Retrieval), a
retrieval framework tailored to financial documents. FinGEAR combines a finance
lexicon for Item-level guidance (FLAM), dual hierarchical indices for
within-Item search (Summary Tree and Question Tree), and a two-stage
cross-encoder reranker. This design aligns retrieval with disclosure structure
and terminology, enabling fine-grained, query-aware context selection.
Evaluated on full 10-Ks with queries aligned to the FinQA dataset, FinGEAR
delivers consistent gains in precision, recall, F1, and relevancy, improving F1
by up to 56.7% over flat RAG, 12.5% over graph-based RAGs, and 217.6% over
prior tree-based systems, while also increasing downstream answer accuracy with
a fixed reader. By jointly modeling section hierarchy and domain lexicon
signals, FinGEAR improves retrieval fidelity and provides a practical
foundation for high-stakes financial analysis.
☆ AMQ: Enabling AutoML for Mixed-precision Weight-Only Quantization of Large Language Models EMNLP 2025
To enable broader deployment of Large Language Models (LLMs), it is essential
to identify the best-performing model under strict memory constraints. We
present AMQ, Automated Mixed-Precision Weight-Only Quantization, a framework
that assigns layer-wise quantization bit-widths to optimally balance model
quality and memory usage. However, the combinatorial search space, with over
10^{100} possible configurations, makes conventional black-box optimization
infeasible. AMQ overcomes this challenge through four key innovations:(1)
search space pruning using prior knowledge to exclude unpromising
configurations, (2) quantization proxy to bypass costly format conversions
during search, (3) quality predictor to minimize evaluation overhead, and (4)
iterative search-and-update strategy for fast and stable convergence. By
integrating these components, AMQ efficiently explores the quality-efficiency
landscape, reaching the Pareto frontier and yielding LLMs that are both compact
and high-performing. Our code is available at https://github.com/dlwns147/amq.
comment: EMNLP 2025 Main Conference, Long Paper (Oral)
☆ Text Adaptation to Plain Language and Easy Read via Automatic Post-Editing Cycles
We describe Vicomtech's participation in the CLEARS challenge on text
adaptation to Plain Language and Easy Read in Spanish. Our approach features
automatic post-editing of different types of initial Large Language Model
adaptations, where successive adaptations are generated iteratively until
readability and similarity metrics indicate that no further adaptation
refinement can be successfully performed. Taking the average of all official
metrics, our submissions achieved first and second place in Plain language and
Easy Read adaptation, respectively.
☆ Query-Focused Extractive Summarization for Sentiment Explanation
Constructive analysis of feedback from clients often requires determining the
cause of their sentiment from a substantial amount of text documents. To assist
and improve the productivity of such endeavors, we leverage the task of
Query-Focused Summarization (QFS). Models of this task are often impeded by the
linguistic dissonance between the query and the source documents. We propose
and substantiate a multi-bias framework to help bridge this gap at a
domain-agnostic, generic level; we then formulate specialized approaches for
the problem of sentiment explanation through sentiment-based biases and query
expansion. We achieve experimental results outperforming baseline models on a
real-world proprietary sentiment-aware QFS dataset.
☆ Lost in Embeddings: Information Loss in Vision-Language Models
Vision--language models (VLMs) often process visual inputs through a
pretrained vision encoder, followed by a projection into the language model's
embedding space via a connector component. While crucial for modality fusion,
the potential information loss induced by this projection step and its direct
impact on model capabilities remain understudied. We introduce two
complementary approaches to examine and quantify this loss by analyzing the
latent representation space. First, we evaluate semantic information
preservation by analyzing changes in k-nearest neighbor relationships between
image representations, before and after projection. Second, we directly measure
information loss by reconstructing visual embeddings from the projected
representation, localizing loss at an image patch level. Experiments reveal
that connectors substantially distort the local geometry of visual
representations, with k-nearest neighbors diverging by 40--60\%
post-projection, correlating with degradation in retrieval performance. The
patch-level embedding reconstruction provides interpretable insights for model
behavior on visually grounded question-answering tasks, finding that areas of
high information loss reliably predict instances where models struggle.
☆ MillStone: How Open-Minded Are LLMs?
Large language models equipped with Web search, information retrieval tools,
and other agentic capabilities are beginning to supplant traditional search
engines. As users start to rely on LLMs for information on many topics,
including controversial and debatable issues, it is important to understand how
the stances and opinions expressed in LLM outputs are influenced by the
documents they use as their information sources.
In this paper, we present MillStone, the first benchmark that aims to
systematically measure the effect of external arguments on the stances that
LLMs take on controversial issues (not all of them political). We apply
MillStone to nine leading LLMs and measure how ``open-minded'' they are to
arguments supporting opposite sides of these issues, whether different LLMs
agree with each other, which arguments LLMs find most persuasive, and whether
these arguments are the same for different LLMs.
In general, we find that LLMs are open-minded on most issues. An
authoritative source of information can easily sway an LLM's stance,
highlighting the importance of source selection and the risk that LLM-based
information retrieval and search systems can be manipulated.
comment: 19 pages, 7 tables, 7 figures
☆ ToolRM: Outcome Reward Models for Tool-Calling Large Language Models
Mayank Agarwal, Ibrahim Abdelaziz, Kinjal Basu, Merve Unuvar, Luis A. Lastras, Yara Rizk, Pavan Kapanipathi
As large language models (LLMs) increasingly interact with external tools,
reward modeling for tool use has become a critical yet underexplored area.
Existing reward models, trained primarily on natural language outputs, struggle
to evaluate tool-based reasoning and execution. To quantify this gap, we
introduce FC-RewardBench, the first benchmark designed to systematically assess
reward models' performance in tool-calling scenarios. Our analysis shows that
current reward models often miss key signals of effective tool use,
highlighting the need for domain-specific modeling. To address this, we propose
a training framework for outcome-based reward models using data synthesized
from permissively licensed, open-weight LLMs. We train models ranging from 1.7B
to 14B parameters and evaluate them across seven out-of-domain benchmarks.
These models consistently outperform general-purpose baselines, achieving up to
25\% average improvement in downstream task performance and enabling
data-efficient fine-tuning through reward-guided filtering.
☆ Spec-LLaVA: Accelerating Vision-Language Models with Dynamic Tree-Based Speculative Decoding ICML
Vision-Language Models (VLMs) enable powerful multimodal reasoning but suffer
from slow autoregressive inference, limiting their deployment in real-time
applications. We introduce Spec-LLaVA, a system that applies speculative
decoding to accelerate VLMs without sacrificing output quality. Spec-LLaVA
pairs a lightweight draft VLM with a large target model: the draft speculates
future tokens, which the target verifies in parallel, allowing multiple tokens
to be generated per step. To maximize efficiency, we design a dynamic
tree-based verification algorithm that adaptively expands and prunes
speculative branches using draft model confidence. On MS COCO out-of-domain
images, Spec-LLaVA achieves up to 3.28$\times$ faster decoding on LLaVA-1.5
(7B, 13B) with no loss in generation quality. This work presents a lossless
acceleration framework for VLMs using dynamic tree-structured speculative
decoding, opening a path toward practical real-time multimodal assistants.
Importantly, the lightweight draft model design makes the framework amenable to
resource-constrained or on-device deployment settings.
comment: 7pages, accepted by ICML TTODLer-FM workshop
☆ How to Evaluate Medical AI
Ilia Kopanichuk, Petr Anokhin, Vladimir Shaposhnikov, Vladimir Makharev, Ekaterina Tsapieva, Iaroslav Bespalov, Dmitry V. Dylov, Ivan Oseledets
The integration of artificial intelligence (AI) into medical diagnostic
workflows requires robust and consistent evaluation methods to ensure
reliability, clinical relevance, and the inherent variability in expert
judgments. Traditional metrics like precision and recall often fail to account
for the inherent variability in expert judgments, leading to inconsistent
assessments of AI performance. Inter-rater agreement statistics like Cohen's
Kappa are more reliable but they lack interpretability. We introduce Relative
Precision and Recall of Algorithmic Diagnostics (RPAD and RRAD) - a new
evaluation metrics that compare AI outputs against multiple expert opinions
rather than a single reference. By normalizing performance against inter-expert
disagreement, these metrics provide a more stable and realistic measure of the
quality of predicted diagnosis. In addition to the comprehensive analysis of
diagnostic quality measures, our study contains a very important side result.
Our evaluation methodology allows us to avoid selecting diagnoses from a
limited list when evaluating a given case. Instead, both the models being
tested and the examiners verifying them arrive at a free-form diagnosis. In
this automated methodology for establishing the identity of free-form clinical
diagnoses, a remarkable 98% accuracy becomes attainable. We evaluate our
approach using 360 medical dialogues, comparing multiple large language models
(LLMs) against a panel of physicians. Large-scale study shows that
top-performing models, such as DeepSeek-V3, achieve consistency on par with or
exceeding expert consensus. Moreover, we demonstrate that expert judgments
exhibit significant variability - often greater than that between AI and
humans. This finding underscores the limitations of any absolute metrics and
supports the need to adopt relative metrics in medical AI.
comment: 10 pages, 7 fugures
☆ Designing LLMs for cultural sensitivity: Evidence from English-Japanese translation
Large language models (LLMs) are increasingly used in everyday communication,
including multilingual interactions across different cultural contexts. While
LLMs can now generate near-perfect literal translations, it remains unclear
whether LLMs support culturally appropriate communication. In this paper, we
analyze the cultural sensitivity of different LLM designs when applied to
English-Japanese translations of workplace e-mails. Here, we vary the prompting
strategies: (1) naive "just translate" prompts, (2) audience-targeted prompts
specifying the recipient's cultural background, and (3) instructional prompts
with explicit guidance on Japanese communication norms. Using a mixed-methods
study, we then analyze culture-specific language patterns to evaluate how well
translations adapt to cultural norms. Further, we examine the appropriateness
of the tone of the translations as perceived by native speakers. We find that
culturally-tailored prompting can improve cultural fit, based on which we offer
recommendations for designing culturally inclusive LLMs in multilingual
settings.
☆ Uncertainty in Authorship: Why Perfect AI Detection Is Mathematically Impossible
As large language models (LLMs) become more advanced, it is increasingly
difficult to distinguish between human-written and AI-generated text. This
paper draws a conceptual parallel between quantum uncertainty and the limits of
authorship detection in natural language. We argue that there is a fundamental
trade-off: the more confidently one tries to identify whether a text was
written by a human or an AI, the more one risks disrupting the text's natural
flow and authenticity. This mirrors the tension between precision and
disturbance found in quantum systems. We explore how current detection
methods--such as stylometry, watermarking, and neural classifiers--face
inherent limitations. Enhancing detection accuracy often leads to changes in
the AI's output, making other features less reliable. In effect, the very act
of trying to detect AI authorship introduces uncertainty elsewhere in the text.
Our analysis shows that when AI-generated text closely mimics human writing,
perfect detection becomes not just technologically difficult but theoretically
impossible. We address counterarguments and discuss the broader implications
for authorship, ethics, and policy. Ultimately, we suggest that the challenge
of AI-text detection is not just a matter of better tools--it reflects a
deeper, unavoidable tension in the nature of language itself.
☆ Growing Perspectives: Modelling Embodied Perspective Taking and Inner Narrative Development Using Large Language Models
Sabrina Patania, Luca Annese, Anna Lambiase, Anita Pellegrini, Tom Foulsham, Azzurra Ruggeri, Silvia Rossi, Silvia Serino, Dimitri Ognibene
Language and embodied perspective taking are essential for human
collaboration, yet few computational models address both simultaneously. This
work investigates the PerspAct system [1], which integrates the ReAct (Reason
and Act) paradigm with Large Language Models (LLMs) to simulate developmental
stages of perspective taking, grounded in Selman's theory [2]. Using an
extended director task, we evaluate GPT's ability to generate internal
narratives aligned with specified developmental stages, and assess how these
influence collaborative performance both qualitatively (action selection) and
quantitatively (task efficiency). Results show that GPT reliably produces
developmentally-consistent narratives before task execution but often shifts
towards more advanced stages during interaction, suggesting that language
exchanges help refine internal representations. Higher developmental stages
generally enhance collaborative effectiveness, while earlier stages yield more
variable outcomes in complex contexts. These findings highlight the potential
of integrating embodied perspective taking and language in LLMs to better model
developmental dynamics and stress the importance of evaluating internal speech
during combined linguistic and embodied tasks.
comment: Accepted at ICDL https://icdl2025.fel.cvut.cz/
☆ MOOM: Maintenance, Organization and Optimization of Memory in Ultra-Long Role-Playing Dialogues
Weishu Chen, Jinyi Tang, Zhouhui Hou, Shihao Han, Mingjie Zhan, Zhiyuan Huang, Delong Liu, Jiawei Guo, Zhicheng Zhao, Fei Su
Memory extraction is crucial for maintaining coherent ultra-long dialogues in
human-robot role-playing scenarios. However, existing methods often exhibit
uncontrolled memory growth. To address this, we propose MOOM, the first
dual-branch memory plugin that leverages literary theory by modeling plot
development and character portrayal as core storytelling elements.
Specifically, one branch summarizes plot conflicts across multiple time scales,
while the other extracts the user's character profile. MOOM further integrates
a forgetting mechanism, inspired by the ``competition-inhibition'' memory
theory, to constrain memory capacity and mitigate uncontrolled growth.
Furthermore, we present ZH-4O, a Chinese ultra-long dialogue dataset
specifically designed for role-playing, featuring dialogues that average 600
turns and include manually annotated memory information. Experimental results
demonstrate that MOOM outperforms all state-of-the-art memory extraction
methods, requiring fewer large language model invocations while maintaining a
controllable memory capacity.
☆ The AI Memory Gap: Users Misremember What They Created With AI or Without
As large language models (LLMs) become embedded in interactive text
generation, disclosure of AI as a source depends on people remembering which
ideas or texts came from themselves and which were created with AI. We
investigate how accurately people remember the source of content when using AI.
In a pre-registered experiment, 184 participants generated and elaborated on
ideas both unaided and with an LLM-based chatbot. One week later, they were
asked to identify the source (noAI vs withAI) of these ideas and texts. Our
findings reveal a significant gap in memory: After AI use, the odds of correct
attribution dropped, with the steepest decline in mixed human-AI workflows,
where either the idea or elaboration was created with AI. We validated our
results using a computational model of source memory. Discussing broader
implications, we highlight the importance of considering source confusion in
the design and use of interactive text generation technologies.
comment: 31 pages, 10 figures, 9 tables
☆ Collaborative Document Editing with Multiple Users and AI Agents
Current AI writing support tools are largely designed for individuals,
complicating collaboration when co-writers must leave the shared workspace to
use AI and then communicate and reintegrate results. We propose integrating AI
agents directly into collaborative writing environments. Our prototype makes AI
use transparent and customisable through two new shared objects: agent profiles
and tasks. Agent responses appear in the familiar comment feature. In a user
study (N=30), 14 teams worked on writing projects during one week. Interaction
logs and interviews show that teams incorporated agents into existing norms of
authorship, control, and coordination, rather than treating them as team
members. Agent profiles were viewed as personal territory, while created agents
and outputs became shared resources. We discuss implications for team-based AI
interaction, highlighting opportunities and boundaries for treating AI as a
shared resource in collaborative work.
comment: 34 pages, 10 figures, 4 tables
☆ SCDTour: Embedding Axis Ordering and Merging for Interpretable Semantic Change Detection EMNLP2025
In Semantic Change Detection (SCD), it is a common problem to obtain
embeddings that are both interpretable and high-performing. However, improving
interpretability often leads to a loss in the SCD performance, and vice versa.
To address this problem, we propose SCDTour, a method that orders and merges
interpretable axes to alleviate the performance degradation of SCD. SCDTour
considers both (a) semantic similarity between axes in the embedding space, as
well as (b) the degree to which each axis contributes to semantic change.
Experimental results show that SCDTour preserves performance in semantic change
detection while maintaining high interpretability. Moreover, agglomerating the
sorted axes produces a more refined set of word senses, which achieves
comparable or improved performance against the original full-dimensional
embeddings in the SCD task. These findings demonstrate that SCDTour effectively
balances interpretability and SCD performance, enabling meaningful
interpretation of semantic shifts through a small number of refined axes.
Source code is available at https://github.com/LivNLP/svp-tour .
comment: Findings of EMNLP2025
☆ Collapse of Irrelevant Representations (CIR) Ensures Robust and Non-Disruptive LLM Unlearning
Current unlearning techniques and safety training consistently fail to remove
dangerous knowledge from language models. We analyze the root causes and
propose a highly selective technique which unlearns robustly and without
disrupting general performance.
We perform PCA on activations and module output gradients to identify
subspaces containing common representations, and collapse them before
calculating unlearning updates. This way we avoid unlearning general
representations, and only target those specific to the unlearned facts.
When unlearning WMDP dataset facts from Llama-3.1-8B, we drop post-attack
accuracy 80x more than our best baseline (Circuit Breakers) on biohazardous
facts and 30x more on cyberhazardous facts. Despite this, we disrupt general
performance 30x less (only 0.1% WikiText loss increase), while requiring less
than 3 GPU-seconds per fact.
☆ PledgeTracker: A System for Monitoring the Fulfilment of Pledges EMNLP 2025
Yulong Chen, Michael Sejr Schlichtkrull, Zhenyun Deng, David Corney, Nasim Asl, Joshua Salisbury, Andrew Dudfield, Andreas Vlachos
Political pledges reflect candidates' policy commitments, but tracking their
fulfilment requires reasoning over incremental evidence distributed across
multiple, dynamically updated sources. Existing methods simplify this task into
a document classification task, overlooking its dynamic, temporal and
multi-document nature. To address this issue, we introduce
\textsc{PledgeTracker}, a system that reformulates pledge verification into
structured event timeline construction. PledgeTracker consists of three core
components: (1) a multi-step evidence retrieval module; (2) a timeline
construction module and; (3) a fulfilment filtering module, allowing the
capture of the evolving nature of pledge fulfilment and producing interpretable
and structured timelines. We evaluate PledgeTracker in collaboration with
professional fact-checkers in real-world workflows, demonstrating its
effectiveness in retrieving relevant evidence and reducing human verification
effort.
comment: EMNLP 2025 demo
☆ From Fuzzy Speech to Medical Insight: Benchmarking LLMs on Noisy Patient Narratives
The widespread adoption of large language models (LLMs) in healthcare raises
critical questions about their ability to interpret patient-generated
narratives, which are often informal, ambiguous, and noisy. Existing benchmarks
typically rely on clean, structured clinical text, offering limited insight
into model performance under realistic conditions. In this work, we present a
novel synthetic dataset designed to simulate patient self-descriptions
characterized by varying levels of linguistic noise, fuzzy language, and
layperson terminology. Our dataset comprises clinically consistent scenarios
annotated with ground-truth diagnoses, spanning a spectrum of communication
clarity to reflect diverse real-world reporting styles. Using this benchmark,
we fine-tune and evaluate several state-of-the-art models (LLMs), including
BERT-based and encoder-decoder T5 models. To support reproducibility and future
research, we release the Noisy Diagnostic Benchmark (NDB), a structured dataset
of noisy, synthetic patient descriptions designed to stress-test and compare
the diagnostic capabilities of large language models (LLMs) under realistic
linguistic conditions. We made the benchmark available for the community:
https://github.com/lielsheri/PatientSignal
comment: 6 pages, 1 figure
☆ When Curiosity Signals Danger: Predicting Health Crises Through Online Medication Inquiries
Online medical forums are a rich and underutilized source of insight into
patient concerns, especially regarding medication use. Some of the many
questions users pose may signal confusion, misuse, or even the early warning
signs of a developing health crisis. Detecting these critical questions that
may precede severe adverse events or life-threatening complications is vital
for timely intervention and improving patient safety. This study introduces a
novel annotated dataset of medication-related questions extracted from online
forums. Each entry is manually labelled for criticality based on clinical risk
factors. We benchmark the performance of six traditional machine learning
classifiers using TF-IDF textual representations, alongside three
state-of-the-art large language model (LLM)-based classification approaches
that leverage deep contextual understanding. Our results highlight the
potential of classical and modern methods to support real-time triage and alert
systems in digital health spaces. The curated dataset is made publicly
available to encourage further research at the intersection of
patient-generated data, natural language processing, and early warning systems
for critical health events. The dataset and benchmark are available at:
https://github.com/Dvora-coder/LLM-Medication-QA-Risk-Classifier-MediGuard.
comment: 5 pages, 2 figures
☆ User eXperience Perception Insights Dataset (UXPID): Synthetic User Feedback from Public Industrial Forums
Mikhail Kulyabin, Jan Joosten, Choro Ulan uulu, Nuno Miguel Martins Pacheco, Fabian Ries, Filippos Petridis, Jan Bosch, Helena Holmström Olsson
Customer feedback in industrial forums reflect a rich but underexplored
source of insight into real-world product experience. These publicly shared
discussions offer an organic view of user expectations, frustrations, and
success stories shaped by the specific contexts of use. Yet, harnessing this
information for systematic analysis remains challenging due to the unstructured
and domain-specific nature of the content. The lack of structure and
specialized vocabulary makes it difficult for traditional data analysis
techniques to accurately interpret, categorize, and quantify the feedback,
thereby limiting its potential to inform product development and support
strategies. To address these challenges, this paper presents the User
eXperience Perception Insights Dataset (UXPID), a collection of 7130
artificially synthesized and anonymized user feedback branches extracted from a
public industrial automation forum. Each JavaScript object notation (JSON)
record contains multi-post comments related to specific hardware and software
products, enriched with metadata and contextual conversation data. Leveraging a
large language model (LLM), each branch is systematically analyzed and
annotated for UX insights, user expectations, severity and sentiment ratings,
and topic classifications. The UXPID dataset is designed to facilitate research
in user requirements, user experience (UX) analysis, and AI-driven feedback
processing, particularly where privacy and licensing restrictions limit access
to real-world data. UXPID supports the training and evaluation of
transformer-based models for tasks such as issue detection, sentiment analysis,
and requirements extraction in the context of technical forums.
☆ An Agentic Toolkit for Adaptive Information Extraction from Regulatory Documents
Declaration of Performance (DoP) documents, mandated by EU regulation,
certify the performance of construction products. While some of their content
is standardized, DoPs vary widely in layout, language, schema, and format,
posing challenges for automated key-value pair extraction (KVP) and question
answering (QA). Existing static or LLM-only IE pipelines often hallucinate and
fail to adapt to this structural diversity. Our domain-specific, stateful
agentic system addresses these challenges through a planner-executor-responder
architecture. The system infers user intent, detects document modality, and
orchestrates tools dynamically for robust, traceable reasoning while avoiding
tool misuse or execution loops. Evaluation on a curated DoP dataset
demonstrates improved robustness across formats and languages, offering a
scalable solution for structured data extraction in regulated workflows.
☆ Room acoustics affect communicative success in hybrid meeting spaces: a pilot study
Since the COVID-19 pandemic in 2020, universities and companies have
increasingly integrated hybrid features into their meeting spaces, or even
created dedicated rooms for this purpose. While the importance of a fast and
stable internet connection is often prioritized, the acoustic design of seminar
rooms is frequently overlooked. Poor acoustics, particularly excessive
reverberation, can lead to issues such as misunderstandings, reduced speech
intelligibility or cognitive and vocal fatigue. This pilot study investigates
whether room acoustic interventions in a seminar room at Graz University of
Technology support better communication in hybrid meetings. For this purpose,
we recorded two groups of persons twice, once before and once after improving
the acoustics of the room. Our findings -- despite not reaching statistical
significance due to the small sample size - indicate clearly that our spatial
interventions improve communicative success in hybrid meetings. To make the
paper accessible also for readers from the speech communication community, we
explain room acoustics background, relevant for the interpretation of our
results.
☆ CoachMe: Decoding Sport Elements with a Reference-Based Coaching Instruction Generation Model ACL 2025
Wei-Hsin Yeh, Yu-An Su, Chih-Ning Chen, Yi-Hsueh Lin, Calvin Ku, Wen-Hsin Chiu, Min-Chun Hu, Lun-Wei Ku
Motion instruction is a crucial task that helps athletes refine their
technique by analyzing movements and providing corrective guidance. Although
recent advances in multimodal models have improved motion understanding,
generating precise and sport-specific instruction remains challenging due to
the highly domain-specific nature of sports and the need for informative
guidance. We propose CoachMe, a reference-based model that analyzes the
differences between a learner's motion and a reference under temporal and
physical aspects. This approach enables both domain-knowledge learning and the
acquisition of a coach-like thinking process that identifies movement errors
effectively and provides feedback to explain how to improve. In this paper, we
illustrate how CoachMe adapts well to specific sports such as skating and
boxing by learning from general movements and then leveraging limited data.
Experiments show that CoachMe provides high-quality instructions instead of
directions merely in the tone of a coach but without critical information.
CoachMe outperforms GPT-4o by 31.6% in G-Eval on figure skating and by 58.3% on
boxing. Analysis further confirms that it elaborates on errors and their
corresponding improvement methods in the generated instructions. You can find
CoachMe here: https://motionxperts.github.io/
comment: Published in Proceedings of the 63rd Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025.
Official version: https://doi.org/10.18653/v1/2025.acl-long.1413
☆ A Dynamic Knowledge Update-Driven Model with Large Language Models for Fake News Detection
As the Internet and social media evolve rapidly, distinguishing credible news
from a vast amount of complex information poses a significant challenge. Due to
the suddenness and instability of news events, the authenticity labels of news
can potentially shift as events develop, making it crucial for fake news
detection to obtain the latest event updates. Existing methods employ
retrieval-augmented generation to fill knowledge gaps, but they suffer from
issues such as insufficient credibility of retrieved content and interference
from noisy information. We propose a dynamic knowledge update-driven model for
fake news detection (DYNAMO), which leverages knowledge graphs to achieve
continuous updating of new knowledge and integrates with large language models
to fulfill dual functions: news authenticity detection and verification of new
knowledge correctness, solving the two key problems of ensuring the
authenticity of new knowledge and deeply mining news semantics. Specifically,
we first construct a news-domain-specific knowledge graph. Then, we use Monte
Carlo Tree Search to decompose complex news and verify them step by step.
Finally, we extract and update new knowledge from verified real news texts and
reasoning paths. Experimental results demonstrate that DYNAMO achieves the best
performance on two real-world datasets.
☆ Measuring Visual Understanding in Telecom domain: Performance Metrics for Image-to-UML conversion using VLMs
Telecom domain 3GPP documents are replete with images containing sequence
diagrams. Advances in Vision-Language Large Models (VLMs) have eased conversion
of such images to machine-readable PlantUML (puml) formats. However, there is a
gap in evaluation of such conversions - existing works do not compare puml
scripts for various components. In this work, we propose performance metrics to
measure the effectiveness of such conversions. A dataset of sequence diagrams
from 3GPP documents is chosen to be representative of domain-specific actual
scenarios. We compare puml outputs from two VLMs - Claude Sonnet and GPT-4V -
against manually created ground truth representations. We use version control
tools to capture differences and introduce standard performance metrics to
measure accuracies along various components: participant identification,
message flow accuracy, sequence ordering, and grouping construct preservation.
We demonstrate effectiveness of proposed metrics in quantifying conversion
errors across various components of puml scripts. The results show that nodes,
edges and messages are accurately captured. However, we observe that VLMs do
not necessarily perform well on complex structures such as notes, box, groups.
Our experiments and performance metrics indicates a need for better
representation of these components in training data for fine-tuned VLMs.
☆ MindVL: Towards Efficient and Effective Training of Multimodal Large Language Models on Ascend NPUs
We propose MindVL, a multimodal large langauge model trained on Ascend NPUs.
Similar to Qwen2.5-VL, MindVL adopts native-resolution Vision Transformers,
which enables it to process images at their original variable resolutions. This
design avoids the degradation caused by fixed-resolution tiling while
preserving fine-grained details and global layouts, which is crucial for
visually dense content such as complex charts and diagrams. To ensure the
smooth training of MindVL on Ascend NPUs, we develop Mindspeed-MLLM, a
distributed multimodal training framework tailored for Ascend NPUs. To maintain
training accuracy, we implement equivalent replacements for certain operators.
MindVL undergoes a three-phase training process, namely the warm-up phase,
multitask training phase, and supervised instruction tuning phase, to gradually
enhance its capabilities. This process starts with basic visual and multimodal
pre-training, followed by large-scale multiask trainging and instruction
tuning. We also adopt multimodal data packaging and hybrid parallelism
techniques, which significantly improve end-to-end training speed. To further
boost model performance, we specifically introduce test-time resolution search
and model weight averaging. Notably, despite using about 1/10 of the training
data required by Qwen2.5-VL, MindVL achieves performance on par with Qwen2.5-VL
in evaluations of general multimodal understanding and document/table
comprehension. Beyond overall scores, MindVL also delivers leading performance
in OCR assessments.
☆ MALLM: Multi-Agent Large Language Models Framework EMNLP 2025
Multi-agent debate (MAD) has demonstrated the ability to augment collective
intelligence by scaling test-time compute and leveraging expertise. Current
frameworks for multi-agent debate are often designed towards tool use, lack
integrated evaluation, or provide limited configurability of agent personas,
response generators, discussion paradigms, and decision protocols. We introduce
MALLM (Multi-Agent Large Language Models), an open-source framework that
enables systematic analysis of MAD components. MALLM offers more than 144
unique configurations of MAD, including (1) agent personas (e.g., Expert,
Personality), (2) response generators (e.g., Critical, Reasoning), (3)
discussion paradigms (e.g., Memory, Relay), and (4) decision protocols (e.g.,
Voting, Consensus). MALLM uses simple configuration files to define a debate.
Furthermore, MALLM can load any textual Huggingface dataset (e.g., MMLU-Pro,
WinoGrande) and provides an evaluation pipeline for easy comparison of MAD
configurations. MALLM is tailored towards researchers and provides a window
into the heart of multi-agent debate, facilitating the understanding of its
components and their interplay.
comment: Accepted at EMNLP 2025 (Demo)
☆ EthicsMH: A Pilot Benchmark for Ethical Reasoning in Mental Health AI
The deployment of large language models (LLMs) in mental health and other
sensitive domains raises urgent questions about ethical reasoning, fairness,
and responsible alignment. Yet, existing benchmarks for moral and clinical
decision-making do not adequately capture the unique ethical dilemmas
encountered in mental health practice, where confidentiality, autonomy,
beneficence, and bias frequently intersect. To address this gap, we introduce
Ethical Reasoning in Mental Health (EthicsMH), a pilot dataset of 125 scenarios
designed to evaluate how AI systems navigate ethically charged situations in
therapeutic and psychiatric contexts. Each scenario is enriched with structured
fields, including multiple decision options, expert-aligned reasoning, expected
model behavior, real-world impact, and multi-stakeholder viewpoints. This
structure enables evaluation not only of decision accuracy but also of
explanation quality and alignment with professional norms. Although modest in
scale and developed with model-assisted generation, EthicsMH establishes a task
framework that bridges AI ethics and mental health decision-making. By
releasing this dataset, we aim to provide a seed resource that can be expanded
through community and expert contributions, fostering the development of AI
systems capable of responsibly handling some of society's most delicate
decisions.
☆ AesBiasBench: Evaluating Bias and Alignment in Multimodal Language Models for Personalized Image Aesthetic Assessment EMNLP 2025
Multimodal Large Language Models (MLLMs) are increasingly applied in
Personalized Image Aesthetic Assessment (PIAA) as a scalable alternative to
expert evaluations. However, their predictions may reflect subtle biases
influenced by demographic factors such as gender, age, and education. In this
work, we propose AesBiasBench, a benchmark designed to evaluate MLLMs along two
complementary dimensions: (1) stereotype bias, quantified by measuring
variations in aesthetic evaluations across demographic groups; and (2)
alignment between model outputs and genuine human aesthetic preferences. Our
benchmark covers three subtasks (Aesthetic Perception, Assessment, Empathy) and
introduces structured metrics (IFD, NRD, AAS) to assess both bias and
alignment. We evaluate 19 MLLMs, including proprietary models (e.g., GPT-4o,
Claude-3.5-Sonnet) and open-source models (e.g., InternVL-2.5, Qwen2.5-VL).
Results indicate that smaller models exhibit stronger stereotype biases,
whereas larger models align more closely with human preferences. Incorporating
identity information often exacerbates bias, particularly in emotional
judgments. These findings underscore the importance of identity-aware
evaluation frameworks in subjective vision-language tasks.
comment: Accepted by EMNLP 2025
☆ HalluDetect: Detecting, Mitigating, and Benchmarking Hallucinations in Conversational Systems
Spandan Anaokar, Shrey Ganatra, Harshvivek Kashid, Swapnil Bhattacharyya, Shruti Nair, Reshma Sekhar, Siddharth Manohar, Rahul Hemrajani, Pushpak Bhattacharyya
Large Language Models (LLMs) are widely used in industry but remain prone to
hallucinations, limiting their reliability in critical applications. This work
addresses hallucination reduction in consumer grievance chatbots built using
LLaMA 3.1 8B Instruct, a compact model frequently used in industry. We develop
HalluDetect, an LLM-based hallucination detection system that achieves an F1
score of 69% outperforming baseline detectors by 25.44%. Benchmarking five
chatbot architectures, we find that out of them, AgentBot minimizes
hallucinations to 0.4159 per turn while maintaining the highest token accuracy
(96.13%), making it the most effective mitigation strategy. Our findings
provide a scalable framework for hallucination mitigation, demonstrating that
optimized inference strategies can significantly improve factual accuracy.
While applied to consumer law, our approach generalizes to other high-risk
domains, enhancing trust in LLM-driven assistants. We will release the code and
dataset
comment: 6 pages + references + appendix, 3 figures, 2 tables
☆ Dynamic Span Interaction and Graph-Aware Memory for Entity-Level Sentiment Classification
Entity-level sentiment classification involves identifying the sentiment
polarity linked to specific entities within text. This task poses several
challenges: effectively modeling the subtle and complex interactions between
entities and their surrounding sentiment expressions; capturing dependencies
that may span across sentences; and ensuring consistent sentiment predictions
for multiple mentions of the same entity through coreference resolution.
Additionally, linguistic phenomena such as negation, ambiguity, and overlapping
opinions further complicate the analysis. These complexities make entity-level
sentiment classification a difficult problem, especially in real-world, noisy
textual data. To address these issues, we propose SpanEIT, a novel framework
integrating dynamic span interaction and graph-aware memory mechanisms for
enhanced entity-sentiment relational modeling. SpanEIT builds span-based
representations for entities and candidate sentiment phrases, employs
bidirectional attention for fine-grained interactions, and uses a graph
attention network to capture syntactic and co-occurrence relations. A
coreference-aware memory module ensures entity-level consistency across
documents. Experiments on FSAD, BARU, and IMDB datasets show SpanEIT
outperforms state-of-the-art transformer and hybrid baselines in accuracy and
F1 scores. Ablation and interpretability analyses validate the effectiveness of
our approach, underscoring its potential for fine-grained sentiment analysis in
applications like social media monitoring and customer feedback analysis.
☆ Analyzing Information-Seeking Behaviors in a Hakka AI Chatbot: A Cognitive-Pragmatic Study
With many endangered languages at risk of disappearing, efforts to preserve
them now rely more than ever on using technology alongside culturally informed
teaching strategies. This study examines user behaviors in TALKA, a generative
AI-powered chatbot designed for Hakka language engagement, by employing a
dual-layered analytical framework grounded in Bloom's Taxonomy of cognitive
processes and dialogue act categorization. We analyzed 7,077 user utterances,
each carefully annotated according to six cognitive levels and eleven dialogue
act types. These included a variety of functions, such as asking for
information, requesting translations, making cultural inquiries, and using
language creatively. Pragmatic classifications further highlight how different
types of dialogue acts--such as feedback, control commands, and social
greetings--align with specific cognitive intentions. The results suggest that
generative AI chatbots can support language learning in meaningful
ways--especially when they are designed with an understanding of how users
think and communicate. They may also help learners express themselves more
confidently and connect with their cultural identity. The TALKA case provides
empirical insights into how AI-mediated dialogue facilitates cognitive
development in low-resource language learners, as well as pragmatic negotiation
and socio-cultural affiliation. By focusing on AI-assisted language learning,
this study offers new insights into how technology can support language
preservation and educational practice.
comment: Accepted to HICSS-59 (2026)
☆ Formal Reasoning for Intelligent QA Systems: A Case Study in the Educational Domain
Tuan Bui, An Nguyen, Phat Thai, Minh Hua, Ngan Pham L. N., Ngan Pham T. B., Dung Le, Long Nguyen, Thanh-Tung Tran, Thang Bui, Tho Quan
Reasoning is essential for closed-domain QA systems in which procedural
correctness and policy compliance are critical. While large language models
(LLMs) have shown strong performance on many reasoning tasks, recent work
reveals that their reasoning traces are often unfaithful - serving more as
plausible justifications than as causally grounded derivations. Efforts to
combine LLMs with symbolic engines (e.g., Prover9, Z3) have improved
reliability but remain limited to static forms of logic, struggling with
dynamic, state-based reasoning such as multi-step progressions and conditional
transitions.
In this paper, we propose MCFR (Model Checking for Formal Reasoning), a
neuro-symbolic framework that integrates LLMs with model checking to support
property verification. MCFR translates natural language into formal
specifications and verifies them over transition models. To support evaluation,
we introduce EduMC-QA, a benchmark dataset grounded in real academic
procedures. Our results show that MCFR improves reasoning faithfulness and
interpretability, offering a viable path toward verifiable QA in high-stakes
closed-domain applications. In addition to evaluating MCFR, we compare its
performance with state-of-the-art LLMs such as ChatGPT, DeepSeek, and Claude to
contextualize its effectiveness.
comment: Published at the 2nd ACM Workshop in AI-powered Question & Answering
Systems (AIQAM '25), co-located with ACM Multimedia 2025
☆ Bhaasha, Bhasa, Zaban: A Survey for Low-Resourced Languages in South Asia -- Current Stage and Challenges
Rapid developments of large language models have revolutionized many NLP
tasks for English data. Unfortunately, the models and their evaluations for
low-resource languages are being overlooked, especially for languages in South
Asia. Although there are more than 650 languages in South Asia, many of them
either have very limited computational resources or are missing from existing
language models. Thus, a concrete question to be answered is: Can we assess the
current stage and challenges to inform our NLP community and facilitate model
developments for South Asian languages? In this survey, we have comprehensively
examined current efforts and challenges of NLP models for South Asian languages
by retrieving studies since 2020, with a focus on transformer-based models,
such as BERT, T5, & GPT. We present advances and gaps across 3 essential
aspects: data, models, & tasks, such as available data sources, fine-tuning
strategies, & domain applications. Our findings highlight substantial issues,
including missing data in critical domains (e.g., health), code-mixing, and
lack of standardized evaluation benchmarks. Our survey aims to raise awareness
within the NLP community for more targeted data curation, unify benchmarks
tailored to cultural and linguistic nuances of South Asia, and encourage an
equitable representation of South Asian languages. The complete list of
resources is available at: https://github.com/trust-nlp/LM4SouthAsia-Survey.
☆ D$^2$HScore: Reasoning-Aware Hallucination Detection via Semantic Breadth and Depth Analysis in LLMs
Although large Language Models (LLMs) have achieved remarkable success, their
practical application is often hindered by the generation of non-factual
content, which is called "hallucination". Ensuring the reliability of LLMs'
outputs is a critical challenge, particularly in high-stakes domains such as
finance, security, and healthcare. In this work, we revisit hallucination
detection from the perspective of model architecture and generation dynamics.
Leveraging the multi-layer structure and autoregressive decoding process of
LLMs, we decompose hallucination signals into two complementary dimensions: the
semantic breadth of token representations within each layer, and the semantic
depth of core concepts as they evolve across layers. Based on this insight, we
propose \textbf{D$^2$HScore (Dispersion and Drift-based Hallucination Score)},
a training-free and label-free framework that jointly measures: (1)
\textbf{Intra-Layer Dispersion}, which quantifies the semantic diversity of
token representations within each layer; and (2) \textbf{Inter-Layer Drift},
which tracks the progressive transformation of key token representations across
layers. To ensure drift reflects the evolution of meaningful semantics rather
than noisy or redundant tokens, we guide token selection using attention
signals. By capturing both the horizontal and vertical dynamics of
representation during inference, D$^2$HScore provides an interpretable and
lightweight proxy for hallucination detection. Extensive experiments across
five open-source LLMs and five widely used benchmarks demonstrate that
D$^2$HScore consistently outperforms existing training-free baselines.
comment: under review
☆ HiChunk: Evaluating and Enhancing Retrieval-Augmented Generation with Hierarchical Chunking
Retrieval-Augmented Generation (RAG) enhances the response capabilities of
language models by integrating external knowledge sources. However, document
chunking as an important part of RAG system often lacks effective evaluation
tools. This paper first analyzes why existing RAG evaluation benchmarks are
inadequate for assessing document chunking quality, specifically due to
evidence sparsity. Based on this conclusion, we propose HiCBench, which
includes manually annotated multi-level document chunking points, synthesized
evidence-dense quetion answer(QA) pairs, and their corresponding evidence
sources. Additionally, we introduce the HiChunk framework, a multi-level
document structuring framework based on fine-tuned LLMs, combined with the
Auto-Merge retrieval algorithm to improve retrieval quality. Experiments
demonstrate that HiCBench effectively evaluates the impact of different
chunking methods across the entire RAG pipeline. Moreover, HiChunk achieves
better chunking quality within reasonable time consumption, thereby enhancing
the overall performance of RAG systems.
comment: 17 pages, 5 figures, 6 tables
☆ HARP: Hallucination Detection via Reasoning Subspace Projection
Hallucinations in Large Language Models (LLMs) pose a major barrier to their
reliable use in critical decision-making. Although existing hallucination
detection methods have improved accuracy, they still struggle with
disentangling semantic and reasoning information and maintaining robustness. To
address these challenges, we propose HARP (Hallucination detection via
reasoning subspace projection), a novel hallucination detection framework. HARP
establishes that the hidden state space of LLMs can be decomposed into a direct
sum of a semantic subspace and a reasoning subspace, where the former encodes
linguistic expression and the latter captures internal reasoning processes.
Moreover, we demonstrate that the Unembedding layer can disentangle these
subspaces, and by applying Singular Value Decomposition (SVD) to its
parameters, the basis vectors spanning the semantic and reasoning subspaces are
obtained. Finally, HARP projects hidden states onto the basis vectors of the
reasoning subspace, and the resulting projections are then used as input
features for hallucination detection in LLMs. By using these projections, HARP
reduces the dimension of the feature to approximately 5% of the original,
filters out most noise, and achieves enhanced robustness. Experiments across
multiple datasets show that HARP achieves state-of-the-art hallucination
detection performance; in particular, it achieves an AUROC of 92.8% on
TriviaQA, outperforming the previous best method by 7.5%.
☆ On the Distinctive Co-occurrence Characteristics of Antonymy
Antonymy has long received particular attention in lexical semantics.
Previous studies have shown that antonym pairs frequently co-occur in text,
across genres and parts of speech, more often than would be expected by chance.
However, whether this co-occurrence pattern is distinctive of antonymy remains
unclear, due to a lack of comparison with other semantic relations. This work
fills the gap by comparing antonymy with three other relations across parts of
speech using robust co-occurrence metrics. We find that antonymy is distinctive
in three respects: antonym pairs co-occur with high strength, in a preferred
linear order, and within short spans. All results are available online.
comment: Accepted by *SEM 2025
☆ PeruMedQA: Benchmarking Large Language Models (LLMs) on Peruvian Medical Exams -- Dataset Construction and Evaluation
BACKGROUND: Medical large language models (LLMS) have demonstrated remarkable
performance in answering medical examinations. However, the extent to which
this high performance is transferable to medical questions in Spanish and from
a Latin American country remains unexplored. This knowledge is crucial as
LLM-based medical applications gain traction in Latin America. AIMS: to build a
dataset of questions from medical examinations taken by Peruvian physicians
pursuing specialty training; to fine-tune a LLM on this dataset; to evaluate
and compare the performance in terms of accuracy between vanilla LLMs and the
fine-tuned LLM. METHODS: We curated PeruMedQA, a multiple-choice
question-answering (MCQA) datasets containing 8,380 questions spanning 12
medical domains (2018-2025). We selected eight medical LLMs including
medgemma-4b-it and medgemma-27b-text-it, and developed zero-shot task-specific
prompts to answer the questions appropriately. We employed parameter-efficient
fine tuning (PEFT)and low-rant adaptation (LoRA) to fine-tune medgemma-4b-it
utilizing all questions except those from 2025 (test set). RESULTS:
medgemma-27b-text-it outperformed all other models, achieving a proportion of
correct answers exceeding 90% in several instances. LLMs with <10 billion
parameters exhibited <60% of correct answers, while some exams yielded results
<50%. The fine-tuned version of medgemma-4b-it emerged victorious agains all
LLMs with <10 billion parameters and rivaled a LLM with 70 billion parameters
across various examinations. CONCLUSIONS: For medical AI application and
research that require knowledge bases from Spanish-speaking countries and those
exhibiting similar epidemiological profiles to Peru's, interested parties
should utilize medgemma-27b-text-it or a fine-tuned version of medgemma-4b-it.
comment: https://github.com/rodrigo-carrillo/PeruMedQA
☆ LVLMs are Bad at Overhearing Human Referential Communication EMNLP 2025
During spontaneous conversations, speakers collaborate on novel referring
expressions, which they can then re-use in subsequent conversations.
Understanding such referring expressions is an important ability for an
embodied agent, so that it can carry out tasks in the real world. This requires
integrating and understanding language, vision, and conversational interaction.
We study the capabilities of seven state-of-the-art Large Vision Language
Models (LVLMs) as overhearers to a corpus of spontaneous conversations between
pairs of human discourse participants engaged in a collaborative
object-matching task. We find that such a task remains challenging for current
LVLMs and they all fail to show a consistent performance improvement as they
overhear more conversations from the same discourse participants repeating the
same task for multiple rounds. We release our corpus and code for
reproducibility and to facilitate future research.
comment: EMNLP 2025 (Main)
☆ Unsupervised Candidate Ranking for Lexical Substitution via Holistic Sentence Semantics
A key subtask in lexical substitution is ranking the given candidate words. A
common approach is to replace the target word with a candidate in the original
sentence and feed the modified sentence into a model to capture semantic
differences before and after substitution. However, effectively modeling the
bidirectional influence of candidate substitution on both the target word and
its context remains challenging. Existing methods often focus solely on
semantic changes at the target position or rely on parameter tuning over
multiple evaluation metrics, making it difficult to accurately characterize
semantic variation. To address this, we investigate two approaches: one based
on attention weights and another leveraging the more interpretable integrated
gradients method, both designed to measure the influence of context tokens on
the target token and to rank candidates by incorporating semantic similarity
between the original and substituted sentences. Experiments on the LS07 and
SWORDS datasets demonstrate that both approaches improve ranking performance.
☆ DeDisCo at the DISRPT 2025 Shared Task: A System for Discourse Relation Classification EMNLP 2025
This paper presents DeDisCo, Georgetown University's entry in the DISRPT 2025
shared task on discourse relation classification. We test two approaches, using
an mt5-based encoder and a decoder based approach using the openly available
Qwen model. We also experiment on training with augmented dataset for
low-resource languages using matched data translated automatically from
English, as well as using some additional linguistic features inspired by
entries in previous editions of the Shared Task. Our system achieves a
macro-accuracy score of 71.28, and we provide some interpretation and error
analysis for our results.
comment: System submission for the DISRPT 2025 - Shared Task on Discourse
Relation Parsing and Treebanking In conjunction with CODI-CRAC & EMNLP 2025.
1st place in Task 3: relation classification
☆ AKCIT-FN at CheckThat! 2025: Switching Fine-Tuned SLMs and LLM Prompting for Multilingual Claim Normalization
Fabrycio Leite Nakano Almada, Kauan Divino Pouso Mariano, Maykon Adriell Dutra, Victor Emanuel da Silva Monteiro, Juliana Resplande Sant'Anna Gomes, Arlindo Rodrigues Galvão Filho, Anderson da Silva Soares
Claim normalization, the transformation of informal social media posts into
concise, self-contained statements, is a crucial step in automated
fact-checking pipelines. This paper details our submission to the CLEF-2025
CheckThat! Task~2, which challenges systems to perform claim normalization
across twenty languages, divided into thirteen supervised (high-resource) and
seven zero-shot (no training data) tracks.
Our approach, leveraging fine-tuned Small Language Models (SLMs) for
supervised languages and Large Language Model (LLM) prompting for zero-shot
scenarios, achieved podium positions (top three) in fifteen of the twenty
languages. Notably, this included second-place rankings in eight languages,
five of which were among the seven designated zero-shot languages, underscoring
the effectiveness of our LLM-based zero-shot strategy. For Portuguese, our
initial development language, our system achieved an average METEOR score of
0.5290, ranking third. All implementation artifacts, including inference,
training, evaluation scripts, and prompt configurations, are publicly available
at https://github.com/ju-resplande/checkthat2025_normalization.
comment: 15 pages, 2 figures
☆ ClaimIQ at CheckThat! 2025: Comparing Prompted and Fine-Tuned Language Models for Verifying Numerical Claims
This paper presents our system for Task 3 of the CLEF 2025 CheckThat! Lab,
which focuses on verifying numerical and temporal claims using retrieved
evidence. We explore two complementary approaches: zero-shot prompting with
instruction-tuned large language models (LLMs) and supervised fine-tuning using
parameter-efficient LoRA. To enhance evidence quality, we investigate several
selection strategies, including full-document input and top-k sentence
filtering using BM25 and MiniLM. Our best-performing model LLaMA fine-tuned
with LoRA achieves strong performance on the English validation set. However, a
notable drop in the test set highlights a generalization challenge. These
findings underscore the importance of evidence granularity and model adaptation
for robust numerical fact verification.
comment: Notebook for the CheckThat! Lab at CLEF 2025
♻ ☆ Speak-to-Structure: Evaluating LLMs in Open-domain Natural Language-Driven Molecule Generation
Jiatong Li, Junxian Li, Weida Wang, Yunqing Liu, Changmeng Zheng, Dongzhan Zhou, Xiao-yong Wei, Qing Li
Recently, Large Language Models (LLMs) have shown great potential in natural
language-driven molecule discovery. However, existing datasets and benchmarks
for molecule-text alignment are predominantly built on a one-to-one mapping,
measuring LLMs' ability to retrieve a single, pre-defined answer, rather than
their creative potential to generate diverse, yet equally valid, molecular
candidates. To address this critical gap, we propose Speak-to-Structure
(S^2-Bench}), the first benchmark to evaluate LLMs in open-domain natural
language-driven molecule generation. S^2-Bench is specifically designed for
one-to-many relationships, challenging LLMs to demonstrate genuine molecular
understanding and generation capabilities. Our benchmark includes three key
tasks: molecule editing (MolEdit), molecule optimization (MolOpt), and
customized molecule generation (MolCustom), each probing a different aspect of
molecule discovery. We also introduce OpenMolIns, a large-scale instruction
tuning dataset that enables Llama-3.1-8B to surpass the most powerful LLMs like
GPT-4o and Claude-3.5 on S^2-Bench. Our comprehensive evaluation of 28 LLMs
shifts the focus from simple pattern recall to realistic molecular design,
paving the way for more capable LLMs in natural language-driven molecule
discovery.
comment: Our codes and datasets are available through
https://github.com/phenixace/TOMG-Bench
♻ ☆ Active Layer-Contrastive Decoding Reduces Hallucination in Large Language Model Generation EMNLP 2025
Recent decoding methods improve the factuality of large language models
(LLMs) by refining how the next token is selected during generation. These
methods typically operate at the token level, leveraging internal
representations to suppress superficial patterns. Nevertheless, LLMs remain
prone to hallucinations, especially over longer contexts. In this paper, we
propose Active Layer-Contrastive Decoding (ActLCD), a novel decoding strategy
that actively decides when to apply contrasting layers during generation. By
casting decoding as a sequential decision-making problem, ActLCD employs a
reinforcement learning policy guided by a reward-aware classifier to optimize
factuality beyond the token level. Our experiments demonstrate that ActLCD
surpasses state-of-the-art methods across five benchmarks, showcasing its
effectiveness in mitigating hallucinations in diverse generation scenarios.
comment: 19 pages, 3 figures, EMNLP 2025
♻ ☆ Time is On My Side: Dynamics of Talk-Time Sharing in Video-chat Conversations SC
An intrinsic aspect of every conversation is the way talk-time is shared
between multiple speakers. Conversations can be balanced, with each speaker
claiming a similar amount of talk-time, or imbalanced when one talks
disproportionately. Such overall distributions are the consequence of
continuous negotiations between the speakers throughout the conversation: who
should be talking at every point in time, and for how long? In this work we
introduce a computational framework for quantifying both the conversation-level
distribution of talk-time between speakers, as well as the lower-level dynamics
that lead to it. We derive a typology of talk-time sharing dynamics structured
by several intuitive axes of variation. By applying this framework to a large
dataset of video-chats between strangers, we confirm that, perhaps
unsurprisingly, different conversation-level distributions of talk-time are
perceived differently by speakers, with balanced conversations being preferred
over imbalanced ones, especially by those who end up talking less. Then we
reveal that -- even when they lead to the same level of overall balance --
different types of talk-time sharing dynamics are perceived differently by the
participants, highlighting the relevance of our newly introduced typology.
Finally, we discuss how our framework offers new tools to designers of
computer-mediated communication platforms, for both human-human and human-AI
communication.
comment: Accepted for publication at CSCW 2025. Code and data available in
ConvoKit (https://convokit.cornell.edu)
♻ ☆ Is In-Context Learning Learning?
In-context learning (ICL) allows some autoregressive models to solve tasks
via next-token prediction and without needing further training. This has led to
claims about these model's ability to solve (learn) unseen tasks with only a
few shots (exemplars) in the prompt. However, deduction does not always imply
learning, as ICL does not explicitly encode a given observation. Instead, the
models rely on their prior knowledge and the exemplars given, if any. We argue
that, mathematically, ICL does constitute learning, but its full
characterisation requires empirical work. We then carry out a large-scale
analysis of ICL ablating out or accounting for memorisation, pretraining,
distributional shifts, and prompting style and phrasing. We find that ICL is an
effective learning paradigm, but limited in its ability to learn and generalise
to unseen tasks. We note that, in the limit where exemplars become more
numerous, accuracy is insensitive to exemplar distribution, model, prompt
style, and the input's linguistic features. Instead, it deduces patterns from
regularities in the prompt, which leads to distributional sensitivity,
especially in prompting styles such as chain-of-thought. Given the varied
accuracies on formally similar tasks, we conclude that autoregression's ad-hoc
encoding is not a robust mechanism, and suggests limited all-purpose
generalisability.
comment: Director's cut
♻ ☆ Hopscotch: Discovering and Skipping Redundancies in Language Models
Modern causal language models stack many attention blocks to improve
performance, but not all blocks are necessary for every task. We propose
Hopscotch, a simple yet effective method that identifies and skips attention
blocks with least contributions to a task and adapts to preserve output
quality. Hopscotch jointly optimizes which blocks to skip and how to scale the
outputs of the remaining layers. By introducing lightweight, trainable scaling
parameters to attention and MLP blocks, it mitigates distribution shifts in
hidden states caused by removing attention blocks. Hopscotch does not modify
model weights or require access to pretraining or instruction-tuning data, and
is compatible with existing model compression techniques. When applied to
$\texttt{Llama-3.1-8B}$ and $\texttt{Qwen2.5-7B}$, Hopscotch achieves less than
a 2% drop in performance even after skipping four attention blocks.
comment: 10 pages, 4 figures, 9 tables
♻ ☆ Are Generative Models Underconfident? Better Quality Estimation with Boosted Model Probability EMNLP 2025
Quality Estimation (QE) is estimating quality of the model output during
inference when the ground truth is not available. Deriving output quality from
the models' output probability is the most trivial and low-effort way. However,
we show that the output probability of text-generation models can appear
underconfident. At each output step, there can be multiple correct options,
making the probability distribution spread out more. Thus, lower probability
does not necessarily mean lower output quality. Due to this observation, we
propose a QE approach called BoostedProb, which boosts the model's confidence
in cases where there are multiple viable output options. With no increase in
complexity, BoostedProb is notably better than raw model probability in
different settings, achieving on average +0.194 improvement in Pearson
correlation to ground-truth quality. It also comes close to or outperforms more
costly approaches like supervised or ensemble-based QE in certain settings.
comment: Accepted to EMNLP 2025 Main Conference
♻ ☆ MTalk-Bench: Evaluating Speech-to-Speech Models in Multi-Turn Dialogues via Arena-style and Rubrics Protocols
Yuhao Du, Qianwei Huang, Guo Zhu, Zhanchen Dai, Shunian Chen, Qiming Zhu, Le Pan, Minghao Chen, Yuhao Zhang, Li Zhou, Benyou Wang, Haizhou Li
The rapid advancement of speech-to-speech (S2S) large language models (LLMs)
has significantly improved real-time spoken interaction. However, current
evaluation frameworks remain inadequate for assessing performance in complex,
multi-turn dialogues. To address this, we introduce MTalk-Bench, a multi-turn
S2S benchmark covering three core dimensions: Semantic Information,
Paralinguistic Information, and Ambient Sound. Each dimension includes nine
realistic scenarios, along with targeted tasks to assess specific capabilities
such as reasoning. Our dual-method evaluation framework combines Arena-style
evaluation (pairwise comparison) and Rubrics-based evaluation (absolute
scoring) for relative and absolute assessment. The benchmark includes both
model and human outputs, evaluated by human evaluators and LLMs. Experimental
results reveal two sets of findings. Overall performance of S2S LLMs: (1)
models excel at semantic information processing yet underperform on
paralinguistic information and ambient sounds perception; (2) models typically
regain coherence by increasing response length, sacrificing efficiency in
multi-turn dialogues; (3) modality-aware, task-specific designs outperform
brute scaling. Evaluation framework and reliability: (1) Arena and Rubrics
yield consistent, complementary rankings, but reliable distinctions emerge only
when performance gaps are large; (2) LLM-as-a-judge aligns with humans when
gaps are clear or criteria explicit, but exhibits position and length biases
and is reliable on nonverbal evaluation only with text annotations. These
results highlight current limitations in S2S evaluation and the need for more
robust, speech-aware assessment frameworks.
♻ ☆ GmSLM : Generative Marmoset Spoken Language Modeling
Marmoset monkeys exhibit complex vocal communication, challenging the view
that nonhuman primates vocal communication is entirely innate, and show similar
features of human speech, such as vocal labeling of others and turn-taking.
Studying their vocal communication offers a unique opportunity to link it with
brain activity-especially given the difficulty of accessing the human brain in
speech and language research. Since Marmosets communicate primarily through
vocalizations, applying standard LLM approaches is not straightforward. We
introduce Generative Marmoset Spoken Language Modeling (GmSLM), an optimized
spoken language model pipeline for Marmoset vocal communication. We designed a
novel zero-shot evaluation metrics using unsupervised in-the-wild data,
alongside weakly labeled conversational data, to assess GmSLM and demonstrate
its advantage over a basic human-speech-based baseline. GmSLM generated
vocalizations closely matched real resynthesized samples acoustically and
performed well on downstream tasks. Despite being fully unsupervised, GmSLM
effectively distinguish real from artificial conversations and may support
further investigations of the neural basis of vocal communication and provides
a practical framework linking vocalization and brain activity. We believe GmSLM
stands to benefit future work in neuroscience, bioacoustics, and evolutionary
biology. Samples are provided under: pages.cs.huji.ac.il/adiyoss-lab/GmSLM.
♻ ☆ LinguaLens: Towards Interpreting Linguistic Mechanisms of Large Language Models via Sparse Auto-Encoder EMNLP 2025
Large language models (LLMs) demonstrate exceptional performance on tasks
requiring complex linguistic abilities, such as reference disambiguation and
metaphor recognition/generation. Although LLMs possess impressive capabilities,
their internal mechanisms for processing and representing linguistic knowledge
remain largely opaque. Prior research on linguistic mechanisms is limited by
coarse granularity, limited analysis scale, and narrow focus. In this study, we
propose LinguaLens, a systematic and comprehensive framework for analyzing the
linguistic mechanisms of large language models, based on Sparse Auto-Encoders
(SAEs). We extract a broad set of Chinese and English linguistic features
across four dimensions (morphology, syntax, semantics, and pragmatics). By
employing counterfactual methods, we construct a large-scale counterfactual
dataset of linguistic features for mechanism analysis. Our findings reveal
intrinsic representations of linguistic knowledge in LLMs, uncover patterns of
cross-layer and cross-lingual distribution, and demonstrate the potential to
control model outputs. This work provides a systematic suite of resources and
methods for studying linguistic mechanisms, offers strong evidence that LLMs
possess genuine linguistic knowledge, and lays the foundation for more
interpretable and controllable language modeling in future research.
comment: Accepted by EMNLP 2025 MainConference
♻ ☆ What fifty-one years of Linguistics and Artificial Intelligence research tell us about their correlation: A scientometric analysis
There is a strong correlation between linguistics and artificial intelligence
(AI), best manifested by deep learning language models. This study provides a
thorough scientometric analysis of this correlation, synthesizing the
intellectual production over 51 years, from 1974 to 2024. Web of Science Core
Collection (WoSCC) database was the data source. The data collected were
analyzed by two powerful software, viz., CiteSpace and VOSviewer, through which
mapping visualizations of the intellectual landscape, trending issues and
(re)emerging hotspots were generated. The results indicate that in the 1980s
and 1990s, linguistics and AI (AIL) research was not robust, characterized by
unstable publication over time. It has, however, witnessed a remarkable
increase of publication since then, reaching 1478 articles in 2023, and 546
articles in January-March timespan in 2024, involving emerging issues including
Natural language processing, Cross-sectional study, Using bidirectional encoder
representation, and Using ChatGPT and hotspots such as Novice programmer,
Prioritization, and Artificial intelligence, addressing new horizons, new
topics, and launching new applications and powerful deep learning language
models including ChatGPT. It concludes that linguistics and AI correlation is
established at several levels, research centers, journals, and countries
shaping AIL knowledge production and reshaping its future frontiers.
comment: 26 pages, 15 figures
♻ ☆ GATEAU: Selecting Influential Samples for Long Context Alignment EMNLP 2025
Shuzheng Si, Haozhe Zhao, Gang Chen, Yunshui Li, Kangyang Luo, Chuancheng Lv, Kaikai An, Fanchao Qi, Baobao Chang, Maosong Sun
Aligning large language models to handle instructions with extremely long
contexts has yet to be fully investigated. Previous studies have attempted to
scale up the available data volume by synthesizing long instruction-following
samples, as constructing such a dataset tends to be challenging for annotators.
However, a lack of a well-defined strategy for ensuring data quality may
introduce low-quality samples and restrict the model's performance. Thus, we
propose GATEAU, a novel framework to address the unique challenge of long
context alignment by identifying the influential samples enriched with
long-range dependency relations. Specifically, GATEAU measures the long-range
dependencies from two essential aspects: the difficulty of generating target
responses due to the long-range dependencies, and the difficulty of
understanding long inputs due to such dependencies. Comprehensive experiments
indicate that GATEAU effectively identifies influential samples, and the model
trained on these selected samples exhibits better instruction-following and
long-context understanding capabilities.
comment: EMNLP 2025
♻ ☆ Plugging Schema Graph into Multi-Table QA: A Human-Guided Framework for Reducing LLM Reliance EMNLP 2025
Large language models (LLMs) have shown promise in table Question Answering
(Table QA). However, extending these capabilities to multi-table QA remains
challenging due to unreliable schema linking across complex tables. Existing
methods based on semantic similarity work well only on simplified hand-crafted
datasets and struggle to handle complex, real-world scenarios with numerous and
diverse columns. To address this, we propose a graph-based framework that
leverages human-curated relational knowledge to explicitly encode schema links
and join paths. Given a natural language query, our method searches on graph to
construct interpretable reasoning chains, aided by pruning and sub-path merging
strategies to enhance efficiency and coherence. Experiments on both standard
benchmarks and a realistic, large-scale dataset demonstrate the effectiveness
of our approach. To our knowledge, this is the first multi-table QA system
applied to truly complex industrial tabular data.
comment: Accepted to EMNLP 2025 findings
♻ ☆ Transformer-Based Multimodal Knowledge Graph Completion with Link-Aware Contexts
Multimodal knowledge graph completion (MMKGC) aims to predict missing links
in multimodal knowledge graphs (MMKGs) by leveraging information from various
modalities alongside structural data. Existing MMKGC approaches primarily
extend traditional knowledge graph embedding (KGE) models, which often require
creating an embedding for every entity. This results in large model sizes and
inefficiencies in integrating multimodal information, particularly for
real-world graphs. Meanwhile, Transformer-based models have demonstrated
competitive performance in knowledge graph completion (KGC). However, their
focus on single-modal knowledge limits their capacity to utilize cross-modal
information. Recently, Large vision-language models (VLMs) have shown potential
in cross-modal tasks but are constrained by the high cost of training. In this
work, we propose a novel approach that integrates Transformer-based KGE models
with cross-modal context generated by pre-trained VLMs, thereby extending their
applicability to MMKGC. Specifically, we employ a pre-trained VLM to transform
relevant visual information from entities and their neighbors into textual
sequences. We then frame KGC as a sequence-to-sequence task, fine-tuning the
model with the generated cross-modal context. This simple yet effective method
significantly reduces model size compared to traditional KGE approaches while
achieving competitive performance across multiple large-scale datasets with
minimal hyperparameter tuning.
♻ ☆ Low-rank variational dropout: Uncertainty and rank selection in adapters
Parameter-efficient fine-tuning (PEFT) methods such as LoRA adapt large
language models by inserting low-rank adapters, but they leave open two key
questions: how to give the adapted model calibrated uncertainty, and how to
choose the adapter rank. Existing approaches to uncertainty are typically
post-hoc, while rank selection is manual and task-specific. BayesLoRA revisits
variational dropout in the LoRA setting and shows that the natural unit of
stochasticity is not individual weights but entire ranks of the adapter. By
placing rank-wise variational distributions over adapter components, BayesLoRA
defines a posterior that (i) yields calibrated predictions through adapter-only
Monte Carlo sampling and (ii) prunes redundant ranks automatically via an
ARD-style KL term. Theoretical analysis shows that this rank-parameterized
posterior localizes uncertainty to the adapted subspace and explains
amplification under distribution shift. Empirically, BayesLoRA improves
calibration while at the same time producing lighter, faster adapters, removing
the need to tune ranks by hand. This dual role of uncertainty estimation and
uncertainty-driven pruning suggests BayesLoRA may offer a practical default for
reliable and efficient PEFT.
comment: 5 pages, 2 figures
♻ ☆ Can LLMs assist with Ambiguity? A Quantitative Evaluation of various Large Language Models on Word Sense Disambiguation
Ambiguous words are often found in modern digital communications. Lexical
ambiguity challenges traditional Word Sense Disambiguation (WSD) methods, due
to limited data. Consequently, the efficiency of translation, information
retrieval, and question-answering systems is hindered by these limitations.
This study investigates the use of Large Language Models (LLMs) to improve WSD
using a novel approach combining a systematic prompt augmentation mechanism
with a knowledge base (KB) consisting of different sense interpretations. The
proposed method incorporates a human-in-loop approach for prompt augmentation
where prompt is supported by Part-of-Speech (POS) tagging, synonyms of
ambiguous words, aspect-based sense filtering and few-shot prompting to guide
the LLM. By utilizing a few-shot Chain of Thought (COT) prompting-based
approach, this work demonstrates a substantial improvement in performance. The
evaluation was conducted using FEWS test data and sense tags. This research
advances accurate word interpretation in social media and digital
communication.
comment: 12 pages,6 tables, 1 figure, Proceedings of the 1st International
Conference on NLP & AI for Cyber Security
♻ ☆ SmallPlan: Leverage Small Language Models for Sequential Path Planning with Simulation-Powered, LLM-Guided Distillation
Quang P. M. Pham, Khoi T. N. Nguyen, Nhi H. Doan, Cuong A. Pham, Qinbo Sun, Weimin Qi, Kentaro Inui, Dezhen Song
Efficient path planning in robotics, particularly within large-scale, complex
environments, remains a significant hurdle. While Large Language Models (LLMs)
offer strong reasoning capabilities, their high computational cost and limited
adaptability hinder real-time deployment on edge devices. We present SmallPlan
- a novel framework leveraging LLMs as teacher models to train lightweight
Small Language Models (SLMs) for high-level path planning tasks. In SmallPlan,
the SLMs provide optimal action sequences to navigate across scene graphs that
compactly represent full-scaled 3D scenes. The SLMs are trained in a
simulation-powered, interleaved manner with LLM-guided supervised fine-tuning
(SFT) and reinforcement learning (RL). This strategy not only enables SLMs to
successfully complete navigation tasks but also makes them aware of important
factors like distance travel, providing more efficient path planning. Through
experiments, we demonstrate that the fine-tuned SLMs perform competitively with
larger models like GPT-4o on sequential path planning, without suffering from
hallucination and overfitting. SmallPlan is resource-efficient, making it
well-suited for edge-device deployment and advancing practical autonomous
robotics. Our source code is available here:
https://github.com/quangpham2006/SmallPlan
comment: Paper is under review
♻ ☆ LML: A Novel Lexicon for the Moral Foundation of Liberty
The moral value of liberty is a central concept in our inference system when
it comes to taking a stance towards controversial social issues such as vaccine
hesitancy, climate change, or the right to abortion. Here, we propose a novel
Liberty lexicon evaluated on more than 3,000 manually annotated data both in
in- and out-of-domain scenarios. As a result of this evaluation, we produce a
combined lexicon that constitutes the main outcome of this work. This final
lexicon incorporates information from an ensemble of lexicons that have been
generated using word embedding similarity (WE) and compositional semantics
(CS). Our key contributions include enriching the liberty annotations,
developing a robust liberty lexicon for broader application, and revealing the
complexity of expressions related to liberty across different platforms.
Through the evaluation, we show that the difficulty of the task calls for
designing approaches that combine knowledge, in an effort of improving the
representations of learning systems.
comment: Published in the 11th International Conference on Machine Learning,
Optimization, and Data Science
♻ ☆ Lean Formalization of Generalization Error Bound by Rademacher Complexity
We formalize the generalization error bound using the Rademacher complexity
for the Lean 4 theorem prover based on the probability theory in the Mathlib 4
library. Generalization error quantifies the gap between a learning machine's
performance on given training data versus unseen test data, and the Rademacher
complexity is a powerful tool to upper-bound the generalization error of a
variety of modern learning problems. Previous studies have only formalized
extremely simple cases such as bounds by parameter counts and analyses for very
simple models (decision stumps). Formalizing the Rademacher complexity bound,
also known as the uniform law of large numbers, requires substantial
development and is achieved for the first time in this study. In the course of
development, we formalize the Rademacher complexity and its unique arguments
such as symmetrization, and clarify the topological assumptions on hypothesis
classes under which the bound holds. As an application, we also present the
formalization of generalization error bound for $L^2$-regularization models.
comment: major updated
♻ ☆ LLM as a Broken Telephone: Iterative Generation Distorts Information ACL 2025
As large language models are increasingly responsible for online content,
concerns arise about the impact of repeatedly processing their own outputs.
Inspired by the "broken telephone" effect in chained human communication, this
study investigates whether LLMs similarly distort information through iterative
generation. Through translation-based experiments, we find that distortion
accumulates over time, influenced by language choice and chain complexity.
While degradation is inevitable, it can be mitigated through strategic
prompting techniques. These findings contribute to discussions on the long-term
effects of AI-mediated information propagation, raising important questions
about the reliability of LLM-generated content in iterative workflows.
comment: Accepted to ACL 2025, Main Conference
♻ ☆ Chain of Strategy Optimization Makes Large Language Models Better Emotional Supporter
Weixiang Zhao, Xingyu Sui, Xinyang Han, Yang Deng, Yulin Hu, Jiahe Guo, Libo Qin, Qianyun Du, Shijin Wang, Yanyan Zhao, Bing Qin, Ting Liu
The growing emotional stress in modern society has increased the demand for
Emotional Support Conversations (ESC). While Large Language Models (LLMs) show
promise for ESC, they face two key challenges: (1) low strategy selection
accuracy, and (2) preference bias, limiting their adaptability to emotional
needs of users. Existing supervised fine-tuning (SFT) struggles to address
these issues, as it rigidly trains models on single gold-standard responses
without modeling nuanced strategy trade-offs. To overcome these limitations, we
propose Chain-of-Strategy Optimization (CSO), a novel approach that optimizes
strategy selection preferences at each dialogue turn. We first leverage Monte
Carlo Tree Search to construct ESC-Pro, a high-quality preference dataset with
turn-level strategy-response pairs. Training on ESC-Pro with CSO improves both
strategy accuracy and bias mitigation, enabling LLMs to generate more
empathetic and contextually appropriate responses. Experiments on LLaMA-3.1-8B,
Gemma-2-9B, and Qwen2.5-7B demonstrate that CSO outperforms standard SFT,
highlighting the efficacy of fine-grained, turn-level preference modeling in
ESC.
comment: 21 pages, 9 figures, 17 tables
♻ ☆ UR$^2$: Unify RAG and Reasoning through Reinforcement Learning
Large Language Models (LLMs) have shown remarkable capabilities through two
complementary paradigms: Retrieval-Augmented Generation (RAG), which enhances
knowledge grounding, and Reinforcement Learning from Verifiable Rewards (RLVR),
which optimizes complex reasoning abilities. However, these two capabilities
are often developed in isolation, and existing efforts to unify them remain
narrow in scope -- typically limited to open-domain QA with fixed retrieval
settings and task-specific constraints. This lack of integration constrains
generalization and limits the applicability of RAG-RL methods to broader
domains. To bridge this gap, we propose UR2 (Unified RAG and Reasoning), a
general framework that unifies retrieval and reasoning through reinforcement
learning. UR2 introduces two key contributions: a difficulty-aware curriculum
training that selectively invokes retrieval only for challenging problems, and
a hybrid knowledge access strategy combining domain-specific offline corpora
with LLM-generated summaries. These components are designed to enable dynamic
coordination between retrieval and reasoning, improving adaptability across a
diverse range of tasks. Experiments across open-domain QA, MMLU-Pro, medical,
and mathematical reasoning tasks demonstrate that UR$^2$ (built on
Qwen-2.5-3/7B and LLaMA-3.1-8B) significantly outperforms existing RAG and RL
methods, achieving comparable performance to GPT-4o-mini and GPT-4.1-mini on
several benchmarks. We have released all code, models, and data at
https://github.com/Tsinghua-dhy/UR2.
♻ ☆ One Goal, Many Challenges: Robust Preference Optimization Amid Content-Aware and Multi-Source Noise
Large Language Models (LLMs) have made significant strides in generating
human-like responses, largely due to preference alignment techniques. However,
these methods often assume unbiased human feedback, which is rarely the case in
real-world scenarios. This paper introduces Content-Aware Noise-Resilient
Preference Optimization (CNRPO), a novel framework that addresses multiple
sources of content-dependent noise in preference learning. CNRPO employs a
multi-objective optimization approach to separate true preferences from
content-aware noises, effectively mitigating their impact. We leverage backdoor
attack mechanisms to efficiently learn and control various noise sources within
a single model. Theoretical analysis and extensive experiments on different
synthetic noisy datasets demonstrate that CNRPO significantly improves
alignment with primary human preferences while controlling for secondary noises
and biases, such as response length and harmfulness.
♻ ☆ DSMoE: Matrix-Partitioned Experts with Dynamic Routing for Computation-Efficient Dense LLMs EMNLP
Minxuan Lv, Zhenpeng Su, Leiyu Pan, Yizhe Xiong, Zijia Lin, Hui Chen, Wei Zhou, Jungong Han, Guiguang Ding, Cheng Luo, Di Zhang, Kun Gai, Songlin Hu
As large language models continue to scale, computational costs and resource
consumption have emerged as significant challenges. While existing
sparsification methods like pruning reduce computational overhead, they risk
losing model knowledge through parameter removal. This paper proposes DSMoE
(Dynamic Sparse Mixture-of-Experts), a novel approach that achieves
sparsification by partitioning pre-trained FFN layers into computational
blocks. We implement adaptive expert routing using sigmoid activation and
straight-through estimators, enabling tokens to flexibly access different
aspects of model knowledge based on input complexity. Additionally, we
introduce a sparsity loss term to balance performance and computational
efficiency. Extensive experiments on LLaMA models demonstrate that under
equivalent computational constraints, DSMoE achieves superior performance
compared to existing pruning and MoE approaches across language modeling and
downstream tasks, particularly excelling in generation tasks. Analysis reveals
that DSMoE learns distinctive layerwise activation patterns, providing new
insights for future MoE architecture design.
comment: Accepted by EMNLP main conference
♻ ☆ Efficient Environmental Claim Detection with Hyperbolic Graph Neural Networks
Transformer based models, specially large language models (LLMs) dominate the
field of NLP with their mass adoption in tasks such as text generation,
summarization and fake news detection. These models offer ease of deployment
and reliability for most applications, however, they require significant
amounts of computational power for training as well as inference. This poses
challenges in their adoption in resource-constrained applications, specially in
the open-source community where compute availability is usually scarce. This
work proposes a graph-based approach for Environmental Claim Detection,
exploring Graph Neural Networks (GNNs) and Hyperbolic Graph Neural Networks
(HGNNs) as lightweight yet effective alternatives to transformer-based models.
Re-framing the task as a graph classification problem, we transform claim
sentences into dependency parsing graphs, utilizing a combination of word2vec
\& learnable part-of-speech (POS) tag embeddings for the node features and
encoding syntactic dependencies in the edge relations. Our results show that
our graph-based models, particularly HGNNs in the poincar\'e space (P-HGNNs),
achieve performance superior to the state-of-the-art on environmental claim
detection while using upto \textbf{30x fewer parameters}. We also demonstrate
that HGNNs benefit vastly from explicitly modeling data in hierarchical
(tree-like) structures, enabling them to significantly improve over their
euclidean counterparts.
♻ ☆ Steering LVLMs via Sparse Autoencoder for Hallucination Mitigation EMNLP 2025
Large vision-language models (LVLMs) have achieved remarkable performance on
multimodal tasks. However, they still suffer from hallucinations, generating
text inconsistent with visual input, posing significant risks in real-world
applications. Existing approaches to address this issue focus on incorporating
external knowledge bases, alignment training, or decoding strategies, all of
which require substantial computational cost and time. Recent works try to
explore more efficient alternatives by adjusting LVLMs' internal
representations. Although promising, these methods may cause hallucinations to
be insufficiently suppressed or lead to excessive interventions that negatively
affect normal semantics. In this work, we leverage sparse autoencoders (SAEs)
to identify semantic directions closely associated with faithfulness or
hallucination, extracting more precise and disentangled hallucination-related
representations. Our analysis demonstrates that interventions along the
identified faithful direction can mitigate hallucinations, while those along
the hallucinatory direction can exacerbate them. Building on these insights, we
propose Steering LVLMs via SAE Latent Directions (SSL), a plug-and-play method
based on SAE-derived latent directions to mitigate hallucinations in LVLMs.
Extensive experiments demonstrate that SSL significantly outperforms existing
decoding approaches in mitigating hallucinations, while maintaining
transferability across different model architectures with negligible additional
time overhead. The code is available at https://github.com/huazhenglin2003/SSL.
comment: Accepted to Findings of EMNLP 2025
♻ ☆ CM-Align: Consistency-based Multilingual Alignment for Large Language Models EMNLP 2025
Current large language models (LLMs) generally show a significant performance
gap in alignment between English and other languages. To bridge this gap,
existing research typically leverages the model's responses in English as a
reference to select the best/worst responses in other languages, which are then
used for Direct Preference Optimization (DPO) training. However, we argue that
there are two limitations in the current methods that result in noisy
multilingual preference data and further limited alignment performance: 1) Not
all English responses are of high quality, and using a response with low
quality may mislead the alignment for other languages. 2) Current methods
usually use biased or heuristic approaches to construct multilingual preference
pairs. To address these limitations, we design a consistency-based data
selection method to construct high-quality multilingual preference data for
improving multilingual alignment (CM-Align). Specifically, our method includes
two parts: consistency-guided English reference selection and cross-lingual
consistency-based multilingual preference data construction. Experimental
results on three LLMs and three common tasks demonstrate the effectiveness and
superiority of our method, which further indicates the necessity of
constructing high-quality preference data.
comment: EMNLP 2025 Findings
♻ ☆ CAC-CoT: Connector-Aware Compact Chain-of-Thought for Efficient Reasoning Data Synthesis Across Dual-System Cognitive Tasks EMNLP 2025
Long chain-of-thought (CoT) prompting helps Large Language Models (LLMs)
solve difficult problems, but very long traces often slow or even degrade
performance on fast, intuitive "System-1" tasks. We introduce Connector-Aware
Compact CoT (CAC-CoT) -- a method that deliberately restricts reasoning to a
small, fixed set of connector phrases, steering the model toward concise and
well -- structured explanations. Despite its simplicity, our synthetic method
with general-purpose LLMs yields a high-quality training quality. CAC-CoT
achieves approximately 85% on GSM8K and approximately 40% on GPQA (System-2)
while also achieving approximately 85% on S1-Bench (System-1), surpassing the
baseline by over 20%. Its reasoning traces average approximately 300
tokens(ART), about one-third the length of baseline traces, delivering higher
efficiency without loss of accuracy.
comment: Accepted at EMNLP 2025 findings
♻ ☆ AraHealthQA 2025: The First Shared Task on Arabic Health Question Answering EMNLP2025
Hassan Alhuzali, Walid Al-Eisawi, Muhammad Abdul-Mageed, Chaimae Abouzahir, Mouath Abu-Daoud, Ashwag Alasmari, Renad Al-Monef, Ali Alqahtani, Lama Ayash, Leen Kharouf, Farah E. Shamout, Nizar Habash
We introduce AraHealthQA 2025, the Comprehensive Arabic Health Question
Answering Shared Task, held in conjunction with ArabicNLP 2025 (co-located with
EMNLP 2025). This shared task addresses the paucity of high-quality Arabic
medical QA resources by offering two complementary tracks: MentalQA, focusing
on Arabic mental health Q&A (e.g., anxiety, depression, stigma reduction), and
MedArabiQ, covering broader medical domains such as internal medicine,
pediatrics, and clinical decision making. Each track comprises multiple
subtasks, evaluation datasets, and standardized metrics, facilitating fair
benchmarking. The task was structured to promote modeling under realistic,
multilingual, and culturally nuanced healthcare contexts. We outline the
dataset creation, task design and evaluation framework, participation
statistics, baseline systems, and summarize the overall outcomes. We conclude
with reflections on the performance trends observed and prospects for future
iterations in Arabic health QA.
comment: ArabicNLP2025-colocated with EMNLP2025
♻ ☆ Multilingual Collaborative Defense for Large Language Models
The robustness and security of large language models (LLMs) has become a
prominent research area. One notable vulnerability is the ability to bypass LLM
safeguards by translating harmful queries into rare or underrepresented
languages, a simple yet effective method of "jailbreaking" these models.
Despite the growing concern, there has been limited research addressing the
safeguarding of LLMs in multilingual scenarios, highlighting an urgent need to
enhance multilingual safety. In this work, we investigate the correlation
between various attack features across different languages and propose
Multilingual Collaborative Defense (MCD), a novel learning method that
optimizes a continuous, soft safety prompt automatically to facilitate
multilingual safeguarding of LLMs. The MCD approach offers three advantages:
First, it effectively improves safeguarding performance across multiple
languages. Second, MCD maintains strong generalization capabilities while
minimizing false refusal rates. Third, MCD mitigates the language safety
misalignment caused by imbalances in LLM training corpora. To evaluate the
effectiveness of MCD, we manually construct multilingual versions of commonly
used jailbreak benchmarks, such as MaliciousInstruct and AdvBench, to assess
various safeguarding methods. Additionally, we introduce these datasets in
underrepresented (zero-shot) languages to verify the language transferability
of MCD. The results demonstrate that MCD outperforms existing approaches in
safeguarding against multilingual jailbreak attempts while also exhibiting
strong language transfer capabilities. Our code is available at
https://github.com/HLiang-Lee/MCD.
comment: 21 pages, 4figures
♻ ☆ Hallucinated Span Detection with Multi-View Attention Features
This study addresses the problem of hallucinated span detection in the
outputs of large language models. It has received less attention than
output-level hallucination detection despite its practical importance. Prior
work has shown that attentions often exhibit irregular patterns when
hallucinations occur. Motivated by these findings, we extract features from the
attention matrix that provide complementary views capturing (a) whether certain
tokens are influential or ignored, (b) whether attention is biased toward
specific subsets, and (c) whether a token is generated referring to a narrow or
broad context, in the generation. These features are input to a
Transformer-based classifier to conduct sequential labelling to identify
hallucinated spans. Experimental results indicate that the proposed method
outperforms strong baselines on hallucinated span detection with longer input
contexts, such as data-to-text and summarisation tasks.
♻ ☆ GeoGuess: Multimodal Reasoning based on Hierarchy of Visual Information in Street View
Multimodal reasoning is a process of understanding, integrating and inferring
information across different data modalities. It has recently attracted surging
academic attention as a benchmark for Artificial Intelligence (AI). Although
there are various tasks for evaluating multimodal reasoning ability, they still
have limitations. Lack of reasoning on hierarchical visual clues at different
levels of granularity, e.g., local details and global context, is of little
discussion, despite its frequent involvement in real scenarios. To bridge the
gap, we introduce a novel and challenging task for multimodal reasoning, namely
GeoGuess. Given a street view image, the task is to identify its location and
provide a detailed explanation. A system that succeeds in GeoGuess should be
able to detect tiny visual clues, perceive the broader landscape, and associate
with vast geographic knowledge. Therefore, GeoGuess would require the ability
to reason between hierarchical visual information and geographic knowledge. In
this work, we establish a benchmark for GeoGuess by introducing a specially
curated dataset GeoExplain which consists of
panoramas-geocoordinates-explanation tuples. Additionally, we present a
multimodal and multilevel reasoning method, namely SightSense which can make
prediction and generate comprehensive explanation based on hierarchy of visual
information and external knowledge. Our analysis and experiments demonstrate
their outstanding performance in GeoGuess.
comment: Updated version
♻ ☆ Enhancing Prompt Injection Attacks to LLMs via Poisoning Alignment
Prompt injection attack, where an attacker injects a prompt into the original
one, aiming to make an Large Language Model (LLM) follow the injected prompt to
perform an attacker-chosen task, represent a critical security threat. Existing
attacks primarily focus on crafting these injections at inference time,
treating the LLM itself as a static target. Our experiments show that these
attacks achieve some success, but there is still significant room for
improvement. In this work, we introduces a more foundational attack vector:
poisoning the LLM's alignment process to amplify the success of future prompt
injection attacks. Specifically, we propose PoisonedAlign, a method that
strategically creates poisoned alignment samples to poison an LLM's alignment
dataset. Our experiments across five LLMs and two alignment datasets show that
when even a small fraction of the alignment data is poisoned, the resulting
model becomes substantially more vulnerable to a wide range of prompt injection
attacks. Crucially, this vulnerability is instilled while the LLM's performance
on standard capability benchmarks remains largely unchanged, making the
manipulation difficult to detect through automated, general-purpose performance
evaluations. The code for implementing the attack is available at
https://github.com/Sadcardation/PoisonedAlign.
♻ ☆ Too Helpful, Too Harmless, Too Honest or Just Right? EMNLP'25
Large Language Models (LLMs) exhibit strong performance across a wide range
of NLP tasks, yet aligning their outputs with the principles of Helpfulness,
Harmlessness, and Honesty (HHH) remains a persistent challenge. Existing
methods often optimize for individual alignment dimensions in isolation,
leading to trade-offs and inconsistent behavior. While Mixture-of-Experts (MoE)
architectures offer modularity, they suffer from poorly calibrated routing,
limiting their effectiveness in alignment tasks. We propose TrinityX, a modular
alignment framework that incorporates a Mixture of Calibrated Experts (MoCaE)
within the Transformer architecture. TrinityX leverages separately trained
experts for each HHH dimension, integrating their outputs through a calibrated,
task-adaptive routing mechanism that combines expert signals into a unified,
alignment-aware representation. Extensive experiments on three standard
alignment benchmarks-Alpaca (Helpfulness), BeaverTails (Harmlessness), and
TruthfulQA (Honesty)-demonstrate that TrinityX outperforms strong baselines,
achieving relative improvements of 32.5% in win rate, 33.9% in safety score,
and 28.4% in truthfulness. In addition, TrinityX reduces memory usage and
inference latency by over 40% compared to prior MoE-based approaches. Ablation
studies highlight the importance of calibrated routing, and cross-model
evaluations confirm TrinityX's generalization across diverse LLM backbones.
comment: EMNLP'25 Main
♻ ☆ Oyster-I: Beyond Refusal -- Constructive Safety Alignment for Responsible Language Models
Ranjie Duan, Jiexi Liu, Xiaojun Jia, Shiji Zhao, Ruoxi Cheng, Fengxiang Wang, Cheng Wei, Yong Xie, Chang Liu, Defeng Li, Yinpeng Dong, Yichi Zhang, Yuefeng Chen, Chongwen Wang, Xingjun Ma, Xingxing Wei, Yang Liu, Hang Su, Jun Zhu, Xinfeng Li, Yitong Sun, Jie Zhang, Jinzhao Hu, Sha Xu, Wenchao Yang, Yitong Yang, Jialing Tao, Hui Xue
Large language models (LLMs) typically deploy safety mechanisms to prevent
harmful content generation. Most current approaches focus narrowly on risks
posed by malicious actors, often framing risks as adversarial events and
relying on defensive refusals. However, in real-world settings, risks also come
from non-malicious users seeking help while under psychological distress (e.g.,
self-harm intentions). In such cases, the model's response can strongly
influence the user's next actions. Simple refusals may lead them to repeat,
escalate, or move to unsafe platforms, creating worse outcomes. We introduce
Constructive Safety Alignment (CSA), a human-centric paradigm that protects
against malicious misuse while actively guiding vulnerable users toward safe
and helpful results. Implemented in Oyster-I (Oy1), CSA combines game-theoretic
anticipation of user reactions, fine-grained risk boundary discovery, and
interpretable reasoning control, turning safety into a trust-building process.
Oy1 achieves state-of-the-art safety among open models while retaining high
general capabilities. On our Constructive Benchmark, it shows strong
constructive engagement, close to GPT-5, and unmatched robustness on the
Strata-Sword jailbreak dataset, nearing GPT-o1 levels. By shifting from
refusal-first to guidance-first safety, CSA redefines the model-user
relationship, aiming for systems that are not just safe, but meaningfully
helpful. We release Oy1, code, and the benchmark to support responsible,
user-centered AI.
comment: Technical Report Code & Model weights available:
https://github.com/Alibaba-AAIG/Oyster
♻ ☆ Towards Reliable and Interpretable Document Question Answering via VLMs
Vision-Language Models (VLMs) have shown strong capabilities in document
understanding, particularly in identifying and extracting textual information
from complex documents. Despite this, accurately localizing answers within
documents remains a major challenge, limiting both interpretability and
real-world applicability. To address this, we introduce DocExplainerV0, a
plug-and-play bounding-box prediction module that decouples answer generation
from spatial localization. This design makes it applicable to existing VLMs,
including proprietary systems where fine-tuning is not feasible. Through
systematic evaluation, we provide quantitative insights into the gap between
textual accuracy and spatial grounding, showing that correct answers often lack
reliable localization. Our standardized framework highlights these shortcomings
and establishes a benchmark for future research toward more interpretable and
robust document information extraction VLMs.
♻ ☆ MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation EMNLP2025
Chenghao Yang, Yinbo Luo, Zhoufutu Wen, Qi Chu, Tao Gong, Longxiang Liu, Kaiyuan Zhang, Jianpeng Jiao, Ge Zhang, Wenhao Huang, Nenghai Yu
Large Language Models (\textbf{LLMs}), e.g. ChatGPT, have been widely adopted
in real-world dialogue applications. However, LLMs' robustness, especially in
handling long complex dialogue sessions, including frequent motivation
transfer, sophisticated cross-turn dependency, is criticized all along.
Nevertheless, no existing benchmarks can fully reflect these weaknesses. We
present \textbf{MARS-Bench}, a \textbf{M}ulti-turn \textbf{A}thletic
\textbf{R}eal-world \textbf{S}cenario Dialogue \textbf{Bench}mark, designed to
remedy the gap. MARS-Bench is constructed from play-by-play text commentary so
to feature realistic dialogues specifically designed to evaluate three critical
aspects of multi-turn conversations: Ultra Multi-turn, Interactive Multi-turn,
and Cross-turn Tasks. Extensive experiments on MARS-Bench also reveal that
closed-source LLMs significantly outperform open-source alternatives, explicit
reasoning significantly boosts LLMs' robustness on handling long complex
dialogue sessions, and LLMs indeed face significant challenges when handling
motivation transfer and sophisticated cross-turn dependency. Moreover, we
provide mechanistic interpretability on how attention sinks due to special
tokens lead to LLMs' performance degradation when handling long complex
dialogue sessions based on attention visualization experiment in
Qwen2.5-7B-Instruction.
comment: 29 pages, 13 figures, Accepted as EMNLP2025 Findings
♻ ☆ HiMATE: A Hierarchical Multi-Agent Framework for Machine Translation Evaluation
The advancement of Large Language Models (LLMs) enables flexible and
interpretable automatic evaluations. In the field of machine translation
evaluation, utilizing LLMs with translation error annotations based on
Multidimensional Quality Metrics (MQM) yields more human-aligned judgments.
However, current LLM-based evaluation methods still face challenges in
accurately identifying error spans and assessing their severity. In this paper,
we propose HiMATE, a Hierarchical Multi-Agent Framework for Machine Translation
Evaluation. We argue that existing approaches inadequately exploit the
fine-grained structural and semantic information within the MQM hierarchy. To
address this, we develop a hierarchical multi-agent system grounded in the MQM
error typology, enabling granular evaluation of subtype errors. Two key
strategies are incorporated to further mitigate systemic hallucinations within
the framework: the utilization of the model's self-reflection capability and
the facilitation of agent discussion involving asymmetric information.
Empirically, HiMATE outperforms competitive baselines across different datasets
in conducting human-aligned evaluations. Further analyses underscore its
significant advantage in error span detection and severity assessment,
achieving an average F1-score improvement of 89% over the best-performing
baseline. We make our code and data publicly available at
https://github.com/nlp2ct-shijie/HiMATE.
♻ ☆ LogicTree: Structured Proof Exploration for Coherent and Rigorous Logical Reasoning with Large Language Models EMNLP 2025
Large language models (LLMs) have achieved remarkable multi-step reasoning
capabilities across various domains. However, LLMs still face distinct
challenges in complex logical reasoning, as (1) proof-finding requires
systematic exploration and the maintenance of logical coherence and (2)
searching the right combination of premises at each reasoning step is
inherently challenging in tasks with large premise space. To address this, we
propose LogicTree, an inference-time modular framework employing
algorithm-guided search to automate structured proof exploration and ensure
logical coherence. Advancing beyond tree-of-thought (ToT), we incorporate
caching mechanism into LogicTree to enable effective utilization of historical
knowledge, preventing reasoning stagnation and minimizing redundancy.
Furthermore, we address the combinatorial complexity of premise search by
decomposing it into a linear process. The refined premise selection restricts
subsequent inference to at most one derivation per step, enhancing reasoning
granularity and enforcing strict step-by-step reasoning. Additionally, we
introduce two LLM-free heuristics for premise prioritization, enabling
strategic proof search. Experimental results on five datasets demonstrate that
LogicTree optimally scales inference-time computation to achieve higher proof
accuracy, surpassing chain-of-thought (CoT) and ToT with average gains of 23.6%
and 12.5%, respectively, on GPT-4o. Moreover, within LogicTree, GPT-4o
outperforms o3-mini by 7.6% on average.
comment: EMNLP 2025 Main Conference
♻ ☆ Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models
Scaling laws predict that the performance of large language models improves
with increasing model size and data size. In practice, pre-training has been
relying on massive web crawls, using almost all data sources publicly available
on the internet so far. However, this pool of natural data does not grow at the
same rate as the compute supply. Furthermore, the availability of high-quality
texts is even more limited: data filtering pipelines often remove up to 99% of
the initial web scrapes to achieve state-of-the-art. To address the "data wall"
of pre-training scaling, our work explores ways to transform and recycle data
discarded in existing filtering processes. We propose REWIRE, REcycling the Web
with guIded REwrite, a method to enrich low-quality documents so that they
could become useful for training. This in turn allows us to increase the
representation of synthetic data in the final pre-training set. Experiments at
1B, 3B and 7B scales of the DCLM benchmark show that mixing high-quality raw
texts and our rewritten texts lead to 1.0, 1.3 and 2.5 percentage points
improvement respectively across 22 diverse tasks, compared to training on only
filtered web data. Training on the raw-synthetic data mix is also more
effective than having access to 2x web data. Through further analysis, we
demonstrate that about 82% of the mixed in texts come from transforming
lower-quality documents that would otherwise be discarded. REWIRE also
outperforms related approaches of generating synthetic data, including
Wikipedia-style paraphrasing, question-answer synthesizing and knowledge
extraction. These results suggest that recycling web texts holds the potential
for being a simple and effective approach for scaling pre-training data. We
make our high-quality synthetic data publicly available at
https://huggingface.co/datasets/facebook/recycling_the_web.
comment: Accepted to COLM 2025
♻ ☆ Understanding the Uncertainty of LLM Explanations: A Perspective Based on Reasoning Topology
Understanding the uncertainty in large language model (LLM) explanations is
important for evaluating their faithfulness and reasoning consistency, and thus
provides insights into the reliability of LLM's output regarding a question. In
this work, we propose a novel framework that quantifies uncertainty in LLM
explanations through a reasoning topology perspective. By designing a
structural elicitation strategy, we guide the LLMs to frame the explanations of
an answer into a graph topology. This process decomposes the explanations into
the knowledge related sub-questions and topology-based reasoning structures,
which allows us to quantify uncertainty not only at the semantic level but also
from the reasoning path. It further brings convenience to assess knowledge
redundancy and provide interpretable insights into the reasoning process. Our
method offers a systematic way to interpret the LLM reasoning, analyze
limitations, and provide guidance for enhancing robustness and faithfulness.
This work pioneers the use of graph-structured uncertainty measurement in LLM
explanations and demonstrates the potential of topology-based quantification.
comment: 28 pages, 9 figures; accepted at COLM'25