Computation and Language
☆ Are AI Detectors Good Enough? A Survey on Quality of Datasets With Machine-Generated Texts
The rapid development of autoregressive Large Language Models (LLMs) has
significantly improved the quality of generated texts, necessitating reliable
machine-generated text detectors. A huge number of detectors and collections
with AI fragments have emerged, and several detection methods even showed
recognition quality up to 99.9% according to the target metrics in such
collections. However, the quality of such detectors tends to drop dramatically
in the wild, posing a question: Are detectors actually highly trustworthy or do
their high benchmark scores come from the poor quality of evaluation datasets?
In this paper, we emphasise the need for robust and qualitative methods for
evaluating generated data to be secure against bias and low generalising
ability of future model. We present a systematic review of datasets from
competitions dedicated to AI-generated content detection and propose methods
for evaluating the quality of datasets containing AI-generated fragments. In
addition, we discuss the possibility of using high-quality generated data to
achieve two goals: improving the training of detection models and improving the
training datasets themselves. Our contribution aims to facilitate a better
understanding of the dynamics between human and machine text, which will
ultimately support the integrity of information in an increasingly automated
world.
☆ SudoLM: Learning Access Control of Parametric Knowledge with Authorization Alignment
Existing preference alignment is a one-size-fits-all alignment mechanism,
where the part of the large language model (LLM) parametric knowledge with
non-preferred features is uniformly blocked to all the users. However, this
part of knowledge can be useful to advanced users whose expertise qualifies
them to handle these information. The one-size-fits-all alignment mechanism
undermines LLM's utility for these qualified users. To address this problem, we
propose SudoLM, a framework that lets LLMs learn access control over specific
parametric knowledge for users with different credentials via authorization
alignment. SudoLM allows authorized users to unlock their access to all the
parametric knowledge with an assigned SUDO key while blocking access to
non-qualified users. Experiments on two application scenarios demonstrate that
SudoLM effectively controls the user's access to the parametric knowledge and
maintains its general utility.
☆ Enhancing Large Language Models' Situated Faithfulness to External Contexts
Large Language Models (LLMs) are often augmented with external information as
contexts, but this external information can sometimes be inaccurate or even
intentionally misleading. We argue that robust LLMs should demonstrate situated
faithfulness, dynamically calibrating their trust in external information based
on their confidence in the internal knowledge and the external context. To
benchmark this capability, we evaluate LLMs across several QA datasets,
including a newly created dataset called RedditQA featuring in-the-wild
incorrect contexts sourced from Reddit posts. We show that when provided with
both correct and incorrect contexts, both open-source and proprietary models
tend to overly rely on external information, regardless of its factual
accuracy. To enhance situated faithfulness, we propose two approaches:
Self-Guided Confidence Reasoning (SCR) and Rule-Based Confidence Reasoning
(RCR). SCR enables models to self-access the confidence of external information
relative to their own internal knowledge to produce the most accurate answer.
RCR, in contrast, extracts explicit confidence signals from the LLM and
determines the final answer using predefined rules. Our results show that for
LLMs with strong reasoning capabilities, such as GPT-4o and GPT-4o mini, SCR
outperforms RCR, achieving improvements of up to 24.2% over a direct input
augmentation baseline. Conversely, for a smaller model like Llama-3-8B, RCR
outperforms SCR. Fine-tuning SCR with our proposed Confidence Reasoning Direct
Preference Optimization (CR-DPO) method improves performance on both seen and
unseen datasets, yielding an average improvement of 8.9% on Llama-3-8B. In
addition to quantitative results, we offer insights into the relative strengths
of SCR and RCR. Our findings highlight promising avenues for improving situated
faithfulness in LLMs. The data and code are released.
☆ NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples NeurIPS 24
Baiqi Li, Zhiqiu Lin, Wenxuan Peng, Jean de Dieu Nyandwi, Daniel Jiang, Zixian Ma, Simran Khanuja, Ranjay Krishna, Graham Neubig, Deva Ramanan
Vision-language models (VLMs) have made significant progress in recent
visual-question-answering (VQA) benchmarks that evaluate complex
visio-linguistic reasoning. However, are these models truly effective? In this
work, we show that VLMs still struggle with natural images and questions that
humans can easily answer, which we term natural adversarial samples. We also
find it surprisingly easy to generate these VQA samples from natural image-text
corpora using off-the-shelf models like CLIP and ChatGPT. We propose a
semi-automated approach to collect a new benchmark, NaturalBench, for reliably
evaluating VLMs with 10,000 human-verified VQA samples. Crucially, we adopt a
$\textbf{vision-centric}$ design by pairing each question with two images that
yield different answers, preventing blind solutions from answering without
using the images. This makes NaturalBench more challenging than previous
benchmarks that can be solved with commonsense priors. We evaluate 53
state-of-the-art VLMs on NaturalBench, showing that models like
LLaVA-OneVision, Cambrian-1, Llama3.2-Vision, Molmo, Qwen2-VL, and even GPT-4o
lag 50%-70% behind human performance (over 90%). We analyze why NaturalBench is
hard from two angles: (1) Compositionality: Solving NaturalBench requires
diverse visio-linguistic skills, including understanding attribute bindings,
object relationships, and advanced reasoning like logic and counting. To this
end, unlike prior work that uses a single tag per sample, we tag each
NaturalBench sample with 1 to 8 skill tags for fine-grained evaluation. (2)
Biases: NaturalBench exposes severe biases in VLMs, as models often choose the
same answer regardless of the image. Lastly, we apply our benchmark curation
method to diverse data sources, including long captions (over 100 words) and
non-English languages like Chinese and Hindi, highlighting its potential for
dynamic evaluations of VLMs.
comment: Accepted to NeurIPS 24; We open-source our dataset at:
https://huggingface.co/datasets/BaiqiL/NaturalBench; Project page at:
https://linzhiqiu.github.io/papers/naturalbench/
☆ MiCEval: Unveiling Multimodal Chain of Thought's Quality via Image Description and Reasoning Steps
Xiongtao Zhou, Jie He, Lanyu Chen, jingyu li, Haojing Chen, Victor Gutierrez Basulto, Jeff Z. Pan, Hanjie Chen
Multimodal Chain of Thought (MCoT) is a popular prompting strategy for
improving the performance of multimodal large language models (MLLMs) across a
range of complex reasoning tasks. Despite its popularity, there is a notable
absence of automated methods for evaluating the quality of reasoning steps in
MCoT. To address this gap, we propose Multimodal Chain-of-Thought Evaluation
(MiCEval), a framework designed to assess the correctness of reasoning chains
by evaluating the quality of both the description and each reasoning step. The
evaluation of the description component focuses on the accuracy of the image
descriptions, while the reasoning step evaluates the quality of each step as it
is conditionally generated based on the preceding steps. MiCEval is built upon
a fine-grained dataset with annotations that rate each step according to
correctness, relevance, and informativeness. Extensive experiments on four
state-of-the-art MLLMs show that step-wise evaluations using MiCEval align more
closely with human judgments compared to existing methods based on cosine
similarity or fine-tuning approaches. MiCEval datasets and code can be found in
https://github.com/alenai97/MiCEval.
comment: 40 pages
☆ DiscoGraMS: Enhancing Movie Screen-Play Summarization using Movie Character-Aware Discourse Graph
Summarizing movie screenplays presents a unique set of challenges compared to
standard document summarization. Screenplays are not only lengthy, but also
feature a complex interplay of characters, dialogues, and scenes, with numerous
direct and subtle relationships and contextual nuances that are difficult for
machine learning models to accurately capture and comprehend. Recent attempts
at screenplay summarization focus on fine-tuning transformer-based pre-trained
models, but these models often fall short in capturing long-term dependencies
and latent relationships, and frequently encounter the "lost in the middle"
issue. To address these challenges, we introduce DiscoGraMS, a novel resource
that represents movie scripts as a movie character-aware discourse graph (CaD
Graph). This approach is well-suited for various downstream tasks, such as
summarization, question-answering, and salience detection. The model aims to
preserve all salient information, offering a more comprehensive and faithful
representation of the screenplay's content. We further explore a baseline
method that combines the CaD Graph with the corresponding movie script through
a late fusion of graph and text modalities, and we present very initial
promising results.
☆ Real-time Fake News from Adversarial Feedback
We show that existing evaluations for fake news detection based on
conventional sources, such as claims on fact-checking websites, result in an
increasing accuracy over time for LLM-based detectors -- even after their
knowledge cutoffs. This suggests that recent popular political claims, which
form the majority of fake news on such sources, are easily classified using
surface-level shallow patterns. Instead, we argue that a proper fake news
detection dataset should test a model's ability to reason factually about the
current world by retrieving and reading related evidence. To this end, we
develop a novel pipeline that leverages natural language feedback from a
RAG-based detector to iteratively modify real-time news into deceptive fake
news that challenges LLMs. Our iterative rewrite decreases the binary
classification AUC by an absolute 17.5 percent for a strong RAG GPT-4o
detector. Our experiments reveal the important role of RAG in both detecting
and generating fake news, as retrieval-free LLM detectors are vulnerable to
unseen events and adversarial attacks, while feedback from RAG detection helps
discover more deceitful patterns in fake news.
☆ Distance between Relevant Information Pieces Causes Bias in Long-Context LLMs
Runchu Tian, Yanghao Li, Yuepeng Fu, Siyang Deng, Qinyu Luo, Cheng Qian, Shuo Wang, Xin Cong, Zhong Zhang, Yesai Wu, Yankai Lin, Huadong Wang, Xiaojiang Liu
Positional bias in large language models (LLMs) hinders their ability to
effectively process long inputs. A prominent example is the "lost in the
middle" phenomenon, where LLMs struggle to utilize relevant information
situated in the middle of the input. While prior research primarily focuses on
single pieces of relevant information, real-world applications often involve
multiple relevant information pieces. To bridge this gap, we present
LongPiBench, a benchmark designed to assess positional bias involving multiple
pieces of relevant information. Thorough experiments are conducted with five
commercial and six open-source models. These experiments reveal that while most
current models are robust against the "lost in the middle" issue, there exist
significant biases related to the spacing of relevant information pieces. These
findings highlight the importance of evaluating and reducing positional biases
to advance LLM's capabilities.
comment: work in progress
☆ GenEOL: Harnessing the Generative Power of LLMs for Training-Free Sentence Embeddings
Training-free embedding methods directly leverage pretrained large language
models (LLMs) to embed text, bypassing the costly and complex procedure of
contrastive learning. Previous training-free embedding methods have mainly
focused on optimizing embedding prompts and have overlooked the benefits of
utilizing the generative abilities of LLMs. We propose a novel method, GenEOL,
which uses LLMs to generate diverse transformations of a sentence that preserve
its meaning, and aggregates the resulting embeddings of these transformations
to enhance the overall sentence embedding. GenEOL significantly outperforms the
existing training-free embedding methods by an average of 2.85 points across
several LLMs on the sentence semantic text similarity (STS) benchmark. Our
analysis shows that GenEOL stabilizes representation quality across LLM layers
and is robust to perturbations of embedding prompts. GenEOL also achieves
notable gains on multiple clustering, reranking and pair-classification tasks
from the MTEB benchmark.
☆ Diverging Preferences: When do Annotators Disagree and do Models Know?
Michael JQ Zhang, Zhilin Wang, Jena D. Hwang, Yi Dong, Olivier Delalleau, Yejin Choi, Eunsol Choi, Xiang Ren, Valentina Pyatkin
We examine diverging preferences in human-labeled preference datasets. We
develop a taxonomy of disagreement sources spanning 10 categories across four
high-level classes -- task underspecification, response style, refusals, and
annotation errors. We find that the majority of disagreements are in opposition
with standard reward modeling approaches, which are designed with the
assumption that annotator disagreement is noise. We then explore how these
findings impact two areas of LLM development: reward modeling and evaluation.
In our experiments, we demonstrate how standard reward modeling methods, like
the Bradley-Terry model, fail to differentiate whether a given preference
judgment is the result of unanimous agreement among annotators or the majority
opinion among diverging user preferences. We also find that these tendencies
are also echoed by popular LLM-as-Judge evaluation methods, which consistently
identify a winning response in cases of diverging preferences. These findings
highlight remaining challenges in LLM evaluations, which are greatly influenced
by divisive features like response style, and in developing pluralistically
aligned LLMs. To address these issues, we develop methods for identifying
diverging preferences to mitigate their influence on evaluation and training.
☆ CELI: Controller-Embedded Language Model Interactions
Jan-Samuel Wagner, Dave DeCaprio, Abishek Chiffon Muthu Raja, Jonathan M. Holman, Lauren K. Brady, Sky C. Cheung, Hosein Barzekar, Eric Yang, Mark Anthony Martinez II, David Soong, Sriram Sridhar, Han Si, Brandon W. Higgs, Hisham Hamadeh, Scott Ogden
We introduce Controller-Embedded Language Model Interactions (CELI), a
framework that integrates control logic directly within language model (LM)
prompts, facilitating complex, multi-stage task execution. CELI addresses
limitations of existing prompt engineering and workflow optimization techniques
by embedding control logic directly within the operational context of language
models, enabling dynamic adaptation to evolving task requirements. Our
framework transfers control from the traditional programming execution
environment to the LMs, allowing them to autonomously manage computational
workflows while maintaining seamless interaction with external systems and
functions. CELI supports arbitrary function calls with variable arguments,
bridging the gap between LMs' adaptive reasoning capabilities and conventional
software paradigms' structured control mechanisms. To evaluate CELI's
versatility and effectiveness, we conducted case studies in two distinct
domains: code generation (HumanEval benchmark) and multi-stage content
generation (Wikipedia-style articles). The results demonstrate notable
performance improvements across a range of domains. CELI achieved a 4.9
percentage point improvement over the best reported score of the baseline GPT-4
model on the HumanEval code generation benchmark. In multi-stage content
generation, 94.4% of CELI-produced Wikipedia-style articles met or exceeded
first draft quality when optimally configured, with 44.4% achieving high
quality. These outcomes underscore CELI's potential for optimizing AI-driven
workflows across diverse computational domains.
comment: 26 pages, 2 figures
☆ You Shall Know a Tool by the Traces it Leaves: The Predictability of Sentiment Analysis Tools
If sentiment analysis tools were valid classifiers, one would expect them to
provide comparable results for sentiment classification on different kinds of
corpora and for different languages. In line with results of previous studies
we show that sentiment analysis tools disagree on the same dataset. Going
beyond previous studies we show that the sentiment tool used for sentiment
annotation can even be predicted from its outcome, revealing an algorithmic
bias of sentiment analysis. Based on Twitter, Wikipedia and different news
corpora from the English, German and French languages, our classifiers separate
sentiment tools with an averaged F1-score of 0.89 (for the English corpora). We
therefore warn against taking sentiment annotations as face value and argue for
the need of more and systematic NLP evaluation studies.
☆ DiSCo Meets LLMs: A Unified Approach for Sparse Retrieval and Contextual Distillation in Conversational Search
Conversational Search (CS) is the task of retrieving relevant documents from
a corpus within a conversational context, combining retrieval with
conversational context modeling. With the explosion of Large Language Models
(LLMs), the CS field has seen major improvements with LLMs rewriting user
queries, accounting for conversational context. However, engaging LLMs at
inference time harms efficiency. Current methods address this by distilling
embeddings from human-rewritten queries to learn the context modeling task.
Yet, these approaches predominantly focus on context modeling, and only treat
the contrastive component of the retrieval task within a
distillation-independent loss term. To address these limitations, we propose a
new distillation method, as a relaxation of the previous objective, unifying
retrieval and context modeling. We relax the existing training objectives by
distilling similarity scores between conversations and documents, rather than
relying solely on representation learning. Our proposed distillation objective
allows for more freedom in the representation space and leverages the
contrastive nature of document relevance. Through experiments on Learned Sparse
Retrieval (LSR) across 5 CS datasets, our approach demonstrates substantial
improvements in both in-domain and out-of-domain retrieval performance,
outperforming state-of-the-art with gains of up to 6 points in recall for
out-of-domain datasets. Additionally, through the relaxation of the objective,
we propose a multi-teacher distillation, using multiple LLMs as teachers,
yielding additional gains, and outperforming the teachers themselves in
in-domain experiments. Finally, analysis of the sparsity of the models reveals
that our distillation allows for better control over the sparsity of the
trained models.
☆ Teaching Models to Balance Resisting and Accepting Persuasion
Large language models (LLMs) are susceptible to persuasion, which can pose
risks when models are faced with an adversarial interlocutor. We take a first
step towards defending models against persuasion while also arguing that
defense against adversarial (i.e. negative) persuasion is only half of the
equation: models should also be able to accept beneficial (i.e. positive)
persuasion to improve their answers. We show that optimizing models for only
one side results in poor performance on the other. In order to balance positive
and negative persuasion, we introduce Persuasion-Balanced Training (or PBT),
which leverages multi-agent recursive dialogue trees to create data and trains
models via preference optimization to accept persuasion when appropriate. PBT
consistently improves resistance to misinformation and resilience to being
challenged while also resulting in the best overall performance on holistic
data containing both positive and negative persuasion. Crucially, we show that
PBT models are better teammates in multi-agent debates. We find that without
PBT, pairs of stronger and weaker models have unstable performance, with the
order in which the models present their answers determining whether the team
obtains the stronger or weaker model's performance. PBT leads to better and
more stable results and less order dependence, with the stronger model
consistently pulling the weaker one up.
comment: Code: https://github.com/esteng/persuasion_balanced_training
☆ Toolshed: Scale Tool-Equipped Agents with Advanced RAG-Tool Fusion and Tool Knowledge Bases
Recent advancements in tool-equipped Agents (LLMs) have enabled complex tasks
like secure database interactions and multi-agent code development. However,
scaling tool capacity beyond agent reasoning or model limits remains a
challenge. In this paper, we address these challenges by introducing Toolshed
Knowledge Bases, a tool knowledge base (vector database) designed to store
enhanced tool representations and optimize tool selection for large-scale
tool-equipped Agents. Additionally, we propose Advanced RAG-Tool Fusion, a
novel ensemble of tool-applied advanced retrieval-augmented generation (RAG)
techniques across the pre-retrieval, intra-retrieval, and post-retrieval
phases, without requiring model fine-tuning. During pre-retrieval, tool
documents are enhanced with key information and stored in the Toolshed
Knowledge Base. Intra-retrieval focuses on query planning and transformation to
increase retrieval accuracy. Post-retrieval refines the retrieved tool
documents and enables self-reflection. Furthermore, by varying both the total
number of tools (tool-M) an Agent has access to and the tool selection
threshold (top-k), we address trade-offs between retrieval accuracy, agent
performance, and token cost. Our approach achieves 46%, 56%, and 47% absolute
improvements on the ToolE single-tool, ToolE multi-tool and Seal-Tools
benchmark datasets, respectively (Recall@5).
☆ Dialetto, ma Quanto Dialetto? Transcribing and Evaluating Dialects on a Continuum
There is increasing interest in looking at dialects in NLP. However, most
work to date still treats dialects as discrete categories. For instance,
evaluative work in variation-oriented NLP for English often works with Indian
English or African-American Venacular English as homogeneous categories (Faisal
et al., 2024; Ziems et al., 2023), yet even within one variety there is
substantial variation. We examine within-dialect variation and show that
performance critically varies within categories. We measure speech-to-text
performance on Italian dialects, and empirically observe a geographical
performance disparity. This disparity correlates substantially (-0.5) with
linguistic similarity to the highest performing dialect variety. We
cross-examine our results against dialectometry methods, and interpret the
performance disparity to be due to a bias towards dialects that are more
similar to the standard variety in the speech-to-text model examined. We
additionally leverage geostatistical methods to predict zero-shot performance
at unseen sites, and find the incorporation of geographical information to
substantially improve prediction performance, indicating there to be
geographical structure in the performance distribution.
☆ Do LLMs estimate uncertainty well in instruction-following?
Large language models (LLMs) could be valuable personal AI agents across
various domains, provided they can precisely follow user instructions. However,
recent studies have shown significant limitations in LLMs'
instruction-following capabilities, raising concerns about their reliability in
high-stakes applications. Accurately estimating LLMs' uncertainty in adhering
to instructions is critical to mitigating deployment risks. We present, to our
knowledge, the first systematic evaluation of the uncertainty estimation
abilities of LLMs in the context of instruction-following. Our study identifies
key challenges with existing instruction-following benchmarks, where multiple
factors are entangled with uncertainty stems from instruction-following,
complicating the isolation and comparison across methods and models. To address
these issues, we introduce a controlled evaluation setup with two benchmark
versions of data, enabling a comprehensive comparison of uncertainty estimation
methods under various conditions. Our findings show that existing uncertainty
methods struggle, particularly when models make subtle errors in instruction
following. While internal model states provide some improvement, they remain
inadequate in more complex scenarios. The insights from our controlled
evaluation setups provide a crucial understanding of LLMs' limitations and
potential for uncertainty estimation in instruction-following tasks, paving the
way for more trustworthy AI agents.
☆ Optimizing Attention with Mirror Descent: Generalized Max-Margin Token Selection
Attention mechanisms have revolutionized several domains of artificial
intelligence, such as natural language processing and computer vision, by
enabling models to selectively focus on relevant parts of the input data. While
recent work has characterized the optimization dynamics of gradient descent
(GD) in attention-based models and the structural properties of its preferred
solutions, less is known about more general optimization algorithms such as
mirror descent (MD). In this paper, we investigate the convergence properties
and implicit biases of a family of MD algorithms tailored for softmax attention
mechanisms, with the potential function chosen as the $p$-th power of the
$\ell_p$-norm. Specifically, we show that these algorithms converge in
direction to a generalized hard-margin SVM with an $\ell_p$-norm objective when
applied to a classification problem using a softmax attention model. Notably,
our theoretical results reveal that the convergence rate is comparable to that
of traditional GD in simpler models, despite the highly nonlinear and nonconvex
nature of the present problem. Additionally, we delve into the joint
optimization dynamics of the key-query matrix and the decoder, establishing
conditions under which this complex joint optimization converges to their
respective hard-margin SVM solutions. Lastly, our numerical experiments on real
data demonstrate that MD algorithms improve generalization over standard GD and
excel in optimal token selection.
☆ Large Language Models Are Overparameterized Text Encoders
Large language models (LLMs) demonstrate strong performance as text embedding
models when finetuned with supervised contrastive training. However, their
large size balloons inference time and memory requirements. In this paper, we
show that by pruning the last $p\%$ layers of an LLM before supervised training
for only 1000 steps, we can achieve a proportional reduction in memory and
inference time. We evaluate four different state-of-the-art LLMs on text
embedding tasks and find that our method can prune up to 30\% of layers with
negligible impact on performance and up to 80\% with only a modest drop. With
only three lines of code, our method is easily implemented in any pipeline for
transforming LLMs to text encoders. We also propose $\text{L}^3 \text{Prune}$,
a novel layer-pruning strategy based on the model's initial loss that provides
two optimal pruning configurations: a large variant with negligible performance
loss and a small variant for resource-constrained settings. On average, the
large variant prunes 21\% of the parameters with a $-0.3$ performance drop, and
the small variant only suffers from a $-5.1$ decrease while pruning 74\% of the
model. We consider these results strong evidence that LLMs are
overparameterized for text embedding tasks, and can be easily pruned.
comment: 8 pages of content + 1 for limitations and ethical considerations, 14
pages in total including references and appendix, 5+1 figures
☆ MomentumSMoE: Integrating Momentum into Sparse Mixture of Experts NeurIPS 2024
Sparse Mixture of Experts (SMoE) has become the key to unlocking unparalleled
scalability in deep learning. SMoE has the potential to exponentially increase
parameter count while maintaining the efficiency of the model by only
activating a small subset of these parameters for a given sample. However, it
has been observed that SMoE suffers from unstable training and has difficulty
adapting to new distributions, leading to the model's lack of robustness to
data contamination. To overcome these limitations, we first establish a
connection between the dynamics of the expert representations in SMoEs and
gradient descent on a multi-objective optimization problem. Leveraging our
framework, we then integrate momentum into SMoE and propose a new family of
SMoEs named MomentumSMoE. We theoretically prove and numerically demonstrate
that MomentumSMoE is more stable and robust than SMoE. In particular, we verify
the advantages of MomentumSMoE over SMoE on a variety of practical tasks
including ImageNet-1K object recognition and WikiText-103 language modeling. We
demonstrate the applicability of MomentumSMoE to many types of SMoE models,
including those in the Sparse MoE model for vision (V-MoE) and the Generalist
Language Model (GLaM). We also show that other advanced momentum-based
optimization methods, such as Adam, can be easily incorporated into the
MomentumSMoE framework for designing new SMoE models with even better
performance, almost negligible additional computation cost, and simple
implementations.
comment: 10 pages in the main text. Published at NeurIPS 2024. The code is
available at https://github.com/rachtsy/MomentumSMoE
☆ RAG-ConfusionQA: A Benchmark for Evaluating LLMs on Confusing Questions
Conversational AI agents use Retrieval Augmented Generation (RAG) to provide
verifiable document-grounded responses to user inquiries. However, many natural
questions do not have good answers: about 25\% contain false
assumptions~\cite{Yu2023:CREPE}, and over 50\% are
ambiguous~\cite{Min2020:AmbigQA}. RAG agents need high-quality data to improve
their responses to confusing questions. This paper presents a novel synthetic
data generation method to efficiently create a diverse set of context-grounded
confusing questions from a given document corpus. We conduct an empirical
comparative evaluation of several large language models as RAG agents to
measure the accuracy of confusion detection and appropriate response
generation. We contribute a benchmark dataset to the public domain.
comment: under review
☆ Tell me what I need to know: Exploring LLM-based (Personalized) Abstractive Multi-Source Meeting Summarization
Meeting summarization is crucial in digital communication, but existing
solutions struggle with salience identification to generate personalized,
workable summaries, and context understanding to fully comprehend the meetings'
content. Previous attempts to address these issues by considering related
supplementary resources (e.g., presentation slides) alongside transcripts are
hindered by models' limited context sizes and handling the additional
complexities of the multi-source tasks, such as identifying relevant
information in additional files and seamlessly aligning it with the meeting
content. This work explores multi-source meeting summarization considering
supplementary materials through a three-stage large language model approach:
identifying transcript passages needing additional context, inferring relevant
details from supplementary materials and inserting them into the transcript,
and generating a summary from this enriched transcript. Our multi-source
approach enhances model understanding, increasing summary relevance by ~9% and
producing more content-rich outputs. We introduce a personalization protocol
that extracts participant characteristics and tailors summaries accordingly,
improving informativeness by ~10%. This work further provides insights on
performance-cost trade-offs across four leading model families, including
edge-device capable options. Our approach can be extended to similar complex
generative tasks benefitting from additional resources and personalization,
such as dialogue systems and action planning.
☆ Do LLMs "know" internally when they follow instructions?
Juyeon Heo, Christina Heinze-Deml, Oussama Elachqar, Shirley Ren, Udhay Nallasamy, Andy Miller, Kwan Ho Ryan Chan, Jaya Narain
Instruction-following is crucial for building AI agents with large language
models (LLMs), as these models must adhere strictly to user-provided
constraints and guidelines. However, LLMs often fail to follow even simple and
clear instructions. To improve instruction-following behavior and prevent
undesirable outputs, a deeper understanding of how LLMs' internal states relate
to these outcomes is required. Our analysis of LLM internal states reveal a
dimension in the input embedding space linked to successful
instruction-following. We demonstrate that modifying representations along this
dimension improves instruction-following success rates compared to random
changes, without compromising response quality. Further investigation reveals
that this dimension is more closely related to the phrasing of prompts rather
than the inherent difficulty of the task or instructions. This discovery also
suggests explanations for why LLMs sometimes fail to follow clear instructions
and why prompt engineering is often effective, even when the content remains
largely unchanged. This work provides insight into the internal workings of
LLMs' instruction-following, paving the way for reliable LLM agents.
☆ SignAttention: On the Interpretability of Transformer Models for Sign Language Translation NeurIPS 2024
Pedro Alejandro Dal Bianco, Oscar Agustín Stanchi, Facundo Manuel Quiroga, Franco Ronchetti, Enzo Ferrante
This paper presents the first comprehensive interpretability analysis of a
Transformer-based Sign Language Translation (SLT) model, focusing on the
translation from video-based Greek Sign Language to glosses and text.
Leveraging the Greek Sign Language Dataset, we examine the attention mechanisms
within the model to understand how it processes and aligns visual input with
sequential glosses. Our analysis reveals that the model pays attention to
clusters of frames rather than individual ones, with a diagonal alignment
pattern emerging between poses and glosses, which becomes less distinct as the
number of glosses increases. We also explore the relative contributions of
cross-attention and self-attention at each decoding step, finding that the
model initially relies on video frames but shifts its focus to previously
predicted tokens as the translation progresses. This work contributes to a
deeper understanding of SLT models, paving the way for the development of more
transparent and reliable translation systems essential for real-world
applications.
comment: Accepted at IAI Workshop @ NeurIPS 2024
☆ Combining Entropy and Matrix Nuclear Norm for Enhanced Evaluation of Language Models
As large language models (LLMs) continue to advance, the need for precise and
efficient evaluation metrics becomes more pressing. Traditional approaches,
while informative, often face limitations in computational demands and
interpretability. In this paper, we introduce a novel hybrid evaluation method
that integrates two established techniques: entropy derived from covariance
matrices and the Matrix Nuclear Norm (MNN). Our method begins by normalizing
hidden states from LLMs, then computes the covariance matrix and MNN from these
representations. We further calculate the entropy of the covariance matrix to
capture uncertainty and redundancy in the model's outputs. By combining these
metrics into a composite score, we offer a comprehensive evaluation framework
that balances accuracy with computational efficiency. Additionally, our
approach allows for flexibility in adjusting the weightings between entropy and
MNN, tailoring the evaluation for different objectives. Through a series of
experiments on various LLMs, we demonstrate the robustness and efficacy of our
method, offering deeper insights into model performance. This work contributes
to the ongoing development of LLM evaluation and opens avenues for future
innovations in model assessment techniques.
comment: The method is currently under experimentation
☆ A Systematic Study of Cross-Layer KV Sharing for Efficient LLM Inference
Recently, sharing key-value (KV) cache across layers has been found effective
in efficient inference of large language models (LLMs). To systematically
investigate different techniques of cross-layer KV sharing, we propose a
unified framework that covers several recent methods and their novel variants.
We conduct comprehensive experiments on all the configurations of the
framework, evaluating their generation throughput and performance in language
modeling and downstream tasks. We find that when reducing the size of the KV
cache by 2x, most configurations can achieve competitive performance to and
higher throughput than standard transformers, but when further reducing the
size of the KV cache, pairing queries of all layers with KVs of upper layers
can better maintain performance, although it also introduces additional
training cost and prefilling latency. We hope that this work will help users
choose the appropriate approach according to their requirements and facilitate
research on the acceleration of LLM inference.
☆ Unlearning Backdoor Attacks for LLMs with Weak-to-Strong Knowledge Distillation
Parameter-efficient fine-tuning (PEFT) can bridge the gap between large
language models (LLMs) and downstream tasks. However, PEFT has been proven
vulnerable to malicious attacks. Research indicates that poisoned LLMs, even
after PEFT, retain the capability to activate internalized backdoors when input
samples contain predefined triggers. In this paper, we introduce a novel
weak-to-strong unlearning algorithm to defend against backdoor attacks based on
feature alignment knowledge distillation, named W2SDefense. Specifically, we
first train a small-scale language model through full-parameter fine-tuning to
serve as the clean teacher model. Then, this teacher model guides the
large-scale poisoned student model in unlearning the backdoor, leveraging PEFT.
Theoretical analysis suggests that W2SDefense has the potential to enhance the
student model's ability to unlearn backdoor features, preventing the activation
of the backdoor. We conduct experiments on text classification tasks involving
three state-of-the-art language models and three different backdoor attack
algorithms. Our empirical results demonstrate the outstanding performance of
W2SDefense in defending against backdoor attacks without compromising model
performance.
☆ Fact Recall, Heuristics or Pure Guesswork? Precise Interpretations of Language Models for Fact Completion
Previous interpretations of language models (LMs) miss important distinctions
in how these models process factual information. For example, given the query
"Astrid Lindgren was born in" with the corresponding completion "Sweden", no
difference is made between whether the prediction was based on having the exact
knowledge of the birthplace of the Swedish author or assuming that a person
with a Swedish-sounding name was born in Sweden. In this paper, we investigate
four different prediction scenarios for which the LM can be expected to show
distinct behaviors. These scenarios correspond to different levels of model
reliability and types of information being processed - some being less
desirable for factual predictions. To facilitate precise interpretations of LMs
for fact completion, we propose a model-specific recipe called PrISM for
constructing datasets with examples of each scenario based on a set of
diagnostic criteria. We apply a popular interpretability method, causal tracing
(CT), to the four prediction scenarios and find that while CT produces
different results for each scenario, aggregations over a set of mixed examples
may only represent the results from the scenario with the strongest measured
signal. In summary, we contribute tools for a more granular study of fact
completion in language models and analyses that provide a more nuanced
understanding of how LMs process fact-related queries.
☆ SylloBio-NLI: Evaluating Large Language Models on Biomedical Syllogistic Reasoning
Syllogistic reasoning is crucial for Natural Language Inference (NLI). This
capability is particularly significant in specialized domains such as
biomedicine, where it can support automatic evidence interpretation and
scientific discovery. This paper presents SylloBio-NLI, a novel framework that
leverages external ontologies to systematically instantiate diverse syllogistic
arguments for biomedical NLI. We employ SylloBio-NLI to evaluate Large Language
Models (LLMs) on identifying valid conclusions and extracting supporting
evidence across 28 syllogistic schemes instantiated with human genome pathways.
Extensive experiments reveal that biomedical syllogistic reasoning is
particularly challenging for zero-shot LLMs, which achieve an average accuracy
between 70% on generalized modus ponens and 23% on disjunctive syllogism. At
the same time, we found that few-shot prompting can boost the performance of
different LLMs, including Gemma (+14%) and LLama-3 (+43%). However, a deeper
analysis shows that both techniques exhibit high sensitivity to superficial
lexical variations, highlighting a dependency between reliability, models'
architecture, and pre-training regime. Overall, our results indicate that,
while in-context examples have the potential to elicit syllogistic reasoning in
LLMs, existing models are still far from achieving the robustness and
consistency required for safe biomedical NLI applications.
☆ Generative AI, Pragmatics, and Authenticity in Second Language Learning
There are obvious benefits to integrating generative AI (artificial
intelligence) into language learning and teaching. Those include using AI as a
language tutor, creating learning materials, or assessing learner output.
However, due to how AI systems under-stand human language, based on a
mathematical model using statistical probability, they lack the lived
experience to be able to use language with the same social aware-ness as
humans. Additionally, there are built-in linguistic and cultural biases based
on their training data which is mostly in English and predominantly from
Western sources. Those facts limit AI suitability for some language learning
interactions. Stud-ies have clearly shown that systems such as ChatGPT often do
not produce language that is pragmatically appropriate. The lack of linguistic
and cultural authenticity has important implications for how AI is integrated
into second language acquisition as well as in instruction targeting
development of intercultural communication compe-tence.
☆ Analyzing Context Utilization of LLMs in Document-Level Translation
Large language models (LLM) are increasingly strong contenders in machine
translation. We study document-level translation, where some words cannot be
translated without context from outside the sentence. We investigate the
ability of prominent LLMs to utilize context by analyzing models' robustness to
perturbed and randomized document context. We find that LLMs' improved
document-translation performance is not always reflected in pronoun translation
performance. We highlight the need for context-aware finetuning of LLMs with a
focus on relevant parts of the context to improve their reliability for
document-level translation.
comment: 4 pages, 2 figures, 2 tables
☆ How Do Multilingual Models Remember? Investigating Multilingual Factual Recall Mechanisms
Large Language Models (LLMs) store and retrieve vast amounts of factual
knowledge acquired during pre-training. Prior research has localized and
identified mechanisms behind knowledge recall; however, it has primarily
focused on English monolingual models. The question of how these processes
generalize to other languages and multilingual LLMs remains unexplored. In this
paper, we address this gap by conducting a comprehensive analysis of two highly
multilingual LLMs. We assess the extent to which previously identified
components and mechanisms of factual recall in English apply to a multilingual
context. Then, we examine when language plays a role in the recall process,
uncovering evidence of language-independent and language-dependent mechanisms.
☆ Fine-Tuning Pre-trained Language Models for Robust Causal Representation Learning
The fine-tuning of pre-trained language models (PLMs) has been shown to be
effective across various domains. By using domain-specific supervised data, the
general-purpose representation derived from PLMs can be transformed into a
domain-specific representation. However, these methods often fail to generalize
to out-of-domain (OOD) data due to their reliance on non-causal
representations, often described as spurious features. Existing methods either
make use of adjustments with strong assumptions about lack of hidden common
causes, or mitigate the effect of spurious features using multi-domain data. In
this work, we investigate how fine-tuned pre-trained language models aid
generalizability from single-domain scenarios under mild assumptions, targeting
more general and practical real-world scenarios. We show that a robust
representation can be derived through a so-called causal front-door adjustment,
based on a decomposition assumption, using fine-tuned representations as a
source of data augmentation. Comprehensive experiments in both synthetic and
real-world settings demonstrate the superior generalizability of the proposed
method compared to existing approaches. Our work thus sheds light on the domain
generalization problem by introducing links between fine-tuning and causal
mechanisms into representation learning.
☆ Efficiently Computing Susceptibility to Context in Language Models
One strength of modern language models is their ability to incorporate
information from a user-input context when answering queries. However, they are
not equally sensitive to the subtle changes to that context. To quantify this,
Du et al. (2024) gives an information-theoretic metric to measure such
sensitivity. Their metric, susceptibility, is defined as the degree to which
contexts can influence a model's response to a query at a distributional level.
However, exactly computing susceptibility is difficult and, thus, Du et al.
(2024) falls back on a Monte Carlo approximation. Due to the large number of
samples required, the Monte Carlo approximation is inefficient in practice. As
a faster alternative, we propose Fisher susceptibility, an efficient method to
estimate the susceptibility based on Fisher information. Empirically, we
validate that Fisher susceptibility is comparable to Monte Carlo estimated
susceptibility across a diverse set of query domains despite its being
$70\times$ faster. Exploiting the improved efficiency, we apply Fisher
susceptibility to analyze factors affecting the susceptibility of language
models. We observe that larger models are as susceptible as smaller ones.
☆ Critical Questions Generation: Motivation and Challenges CoNLL 2024
The development of Large Language Models (LLMs) has brought impressive
performances on mitigation strategies against misinformation, such as
counterargument generation. However, LLMs are still seriously hindered by
outdated knowledge and by their tendency to generate hallucinated content. In
order to circumvent these issues, we propose a new task, namely, Critical
Questions Generation, consisting of processing an argumentative text to
generate the critical questions (CQs) raised by it. In argumentation theory CQs
are tools designed to lay bare the blind spots of an argument by pointing at
the information it could be missing. Thus, instead of trying to deploy LLMs to
produce knowledgeable and relevant counterarguments, we use them to question
arguments, without requiring any external knowledge. Research on CQs Generation
using LLMs requires a reference dataset for large scale experimentation. Thus,
in this work we investigate two complementary methods to create such a
resource: (i) instantiating CQs templates as defined by Walton's argumentation
theory and (ii), using LLMs as CQs generators. By doing so, we contribute with
a procedure to establish what is a valid CQ and conclude that, while LLMs are
reasonable CQ generators, they still have a wide margin for improvement in this
task.
comment: 14 pages, 3 figures, 7 tables, to be published in the 28th Conference
on Computational Natural Language Learning (CoNLL 2024)
☆ LoGU: Long-form Generation with Uncertainty Expressions
Ruihan Yang, Caiqi Zhang, Zhisong Zhang, Xinting Huang, Sen Yang, Nigel Collier, Dong Yu, Deqing Yang
While Large Language Models (LLMs) demonstrate impressive capabilities, they
still struggle with generating factually incorrect content (i.e.,
hallucinations). A promising approach to mitigate this issue is enabling models
to express uncertainty when unsure. Previous research on uncertainty modeling
has primarily focused on short-form QA, but realworld applications often
require much longer responses. In this work, we introduce the task of Long-form
Generation with Uncertainty(LoGU). We identify two key challenges: Uncertainty
Suppression, where models hesitate to express uncertainty, and Uncertainty
Misalignment, where models convey uncertainty inaccurately. To tackle these
challenges, we propose a refinement-based data collection framework and a
two-stage training pipeline. Our framework adopts a divide-and-conquer
strategy, refining uncertainty based on atomic claims. The collected data are
then used in training through supervised fine-tuning (SFT) and direct
preference optimization (DPO) to enhance uncertainty expression. Extensive
experiments on three long-form instruction following datasets show that our
method significantly improves accuracy, reduces hallucinations, and maintains
the comprehensiveness of responses.
☆ SwaQuAD-24: QA Benchmark Dataset in Swahili
This paper proposes the creation of a Swahili Question Answering (QA)
benchmark dataset, aimed at addressing the underrepresentation of Swahili in
natural language processing (NLP). Drawing from established benchmarks like
SQuAD, GLUE, KenSwQuAD, and KLUE, the dataset will focus on providing
high-quality, annotated question-answer pairs that capture the linguistic
diversity and complexity of Swahili. The dataset is designed to support a
variety of applications, including machine translation, information retrieval,
and social services like healthcare chatbots. Ethical considerations, such as
data privacy, bias mitigation, and inclusivity, are central to the dataset
development. Additionally, the paper outlines future expansion plans to include
domain-specific content, multimodal integration, and broader crowdsourcing
efforts. The Swahili QA dataset aims to foster technological innovation in East
Africa and provide an essential resource for NLP research and applications in
low-resource languages.
☆ EcomEdit: An Automated E-commerce Knowledge Editing Framework for Enhanced Product and Purchase Intention Understanding
Knowledge Editing (KE) aims to correct and update factual information in
Large Language Models (LLMs) to ensure accuracy and relevance without
computationally expensive fine-tuning. Though it has been proven effective in
several domains, limited work has focused on its application within the
e-commerce sector. However, there are naturally occurring scenarios that make
KE necessary in this domain, such as the timely updating of product features
and trending purchase intentions by customers, which necessitate further
exploration. In this paper, we pioneer the application of KE in the e-commerce
domain by presenting ECOMEDIT, an automated e-commerce knowledge editing
framework tailored for e-commerce-related knowledge and tasks. Our framework
leverages more powerful LLMs as judges to enable automatic knowledge conflict
detection and incorporates conceptualization to enhance the semantic coverage
of the knowledge to be edited. Through extensive experiments, we demonstrate
the effectiveness of ECOMEDIT in improving LLMs' understanding of product
descriptions and purchase intentions. We also show that LLMs, after our
editing, can achieve stronger performance on downstream e-commerce tasks.
☆ REEF: Representation Encoding Fingerprints for Large Language Models
Protecting the intellectual property of open-source Large Language Models
(LLMs) is very important, because training LLMs costs extensive computational
resources and data. Therefore, model owners and third parties need to identify
whether a suspect model is a subsequent development of the victim model. To
this end, we propose a training-free REEF to identify the relationship between
the suspect and victim models from the perspective of LLMs' feature
representations. Specifically, REEF computes and compares the centered kernel
alignment similarity between the representations of a suspect model and a
victim model on the same samples. This training-free REEF does not impair the
model's general capabilities and is robust to sequential fine-tuning, pruning,
model merging, and permutations. In this way, REEF provides a simple and
effective way for third parties and models' owners to protect LLMs'
intellectual property together. The code is available at
https://github.com/tmylla/REEF.
☆ MoDification: Mixture of Depths Made Easy
Chen Zhang, Meizhi Zhong, Qimeng Wang, Xuantao Lu, Zheyu Ye, Chengqiang Lu, Yan Gao, Yao Hu, Kehai Chen, Min Zhang, Dawei Song
Long-context efficiency has recently become a trending topic in serving large
language models (LLMs). And mixture of depths (MoD) is proposed as a perfect
fit to bring down both latency and memory. In this paper, however, we discover
that MoD can barely transform existing LLMs without costly training over an
extensive number of tokens. To enable the transformations from any LLMs to MoD
ones, we showcase top-k operator in MoD should be promoted to threshold-p
operator, and refinement to architecture and data should also be crafted along.
All these designs form our method termed MoDification. Through a comprehensive
set of experiments covering model scales from 3B to 70B, we exhibit
MoDification strikes an excellent balance between efficiency and effectiveness.
MoDification can achieve up to ~1.2x speedup in latency and ~1.8x reduction in
memory compared to original LLMs especially in long-context applications.
comment: 12 pages, 9 figures, 5 tables, work in progress
☆ Good Parenting is all you need -- Multi-agentic LLM Hallucination Mitigation
This study explores the ability of Large Language Model (LLM) agents to
detect and correct hallucinations in AI-generated content. A primary agent was
tasked with creating a blog about a fictional Danish artist named Flipfloppidy,
which was then reviewed by another agent for factual inaccuracies. Most LLMs
hallucinated the existence of this artist. Across 4,900 test runs involving
various combinations of primary and reviewing agents, advanced AI models such
as Llama3-70b and GPT-4 variants demonstrated near-perfect accuracy in
identifying hallucinations and successfully revised outputs in 85% to 100% of
cases following feedback. These findings underscore the potential of advanced
AI models to significantly enhance the accuracy and reliability of generated
content, providing a promising approach to improving AI workflow orchestration.
☆ Beyond Binary: Towards Fine-Grained LLM-Generated Text Detection via Role Recognition and Involvement Measurement
The rapid development of large language models (LLMs), like ChatGPT, has
resulted in the widespread presence of LLM-generated content on social media
platforms, raising concerns about misinformation, data biases, and privacy
violations, which can undermine trust in online discourse. While detecting
LLM-generated content is crucial for mitigating these risks, current methods
often focus on binary classification, failing to address the complexities of
real-world scenarios like human-AI collaboration. To move beyond binary
classification and address these challenges, we propose a new paradigm for
detecting LLM-generated content. This approach introduces two novel tasks: LLM
Role Recognition (LLM-RR), a multi-class classification task that identifies
specific roles of LLM in content generation, and LLM Influence Measurement
(LLM-IM), a regression task that quantifies the extent of LLM involvement in
content creation. To support these tasks, we propose LLMDetect, a benchmark
designed to evaluate detectors' performance on these new tasks. LLMDetect
includes the Hybrid News Detection Corpus (HNDC) for training detectors, as
well as DetectEval, a comprehensive evaluation suite that considers five
distinct cross-context variations and multi-intensity variations within the
same LLM role. This allows for a thorough assessment of detectors'
generalization and robustness across diverse contexts. Our empirical validation
of 10 baseline detection methods demonstrates that fine-tuned PLM-based models
consistently outperform others on both tasks, while advanced LLMs face
challenges in accurately detecting their own generated content. Our
experimental results and analysis offer insights for developing more effective
detection models for LLM-generated content. This research enhances the
understanding of LLM-generated content and establishes a foundation for more
nuanced detection methodologies.
comment: Social Media, Large Language Models, LLM-generated Text Detection,
AI-assisted News Detection
☆ Nova: An Iterative Planning and Search Approach to Enhance Novelty and Diversity of LLM Generated Ideas
Xiang Hu, Hongyu Fu, Jinge Wang, Yifeng Wang, Zhikun Li, Renjun Xu, Yu Lu, Yaochu Jin, Lili Pan, Zhenzhong Lan
Scientific innovation is pivotal for humanity, and harnessing large language
models (LLMs) to generate research ideas could transform discovery. However,
existing LLMs often produce simplistic and repetitive suggestions due to their
limited ability in acquiring external knowledge for innovation. To address this
problem, we introduce an enhanced planning and search methodology designed to
boost the creative potential of LLM-based systems. Our approach involves an
iterative process to purposely plan the retrieval of external knowledge,
progressively enriching the idea generation with broader and deeper insights.
Validation through automated and human assessments indicates that our framework
substantially elevates the quality of generated ideas, particularly in novelty
and diversity. The number of unique novel ideas produced by our framework is
3.4 times higher than without it. Moreover, our method outperforms the current
state-of-the-art, generating at least 2.5 times more top-rated ideas based on
170 seed papers in a Swiss Tournament evaluation.
☆ Synthesizing Post-Training Data for LLMs through Multi-Agent Simulation
Post-training is essential for enabling large language models (LLMs) to
follow human instructions. Inspired by the recent success of using LLMs to
simulate human society, we leverage multi-agent simulation to automatically
generate diverse text-based scenarios, capturing a wide range of real-world
human needs. We propose MATRIX, a multi-agent simulator that creates realistic
and scalable scenarios. Leveraging these outputs, we introduce a novel
scenario-driven instruction generator MATRIX-Gen for controllable and highly
realistic data synthesis. Extensive experiments demonstrate that our framework
effectively generates both general and domain-specific data. Notably, on
AlpacaEval 2 and Arena-Hard benchmarks, Llama-3-8B-Base, post-trained on
datasets synthesized by MATRIX-Gen with just 20K instruction-response pairs,
outperforms Meta's Llama-3-8B-Instruct model, which was trained on over 10M
pairs; see our project at https://github.com/ShuoTang123/MATRIX-Gen.
☆ Addressing Blind Guessing: Calibration of Selection Bias in Multiple-Choice Question Answering by Video Language Models
Evaluating Video Language Models (VLMs) is a challenging task. Due to its
transparency, Multiple-Choice Question Answering (MCQA) is widely used to
measure the performance of these models through accuracy. However, existing
MCQA benchmarks fail to capture the full reasoning capabilities of VLMs due to
selection bias, when models disproportionately favor certain answer options
based on positional patterns observed during training. In this work, we conduct
a comprehensive empirical analysis of several VLM architectures across major
datasets designed to assess complex video-focused reasoning. We identify where
the bias is most pronounced and demonstrate to what extent model responses
reflect genuine understanding of video content and related questions, as
opposed to reliance on arbitrary patterns or superficial cues, such as answer
position. By decomposing the MCQA task and adapting fairness bias metrics to
VLMs, we introduce a post-processing calibration technique BOLD to balance this
bias. Our results show that reducing selection bias improves not only debiasing
metrics but also overall model performance, including Accuracy and F1 Mean
score. Our method, by suppressing "blind guessing", offers a more cost- and
time-effective approach to mitigating selection bias compared to existing
techniques. This study represents the first focused investigation of selection
bias in video-to-text LLM-powered models.
☆ A Novel Method to Metigate Demographic and Expert Bias in ICD Coding with Causal Inference
ICD(International Classification of Diseases) coding involves assigning ICD
codes to patients visit based on their medical notes. Considering ICD coding as
a multi-label text classification task, researchers have developed
sophisticated methods. Despite progress, these models often suffer from label
imbalance and may develop spurious correlations with demographic factors.
Additionally, while human coders assign ICD codes, the inclusion of irrelevant
information from unrelated experts introduces biases. To combat these issues,
we propose a novel method to mitigate Demographic and Expert biases in ICD
coding through Causal Inference (DECI). We provide a novel causality-based
interpretation in ICD Coding that models make predictions by three distinct
pathways. And based counterfactual reasoning, DECI mitigate demographic and
expert biases. Experimental results show that DECI outperforms state-of-the-art
models, offering a significant advancement in accurate and unbiased ICD coding.
☆ Towards Robust Knowledge Representations in Multilingual LLMs for Equivalence and Inheritance based Consistent Reasoning
Reasoning and linguistic skills form the cornerstone of human intelligence,
facilitating problem-solving and decision-making. Recent advances in Large
Language Models (LLMs) have led to impressive linguistic capabilities and
emergent reasoning behaviors, fueling widespread adoption across application
domains. However, LLMs still struggle with complex reasoning tasks,
highlighting their systemic limitations. In this work, we focus on evaluating
whether LLMs have the requisite representations to reason using two
foundational relationships: "equivalence" and "inheritance". We introduce novel
tasks and benchmarks spanning six languages and observe that current SOTA LLMs
often produce conflicting answers to the same questions across languages in
17.3-57.5% of cases and violate inheritance constraints in up to 37.2% cases.
To enhance consistency across languages, we propose novel "Compositional
Representations" where tokens are represented as composition of equivalent
tokens across languages, with resulting conflict reduction (up to -4.7%)
indicating benefits of shared LLM representations.
☆ Unveiling Large Language Models Generated Texts: A Multi-Level Fine-Grained Detection Framework
Large language models (LLMs) have transformed human writing by enhancing
grammar correction, content expansion, and stylistic refinement. However, their
widespread use raises concerns about authorship, originality, and ethics, even
potentially threatening scholarly integrity. Existing detection methods, which
mainly rely on single-feature analysis and binary classification, often fail to
effectively identify LLM-generated text in academic contexts. To address these
challenges, we propose a novel Multi-level Fine-grained Detection (MFD)
framework that detects LLM-generated text by integrating low-level structural,
high-level semantic, and deep-level linguistic features, while conducting
sentence-level evaluations of lexicon, grammar, and syntax for comprehensive
analysis. To improve detection of subtle differences in LLM-generated text and
enhance robustness against paraphrasing, we apply two mainstream evasion
techniques to rewrite the text. These variations, along with original texts,
are used to train a text encoder via contrastive learning, extracting
high-level semantic features of sentence to boost detection generalization.
Furthermore, we leverage advanced LLM to analyze the entire text and extract
deep-level linguistic features, enhancing the model's ability to capture
complex patterns and nuances while effectively incorporating contextual
information. Extensive experiments on public datasets show that the MFD model
outperforms existing methods, achieving an MAE of 0.1346 and an accuracy of
88.56%. Our research provides institutions and publishers with an effective
mechanism to detect LLM-generated text, mitigating risks of compromised
authorship. Educators and editors can use the model's predictions to refine
verification and plagiarism prevention protocols, ensuring adherence to
standards.
☆ Few-Shot Joint Multimodal Entity-Relation Extraction via Knowledge-Enhanced Cross-modal Prompt Model ACM MM 2024
Joint Multimodal Entity-Relation Extraction (JMERE) is a challenging task
that aims to extract entities and their relations from text-image pairs in
social media posts. Existing methods for JMERE require large amounts of labeled
data. However, gathering and annotating fine-grained multimodal data for JMERE
poses significant challenges. Initially, we construct diverse and comprehensive
multimodal few-shot datasets fitted to the original data distribution. To
address the insufficient information in the few-shot setting, we introduce the
\textbf{K}nowledge-\textbf{E}nhanced \textbf{C}ross-modal \textbf{P}rompt
\textbf{M}odel (KECPM) for JMERE. This method can effectively address the
problem of insufficient information in the few-shot setting by guiding a large
language model to generate supplementary background knowledge. Our proposed
method comprises two stages: (1) a knowledge ingestion stage that dynamically
formulates prompts based on semantic similarity guide ChatGPT generating
relevant knowledge and employs self-reflection to refine the knowledge; (2) a
knowledge-enhanced language model stage that merges the auxiliary knowledge
with the original input and utilizes a transformer-based model to align with
JMERE's required output format. We extensively evaluate our approach on a
few-shot dataset derived from the JMERE dataset, demonstrating its superiority
over strong baselines in terms of both micro and macro F$_1$ scores.
Additionally, we present qualitative analyses and case studies to elucidate the
effectiveness of our model.
comment: accepted by ACM MM 2024
☆ Paths-over-Graph: Knowledge Graph Enpowered Large Language Model Reasoning
Large Language Models (LLMs) have achieved impressive results in various
tasks but struggle with hallucination problems and lack of relevant knowledge,
especially in deep complex reasoning and knowledge-intensive tasks. Knowledge
Graphs (KGs), which capture vast amounts of facts in a structured format, offer
a reliable source of knowledge for reasoning. However, existing KG-based LLM
reasoning methods face challenges like handling multi-hop reasoning,
multi-entity questions, and effectively utilizing graph structures. To address
these issues, we propose Paths-over-Graph (PoG), a novel method that enhances
LLM reasoning by integrating knowledge reasoning paths from KGs, improving the
interpretability and faithfulness of LLM outputs. PoG tackles multi-hop and
multi-entity questions through a three-phase dynamic multi-hop path
exploration, which combines the inherent knowledge of LLMs with factual
knowledge from KGs. In order to improve the efficiency, PoG prunes irrelevant
information from the graph exploration first and introduces efficient
three-step pruning techniques that incorporate graph structures, LLM prompting,
and a pre-trained language model (e.g., SBERT) to effectively narrow down the
explored candidate paths. This ensures all reasoning paths contain highly
relevant information captured from KGs, making the reasoning faithful and
interpretable in problem-solving. PoG innovatively utilizes graph structure to
prune the irrelevant noise and represents the first method to implement
multi-entity deep path detection on KGs for LLM reasoning tasks. Comprehensive
experiments on five benchmark KGQA datasets demonstrate PoG outperforms the
state-of-the-art method ToG across GPT-3.5-Turbo and GPT-4, achieving an
average accuracy improvement of 18.9%. Notably, PoG with GPT-3.5-Turbo
surpasses ToG with GPT-4 by up to 23.9%.
☆ Montessori-Instruct: Generate Influential Training Data Tailored for Student Learning
Synthetic data has been widely used to train large language models, but their
generative nature inevitably introduces noisy, non-informative, and misleading
learning signals. In this paper, we propose Montessori-Instruct, a novel data
synthesis framework that tailors the data synthesis ability of the teacher
language model toward the student language model's learning process.
Specifically, we utilize local data influence of synthetic training data points
on students to characterize students' learning preferences. Then, we train the
teacher model with Direct Preference Optimization (DPO) to generate synthetic
data tailored toward student learning preferences. Experiments with
Llama3-8B-Instruct (teacher) and Llama3-8B (student) on Alpaca Eval and
MT-Bench demonstrate that Montessori-Instruct significantly outperforms
standard synthesis methods by 18.35\% and 46.24\% relatively. Our method also
beats data synthesized by a stronger teacher model, GPT-4o. Further analysis
confirms the benefits of teacher's learning to generate more influential
training data in the student's improved learning, the advantages of local data
influence in accurately measuring student preferences, and the robustness of
Montessori-Instruct across different student models. Our code and data are
open-sourced at https://github.com/cxcscmu/Montessori-Instruct.
comment: Codes and data are open-sourced at
https://github.com/cxcscmu/Montessori-Instruct
☆ MediTOD: An English Dialogue Dataset for Medical History Taking with Comprehensive Annotations EMNLP2024
Medical task-oriented dialogue systems can assist doctors by collecting
patient medical history, aiding in diagnosis, or guiding treatment selection,
thereby reducing doctor burnout and expanding access to medical services.
However, doctor-patient dialogue datasets are not readily available, primarily
due to privacy regulations. Moreover, existing datasets lack comprehensive
annotations involving medical slots and their different attributes, such as
symptoms and their onset, progression, and severity. These comprehensive
annotations are crucial for accurate diagnosis. Finally, most existing datasets
are non-English, limiting their utility for the larger research community.
In response, we introduce MediTOD, a new dataset of doctor-patient dialogues
in English for the medical history-taking task. Collaborating with doctors, we
devise a questionnaire-based labeling scheme tailored to the medical domain.
Then, medical professionals create the dataset with high-quality comprehensive
annotations, capturing medical slots and their attributes. We establish
benchmarks in supervised and few-shot settings on MediTOD for natural language
understanding, policy learning, and natural language generation subtasks,
evaluating models from both TOD and biomedical domains. We make MediTOD
publicly available for future research.
comment: EMNLP2024 Camera Ready Version
☆ Rationale Behind Essay Scores: Enhancing S-LLM's Multi-Trait Essay Scoring with Rationale Generated by LLMs
Existing automated essay scoring (AES) has solely relied on essay text
without using explanatory rationales for the scores, thereby forgoing an
opportunity to capture the specific aspects evaluated by rubric indicators in a
fine-grained manner. This paper introduces Rationale-based Multiple Trait
Scoring (RMTS), a novel approach for multi-trait essay scoring that integrates
prompt-engineering-based large language models (LLMs) with a fine-tuning-based
essay scoring model using a smaller large language model (S-LLM). RMTS uses an
LLM-based trait-wise rationale generation system where a separate LLM agent
generates trait-specific rationales based on rubric guidelines, which the
scoring model uses to accurately predict multi-trait scores. Extensive
experiments on benchmark datasets, including ASAP, ASAP++, and Feedback Prize,
show that RMTS significantly outperforms state-of-the-art models and vanilla
S-LLMs in trait-specific scoring. By assisting quantitative assessment with
fine-grained qualitative rationales, RMTS enhances the trait-wise reliability,
providing partial explanations about essays.
☆ E3D-GPT: Enhanced 3D Visual Foundation for Medical Vision-Language Model
Haoran Lai, Zihang Jiang, Qingsong Yao, Rongsheng Wang, Zhiyang He, Xiaodong Tao, Wei Wei, Weifu Lv, S. Kevin Zhou
The development of 3D medical vision-language models holds significant
potential for disease diagnosis and patient treatment. However, compared to 2D
medical images, 3D medical images, such as CT scans, face challenges related to
limited training data and high dimension, which severely restrict the progress
of 3D medical vision-language models. To address these issues, we collect a
large amount of unlabeled 3D CT data and utilize self-supervised learning to
construct a 3D visual foundation model for extracting 3D visual features. Then,
we apply 3D spatial convolutions to aggregate and project high-level image
features, reducing computational complexity while preserving spatial
information. We also construct two instruction-tuning datasets based on BIMCV-R
and CT-RATE to fine-tune the 3D vision-language model. Our model demonstrates
superior performance compared to existing methods in report generation, visual
question answering, and disease diagnosis. Code and data will be made publicly
available soon.
☆ Supervised Chain of Thought
Large Language Models (LLMs) have revolutionized natural language processing
and hold immense potential for advancing Artificial Intelligence. However, the
core architecture of most mainstream LLMs -- the Transformer -- has inherent
limitations in computational depth, rendering them theoretically incapable of
solving many reasoning tasks that demand increasingly deep computations. Chain
of Thought (CoT) prompting has emerged as a technique to address these
architectural limitations, as evidenced by several theoretical studies. It
offers a promising approach to solving complex reasoning tasks that were
previously beyond the capabilities of these models. Despite its successes, CoT
and its variants (such as Tree of Thought, Graph of Thought, etc.) rely on a
"one-prompt-for-all" approach, using a single prompt structure (e.g., "think
step by step") for a wide range of tasks -- from counting and sorting to
solving mathematical and algorithmic problems. This approach poses significant
challenges for models to generate the correct reasoning steps, as the model
must navigate through a vast prompt template space to find the appropriate
template for each task. In this work, we build upon previous theoretical
analyses of CoT to demonstrate how the one-prompt-for-all approach can
negatively affect the computability of LLMs. We partition the solution search
space into two: the prompt space and the answer space. Our findings show that
task-specific supervision is essential for navigating the prompt space
accurately and achieving optimal performance. Through experiments with
state-of-the-art LLMs, we reveal a gap in reasoning performance when
supervision is applied versus when it is not.
☆ Speciesism in Natural Language Processing Research
Natural Language Processing (NLP) research on AI Safety and social bias in AI
has focused on safety for humans and social bias against human minorities.
However, some AI ethicists have argued that the moral significance of nonhuman
animals has been ignored in AI research. Therefore, the purpose of this study
is to investigate whether there is speciesism, i.e., discrimination against
nonhuman animals, in NLP research. First, we explain why nonhuman animals are
relevant in NLP research. Next, we survey the findings of existing research on
speciesism in NLP researchers, data, and models and further investigate this
problem in this study. The findings of this study suggest that speciesism
exists within researchers, data, and models, respectively. Specifically, our
survey and experiments show that (a) among NLP researchers, even those who
study social bias in AI, do not recognize speciesism or speciesist bias; (b)
among NLP data, speciesist bias is inherent in the data annotated in the
datasets used to evaluate NLP models; (c) OpenAI GPTs, recent NLP models,
exhibit speciesist bias by default. Finally, we discuss how we can reduce
speciesism in NLP research.
comment: This article is a preprint and has not been peer-reviewed. The
postprint has been accepted for publication in AI and Ethics. Please cite the
final version of the article once it is published
☆ MetaAlign: Align Large Language Models with Diverse Preferences during Inference Time
Large Language Models (LLMs) acquire extensive knowledge and remarkable
abilities from extensive text corpora, making them powerful tools for various
applications. To make LLMs more usable, aligning them with human preferences is
essential. Existing alignment techniques, such as Reinforcement Learning from
Human Feedback (RLHF) and Direct Preference Optimization (DPO), typically embed
predefined preferences directly within the model's parameters. These methods,
however, often result in a static alignment that can not account for the
diversity of human preferences in practical applications. In response to this
challenge, we propose an effective method, \textbf{MetaAlign}, which aims to
help LLMs dynamically align with various explicit or implicit preferences
specified at inference time. Experimental results show that LLMs optimized on
our meticulously constructed MetaAlign Dataset can effectively align with any
preferences specified at the inference stage, validating the feasibility of
MetaAlign. We hope that our work can provide some insights into the alignment
of language models.
comment: 19 pages, 6 figures
☆ LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs
Yujun Zhou, Jingdong Yang, Kehan Guo, Pin-Yu Chen, Tian Gao, Werner Geyer, Nuno Moniz, Nitesh V Chawla, Xiangliang Zhang
Laboratory accidents pose significant risks to human life and property,
underscoring the importance of robust safety protocols. Despite advancements in
safety training, laboratory personnel may still unknowingly engage in unsafe
practices. With the increasing reliance on large language models (LLMs) for
guidance in various fields, including laboratory settings, there is a growing
concern about their reliability in critical safety-related decision-making.
Unlike trained human researchers, LLMs lack formal lab safety education,
raising questions about their ability to provide safe and accurate guidance.
Existing research on LLM trustworthiness primarily focuses on issues such as
ethical compliance, truthfulness, and fairness but fails to fully cover
safety-critical real-world applications, like lab safety. To address this gap,
we propose the Laboratory Safety Benchmark (LabSafety Bench), a comprehensive
evaluation framework based on a new taxonomy aligned with Occupational Safety
and Health Administration (OSHA) protocols. This benchmark includes 765
multiple-choice questions verified by human experts, assessing LLMs and vision
language models (VLMs) performance in lab safety contexts. Our evaluations
demonstrate that while GPT-4o outperforms human participants, it is still prone
to critical errors, highlighting the risks of relying on LLMs in
safety-critical environments. Our findings emphasize the need for specialized
benchmarks to accurately assess the trustworthiness of LLMs in real-world
safety applications.
comment: 50 pages, 19 figures
☆ XForecast: Evaluating Natural Language Explanations for Time Series Forecasting
Time series forecasting aids decision-making, especially for stakeholders who
rely on accurate predictions, making it very important to understand and
explain these models to ensure informed decisions. Traditional explainable AI
(XAI) methods, which underline feature or temporal importance, often require
expert knowledge. In contrast, natural language explanations (NLEs) are more
accessible to laypeople. However, evaluating forecast NLEs is difficult due to
the complex causal relationships in time series data. To address this, we
introduce two new performance metrics based on simulatability, assessing how
well a human surrogate can predict model forecasts using the explanations.
Experiments show these metrics differentiate good from poor explanations and
align with human judgments. Utilizing these metrics, we further evaluate the
ability of state-of-the-art large language models (LLMs) to generate
explanations for time series data, finding that numerical reasoning, rather
than model size, is the main factor influencing explanation quality.
☆ MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart Problems
Multimodal Large Language Models (MLLMs) have demonstrated impressive
abilities across various tasks, including visual question answering and chart
comprehension, yet existing benchmarks for chart-related tasks fall short in
capturing the complexity of real-world multi-chart scenarios. Current
benchmarks primarily focus on single-chart tasks, neglecting the multi-hop
reasoning required to extract and integrate information from multiple charts,
which is essential in practical applications. To fill this gap, we introduce
MultiChartQA, a benchmark that evaluates MLLMs' capabilities in four key areas:
direct question answering, parallel question answering, comparative reasoning,
and sequential reasoning. Our evaluation of a wide range of MLLMs reveals
significant performance gaps compared to humans. These results highlight the
challenges in multi-chart comprehension and the potential of MultiChartQA to
drive advancements in this field. Our code and data are available at
https://github.com/Zivenzhu/Multi-chart-QA
comment: 18 pages, 9 figures
☆ LLM The Genius Paradox: A Linguistic and Math Expert's Struggle with Simple Word-based Counting Problems
Interestingly, LLMs yet struggle with some basic tasks that humans find
trivial to handle, e.g., counting the number of character r's in the word
"strawberry". There are several popular conjectures (e.g., tokenization,
architecture and training data) regarding the reason for deficiency of LLMs in
simple word-based counting problems, sharing the similar belief that such
failure stems from model pretraining hence probably inevitable during
deployment. In this paper, we carefully design multiple evaluation settings to
investigate validity of prevalent conjectures. Meanwhile, we measure
transferability of advanced mathematical and coding reasoning capabilities from
specialized LLMs to simple counting tasks. Although specialized LLMs suffer
from counting problems as well, we find conjectures about inherent deficiency
of LLMs invalid and further seek opportunities to elicit knowledge and
capabilities from LLMs that are beneficial to counting tasks. Compared with
strategies such as finetuning and in-context learning that are commonly adopted
to enhance performance on new or challenging tasks, we show that engaging
reasoning is the most robust and efficient way to help LLMs better perceive
tasks with more accurate responses.
We hope our conjecture validation design could provide insights into the
study of future critical failure modes of LLMs. Based on challenges in
transferring advanced capabilities to much simpler tasks, we call for more
attention to model capability acquisition and evaluation. We also highlight the
importance of cultivating consciousness of "reasoning before responding" during
model pretraining.
☆ Automated Genre-Aware Article Scoring and Feedback Using Large Language Models
This paper focuses on the development of an advanced intelligent article
scoring system that not only assesses the overall quality of written work but
also offers detailed feature-based scoring tailored to various article genres.
By integrating the pre-trained BERT model with the large language model
Chat-GPT, the system gains a deep understanding of both the content and
structure of the text, enabling it to provide a thorough evaluation along with
targeted suggestions for improvement. Experimental results demonstrate that
this system outperforms traditional scoring methods across multiple public
datasets, particularly in feature-based assessments, offering a more accurate
reflection of the quality of different article types. Moreover, the system
generates personalized feedback to assist users in enhancing their writing
skills, underscoring the potential and practical value of automated scoring
technologies in educational contexts.
☆ Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning
Autoregressive language models, despite their impressive capabilities,
struggle with complex reasoning and long-term planning tasks. We introduce
discrete diffusion models as a novel solution to these challenges. Through the
lens of subgoal imbalance, we demonstrate how diffusion models effectively
learn difficult subgoals that elude autoregressive approaches. We propose
Multi-granularity Diffusion Modeling (MDM), which prioritizes subgoals based on
difficulty during learning. On complex tasks like Countdown, Sudoku, and
Boolean Satisfiability Problems, MDM significantly outperforms autoregressive
models without using search techniques. For instance, MDM achieves 91.5\% and
100\% accuracy on Countdown and Sudoku, respectively, compared to 45.8\% and
20.7\% for autoregressive models. Our work highlights the potential of
diffusion-based approaches in advancing AI capabilities for sophisticated
language understanding and problem-solving tasks.
☆ Towards Faithful Natural Language Explanations: A Study Using Activation Patching in Large Language Models
Large Language Models (LLMs) are capable of generating persuasive Natural
Language Explanations (NLEs) to justify their answers. However, the
faithfulness of these explanations should not be readily trusted at face value.
Recent studies have proposed various methods to measure the faithfulness of
NLEs, typically by inserting perturbations at the explanation or feature level.
We argue that these approaches are neither comprehensive nor correctly designed
according to the established definition of faithfulness. Moreover, we highlight
the risks of grounding faithfulness findings on out-of-distribution samples. In
this work, we leverage a causal mediation technique called activation patching,
to measure the faithfulness of an explanation towards supporting the explained
answer. Our proposed metric, Causal Faithfulness quantifies the consistency of
causal attributions between explanations and the corresponding model outputs as
the indicator of faithfulness. We experimented across models varying from 2B to
27B parameters and found that models that underwent alignment tuning tend to
produce more faithful and plausible explanations. We find that Causal
Faithfulness is a promising improvement over existing faithfulness tests by
taking into account the model's internal computations and avoiding out of
distribution concerns that could otherwise undermine the validity of
faithfulness assessments. We release the code in
\url{https://github.com/wj210/Causal-Faithfulness}
comment: Under review
☆ SRAP-Agent: Simulating and Optimizing Scarce Resource Allocation Policy with LLM-based Agent
Public scarce resource allocation plays a crucial role in economics as it
directly influences the efficiency and equity in society. Traditional studies
including theoretical model-based, empirical study-based and simulation-based
methods encounter limitations due to the idealized assumption of complete
information and individual rationality, as well as constraints posed by limited
available data. In this work, we propose an innovative framework, SRAP-Agent
(Simulating and Optimizing Scarce Resource Allocation Policy with LLM-based
Agent), which integrates Large Language Models (LLMs) into economic
simulations, aiming to bridge the gap between theoretical models and real-world
dynamics. Using public housing allocation scenarios as a case study, we conduct
extensive policy simulation experiments to verify the feasibility and
effectiveness of the SRAP-Agent and employ the Policy Optimization Algorithm
with certain optimization objectives. The source code can be found in
https://github.com/jijiarui-cather/SRAPAgent_Framework
☆ Utilizing Large Language Models for Event Deconstruction to Enhance Multimodal Aspect-Based Sentiment Analysis
With the rapid development of the internet, the richness of User-Generated
Contentcontinues to increase, making Multimodal Aspect-Based Sentiment Analysis
(MABSA) a research hotspot. Existing studies have achieved certain results in
MABSA, but they have not effectively addressed the analytical challenges in
scenarios where multiple entities and sentiments coexist. This paper
innovatively introduces Large Language Models (LLMs) for event decomposition
and proposes a reinforcement learning framework for Multimodal Aspect-based
Sentiment Analysis (MABSA-RL) framework. This framework decomposes the original
text into a set of events using LLMs, reducing the complexity of analysis,
introducing reinforcement learning to optimize model parameters. Experimental
results show that MABSA-RL outperforms existing advanced methods on two
benchmark datasets. This paper provides a new research perspective and method
for multimodal aspect-level sentiment analysis.
☆ Fine-Grained Verifiers: Preference Modeling as Next-token Prediction in Vision-Language Alignment
The recent advancements in large language models (LLMs) and pre-trained
vision models have accelerated the development of vision-language large models
(VLLMs), enhancing the interaction between visual and linguistic modalities.
Despite their notable success across various domains, VLLMs face challenges in
modality alignment, which can lead to issues like hallucinations and unsafe
content generation. Current alignment techniques often rely on coarse feedback
and external datasets, limiting scalability and performance. In this paper, we
propose FiSAO (Fine-Grained Self-Alignment Optimization), a novel
self-alignment method that utilizes the model's own visual encoder as a
fine-grained verifier to improve vision-language alignment without the need for
additional data. By leveraging token-level feedback from the vision encoder,
FiSAO significantly improves vision-language alignment, even surpassing
traditional preference tuning methods that require additional data. Through
both theoretical analysis and experimental validation, we demonstrate that
FiSAO effectively addresses the misalignment problem in VLLMs, marking the
first instance of token-level rewards being applied to such models.
comment: 23 pages
☆ CAPE: A Chinese Dataset for Appraisal-based Emotional Generation using Large Language Models
Generating emotionally appropriate responses in conversations with large
language models presents a significant challenge due to the complexities of
human emotions and cognitive processes, which remain largely underexplored in
their critical role in social interactions. In this study, we introduce a
two-stage automatic data generation framework to create CAPE, a Chinese dataset
named Cognitive Appraisal theory-based Emotional corpus. This corpus
facilitates the generation of dialogues with contextually appropriate emotional
responses by accounting for diverse personal and situational factors. We
propose two tasks utilizing this dataset: emotion prediction and next utterance
prediction. Both automated and human evaluations demonstrate that agents
trained on our dataset can deliver responses that are more aligned with human
emotional expressions. Our study shows the potential for advancing emotional
expression in conversational agents, paving the way for more nuanced and
meaningful human-computer interactions.
☆ A Lightweight Multi Aspect Controlled Text Generation Solution For Large Language Models
Large language models (LLMs) show remarkable abilities with instruction
tuning. However, they fail to achieve ideal tasks when lacking high-quality
instruction tuning data on target tasks. Multi-Aspect Controllable Text
Generation (MCTG) is a representative task for this dilemma, where aspect
datasets are usually biased and correlated. Existing work exploits additional
model structures and strategies for solutions, limiting adaptability to LLMs.
To activate MCTG ability of LLMs, we propose a lightweight MCTG pipeline based
on data augmentation. We analyze bias and correlations in traditional datasets,
and address these concerns with augmented control attributes and sentences.
Augmented datasets are feasible for instruction tuning. In our experiments,
LLMs perform better in MCTG after data augmentation, with a 20% accuracy rise
and less aspect correlations.
☆ Coherence-Driven Multimodal Safety Dialogue with Active Learning for Embodied Agents
When assisting people in daily tasks, robots need to accurately interpret
visual cues and respond effectively in diverse safety-critical situations, such
as sharp objects on the floor. In this context, we present M-CoDAL, a
multimodal-dialogue system specifically designed for embodied agents to better
understand and communicate in safety-critical situations. The system leverages
discourse coherence relations to enhance its contextual understanding and
communication abilities. To train this system, we introduce a novel
clustering-based active learning mechanism that utilizes an external Large
Language Model (LLM) to identify informative instances. Our approach is
evaluated using a newly created multimodal dataset comprising 1K safety
violations extracted from 2K Reddit images. These violations are annotated
using a Large Multimodal Model (LMM) and verified by human annotators. Results
with this dataset demonstrate that our approach improves resolution of safety
situations, user sentiment, as well as safety of the conversation. Next, we
deploy our dialogue system on a Hello Robot Stretch robot and conduct a
within-subject user study with real-world participants. In the study,
participants role-play two safety scenarios with different levels of severity
with the robot and receive interventions from our model and a baseline system
powered by OpenAI's ChatGPT. The study results corroborate and extend the
findings from automated evaluation, showing that our proposed system is more
persuasive and competent in a real-world embodied agent setting.
☆ ViConsFormer: Constituting Meaningful Phrases of Scene Texts using Transformer-based Method in Vietnamese Text-based Visual Question Answering
Text-based VQA is a challenging task that requires machines to use scene
texts in given images to yield the most appropriate answer for the given
question. The main challenge of text-based VQA is exploiting the meaning and
information from scene texts. Recent studies tackled this challenge by
considering the spatial information of scene texts in images via embedding 2D
coordinates of their bounding boxes. In this study, we follow the definition of
meaning from linguistics to introduce a novel method that effectively exploits
the information from scene texts written in Vietnamese. Experimental results
show that our proposed method obtains state-of-the-art results on two
large-scale Vietnamese Text-based VQA datasets. The implementation can be found
at this link.
♻ ☆ Locate-then-edit for Multi-hop Factual Recall under Knowledge Editing
The locate-then-edit paradigm has shown significant promise for knowledge
editing (KE) in Large Language Models (LLMs). While previous methods perform
well on single-hop fact recall tasks, they consistently struggle with multi-hop
factual recall tasks involving newly edited knowledge. In this paper,
leveraging tools in mechanistic interpretability, we first identify that in
multi-hop tasks, LLMs tend to retrieve implicit subject knowledge from deeper
MLP layers, unlike single-hop tasks, which rely on earlier layers. This
distinction explains the poor performance of current methods in multi-hop
queries, as they primarily focus on editing shallow layers, leaving deeper
layers unchanged. To address this, we propose IFMET, a novel locate-then-edit
KE approach designed to edit both shallow and deep MLP layers. IFMET employs
multi-hop editing prompts and supplementary sets to locate and modify knowledge
across different reasoning stages. Experimental results demonstrate that IFMET
significantly improves performance on multi-hop factual recall tasks,
effectively overcoming the limitations of previous locate-then-edit methods.
comment: 21 pages
♻ ☆ System 2 thinking in OpenAI's o1-preview model: Near-perfect performance on a mathematics exam
The processes underlying human cognition are often divided into System 1,
which involves fast, intuitive thinking, and System 2, which involves slow,
deliberate reasoning. Previously, large language models were criticized for
lacking the deeper, more analytical capabilities of System 2. In September
2024, OpenAI introduced the o1 model series, designed to handle System 2-like
reasoning. While OpenAI's benchmarks are promising, independent validation is
still needed. In this study, we tested the o1-preview model twice on the Dutch
'Mathematics B' final exam. It scored a near-perfect 76 and 74 out of 76
points. For context, only 24 out of 16,414 students in the Netherlands achieved
a perfect score. By comparison, the GPT-4o model scored 66 and 62 out of 76,
well above the Dutch average of 40.63 points. Neither model had access to the
exam figures. Since there was a risk of model contamination (i.e., the
knowledge cutoff of o1-preview and GPT-4o was after the exam was published
online), we repeated the procedure with a new Mathematics B exam that was
published after the cutoff date. The results again indicated that o1-preview
performed strongly (97.8th percentile), which suggests that contamination was
not a factor. We also show that there is some variability in the output of
o1-preview, which means that sometimes there is 'luck' (the answer is correct)
or 'bad luck' (the output has diverged into something that is incorrect). We
demonstrate that a self-consistency approach, where repeated prompts are given
and the most common answer is selected, is a useful strategy for identifying
the correct answer. It is concluded that while OpenAI's new model series holds
great potential, certain risks must be considered.
♻ ☆ Liger Kernel: Efficient Triton Kernels for LLM Training
Pin-Lun Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Steven Shimizu, Shivam Sahni, Haowen Ning, Yanning Chen
Training Large Language Models (LLMs) efficiently at scale presents a
formidable challenge, driven by their ever-increasing computational demands and
the need for enhanced performance. In this work, we introduce Liger-Kernel, an
open-sourced set of Triton kernels developed specifically for LLM training.
With kernel optimization techniques like kernel operation fusing and input
chunking, our kernels achieve on average a 20% increase in training throughput
and a 60% reduction in GPU memory usage for popular LLMs compared to
HuggingFace implementations. In addition, Liger-Kernel is designed with
modularity, accessibility, and adaptability in mind, catering to both casual
and expert users. Comprehensive benchmarks and integration tests are built in
to ensure compatibility, performance, correctness, and convergence across
diverse computing environments and model architectures.
The source code is available under a permissive license at:
github.com/linkedin/Liger-Kernel.
comment: 17 pages, 12 figures
♻ ☆ Contextual Document Embeddings
Dense document embeddings are central to neural retrieval. The dominant
paradigm is to train and construct embeddings by running encoders directly on
individual documents. In this work, we argue that these embeddings, while
effective, are implicitly out-of-context for targeted use cases of retrieval,
and that a contextualized document embedding should take into account both the
document and neighboring documents in context - analogous to contextualized
word embeddings. We propose two complementary methods for contextualized
document embeddings: first, an alternative contrastive learning objective that
explicitly incorporates the document neighbors into the intra-batch contextual
loss; second, a new contextual architecture that explicitly encodes neighbor
document information into the encoded representation. Results show that both
methods achieve better performance than biencoders in several settings, with
differences especially pronounced out-of-domain. We achieve state-of-the-art
results on the MTEB benchmark with no hard negative mining, score distillation,
dataset-specific instructions, intra-GPU example-sharing, or extremely large
batch sizes. Our method can be applied to improve performance on any
contrastive learning dataset and any biencoder.
♻ ☆ Learning Linear Attention in Polynomial Time
Previous research has explored the computational expressivity of Transformer
models in simulating Boolean circuits or Turing machines. However, the
learnability of these simulators from observational data has remained an open
question. Our study addresses this gap by providing the first polynomial-time
learnability results (specifically strong, agnostic PAC learning) for
single-layer Transformers with linear attention. We show that linear attention
may be viewed as a linear predictor in a suitably defined RKHS. As a
consequence, the problem of learning any linear transformer may be converted
into the problem of learning an ordinary linear predictor in an expanded
feature space, and any such predictor may be converted back into a multiheaded
linear transformer. Moving to generalization, we show how to efficiently
identify training datasets for which every empirical risk minimizer is
equivalent (up to trivial symmetries) to the linear Transformer that generated
the data, thereby guaranteeing the learned model will correctly generalize
across all inputs. Finally, we provide examples of computations expressible via
linear attention and therefore polynomial-time learnable, including associative
memories, finite automata, and a class of Universal Turing Machine (UTMs) with
polynomially bounded computation histories. We empirically validate our
theoretical findings on three tasks: learning random linear attention networks,
key--value associations, and learning to execute finite automata. Our findings
bridge a critical gap between theoretical expressivity and learnability of
Transformers, and show that flexible and general models of computation are
efficiently learnable.
♻ ☆ One size doesn't fit all: Predicting the Number of Examples for In-Context Learning
In-context learning (ICL) refers to the process of adding a small number of
localized examples (ones that are semantically similar to the input) from a
training set of labelled data to an LLM's prompt with an objective to
effectively control the generative process seeking to improve the downstream
task performance. Existing ICL approaches use an identical number of examples
(a pre-configured hyper-parameter) for each data instance. Our work alleviates
the limitations of this 'one fits all' approach by dynamically predicting the
number of examples for each data instance to be used in few-shot inference with
LLMs. In particular, we employ a multi-label classifier, the parameters of
which are fitted using a training set, where the label for each instance in the
training set indicates if using a specific value of k (number of most similar
examples from 0 up to a maximum value) leads to correct k-shot downstream
predictions. Our experiments on a number of text classification benchmarks show
that AICL substantially outperforms standard ICL by up to 17%.
♻ ☆ Movie101v2: Improved Movie Narration Benchmark
Automatic movie narration aims to generate video-aligned plot descriptions to
assist visually impaired audiences. Unlike standard video captioning, it
involves not only describing key visual details but also inferring plots that
unfold across multiple movie shots, presenting distinct and complex challenges.
To advance this field, we introduce Movie101v2, a large-scale, bilingual
dataset with enhanced data quality specifically designed for movie narration.
Revisiting the task, we propose breaking down the ultimate goal of automatic
movie narration into three progressive stages, offering a clear roadmap with
corresponding evaluation metrics. Based on our new benchmark, we baseline a
range of large vision-language models, including GPT-4V, and conduct an
in-depth analysis of the challenges in narration generation. Our findings
highlight that achieving applicable movie narration generation is a fascinating
goal that requires significant research.
♻ ☆ MCQG-SRefine: Multiple Choice Question Generation and Evaluation with Iterative Self-Critique, Correction, and Comparison Feedback
Automatic question generation (QG) is essential for AI and NLP, particularly
in intelligent tutoring, dialogue systems, and fact verification. Generating
multiple-choice questions (MCQG) for professional exams, like the United States
Medical Licensing Examination (USMLE), is particularly challenging, requiring
domain expertise and complex multi-hop reasoning for high-quality questions.
However, current large language models (LLMs) like GPT-4 struggle with
professional MCQG due to outdated knowledge, hallucination issues, and prompt
sensitivity, resulting in unsatisfactory quality and difficulty. To address
these challenges, we propose MCQG-SRefine, an LLM self-refine-based (Critique
and Correction) framework for converting medical cases into high-quality
USMLE-style questions. By integrating expert-driven prompt engineering with
iterative self-critique and self-correction feedback, MCQG-SRefine
significantly enhances human expert satisfaction regarding both the quality and
difficulty of the questions. Furthermore, we introduce an LLM-as-Judge-based
automatic metric to replace the complex and costly expert evaluation process,
ensuring reliable and expert-aligned assessments.
comment: Equal contribution for the first two authors
♻ ☆ Advocating Character Error Rate for Multilingual ASR Evaluation
Automatic speech recognition (ASR) systems have traditionally been evaluated
using English datasets, with the word error rate (WER) serving as the
predominant metric. WER's simplicity and ease of interpretation have
contributed to its widespread adoption, particularly for English. However, as
ASR systems expand to multilingual contexts, WER fails in various ways,
particularly with morphologically complex languages or those without clear word
boundaries. Our work documents the limitations of WER as an evaluation metric
and advocates for the character error rate (CER) as the primary metric in
multilingual ASR evaluation. We show that CER avoids many of the challenges WER
faces and exhibits greater consistency across writing systems. We support our
proposition by conducting human evaluations of ASR transcriptions in three
languages: Malayalam, English, and Arabic, which exhibit distinct morphological
characteristics. We show that CER correlates more closely with human judgments
than WER, even for English. To facilitate further research, we release our
human evaluation dataset for future benchmarking of ASR metrics. Our findings
suggest that CER should be prioritized, or at least supplemented, in
multilingual ASR evaluations to account for the varying linguistic
characteristics of different languages.
comment: 4 pages
♻ ☆ English offensive text detection using CNN based Bi-GRU model
Over the years, the number of users of social media has increased
drastically. People frequently share their thoughts through social platforms,
and this leads to an increase in hate content. In this virtual community,
individuals share their views, express their feelings, and post photos, videos,
blogs, and more. Social networking sites like Facebook and Twitter provide
platforms to share vast amounts of content with a single click. However, these
platforms do not impose restrictions on the uploaded content, which may include
abusive language and explicit images unsuitable for social media. To resolve
this issue, a new idea must be implemented to divide the inappropriate content.
Numerous studies have been done to automate the process. In this paper, we
propose a new Bi-GRU-CNN model to classify whether the text is offensive or
not. The combination of the Bi-GRU and CNN models outperforms the existing
model.
comment: 5 pages and 6 figures
♻ ☆ Improving Reward Models with Synthetic Critiques
Reward models (RMs) play a critical role in aligning language models through
the process of reinforcement learning from human feedback. RMs are trained to
predict a score reflecting human preference, which requires significant time
and cost for human annotation. Additionally, RMs tend to quickly overfit on
superficial features in the training set, hindering their generalization
performance on unseen distributions. We propose a novel approach using
synthetic natural language critiques generated by large language models to
provide additional feedback, evaluating aspects such as instruction following,
correctness, and style. This offers richer signals and more robust features for
RMs to assess and score on. We demonstrate that high-quality critiques improve
the performance and data efficiency of RMs initialized from different
pretrained models, reducing the reliance on costly human annotations.
Furthermore, incorporating critiques improves both the interpretability and
robustness of RM training.
♻ ☆ With Ears to See and Eyes to Hear: Sound Symbolism Experiments with Multimodal Large Language Models EMNLP 2024
Recently, Large Language Models (LLMs) and Vision Language Models (VLMs) have
demonstrated aptitude as potential substitutes for human participants in
experiments testing psycholinguistic phenomena. However, an understudied
question is to what extent models that only have access to vision and text
modalities are able to implicitly understand sound-based phenomena via abstract
reasoning from orthography and imagery alone. To investigate this, we analyse
the ability of VLMs and LLMs to demonstrate sound symbolism (i.e., to recognise
a non-arbitrary link between sounds and concepts) as well as their ability to
"hear" via the interplay of the language and vision modules of open and
closed-source multimodal models. We perform multiple experiments, including
replicating the classic Kiki-Bouba and Mil-Mal shape and magnitude symbolism
tasks, and comparing human judgements of linguistic iconicity with that of
LLMs. Our results show that VLMs demonstrate varying levels of agreement with
human labels, and more task information may be required for VLMs versus their
human counterparts for in silico experimentation. We additionally see through
higher maximum agreement levels that Magnitude Symbolism is an easier pattern
for VLMs to identify than Shape Symbolism, and that an understanding of
linguistic iconicity is highly dependent on model size.
comment: Accepted to EMNLP 2024 (Camera Ready)
♻ ☆ Crossroads of Continents: Automated Artifact Extraction for Cultural Adaptation with Large Multimodal Models
We present a comprehensive three-phase study to examine (1) the cultural
understanding of Large Multimodal Models (LMMs) by introducing DalleStreet, a
large-scale dataset generated by DALL-E 3 and validated by humans, containing
9,935 images of 67 countries and 10 concept classes; (2) the underlying
implicit and potentially stereotypical cultural associations with a cultural
artifact extraction task; and (3) an approach to adapt cultural representation
in an image based on extracted associations using a modular pipeline,
CultureAdapt. We find disparities in cultural understanding at geographic
sub-region levels with both open-source (LLaVA) and closed-source (GPT-4V)
models on DalleStreet and other existing benchmarks, which we try to understand
using over 18,000 artifacts that we identify in association to different
countries. Our findings reveal a nuanced picture of the cultural competence of
LMMs, highlighting the need to develop culture-aware systems. Dataset and code
are available at https://github.com/iamshnoo/crossroads
comment: under review
♻ ☆ What's under the hood: Investigating Automatic Metrics on Meeting Summarization
Meeting summarization has become a critical task considering the increase in
online interactions. While new techniques are introduced regularly, their
evaluation uses metrics not designed to capture meeting-specific errors,
undermining effective evaluation. This paper investigates what the frequently
used automatic metrics capture and which errors they mask by correlating
automatic metric scores with human evaluations across a broad error taxonomy.
We commence with a comprehensive literature review on English meeting
summarization to define key challenges like speaker dynamics and contextual
turn-taking and error types such as missing information and linguistic
inaccuracy, concepts previously loosely defined in the field. We examine the
relationship between characteristic challenges and errors by using annotated
transcripts and summaries from Transformer-based sequence-to-sequence and
autoregressive models from the general summary QMSum dataset. Through
experimental validation, we find that different model architectures respond
variably to challenges in meeting transcripts, resulting in different
pronounced links between challenges and errors. Current default-used metrics
struggle to capture observable errors, showing weak to mid-correlations, while
a third of the correlations show trends of error masking. Only a subset reacts
accurately to specific errors, while most correlations show either
unresponsiveness or failure to reflect the error's impact on summary quality.
♻ ☆ On Debiasing Text Embeddings Through Context Injection
Current advances in Natural Language Processing (NLP) have made it
increasingly feasible to build applications leveraging textual data. Generally,
the core of these applications rely on having a good semantic representation of
text into vectors, via embedding models. However, it has been shown that these
embeddings capture and perpetuate biases already present in text. While a few
techniques have been proposed to debias embeddings, they do not take advantage
of the recent advances in context understanding of modern embedding models. In
this paper, we fill this gap by conducting a review of 19 embedding models by
quantifying their biases and how well they respond to context injection as a
mean of debiasing. We show that higher performing models are more prone to
capturing biases, but are also better at incorporating context. Surprisingly,
we find that while models can easily embed affirmative semantics, they fail at
embedding neutral semantics. Finally, in a retrieval task, we show that biases
in embeddings can lead to non-desirable outcomes. We use our new-found insights
to design a simple algorithm for top $k$ retrieval, where $k$ is dynamically
selected. We show that our algorithm is able to retrieve all relevant gendered
and neutral chunks.
♻ ☆ Train & Constrain: Phonologically Informed Tongue-Twister Generation from Topics and Paraphrases
Previous work in phonologically and phonetically grounded language generation
has mainly focused on domains such as puns and poetry. In this article, we
present new work on the generation of English tongue twisters - a form of
language that is required to be conditioned on a phoneme level to maximize
sound overlap, while maintaining semantic consistency with an input topic or
phrase and still being grammatically correct. We present TwisterLister, a
pipeline for generating phonologically informed tongue twisters from large
language models (LLMs) that we use to generate TwistList 2.0, the largest
annotated dataset of tongue twisters to date, consisting of 17K+ examples from
a combination of human and LLM authors. Our generation pipeline involves the
use of a phonologically constrained vocabulary alongside LLM prompting to
generate novel, non-derivative tongue twister examples. We additionally present
the results of automatic and human evaluation of smaller models trained on our
generated dataset to demonstrate the extent to which phonologically motivated
language types can be generated without explicit injection of phonological
knowledge. Additionally, we introduce a phoneme-aware constrained decoding
module (PACD) that can be integrated into an autoregressive language model and
demonstrate that this method generates good quality tongue twisters both with
and without fine-tuning the underlying language model. We also design and
implement a range of automatic metrics for the task of tongue twister
generation that is phonologically motivated and captures the unique essence of
tongue twisters, primarily based on phonemic edit distance (PED)
comment: Accepted Final Version to Computational Linguistics
♻ ☆ Error Span Annotation: A Balanced Approach for Human Evaluation of Machine Translation
Tom Kocmi, Vilém Zouhar, Eleftherios Avramidis, Roman Grundkiewicz, Marzena Karpinska, Maja Popović, Mrinmaya Sachan, Mariya Shmatova
High-quality Machine Translation (MT) evaluation relies heavily on human
judgments. Comprehensive error classification methods, such as Multidimensional
Quality Metrics (MQM), are expensive as they are time-consuming and can only be
done by experts, whose availability may be limited especially for low-resource
languages. On the other hand, just assigning overall scores, like Direct
Assessment (DA), is simpler and faster and can be done by translators of any
level, but is less reliable. In this paper, we introduce Error Span Annotation
(ESA), a human evaluation protocol which combines the continuous rating of DA
with the high-level error severity span marking of MQM. We validate ESA by
comparing it to MQM and DA for 12 MT systems and one human reference
translation (English to German) from WMT23. The results show that ESA offers
faster and cheaper annotations than MQM at the same quality level, without the
requirement of expensive MQM experts.
♻ ☆ BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models
While large language models (LLMs) exhibit remarkable capabilities across
various tasks, they encounter potential security risks such as jailbreak
attacks, which exploit vulnerabilities to bypass security measures and generate
harmful outputs. Existing jailbreak strategies mainly focus on maximizing
attack success rate (ASR), frequently neglecting other critical factors,
including the relevance of the jailbreak response to the query and the level of
stealthiness. This narrow focus on single objectives can result in ineffective
attacks that either lack contextual relevance or are easily recognizable. In
this work, we introduce BlackDAN, an innovative black-box attack framework with
multi-objective optimization, aiming to generate high-quality prompts that
effectively facilitate jailbreaking while maintaining contextual relevance and
minimizing detectability. BlackDAN leverages Multiobjective Evolutionary
Algorithms (MOEAs), specifically the NSGA-II algorithm, to optimize jailbreaks
across multiple objectives including ASR, stealthiness, and semantic relevance.
By integrating mechanisms like mutation, crossover, and Pareto-dominance,
BlackDAN provides a transparent and interpretable process for generating
jailbreaks. Furthermore, the framework allows customization based on user
preferences, enabling the selection of prompts that balance harmfulness,
relevance, and other factors. Experimental results demonstrate that BlackDAN
outperforms traditional single-objective methods, yielding higher success rates
and improved robustness across various LLMs and multimodal LLMs, while ensuring
jailbreak responses are both relevant and less detectable.
♻ ☆ Improving Retrieval in Sponsored Search by Leveraging Query Context Signals EMNLP 2024
Accurately retrieving relevant bid keywords for user queries is critical in
Sponsored Search but remains challenging, particularly for short, ambiguous
queries. Existing dense and generative retrieval models often fail to capture
nuanced user intent in these cases. To address this, we propose an approach to
enhance query understanding by augmenting queries with rich contextual signals
derived from web search results and large language models, stored in an online
cache. Specifically, we use web search titles and snippets to ground queries in
real-world information and utilize GPT-4 to generate query rewrites and
explanations that clarify user intent. These signals are efficiently integrated
through a Fusion-in-Decoder based Unity architecture, enabling both dense and
generative retrieval with serving costs on par with traditional context-free
models. To address scenarios where context is unavailable in the cache, we
introduce context glancing, a curriculum learning strategy that improves model
robustness and performance even without contextual signals during inference.
Extensive offline experiments demonstrate that our context-aware approach
substantially outperforms context-free models. Furthermore, online A/B testing
on a prominent search engine across 160+ countries shows significant
improvements in user engagement and revenue.
comment: Accepted to EMNLP 2024 Industry Track. 10 pages, 10 tables, 1 figure
♻ ☆ Model Internals-based Answer Attribution for Trustworthy Retrieval-Augmented Generation EMNLP 2024
Ensuring the verifiability of model answers is a fundamental challenge for
retrieval-augmented generation (RAG) in the question answering (QA) domain.
Recently, self-citation prompting was proposed to make large language models
(LLMs) generate citations to supporting documents along with their answers.
However, self-citing LLMs often struggle to match the required format, refer to
non-existent sources, and fail to faithfully reflect LLMs' context usage
throughout the generation. In this work, we present MIRAGE --Model
Internals-based RAG Explanations -- a plug-and-play approach using model
internals for faithful answer attribution in RAG applications. MIRAGE detects
context-sensitive answer tokens and pairs them with retrieved documents
contributing to their prediction via saliency methods. We evaluate our proposed
approach on a multilingual extractive QA dataset, finding high agreement with
human answer attribution. On open-ended QA, MIRAGE achieves citation quality
and efficiency comparable to self-citation while also allowing for a
finer-grained control of attribution parameters. Our qualitative evaluation
highlights the faithfulness of MIRAGE's attributions and underscores the
promising application of model internals for RAG answer attribution.
comment: Accepted by EMNLP 2024 Main Conference. Code and data released at
https://github.com/Betswish/MIRAGE
♻ ☆ A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus ACL 2024
Natural language inference (NLI), the task of recognizing the entailment
relationship in sentence pairs, is an actively studied topic serving as a proxy
for natural language understanding. Despite the relevance of the task in
building conversational agents and improving text classification, machine
translation and other NLP tasks, to the best of our knowledge, there is no
publicly available NLI corpus for the Romanian language. To this end, we
introduce the first Romanian NLI corpus (RoNLI) comprising 58K training
sentence pairs, which are obtained via distant supervision, and 6K validation
and test sentence pairs, which are manually annotated with the correct labels.
We conduct experiments with multiple machine learning methods based on distant
learning, ranging from shallow models based on word embeddings to
transformer-based neural networks, to establish a set of competitive baselines.
Furthermore, we improve on the best model by employing a new curriculum
learning strategy based on data cartography. Our dataset and code to reproduce
the baselines are available at https://github.com/Eduard6421/RONLI.
comment: Accepted at ACL 2024 (Main)
♻ ☆ Towards Lifelong Dialogue Agents via Relation-aware Memory Construction and Timeline-augmented Response Generation
Kai Tzu-iunn Ong, Namyoung Kim, Minju Gwak, Hyungjoo Chae, Taeyoon Kwon, Yohan Jo, Seung-won Hwang, Dongha Lee, Jinyoung Yeo
To achieve lifelong human-agent interaction, dialogue agents need to
constantly memorize perceived information and properly retrieve it for response
generation (RG). While prior work focuses on getting rid of outdated memories
to improve retrieval quality, we argue that such memories provide rich,
important contextual cues for RG (e.g., changes in user behaviors) in long-term
conversations. We present Theanine, a framework for LLM-based lifelong dialogue
agents. Theanine discards memory removal and manages large-scale memories by
linking them based on their temporal and cause-effect relation. Enabled by this
linking structure, Theanine augments RG with memory timelines - series of
memories representing the evolution or causality of relevant past events. Along
with Theanine, we introduce TeaFarm, a counterfactual-driven evaluation scheme,
addressing the limitation of G-Eval and human efforts in measuring
memory-augmented dialogue agents. A supplementary video for Theanine and data
for TeaFarm are at https://huggingface.co/spaces/ResearcherScholar/Theanine.
comment: Work in Progress
♻ ☆ Large Language Models, scientific knowledge and factuality: A framework to streamline human expert evaluation
The paper introduces a framework for the evaluation of the encoding of
factual scientific knowledge, designed to streamline the manual evaluation
process typically conducted by domain experts. Inferring over and extracting
information from Large Language Models (LLMs) trained on a large corpus of
scientific literature can potentially define a step change in biomedical
discovery, reducing the barriers for accessing and integrating existing medical
evidence. This work explores the potential of LLMs for dialoguing with
biomedical background knowledge, using the context of antibiotic discovery. The
framework involves of three evaluation steps, each assessing different aspects
sequentially: fluency, prompt alignment, semantic coherence, factual knowledge,
and specificity of the generated responses. By splitting these tasks between
non-experts and experts, the framework reduces the effort required from the
latter. The work provides a systematic assessment on the ability of eleven
state-of-the-art models LLMs, including ChatGPT, GPT-4 and Llama 2, in two
prompting-based tasks: chemical compound definition generation and chemical
compound-fungus relation determination. Although recent models have improved in
fluency, factual accuracy is still low and models are biased towards
over-represented entities. The ability of LLMs to serve as biomedical knowledge
bases is questioned, and the need for additional systematic evaluation
frameworks is highlighted. While LLMs are currently not fit for purpose to be
used as biomedical factual knowledge bases in a zero-shot setting, there is a
promising emerging property in the direction of factuality as the models become
domain specialised, scale-up in size and level of human feedback.
comment: Accepted at the Journal of Biomedical Informatics, Volume 158,
October 2024, 104724
♻ ☆ Multi-LLM QA with Embodied Exploration
Large language models (LLMs) have grown in popularity due to their natural
language interface and pre trained knowledge, leading to rapidly increasing
success in question-answering (QA) tasks. More recently, multi-agent systems
with LLM-based agents (Multi-LLM) have been utilized increasingly more for QA.
In these scenarios, the models may each answer the question and reach a
consensus or each model is specialized to answer different domain questions.
However, most prior work dealing with Multi-LLM QA has focused on scenarios
where the models are asked in a zero-shot manner or are given information
sources to extract the answer. For question answering of an unknown
environment, embodied exploration of the environment is first needed to answer
the question. This skill is necessary for personalizing embodied AI to
environments such as households. There is a lack of insight into whether a
Multi-LLM system can handle question-answering based on observations from
embodied exploration. In this work, we address this gap by investigating the
use of Multi-Embodied LLM Explorers (MELE) for QA in an unknown environment.
Multiple LLM-based agents independently explore and then answer queries about a
household environment. We analyze different aggregation methods to generate a
single, final answer for each query: debating, majority voting, and training a
central answer module (CAM). Using CAM, we observe a $46\%$ higher accuracy
compared against the other non-learning-based aggregation methods. We provide
code and the query dataset for further research.
comment: 16 pages, 9 Figures, 5 Tables
♻ ☆ MolecularGPT: Open Large Language Model (LLM) for Few-Shot Molecular Property Prediction
Molecular property prediction (MPP) is a fundamental and crucial task in drug
discovery. However, prior methods are limited by the requirement for a large
number of labeled molecules and their restricted ability to generalize for
unseen and new tasks, both of which are essential for real-world applications.
To address these challenges, we present MolecularGPT for few-shot MPP. From a
perspective on instruction tuning, we fine-tune large language models (LLMs)
based on curated molecular instructions spanning over 1000 property prediction
tasks. This enables building a versatile and specialized LLM that can be
adapted to novel MPP tasks without any fine-tuning through zero- and few-shot
in-context learning (ICL). MolecularGPT exhibits competitive in-context
reasoning capabilities across 10 downstream evaluation datasets, setting new
benchmarks for few-shot molecular prediction tasks. More importantly, with just
two-shot examples, MolecularGPT can outperform standard supervised graph neural
network methods on 4 out of 7 datasets. It also excels state-of-the-art LLM
baselines by up to 15.7% increase on classification accuracy and decrease of
17.9 on regression metrics (e.g., RMSE) under zero-shot. This study
demonstrates the potential of LLMs as effective few-shot molecular property
predictors. The code is available at https://github.com/NYUSHCS/MolecularGPT.
♻ ☆ A Fundamental Trade-off in Aligned Language Models and its Relation to Sampling Adaptors EMNLP 2024
The relationship between the quality of a string, as judged by a human
reader, and its probability, $p(\boldsymbol{y})$ under a language model
undergirds the development of better language models. For example, many popular
algorithms for sampling from a language model have been conceived with the goal
of manipulating $p(\boldsymbol{y})$ to place higher probability on strings that
humans deem of high quality. In this article, we examine the
probability--quality relationship in language models explicitly aligned to
human preferences, e.g., through reinforcement learning through human feedback.
We show that, when sampling corpora from an aligned language model, there
exists a trade-off between the strings' average reward and average
log-likelihood under the prior language model, i.e., the same model before
alignment with human preferences. We provide a formal treatment of this
phenomenon and demonstrate how a choice of sampling adaptor allows for a
selection of how much likelihood we exchange for the reward.
comment: EMNLP 2024
♻ ☆ Entity Matching using Large Language Models EDBT
Entity matching is the task of deciding whether two entity descriptions refer
to the same real-world entity. Entity matching is a central step in most data
integration pipelines. Many state-of-the-art entity matching methods rely on
pre-trained language models (PLMs) such as BERT or RoBERTa. Two major drawbacks
of these models for entity matching are that (i) the models require significant
amounts of task-specific training data and (ii) the fine-tuned models are not
robust concerning out-of-distribution entities. This paper investigates using
generative large language models (LLMs) as a less task-specific training
data-dependent and more robust alternative to PLM-based matchers. The study
covers hosted and open-source LLMs which can be run locally. We evaluate these
models in a zero-shot scenario and a scenario where task-specific training data
is available. We compare different prompt designs and the prompt sensitivity of
the models. We show that there is no single best prompt but that the prompt
needs to be tuned for each model/dataset combination. We further investigate
(i) the selection of in-context demonstrations, (ii) the generation of matching
rules, as well as (iii) fine-tuning LLMs using the same pool of training data.
Our experiments show that the best LLMs require no or only a few training
examples to perform comparably to PLMs that were fine-tuned using thousands of
examples. LLM-based matchers further exhibit higher robustness to unseen
entities. We show that GPT4 can generate structured explanations for matching
decisions and can automatically identify potential causes of matching errors by
analyzing explanations of wrong decisions. We demonstrate that the model can
generate meaningful textual descriptions of the identified error classes, which
can help data engineers to improve entity matching pipelines.
comment: Published in Proceedings of the 28th International Conference on
Extending Database Technology (EDBT), 25th March-28th March, 2025, ISBN
978-3-89318-098-1 on OpenProceedings.org
♻ ☆ 2D-TPE: Two-Dimensional Positional Encoding Enhances Table Understanding for Large Language Models
Tables are ubiquitous across various domains for concisely representing
structured information. Empowering large language models (LLMs) to reason over
tabular data represents an actively explored direction. However, since typical
LLMs only support one-dimensional~(1D) inputs, existing methods often flatten
the two-dimensional~(2D) table structure into a sequence of tokens, which can
severely disrupt the spatial relationships and result in an inevitable loss of
vital contextual information. In this paper, we first empirically demonstrate
the detrimental impact of such flattening operations on the performance of LLMs
in capturing the spatial information of tables through two elaborate proxy
tasks. Subsequently, we introduce a simple yet effective positional encoding
method, termed ``2D-TPE'' (Two-Dimensional Table Positional Encoding), to
address this challenge. 2D-TPE enables each attention head to dynamically
select a permutation order of tokens within the context for attending to them,
where each permutation represents a distinct traversal mode for the table, such
as column-wise or row-wise traversal. 2D-TPE effectively mitigates the risk of
losing essential spatial information while preserving computational efficiency,
thus better preserving the table structure. Extensive experiments across five
benchmarks demonstrate that 2D-TPE outperforms strong baselines, underscoring
the importance of preserving the table structure for accurate table
comprehension. Comprehensive analysis further reveals the substantially better
scalability of 2D-TPE to large tables than baselines.
♻ ☆ FAME: Towards Factual Multi-Task Model Editing EMNLP 2024
Large language models (LLMs) embed extensive knowledge and utilize it to
perform exceptionally well across various tasks. Nevertheless, outdated
knowledge or factual errors within LLMs can lead to misleading or incorrect
responses, causing significant issues in practical applications. To rectify the
fatal flaw without the necessity for costly model retraining, various model
editing approaches have been proposed to correct inaccurate knowledge within
LLMs in a cost-efficient way. To evaluate these model editing methods, previous
work introduced a series of datasets. However, most of the previous datasets
only contain fabricated data in a single format, which diverges from real-world
model editing scenarios, raising doubts about their usability in practice. To
facilitate the application of model editing in real-world scenarios, we propose
the challenge of practicality. To resolve such challenges and effectively
enhance the capabilities of LLMs, we present FAME, an factual, comprehensive,
and multi-task dataset, which is designed to enhance the practicality of model
editing. We then propose SKEME, a model editing method that uses a novel
caching mechanism to ensure synchronization with the real world. The
experiments demonstrate that SKEME performs excellently across various tasks
and scenarios, confirming its practicality.
comment: 9 pages, 3 figures. This paper has been accepted by EMNLP 2024
♻ ☆ Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition
Visual Speech Recognition (VSR) aims to infer speech into text depending on
lip movements alone. As it focuses on visual information to model the speech,
its performance is inherently sensitive to personal lip appearances and
movements, and this makes the VSR models show degraded performance when they
are applied to unseen speakers. In this paper, to remedy the performance
degradation of the VSR model on unseen speakers, we propose prompt tuning
methods of Deep Neural Networks (DNNs) for speaker-adaptive VSR. Specifically,
motivated by recent advances in Natural Language Processing (NLP), we finetune
prompts on adaptation data of target speakers instead of modifying the
pre-trained model parameters. Different from the previous prompt tuning methods
mainly limited to Transformer variant architecture, we explore different types
of prompts, the addition, the padding, and the concatenation form prompts that
can be applied to the VSR model which is composed of CNN and Transformer in
general. With the proposed prompt tuning, we show that the performance of the
pre-trained VSR model on unseen speakers can be largely improved by using a
small amount of adaptation data (e.g., less than 5 minutes), even if the
pre-trained model is already developed with large speaker variations. Moreover,
by analyzing the performance and parameters of different types of prompts, we
investigate when the prompt tuning is preferred over the finetuning methods.
The effectiveness of the proposed method is evaluated on both word- and
sentence-level VSR databases, LRW-ID and GRID.
comment: IEEE TPAMI
♻ ☆ BANTH: A Multi-label Hate Speech Detection Dataset for Transliterated Bangla
Fabiha Haider, Fariha Tanjim Shifat, Md Farhan Ishmam, Deeparghya Dutta Barua, Md Sakib Ul Rahman Sourove, Md Fahim, Md Farhad Alam
The proliferation of transliterated texts in digital spaces has emphasized
the need for detecting and classifying hate speech in languages beyond English,
particularly in low-resource languages. As online discourse can perpetuate
discrimination based on target groups, e.g. gender, religion, and origin,
multi-label classification of hateful content can help in comprehending hate
motivation and enhance content moderation. While previous efforts have focused
on monolingual or binary hate classification tasks, no work has yet addressed
the challenge of multi-label hate speech classification in transliterated
Bangla. We introduce BanTH, the first multi-label transliterated Bangla hate
speech dataset comprising 37.3k samples. The samples are sourced from YouTube
comments, where each instance is labeled with one or more target groups,
reflecting the regional demographic. We establish novel transformer
encoder-based baselines by further pre-training on transliterated Bangla
corpus. We also propose a novel translation-based LLM prompting strategy for
transliterated text. Experiments reveal that our further pre-trained encoders
are achieving state-of-the-art performance on the BanTH dataset, while our
translation-based prompting outperforms other strategies in the zero-shot
setting. The introduction of BanTH not only fills a critical gap in hate speech
research for Bangla but also sets the stage for future exploration into
code-mixed and multi-label classification challenges in underrepresented
languages.
♻ ☆ Understanding Likelihood Over-optimisation in Direct Alignment Algorithms
Direct Alignment Algorithms (DAAs), such as Direct Preference Optimisation
(DPO) and Identity Preference Optimisation (IPO), have emerged as alternatives
to online Reinforcement Learning from Human Feedback (RLHF) algorithms such as
Proximal Policy Optimisation (PPO) for aligning language models to human
preferences, without the need for explicit reward modelling. These methods
generally aim to increase the likelihood of generating better (preferred)
completions while discouraging worse (non-preferred) ones, while staying close
to the original model's behaviour. In this work, we explore the relationship
between completion likelihood and model performance in state-of-the-art DAAs,
and identify a critical issue of likelihood over-optimisation. Contrary to
expectations, we find that higher likelihood of better completions and larger
margins between better and worse completion likelihoods do not necessarily lead
to better performance, and may even degrade it. Our analysis reveals that while
higher likelihood correlates with better memorisation of factual knowledge
patterns, a slightly lower completion likelihood tends to improve output
diversity, thus leading to better generalisation to unseen scenarios. Moreover,
we identify two key indicators that signal when over-optimised output diversity
begins to harm performance: Decreasing Entropy over Top-k Tokens and
Diminishing Top-k Probability Mass. Our experimental results validate that
these indicators are reliable signs of declining performance under different
regularisations, helping prevent over-optimisation and improve alignment with
human preferences.
comment: Preprint Version
♻ ☆ MaiBaam Annotation Guidelines
This document provides the annotation guidelines for MaiBaam, a Bavarian
corpus manually annotated with part-of-speech (POS) tags, syntactic
dependencies, and German lemmas. MaiBaam belongs to the Universal Dependencies
(UD) project, and our annotations elaborate on the general and German UD
version 2 guidelines. In this document, we detail how to preprocess and
tokenize Bavarian data, provide an overview of the POS tags and dependencies we
use, explain annotation decisions that would also apply to closely related
languages like German, and lastly we introduce and motivate decisions that are
specific to Bavarian grammar.
comment: Updated for UD v2.15 (German lemmas added)
♻ ☆ Evaluating Semantic Variation in Text-to-Image Synthesis: A Causal Perspective
Xiangru Zhu, Penglei Sun, Yaoxian Song, Yanghua Xiao, Zhixu Li, Chengyu Wang, Jun Huang, Bei Yang, Xiaoxiao Xu
Accurate interpretation and visualization of human instructions are crucial
for text-to-image (T2I) synthesis. However, current models struggle to capture
semantic variations from word order changes, and existing evaluations, relying
on indirect metrics like text-image similarity, fail to reliably assess these
challenges. This often obscures poor performance on complex or uncommon
linguistic patterns by the focus on frequent word combinations. To address
these deficiencies, we propose a novel metric called SemVarEffect and a
benchmark named SemVarBench, designed to evaluate the causality between
semantic variations in inputs and outputs in T2I synthesis. Semantic variations
are achieved through two types of linguistic permutations, while avoiding
easily predictable literal variations. Experiments reveal that the
CogView-3-Plus and Ideogram 2 performed the best, achieving a score of 0.2/1.
Semantic variations in object relations are less understood than attributes,
scoring 0.07/1 compared to 0.17-0.19/1. We found that cross-modal alignment in
UNet or Transformers plays a crucial role in handling semantic variations, a
factor previously overlooked by a focus on textual encoders. Our work
establishes an effective evaluation framework that advances the T2I synthesis
community's exploration of human instruction understanding. Our benchmark and
code are available at https://github.com/zhuxiangru/SemVarBench .
comment: The only change in the current version update is the replacement of
the template with a more precise one
♻ ☆ I run as fast as a rabbit, can you? A Multilingual Simile Dialogue Dataset ACL 2023
A simile is a figure of speech that compares two different things (called the
tenor and the vehicle) via shared properties. The tenor and the vehicle are
usually connected with comparator words such as "like" or "as". The simile
phenomena are unique and complex in a real-life dialogue scene where the tenor
and the vehicle can be verbal phrases or sentences, mentioned by different
speakers, exist in different sentences, or occur in reversed order. However,
the current simile research usually focuses on similes in a triplet tuple
(tenor, property, vehicle) or a single sentence where the tenor and vehicle are
usually entities or noun phrases, which could not reflect complex simile
phenomena in real scenarios. In this paper, we propose a novel and high-quality
multilingual simile dialogue (MSD) dataset to facilitate the study of complex
simile phenomena. The MSD is the largest manually annotated simile data
($\sim$20K) and it contains both English and Chinese data. Meanwhile, the MSD
data can also be used on dialogue tasks to test the ability of dialogue systems
when using similes. We design 3 simile tasks (recognition, interpretation, and
generation) and 2 dialogue tasks (retrieval and generation) with MSD. For each
task, we provide experimental results from strong pre-trained or
state-of-the-art models. The experiments demonstrate the challenge of MSD and
we have released the data/code on GitHub.
comment: 13 Pages, 1 Figure, 12 Tables, ACL 2023 findings
♻ ☆ Can Few-shot Work in Long-Context? Recycling the Context to Generate Demonstrations
Arie Cattan, Alon Jacovi, Alex Fabrikant, Jonathan Herzig, Roee Aharoni, Hannah Rashkin, Dror Marcus, Avinatan Hassidim, Yossi Matias, Idan Szpektor, Avi Caciularu
Despite recent advancements in Large Language Models (LLMs), their
performance on tasks involving long contexts remains sub-optimal. In-Context
Learning (ICL) with few-shot examples may be an appealing solution to enhance
LLM performance in this scenario; However, na\"ively adding ICL examples with
long context introduces challenges, including substantial token overhead added
for each few-shot example and context mismatch between the demonstrations and
the target query. In this work, we propose to automatically generate few-shot
examples for long context QA tasks by recycling contexts. Specifically, given a
long input context (1-3k tokens) and a query, we generate additional
query-output pairs from the given context as few-shot examples, while
introducing the context only once. This ensures that the demonstrations are
leveraging the same context as the target query while only adding a small
number of tokens to the prompt. We further enhance each demonstration by
instructing the model to explicitly identify the relevant paragraphs before the
answer, which improves performance while providing fine-grained attribution to
the answer source. We apply our method on multiple LLMs and obtain substantial
improvements (+16 absolute points on average across models) on various QA
datasets with long context, especially when the answer lies within the middle
of the context. Surprisingly, despite introducing only single-hop ICL examples,
LLMs also successfully generalize to multi-hop long-context QA using our
approach.
♻ ☆ Towards Verifiable Text Generation with Evolving Memory and Self-Reflection EMNLP 2024
Despite the remarkable ability of large language models (LLMs) in language
comprehension and generation, they often suffer from producing factually
incorrect information, also known as hallucination. A promising solution to
this issue is verifiable text generation, which prompts LLMs to generate
content with citations for accuracy verification. However, verifiable text
generation is non-trivial due to the focus-shifting phenomenon, the intricate
reasoning needed to align the claim with correct citations, and the dilemma
between the precision and breadth of retrieved documents. In this paper, we
present VTG, an innovative framework for Verifiable Text Generation with
evolving memory and self-reflection. VTG introduces evolving long short-term
memory to retain both valuable documents and recent documents. A two-tier
verifier equipped with an evidence finder is proposed to rethink and reflect on
the relationship between the claim and citations. Furthermore, active retrieval
and diverse query generation are utilized to enhance both the precision and
breadth of the retrieved documents. We conduct extensive experiments on five
datasets across three knowledge-intensive tasks and the results reveal that VTG
significantly outperforms baselines.
comment: EMNLP 2024 Main Conference
♻ ☆ Harnessing Webpage UIs for Text-Rich Visual Understanding
Junpeng Liu, Tianyue Ou, Yifan Song, Yuxiao Qu, Wai Lam, Chenyan Xiong, Wenhu Chen, Graham Neubig, Xiang Yue
Text-rich visual understanding-the ability to process environments where
dense textual content is integrated with visuals-is crucial for multimodal
large language models (MLLMs) to interact effectively with structured
environments. To enhance this capability, we propose synthesizing general
multimodal instructions from webpage UIs using text-based large language models
(LLMs). Despite lacking direct visual input, text-based LLMs are able to
process structured text representations from webpage accessibility trees. These
instructions are then paired with UI screenshots to train multimodal models. We
introduce MultiUI, a dataset containing 7.3 million samples from 1 million
websites, covering diverse multimodal tasks and UI layouts. Models trained on
MultiUI not only excel in web UI tasks-achieving up to a 48% improvement on
VisualWebBench and a 19.1% boost in element accuracy on a web agent dataset
Mind2Web-but also generalize surprisingly well to non-web UI tasks and even to
non-UI domains, such as document understanding, OCR, and chart interpretation.
These results highlight the broad applicability of web UI data for advancing
text-rich visual understanding across various scenarios.
♻ ☆ Dating ancient manuscripts using radiocarbon and AI-based writing style analysis
Mladen Popović, Maruf A. Dhali, Lambert Schomaker, Johannes van der Plicht, Kaare Lund Rasmussen, Jacopo La Nasa, Ilaria Degano, Maria Perla Colombini, Eibert Tigchelaar
Determining the chronology of ancient handwritten manuscripts is essential
for reconstructing the evolution of ideas. For the Dead Sea Scrolls, this is
particularly important. However, there is an almost complete lack of
date-bearing manuscripts evenly distributed across the timeline and written in
similar scripts available for palaeographic comparison. Here, we present Enoch,
a state-of-the-art AI-based date-prediction model, trained on the basis of new
radiocarbon-dated samples of the scrolls. Enoch uses established
handwriting-style descriptors and applies Bayesian ridge regression. The
challenge of this study is that the number of radiocarbon-dated manuscripts is
small, while current machine learning requires an abundance of training data.
We show that by using combined angular and allographic writing style feature
vectors and applying Bayesian ridge regression, Enoch could predict the
radiocarbon-based dates from style, supported by leave-one-out validation, with
varied MAEs of 27.9 to 30.7 years relative to the radiocarbon dating. Enoch was
then used to estimate the dates of 135 unseen manuscripts, revealing that 79
per cent of the samples were considered 'realistic' upon palaeographic post-hoc
evaluation. We present a new chronology of the scrolls. The radiocarbon ranges
and Enoch's style-based predictions are often older than the traditionally
assumed palaeographic estimates. In the range of 300-50 BCE, Enoch's date
prediction provides an improved granularity. The study is in line with current
developments in multimodal machine-learning techniques, and the methods can be
used for date prediction in other partially-dated manuscript collections. This
research shows how Enoch's quantitative, probability-based approach can be a
tool for palaeographers and historians, re-dating ancient Jewish key texts and
contributing to current debates on Jewish and Christian origins.
comment: 16 pages of main article, 103 pages of supplementary materials; the
first version of this article is originally prepared in July 2023 after the
completion of all the experiments
♻ ☆ Conversational Recommender System and Large Language Model Are Made for Each Other in E-commerce Pre-sales Dialogue EMNLP 2023
Yuanxing Liu, Wei-Nan Zhang, Yifan Chen, Yuchi Zhang, Haopeng Bai, Fan Feng, Hengbin Cui, Yongbin Li, Wanxiang Che
E-commerce pre-sales dialogue aims to understand and elicit user needs and
preferences for the items they are seeking so as to provide appropriate
recommendations. Conversational recommender systems (CRSs) learn user
representation and provide accurate recommendations based on dialogue context,
but rely on external knowledge. Large language models (LLMs) generate responses
that mimic pre-sales dialogues after fine-tuning, but lack domain-specific
knowledge for accurate recommendations. Intuitively, the strengths of LLM and
CRS in E-commerce pre-sales dialogues are complementary, yet no previous work
has explored this. This paper investigates the effectiveness of combining LLM
and CRS in E-commerce pre-sales dialogues, proposing two collaboration methods:
CRS assisting LLM and LLM assisting CRS. We conduct extensive experiments on a
real-world dataset of Ecommerce pre-sales dialogues. We analyze the impact of
two collaborative approaches with two CRSs and two LLMs on four tasks of
Ecommerce pre-sales dialogue. We find that collaborations between CRS and LLM
can be very effective in some cases.
comment: EMNLP 2023 Findings
♻ ☆ Unraveling and Mitigating Retriever Inconsistencies in Retrieval-Augmented Large Language Models ACL 2024
Although Retrieval-Augmented Large Language Models (RALMs) demonstrate their
superiority in terms of factuality, they do not consistently outperform the
original retrieval-free Language Models (LMs). Our experiments reveal that this
example-level performance inconsistency exists not only between
retrieval-augmented and retrieval-free LM but also among different retrievers.
To understand this phenomenon, we investigate the degeneration behavior of
RALMs and theoretically decompose it into four categories. Further analysis
based on our decomposition reveals that the innate difference in knowledge
sources and the unpredictable degeneration of the reader model contribute most
to the inconsistency. Drawing from our analysis, we introduce Ensemble of
Retrievers (EoR), a trainable framework that can adaptively retrieve from
different knowledge sources and effectively decrease unpredictable reader
errors. Our experiments on Open Domain Question Answering show that EoR
substantially improves performance over the RALM with a single retriever by
considerably reducing inconsistent behaviors.
comment: ACL 2024 (findings)
♻ ☆ PARIKSHA: A Large-Scale Investigation of Human-LLM Evaluator Agreement on Multilingual and Multi-Cultural Data EMNLP 2024
Evaluation of multilingual Large Language Models (LLMs) is challenging due to
a variety of factors -- the lack of benchmarks with sufficient linguistic
diversity, contamination of popular benchmarks into LLM pre-training data and
the lack of local, cultural nuances in translated benchmarks. In this work, we
study human and LLM-based evaluation in a multilingual, multi-cultural setting.
We evaluate 30 models across 10 Indic languages by conducting 90K human
evaluations and 30K LLM-based evaluations and find that models such as GPT-4o
and Llama-3 70B consistently perform best for most Indic languages. We build
leaderboards for two evaluation settings - pairwise comparison and direct
assessment and analyze the agreement between humans and LLMs. We find that
humans and LLMs agree fairly well in the pairwise setting but the agreement
drops for direct assessment evaluation especially for languages such as Bengali
and Odia. We also check for various biases in human and LLM-based evaluation
and find evidence of self-bias in the GPT-based evaluator. Our work presents a
significant step towards scaling up multilingual evaluation of LLMs.
comment: Accepted to EMNLP 2024
♻ ☆ Graph Neural Network Enhanced Retrieval for Question Answering of LLMs
Retrieval augmented generation has revolutionized large language model (LLM)
outputs by providing factual supports. Nevertheless, it struggles to capture
all the necessary knowledge for complex reasoning questions. Existing retrieval
methods typically divide reference documents into passages, treating them in
isolation. These passages, however, are often interrelated, such as passages
that are contiguous or share the same keywords. Therefore, it is crucial to
recognize such relatedness for enhancing the retrieval process. In this paper,
we propose a novel retrieval method, called GNN-Ret, which leverages graph
neural networks (GNNs) to enhance retrieval by exploiting the relatedness
between passages. Specifically, we first construct a graph of passages by
connecting passages that are structure-related or keyword-related. A graph
neural network (GNN) is then leveraged to exploit the relationships between
passages and improve the retrieval of supporting passages. Furthermore, we
extend our method to handle multi-hop reasoning questions using a recurrent
graph neural network (RGNN), named RGNN-Ret. At each step, RGNN-Ret integrates
the graphs of passages from previous steps, thereby enhancing the retrieval of
supporting passages. Extensive experiments on benchmark datasets demonstrate
that GNN-Ret achieves higher accuracy for question answering with a single
query of LLMs than strong baselines that require multiple queries, and RGNN-Ret
further improves accuracy and achieves state-of-the-art performance, with up to
10.4% accuracy improvement on the 2WikiMQA dataset.
comment: Under review
♻ ☆ On the Use of Large Language Models to Generate Capability Ontologies
Capability ontologies are increasingly used to model functionalities of
systems or machines. The creation of such ontological models with all
properties and constraints of capabilities is very complex and can only be done
by ontology experts. However, Large Language Models (LLMs) have shown that they
can generate machine-interpretable models from natural language text input and
thus support engineers / ontology experts. Therefore, this paper investigates
how LLMs can be used to create capability ontologies. We present a study with a
series of experiments in which capabilities with varying complexities are
generated using different prompting techniques and with different LLMs. Errors
in the generated ontologies are recorded and compared. To analyze the quality
of the generated ontologies, a semi-automated approach based on RDF syntax
checking, OWL reasoning, and SHACL constraints is used. The results of this
study are very promising because even for complex capabilities, the generated
ontologies are almost free of errors.
comment: \c{opyright} 2024 IEEE. Personal use of this material is permitted.
Permission from IEEE must be obtained for all other uses, in any current or
future media, including reprinting/republishing this material for advertising
or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of
this work in other works
♻ ☆ Toward a Method to Generate Capability Ontologies from Natural Language Descriptions
To achieve a flexible and adaptable system, capability ontologies are
increasingly leveraged to describe functions in a machine-interpretable way.
However, modeling such complex ontological descriptions is still a manual and
error-prone task that requires a significant amount of effort and ontology
expertise. This contribution presents an innovative method to automate
capability ontology modeling using Large Language Models (LLMs), which have
proven to be well suited for such tasks. Our approach requires only a natural
language description of a capability, which is then automatically inserted into
a predefined prompt using a few-shot prompting technique. After prompting an
LLM, the resulting capability ontology is automatically verified through
various steps in a loop with the LLM to check the overall correctness of the
capability ontology. First, a syntax check is performed, then a check for
contradictions, and finally a check for hallucinations and missing ontology
elements. Our method greatly reduces manual effort, as only the initial natural
language description and a final human review and possible correction are
necessary, thereby streamlining the capability ontology generation process.
comment: \c{opyright} 2024 IEEE. Personal use of this material is permitted.
Permission from IEEE must be obtained for all other uses, in any current or
future media, including reprinting/republishing this material for advertising
or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of
this work in other works
♻ ☆ Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition
Chao-Han Huck Yang, Taejin Park, Yuan Gong, Yuanchao Li, Zhehuai Chen, Yen-Ting Lin, Chen Chen, Yuchen Hu, Kunal Dhawan, Piotr Żelasko, Chao Zhang, Yun-Nung Chen, Yu Tsao, Jagadeesh Balam, Boris Ginsburg, Sabato Marco Siniscalchi, Eng Siong Chng, Peter Bell, Catherine Lai, Shinji Watanabe, Andreas Stolcke
Given recent advances in generative AI technology, a key question is how
large language models (LLMs) can enhance acoustic modeling tasks using text
decoding results from a frozen, pretrained automatic speech recognition (ASR)
model. To explore new capabilities in language modeling for speech processing,
we introduce the generative speech transcription error correction (GenSEC)
challenge. This challenge comprises three post-ASR language modeling tasks: (i)
post-ASR transcription correction, (ii) speaker tagging, and (iii) emotion
recognition. These tasks aim to emulate future LLM-based agents handling
voice-based interfaces while remaining accessible to a broad audience by
utilizing open pretrained language models or agent-based APIs. We also discuss
insights from baseline evaluations, as well as lessons learned for designing
future evaluations.
comment: IEEE SLT 2024. The initial draft version has been done in December
2023. Post-ASR Text Processing and Understanding Community and LlaMA-7B
pre-training correction model:
https://huggingface.co/GenSEC-LLM/SLT-Task1-Llama2-7b-HyPo-baseline
♻ ☆ VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment EMNLP 2024
Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, Lingpeng Kong, Qi Liu
As large vision-language models (LVLMs) evolve rapidly, the demand for
high-quality and diverse data to align these models becomes increasingly
crucial. However, the creation of such data with human supervision proves
costly and time-intensive. In this paper, we investigate the efficacy of AI
feedback to scale supervision for aligning LVLMs. We introduce VLFeedback, the
first large-scale vision-language feedback dataset, comprising over 82K
multi-modal instructions and comprehensive rationales generated by
off-the-shelf models without human annotations. To evaluate the effectiveness
of AI feedback for vision-language alignment, we train Silkie, an LVLM
fine-tuned via direct preference optimization on VLFeedback. Silkie showcases
exceptional performance regarding helpfulness, visual faithfulness, and safety
metrics. It outperforms its base model by 6.9\% and 9.5\% in perception and
cognition tasks, reduces hallucination issues on MMHal-Bench, and exhibits
enhanced resilience against red-teaming attacks. Furthermore, our analysis
underscores the advantage of AI feedback, particularly in fostering preference
diversity to deliver more comprehensive improvements. Our dataset, training
code and models are available at https://vlf-silkie.github.io.
comment: EMNLP 2024 Main Conference camera-ready version (fixed small typos).
This article supersedes arXiv:2312.10665
♻ ☆ WaterMax: breaking the LLM watermark detectability-robustness-quality trade-off
Watermarking is a technical means to dissuade malfeasant usage of Large
Language Models. This paper proposes a novel watermarking scheme, so-called
WaterMax, that enjoys high detectability while sustaining the quality of the
generated text of the original LLM. Its new design leaves the LLM untouched (no
modification of the weights, logits, temperature, or sampling technique).
WaterMax balances robustness and complexity contrary to the watermarking
techniques of the literature inherently provoking a trade-off between quality
and robustness. Its performance is both theoretically proven and experimentally
validated. It outperforms all the SotA techniques under the most complete
benchmark suite. Code available at https://github.com/eva-giboulot/WaterMax.
♻ ☆ LLM Critics Help Catch Bugs in Mathematics: Towards a Better Mathematical Verifier with Natural Language Feedback
Bofei Gao, Zefan Cai, Runxin Xu, Peiyi Wang, Ce Zheng, Runji Lin, Keming Lu, Dayiheng Liu, Chang Zhou, Wen Xiao, Junjie Hu, Tianyu Liu, Baobao Chang
In recent progress, mathematical verifiers have achieved success in
mathematical reasoning tasks by validating the correctness of solutions
generated by policy models. However, existing verifiers are trained with binary
classification labels, which are not informative enough for the model to
accurately assess the solutions. To mitigate the aforementioned insufficiency
of binary labels, we introduce step-wise natural language feedback as rationale
labels, that is, the correctness of each step and the detailed explanations. In
this paper, we propose Math-Minos, a natural language feedback-enhanced
verifier by constructing automatically generated training data and a two-stage
training paradigm for effective training and efficient inference. Our
experiments reveal that a small set of natural language feedback can
significantly boost the performance of the verifier in both verification and
reinforcement learning. We have released the code and data for further
exploration.
comment: 15 pages
♻ ☆ PertEval: Unveiling Real Knowledge Capacity of LLMs with Knowledge-Invariant Perturbations NeurIPS '24
Expert-designed close-ended benchmarks are indispensable in assessing the
knowledge capacity of large language models (LLMs). Despite their widespread
use, concerns have mounted regarding their reliability due to limited test
scenarios and an unavoidable risk of data contamination. To rectify this, we
present PertEval, a toolkit devised for in-depth probing of LLMs' knowledge
capacity through \textbf{knowledge-invariant perturbations}. These
perturbations employ human-like restatement techniques to generate on-the-fly
test samples from static benchmarks, meticulously retaining knowledge-critical
content while altering irrelevant details. Our toolkit further includes a suite
of \textbf{response consistency analyses} that compare performance on raw vs.
perturbed test sets to precisely assess LLMs' genuine knowledge capacity. Six
representative LLMs are re-evaluated using PertEval. Results reveal
significantly inflated performance of the LLMs on raw benchmarks, including an
absolute 25.8% overestimation for GPT-4. Additionally, through a nuanced
response pattern analysis, we discover that PertEval retains LLMs' uncertainty
to specious knowledge, and reveals their potential rote memorization to correct
options which leads to overestimated performance. We also find that the
detailed response consistency analyses by PertEval could illuminate various
weaknesses in existing LLMs' knowledge mastery and guide the development of
refinement. Our findings provide insights for advancing more robust and
genuinely knowledgeable LLMs. Our code is available at
\url{https://github.com/aigc-apps/PertEval}.
comment: Accepted by NeurIPS '24 D&B Spotlight; 28 pages, 15 figures, 14
tables
♻ ☆ SciAssess: Benchmarking LLM Proficiency in Scientific Literature Analysis
Hengxing Cai, Xiaochen Cai, Junhan Chang, Sihang Li, Lin Yao, Changxin Wang, Zhifeng Gao, Hongshuai Wang, Yongge Li, Mujie Lin, Shuwen Yang, Jiankun Wang, Mingjun Xu, Jin Huang, Xi Fang, Jiaxi Zhuang, Yuqi Yin, Yaqi Li, Changhong Chen, Zheng Cheng, Zifeng Zhao, Linfeng Zhang, Guolin Ke
Recent breakthroughs in Large Language Models (LLMs) have revolutionized
scientific literature analysis. However, existing benchmarks fail to adequately
evaluate the proficiency of LLMs in this domain, particularly in scenarios
requiring higher-level abilities beyond mere memorization and the handling of
multimodal data. In response to this gap, we introduce SciAssess, a benchmark
specifically designed for the comprehensive evaluation of LLMs in scientific
literature analysis. It aims to thoroughly assess the efficacy of LLMs by
evaluating their capabilities in Memorization (L1), Comprehension (L2), and
Analysis \& Reasoning (L3). It encompasses a variety of tasks drawn from
diverse scientific fields, including biology, chemistry, material, and
medicine. To ensure the reliability of SciAssess, rigorous quality control
measures have been implemented, ensuring accuracy, anonymization, and
compliance with copyright standards. SciAssess evaluates 11 LLMs, highlighting
their strengths and areas for improvement. We hope this evaluation supports the
ongoing development of LLM applications in scientific literature analysis.
SciAssess and its resources are available at
\url{https://github.com/sci-assess/SciAssess}.
♻ ☆ Does Mapo Tofu Contain Coffee? Probing LLMs for Food-related Cultural Knowledge
Li Zhou, Taelin Karidi, Wanlong Liu, Nicolas Garneau, Yong Cao, Wenyu Chen, Haizhou Li, Daniel Hershcovich
Recent studies have highlighted the presence of cultural biases in Large
Language Models (LLMs), yet often lack a robust methodology to dissect these
phenomena comprehensively. Our work aims to bridge this gap by delving into the
Food domain, a universally relevant yet culturally diverse aspect of human
life. We introduce FmLAMA, a multilingual dataset centered on food-related
cultural facts and variations in food practices. We analyze LLMs across various
architectures and configurations, evaluating their performance in both
monolingual and multilingual settings. By leveraging templates in six different
languages, we investigate how LLMs interact with language-specific and cultural
knowledge. Our findings reveal that (1) LLMs demonstrate a pronounced bias
towards food knowledge prevalent in the United States; (2) Incorporating
relevant cultural context significantly improves LLMs' ability to access
cultural knowledge; (3) The efficacy of LLMs in capturing cultural nuances is
highly dependent on the interplay between the probing language, the specific
model architecture, and the cultural context in question. This research
underscores the complexity of integrating cultural understanding into LLMs and
emphasizes the importance of culturally diverse datasets to mitigate biases and
enhance model performance across different cultural domains.
comment: cultural bias analysis, cultural knowledge probing, large language
models, cultural NLP
♻ ☆ Synergizing In-context Learning with Hints for End-to-end Task-oriented Dialog Systems EMNLP2024
End-to-end Task-Oriented Dialog (TOD) systems typically require extensive
training datasets to perform well. In contrast, large language model (LLM)
based TOD systems can excel even with limited data due to their ability to
learn tasks through in-context exemplars. However, these models lack alignment
with the style of responses in training data and often generate comprehensive
responses, making it difficult for users to grasp the information quickly. In
response, we propose SyncTOD that synergizes LLMs with task-specific hints to
improve alignment in low-data settings. SyncTOD employs small auxiliary models
to provide hints and select exemplars for in-context prompts. With ChatGPT,
SyncTOD achieves superior performance compared to LLM-based baselines and SoTA
models in low-data settings, while retaining competitive performance in
full-data settings.
comment: EMNLP2024 Camera-Ready Version
♻ ☆ Hyper-multi-step: The Truth Behind Difficult Long-context Tasks
Yijiong Yu, Ma Xiufa, Fang Jianwei, Zhi Xu, Su Guangyao, Wang Jiancheng, Yongfeng Huang, Zhixiao Qi, Wei Wang, Weifeng Liu, Ran Chen, Ji Pei
Long-context language models (LCLM), characterized by their extensive context
window, is becoming increasingly popular. Meanwhile, many long-context
benchmarks present challenging tasks that even the most advanced LCLMs struggle
to complete. However, the underlying sources of various challenging
long-context tasks have seldom been studied. To bridge this gap, we conduct
experiments to indicate their difficulty stems primarily from two basic issues:
"multi-matching retrieval," which requires the simultaneous retrieval of
multiple items, and "logic-based retrieval," which necessitates logical
judgment within retrieval criteria. These two problems, while seemingly
straightforward, actually exceed the capabilities of LCLMs because they are
proven to be hyper-multi-step (demanding numerous steps to solve) in nature.
This finding could explain why LLMs struggle with more advanced long-context
tasks, providing a more accurate perspective for rethinking solutions for them.
comment: Our code is publicly available at
https://github.com/yuyijiong/hard_retrieval_for_llm and the datasets is at
https://huggingface.co/datasets/yuyijiong/difficult_retrieval
♻ ☆ Fisher Information-based Efficient Curriculum Federated Learning with Large Language Models EMNLP 2024
As a promising paradigm to collaboratively train models with decentralized
data, Federated Learning (FL) can be exploited to fine-tune Large Language
Models (LLMs). While LLMs correspond to huge size, the scale of the training
data significantly increases, which leads to tremendous amounts of computation
and communication costs. The training data is generally non-Independent and
Identically Distributed (non-IID), which requires adaptive data processing
within each device. Although Low Rank Adaptation (LoRA) can significantly
reduce the scale of parameters to update in the fine-tuning process, it still
takes unaffordable time to transfer the low-rank parameters of all the layers
in LLMs. In this paper, we propose a Fisher Information-based Efficient
Curriculum Federated Learning framework (FibecFed) with two novel methods,
i.e., adaptive federated curriculum learning and efficient sparse parameter
update. First, we propose a fisher information-based method to adaptively
sample data within each device to improve the effectiveness of the FL
fine-tuning process. Second, we dynamically select the proper layers for global
aggregation and sparse parameters for local update with LoRA so as to improve
the efficiency of the FL fine-tuning process. Extensive experimental results
based on 10 datasets demonstrate that FibecFed yields excellent performance (up
to 45.35% in terms of accuracy) and superb fine-tuning speed (up to 98.61%
faster) compared with 17 baseline approaches).
comment: 27 pages, 8 figures, 14 tables, to appear in EMNLP 2024
♻ ☆ QUIS: Question-guided Insights Generation for Automated Exploratory Data Analysis
Discovering meaningful insights from a large dataset, known as Exploratory
Data Analysis (EDA), is a challenging task that requires thorough exploration
and analysis of the data. Automated Data Exploration (ADE) systems use
goal-oriented methods with Large Language Models and Reinforcement Learning
towards full automation. However, these methods require human involvement to
anticipate goals that may limit insight extraction, while fully automated
systems demand significant computational resources and retraining for new
datasets. We introduce QUIS, a fully automated EDA system that operates in two
stages: insight generation (ISGen) driven by question generation (QUGen). The
QUGen module generates questions in iterations, refining them from previous
iterations to enhance coverage without human intervention or manually curated
examples. The ISGen module analyzes data to produce multiple relevant insights
in response to each question, requiring no prior training and enabling QUIS to
adapt to new datasets.
comment: Accepted for ENLP 2024 Industry Track
♻ ☆ SciLitLLM: How to Adapt LLMs for Scientific Literature Understanding
Sihang Li, Jin Huang, Jiaxi Zhuang, Yaorui Shi, Xiaochen Cai, Mingjun Xu, Xiang Wang, Linfeng Zhang, Guolin Ke, Hengxing Cai
Scientific literature understanding is crucial for extracting targeted
information and garnering insights, thereby significantly advancing scientific
discovery. Despite the remarkable success of Large Language Models (LLMs), they
face challenges in scientific literature understanding, primarily due to (1) a
lack of scientific knowledge and (2) unfamiliarity with specialized scientific
tasks.
To develop an LLM specialized in scientific literature understanding, we
propose a hybrid strategy that integrates continual pre-training (CPT) and
supervised fine-tuning (SFT), to simultaneously infuse scientific domain
knowledge and enhance instruction-following capabilities for domain-specific
tasks.cIn this process, we identify two key challenges: (1) constructing
high-quality CPT corpora, and (2) generating diverse SFT instructions. We
address these challenges through a meticulous pipeline, including PDF text
extraction, parsing content error correction, quality filtering, and synthetic
instruction creation. Applying this strategy, we present a suite of LLMs:
SciLitLLM, specialized in scientific literature understanding. These models
demonstrate promising performance on scientific literature understanding
benchmarks.
Our contributions are threefold: (1) We present an effective framework that
integrates CPT and SFT to adapt LLMs to scientific literature understanding,
which can also be easily adapted to other domains. (2) We propose an LLM-based
synthesis method to generate diverse and high-quality scientific instructions,
resulting in a new instruction set -- SciLitIns -- for supervised fine-tuning
in less-represented scientific domains. (3) SciLitLLM achieves promising
performance improvements on scientific literature understanding benchmarks.
♻ ☆ SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs
Attention is the cornerstone of modern Large Language Models (LLMs). Yet its
quadratic complexity limits the efficiency and scalability of LLMs, especially
for those with a long-context window. A promising approach addressing this
limitation is to leverage the sparsity in attention. However, existing
sparsity-based solutions predominantly rely on predefined patterns or
heuristics to approximate sparsity. This practice falls short to fully capture
the dynamic nature of attention sparsity in language-based tasks. This paper
argues that attention sparsity should be learned rather than predefined. To
this end, we design SeerAttention, a new Attention mechanism that augments the
conventional attention with a learnable gate that adaptively selects
significant blocks in an attention map and deems the rest blocks sparse. Such
block-level sparsity effectively balances accuracy and speedup. To enable
efficient learning of the gating network, we develop a customized
FlashAttention implementation that extracts the block-level ground truth of
attention map with minimum overhead. SeerAttention not only applies to
post-training, but also excels in long-context fine-tuning. Our results show
that at post-training stages, SeerAttention significantly outperforms
state-of-the-art static or heuristic-based sparse attention methods, while also
being more versatile and flexible to adapt to varying context lengths and
sparsity ratios. When applied to long-context fine-tuning with YaRN,
SeerAttention can achieve a remarkable 90% sparsity ratio at a 32k context
length with minimal perplexity loss, offering a 5.67x speedup over
FlashAttention-2.
♻ ☆ P3: A Policy-Driven, Pace-Adaptive, and Diversity-Promoted Framework for data pruning in LLM Training
In the rapidly advancing field of Large Language Models (LLMs), effectively
leveraging existing datasets during fine-tuning to maximize the model's
potential is of paramount importance. This paper introduces P3, an adaptive
framework aimed at optimizing the task-specific fine-tuning process through
iterative data pruning. P3 consists of three key components: (1) Policy-driven
Difficulty Measurement, which dynamically assesses data difficulty based on the
model's real-time performance, replacing static metrics with adaptable
evaluations; (2) Pace-Adaptive Selection, leveraging self-paced learning to
progressively introduce more challenging data, thereby enhancing model
capability; (3) Diversity Promotion, incorporating Determinantal Point Process
(DPP) to ensure data diversity across epochs, enriching the learning process.
We validate P3 on the reasoning scenarios, APPS and MATH, demonstrating
significant improvements over traditional data pruning methods. By advancing
dynamic data selection and utilization strategies, P3 contributes both a
theoretical framework and concrete approach to fully exploit existing data for
LLMs' performance improvement, offering utility across diverse tasks.
♻ ☆ LatentExplainer: Explaining Latent Representations in Deep Generative Models with Multi-modal Foundation Models
Deep generative models like VAEs and diffusion models have advanced various
generation tasks by leveraging latent variables to learn data distributions and
generate high-quality samples. Despite the field of explainable AI making
strides in interpreting machine learning models, understanding latent variables
in generative models remains challenging. This paper introduces
\textit{LatentExplainer}, a framework for automatically generating semantically
meaningful explanations of latent variables in deep generative models.
\textit{LatentExplainer} tackles three main challenges: inferring the meaning
of latent variables, aligning explanations with inductive biases, and handling
varying degrees of explainability. Our approach perturbs latent variables,
interpreting changes in generated data, and uses multi-modal large language
models (MLLMs) to produce human-understandable explanations. We evaluate our
proposed method on several real-world and synthetic datasets, and the results
demonstrate superior performance in generating high-quality explanations for
latent variables. The results highlight the effectiveness of incorporating
inductive biases and uncertainty quantification, significantly enhancing model
interpretability.
♻ ☆ Supervised Fine-Tuning Achieve Rapid Task Adaption Via Alternating Attention Head Activation Patterns
LLMs' performance on complex tasks is still unsatisfactory. A key issue is
that presently LLMs learn in a data-driven schema, while the instructions about
these complex tasks are both scarce and hard to collect or construct. On the
contrary, a prominent phenomenon is that LLMs can learn rather fast on simpler
tasks with adequate prior knowledge captured during pretraining stage. Thus, if
the prerequisite and mechanism of such rapid generalization could be
elucidated, it could enhance the efficiency and effectiveness of the LLM's
ability to learn complex tasks. Thus, in this paper, we employ a gradient-based
method, to dissect the process that the SFT process adapts LLMs to downstream
tasks via the perspective of attention patterns. We find that: (1) LLMs
selectively activate task-specific attention heads during SFT; (2) activation
patterns for complex tasks are combinations of basic task patterns; and (3)
changes in a few parameters can significantly impact activation patterns after
SFT on a small number of samples.Based on these insights, experiments are
conducted to actually enhance the efficiency and effectiveness of SFT.
comment: in review
♻ ☆ From Introspection to Best Practices: Principled Analysis of Demonstrations in Multimodal In-Context Learning
Motivated by in-context learning (ICL) capabilities of Large Language models
(LLMs), multimodal LLMs with additional visual modality are also exhibited with
similar ICL abilities when multiple image-text pairs are provided as
demonstrations. However, relatively less work has been done to investigate the
principles behind how and why multimodal ICL works. We conduct a systematic and
principled evaluation of multimodal ICL for models of different scales on a
broad spectrum of new yet critical tasks. Through perturbations over different
modality information, we show that modalities matter differently across tasks
in multimodal ICL. Guided by task-specific modality impact, we recommend
modality-driven demonstration strategies to boost ICL performance. We also find
that models may follow inductive biases from multimodal ICL even if they are
rarely seen in or contradict semantic priors from pretraining data. Our
principled analysis provides a comprehensive way of understanding the role of
demonstrations in multimodal in-context learning, and sheds light on
effectively improving multimodal ICL on a wide range of tasks.
♻ ☆ Everything is Editable: Extend Knowledge Editing to Unstructured Data in Large Language Models
Recent knowledge editing methods have primarily focused on modifying
structured knowledge in large language models. However, this task setting
overlooks the fact that a significant portion of real-world knowledge is stored
in an unstructured format, characterized by long-form content, noise, and a
complex yet comprehensive nature. Techniques like local layer key-value storage
and term-driven optimization, as used in previous methods like MEMIT, are not
effective for handling unstructured knowledge. To address these challenges, we
propose a novel Unstructured Knowledge Editing method, namely UnKE, which
extends previous assumptions in the layer dimension and token dimension.
Firstly, in the layer dimension, we propose non-local block key-value storage
to replace local layer key-value storage, increasing the representation ability
of key-value pairs and incorporating attention layer knowledge. Secondly, in
the token dimension, we replace term-driven optimization with cause-driven
optimization, which edits the last token directly while preserving context,
avoiding the need to locate terms and preventing the loss of context
information. Results on newly proposed unstructured knowledge editing dataset
(UnKEBench) and traditional structured datasets demonstrate that UnKE achieves
remarkable performance, surpassing strong baselines. In addition, UnKE has
robust batch editing and sequential editing capabilities.
♻ ☆ Amphista: Bi-directional Multi-head Decoding for Accelerating LLM Inference
Zeping Li, Xinlong Yang, Ziheng Gao, Ji Liu, Guanchen Li, Zhuang Liu, Dong Li, Jinzhang Peng, Lu Tian, Emad Barsoum
Large Language Models (LLMs) inherently use autoregressive decoding, which
lacks parallelism in inference and results in significantly slow inference
speed. While methods such as Medusa constructs parallelized heads, they lack
adequate information interaction across different prediction positions. To
overcome this limitation, we introduce Amphista, an enhanced speculative
decoding framework that builds upon Medusa. Specifically, Amphista models an
Auto-embedding Block capable of parallel inference, incorporating
bi-directional attention to enable interaction between different drafting
heads. Additionally, Amphista integrates Staged Adaptation Layers, which ensure
a seamless transition of semantic information from the target model's
autoregressive inference to the drafting heads' non-autoregressive inference,
effectively achieving paradigm shift and feature fusion. Experimental results
on Vicuna models using MT-Bench and Spec-Bench demonstrate that Amphista
achieves substantial acceleration while maintaining generation quality. On
MT-Bench, Amphista delivers up to 2.75$\times$ speedup over vanilla
autoregressive decoding and 1.40$\times$ over Medusa on Vicuna 33B in
wall-clock time.
♻ ☆ LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding ACL 2024
Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, Carole-Jean Wu
We present LayerSkip, an end-to-end solution to speed-up inference of large
language models (LLMs). First, during training we apply layer dropout, with low
dropout rates for earlier layers and higher dropout rates for later layers, and
an early exit loss where all transformer layers share the same exit. Second,
during inference, we show that this training recipe increases the accuracy of
early exit at earlier layers, without adding any auxiliary layers or modules to
the model. Third, we present a novel self-speculative decoding solution where
we exit at early layers and verify and correct with remaining layers of the
model. Our proposed self-speculative decoding approach has less memory
footprint than other speculative decoding approaches and benefits from shared
compute and activations of the draft and verification stages. We run
experiments on different Llama model sizes on different types of training:
pretraining from scratch, continual pretraining, finetuning on specific data
domain, and finetuning on specific task. We implement our inference solution
and show speedups of up to 2.16x on summarization for CNN/DM documents, 1.82x
on coding, and 2.0x on TOPv2 semantic parsing task. We open source our code and
checkpoints at https://github.com/facebookresearch/LayerSkip.
comment: ACL 2024
♻ ☆ A Tighter Complexity Analysis of SparseGPT
In this work, we improved the analysis of the running time of SparseGPT
[Frantar, Alistarh ICML 2023] from $O(d^{3})$ to $O(d^{\omega} + d^{2+a+o(1)} +
d^{1+\omega(1,1,a)-a})$ for any $a \in [0, 1]$, where $\omega$ is the exponent
of matrix multiplication. In particular, for the current $\omega \approx 2.371$
[Alman, Duan, Williams, Xu, Xu, Zhou 2024], our running time boils down to
$O(d^{2.53})$. This running time is due to the analysis of the lazy update
behavior in iterative maintenance problems such as [Deng, Song, Weinstein 2022;
Brand, Song, Zhou ICML 2024].
♻ ☆ A Comprehensive Study of Multilingual Confidence Estimation on Large Language Models
The tendency of Large Language Models (LLMs) to generate hallucinations
raises concerns regarding their reliability. Therefore, confidence estimations
indicating the extent of trustworthiness of the generations become essential.
However, current LLM confidence estimations in languages other than English
remain underexplored. This paper addresses this gap by introducing a
comprehensive investigation of Multilingual Confidence estimation (MlingConf)
on LLMs, focusing on both language-agnostic (LA) and language-specific (LS)
tasks to explore the performance and language dominance effects of multilingual
confidence estimations on different tasks. The benchmark comprises four
meticulously checked and human-evaluate high-quality multilingual datasets for
LA tasks and one for the LS task tailored to specific social, cultural, and
geographical contexts of a language. Our experiments reveal that on LA tasks
English exhibits notable linguistic dominance in confidence estimations than
other languages, while on LS tasks, using question-related language to prompt
LLMs demonstrates better linguistic dominance in multilingual confidence
estimations. The phenomena inspire a simple yet effective native-tone prompting
strategy by employing language-specific prompts for LS tasks, effectively
improving LLMs' reliability and accuracy on LS tasks.
comment: Comments: n pages; Previously this version appeared as
arXiv:2410.12478 which was submitted as a new work by accident
♻ ☆ ExACT: Teaching AI Agents to Explore with Reflective-MCTS and Exploratory Learning
Autonomous agents have demonstrated significant potential in automating
complex multistep decision-making tasks. However, even state-of-the-art
vision-language models (VLMs), such as GPT-4o, still fall short of human-level
performance, particularly in intricate web environments and long-horizon tasks.
To address these limitations, we present ExACT, an approach to combine
test-time search and self-learning to build o1-like models for agentic
applications. We first introduce Reflective Monte Carlo Tree Search (R-MCTS), a
novel test time algorithm designed to enhance AI agents' ability to explore
decision space on the fly. R-MCTS extends traditional MCTS by 1) incorporating
contrastive reflection, allowing agents to learn from past interactions and
dynamically improve their search efficiency; and 2) using multi-agent debate
for reliable state evaluation. Next, we introduce Exploratory Learning, a novel
learning strategy to teach agents to search at inference time without relying
on any external search algorithms. On the challenging VisualWebArena benchmark,
our GPT-4o based R-MCTS agent achieves a 6% to 30% relative improvement across
various tasks compared to the previous state-of-the-art. Additionally, we show
that the knowledge and experience gained from test-time search can be
effectively transferred back to GPT-4o via fine-tuning. After Exploratory
Learning, GPT-4o 1) demonstrates the ability to explore the environment,
evaluate a state, and backtrack to viable ones when it detects that the current
state cannot lead to success, and 2) matches 87% of R-MCTS's performance while
using significantly less compute. Notably, our work demonstrates the compute
scaling properties in both training - data collection with R-MCTS - and testing
time. These results suggest a promising research direction to enhance VLMs'
capabilities for agentic applications via test-time search and self-learning.
♻ ☆ MlingConf: A Comprehensive Study of Multilingual Confidence Estimation on Large Language Models
The tendency of Large Language Models (LLMs) to generate hallucinations
raises concerns regarding their reliability. Therefore, confidence estimations
indicating the extent of trustworthiness of the generations become essential.
However, current LLM confidence estimations in languages other than English
remain underexplored. This paper addresses this gap by introducing a
comprehensive investigation of Multilingual Confidence estimation (MlingConf)
on LLMs, focusing on both language-agnostic (LA) and language-specific (LS)
tasks to explore the performance and language dominance effects of multilingual
confidence estimations on different tasks. The benchmark comprises four
meticulously checked and human-evaluate high-quality multilingual datasets for
LA tasks and one for the LS task tailored to specific social, cultural, and
geographical contexts of a language. Our experiments reveal that on LA tasks
English exhibits notable linguistic dominance in confidence estimations than
other languages, while on LS tasks, using question-related language to prompt
LLMs demonstrates better linguistic dominance in multilingual confidence
estimations. The phenomena inspire a simple yet effective native-tone prompting
strategy by employing language-specific prompts for LS tasks, effectively
improving LLMs' reliability and accuracy on LS tasks.
comment: Comments: This work was intended as a replacement of arXiv:2402.13606
and any subsequent updates will appear there
♻ ☆ $\textbf{Only-IF}$:Revealing the Decisive Effect of Instruction Diversity on Generalization
Understanding and accurately following instructions is critical for large
language models (LLMs) to be effective across diverse tasks. In this work, we
rigorously examine the key factors that enable models to generalize to unseen
instructions, providing insights to guide the collection of data for
instruction-tuning. Through controlled experiments, inspired by the
Turing-complete Markov algorithm, we demonstrate that such generalization
$\textbf{only emerges}$ when training data is diversified enough across
semantic domains. Our findings also reveal that merely diversifying within
limited domains fails to ensure robust generalization. In contrast,
cross-domain data diversification, even under constrained data budgets,
significantly enhances a model's adaptability. We further extend our analysis
to real-world scenarios, including fine-tuning of
$\textit{$\textbf{specialist}$}$ and $\textit{$\textbf{generalist}$}$ models.
In both cases, we demonstrate that 1) better performance can be achieved by
increasing the diversity of an established dataset while keeping the data size
constant, and 2) when scaling up the data, diversifying the semantics of
instructions is more effective than simply increasing the quantity of similar
data. Our research provides important insights for dataset collation,
particularly when optimizing model performance by expanding training data for
both specialist and generalist scenarios. We show that careful consideration of
data diversification is key: training specialist models with data extending
beyond their core domain leads to significant performance improvements, while
generalist models benefit from diverse data mixtures that enhance their overall
instruction-following capabilities across a wide range of applications. Our
results highlight the critical role of strategic diversification and offer
clear guidelines for improving data quality.
comment: Fix formatting issues
♻ ☆ BenTo: Benchmark Task Reduction with In-Context Transferability
Evaluating large language models (LLMs) is costly: it requires the generation
and examination of LLM outputs on a large-scale benchmark of various tasks.
This paper investigates how to efficiently reduce the tasks used to benchmark
LLMs without affecting the evaluation quality. Our study reveals that task
transferability and relevance provide critical information to identify the most
representative subset of tasks via optimizing a facility location function. We
propose a practically efficient metric for estimating the transferability
between two tasks via in-context learning (ICL). By analyzing the pairwise
transferability, we can reduce tasks in a modern LLM benchmark (e.g., MMLU or
FLAN) to 5% while inducing only a <4% difference to the evaluation on the
original benchmark. Compared to prior works, our method is training-free,
gradient-free, and highly efficient requiring ICL only.
comment: https://github.com/tianyi-lab/bento
♻ ☆ GraphInsight: Unlocking Insights in Large Language Models for Graph Structure Understanding
Although Large Language Models (LLMs) have demonstrated potential in
processing graphs, they struggle with comprehending graphical structure
information through prompts of graph description sequences, especially as the
graph size increases. We attribute this challenge to the uneven memory
performance of LLMs across different positions in graph description sequences,
known as ''positional biases''. To address this, we propose GraphInsight, a
novel framework aimed at improving LLMs' comprehension of both macro- and
micro-level graphical information. GraphInsight is grounded in two key
strategies: 1) placing critical graphical information in positions where LLMs
exhibit stronger memory performance, and 2) investigating a lightweight
external knowledge base for regions with weaker memory performance, inspired by
retrieval-augmented generation (RAG). Moreover, GraphInsight explores
integrating these two strategies into LLM agent processes for composite graph
tasks that require multi-step reasoning. Extensive empirical studies on
benchmarks with a wide range of evaluation tasks show that GraphInsight
significantly outperforms all other graph description methods (e.g., prompting
techniques and reordering strategies) in understanding graph structures of
varying sizes.
♻ ☆ AutoPal: Autonomous Adaptation to Users for Personal AI Companionship
Previous research has demonstrated the potential of AI agents to act as
companions that can provide constant emotional support for humans. In this
paper, we emphasize the necessity of autonomous adaptation in personal AI
companionship, an underexplored yet promising direction. Such adaptability is
crucial as it can facilitate more tailored interactions with users and allow
the agent to evolve in response to users' changing needs. However, imbuing
agents with autonomous adaptability presents unique challenges, including
identifying optimal adaptations to meet users' expectations and ensuring a
smooth transition during the adaptation process. To address them, we devise a
hierarchical framework, AutoPal, that enables controllable and authentic
adjustments to the agent's persona based on user interactions. A
personamatching dataset is constructed to facilitate the learning of optimal
persona adaptations. Extensive experiments demonstrate the effectiveness of
AutoPal and highlight the importance of autonomous adaptability in AI
companionship.
♻ ☆ Efficiently Quantifying and Mitigating Ripple Effects in Model Editing
Jianchen Wang, Zhouhong Gu, Xiaoxuan Zhu, Lin Zhang, Haoning Ye, Zhuozhi Xiong, Hongwei Feng, Yanghua Xiao
Large Language Models have revolutionized numerous tasks with their
remarkable efficacy. However, editing these models, crucial for rectifying
outdated or erroneous information, often leads to a complex issue known as the
ripple effect in the hidden space. While difficult to detect, this effect can
significantly impede the efficacy of model editing tasks and deteriorate model
performance. This paper addresses this scientific challenge by proposing a
novel evaluation methodology, Graphical Impact Evaluation(GIE), which
quantitatively evaluates the adaptations of the model and the subsequent impact
of editing. Furthermore, we introduce the Selective Impact Revision(SIR), a
model editing method designed to mitigate this ripple effect. Our comprehensive
evaluations reveal that the ripple effect in the hidden space is a significant
issue in all current model editing methods. However, our proposed methods, GIE
and SIR, effectively identify and alleviate this issue, contributing to the
advancement of LLM editing techniques.
♻ ☆ MoR: Mixture of Ranks for Low-Rank Adaptation Tuning
Low-Rank Adaptation (LoRA) drives research to align its performance with full
fine-tuning. However, significant challenges remain: (1) Simply increasing the
rank size of LoRA does not effectively capture high-rank information, which
leads to a performance bottleneck.(2) MoE-style LoRA methods substantially
increase parameters and inference latency, contradicting the goals of efficient
fine-tuning and ease of application. To address these challenges, we introduce
Mixture of Ranks (MoR), which learns rank-specific information for different
tasks based on input and efficiently integrates multi-rank information. We
firstly propose a new framework that equates the integration of multiple LoRAs
to expanding the rank of LoRA. Moreover, we hypothesize that low-rank LoRA
already captures sufficient intrinsic information, and MoR can derive high-rank
information through mathematical transformations of the low-rank components.
Thus, MoR can reduces the learning difficulty of LoRA and enhances its
multi-task capabilities. MoR achieves impressive results, with MoR delivering a
1.31\% performance improvement while using only 93.93\% of the parameters
compared to baseline methods.
comment: 11 pages, 7 figures
♻ ☆ UniAutoML: A Human-Centered Framework for Unified Discriminative and Generative AutoML with Large Language Models
Automated Machine Learning (AutoML) has simplified complex ML processes such
as data pre-processing, model selection, and hyper-parameter searching.
However, traditional AutoML frameworks focus solely on discriminative tasks,
often falling short in tackling AutoML for generative models. Additionally,
these frameworks lack interpretability and user engagement during the training
process, primarily due to the absence of human-centered design. It leads to a
lack of transparency in final decision-making and limited user control,
potentially reducing trust and adoption of AutoML methods. To address these
limitations, we introduce UniAutoML, a human-centered AutoML framework that
leverages Large Language Models (LLMs) to unify AutoML for both discriminative
(e.g., Transformers and CNNs for classification or regression tasks) and
generative tasks (e.g., fine-tuning diffusion models or LLMs). The
human-centered design of UniAutoML innovatively features a conversational user
interface (CUI) that facilitates natural language interactions, providing users
with real-time guidance, feedback, and progress updates for better
interpretability. This design enhances transparency and user control throughout
the AutoML training process, allowing users to seamlessly break down or modify
the model being trained. To mitigate potential risks associated with LLM
generated content, UniAutoML incorporates a safety guardline that filters
inputs and censors outputs. We evaluated UniAutoML's performance and usability
through experiments on eight diverse datasets and user studies involving 25
participants, demonstrating that UniAutoML not only enhances performance but
also improves user control and trust. Our human-centered design bridges the gap
between AutoML capabilities and user understanding, making ML more accessible
to a broader audience.
♻ ☆ ACCEPT: Adaptive Codebook for Composite and Efficient Prompt Tuning EMNLP
Prompt Tuning has been a popular Parameter-Efficient Fine-Tuning method
attributed to its remarkable performance with few updated parameters on various
large-scale pretrained Language Models (PLMs). Traditionally, each prompt has
been considered indivisible and updated independently, leading the parameters
increase proportionally as prompt length grows. To address this issue, we
propose Adaptive Codebook for Composite and Efficient Prompt Tuning (ACCEPT).
In our method, we refer to the concept of product quantization (PQ), allowing
all soft prompts to share a set of learnable codebook vectors in each subspace,
with each prompt differentiated by a set of adaptive weights. We achieve the
superior performance on 17 diverse natural language tasks including natural
language understanding (NLU) and question answering (QA) tasks by tuning only
0.3% of parameters of the PLMs. Our approach also excels in few-shot and large
model settings, highlighting its significant potential.
comment: EMNLP Findings 2024
♻ ☆ On Subjective Uncertainty Quantification and Calibration in Natural Language Generation
Applications of large language models often involve the generation of
free-form responses, in which case uncertainty quantification becomes
challenging. This is due to the need to identify task-specific uncertainties
(e.g., about the semantics) which appears difficult to define in general cases.
This work addresses these challenges from a perspective of Bayesian decision
theory, starting from the assumption that our utility is characterized by a
similarity measure that compares a generated response with a hypothetical true
response. We discuss how this assumption enables principled quantification of
the model's subjective uncertainty and its calibration. We further derive a
measure for epistemic uncertainty, based on a missing data perspective and its
characterization as an excess risk. The proposed methods can be applied to
black-box language models. We illustrate the methods on question answering and
machine translation tasks. Our experiments provide a principled evaluation of
task-specific calibration, and demonstrate that epistemic uncertainty offers a
promising deferral strategy for efficient data acquisition in in-context
learning.
♻ ☆ An Evolved Universal Transformer Memory
Prior methods propose to offset the escalating costs of modern foundation
models by dropping specific parts of their contexts with hand-designed rules,
while attempting to preserve their original performance. We overcome this
trade-off with Neural Attention Memory Models (NAMMs), introducing a learned
network for memory management that improves both the performance and efficiency
of transformers. We evolve NAMMs atop pre-trained transformers to provide
different latent contexts focusing on the most relevant information for
individual layers and attention heads. NAMMs are universally applicable to any
model using self-attention as they condition exclusively on the values in the
produced attention matrices. Learning NAMMs on a small set of problems, we
achieve substantial performance improvements across multiple long-context
benchmarks while cutting the model's input contexts up to a fraction of the
original sizes. We show the generality of our conditioning enables zero-shot
transfer of NAMMs trained only on language to entirely new transformer
architectures even across input modalities, with their benefits carrying over
to vision and reinforcement learning.
comment: 29 pages, 14 figures. Preprint, under submission. Source code is
available at https://github.com/SakanaAI/evo-memory