Computation and Language
☆ Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?
We evaluate how well Large Language Models (LLMs) latently recall and compose
facts to answer multi-hop queries like "In the year Scarlett Johansson was
born, the Summer Olympics were hosted in the country of". One major challenge
in evaluating this ability is that LLMs may have developed shortcuts by
encounters of the head entity "Scarlett Johansson" and the answer entity
"United States" in the same training sequences or merely guess the answer based
on frequency-based priors. To prevent shortcuts, we exclude test queries where
the head and answer entities co-appear in pretraining corpora. Through careful
selection of relations and facts and systematic removal of cases where models
might guess answers or exploit partial matches, we construct an evaluation
dataset SOCRATES (ShOrtCut-fRee lATent rEaSoning). We observe that LLMs
demonstrate promising latent multi-hop reasoning abilities without exploiting
shortcuts, but only for certain types of queries. For queries requiring latent
recall of countries as the intermediate answer, the best models achieve 80%
latent composability, but this drops to just 5% for the recall of years.
Comparisons with Chain-of-Thought composability highlight a significant gap
between the ability of models to reason latently versus explicitly. Analysis
reveals that latent representations of the intermediate answer are constructed
more often in queries with higher latent composability, and shows the emergence
of latent multi-hop reasoning during pretraining.
☆ DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation
Storytelling video generation (SVG) has recently emerged as a task to create
long, multi-motion, multi-scene videos that consistently represent the story
described in the input text script. SVG holds great potential for diverse
content creation in media and entertainment; however, it also presents
significant challenges: (1) objects must exhibit a range of fine-grained,
complex motions, (2) multiple objects need to appear consistently across
scenes, and (3) subjects may require multiple motions with seamless transitions
within a single scene. To address these challenges, we propose DreamRunner, a
novel story-to-video generation method: First, we structure the input script
using a large language model (LLM) to facilitate both coarse-grained scene
planning as well as fine-grained object-level layout and motion planning. Next,
DreamRunner presents retrieval-augmented test-time adaptation to capture target
motion priors for objects in each scene, supporting diverse motion
customization based on retrieved videos, thus facilitating the generation of
new videos with complex, scripted motions. Lastly, we propose a novel
spatial-temporal region-based 3D attention and prior injection module SR3AI for
fine-grained object-motion binding and frame-by-frame semantic control. We
compare DreamRunner with various SVG baselines, demonstrating state-of-the-art
performance in character consistency, text alignment, and smooth transitions.
Additionally, DreamRunner exhibits strong fine-grained condition-following
ability in compositional text-to-video generation, significantly outperforming
baselines on T2V-ComBench. Finally, we validate DreamRunner's robust ability to
generate multi-object interactions with qualitative examples.
comment: Project website: https://dreamrunner-story2video.github.io/
☆ Self-Generated Critiques Boost Reward Modeling for Language Models
Yue Yu, Zhengxing Chen, Aston Zhang, Liang Tan, Chenguang Zhu, Richard Yuanzhe Pang, Yundi Qian, Xuewei Wang, Suchin Gururangan, Chao Zhang, Melanie Kambadur, Dhruv Mahajan, Rui Hou
Reward modeling is crucial for aligning large language models (LLMs) with
human preferences, especially in reinforcement learning from human feedback
(RLHF). However, current reward models mainly produce scalar scores and
struggle to incorporate critiques in a natural language format. We hypothesize
that predicting both critiques and the scalar reward would improve reward
modeling ability. Motivated by this, we propose Critic-RM, a framework that
improves reward models using self-generated critiques without extra
supervision. Critic-RM employs a two-stage process: generating and filtering
high-quality critiques, followed by joint fine-tuning on reward prediction and
critique generation. Experiments across benchmarks show that Critic-RM improves
reward modeling accuracy by 3.7%-7.3% compared to standard reward models and
LLM judges, demonstrating strong performance and data efficiency. Additional
studies further validate the effectiveness of generated critiques in rectifying
flawed reasoning steps with 2.5%-3.2% gains in improving reasoning accuracy.
comment: 20 pages
☆ Preventing Jailbreak Prompts as Malicious Tools for Cybercriminals: A Cyber Defense Perspective
Jean Marie Tshimula, Xavier Ndona, D'Jeff K. Nkashama, Pierre-Martin Tardif, Froduald Kabanza, Marc Frappier, Shengrui Wang
Jailbreak prompts pose a significant threat in AI and cybersecurity, as they
are crafted to bypass ethical safeguards in large language models, potentially
enabling misuse by cybercriminals. This paper analyzes jailbreak prompts from a
cyber defense perspective, exploring techniques like prompt injection and
context manipulation that allow harmful content generation, content filter
evasion, and sensitive information extraction. We assess the impact of
successful jailbreaks, from misinformation and automated social engineering to
hazardous content creation, including bioweapons and explosives. To address
these threats, we propose strategies involving advanced prompt analysis,
dynamic safety protocols, and continuous model fine-tuning to strengthen AI
resilience. Additionally, we highlight the need for collaboration among AI
researchers, cybersecurity experts, and policymakers to set standards for
protecting AI systems. Through case studies, we illustrate these cyber defense
approaches, promoting responsible AI practices to maintain system integrity and
public trust. \textbf{\color{red}Warning: This paper contains content which the
reader may find offensive.}
☆ Do Automatic Factuality Metrics Measure Factuality? A Critical Evaluation
Modern LLMs can now produce highly readable abstractive summaries, to the
point where traditional automated metrics for evaluating summary quality, such
as ROUGE, have become saturated. However, LLMs still sometimes introduce
unwanted content into summaries, i.e., information inconsistent with or
unsupported by their source. Measuring the occurrence of these often subtle
``hallucinations'' automatically has proved to be challenging. This in turn has
motivated development of a variety of metrics intended to measure the factual
consistency of generated summaries against their source. But are these
approaches measuring what they purport to do? In this work, we stress-test
automatic factuality metrics. Specifically, we investigate whether and to what
degree superficial attributes of summary texts suffice to predict
``factuality'', finding that a (supervised) model using only such shallow
features is reasonably competitive with SOTA factuality scoring methods. We
then evaluate how factuality metrics respond to factual corrections in
inconsistent summaries and find that only a few show meaningful improvements.
In contrast, some metrics are more sensitive to benign, non-factual edits.
Motivated by these insights, we show that one can ``game'' (most) automatic
factuality metrics, i.e., reliably inflate ``factuality'' scores by appending
innocuous sentences to generated summaries.Taken together, our results raise
questions about the degree to which we should rely on existing automated
factuality metrics and what exactly we want ``factuality metrics'' to measure.
☆ StructFormer: Document Structure-based Masked Attention and its Impact on Language Model Pre-Training
Most state-of-the-art techniques for Language Models (LMs) today rely on
transformer-based architectures and their ubiquitous attention mechanism.
However, the exponential growth in computational requirements with longer input
sequences confines Transformers to handling short passages. Recent efforts have
aimed to address this limitation by introducing selective attention mechanisms,
notably local and global attention. While sparse attention mechanisms, akin to
full attention in being Turing-complete, have been theoretically established,
their practical impact on pre-training remains unexplored. This study focuses
on empirically assessing the influence of global attention on BERT
pre-training. The primary steps involve creating an extensive corpus of
structure-aware text through arXiv data, alongside a text-only counterpart. We
carry out pre-training on these two datasets, investigate shifts in attention
patterns, and assess their implications for downstream tasks. Our analysis
underscores the significance of incorporating document structure into LM
models, demonstrating their capacity to excel in more abstract tasks, such as
document understanding.
☆ Recent Trends in Linear Text Segmentation: a Survey
Linear Text Segmentation is the task of automatically tagging text documents
with topic shifts, i.e. the places in the text where the topics change. A
well-established area of research in Natural Language Processing, drawing from
well-understood concepts in linguistic and computational linguistic research,
the field has recently seen a lot of interest as a result of the surge of text,
video, and audio available on the web, which in turn require ways of
summarising and categorizing the mole of content for which linear text
segmentation is a fundamental step. In this survey, we provide an extensive
overview of current advances in linear text segmentation, describing the state
of the art in terms of resources and approaches for the task. Finally, we
highlight the limitations of available resources and of the task itself, while
indicating ways forward based on the most recent literature and under-explored
research directions.
☆ From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
Dawei Li, Bohan Jiang, Liangjie Huang, Alimohammad Beigi, Chengshuai Zhao, Zhen Tan, Amrita Bhattacharjee, Yuxuan Jiang, Canyu Chen, Tianhao Wu, Kai Shu, Lu Cheng, Huan Liu
Assessment and evaluation have long been critical challenges in artificial
intelligence (AI) and natural language processing (NLP). However, traditional
methods, whether matching-based or embedding-based, often fall short of judging
subtle attributes and delivering satisfactory results. Recent advancements in
Large Language Models (LLMs) inspire the "LLM-as-a-judge" paradigm, where LLMs
are leveraged to perform scoring, ranking, or selection across various tasks
and applications. This paper provides a comprehensive survey of LLM-based
judgment and assessment, offering an in-depth overview to advance this emerging
field. We begin by giving detailed definitions from both input and output
perspectives. Then we introduce a comprehensive taxonomy to explore
LLM-as-a-judge from three dimensions: what to judge, how to judge and where to
judge. Finally, we compile benchmarks for evaluating LLM-as-a-judge and
highlight key challenges and promising directions, aiming to provide valuable
insights and inspire future research in this promising research area. Paper
list and more resources about LLM-as-a-judge can be found at
\url{https://github.com/llm-as-a-judge/Awesome-LLM-as-a-judge} and
\url{https://llm-as-a-judge.github.io}.
comment: 32 pages, 5 figures
☆ Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision
Zhiheng Xi, Dingwen Yang, Jixuan Huang, Jiafu Tang, Guanyu Li, Yiwen Ding, Wei He, Boyang Hong, Shihan Do, Wenyu Zhan, Xiao Wang, Rui Zheng, Tao Ji, Xiaowei Shi, Yitao Zhai, Rongxiang Weng, Jingang Wang, Xunliang Cai, Tao Gui, Zuxuan Wu, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Yu-Gang Jiang
Training large language models (LLMs) to spend more time thinking and
reflection before responding is crucial for effectively solving complex
reasoning tasks in fields such as science, coding, and mathematics. However,
the effectiveness of mechanisms like self-reflection and self-correction
depends on the model's capacity to accurately assess its own performance, which
can be limited by factors such as initial accuracy, question difficulty, and
the lack of external feedback. In this paper, we delve into a two-player
paradigm that separates the roles of reasoning and critique models, where the
critique model provides step-level feedback to supervise the reasoning (actor)
model during both test-time and train-time. We first propose AutoMathCritique,
an automated and scalable framework for collecting critique data, resulting in
a dataset of $76,321$ responses paired with step-level feedback. Fine-tuning
language models with this dataset enables them to generate natural language
feedback for mathematical reasoning. We demonstrate that the critique models
consistently improve the actor's performance on difficult queries at test-time,
especially when scaling up inference-time computation. Motivated by these
findings, we introduce the critique-based supervision to the actor's
self-training process, and propose a critique-in-the-loop self-improvement
method. Experiments show that the method improves the actor's exploration
efficiency and solution diversity, especially on challenging queries, leading
to a stronger reasoning model. Lastly, we take the preliminary step to explore
training self-talk reasoning models via critique supervision and showcase its
potential. Our code and datasets are at
\href{https://mathcritique.github.io/}{https://mathcritique.github.io/}.
comment: Preprint
☆ EnStack: An Ensemble Stacking Framework of Large Language Models for Enhanced Vulnerability Detection in Source Code
Automated detection of software vulnerabilities is critical for enhancing
security, yet existing methods often struggle with the complexity and diversity
of modern codebases. In this paper, we introduce EnStack, a novel ensemble
stacking framework that enhances vulnerability detection using natural language
processing (NLP) techniques. Our approach synergizes multiple pre-trained large
language models (LLMs) specialized in code understanding CodeBERT for semantic
analysis, GraphCodeBERT for structural representation, and UniXcoder for
cross-modal capabilities. By fine-tuning these models on the Draper VDISC
dataset and integrating their outputs through meta-classifiers such as Logistic
Regression, Support Vector Machines (SVM), Random Forest, and XGBoost, EnStack
effectively captures intricate code patterns and vulnerabilities that
individual models may overlook. The meta-classifiers consolidate the strengths
of each LLM, resulting in a comprehensive model that excels in detecting subtle
and complex vulnerabilities across diverse programming contexts. Experimental
results demonstrate that EnStack significantly outperforms existing methods,
achieving notable improvements in accuracy, precision, recall, and F1-score.
This work highlights the potential of ensemble LLM approaches in code analysis
tasks and offers valuable insights into applying NLP techniques for advancing
automated vulnerability detection.
comment: Accepted in 2024 IEEE International Conference on Big Data (IEEE
BigData 2024)
☆ RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics
Spatial understanding is a crucial capability for robots to make grounded
decisions based on their environment. This foundational skill enables robots
not only to perceive their surroundings but also to reason about and interact
meaningfully within the world. In modern robotics, these capabilities are taken
on by visual language models, and they face significant challenges when applied
to spatial reasoning context due to their training data sources. These sources
utilize general-purpose image datasets, and they often lack sophisticated
spatial scene understanding capabilities. For example, the datasets do not
address reference frame comprehension - spatial relationships require clear
contextual understanding, whether from an ego-centric, object-centric, or
world-centric perspective, which allow for effective real-world interaction. To
address this issue, we introduce RoboSpatial, a large-scale spatial
understanding dataset consisting of real indoor and tabletop scenes captured as
3D scans and egocentric images, annotated with rich spatial information
relevant to robotics. The dataset includes 1M images, 5K 3D scans, and 3M
annotated spatial relationships, with paired 2D egocentric images and 3D scans
to make it both 2D and 3D ready. Our experiments show that models trained with
RoboSpatial outperform baselines on downstream tasks such as spatial affordance
prediction, spatial relationship prediction, and robotics manipulation.
☆ Profiling Bias in LLMs: Stereotype Dimensions in Contextual Word Embeddings
Large language models (LLMs) are the foundation of the current successes of
artificial intelligence (AI), however, they are unavoidably biased. To
effectively communicate the risks and encourage mitigation efforts these models
need adequate and intuitive descriptions of their discriminatory properties,
appropriate for all audiences of AI. We suggest bias profiles with respect to
stereotype dimensions based on dictionaries from social psychology research.
Along these dimensions we investigate gender bias in contextual embeddings,
across contexts and layers, and generate stereotype profiles for twelve
different LLMs, demonstrating their intuition and use case for exposing and
visualizing bias.
☆ Fundamental Limits of Prompt Tuning Transformers: Universality, Capacity and Efficiency
We investigate the statistical and computational limits of prompt tuning for
transformer-based foundation models. Our key contributions are prompt tuning on
\textit{single-head} transformers with only a \textit{single} self-attention
layer: (i) is universal, and (ii) supports efficient (even almost-linear time)
algorithms under the Strong Exponential Time Hypothesis (SETH). Statistically,
we prove that prompt tuning on such simplest possible transformers are
universal approximators for sequence-to-sequence Lipschitz functions. In
addition, we provide an exponential-in-$dL$ and -in-$(1/\epsilon)$ lower bound
on the required soft-prompt tokens for prompt tuning to memorize any dataset
with 1-layer, 1-head transformers. Computationally, we identify a phase
transition in the efficiency of prompt tuning, determined by the norm of the
\textit{soft-prompt-induced} keys and queries, and provide an upper bound
criterion. Beyond this criterion, no sub-quadratic (efficient) algorithm for
prompt tuning exists under SETH. Within this criterion, we showcase our theory
by proving the existence of almost-linear time prompt tuning inference
algorithms. These fundamental limits provide important necessary conditions for
designing expressive and efficient prompt tuning methods for practitioners.
☆ LaB-RAG: Label Boosted Retrieval Augmented Generation for Radiology Report Generation
In the current paradigm of image captioning, deep learning models are trained
to generate text from image embeddings of latent features. We challenge the
assumption that these latent features ought to be high-dimensional vectors
which require model fine tuning to handle. Here we propose Label Boosted
Retrieval Augmented Generation (LaB-RAG), a text-based approach to image
captioning that leverages image descriptors in the form of categorical labels
to boost standard retrieval augmented generation (RAG) with pretrained large
language models (LLMs). We study our method in the context of radiology report
generation (RRG), where the task is to generate a clinician's report detailing
their observations from a set of radiological images, such as X-rays. We argue
that simple linear classifiers over extracted image embeddings can effectively
transform X-rays into text-space as radiology-specific labels. In combination
with standard RAG, we show that these derived text labels can be used with
general-domain LLMs to generate radiology reports. Without ever training our
generative language model or image feature encoder models, and without ever
directly "showing" the LLM an X-ray, we demonstrate that LaB-RAG achieves
better results across natural language and radiology language metrics compared
with other retrieval-based RRG methods, while attaining competitive results
compared to other fine-tuned vision-language RRG models. We further present
results of our experiments with various components of LaB-RAG to better
understand our method. Finally, we critique the use of a popular RRG metric,
arguing it is possible to artificially inflate its results without true
data-leakage.
☆ All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages
Ashmal Vayani, Dinura Dissanayake, Hasindri Watawana, Noor Ahsan, Nevasini Sasikumar, Omkar Thawakar, Henok Biadglign Ademtew, Yahya Hmaiti, Amandeep Kumar, Kartik Kuckreja, Mykola Maslych, Wafa Al Ghallabi, Mihail Mihaylov, Chao Qin, Abdelrahman M Shaker, Mike Zhang, Mahardika Krisna Ihsani, Amiel Esplana, Monil Gokani, Shachar Mirkin, Harsh Singh, Ashay Srivastava, Endre Hamerlik, Fathinah Asma Izzati, Fadillah Adamsyah Maani, Sebastian Cavada, Jenny Chim, Rohit Gupta, Sanjay Manjunath, Kamila Zhumakhanova, Feno Heriniaina Rabevohitra, Azril Amirudin, Muhammad Ridzuan, Daniya Kareem, Ketan More, Kunyang Li, Pramesh Shakya, Muhammad Saad, Amirpouya Ghasemaghaei, Amirbek Djanibekov, Dilshod Azizov, Branislava Jankovic, Naman Bhatia, Alvaro Cabrera, Johan Obando-Ceron, Olympiah Otieno, Fabian Farestam, Muztoba Rabbani, Sanoojan Baliah, Santosh Sanjeev, Abduragim Shtanchaev, Maheen Fatima, Thao Nguyen, Amrin Kareem, Toluwani Aremu, Nathan Xavier, Amit Bhatkal, Hawau Toyin, Aman Chadha, Hisham Cholakkal, Rao Muhammad Anwer, Michael Felsberg, Jorma Laaksonen, Thamar Solorio, Monojit Choudhury, Ivan Laptev, Mubarak Shah, Salman Khan, Fahad Khan
Existing Large Multimodal Models (LMMs) generally focus on only a few regions
and languages. As LMMs continue to improve, it is increasingly important to
ensure they understand cultural contexts, respect local sensitivities, and
support low-resource languages, all while effectively integrating corresponding
visual cues. In pursuit of culturally diverse global multimodal models, our
proposed All Languages Matter Benchmark (ALM-bench) represents the largest and
most comprehensive effort to date for evaluating LMMs across 100 languages.
ALM-bench challenges existing models by testing their ability to understand and
reason about culturally diverse images paired with text in various languages,
including many low-resource languages traditionally underrepresented in LMM
research. The benchmark offers a robust and nuanced evaluation framework
featuring various question formats, including true/false, multiple choice, and
open-ended questions, which are further divided into short and long-answer
categories. ALM-bench design ensures a comprehensive assessment of a model's
ability to handle varied levels of difficulty in visual and linguistic
reasoning. To capture the rich tapestry of global cultures, ALM-bench carefully
curates content from 13 distinct cultural aspects, ranging from traditions and
rituals to famous personalities and celebrations. Through this, ALM-bench not
only provides a rigorous testing ground for state-of-the-art open and
closed-source LMMs but also highlights the importance of cultural and
linguistic inclusivity, encouraging the development of models that can serve
diverse global populations effectively. Our benchmark is publicly available.
comment: A Multilingual Multimodal cultural benchmark for 100 languages
☆ AtomR: Atomic Operator-Empowered Large Language Models for Heterogeneous Knowledge Reasoning
Recent advancements in large language models (LLMs) have led to significant
improvements in various natural language processing tasks, but it is still
challenging for LLMs to perform knowledge-intensive complex question answering
due to LLMs' inefficacy in reasoning planning and the hallucination problem. A
typical solution is to employ retrieval-augmented generation (RAG) coupled with
chain-of-thought (CoT) reasoning, which decomposes complex questions into
chain-like sub-questions and applies iterative RAG at each sub-question.
However, prior works exhibit sub-optimal reasoning planning and overlook
dynamic knowledge retrieval from heterogeneous sources. In this paper, we
propose AtomR, a novel heterogeneous knowledge reasoning framework that
conducts multi-source reasoning at the atomic level. Drawing inspiration from
the graph modeling of knowledge, AtomR leverages large language models (LLMs)
to decompose complex questions into combinations of three atomic knowledge
operators, significantly enhancing the reasoning process at both the planning
and execution stages. We also introduce BlendQA, a novel evaluation benchmark
tailored to assess complex heterogeneous knowledge reasoning. Experiments show
that AtomR significantly outperforms state-of-the-art baselines across three
single-source and two multi-source reasoning benchmarks, with notable
performance gains of 9.4% on 2WikiMultihop and 9.5% on BlendQA.
☆ O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?
Zhen Huang, Haoyang Zou, Xuefeng Li, Yixiu Liu, Yuxiang Zheng, Ethan Chern, Shijie Xia, Yiwei Qin, Weizhe Yuan, Pengfei Liu
This paper presents a critical examination of current approaches to
replicating OpenAI's O1 model capabilities, with particular focus on the
widespread but often undisclosed use of knowledge distillation techniques.
While our previous work explored the fundamental technical path to O1
replication, this study reveals how simple distillation from O1's API, combined
with supervised fine-tuning, can achieve superior performance on complex
mathematical reasoning tasks. Through extensive experiments, we show that a
base model fine-tuned on simply tens of thousands of samples O1-distilled
long-thought chains outperforms O1-preview on the American Invitational
Mathematics Examination (AIME) with minimal technical complexity. Moreover, our
investigation extends beyond mathematical reasoning to explore the
generalization capabilities of O1-distilled models across diverse tasks:
hallucination, safety and open-domain QA. Notably, despite training only on
mathematical problem-solving data, our models demonstrated strong
generalization to open-ended QA tasks and became significantly less susceptible
to sycophancy after fine-tuning. We deliberately make this finding public to
promote transparency in AI research and to challenge the current trend of
obscured technical claims in the field. Our work includes: (1) A detailed
technical exposition of the distillation process and its effectiveness, (2) A
comprehensive benchmark framework for evaluating and categorizing O1
replication attempts based on their technical transparency and reproducibility,
(3) A critical discussion of the limitations and potential risks of
over-relying on distillation approaches, our analysis culminates in a crucial
bitter lesson: while the pursuit of more capable AI systems is important, the
development of researchers grounded in first-principles thinking is paramount.
comment: 16 pages
☆ When Babies Teach Babies: Can student knowledge sharing outperform Teacher-Guided Distillation on small datasets? CoNLL
We present our submission to the BabyLM challenge, aiming to push the
boundaries of data-efficient language model pretraining. Our method builds upon
deep mutual learning, introducing a student model search for diverse
initialization. We address the limitation of treating students equally by
formulating weighted mutual learning as a bi-level optimization problem. The
inner loop learns compact students through online distillation, while the outer
loop optimizes weights for better knowledge distillation from diverse students.
This dynamic weighting strategy eliminates the need for a teacher model,
reducing computational requirements. Our evaluations show that teacher-less
methods can match or surpass teacher-supervised approaches.
comment: Accepted to BabyLM challenge, CoNLL Workshop, EMNLP 2024
☆ Learning by Analogy: Enhancing Few-Shot Prompting for Math Word Problem Solving with Computational Graph-Based Retrieval
Large language models (LLMs) are known to struggle with complicated reasoning
tasks such as math word problems (MWPs). In this paper, we present how analogy
from similarly structured questions can improve LLMs' problem-solving
capabilities for MWPs. Specifically, we rely on the retrieval of problems with
similar computational graphs to the given question to serve as exemplars in the
prompt, providing the correct reasoning path for the generation model to refer
to. Empirical results across six math word problem datasets demonstrate the
effectiveness of our proposed method, which achieves a significant improvement
of up to 6.7 percent on average in absolute value, compared to baseline
methods. These results highlight our method's potential in addressing the
reasoning challenges in current LLMs.
☆ Finding Structure in Language Models
When we speak, write or listen, we continuously make predictions based on our
knowledge of a language's grammar. Remarkably, children acquire this
grammatical knowledge within just a few years, enabling them to understand and
generalise to novel constructions that have never been uttered before. Language
models are powerful tools that create representations of language by
incrementally predicting the next word in a sentence, and they have had a
tremendous societal impact in recent years. The central research question of
this thesis is whether these models possess a deep understanding of grammatical
structure similar to that of humans. This question lies at the intersection of
natural language processing, linguistics, and interpretability. To address it,
we will develop novel interpretability techniques that enhance our
understanding of the complex nature of large-scale language models. We approach
our research question from three directions. First, we explore the presence of
abstract linguistic information through structural priming, a key paradigm in
psycholinguistics for uncovering grammatical structure in human language
processing. Next, we examine various linguistic phenomena, such as adjective
order and negative polarity items, and connect a model's comprehension of these
phenomena to the data distribution on which it was trained. Finally, we
introduce a controlled testbed for studying hierarchical structure in language
models using various synthetic languages of increasing complexity and examine
the role of feature interactions in modelling this structure. Our findings
offer a detailed account of the grammatical knowledge embedded in language
model representations and provide several directions for investigating
fundamental linguistic questions using computational methods.
comment: PhD Thesis at ILLC, University of Amsterdam
☆ Adapter-based Approaches to Knowledge-enhanced Language Models -- A Survey
Knowledge-enhanced language models (KELMs) have emerged as promising tools to
bridge the gap between large-scale language models and domain-specific
knowledge. KELMs can achieve higher factual accuracy and mitigate
hallucinations by leveraging knowledge graphs (KGs). They are frequently
combined with adapter modules to reduce the computational load and risk of
catastrophic forgetting. In this paper, we conduct a systematic literature
review (SLR) on adapter-based approaches to KELMs. We provide a structured
overview of existing methodologies in the field through quantitative and
qualitative analysis and explore the strengths and potential shortcomings of
individual approaches. We show that general knowledge and domain-specific
approaches have been frequently explored along with various adapter
architectures and downstream tasks. We particularly focused on the popular
biomedical domain, where we provided an insightful performance comparison of
existing KELMs. We outline the main trends and propose promising future
directions.
comment: 12 pages, 4 figures. Published at KEOD24 via SciTePress
☆ Human-Calibrated Automated Testing and Validation of Generative Language Models
This paper introduces a comprehensive framework for the evaluation and
validation of generative language models (GLMs), with a focus on
Retrieval-Augmented Generation (RAG) systems deployed in high-stakes domains
such as banking. GLM evaluation is challenging due to open-ended outputs and
subjective quality assessments. Leveraging the structured nature of RAG
systems, where generated responses are grounded in a predefined document
collection, we propose the Human-Calibrated Automated Testing (HCAT) framework.
HCAT integrates a) automated test generation using stratified sampling, b)
embedding-based metrics for explainable assessment of functionality, risk and
safety attributes, and c) a two-stage calibration approach that aligns
machine-generated evaluations with human judgments through probability
calibration and conformal prediction.
In addition, the framework includes robustness testing to evaluate model
performance against adversarial, out-of-distribution, and varied input
conditions, as well as targeted weakness identification using marginal and
bivariate analysis to pinpoint specific areas for improvement. This
human-calibrated, multi-layered evaluation framework offers a scalable,
transparent, and interpretable approach to GLM assessment, providing a
practical and reliable solution for deploying GLMs in applications where
accuracy, transparency, and regulatory compliance are paramount.
☆ FineWeb-zhtw: Scalable Curation of Traditional Chinese Text Data from the Web
Cheng-Wei Lin, Wan-Hsuan Hsieh, Kai-Xin Guan, Chan-Jan Hsu, Chia-Chen Kuo, Chuan-Lin Lai, Chung-Wei Chung, Ming-Jen Wang, Da-Shan Shiu
The quality and size of a pretraining dataset significantly influence the
performance of large language models (LLMs). While there have been numerous
efforts in the curation of such a dataset for English users, there is a
relative lack of similar initiatives for Traditional Chinese. Building upon
this foundation of FineWeb, we introduce FineWeb-zhtw, a dataset tailored
specifically for Traditional Chinese users. We came up with multiple stages of
meticulously designed filters to cater to the linguistic difference between
English and Traditional Chinese, to ensure comprehensiveness and quality. We
determined effectiveness from querying dataset samples with three main
objectives. Our code and datasets are publicly available.
☆ Multi-modal Retrieval Augmented Multi-modal Generation: A Benchmark, Evaluate Metrics and Strong Baselines
This paper investigates an intriguing task of Multi-modal Retrieval Augmented
Multi-modal Generation (M$^2$RAG). This task requires foundation models to
browse multi-modal web pages, with mixed text and images, and generate
multi-modal responses for solving user queries, which exhibits better
information density and readability. Given the early researching stage of
M$^2$RAG task, there is a lack of systematic studies and analysis. To fill this
gap, we construct a benchmark for M$^2$RAG task, equipped with a suite of
text-modal metrics and multi-modal metrics to analyze the capabilities of
existing foundation models. Besides, we also propose several effective methods
for foundation models to accomplish this task, based on the comprehensive
evaluation results on our benchmark. Extensive experimental results reveal
several intriguing phenomena worth further research.
☆ The Two-Hop Curse: LLMs trained on A->B, B->C fail to learn A-->C
While LLMs excel at multi-hop questions (e.g. "Who is the spouse of the
performer of Imagine?") when using chain-of-thought reasoning (CoT), they
struggle when forced to reason internally (without CoT). Previous work on the
size and nature of this gap produced mixed evidence with inconclusive results.
In this paper, we introduce a controlled setting for investigating two-hop
reasoning in LLMs, where the above-chance performance constitutes undeniable
evidence for latent reasoning. We fine-tune LLMs (including Llama 3 8B Instruct
and GPT-4o) on fictional facts and confirm that they generalize to answering
two-hop questions about them using CoT. We find that models can perform latent
reasoning when facts appear together during training or in the prompt. However,
to our surprise, models completely fail at two-hop reasoning without CoT when
learned facts only appear in different documents, achieving chance-level
accuracy and chance-level test loss. We call this complete failure to compose
separately learned facts the Two-Hop Curse. Moreover, we evaluate 9 frontier
LLMs on real-world facts, finding that models completely fail at two-hop no-CoT
reasoning for over half of question categories while maintaining partial
success with CoT across most categories. These results suggest that LLMs lack a
general capability for latent multi-hop reasoning independent of the question
type.
☆ Preference Optimization for Reasoning with Pseudo Feedback
Preference optimization techniques, such as Direct Preference Optimization
(DPO), are frequently employed to enhance the reasoning capabilities of large
language models (LLMs) in domains like mathematical reasoning and coding,
typically following supervised fine-tuning. These methods rely on high-quality
labels for reasoning tasks to generate preference pairs; however, the
availability of reasoning datasets with human-verified labels is limited. In
this study, we introduce a novel approach to generate pseudo feedback for
reasoning tasks by framing the labeling of solutions to reason problems as an
evaluation against associated test cases. We explore two forms of pseudo
feedback based on test cases: one generated by frontier LLMs and the other by
extending self-consistency to multi-test-case. We conduct experiments on both
mathematical reasoning and coding tasks using pseudo feedback for preference
optimization, and observe improvements across both tasks. Specifically, using
Mathstral-7B as our base model, we improve MATH results from 58.3 to 68.6,
surpassing both NuminaMath-72B and GPT-4-Turbo-1106-preview. In GSM8K and
College Math, our scores increase from 85.6 to 90.3 and from 34.3 to 42.3,
respectively. Building on Deepseek-coder-7B-v1.5, we achieve a score of 24.6 on
LiveCodeBench (from 21.1), surpassing Claude-3-Haiku.
comment: 28 pages, 11 figures
☆ Can AI grade your essays? A comparative analysis of large language models and teacher ratings in multidimensional essay scoring
The manual assessment and grading of student writing is a time-consuming yet
critical task for teachers. Recent developments in generative AI, such as large
language models, offer potential solutions to facilitate essay-scoring tasks
for teachers. In our study, we evaluate the performance and reliability of both
open-source and closed-source LLMs in assessing German student essays,
comparing their evaluations to those of 37 teachers across 10 pre-defined
criteria (i.e., plot logic, expression). A corpus of 20 real-world essays from
Year 7 and 8 students was analyzed using five LLMs: GPT-3.5, GPT-4, o1, LLaMA
3-70B, and Mixtral 8x7B, aiming to provide in-depth insights into LLMs' scoring
capabilities. Closed-source GPT models outperform open-source models in both
internal consistency and alignment with human ratings, particularly excelling
in language-related criteria. The novel o1 model outperforms all other LLMs,
achieving Spearman's $r = .74$ with human assessments in the overall score, and
an internal consistency of $ICC=.80$. These findings indicate that LLM-based
assessment can be a useful tool to reduce teacher workload by supporting the
evaluation of essays, especially with regard to language-related criteria.
However, due to their tendency for higher scores, the models require further
refinement to better capture aspects of content quality.
comment: Accepted at LAK '25
☆ Learning from Relevant Subgoals in Successful Dialogs using Iterative Training for Task-oriented Dialog Systems
Task-oriented Dialog (ToD) systems have to solve multiple subgoals to
accomplish user goals, whereas feedback is often obtained only at the end of
the dialog. In this work, we propose SUIT (SUbgoal-aware ITerative Training),
an iterative training approach for improving ToD systems. We sample dialogs
from the model we aim to improve and determine subgoals that contribute to
dialog success using distant supervision to obtain high quality training
samples. We show how this data improves supervised fine-tuning or,
alternatively, preference learning results. SUIT is able to iteratively
generate more data instead of relying on fixed static sets. SUIT reaches new
state-of-the-art performance on a popular ToD benchmark.
☆ BayLing 2: A Multilingual Large Language Model with Efficient Language Alignment
Large language models (LLMs), with their powerful generative capabilities and
vast knowledge, empower various tasks in everyday life. However, these
abilities are primarily concentrated in high-resource languages, leaving
low-resource languages with weaker generative capabilities and relatively
limited knowledge. Enhancing the multilingual capabilities of LLMs is therefore
crucial for serving over 100 linguistic communities worldwide. An intuitive
approach to enhance the multilingual capabilities would be to construct
instruction data for various languages, but constructing instruction data for
over 100 languages is prohibitively costly. In this paper, we introduce BayLing
2, which efficiently transfers generative capabilities and knowledge from
high-resource languages to low-resource languages through language alignment.
To achieve this, we constructed a dataset of 3.2 million instructions,
comprising high-resource language instructions (Chinese and English) and
cross-lingual instructions for 100+ languages and performed instruction tuning
based on the dataset to facilitate the capability transfer between languages.
Using Llama as the foundation model, we developed BayLing-2-7B, BayLing-2-13B,
and BayLing-3-8B, and conducted a comprehensive evaluation of BayLing. For
multilingual translation across 100+ languages, BayLing shows superior
performance compared to open-source models of similar scale. For multilingual
knowledge and understanding benchmarks, BayLing achieves significant
improvements across over 20 low-resource languages, demonstrating its
capability of effective knowledge transfer from high-resource to low-resource
languages. Furthermore, results on English benchmarks indicate that BayLing
maintains high performance in highresource languages while enhancing the
performance in low-resource languages. Demo, homepage, code and models of
BayLing are available.
comment: BayLing 2's online demo: http://nlp.ict.ac.cn/bayling/demo. BayLing
2's code and models: https://github.com/ictnlp/BayLing
☆ Unraveling Arithmetic in Large Language Models: The Role of Algebraic Structures
Large language models (LLMs) have demonstrated remarkable mathematical
capabilities, largely driven by chain-of-thought (CoT) prompting, which
decomposes complex reasoning into step-by-step solutions. This approach has
enabled significant advancements, as evidenced by performance on benchmarks
like GSM8K and MATH. However, the mechanisms underlying LLMs' ability to
perform arithmetic in a single step of CoT remain poorly understood. Existing
studies debate whether LLMs encode numerical values or rely on symbolic
reasoning, while others explore attention and multi-layered processing in
arithmetic tasks. In this work, we propose that LLMs learn arithmetic by
capturing algebraic structures, such as \emph{Commutativity} and
\emph{Identity} properties. Since these structures are observable through
input-output relationships, they can generalize to unseen data. We empirically
demonstrate that LLMs can learn algebraic structures using a custom dataset of
arithmetic problems. Our findings indicate that leveraging algebraic structures
can enhance the LLMs' arithmetic capabilities, offering insights into improving
their arithmetic performance.
☆ NormXLogit: The Head-on-Top Never Lies
The Transformer architecture has emerged as the dominant choice for building
large language models (LLMs). However, with new LLMs emerging on a frequent
basis, it is important to consider the potential value of architecture-agnostic
approaches that can provide interpretability across a variety of architectures.
Despite recent successes in the interpretability of LLMs, many existing
approaches rely on complex methods that are often tied to a specific model
design and come with a significant computational cost. To address these
limitations, we propose a novel technique, called NormXLogit, for assessing the
significance of individual input tokens. This method operates based on the
input and output representations associated with each token. First, we
demonstrate that during the pre-training of LLMs, the norms of word embeddings
capture the importance of input tokens. Second, we reveal a significant
relationship between a token's importance and the extent to which its
representation can resemble the model's final prediction. Through extensive
analysis, we show that our approach consistently outperforms existing
gradient-based methods in terms of faithfulness. Additionally, our method
achieves better performance in layer-wise explanations compared to the most
prominent architecture-specific methods.
☆ Transparent Neighborhood Approximation for Text Classifier Explanation
Recent literature highlights the critical role of neighborhood construction
in deriving model-agnostic explanations, with a growing trend toward deploying
generative models to improve synthetic instance quality, especially for
explaining text classifiers. These approaches overcome the challenges in
neighborhood construction posed by the unstructured nature of texts, thereby
improving the quality of explanations. However, the deployed generators are
usually implemented via neural networks and lack inherent explainability,
sparking arguments over the transparency of the explanation process itself. To
address this limitation while preserving neighborhood quality, this paper
introduces a probability-based editing method as an alternative to black-box
text generators. This approach generates neighboring texts by implementing
manipulations based on in-text contexts. Substituting the generator-based
construction process with recursive probability-based editing, the resultant
explanation method, XPROB (explainer with probability-based editing), exhibits
competitive performance according to the evaluation conducted on two real-world
datasets. Additionally, XPROB's fully transparent and more controllable
construction process leads to superior stability compared to the
generator-based explainers.
comment: IEEE DSAA'24
☆ DoubleCCA: Improving Foundation Model Group Robustness with Random Sentence Embeddings
This paper presents a novel method to improve the robustness of foundation
models to group-based biases. We propose a simple yet effective method, called
DoubleCCA, that leverages random sentences and Canonical Correlation Analysis
(CCA) to enrich the text embeddings of the foundation model. First, we generate
various random sentences that augment the original prompts, which extends the
original prompts with random words or character sequences. Second, we use an
additional sentence embedding model to generate different text embeddings with
respect to these random sentences. We then use CCA double twice to align the
representations and reconstruct them back to the original representation space.
We demonstrate the effectiveness of our method on a variety of tasks and
datasets, showing that it outperforms existing methods in terms of both
performance and robustness. Our method is simple to implement and can be easily
integrated into existing models, making it a practical solution for improving
the robustness of foundation models to group-based biases.
comment: 18 pages, 6 figures, 2 tables
☆ MH-MoE:Multi-Head Mixture-of-Experts
Multi-Head Mixture-of-Experts (MH-MoE) demonstrates superior performance by
using the multi-head mechanism to collectively attend to information from
various representation spaces within different experts. In this paper, we
present a novel implementation of MH-MoE that maintains both FLOPs and
parameter parity with sparse Mixture of Experts models. Experimental results on
language models show that the new implementation yields quality improvements
over both vanilla MoE and fine-grained MoE models. Additionally, our
experiments demonstrate that MH-MoE is compatible with 1-bit Large Language
Models (LLMs) such as BitNet.
comment: 7 pages, 0 figures
☆ Video-Text Dataset Construction from Multi-AI Feedback: Promoting Weak-to-Strong Preference Learning for Video Large Language Models
High-quality video-text preference data is crucial for Multimodal Large
Language Models (MLLMs) alignment. However, existing preference data is very
scarce. Obtaining VQA preference data for preference training is costly, and
manually annotating responses is highly unreliable, which could result in
low-quality pairs. Meanwhile, AI-generated responses controlled by temperature
adjustment lack diversity. To address these issues, we propose a high-quality
VQA preference dataset, called \textit{\textbf{M}ultiple \textbf{M}ultimodal
\textbf{A}rtificial \textbf{I}ntelligence \textbf{P}reference Datasets in
\textbf{V}QA} (\textbf{MMAIP-V}), which is constructed by sampling from the
response distribution set and using an external scoring function for response
evaluation. Furthermore, to fully leverage the preference knowledge in MMAIP-V
and ensure sufficient optimization, we propose \textit{\textbf{Iter}ative
\textbf{W}eak-to-\textbf{S}trong \textbf{R}einforcement \textbf{L}earning from
\textbf{AI} \textbf{F}eedback for video MLLMs} (\textbf{Iter-W2S-RLAIF}), a
framework that gradually enhances MLLMs' alignment capabilities by iteratively
updating the reference model and performing parameter extrapolation. Finally,
we propose an unbiased and information-complete evaluation scheme in VQA
evaluation. Experiments demonstrate that MMAIP-V is beneficial for MLLMs in
preference learning and Iter-W2S-RLAIF fully exploits the alignment information
in MMAIP-V. We believe that the proposed automatic VQA preference data
generation pipeline based on AI feedback can greatly promote future work in the
MLLMs alignment. \textbf{Code and dataset are available}
\href{https://anonymous.4open.science/r/MMAIP-V_Iter-W2S-RLAIF-702F}{MMAIP-V\_Iter-W2S-RLAIF-702F}.
☆ Enhancing Multi-Agent Consensus through Third-Party LLM Integration: Analyzing Uncertainty and Mitigating Hallucinations in Large Language Models
Large Language Models (LLMs) still face challenges when dealing with complex
reasoning tasks, often resulting in hallucinations, which limit the practical
application of LLMs. To alleviate this issue, this paper proposes a new method
that integrates different LLMs to expand the knowledge boundary, reduce
dependence on a single model, and promote in-depth debate among agents. The
main contributions include: 1) Introducing third-party LLMs to adjust the
attention weights of agents through uncertainty estimation and confidence
analysis, optimizing consensus formation in multi-agent systems; 2) Experiments
on arithmetic datasets have validated the effectiveness of the method,
surpassing traditional multi-agent baselines. This research provides a new
perspective for large models to alleviate hallucination phenomena when dealing
with complex tasks.
☆ LLM Augmentations to support Analytical Reasoning over Multiple Documents
Building on their demonstrated ability to perform a variety of tasks, we
investigate the application of large language models (LLMs) to enhance in-depth
analytical reasoning within the context of intelligence analysis. Intelligence
analysts typically work with massive dossiers to draw connections between
seemingly unrelated entities, and uncover adversaries' plans and motives. We
explore if and how LLMs can be helpful to analysts for this task and develop an
architecture to augment the capabilities of an LLM with a memory module called
dynamic evidence trees (DETs) to develop and track multiple investigation
threads. Through extensive experiments on multiple datasets, we highlight how
LLMs, as-is, are still inadequate to support intelligence analysts and offer
recommendations to improve LLMs for such intricate reasoning applications.
comment: 2024 IEEE International Conference on Big Data (IEEE BigData 2024)
☆ Adaptive Circuit Behavior and Generalization in Mechanistic Interpretability
Mechanistic interpretability aims to understand the inner workings of large
neural networks by identifying circuits, or minimal subgraphs within the model
that implement algorithms responsible for performing specific tasks. These
circuits are typically discovered and analyzed using a narrowly defined prompt
format. However, given the abilities of large language models (LLMs) to
generalize across various prompt formats for the same task, it remains unclear
how well these circuits generalize. For instance, it is unclear whether the
models generalization results from reusing the same circuit components, the
components behaving differently, or the use of entirely different components.
In this paper, we investigate the generality of the indirect object
identification (IOI) circuit in GPT-2 small, which is well-studied and believed
to implement a simple, interpretable algorithm. We evaluate its performance on
prompt variants that challenge the assumptions of this algorithm. Our findings
reveal that the circuit generalizes surprisingly well, reusing all of its
components and mechanisms while only adding additional input edges. Notably,
the circuit generalizes even to prompt variants where the original algorithm
should fail; we discover a mechanism that explains this which we term S2
Hacking. Our findings indicate that circuits within LLMs may be more flexible
and general than previously recognized, underscoring the importance of studying
circuit generalization to better understand the broader capabilities of these
models.
comment: 10 pages, 8 figures
☆ Cautious Optimizers: Improving Training with One Line of Code
AdamW has been the default optimizer for transformer pretraining. For many
years, our community searches for faster and more stable optimizers with only
constraint positive outcomes. In this work, we propose a \textbf{single-line
modification in Pytorch} to any momentum-based optimizer, which we rename
Cautious Optimizer, e.g. C-AdamW and C-Lion. Our theoretical result shows that
this modification preserves Adam's Hamiltonian function and it does not break
the convergence guarantee under the Lyapunov analysis. In addition, a whole new
family of optimizers is revealed by our theoretical insight. Among them, we
pick the simplest one for empirical experiments, showing speed-up on Llama and
MAE pretraining up to $1.47\times$. Code is available at
https://github.com/kyleliang919/C-Optim
☆ SAGEval: The frontiers of Satisfactory Agent based NLG Evaluation for reference-free open-ended text
Reshmi Ghosh, Tianyi Yao, Lizzy Chen, Sadid Hasan, Tianwei Chen, Dario Bernal, Huitian Jiao, H M Sajjad Hossain
Large Language Model (LLM) integrations into applications like Microsoft365
suite and Google Workspace for creating/processing documents, emails,
presentations, etc. has led to considerable enhancements in productivity and
time savings. But as these integrations become more more complex, it is
paramount to ensure that the quality of output from the LLM-integrated
applications are relevant and appropriate for use. Identifying the need to
develop robust evaluation approaches for natural language generation, wherein
references/ground labels doesn't exist or isn't amply available, this paper
introduces a novel framework called "SAGEval" which utilizes a critiquing Agent
to provide feedback on scores generated by LLM evaluators. We show that the
critiquing Agent is able to rectify scores from LLM evaluators, in absence of
references/ground-truth labels, thereby reducing the need for labeled data even
for complex NLG evaluation scenarios, like the generation of JSON-structured
forms/surveys with responses in different styles like multiple choice, likert
ratings, single choice questions, etc.
☆ Predicting Emergent Capabilities by Finetuning
A fundamental open challenge in modern LLM scaling is the lack of
understanding around emergent capabilities. In particular, language model
pretraining loss is known to be highly predictable as a function of compute.
However, downstream capabilities are far less predictable -- sometimes even
exhibiting emergent jumps -- which makes it challenging to anticipate the
capabilities of future models. In this work, we first pose the task of
emergence prediction: given access to current LLMs that have random few-shot
accuracy on a task, can we predict whether future models (GPT-N+1) will have
non-trivial accuracy on that task? We then discover a simple insight for this
problem: finetuning LLMs on a given task can shift the point in scaling at
which emergence occurs towards less capable models. To operationalize this
insight, we can finetune LLMs with varying amounts of data and fit a parametric
function that predicts when emergence will occur (i.e., "emergence laws"). We
validate this approach using four standard NLP benchmarks where large-scale
open-source LLMs already demonstrate emergence (MMLU, GSM8K, CommonsenseQA, and
CoLA). Using only small-scale LLMs, we find that, in some cases, we can
accurately predict whether models trained with up to 4x more compute have
emerged. Finally, we present a case study of two realistic uses for emergence
prediction.
☆ TransCompressor: LLM-Powered Multimodal Data Compression for Smart Transportation
The incorporation of Large Language Models (LLMs) into smart transportation
systems has paved the way for improving data management and operational
efficiency. This study introduces TransCompressor, a novel framework that
leverages LLMs for efficient compression and decompression of multimodal
transportation sensor data. TransCompressor has undergone thorough evaluation
with diverse sensor data types, including barometer, speed, and altitude
measurements, across various transportation modes like buses, taxis, and MTRs.
Comprehensive evaluation illustrates the effectiveness of TransCompressor in
reconstructing transportation sensor data at different compression ratios. The
results highlight that, with well-crafted prompts, LLMs can utilize their vast
knowledge base to contribute to data compression processes, enhancing data
storage, analysis, and retrieval in smart transportation settings.
comment: 6 pages
♻ ☆ Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval
The goal of Text-to-Image Person Retrieval (TIPR) is to retrieve specific
person images according to the given textual descriptions. A primary challenge
in this task is bridging the substantial representational gap between visual
and textual modalities. The prevailing methods map texts and images into
unified embedding space for matching, while the intricate semantic
correspondences between texts and images are still not effectively constructed.
To address this issue, we propose a novel TIPR framework to build fine-grained
interactions and alignment between person images and the corresponding texts.
Specifically, via fine-tuning the Contrastive Language-Image Pre-training
(CLIP) model, a visual-textual dual encoder is firstly constructed, to
preliminarily align the image and text features. Secondly, a Text-guided Image
Restoration (TIR) auxiliary task is proposed to map abstract textual entities
to specific image regions, improving the alignment between local textual and
visual embeddings. Additionally, a cross-modal triplet loss is presented to
handle hard samples, and further enhance the model's discriminability for minor
differences. Moreover, a pruning-based text data augmentation approach is
proposed to enhance focus on essential elements in descriptions, thereby
avoiding excessive model attention to less significant information. The
experimental results show our proposed method outperforms state-of-the-art
methods on three popular benchmark datasets, and the code will be made publicly
available at https://github.com/Delong-liu-bupt/SEN.
♻ ☆ Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions
Yu Zhao, Huifeng Yin, Bo Zeng, Hao Wang, Tianqi Shi, Chenyang Lyu, Longyue Wang, Weihua Luo, Kaifu Zhang
Currently OpenAI o1 sparks a surge of interest in the study of large
reasoning models (LRM). Building on this momentum, Marco-o1 not only focuses on
disciplines with standard answers, such as mathematics, physics, and coding --
which are well-suited for reinforcement learning (RL) -- but also places
greater emphasis on open-ended resolutions. We aim to address the question:
''Can the o1 model effectively generalize to broader domains where clear
standards are absent and rewards are challenging to quantify?'' Marco-o1 is
powered by Chain-of-Thought (CoT) fine-tuning, Monte Carlo Tree Search (MCTS),
reflection mechanisms, and innovative reasoning strategies -- optimized for
complex real-world problem-solving tasks.
♻ ☆ Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction
Haoran Qiu, Weichao Mao, Archit Patke, Shengkun Cui, Saurabh Jha, Chen Wang, Hubertus Franke, Zbigniew T. Kalbarczyk, Tamer Başar, Ravishankar K. Iyer
Large language models (LLMs) have been driving a new wave of interactive AI
applications across numerous domains. However, efficiently serving LLM
inference requests is challenging due to their unpredictable execution times
originating from the autoregressive nature of generative models. Existing LLM
serving systems exploit first-come-first-serve (FCFS) scheduling, suffering
from head-of-line blocking issues. To address the non-deterministic nature of
LLMs and enable efficient interactive LLM serving, we present a speculative
shortest-job-first (SSJF) scheduler that uses a light proxy model to predict
LLM output sequence lengths. Our open-source SSJF implementation does not
require changes to memory management or batching strategies. Evaluations on
real-world datasets and production workload traces show that SSJF reduces
average job completion times by 30.5-39.6% and increases throughput by 2.2-3.6x
compared to FCFS schedulers, across no batching, dynamic batching, and
continuous batching settings.
comment: Accepted at AIOps'24
♻ ☆ A Review of Mechanistic Models of Event Comprehension
This review examines theoretical assumptions and computational models of
event comprehension, tracing the evolution from discourse comprehension
theories to contemporary event cognition frameworks. The review covers key
discourse comprehension accounts, including Construction-Integration, Event
Indexing, Causal Network, and Resonance models, highlighting their
contributions to understanding cognitive processes in comprehension. I then
discuss contemporary theoretical frameworks of event comprehension, including
Event Segmentation Theory (Zacks et al., 2007), the Event Horizon Model
(Radvansky & Zacks, 2014), and Hierarchical Generative Framework (Kuperberg,
2021), which emphasize prediction, causality, and multilevel representations in
event understanding. Building on these theories, I evaluate five computational
models of event comprehension: REPRISE (Butz et al., 2019), Structured Event
Memory (SEM; Franklin et al., 2020), the Lu model (Lu et al., 2022), the
Gumbsch model (Gumbsch et al., 2022), and the Elman and McRae model (2019). The
analysis focuses on their approaches to hierarchical processing, prediction
mechanisms, and representation learning. Key themes that emerge include the use
of hierarchical structures as inductive biases, the importance of prediction in
comprehension, and diverse strategies for learning event dynamics. The review
identifies critical areas for future research, including the need for more
sophisticated approaches to learning structured representations, integrating
episodic memory mechanisms, and developing adaptive updating algorithms for
working event models. By synthesizing insights from both theoretical frameworks
and computational implementations, this review aims to advance our
understanding of human event comprehension and guide future modeling efforts in
cognitive science.
♻ ☆ A Survey of Event Causality Identification: Principles, Taxonomy, Challenges, and Assessment
Event Causality Identification (ECI) has become a crucial task in Natural
Language Processing (NLP), aimed at automatically extracting causalities from
textual data. In this survey, we systematically address the foundational
principles, technical frameworks, and challenges of ECI, offering a
comprehensive taxonomy to categorize and clarify current research
methodologies, as well as a quantitative assessment of existing models. We
first establish a conceptual framework for ECI, outlining key definitions,
problem formulations, and evaluation standards. Our taxonomy classifies ECI
methods according to the two primary tasks of sentence-level (SECI) and
document-level (DECI) event causality identification. For SECI, we examine
feature pattern-based matching, deep semantic encoding, causal knowledge
pre-training and prompt-based fine-tuning, and external knowledge enhancement
methods. For DECI, we highlight approaches focused on event graph reasoning and
prompt-based techniques to address the complexity of cross-sentence causal
inference. Additionally, we analyze the strengths, limitations, and open
challenges of each approach. We further conduct an extensive quantitative
evaluation of various ECI methods on two benchmark datasets. Finally, we
explore future research directions, highlighting promising pathways to overcome
current limitations and broaden ECI applications.
♻ ☆ Multimodal Foundation Models Exploit Text to Make Medical Image Predictions
Multimodal foundation models have shown compelling but conflicting
performance in medical image interpretation. However, the mechanisms by which
these models integrate and prioritize different data modalities, including
images and text, remain poorly understood. Here, using a diverse collection of
1014 multimodal medical cases, we evaluate the unimodal and multimodal image
interpretation abilities of proprietary (GPT-4, Gemini Pro 1.0) and open-source
(Llama-3.2-90B, LLaVA-Med-v1.5) multimodal foundational models with and without
the use of text descriptions. Across all models, image predictions were largely
driven by exploiting text, with accuracy increasing monotonically with the
amount of informative text. By contrast, human performance on medical image
interpretation did not improve with informative text. Exploitation of text is a
double-edged sword; we show that even mild suggestions of an incorrect
diagnosis in text diminishes image-based classification, reducing performance
dramatically in cases the model could previously answer with images alone.
Finally, we conducted a physician evaluation of model performance on long-form
medical cases, finding that the provision of images either reduced or had no
effect on model performance when text is already highly informative. Our
results suggest that multimodal AI models may be useful in medical diagnostic
reasoning but that their accuracy is largely driven, for better and worse, by
their exploitation of text.
♻ ☆ A Comprehensive Survey of Text Classification Techniques and Their Research Applications: Observational and Experimental Insights
The exponential growth of textual data presents substantial challenges in
management and analysis, notably due to high storage and processing costs. Text
classification, a vital aspect of text mining, provides robust solutions by
enabling efficient categorization and organization of text data. These
techniques allow individuals, researchers, and businesses to derive meaningful
patterns and insights from large volumes of text. This survey paper introduces
a comprehensive taxonomy specifically designed for text classification based on
research fields. The taxonomy is structured into hierarchical levels: research
field-based category, research field-based sub-category, methodology-based
technique, methodology sub-technique, and research field applications. We
employ a dual evaluation approach: empirical and experimental. Empirically, we
assess text classification techniques across four critical criteria.
Experimentally, we compare and rank the methodology sub-techniques within the
same methodology technique and within the same overall research field
sub-category. This structured taxonomy, coupled with thorough evaluations,
provides a detailed and nuanced understanding of text classification algorithms
and their applications, empowering researchers to make informed decisions based
on precise, field-specific insights.
♻ ☆ HQP: A Human-Annotated Dataset for Detecting Online Propaganda ACL
Online propaganda poses a severe threat to the integrity of societies.
However, existing datasets for detecting online propaganda have a key
limitation: they were annotated using weak labels that can be noisy and even
incorrect. To address this limitation, our work makes the following
contributions: (1) We present HQP: a novel dataset (N = 30,000) for detecting
online propaganda with high-quality labels. To the best of our knowledge, HQP
is the first large-scale dataset for detecting online propaganda that was
created through human annotation. (2) We show empirically that state-of-the-art
language models fail in detecting online propaganda when trained with weak
labels (AUC: 64.03). In contrast, state-of-the-art language models can
accurately detect online propaganda when trained with our high-quality labels
(AUC: 92.25), which is an improvement of ~44%. (3) We show that prompt-based
learning using a small sample of high-quality labels can still achieve a
reasonable performance (AUC: 80.27) while significantly reducing the cost of
labeling. (4) We extend HQP to HQP+ to test how well propaganda across
different contexts can be detected. Crucially, our work highlights the
importance of high-quality labels for sensitive NLP tasks such as propaganda
detection.
comment: Accepted at ACL Findings 24
♻ ☆ TEG-DB: A Comprehensive Dataset and Benchmark of Textual-Edge Graphs NeurIPS 2024
Zhuofeng Li, Zixing Gou, Xiangnan Zhang, Zhongyuan Liu, Sirui Li, Yuntong Hu, Chen Ling, Zheng Zhang, Liang Zhao
Text-Attributed Graphs (TAGs) augment graph structures with natural language
descriptions, facilitating detailed depictions of data and their
interconnections across various real-world settings. However, existing TAG
datasets predominantly feature textual information only at the nodes, with
edges typically represented by mere binary or categorical attributes. This lack
of rich textual edge annotations significantly limits the exploration of
contextual relationships between entities, hindering deeper insights into
graph-structured data. To address this gap, we introduce Textual-Edge Graphs
Datasets and Benchmark (TEG-DB), a comprehensive and diverse collection of
benchmark textual-edge datasets featuring rich textual descriptions on nodes
and edges. The TEG-DB datasets are large-scale and encompass a wide range of
domains, from citation networks to social networks. In addition, we conduct
extensive benchmark experiments on TEG-DB to assess the extent to which current
techniques, including pre-trained language models, graph neural networks, and
their combinations, can utilize textual node and edge information. Our goal is
to elicit advancements in textual-edge graph research, specifically in
developing methodologies that exploit rich textual node and edge descriptions
to enhance graph analysis and provide deeper insights into complex real-world
networks. The entire TEG-DB project is publicly accessible as an open-source
repository on Github, accessible at
https://github.com/Zhuofeng-Li/TEG-Benchmark.
comment: Accepted by NeurIPS 2024
♻ ☆ MindForge: Empowering Embodied Agents with Theory of Mind for Lifelong Collaborative Learning
Contemporary embodied agents, such as Voyager in Minecraft, have demonstrated
promising capabilities in open-ended individual learning. However, when powered
with open large language models (LLMs), these agents often struggle with
rudimentary tasks, even when fine-tuned on domain-specific knowledge. Inspired
by human cultural learning, we present \collabvoyager, a novel framework that
enhances Voyager with lifelong collaborative learning through explicit
perspective-taking. \collabvoyager introduces three key innovations: (1) theory
of mind representations linking percepts, beliefs, desires, and actions; (2)
natural language communication between agents; and (3) semantic memory of task
and environment knowledge and episodic memory of collaboration episodes. These
advancements enable agents to reason about their and others' mental states,
empirically addressing two prevalent failure modes: false beliefs and faulty
task executions. In mixed-expertise Minecraft experiments, \collabvoyager
agents outperform Voyager counterparts, significantly improving task completion
rate by $66.6\% (+39.4\%)$ for collecting one block of dirt and $70.8\%
(+20.8\%)$ for collecting one wood block. They exhibit emergent behaviors like
knowledge transfer from expert to novice agents and collaborative code
correction. \collabvoyager agents also demonstrate the ability to adapt to
out-of-distribution tasks by using their previous experiences and beliefs
obtained through collaboration. In this open-ended social learning paradigm,
\collabvoyager paves the way for the democratic development of embodied AI,
where agents learn in deployment from both peer and environmental feedback.
♻ ☆ How ChatGPT Changed the Media's Narratives on AI: A Semi-Automated Narrative Analysis Through Frame Semantics
We perform a mixed-method frame semantics-based analysis on a dataset of more
than 49,000 sentences collected from 5846 news articles that mention AI. The
dataset covers the twelve-month period centred around the launch of OpenAI's
chatbot ChatGPT and is collected from the most visited open-access
English-language news publishers. Our findings indicate that during the six
months succeeding the launch, media attention rose tenfold$\unicode{x2014}$from
already historically high levels. During this period, discourse has become
increasingly centred around experts and political leaders, and AI has become
more closely associated with dangers and risks. A deeper review of the data
also suggests a qualitative shift in the types of threat AI is thought to
represent, as well as the anthropomorphic qualities ascribed to it.
comment: 19 pages, 6 figures and 2 appendices (5 pages) Minds & Machines,
published in November 2024
♻ ☆ A Survey of Stance Detection on Social Media: New Directions and Perspectives
In modern digital environments, users frequently express opinions on
contentious topics, providing a wealth of information on prevailing attitudes.
The systematic analysis of these opinions offers valuable insights for
decision-making in various sectors, including marketing and politics. As a
result, stance detection has emerged as a crucial subfield within affective
computing, enabling the automatic detection of user stances in social media
conversations and providing a nuanced understanding of public sentiment on
complex issues. Recent years have seen a surge of research interest in
developing effective stance detection methods, with contributions from multiple
communities, including natural language processing, web science, and social
computing. This paper provides a comprehensive survey of stance detection
techniques on social media, covering task definitions, datasets, approaches,
and future works. We review traditional stance detection models, as well as
state-of-the-art methods based on large language models, and discuss their
strengths and limitations. Our survey highlights the importance of stance
detection in understanding public opinion and sentiment, and identifies gaps in
current research. We conclude by outlining potential future directions for
stance detection on social media, including the need for more robust and
generalizable models, and the importance of addressing emerging challenges such
as multi-modal stance detection and stance detection in low-resource languages.
♻ ☆ Learning thresholds lead to stable language coexistence
We introduce a language competition model that is based on the
Abrams-Strogatz model and incorporates the effects of memory and learning in
the language shift dynamics. On a coarse grained time scale, the effects of
memory and learning can be expressed as thresholds on the speakers fractions of
the competing languages. In its simplest form, the resulting model is exactly
solvable. Besides the consensus on one of the two languages, the model
describes additional equilibrium states that are not present in the
Abrams-Strogatz model: a stable dynamical coexistence of the two languages and
a frozen state coinciding with the initial state. We show numerically that
these results are preserved for threshold functions of a more general shape.
The comparison of the model predictions with historical datasets demonstrates
that while the Abrams-Strogatz model fails to describe some relevant language
competition situations, the proposed model provides a good fitting.
comment: 15 pages, 6 figures and 1 table
♻ ☆ Visual Riddles: a Commonsense and World Knowledge Challenge for Large Vision and Language Models
Nitzan Bitton-Guetta, Aviv Slobodkin, Aviya Maimon, Eliya Habba, Royi Rassin, Yonatan Bitton, Idan Szpektor, Amir Globerson, Yuval Elovici
Imagine observing someone scratching their arm; to understand why, additional
context would be necessary. However, spotting a mosquito nearby would
immediately offer a likely explanation for the person's discomfort, thereby
alleviating the need for further information. This example illustrates how
subtle visual cues can challenge our cognitive skills and demonstrates the
complexity of interpreting visual scenarios. To study these skills, we present
Visual Riddles, a benchmark aimed to test vision and language models on visual
riddles requiring commonsense and world knowledge. The benchmark comprises 400
visual riddles, each featuring a unique image created by a variety of
text-to-image models, question, ground-truth answer, textual hint, and
attribution. Human evaluation reveals that existing models lag significantly
behind human performance, which is at 82% accuracy, with Gemini-Pro-1.5 leading
with 40% accuracy. Our benchmark comes with automatic evaluation tasks to make
assessment scalable. These findings underscore the potential of Visual Riddles
as a valuable resource for enhancing vision and language models' capabilities
in interpreting complex visual scenarios.
comment: https://visual-riddles.github.io/
♻ ☆ OLoRA: Orthonormal Low-Rank Adaptation of Large Language Models
The advent of large language models (LLMs) has revolutionized natural
language processing, enabling unprecedented capabilities in understanding and
generating human-like text. However, the computational cost and convergence
times associated with fine-tuning these models remain significant challenges.
Low-Rank Adaptation (LoRA) has emerged as a promising method to mitigate these
issues by introducing efficient fine-tuning techniques with a reduced number of
trainable parameters. In this paper, we present OLoRA, an enhancement to the
LoRA method that leverages orthonormal matrix initialization through QR
decomposition. OLoRA significantly accelerates the convergence of LLM training
while preserving the efficiency benefits of LoRA, such as the number of
trainable parameters and GPU memory footprint. Our empirical evaluations
demonstrate that OLoRA not only converges faster but also exhibits improved
performance compared to standard LoRA across a variety of language modeling
tasks. This advancement opens new avenues for more efficient and accessible
fine-tuning of LLMs, potentially enabling broader adoption and innovation in
natural language applications.
comment: 10 pages, 5 figures
♻ ☆ OASIS: Open Agent Social Interaction Simulations with One Million Agents
Ziyi Yang, Zaibin Zhang, Zirui Zheng, Yuxian Jiang, Ziyue Gan, Zhiyu Wang, Zijian Ling, Jinsong Chen, Martz Ma, Bowen Dong, Prateek Gupta, Shuyue Hu, Zhenfei Yin, Guohao Li, Xu Jia, Lijun Wang, Bernard Ghanem, Huchuan Lu, Wanli Ouyang, Yu Qiao, Philip Torr, Jing Shao
There has been a growing interest in enhancing rule-based agent-based models
(ABMs) for social media platforms (i.e., X, Reddit) with more realistic large
language model (LLM) agents, thereby allowing for a more nuanced study of
complex systems. As a result, several LLM-based ABMs have been proposed in the
past year. While they hold promise, each simulator is specifically designed to
study a particular scenario, making it time-consuming and resource-intensive to
explore other phenomena using the same ABM. Additionally, these models simulate
only a limited number of agents, whereas real-world social media platforms
involve millions of users. To this end, we propose OASIS, a generalizable and
scalable social media simulator. OASIS is designed based on real-world social
media platforms, incorporating dynamically updated environments (i.e., dynamic
social networks and post information), diverse action spaces (i.e., following,
commenting), and recommendation systems (i.e., interest-based and
hot-score-based). Additionally, OASIS supports large-scale user simulations,
capable of modeling up to one million users. With these features, OASIS can be
easily extended to different social media platforms to study large-scale group
phenomena and behaviors. We replicate various social phenomena, including
information spreading, group polarization, and herd effects across X and Reddit
platforms. Moreover, we provide observations of social phenomena at different
agent group scales. We observe that the larger agent group scale leads to more
enhanced group dynamics and more diverse and helpful agents' opinions. These
findings demonstrate OASIS's potential as a powerful tool for studying complex
systems in digital environments.
♻ ☆ Meta-Chunking: Learning Efficient Text Segmentation via Logical Perception
Retrieval-Augmented Generation (RAG), while serving as a viable complement to
large language models (LLMs), often overlooks the crucial aspect of text
chunking within its pipeline, which impacts the quality of knowledge-intensive
tasks. This paper introduces the concept of Meta-Chunking, which refers to a
granularity between sentences and paragraphs, consisting of a collection of
sentences within a paragraph that have deep linguistic logical connections. To
implement Meta-Chunking, we designed Perplexity (PPL) Chunking, which balances
performance and speed, and precisely identifies the boundaries of text chunks
by analyzing the characteristics of context perplexity distribution.
Additionally, considering the inherent complexity of different texts, we
propose a strategy that combines PPL Chunking with dynamic merging to achieve a
balance between fine-grained and coarse-grained text chunking. Experiments
conducted on eleven datasets demonstrate that Meta-Chunking can more
efficiently improve the performance of single-hop and multi-hop question
answering based on RAG. For instance, on the 2WikiMultihopQA dataset, it
outperforms similarity chunking by 1.32 while only consuming 45.8% of the time.
Furthermore, through the analysis of models of various scales and types, we
observed that PPL Chunking exhibits notable flexibility and adaptability. Our
code is available at https://github.com/IAAR-Shanghai/Meta-Chunking.
♻ ☆ Deanthropomorphising NLP: Can a Language Model Be Conscious?
This work is intended as a voice in the discussion over previous claims that
a pretrained large language model (LLM) based on the Transformer model
architecture can be sentient. Such claims have been made concerning the LaMDA
model and also concerning the current wave of LLM-powered chatbots, such as
ChatGPT. This claim, if confirmed, would have serious ramifications in the
Natural Language Processing (NLP) community due to wide-spread use of similar
models. However, here we take the position that such a large language model
cannot be sentient, or conscious, and that LaMDA in particular exhibits no
advances over other similar models that would qualify it. We justify this by
analysing the Transformer architecture through Integrated Information Theory of
consciousness. We see the claims of sentience as part of a wider tendency to
use anthropomorphic language in NLP reporting. Regardless of the veracity of
the claims, we consider this an opportune moment to take stock of progress in
language modelling and consider the ethical implications of the task. In order
to make this work helpful for readers outside the NLP community, we also
present the necessary background in language modelling.
♻ ☆ Information Extraction from Heterogeneous Documents without Ground Truth Labels using Synthetic Label Generation and Knowledge Distillation WACV 2025
Invoices and receipts submitted by employees are visually rich documents
(VRDs) with textual, visual and layout information. To protect against the risk
of fraud and abuse, it is crucial for organizations to efficiently extract
desired information from submitted receipts. This helps in the assessment of
key factors such as appropriateness of the expense claim, adherence to spending
and transaction policies, the validity of the receipt, as well as downstream
anomaly detection at various levels. These documents are heterogeneous, with
multiple formats and languages, uploaded with different image qualities, and
often do not contain ground truth labels for the efficient training of models.
In this paper we propose Task Aware Instruction-based Labelling (TAIL), a
method for synthetic label generation in VRD corpuses without labels, and
fine-tune a multimodal Visually Rich Document Understanding Model (VRDU) on
TAIL labels using response-based knowledge distillation without using the
teacher model's weights or training dataset to conditionally generate
annotations in the appropriate format. Using a benchmark external dataset where
ground truth labels are available, we demonstrate conditions under which our
approach performs at par with Claude 3 Sonnet through empirical studies. We
then show that the resulting model performs at par or better on the internal
expense documents of a large multinational organization than state-of-the-art
LMM (large multimodal model) Claude 3 Sonnet while being 85% less costly and
~5X faster, and outperforms layout-aware baselines by more than 10% in Average
Normalized Levenshtein Similarity (ANLS) scores due to its ability to reason
and extract information from rare formats. Finally, we illustrate the usage of
our approach in overpayment prevention.
comment: Accepted to WACV 2025
♻ ☆ Towards the Dynamics of a DNN Learning Symbolic Interactions
This study proves the two-phase dynamics of a deep neural network (DNN)
learning interactions. Despite the long disappointing view of the faithfulness
of post-hoc explanation of a DNN, a series of theorems have been proven in
recent years to show that for a given input sample, a small set of interactions
between input variables can be considered as primitive inference patterns that
faithfully represent a DNN's detailed inference logic on that sample.
Particularly, Zhang et al. have observed that various DNNs all learn
interactions of different complexities in two distinct phases, and this
two-phase dynamics well explains how a DNN changes from under-fitting to
over-fitting. Therefore, in this study, we mathematically prove the two-phase
dynamics of interactions, providing a theoretical mechanism for how the
generalization power of a DNN changes during the training process. Experiments
show that our theory well predicts the real dynamics of interactions on
different DNNs trained for various tasks.
♻ ☆ StepTool: A Step-grained Reinforcement Learning Framework for Tool Learning in LLMs
Yuanqing Yu, Zhefan Wang, Weizhi Ma, Zhicheng Guo, Jingtao Zhan, Shuai Wang, Chuhan Wu, Zhiqiang Guo, Min Zhang
Despite having powerful reasoning and inference capabilities, Large Language
Models (LLMs) still need external tools to acquire real-time information
retrieval or domain-specific expertise to solve complex tasks, which is
referred to as tool learning. Existing tool learning methods primarily rely on
tuning with expert trajectories, focusing on token-sequence learning from a
linguistic perspective. However, there are several challenges: 1) imitating
static trajectories limits their ability to generalize to new tasks. 2) even
expert trajectories can be suboptimal, and better solution paths may exist. In
this work, we introduce StepTool, a novel step-grained reinforcement learning
framework to improve tool learning in LLMs. It consists of two components:
Step-grained Reward Shaping, which assigns rewards at each tool interaction
based on tool invocation success and its contribution to the task, and
Step-grained Optimization, which uses policy gradient methods to optimize the
model in a multi-step manner. Experimental results demonstrate that StepTool
significantly outperforms existing methods in multi-step, tool-based tasks,
providing a robust solution for complex task environments. Codes are available
at https://github.com/yuyq18/StepTool.
comment: Ongoning Work
♻ ☆ From General to Specific: Utilizing General Hallucination to Benchmark Specific Role-Playing Agents
The advanced role-playing capabilities of Large Language Models (LLMs) have
paved the way for developing Role-Playing Agents (RPAs). However, existing
benchmarks in this domain, such as HPD and SocialBench face limitations like
poor generalizability, implicit and inaccurate judgments, and the risk of model
forgetting. To address the above issues, we propose an automatic, scalable, and
generalizable paradigm. Specifically, we construct a benchmark, SHARP, by
extracting relations from a general knowledge graph and leveraging the inherent
hallucination properties of RPAs to simulate interactions across roles. We
employ ChatGPT for stance detection and define relationship hallucination along
with three related metrics based on stance transfer. Extensive experiments
validate the effectiveness and stability of our paradigm. Our findings further
explore the factors influencing these metrics and discuss the trade-off between
blind loyalty to relationships and adherence to facts in RPAs.
comment: Revise three typos in the abstract and methodology sections of the
introduction
♻ ☆ Assessing the Answerability of Queries in Retrieval-Augmented Code Generation
Thanks to unprecedented language understanding and generation capabilities of
large language model (LLM), Retrieval-augmented Code Generation (RaCG) has
recently been widely utilized among software developers. While this has
increased productivity, there are still frequent instances of incorrect codes
being provided. In particular, there are cases where plausible yet incorrect
codes are generated for queries from users that cannot be answered with the
given queries and API descriptions. This study proposes a task for evaluating
answerability, which assesses whether valid answers can be generated based on
users' queries and retrieved APIs in RaCG. Additionally, we build a benchmark
dataset called Retrieval-augmented Code Generability Evaluation (RaCGEval) to
evaluate the performance of models performing this task. Experimental results
show that this task remains at a very challenging level, with baseline models
exhibiting a low performance of 46.7%. Furthermore, this study discusses
methods that could significantly improve performance.
♻ ☆ Beyond Turing Test: Can GPT-4 Sway Experts' Decisions?
In the post-Turing era, evaluating large language models (LLMs) involves
assessing generated text based on readers' reactions rather than merely its
indistinguishability from human-produced content. This paper explores how
LLM-generated text impacts readers' decisions, focusing on both amateur and
expert audiences. Our findings indicate that GPT-4 can generate persuasive
analyses affecting the decisions of both amateurs and professionals.
Furthermore, we evaluate the generated text from the aspects of grammar,
convincingness, logical coherence, and usefulness. The results highlight a high
correlation between real-world evaluation through audience reactions and the
current multi-dimensional evaluators commonly used for generative models.
Overall, this paper shows the potential and risk of using generated text to
sway human decisions and also points out a new direction for evaluating
generated text, i.e., leveraging the reactions and decisions of readers. We
release our dataset to assist future research.
♻ ☆ Octavius: Mitigating Task Interference in MLLMs via LoRA-MoE ICLR 2024
Zeren Chen, Ziqin Wang, Zhen Wang, Huayang Liu, Zhenfei Yin, Si Liu, Lu Sheng, Wanli Ouyang, Yu Qiao, Jing Shao
Recent studies have demonstrated Large Language Models (LLMs) can extend
their zero-shot generalization capabilities to multimodal learning through
instruction tuning. As more modalities and downstream tasks are introduced,
negative conflicts and interference may have a worse impact on performance.
While this phenomenon has been overlooked in previous work, we propose a novel
and extensible framework, called Octavius, for comprehensive studies and
experimentation on multimodal learning with Multimodal Large Language Models
(MLLMs). Specifically, we combine the well-known Mixture-of-Experts (MoE) and
one of the representative PEFT techniques, i.e., LoRA, designing a novel
LLM-based decoder, called LoRA-MoE, for multimodal learning. To the best of our
knowledge, we are one of the pioneering efforts to introduce MoE into MLLMs to
address this problem. The experimental results (about 20% improvement) have
shown the effectiveness and versatility of our design in various 2D and 3D
downstream tasks. Code and datasets are available at
https://openlamm.github.io/tutorial/.
comment: 22 pages, 12 figures. Accepted in ICLR 2024
♻ ☆ Scalable Fine-tuning from Multiple Data Sources: A First-Order Approximation Approach
We study the problem of fine-tuning a language model (LM) for a target task
by optimally using the information from $n$ auxiliary tasks. This problem has
broad applications in NLP, such as targeted instruction tuning and data
selection in chain-of-thought fine-tuning. The key challenge of this problem is
that not all auxiliary tasks are useful to improve the performance of the
target task. Thus, choosing the right subset of auxiliary tasks is crucial.
Conventional subset selection methods, such as forward and backward stepwise
selection, are unsuitable for LM fine-tuning because they require repeated
training on subsets of auxiliary tasks. This paper introduces a new algorithm
to estimate model fine-tuning performances without repeated training. Our
algorithm first performs multitask training using the data of all the tasks to
obtain a meta initialization. Then, we approximate the model fine-tuning loss
of a subset using functional values and gradients from the meta initialization.
Empirically, we find that this gradient-based approximation holds with
remarkable accuracy for twelve transformer-based LMs. Thus, we can now estimate
fine-tuning performances on CPUs within a few seconds. Finally, we fine-tune
the pretrained base model for once on the selected subset of tasks. We conduct
extensive experiments to validate this approach, delivering a speedup of
$30\times$ over conventional subset selection while incurring only $1\%$ error
of the true fine-tuning performances. In downstream evaluations involving both
instruction tuning and chain-of-thought fine-tuning, this loss-based selection
approach improves over prior gradient or representation similarity-based
methods for subset selection by up to $3.8\%$.
comment: 17 pages
♻ ☆ Continual Learning of Large Language Models: A Comprehensive Survey
Haizhou Shi, Zihao Xu, Hengyi Wang, Weiyi Qin, Wenyuan Wang, Yibin Wang, Zifeng Wang, Sayna Ebrahimi, Hao Wang
The recent success of large language models (LLMs) trained on static,
pre-collected, general datasets has sparked numerous research directions and
applications. One such direction addresses the non-trivial challenge of
integrating pre-trained LLMs into dynamic data distributions, task structures,
and user preferences. Pre-trained LLMs, when tailored for specific needs, often
experience significant performance degradation in previous knowledge domains --
a phenomenon known as "catastrophic forgetting". While extensively studied in
the continual learning (CL) community, it presents new manifestations in the
realm of LLMs. In this survey, we provide a comprehensive overview of the
current research progress on LLMs within the context of CL. This survey is
structured into four main sections: we first describe an overview of
continually learning LLMs, consisting of two directions of continuity: vertical
continuity (or vertical continual learning), i.e., continual adaptation from
general to specific capabilities, and horizontal continuity (or horizontal
continual learning), i.e., continual adaptation across time and domains
(Section 3). We then summarize three stages of learning LLMs in the context of
modern CL: Continual Pre-Training (CPT), Domain-Adaptive Pre-training (DAP),
and Continual Fine-Tuning (CFT) (Section 4). Then we provide an overview of
evaluation protocols for continual learning with LLMs, along with the current
available data sources (Section 5). Finally, we discuss intriguing questions
pertaining to continual learning for LLMs (Section 6). The full list of papers
examined in this survey is available at
https://github.com/Wang-ML-Lab/llm-continual-learning-survey.
comment: 44 pages, 2 figures, 4 tables; Work in progress
♻ ☆ Interpretable Video based Stress Detection with Self-Refine Chain-of-thought Reasoning ICDE 2025
Stress detection is a critical area of research with significant implications
for health monitoring and intervention systems. In this paper, we propose a
novel interpretable approach for video-based stress detection, leveraging
self-refine chain-of-thought reasoning to enhance both accuracy and
transparency in decision-making processes. Our method focuses on extracting
subtle behavioral and physiological cues from video sequences that indicate
stress levels. By incorporating a chain-of-thought reasoning mechanism, the
system refines its predictions iteratively, ensuring that the decision-making
process can be traced and explained. The model also learns to self-refine
through feedback loops, improving its reasoning capabilities over time.
We evaluate our approach on several public and private datasets,
demonstrating its superior performance in comparison to traditional video-based
stress detection methods. Additionally, we provide comprehensive insights into
the interpretability of the model's predictions, making the system highly
valuable for applications in both healthcare and human-computer interaction
domains.
comment: submitted to ICDE 2025 for review
♻ ☆ KBAlign: Efficient Self Adaptation on Specific Knowledge Bases
Zheni Zeng, Yuxuan Chen, Shi Yu, Yukun Yan, Zhenghao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, Maosong Sun
Humans can utilize techniques to quickly acquire knowledge from specific
materials in advance, such as creating self-assessment questions, enabling us
to achieving related tasks more efficiently. In contrast, large language models
(LLMs) usually relies on retrieval-augmented generation to exploit knowledge
materials in an instant manner, or requires external signals such as human
preference data and stronger LLM annotations to conduct knowledge adaptation.
To unleash the self-learning potential of LLMs, we propose KBAlign, an approach
designed for efficient adaptation to downstream tasks involving knowledge
bases. Our method utilizes iterative training with self-annotated data such as
Q&A pairs and revision suggestions, enabling the model to grasp the knowledge
content efficiently. Experimental results on multiple datasets demonstrate the
effectiveness of our approach, significantly boosting model performance in
downstream tasks that require specific knowledge at a low cost. Notably, our
approach achieves over 90% of the performance improvement that can be obtained
by using GPT-4-turbo annotation, while relying entirely on self-supervision. We
release our experimental data, models, and process analyses to the community
for further exploration (https://github.com/thunlp/KBAlign).
♻ ☆ The GPT-WritingPrompts Dataset: A Comparative Analysis of Character Portrayal in Short Stories EMNLP 2024
The improved generative capabilities of large language models have made them
a powerful tool for creative writing and storytelling. It is therefore
important to quantitatively understand the nature of generated stories, and how
they differ from human storytelling. We augment the Reddit WritingPrompts
dataset with short stories generated by GPT-3.5, given the same prompts. We
quantify and compare the emotional and descriptive features of storytelling
from both generative processes, human and machine, along a set of six
dimensions. We find that generated stories differ significantly from human
stories along all six dimensions, and that human and machine generations
display similar biases when grouped according to the narrative point-of-view
and gender of the main protagonist. We release our dataset and code at
https://github.com/KristinHuangg/gpt-writing-prompts.
comment: 9 pages plus appendices; published at the 6th Workshop on Narrative
Understanding, EMNLP 2024
♻ ☆ Emotion Granularity from Text: An Aggregate-Level Indicator of Mental Health EMNLP 2024
Krishnapriya Vishnubhotla, Daniela Teodorescu, Mallory J. Feldman, Kristen A. Lindquist, Saif M. Mohammad
We are united in how emotions are central to shaping our experiences; and
yet, individuals differ greatly in how we each identify, categorize, and
express emotions. In psychology, variation in the ability of individuals to
differentiate between emotion concepts is called emotion granularity
(determined through self-reports of one's emotions). High emotion granularity
has been linked with better mental and physical health; whereas low emotion
granularity has been linked with maladaptive emotion regulation strategies and
poor health outcomes. In this work, we propose computational measures of
emotion granularity derived from temporally-ordered speaker utterances in
social media (in lieu of self-reports that suffer from various biases). We then
investigate the effectiveness of such text-derived measures of emotion
granularity in functioning as markers of various mental health conditions
(MHCs). We establish baseline measures of emotion granularity derived from
textual utterances, and show that, at an aggregate level, emotion granularities
are significantly lower for people self-reporting as having an MHC than for the
control population. This paves the way towards a better understanding of the
MHCs, and specifically the role emotions play in our well-being.
comment: 9 pages plus appendices; published as a long paper at EMNLP 2024