Computation and Language
☆ Language Generation with Infinite Contamination
We study language generation in the limit, where an algorithm observes an
adversarial enumeration of strings from an unknown target language $K$ and must
eventually generate new, unseen strings from $K$. Kleinberg and Mullainathan
[KM24] proved that generation is achievable in surprisingly general settings.
But their generator suffers from ``mode collapse,'' producing from an
ever-smaller subset of the target. To address this, Kleinberg and Wei [KW25]
require the generator's output to be ``dense'' in the target language. They
showed that generation with density, surprisingly, remains achievable at the
same generality.
Both results assume perfect data: no noisy insertions and no omissions. This
raises a central question: how much contamination can generation tolerate?
Recent works made partial progress on this question by studying (non-dense)
generation with either finite amounts of noise (but no omissions) or omissions
(but no noise).
We characterize robustness under contaminated enumerations: 1. Generation
under Contamination: Language generation in the limit is achievable for all
countable collections iff the fraction of contaminated examples converges to
zero. When this fails, we characterize which collections are generable. 2.
Dense Generation under Contamination: Dense generation is strictly less robust
to contamination than generation. As a byproduct, we resolve an open question
of Raman and Raman [ICML25] by showing that generation is possible with only
membership oracle access under finitely many contaminated examples.
Finally, we introduce a beyond-worst-case model inspired by curriculum
learning and prove that dense generation is achievable even with infinite
contamination provided the fraction of contaminated examples converges to zero.
This suggests curriculum learning may be crucial for learning from noisy web
data.
☆ DigiData: Training and Evaluating General-Purpose Mobile Control Agents
Yuxuan Sun, Manchen Wang, Shengyi Qian, William R. Wong, Eric Gan, Pierluca D'Oro, Alejandro Castillejo Munoz, Sneha Silwal, Pedro Matias, Nitin Kamra, Satwik Kottur, Nick Raines, Xuanyi Zhao, Joy Chen, Joseph Greer, Andrea Madotto, Allen Bolourchi, James Valori, Kevin Carlberg, Karl Ridgeway, Joseph Tighe
AI agents capable of controlling user interfaces have the potential to
transform human interaction with digital devices. To accelerate this
transformation, two fundamental building blocks are essential: high-quality
datasets that enable agents to achieve complex and human-relevant goals, and
robust evaluation methods that allow researchers and practitioners to rapidly
enhance agent performance. In this paper, we introduce DigiData, a large-scale,
high-quality, diverse, multi-modal dataset designed for training mobile control
agents. Unlike existing datasets, which derive goals from unstructured
interactions, DigiData is meticulously constructed through comprehensive
exploration of app features, resulting in greater diversity and higher goal
complexity. Additionally, we present DigiData-Bench, a benchmark for evaluating
mobile control agents on real-world complex tasks. We demonstrate that the
commonly used step-accuracy metric falls short in reliably assessing mobile
control agents and, to address this, we propose dynamic evaluation protocols
and AI-powered evaluations as rigorous alternatives for agent assessment. Our
contributions aim to significantly advance the development of mobile control
agents, paving the way for more intuitive and effective human-device
interactions.
comment: Website: https://facebookresearch.github.io/DigiData
☆ SPOT: An Annotated French Corpus and Benchmark for Detecting Critical Interventions in Online Conversations
We introduce SPOT (Stopping Points in Online Threads), the first annotated
corpus translating the sociological concept of stopping point into a
reproducible NLP task. Stopping points are ordinary critical interventions that
pause or redirect online discussions through a range of forms (irony, subtle
doubt or fragmentary arguments) that frameworks like counterspeech or social
correction often overlook. We operationalize this concept as a binary
classification task and provide reliable annotation guidelines. The corpus
contains 43,305 manually annotated French Facebook comments linked to URLs
flagged as false information by social media users, enriched with contextual
metadata (article, post, parent comment, page or group, and source). We
benchmark fine-tuned encoder models (CamemBERT) and instruction-tuned LLMs
under various prompting strategies. Results show that fine-tuned encoders
outperform prompted LLMs in F1 score by more than 10 percentage points,
confirming the importance of supervised learning for emerging non-English
social media tasks. Incorporating contextual metadata further improves encoder
models F1 scores from 0.75 to 0.78. We release the anonymized dataset, along
with the annotation guidelines and code in our code repository, to foster
transparency and reproducible research.
☆ SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards NeurIPS 2025
Multimodal large language models (MLLMs) have achieved remarkable progress in
vision-language tasks, but they continue to struggle with spatial
understanding. Existing spatial MLLMs often rely on explicit 3D inputs or
architecture-specific modifications, and remain constrained by large-scale
datasets or sparse supervision. To address these limitations, we introduce
SpatialThinker, a 3D-aware MLLM trained with RL to integrate structured spatial
grounding with multi-step reasoning. The model simulates human-like spatial
perception by constructing a scene graph of task-relevant objects and spatial
relations, and reasoning towards an answer via dense spatial rewards.
SpatialThinker consists of two key contributions: (1) a data synthesis pipeline
that generates STVQA-7K, a high-quality spatial VQA dataset, and (2) online RL
with a multi-objective dense spatial reward enforcing spatial grounding.
SpatialThinker-7B outperforms supervised fine-tuning and the sparse RL baseline
on spatial understanding and real-world VQA benchmarks, nearly doubling the
base-model gain compared to sparse RL, and surpassing GPT-4o. These results
showcase the effectiveness of combining spatial supervision with reward-aligned
reasoning in enabling robust 3D spatial understanding with limited data and
advancing MLLMs towards human-level visual reasoning.
comment: Preprint. Accepted at NeurIPS 2025 Workshops on SPACE in Vision,
Language, and Embodied AI (SpaVLE), Embodied World Models for Decision Making
(EWM), Aligning Reinforcement Learning Experimentalists and Theorists
(ARLET), and Scaling Environments for Agents (SEA)
☆ ConvFill: Model Collaboration for Responsive Conversational Voice Agents
Deploying conversational voice agents with large language models faces a
critical challenge: cloud-based foundation models provide deep reasoning and
domain knowledge but introduce latency that disrupts natural conversation,
while on-device models respond immediately but lack sophistication. We propose
conversational infill, a task where a lightweight on-device model generates
contextually appropriate dialogue while seamlessly incorporating streaming
knowledge from a powerful backend model. This approach decouples response
latency from model capability, enabling systems that feel responsive while
accessing the full power of large-scale models. We present ConvFill, a 360M
parameter model trained on synthetic multi-domain conversations. Evaluation
across multiple backend models shows that conversational infill can be
successfully learned, with ConvFill achieving accuracy improvements of 36-42%
over standalone small models of the same size while consistently retaining
sub-200ms response latencies. Our results demonstrate the promise of this
approach for building on-device conversational agents that are both immediately
responsive and knowledgeable.
☆ Surgical Agent Orchestration Platform for Voice-directed Patient Data Interaction
In da Vinci robotic surgery, surgeons' hands and eyes are fully engaged in
the procedure, making it difficult to access and manipulate multimodal patient
data without interruption. We propose a voice-directed Surgical Agent
Orchestrator Platform (SAOP) built on a hierarchical multi-agent framework,
consisting of an orchestration agent and three task-specific agents driven by
Large Language Models (LLMs). These LLM-based agents autonomously plan, refine,
validate, and reason to map voice commands into specific tasks such as
retrieving clinical information, manipulating CT scans, or navigating 3D
anatomical models on the surgical video. We also introduce a Multi-level
Orchestration Evaluation Metric (MOEM) to comprehensively assess the
performance and robustness from command-level and category-level perspectives.
The SAOP achieves high accuracy and success rates across 240 voice commands,
while LLM-based agents improve robustness against speech recognition errors and
diverse or ambiguous free-form commands, demonstrating strong potential to
support minimally invasive da Vinci robotic surgery.
comment: 22 pages, 12 figures, 1 table, Supplementary Information,
Supplementary Data 1
☆ Teaching Pretrained Language Models to Think Deeper with Retrofitted Recurrence
Sean McLeish, Ang Li, John Kirchenbauer, Dayal Singh Kalra, Brian R. Bartoldson, Bhavya Kailkhura, Avi Schwarzschild, Jonas Geiping, Tom Goldstein, Micah Goldblum
Recent advances in depth-recurrent language models show that recurrence can
decouple train-time compute and parameter count from test-time compute. In this
work, we study how to convert existing pretrained non-recurrent language models
into depth-recurrent models. We find that using a curriculum of recurrences to
increase the effective depth of the model over the course of training preserves
performance while reducing total computational cost. In our experiments, on
mathematics, we observe that converting pretrained models to recurrent ones
results in better performance at a given compute budget than simply
post-training the original non-recurrent language model.
comment: code: https://github.com/mcleish7/retrofitting-recurrence, models:
https://huggingface.co/collections/tomg-group-umd/retrofitting-recurrence
☆ Retriv at BLP-2025 Task 2: Test-Driven Feedback-Guided Framework for Bangla-to-Python Code Generation
Large Language Models (LLMs) have advanced the automated generation of code
from natural language prompts. However, low-resource languages (LRLs) like
Bangla remain underrepresented due to the limited availability of
instruction-to-code datasets and evaluation benchmarks. To address this, the
BLP Workshop at IJCNLP-AACL 2025 introduced a shared task on "Code Generation
in Bangla". In this work, we propose a method that combines instruction
prompting with a test-driven, feedback-guided iterative refinement process
using a fine-tuned Qwen2.5-14B model. The model generates code from Bangla
instructions, tests it against unit tests, and iteratively refines any failing
outputs through three evaluation passes, using test feedback to guide each
step. This approach helped our team "Retriv" to secure 2nd place in the shared
task with a Pass@1 score of 0.934. The analysis highlights challenges in Bangla
instruction understanding and Python code generation, emphasizing the need for
targeted methods in LRLs. We made experimental scripts publicly available for
the community.
comment: 8 pages, 1 figure, experimental scripts publicly available at
https://github.com/NafiAsib/Retriv-BLP25-Task-2
☆ Selecting Auxiliary Data via Neural Tangent Kernels for Low-Resource Domains
Large language models (LLMs) have achieved remarkable success across
widespread tasks, yet their application in low-resource domains remains a
significant challenge due to data scarcity and the high risk of overfitting.
While in-domain data is limited, there exist vast amounts of similar
general-domain data, and our initial findings reveal that they could
potentially serve as auxiliary supervision for domain enhancement. This
observation leads us to our central research question: \textbf{\textit{how to
effectively select the most valuable auxiliary data to maximize domain-specific
performance}}, particularly when traditional methods are inapplicable due to a
lack of large in-domain data pools or validation sets. To address this, we
propose \textbf{NTK-Selector}, a principled and efficient framework for
selecting general-domain auxiliary data to enhance domain-specific performance
via neural tangent kernels (NTK). Our method tackles two challenges of directly
applying NTK to LLMs, theoretical assumptions and prohibitive computational
cost, by empirically demonstrating a stable NTK-like behavior in LLMs during
LoRA fine-tuning and proposing a Jacobian-free approximation method. Extensive
experiments across four low-resource domains (medical, financial, legal, and
psychological) demonstrate that NTK-Selector consistently improves downstream
performance. Specifically, fine-tuning on 1,000 in-domain samples alone only
yielded +0.8 points for Llama3-8B-Instruct and +0.9 points for Qwen3-8B. In
contrast, enriching with 9,000 auxiliary samples selected by NTK-Selector led
to substantial \textbf{gains of +8.7 and +5.1 points}, which corresponds to a
\textbf{10.9x and 5.7x improvement} over the domain-only setting.
comment: 27 pages
♻ ☆ Mixed Signals: Understanding Model Disagreement in Multimodal Empathy Detection
Multimodal models play a key role in empathy detection, but their performance
can suffer when modalities provide conflicting cues. To understand these
failures, we examine cases where unimodal and multimodal predictions diverge.
Using fine-tuned models for text, audio, and video, along with a gated fusion
model, we find that such disagreements often reflect underlying ambiguity, as
evidenced by annotator uncertainty. Our analysis shows that dominant signals in
one modality can mislead fusion when unsupported by others. We also observe
that humans, like models, do not consistently benefit from multimodal input.
These insights position disagreement as a useful diagnostic signal for
identifying challenging examples and improving empathy system robustness.