AI Research

Latest research papers from arXiv covering machine learning, computer vision, natural language processing, and more.

arXivPDF

Astra: General Interactive World Model with Autoregressive Denoising

Recent advances in diffusion transformers have empowered video generation models to generate high-quality video clips from texts or images. However, world models with the ability to predict long-horizon futures from past observations and actions remain underexplored, especially for general-purpose s...

Yixuan Zhu, Jiaqi Feng, Wenzhao Zheng
Dec 9, 2025
arXivPDF

Selfi: Self Improving Reconstruction Engine via 3D Geometric Feature Alignment

Novel View Synthesis (NVS) has traditionally relied on models with explicit 3D inductive biases combined with known camera parameters from Structure-from-Motion (SfM) beforehand. Recent vision foundation models like VGGT take an orthogonal approach -- 3D knowledge is gained implicitly through traini...

Youming Deng, Songyou Peng, Junyi Zhang
Dec 9, 2025
arXivPDF

Efficiently Reconstructing Dynamic Scenes One D4RT at a Time

Understanding and reconstructing the complex geometry and motion of dynamic scenes from video remains a formidable challenge in computer vision. This paper introduces D4RT, a simple yet powerful feedforward model designed to efficiently solve this task. D4RT utilizes a unified transformer architectu...

Chuhan Zhang, Guillaume Le Moing, Skanda Koppula
Dec 9, 2025
arXivPDF

Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs

We introduce two new benchmarks REST and REST+(Render-Equivalence Stress Tests) to enable systematic evaluation of cross-modal inconsistency in multimodal large language models (MLLMs). MLLMs are trained to represent vision and language in the same embedding space, yet they cannot perform the same t...

Angela van Sprang, Laurens Samson, Ana Lucic
Dec 9, 2025
arXivPDF

Unified Diffusion Transformer for High-fidelity Text-Aware Image Restoration

Text-Aware Image Restoration (TAIR) aims to recover high- quality images from low-quality inputs containing degraded textual content. While diffusion models provide strong gen- erative priors for general image restoration, they often pro- duce text hallucinations in text-centric tasks due to the ab-...

Jin Hyeon Kim, Paul Hyunbin Cho, Claire Kim
Dec 9, 2025
arXivPDF

OSMO: Open-Source Tactile Glove for Human-to-Robot Skill Transfer

Human video demonstrations provide abundant training data for learning robot policies, but video alone cannot capture the rich contact signals critical for mastering manipulation. We introduce OSMO, an open-source wearable tactile glove designed for human-to-robot skill transfer. The glove features ...

Jessica Yin, Haozhi Qi, Youngsun Wi
Dec 9, 2025
arXivPDF

SAQ: Stabilizer-Aware Quantum Error Correction Decoder

Quantum Error Correction (QEC) decoding faces a fundamental accuracy-efficiency tradeoff. Classical methods like Minimum Weight Perfect Matching (MWPM) exhibit variable performance across noise models and suffer from polynomial complexity, while tensor network decoders achieve high accuracy but at p...

David Zenati, Eliya Nachmani
Dec 9, 2025
arXivPDF

LiDAS: Lighting-driven Dynamic Active Sensing for Nighttime Perception

Nighttime environments pose significant challenges for camera-based perception, as existing methods passively rely on the scene lighting. We introduce Lighting-driven Dynamic Active Sensing (LiDAS), a closed-loop active illumination system that combines off-the-shelf visual perception models with hi...

Simon de Moreau, Andrei Bursuc, Hafid El-Idrissi
Dec 9, 2025
arXivPDF

Self-Evolving 3D Scene Generation from a Single Image

Generating high-quality, textured 3D scenes from a single image remains a fundamental challenge in vision and graphics. Recent image-to-3D generators recover reasonable geometry from single views, but their object-centric training limits generalization to complex, large-scale scenes with faithful st...

Kaizhi Zheng, Yue Fan, Jing Gu
Dec 9, 2025
arXivPDF

UniLayDiff: A Unified Diffusion Transformer for Content-Aware Layout Generation

Content-aware layout generation is a critical task in graphic design automation, focused on creating visually appealing arrangements of elements that seamlessly blend with a given background image. The variety of real-world applications makes it highly challenging to develop a single model capable o...

Zeyang Liu, Le Wang, Sanping Zhou
Dec 9, 2025
arXivPDF

Open Polymer Challenge: Post-Competition Report

Machine learning (ML) offers a powerful path toward discovering sustainable polymer materials, but progress has been limited by the lack of large, high-quality, and openly accessible polymer datasets. The Open Polymer Challenge (OPC) addresses this gap by releasing the first community-developed benc...

Gang Liu, Sobin Alosious, Subhamoy Mahajan
Dec 9, 2025
arXivPDF

Unsupervised Learning of Density Estimates with Topological Optimization

Kernel density estimation is a key component of a wide variety of algorithms in machine learning, Bayesian inference, stochastic dynamics and signal processing. However, the unsupervised density estimation technique requires tuning a crucial hyperparameter: the kernel bandwidth. The choice of bandwi...

Suina Tanweer, Firas A. Khasawneh
Dec 9, 2025
arXivPDF

Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training

While scaling laws for Large Language Models (LLMs) traditionally focus on proxy metrics like pretraining loss, predicting downstream task performance has been considered unreliable. This paper challenges that view by proposing a direct framework to model the scaling of benchmark performance from th...

Jakub Krajewski, Amitis Shidani, Dan Busbridge
Dec 9, 2025
arXivPDF

Toward Faithful Retrieval-Augmented Generation with Sparse Autoencoders

Retrieval-Augmented Generation (RAG) improves the factuality of large language models (LLMs) by grounding outputs in retrieved evidence, but faithfulness failures, where generations contradict or extend beyond the provided sources, remain a critical challenge. Existing hallucination detection method...

Guangzhi Xiong, Zhenghao He, Bohan Liu
Dec 9, 2025
arXivPDF

No Labels, No Problem: Training Visual Reasoners with Multimodal Verifiers

Visual reasoning is challenging, requiring both precise object grounding and understanding complex spatial relationships. Existing methods fall into two camps: language-only chain-of-thought approaches, which demand large-scale (image, query, answer) supervision, and program-synthesis approaches whi...

Damiano Marsili, Georgia Gkioxari
Dec 9, 2025

Data from arXiv.org • Updated hourly