1 Tsinghua University 2 Tencent † Corresponding Author
4D reconstruction from a single monocular video is an important but challenging task due to its inherent underconstrained nature. While most existing 4D reconstruction methods focus on multi-camera settings, they always suffer from limited multi-view information in monocular videos. Recent studies have attempted to mitigate the illposed problem by incorporating data-driven priors as additional supervision. However, they require hours of optimization to align the splatted 2D feature maps of explicit Gaussians with various priors, which limits the range of applications. To address the time-consuming issue, we propose 4DFly, an efficient and effective framework for reconstructing the 4D scene from a monocular video (hundreds of frames within 6 minutes), more than 20 × faster and even achieving higher quality than previous optimization methods. Our key insight is to unleash the explicit property of Gaussian primitives and directly apply data priors to them. Specifically, we build a streaming 4D reconstruction paradigm that includes: propagating existing Gaussian to the next timestep with an anchor-based strategy, expanding the 4D scene map with the canonical Gaussian map, and an efficient 4D scene optimization process to further improve visual quality and motion accuracy. Extensive experiments demonstrate the superiority of our 4D-Fly over state-of-the-art methods in terms of speed and quality.
Pipeline of 4D-Fly. Our method takes as input a casually captured video with the camera intrinsics and extrinsics of each frame, aiming to reconstruct the dynamic 3D scene and the underlying motion of every point. We represent the underlying 4D scene as a global 4D scene map 𝒢 and construct it in a streaming way. Assume we have constructed G1→t. For the new frame at timestep t+1, we first compute its monocular depth map, segmentation mask, and 2D tracks using off-the-shelf models. Then, we extend the 4D scene map using: (1) anchor-based propagation to propagate existing dynamic Gaussians to the next timestep; (2) 4D scene expansion with a canonical Gaussian map; and (3) fast optimization for both foreground and background. After training over the entire sequence, our 4D scene map allows for novel view rendering at any queried timestep and can also be used for point tracking.
@InProceedings{Wu_2025_CVPR,
author = {Wu, Diankun and Liu, Fangfu and Hung, Yi-Hsin and Qian, Yue and Zhan, Xiaohang and Duan, Yueqi},
title = {4D-Fly: Fast 4D Reconstruction from a Single Monocular Video},
booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
month = {June},
year = {2025},
pages = {16663-16673}
}