FlowCam: Training Generalizable 3D Radiance Fields without Camera Poses via Pixel-Aligned Scene Flow
The dependence on SfM-computed camera poses prohibits 3D representation learning at-scale.
In this work, we train 3D scene representations without camera poses.
Our method works robustly, even succeeding on the challenging CO3D dataset, on which classical SfM methods struggle.
Our key innovation is a camera pose formulation which leverages the robustness of optical flow methods. Specifically, we lift optical flow into scene flow via differentiable rendering, and differentiably solve for camera pose via a weighted Procrustes formulation.
Our method is only supervised by optical flow and re-rendering losses.
TL;DR: We propose to train generalizable 3D scene representations without known camera poses
Below we show generalizable pose estimation followed by generalizable view synthesis on top of a smoothened and wobbled trajectory. Since our model estimates poses and geometry on short video clips, we apply both our pose estimation and view synthesis on sliding windows of the video and trajectory. Our model predicts poses at ~20fps.
CO3D Hydrants
CO3D 10-Category
KITTI
RealEstate10K
Limitations
Our method does not model dynamics, does not robustly predict intrinsics, has no loop closure mechanism, and operates on relatively short clips.