RoMo: Robust Motion Segmentation Improves Structure from Motion

Google DeepMind, University of Toronto,
Adobe Research, Simon Fraser University

* Equal Contribution
Equal Advising

We introduce a zero-shot motion segmentation method for video that leverages cues from epipolar geometry and optical flow. Our predicted masks enhance SfM camera calibration in highly dynamic scenes.

Abstract

There has been extensive progress in the reconstruction and generation of 4D scenes from monocular casually-captured video. While these tasks rely heavily on known camera poses, the problem of finding such poses using structure-from-motion (SfM) often depends on robustly separating static from dynamic parts of a video. The lack of a robust solution to this problem limits the performance of SfM camera-calibration pipelines. We propose a novel approach to video-based motion segmentation to identify the components of a scene that are moving w.r.t. a fixed world frame. Our simple but effective iterative method, RoMo, combines optical flow and epipolar cues with a pre-trained video segmentation model. It outperforms unsupervised baselines for motion segmentation as well as supervised baselines trained from synthetic data. More importantly, the combination of an off-the-shelf SfM pipeline with our segmentation masks establishes a new state-of-the-art on camera calibration for scenes with dynamic content, outperforming existing methods by a substantial margin.

Examples of RoMo Motion Masks

Video Comparisons

Moving object segmentation results on DAVIS16, TrackSegv2 and FBMS 59. We show comparison of our zero-shot method against OCLR-adap, a motion segmentation approach which is fully supervised on synthetic datasets at training time and further adapted to the given videos at test-time.

Input Video

RoMo (ours)

OCLR-adap

Input Video

RoMo (ours)

OCLR-adap

Input Video

RoMo (ours)

OCLR-adap

Input Video

RoMo (ours)

OCLR-adap

Input Video

RoMo (ours)

OCLR-adap

Input Video

RoMo (ours)

OCLR-adap

Input Video

RoMo (ours)

OCLR-adap

Input Video

RoMo (ours)

OCLR-adap

Input Video

RoMo (ours)

OCLR-adap

Input Video

RoMo (ours)

OCLR-adap

Input Video

RoMo (ours)

OCLR-adap

Estimated Camera Trajectories and Motion Masks on Casual Motion Dataset

Description

camera comparison. GT: , Estimate: ---

Input Video

RoMo (Ours)

Description

camera comparison. GT: , Estimate: ---

Input Video

RoMo (Ours)

Description

camera comparison. GT: , Estimate: ---

Input Video

RoMo (Ours)

Description

camera comparison. GT: , Estimate: ---

Input Video

RoMo (Ours)

RoMo Motion Masks on MPI Sintel Dataset

Application in Distractor Removal

Our method can be applied to the task of robust 3D reconstruction in the presence of transient distractors, when the input set of images is sampled from a video. We show an example of this application on the patio scene from the NeRF On-the-go dataset.

RoMo Masks GIF
Input video and RoMo masks

Our masks, applied to the photo-metric loss during training a 3D Gaussian Splatting model, can be as effective as robust training methods such as SpotLessSplats.

SLS Comparison GIF
Distractor removal qualitative comparison between vanilla 3DGS, SpotLessSplats and RoMo

BibTeX


@article{golisabour2024romo,
    title={{RoMo}: Robust Motion Segmentation Improves Structure from Motion},
    author={Goli, Lily and Sabour, Sara and Matthews, Mark and Marcus, Brubaker and Lagun, Dmitry and Jacobson, Alec and Fleet, David J. and Saxena, Saurabh and Tagliasacchi, Andrea},
    journal={arXiv:2411.18650},
    year={2024}
}