Unsupervised Multimodal Video-to-Video Translation via Self-Supervised Learning

Kangning Liu*    Shuhang Gu*    Andres Romero    Radu Timofte

Abstract

Existing unsupervised video-to-video translation methods fail to produce translated videos which are frame-wise realistic, semantic information preserving and video-level consistent. In this work, we propose a novel unsupervised video-to-video translation model. Our model decomposes the style and the content, uses the specialized encoder-decoder structure and propagates the inter-frame information through bidirectional recurrent neural network (RNN) units. The style-content decomposition mechanism enables us to achieve long-term style-consistent video translation results as well as provides us with a good interface for modality flexible translation. In addition, by changing the input frames and style codes incorporated in our translation, we propose a video interpolation loss, which captures temporal information within the sequence to train our building blocks in a self-supervised manner. Our model can produce photo-realistic, spatio-temporal consistent translated videos in a multimodal way. Subjective and objective experimental results validate the superiority of our model over existing methods.

Citing

@misc{2004.06502, Author = {Kangning Liu and Shuhang Gu and Andres Romero and Radu Timofte}, Title = {Unsupervised Multimodal Video-to-Video Translation via Self-Supervised Learning}, Year = {2020}, Eprint = {arXiv:2004.06502}, }

UVIT: Multi-subdomain & Multimodality

We exploit the video temporal information in order to produce multi-subdomain (day, night, snow, etc) videos with different modalities per sub-domain in consistent video-to-video translations.

UVIT: Overview

Given an input video sequence, we first decompose it to the content by a Content Encoder and the style by a Style Encoder. Then the content is processed by special RNN units, namely TrajGRUs in order to get the content used for translation and interpolation in a recurrent manner. Finally, the translation content and the interpolation content are decoded to the translated video and the interpolated video together with the style latent variable. We also show the video adversarial loss, the cycle consistency loss, the video interpolation loss and the style encoder loss

UVIT: video translation and video interpolation

Given an input video sequence, we first decompose it to the content by a Content Encoder and the style by a Style Encoder. Then the content is processed by special RNN units, namely TrajGRUs in order to get the content used for translation and interpolation in a recurrent manner. Finally, the translation content and the interpolation content are decoded to the translated video and the interpolated video together with the style latent variable. We also show the video adversarial loss, the cycle consistency loss, the video interpolation loss and the style encoder loss

Videos

Table of contents

  • Compare with the baseline
  • Multi-subdomain and multimodality
  • Long style consistent translated video
  • Translation on other datasets
  • Ablation study: when no subdomain label
*

1_LRCompare

Compare with the baseline – Video of the label-to-image qualitative comparison.

2_HRcompare

Compare with the baseline – A consistent video should be 1) style inconsistent 2) content consistent.

3_HRcompare2

Compare with the baseline – Video of the comparison with RecycleGAN.

4_Multimodality

Multi-subdomain and multimodality

5_long_consistency

Long style consistent translated video (1680 frames)

6_Rainandsnow

Translation on other datasets – Video of Viper Rain-and-Snow translation.

7_Sunsetandday

Translation on other datasets – Video of Viper Sunset-and-Day translation.

8_Cityscapesandviper

Translation on other datasets – Video of Cityscapes-and-Viper translation

9_No_subdomain

Ablation study – when no subdomain label

Contact

Kangning Liu1,2, Shuhang Gu2, Andres Romero2, Radu Timofte2

1 Center for Data Science, New York University, USA

2 Computer Vision Lab, ETH Zurich, Switzerland

Email: kl3141@nyu.edu , {shuhang.gu,andres.romero-vergara,radu.timofte}@vision.ee.ethz.ch