Abstract
Many image-to-image (I2I) translation problems are in nature of high diversity that a single input may have various counterparts. The multi-modal network that can build a many-to-many mapping between two visual domains has been proposed in prior works. However, most of them are guided by sampled noises. Some others encode the reference image into a latent vector, which would eliminate the semantic information of the reference image. In this work, we aim to provide a solution to control the output based on references semantically. Given a reference image and an input in another domain, we first perform semantic matching between the two visual content and generate an auxiliary image, which explicitly encourages the semantic characteristic to be preserved. A deep network then is used for I2I translation and the final outputs are expected to be semantically similar to both the input and the reference. However, few paired data can satisfy that dual-similarity in a supervised fashion, and so we build up a self-supervised framework in the training stage. We improve the quality and diversity of the outputs by employing non-local blocks and a multi-task architecture. We assess the proposed method through extensive qualitative and quantitative evaluations and also present comparisons with several state-of-the-art models.
Original language | English |
---|---|
Article number | 9115302 |
Pages (from-to) | 1654-1665 |
Number of pages | 12 |
Journal | IEEE Transactions on Multimedia |
Volume | 23 |
Early online date | 11 Jun 2020 |
DOIs | |
Publication status | Published - 2021 |
Externally published | Yes |
Bibliographical note
Publisher Copyright:© 1999-2012 IEEE.
Funding
This work was supported in part by the National Natural Science Foundation of China under Grant 61672443, in part by Hong Kong GRF-RGC General Research Fund under Grants 9042322 (CityU 11200116), 9042489 (CityU 11206317), and 9042816 (CityU 11209819), in part by Hong Kong ECS under Grant 21209119, Hong Kong UGC, and in part by Start-up under Grant 7200607, CityU of Hong Kong.
Keywords
- Artificial neural networks
- image generation
- image representation