The human visual system excels at biasing the stereoscopic visual signals by the attention mechanisms. Traditional methods relying on the low-level features and depth relevant information for stereoscopic video saliency prediction have fundamental limitations. For example, it is cumbersome to model the interactions between multiple visual cues including spatial, temporal, and depth information as a result of the sophistication. In this paper, we argue that the high-level features are crucial and resort to the deep learning framework to learn the saliency map of stereoscopic videos. Driven by spatio-temporal coherence from consecutive frames, the model first imitates the mechanism of saliency by taking advantage of the 3D convolutional neural network. Subsequently, the saliency originated from the intrinsic depth is derived based on the correlations between left and right views in a data-driven manner. Finally, a Convolutional Long Short-Term Memory (Conv-LSTM) based fusion network is developed to model the instantaneous interactions between spatio-temporal and depth attributes, such that the ultimate stereoscopic saliency maps over time are produced. Moreover, we establish a new large-scale stereoscopic video saliency dataset (SVS) including 175 stereoscopic video sequences and their fixation density annotations, aiming to comprehensively study the intrinsic attributes for stereoscopic video saliency detection. Extensive experiments show that our proposed model can achieve superior performance compared to the state-of-the-art methods on the newly built dataset for stereoscopic videos.
|Title of host publication||Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition|
|Publication status||Published - Jun 2019|
Bibliographical noteThis work was supported in part by the National Natural Science Foundation of China under Grant 61871270, 61672443 and 61620106008, in part by the Hong Kong RGC Early Career Scheme under Grant 9048122 (CityU 21211018), in part by the Guangdong Nature Science Foundation of China under Grant 2016A030310058, in part by the Natural Science Foundation of SZU (grant no. 827000144), and in part by the National Engineering Laboratory for Big Data System Computing Technology of China.
- 3D from Multiview and Sensors
- Datasets and Evaluation
- Deep Learning
- RGBD sensors and analytics
- Video Analytics