It is a challenging task to perform automatic image description, which aims to translate an image with visual information into natural language conforming to certain proper grammars and sentence structures. In this work, an optimal learning framework called deep sequential fusion based long short term memory network is designed. In the proposed framework, a layer-wise strategy is introduced into the generation process of recurrent neural network to increase the depth of language model for producing more abstract and discriminative features. Then, a deep supervision method is developed to enrich the model capacity with extra regularization. Moreover, the prediction scores from all of the auxiliary branches in the language model are employed to fuse the final decision output with product rule, which further makes use of the optimized model parameters and hence boosts the performance. The experimental results on two public benchmark datasets verify the effectiveness of the proposed approaches, with the consensus-based image description evaluation metric (CIDEr) being 103.4 on the MSCOCO dataset and the metric for evaluation of translation with explicit ordering (METEOR) reaching to 20.6 on the Flickr30K dataset.
Bibliographical noteThis work was supported in part by National Natural Science Foundation of China under Grants 61622115 and 61472281, Program for Professor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning (No. GZ2015005), Shanghai Engineering Research Center of Industrial Vision Perception & Intelligent Computing (17DZ2251600), IBM Shared University Research Awards Program, and Scientific Research Foundation of Education Bureau of Jiangxi Province (No. GJJ170643).
- Deep sequential fusion
- Deep supervision
- Image description
- Layer-wise optimization
- Long short term memory network