Finding It at Another Side: A Viewpoint-Adapted Matching Encoder for Change Captioning

Finding It at Another Side: A Viewpoint-Adapted Matching Encoder for Change Captioning

Abstract

The prevalent approach to the image captioning is an encoder-decoder framework, where the combination of convolutional neural networks (CNNs) and recurrent neural networks (RNNs) is the de-facto. In contrast, CNNs are exploited in sequence learning [2, 16, 33] and shown promising results. However, the variants of RNNs [15, 29, 50, 48] are showing great success in image captioning tasks due to their compelling ability to learn the language over time. In this work, we introduce Incep-cap, a novel convolutional language model equipped with language attributes to generate a caption for a given image. Our proposed model learns to generate words by treating it as a multi-class classification task.We integrate both visual (‘girl’) and non-visual (‘a’) attributes as language attributes to bridge the gap between two independent convolutional blocks. Our extensive experimental results outperform the state-of-the-art deep LSTM models on both Flickr 8k and Flickr30k datasets, by achieving BLEU-4 of 30.08 and 39.72, respectively.

Publication
In Proceedings of the 15th European Conference on Computer Vision (ECCV, Poster).