Most recent image captioning works are conducted in English as the majority of image-caption datasets are in English. However, there are a large amount of non-native English speakers worldwide. Generating image captions in different languages is …
Most of the existing deep learning based image captioning methods are fully-supervised models, which require large-scale paired image-caption datasets. However, getting large scale image-caption paired data is labor-intensive and time-consuming. In …
Scene graph generation has received growing attention with advancement image understanding tasks such as object detection, attributes and relationship prediction, etc. However, existing datasets are biased in terms of object and relationship labels, …
Image captioning is a multimodal task involving computer vision and natural language processing, where the goal is to learn a mapping from the image to its natural language description. In general, the mapping function is learned from a training set …
The existing image captioning approaches typically train a one-stage sentence decoder, which is difficult to generate rich fine-grained descriptions. On the other hand, multi-stage image caption model is hard to train due to the vanishing gradient …
Language Models based on recurrent neural networks have dominated recent image caption generation tasks. In this paper, we introduce a Language CNN model which is suitable for statistical language modeling tasks and shows competitive performance in …