The explosion of video data on the internet requires effective and efficient technology to generate captions automatically for people who are not able to watch the videos. Despite the great progress of video captioning research, particularly on video …
The prevalent approach to the image captioning is an encoder-decoder framework, where the combination of convolutional neural networks (CNNs) and recurrent neural networks (RNNs) is the de-facto. In contrast, CNNs are exploited in sequence learning …
With the rapid growth of video data and the increasing demands of various applications such as intelligent video search and assistance toward visually-impaired people, video captioning task has received a lot of attention recently in computer vision …