Vision-Language

Delving into OOD Detection with Vision-Language Representations

Recognizing out-of-distribution (OOD) samples is critical for machine learning systems deployed in the open world. The vast majority of OOD detection methods are driven by a single modality (e.g., either vision or language), leaving the rich …

Interactive Image Generation with Natural-Language Feedback

Using natural-language feedback to guide image generation and manipulation can greatly lower the required efforts and skills. This topic has received increased attention in recent years through refinement of Generative Adversarial Networks (GANs); …

Exploiting Semantic Embedding and Visual Feature for Facial Action Unit Detection

Recent study on detecting facial action units (AU) has utilized auxiliary information (i.e., facial landmarks, relationship among AUs and expressions, web facial images, etc.), in order to improve the AU detection performance. As of now, no semantic …

Self-Supervised Relationship Probing

Structured representations of images according to visual relationships are beneficial for many vision and vision-language applications. However, current human-annotated visual relationship datasets suffer from the long-tailed predicate distribution …

Video Captioning with Boundary-aware Hierarchical Language Decoding and Joint Video Prediction

The explosion of video data on the internet requires effective and efficient technology to generate captions automatically for people who are not able to watch the videos. Despite the great progress of video captioning research, particularly on video …

Finding It at Another Side: A Viewpoint-Adapted Matching Encoder for Change Captioning

The prevalent approach to the image captioning is an encoder-decoder framework, where the combination of convolutional neural networks (CNNs) and recurrent neural networks (RNNs) is the de-facto. In contrast, CNNs are exploited in sequence learning …

Watch It Twice: Video Captioning with a Refocused Video Encoder

With the rapid growth of video data and the increasing demands of various applications such as intelligent video search and assistance toward visually-impaired people, video captioning task has received a lot of attention recently in computer vision …

Unpaired Image Captioning via Scene Graph Alignments

Most of the existing deep learning based image captioning methods are fully-supervised models, which require large-scale paired image-caption datasets. However, getting large scale image-caption paired data is labor-intensive and time-consuming. In …

Scene Graph Generation with External Knowledge and Image Reconstruction

Scene graph generation has received growing attention with advancement image understanding tasks such as object detection, attributes and relationship prediction, etc. However, existing datasets are biased in terms of object and relationship labels, …

Unpaired Image Captioning by Language Pivoting

Image captioning is a multimodal task involving computer vision and natural language processing, where the goal is to learn a mapping from the image to its natural language description. In general, the mapping function is learned from a training set …