Recognizing out-of-distribution (OOD) samples is critical for machine learning systems deployed in the open world. The vast majority of OOD detection methods are driven by a single modality (e.g., either vision or language), leaving the rich …
Using natural-language feedback to guide image generation and manipulation can greatly lower the required efforts and skills. This topic has received increased attention in recent years through refinement of Generative Adversarial Networks (GANs); …
Recent study on detecting facial action units (AU) has utilized auxiliary information (i.e., facial landmarks, relationship among AUs and expressions, web facial images, etc.), in order to improve the AU detection performance. As of now, no semantic …
Structured representations of images according to visual relationships are beneficial for many vision and vision-language applications. However, current human-annotated visual relationship datasets suffer from the long-tailed predicate distribution …
The explosion of video data on the internet requires effective and efficient technology to generate captions automatically for people who are not able to watch the videos. Despite the great progress of video captioning research, particularly on video …
The prevalent approach to the image captioning is an encoder-decoder framework, where the combination of convolutional neural networks (CNNs) and recurrent neural networks (RNNs) is the de-facto. In contrast, CNNs are exploited in sequence learning …
With the rapid growth of video data and the increasing demands of various applications such as intelligent video search and assistance toward visually-impaired people, video captioning task has received a lot of attention recently in computer vision …
Most of the existing deep learning based image captioning methods are fully-supervised models, which require large-scale paired image-caption datasets. However, getting large scale image-caption paired data is labor-intensive and time-consuming. In …
Scene graph generation has received growing attention with advancement image understanding tasks such as object detection, attributes and relationship prediction, etc. However, existing datasets are biased in terms of object and relationship labels, …
Image captioning is a multimodal task involving computer vision and natural language processing, where the goal is to learn a mapping from the image to its natural language description. In general, the mapping function is learned from a training set …