Publications

We introduce a new image segmentation task, called Entity Segmentation (ES), which aims to segment all visual entities (objects and …

Professional document editing tools require a certain level of expertise to perform complex edit operations. To make editing tools …

In this paper, we introduce a novel concept of user-entity differential privacy (UeDP) to provide formal privacy protection …

Document images are a ubiquitous source of data where the text is organized in a complex hierarchical structure ranging from fine …

Recognizing out-of-distribution (OOD) samples is critical for machine learning systems deployed in the open world. The vast majority of …

We propose a new task of synthesizing speech directly from semi-structured documents where the extracted text tokens from OCR systems …

We introduce DocTime - a novel temporal dependency graph (TDG) parser that takes as input a text document and produces a temporal …

This work presents one of the first comprehensive studies on different sparse attention patterns in Transformer models. We first …

Multilingual natural language understanding, which aims to comprehend multilingual documents, is an important task. Existing efforts …

Most recent image captioning works are conducted in English as the majority of image-caption datasets are in English. However, there …

Using natural-language feedback to guide image generation and manipulation can greatly lower the required efforts and skills. This …

Document intelligence automates the extraction of information from documents and supports many business applications. Recent …

Recent studies indicate that NLU models are prone to rely on shortcut features for prediction. As a result, these models could …

We propose SelfDoc, a task-agnostic pre-training framework for document image analysis. Because documents are multimodal displays and …

In instance-level detection tasks (e.g., object detection), reducing input resolution is an easy option to improve runtime efficiency. …

Recent study on detecting facial action units (AU) has utilized auxiliary information (i.e., facial landmarks, relationship among AUs …

Structured representations of images according to visual relationships are beneficial for many vision and vision-language applications. …

The explosion of video data on the internet requires effective and efficient technology to generate captions automatically for people …

The prevalent approach to the image captioning is an encoder-decoder framework, where the combination of convolutional neural networks …

Mobile energy storage systems (MESSs) provide mobility and flexibility to enhance distribution system resilience. The paper proposes a …

We, as humans, can easily use our vision and language capabilities to accomplish a wide variety of tasks that combine the image and the …

With the rapid growth of video data and the increasing demands of various applications such as intelligent video search and assistance …

Most of the existing deep learning based image captioning methods are fully-supervised models, which require large-scale paired …

Scene graph generation has received growing attention with advancement image understanding tasks such as object detection, attributes …

Image captioning is a multimodal task involving computer vision and natural language processing, where the goal is to learn a mapping …

Textual-visual cross-modal retrieval has been a hot research topic in both computer vision and natural language processing communities. …

The existing image captioning approaches typically train a one-stage sentence decoder, which is difficult to generate rich fine-grained …

Language Models based on recurrent neural networks have dominated recent image caption generation tasks. In this paper, we introduce a …

In this paper, we provide a broad survey of the recent advances in convolutional neural networks. We detailize the improvements of CNN …