Multimodal Learning | Jiuxiang Gu

My research focuses on the intersection of computer vision and natural language processing, particularly in the area of multimodal learning. This includes:

Vision-Language Pretraining: Developing efficient models that can understand both visual and textual information
Multimodal Understanding and Reasoning: Creating systems that can reason across different modalities
Efficient Modeling: Optimizing model architectures for better performance and reduced computational cost
Self-Supervised Learning: Leveraging unlabeled data to improve model performance

Research areas in multimodal learning: understanding visual and textual information, efficient model architectures, and self-supervised learning approaches.

Key Research Areas

Vision-Language Pretraining

Developing large-scale models that can understand and generate content across visual and textual modalities. This includes work on image captioning, visual question answering, and cross-modal retrieval.

Efficient Modeling

Focusing on creating more efficient model architectures that reduce computational requirements while maintaining or improving performance. This includes work on model compression, knowledge distillation, and architectural innovations.

Self-Supervised Learning

Exploring ways to leverage unlabeled data to improve model performance through self-supervised learning techniques, particularly in multimodal contexts.

Overview of multimodal learning research and its applications in various domains.

Applications

The research has applications in:

Document Understanding: Processing and understanding complex documents with both text and visual elements
Content Generation: Creating coherent multimodal content
Information Retrieval: Finding relevant information across different modalities
Human-Computer Interaction: Improving how humans interact with AI systems

This work is conducted at Adobe Research, where we focus on both fundamental research and practical applications that can benefit Adobe’s creative tools and services.

Key Research Areas

Vision-Language Pretraining

Efficient Modeling

Self-Supervised Learning

Applications

References