Multimodal Learning
Vision-Language and Multimodal Pretraining
My research focuses on the intersection of computer vision and natural language processing, particularly in the area of multimodal learning. This includes:
- Vision-Language Pretraining: Developing efficient models that can understand both visual and textual information
- Multimodal Understanding and Reasoning: Creating systems that can reason across different modalities
- Efficient Modeling: Optimizing model architectures for better performance and reduced computational cost
- Self-Supervised Learning: Leveraging unlabeled data to improve model performance



Key Research Areas
Vision-Language Pretraining
Developing large-scale models that can understand and generate content across visual and textual modalities. This includes work on image captioning, visual question answering, and cross-modal retrieval.
Efficient Modeling
Focusing on creating more efficient model architectures that reduce computational requirements while maintaining or improving performance. This includes work on model compression, knowledge distillation, and architectural innovations.
Self-Supervised Learning
Exploring ways to leverage unlabeled data to improve model performance through self-supervised learning techniques, particularly in multimodal contexts.


Applications
The research has applications in:
- Document Understanding: Processing and understanding complex documents with both text and visual elements
- Content Generation: Creating coherent multimodal content
- Information Retrieval: Finding relevant information across different modalities
- Human-Computer Interaction: Improving how humans interact with AI systems
This work is conducted at Adobe Research, where we focus on both fundamental research and practical applications that can benefit Adobe’s creative tools and services.