Multimodal Learning

LoRA-Contextualizing: Adaptation of Large Multimodal Models for Multi-page Document Understanding (Proceedings of the International Conference on Learning Representations 2025)

Commit: Coordinated instruction tuning for multimodal large language models (arXiv 2024)

TRINS: Towards Multimodal Language Models that Can Read (Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2024)

Learning the Visualness of Text Using Large Vision-Language Models (Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2022)

Towards language-free training for text-to-image generation (Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2022)

Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models (Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018)

Unpaired image captioning by language pivoting (Proceedings of the European Conference on Computer Vision 2018)