Publications

publications by categories in reversed chronological order. generated by jekyll-scholar.

2026

ICLR 2026
FastCar: Cache Attentive Replay for Fast Auto-Regressive Video Generation on the Edge

Xuan Shen, Weize Ma, Yufa Zhou, and 11 more authors
AAAI 2026
OIDA-QA: A Multimodal Benchmark for Analyzing the Opioid Industry Documents Archive

Xuan Shen, Brian Wingenroth, Zichao Wang, and 12 more authors
CVPR 2026
DiffGraph: An Automated Agent-driven Model Merging Framework for In-the-Wild Text-to-Image Generation

Zhuoling Li, Hossein Rahmani, Jiarui Zhang, and 5 more authors
CVPR 2026
Sparse-LaViDa: Sparse Multimodal Discrete Diffusion Language Models

Shufan Li, Jiuxiang Gu, Kangning Liu, and 4 more authors
2026
LaViDa-R1: Advancing Reasoning for Unified Multimodal Diffusion Language Models

Shufan Li, Wanrong Zhu, Jiuxiang Gu, and 6 more authors
2026 US Patent App. 18/777,186
Text rendering for image generation models

Ruiyi Zhang, Jiuxiang Gu, Yufan Zhou, and 4 more authors
2026 US Patent 12,524,954
Generating 3D models from a single image

Hao Tan, Yicong Hong, Kai Zhang, and 8 more authors
arXiv preprint arXiv:2601.04589, 2026
MiLDEdit: Reasoning-Based Multi-Layer Design Document Editing

Zhaoyang Lin, Wanrong Zhu, Jiuxiang Gu, and 7 more authors
ICLR 2026
LaViDa-O: Elastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation

Shufan Li, Jiuxiang Gu, Kangning Liu, and 4 more authors

2025

EMNLP 2025
Commit: Coordinated instruction tuning for multimodal large language models

Junda Wu, Xintong Li, Tong Yu, and 6 more authors
2025 US Patent App. 18/347,877
Efficient vision-language retrieval using structural pruning

Handong Zhao, Yue Bai, Zhe Lin, and 4 more authors
2025 US Patent App. 18/493,465
Generating temporal dependency graphs

Puneet Mathur, Vlad Morariu, Verena Kaynig-Fittkau, and 7 more authors
WACV 2025
ARTIST: Improving the Generation of Text-rich Images with Disentangled Diffusion Models

Jianyi Zhang, Yufan Zhou, Jiuxiang Gu, and 5 more authors
WACV 2025
Differential privacy mechanisms in neural tangent kernel regression

Jiuxiang Gu, Yingyu Liang, Zhizhou Sha, and 2 more authors
Submitted to ACL 2025
A multi-llm debiasing framework

Deonna M Owens, Ryan A Rossi, Sungchul Kim, and 7 more authors
ICLR 2025
Imagefolder: Autoregressive image generation with folded tokens

Xiang Li, Kai Qiu, Hao Chen, and 4 more authors
ICLR 2025
LoRA-Contextualizing Adaptation of Large Multimodal Models for Long Document Understanding

Jian Chen, Ruiyi Zhang, Yufan Zhou, and 6 more authors
TMLR, 2025
Personalization of large language models: A survey

Zhehao Zhang, Ryan A Rossi, Branislav Kveton, and 8 more authors
AAAI 2025
Numerical pruning for efficient autoregressive models

Xuan Shen, Zhao Song, Yufa Zhou, and 12 more authors
AAAI 2025
LazyDiT: Lazy Learning for the Acceleration of Diffusion Transformers

Xuan Shen, Zhao Song, Yufa Zhou, and 12 more authors
CVPR 2025
MegaSynth: Scaling Up 3D Scene Reconstruction with Synthesized Data

Hanwen Jiang, Zexiang Xu, Desai Xie, and 8 more authors
ICCV 2025
Refer to Anything with Vision-Language Prompts

Shengcao Cao, Zijun Wei, Jason Kuen, and 6 more authors
ICCV 2025
DiffIP: Representation Fingerprints for Robust IP Protection of Diffusion Models

Zhuoling Li, Haoxuan Qu, Jason Kuen, and 4 more authors
ICCV 2025
Multimodal LLMs as Customized Reward Models for Text-to-Image Generation

Shijie Zhou, Ruiyi Zhang, Huaisheng Zhu, and 5 more authors
arXiv preprint arXiv:2501.19201, 2025
Efficient Reasoning with Hidden Thinking

Xuan Shen, Yizhou Wang, Xiangxi Shi, and 3 more authors
ACL 2025
From Selection to Generation: A Survey of LLM-based Active Learning

Yu Xia, Subhojyoti Mukherjee, Zhouhang Xie, and 8 more authors
Submitted to NeurIPS 2025
Efficient Reasoning with Hidden Thinking

Xuan Shen, Yizhou Wang, Xiangxi Shi, and 3 more authors
Submitted to NeurIPS 2025
ADOPT: A Multimodal Framework for Document Understanding and Generation

Jiuxiang Gu, Jing Shi, Wanrong Zhu, and 6 more authors
Submitted to NeurIPS 2025
DraftAttention: Fast Video Diffusion via Low-Resolution Attention Guidance

Xuan Shen, Chenxia Han, Yufa Zhou, and 7 more authors
Submitted to NeurIPS 2025
MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models

Haozhe Zhao, Zefan Cai, Shuzheng Si, and 4 more authors
TMLR, 2025
From Preferences to Prejudice: The Role of Alignment Tuning in Shaping Social Bias in Video Diffusion Models

Zefan Cai, Haoyi Qiu, Haozhe Zhao, and 6 more authors
NeurIPS 2025
R-KV: Redundancy-aware KV Cache Compression for Reasoning Models

Zefan Cai, Wen Xiao, Hanshi Sun, and 11 more authors
Submitted to NeurIPS 2025
ADOPD-Instruct: A Large-Scale Multimodal Dataset for Document Editing

Wanrong Zhu, Xiangxi Shi, Yufan Zhou, and 3 more authors
ACL 2025
Metal: A multi-agent framework for chart generation with test-time scaling

Bingxuan Li, Yiwei Wang, Jiuxiang Gu, and 2 more authors
2025 US Patent App. 18/952,023
Utilizing a generative neural network to interactively create and modify digital images based on natural language feedback

Ruiyi Zhang, Yufan Zhou, Christopher Tensmeyer, and 3 more authors
Submitted to ICCV 2025
Robust Latent Matters: Boosting Image Generation with Sampling Error Synthesis

Kai Qiu, Xiang Li, Jason Kuen, and 7 more authors
2025 US Patent App. 18/460,747
Generating 3d models from a single image

Hao Tan, Yicong Hong, Kai Zhang, and 8 more authors
2025 US Patent App. 18/528,116
Position-based text-to-speech model

Puneet Mathur, Franck Dernoncourt, Quan Hung Tran, and 5 more authors
CVPR 2025
QuartDepth: Post-Training Quantization for Real-Time Depth Estimation on the Edge

Xuan Shen, Weize Ma, Jing Liu, and 8 more authors
2025 US Patent App. 18/472,746
Generating an improved named entity recognition model using noisy data with a self-cleaning discriminator model

Ruiyi Zhang, Zhendong Chu, Vlad Morariu, and 4 more authors
2025
Towards Visual Text Grounding of Multimodal Large Language Model

Ming Li, Ruiyi Zhang, Jian Chen, and 6 more authors
2025 US Patent App. 19/239,469
Unified pretraining framework for document understanding

Jiuxiang Gu, Ani Nenkova, Nikolaos Barmpalios, and 4 more authors
arXiv preprint arXiv:2512.14691, 2025
MMGR: Multi-Modal Generative Reasoning

Zefan Cai, Haoyi Qiu, Tengfei Ma, and 6 more authors
arXiv preprint arXiv:2512.12487, 2025
More Than the Final Answer: Improving Visual Extraction and Logical Consistency in Vision-Language Models

Hoang Anh Just, Yue Fan, Handong Zhao, and 6 more authors

2024

2024 US Patent 11,886,815
Self-supervised document representation learning

Jiuxiang Gu, Vlad Morariu, Varun Manjunatha, and 5 more authors
2024 US Patent 12,136,185
Multi-scale distillation for low-resolution detection

Jason Kuen, Jiuxiang Gu, and Zhe Lin
2024 US Patent 12,148,119
Utilizing a generative neural network to interactively create and modify digital images based on natural language feedback

Ruiyi Zhang, Yufan Zhou, Christopher Tensmeyer, and 3 more authors
2024
Improving a Named Entity Recognizer Trained on Noisy Data with a Few Clean Instances

Zhendong Chu, Ruiyi Zhang, Tong Yu, and 4 more authors
ICLR 2024 Oral
Lrm: Large reconstruction model for single image to 3d

Yicong Hong, Kai Zhang, Jiuxiang Gu, and 7 more authors
CVPR 2024
Customization assistant for text-to-image generation

Yufan Zhou, Ruiyi Zhang, Jiuxiang Gu, and 1 more author
ACL 2024
Selective reflection-tuning: Student-selected data recycling for llm instruction-tuning

Ming Li, Lichang Chen, Jiuhai Chen, and 3 more authors
ICLR 2024
ADoPD: A large-scale document page decomposition dataset

Jiuxiang Gu, Xiangxi Shi, Jason Kuen, and 5 more authors
ICLR 2024
SOHES: Self-supervised open-world hierarchical entity segmentation

Shengcao Cao, Jiuxiang Gu, Jason Kuen, and 7 more authors
2024 US Patent App. 17/947,737
Image and semantic based table recognition

Jiuxiang Gu, Vlad Morariu, Tong Sun, and 2 more authors
2024 US Patent App. 18/048,900
Label induction

Rajiv Bhawanji Jain, Michelle Yuan, Vlad Ion Morariu, and 8 more authors
2024 US Patent App. 18/173,199
Training language models and preserving privacy

Franck Dernoncourt, Tong Sun, Thi Kim Phung Lai, and 3 more authors
arXiv preprint arXiv:2405.03251, 2024
Exploring the frontiers of softmax: Provable optimization, applications in diffusion model, and beyond

Jiuxiang Gu, Chenyang Li, Yingyu Liang, and 2 more authors
COLING 2024
DocScript: Document-Level Script Event Prediction

Puneet Mathur, Vlad I Morariu, Aparna Garimella, and 6 more authors
2024 US Patent App. 18/055,752
Extracting document hierarchy using a multimodal, layer-wise link prediction neural network

Vlad Morariu, Puneet Mathur, Rajiv Jain, and 8 more authors
2024 US Patent 11,995,394
Language-guided document editing

Vlad Ion Morariu, Puneet Mathur, Rajiv Bhawanji Jain, and 2 more authors
CVPR 2024
Trins: Towards multimodal language models that can read

Ruiyi Zhang, Yanzhe Zhang, Jian Chen, and 4 more authors
arXiv preprint arXiv:2406.09305, 2024
Toffee: Efficient million-scale dataset construction for subject-driven text-to-image generation

Yufan Zhou, Ruiyi Zhang, Kaizhi Zheng, and 5 more authors
CVPR 2024
DocSynthv2: A Practical Autoregressive Modeling for Document Generation

Sanket Biswas, Rajiv Jain, Vlad I Morariu, and 5 more authors
NAACL 2024
Self-Cleaning: Improving a Named Entity Recognizer Trained on Noisy Data with a Few Clean Instances

Zhendong Chu, Ruiyi Zhang, Tong Yu, and 4 more authors
ICML 2024
Category-aware active domain adaptation

Wenxiao Xiao, Jiuxiang Gu, and Hongfu Liu
arXiv preprint arXiv:2407.19185, 2024
Llava-read: Enhancing reading ability of multimodal language models

Ruiyi Zhang, Yufan Zhou, Jian Chen, and 3 more authors
arXiv preprint arXiv:2408.14594, 2024
Mmr: Evaluating reading ability of large multimodal models

Jian Chen, Ruiyi Zhang, Yufan Zhou, and 3 more authors
EMNLP 2024
TextLap: Customizing Language Models for Text-to-Layout Planning

Jian Chen, Ruiyi Zhang, Yufan Zhou, and 4 more authors
EMNLP 2024
Advancing Vision-Language Models with Adapter Ensemble Strategies

Yue Bai, Handong Zhao, Zhe Lin, and 5 more authors
2024 US Patent App. 18/318,921
TEXT-TO-IMAGE SYSTEM AND METHOD

Ruiyi Zhang, Yufan Zhou, Tong Yu, and 4 more authors
2024 US Patent App. 18/339,883
IDENTIFYING VISUAL TEXT USING VISION-LANGUAGE MODELS

Jiuxiang GU, Ryan Rossi, Gaurav Verma, and 2 more authors
2024 US Patent App. 18/328,950
EFFICIENT AUGMENTATION FOR MULTIMODAL MACHINE LEARNING

Handong Zhao, Yue Bai, Zhe Lin, and 4 more authors
arXiv preprint arXiv:2410.16400, 2024
VipAct: Visual-perception enhancement via specialized vlm agent collaboration and tool-use

Zhehao Zhang, Ryan Rossi, Tong Yu, and 7 more authors
arXiv preprint arXiv:2410.20011, 2024
A survey of small language models

Chien Van Nguyen, Xuan Shen, Ryan Aponte, and 8 more authors
arXiv preprint arXiv:2412.01762, 2024
XQ-GAN: An Open-source Image Tokenization Framework for Autoregressive Generation

Xiang Li, Kai Qiu, Hao Chen, and 5 more authors
arXiv preprint arXiv:2412.02142, 2024
Personalized Multimodal Large Language Models: A Survey

Junda Wu, Hanjia Lyu, Yu Xia, and 8 more authors
arXiv preprint arXiv:2412.10533, 2024
SUGAR: Subject-Driven Video Customization in a Zero-Shot Manner

Yufan Zhou, Ruiyi Zhang, Jiuxiang Gu, and 3 more authors

2023

2023 US Patent 11,610,393
Knowledge distillation for neural networks using multiple augmentation strategies

Jason Wen Yong Kuen, Zhe Lin, and Jiuxiang Gu
ICCV 2023
High-Quality Entity Segmentation

Qi Lu, Jason Kuen, Shen Tiancheng, and 5 more authors
WACV 2023
LayerDoc: layer-wise extraction of spatial hierarchical structure in visually-rich documents

Puneet Mathur, Rajiv Jain, Ashutosh Mehra, and 8 more authors
ACL 2023
A critical analysis of out-of-distribution detection for document understanding

Jiuxiang Gu, Yifei Ming, Yi Zhou, and 8 more authors
ACL 2023
Learning the visualness of text using large vision-language models

Gaurav Verma, Ryan A Rossi, Christopher Tensmeyer, and 2 more authors
2023 US Patent 11,816,243
Preserving user-entity differential privacy in natural language modeling

Thi Kim Phung Lai, Tong Sun, Rajiv Jain, and 3 more authors
2023 US Patent App. 17/528,972
Enhanced document visual question answering system via hierarchical attention

Shijie Geng, Christopher Tensmeyer, Curtis Michael Wigington, and 1 more author
NeurIPS 2023
AIMS: all-inclusive multi-level segmentation for anything

Lu Qi, Jason Kuen, Weidong Guo, and 5 more authors
AAAI 2023
Docedit: language-guided document editing

Puneet Mathur, Rajiv Jain, Jiuxiang Gu, and 3 more authors
NeurIPS Workshop 2023
Llavar: Enhanced visual instruction tuning for text-rich image understanding

Yanzhe Zhang, Ruiyi Zhang, Jiuxiang Gu, and 4 more authors
2023 US Patent App. 17/577,605
Facilitating identification of fillable regions in a form

Ashutosh Mehra, Christopher Alan Tensmeyer, Vlad Ion Morariu, and 1 more author
2023 US Patent App. 17/650,437
Open vocabulary instance segmentation

Jason Wen Yong Kuen, Dat Ba Huynh, Zhe Lin, and 1 more author
2023 US Patent App. 17/740,497
Adaptive sparse attention pattern

Jiuxiang Gu, Zihan Wang, Jason Wen Yong Kuen, and 5 more authors
2023 US Patent App. 17/664,079
Systems and methods for product retrieval

Handong Zhao, Haoyu Ma, Zhe Lin, and 5 more authors
EMNLP 2023
A critical analysis of document out-of-distribution detection

Jiuxiang Gu, Yifei Ming, Yi Zhou, and 8 more authors
2023 US Patent App. 17/746,779
Multimodal extraction across multiple granularities

Vlad Ion Morariu, Tong Sun, Nikolaos Barmpalios, and 4 more authors
2023 US Patent App. 17/806,097
Open vocabulary instance segmentation with noise estimation and robust student

Jason Wen Yong Kuen, Dat Ba Huynh, Zhe Lin, and 1 more author
NeurIPS Workshop 2023
Reflection-tuning: Data recycling improves llm instruction-tuning

Ming Li, Lichang Chen, Jiuhai Chen, and 4 more authors

2022

2022 Patent
Generating scene graphs from digital images using external knowledge and image reconstruction

Handong Zhao, Zhe Lin, Sheng Li, and 2 more authors
AAAI 2022
UNISON: Unpaired cross-lingual image captioning

Jiahui Gao, Yi Zhou, LH Philip, and 2 more authors
T-PAMI, 2022 Journal
Open world entity segmentation

Lu Qi, Jason Kuen, Yi Wang, and 5 more authors
CVPR 2022
Open-vocabulary instance segmentation via robust cross-modal pseudo-labeling

Dat Huynh, Jason Kuen, Zhe Lin, and 2 more authors
CVPR 2022
Towards language-free training for text-to-image generation

Yufan Zhou, Ruiyi Zhang, Changyou Chen, and 6 more authors
ECCV 2022
Ca-ssl: Class-agnostic semi-supervised learning for detection and segmentation

Lu Qi, Jason Kuen, Zhe Lin, and 7 more authors
Big Data 2022
User-entity differential privacy in learning natural language models

Phung Lai, NhatHai Phan, Tong Sun, and 4 more authors
AAAI 2022
Tigan: Text-based interactive image generation and manipulation

Yufan Zhou, Ruiyi Zhang, Jiuxiang Gu, and 5 more authors
ACM Web 2022
Fedkc: Federated knowledge composition for multilingual natural language understanding

Haoyu Wang, Handong Zhao, Yaqing Wang, and 3 more authors
ACL 2022
Learning adaptive axis attentions in fine-tuning: Beyond fixed sparse attention patterns

Zihan Wang, Jiuxiang Gu, Jason Kuen, and 6 more authors
2022 US Patent App. 17/093,185
Self-supervised visual-relationship probing

Jiuxiang Gu, Vlad Ion Morariu, Tong Sun, and 2 more authors
CVPR 2022
Ei-clip: Entity-aware interventional contrastive learning for e-commerce cross-modal retrieval

Haoyu Ma, Handong Zhao, Zhe Lin, and 6 more authors
NNACL 2022
Doctime: A document-level temporal dependency graph parser

Puneet Mathur, Vlad Morariu, Verena Kaynig-Fittkau, and 6 more authors
ECCV 2022
Meta spatio-temporal debiasing for video scene graph generation

Li Xu, Haoxuan Qu, Jason Kuen, and 2 more authors
INTERSPEECH 2022
DocLayoutTTS: Dataset and Baselines for Layout-informed Document-level Neural Speech Synthesis.

Puneet Mathur, Franck Dernoncourt, Quan Hung Tran, and 5 more authors
2022 US Patent App. 17/805,289
Generating scene graphs from digital images using external knowledge and image reconstruction

Handong Zhao, Zhe Lin, Sheng Li, and 2 more authors
ECCV 2022
Improving the reliability for confidence estimation

Haoxuan Qu, Yanchao Li, Lin Geng Foo, and 3 more authors
NeurIPS, 2022
Delving into out-of-distribution detection with vision-language representations

Yifei Ming, Ziyang Cai, Jiuxiang Gu, and 3 more authors
EMNLP 2022
MGDoc: Pre-training with multi-granular hierarchy for document image understanding

Zilong Wang, Jiuxiang Gu, Chris Tensmeyer, and 5 more authors

2021

NAACL 2021
Towards interpreting and mitigating shortcut learning behavior of NLU models

Mengnan Du, Varun Manjunatha, Rajiv Jain, and 5 more authors
CVPR 2021
Multi-scale aligned distillation for low-resolution detection

Lu Qi, Jason Kuen, Jiuxiang Gu, and 5 more authors
CVPR 2021
Selfdoc: Self-supervised document representation learning

Peizhao Li, Jiuxiang Gu, Jason Kuen, and 5 more authors
CVPR 2021
Exploiting semantic embedding and visual feature for facial action unit detection

Huiyuan Yang, Lijun Yin, Yi Zhou, and 1 more author
NeurIPS 2021
Unidoc: Unified pretraining framework for document understanding

Jiuxiang Gu, Jason Kuen, Vlad I Morariu, and 5 more authors

2020

PESGM 2020 Best Paper
Resilient load restoration in microgrids considering mobile energy storage fleets: A deep reinforcement learning approach

Shuhan Yao, Jiuxiang Gu, Huajun Zhang, and 3 more authors
ECCV 2020
Finding it at another side: A viewpoint-adapted matching encoder for change captioning

Xiangxi Shi, Xu Yang, Jiuxiang Gu, and 2 more authors
2020
Self-supervised relationship probing

Jiuxiang Gu, Jason Kuen, Shafiq Joty, and 4 more authors

2019

ICCV 2019
Unpaired Image Captioning via Scene Graph Alignments

Jiuxiang Gu, Shafiq Joty, Jianfei Cai, and 3 more authors
CVPR 2019
Scene graph generation with external knowledge and image reconstruction

Jiuxiang Gu, Handong Zhao, Zhe Lin, and 3 more authors
ACM MM 2019
Watch It Twice: Video Captioning with a Refocused Video Encoder

Xiangxi Shi, Jianfei Cai, Shafiq Joty, and 1 more author

2018

AAAI 2018 Oral
Stack-Captioning: Coarse-to-Fine Learning for Image Captioning

Jiuxiang Gu, Jianfei Cai, Gang Wang, and 1 more author
CVPR 2018 Spotlight
Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models

Jiuxiang Gu, Jianfei Cai, Shafiq Joty, and 2 more authors
Pattern Recognition, 2018
Recent advances in convolutional neural networks

Jiuxiang Gu, Zhenhua Wang, Jason Kuen, and 8 more authors
Neurocomputing, 2018
Video Captioning with Boundary-aware Hierarchical Language Decoding and Joint Video Prediction

Xiangxi Shi, Jianfei Cai, Jiuxiang Gu, and 1 more author
ECCV 2018
Unpaired image captioning by language pivoting

Jiuxiang Gu, Shafiq Joty, Jianfei Cai, and 1 more author
TRECVID 2018
NTU ROSE Lab at TRECVID 2018: Ad-hoc Video Search and Video to Text.

Muhammet Bastan, Xiangxi Shi, Jiuxiang Gu, and 4 more authors

2017

ICCV 2017
An empirical study of language cnn for image captioning

Jiuxiang Gu, Gang Wang, Jianfei Cai, and 1 more author

2014

Journal of University of Chinese Academy of Sciences, 2014
HJ-1C real-time image processing technology based on GPU

GU Gu, Renzhong Yang, Lu Shi, and 1 more author

2013

Microelectronics & Computer, 2013
Research of RS Decoding Technology Based on GPU

Jiuxiang Gu, Ren-zhong YANG, and Hong-wei WEI