Publications
publications by categories in reversed chronological order. generated by jekyll-scholar.
2025
-
Efficient vision-language retrieval using structural pruning2025US Patent App. 18/347,877
-
Generating temporal dependency graphs2025US Patent App. 18/493,465
-
Artist: Improving the generation of text-rich images by disentanglementIn WACV, 2025
-
Differential privacy mechanisms in neural tangent kernel regressionIn WACV, 2025
-
A multi-llm debiasing frameworkIn Submitted to ACL, 2025
-
Imagefolder: Autoregressive image generation with folded tokensIn ICLR, 2025
-
LoRA-Contextualizing Adaptation of Large Multimodal Models for Long Document UnderstandingIn ICLR, 2025
-
Personalization of large language models: A surveyTMLR, 2025
-
Numerical pruning for efficient autoregressive modelsIn AAAI, 2025
-
LazyDiT: Lazy Learning for the Acceleration of Diffusion TransformersIn AAAI, 2025
-
MegaSynth: Scaling Up 3D Scene Reconstruction with Synthesized DataIn CVPR, 2025
-
Efficient Reasoning with Hidden ThinkingarXiv preprint arXiv:2501.19201, 2025
-
LoRA-Contextualizing Adaptation of Large Multimodal Models for Long Document UnderstandingIn ICLR, 2025
-
From Selection to Generation: A Survey of LLM-based Active LearningIn ACL, 2025
-
Efficient Reasoning with Hidden ThinkingIn Submitted to NeurIPS, 2025
-
FastCar: Cache Attentive Replay for Fast Auto-Regressive Video Generation on the EdgeIn Submitted to NeurIPS, 2025
-
ADOPT: A Multimodal Framework for Document Understanding and GenerationIn Submitted to NeurIPS, 2025
-
DraftAttention: Fast Video Diffusion via Low-Resolution Attention GuidanceIn Submitted to NeurIPS, 2025
-
MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation ModelsIn Submitted to NeurIPS, 2025
-
From Preferences to Prejudice: The Role of Alignment Tuning in Shaping Social Bias in Video Diffusion ModelsIn Submitted to NeurIPS, 2025
-
R-KV: Redundancy-aware KV Cache Compression for Reasoning ModelsIn Submitted to NeurIPS, 2025
-
OIDA-QA: A Multimodal Benchmark for Analyzing the Opioid Industry Documents ArchiveIn Submitted to NeurIPS, 2025
-
ADOPD-Instruct: A Large-Scale Multimodal Dataset for Document EditingIn Submitted to NeurIPS, 2025
-
ARTIST: Improving the Generation of Text-rich Images with Disentangled Diffusion ModelsIn WACV, 2025
-
Metal: A multi-agent framework for chart generation with test-time scalingIn ACL, 2025
-
Utilizing a generative neural network to interactively create and modify digital images based on natural language feedback2025US Patent App. 18/952,023
-
Robust Latent Matters: Boosting Image Generation with Sampling Error SynthesisIn Submitted to ICCV, 2025
-
Generating 3d models from a single image2025US Patent App. 18/460,747
-
Position-based text-to-speech model2025US Patent App. 18/528,116
-
QuartDepth: Post-Training Quantization for Real-Time Depth Estimation on the EdgeIn CVPR, 2025
-
Generating an improved named entity recognition model using noisy data with a self-cleaning discriminator model2025US Patent App. 18/472,746
-
Towards Visual Text Grounding of Multimodal Large Language ModelIn , 2025
2024
-
Self-supervised document representation learning2024US Patent 11,886,815
-
Multi-scale distillation for low-resolution detection2024US Patent 12,136,185
-
Utilizing a generative neural network to interactively create and modify digital images based on natural language feedback2024US Patent 12,148,119
-
Improving a Named Entity Recognizer Trained on Noisy Data with a Few Clean InstancesIn , 2024
-
Lrm: Large reconstruction model for single image to 3dIn , 2024Oral
-
Customization assistant for text-to-image generationIn CVPR, 2024
-
Selective reflection-tuning: Student-selected data recycling for llm instruction-tuningIn ACL, 2024
-
ADoPD: A large-scale document page decomposition datasetIn ICLR, 2024
-
SOHES: Self-supervised open-world hierarchical entity segmentationIn ICLR, 2024
-
Image and semantic based table recognition2024US Patent App. 17/947,737
-
Label induction2024US Patent App. 18/048,900
-
Training language models and preserving privacy2024US Patent App. 18/173,199
-
Exploring the frontiers of softmax: Provable optimization, applications in diffusion model, and beyondarXiv preprint arXiv:2405.03251, 2024
-
DocScript: Document-Level Script Event PredictionIn COLING, 2024
-
Extracting document hierarchy using a multimodal, layer-wise link prediction neural network2024US Patent App. 18/055,752
-
Language-guided document editing2024US Patent 11,995,394
-
Trins: Towards multimodal language models that can readIn CVPR, 2024
-
Toffee: Efficient million-scale dataset construction for subject-driven text-to-image generationarXiv preprint arXiv:2406.09305, 2024
-
DocSynthv2: A Practical Autoregressive Modeling for Document GenerationIn CVPR, 2024
-
Self-Cleaning: Improving a Named Entity Recognizer Trained on Noisy Data with a Few Clean InstancesIn NAACL, 2024
-
Category-aware active domain adaptationIn ICML, 2024
-
Llava-read: Enhancing reading ability of multimodal language modelsarXiv preprint arXiv:2407.19185, 2024
-
Commit: Coordinated instruction tuning for multimodal large language modelsarXiv preprint arXiv:2407.20454, 2024
-
Mmr: Evaluating reading ability of large multimodal modelsarXiv preprint arXiv:2408.14594, 2024
-
TextLap: Customizing Language Models for Text-to-Layout PlanningIn EMNLP, 2024
-
Advancing Vision-Language Models with Adapter Ensemble StrategiesIn EMNLP, 2024
-
TEXT-TO-IMAGE SYSTEM AND METHOD2024US Patent App. 18/318,921
-
IDENTIFYING VISUAL TEXT USING VISION-LANGUAGE MODELSDec 2024US Patent App. 18/339,883
-
EFFICIENT AUGMENTATION FOR MULTIMODAL MACHINE LEARNINGDec 2024US Patent App. 18/328,950
-
VipAct: Visual-perception enhancement via specialized vlm agent collaboration and tool-usearXiv preprint arXiv:2410.16400, Dec 2024
-
A survey of small language modelsarXiv preprint arXiv:2410.20011, Dec 2024
-
XQ-GAN: An Open-source Image Tokenization Framework for Autoregressive GenerationarXiv preprint arXiv:2412.01762, Dec 2024
-
Personalized Multimodal Large Language Models: A SurveyarXiv preprint arXiv:2412.02142, Dec 2024
-
SUGAR: Subject-Driven Video Customization in a Zero-Shot MannerarXiv preprint arXiv:2412.10533, Dec 2024
2023
-
Knowledge distillation for neural networks using multiple augmentation strategiesDec 2023US Patent 11,610,393
-
High-Quality Entity SegmentationIn ICCV, Dec 2023
-
LayerDoc: layer-wise extraction of spatial hierarchical structure in visually-rich documentsIn WACV, Dec 2023
-
A critical analysis of out-of-distribution detection for document understandingIn ACL, Dec 2023
-
Learning the visualness of text using large vision-language modelsIn ACL, Dec 2023
-
Preserving user-entity differential privacy in natural language modelingDec 2023US Patent 11,816,243
-
Enhanced document visual question answering system via hierarchical attentionDec 2023US Patent App. 17/528,972
-
AIMS: all-inclusive multi-level segmentation for anythingIn NeurIPS, Dec 2023
-
Docedit: language-guided document editingIn AAAI, Dec 2023
-
Llavar: Enhanced visual instruction tuning for text-rich image understandingIn NeurIPS Workshop, Dec 2023
-
Facilitating identification of fillable regions in a formJul 2023US Patent App. 17/577,605
-
Open vocabulary instance segmentationJul 2023US Patent App. 17/650,437
-
Adaptive sparse attention patternJul 2023US Patent App. 17/740,497
-
Systems and methods for product retrievalJul 2023US Patent App. 17/664,079
-
A critical analysis of document out-of-distribution detectionIn EMNLP, Jul 2023
-
Multimodal extraction across multiple granularitiesJul 2023US Patent App. 17/746,779
-
Open vocabulary instance segmentation with noise estimation and robust studentJul 2023US Patent App. 17/806,097
-
Reflection-tuning: Data recycling improves llm instruction-tuningIn NeurIPS Workshop, Jul 2023
2022
-
Generating scene graphs from digital images using external knowledge and image reconstructionJul 2022US Patent 11,373,390
-
UNISON: Unpaired cross-lingual image captioningIn AAAI, Jul 2022
-
Open world entity segmentationT-PAMI, Jul 2022
-
Open-vocabulary instance segmentation via robust cross-modal pseudo-labelingIn CVPR, Jul 2022
-
Towards language-free training for text-to-image generationIn CVPR, Jul 2022
-
Ca-ssl: Class-agnostic semi-supervised learning for detection and segmentationIn ECCV, Jul 2022
-
User-entity differential privacy in learning natural language modelsIn Big Data, Jul 2022
-
Bit-aware randomized response for local differential privacy in federated learningJul 2022
-
Tigan: Text-based interactive image generation and manipulationIn AAAI, Jul 2022
-
Fedkc: Federated knowledge composition for multilingual natural language understandingIn ACM Web, Jul 2022
-
Learning adaptive axis attentions in fine-tuning: Beyond fixed sparse attention patternsIn ACL, Jul 2022
-
Self-supervised visual-relationship probingJul 2022US Patent App. 17/093,185
-
Ei-clip: Entity-aware interventional contrastive learning for e-commerce cross-modal retrievalIn CVPR, Jul 2022
-
Doctime: A document-level temporal dependency graph parserIn NNACL, Jul 2022
-
Meta spatio-temporal debiasing for video scene graph generationIn ECCV, Jul 2022
-
DocLayoutTTS: Dataset and Baselines for Layout-informed Document-level Neural Speech Synthesis.In INTERSPEECH, Jul 2022
-
Generating scene graphs from digital images using external knowledge and image reconstructionJul 2022US Patent App. 17/805,289
-
Improving the reliability for confidence estimationIn ECCV, Jul 2022
-
Delving into out-of-distribution detection with vision-language representationsNeurIPS, Jul 2022
-
MGDoc: Pre-training with multi-granular hierarchy for document image understandingIn EMNLP, Jul 2022
2021
-
Towards interpreting and mitigating shortcut learning behavior of NLU modelsIn NAACL, Jul 2021
-
Multi-scale aligned distillation for low-resolution detectionIn CVPR, Jul 2021
-
Selfdoc: Self-supervised document representation learningIn CVPR, Jul 2021
-
Exploiting semantic embedding and visual feature for facial action unit detectionIn CVPR, Jul 2021
-
Unidoc: Unified pretraining framework for document understandingIn NeurIPS, Jul 2021
2020
-
Resilient load restoration in microgrids considering mobile energy storage fleets: A deep reinforcement learning approachIn PESGM, Jul 2020Best Paper
-
Finding it at another side: A viewpoint-adapted matching encoder for change captioningIn ECCV, Jul 2020
-
Self-supervised relationship probingIn , Jul 2020
2019
-
Unpaired Image Captioning via Scene Graph AlignmentsIn ICCV, Jul 2019
-
Scene graph generation with external knowledge and image reconstructionIn CVPR, Jul 2019
-
Watch It Twice: Video Captioning with a Refocused Video EncoderIn ACM MM, Jul 2019
2018
-
Stack-Captioning: Coarse-to-Fine Learning for Image CaptioningIn AAAI, Jul 2018Oral
-
Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative ModelsIn CVPR, Jul 2018Spotlight
-
Recent advances in convolutional neural networksPattern Recognition, Jul 2018
-
Video Captioning with Boundary-aware Hierarchical Language Decoding and Joint Video PredictionNeurocomputing, Jul 2018
-
Unpaired image captioning by language pivotingIn ECCV, Jul 2018
-
NTU ROSE Lab at TRECVID 2018: Ad-hoc Video Search and Video to Text.In TRECVID, Jul 2018
2017
-
An empirical study of language cnn for image captioningIn ICCV, Jul 2017
2014
-
HJ-1C real-time image processing technology based on GPUJournal of University of Chinese Academy of Sciences, Jul 2014
2013
-
Research of RS Decoding Technology Based on GPUMicroelectronics & Computer, Jul 2013