... slide to see more examples ...
The interaction between humans and artificial intelligence (AI) is a crucial factor that reflects the effectiveness of multimodal large language models (MLLMs). However, current MLLMs primarily focus on image-level comprehension and limit interaction to textual instructions, thereby constraining their flexibility in usage and depth of response. In this paper, we introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.
Specifically, we propose SPHINX-V, a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM for various visual prompts (points, bounding boxes, cycle and free-form shape) and language understanding. To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench. MDVP-Data features a multi-domain dataset containing 1.6M unique image-visual prompt-text instruction-following samples, including natural images, document images, OCR images, mobile screenshots, web screenshots, and multi-panel images. Furthermore, we present MDVP-Bench, a comprehensive and challenging benchmark to assess a model's capability in understanding visual prompting instructions.
Our experiments demonstrate SPHINX-V's impressive multimodal interaction capabilities through visual prompting, revealing significant improvements in detailed region description and question-answering abilities.
SPHINX-V demonstrates robust capabilities in recognizing diverse visual prompts and images across various domains, enabling execution of complex tasks such as Visual Question Answering (VQA), Optical Character Recognition (OCR), Regional Captioning, and other intricate referring tasks. This formidable proficiency stems from its simple yet effective model architecture and training strategies:
SPHINX-V comprises three components: a mixed vision encoder, a versatile visual prompt encoder, and a large language model (LLM). Given an image alongside coordinates of specific points or regions and input instruction language, we perform tokenization and encoding to derive three distinct embeddings: image embeddings \( Z_i \), visual prompt embeddings \( Z_v \), and text embeddings \( Z_t \). Subsequently, we employ two projection matrixs to convert \( Z_i \) and \( Z_v \) into language embedding tokens, aligning them with the word embedding space in the language model. Finally, we integrate the image, visual prompt, and language representations, forwarding them collectively to the LLM to attain pixel/region-level understanding capability.
SPHINX-V is trained in two stages:
By employing a visual prompt encoder, we initially freeze both the pre-trained vision encoder and the pre-trained LLM. We focus on training the features of visual prompts to align with the features of image-visual prompt-text. We implement an MLP as the connector between visual prompts and language to enhance the multimodal prompting capabilities of the model.
At this stage, we load the weights trained during stage 1 and keep the vision encoder and visual prompt encoder weights frozen. We then fine-tune the visual prompt projector and the LLM model. Our focus is on enhancing SPHINX-V's ability to accurately interpret user instructions and handle diverse pixel-level understanding tasks, such as detailed captioning, inter-relationship analysis, and complex reasoning.
We introduce MDVP-Data, an instruction dataset designed to foster fine-grained and open-world image understanding in MLLMs, encompassing approximately 1.6 million multimodal dialogues. MDVP-Data integrates both point-level and region-level instruction data derived from public datasets. It consists of two types of data:
Above diagram displays the distribution of images drawn from various sources, including nature scenes, OCR texts, web content, mobile interfaces, documents, and multi-pane graphics. It also features a sample from the GPT-assisted MDVP dataset, emphasizing the diversity and richness of its point-based and region-based instruction-following data.
In traditional evaluation tasks, SPHINX-V significantly outperforms existing visual-prompt-based methods.
Semantic Similarity and Semantic IoU of ROC(referring object classification) on LVIS and PACO.
Accuracy of ROC on LVIS and OCR(optical character recognition) on COCO text.
Detailed region-level captioning performance on the RefCOCOg validation set. Using GPT-4V for evaluation.
Brief region captioning performance evaluated on RefCOCOg. Using traditional evaluation metric.
Validation Accuracy on VCR dataset. Q, A, R stands for Question, Answer, Rational, respectively.
To evaluate the proficiency of the MLLM in complex pixel-level image understanding tasks and its versatility across various domains, we initially curated a subset of our MDVP-Data. This subset underwent a thorough manual content review and filtering process, resulting in the creation of MDVP-Bench. MDVP-Bench is a comprehensive and challenging benchmark covering a wide range of tasks, including concise descriptions, elaborate narratives, analyses of interconnections among different regions, and complex reasoning. The performance of existing visual-prompt-based methods on MDVP-Bench is as follows:
@misc{
lin2024drawandunderstand,
title={Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want},
author={Weifeng Lin and Xinyu Wei and Ruichuan An and Peng Gao and Bocheng Zou and Yulin Luo and Siyuan Huang and Shanghang Zhang and Hongsheng Li},
year={2024},
eprint={2403.20271},
archivePrefix={arXiv},
primaryClass={cs.CV}
}