🎨 Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want

1Shanghai AI Laboratory 2Peking University 3Xi'an Jiaotong University 4University of Wisconsin-Madison 5The Chinese University of Hong Kong

... slide to see more examples ...

Abstract

The interaction between humans and artificial intelligence (AI) is a crucial factor that reflects the effectiveness of multimodal large language models (MLLMs). However, current MLLMs primarily focus on image-level comprehension and limit interaction to textual instructions, thereby constraining their flexibility in usage and depth of response. In this paper, we introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.

Specifically, we propose SPHINX-V, a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM for various visual prompts (points, bounding boxes, cycle and free-form shape) and language understanding. To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench. MDVP-Data features a multi-domain dataset containing 1.6M unique image-visual prompt-text instruction-following samples, including natural images, document images, OCR images, mobile screenshots, web screenshots, and multi-panel images. Furthermore, we present MDVP-Bench, a comprehensive and challenging benchmark to assess a model's capability in understanding visual prompting instructions.

Our experiments demonstrate SPHINX-V's impressive multimodal interaction capabilities through visual prompting, revealing significant improvements in detailed region description and question-answering abilities.

Icon SPHINX-V: Understanding Arbitrary Format Visual Prompts and Images from Diverse Domains

SPHINX-V demonstrates robust capabilities in recognizing diverse visual prompts and images across various domains, enabling execution of complex tasks such as Visual Question Answering (VQA), Optical Character Recognition (OCR), Regional Captioning, and other intricate referring tasks. This formidable proficiency stems from its simple yet effective model architecture and training strategies:

arch

SPHINX-V comprises three components: a mixed vision encoder, a versatile visual prompt encoder, and a large language model (LLM). Given an image alongside coordinates of specific points or regions and input instruction language, we perform tokenization and encoding to derive three distinct embeddings: image embeddings \( Z_i \), visual prompt embeddings \( Z_v \), and text embeddings \( Z_t \). Subsequently, we employ two projection matrixs to convert \( Z_i \) and \( Z_v \) into language embedding tokens, aligning them with the word embedding space in the language model. Finally, we integrate the image, visual prompt, and language representations, forwarding them collectively to the LLM to attain pixel/region-level understanding capability.

SPHINX-V is trained in two stages:

  • Stage 1: Image-Visual Prompt-Text Alignment Pre-training.

    By employing a visual prompt encoder, we initially freeze both the pre-trained vision encoder and the pre-trained LLM. We focus on training the features of visual prompts to align with the features of image-visual prompt-text. We implement an MLP as the connector between visual prompts and language to enhance the multimodal prompting capabilities of the model.

  • Stage 2: Multi-Task Supervised Finetuning.

    At this stage, we load the weights trained during stage 1 and keep the vision encoder and visual prompt encoder weights frozen. We then fine-tune the visual prompt projector and the LLM model. Our focus is on enhancing SPHINX-V's ability to accurately interpret user instructions and handle diverse pixel-level understanding tasks, such as detailed captioning, inter-relationship analysis, and complex reasoning.

Icon MDVP-Data: Multi-domain Visual-Prompt Instruction Dataset

We introduce MDVP-Data, an instruction dataset designed to foster fine-grained and open-world image understanding in MLLMs, encompassing approximately 1.6 million multimodal dialogues. MDVP-Data integrates both point-level and region-level instruction data derived from public datasets. It consists of two types of data:

  1. Restructured public grounding datasets formatted for visual prompt-based instruction following tuning;
  2. High-quality training pairs developed using meticulously crafted prompt templates, produced through the GPT-4V model.
arch

Above diagram displays the distribution of images drawn from various sources, including nature scenes, OCR texts, web content, mobile interfaces, documents, and multi-pane graphics. It also features a sample from the GPT-assisted MDVP dataset, emphasizing the diversity and richness of its point-based and region-based instruction-following data.

Natural Image

arch

Multi Panel Image

Multi_Panel

Document

Document

Text Spot

Document

Web Page

WebPage

Screen Shot

ScreenShot

Icon Performance

Icon Traditional Evaluating Task

In traditional evaluation tasks, SPHINX-V significantly outperforms existing visual-prompt-based methods.

ROC

Semantic Similarity and Semantic IoU of ROC(referring object classification) on LVIS and PACO.

OCR

Accuracy of ROC on LVIS and OCR(optical character recognition) on COCO text.

detail caption

Detailed region-level captioning performance on the RefCOCOg validation set. Using GPT-4V for evaluation.

brief caption

Brief region captioning performance evaluated on RefCOCOg. Using traditional evaluation metric.

arch

Validation Accuracy on VCR dataset. Q, A, R stands for Question, Answer, Rational, respectively.

Icon MDVP-Bench

To evaluate the proficiency of the MLLM in complex pixel-level image understanding tasks and its versatility across various domains, we initially curated a subset of our MDVP-Data. This subset underwent a thorough manual content review and filtering process, resulting in the creation of MDVP-Bench. MDVP-Bench is a comprehensive and challenging benchmark covering a wide range of tasks, including concise descriptions, elaborate narratives, analyses of interconnections among different regions, and complex reasoning. The performance of existing visual-prompt-based methods on MDVP-Bench is as follows:

arch

BibTeX


      @misc{
          lin2024drawandunderstand,
          title={Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want}, 
          author={Weifeng Lin and Xinyu Wei and Ruichuan An and Peng Gao and Bocheng Zou and Yulin Luo and Siyuan Huang and Shanghang Zhang and Hongsheng Li},
          year={2024},
          eprint={2403.20271},
          archivePrefix={arXiv},
          primaryClass={cs.CV}
      }