Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chengyao Wang

DreamOmni: Unified Image Generation and Editing

Dec 22, 2024

Bin Xia, Yuechen Zhang, Jingyao Li, Chengyao Wang, Yitong Wang, Xinglong Wu, Bei Yu, Jiaya Jia

Figure 1 for DreamOmni: Unified Image Generation and Editing

Figure 2 for DreamOmni: Unified Image Generation and Editing

Figure 3 for DreamOmni: Unified Image Generation and Editing

Figure 4 for DreamOmni: Unified Image Generation and Editing

Abstract:Currently, the success of large language models (LLMs) illustrates that a unified multitasking approach can significantly enhance model usability, streamline deployment, and foster synergistic benefits across different tasks. However, in computer vision, while text-to-image (T2I) models have significantly improved generation quality through scaling up, their framework design did not initially consider how to unify with downstream tasks, such as various types of editing. To address this, we introduce DreamOmni, a unified model for image generation and editing. We begin by analyzing existing frameworks and the requirements of downstream tasks, proposing a unified framework that integrates both T2I models and various editing tasks. Furthermore, another key challenge is the efficient creation of high-quality editing data, particularly for instruction-based and drag-based editing. To this end, we develop a synthetic data pipeline using sticker-like elements to synthesize accurate, high-quality datasets efficiently, which enables editing data scaling up for unified model training. For training, DreamOmni jointly trains T2I generation and downstream tasks. T2I training enhances the model's understanding of specific concepts and improves generation quality, while editing training helps the model grasp the nuances of the editing task. This collaboration significantly boosts editing performance. Extensive experiments confirm the effectiveness of DreamOmni. The code and model will be released.

Via

Access Paper or Ask Questions

Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition

Dec 12, 2024

Zhisheng Zhong, Chengyao Wang, Yuqi Liu, Senqiao Yang, Longxiang Tang, Yuechen Zhang, Jingyao Li, Tianyuan Qu, Yanwei Li, Yukang Chen(+5 more)

Figure 1 for Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition

Figure 2 for Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition

Figure 3 for Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition

Figure 4 for Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition

Abstract:As Multi-modal Large Language Models (MLLMs) evolve, expanding beyond single-domain capabilities is essential to meet the demands for more versatile and efficient AI. However, previous omni-models have insufficiently explored speech, neglecting its integration with multi-modality. We introduce Lyra, an efficient MLLM that enhances multimodal abilities, including advanced long-speech comprehension, sound understanding, cross-modality efficiency, and seamless speech interaction. To achieve efficiency and speech-centric capabilities, Lyra employs three strategies: (1) leveraging existing open-source large models and a proposed multi-modality LoRA to reduce training costs and data requirements; (2) using a latent multi-modality regularizer and extractor to strengthen the relationship between speech and other modalities, thereby enhancing model performance; and (3) constructing a high-quality, extensive dataset that includes 1.5M multi-modal (language, vision, audio) data samples and 12K long speech samples, enabling Lyra to handle complex long speech inputs and achieve more robust omni-cognition. Compared to other omni-methods, Lyra achieves state-of-the-art performance on various vision-language, vision-speech, and speech-language benchmarks, while also using fewer computational resources and less training data.

* Tech report

Via

Access Paper or Ask Questions

VisionZip: Longer is Better but Not Necessary in Vision Language Models

Dec 05, 2024

Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, Jiaya Jia

Figure 1 for VisionZip: Longer is Better but Not Necessary in Vision Language Models

Figure 2 for VisionZip: Longer is Better but Not Necessary in Vision Language Models

Figure 3 for VisionZip: Longer is Better but Not Necessary in Vision Language Models

Figure 4 for VisionZip: Longer is Better but Not Necessary in Vision Language Models

Abstract:Recent advancements in vision-language models have enhanced performance by increasing the length of visual tokens, making them much longer than text tokens and significantly raising computational costs. However, we observe that the visual tokens generated by popular vision encoders, such as CLIP and SigLIP, contain significant redundancy. To address this, we introduce VisionZip, a simple yet effective method that selects a set of informative tokens for input to the language model, reducing visual token redundancy and improving efficiency while maintaining model performance. The proposed VisionZip can be widely applied to image and video understanding tasks and is well-suited for multi-turn dialogues in real-world scenarios, where previous methods tend to underperform. Experimental results show that VisionZip outperforms the previous state-of-the-art method by at least 5% performance gains across nearly all settings. Moreover, our method significantly enhances model inference speed, improving the prefilling time by 8x and enabling the LLaVA-Next 13B model to infer faster than the LLaVA-Next 7B model while achieving better results. Furthermore, we analyze the causes of this redundancy and encourage the community to focus on extracting better visual features rather than merely increasing token length. Our code is available at https://github.com/dvlab-research/VisionZip .

* 2 columns, 28 pages, 15 figures, 18 tables

Via

Access Paper or Ask Questions

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Mar 27, 2024

Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, Jiaya Jia

Figure 1 for Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Figure 2 for Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Figure 3 for Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Figure 4 for Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Abstract:In this work, we introduce Mini-Gemini, a simple and effective framework enhancing multi-modality Vision Language Models (VLMs). Despite the advancements in VLMs facilitating basic visual dialog and reasoning, a performance gap persists compared to advanced models like GPT-4 and Gemini. We try to narrow the gap by mining the potential of VLMs for better performance and any-to-any workflow from three aspects, i.e., high-resolution visual tokens, high-quality data, and VLM-guided generation. To enhance visual tokens, we propose to utilize an additional visual encoder for high-resolution refinement without increasing the visual token count. We further construct a high-quality dataset that promotes precise image comprehension and reasoning-based generation, expanding the operational scope of current VLMs. In general, Mini-Gemini further mines the potential of VLMs and empowers current frameworks with image understanding, reasoning, and generation simultaneously. Mini-Gemini supports a series of dense and MoE Large Language Models (LLMs) from 2B to 34B. It is demonstrated to achieve leading performance in several zero-shot benchmarks and even surpasses the developed private models. Code and models are available at https://github.com/dvlab-research/MiniGemini.

* Code and models are available at https://github.com/dvlab-research/MiniGemini

Via

Access Paper or Ask Questions

GroupContrast: Semantic-aware Self-supervised Representation Learning for 3D Understanding

Mar 14, 2024

Chengyao Wang, Li Jiang, Xiaoyang Wu, Zhuotao Tian, Bohao Peng, Hengshuang Zhao, Jiaya Jia

Figure 1 for GroupContrast: Semantic-aware Self-supervised Representation Learning for 3D Understanding

Figure 2 for GroupContrast: Semantic-aware Self-supervised Representation Learning for 3D Understanding

Figure 3 for GroupContrast: Semantic-aware Self-supervised Representation Learning for 3D Understanding

Figure 4 for GroupContrast: Semantic-aware Self-supervised Representation Learning for 3D Understanding

Abstract:Self-supervised 3D representation learning aims to learn effective representations from large-scale unlabeled point clouds. Most existing approaches adopt point discrimination as the pretext task, which assigns matched points in two distinct views as positive pairs and unmatched points as negative pairs. However, this approach often results in semantically identical points having dissimilar representations, leading to a high number of false negatives and introducing a "semantic conflict" problem. To address this issue, we propose GroupContrast, a novel approach that combines segment grouping and semantic-aware contrastive learning. Segment grouping partitions points into semantically meaningful regions, which enhances semantic coherence and provides semantic guidance for the subsequent contrastive representation learning. Semantic-aware contrastive learning augments the semantic information extracted from segment grouping and helps to alleviate the issue of "semantic conflict". We conducted extensive experiments on multiple 3D scene understanding tasks. The results demonstrate that GroupContrast learns semantically meaningful representations and achieves promising transfer learning performance.

* CVPR 2024

Via

Access Paper or Ask Questions

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

Nov 28, 2023

Yanwei Li, Chengyao Wang, Jiaya Jia

Figure 1 for LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

Figure 2 for LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

Figure 3 for LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

Figure 4 for LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

Abstract:In this work, we present a novel method to tackle the token generation challenge in Vision Language Models (VLMs) for video and image understanding, called LLaMA-VID. Current VLMs, while proficient in tasks like image captioning and visual question answering, face computational burdens when processing long videos due to the excessive visual tokens. LLaMA-VID addresses this issue by representing each frame with two distinct tokens, namely context token and content token. The context token encodes the overall image context based on user input, whereas the content token encapsulates visual cues in each frame. This dual-token strategy significantly reduces the overload of long videos while preserving critical information. Generally, LLaMA-VID empowers existing frameworks to support hour-long videos and pushes their upper limit with an extra context token. It is proved to surpass previous methods on most of video- or image-based benchmarks. Code is available https://github.com/dvlab-research/LLaMA-VID}{https://github.com/dvlab-research/LLaMA-VID

* Code is available at https://github.com/dvlab-research/LLaMA-VID

Via

Access Paper or Ask Questions

Hierarchical Dense Correlation Distillation for Few-Shot Segmentation-Extended Abstract

Jun 27, 2023

Bohao Peng, Zhuotao Tian, Xiaoyang Wu, Chengyao Wang, Shu Liu, Jingyong Su, Jiaya Jia

Figure 1 for Hierarchical Dense Correlation Distillation for Few-Shot Segmentation-Extended Abstract

Figure 2 for Hierarchical Dense Correlation Distillation for Few-Shot Segmentation-Extended Abstract

Figure 3 for Hierarchical Dense Correlation Distillation for Few-Shot Segmentation-Extended Abstract

Figure 4 for Hierarchical Dense Correlation Distillation for Few-Shot Segmentation-Extended Abstract

Abstract:Few-shot semantic segmentation (FSS) aims to form class-agnostic models segmenting unseen classes with only a handful of annotations. Previous methods limited to the semantic feature and prototype representation suffer from coarse segmentation granularity and train-set overfitting. In this work, we design Hierarchically Decoupled Matching Network (HDMNet) mining pixel-level support correlation based on the transformer architecture. The self-attention modules are used to assist in establishing hierarchical dense features, as a means to accomplish the cascade matching between query and support features. Moreover, we propose a matching module to reduce train-set overfitting and introduce correlation distillation leveraging semantic correspondence from coarse resolution to boost fine-grained segmentation. Our method performs decently in experiments. We achieve 50.0% mIoU on COCO dataset one-shot setting and 56.0% on five-shot segmentation, respectively. The code will be available on the project website. We hope our work can benefit broader industrial applications where novel classes with limited annotations are required to be decently identified.

* Accepted to CVPR 2023 VISION Workshop, Oral. The extended abstract of Hierarchical Dense Correlation Distillation for Few-Shot Segmentation. arXiv admin note: substantial text overlap with arXiv:2303.14652

Via

Access Paper or Ask Questions