Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jiazheng Liu

Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning

Mar 10, 2025

Jiazheng Liu, Sipeng Zheng, Börje F. Karlsson, Zongqing Lu

Figure 1 for Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning

Figure 2 for Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning

Figure 3 for Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning

Figure 4 for Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning

Abstract:Multimodal large language models (MLLMs), built on large-scale pre-trained vision towers and language models, have shown great capabilities in multimodal understanding. However, most existing MLLMs are trained on single-turn vision question-answering tasks, which do not accurately reflect real-world human conversations. In this paper, we introduce MMDiag, a multi-turn multimodal dialogue dataset. This dataset is collaboratively generated through deliberately designed rules and GPT assistance, featuring strong correlations between questions, between questions and images, and among different image regions; thus aligning more closely with real-world scenarios. MMDiag serves as a strong benchmark for multi-turn multimodal dialogue learning and brings more challenges to the grounding and reasoning capabilities of MLLMs. Further, inspired by human vision processing, we present DiagNote, an MLLM equipped with multimodal grounding and reasoning capabilities. DiagNote consists of two modules (Deliberate and Gaze) interacting with each other to perform Chain-of-Thought and annotations respectively, throughout multi-turn dialogues. We empirically demonstrate the advantages of DiagNote in both grounding and jointly processing and reasoning with vision and language information over existing MLLMs.

Via

Access Paper or Ask Questions

Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds

Oct 20, 2023

Sipeng Zheng, Jiazheng Liu, Yicheng Feng, Zongqing Lu

Figure 1 for Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds

Figure 2 for Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds

Figure 3 for Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds

Figure 4 for Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds

Abstract:Recent studies have presented compelling evidence that large language models (LLMs) can equip embodied agents with the self-driven capability to interact with the world, which marks an initial step toward versatile robotics. However, these efforts tend to overlook the visual richness of open worlds, rendering the entire interactive process akin to "a blindfolded text-based game." Consequently, LLM-based agents frequently encounter challenges in intuitively comprehending their surroundings and producing responses that are easy to understand. In this paper, we propose Steve-Eye, an end-to-end trained large multimodal model designed to address this limitation. Steve-Eye integrates the LLM with a visual encoder which enables it to process visual-text inputs and generate multimodal feedback. In addition, we use a semi-automatic strategy to collect an extensive dataset comprising 850K open-world instruction pairs, empowering our model to encompass three essential functions for an agent: multimodal perception, foundational knowledge base, and skill prediction and planning. Lastly, we develop three open-world evaluation benchmarks, then carry out extensive experiments from a wide range of perspectives to validate our model's capability to strategically act and plan. Codes and datasets will be released.

* 19 pages, 14 figures

Via

Access Paper or Ask Questions

LLaMA Rider: Spurring Large Language Models to Explore the Open World

Oct 13, 2023

Yicheng Feng, Yuxuan Wang, Jiazheng Liu, Sipeng Zheng, Zongqing Lu

Figure 1 for LLaMA Rider: Spurring Large Language Models to Explore the Open World

Figure 2 for LLaMA Rider: Spurring Large Language Models to Explore the Open World

Figure 3 for LLaMA Rider: Spurring Large Language Models to Explore the Open World

Figure 4 for LLaMA Rider: Spurring Large Language Models to Explore the Open World

Abstract:Recently, various studies have leveraged Large Language Models (LLMs) to help decision-making and planning in environments, and try to align the LLMs' knowledge with the world conditions. Nonetheless, the capacity of LLMs to continuously acquire environmental knowledge and adapt in an open world remains uncertain. In this paper, we propose an approach to spur LLMs to explore the open world, gather experiences, and learn to improve their task-solving capabilities. In this approach, a multi-round feedback-revision mechanism is utilized to encourage LLMs to actively select appropriate revision actions guided by feedback information from the environment. This facilitates exploration and enhances the model's performance. Besides, we integrate sub-task relabeling to assist LLMs in maintaining consistency in sub-task planning and help the model learn the combinatorial nature between tasks, enabling it to complete a wider range of tasks through training based on the acquired exploration experiences. By evaluation in Minecraft, an open-ended sandbox world, we demonstrate that our approach LLaMA-Rider enhances the efficiency of the LLM in exploring the environment, and effectively improves the LLM's ability to accomplish more tasks through fine-tuning with merely 1.3k instances of collected data, showing minimal training costs compared to the baseline using reinforcement learning.

* 18 pages

Via

Access Paper or Ask Questions

Blind2Sound: Self-Supervised Image Denoising without Residual Noise

Mar 14, 2023

Zejin Wang, Jiazheng Liu, Hao Zhai, Hua Han

Figure 1 for Blind2Sound: Self-Supervised Image Denoising without Residual Noise

Figure 2 for Blind2Sound: Self-Supervised Image Denoising without Residual Noise

Figure 3 for Blind2Sound: Self-Supervised Image Denoising without Residual Noise

Figure 4 for Blind2Sound: Self-Supervised Image Denoising without Residual Noise

Abstract:Self-supervised blind denoising for Poisson-Gaussian noise remains a challenging task. Pseudo-supervised pairs constructed from single noisy images re-corrupt the signal and degrade the performance. The visible blindspots solve the information loss in masked inputs. However, without explicitly noise sensing, mean square error as an objective function cannot adjust denoising intensities for dynamic noise levels, leading to noticeable residual noise. In this paper, we propose Blind2Sound, a simple yet effective approach to overcome residual noise in denoised images. The proposed adaptive re-visible loss senses noise levels and performs personalized denoising without noise residues while retaining the signal lossless. The theoretical analysis of intermediate medium gradients guarantees stable training, while the Cramer Gaussian loss acts as a regularization to facilitate the accurate perception of noise levels and improve the performance of the denoiser. Experiments on synthetic and real-world datasets show the superior performance of our method, especially for single-channel images.

Via

Access Paper or Ask Questions

Blind2Unblind: Self-Supervised Image Denoising with Visible Blind Spots

Mar 15, 2022

Zejin Wang, Jiazheng Liu, Guoqing Li, Hua Han

Figure 1 for Blind2Unblind: Self-Supervised Image Denoising with Visible Blind Spots

Figure 2 for Blind2Unblind: Self-Supervised Image Denoising with Visible Blind Spots

Figure 3 for Blind2Unblind: Self-Supervised Image Denoising with Visible Blind Spots

Figure 4 for Blind2Unblind: Self-Supervised Image Denoising with Visible Blind Spots

Abstract:Real noisy-clean pairs on a large scale are costly and difficult to obtain. Meanwhile, supervised denoisers trained on synthetic data perform poorly in practice. Self-supervised denoisers, which learn only from single noisy images, solve the data collection problem. However, self-supervised denoising methods, especially blindspot-driven ones, suffer sizable information loss during input or network design. The absence of valuable information dramatically reduces the upper bound of denoising performance. In this paper, we propose a simple yet efficient approach called Blind2Unblind to overcome the information loss in blindspot-driven denoising methods. First, we introduce a global-aware mask mapper that enables global perception and accelerates training. The mask mapper samples all pixels at blind spots on denoised volumes and maps them to the same channel, allowing the loss function to optimize all blind spots at once. Second, we propose a re-visible loss to train the denoising network and make blind spots visible. The denoiser can learn directly from raw noise images without losing information or being trapped in identity mapping. We also theoretically analyze the convergence of the re-visible loss. Extensive experiments on synthetic and real-world datasets demonstrate the superior performance of our approach compared to previous work. Code is available at https://github.com/demonsjin/Blind2Unblind.

* Accepted to CVPR2022

Via

Access Paper or Ask Questions