Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hang Song

Mixed-R1: Unified Reward Perspective For Reasoning Capability in Multimodal Large Language Models

May 30, 2025

Shilin Xu, Yanwei Li, Rui Yang, Tao Zhang, Yueyi Sun, Wei Chow, Linfeng Li, Hang Song, Qi Xu, Yunhai Tong(+2 more)

Abstract:Recent works on large language models (LLMs) have successfully demonstrated the emergence of reasoning capabilities via reinforcement learning (RL). Although recent efforts leverage group relative policy optimization (GRPO) for MLLMs post-training, they constantly explore one specific aspect, such as grounding tasks, math problems, or chart analysis. There are no works that can leverage multi-source MLLM tasks for stable reinforcement learning. In this work, we present a unified perspective to solve this problem. We present Mixed-R1, a unified yet straightforward framework that contains a mixed reward function design (Mixed-Reward) and a mixed post-training dataset (Mixed-45K). We first design a data engine to select high-quality examples to build the Mixed-45K post-training dataset. Then, we present a Mixed-Reward design, which contains various reward functions for various MLLM tasks. In particular, it has four different reward functions: matching reward for binary answer or multiple-choice problems, chart reward for chart-aware datasets, IoU reward for grounding problems, and open-ended reward for long-form text responses such as caption datasets. To handle the various long-form text content, we propose a new open-ended reward named Bidirectional Max-Average Similarity (BMAS) by leveraging tokenizer embedding matching between the generated response and the ground truth. Extensive experiments show the effectiveness of our proposed method on various MLLMs, including Qwen2.5-VL and Intern-VL on various sizes. Our dataset and model are available at https://github.com/xushilin1/mixed-r1.

Via

Access Paper or Ask Questions

Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment in Multi-Modal Models

Nov 14, 2024

Wei Wang, Zhaowei Li, Qi Xu, Linfeng Li, YiQing Cai, Botian Jiang, Hang Song, Xingcan Hu, Pengyu Wang, Li Xiao

Figure 1 for Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment in Multi-Modal Models

Figure 2 for Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment in Multi-Modal Models

Figure 3 for Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment in Multi-Modal Models

Figure 4 for Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment in Multi-Modal Models

Abstract:Multi-modal large language models (MLLMs) have achieved remarkable success in fine-grained visual understanding across a range of tasks. However, they often encounter significant challenges due to inadequate alignment for fine-grained knowledge, which restricts their ability to accurately capture local details and attain a comprehensive global perception. While recent advancements have focused on aligning object expressions with grounding information, they typically lack explicit integration of object images, which contain affluent information beyond mere texts or coordinates. To bridge this gap, we introduce a novel fine-grained visual knowledge alignment method that effectively aligns and integrates multi-scale knowledge of objects, including texts, coordinates, and images. This innovative method is underpinned by our multi-scale fine-grained enhancement data synthesis pipeline, which provides over 300K essential training data to enhance alignment and improve overall performance. Furthermore, we present TinyGroundingGPT, a series of compact models optimized for high-level alignments. With a scale of approximately 3B parameters, TinyGroundingGPT achieves outstanding results in grounding tasks while delivering performance comparable to larger MLLMs in complex visual scenarios.

Via

Access Paper or Ask Questions

UnifiedMLLM: Enabling Unified Representation for Multi-modal Multi-tasks With Large Language Model

Aug 05, 2024

Zhaowei Li, Wei Wang, YiQing Cai, Xu Qi, Pengyu Wang, Dong Zhang, Hang Song, Botian Jiang, Zhida Huang, Tao Wang

Figure 1 for UnifiedMLLM: Enabling Unified Representation for Multi-modal Multi-tasks With Large Language Model

Figure 2 for UnifiedMLLM: Enabling Unified Representation for Multi-modal Multi-tasks With Large Language Model

Figure 3 for UnifiedMLLM: Enabling Unified Representation for Multi-modal Multi-tasks With Large Language Model

Figure 4 for UnifiedMLLM: Enabling Unified Representation for Multi-modal Multi-tasks With Large Language Model

Abstract:Significant advancements has recently been achieved in the field of multi-modal large language models (MLLMs), demonstrating their remarkable capabilities in understanding and reasoning across diverse tasks. However, these models are often trained for specific tasks and rely on task-specific input-output formats, limiting their applicability to a broader range of tasks. This raises a fundamental question: Can we develop a unified approach to represent and handle different multi-modal tasks to maximize the generalizability of MLLMs? In this paper, we propose UnifiedMLLM, a comprehensive model designed to represent various tasks using a unified representation. Our model exhibits strong capabilities in comprehending the implicit intent of user instructions and preforming reasoning. In addition to generating textual responses, our model also outputs task tokens and grounding tokens, serving as indicators of task types and task granularity. These outputs are subsequently routed through the task router and directed to specific expert models for task completion. To train our model, we construct a task-specific dataset and an 100k multi-task dataset encompassing complex scenarios. Employing a three-stage training strategy, we equip our model with robust reasoning and task processing capabilities while preserving its generalization capacity and knowledge reservoir. Extensive experiments showcase the impressive performance of our unified representation approach across various tasks, surpassing existing methodologies. Furthermore, our approach exhibits exceptional scalability and generality. Our code, model, and dataset will be available at \url{https://github.com/lzw-lzw/UnifiedMLLM}.

Via

Access Paper or Ask Questions

Small Distance Increment Method for Measuring Complex Permittivity With mmWave Radar

Mar 19, 2024

Hang Song, Hyun Joon Kim, Mingxia Wan, Bo Wei, Takamaro Kikkawa, Jun-ichi Takada

Figure 1 for Small Distance Increment Method for Measuring Complex Permittivity With mmWave Radar

Figure 2 for Small Distance Increment Method for Measuring Complex Permittivity With mmWave Radar

Figure 3 for Small Distance Increment Method for Measuring Complex Permittivity With mmWave Radar

Figure 4 for Small Distance Increment Method for Measuring Complex Permittivity With mmWave Radar

Abstract:Measuring the complex permittivity of material is essential in many scenarios such as quality check and component analysis. Generally, measurement methods for characterizing the material are based on the usage of vector network analyzer, which is large and not easy for on-site measurement, especially in high frequency range such as millimeter wave (mmWave). In addition, some measurement methods require the destruction of samples, which is not suitable for non-destructive inspection. In this work, a small distance increment (SDI) method is proposed to non-destructively measure the complex permittivity of material. In SDI, the transmitter and receiver are formed as the monostatic radar, which is facing towards the material under test (MUT). During the measurement, the distance between radar and MUT changes with small increments and the signals are recorded at each position. A mathematical model is formulated to depict the relationship among the complex permittivity, distance increment, and measured signals. By fitting the model, the complex permittivity of MUT is estimated. To implement and evaluate the proposed SDI method, a commercial off-the-shelf mmWave radar is utilized and the measurement system is developed. Then, the evaluation was carried out on the acrylic plate. With the proposed method, the estimated complex permittivity of acrylic plate shows good agreement with the literature values, demonstrating the efficacy of SDI method for characterizing the complex permittivity of material.

Via

Access Paper or Ask Questions

GroundingGPT:Language Enhanced Multi-modal Grounding Model

Jan 30, 2024

Zhaowei Li, Qi Xu, Dong Zhang, Hang Song, Yiqing Cai, Qi Qi, Ran Zhou, Junting Pan, Zefeng Li, Van Tu Vu(+2 more)

Abstract:Multi-modal large language models have demonstrated impressive performance across various tasks in different modalities. However, existing multi-modal models primarily emphasize capturing global information within each modality while neglecting the importance of perceiving local information across modalities. Consequently, these models lack the ability to effectively understand the fine-grained details of input data, limiting their performance in tasks that require a more nuanced understanding. To address this limitation, there is a compelling need to develop models that enable fine-grained understanding across multiple modalities, thereby enhancing their applicability to a wide range of tasks. In this paper, we propose GroundingGPT, a language enhanced multi-modal grounding model. Beyond capturing global information like other multi-modal models, our proposed model excels at tasks demanding a detailed understanding of local information within the input. It demonstrates precise identification and localization of specific regions in images or moments in videos. To achieve this objective, we design a diversified dataset construction pipeline, resulting in a multi-modal, multi-granularity dataset for model training. The code, dataset, and demo of our model can be found at https: //github.com/lzw-lzw/GroundingGPT.

Via

Access Paper or Ask Questions

Pensieve 5G: Implementation of RL-based ABR Algorithm for UHD 4K/8K Content Delivery on Commercial 5G SA/NR-DC Network

Dec 29, 2022

Kasidis Arunruangsirilert, Bo Wei, Hang Song, Jiro Katto

Abstract:While the rollout of the fifth-generation mobile network (5G) is underway across the globe with the intention to deliver 4K/8K UHD videos, Augmented Reality (AR), and Virtual Reality (VR) content to the mass amounts of users, the coverage and throughput are still one of the most significant issues, especially in the rural areas, where only 5G in the low-frequency band are being deployed. This called for a high-performance adaptive bitrate (ABR) algorithm that can maximize the user quality of experience given 5G network characteristics and data rate of UHD contents. Recently, many of the newly proposed ABR techniques were machine-learning based. Among that, Pensieve is one of the state-of-the-art techniques, which utilized reinforcement-learning to generate an ABR algorithm based on observation of past decision performance. By incorporating the context of the 5G network and UHD content, Pensieve has been optimized into Pensieve 5G. New QoE metrics that more accurately represent the QoE of UHD video streaming on the different types of devices were proposed and used to evaluate Pensieve 5G against other ABR techniques including the original Pensieve. The results from the simulation based on the real 5G Standalone (SA) network throughput shows that Pensieve 5G outperforms both conventional algorithms and Pensieve with the average QoE improvement of 8.8% and 14.2%, respectively. Additionally, Pensieve 5G also performed well on the commercial 5G NR-NR Dual Connectivity (NR-DC) Network, despite the training being done solely using the data from the 5G Standalone (SA) network.

* 2023 IEEE Wireless Communications and Networking Conference (WCNC), 26-29 March 2023, Glasgow, Scotland, UK

Via

Access Paper or Ask Questions

QuDASH: Quantum-inspired rate adaptation approach for DASH video streaming

Jun 19, 2022

Bo Wei, Hang Song, Makoto Nakamura, Koichi Kimura, Nozomu Togawa, Jiro Katto

Figure 1 for QuDASH: Quantum-inspired rate adaptation approach for DASH video streaming

Figure 2 for QuDASH: Quantum-inspired rate adaptation approach for DASH video streaming

Figure 3 for QuDASH: Quantum-inspired rate adaptation approach for DASH video streaming

Figure 4 for QuDASH: Quantum-inspired rate adaptation approach for DASH video streaming

Abstract:Internet traffic is dramatically increasing with the development of network technologies. Within the total traffic, video streaming traffic accounts for a large amount, which reveals the importance to guarantee the quality of content delivery service. Based on the network conditions, adaptive bitrate (ABR) control is utilized as a common technique which can choose the proper bitrate to ensure the video streaming quality. In this paper, a new bitrate control method, QuDASH is proposed by taking advantage of the emerging quantum technology. In QuDASH, the adaptive control model is developed using the quadratic unconstrained binary optimization (QUBO), which aims at increasing the average bitrate and decreasing the video rebuffering events to maximize the user quality of experience (QoE). Then, the control model is solved by Digital Annealer, which is a quantum-Inspired computing technology. The evaluation of the proposed method is carried out by simulation with the measured throughput traces in real world. Experiment results demonstrated that the proposed QuDASH method has better performance in terms of QoE compared with other advanced ABR methods. In 68.2% of the examined cases, QuDASH achieves the highest QoE results, which shows the superiority of the QuDASH over conventional methods.

Via

Access Paper or Ask Questions

RSSI-CSI Measurement and Variation Mitigation with Commodity WiFi Device

Mar 24, 2022

Bo Wei, Hang Song, Jiro Katto, Takamaro Kikkawa

Figure 1 for RSSI-CSI Measurement and Variation Mitigation with Commodity WiFi Device

Figure 2 for RSSI-CSI Measurement and Variation Mitigation with Commodity WiFi Device

Figure 3 for RSSI-CSI Measurement and Variation Mitigation with Commodity WiFi Device

Figure 4 for RSSI-CSI Measurement and Variation Mitigation with Commodity WiFi Device

Abstract:Owing to the plentiful information released by the commodity devices, WiFi signals have been widely studied for various wireless sensing applications. In many works, both received signal strength indicator (RSSI) and the channel state information (CSI) are utilized as the key factors for precise sensing. However, the calculation and relationship between RSSI and CSI is not explained in detail. Furthermore, there are few works focusing on the measurement variation of the WiFi signal which impacts the sensing results. In this paper, the relationship between RSSI and CSI is studied in detail and the measurement variation of amplitude and phase information is investigated by extensive experiments. In the experiments, transmitter and receiver are directly connected by power divider and RF cables and the signal transmission is quantitatively controlled by RF attenuators. By changing the intensity of attenuation, the measurement of RSSI and CSI is carried out under different conditions. From the results, it is found that in order to get a reliable measurement of the signal amplitude and phase by commodity WiFi, the attenuation of the channels should not exceed 60 dB. Meanwhile, the difference between two channels should be lower than 10 dB. An active control mechanism is suggested to ensure the measurement stability. The findings and criteria of this work is promising to facilitate more precise sensing technologies with WiFi signal.

Via

Access Paper or Ask Questions

Adaptive video transmission using QUBO method and Digital Annealer based on Ising machine

Sep 25, 2021

Bo Wei, Hang Song, Jiro Katto

Figure 1 for Adaptive video transmission using QUBO method and Digital Annealer based on Ising machine

Figure 2 for Adaptive video transmission using QUBO method and Digital Annealer based on Ising machine

Abstract:With the dramatically increasing video streaming in the total network traffic, it is critical to develop effective algorithms to promote the content delivery service of high quality. Adaptive bitrate (ABR) control is the most essential technique which determines the proper bitrate to be chosen based on network conditions, thus realize high-quality video streaming. In this paper, a novel ABR strategy is proposed based on Ising machine by using the quadratic unconstrained binary optimization (QUBO) method and Digital Annealer (DA) for the first time. The proposed method is evaluated by simulation with the real-world measured throughput, and compared with other state-of-the-art methods. Experiment results show that the proposed QUBO-based method can outperform the existing methods, which demonstrating the superior of the proposed QUBO-based method.

Via

Access Paper or Ask Questions

Synchronous Maneuver Searching and Trajectory Planning for Autonomous Vehicles in Dynamic Traffic Environments

Sep 17, 2019

Lilin Qian, Xin Xu, Yujun Zeng, Xiaohui Li, Zhenping Sun, Hang Song

Figure 1 for Synchronous Maneuver Searching and Trajectory Planning for Autonomous Vehicles in Dynamic Traffic Environments

Figure 2 for Synchronous Maneuver Searching and Trajectory Planning for Autonomous Vehicles in Dynamic Traffic Environments

Figure 3 for Synchronous Maneuver Searching and Trajectory Planning for Autonomous Vehicles in Dynamic Traffic Environments

Figure 4 for Synchronous Maneuver Searching and Trajectory Planning for Autonomous Vehicles in Dynamic Traffic Environments

Abstract:In the real-time decision-making and local planning process of autonomous vehicles in dynamic environments, the autonomous driving system may fail to find a reasonable policy or even gets trapped in some situation due to the complexity of global tasks and the incompatibility between upper-level maneuver decisions with the low-level lower level trajectory planning. To solve this problem, this paper presents a synchronous maneuver searching and trajectory planning (SMSTP) algorithm based on the topological concept of homotopy. Firstly, a set of alternative maneuvers with boundary limits are enumerated on a multi-lane road. Instead of sampling numerous paths in the whole spatio-temporal space, we, for the first time, propose using Trajectory Profiles (TPs) to quickly construct the topological maneuvers represented by different routes, and put forward a corridor generation algorithm based on graph-search. The bounded corridor further constrains the maneuver's space in the spatial space. A step-wise heuristic optimization algorithm is then proposed to synchronously generate a feasible trajectory for each maneuver. To achieve real-time performance, we initialize the states to be optimized with the boundary constraints of maneuvers, and we set some heuristic states as terminal targets in the quadratic cost function. The solution of a feasible trajectory is always guaranteed only if a specific maneuver is given. The simulation and realistic driving-test experiments verified that the proposed SMSTP algorithm has a short computation time which is less than 37ms, and the experimental results showed the validity and effectiveness of the SMSTP algorithm.

* This work has been accepted by IEEE Intelligent Transportation System Magazine

Via

Access Paper or Ask Questions