Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yongxin Guo

Watch and Listen: Understanding Audio-Visual-Speech Moments with Multimodal LLM

May 23, 2025

Zinuo Li, Xian Zhang, Yongxin Guo, Mohammed Bennamoun, Farid Boussaid, Girish Dwivedi, Luqi Gong, Qiuhong Ke

Abstract:Humans naturally understand moments in a video by integrating visual and auditory cues. For example, localizing a scene in the video like "A scientist passionately speaks on wildlife conservation as dramatic orchestral music plays, with the audience nodding and applauding" requires simultaneous processing of visual, audio, and speech signals. However, existing models often struggle to effectively fuse and interpret audio information, limiting their capacity for comprehensive video temporal understanding. To address this, we present TriSense, a triple-modality large language model designed for holistic video temporal understanding through the integration of visual, audio, and speech modalities. Central to TriSense is a Query-Based Connector that adaptively reweights modality contributions based on the input query, enabling robust performance under modality dropout and allowing flexible combinations of available inputs. To support TriSense's multimodal capabilities, we introduce TriSense-2M, a high-quality dataset of over 2 million curated samples generated via an automated pipeline powered by fine-tuned LLMs. TriSense-2M includes long-form videos and diverse modality combinations, facilitating broad generalization. Extensive experiments across multiple benchmarks demonstrate the effectiveness of TriSense and its potential to advance multimodal video analysis. Code and dataset will be publicly released.

Via

Access Paper or Ask Questions

Noise-Consistent Siamese-Diffusion for Medical Image Synthesis and Segmentation

May 09, 2025

Kunpeng Qiu, Zhiqiang Gao, Zhiying Zhou, Mingjie Sun, Yongxin Guo

Abstract:Deep learning has revolutionized medical image segmentation, yet its full potential remains constrained by the paucity of annotated datasets. While diffusion models have emerged as a promising approach for generating synthetic image-mask pairs to augment these datasets, they paradoxically suffer from the same data scarcity challenges they aim to mitigate. Traditional mask-only models frequently yield low-fidelity images due to their inability to adequately capture morphological intricacies, which can critically compromise the robustness and reliability of segmentation models. To alleviate this limitation, we introduce Siamese-Diffusion, a novel dual-component model comprising Mask-Diffusion and Image-Diffusion. During training, a Noise Consistency Loss is introduced between these components to enhance the morphological fidelity of Mask-Diffusion in the parameter space. During sampling, only Mask-Diffusion is used, ensuring diversity and scalability. Comprehensive experiments demonstrate the superiority of our method. Siamese-Diffusion boosts SANet's mDice and mIoU by 3.6% and 4.4% on the Polyps, while UNet improves by 1.52% and 1.64% on the ISIC2018. Code is available at GitHub.

* Accepted by CVPR2025

Via

Access Paper or Ask Questions

FedMABA: Towards Fair Federated Learning through Multi-Armed Bandits Allocation

Oct 26, 2024

Zhichao Wang, Lin Wang, Yongxin Guo, Ying-Jun Angela Zhang, Xiaoying Tang

Abstract:The increasing concern for data privacy has driven the rapid development of federated learning (FL), a privacy-preserving collaborative paradigm. However, the statistical heterogeneity among clients in FL results in inconsistent performance of the server model across various clients. Server model may show favoritism towards certain clients while performing poorly for others, heightening the challenge of fairness. In this paper, we reconsider the inconsistency in client performance distribution and introduce the concept of adversarial multi-armed bandit to optimize the proposed objective with explicit constraints on performance disparities. Practically, we propose a novel multi-armed bandit-based allocation FL algorithm (FedMABA) to mitigate performance unfairness among diverse clients with different data distributions. Extensive experiments, in different Non-I.I.D. scenarios, demonstrate the exceptional performance of FedMABA in enhancing fairness.

Via

Access Paper or Ask Questions

TRACE: Temporal Grounding Video LLM via Causal Event Modeling

Oct 08, 2024

Yongxin Guo, Jingyu Liu, Mingda Li, Xiaoying Tang, Qingbin Liu, Xi Chen

Figure 1 for TRACE: Temporal Grounding Video LLM via Causal Event Modeling

Figure 2 for TRACE: Temporal Grounding Video LLM via Causal Event Modeling

Figure 3 for TRACE: Temporal Grounding Video LLM via Causal Event Modeling

Figure 4 for TRACE: Temporal Grounding Video LLM via Causal Event Modeling

Abstract:Video Temporal Grounding (VTG) is a crucial capability for video understanding models and plays a vital role in downstream tasks such as video browsing and editing. To effectively handle various tasks simultaneously and enable zero-shot prediction, there is a growing trend in employing video LLMs for VTG tasks. However, current video LLM-based methods rely exclusively on natural language generation, lacking the ability to model the clear structure inherent in videos, which restricts their effectiveness in tackling VTG tasks. To address this issue, this paper first formally introduces causal event modeling framework, which represents videos as sequences of events, and predict the current event using previous events, video inputs, and textural instructions. Each event consists of three components: timestamps, salient scores, and textual captions. We then propose a novel task-interleaved video LLM called TRACE to effectively implement the causal event modeling framework in practice. The TRACE processes visual frames, timestamps, salient scores, and text as distinct tasks, employing various encoders and decoding heads for each. Task tokens are arranged in an interleaved sequence according to the causal event modeling framework's formulation. Extensive experiments on various VTG tasks and datasets demonstrate the superior performance of TRACE compared to state-of-the-art video LLMs. Our model and code are available at \url{https://github.com/gyxxyg/TRACE}.

Via

Access Paper or Ask Questions

Enhancing Long Video Understanding via Hierarchical Event-Based Memory

Sep 10, 2024

Dingxin Cheng, Mingda Li, Jingyu Liu, Yongxin Guo, Bin Jiang, Qingbin Liu, Xi Chen, Bo Zhao

Figure 1 for Enhancing Long Video Understanding via Hierarchical Event-Based Memory

Figure 2 for Enhancing Long Video Understanding via Hierarchical Event-Based Memory

Figure 3 for Enhancing Long Video Understanding via Hierarchical Event-Based Memory

Figure 4 for Enhancing Long Video Understanding via Hierarchical Event-Based Memory

Abstract:Recently, integrating visual foundation models into large language models (LLMs) to form video understanding systems has attracted widespread attention. Most of the existing models compress diverse semantic information within the whole video and feed it into LLMs for content comprehension. While this method excels in short video understanding, it may result in a blend of multiple event information in long videos due to coarse compression, which causes information redundancy. Consequently, the semantics of key events might be obscured within the vast information that hinders the model's understanding capabilities. To address this issue, we propose a Hierarchical Event-based Memory-enhanced LLM (HEM-LLM) for better understanding of long videos. Firstly, we design a novel adaptive sequence segmentation scheme to divide multiple events within long videos. In this way, we can perform individual memory modeling for each event to establish intra-event contextual connections, thereby reducing information redundancy. Secondly, while modeling current event, we compress and inject the information of the previous event to enhance the long-term inter-event dependencies in videos. Finally, we perform extensive experiments on various video understanding tasks and the results show that our model achieves state-of-the-art performances.

Via

Access Paper or Ask Questions

Innovative RIS Prototyping Enhancing Wireless Communication with Real-Time Spot Beam Tracking and OAM Wavefront Manipulation

Jul 28, 2024

Yufei Zhao, Yuan Feng, Afkar Mohamed Ismail, Ziyue Wang, Yong Liang Guan, Yongxin Guo, Chau Yuen

Abstract:This paper presents a sophisticated reconfigurable metasurface architecture that introduces an advanced concept of flexible full-array space-time wavefront manipulation with enhanced dynamic capabilities. The practical 2-bit phase-shifting unit cell on the RIS is distinguished by its ability to maintain four stable phase states, each with ${90^ \circ }$ differences, and features an insertion loss of less than 0.6 dB across a bandwidth of 200 MHz. All reconfigurable units are equipped with meticulously designed control circuits, governed by an intelligent core composed of multiple Micro-Controller Units (MCUs), enabling rapid control response across the entire RIS array. Owing to the capability of each unit cell on the metasurface to independently switch states, the entire RIS is not limited to controlling general beams with specific directional patterns, but also generates beams with more complex structures, including multi-focus 3D spot beams and vortex beams. This development substantially broadens its applicability across various industrial wireless transmission scenarios. Moreover, by leveraging the rapid-respond space-time coding and the full-array independent programmability of the RIS prototyping operating at 10.7 GHz, we have demonstrated that: 1) The implementation of 3D spot beams scanning facilitates dynamic beam tracking and real-time communication under the indoor near-field scenario; 2) The rapid wavefront rotation of vortex beams enables precise modulation of signals within the Doppler domain, showcasing an innovative approach to wireless signal manipulation; 3) The beam steering experiments for blocking users under outdoor far-field scenarios, verifying the beamforming capability of the RIS board.

Via

Access Paper or Ask Questions

Smart Sampling: Helping from Friendly Neighbors for Decentralized Federated Learning

Jul 05, 2024

Lin Wang, Yang Chen, Yongxin Guo, Xiaoying Tang

Figure 1 for Smart Sampling: Helping from Friendly Neighbors for Decentralized Federated Learning

Figure 2 for Smart Sampling: Helping from Friendly Neighbors for Decentralized Federated Learning

Figure 3 for Smart Sampling: Helping from Friendly Neighbors for Decentralized Federated Learning

Figure 4 for Smart Sampling: Helping from Friendly Neighbors for Decentralized Federated Learning

Abstract:Federated Learning (FL) is gaining widespread interest for its ability to share knowledge while preserving privacy and reducing communication costs. Unlike Centralized FL, Decentralized FL (DFL) employs a network architecture that eliminates the need for a central server, allowing direct communication among clients and leading to significant communication resource savings. However, due to data heterogeneity, not all neighboring nodes contribute to enhancing the local client's model performance. In this work, we introduce \textbf{\emph{AFIND+}}, a simple yet efficient algorithm for sampling and aggregating neighbors in DFL, with the aim of leveraging collaboration to improve clients' model performance. AFIND+ identifies helpful neighbors, adaptively adjusts the number of selected neighbors, and strategically aggregates the sampled neighbors' models based on their contributions. Numerical results on real-world datasets with diverse data partitions demonstrate that AFIND+ outperforms other sampling algorithms in DFL and is compatible with most existing DFL optimization algorithms.

Via

Access Paper or Ask Questions

Client2Vec: Improving Federated Learning by Distribution Shifts Aware Client Indexing

May 25, 2024

Yongxin Guo, Lin Wang, Xiaoying Tang, Tao Lin

Abstract:Federated Learning (FL) is a privacy-preserving distributed machine learning paradigm. Nonetheless, the substantial distribution shifts among clients pose a considerable challenge to the performance of current FL algorithms. To mitigate this challenge, various methods have been proposed to enhance the FL training process. This paper endeavors to tackle the issue of data heterogeneity from another perspective -- by improving FL algorithms prior to the actual training stage. Specifically, we introduce the Client2Vec mechanism, which generates a unique client index for each client before the commencement of FL training. Subsequently, we leverage the generated client index to enhance the subsequent FL training process. To demonstrate the effectiveness of the proposed Client2Vec method, we conduct three case studies that assess the impact of the client index on the FL training process. These case studies encompass enhanced client sampling, model aggregation, and local training. Extensive experiments conducted on diverse datasets and model architectures show the efficacy of Client2Vec across all three case studies. Our code is avaliable at \url{https://github.com/LINs-lab/client2vec}.

Via

Access Paper or Ask Questions

Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models

May 23, 2024

Yongxin Guo, Zhenglin Cheng, Xiaoying Tang, Tao Lin

Abstract:The Sparse Mixture of Experts (SMoE) has been widely employed to enhance the efficiency of training and inference for Transformer-based foundational models, yielding promising results. However, the performance of SMoE heavily depends on the choice of hyper-parameters, such as the number of experts and the number of experts to be activated (referred to as top-k), resulting in significant computational overhead due to the extensive model training by searching over various hyper-parameter configurations. As a remedy, we introduce the Dynamic Mixture of Experts (DynMoE) technique. DynMoE incorporates (1) a novel gating method that enables each token to automatically determine the number of experts to activate. (2) An adaptive process automatically adjusts the number of experts during training. Extensive numerical results across Vision, Language, and Vision-Language tasks demonstrate the effectiveness of our approach to achieve competitive performance compared to GMoE for vision and language tasks, and MoE-LLaVA for vision-language tasks, while maintaining efficiency by activating fewer parameters. Our code is available at https://github.com/LINs-lab/DynMoE.

* 9 pages, 21 figures

Via

Access Paper or Ask Questions

VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding

May 22, 2024

Yongxin Guo, Jingyu Liu, Mingda Li, Xiaoying Tang, Xi Chen, Bo Zhao

Figure 1 for VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding

Figure 2 for VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding

Figure 3 for VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding

Figure 4 for VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding

Abstract:Video Temporal Grounding (VTG) focuses on accurately identifying event timestamps within a particular video based on a linguistic query, playing a vital role in downstream tasks such as video browsing and editing. While Video Large Language Models (video LLMs) have made significant progress in understanding video content, they often face challenges in accurately pinpointing timestamps within videos, which limits their performance on VTG tasks. Therefore, to improve video LLMs' ability to effectively locate timestamps, we argue that two critical aspects need to be enhanced. First, it is essential to have high-quality instructional tuning datasets that encompass mainstream VTG tasks. Second, directly incorporating timestamp knowledge into video LLMs is crucial, as it enables models to efficiently comprehend timestamp information. To address these needs, we first introduce VTG-IT-120K, a high-quality and comprehensive instruction tuning dataset that covers VTG tasks such as moment retrieval, dense video captioning, video summarization, and video highlight detection. Furthermore, we propose a specially designed video LLM model for VTG tasks, VTG-LLM, which (1) effectively integrates timestamp knowledge into visual tokens; (2) incorporates absolute-time tokens that specifically handle timestamp knowledge, thereby avoiding concept shifts; and (3) introduces a lightweight, high-performance slot-based token compression method to facilitate the sampling of more video frames. Comprehensive experiments showcase the superior performance of VTG-LLM in comparison to other video LLM methods across various VTG tasks. Our code and datasets are available at \url{https://github.com/gyxxyg/VTG-LLM}.

Via

Access Paper or Ask Questions