Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Lingyun Wang

M-LLM Based Video Frame Selection for Efficient Video Understanding

Feb 27, 2025

Kai Hu, Feng Gao, Xiaohan Nie, Peng Zhou, Son Tran, Tal Neiman, Lingyun Wang, Mubarak Shah, Raffay Hamid, Bing Yin(+1 more)

Abstract:Recent advances in Multi-Modal Large Language Models (M-LLMs) show promising results in video reasoning. Popular Multi-Modal Large Language Model (M-LLM) frameworks usually apply naive uniform sampling to reduce the number of video frames that are fed into an M-LLM, particularly for long context videos. However, it could lose crucial context in certain periods of a video, so that the downstream M-LLM may not have sufficient visual information to answer a question. To attack this pain point, we propose a light-weight M-LLM -based frame selection method that adaptively select frames that are more relevant to users' queries. In order to train the proposed frame selector, we introduce two supervision signals (i) Spatial signal, where single frame importance score by prompting a M-LLM; (ii) Temporal signal, in which multiple frames selection by prompting Large Language Model (LLM) using the captions of all frame candidates. The selected frames are then digested by a frozen downstream video M-LLM for visual reasoning and question answering. Empirical results show that the proposed M-LLM video frame selector improves the performances various downstream video Large Language Model (video-LLM) across medium (ActivityNet, NExT-QA) and long (EgoSchema, LongVideoBench) context video question answering benchmarks.

Via

Access Paper or Ask Questions

SitPose: Real-Time Detection of Sitting Posture and Sedentary Behavior Using Ensemble Learning With Depth Sensor

Dec 16, 2024

Hang Jin, Xin He, Lingyun Wang, Yujun Zhu, Weiwei Jiang, Xiaobo Zhou

Figure 1 for SitPose: Real-Time Detection of Sitting Posture and Sedentary Behavior Using Ensemble Learning With Depth Sensor

Figure 2 for SitPose: Real-Time Detection of Sitting Posture and Sedentary Behavior Using Ensemble Learning With Depth Sensor

Figure 3 for SitPose: Real-Time Detection of Sitting Posture and Sedentary Behavior Using Ensemble Learning With Depth Sensor

Figure 4 for SitPose: Real-Time Detection of Sitting Posture and Sedentary Behavior Using Ensemble Learning With Depth Sensor

Abstract:Poor sitting posture can lead to various work-related musculoskeletal disorders (WMSDs). Office employees spend approximately 81.8% of their working time seated, and sedentary behavior can result in chronic diseases such as cervical spondylosis and cardiovascular diseases. To address these health concerns, we present SitPose, a sitting posture and sedentary detection system utilizing the latest Kinect depth camera. The system tracks 3D coordinates of bone joint points in real-time and calculates the angle values of related joints. We established a dataset containing six different sitting postures and one standing posture, totaling 33,409 data points, by recruiting 36 participants. We applied several state-of-the-art machine learning algorithms to the dataset and compared their performance in recognizing the sitting poses. Our results show that the ensemble learning model based on the soft voting mechanism achieves the highest F1 score of 98.1%. Finally, we deployed the SitPose system based on this ensemble model to encourage better sitting posture and to reduce sedentary habits.

Via

Access Paper or Ask Questions

Real-Time Fall Detection Using Smartphone Accelerometers and WiFi Channel State Information

Dec 13, 2024

Lingyun Wang, Deqi Su, Aohua Zhang, Yujun Zhu, Weiwei Jiang, Xin He, Panlong Yang

Abstract:In recent years, as the population ages, falls have increasingly posed a significant threat to the health of the elderly. We propose a real-time fall detection system that integrates the inertial measurement unit (IMU) of a smartphone with optimized Wi-Fi channel state information (CSI) for secondary validation. Initially, the IMU distinguishes falls from routine daily activities with minimal computational demand. Subsequently, the CSI is employed for further assessment, which includes evaluating the individual's post-fall mobility. This methodology not only achieves high accuracy but also reduces energy consumption in the smartphone platform. An Android application developed specifically for the purpose issues an emergency alert if the user experiences a fall and is unable to move. Experimental results indicate that the CSI model, based on convolutional neural networks (CNN), achieves a detection accuracy of 99%, \revised{surpassing comparable IMU-only models, and demonstrating significant resilience in distinguishing between falls and non-fall activities.

Via

Access Paper or Ask Questions

Freqformer: Frequency-Domain Transformer for 3-D Visualization and Quantification of Human Retinal Circulation

Nov 17, 2024

Lingyun Wang, Bingjie Wang, Jay Chhablani, Jose Alain Sahel, Shaohua Pi

Abstract:We introduce Freqformer, a novel Transformer-based architecture designed for 3-D, high-definition visualization of human retinal circulation from a single scan in commercial optical coherence tomography angiography (OCTA). Freqformer addresses the challenge of limited signal-to-noise ratio in OCTA volume by utilizing a complex-valued frequency-domain module (CFDM) and a simplified multi-head attention (Sim-MHA) mechanism. Using merged volumes as ground truth, Freqformer enables accurate reconstruction of retinal vasculature across the depth planes, allowing for 3-D quantification of capillary segments (count, density, and length). Our method outperforms state-of-the-art convolutional neural networks (CNNs) and several Transformer-based models, with superior performance in peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM), and learned perceptual image patch similarity (LPIPS). Furthermore, Freqformer demonstrates excellent generalizability across lower scanning density, effectively enhancing OCTA scans with larger fields of view (from 3$\times$3 $mm^{2}$ to 6$\times$6 $mm^{2}$ and 12$\times$12 $mm^{2}$). These results suggest that Freqformer can significantly improve the understanding and characterization of retinal circulation, offering potential clinical applications in diagnosing and managing retinal vascular diseases.

Via

Access Paper or Ask Questions

Multimodal Instruction Tuning with Hybrid State Space Models

Nov 13, 2024

Jianing Zhou, Han Li, Shuai Zhang, Ning Xie, Ruijie Wang, Xiaohan Nie, Sheng Liu, Lingyun Wang

Figure 1 for Multimodal Instruction Tuning with Hybrid State Space Models

Figure 2 for Multimodal Instruction Tuning with Hybrid State Space Models

Figure 3 for Multimodal Instruction Tuning with Hybrid State Space Models

Figure 4 for Multimodal Instruction Tuning with Hybrid State Space Models

Abstract:Handling lengthy context is crucial for enhancing the recognition and understanding capabilities of multimodal large language models (MLLMs) in applications such as processing high-resolution images or high frame rate videos. The rise in image resolution and frame rate substantially increases computational demands due to the increased number of input tokens. This challenge is further exacerbated by the quadratic complexity with respect to sequence length of the self-attention mechanism. Most prior works either pre-train models with long contexts, overlooking the efficiency problem, or attempt to reduce the context length via downsampling (e.g., identify the key image patches or frames) to decrease the context length, which may result in information loss. To circumvent this issue while keeping the remarkable effectiveness of MLLMs, we propose a novel approach using a hybrid transformer-MAMBA model to efficiently handle long contexts in multimodal applications. Our multimodal model can effectively process long context input exceeding 100k tokens, outperforming existing models across various benchmarks. Remarkably, our model enhances inference efficiency for high-resolution images and high-frame-rate videos by about 4 times compared to current models, with efficiency gains increasing as image resolution or video frames rise. Furthermore, our model is the first to be trained on low-resolution images or low-frame-rate videos while being capable of inference on high-resolution images and high-frame-rate videos, offering flexibility for inference in diverse scenarios.

Via

Access Paper or Ask Questions

BreakNet: Discontinuity-Resilient Multi-Scale Transformer Segmentation of Retinal Layers

Aug 26, 2024

Razieh Ganjee, Bingjie Wang, Lingyun Wang, Chengcheng Zhao, José-Alain Sahel, Shaohua Pi

Abstract:Visible light optical coherence tomography (vis-OCT) is gaining traction for retinal imaging due to its high resolution and functional capabilities. However, the significant absorption of hemoglobin in the visible light range leads to pronounced shadow artifacts from retinal blood vessels, posing challenges for accurate layer segmentation. In this study, we present BreakNet, a multi-scale Transformer-based segmentation model designed to address boundary discontinuities caused by these shadow artifacts. BreakNet utilizes hierarchical Transformer and convolutional blocks to extract multi-scale global and local feature maps, capturing essential contextual, textural, and edge characteristics. The model incorporates decoder blocks that expand pathwaproys to enhance the extraction of fine details and semantic information, ensuring precise segmentation. Evaluated on rodent retinal images acquired with prototype vis-OCT, BreakNet demonstrated superior performance over state-of-the-art segmentation models, such as TCCT-BP and U-Net, even when faced with limited-quality ground truth data. Our findings indicate that BreakNet has the potential to significantly improve retinal quantification and analysis.

Via

Access Paper or Ask Questions

Gaussian Time Machine: A Real-Time Rendering Methodology for Time-Variant Appearances

May 22, 2024

Licheng Shen, Ho Ngai Chow, Lingyun Wang, Tong Zhang, Mengqiu Wang, Yuxing Han

Figure 1 for Gaussian Time Machine: A Real-Time Rendering Methodology for Time-Variant Appearances

Figure 2 for Gaussian Time Machine: A Real-Time Rendering Methodology for Time-Variant Appearances

Figure 3 for Gaussian Time Machine: A Real-Time Rendering Methodology for Time-Variant Appearances

Figure 4 for Gaussian Time Machine: A Real-Time Rendering Methodology for Time-Variant Appearances

Abstract:Recent advancements in neural rendering techniques have significantly enhanced the fidelity of 3D reconstruction. Notably, the emergence of 3D Gaussian Splatting (3DGS) has marked a significant milestone by adopting a discrete scene representation, facilitating efficient training and real-time rendering. Several studies have successfully extended the real-time rendering capability of 3DGS to dynamic scenes. However, a challenge arises when training images are captured under vastly differing weather and lighting conditions. This scenario poses a challenge for 3DGS and its variants in achieving accurate reconstructions. Although NeRF-based methods (NeRF-W, CLNeRF) have shown promise in handling such challenging conditions, their computational demands hinder real-time rendering capabilities. In this paper, we present Gaussian Time Machine (GTM) which models the time-dependent attributes of Gaussian primitives with discrete time embedding vectors decoded by a lightweight Multi-Layer-Perceptron(MLP). By adjusting the opacity of Gaussian primitives, we can reconstruct visibility changes of objects. We further propose a decomposed color model for improved geometric consistency. GTM achieved state-of-the-art rendering fidelity on 3 datasets and is 100 times faster than NeRF-based counterparts in rendering. Moreover, GTM successfully disentangles the appearance changes and renders smooth appearance interpolation.

* 14 pages, 6 figures

Via

Access Paper or Ask Questions

Fully Automated OCT-based Tissue Screening System

May 15, 2024

Shaohua Pi, Razieh Ganjee, Lingyun Wang, Riley K. Arbuckle, Chengcheng Zhao, Jose A Sahel, Bingjie Wang, Yuanyuan Chen

Abstract:This study introduces a groundbreaking optical coherence tomography (OCT) imaging system dedicated for high-throughput screening applications using ex vivo tissue culture. Leveraging OCT's non-invasive, high-resolution capabilities, the system is equipped with a custom-designed motorized platform and tissue detection ability for automated, successive imaging across samples. Transformer-based deep learning segmentation algorithms further ensure robust, consistent, and efficient readouts meeting the standards for screening assays. Validated using retinal explant cultures from a mouse model of retinal degeneration, the system provides robust, rapid, reliable, unbiased, and comprehensive readouts of tissue response to treatments. This fully automated OCT-based system marks a significant advancement in tissue screening, promising to transform drug discovery, as well as other relevant research fields.

Via

Access Paper or Ask Questions

Sub2Full: split spectrum to boost OCT despeckling without clean data

Jan 18, 2024

Lingyun Wang, Jose A Sahel, Shaohua Pi

Abstract:Optical coherence tomography (OCT) suffers from speckle noise, causing the deterioration of image quality, especially in high-resolution modalities like visible light OCT (vis-OCT). The potential of conventional supervised deep learning denoising methods is limited by the difficulty of obtaining clean data. Here, we proposed an innovative self-supervised strategy called Sub2Full (S2F) for OCT despeckling without clean data. This approach works by acquiring two repeated B-scans, splitting the spectrum of the first repeat as a low-resolution input, and utilizing the full spectrum of the second repeat as the high-resolution target. The proposed method was validated on vis-OCT retinal images visualizing sublaminar structures in outer retina and demonstrated superior performance over conventional Noise2Noise and Noise2Void schemes. The code is available at https://github.com/PittOCT/Sub2Full-OCT-Denoising.

Via

Access Paper or Ask Questions

AG-CRC: Anatomy-Guided Colorectal Cancer Segmentation in CT with Imperfect Anatomical Knowledge

Oct 07, 2023

Rongzhao Zhang, Zhian Bai, Ruoying Yu, Wenrao Pang, Lingyun Wang, Lifeng Zhu, Xiaofan Zhang, Huan Zhang, Weiguo Hu

Figure 1 for AG-CRC: Anatomy-Guided Colorectal Cancer Segmentation in CT with Imperfect Anatomical Knowledge

Figure 2 for AG-CRC: Anatomy-Guided Colorectal Cancer Segmentation in CT with Imperfect Anatomical Knowledge

Figure 3 for AG-CRC: Anatomy-Guided Colorectal Cancer Segmentation in CT with Imperfect Anatomical Knowledge

Figure 4 for AG-CRC: Anatomy-Guided Colorectal Cancer Segmentation in CT with Imperfect Anatomical Knowledge

Abstract:When delineating lesions from medical images, a human expert can always keep in mind the anatomical structure behind the voxels. However, although high-quality (though not perfect) anatomical information can be retrieved from computed tomography (CT) scans with modern deep learning algorithms, it is still an open problem how these automatically generated organ masks can assist in addressing challenging lesion segmentation tasks, such as the segmentation of colorectal cancer (CRC). In this paper, we develop a novel Anatomy-Guided segmentation framework to exploit the auto-generated organ masks to aid CRC segmentation from CT, namely AG-CRC. First, we obtain multi-organ segmentation (MOS) masks with existing MOS models (e.g., TotalSegmentor) and further derive a more robust organ of interest (OOI) mask that may cover most of the colon-rectum and CRC voxels. Then, we propose an anatomy-guided training patch sampling strategy by optimizing a heuristic gain function that considers both the proximity of important regions (e.g., the tumor or organs of interest) and sample diversity. Third, we design a novel self-supervised learning scheme inspired by the topology of tubular organs like the colon to boost the model performance further. Finally, we employ a masked loss scheme to guide the model to focus solely on the essential learning region. We extensively evaluate the proposed method on two CRC segmentation datasets, where substantial performance improvement (5% to 9% in Dice) is achieved over current state-of-the-art medical image segmentation models, and the ablation studies further evidence the efficacy of every proposed component.

* Submitted to Medical Image Analysis

Via

Access Paper or Ask Questions