Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yanan Guo

FiLA-Video: Spatio-Temporal Compression for Fine-Grained Long Video Understanding

Apr 29, 2025

Yanan Guo, Wenhui Dong, Jun Song, Shiding Zhu, Xuan Zhang, Hanqing Yang, Yingbo Wang, Yang Du, Xianing Chen, Bo Zheng

Abstract:Recent advancements in video understanding within visual large language models (VLLMs) have led to notable progress. However, the complexity of video data and contextual processing limitations still hinder long-video comprehension. A common approach is video feature compression to reduce token input to large language models, yet many methods either fail to prioritize essential features, leading to redundant inter-frame information, or introduce computationally expensive modules.To address these issues, we propose FiLA(Fine-grained Vision Language Model)-Video, a novel framework that leverages a lightweight dynamic-weight multi-frame fusion strategy, which adaptively integrates multiple frames into a single representation while preserving key video information and reducing computational costs. To enhance frame selection for fusion, we introduce a keyframe selection strategy, effectively identifying informative frames from a larger pool for improved summarization. Additionally, we present a simple yet effective long-video training data generation strategy, boosting model performance without extensive manual annotation. Experimental results demonstrate that FiLA-Video achieves superior efficiency and accuracy in long-video comprehension compared to existing methods.

* 8 pages, 6 figures

Via

Access Paper or Ask Questions

HyViLM: Enhancing Fine-Grained Recognition with a Hybrid Encoder for Vision-Language Models

Dec 11, 2024

Shiding Zhu, Wenhui Dong, Jun Song, Yanan Guo, Bo Zheng

Abstract:Recently, there has been growing interest in the capability of multimodal large language models (MLLMs) to process high-resolution images. A common approach currently involves dynamically cropping the original high-resolution image into smaller sub-images, which are then fed into a vision encoder that was pre-trained on lower-resolution images. However, this cropping approach often truncates objects and connected areas in the original image, causing semantic breaks. To address this limitation, we introduce HyViLM, designed to process images of any resolution while retaining the overall context during encoding. Specifically, we: (i) Design a new visual encoder called Hybrid Encoder that not only encodes individual sub-images but also interacts with detailed global visual features, significantly improving the model's ability to encode high-resolution images. (ii) Propose an optimal feature fusion strategy for the dynamic cropping approach, effectively leveraging information from different layers of the vision encoder. Compared with the state-of-the-art MLLMs under the same setting, our HyViLM outperforms existing MLLMs in nine out of ten tasks. Specifically, HyViLM achieves a 9.6% improvement in performance on the TextVQA task and a 6.9% enhancement on the DocVQA task.

* 11 pages, 4 figures

Via

Access Paper or Ask Questions

Hybrid bundle-adjusting 3D Gaussians for view consistent rendering with pose optimization

Oct 17, 2024

Yanan Guo, Ying Xie, Ying Chang, Benkui Zhang, Bo Jia, Lin Cao

Abstract:Novel view synthesis has made significant progress in the field of 3D computer vision. However, the rendering of view-consistent novel views from imperfect camera poses remains challenging. In this paper, we introduce a hybrid bundle-adjusting 3D Gaussians model that enables view-consistent rendering with pose optimization. This model jointly extract image-based and neural 3D representations to simultaneously generate view-consistent images and camera poses within forward-facing scenes. The effective of our model is demonstrated through extensive experiments conducted on both real and synthetic datasets. These experiments clearly illustrate that our model can effectively optimize neural scene representations while simultaneously resolving significant camera pose misalignments. The source code is available at https://github.com/Bistu3DV/hybridBA.

* Photonics Asia 2024

Via

Access Paper or Ask Questions

Snore-GANs: Improving Automatic Snore Sound Classification with Synthesized Data

Mar 29, 2019

Zixing Zhang, Jing Han, Kun Qian, Christoph Janott, Yanan Guo, Bjoern Schuller

Figure 1 for Snore-GANs: Improving Automatic Snore Sound Classification with Synthesized Data

Figure 2 for Snore-GANs: Improving Automatic Snore Sound Classification with Synthesized Data

Figure 3 for Snore-GANs: Improving Automatic Snore Sound Classification with Synthesized Data

Figure 4 for Snore-GANs: Improving Automatic Snore Sound Classification with Synthesized Data

Abstract:One of the frontier issues that severely hamper the development of automatic snore sound classification (ASSC) associates to the lack of sufficient supervised training data. To cope with this problem, we propose a novel data augmentation approach based on semi-supervised conditional Generative Adversarial Networks (scGANs), which aims to automatically learn a mapping strategy from a random noise space to original data distribution. The proposed approach has the capability of well synthesizing 'realistic' high-dimensional data, while requiring no additional annotation process. To handle the mode collapse problem of GANs, we further introduce an ensemble strategy to enhance the diversity of the generated data. The systematic experiments conducted on a widely used Munich-Passau snore sound corpus demonstrate that the scGANs-based systems can remarkably outperform other classic data augmentation systems, and are also competitive to other recently reported systems for ASSC.

* accepted by IEEE JBHI

Via

Access Paper or Ask Questions

Multiview Cauchy Estimator Feature Embedding for Depth and Inertial Sensor-Based Human Action Recognition

Sep 04, 2016

Yanan Guo, Lei Li, Weifeng Liu, Jun Cheng, Dapeng Tao

Figure 1 for Multiview Cauchy Estimator Feature Embedding for Depth and Inertial Sensor-Based Human Action Recognition

Figure 2 for Multiview Cauchy Estimator Feature Embedding for Depth and Inertial Sensor-Based Human Action Recognition

Figure 3 for Multiview Cauchy Estimator Feature Embedding for Depth and Inertial Sensor-Based Human Action Recognition

Figure 4 for Multiview Cauchy Estimator Feature Embedding for Depth and Inertial Sensor-Based Human Action Recognition

Abstract:The ever-growing popularity of Kinect and inertial sensors has prompted intensive research efforts on human action recognition. Since human actions can be characterized by multiple feature representations extracted from Kinect and inertial sensors, multiview features must be encoded into a unified space optimal for human action recognition. In this paper, we propose a new unsupervised feature fusion method termed Multiview Cauchy Estimator Feature Embedding (MCEFE) for human action recognition. By minimizing empirical risk, MCEFE integrates the encoded complementary information in multiple views to find the unified data representation and the projection matrices. To enhance robustness to outliers, the Cauchy estimator is imposed on the reconstruction error. Furthermore, ensemble manifold regularization is enforced on the projection matrices to encode the correlations between different views and avoid overfitting. Experiments are conducted on the new Chinese Academy of Sciences - Yunnan University - Multimodal Human Action Database (CAS-YNU-MHAD) to demonstrate the effectiveness and robustness of MCEFE for human action recognition.

* This paper has been withdrawn by the author due to a crucial error

Via

Access Paper or Ask Questions