Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yunlong Zhao

LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding

Apr 09, 2025

Ziyi Wang, Haoran Wu, Yiming Rong, Deyang Jiang, Yixin Zhang, Yunlong Zhao, Shuang Xu, Bo XU

Abstract:Long video understanding is a complex task that requires both spatial detail and temporal awareness. While Vision-Language Models (VLMs) obtain frame-level understanding capabilities through multi-frame input, they suffer from information loss due to the sparse sampling strategy. In contrast, Video Large Language Models (Video-LLMs) capture temporal relationships within visual features but are limited by the scarcity of high-quality video-text datasets. To transfer long video understanding capabilities to VLMs with minimal data and computational cost, we propose Lightweight Video Compression (LVC), a novel method featuring the Query-Attention Video Compression mechanism, which effectively tackles the sparse sampling problem in VLMs. By training only the alignment layer with 10k short video-text pairs, LVC significantly enhances the temporal reasoning abilities of VLMs. Extensive experiments show that LVC provides consistent performance improvements across various models, including the InternVL2 series and Phi-3.5-Vision. Notably, the InternVL2-40B-LVC achieves scores of 68.2 and 65.9 on the long video understanding benchmarks MLVU and Video-MME, respectively, with relative improvements of 14.6% and 7.7%. The enhanced models and code will be publicly available soon.

Via

Access Paper or Ask Questions

GridShow: Omni Visual Generation

Dec 17, 2024

Cong Wan, Xiangyang Luo, Zijian Cai, Yiren Song, Yunlong Zhao, Yifan Bai, Yuhang He, Yihong Gong

Abstract:In this paper, we introduce GRID, a novel paradigm that reframes a broad range of visual generation tasks as the problem of arranging grids, akin to film strips. At its core, GRID transforms temporal sequences into grid layouts, enabling image generation models to process visual sequences holistically. To achieve both layout consistency and motion coherence, we develop a parallel flow-matching training strategy that combines layout matching and temporal losses, guided by a coarse-to-fine schedule that evolves from basic layouts to precise motion control. Our approach demonstrates remarkable efficiency, achieving up to 35 faster inference speeds while using 1/1000 of the computational resources compared to specialized models. Extensive experiments show that GRID exhibits exceptional versatility across diverse visual generation tasks, from Text-to-Video to 3D Editing, while maintaining its foundational image generation capabilities. This dual strength in both expanded applications and preserved core competencies establishes GRID as an efficient and versatile omni-solution for visual generation.

* Codes: https://github.com/Should-AI-Lab/GRID

Via

Access Paper or Ask Questions

MetaDD: Boosting Dataset Distillation with Neural Network Architecture-Invariant Generalization

Oct 07, 2024

Yunlong Zhao, Xiaoheng Deng, Xiu Su, Hongyan Xu, Xiuxing Li, Yijing Liu, Shan You

Figure 1 for MetaDD: Boosting Dataset Distillation with Neural Network Architecture-Invariant Generalization

Figure 2 for MetaDD: Boosting Dataset Distillation with Neural Network Architecture-Invariant Generalization

Figure 3 for MetaDD: Boosting Dataset Distillation with Neural Network Architecture-Invariant Generalization

Figure 4 for MetaDD: Boosting Dataset Distillation with Neural Network Architecture-Invariant Generalization

Abstract:Dataset distillation (DD) entails creating a refined, compact distilled dataset from a large-scale dataset to facilitate efficient training. A significant challenge in DD is the dependency between the distilled dataset and the neural network (NN) architecture used. Training a different NN architecture with a distilled dataset distilled using a specific architecture often results in diminished trainning performance for other architectures. This paper introduces MetaDD, designed to enhance the generalizability of DD across various NN architectures. Specifically, MetaDD partitions distilled data into meta features (i.e., the data's common characteristics that remain consistent across different NN architectures) and heterogeneous features (i.e., the data's unique feature to each NN architecture). Then, MetaDD employs an architecture-invariant loss function for multi-architecture feature alignment, which increases meta features and reduces heterogeneous features in distilled data. As a low-memory consumption component, MetaDD can be seamlessly integrated into any DD methodology. Experimental results demonstrate that MetaDD significantly improves performance across various DD methods. On the Distilled Tiny-Imagenet with Sre2L (50 IPC), MetaDD achieves cross-architecture NN accuracy of up to 30.1\%, surpassing the second-best method (GLaD) by 1.7\%.

Via

Access Paper or Ask Questions

Generative Enzyme Design Guided by Functionally Important Sites and Small-Molecule Substrates

May 13, 2024

Zhenqiao Song, Yunlong Zhao, Wenxian Shi, Wengong Jin, Yang Yang, Lei Li

Abstract:Enzymes are genetically encoded biocatalysts capable of accelerating chemical reactions. How can we automatically design functional enzymes? In this paper, we propose EnzyGen, an approach to learn a unified model to design enzymes across all functional families. Our key idea is to generate an enzyme's amino acid sequence and their three-dimensional (3D) coordinates based on functionally important sites and substrates corresponding to a desired catalytic function. These sites are automatically mined from enzyme databases. EnzyGen consists of a novel interleaving network of attention and neighborhood equivariant layers, which captures both long-range correlation in an entire protein sequence and local influence from nearest amino acids in 3D space. To learn the generative model, we devise a joint training objective, including a sequence generation loss, a position prediction loss and an enzyme-substrate interaction loss. We further construct EnzyBench, a dataset with 3157 enzyme families, covering all available enzymes within the protein data bank (PDB). Experimental results show that our EnzyGen consistently achieves the best performance across all 323 testing families, surpassing the best baseline by 10.79% in terms of substrate binding affinity. These findings demonstrate EnzyGen's superior capability in designing well-folded and effective enzymes binding to specific substrates with high affinities.

Via

Access Paper or Ask Questions

Generating Games via LLMs: An Investigation with Video Game Description Language

Apr 11, 2024

Chengpeng Hu, Yunlong Zhao, Jialin Liu

Figure 1 for Generating Games via LLMs: An Investigation with Video Game Description Language

Figure 2 for Generating Games via LLMs: An Investigation with Video Game Description Language

Figure 3 for Generating Games via LLMs: An Investigation with Video Game Description Language

Figure 4 for Generating Games via LLMs: An Investigation with Video Game Description Language

Abstract:Recently, the emergence of large language models (LLMs) has unlocked new opportunities for procedural content generation. However, recent attempts mainly focus on level generation for specific games with defined game rules such as Super Mario Bros. and Zelda. This paper investigates the game generation via LLMs. Based on video game description language, this paper proposes an LLM-based framework to generate game rules and levels simultaneously. Experiments demonstrate how the framework works with prompts considering different combinations of context. Our findings extend the current applications of LLMs and offer new insights for generating new games in the area of procedural content generation.

Via

Access Paper or Ask Questions

Functional Geometry Guided Protein Sequence and Backbone Structure Co-Design

Oct 09, 2023

Zhenqiao Song, Yunlong Zhao, Wenxian Shi, Yang Yang, Lei Li

Figure 1 for Functional Geometry Guided Protein Sequence and Backbone Structure Co-Design

Figure 2 for Functional Geometry Guided Protein Sequence and Backbone Structure Co-Design

Figure 3 for Functional Geometry Guided Protein Sequence and Backbone Structure Co-Design

Figure 4 for Functional Geometry Guided Protein Sequence and Backbone Structure Co-Design

Abstract:Proteins are macromolecules responsible for essential functions in almost all living organisms. Designing reasonable proteins with desired functions is crucial. A protein's sequence and structure are strongly correlated and they together determine its function. In this paper, we propose NAEPro, a model to jointly design Protein sequence and structure based on automatically detected functional sites. NAEPro is powered by an interleaving network of attention and equivariant layers, which can capture global correlation in a whole sequence and local influence from nearest amino acids in three dimensional (3D) space. Such an architecture facilitates effective yet economic message passing at two levels. We evaluate our model and several strong baselines on two protein datasets, $\beta$-lactamase and myoglobin. Experimental results show that our model consistently achieves the highest amino acid recovery rate, TM-score, and the lowest RMSD among all competitors. These findings prove the capability of our model to design protein sequences and structures that closely resemble their natural counterparts. Furthermore, in-depth analysis further confirms our model's ability to generate highly effective proteins capable of binding to their target metallocofactors. We provide code, data and models in Github.

Via

Access Paper or Ask Questions

Joint Design of Protein Sequence and Structure based on Motifs

Oct 04, 2023

Zhenqiao Song, Yunlong Zhao, Yufei Song, Wenxian Shi, Yang Yang, Lei Li

Abstract:Designing novel proteins with desired functions is crucial in biology and chemistry. However, most existing work focus on protein sequence design, leaving protein sequence and structure co-design underexplored. In this paper, we propose GeoPro, a method to design protein backbone structure and sequence jointly. Our motivation is that protein sequence and its backbone structure constrain each other, and thus joint design of both can not only avoid nonfolding and misfolding but also produce more diverse candidates with desired functions. To this end, GeoPro is powered by an equivariant encoder for three-dimensional (3D) backbone structure and a protein sequence decoder guided by 3D geometry. Experimental results on two biologically significant metalloprotein datasets, including $\beta$-lactamases and myoglobins, show that our proposed GeoPro outperforms several strong baselines on most metrics. Remarkably, our method discovers novel $\beta$-lactamases and myoglobins which are not present in protein data bank (PDB) and UniProt. These proteins exhibit stable folding and active site environments reminiscent of those of natural proteins, demonstrating their excellent potential to be biologically functional.

Via

Access Paper or Ask Questions

MOSPC: MOS Prediction Based on Pairwise Comparison

Jun 18, 2023

Kexin Wang, Yunlong Zhao, Qianqian Dong, Tom Ko, Mingxuan Wang

Figure 1 for MOSPC: MOS Prediction Based on Pairwise Comparison

Figure 2 for MOSPC: MOS Prediction Based on Pairwise Comparison

Figure 3 for MOSPC: MOS Prediction Based on Pairwise Comparison

Figure 4 for MOSPC: MOS Prediction Based on Pairwise Comparison

Abstract:As a subjective metric to evaluate the quality of synthesized speech, Mean opinion score~(MOS) usually requires multiple annotators to score the same speech. Such an annotation approach requires a lot of manpower and is also time-consuming. MOS prediction model for automatic evaluation can significantly reduce labor cost. In previous works, it is difficult to accurately rank the quality of speech when the MOS scores are close. However, in practical applications, it is more important to correctly rank the quality of synthesis systems or sentences than simply predicting MOS scores. Meanwhile, as each annotator scores multiple audios during annotation, the score is probably a relative value based on the first or the first few speech scores given by the annotator. Motivated by the above two points, we propose a general framework for MOS prediction based on pair comparison (MOSPC), and we utilize C-Mixup algorithm to enhance the generalization performance of MOSPC. The experiments on BVCC and VCC2018 show that our framework outperforms the baselines on most of the correlation coefficient metrics, especially on the metric KTAU related to quality ranking. And our framework also surpasses the strong baseline in ranking accuracy on each fine-grained segment. These results indicate that our framework contributes to improving the ranking accuracy of speech quality.

Via

Access Paper or Ask Questions

PolyVoice: Language Models for Speech to Speech Translation

Jun 13, 2023

Qianqian Dong, Zhiying Huang, Qiao Tian, Chen Xu, Tom Ko, Yunlong Zhao, Siyuan Feng, Tang Li, Kexin Wang, Xuxin Cheng(+8 more)

Figure 1 for PolyVoice: Language Models for Speech to Speech Translation

Figure 2 for PolyVoice: Language Models for Speech to Speech Translation

Figure 3 for PolyVoice: Language Models for Speech to Speech Translation

Figure 4 for PolyVoice: Language Models for Speech to Speech Translation

Abstract:We propose PolyVoice, a language model-based framework for speech-to-speech translation (S2ST) system. Our framework consists of two language models: a translation language model and a speech synthesis language model. We use discretized speech units, which are generated in a fully unsupervised way, and thus our framework can be used for unwritten languages. For the speech synthesis part, we adopt the existing VALL-E X approach and build a unit-based audio language model. This grants our framework the ability to preserve the voice characteristics and the speaking style of the original speech. We examine our system on Chinese $\rightarrow$ English and English $\rightarrow$ Spanish pairs. Experimental results show that our system can generate speech with high translation quality and audio quality. Speech samples are available at https://speechtranslation.github.io/polyvoice.

Via

Access Paper or Ask Questions

Game-based Platforms for Artificial Intelligence Research

Apr 26, 2023

Chengpeng Hu, Yunlong Zhao, Ziqi Wang, Haocheng Du, Jialin Liu

Abstract:Games have been the perfect test-beds for artificial intelligence research for the characteristics that widely exist in real-world scenarios. Learning and optimisation, decision making in dynamic and uncertain environments, game theory, planning and scheduling, design and education are common research areas shared between games and real-world problems. Numerous open-sourced games or game-based environments have been implemented for studying artificial intelligence. In addition to single- or multi-player, collaborative or adversarial games, there has also been growing interest in implementing platforms for creative design in recent years. Those platforms provide ideal benchmarks for exploring and comparing artificial intelligence ideas and techniques. This paper reviews the game-based platforms for artificial intelligence research, discusses the research trend induced by the evolution of those platforms, and gives an outlook.

Via

Access Paper or Ask Questions