School of Electrical and Information Engineering, The University of Sydney, Australia
Abstract:Recently, vision-language models have made remarkable progress, demonstrating outstanding capabilities in various tasks such as image captioning and video understanding. We introduce Valley2, a novel multimodal large language model designed to enhance performance across all domains and extend the boundaries of practical applications in e-commerce and short video scenarios. Notably, Valley2 achieves state-of-the-art (SOTA) performance on e-commerce benchmarks, surpassing open-source models of similar size by a large margin (79.66 vs. 72.76). Additionally, Valley2 ranks second on the OpenCompass leaderboard among models with fewer than 10B parameters, with an impressive average score of 67.4. The code and model weights are open-sourced at https://github.com/bytedance/Valley.
Abstract:While most existing neural image compression (NIC) and neural video compression (NVC) methodologies have achieved remarkable success, their optimization is primarily focused on human visual perception. However, with the rapid development of artificial intelligence, many images and videos will be used for various machine vision tasks. Consequently, such existing compression methodologies cannot achieve competitive performance in machine vision. In this work, we introduce an efficient adaptive compression (EAC) method tailored for both human perception and multiple machine vision tasks. Our method involves two key modules: 1), an adaptive compression mechanism, that adaptively selects several subsets from latent features to balance the optimizations for multiple machine vision tasks (e.g., segmentation, and detection) and human vision. 2), a task-specific adapter, that uses the parameter-efficient delta-tuning strategy to stimulate the comprehensive downstream analytical networks for specific machine vision tasks. By using the above two modules, we can optimize the bit-rate costs and improve machine vision performance. In general, our proposed EAC can seamlessly integrate with existing NIC (i.e., Ball\'e2018, and Cheng2020) and NVC (i.e., DVC, and FVC) methods. Extensive evaluation on various benchmark datasets (i.e., VOC2007, ILSVRC2012, VOC2012, COCO, UCF101, and DAVIS) shows that our method enhances performance for multiple machine vision tasks while maintaining the quality of human vision.
Abstract:The high computational cost and slow inference time are major obstacles to deploying the video diffusion model (VDM) in practical applications. To overcome this, we introduce a new Video Diffusion Model Compression approach using individual content and motion dynamics preserved pruning and consistency loss. First, we empirically observe that deeper VDM layers are crucial for maintaining the quality of \textbf{motion dynamics} e.g., coherence of the entire video, while shallower layers are more focused on \textbf{individual content} e.g., individual frames. Therefore, we prune redundant blocks from the shallower layers while preserving more of the deeper layers, resulting in a lightweight VDM variant called VDMini. Additionally, we propose an \textbf{Individual Content and Motion Dynamics (ICMD)} Consistency Loss to gain comparable generation performance as larger VDM, i.e., the teacher to VDMini i.e., the student. Particularly, we first use the Individual Content Distillation (ICD) Loss to ensure consistency in the features of each generated frame between the teacher and student models. Next, we introduce a Multi-frame Content Adversarial (MCA) Loss to enhance the motion dynamics across the generated video as a whole. This method significantly accelerates inference time while maintaining high-quality video generation. Extensive experiments demonstrate the effectiveness of our VDMini on two important video generation tasks, Text-to-Video (T2V) and Image-to-Video (I2V), where we respectively achieve an average 2.5 $\times$ and 1.4 $\times$ speed up for the I2V method SF-V and the T2V method T2V-Turbo-v2, while maintaining the quality of the generated videos on two benchmarks, i.e., UCF101 and VBench.
Abstract:Fast progress in 3D Gaussian Splatting (3DGS) has made 3D Gaussians popular for 3D modeling and image rendering, but this creates big challenges in data storage and transmission. To obtain a highly compact 3DGS representation, we propose a hybrid entropy model for Gaussian Splatting (HEMGS) data compression, which comprises two primary components, a hyperprior network and an autoregressive network. To effectively reduce structural redundancy across attributes, we apply a progressive coding algorithm to generate hyperprior features, in which we use previously compressed attributes and location as prior information. In particular, to better extract the location features from these compressed attributes, we adopt a domain-aware and instance-aware architecture to respectively capture domain-aware structural relations without additional storage costs and reveal scene-specific features through MLPs. Additionally, to reduce redundancy within each attribute, we leverage relationships between neighboring compressed elements within the attributes through an autoregressive network. Given its unique structure, we propose an adaptive context coding algorithm with flexible receptive fields to effectively capture adjacent compressed elements. Overall, we integrate our HEMGS into an end-to-end optimized 3DGS compression framework and the extensive experimental results on four benchmarks indicate that our method achieves about 40\% average reduction in size while maintaining the rendering quality over our baseline method and achieving state-of-the-art compression results.
Abstract:Video Large Language Models (Video-LLMs) have made remarkable progress in video understanding tasks. However, they are constrained by the maximum length of input tokens, making it impractical to input entire videos. Existing frame selection approaches, such as uniform frame sampling and text-frame retrieval, fail to account for the information density variations in the videos or the complex instructions in the tasks, leading to sub-optimal performance. In this paper, we propose Frame-Voyager that learns to query informative frame combinations, based on the given textual queries in the task. To train Frame-Voyager, we introduce a new data collection and labeling pipeline, by ranking frame combinations using a pre-trained Video-LLM. Given a video of M frames, we traverse its T-frame combinations, feed them into a Video-LLM, and rank them based on Video-LLM's prediction losses. Using this ranking as supervision, we train Frame-Voyager to query the frame combinations with lower losses. In experiments, we evaluate Frame-Voyager on four Video Question Answering benchmarks by plugging it into two different Video-LLMs. The experimental results demonstrate that Frame-Voyager achieves impressive results in all settings, highlighting its potential as a plug-and-play solution for Video-LLMs.
Abstract:Global Station Weather Forecasting (GSWF) is crucial for various sectors, including aviation, agriculture, energy, and disaster preparedness. Recent advancements in deep learning have significantly improved the accuracy of weather predictions by optimizing models based on public meteorological data. However, existing public datasets for GSWF optimization and benchmarking still suffer from significant limitations, such as small sizes, limited temporal coverage, and a lack of comprehensive variables. These shortcomings prevent them from effectively reflecting the benchmarks of current forecasting methods and fail to support the real needs of operational weather forecasting. To address these challenges, we present the WEATHER-5K dataset. This dataset comprises a comprehensive collection of data from 5,672 weather stations worldwide, spanning a 10-year period with one-hour intervals. It includes multiple crucial weather elements, providing a more reliable and interpretable resource for forecasting. Furthermore, our WEATHER-5K dataset can serve as a benchmark for comprehensively evaluating existing well-known forecasting models, extending beyond GSWF methods to support future time-series research challenges and opportunities. The dataset and benchmark implementation are publicly available at: https://github.com/taohan10200/WEATHER-5K.
Abstract:Point cloud compression has garnered significant interest in computer vision. However, existing algorithms primarily cater to human vision, while most point cloud data is utilized for machine vision tasks. To address this, we propose a point cloud compression framework that simultaneously handles both human and machine vision tasks. Our framework learns a scalable bit-stream, using only subsets for different machine vision tasks to save bit-rate, while employing the entire bit-stream for human vision tasks. Building on mainstream octree-based frameworks like VoxelContext-Net, OctAttention, and G-PCC, we introduce a new octree depth-level predictor. This predictor adaptively determines the optimal depth level for each octree constructed from a point cloud, controlling the bit-rate for machine vision tasks. For simpler tasks (\textit{e.g.}, classification) or objects/scenarios, we use fewer depth levels with fewer bits, saving bit-rate. Conversely, for more complex tasks (\textit{e.g}., segmentation) or objects/scenarios, we use deeper depth levels with more bits to enhance performance. Experimental results on various datasets (\textit{e.g}., ModelNet10, ModelNet40, ShapeNet, ScanNet, and KITTI) show that our point cloud compression approach improves performance for machine vision tasks without compromising human vision quality.
Abstract:Point cloud analysis has seen substantial advancements due to deep learning, although previous Transformer-based methods excel at modeling long-range dependencies on this task, their computational demands are substantial. Conversely, the Mamba offers greater efficiency but shows limited potential compared with Transformer-based methods. In this study, we introduce PoinTramba, a pioneering hybrid framework that synergies the analytical power of Transformer with the remarkable computational efficiency of Mamba for enhanced point cloud analysis. Specifically, our approach first segments point clouds into groups, where the Transformer meticulously captures intricate intra-group dependencies and produces group embeddings, whose inter-group relationships will be simultaneously and adeptly captured by efficient Mamba architecture, ensuring comprehensive analysis. Unlike previous Mamba approaches, we introduce a bi-directional importance-aware ordering (BIO) strategy to tackle the challenges of random ordering effects. This innovative strategy intelligently reorders group embeddings based on their calculated importance scores, significantly enhancing Mamba's performance and optimizing the overall analytical process. Our framework achieves a superior balance between computational efficiency and analytical performance by seamlessly integrating these advanced techniques, marking a substantial leap forward in point cloud analysis. Extensive experiments on datasets such as ScanObjectNN, ModelNet40, and ShapeNetPart demonstrate the effectiveness of our approach, establishing a new state-of-the-art analysis benchmark on point cloud recognition. For the first time, this paradigm leverages the combined strengths of both Transformer and Mamba architectures, facilitating a new standard in the field. The code is available at https://github.com/xiaoyao3302/PoinTramba.
Abstract:The advent of data-driven weather forecasting models, which learn from hundreds of terabytes (TB) of reanalysis data, has significantly advanced forecasting capabilities. However, the substantial costs associated with data storage and transmission present a major challenge for data providers and users, affecting resource-constrained researchers and limiting their accessibility to participate in AI-based meteorological research. To mitigate this issue, we introduce an efficient neural codec, the Variational Autoencoder Transformer (VAEformer), for extreme compression of climate data to significantly reduce data storage cost, making AI-based meteorological research portable to researchers. Our approach diverges from recent complex neural codecs by utilizing a low-complexity Auto-Encoder transformer. This encoder produces a quantized latent representation through variance inference, which reparameterizes the latent space as a Gaussian distribution. This method improves the estimation of distributions for cross-entropy coding. Extensive experiments demonstrate that our VAEformer outperforms existing state-of-the-art compression methods in the context of climate data. By applying our VAEformer, we compressed the most popular ERA5 climate dataset (226 TB) into a new dataset, CRA5 (0.7 TB). This translates to a compression ratio of over 300 while retaining the dataset's utility for accurate scientific analysis. Further, downstream experiments show that global weather forecasting models trained on the compact CRA5 dataset achieve forecasting accuracy comparable to the model trained on the original dataset. Code, the CRA5 dataset, and the pre-trained model are available at https://github.com/taohan10200/CRA5.
Abstract:Content-adaptive compression is crucial for enhancing the adaptability of the pre-trained neural codec for various contents. Although these methods have been very practical in neural image compression (NIC), their application in neural video compression (NVC) is still limited due to two main aspects: 1), video compression relies heavily on temporal redundancy, therefore updating just one or a few frames can lead to significant errors accumulating over time; 2), NVC frameworks are generally more complex, with many large components that are not easy to update quickly during encoding. To address the previously mentioned challenges, we have developed a content-adaptive NVC technique called Group-aware Parameter-Efficient Updating (GPU). Initially, to minimize error accumulation, we adopt a group-aware approach for updating encoder parameters. This involves adopting a patch-based Group of Pictures (GoP) training strategy to segment a video into patch-based GoPs, which will be updated to facilitate a globally optimized domain-transferable solution. Subsequently, we introduce a parameter-efficient delta-tuning strategy, which is achieved by integrating several light-weight adapters into each coding component of the encoding process by both serial and parallel configuration. Such architecture-agnostic modules stimulate the components with large parameters, thereby reducing both the update cost and the encoding time. We incorporate our GPU into the latest NVC framework and conduct comprehensive experiments, whose results showcase outstanding video compression efficiency across four video benchmarks and adaptability of one medical image benchmark.