Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shuyi Zhang

Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-Thought

Jun 12, 2025

Shuyi Zhang, Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Pengwei Wang, Zhongyuan Wang, Hongxuan Ma, Shanghang Zhang

Abstract:Video content comprehension is essential for various applications, ranging from video analysis to interactive systems. Despite advancements in large-scale vision-language models (VLMs), these models often struggle to capture the nuanced, spatiotemporal details essential for thorough video analysis. To address this gap, we introduce Video-CoT, a groundbreaking dataset designed to enhance spatiotemporal understanding using Chain-of-Thought (CoT) methodologies. Video-CoT contains 192,000 fine-grained spa-tiotemporal question-answer pairs and 23,000 high-quality CoT-annotated samples, providing a solid foundation for evaluating spatiotemporal understanding in video comprehension. Additionally, we provide a comprehensive benchmark for assessing these tasks, with each task featuring 750 images and tailored evaluation metrics. Our extensive experiments reveal that current VLMs face significant challenges in achieving satisfactory performance, high-lighting the difficulties of effective spatiotemporal understanding. Overall, the Video-CoT dataset and benchmark open new avenues for research in multimedia understanding and support future innovations in intelligent systems requiring advanced video analysis capabilities. By making these resources publicly available, we aim to encourage further exploration in this critical area. Project website:https://video-cot.github.io/ .

Via

Access Paper or Ask Questions

Attention Beats Linear for Fast Implicit Neural Representation Generation

Jul 22, 2024

Shuyi Zhang, Ke Liu, Jingjun Gu, Xiaoxu Cai, Zhihua Wang, Jiajun Bu, Haishuai Wang

Figure 1 for Attention Beats Linear for Fast Implicit Neural Representation Generation

Figure 2 for Attention Beats Linear for Fast Implicit Neural Representation Generation

Figure 3 for Attention Beats Linear for Fast Implicit Neural Representation Generation

Figure 4 for Attention Beats Linear for Fast Implicit Neural Representation Generation

Abstract:Implicit Neural Representation (INR) has gained increasing popularity as a data representation method, serving as a prerequisite for innovative generation models. Unlike gradient-based methods, which exhibit lower efficiency in inference, the adoption of hyper-network for generating parameters in Multi-Layer Perceptrons (MLP), responsible for executing INR functions, has surfaced as a promising and efficient alternative. However, as a global continuous function, MLP is challenging in modeling highly discontinuous signals, resulting in slow convergence during the training phase and inaccurate reconstruction performance. Moreover, MLP requires massive representation parameters, which implies inefficiencies in data representation. In this paper, we propose a novel Attention-based Localized INR (ANR) composed of a localized attention layer (LAL) and a global MLP that integrates coordinate features with data features and converts them to meaningful outputs. Subsequently, we design an instance representation framework that delivers a transformer-like hyper-network to represent data instances as a compact representation vector. With instance-specific representation vector and instance-agnostic ANR parameters, the target signals are well reconstructed as a continuous function. We further address aliasing artifacts with variational coordinates when obtaining the super-resolution inference results. Extensive experimentation across four datasets showcases the notable efficacy of our ANR method, e.g. enhancing the PSNR value from 37.95dB to 47.25dB on the CelebA dataset. Code is released at https://github.com/Roninton/ANR.

* Accept by ECCV 2024

Via

Access Paper or Ask Questions

Bilateral Unsymmetrical Graph Contrastive Learning for Recommendation

Mar 22, 2024

Jiaheng Yu, Jing Li, Yue He, Kai Zhu, Shuyi Zhang, Wen Hu

Figure 1 for Bilateral Unsymmetrical Graph Contrastive Learning for Recommendation

Figure 2 for Bilateral Unsymmetrical Graph Contrastive Learning for Recommendation

Figure 3 for Bilateral Unsymmetrical Graph Contrastive Learning for Recommendation

Figure 4 for Bilateral Unsymmetrical Graph Contrastive Learning for Recommendation

Abstract:Recent methods utilize graph contrastive Learning within graph-structured user-item interaction data for collaborative filtering and have demonstrated their efficacy in recommendation tasks. However, they ignore that the difference relation density of nodes between the user- and item-side causes the adaptability of graphs on bilateral nodes to be different after multi-hop graph interaction calculation, which limits existing models to achieve ideal results. To solve this issue, we propose a novel framework for recommendation tasks called Bilateral Unsymmetrical Graph Contrastive Learning (BusGCL) that consider the bilateral unsymmetry on user-item node relation density for sliced user and item graph reasoning better with bilateral slicing contrastive training. Especially, taking into account the aggregation ability of hypergraph-based graph convolutional network (GCN) in digging implicit similarities is more suitable for user nodes, embeddings generated from three different modules: hypergraph-based GCN, GCN and perturbed GCN, are sliced into two subviews by the user- and item-side respectively, and selectively combined into subview pairs bilaterally based on the characteristics of inter-node relation structure. Furthermore, to align the distribution of user and item embeddings after aggregation, a dispersing loss is leveraged to adjust the mutual distance between all embeddings for maintaining learning ability. Comprehensive experiments on two public datasets have proved the superiority of BusGCL in comparison to various recommendation methods. Other models can simply utilize our bilateral slicing contrastive learning to enhance recommending performance without incurring extra expenses.

Via

Access Paper or Ask Questions

YOLOPv2: Better, Faster, Stronger for Panoptic Driving Perception

Aug 24, 2022

Cheng Han, Qichao Zhao, Shuyi Zhang, Yinzi Chen, Zhenlin Zhang, Jinwei Yuan

Figure 1 for YOLOPv2: Better, Faster, Stronger for Panoptic Driving Perception

Figure 2 for YOLOPv2: Better, Faster, Stronger for Panoptic Driving Perception

Figure 3 for YOLOPv2: Better, Faster, Stronger for Panoptic Driving Perception

Figure 4 for YOLOPv2: Better, Faster, Stronger for Panoptic Driving Perception

Abstract:Over the last decade, multi-tasking learning approaches have achieved promising results in solving panoptic driving perception problems, providing both high-precision and high-efficiency performance. It has become a popular paradigm when designing networks for real-time practical autonomous driving system, where computation resources are limited. This paper proposed an effective and efficient multi-task learning network to simultaneously perform the task of traffic object detection, drivable road area segmentation and lane detection. Our model achieved the new state-of-the-art (SOTA) performance in terms of accuracy and speed on the challenging BDD100K dataset. Especially, the inference time is reduced by half compared to the previous SOTA model. Code will be released in the near future.

Via

Access Paper or Ask Questions

Snow Mask Guided Adaptive Residual Network for Image Snow Removal

Jul 11, 2022

Bodong Cheng, Juncheng Li, Ying Chen, Shuyi Zhang, Tieyong Zeng

Figure 1 for Snow Mask Guided Adaptive Residual Network for Image Snow Removal

Figure 2 for Snow Mask Guided Adaptive Residual Network for Image Snow Removal

Figure 3 for Snow Mask Guided Adaptive Residual Network for Image Snow Removal

Figure 4 for Snow Mask Guided Adaptive Residual Network for Image Snow Removal

Abstract:Image restoration under severe weather is a challenging task. Most of the past works focused on removing rain and haze phenomena in images. However, snow is also an extremely common atmospheric phenomenon that will seriously affect the performance of high-level computer vision tasks, such as object detection and semantic segmentation. Recently, some methods have been proposed for snow removing, and most methods deal with snow images directly as the optimization object. However, the distribution of snow location and shape is complex. Therefore, failure to detect snowflakes / snow streak effectively will affect snow removing and limit the model performance. To solve these issues, we propose a Snow Mask Guided Adaptive Residual Network (SMGARN). Specifically, SMGARN consists of three parts, Mask-Net, Guidance-Fusion Network (GF-Net), and Reconstruct-Net. Firstly, we build a Mask-Net with Self-pixel Attention (SA) and Cross-pixel Attention (CA) to capture the features of snowflakes and accurately localized the location of the snow, thus predicting an accurate snow mask. Secondly, the predicted snow mask is sent into the specially designed GF-Net to adaptively guide the model to remove snow. Finally, an efficient Reconstruct-Net is used to remove the veiling effect and correct the image to reconstruct the final snow-free image. Extensive experiments show that our SMGARN numerically outperforms all existing snow removal methods, and the reconstructed images are clearer in visual contrast. All codes will be available.

Via

Access Paper or Ask Questions

Class-Aware Contrastive Semi-Supervised Learning

Mar 24, 2022

Fan Yang, Kai Wu, Shuyi Zhang, Guannan Jiang, Yong Liu, Feng Zheng, Wei Zhang, Chengjie Wang, Long Zeng

Figure 1 for Class-Aware Contrastive Semi-Supervised Learning

Figure 2 for Class-Aware Contrastive Semi-Supervised Learning

Figure 3 for Class-Aware Contrastive Semi-Supervised Learning

Figure 4 for Class-Aware Contrastive Semi-Supervised Learning

Abstract:Pseudo-label-based semi-supervised learning (SSL) has achieved great success on raw data utilization. However, its training procedure suffers from confirmation bias due to the noise contained in self-generated artificial labels. Moreover, the model's judgment becomes noisier in real-world applications with extensive out-of-distribution data. To address this issue, we propose a general method named Class-aware Contrastive Semi-Supervised Learning (CCSSL), which is a drop-in helper to improve the pseudo-label quality and enhance the model's robustness in the real-world setting. Rather than treating real-world data as a union set, our method separately handles reliable in-distribution data with class-wise clustering for blending into downstream tasks and noisy out-of-distribution data with image-wise contrastive for better generalization. Furthermore, by applying target re-weighting, we successfully emphasize clean label learning and simultaneously reduce noisy label learning. Despite its simplicity, our proposed CCSSL has significant performance improvements over the state-of-the-art SSL methods on the standard datasets CIFAR100 and STL10. On the real-world dataset Semi-iNat 2021, we improve FixMatch by 9.80% and CoMatch by 3.18%. Code is available https://github.com/TencentYoutuResearch/Classification-SemiCLS.

* cvpr2022 accepted, half more page for adding rebuttal Infos

Via

Access Paper or Ask Questions