Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zeyuan Zhang

RecWizard: A Toolkit for Conversational Recommendation with Modular, Portable Models and Interactive User Interface

Feb 23, 2024

Zeyuan Zhang, Tanmay Laud, Zihang He, Xiaojie Chen, Xinshuang Liu, Zhouhang Xie, Julian McAuley, Zhankui He

Figure 1 for RecWizard: A Toolkit for Conversational Recommendation with Modular, Portable Models and Interactive User Interface

Figure 2 for RecWizard: A Toolkit for Conversational Recommendation with Modular, Portable Models and Interactive User Interface

Figure 3 for RecWizard: A Toolkit for Conversational Recommendation with Modular, Portable Models and Interactive User Interface

Figure 4 for RecWizard: A Toolkit for Conversational Recommendation with Modular, Portable Models and Interactive User Interface

Abstract:We present a new Python toolkit called RecWizard for Conversational Recommender Systems (CRS). RecWizard offers support for development of models and interactive user interface, drawing from the best practices of the Huggingface ecosystems. CRS with RecWizard are modular, portable, interactive and Large Language Models (LLMs)-friendly, to streamline the learning process and reduce the additional effort for CRS research. For more comprehensive information about RecWizard, please check our GitHub https://github.com/McAuley-Lab/RecWizard.

* AAAI'24 Demo Track

Via

Access Paper or Ask Questions

LVCHAT: Facilitating Long Video Comprehension

Feb 19, 2024

Yu Wang, Zeyuan Zhang, Julian McAuley, Zexue He

Abstract:Enabling large language models (LLMs) to read videos is vital for multimodal LLMs. Existing works show promise on short videos whereas long video (longer than e.g.~1 minute) comprehension remains challenging. The major problem lies in the over-compression of videos, i.e., the encoded video representations are not enough to represent the whole video. To address this issue, we propose Long Video Chat (LVChat), where Frame-Scalable Encoding (FSE) is introduced to dynamically adjust the number of embeddings in alignment with the duration of the video to ensure long videos are not overly compressed into a few embeddings. To deal with long videos whose length is beyond videos seen during training, we propose Interleaved Frame Encoding (IFE), repeating positional embedding and interleaving multiple groups of videos to enable long video input, avoiding performance degradation due to overly long videos. Experimental results show that LVChat significantly outperforms existing methods by up to 27\% in accuracy on long-video QA datasets and long-video captioning benchmarks. Our code is published at https://github.com/wangyu-ustc/LVChat.

* 17 pages; 8 figures

Via

Access Paper or Ask Questions

BEVFusion4D: Learning LiDAR-Camera Fusion Under Bird's-Eye-View via Cross-Modality Guidance and Temporal Aggregation

Mar 30, 2023

Hongxiang Cai, Zeyuan Zhang, Zhenyu Zhou, Ziyin Li, Wenbo Ding, Jiuhua Zhao

Abstract:Integrating LiDAR and Camera information into Bird's-Eye-View (BEV) has become an essential topic for 3D object detection in autonomous driving. Existing methods mostly adopt an independent dual-branch framework to generate LiDAR and camera BEV, then perform an adaptive modality fusion. Since point clouds provide more accurate localization and geometry information, they could serve as a reliable spatial prior to acquiring relevant semantic information from the images. Therefore, we design a LiDAR-Guided View Transformer (LGVT) to effectively obtain the camera representation in BEV space and thus benefit the whole dual-branch fusion system. LGVT takes camera BEV as the primitive semantic query, repeatedly leveraging the spatial cue of LiDAR BEV for extracting image features across multiple camera views. Moreover, we extend our framework into the temporal domain with our proposed Temporal Deformable Alignment (TDA) module, which aims to aggregate BEV features from multiple historical frames. Including these two modules, our framework dubbed BEVFusion4D achieves state-of-the-art results in 3D object detection, with 72.0% mAP and 73.5% NDS on the nuScenes validation set, and 73.3% mAP and 74.7% NDS on nuScenes test set, respectively.

* 13 pages, 7 figures

Via

Access Paper or Ask Questions

Preliminary Analysis of Channel Capacity in Air to ground LoS MIMO Communication Based on A Cloud Modeling Method

Oct 19, 2022

Ning Wei, Shuangqing Tang, Zeyuan Zhang

Figure 1 for Preliminary Analysis of Channel Capacity in Air to ground LoS MIMO Communication Based on A Cloud Modeling Method

Figure 2 for Preliminary Analysis of Channel Capacity in Air to ground LoS MIMO Communication Based on A Cloud Modeling Method

Figure 3 for Preliminary Analysis of Channel Capacity in Air to ground LoS MIMO Communication Based on A Cloud Modeling Method

Figure 4 for Preliminary Analysis of Channel Capacity in Air to ground LoS MIMO Communication Based on A Cloud Modeling Method

Abstract:Since the orthogonality of the line-of-sight multiple input multiple output (LoS MIMO) channel is only available within the Rayleigh distance, coverage of communication systems is restricted due to the finite implementation spacing of antennas. However, media with different permittivity in the transmission path are likely to loosen the requirement for antenna spacing. Such a conclusion could be enlightening in an air-to-ground LoS MIMO scenario considering the existence of clouds in the troposphere. To analyze the random phase variations in the presence of a single-layer cloud, we propose and modify a new cloud modeling method fit for LoS MIMO scene based on real-measurement data. Then, the preliminary analysis of channel capacity is conducted based on the simulation result.

* 14 pages

Via

Access Paper or Ask Questions

Attention-based Partial Face Recognition

Jun 14, 2021

Stefan Hörmann, Zeyuan Zhang, Martin Knoche, Torben Teepe, Gerhard Rigoll

Figure 1 for Attention-based Partial Face Recognition

Figure 2 for Attention-based Partial Face Recognition

Figure 3 for Attention-based Partial Face Recognition

Figure 4 for Attention-based Partial Face Recognition

Abstract:Photos of faces captured in unconstrained environments, such as large crowds, still constitute challenges for current face recognition approaches as often faces are occluded by objects or people in the foreground. However, few studies have addressed the task of recognizing partial faces. In this paper, we propose a novel approach to partial face recognition capable of recognizing faces with different occluded areas. We achieve this by combining attentional pooling of a ResNet's intermediate feature maps with a separate aggregation module. We further adapt common losses to partial faces in order to ensure that the attention maps are diverse and handle occluded parts. Our thorough analysis demonstrates that we outperform all baselines under multiple benchmark protocols, including naturally and synthetically occluded partial faces. This suggests that our method successfully focuses on the relevant parts of the occluded face.

* To be published in IEEE ICIP 2021

Via

Access Paper or Ask Questions