Abstract:3D human motion generation has seen substantial advancement in recent years. While state-of-the-art approaches have improved performance significantly, they still struggle with complex and detailed motions unseen in training data, largely due to the scarcity of motion datasets and the prohibitive cost of generating new training examples. To address these challenges, we introduce CoMA, an agent-based solution for complex human motion generation, editing, and comprehension. CoMA leverages multiple collaborative agents powered by large language and vision models, alongside a mask transformer-based motion generator featuring body part-specific encoders and codebooks for fine-grained control. Our framework enables generation of both short and long motion sequences with detailed instructions, text-guided motion editing, and self-correction for improved quality. Evaluations on the HumanML3D dataset demonstrate competitive performance against state-of-the-art methods. Additionally, we create a set of context-rich, compositional, and long text prompts, where user studies show our method significantly outperforms existing approaches.
Abstract:In this paper, a novel continuous-aperture array (CAPA)-based wireless communication architecture is proposed, which relies on an electrically large aperture with a continuous current distribution. First, an existing prototype of CAPA is reviewed, followed by the potential benefits and key motivations for employing CAPAs in wireless communications. Then, three practical hardware implementation approaches for CAPAs are introduced based on electronic, optical, and acoustic materials. Furthermore, several beamforming approaches are proposed to optimize the continuous current distributions of CAPAs, which are fundamentally different from those used for conventional spatially discrete arrays (SPDAs). Numerical results are provided to demonstrate their key features in low complexity and near-optimality. Based on these proposed approaches, the performance gains of CAPAs over SPDAs are revealed in terms of channel capacity as well as diversity-multiplexing gains. Finally, several open research problems in CAPA are highlighted.
Abstract:In contrast to conventional RIS, the scattering matrix of a non-reciprocal RIS (NR-RIS) is non-symmetric, leading to differences in the uplink and the downlink components of NR-RIS cascaded channels. In this paper, a physically-consistent device model is proposed in which an NR-RIS is composed of multiple groups of two-port elements inter-connected by non-reciprocal devices. The resulting non-reciprocal scattering matrix is derived for various cases including two-element groups connected with isolators or gyrators, and general three-element groups connected via circulators. Signal models are given for NR-RIS operating in either reflecting-only or simultaneously transmitting and reflecting modes. The problem of NR-RIS design for non-reciprocal beamsteering is formulated for three-element circulator implementations, and numerical results confirm that non-reciprocal beamsteering can be achieved with minimal sidelobe power. We also show that our physically consistent NR-RIS architecture is effective in implementing channel reciprocity attacks, achieving similar performance to that with idealized NR-RIS models.
Abstract:Click-through rate (CTR) prediction, which predicts the probability of a user clicking an ad, is a fundamental task in recommender systems. The emergence of heterogeneous information, such as user profile and behavior sequences, depicts user interests from different aspects. A mutually beneficial integration of heterogeneous information is the cornerstone towards the success of CTR prediction. However, most of the existing methods suffer from two fundamental limitations, including (1) insufficient inter-mode interaction due to the unidirectional information flow between modes, and (2) aggressive information aggregation caused by early summarization, resulting in excessive information loss. To address the above limitations, we propose a novel module named InterFormer to learn heterogeneous information interaction in an interleaving style. To achieve better interaction learning, InterFormer enables bidirectional information flow for mutually beneficial learning across different modes. To avoid aggressive information aggregation, we retain complete information in each data mode and use a separate bridging arch for effective information selection and summarization. Our proposed InterFormer achieves state-of-the-art performance on three public datasets and a large-scale industrial dataset.
Abstract:Extremely large-scale arrays (XL-arrays) and ultra-high frequencies are two key technologies for sixth-generation (6G) networks, offering higher system capacity and expanded bandwidth resources. To effectively combine these technologies, it is necessary to consider the near-field spherical-wave propagation model, rather than the traditional far-field planar-wave model. In this paper, we explore a near-field communication system comprising a base station (BS) with hybrid analog-digital beamforming and multiple mobile users. Our goal is to maximize the system's sum-rate by optimizing the near-field codebook design for hybrid precoding. To enable fast adaptation to varying user distributions, we propose a meta-learning-based framework that integrates the model-agnostic meta-learning (MAML) algorithm with a codebook learning network. Specifically, we first design a deep neural network (DNN) to learn the near-field codebook. Then, we combine the MAML algorithm with the DNN to allow rapid adaptation to different channel conditions by leveraging a well-initialized model from the outer network. Simulation results demonstrate that our proposed framework outperforms conventional algorithms, offering improved generalization and better overall performance.
Abstract:This paper addresses the limitations of adverse weather image restoration approaches trained on synthetic data when applied to real-world scenarios. We formulate a semi-supervised learning framework employing vision-language models to enhance restoration performance across diverse adverse weather conditions in real-world settings. Our approach involves assessing image clearness and providing semantics using vision-language models on real data, serving as supervision signals for training restoration models. For clearness enhancement, we use real-world data, utilizing a dual-step strategy with pseudo-labels assessed by vision-language models and weather prompt learning. For semantic enhancement, we integrate real-world data by adjusting weather conditions in vision-language model descriptions while preserving semantic meaning. Additionally, we introduce an effective training strategy to bootstrap restoration performance. Our approach achieves superior results in real-world adverse weather image restoration, demonstrated through qualitative and quantitative comparisons with state-of-the-art works.
Abstract:This paper presents EasyAnimate, an advanced method for video generation that leverages the power of transformer architecture for high-performance outcomes. We have expanded the DiT framework originally designed for 2D image synthesis to accommodate the complexities of 3D video generation by incorporating a motion module block. It is used to capture temporal dynamics, thereby ensuring the production of consistent frames and seamless motion transitions. The motion module can be adapted to various DiT baseline methods to generate video with different styles. It can also generate videos with different frame rates and resolutions during both training and inference phases, suitable for both images and videos. Moreover, we introduce slice VAE, a novel approach to condense the temporal axis, facilitating the generation of long duration videos. Currently, EasyAnimate exhibits the proficiency to generate videos with 144 frames. We provide a holistic ecosystem for video production based on DiT, encompassing aspects such as data pre-processing, VAE training, DiT models training (both the baseline model and LoRA model), and end-to-end video inference. Code is available at: https://github.com/aigc-apps/EasyAnimate. We are continuously working to enhance the performance of our method.
Abstract:While Large Language Models (LLMs) have demonstrated exceptional multitasking abilities, fine-tuning these models on downstream, domain-specific datasets is often necessary to yield superior performance on test sets compared to their counterparts without fine-tuning. However, the comprehensive effects of fine-tuning on the LLMs' generalization ability are not fully understood. This paper delves into the differences between original, unmodified LLMs and their fine-tuned variants. Our primary investigation centers on whether fine-tuning affects the generalization ability intrinsic to LLMs. To elaborate on this, we conduct extensive experiments across five distinct language tasks on various datasets. Our main findings reveal that models fine-tuned on generation and classification tasks exhibit dissimilar behaviors in generalizing to different domains and tasks. Intriguingly, we observe that integrating the in-context learning strategy during fine-tuning on generation tasks can enhance the model's generalization ability. Through this systematic investigation, we aim to contribute valuable insights into the evolving landscape of fine-tuning practices for LLMs.
Abstract:Video-Language Models (VLMs), powered by the advancements in Large Language Models (LLMs), are charting new frontiers in video understanding. A pivotal challenge is the development of an efficient method to encapsulate video content into a set of representative tokens to align with LLMs. In this work, we introduce Slot-VLM, a novel framework designed to generate semantically decomposed video tokens, in terms of object-wise and event-wise visual representations, to facilitate LLM inference. Particularly, we design a SlowFast Slots module, i.e., SF-Slots, that adaptively aggregates the dense video tokens from the CLIP vision encoder to a set of representative slots. In order to take into account both the spatial object details and the varied temporal dynamics, SF-Slots is built with a dual-branch structure. The Slow-Slots branch focuses on extracting object-centric slots from features at high spatial resolution but low (slow) frame sample rate, emphasizing detailed object information. Conversely, Fast-Slots branch is engineered to learn event-centric slots from high temporal sample rate but low spatial resolution features. These complementary slots are combined to form the vision context, serving as the input to the LLM for efficient question answering. Our experimental results demonstrate the effectiveness of our Slot-VLM, which achieves the state-of-the-art performance on video question-answering.
Abstract:Multiple-antenna technologies are evolving towards large-scale aperture sizes, extremely high frequencies, and innovative antenna types. This evolution is giving rise to the emergence of near-field communications (NFC) in future wireless systems. Considerable attention has been directed towards this cutting-edge technology due to its potential to enhance the capacity of wireless networks by introducing increased spatial degrees of freedom (DoFs) in the range domain. Within this context, a comprehensive review of the state of the art on NFC is presented, with a specific focus on its 1) fundamental operating principles, 2) channel modeling, 3) performance analysis, 4) signal processing, and 5) integration with other emerging technologies. Specifically, 1) the basic principles of NFC are characterized from both physics and communications perspectives, unveiling its unique properties in contrast to far-field communications. 2) Based on these principles, deterministic and stochastic near-field channel models are investigated for spatially-discrete (SPD) and continuous-aperture (CAP) antenna arrays. 3) Rooted in these models, existing contributions on near-field performance analysis are reviewed in terms of DoFs/effective DoFs (EDoFs), power scaling law, and transmission rate. 4) Existing signal processing techniques for NFC are systematically surveyed, encompassing channel estimation, beamforming design, and low-complexity beam training. 5) Major issues and research opportunities associated with the integration of NFC and other emerging technologies are identified to facilitate NFC applications in next-generation networks. Promising directions are highlighted throughout the paper to inspire future research endeavors in the realm of NFC.