Abstract:Extremely large-scale arrays (XL-arrays) and ultra-high frequencies are two key technologies for sixth-generation (6G) networks, offering higher system capacity and expanded bandwidth resources. To effectively combine these technologies, it is necessary to consider the near-field spherical-wave propagation model, rather than the traditional far-field planar-wave model. In this paper, we explore a near-field communication system comprising a base station (BS) with hybrid analog-digital beamforming and multiple mobile users. Our goal is to maximize the system's sum-rate by optimizing the near-field codebook design for hybrid precoding. To enable fast adaptation to varying user distributions, we propose a meta-learning-based framework that integrates the model-agnostic meta-learning (MAML) algorithm with a codebook learning network. Specifically, we first design a deep neural network (DNN) to learn the near-field codebook. Then, we combine the MAML algorithm with the DNN to allow rapid adaptation to different channel conditions by leveraging a well-initialized model from the outer network. Simulation results demonstrate that our proposed framework outperforms conventional algorithms, offering improved generalization and better overall performance.
Abstract:This paper addresses the limitations of adverse weather image restoration approaches trained on synthetic data when applied to real-world scenarios. We formulate a semi-supervised learning framework employing vision-language models to enhance restoration performance across diverse adverse weather conditions in real-world settings. Our approach involves assessing image clearness and providing semantics using vision-language models on real data, serving as supervision signals for training restoration models. For clearness enhancement, we use real-world data, utilizing a dual-step strategy with pseudo-labels assessed by vision-language models and weather prompt learning. For semantic enhancement, we integrate real-world data by adjusting weather conditions in vision-language model descriptions while preserving semantic meaning. Additionally, we introduce an effective training strategy to bootstrap restoration performance. Our approach achieves superior results in real-world adverse weather image restoration, demonstrated through qualitative and quantitative comparisons with state-of-the-art works.
Abstract:This paper presents EasyAnimate, an advanced method for video generation that leverages the power of transformer architecture for high-performance outcomes. We have expanded the DiT framework originally designed for 2D image synthesis to accommodate the complexities of 3D video generation by incorporating a motion module block. It is used to capture temporal dynamics, thereby ensuring the production of consistent frames and seamless motion transitions. The motion module can be adapted to various DiT baseline methods to generate video with different styles. It can also generate videos with different frame rates and resolutions during both training and inference phases, suitable for both images and videos. Moreover, we introduce slice VAE, a novel approach to condense the temporal axis, facilitating the generation of long duration videos. Currently, EasyAnimate exhibits the proficiency to generate videos with 144 frames. We provide a holistic ecosystem for video production based on DiT, encompassing aspects such as data pre-processing, VAE training, DiT models training (both the baseline model and LoRA model), and end-to-end video inference. Code is available at: https://github.com/aigc-apps/EasyAnimate. We are continuously working to enhance the performance of our method.
Abstract:While Large Language Models (LLMs) have demonstrated exceptional multitasking abilities, fine-tuning these models on downstream, domain-specific datasets is often necessary to yield superior performance on test sets compared to their counterparts without fine-tuning. However, the comprehensive effects of fine-tuning on the LLMs' generalization ability are not fully understood. This paper delves into the differences between original, unmodified LLMs and their fine-tuned variants. Our primary investigation centers on whether fine-tuning affects the generalization ability intrinsic to LLMs. To elaborate on this, we conduct extensive experiments across five distinct language tasks on various datasets. Our main findings reveal that models fine-tuned on generation and classification tasks exhibit dissimilar behaviors in generalizing to different domains and tasks. Intriguingly, we observe that integrating the in-context learning strategy during fine-tuning on generation tasks can enhance the model's generalization ability. Through this systematic investigation, we aim to contribute valuable insights into the evolving landscape of fine-tuning practices for LLMs.
Abstract:Video-Language Models (VLMs), powered by the advancements in Large Language Models (LLMs), are charting new frontiers in video understanding. A pivotal challenge is the development of an efficient method to encapsulate video content into a set of representative tokens to align with LLMs. In this work, we introduce Slot-VLM, a novel framework designed to generate semantically decomposed video tokens, in terms of object-wise and event-wise visual representations, to facilitate LLM inference. Particularly, we design a SlowFast Slots module, i.e., SF-Slots, that adaptively aggregates the dense video tokens from the CLIP vision encoder to a set of representative slots. In order to take into account both the spatial object details and the varied temporal dynamics, SF-Slots is built with a dual-branch structure. The Slow-Slots branch focuses on extracting object-centric slots from features at high spatial resolution but low (slow) frame sample rate, emphasizing detailed object information. Conversely, Fast-Slots branch is engineered to learn event-centric slots from high temporal sample rate but low spatial resolution features. These complementary slots are combined to form the vision context, serving as the input to the LLM for efficient question answering. Our experimental results demonstrate the effectiveness of our Slot-VLM, which achieves the state-of-the-art performance on video question-answering.
Abstract:Multiple-antenna technologies are evolving towards large-scale aperture sizes, extremely high frequencies, and innovative antenna types. This evolution is giving rise to the emergence of near-field communications (NFC) in future wireless systems. Considerable attention has been directed towards this cutting-edge technology due to its potential to enhance the capacity of wireless networks by introducing increased spatial degrees of freedom (DoFs) in the range domain. Within this context, a comprehensive review of the state of the art on NFC is presented, with a specific focus on its 1) fundamental operating principles, 2) channel modeling, 3) performance analysis, 4) signal processing, and 5) integration with other emerging technologies. Specifically, 1) the basic principles of NFC are characterized from both physics and communications perspectives, unveiling its unique properties in contrast to far-field communications. 2) Based on these principles, deterministic and stochastic near-field channel models are investigated for spatially-discrete (SPD) and continuous-aperture (CAP) antenna arrays. 3) Rooted in these models, existing contributions on near-field performance analysis are reviewed in terms of DoFs/effective DoFs (EDoFs), power scaling law, and transmission rate. 4) Existing signal processing techniques for NFC are systematically surveyed, encompassing channel estimation, beamforming design, and low-complexity beam training. 5) Major issues and research opportunities associated with the integration of NFC and other emerging technologies are identified to facilitate NFC applications in next-generation networks. Promising directions are highlighted throughout the paper to inspire future research endeavors in the realm of NFC.
Abstract:Reasoning, a crucial ability for complex problem-solving, plays a pivotal role in various real-world settings such as negotiation, medical diagnosis, and criminal investigation. It serves as a fundamental methodology in the field of Artificial General Intelligence (AGI). With the ongoing development of foundation models, there is a growing interest in exploring their abilities in reasoning tasks. In this paper, we introduce seminal foundation models proposed or adaptable for reasoning, highlighting the latest advancements in various reasoning tasks, methods, and benchmarks. We then delve into the potential future directions behind the emergence of reasoning abilities within foundation models. We also discuss the relevance of multimodal learning, autonomous agents, and super alignment in the context of reasoning. By discussing these future research directions, we hope to inspire researchers in their exploration of this field, stimulate further advancements in reasoning with foundation models, and contribute to the development of AGI.
Abstract:Reconfigurable intelligent surface (RIS)-aided near-field communications is investigated. First, the necessity of investigating RIS-aided near-field communications and the advantages brought about by the unique spherical-wave-based near-field propagation are discussed. Then, the family of patch-array-based RISs and metasurface-based RISs are introduced along with their respective near-field channel models. A pair of fundamental performance limits of RIS-aided near-field communications, namely their power scaling law and effective degrees-of-freedom, are analyzed for both patch-array-based and metasurface-based RISs, which reveals the potential performance gains that can be achieved. Furthermore, the associated near-field beam training and beamforming design issues are studied, where a two-stage hierarchical beam training approach and a low-complexity element-wise beamforming design are proposed for RIS-aided near-field communications. Finally, a suite of open research problems is highlighted for motivating future research.
Abstract:The remarkable natural language understanding, reasoning, and generation capabilities of large language models (LLMs) have made them attractive for application to video question answering (Video QA) tasks, utilizing video tokens as contextual input. However, employing LLMs for long video understanding presents significant challenges and remains under-explored. The extensive number of video tokens leads to considerable computational costs for LLMs while using aggregated tokens results in loss of vision details. Moreover, the presence of abundant question-irrelevant tokens introduces noise to the video QA process. To address these issues, we introduce a simple yet effective retrieval-based video language model (R-VLM) for efficient and interpretable long video QA. Specifically, given a question (query) and a long video, our model identifies and selects the most relevant $K$ video chunks and uses their associated visual tokens to serve as context for the LLM inference. This effectively reduces the number of video tokens, eliminates noise interference, and enhances system performance. Our experimental results validate the effectiveness of our framework for comprehending long videos. Furthermore, based on the retrieved chunks, our model is interpretable that provides the justifications on where we get the answers.
Abstract:Stable Diffusion web UI (SD-WebUI) is a comprehensive project that provides a browser interface based on Gradio library for Stable Diffusion models. In this paper, We propose a novel WebUI plugin called EasyPhoto, which enables the generation of AI portraits. By training a digital doppelganger of a specific user ID using 5 to 20 relevant images, the finetuned model (according to the trained LoRA model) allows for the generation of AI photos using arbitrary templates. Our current implementation supports the modification of multiple persons and different photo styles. Furthermore, we allow users to generate fantastic template image with the strong SDXL model, enhancing EasyPhoto's capabilities to deliver more diverse and satisfactory results. The source code for EasyPhoto is available at: https://github.com/aigc-apps/sd-webui-EasyPhoto. We also support a webui-free version by using diffusers: https://github.com/aigc-apps/EasyPhoto. We are continuously enhancing our efforts to expand the EasyPhoto pipeline, making it suitable for any identification (not limited to just the face), and we enthusiastically welcome any intriguing ideas or suggestions.