Abstract:Traditional wireless image transmission methods struggle to balance rate efficiency and reconstruction quality under varying channel conditions. To address these challenges, we propose a novel semantic communication (SemCom) system that integrates entropy-aware and channel-adaptive mechanisms for wireless image transmission over multi-user multiple-input multiple-output (MU-MIMO) fading channels. Unlike existing approaches, our system dynamically adjusts transmission rates based on the entropy of feature maps, channel state information (CSI), and signal-to-noise ratio (SNR), ensuring optimal resource utilization and robust performance. The system employs feature map pruning, channel attention, spatial attention, and multihead self-attention (MHSA) mechanisms to prioritize critical semantic features and effectively reconstruct images. Experimental results demonstrate that the proposed system outperforms state-of-the-art benchmarks, including BPG+LDPC+4QAM and Deep JSCC, in terms of rate-distortion performance, flexibility, and robustness, particularly under challenging conditions such as low SNR, imperfect CSI, and inter-user interference. This work establishes a strong foundation for adaptive-rate SemCom systems and highlights their potential for real-time, bandwidthintensive applications.
Abstract:Recent text-to-video (T2V) technology advancements, as demonstrated by models such as Gen3, Pika, and Sora, have significantly broadened its applicability and popularity. This progress has created a growing demand for accurate quality assessment metrics to evaluate the perceptual quality of text-generated videos and optimize video generation models. However, assessing the quality of text-generated videos remains challenging due to the presence of highly complex distortions, such as unnatural actions and phenomena that defy human cognition. To address these challenges, we constructed a large-scale benchmark dataset for \textbf{T}ext-generated \textbf{V}ideo \textbf{eval}uation, \textbf{T2VEval-Bench}, comprising 148 textual words and 1,783 videos generated by 12 models. During the subjective evaluation, we collected five key scores: overall impression, video quality, aesthetic quality, realness, and text-video consistency. For objective evaluation, we developed the \textbf{T2VEval} model, which assesses videos across three branches: quality, authenticity, and consistency. Using an attention-based fusion module, T2VEval effectively integrates features from each branch and predicts scores with the aid of a large oracle model. Additionally, we implemented a progressive training strategy, enabling each branch to learn targeted knowledge while maintaining synergy with the others. Experimental results demonstrate that T2VEval achieves state-of-the-art performance across multiple metrics. The dataset and code will be open-sourced upon completion of the follow-up work.
Abstract:This paper investigates a novel generative artificial intelligence (GAI) empowered multi-user semantic communication system called semantic feature multiple access (SFMA) for video transmission, which comprises a base station (BS) and paired users. The BS generates and combines semantic information of several frames simultaneously requested by paired users into a single signal. Users recover their frames from this combined signal and input the recovered frames into a GAI-based video frame interpolation model to generate the intermediate frame. To optimize transmission rates and temporal gaps between simultaneously transmitted frames, we formulate an optimization problem to maximize the system sum rate while minimizing temporal gaps. Since the standard signal-to-interference-plus-noise ratio (SINR) equation does not accurately capture the performance of our semantic communication system, we introduce a weight parameter into the SINR equation to better represent the system's performance. Due to its dependence on transmit power, we propose a three-step solution. First, we develop a user pairing algorithm that pairs two users with the highest preference value, a weighted combination of semantic transmission rate and temporal gap. Second, we optimize inter-group power allocation by formulating an optimization problem that allocates proper transmit power across all user groups to maximize system sum rates while satisfying each user's minimum rate requirement. Third, we address intra-group power allocation to enhance each user's performance. Simulation results demonstrate that our method improves transmission rates by up to 24.8%, 45.8%, and 66.1% compared to fixed-power non-orthogonal multiple access (F-NOMA), orthogonal joint source-channel coding (O-JSCC), and orthogonal frequency division multiple access (OFDMA), respectively.
Abstract:Sora-like video generation models have achieved remarkable progress with a Multi-Modal Diffusion Transformer MM-DiT architecture. However, the current video generation models predominantly focus on single-prompt, struggling to generate coherent scenes with multiple sequential prompts that better reflect real-world dynamic scenarios. While some pioneering works have explored multi-prompt video generation, they face significant challenges including strict training data requirements, weak prompt following, and unnatural transitions. To address these problems, we propose DiTCtrl, a training-free multi-prompt video generation method under MM-DiT architectures for the first time. Our key idea is to take the multi-prompt video generation task as temporal video editing with smooth transitions. To achieve this goal, we first analyze MM-DiT's attention mechanism, finding that the 3D full attention behaves similarly to that of the cross/self-attention blocks in the UNet-like diffusion models, enabling mask-guided precise semantic control across different prompts with attention sharing for multi-prompt video generation. Based on our careful design, the video generated by DiTCtrl achieves smooth transitions and consistent object motion given multiple sequential prompts without additional training. Besides, we also present MPVBench, a new benchmark specially designed for multi-prompt video generation to evaluate the performance of multi-prompt generation. Extensive experiments demonstrate that our method achieves state-of-the-art performance without additional training.
Abstract:Dynamic texture synthesis aims to generate sequences that are visually similar to a reference video texture and exhibit specific stationary properties in time. In this paper, we introduce a spatiotemporal generative adversarial network (DTSGAN) that can learn from a single dynamic texture by capturing its motion and content distribution. With the pipeline of DTSGAN, a new video sequence is generated from the coarsest scale to the finest one. To avoid mode collapse, we propose a novel strategy for data updates that helps improve the diversity of generated results. Qualitative and quantitative experiments show that our model is able to generate high quality dynamic textures and natural motion.
Abstract:Consistent human-centric image and video synthesis aims to generate images or videos with new poses while preserving appearance consistency with a given reference image, which is crucial for low-cost visual content creation. Recent advances based on diffusion models typically rely on separate networks for reference appearance feature extraction and target visual generation, leading to inconsistent domain gaps between references and targets. In this paper, we frame the task as a spatially-conditioned inpainting problem, where the target image is inpainted to maintain appearance consistency with the reference. This approach enables the reference features to guide the generation of pose-compliant targets within a unified denoising network, thereby mitigating domain gaps. Additionally, to better maintain the reference appearance information, we impose a causal feature interaction framework, in which reference features can only query from themselves, while target features can query appearance information from both the reference and the target. To further enhance computational efficiency and flexibility, in practical implementation, we decompose the spatially-conditioned generation process into two stages: reference appearance extraction and conditioned target generation. Both stages share a single denoising network, with interactions restricted to self-attention layers. This proposed method ensures flexible control over the appearance of generated human images and videos. By fine-tuning existing base diffusion models on human video data, our method demonstrates strong generalization to unseen human identities and poses without requiring additional per-instance fine-tuning. Experimental results validate the effectiveness of our approach, showing competitive performance compared to existing methods for consistent human image and video synthesis.
Abstract:Automatic black-and-white image sequence colorization while preserving character and object identity (ID) is a complex task with significant market demand, such as in cartoon or comic series colorization. Despite advancements in visual colorization using large-scale generative models like diffusion models, challenges with controllability and identity consistency persist, making current solutions unsuitable for industrial application.To address this, we propose ColorFlow, a three-stage diffusion-based framework tailored for image sequence colorization in industrial applications. Unlike existing methods that require per-ID finetuning or explicit ID embedding extraction, we propose a novel robust and generalizable Retrieval Augmented Colorization pipeline for colorizing images with relevant color references. Our pipeline also features a dual-branch design: one branch for color identity extraction and the other for colorization, leveraging the strengths of diffusion models. We utilize the self-attention mechanism in diffusion models for strong in-context learning and color identity matching. To evaluate our model, we introduce ColorFlow-Bench, a comprehensive benchmark for reference-based colorization. Results show that ColorFlow outperforms existing models across multiple metrics, setting a new standard in sequential image colorization and potentially benefiting the art industry. We release our codes and models on our project page: https://zhuang2002.github.io/ColorFlow/.
Abstract:Image editing has advanced significantly with the development of diffusion models using both inversion-based and instruction-based methods. However, current inversion-based approaches struggle with big modifications (e.g., adding or removing objects) due to the structured nature of inversion noise, which hinders substantial changes. Meanwhile, instruction-based methods often constrain users to black-box operations, limiting direct interaction for specifying editing regions and intensity. To address these limitations, we propose BrushEdit, a novel inpainting-based instruction-guided image editing paradigm, which leverages multimodal large language models (MLLMs) and image inpainting models to enable autonomous, user-friendly, and interactive free-form instruction editing. Specifically, we devise a system enabling free-form instruction editing by integrating MLLMs and a dual-branch image inpainting model in an agent-cooperative framework to perform editing category classification, main object identification, mask acquisition, and editing area inpainting. Extensive experiments show that our framework effectively combines MLLMs and inpainting models, achieving superior performance across seven metrics including mask region preservation and editing effect coherence.
Abstract:Researches on leveraging big artificial intelligence model (BAIM) technology to drive the intelligent evolution of wireless networks are emerging. However, since the breakthrough in generalization brought about by BAIM techniques mainly occurs in natural language processing, there is still a lack of a clear technical roadmap on how to efficiently apply BAIM techniques to wireless systems with many additional peculiarities. To this end, this paper first reviews recent research works on BAIM for wireless and assesses the current research situation. Then, this paper analyzes and compares the differences between language intelligence and wireless intelligence on multiple levels, including scientific foundations, core usages, and technical details. It highlights the necessity and scientific significance of developing BAIM technology in a wireless-native way, as well as new issues that need to be considered in specific technical implementation. Finally, by synthesizing the evolutionary laws of language models with the particularities of wireless system, this paper provides several instructive methodologies for how to develop wireless-native BAIM.
Abstract:Point clouds have gained prominence in numerous applications due to their ability to accurately depict 3D objects and scenes. However, compressing unstructured, high-precision point cloud data effectively remains a significant challenge. In this paper, we propose NeRC$^{\textbf{3}}$, a novel point cloud compression framework leveraging implicit neural representations to handle both geometry and attributes. Our approach employs two coordinate-based neural networks to implicitly represent a voxelized point cloud: the first determines the occupancy status of a voxel, while the second predicts the attributes of occupied voxels. By feeding voxel coordinates into these networks, the receiver can efficiently reconstructs the original point cloud's geometry and attributes. The neural network parameters are quantized and compressed alongside auxiliary information required for reconstruction. Additionally, we extend our method to dynamic point cloud compression with techniques to reduce temporal redundancy, including a 4D spatial-temporal representation termed 4D-NeRC$^{\textbf{3}}$. Experimental results validate the effectiveness of our approach: for static point clouds, NeRC$^{\textbf{3}}$ outperforms octree-based methods in the latest G-PCC standard. For dynamic point clouds, 4D-NeRC$^{\textbf{3}}$ demonstrates superior geometry compression compared to state-of-the-art G-PCC and V-PCC standards and achieves competitive results for joint geometry and attribute compression.