Abstract:Generative frame interpolation, empowered by large-scale pre-trained video generation models, has demonstrated remarkable advantages in complex scenes. However, existing methods heavily rely on the generative model to independently infer the correspondences between input frames, an ability that is inadequately developed during pre-training. In this work, we propose a novel framework, termed Motion-aware Generative frame interpolation (MoG), to significantly enhance the model's motion awareness by integrating explicit motion guidance. Specifically we investigate two key questions: what can serve as an effective motion guidance, and how we can seamlessly embed this guidance into the generative model. For the first question, we reveal that the intermediate flow from flow-based interpolation models could efficiently provide task-oriented motion guidance. Regarding the second, we first obtain guidance-based representations of intermediate frames by warping input frames' representations using guidance, and then integrate them into the model at both latent and feature levels. To demonstrate the versatility of our method, we train MoG on both real-world and animation datasets. Comprehensive evaluations show that our MoG significantly outperforms the existing methods in both domains, achieving superior video quality and improved fidelity.
Abstract:Recently, the remarkable success of pre-trained Vision Transformers (ViTs) from image-text matching has sparked an interest in image-to-video adaptation. However, most current approaches retain the full forward pass for each frame, leading to a high computation overhead for processing entire videos. In this paper, we present InTI, a novel approach for compressive image-to-video adaptation using dynamic Inter-frame Token Interpolation. InTI aims to softly preserve the informative tokens without disrupting their coherent spatiotemporal structure. Specifically, each token pair at identical positions within neighbor frames is linearly aggregated into a new token, where the aggregation weights are generated by a multi-scale context-aware network. In this way, the information of neighbor frames can be adaptively compressed in a point-by-point manner, thereby effectively reducing the number of processed frames by half each time. Importantly, InTI can be seamlessly integrated with existing adaptation methods, achieving strong performance without extra-complex design. On Kinetics-400, InTI reaches a top-1 accuracy of 87.1 with a remarkable 37.5% reduction in GFLOPs compared to naive adaptation. When combined with additional temporal modules, InTI achieves a top-1 accuracy of 87.6 with a 37% reduction in GFLOPs. Similar conclusions have been verified in other common datasets.
Abstract:Inter-frame modeling is pivotal in generating intermediate frames for video frame interpolation (VFI). Current approaches predominantly rely on convolution or attention-based models, which often either lack sufficient receptive fields or entail significant computational overheads. Recently, Selective State Space Models (S6) have emerged, tailored specifically for long sequence modeling, offering both linear complexity and data-dependent modeling capabilities. In this paper, we propose VFIMamba, a novel frame interpolation method for efficient and dynamic inter-frame modeling by harnessing the S6 model. Our approach introduces the Mixed-SSM Block (MSB), which initially rearranges tokens from adjacent frames in an interleaved fashion and subsequently applies multi-directional S6 modeling. This design facilitates the efficient transmission of information across frames while upholding linear complexity. Furthermore, we introduce a novel curriculum learning strategy that progressively cultivates proficiency in modeling inter-frame dynamics across varying motion magnitudes, fully unleashing the potential of the S6 model. Experimental findings showcase that our method attains state-of-the-art performance across diverse benchmarks, particularly excelling in high-resolution scenarios. In particular, on the X-TEST dataset, VFIMamba demonstrates a noteworthy improvement of 0.80 dB for 4K frames and 0.96 dB for 2K frames.
Abstract:Point-based image editing has attracted remarkable attention since the emergence of DragGAN. Recently, DragDiffusion further pushes forward the generative quality via adapting this dragging technique to diffusion models. Despite these great success, this dragging scheme exhibits two major drawbacks, namely inaccurate point tracking and incomplete motion supervision, which may result in unsatisfactory dragging outcomes. To tackle these issues, we build a stable and precise drag-based editing framework, coined as StableDrag, by designing a discirminative point tracking method and a confidence-based latent enhancement strategy for motion supervision. The former allows us to precisely locate the updated handle points, thereby boosting the stability of long-range manipulation, while the latter is responsible for guaranteeing the optimized latent as high-quality as possible across all the manipulation steps. Thanks to these unique designs, we instantiate two types of image editing models including StableDrag-GAN and StableDrag-Diff, which attains more stable dragging performance, through extensive qualitative experiments and quantitative assessment on DragBench.
Abstract:We consider joint beamforming and stream allocation to maximize the weighted sum rate (WSR) for non-coherent joint transmission (NCJT) in user-centric cell-free MIMO networks, where distributed access points (APs) are organized in clusters to transmit different signals to serve each user equipment (UE). We for the first time consider the common limits of maximum number of receive streams at UEs in practical networks, and formulate a joint beamforming and transmit stream allocation problem for WSR maximization under per-AP transmit power constraints. Since the integer number of transmit streams determines the dimension of the beamformer, the joint optimization problem is mixed-integer and nonconvex with coupled decision variables that is inherently NP-hard. In this paper, we first propose a distributed low-interaction reduced weighted minimum mean square error (RWMMSE) beamforming algorithm for WSR maximization with fixed streams. Our proposed RWMMSE algorithm requires significantly less interaction across the network and has the current lowest computational complexity that scales linearly with the number of transmit antennas, without any compromise on WSR. We draw insights on the joint beamforming and stream allocation problem to decouple the decision variables and relax the mixed-integer constraints. We then propose a joint beamforming and linear stream allocation algorithm, termed as RWMMSE-LSA, which yields closed-form updates with linear stream allocation complexity and is guaranteed to converge to the stationary points of the original joint optimization problem. Simulation results demonstrate substantial performance gain of our proposed algorithms over the current best alternatives in both WSR performance and convergence time.
Abstract:Recently, the decentralized baseband processing (DBP) paradigm and relevant detection methods have been proposed to enable extremely large-scale massive multiple-input multiple-output technology. Under the DBP architecture, base station antennas are divided into several independent clusters, each connected to a local computing fabric. However, current detection methods tailored to DBP only consider ideal white Gaussian noise scenarios, while in practice, the noise is often colored due to interference from neighboring cells. Moreover, in the DBP architecture, linear minimum mean-square error (LMMSE) detection methods rely on the estimation of the noise covariance matrix through averaging distributedly stored noise samples. This presents a significant challenge for decentralized LMMSE-based equalizer design. To address this issue, this paper proposes decentralized LMMSE equalization methods under colored noise scenarios for both star and daisy chain DBP architectures. Specifically, we first propose two decentralized equalizers for the star DBP architecture based on dimensionality reduction techniques. Then, we derive an optimal decentralized equalizer using the block coordinate descent (BCD) method for the daisy chain DBP architecture with a bandwidth reduction enhancement scheme based on decentralized low-rank decomposition. Finally, simulation results demonstrate that our proposed methods can achieve excellent detection performance while requiring much less communication bandwidth.
Abstract:Precoding design for maximizing weighted sum-rate (WSR) is a fundamental problem for downlink of massive multi-user multiple-input multiple-output (MU-MIMO) systems. It is well-known that this problem is generally NP-hard due to the presence of multi-user interference. The weighted minimum mean-square error (WMMSE) algorithm is a popular approach for WSR maximization. However, its computational complexity is cubic in the number of base station (BS) antennas, which is unaffordable when the BS is equipped with a large antenna array. In this paper, we consider the WSR maximization problem with either a sum-power constraint (SPC) or per-antenna power constraints (PAPCs). For the former, we prove that any nontrivial stationary point must have a low-dimensional subspace structure, and then propose a reduced-WMMSE (R-WMMSE) with linear complexity by exploiting the solution structure. For the latter, we propose a linear-complexity WMMSE approach, named PAPC-WMMSE, by using a novel recursive design of the algorithm. Both R-WMMSE and PAPC-WMMSE have simple closed-form updates and guaranteed convergence to stationary points. Simulation results verify the efficacy of the proposed designs, especially the much lower complexity as compared to the state-of-the-art approaches for massive MU-MIMO systems.
Abstract:Conventional uplink equalization in massive MIMO systems relies on a centralized baseband processing architecture. However, as the number of base station antennas increases, centralized baseband processing architectures encounter two bottlenecks, i.e., the tremendous data interconnection and the high-dimensional computation. To tackle these obstacles, decentralized baseband processing was proposed for uplink equalization, but only applicable to the scenarios with unpractical white Gaussian noise assumption. This paper presents an uplink linear minimum mean-square error (L-MMSE) equalization method in the daisy chain decentralized baseband processing architecture under colored noise assumption. The optimized L-MMSE equalizer is derived by exploiting the block coordinate descent method, which shows near-optimal performance both in theoretical and simulation while significantly mitigating the bottlenecks.