Abstract:Anomaly detection plays a crucial role in quality control for industrial applications. However, ensuring robustness under unseen domain shifts such as lighting variations or sensor drift remains a significant challenge. Existing methods attempt to address domain shifts by training generalizable models but often rely on prior knowledge of target distributions and can hardly generalise to backbones designed for other data modalities. To overcome these limitations, we build upon memory-bank-based anomaly detection methods, optimizing a robust Sinkhorn distance on limited target training data to enhance generalization to unseen target domains. We evaluate the effectiveness on both 2D and 3D anomaly detection benchmarks with simulated distribution shifts. Our proposed method demonstrates superior results compared with state-of-the-art anomaly detection and domain adaptation methods.
Abstract:Multi-camera systems provide richer contextual information for industrial anomaly detection. However, traditional methods process each view independently, disregarding the complementary information across viewpoints. Existing multi-view anomaly detection approaches typically employ data-driven cross-view attention for feature fusion but fail to leverage the unique geometric properties of multi-camera setups. In this work, we introduce an epipolar geometry-constrained attention module to guide cross-view fusion, ensuring more effective information aggregation. To further enhance the potential of cross-view attention, we propose a pretraining strategy inspired by memory bank-based anomaly detection. This approach encourages normal feature representations to form multiple local clusters and incorporate multi-view aware negative sample synthesis to regularize pretraining. We demonstrate that our epipolar guided multi-view anomaly detection framework outperforms existing methods on the state-of-the-art multi-view anomaly detection dataset.
Abstract:Ambient Internet of Things (AIoT), recently standardized by the 3rd Generation Partnership Project (3GPP), demands a low-power wide-area communication solution that operates several orders of magnitude below the power requirements of existing 3GPP specifications. Ambient backscatter communication (AmBC) is considered as a competitive potential technique by harvesting energy from the ambient RF signal. This paper considers a symbiotic AmBC into Long Term Evolution (LTE) cellular system uplink. Leveraging by LTE uplink channel estimation ability, AIoT conveys its own message to Base Station (BS) by modulating backscatter path. We explore the detector design, analyze the error performance of the proposed scheme, provide exact expression and its Guassian approximation for the error probability. We corroborate the receiver error performance by Monte Carlo simulation. Analysis of communication range reveals AmBC achieves a reasonable BER of order of magnitude $10^{-2}$ within four times wavelength reading distance. In addition, a AmBC prototype in LTE uplink confirms the its feasibility. The over-the-air experiment results validate theoretical analysis. Hence, the proposed AmBC approach enables AIoT deployment with minimal changes to the LTE system.
Abstract:Video Diffusion Transformers (DiTs) have demonstrated significant potential for generating high-fidelity videos but are computationally intensive. Existing acceleration methods include distillation, which requires costly retraining, and feature caching, which is highly sensitive to network architecture. Recent token reduction methods are training-free and architecture-agnostic, offering greater flexibility and wider applicability. However, they enforce the same sequence length across different components, constraining their acceleration potential. We observe that intra-sequence redundancy in video DiTs varies across features, blocks, and denoising timesteps. Building on this observation, we propose Asymmetric Reduction and Restoration (AsymRnR), a training-free approach to accelerate video DiTs. It offers a flexible and adaptive strategy that reduces the number of tokens based on their redundancy to enhance both acceleration and generation quality. We further propose matching cache to facilitate faster processing. Integrated into state-of-the-art video DiTs, AsymRnR achieves a superior speedup without compromising the quality.
Abstract:While open-source video generation and editing models have made significant progress, individual models are typically limited to specific tasks, failing to meet the diverse needs of users. Effectively coordinating these models can unlock a wide range of video generation and editing capabilities. However, manual coordination is complex and time-consuming, requiring users to deeply understand task requirements and possess comprehensive knowledge of each model's performance, applicability, and limitations, thereby increasing the barrier to entry. To address these challenges, we propose a novel video generation and editing system powered by our Semantic Planning Agent (SPAgent). SPAgent bridges the gap between diverse user intents and the effective utilization of existing generative models, enhancing the adaptability, efficiency, and overall quality of video generation and editing. Specifically, the SPAgent assembles a tool library integrating state-of-the-art open-source image and video generation and editing models as tools. After fine-tuning on our manually annotated dataset, SPAgent can automatically coordinate the tools for video generation and editing, through our novelly designed three-step framework: (1) decoupled intent recognition, (2) principle-guided route planning, and (3) capability-based execution model selection. Additionally, we enhance the SPAgent's video quality evaluation capability, enabling it to autonomously assess and incorporate new video generation and editing models into its tool library without human intervention. Experimental results demonstrate that the SPAgent effectively coordinates models to generate or edit videos, highlighting its versatility and adaptability across various video tasks.
Abstract:The rapid development of diffusion models (DMs) has significantly advanced image and video applications, making "what you want is what you see" a reality. Among these, video editing has gained substantial attention and seen a swift rise in research activity, necessitating a comprehensive and systematic review of the existing literature. This paper reviews diffusion model-based video editing techniques, including theoretical foundations and practical applications. We begin by overviewing the mathematical formulation and image domain's key methods. Subsequently, we categorize video editing approaches by the inherent connections of their core technologies, depicting evolutionary trajectory. This paper also dives into novel applications, including point-based editing and pose-guided human video editing. Additionally, we present a comprehensive comparison using our newly introduced V2VBench. Building on the progress achieved to date, the paper concludes with ongoing challenges and potential directions for future research.
Abstract:Existing approaches towards anomaly detection~(AD) often rely on a substantial amount of anomaly-free data to train representation and density models. However, large anomaly-free datasets may not always be available before the inference stage; in which case an anomaly detection model must be trained with only a handful of normal samples, a.k.a. few-shot anomaly detection (FSAD). In this paper, we propose a novel methodology to address the challenge of FSAD which incorporates two important techniques. Firstly, we employ a model pre-trained on a large source dataset to initialize model weights. Secondly, to ameliorate the covariate shift between source and target domains, we adopt contrastive training to fine-tune on the few-shot target domain data. To learn suitable representations for the downstream AD task, we additionally incorporate cross-instance positive pairs to encourage a tight cluster of the normal samples, and negative pairs for better separation between normal and synthesized negative samples. We evaluate few-shot anomaly detection on on 3 controlled AD tasks and 4 real-world AD tasks to demonstrate the effectiveness of the proposed method.
Abstract:The 3GPP has recently conducted a study on the Ambient Internet of Things (AIoT), with a particular emphasis on examining backscatter communications as one of the primary techniques under consideration. Previous investigations into Ambient Backscatter Communications (AmBC) within the long term evolution (LTE) downlink have shown that it is feasible to utilize the user equipment channel estimator as a receiver for demodulating frequency shift keyed (FSK) messages transmitted by the backscatter devices. In practical deployment scenarios, the backscattered link often experiences a low signal-to-noise ratio, leading to subpar bit error rate (BER) performance in the case of uncoded transmissions. In this paper, we propose the adoption of the same convolutional coding methodology for backscatter links that is already employed for LTE downlink control signals. This approach facilitates the reuse of identical demodulation functions at the modem for both control signals and backscattered AIoT messages. To assess the performance of the proposed scheme, we conducted experiments utilizing real LTE downlink signals generated by a mobile operator within an office environment. When compared to uncoded FSK, convolutional channel coding delivers a notable gain of approximately 6 dB at a BER of $10^{-3}$. Consequently, the AmBC system demonstrates a high level of reliability, achieving a BER of $10^{-3}$ at a Signal-to-Noise Ratio (SNR) of 5 dB.
Abstract:Long Term Evolution (LTE) signal is ubiquitously present in electromagnetic (EM) background environment, which make it an attractive signal source for the ambient backscatter communications (AmBC). In this paper, we propose a system, in which a backscatter device (BD) introduces artificial Doppler shift to the channel which is larger than the natural Doppler but still small enough such that it can be tracked by the channel estimator at the User Equipment (UE). Channel estimation is done using the downlink cell specific reference signals (CRS) that are present regardless the UE being attached to the network or not. FSK was selected due to its robust operation in a fading channel. We describe the whole AmBC system, use two receivers. Finally, numerical simulations and measurements are provided to validate the proposed FSK AmBC performance.
Abstract:Long Term Evolution (LTE) systems provide ubiquitous coverage for mobile communications, which makes it a promising candidate to be used as a signal source in the ambient backscatter communications. In this paper, we propose a system in which a backscatter device modulates the ambient LTE signal by changing its reflection coefficient and the receiver uses the LTE Cell Specific Reference Signals (CRS) to estimate the channel and demodulates the backscattered signal from the obtained channel impulse response estimates. We first outline the overall system, discuss the receiver operation, and then provide experimental evidence on the practicality of the proposed system.