College of Artificial Intelligence and Automation, Hohai University
Abstract:Recent research arXiv:2410.15027 arXiv:2410.23775 has highlighted the inherent in-context generation capabilities of pretrained diffusion transformers (DiTs), enabling them to seamlessly adapt to diverse visual tasks with minimal or no architectural modifications. These capabilities are unlocked by concatenating self-attention tokens across multiple input and target images, combined with grouped and masked generation pipelines. Building upon this foundation, we present ChatDiT, a zero-shot, general-purpose, and interactive visual generation framework that leverages pretrained diffusion transformers in their original form, requiring no additional tuning, adapters, or modifications. Users can interact with ChatDiT to create interleaved text-image articles, multi-page picture books, edit images, design IP derivatives, or develop character design settings, all through free-form natural language across one or more conversational rounds. At its core, ChatDiT employs a multi-agent system comprising three key components: an Instruction-Parsing agent that interprets user-uploaded images and instructions, a Strategy-Planning agent that devises single-step or multi-step generation actions, and an Execution agent that performs these actions using an in-context toolkit of diffusion transformers. We thoroughly evaluate ChatDiT on IDEA-Bench arXiv:2412.11767, comprising 100 real-world design tasks and 275 cases with diverse instructions and varying numbers of input and target images. Despite its simplicity and training-free approach, ChatDiT surpasses all competitors, including those specifically designed and trained on extensive multi-task datasets. We further identify key limitations of pretrained DiTs in zero-shot adapting to tasks. We release all code, agents, results, and intermediate outputs to facilitate further research at https://github.com/ali-vilab/ChatDiT
Abstract:We present The Matrix, the first foundational realistic world simulator capable of generating continuous 720p high-fidelity real-scene video streams with real-time, responsive control in both first- and third-person perspectives, enabling immersive exploration of richly dynamic environments. Trained on limited supervised data from AAA games like Forza Horizon 5 and Cyberpunk 2077, complemented by large-scale unsupervised footage from real-world settings like Tokyo streets, The Matrix allows users to traverse diverse terrains -- deserts, grasslands, water bodies, and urban landscapes -- in continuous, uncut hour-long sequences. Operating at 16 FPS, the system supports real-time interactivity and demonstrates zero-shot generalization, translating virtual game environments to real-world contexts where collecting continuous movement data is often infeasible. For example, The Matrix can simulate a BMW X3 driving through an office setting--an environment present in neither gaming data nor real-world sources. This approach showcases the potential of AAA game data to advance robust world models, bridging the gap between simulations and real-world applications in scenarios with limited data.
Abstract:Computational protein design (CPD) offers transformative potential for bioengineering, but current deep CPD models, focused on universal domains, struggle with function-specific designs. This work introduces a novel CPD paradigm tailored for functional design tasks, particularly for enzymes-a key protein class often lacking specific application efficiency. To address structural data scarcity, we present CrossDesign, a domain-adaptive framework that leverages pretrained protein language models (PPLMs). By aligning protein structures with sequences, CrossDesign transfers pretrained knowledge to structure models, overcoming the limitations of limited structural data. The framework combines autoregressive (AR) and non-autoregressive (NAR) states in its encoder-decoder architecture, applying it to enzyme datasets and pan-proteins. Experimental results highlight CrossDesign's superior performance and robustness, especially with out-of-domain enzymes. Additionally, the model excels in fitness prediction when tested on large-scale mutation data, showcasing its stability.
Abstract:Efficient wideband spectrum sensing (WSS) is essential for managing spectrum scarcity in wireless communications. However, existing compressed sensing (CS)-based WSS methods require high sampling rates and power consumption, particularly with high-precision analog-to-digital converters (ADCs). Although 1-bit CS with low-precision ADCs can mitigate these demands, most approaches still depend on multi-user cooperation and prior sparsity information, which are often unavailable in WSS scenarios. This paper introduces a non-cooperative WSS method using multicoset sampling with 1-bit ADCs to achieve sub-Nyquist sampling without requiring sparsity knowledge. We analyze the impact of 1-bit quantization on multiband signals, then apply eigenvalue decomposition to isolate the signal subspace from noise, enabling spectrum support estimation without signal reconstruction. This approach provides a power-efficient solution for WSS that eliminates the need for cooperation and prior information.
Abstract:Federated learning (FL) is an innovative distributed artificial intelligence (AI) technique. It has been used for interdisciplinary studies in different fields such as healthcare, marketing and finance. However the application of FL in wireless networks is still in its infancy. In this work, we first overview benefits and concerns when applying FL to wireless networks. Next, we provide a new perspective on existing personalized FL frameworks by analyzing the relationship between cooperation and personalization in these frameworks. Additionally, we discuss the possibility of tuning the cooperation level with a choice-based approach. Our choice-based FL approach is a flexible and safe FL framework that allows participants to lower the level of cooperation when they feel unsafe or unable to benefit from the cooperation. In this way, the choice-based FL framework aims to address the safety and fairness concerns in FL and protect participants from malicious attacks.
Abstract:Knowledge Graph (KG) is playing an increasingly important role in various AI systems. For e-commerce, an efficient and low-cost automated knowledge graph construction method is the foundation of enabling various successful downstream applications. In this paper, we propose a novel method for constructing structured product knowledge graphs from raw product images. The method cooperatively leverages recent advances in the vision-language model (VLM) and large language model (LLM), fully automating the process and allowing timely graph updates. We also present a human-annotated e-commerce product dataset for benchmarking product property extraction in knowledge graph construction. Our method outperforms our baseline in all metrics and evaluated properties, demonstrating its effectiveness and bright usage potential.
Abstract:Embedding retrieval aims to learn a shared semantic representation space for both queries and items, thus enabling efficient and effective item retrieval using approximate nearest neighbor (ANN) algorithms. In current industrial practice, retrieval systems typically retrieve a fixed number of items for different queries, which actually leads to insufficient retrieval (low recall) for head queries and irrelevant retrieval (low precision) for tail queries. Mostly due to the trend of frequentist approach to loss function designs, till now there is no satisfactory solution to holistically address this challenge in the industry. In this paper, we move away from the frequentist approach, and take a novel \textbf{p}robabilistic approach to \textbf{e}mbedding \textbf{b}ased \textbf{r}etrieval (namely \textbf{pEBR}) by learning the item distribution for different queries, which enables a dynamic cosine similarity threshold calculated by the probabilistic cumulative distribution function (CDF) value. The experimental results show that our approach improves both the retrieval precision and recall significantly. Ablation studies also illustrate how the probabilistic approach is able to capture the differences between head and tail queries.
Abstract:Beam management is an important technique to improve signal strength and reduce interference in wireless communication systems. Recently, there has been increasing interest in using diverse sensing modalities for beam management. However, it remains a big challenge to process multi-modal data efficiently and extract useful information. On the other hand, the recently emerging multi-modal transformer (MMT) is a promising technique that can process multi-modal data by capturing long-range dependencies. While MMT is highly effective in handling multi-modal data and providing robust beam management, integrating reinforcement learning (RL) further enhances their adaptability in dynamic environments. In this work, we propose a two-step beam management method by combining MMT with RL for dynamic beam index prediction. In the first step, we divide available beam indices into several groups and leverage MMT to process diverse data modalities to predict the optimal beam group. In the second step, we employ RL for fast beam decision-making within each group, which in return maximizes throughput. Our proposed framework is tested on a 6G dataset. In this testing scenario, it achieves higher beam prediction accuracy and system throughput compared to both the MMT-only based method and the RL-only based method.
Abstract:Discrete diffusion models have achieved success in tasks like image generation and masked language modeling but face limitations in controlled content editing. We introduce DICE (Discrete Inversion for Controllable Editing), the first approach to enable precise inversion for discrete diffusion models, including multinomial diffusion and masked generative models. By recording noise sequences and masking patterns during the reverse diffusion process, DICE enables accurate reconstruction and flexible editing of discrete data without the need for predefined masks or attention manipulation. We demonstrate the effectiveness of DICE across both image and text domains, evaluating it on models such as VQ-Diffusion, Paella, and RoBERTa. Our results show that DICE preserves high data fidelity while enhancing editing capabilities, offering new opportunities for fine-grained content manipulation in discrete spaces. For project webpage, see https://hexiaoxiao-cs.github.io/DICE/.
Abstract:In high-energy particle physics, extracting information from complex detector signals is crucial for energy reconstruction. Recent advancements involve using deep learning to process calorimeter images from various sub-detectors in experiments like the Large Hadron Collider (LHC) for energy map reconstruction. This paper compares classical algorithms\-MLP, CNN, U-Net, and RNN\-with variants that include self-attention and 3D convolution modules to evaluate their effectiveness in reconstructing the initial energy distribution. Additionally, a test dataset of jet events is utilized to analyze and compare models' performance in handling anomalous high-energy events. The analysis highlights the effectiveness of deep learning techniques for energy image reconstruction and explores their potential in this area.