Abstract:Since the first instances of online education, where courses were uploaded to accessible and shared online platforms, this form of scaling the dissemination of human knowledge to reach a broader audience has sparked extensive discussion and widespread adoption. Recognizing that personalized learning still holds significant potential for improvement, new AI technologies have been continuously integrated into this learning format, resulting in a variety of educational AI applications such as educational recommendation and intelligent tutoring. The emergence of intelligence in large language models (LLMs) has allowed for these educational enhancements to be built upon a unified foundational model, enabling deeper integration. In this context, we propose MAIC (Massive AI-empowered Course), a new form of online education that leverages LLM-driven multi-agent systems to construct an AI-augmented classroom, balancing scalability with adaptivity. Beyond exploring the conceptual framework and technical innovations, we conduct preliminary experiments at Tsinghua University, one of China's leading universities. Drawing from over 100,000 learning records of more than 500 students, we obtain a series of valuable observations and initial analyses. This project will continue to evolve, ultimately aiming to establish a comprehensive open platform that supports and unifies research, technology, and applications in exploring the possibilities of online education in the era of large model AI. We envision this platform as a collaborative hub, bringing together educators, researchers, and innovators to collectively explore the future of AI-driven online education.
Abstract:Diffusion-based text-to-image generation models have significantly advanced the field of art content synthesis. However, current portrait stylization methods generally require either model fine-tuning based on examples or the employment of DDIM Inversion to revert images to noise space, both of which substantially decelerate the image generation process. To overcome these limitations, this paper presents an inversion-free portrait stylization framework based on diffusion models that accomplishes content and style feature fusion in merely four sampling steps. We observed that Latent Consistency Models employing consistency distillation can effectively extract representative Consistency Features from noisy images. To blend the Consistency Features extracted from both content and style images, we introduce a Style Enhancement Attention Control technique that meticulously merges content and style features within the attention space of the target image. Moreover, we propose a feature merging strategy to amalgamate redundant features in Consistency Features, thereby reducing the computational load of attention control. Extensive experiments have validated the effectiveness of our proposed framework in enhancing stylization efficiency and fidelity. The code is available at \url{https://github.com/liujin112/ZePo}.
Abstract:Single-pixel imaging (SPI) using a single-pixel detector is an unconventional imaging method, which has great application prospects in many fields to realize high-performance imaging. In especial, the recent proposed catadioptric panoramic ghost imaging (CPGI) extends the application potential of SPI to high-performance imaging at a wide field of view (FOV) with recent growing demands. However, the resolution of CPGI is limited by hardware parameters of the digital micromirror device (DMD), which may not meet ultrahigh-resolution panoramic imaging needs that require detailed information. Therefore, to overcome the resolution limitation of CPGI, we propose a panoramic SPI based on rotational subdivision (RSPSI). The key of the proposed RSPSI is to obtain the entire panoramic scene by the rotation-scanning with a rotating mirror tilted 45{\deg}, so that one single pattern that only covers one sub-Fov with a small FOV can complete a uninterrupted modulation on the entire panoramic FOV during a once-through pattern projection. Then, based on temporal resolution subdivision, images sequence of sub-Fovs subdivided from the entire panoramic FOV can be reconstructed with pixels-level or even subpixels-level horizontal shifting adjacently. Experimental results using a proof-of-concept setup show that the panoramic image can be obtained with 10428*543 of 5,662,404 pixels, which is more than 9.6 times higher than the resolution limit of the CPGI using the same DMD. To our best knowledge, the RSPSI is the first to achieve a megapixel resolution via SPI, which can provide potential applications in fields requiring the imaging with ultrahigh-resolution and wide FOV.
Abstract:Query-focused summarization (QFS) aims to produce summaries that answer particular questions of interest, enabling greater user control and personalization. With the advent of large language models (LLMs), shows their impressive capability of textual understanding through large-scale pretraining, which implies the great potential of extractive snippet generation. In this paper, we systematically investigated two indispensable characteristics that the LLMs-based QFS models should be harnessed, Lengthy Document Summarization and Efficiently Fine-grained Query-LLM Alignment, respectively. Correspondingly, we propose two modules called Query-aware HyperExpert and Query-focused Infini-attention to access the aforementioned characteristics. These innovations pave the way for broader application and accessibility in the field of QFS technology. Extensive experiments conducted on existing QFS benchmarks indicate the effectiveness and generalizability of the proposed approach. Our code is publicly available at https://github.com/DCDmllm/IDEAL_Summary.
Abstract:The transformer networks are extensively utilized in face forgery detection due to their scalability across large datasets.Despite their success, transformers face challenges in balancing the capture of global context, which is crucial for unveiling forgery clues, with computational complexity.To mitigate this issue, we introduce Band-Attention modulated RetNet (BAR-Net), a lightweight network designed to efficiently process extensive visual contexts while avoiding catastrophic forgetting.Our approach empowers the target token to perceive global information by assigning differential attention levels to tokens at varying distances. We implement self-attention along both spatial axes, thereby maintaining spatial priors and easing the computational burden.Moreover, we present the adaptive frequency Band-Attention Modulation mechanism, which treats the entire Discrete Cosine Transform spectrogram as a series of frequency bands with learnable weights.Together, BAR-Net achieves favorable performance on several face forgery datasets, outperforming current state-of-the-art methods.
Abstract:We consider a wireless networked control system (WNCS) with bidirectional imperfect links for real-time applications such as smart grids. To maintain the stability of WNCS, captured by the probability that plant state violates preset values, at minimal cost, heterogeneous physical processes are monitored by multiple sensors. This status information, such as dynamic plant state and Markov Process-based context information, is then received/estimated by the controller for remote control. However, scheduling multiple sensors and designing the controller with limited resources is challenging due to their coupling, delay, and transmission loss. We formulate a Constrained Markov Decision Problem (CMDP) to minimize violation probability with cost constraints. We reveal the relationship between the goal and different updating actions by analyzing the significance of information that incorporates goal-related usefulness and contextual importance. Subsequently, a goal-oriented deterministic scheduling policy is proposed. Two sensing-assisted control strategies and a control-aware estimation policy are proposed to improve the violation probability-cost tradeoff, integrated with the scheduling policy to form a goal-oriented co-design framework. Additionally, we explore retransmission in downlink transmission and qualitatively analyze its preference scenario. Simulation results demonstrate that the proposed goal-oriented co-design policy outperforms previous work in simultaneously reducing violation probability and cost
Abstract:Driven by the development goal of network paradigm and demand for various functions in the sixth-generation (6G) mission-critical Internet-of-Things (MC-IoT), we foresee a goal-oriented integration of sensing, communication, computing, and control (GIS3C) in this paper. We first provide an overview of the tasks, requirements, and challenges of MC-IoT. Then we introduce an end-to-end GIS3C architecture, in which goal-oriented communication is leveraged to bridge and empower sensing, communication, control, and computing functionalities. By revealing the interplay among multiple subsystems in terms of key performance indicators and parameters, this paper introduces unified metrics, i.e., task completion effectiveness and cost, to facilitate S3C co-design in MC-IoT. The preliminary results demonstrate the benefits of GIS3C in improving task completion effectiveness while reducing costs. We also identify and highlight the gaps and challenges in applying GIS3C in the future 6G networks.
Abstract:Flow matching as a paradigm of generative model achieves notable success across various domains. However, existing methods use either multi-round training or knowledge within minibatches, posing challenges in finding a favorable coupling strategy for straight trajectories. To address this issue, we propose a novel approach, Straighter trajectories of Flow Matching (StraightFM). It straightens trajectories with the coupling strategy guided by diffusion model from entire distribution level. First, we propose a coupling strategy to straighten trajectories, creating couplings between image and noise samples under diffusion model guidance. Second, StraightFM also integrates real data to enhance training, employing a neural network to parameterize another coupling process from images to noise samples. StraightFM is jointly optimized with couplings from above two mutually complementary directions, resulting in straighter trajectories and enabling both one-step and few-step generation. Extensive experiments demonstrate that StraightFM yields high quality samples with fewer step. StraightFM generates visually appealing images with a lower FID among diffusion and traditional flow matching methods within 5 sampling steps when trained on pixel space. In the latent space (i.e., Latent Diffusion), StraightFM achieves a lower KID value compared to existing methods on the CelebA-HQ 256 dataset in fewer than 10 sampling steps.
Abstract:Infrared small target detection based on deep learning offers unique advantages in separating small targets from complex and dynamic backgrounds. However, the features of infrared small targets gradually weaken as the depth of convolutional neural network (CNN) increases. To address this issue, we propose a novel method for detecting infrared small targets called improved dense nested attention network (IDNANet), which is based on the transformer architecture. We preserve the dense nested structure of dense nested attention network (DNANet) and introduce the Swin-transformer during feature extraction stage to enhance the continuity of features. Furthermore, we integrate the ACmix attention structure into the dense nested structure to enhance the features of intermediate layers. Additionally, we design a weighted dice binary cross-entropy (WD-BCE) loss function to mitigate the negative impact of foreground-background imbalance in the samples. Moreover, we develop a dataset specifically for infrared small targets, called BIT-SIRST. The dataset comprises a significant amount of real-world targets and manually annotated labels, as well as synthetic data and corresponding labels. We have evaluated the effectiveness of our method through experiments conducted on public datasets. In comparison to other state-of-the-art methods, our approach outperforms in terms of probability of detection (P_d), false-alarm rate (F_a), and mean intersection of union ($mIoU$). The $mIoU$ reaches 90.89 on the NUDT-SIRST dataset and 79.72 on the NUAA-SIRST dataset.
Abstract:Fourier single-pixel imaging (FSI) is a data-efficient single-pixel imaging (SPI). However, there is still a serious challenge to obtain higher imaging quality using fewer measurements, which limits the development of real-time SPI. In this work, a uniform-sampling foveated FSI (UFFSI) is proposed with three features, uniform sampling, effective sampling and flexible fovea, to achieve under-sampling high-efficiency and high-quality SPI, even in a large-scale scene. First, by flexibly using the three proposed foveated pattern structures, data redundancy is reduced significantly to only require high resolution (HR) on regions of interest (ROIs), which radically reduces the need of total data number. Next, by the non-uniform weight distribution processing, non-uniform spatial sampling is transformed into uniform sampling, then the fast Fourier transform is used accurately and directly to obtain under-sampling high imaging quality with further reduced measurements. At a sampling ratio of 0.0084 referring to HR FSI with 1024*768 pixels, experimentally, by UFFSI with 255*341 cells of 89% reduction in data redundancy, the ROI has a significantly better imaging quality to meet imaging needs. We hope this work can provide a breakthrough for future real-time SPI.