Abstract:While 3D Gaussian Splatting (3D-GS) achieves photorealistic novel view synthesis, its performance degrades with motion blur. In scenarios with rapid motion or low-light conditions, existing RGB-based deblurring methods struggle to model camera pose and radiance changes during exposure, reducing reconstruction accuracy. Event cameras, capturing continuous brightness changes during exposure, can effectively assist in modeling motion blur and improving reconstruction quality. Therefore, we propose Event-driven Bundle Adjusted Deblur Gaussian Splatting (EBAD-Gaussian), which reconstructs sharp 3D Gaussians from event streams and severely blurred images. This method jointly learns the parameters of these Gaussians while recovering camera motion trajectories during exposure time. Specifically, we first construct a blur loss function by synthesizing multiple latent sharp images during the exposure time, minimizing the difference between real and synthesized blurred images. Then we use event stream to supervise the light intensity changes between latent sharp images at any time within the exposure period, supplementing the light intensity dynamic changes lost in RGB images. Furthermore, we optimize the latent sharp images at intermediate exposure times based on the event-based double integral (EDI) prior, applying consistency constraints to enhance the details and texture information of the reconstructed images. Extensive experiments on synthetic and real-world datasets show that EBAD-Gaussian can achieve high-quality 3D scene reconstruction under the condition of blurred images and event stream inputs.
Abstract:Deep neural networks have demonstrated remarkable success in machine learning; however, they remain fundamentally ill-suited for Continual Learning (CL). Recent research has increasingly focused on achieving CL without the need for rehearsal. Among these, parameter isolation-based methods have proven particularly effective in enhancing CL by optimizing model weights for each incremental task. Despite their success, they fall short in optimizing architectures tailored to distinct incremental tasks. To address this limitation, updating a group of models with different architectures offers a promising alternative to the traditional CL paradigm that relies on a single unified model. Building on this insight, this study introduces a novel Population-based Continual Learning (PCL) framework. PCL extends CL to the architectural level by maintaining and evolving a population of neural network architectures, which are continually refined for the current task through NAS. Importantly, the well-evolved population for the current incremental task is naturally inherited by the subsequent one, thereby facilitating forward transfer, a crucial objective in CL. Throughout the CL process, the population evolves, yielding task-specific architectures that collectively form a robust CL system. Experimental results demonstrate that PCL outperforms state-of-the-art rehearsal-free CL methods that employs a unified model, highlighting its potential as a new paradigm for CL.
Abstract:Vision-language pre-training has been an emerging and fast-developing research topic, which transfers multi-modal knowledge from rich-resource pre-training task to limited-resource downstream tasks. Unlike existing works that predominantly learn a single generic encoder, we present a pre-trainable Universal Encoder-DEcoder Network (Uni-EDEN) to facilitate both vision-language perception (e.g., visual question answering) and generation (e.g., image captioning). Uni-EDEN is a two-stream Transformer based structure, consisting of three modules: object and sentence encoders that separately learns the representations of each modality, and sentence decoder that enables both multi-modal reasoning and sentence generation via inter-modal interaction. Considering that the linguistic representations of each image can span different granularities in this hierarchy including, from simple to comprehensive, individual label, a phrase, and a natural sentence, we pre-train Uni-EDEN through multi-granular vision-language proxy tasks: Masked Object Classification (MOC), Masked Region Phrase Generation (MRPG), Image-Sentence Matching (ISM), and Masked Sentence Generation (MSG). In this way, Uni-EDEN is endowed with the power of both multi-modal representation extraction and language modeling. Extensive experiments demonstrate the compelling generalizability of Uni-EDEN by fine-tuning it to four vision-language perception and generation downstream tasks.