Abstract:Concept Bottleneck Models (CBMs) map the black-box visual representations extracted by deep neural networks onto a set of interpretable concepts and use the concepts to make predictions, enhancing the transparency of the decision-making process. Multimodal pre-trained models can match visual representations with textual concept embeddings, allowing for obtaining the interpretable concept bottleneck without the expertise concept annotations. Recent research has focused on the concept bank establishment and the high-quality concept selection. However, it is challenging to construct a comprehensive concept bank through humans or large language models, which severely limits the performance of CBMs. In this work, we propose the Incremental Residual Concept Bottleneck Model (Res-CBM) to address the challenge of concept completeness. Specifically, the residual concept bottleneck model employs a set of optimizable vectors to complete missing concepts, then the incremental concept discovery module converts the complemented vectors with unclear meanings into potential concepts in the candidate concept bank. Our approach can be applied to any user-defined concept bank, as a post-hoc processing method to enhance the performance of any CBMs. Furthermore, to measure the descriptive efficiency of CBMs, the Concept Utilization Efficiency (CUE) metric is proposed. Experiments show that the Res-CBM outperforms the current state-of-the-art methods in terms of both accuracy and efficiency and achieves comparable performance to black-box models across multiple datasets.
Abstract:In this paper, we consider a novel and practical case for talking face video generation. Specifically, we focus on the scenarios involving multi-people interactions, where the talking context, such as audience or surroundings, is present. In these situations, the video generation should take the context into consideration in order to generate video content naturally aligned with driving audios and spatially coherent to the context. To achieve this, we provide a two-stage and cross-modal controllable video generation pipeline, taking facial landmarks as an explicit and compact control signal to bridge the driving audio, talking context and generated videos. Inside this pipeline, we devise a 3D video diffusion model, allowing for efficient contort of both spatial conditions (landmarks and context video), as well as audio condition for temporally coherent generation. The experimental results verify the advantage of the proposed method over other baselines in terms of audio-video synchronization, video fidelity and frame consistency.
Abstract:In this work, we focus on exploring explicit fine-grained control of generative facial image editing, all while generating faithful and consistent personalized facial appearances. We identify the key challenge of this task as the exploration of disentangled conditional control in the generation process, and accordingly propose a novel diffusion-based framework, named DisControlFace, comprising two decoupled components. Specifically, we leverage an off-the-shelf diffusion reconstruction model as the backbone and freeze its pre-trained weights, which helps to reduce identity shift and recover editing-unrelated details of the input image. Furthermore, we construct a parallel control network that is compatible with the reconstruction backbone to generate spatial control conditions based on estimated explicit face parameters. Finally, we further reformulate the training pipeline into a masked-autoencoding form to effectively achieve disentangled training of our DisControlFace. Our DisControlNet can perform robust editing on any facial image through training on large-scale 2D in-the-wild portraits and also supports low-cost fine-tuning with few additional images to further learn diverse personalized priors of a specific person. Extensive experiments demonstrate that DisControlFace can generate realistic facial images corresponding to various face control conditions, while significantly improving the preservation of the personalized facial details.
Abstract:In-context learning, i.e., learning from in-context samples, is an impressive ability of Transformer. However, the mechanism driving the in-context learning is not yet fully understood. In this study, we aim to investigate from an underexplored perspective of representation learning. The representation is more complex for in-context learning senario, where the representation can be impacted by both model weights and in-context samples. We refer the above two conceptually aspects of representation as in-weight component and in-context component, respectively. To study how the two components affect in-context learning capabilities, we construct a novel synthetic task, making it possible to device two probes, in-weights probe and in-context probe, to evaluate the two components, respectively. We demonstrate that the goodness of in-context component is highly related to the in-context learning performance, which indicates the entanglement between in-context learning and representation learning. Furthermore, we find that a good in-weights component can actually benefit the learning of the in-context component, indicating that in-weights learning should be the foundation of in-context learning. To further understand the the in-context learning mechanism and importance of the in-weights component, we proof by construction that a simple Transformer, which uses pattern matching and copy-past mechanism to perform in-context learning, can match the in-context learning performance with more complex, best tuned Transformer under the perfect in-weights component assumption. In short, those discoveries from representation learning perspective shed light on new approaches to improve the in-context capacity.
Abstract:Domain generalization aims to address the domain shift between training and testing data. To learn the domain invariant representations, the model is usually trained on multiple domains. It has been found that the gradients of network weight relative to a specific task loss can characterize the task itself. In this work, with the assumption that the gradients of a specific domain samples under the classification task could also reflect the property of the domain, we propose a Shape Guided Gradient Voting (SGGV) method for domain generalization. Firstly, we introduce shape prior via extra inputs of the network to guide gradient descending towards a shape-biased direction for better generalization. Secondly, we propose a new gradient voting strategy to remove the outliers for robust optimization in the presence of shape guidance. To provide shape guidance, we add edge/sketch extracted from the training data as an explicit way, and also use texture augmented images as an implicit way. We conduct experiments on several popular domain generalization datasets in image classification task, and show that our shape guided gradient updating strategy brings significant improvement of the generalization.
Abstract:Recognizing elementary underlying concepts from observations (disentanglement) and generating novel combinations of these concepts (compositional generalization) are fundamental abilities for humans to support rapid knowledge learning and generalize to new tasks, with which the deep learning models struggle. Towards human-like intelligence, various works on disentangled representation learning have been proposed, and recently some studies on compositional generalization have been presented. However, few works study the relationship between disentanglement and compositional generalization, and the observed results are inconsistent. In this paper, we study several typical disentangled representation learning works in terms of both disentanglement and compositional generalization abilities, and we provide an important insight: vector-based representation (using a vector instead of a scalar to represent a concept) is the key to empower both good disentanglement and strong compositional generalization. This insight also resonates the neuroscience research that the brain encodes information in neuron population activity rather than individual neurons. Motivated by this observation, we further propose a method to reform the scalar-based disentanglement works ($\beta$-TCVAE and FactorVAE) to be vector-based to increase both capabilities. We investigate the impact of the dimensions of vector-based representation and one important question: whether better disentanglement indicates higher compositional generalization. In summary, our study demonstrates that it is possible to achieve both good concept recognition and novel concept composition, contributing an important step towards human-like intelligence.
Abstract:The neural radiance field (NeRF) achieved remarkable success in modeling 3D scenes and synthesizing high-fidelity novel views. However, existing NeRF-based methods focus more on the make full use of the image resolution to generate novel views, but less considering the generation of details under the limited input resolution. In analogy to the extensive usage of image super-resolution, NeRF super-resolution is an effective way to generate the high-resolution implicit representation of 3D scenes and holds great potential applications. Up to now, such an important topic is still under-explored. In this paper, we propose a NeRF super-resolution method, named Super-NeRF, to generate high-resolution NeRF from only low-resolution inputs. Given multi-view low-resolution images, Super-NeRF constructs a consistency-controlling super-resolution module to generate view-consistent high-resolution details for NeRF. Specifically, an optimizable latent code is introduced for each low-resolution input image to control the 2D super-resolution images to converge to the view-consistent output. The latent codes of each low-resolution image are optimized synergistically with the target Super-NeRF representation to fully utilize the view consistency constraint inherent in NeRF construction. We verify the effectiveness of Super-NeRF on synthetic, real-world, and AI-generated NeRF datasets. Super-NeRF achieves state-of-the-art NeRF super-resolution performance on high-resolution detail generation and cross-view consistency.
Abstract:Imagine an interesting multimodal interactive scenario that you can see, hear, and chat with an AI-generated digital character, who is capable of behaving like Sheldon from The Big Bang Theory, as a DEEP copy from appearance to personality. Towards this fantastic multimodal chatting scenario, we propose a novel task, named Deep Personalized Character Creation (DPCC): creating multimodal chat personalized characters from multimodal data such as TV shows. Specifically, given a single- or multi-modality input (text, audio, video), the goal of DPCC is to generate a multi-modality (text, audio, video) response, which should be well-matched the personality of a specific character such as Sheldon, and of high quality as well. To support this novel task, we further collect a character centric multimodal dialogue dataset, named Deep Personalized Character Dataset (DPCD), from TV shows. DPCD contains character-specific multimodal dialogue data of ~10k utterances and ~6 hours of audio/video per character, which is around 10 times larger compared to existing related datasets.On DPCD, we present a baseline method for the DPCC task and create 5 Deep personalized digital Characters (DeepCharacters) from Big Bang TV Shows. We conduct both subjective and objective experiments to evaluate the multimodal response from DeepCharacters in terms of characterization and quality. The results demonstrates that, on our collected DPCD dataset, the proposed baseline can create personalized digital characters for generating multimodal response.Our collected DPCD dataset, the code of data collection and our baseline will be published soon.
Abstract:Layout generation aims to synthesize realistic graphic scenes consisting of elements with different attributes including category, size, position, and between-element relation. It is a crucial task for reducing the burden on heavy-duty graphic design works for formatted scenes, e.g., publications, documents, and user interfaces (UIs). Diverse application scenarios impose a big challenge in unifying various layout generation subtasks, including conditional and unconditional generation. In this paper, we propose a Layout Diffusion Generative Model (LDGM) to achieve such unification with a single decoupled diffusion model. LDGM views a layout of arbitrary missing or coarse element attributes as an intermediate diffusion status from a completed layout. Since different attributes have their individual semantics and characteristics, we propose to decouple the diffusion processes for them to improve the diversity of training samples and learn the reverse process jointly to exploit global-scope contexts for facilitating generation. As a result, our LDGM can generate layouts either from scratch or conditional on arbitrary available attributes. Extensive qualitative and quantitative experiments demonstrate our proposed LDGM outperforms existing layout generation models in both functionality and performance.
Abstract:In this paper, targeting to understand the underlying explainable factors behind observations and modeling the conditional generation process on these factors, we propose a new task, disentanglement of diffusion probabilistic models (DPMs), to take advantage of the remarkable modeling ability of DPMs. To tackle this task, we further devise an unsupervised approach named DisDiff. For the first time, we achieve disentangled representation learning in the framework of diffusion probabilistic models. Given a pre-trained DPM, DisDiff can automatically discover the inherent factors behind the image data and disentangle the gradient fields of DPM into sub-gradient fields, each conditioned on the representation of each discovered factor. We propose a novel Disentangling Loss for DisDiff to facilitate the disentanglement of the representation and sub-gradients. The extensive experiments on synthetic and real-world datasets demonstrate the effectiveness of DisDiff.