for the DeepLearnPhysics Collaboration
Abstract:Despite the rapid progress in image generation, emotional image editing remains under-explored. The semantics, context, and structure of an image can evoke emotional responses, making emotional image editing techniques valuable for various real-world applications, including treatment of psychological disorders, commercialization of products, and artistic design. For the first time, we present a novel challenge of emotion-evoked image generation, aiming to synthesize images that evoke target emotions while retaining the semantics and structures of the original scenes. To address this challenge, we propose a diffusion model capable of effectively understanding and editing source images to convey desired emotions and sentiments. Moreover, due to the lack of emotion editing datasets, we provide a unique dataset consisting of 340,000 pairs of images and their emotion annotations. Furthermore, we conduct human psychophysics experiments and introduce four new evaluation metrics to systematically benchmark all the methods. Experimental results demonstrate that our method surpasses all competitive baselines. Our diffusion model is capable of identifying emotional cues from original images, editing images that elicit desired emotions, and meanwhile, preserving the semantic structure of the original images. All code, model, and data will be made public.
Abstract:In recent years, many convolutional neural network-based models are designed for JPEG artifacts reduction, and have achieved notable progress. However, few methods are suitable for extreme low-bitrate image compression artifacts reduction. The main challenge is that the highly compressed image loses too much information, resulting in reconstructing high-quality image difficultly. To address this issue, we propose a multimodal fusion learning method for text-guided JPEG artifacts reduction, in which the corresponding text description not only provides the potential prior information of the highly compressed image, but also serves as supplementary information to assist in image deblocking. We fuse image features and text semantic features from the global and local perspectives respectively, and design a contrastive loss built upon contrastive learning to produce visually pleasing results. Extensive experiments, including a user study, prove that our method can obtain better deblocking results compared to the state-of-the-art methods.
Abstract:Deep neural networks have greatly promoted the performance of single image super-resolution (SISR). Conventional methods still resort to restoring the single high-resolution (HR) solution only based on the input of image modality. However, the image-level information is insufficient to predict adequate details and photo-realistic visual quality facing large upscaling factors (x8, x16). In this paper, we propose a new perspective that regards the SISR as a semantic image detail enhancement problem to generate semantically reasonable HR image that are faithful to the ground truth. To enhance the semantic accuracy and the visual quality of the reconstructed image, we explore the multi-modal fusion learning in SISR by proposing a Text-Guided Super-Resolution (TGSR) framework, which can effectively utilize the information from the text and image modalities. Different from existing methods, the proposed TGSR could generate HR image details that match the text descriptions through a coarse-to-fine process. Extensive experiments and ablation studies demonstrate the effect of the TGSR, which exploits the text reference to recover realistic images.
Abstract:Few-shot learning has been proposed and rapidly emerging as a viable means for completing various tasks. Many few-shot models have been widely used for relation learning tasks. However, each of these models has a shortage of capturing a certain aspect of semantic features, for example, CNN on long-range dependencies part, Transformer on local features. It is difficult for a single model to adapt to various relation learning, which results in the high variance problem. Ensemble strategy could be competitive on improving the accuracy of few-shot relation extraction and mitigating high variance risks. This paper explores an ensemble approach to reduce the variance and introduces fine-tuning and feature attention strategies to calibrate relation-level features. Results on several few-shot relation learning tasks show that our model significantly outperforms the previous state-of-the-art models.
Abstract:Liquid Argon Time Projection Chambers (LArTPC) are particle imaging detectors recording 2D or 3D images of trajectories of charged particles. Identifying points of interest in these images, namely the initial and terminal points of track-like particle trajectories such as muons and protons, and the initial points of electromagnetic shower-like particle trajectories such as electrons and gamma rays, is a crucial step of identifying and analyzing these particles and impacts the inference of physics signals such as neutrino interaction. The Point Proposal Network is designed to discover these specific points of interest. The algorithm predicts with a sub-voxel precision their spatial location, and also determines the category of the identified points of interest. Using as a benchmark the PILArNet public LArTPC data sample in which the voxel resolution is 3mm/voxel, our algorithm successfully predicted 96.8% and 97.8% of 3D points within a distance of 3 and 10~voxels from the provided true point locations respectively. For the predicted 3D points within 3 voxels of the closest true point locations, the median distance is found to be 0.25 voxels, achieving the sub-voxel level precision. In addition, we report our analysis of the mistakes where our algorithm prediction differs from the provided true point positions by more than 10~voxels. Among 50 mistakes visually scanned, 25 were due to the definition of true position location, 15 were legitimate mistakes where a physicist cannot visually disagree with the algorithm's prediction, and 10 were genuine mistakes that we wish to improve in the future. Further, using these predicted points, we demonstrate a simple algorithm to cluster 3D voxels into individual track-like particle trajectories with a clustering efficiency, purity, and Adjusted Rand Index of 96%, 93%, and 91% respectively.
Abstract:Liquid Argon Time Projection Chambers (LArTPCs) are high resolution particle imaging detectors, employed by accelerator-based neutrino oscillation experiments for high precision physics measurements. While images of particle trajectories are intuitive to analyze for physicists, the development of a high quality, automated data reconstruction chain remains challenging. One of the most critical reconstruction steps is particle clustering: the task of grouping 3D image pixels into different particle instances that share the same particle type. In this paper, we propose the first scalable deep learning algorithm for particle clustering in LArTPC data using sparse convolutional neural networks (SCNN). Building on previous works on SCNNs and proposal free instance segmentation, we build an end-to-end trainable instance segmentation network that learns an embedding of the image pixels to perform point cloud clustering in a transformed space. We benchmark the performance of our algorithm on PILArNet, a public 3D particle imaging dataset, with respect to common clustering evaluation metrics. 3D pixels were successfully clustered into individual particle trajectories with 90% of them having an adjusted Rand index score greater than 92% with a mean pixel clustering efficiency and purity above 96%. This work contributes to the development of an end-to-end optimizable full data reconstruction chain for LArTPCs, in particular pixel-based 3D imaging detectors including the near detector of the Deep Underground Neutrino Experiment. Our algorithm is made available in the open access repository, and we share our Singularity software container, which can be used to reproduce our work on the dataset.
Abstract:Liquid Argon Time Projection Chambers (LArTPCs) are a class of detectors that produce high resolution images of charged particles within their sensitive volume. In these images, the clustering of distinct particles into superstructures is of central importance to the current and future neutrino physics program. Electromagnetic (EM) activity typically exhibits spatially detached fragments of varying morphology and orientation that are challenging to efficiently assemble using traditional algorithms. Similarly, particles that are spatially removed from each other in the detector may originate from a common interaction. Graph Neural Networks (GNNs) were developed in recent years to find correlations between objects embedded in an arbitrary space. GNNs are first studied with the goal of predicting the adjacency matrix of EM shower fragments and to identify the origin of showers, i.e. primary fragments. On the PILArNet public LArTPC simulation dataset, the algorithm developed in this paper achieves a shower clustering accuracy characterized by a mean adjusted Rand index (ARI) of 97.8 % and a primary identification accuracy of 99.8 %. It yields a relative shower energy resolution of $(4.1+1.4/\sqrt{E (\text{GeV})})\,\%$ and a shower direction resolution of $(2.1/\sqrt{E(\text{GeV})})^{\circ}$. The optimized GNN is then applied to the related task of clustering particle instances into interactions and yields a mean ARI of 99.2 % for an interaction density of $\sim\mathcal{O}(1)\,m^{-3}$.