Abstract:Due to the lack of effective mpox detection tools, the mpox virus continues to spread worldwide and has once again been declared a public health emergency of international concern by the World Health Organization. Deep learning-based mpox detection tools are crucial to alleviate mpox outbreak. However, existing methods have difficulty in achieving a good trade-off between detection performance, parameter size, and model complexity, which is crucial for practical applications and widespread deployment, especially in resource-limited scenarios. Given that the success of Mamba in modeling long-range dependencies and its linear complexity, we proposed a lightweight hybrid architecture called MpoxMamba. MpoxMamba utilizes deep separable convolutions to extract local feature representations in mpox skin lesions, and greatly enhances the model's ability to model the global contextual information by grouped Mamba modules. Experimental results on two widely recognized mpox datasets demonstrate that MpoxMamba outperforms existing mpox detection methods and state-of-the-art lightweight models. We also developed a web-based online application to provide free mpox detection services to the public in the epidemic areas (http://5227i971s5.goho.co:30290). The source codes of MpoxMamba are available at https://github.com/YubiaoYue/MpoxMamba.
Abstract:Radio Frequency Fingerprint Identification (RFFI) has emerged as a pivotal task for reliable device authentication. Despite advancements in RFFI methods, background noise and intentional modulation features result in weak energy and subtle differences in the RFF features. These challenges diminish the capability of RFFI methods in feature representation, complicating the effective identification of device identities. This paper proposes a novel Multi-Periodicity Dependency Transformer (MPDFormer) to address these challenges. The MPDFormer employs a spectrum offset-based periodic embedding representation to augment the discrepency of intrinsic features. We delve into the intricacies of the periodicity-dependency attention mechanism, integrating both inter-period and intra-period attention mechanisms. This mechanism facilitates the extraction of both long and short-range periodicity-dependency features , accentuating the feature distinction whilst concurrently attenuating the perturbations caused by background noise and weak-periodicity features. Empirical results demonstrate MPDFormer's superiority over established baseline methods, achieving a 0.07s inference time on NVIDIA Jetson Orin NX.
Abstract:Recent work on image content manipulation based on vision-language pre-training models has been effectively extended to text-driven 3D scene editing. However, existing schemes for 3D scene editing still exhibit certain shortcomings, hindering their further interactive design. Such schemes typically adhere to fixed input patterns, limiting users' flexibility in text input. Moreover, their editing capabilities are constrained by a single or a few 2D visual models and require intricate pipeline design to integrate these models into 3D reconstruction processes. To address the aforementioned issues, we propose a dialogue-based 3D scene editing approach, termed CE3D, which is centered around a large language model that allows for arbitrary textual input from users and interprets their intentions, subsequently facilitating the autonomous invocation of the corresponding visual expert models. Furthermore, we design a scheme utilizing Hash-Atlas to represent 3D scene views, which transfers the editing of 3D scenes onto 2D atlas images. This design achieves complete decoupling between the 2D editing and 3D reconstruction processes, enabling CE3D to flexibly integrate a wide range of existing 2D or 3D visual models without necessitating intricate fusion designs. Experimental results demonstrate that CE3D effectively integrates multiple visual models to achieve diverse editing visual effects, possessing strong scene comprehension and multi-round dialog capabilities. The code is available at https://sk-fun.fun/CE3D.
Abstract:In-context learning (ICL) achieves remarkable performance in various domains such as knowledge acquisition, commonsense reasoning, and semantic understanding. However, its performance significantly deteriorates for emotion detection tasks, especially fine-grained emotion recognition. The underlying reasons for this remain unclear. In this paper, we identify the reasons behind ICL's poor performance from the perspective of prototype theory and propose a method to address this issue. Specifically, we conduct extensive pilot experiments and find that ICL conforms to the prototype theory on fine-grained emotion recognition. Based on this theory, we uncover the following deficiencies in ICL: (1) It relies on prototypes (example-label pairs) that are semantically similar but emotionally inaccurate to predict emotions. (2) It is prone to interference from irrelevant categories, affecting the accuracy and robustness of the predictions. To address these issues, we propose an Emotion Context Learning method (E-ICL) on fine-grained emotion recognition. E-ICL relies on more emotionally accurate prototypes to predict categories by referring to emotionally similar examples with dynamic labels. Simultaneously, E-ICL employs an exclusionary emotion prediction strategy to avoid interference from irrelevant categories, thereby increasing its accuracy and robustness. Note that the entire process is accomplished with the assistance of a plug-and-play emotion auxiliary model, without additional training. Experiments on the fine-grained emotion datasets EDOS, Empathetic-Dialogues, EmpatheticIntent, and GoEmotions show that E-ICL achieves superior emotion prediction performance. Furthermore, even when the emotion auxiliary model used is lower than 10% of the LLMs, E-ICL can still boost the performance of LLMs by over 4% on multiple datasets.
Abstract:Empathetic response generation is to comprehend the cognitive and emotional states in dialogue utterances and generate proper responses. Psychological theories posit that comprehending emotional and cognitive states necessitates iteratively capturing and understanding associated words across dialogue utterances. However, existing approaches regard dialogue utterances as either a long sequence or independent utterances for comprehension, which are prone to overlook the associated words between them. To address this issue, we propose an Iterative Associative Memory Model (IAMM) for empathetic response generation. Specifically, we employ a novel second-order interaction attention mechanism to iteratively capture vital associated words between dialogue utterances and situations, dialogue history, and a memory module (for storing associated words), thereby accurately and nuancedly comprehending the utterances. We conduct experiments on the Empathetic-Dialogue dataset. Both automatic and human evaluations validate the efficacy of the model. Meanwhile, variant experiments on LLMs also demonstrate that attending to associated words improves empathetic comprehension and expression.
Abstract:Empathetic response generation aims to generate empathetic responses by understanding the speaker's emotional feelings from the language of dialogue. Recent methods capture emotional words in the language of communicators and construct them as static vectors to perceive nuanced emotions. However, linguistic research has shown that emotional words in language are dynamic and have correlations with other grammar semantic roles, i.e., words with semantic meanings, in grammar. Previous methods overlook these two characteristics, which easily lead to misunderstandings of emotions and neglect of key semantics. To address this issue, we propose a dynamical Emotion-Semantic Correlation Model (ESCM) for empathetic dialogue generation tasks. ESCM constructs dynamic emotion-semantic vectors through the interaction of context and emotions. We introduce dependency trees to reflect the correlations between emotions and semantics. Based on dynamic emotion-semantic vectors and dependency trees, we propose a dynamic correlation graph convolutional network to guide the model in learning context meanings in dialogue and generating empathetic responses. Experimental results on the EMPATHETIC-DIALOGUES dataset show that ESCM understands semantics and emotions more accurately and expresses fluent and informative empathetic responses. Our analysis results also indicate that the correlations between emotions and semantics are frequently used in dialogues, which is of great significance for empathetic perception and expression.
Abstract:Learning optimal behavior policy for each agent in multi-agent systems is an essential yet difficult problem. Despite fruitful progress in multi-agent reinforcement learning, the challenge of addressing the dynamics of whether two agents should exhibit consistent behaviors is still under-explored. In this paper, we propose a new approach that enables agents to learn whether their behaviors should be consistent with that of other agents by utilizing intrinsic rewards to learn the optimal policy for each agent. We begin by defining behavior consistency as the divergence in output actions between two agents when provided with the same observation. Subsequently, we introduce dynamic consistency intrinsic reward (DCIR) to stimulate agents to be aware of others' behaviors and determine whether to be consistent with them. Lastly, we devise a dynamic scale network (DSN) that provides learnable scale factors for the agent at every time step to dynamically ascertain whether to award consistent behavior and the magnitude of rewards. We evaluate DCIR in multiple environments including Multi-agent Particle, Google Research Football and StarCraft II Micromanagement, demonstrating its efficacy.
Abstract:Numerous diffusion models have recently been applied to image synthesis and editing. However, editing 3D scenes is still in its early stages. It poses various challenges, such as the requirement to design specific methods for different editing types, retraining new models for various 3D scenes, and the absence of convenient human interaction during editing. To tackle these issues, we introduce a text-driven editing method, termed DN2N, which allows for the direct acquisition of a NeRF model with universal editing capabilities, eliminating the requirement for retraining. Our method employs off-the-shelf text-based editing models of 2D images to modify the 3D scene images, followed by a filtering process to discard poorly edited images that disrupt 3D consistency. We then consider the remaining inconsistency as a problem of removing noise perturbation, which can be solved by generating training data with similar perturbation characteristics for training. We further propose cross-view regularization terms to help the generalized NeRF model mitigate these perturbations. Our text-driven method allows users to edit a 3D scene with their desired description, which is more friendly, intuitive, and practical than prior works. Empirical results show that our method achieves multiple editing types, including but not limited to appearance editing, weather transition, material changing, and style transfer. Most importantly, our method generalizes well with editing abilities shared among a set of model parameters without requiring a customized editing model for some specific scenes, thus inferring novel views with editing effects directly from user input. The project website is available at http://sk-fun.fun/DN2N
Abstract:We present a new category of physics-informed neural networks called physics informed variational embedding generative adversarial network (PI-VEGAN), that effectively tackles the forward, inverse, and mixed problems of stochastic differential equations. In these scenarios, the governing equations are known, but only a limited number of sensor measurements of the system parameters are available. We integrate the governing physical laws into PI-VEGAN with automatic differentiation, while introducing a variational encoder for approximating the latent variables of the actual distribution of the measurements. These latent variables are integrated into the generator to facilitate accurate learning of the characteristics of the stochastic partial equations. Our model consists of three components, namely the encoder, generator, and discriminator, each of which is updated alternatively employing the stochastic gradient descent algorithm. We evaluate the effectiveness of PI-VEGAN in addressing forward, inverse, and mixed problems that require the concurrent calculation of system parameters and solutions. Numerical results demonstrate that the proposed method achieves satisfactory stability and accuracy in comparison with the previous physics-informed generative adversarial network (PI-WGAN).
Abstract:Neural Radiance Fields (NeRF) have been widely adopted as practical and versatile representations for 3D scenes, facilitating various downstream tasks. However, different architectures, including plain Multi-Layer Perceptron (MLP), Tensors, low-rank Tensors, Hashtables, and their compositions, have their trade-offs. For instance, Hashtables-based representations allow for faster rendering but lack clear geometric meaning, making spatial-relation-aware editing challenging. To address this limitation and maximize the potential of each architecture, we propose Progressive Volume Distillation with Active Learning (PVD-AL), a systematic distillation method that enables any-to-any conversions between different architectures. PVD-AL decomposes each structure into two parts and progressively performs distillation from shallower to deeper volume representation, leveraging effective information retrieved from the rendering process. Additionally, a Three-Levels of active learning technique provides continuous feedback during the distillation process, resulting in high-performance results. Empirical evidence is presented to validate our method on multiple benchmark datasets. For example, PVD-AL can distill an MLP-based model from a Hashtables-based model at a 10~20X faster speed and 0.8dB~2dB higher PSNR than training the NeRF model from scratch. Moreover, PVD-AL permits the fusion of diverse features among distinct structures, enabling models with multiple editing properties and providing a more efficient model to meet real-time requirements. Project website:http://sk-fun.fun/PVD-AL.