Abstract:With the rapid development of Deep Neural Networks (DNNs), they have been applied in numerous fields. However, research indicates that DNNs are susceptible to adversarial examples, and this is equally true in the multi-label domain. To further investigate multi-label adversarial examples, we introduce a novel type of attacks, termed "Showing Many Labels". The objective of this attack is to maximize the number of labels included in the classifier's prediction results. In our experiments, we select nine attack algorithms and evaluate their performance under "Showing Many Labels". Eight of the attack algorithms were adapted from the multi-class environment to the multi-label environment, while the remaining one was specifically designed for the multi-label environment. We choose ML-LIW and ML-GCN as target models and train them on four popular multi-label datasets: VOC2007, VOC2012, NUS-WIDE, and COCO. We record the success rate of each algorithm when it shows the expected number of labels in eight different scenarios. Experimental results indicate that under the "Showing Many Labels", iterative attacks perform significantly better than one-step attacks. Moreover, it is possible to show all labels in the dataset.
Abstract:Hierarchical text classification (HTC) is a special sub-task of multi-label classification (MLC) whose taxonomy is constructed as a tree and each sample is assigned with at least one path in the tree. Latest HTC models contain three modules: a text encoder, a structure encoder and a multi-label classification head. Specially, the structure encoder is designed to encode the hierarchy of taxonomy. However, the structure encoder has scale problem. As the taxonomy size increases, the learnable parameters of recent HTC works grow rapidly. Recursive regularization is another widely-used method to introduce hierarchical information but it has collapse problem and generally relaxed by assigning with a small weight (ie. 1e-6). In this paper, we propose a Hierarchy-aware Light Global model with Hierarchical local conTrastive learning (HiLight), a lightweight and efficient global model only consisting of a text encoder and a multi-label classification head. We propose a new learning task to introduce the hierarchical information, called Hierarchical Local Contrastive Learning (HiLCL). Extensive experiments are conducted on two benchmark datasets to demonstrate the effectiveness of our model.
Abstract:Fashion vision-language pre-training models have shown efficacy for a wide range of downstream tasks. However, general vision-language pre-training models pay less attention to fine-grained domain features, while these features are important in distinguishing the specific domain tasks from general tasks. We propose a method for fine-grained fashion vision-language pre-training based on fashion Symbols and Attributes Prompt (FashionSAP) to model fine-grained multi-modalities fashion attributes and characteristics. Firstly, we propose the fashion symbols, a novel abstract fashion concept layer, to represent different fashion items and to generalize various kinds of fine-grained fashion features, making modelling fine-grained attributes more effective. Secondly, the attributes prompt method is proposed to make the model learn specific attributes of fashion items explicitly. We design proper prompt templates according to the format of fashion data. Comprehensive experiments are conducted on two public fashion benchmarks, i.e., FashionGen and FashionIQ, and FashionSAP gets SOTA performances for four popular fashion tasks. The ablation study also shows the proposed abstract fashion symbols, and the attribute prompt method enables the model to acquire fine-grained semantics in the fashion domain effectively. The obvious performance gains from FashionSAP provide a new baseline for future fashion task research.
Abstract:Fine-grained supervision based on object annotations has been widely used for vision and language pre-training (VLP). However, in real-world application scenarios, aligned multi-modal data is usually in the image-caption format, which only provides coarse-grained supervision. It is cost-expensive to collect object annotations and build object annotation pre-extractor for different scenarios. In this paper, we propose a fine-grained self-supervision signal without object annotations from a replacement perspective. First, we propose a homonym sentence rewriting (HSR) algorithm to provide token-level supervision. The algorithm replaces a verb/noun/adjective/quantifier word of the caption with its homonyms from WordNet. Correspondingly, we propose a replacement vision-language modeling (RVLM) framework to exploit the token-level supervision. Two replaced modeling tasks, i.e., replaced language contrastive (RLC) and replaced language modeling (RLM), are proposed to learn the fine-grained alignment. Extensive experiments on several downstream tasks demonstrate the superior performance of the proposed method.
Abstract:The security issues in DNNs, such as adversarial examples, have attracted much attention. Adversarial examples refer to the examples which are capable to induce the DNNs return completely predictions by introducing carefully designed perturbations. Obviously, adversarial examples bring great security risks to the development of deep learning. Recently, Some defense approaches against adversarial examples have been proposed, however, in our opinion, the performance of these approaches are still limited. In this paper, we propose a new ensemble defense approach named the Negative Correlation Ensemble (NCEn), which achieves compelling results by introducing gradient directions and gradient magnitudes of each member in the ensemble negatively correlated and at the same time, reducing the transferability of adversarial examples among them. Extensive experiments have been conducted, and the results demonstrate that NCEn can improve the adversarial robustness of ensembles effectively.