Zhejiang University
Abstract:Video captioning generate a sentence that describes the video content. Existing methods always require a number of captions (\eg, 10 or 20) per video to train the model, which is quite costly. In this work, we explore the possibility of using only one or very few ground-truth sentences, and introduce a new task named few-supervised video captioning. Specifically, we propose a few-supervised video captioning framework that consists of lexically constrained pseudo-labeling module and keyword-refined captioning module. Unlike the random sampling in natural language processing that may cause invalid modifications (\ie, edit words), the former module guides the model to edit words using some actions (\eg, copy, replace, insert, and delete) by a pretrained token-level classifier, and then fine-tunes candidate sentences by a pretrained language model. Meanwhile, the former employs the repetition penalized sampling to encourage the model to yield concise pseudo-labeled sentences with less repetition, and selects the most relevant sentences upon a pretrained video-text model. Moreover, to keep semantic consistency between pseudo-labeled sentences and video content, we develop the transformer-based keyword refiner with the video-keyword gated fusion strategy to emphasize more on relevant words. Extensive experiments on several benchmarks demonstrate the advantages of the proposed approach in both few-supervised and fully-supervised scenarios. The code implementation is available at https://github.com/mlvccn/PKG_VidCap
Abstract:Graph neural networks (GNNs) have attracted considerable attention due to their diverse applications. However, the scarcity and quality limitations of graph data present challenges to their training process in practical settings. To facilitate the development of effective GNNs, companies and researchers often seek external collaboration. Yet, directly sharing data raises privacy concerns, motivating data owners to train GNNs on their private graphs and share the trained models. Unfortunately, these models may still inadvertently disclose sensitive properties of their training graphs (e.g., average default rate in a transaction network), leading to severe consequences for data owners. In this work, we study graph property inference attack to identify the risk of sensitive property information leakage from shared models. Existing approaches typically train numerous shadow models for developing such attack, which is computationally intensive and impractical. To address this issue, we propose an efficient graph property inference attack by leveraging model approximation techniques. Our method only requires training a small set of models on graphs, while generating a sufficient number of approximated shadow models for attacks. To enhance diversity while reducing errors in the approximated models, we apply edit distance to quantify the diversity within a group of approximated models and introduce a theoretically guaranteed criterion to evaluate each model's error. Subsequently, we propose a novel selection mechanism to ensure that the retained approximated models achieve high diversity and low error. Extensive experiments across six real-world scenarios demonstrate our method's substantial improvement, with average increases of 2.7% in attack accuracy and 4.1% in ROC-AUC, while being 6.5$\times$ faster compared to the best baseline.
Abstract:Concept activation vector (CAV) has attracted broad research interest in explainable AI, by elegantly attributing model predictions to specific concepts. However, the training of CAV often necessitates a large number of high-quality images, which are expensive to curate and thus limited to a predefined set of concepts. To address this issue, we propose Language-Guided CAV (LG-CAV) to harness the abundant concept knowledge within the certain pre-trained vision-language models (e.g., CLIP). This method allows training any CAV without labeled data, by utilizing the corresponding concept descriptions as guidance. To bridge the gap between vision-language model and the target model, we calculate the activation values of concept descriptions on a common pool of images (probe images) with vision-language model and utilize them as language guidance to train the LG-CAV. Furthermore, after training high-quality LG-CAVs related to all the predicted classes in the target model, we propose the activation sample reweighting (ASR), serving as a model correction technique, to improve the performance of the target model in return. Experiments on four datasets across nine architectures demonstrate that LG-CAV achieves significantly superior quality to previous CAV methods given any concept, and our model correction method achieves state-of-the-art performance compared to existing concept-based methods. Our code is available at https://github.com/hqhQAQ/LG-CAV.
Abstract:Attribution-based explanations are garnering increasing attention recently and have emerged as the predominant approach towards \textit{eXplanable Artificial Intelligence}~(XAI). However, the absence of consistent configurations and systematic investigations in prior literature impedes comprehensive evaluations of existing methodologies. In this work, we introduce {Meta-Rank}, an open platform for benchmarking attribution methods in the image domain. Presently, Meta-Rank assesses eight exemplary attribution methods using six renowned model architectures on four diverse datasets, employing both the \textit{Most Relevant First} (MoRF) and \textit{Least Relevant First} (LeRF) evaluation protocols. Through extensive experimentation, our benchmark reveals three insights in attribution evaluation endeavors: 1) evaluating attribution methods under disparate settings can yield divergent performance rankings; 2) although inconsistent across numerous cases, the performance rankings exhibit remarkable consistency across distinct checkpoints along the same training trajectory; 3) prior attempts at consistent evaluation fare no better than baselines when extended to more heterogeneous models and datasets. Our findings underscore the necessity for future research in this domain to conduct rigorous evaluations encompassing a broader range of models and datasets, and to reassess the assumptions underlying the empirical success of different attribution methods. Our code is publicly available at \url{https://github.com/TreeThree-R/Meta-Rank}.
Abstract:Temporal networks are effective in capturing the evolving interactions of networks over time, such as social networks and e-commerce networks. In recent years, researchers have primarily concentrated on developing specific model architectures for Temporal Graph Neural Networks (TGNNs) in order to improve the representation quality of temporal nodes and edges. However, limited attention has been given to the quality of negative samples during the training of TGNNs. When compared with static networks, temporal networks present two specific challenges for negative sampling: positive sparsity and positive shift. Positive sparsity refers to the presence of a single positive sample amidst numerous negative samples at each timestamp, while positive shift relates to the variations in positive samples across different timestamps. To robustly address these challenges in training TGNNs, we introduce Curriculum Negative Mining (CurNM), a model-aware curriculum learning framework that adaptively adjusts the difficulty of negative samples. Within this framework, we first establish a dynamically updated negative pool that balances random, historical, and hard negatives to address the challenges posed by positive sparsity. Secondly, we implement a temporal-aware negative selection module that focuses on learning from the disentangled factors of recently active edges, thus accurately capturing shifting preferences. Extensive experiments on 12 datasets and 3 TGNNs demonstrate that our method outperforms baseline methods by a significant margin. Additionally, thorough ablation studies and parameter sensitivity experiments verify the usefulness and robustness of our approach. Our code is available at https://github.com/zziyue83/CurNM.
Abstract:Recent studies have delved into constructing generalist agents for open-world embodied environments like Minecraft. Despite the encouraging results, existing efforts mainly focus on solving basic programmatic tasks, e.g., material collection and tool-crafting following the Minecraft tech-tree, treating the ObtainDiamond task as the ultimate goal. This limitation stems from the narrowly defined set of actions available to agents, requiring them to learn effective long-horizon strategies from scratch. Consequently, discovering diverse gameplay opportunities in the open world becomes challenging. In this work, we introduce ODYSSEY, a new framework that empowers Large Language Model (LLM)-based agents with open-world skills to explore the vast Minecraft world. ODYSSEY comprises three key parts: (1) An interactive agent with an open-world skill library that consists of 40 primitive skills and 183 compositional skills. (2) A fine-tuned LLaMA-3 model trained on a large question-answering dataset with 390k+ instruction entries derived from the Minecraft Wiki. (3) A new open-world benchmark includes thousands of long-term planning tasks, tens of dynamic-immediate planning tasks, and one autonomous exploration task. Extensive experiments demonstrate that the proposed ODYSSEY framework can effectively evaluate the planning and exploration capabilities of agents. All datasets, model weights, and code are publicly available to motivate future research on more advanced autonomous agent solutions.
Abstract:Mini-batch Graph Transformer (MGT), as an emerging graph learning model, has demonstrated significant advantages in semi-supervised node prediction tasks with improved computational efficiency and enhanced model robustness. However, existing methods for processing local information either rely on sampling or simple aggregation, which respectively result in the loss and squashing of critical neighbor information.Moreover, the limited number of nodes in each mini-batch restricts the model's capacity to capture the global characteristic of the graph. In this paper, we propose LGMformer, a novel MGT model that employs a two-stage augmented interaction strategy, transitioning from local to global perspectives, to address the aforementioned bottlenecks.The local interaction augmentation (LIA) presents a neighbor-target interaction Transformer (NTIformer) to acquire an insightful understanding of the co-interaction patterns between neighbors and the target node, resulting in a locally effective token list that serves as input for the MGT. In contrast, global interaction augmentation (GIA) adopts a cross-attention mechanism to incorporate entire graph prototypes into the target node epresentation, thereby compensating for the global graph information to ensure a more comprehensive perception. To this end, LGMformer achieves the enhancement of node representations under the MGT paradigm.Experimental results related to node classification on the ten benchmark datasets demonstrate the effectiveness of the proposed method. Our code is available at https://github.com/l-wd/LGMformer.
Abstract:Graph Neural Networks (GNNs) have emerged as a prominent framework for graph mining, leading to significant advances across various domains. Stemmed from the node-wise representations of GNNs, existing explanation studies have embraced the subgraph-specific viewpoint that attributes the decision results to the salient features and local structures of nodes. However, graph-level tasks necessitate long-range dependencies and global interactions for advanced GNNs, deviating significantly from subgraph-specific explanations. To bridge this gap, this paper proposes a novel intrinsically interpretable scheme for graph classification, termed as Global Interactive Pattern (GIP) learning, which introduces learnable global interactive patterns to explicitly interpret decisions. GIP first tackles the complexity of interpretation by clustering numerous nodes using a constrained graph clustering module. Then, it matches the coarsened global interactive instance with a batch of self-interpretable graph prototypes, thereby facilitating a transparent graph-level reasoning process. Extensive experiments conducted on both synthetic and real-world benchmarks demonstrate that the proposed GIP yields significantly superior interpretability and competitive performance to~the state-of-the-art counterparts. Our code will be made publicly available.
Abstract:Active Voltage Control (AVC) on the Power Distribution Networks (PDNs) aims to stabilize the voltage levels to ensure efficient and reliable operation of power systems. With the increasing integration of distributed energy resources, recent efforts have explored employing multi-agent reinforcement learning (MARL) techniques to realize effective AVC. Existing methods mainly focus on the acquisition of short-term AVC strategies, i.e., only learning AVC within the short-term training trajectories of a singular diurnal cycle. However, due to the dynamic nature of load demands and renewable energy, the operation states of real-world PDNs may exhibit significant distribution shifts across varying timescales (e.g., daily and seasonal changes). This can render those short-term strategies suboptimal or even obsolete when performing continuous AVC over extended periods. In this paper, we propose a novel temporal prototype-aware learning method, abbreviated as TPA, to learn time-adaptive AVC under short-term training trajectories. At the heart of TPA are two complementary components, namely multi-scale dynamic encoder and temporal prototype-aware policy, that can be readily incorporated into various MARL methods. The former component integrates a stacked transformer network to learn underlying temporal dependencies at different timescales of the PDNs, while the latter implements a learnable prototype matching mechanism to construct a dedicated AVC policy that can dynamically adapt to the evolving operation states. Experimental results on the AVC benchmark with different PDN sizes demonstrate that the proposed TPA surpasses the state-of-the-art counterparts not only in terms of control performance but also by offering model transferability. Our code is available at https://github.com/Canyizl/TPA-for-AVC.
Abstract:Structural pruning has emerged as a promising approach for producing more efficient models. Nevertheless, the community suffers from a lack of standardized benchmarks and metrics, leaving the progress in this area not fully comprehended. To fill this gap, we present the first comprehensive benchmark, termed \textit{PruningBench}, for structural pruning. PruningBench showcases the following three characteristics: 1) PruningBench employs a unified and consistent framework for evaluating the effectiveness of diverse structural pruning techniques; 2) PruningBench systematically evaluates 16 existing pruning methods, encompassing a wide array of models (e.g., CNNs and ViTs) and tasks (e.g., classification and detection); 3) PruningBench provides easily implementable interfaces to facilitate the implementation of future pruning methods, and enables the subsequent researchers to incorporate their work into our leaderboards. We provide an online pruning platform http://pruning.vipazoo.cn for customizing pruning tasks and reproducing all results in this paper. Codes will be made publicly available.