Abstract:Self-attention mechanism is the key of the Transformer but often criticized for its computation demands. Previous token pruning works motivate their methods from the view of computation redundancy but still need to load the full network and require same memory costs. This paper introduces a novel strategy that simplifies vision transformers and reduces computational load through the selective removal of non-essential attention layers, guided by entropy considerations. We identify that regarding the attention layer in bottom blocks, their subsequent MLP layers, i.e. two feed-forward layers, can elicit the same entropy quantity. Meanwhile, the accompanied MLPs are under-exploited since they exhibit smaller feature entropy compared to those MLPs in the top blocks. Therefore, we propose to integrate the uninformative attention layers into their subsequent counterparts by degenerating them into identical mapping, yielding only MLP in certain transformer blocks. Experimental results on ImageNet-1k show that the proposed method can remove 40% attention layer of DeiT-B, improving throughput and memory bound without performance compromise. Code is available at https://github.com/sihaoevery/lambda_vit.
Abstract:Training-free metrics (a.k.a. zero-cost proxies) are widely used to avoid resource-intensive neural network training, especially in Neural Architecture Search (NAS). Recent studies show that existing training-free metrics have several limitations, such as limited correlation and poor generalisation across different search spaces and tasks. Hence, we propose Sample-Wise Activation Patterns and its derivative, SWAP-Score, a novel high-performance training-free metric. It measures the expressivity of networks over a batch of input samples. The SWAP-Score is strongly correlated with ground-truth performance across various search spaces and tasks, outperforming 15 existing training-free metrics on NAS-Bench-101/201/301 and TransNAS-Bench-101. The SWAP-Score can be further enhanced by regularisation, which leads to even higher correlations in cell-based search space and enables model size control during the search. For example, Spearman's rank correlation coefficient between regularised SWAP-Score and CIFAR-100 validation accuracies on NAS-Bench-201 networks is 0.90, significantly higher than 0.80 from the second-best metric, NWOT. When integrated with an evolutionary algorithm for NAS, our SWAP-NAS achieves competitive performance on CIFAR-10 and ImageNet in approximately 6 minutes and 9 minutes of GPU time respectively.
Abstract:Neural architecture search (NAS) aims to automate architecture engineering in neural networks. This often requires a high computational overhead to evaluate a number of candidate networks from the set of all possible networks in the search space during the search. Prediction of the networks' performance can alleviate this high computational overhead by mitigating the need for evaluating every candidate network. Developing such a predictor typically requires a large number of evaluated architectures which may be difficult to obtain. We address this challenge by proposing a novel evolutionary-based NAS strategy, Predictor-assisted E-NAS (PRE-NAS), which can perform well even with an extremely small number of evaluated architectures. PRE-NAS leverages new evolutionary search strategies and integrates high-fidelity weight inheritance over generations. Unlike one-shot strategies, which may suffer from bias in the evaluation due to weight sharing, offspring candidates in PRE-NAS are topologically homogeneous, which circumvents bias and leads to more accurate predictions. Extensive experiments on NAS-Bench-201 and DARTS search spaces show that PRE-NAS can outperform state-of-the-art NAS methods. With only a single GPU searching for 0.6 days, competitive architecture can be found by PRE-NAS which achieves 2.40% and 24% test error rates on CIFAR-10 and ImageNet respectively.
Abstract:Existing parking recommendation solutions mainly focus on finding and suggesting parking spaces based on the unoccupied options only. However, there are other factors associated with parking spaces that can influence someone's choice of parking such as fare, parking rule, walking distance to destination, travel time, likelihood to be unoccupied at a given time. More importantly, these factors may change over time and conflict with each other which makes the recommendations produced by current parking recommender systems ineffective. In this paper, we propose a novel problem called multi-objective parking recommendation. We present a solution by designing a multi-objective parking recommendation engine called MoParkeR that considers various conflicting factors together. Specifically, we utilise a non-dominated sorting technique to calculate a set of Pareto-optimal solutions, consisting of recommended trade-off parking spots. We conduct extensive experiments using two real-world datasets to show the applicability of our multi-objective recommendation methodology.
Abstract:The existing research on parking availability sensing mainly relies on extensive contextual and historical information. In practice, it is challenging to have such information available as it requires continuous collection of sensory signals. In this paper, we design an end-to-end transfer learning framework for parking availability sensing to predict the parking occupancy in areas where the parking data is insufficient to feed into data-hungry models. This framework overcomes two main challenges: 1) many real-world cases cannot provide enough data for most existing data-driven models. 2) it is difficult to merge sensor data and heterogeneous contextual information due to the differing urban fabric and spatial characteristics. Our work adopts a widely-used concept called adversarial domain adaptation to predict the parking occupancy in an area without abundant sensor data by leveraging data from other areas with similar features. In this paper, we utilise more than 35 million parking data records from sensors placed in two different cities, one is a city centre, and another one is a coastal tourist town. We also utilise heterogeneous spatio-temporal contextual information from external resources including weather and point of interests. We quantify the strength of our proposed framework in different cases and compare it to the existing data-driven approaches. The results show that the proposed framework outperforms existing methods and also provide a few valuable insights for parking availability prediction.
Abstract:Typical person re-identification (re-ID) methods train a deep CNN to extract deep features and combine them with a distance metric for the final evaluation. In this work, we focus on exploiting the full information encoded in the deep feature to boost the re-ID performance. First, we propose a Deep Feature Fusion (DFF) method to exploit the diverse information embedded in a deep feature. DFF treats each sub-feature as an information carrier and employs a diffusion process to exchange their information. Second, we propose an Adaptive Re-Ranking (ARR) method to exploit the contextual information encoded in the features of neighbors. ARR utilizes the contextual information to re-rank the retrieval results in an iterative manner. Particularly, it adds more contextual information after each iteration automatically to consider more matches. Third, we propose a strategy that combines DFF and ARR to enhance the performance. Extensive comparative evaluations demonstrate the superiority of the proposed methods on three large benchmarks.
Abstract:Image processing and recognition are an important part of the modern society, with applications in fields such as advanced artificial intelligence, smart assistants, and security surveillance. The essential first step involved in almost all the visual tasks is background subtraction with a static camera. Ensuring that this critical step is performed in the most efficient manner would therefore improve all aspects related to objects recognition and tracking, behavior comprehension, etc.. Although background subtraction method has been applied for many years, its application suffers from real-time requirement. In this letter, we present a novel approach in implementing the background subtraction. The proposed method uses compressed, low-resolution grayscale image for the background subtraction. These low-resolution grayscale images were found to preserve the salient information very well. To verify the feasibility of our methodology, two prevalent methods, ViBe and GMM, are used in the experiment. The results of the proposed methodology confirm the effectiveness of our approach.
Abstract:Visual salience detection originated over 500 million years ago and is one of nature's most efficient mechanisms. In contrast, many state-of-the-art computational saliency models are complex and inefficient. Most saliency models process high-resolution color (HC) images; however, insights into the evolutionary origins of visual salience detection suggest that achromatic low-resolution vision is essential to its speed and efficiency. Previous studies showed that low-resolution color and high-resolution grayscale images preserve saliency information. However, to our knowledge, no one has investigated whether saliency is preserved in low-resolution grayscale (LG) images. In this study, we explain the biological and computational motivation for LG, and show, through a range of human eye-tracking and computational modeling experiments, that saliency information is preserved in LG images. Moreover, we show that using LG images leads to significant speedups in model training and detection times and conclude by proposing LG images for fast and efficient salience detection.