Abstract:Convolutional neural networks (CNNs) excel in local feature extraction while Transformers are superior in processing global semantic information. By leveraging the strengths of both, hybrid Transformer-CNN networks have become the major architectures in medical image segmentation tasks. However, existing hybrid methods still suffer deficient learning of local semantic features due to the fixed receptive fields of convolutions, and also fall short in effectively integrating local and long-range dependencies. To address these issues, we develop a new method PARF-Net to integrate convolutions of Pixel-wise Adaptive Receptive Fields (Conv-PARF) into hybrid Network for medical image segmentation. The Conv-PARF is introduced to cope with inter-pixel semantic differences and dynamically adjust convolutional receptive fields for each pixel, thus providing distinguishable features to disentangle the lesions with varying shapes and scales from the background. The features derived from the Conv-PARF layers are further processed using hybrid Transformer-CNN blocks under a lightweight manner, to effectively capture local and long-range dependencies, thus boosting the segmentation performance. By assessing PARF-Net on four widely used medical image datasets including MoNuSeg, GlaS, DSB2018 and multi-organ Synapse, we showcase the advantages of our method over the state-of-the-arts. For instance, PARF-Net achieves 84.27% mean Dice on the Synapse dataset, surpassing existing methods by a large margin.
Abstract:My research objective is to explicitly bridge the gap between high computational performance and low power dissipation of robot on-board hardware by designing a bio-inspired tapered whisker neuromorphic computing (also called reservoir computing) system for offroad robot environment perception and navigation, that centres the interaction between a robot's body and its environment. Mobile robots performing tasks in unknown environments need to traverse a variety of complex terrains, and they must be able to reliably and quickly identify and characterize these terrains to avoid getting into potentially challenging or catastrophic circumstances. To solve this problem, I drew inspiration from animals like rats and seals, just relying on whiskers to perceive surroundings information and survive in dark and narrow environments. Additionally, I looked to the human cochlear which can separate different frequencies of sound. Based on these insights, my work addresses this need by exploring the physical whisker-based reservoir computing for quick and cost-efficient mobile robots environment perception and navigation step by step. This research could help us understand how the compliance of the biological counterparts helps robots to dynamically interact with the environment and provides a new solution compared with current methods for robot environment perception and navigation with limited computational resources, such as Mars.
Abstract:Existing differentiable channel pruning methods often attach scaling factors or masks behind channels to prune filters with less importance, and assume uniform contribution of input samples to filter importance. Specifically, the effects of instance complexity on pruning performance are not yet fully investigated. In this paper, we propose a simple yet effective differentiable network pruning method CAP based on instance complexity-aware filter importance scores. We define instance complexity related weight for each sample by giving higher weights to hard samples, and measure the weighted sum of sample-specific soft masks to model non-uniform contribution of different inputs, which encourages hard samples to dominate the pruning process and the model performance to be well preserved. In addition, we introduce a new regularizer to encourage polarization of the masks, such that a sweet spot can be easily found to identify the filters to be pruned. Performance evaluations on various network architectures and datasets demonstrate CAP has advantages over the state-of-the-arts in pruning large networks. For instance, CAP improves the accuracy of ResNet56 on CIFAR-10 dataset by 0.33% aftering removing 65.64% FLOPs, and prunes 87.75% FLOPs of ResNet50 on ImageNet dataset with only 0.89% Top-1 accuracy loss.
Abstract:This paper shows analytical and experimental evidence of using the vibration dynamics of a compliant whisker for accurate terrain classification during steady state motion of a mobile robot. A Hall effect sensor was used to measure whisker vibrations due to perturbations from the ground. Analytical results predict that the whisker vibrations will have a dominant frequency at the vertical perturbation frequency of the mobile robot sandwiched by two other less dominant but distinct frequency components. These frequency components may come from bifurcation of vibration frequency due to nonlinear interaction dynamics at steady state. Experimental results also exhibit distinct dominant frequency components unique to the speed of the robot and the terrain roughness. This nonlinear dynamic feature is used in a deep multi-layer perceptron neural network to classify terrains. We achieved 85.6\% prediction success rate for seven flat terrain surfaces with different textures.
Abstract:In this paper, we propose a spatial-temporal relational reasoning networks (STRRN) approach to investigate the problem of omni-supervised face alignment in videos. Unlike existing fully supervised methods which rely on numerous annotations by hand, our learner exploits large scale unlabeled videos plus available labeled data to generate auxiliary plausible training annotations. Motivated by the fact that neighbouring facial landmarks are usually correlated and coherent across consecutive frames, our approach automatically reasons about discriminative spatial-temporal relationships among landmarks for stable face tracking. Specifically, we carefully develop an interpretable and efficient network module, which disentangles facial geometry relationship for every static frame and simultaneously enforces the bi-directional cycle-consistency across adjacent frames, thus allowing the modeling of intrinsic spatial-temporal relations from raw face sequences. Extensive experimental results demonstrate that our approach surpasses the performance of most fully supervised state-of-the-arts.