Abstract:A global threshold (e.g., 0.5) is often applied to determine which bounding boxes should be included in the final results for an object detection task. A higher threshold reduces false positives but may result in missing a significant portion of true positives. A lower threshold can increase detection recall but may also result in more false positives. Because of this, using a preset global threshold (e.g., 0.5) applied to all the bounding box candidates may lead to suboptimal solutions. In this paper, we propose a Test-time Self-guided Bounding-box Propagation (TSBP) method, leveraging Earth Mover's Distance (EMD) to enhance object detection in histology images. TSBP utilizes bounding boxes with high confidence to influence those with low confidence, leveraging visual similarities between them. This propagation mechanism enables bounding boxes to be selected in a controllable, explainable, and robust manner, which surpasses the effectiveness of using simple thresholds and uncertainty calibration methods. Importantly, TSBP does not necessitate additional labeled samples for model training or parameter estimation, unlike calibration methods. We conduct experiments on gland detection and cell detection tasks in histology images. The results show that our proposed TSBP significantly improves detection outcomes when working in conjunction with state-of-the-art deep learning-based detection networks. Compared to other methods such as uncertainty calibration, TSBP yields more robust and accurate object detection predictions while using no additional labeled samples. The code is available at https://github.com/jwhgdeu/TSBP.
Abstract:Micro-expressions (MEs) are spontaneous, unconscious facial expressions that have promising applications in various fields such as psychotherapy and national security. Thus, micro-expression recognition (MER) has attracted more and more attention from researchers. Although various MER methods have emerged especially with the development of deep learning techniques, the task still faces several challenges, e.g. subtle motion and limited training data. To address these problems, we propose a novel motion extraction strategy (MoExt) for the MER task and use additional macro-expression data in the pre-training process. We primarily pretrain the feature separator and motion extractor using the contrastive loss, thus enabling them to extract representative motion features. In MoExt, shape features and texture features are first extracted separately from onset and apex frames, and then motion features related to MEs are extracted based on the shape features of both frames. To enable the model to more effectively separate features, we utilize the extracted motion features and the texture features from the onset frame to reconstruct the apex frame. Through pre-training, the module is enabled to extract inter-frame motion features of facial expressions while excluding irrelevant information. The feature separator and motion extractor are ultimately integrated into the MER network, which is then fine-tuned using the target ME data. The effectiveness of proposed method is validated on three commonly used datasets, i.e., CASME II, SMIC, SAMM, and CAS(ME)3 dataset. The results show that our method performs favorably against state-of-the-art methods.
Abstract:Quantizing the activations of large language models (LLMs) has been a significant challenge due to the presence of structured outliers. Most existing methods focus on the per-token or per-tensor quantization of activations, making it difficult to achieve both accuracy and hardware efficiency. To address this problem, we propose OutlierTune, an efficient per-channel post-training quantization (PTQ) method for the activations of LLMs. OutlierTune consists of two components: pre-execution of dequantization and symmetrization. The pre-execution of dequantization updates the model weights by the activation scaling factors, avoiding the internal scaling and costly additional computational overheads brought by the per-channel activation quantization. The symmetrization further reduces the quantization differences arising from the weight updates by ensuring the balanced numerical ranges across different activation channels. OutlierTune is easy to implement and hardware-efficient, introducing almost no additional computational overheads during the inference. Extensive experiments show that the proposed framework outperforms existing methods across multiple different tasks. Demonstrating better generalization, this framework improves the Int6 quantization of the instruction-tuning LLMs, such as OPT-IML, to the same level as half-precision (FP16). Moreover, we have shown that the proposed framework is 1.48x faster than the FP16 implementation while reducing approximately 2x memory usage.
Abstract:Face recognition has been of great importance in many applications as a biometric for its throughput, convenience, and non-invasiveness. Recent advancements in deep Convolutional Neural Network (CNN) architectures have boosted significantly the performance of face recognition based on two-dimensional (2D) facial texture images and outperformed the previous state of the art using conventional methods. However, the accuracy of 2D face recognition is still challenged by the change of pose, illumination, make-up, and expression. On the other hand, the geometric information contained in three-dimensional (3D) face data has the potential to overcome the fundamental limitations of 2D face data. We propose a multi-Channel deep 3D face network for face recognition based on 3D face data. We compute the geometric information of a 3D face based on its piecewise-linear triangular mesh structure and then conformally flatten geometric information along with the color from 3D to 2D plane to leverage the state-of-the-art deep CNN architectures. We modify the input layer of the network to take images with nine channels instead of three only such that more geometric information can be explicitly fed to it. We pre-train the network using images from the VGG-Face \cite{Parkhi2015} and then fine-tune it with the generated multi-channel face images. The face recognition accuracy of the multi-Channel deep 3D face network has achieved 98.6. The experimental results also clearly show that the network performs much better when a 9-channel image is flattened to plane based on the conformal map compared with the orthographic projection.
Abstract:In this white paper we provide a vision for 6G Edge Intelligence. Moving towards 5G and beyond the future 6G networks, intelligent solutions utilizing data-driven machine learning and artificial intelligence become crucial for several real-world applications including but not limited to, more efficient manufacturing, novel personal smart device environments and experiences, urban computing and autonomous traffic settings. We present edge computing along with other 6G enablers as a key component to establish the future 2030 intelligent Internet technologies as shown in this series of 6G White Papers. In this white paper, we focus in the domains of edge computing infrastructure and platforms, data and edge network management, software development for edge, and real-time and distributed training of ML/AI algorithms, along with security, privacy, pricing, and end-user aspects. We discuss the key enablers and challenges and identify the key research questions for the development of the Intelligent Edge services. As a main outcome of this white paper, we envision a transition from Internet of Things to Intelligent Internet of Intelligent Things and provide a roadmap for development of 6G Intelligent Edge.