Abstract:In computer vision tasks, the ability to focus on relevant regions within an image is crucial for improving model performance, particularly when key features are small, subtle, or spatially dispersed. Convolutional neural networks (CNNs) typically treat all regions of an image equally, which can lead to inefficient feature extraction. To address this challenge, I have introduced Vision Eagle Attention, a novel attention mechanism that enhances visual feature extraction using convolutional spatial attention. The model applies convolution to capture local spatial features and generates an attention map that selectively emphasizes the most informative regions of the image. This attention mechanism enables the model to focus on discriminative features while suppressing irrelevant background information. I have integrated Vision Eagle Attention into a lightweight ResNet-18 architecture, demonstrating that this combination results in an efficient and powerful model. I have evaluated the performance of the proposed model on three widely used benchmark datasets: FashionMNIST, Intel Image Classification, and OracleMNIST, with a primary focus on image classification. Experimental results show that the proposed approach improves classification accuracy. Additionally, this method has the potential to be extended to other vision tasks, such as object detection, segmentation, and visual tracking, offering a computationally efficient solution for a wide range of vision-based applications. Code is available at: https://github.com/MahmudulHasan11085/Vision-Eagle-Attention.git
Abstract:Despite the strong prediction power of deep learning models, their interpretability remains an important concern. Disentanglement models increase interpretability by decomposing the latent space into interpretable subspaces. In this paper, we propose the first disentanglement method for pathology images. We focus on the task of detecting tumor-infiltrating lymphocytes (TIL). We propose different ideas including cascading disentanglement, novel architecture, and reconstruction branches. We achieve superior performance on complex pathology images, thus improving the interpretability and even generalization power of TIL detection deep learning models. Our codes are available at https://github.com/Shauqi/SS-cVAE.
Abstract:Ophthalmic diseases represent a significant global health issue, necessitating the use of advanced precise diagnostic tools. Optical Coherence Tomography (OCT) imagery which offers high-resolution cross-sectional images of the retina has become a pivotal imaging modality in ophthalmology. Traditionally physicians have manually detected various diseases and biomarkers from such diagnostic imagery. In recent times, deep learning techniques have been extensively used for medical diagnostic tasks enabling fast and precise diagnosis. This paper presents a novel approach for ophthalmic biomarker detection using an ensemble of Convolutional Neural Network (CNN) and Vision Transformer. While CNNs are good for feature extraction within the local context of the image, transformers are known for their ability to extract features from the global context of the image. Using an ensemble of both techniques allows us to harness the best of both worlds. Our method has been implemented on the OLIVES dataset to detect 6 major biomarkers from the OCT images and shows significant improvement of the macro averaged F1 score on the dataset.
Abstract:To address the increasing need for efficient and accurate content moderation, we propose an efficient and lightweight deep classification ensemble structure. Our approach is based on a combination of simple visual features, designed for high-accuracy classification of violent content with low false positives. Our ensemble architecture utilizes a set of lightweight models with narrowed-down color features, and we apply it to both images and videos. We evaluated our approach using a large dataset of explosion and blast contents and compared its performance to popular deep learning models such as ResNet-50. Our evaluation results demonstrate significant improvements in prediction accuracy, while benefiting from 7.64x faster inference and lower computation cost. While our approach is tailored to explosion detection, it can be applied to other similar content moderation and violence detection use cases as well. Based on our experiments, we propose a "think small, think many" philosophy in classification scenarios. We argue that transforming a single, large, monolithic deep model into a verification-based step model ensemble of multiple small, simple, and lightweight models with narrowed-down visual features can possibly lead to predictions with higher accuracy.
Abstract:Computational experiments are exploited in finding a well-designed processing path to optimize material structures for desired properties. This requires understanding the interplay between the processing-(micro)structure-property linkages using a multi-scale approach that connects the macro-scale (process parameters) to meso (homogenized properties) and micro (crystallographic texture) scales. Due to the nature of the problem's multi-scale modeling setup, possible processing path choices could grow exponentially as the decision tree becomes deeper, and the traditional simulators' speed reaches a critical computational threshold. To lessen the computational burden for predicting microstructural evolution under given loading conditions, we develop a neural network (NN)-based method with physics-infused constraints. The NN aims to learn the evolution of microstructures under each elementary process. Our method is effective and robust in finding optimal processing paths. In this study, our NN-based method is applied to maximize the homogenized stiffness of a Copper microstructure, and it is found to be 686 times faster while achieving 0.053% error in the resulting homogenized stiffness compared to the traditional finite element simulator on a 10-process experiment.
Abstract:The ability to distinguish between different movie scenes is critical for understanding the storyline of a movie. However, accurately detecting movie scenes is often challenging as it requires the ability to reason over very long movie segments. This is in contrast to most existing video recognition models, which are typically designed for short-range video analysis. This work proposes a State-Space Transformer model that can efficiently capture dependencies in long movie videos for accurate movie scene detection. Our model, dubbed TranS4mer, is built using a novel S4A building block, which combines the strengths of structured state-space sequence (S4) and self-attention (A) layers. Given a sequence of frames divided into movie shots (uninterrupted periods where the camera position does not change), the S4A block first applies self-attention to capture short-range intra-shot dependencies. Afterward, the state-space operation in the S4A block is used to aggregate long-range inter-shot cues. The final TranS4mer model, which can be trained end-to-end, is obtained by stacking the S4A blocks one after the other multiple times. Our proposed TranS4mer outperforms all prior methods in three movie scene detection datasets, including MovieNet, BBC, and OVSD, while also being $2\times$ faster and requiring $3\times$ less GPU memory than standard Transformer models. We will release our code and models.
Abstract:Understanding the impact of tumor biology on the composition of nearby cells often requires characterizing the impact of biologically distinct tumor regions. Biomarkers have been developed to label biologically distinct tumor regions, but challenges arise because of differences in the spatial extent and distribution of differentially labeled regions. In this work, we present a framework for systematically investigating the impact of distinct tumor regions on cells near the tumor borders, accounting their cross spatial distributions. We apply the framework to multiplex immunohistochemistry (mIHC) studies of pancreatic cancer and show its efficacy in demonstrating how biologically different tumor regions impact the immune response in the tumor microenvironment. Furthermore, we show that the proposed framework can be extended to largescale whole slide image analysis.
Abstract:Virtual Reality has become a significant element of education throughout the years. To understand the quality and advantages of these techniques, it is important to understand how they were developed and evaluated. Since COVID-19, the education system has drastically changed a lot. It has shifted from being in a classroom with a whiteboard and projectors to having your own room in front of your laptop in a virtual meeting. In this respect, virtual reality in the laboratory or Virtual Laboratory is the main focus of this research, which is intended to comprehend the work done in quality education from a distance using VR. As per the findings of the study, adopting virtual reality in education can help students learn more effectively and also help them increase perspective, enthusiasm, and knowledge of complex notions by offering them an interactive experience in which they can engage and learn more effectively. This highlights the importance of a significant expansion of VR use in learning, the majority of which employ scientific comparison approaches to compare students who use VR to those who use the traditional method for learning.
Abstract:The development of the Automatic License Plate Recognition (ALPR) system has received much attention for the English license plate. However, despite being the sixth largest population around the world, no significant progress can be tracked in the Bengali language countries or states for the ALPR system addressing their more alarming traffic management with inadequate road-safety measures. This paper reports a computationally efficient and reasonably accurate Automatic License Plate Recognition (ALPR) system for Bengali characters with a new end-to-end DNN model that we call Bengali License Plate Network(BLPnet). The cascaded architecture for detecting vehicle regions prior to vehicle license plate (VLP) in the model is proposed to eliminate false positives resulting in higher detection accuracy of VLP. Besides, a lower set of trainable parameters is considered for reducing the computational cost making the system faster and more compatible for a real-time application. With a Computational Neural Network (CNN)based new Bengali OCR engine and word-mapping process, the model is characters rotation invariant, and can readily extract, detect and output the complete license plate number of a vehicle. The model feeding with17 frames per second (fps) on real-time video footage can detect a vehicle with the Mean Squared Error (MSE) of 0.0152, and the mean license plate character recognition accuracy of 95%. While compared to the other models, an improvement of 5% and 20% were recorded for the BLPnetover the prominent YOLO-based ALPR model and the Tesseract model for the number-plate detection accuracy and time requirement, respectively.
Abstract:Vision-based deep learning models can be promising for speech-and-hearing-impaired and secret communications. While such non-verbal communications are primarily investigated with hand-gestures and facial expressions, no research endeavour is tracked so far for the lips state (i.e., open/close)-based interpretation/translation system. In support of this development, this paper reports two new Convolutional Neural Network (CNN) models for lips state detection. Building upon two prominent lips landmark detectors, DLIB and MediaPipe, we simplify lips-state model with a set of six key landmarks, and use their distances for the lips state classification. Thereby, both the models are developed to count the opening and closing of lips and thus, they can classify a symbol with the total count. Varying frame-rates, lips-movements and face-angles are investigated to determine the effectiveness of the models. Our early experimental results demonstrate that the model with DLIB is relatively slower in terms of an average of 6 frames per second (FPS) and higher average detection accuracy of 95.25%. In contrast, the model with MediaPipe offers faster landmark detection capability with an average FPS of 20 and detection accuracy of 94.4%. Both models thus could effectively interpret the lips state for non-verbal semantics into a natural language.