Abstract:Graph Neural Networks (GNNs) have shown significant promise in various domains, such as recommendation systems, bioinformatics, and network analysis. However, the irregularity of graph data poses unique challenges for efficient computation, leading to the development of specialized GNN accelerator architectures that surpass traditional CPU and GPU performance. Despite this, the structural diversity of input graphs results in varying performance across different GNN accelerators, depending on their dataflows. This variability in performance due to differing dataflows and graph properties remains largely unexplored, limiting the adaptability of GNN accelerators. To address this, we propose a data-driven framework for dataflow-aware latency prediction in GNN inference. Our approach involves training regressors to predict the latency of executing specific graphs on particular dataflows, using simulations on synthetic graphs. Experimental results indicate that our regressors can predict the optimal dataflow for a given graph with up to 91.28% accuracy and a Mean Absolute Percentage Error (MAPE) of 3.78%. Additionally, we introduce an online scheduling algorithm that uses these regressors to enhance scheduling decisions. Our experiments demonstrate that this algorithm achieves up to $3.17\times$ speedup in mean completion time and $6.26\times$ speedup in mean execution time compared to the best feasible baseline across all datasets.
Abstract:In this work, we attempted to extend the thought and showcase a way forward for the Self-supervised Learning (SSL) learning paradigm by combining contrastive learning, self-distillation (knowledge distillation) and masked data modelling, the three major SSL frameworks, to learn a joint and coordinated representation. The proposed technique of SSL learns by the collaborative power of different learning objectives of SSL. Hence to jointly learn the different SSL objectives we proposed a new SSL architecture KDC-MAE, a complementary masking strategy to learn the modular correspondence, and a weighted way to combine them coordinately. Experimental results conclude that the contrastive masking correspondence along with the KD learning objective has lent a hand to performing better learning for multiple modalities over multiple tasks.
Abstract:In this work, we explore Self-supervised Learning (SSL) as an auxiliary task to blend the texture-based local descriptors into feature modelling for efficient face analysis. Combining a primary task and a self-supervised auxiliary task is beneficial for robust representation. Therefore, we used the SSL task of mask auto-encoder (MAE) as an auxiliary task to reconstruct texture features such as local patterns along with the primary task for robust and unbiased face analysis. We experimented with our hypothesis on three major paradigms of face analysis: face attribute and face-based emotion analysis, and deepfake detection. Our experiment results exhibit that better feature representation can be gleaned from our proposed model for fair and bias-less face analysis.
Abstract:Script identification plays a vital role in applications that involve handwriting and document analysis within a multi-script and multi-lingual environment. Moreover, it exhibits a profound connection with human cognition. This paper provides a new database for benchmarking script identification algorithms, which contains both printed and handwritten documents collected from a wide variety of scripts, such as Arabic, Bengali (Bangla), Gujarati, Gurmukhi, Devanagari, Japanese, Kannada, Malayalam, Oriya, Roman, Tamil, Telugu, and Thai. The dataset consists of 1,135 documents scanned from local newspaper and handwritten letters as well as notes from different native writers. Further, these documents are segmented into lines and words, comprising a total of 13,979 and 86,655 lines and words, respectively, in the dataset. Easy-to-go benchmarks are proposed with handcrafted and deep learning methods. The benchmark includes results at the document, line, and word levels with printed and handwritten documents. Results of script identification independent of the document/line/word level and independent of the printed/handwritten letters are also given. The new multi-lingual database is expected to create new script identifiers, present various challenges, including identifying handwritten and printed samples and serve as a foundation for future research in script identification based on the reported results of the three benchmarks.
Abstract:Computer-aided segmentation methods can assist medical personnel in improving diagnostic outcomes. While recent advancements like UNet and its variants have shown promise, they face a critical challenge: balancing accuracy with computational efficiency. Shallow encoder architectures in UNets often struggle to capture crucial spatial features, leading in inaccurate and sparse segmentation. To address this limitation, we propose a novel \underline{P}rogressive \underline{A}ttention based \underline{M}obile \underline{UNet} (\underline{PAM-UNet}) architecture. The inverted residual (IR) blocks in PAM-UNet help maintain a lightweight framework, while layerwise \textit{Progressive Luong Attention} ($\mathcal{PLA}$) promotes precise segmentation by directing attention toward regions of interest during synthesis. Our approach prioritizes both accuracy and speed, achieving a commendable balance with a mean IoU of 74.65 and a dice score of 82.87, while requiring only 1.32 floating-point operations per second (FLOPS) on the Liver Tumor Segmentation Benchmark (LiTS) 2017 dataset. These results highlight the importance of developing efficient segmentation models to accelerate the adoption of AI in clinical practice.
Abstract:Recent literature has witnessed significant interest towards 3D biometrics employing monocular vision for robust authentication methods. Motivated by this, in this work we seek to provide insight on recent development in the area of 3D biometrics employing monocular vision. We present the similarity and dissimilarity of 3D monocular biometrics and classical biometrics, listing the strengths and challenges. Further, we provide an overview of recent techniques in 3D biometrics with monocular vision, as well as application systems adopted by the industry. Finally, we discuss open research problems in this area of research
Abstract:This work proposes a novel process of using pen tip and tail 3D trajectory for air signature. To acquire the trajectories we developed a new pen tool and a stereo camera was used. We proposed SliT-CNN, a novel 2D spatial-temporal convolutional neural network (CNN) for better featuring of the air signature. In addition, we also collected an air signature dataset from $45$ signers. Skilled forgery signatures per user are also collected. A detailed benchmarking of the proposed dataset using existing techniques and proposed CNN on existing and proposed dataset exhibit the effectiveness of our methodology.
Abstract:Recent advancements in Large Vision-Language Models (VLMs) have shown great promise in natural image domains, allowing users to hold a dialogue about given visual content. However, such general-domain VLMs perform poorly for Remote Sensing (RS) scenarios, leading to inaccurate or fabricated information when presented with RS domain-specific queries. Such a behavior emerges due to the unique challenges introduced by RS imagery. For example, to handle high-resolution RS imagery with diverse scale changes across categories and many small objects, region-level reasoning is necessary alongside holistic scene interpretation. Furthermore, the lack of domain-specific multimodal instruction following data as well as strong backbone models for RS make it hard for the models to align their behavior with user queries. To address these limitations, we propose GeoChat - the first versatile remote sensing VLM that offers multitask conversational capabilities with high-resolution RS images. Specifically, GeoChat can not only answer image-level queries but also accepts region inputs to hold region-specific dialogue. Furthermore, it can visually ground objects in its responses by referring to their spatial coordinates. To address the lack of domain-specific datasets, we generate a novel RS multimodal instruction-following dataset by extending image-text pairs from existing diverse RS datasets. We establish a comprehensive benchmark for RS multitask conversations and compare with a number of baseline methods. GeoChat demonstrates robust zero-shot performance on various RS tasks, e.g., image and region captioning, visual question answering, scene classification, visually grounded conversations and referring detection. Our code is available at https://github.com/mbzuai-oryx/geochat.
Abstract:Vision Transformers (ViTs) have become ubiquitous in computer vision. Despite their success, ViTs lack inductive biases, which can make it difficult to train them with limited data. To address this challenge, prior studies suggest training ViTs with self-supervised learning (SSL) and fine-tuning sequentially. However, we observe that jointly optimizing ViTs for the primary task and a Self-Supervised Auxiliary Task (SSAT) is surprisingly beneficial when the amount of training data is limited. We explore the appropriate SSL tasks that can be optimized alongside the primary task, the training schemes for these tasks, and the data scale at which they can be most effective. Our findings reveal that SSAT is a powerful technique that enables ViTs to leverage the unique characteristics of both the self-supervised and primary tasks, achieving better performance than typical ViTs pre-training with SSL and fine-tuning sequentially. Our experiments, conducted on 10 datasets, demonstrate that SSAT significantly improves ViT performance while reducing carbon footprint. We also confirm the effectiveness of SSAT in the video domain for deepfake detection, showcasing its generalizability. Our code is available at https://github.com/dominickrei/Limited-data-vits.
Abstract:This work explores various ways of exploring multi-task learning (MTL) techniques aimed at classifying videos as original or manipulated in cross-manipulation scenario to attend generalizability in deep fake scenario. The dataset used in our evaluation is FaceForensics++, which features 1000 original videos manipulated by four different techniques, with a total of 5000 videos. We conduct extensive experiments on multi-task learning and contrastive techniques, which are well studied in literature for their generalization benefits. It can be concluded that the proposed detection model is quite generalized, i.e., accurately detects manipulation methods not encountered during training as compared to the state-of-the-art.