Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ankur Kumar

Contrasting Low and High-Resolution Features for HER2 Scoring using Deep Learning

Mar 28, 2025

Ekansh Chauhan, Anila Sharma, Amit Sharma, Vikas Nishadham, Asha Ghughtyal, Ankur Kumar, Gurudutt Gupta, Anurag Mehta, C. V. Jawahar, P. K. Vinod

Abstract:Breast cancer, the most common malignancy among women, requires precise detection and classification for effective treatment. Immunohistochemistry (IHC) biomarkers like HER2, ER, and PR are critical for identifying breast cancer subtypes. However, traditional IHC classification relies on pathologists' expertise, making it labor-intensive and subject to significant inter-observer variability. To address these challenges, this study introduces the India Pathology Breast Cancer Dataset (IPD-Breast), comprising of 1,272 IHC slides (HER2, ER, and PR) aimed at automating receptor status classification. The primary focus is on developing predictive models for HER2 3-way classification (0, Low, High) to enhance prognosis. Evaluation of multiple deep learning models revealed that an end-to-end ConvNeXt network utilizing low-resolution IHC images achieved an AUC, F1, and accuracy of 91.79%, 83.52%, and 83.56%, respectively, for 3-way classification, outperforming patch-based methods by over 5.35% in F1 score. This study highlights the potential of simple yet effective deep learning techniques to significantly improve accuracy and reproducibility in breast cancer classification, supporting their integration into clinical workflows for better patient outcomes.

Via

Access Paper or Ask Questions

What Does an Audio Deepfake Detector Focus on? A Study in the Time Domain

Jan 27, 2025

Petr Grinberg, Ankur Kumar, Surya Koppisetti, Gaurav Bharaj

Figure 1 for What Does an Audio Deepfake Detector Focus on? A Study in the Time Domain

Figure 2 for What Does an Audio Deepfake Detector Focus on? A Study in the Time Domain

Figure 3 for What Does an Audio Deepfake Detector Focus on? A Study in the Time Domain

Figure 4 for What Does an Audio Deepfake Detector Focus on? A Study in the Time Domain

Abstract:Adding explanations to audio deepfake detection (ADD) models will boost their real-world application by providing insight on the decision making process. In this paper, we propose a relevancy-based explainable AI (XAI) method to analyze the predictions of transformer-based ADD models. We compare against standard Grad-CAM and SHAP-based methods, using quantitative faithfulness metrics as well as a partial spoof test, to comprehensively analyze the relative importance of different temporal regions in an audio. We consider large datasets, unlike previous works where only limited utterances are studied, and find that the XAI methods differ in their explanations. The proposed relevancy-based XAI method performs the best overall on a variety of metrics. Further investigation on the relative importance of speech/non-speech, phonetic content, and voice onsets/offsets suggest that the XAI results obtained from analyzing limited utterances don't necessarily hold when evaluated on large datasets.

* Accepted to ICASSP 2025

Via

Access Paper or Ask Questions

Prithvi-EO-2.0: A Versatile Multi-Temporal Foundation Model for Earth Observation Applications

Dec 03, 2024

Daniela Szwarcman, Sujit Roy, Paolo Fraccaro, Þorsteinn Elí Gíslason, Benedikt Blumenstiel, Rinki Ghosal, Pedro Henrique de Oliveira, Joao Lucas de Sousa Almeida, Rocco Sedona, Yanghui Kang(+22 more)

Abstract:This technical report presents Prithvi-EO-2.0, a new geospatial foundation model that offers significant improvements over its predecessor, Prithvi-EO-1.0. Trained on 4.2M global time series samples from NASA's Harmonized Landsat and Sentinel-2 data archive at 30m resolution, the new 300M and 600M parameter models incorporate temporal and location embeddings for enhanced performance across various geospatial tasks. Through extensive benchmarking with GEO-Bench, the 600M version outperforms the previous Prithvi-EO model by 8\% across a range of tasks. It also outperforms six other geospatial foundation models when benchmarked on remote sensing tasks from different domains and resolutions (i.e. from 0.1m to 15m). The results demonstrate the versatility of the model in both classical earth observation and high-resolution applications. Early involvement of end-users and subject matter experts (SMEs) are among the key factors that contributed to the project's success. In particular, SME involvement allowed for constant feedback on model and dataset design, as well as successful customization for diverse SME-led applications in disaster response, land use and crop mapping, and ecosystem dynamics monitoring. Prithvi-EO-2.0 is available on Hugging Face and IBM terratorch, with additional resources on GitHub. The project exemplifies the Trusted Open Science approach embraced by all involved organizations.

Via

Access Paper or Ask Questions

Residual vector quantization for KV cache compression in large language model

Oct 21, 2024

Ankur Kumar

Figure 1 for Residual vector quantization for KV cache compression in large language model

Figure 2 for Residual vector quantization for KV cache compression in large language model

Figure 3 for Residual vector quantization for KV cache compression in large language model

Abstract:KV cache compression methods have mainly relied on scalar quantization techniques to reduce the memory requirements during decoding. In this work, we apply residual vector quantization, which has been widely used for high fidelity audio compression, to compress KV cache in large language models (LLM). We adapt the standard recipe with minimal changes to compress the output of any key or value projection matrix in a pretrained LLM: we scale the vector by its standard deviation, divide channels into groups and then quantize each group with the same residual vector quantizer. We learn the codebook using exponential moving average and there are no other learnable parameters including the input and output projections normally used in a vector quantization set up. We find that a residual depth of 8 recovers most of the performance of the unquantized model. We also find that grouping non-contiguous channels together works better than grouping contiguous channels for compressing key matrix and the method further benefits from a light weight finetuning of LLM together with the quantization. Overall, the proposed technique is competitive with existing quantization methods while being much simpler and results in 5.5x compression compared to half precision.

Via

Access Paper or Ask Questions

Learn from Real: Reality Defender's Submission to ASVspoof5 Challenge

Oct 09, 2024

Yi Zhu, Chirag Goel, Surya Koppisetti, Trang Tran, Ankur Kumar, Gaurav Bharaj

Figure 1 for Learn from Real: Reality Defender's Submission to ASVspoof5 Challenge

Figure 2 for Learn from Real: Reality Defender's Submission to ASVspoof5 Challenge

Figure 3 for Learn from Real: Reality Defender's Submission to ASVspoof5 Challenge

Figure 4 for Learn from Real: Reality Defender's Submission to ASVspoof5 Challenge

Abstract:Audio deepfake detection is crucial to combat the malicious use of AI-synthesized speech. Among many efforts undertaken by the community, the ASVspoof challenge has become one of the benchmarks to evaluate the generalizability and robustness of detection models. In this paper, we present Reality Defender's submission to the ASVspoof5 challenge, highlighting a novel pretraining strategy which significantly improves generalizability while maintaining low computational cost during training. Our system SLIM learns the style-linguistics dependency embeddings from various types of bonafide speech using self-supervised contrastive learning. The learned embeddings help to discriminate spoof from bonafide speech by focusing on the relationship between the style and linguistics aspects. We evaluated our system on ASVspoof5, ASV2019, and In-the-wild. Our submission achieved minDCF of 0.1499 and EER of 5.5% on ASVspoof5 Track 1, and EER of 7.4% and 10.8% on ASV2019 and In-the-wild respectively.

* Accepted into ASVspoof5 workshop

Via

Access Paper or Ask Questions

Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition

Jan 11, 2024

Vahid Noroozi, Somshubra Majumdar, Ankur Kumar, Jagadeesh Balam, Boris Ginsburg

Abstract:In this paper, we propose an efficient and accurate streaming speech recognition model based on the FastConformer architecture. We adapted the FastConformer architecture for streaming applications through: (1) constraining both the look-ahead and past contexts in the encoder, and (2) introducing an activation caching mechanism to enable the non-autoregressive encoder to operate autoregressively during inference. The proposed model is thoughtfully designed in a way to eliminate the accuracy disparity between the train and inference time which is common for many streaming models. Furthermore, our proposed encoder works with various decoder configurations including Connectionist Temporal Classification (CTC) and RNN-Transducer (RNNT) decoders. Additionally, we introduced a hybrid CTC/RNNT architecture which utilizes a shared encoder with both a CTC and RNNT decoder to boost the accuracy and save computation. We evaluate the proposed model on LibriSpeech dataset and a multi-domain large scale dataset and demonstrate that it can achieve better accuracy with lower latency and inference time compared to a conventional buffered streaming model baseline. We also showed that training a model with multiple latencies can achieve better accuracy than single latency models while it enables us to support multiple latencies with a single model. Our experiments also showed the hybrid architecture would not only speedup the convergence of the CTC decoder but also improves the accuracy of streaming models compared to single decoder models.

* Shorter version accepted to ICASSP 2024

Via

Access Paper or Ask Questions

Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition

May 19, 2023

Dima Rekesh, Samuel Kriman, Somshubra Majumdar, Vahid Noroozi, He Huang, Oleksii Hrinchuk, Ankur Kumar, Boris Ginsburg

Figure 1 for Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition

Figure 2 for Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition

Figure 3 for Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition

Figure 4 for Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition

Abstract:Conformer-based models have become the most dominant end-to-end architecture for speech processing tasks. In this work, we propose a carefully redesigned Conformer with a new down-sampling schema. The proposed model, named Fast Conformer, is 2.8x faster than original Conformer, while preserving state-of-the-art accuracy on Automatic Speech Recognition benchmarks. Also we replace the original Conformer global attention with limited context attention post-training to enable transcription of an hour-long audio. We further improve long-form speech transcription by adding a global token. Fast Conformer combined with a Transformer decoder also outperforms the original Conformer in accuracy and in speed for Speech Translation and Spoken Language Understanding.

Via

Access Paper or Ask Questions

LabVIEW is faster and C is economical interfacing tool for UCT automation

May 17, 2022

Ankur Kumar, Mayank Goswami

Figure 1 for LabVIEW is faster and C is economical interfacing tool for UCT automation

Figure 2 for LabVIEW is faster and C is economical interfacing tool for UCT automation

Figure 3 for LabVIEW is faster and C is economical interfacing tool for UCT automation

Figure 4 for LabVIEW is faster and C is economical interfacing tool for UCT automation

Abstract:An in-house developed 2D ultrasound computerized Tomography system is fully automated. Performance analysis of instrument and software interfacing soft tools, namely the LabVIEW, MATLAB, C, and Python, is presented. The instrument interfacing algorithms, hardware control algorithms, signal processing, and analysis codes are written using above mentioned soft tool platforms. Total of eight performance indices are used to compare the ease of (a) realtime control of electromechanical assembly, (b) sensors, instruments integration, (c) synchronized data acquisition, and (d) simultaneous raw data processing. It is found that C utilizes the least processing power and performs a lower number of processes to perform the same task. In runtime analysis (data acquisition and realtime control), LabVIEW performs best, taking 365.69s in comparison to MATLAB (623.83s), Python ( 1505.54s), and C (1252.03s) to complete the experiment. Python performs better in establishing faster interfacing and minimum RAM usage. LabVIEW is recommended for its fast process execution. C is recommended for the most economical implementation. Python is recommended for complex system automation having a very large number of components involved. This article provides a methodology to select optimal soft tools for instrument automation-related aspects.

* 15 pages, 9 figures, 2 tables, 23 references

Via

Access Paper or Ask Questions

Vision Transformer Compression with Structured Pruning and Low Rank Approximation

Mar 25, 2022

Ankur Kumar

Figure 1 for Vision Transformer Compression with Structured Pruning and Low Rank Approximation

Figure 2 for Vision Transformer Compression with Structured Pruning and Low Rank Approximation

Figure 3 for Vision Transformer Compression with Structured Pruning and Low Rank Approximation

Figure 4 for Vision Transformer Compression with Structured Pruning and Low Rank Approximation

Abstract:Transformer architecture has gained popularity due to its ability to scale with large dataset. Consequently, there is a need to reduce the model size and latency, especially for on-device deployment. We focus on vision transformer proposed for image recognition task (Dosovitskiy et al., 2021), and explore the application of different compression techniques such as low rank approximation and pruning for this purpose. Specifically, we investigate a structured pruning method proposed recently in Zhu et al. (2021) and find that mostly feedforward blocks are pruned with this approach, that too, with severe degradation in accuracy. We propose a hybrid compression approach to mitigate this where we compress the attention blocks using low rank approximation and use the previously mentioned pruning with a lower rate for feedforward blocks in each transformer layer. Our technique results in 50% compression with 14% relative increase in classification error whereas we obtain 44% compression with 20% relative increase in error when only pruning is applied. We propose further enhancements to bridge the accuracy gap but leave it as a future work.

Via

Access Paper or Ask Questions

AI and conventional methods for UCT projection data estimation

Aug 17, 2021

Ankur Kumar, Prasunika Khare, Mayank Goswami

Figure 1 for AI and conventional methods for UCT projection data estimation

Figure 2 for AI and conventional methods for UCT projection data estimation

Figure 3 for AI and conventional methods for UCT projection data estimation

Figure 4 for AI and conventional methods for UCT projection data estimation

Abstract:A 2D Compact ultrasound computerized tomography (UCT) system is developed. Fully automatic post processing tools involving signal and image processing are developed as well. Square of the amplitude values are used in transmission mode with natural 1.5 MHz frequency and rise time 10.4 ns and fall time 8.4 ns and duty cycle of 4.32%. Highest peak to corresponding trough values are considered as transmitting wave between transducers in direct line talk. Sensitivity analysis of methods to extract peak to corresponding trough per transducer are discussed in this paper. Total five methods are tested. These methods are taken from broad categories: (a) Conventional and (b) Artificial Intelligence (AI) based methods. Conventional methods, namely: (a) simple gradient based peak detection, (b) Fourier based, (c) wavelet transform are compared with AI based methods: (a) support vector machine (SVM), (b) artificial neural network (ANN). Classification step was performed as well to discard the signal which does not has contribution of transmission wave. It is found that conventional methods have better performance. Reconstruction error, accuracy, F-Score, recall, precision, specificity and MCC for 40 x 40 data 1600 data files are measured. Each data file contains 50,002 data point. Ten such data files are used for training the Neural Network. Each data file has 7/8 wave packets and each packet corresponds to one transmission amplitude data. Reconstruction error is found to be minimum for ANN method. Other performance indices show that FFT method is processing the UCT signal with best recovery.

* 10 Pages, 5 Figures

Via

Access Paper or Ask Questions