Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ayan Biswas

Lost in OCR Translation? Vision-Based Approaches to Robust Document Retrieval

May 08, 2025

Alexander Most, Joseph Winjum, Ayan Biswas, Shawn Jones, Nishath Rajiv Ranasinghe, Dan O'Malley, Manish Bhattarai

Abstract:Retrieval-Augmented Generation (RAG) has become a popular technique for enhancing the reliability and utility of Large Language Models (LLMs) by grounding responses in external documents. Traditional RAG systems rely on Optical Character Recognition (OCR) to first process scanned documents into text. However, even state-of-the-art OCRs can introduce errors, especially in degraded or complex documents. Recent vision-language approaches, such as ColPali, propose direct visual embedding of documents, eliminating the need for OCR. This study presents a systematic comparison between a vision-based RAG system (ColPali) and more traditional OCR-based pipelines utilizing Llama 3.2 (90B) and Nougat OCR across varying document qualities. Beyond conventional retrieval accuracy metrics, we introduce a semantic answer evaluation benchmark to assess end-to-end question-answering performance. Our findings indicate that while vision-based RAG performs well on documents it has been fine-tuned on, OCR-based RAG is better able to generalize to unseen documents of varying quality. We highlight the key trade-offs between computational efficiency and semantic accuracy, offering practical guidance for RAG practitioners in selecting between OCR-dependent and vision-based document retrieval systems in production environments.

Via

Access Paper or Ask Questions

LLM-Assisted Translation of Legacy FORTRAN Codes to C++: A Cross-Platform Study

Apr 21, 2025

Nishath Rajiv Ranasinghe, Shawn M. Jones, Michal Kucer, Ayan Biswas, Daniel O'Malley, Alexander Buschmann Most, Selma Liliane Wanna, Ajay Sreekumar

Abstract:Large Language Models (LLMs) are increasingly being leveraged for generating and translating scientific computer codes by both domain-experts and non-domain experts. Fortran has served as one of the go to programming languages in legacy high-performance computing (HPC) for scientific discoveries. Despite growing adoption, LLM-based code translation of legacy code-bases has not been thoroughly assessed or quantified for its usability. Here, we studied the applicability of LLM-based translation of Fortran to C++ as a step towards building an agentic-workflow using open-weight LLMs on two different computational platforms. We statistically quantified the compilation accuracy of the translated C++ codes, measured the similarity of the LLM translated code to the human translated C++ code, and statistically quantified the output similarity of the Fortran to C++ translation.

* 12 pages, 7 figures, 2 tables

Via

Access Paper or Ask Questions

Enhancing Code Translation in Language Models with Few-Shot Learning via Retrieval-Augmented Generation

Jul 29, 2024

Manish Bhattarai, Javier E. Santos, Shawn Jones, Ayan Biswas, Boian Alexandrov, Daniel O'Malley

Figure 1 for Enhancing Code Translation in Language Models with Few-Shot Learning via Retrieval-Augmented Generation

Figure 2 for Enhancing Code Translation in Language Models with Few-Shot Learning via Retrieval-Augmented Generation

Figure 3 for Enhancing Code Translation in Language Models with Few-Shot Learning via Retrieval-Augmented Generation

Figure 4 for Enhancing Code Translation in Language Models with Few-Shot Learning via Retrieval-Augmented Generation

Abstract:The advent of large language models (LLMs) has significantly advanced the field of code translation, enabling automated translation between programming languages. However, these models often struggle with complex translation tasks due to inadequate contextual understanding. This paper introduces a novel approach that enhances code translation through Few-Shot Learning, augmented with retrieval-based techniques. By leveraging a repository of existing code translations, we dynamically retrieve the most relevant examples to guide the model in translating new code segments. Our method, based on Retrieval-Augmented Generation (RAG), substantially improves translation quality by providing contextual examples from which the model can learn in real-time. We selected RAG over traditional fine-tuning methods due to its ability to utilize existing codebases or a locally stored corpus of code, which allows for dynamic adaptation to diverse translation tasks without extensive retraining. Extensive experiments on diverse datasets with open LLM models such as Starcoder, Llama3-70B Instruct, CodeLlama-34B Instruct, Granite-34B Code Instruct, and Mixtral-8x22B, as well as commercial LLM models like GPT-3.5 Turbo and GPT-4o, demonstrate our approach's superiority over traditional zero-shot methods, especially in translating between Fortran and CPP. We also explored varying numbers of shots i.e. examples provided during inference, specifically 1, 2, and 3 shots and different embedding models for RAG, including Nomic-Embed, Starencoder, and CodeBERT, to assess the robustness and effectiveness of our approach.

* LLM for code translation

Via

Access Paper or Ask Questions

Explainable AI Integrated Feature Engineering for Wildfire Prediction

Apr 01, 2024

Di Fan, Ayan Biswas, James Paul Ahrens

Abstract:Wildfires present intricate challenges for prediction, necessitating the use of sophisticated machine learning techniques for effective modeling\cite{jain2020review}. In our research, we conducted a thorough assessment of various machine learning algorithms for both classification and regression tasks relevant to predicting wildfires. We found that for classifying different types or stages of wildfires, the XGBoost model outperformed others in terms of accuracy and robustness. Meanwhile, the Random Forest regression model showed superior results in predicting the extent of wildfire-affected areas, excelling in both prediction error and explained variance. Additionally, we developed a hybrid neural network model that integrates numerical data and image information for simultaneous classification and regression. To gain deeper insights into the decision-making processes of these models and identify key contributing features, we utilized eXplainable Artificial Intelligence (XAI) techniques, including TreeSHAP, LIME, Partial Dependence Plots (PDP), and Gradient-weighted Class Activation Mapping (Grad-CAM). These interpretability tools shed light on the significance and interplay of various features, highlighting the complex factors influencing wildfire predictions. Our study not only demonstrates the effectiveness of specific machine learning models in wildfire-related tasks but also underscores the critical role of model transparency and interpretability in environmental science applications.

* arXiv admin note: text overlap with arXiv:2307.09615 by other authors

Via

Access Paper or Ask Questions

Exploring Music Genre Classification: Algorithm Analysis and Deployment Architecture

Sep 14, 2023

Ayan Biswas, Supriya Dhabal, Palaniandavar Venkateswaran

Abstract:Music genre classification has become increasingly critical with the advent of various streaming applications. Nowadays, we find it impossible to imagine using the artist's name and song title to search for music in a sophisticated music app. It is always difficult to classify music correctly because the information linked to music, such as region, artist, album, or non-album, is so variable. This paper presents a study on music genre classification using a combination of Digital Signal Processing (DSP) and Deep Learning (DL) techniques. A novel algorithm is proposed that utilizes both DSP and DL methods to extract relevant features from audio signals and classify them into various genres. The algorithm was tested on the GTZAN dataset and achieved high accuracy. An end-to-end deployment architecture is also proposed for integration into music-related applications. The performance of the algorithm is analyzed and future directions for improvement are discussed. The proposed DSP and DL-based music genre classification algorithm and deployment architecture demonstrate a promising approach for music genre classification.

Via

Access Paper or Ask Questions

Dynamic Data Assimilation of MPAS-O and the Global Drifter Dataset

Jan 11, 2023

Derek DeSantis, Ayan Biswas, Earl Lawrence, Phillip Wolfram

Abstract:In this study, we propose a new method for combining in situ buoy measurements with Earth system models (ESMs) to improve the accuracy of temperature predictions in the ocean. The technique utilizes the dynamics and modes identified in ESMs to improve the accuracy of buoy measurements while still preserving features such as seasonality. Using this technique, errors in localized temperature predictions made by the MPAS-O model can be corrected. We demonstrate that our approach improves accuracy compared to other interpolation and data assimilation methods. We apply our method to assimilate the Model for Prediction Across Scales Ocean component (MPAS-O) with the Global Drifter Program's in-situ ocean buoy dataset.

Via

Access Paper or Ask Questions

IDLat: An Importance-Driven Latent Generation Method for Scientific Data

Aug 05, 2022

Jingyi Shen, Haoyu Li, Jiayi Xu, Ayan Biswas, Han-Wei Shen

Figure 1 for IDLat: An Importance-Driven Latent Generation Method for Scientific Data

Figure 2 for IDLat: An Importance-Driven Latent Generation Method for Scientific Data

Figure 3 for IDLat: An Importance-Driven Latent Generation Method for Scientific Data

Figure 4 for IDLat: An Importance-Driven Latent Generation Method for Scientific Data

Abstract:Deep learning based latent representations have been widely used for numerous scientific visualization applications such as isosurface similarity analysis, volume rendering, flow field synthesis, and data reduction, just to name a few. However, existing latent representations are mostly generated from raw data in an unsupervised manner, which makes it difficult to incorporate domain interest to control the size of the latent representations and the quality of the reconstructed data. In this paper, we present a novel importance-driven latent representation to facilitate domain-interest-guided scientific data visualization and analysis. We utilize spatial importance maps to represent various scientific interests and take them as the input to a feature transformation network to guide latent generation. We further reduced the latent size by a lossless entropy encoding algorithm trained together with the autoencoder, improving the storage and memory efficiency. We qualitatively and quantitatively evaluate the effectiveness and efficiency of latent representations generated by our method with data from multiple scientific visualization applications.

* 11 pages, 12 figures, Proc. IEEE VIS 2022

Via

Access Paper or Ask Questions

Relationship-aware Multivariate Sampling Strategy for Scientific Simulation Data

Aug 31, 2020

Subhashis Hazarika, Ayan Biswas, Phillip J. Wolfram, Earl Lawrence, Nathan Urban

Figure 1 for Relationship-aware Multivariate Sampling Strategy for Scientific Simulation Data

Figure 2 for Relationship-aware Multivariate Sampling Strategy for Scientific Simulation Data

Figure 3 for Relationship-aware Multivariate Sampling Strategy for Scientific Simulation Data

Figure 4 for Relationship-aware Multivariate Sampling Strategy for Scientific Simulation Data

Abstract:With the increasing computational power of current supercomputers, the size of data produced by scientific simulations is rapidly growing. To reduce the storage footprint and facilitate scalable post-hoc analyses of such scientific data sets, various data reduction/summarization methods have been proposed over the years. Different flavors of sampling algorithms exist to sample the high-resolution scientific data, while preserving important data properties required for subsequent analyses. However, most of these sampling algorithms are designed for univariate data and cater to post-hoc analyses of single variables. In this work, we propose a multivariate sampling strategy which preserves the original variable relationships and enables different multivariate analyses directly on the sampled data. Our proposed strategy utilizes principal component analysis to capture the variance of multivariate data and can be built on top of any existing state-of-the-art sampling algorithms for single variables. In addition, we also propose variants of different data partitioning schemes (regular and irregular) to efficiently model the local multivariate relationships. Using two real-world multivariate data sets, we demonstrate the efficacy of our proposed multivariate sampling strategy with respect to its data reduction capabilities as well as the ease of performing efficient post-hoc multivariate analyses.

* To appear as IEEE Vis 2020 Shortpaper

Via

Access Paper or Ask Questions

Deep Learning-Based Feature-Aware Data Modeling for Complex Physics Simulations

Dec 08, 2019

Qun Liu, Subhashis Hazarika, John M. Patchett, James Paul Ahrens, Ayan Biswas

Figure 1 for Deep Learning-Based Feature-Aware Data Modeling for Complex Physics Simulations

Figure 2 for Deep Learning-Based Feature-Aware Data Modeling for Complex Physics Simulations

Figure 3 for Deep Learning-Based Feature-Aware Data Modeling for Complex Physics Simulations

Figure 4 for Deep Learning-Based Feature-Aware Data Modeling for Complex Physics Simulations

Abstract:Data modeling and reduction for in situ is important. Feature-driven methods for in situ data analysis and reduction are a priority for future exascale machines as there are currently very few such methods. We investigate a deep-learning based workflow that targets in situ data processing using autoencoders. We propose a Residual Autoencoder integrated Residual in Residual Dense Block (RRDB) to obtain better performance. Our proposed framework compressed our test data into 66 KB from 2.1 MB per 3D volume timestep.

* Accepted as a research poster at the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC19)

Via

Access Paper or Ask Questions

Exploiting Inherent Error-Resiliency of Neuromorphic Computing to achieve Extreme Energy-Efficiency through Mixed-Signal Neurons

Jun 13, 2018

Baibhab Chatterjee, Priyadarshini Panda, Shovan Maity, Ayan Biswas, Kaushik Roy, Shreyas Sen

Figure 1 for Exploiting Inherent Error-Resiliency of Neuromorphic Computing to achieve Extreme Energy-Efficiency through Mixed-Signal Neurons

Figure 2 for Exploiting Inherent Error-Resiliency of Neuromorphic Computing to achieve Extreme Energy-Efficiency through Mixed-Signal Neurons

Figure 3 for Exploiting Inherent Error-Resiliency of Neuromorphic Computing to achieve Extreme Energy-Efficiency through Mixed-Signal Neurons

Figure 4 for Exploiting Inherent Error-Resiliency of Neuromorphic Computing to achieve Extreme Energy-Efficiency through Mixed-Signal Neurons

Abstract:Neuromorphic computing, inspired by the brain, promises extreme efficiency for certain classes of learning tasks, such as classification and pattern recognition. The performance and power consumption of neuromorphic computing depends heavily on the choice of the neuron architecture. Digital neurons (Dig-N) are conventionally known to be accurate and efficient at high speed, while suffering from high leakage currents from a large number of transistors in a large design. On the other hand, analog/mixed-signal neurons are prone to noise, variability and mismatch, but can lead to extremely low-power designs. In this work, we will analyze, compare and contrast existing neuron architectures with a proposed mixed-signal neuron (MS-N) in terms of performance, power and noise, thereby demonstrating the applicability of the proposed mixed-signal neuron for achieving extreme energy-efficiency in neuromorphic computing. The proposed MS-N is implemented in 65 nm CMOS technology and exhibits > 100X better energy-efficiency across all frequencies over two traditional digital neurons synthesized in the same technology node. We also demonstrate that the inherent error-resiliency of a fully connected or even convolutional neural network (CNN) can handle the noise as well as the manufacturing non-idealities of the MS-N up to certain degrees. Notably, a system-level implementation on MNIST datasets exhibits a worst-case increase in classification error by 2.1% when the integrated noise power in the bandwidth is ~ 0.1 uV2, along with +-3{\sigma} amount of variation and mismatch introduced in the transistor parameters for the proposed neuron with 8-bit precision.

Via

Access Paper or Ask Questions