Abstract:The advent of large language models (LLMs) has significantly advanced the field of code translation, enabling automated translation between programming languages. However, these models often struggle with complex translation tasks due to inadequate contextual understanding. This paper introduces a novel approach that enhances code translation through Few-Shot Learning, augmented with retrieval-based techniques. By leveraging a repository of existing code translations, we dynamically retrieve the most relevant examples to guide the model in translating new code segments. Our method, based on Retrieval-Augmented Generation (RAG), substantially improves translation quality by providing contextual examples from which the model can learn in real-time. We selected RAG over traditional fine-tuning methods due to its ability to utilize existing codebases or a locally stored corpus of code, which allows for dynamic adaptation to diverse translation tasks without extensive retraining. Extensive experiments on diverse datasets with open LLM models such as Starcoder, Llama3-70B Instruct, CodeLlama-34B Instruct, Granite-34B Code Instruct, and Mixtral-8x22B, as well as commercial LLM models like GPT-3.5 Turbo and GPT-4o, demonstrate our approach's superiority over traditional zero-shot methods, especially in translating between Fortran and CPP. We also explored varying numbers of shots i.e. examples provided during inference, specifically 1, 2, and 3 shots and different embedding models for RAG, including Nomic-Embed, Starencoder, and CodeBERT, to assess the robustness and effectiveness of our approach.
Abstract:Recreating complex, high-dimensional global fields from limited data points is a grand challenge across various scientific and industrial domains. Given the prohibitive costs of specialized sensors and the frequent inaccessibility of certain regions of the domain, achieving full field coverage is typically not feasible. Therefore, the development of algorithms that intelligently improve sensor placement is of significant value. In this study, we introduce a general approach that employs differentiable programming to exploit sensor placement within the training of a neural network model in order to improve field reconstruction. We evaluated our method using two distinct datasets; the results show that our approach improved test scores. Ultimately, our method of differentiable placement strategies has the potential to significantly increase data collection efficiency, enable more thorough area coverage, and reduce redundancy in sensor deployment.
Abstract:The aim of this short note is to show that Denoising Diffusion Probabilistic Model DDPM, a non-homogeneous discrete-time Markov process, can be represented by a time-homogeneous continuous-time Markov process observed at non-uniformly sampled discrete times. Surprisingly, this continuous-time Markov process is the well-known and well-studied Ornstein-Ohlenbeck (OU) process, which was developed in 1930's for studying Brownian particles in Harmonic potentials. We establish the formal equivalence between DDPM and the OU process using its analytical solution. We further demonstrate that the design problem of the noise scheduler for non-homogeneous DDPM is equivalent to designing observation times for the OU process. We present several heuristic designs for observation times based on principled quantities such as auto-variance and Fisher Information and connect them to ad hoc noise schedules for DDPM. Interestingly, we show that the Fisher-Information-motivated schedule corresponds exactly the cosine schedule, which was developed without any theoretical foundation but is the current state-of-the-art noise schedule.
Abstract:Spatial nonstationarity, the location variance of features' statistical distributions, is ubiquitous in many natural settings. For example, in geological reservoirs rock matrix porosity varies vertically due to geomechanical compaction trends, in mineral deposits grades vary due to sedimentation and concentration processes, in hydrology rainfall varies due to the atmosphere and topography interactions, and in metallurgy crystalline structures vary due to differential cooling. Conventional geostatistical modeling workflows rely on the assumption of stationarity to be able to model spatial features for the geostatistical inference. Nevertheless, this is often not a realistic assumption when dealing with nonstationary spatial data and this has motivated a variety of nonstationary spatial modeling workflows such as trend and residual decomposition, cosimulation with secondary features, and spatial segmentation and independent modeling over stationary subdomains. The advent of deep learning technologies has enabled new workflows for modeling spatial relationships. However, there is a paucity of demonstrated best practice and general guidance on mitigation of spatial nonstationarity with deep learning in the geospatial context. We demonstrate the impact of two common types of geostatistical spatial nonstationarity on deep learning model prediction performance and propose the mitigation of such impacts using self-attention (vision transformer) models. We demonstrate the utility of vision transformers for the mitigation of nonstationarity with relative errors as low as 10%, exceeding the performance of alternative deep learning methods such as convolutional neural networks. We establish best practice by demonstrating the ability of self-attention networks for modeling large-scale spatial relationships in the presence of commonly observed geospatial nonstationarity.
Abstract:Throughout computational science, there is a growing need to utilize the continual improvements in raw computational horsepower to achieve greater physical fidelity through scale-bridging over brute-force increases in the number of mesh elements. For instance, quantitative predictions of transport in nanoporous media, critical to hydrocarbon extraction from tight shale formations, are impossible without accounting for molecular-level interactions. Similarly, inertial confinement fusion simulations rely on numerical diffusion to simulate molecular effects such as non-local transport and mixing without truly accounting for molecular interactions. With these two disparate applications in mind, we develop a novel capability which uses an active learning approach to optimize the use of local fine-scale simulations for informing coarse-scale hydrodynamics. Our approach addresses three challenges: forecasting continuum coarse-scale trajectory to speculatively execute new fine-scale molecular dynamics calculations, dynamically updating coarse-scale from fine-scale calculations, and quantifying uncertainty in neural network models.
Abstract:Segmentation and analysis of individual pores and grains of mudrocks from scanning electron microscope images is non-trivial because of noise, imaging artifacts, variation in pixel grayscale values across images, and overlaps in grayscale values among different physical features such as silt grains, clay grains, and pores in an image, which make their identification difficult. Moreover, because grains and pores often have overlapping grayscale values, direct application of threshold-based segmentation techniques is not sufficient. Recent advances in the field of computer vision have made it easier and faster to segment images and identify multiple occurrences of such features in an image, provided that ground-truth data for training the algorithm is available. Here, we propose a deep learning SEM image segmentation model, MudrockNet based on Google's DeepLab-v3+ architecture implemented with the TensorFlow library. The ground-truth data was obtained from an image-processing workflow applied to scanning electron microscope images of uncemented muds from the Kumano Basin offshore Japan at depths < 1.1 km. The trained deep learning model obtained a pixel-accuracy about 90%, and predictions for the test data obtained a mean intersection over union (IoU) of 0.6591 for silt grains and 0.6642 for pores. We also compared our model with the random forest classifier using trainable Weka segmentation in ImageJ, and it was observed that MudrockNet gave better predictions for both silt grains and pores. The size, concentration, and spatial arrangement of the silt and clay grains can affect the petrophysical properties of a mudrock, and an automated method to accurately identify the different grains and pores in mudrocks can help improve reservoir and seal characterization for petroleum exploration and anthropogenic waste sequestration.
Abstract:Predicting the spatial configuration of gas molecules in nanopores of shale formations is crucial for fluid flow forecasting and hydrocarbon reserves estimation. The key challenge in these tight formations is that the majority of the pore sizes are less than 50 nm. At this scale, the fluid properties are affected by nanoconfinement effects due to the increased fluid-solid interactions. For instance, gas adsorption to the pore walls could account for up to 85% of the total hydrocarbon volume in a tight reservoir. Although there are analytical solutions that describe this phenomenon for simple geometries, they are not suitable for describing realistic pores, where surface roughness and geometric anisotropy play important roles. To describe these, molecular dynamics (MD) simulations are used since they consider fluid-solid and fluid-fluid interactions at the molecular level. However, MD simulations are computationally expensive, and are not able to simulate scales larger than a few connected nanopores. We present a method for building and training physics-based deep learning surrogate models to carry out fast and accurate predictions of molecular configurations of gas inside nanopores. Since training deep learning models requires extensive databases that are computationally expensive to create, we employ active learning (AL). AL reduces the overhead of creating comprehensive sets of high-fidelity data by determining where the model uncertainty is greatest, and running simulations on the fly to minimize it. The proposed workflow enables nanoconfinement effects to be rigorously considered at the mesoscale where complex connected sets of nanopores control key applications such as hydrocarbon recovery and CO2 sequestration.