Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Gregor von Laszewski

An Overview of MLCommons Cloud Mask Benchmark: Related Research and Data

Dec 08, 2023

Gregor von Laszewski, Ruochen Gu

Abstract:Cloud masking is a crucial task that is well-motivated for meteorology and its applications in environmental and atmospheric sciences. Its goal is, given satellite images, to accurately generate cloud masks that identify each pixel in image to contain either cloud or clear sky. In this paper, we summarize some of the ongoing research activities in cloud masking, with a focus on the research and benchmark currently conducted in MLCommons Science Working Group. This overview is produced with the hope that others will have an easier time getting started and collaborate on the activities related to MLCommons Cloud Mask Benchmark.

* 13 pages, 2 tables 7 figures, 3 appendix

Via

Access Paper or Ask Questions

In-depth Analysis On Parallel Processing Patterns for High-Performance Dataframes

Jul 03, 2023

Niranda Perera, Arup Kumar Sarker, Mills Staylor, Gregor von Laszewski, Kaiying Shan, Supun Kamburugamuve, Chathura Widanage, Vibhatha Abeykoon, Thejaka Amila Kanewela, Geoffrey Fox

Figure 1 for In-depth Analysis On Parallel Processing Patterns for High-Performance Dataframes

Figure 2 for In-depth Analysis On Parallel Processing Patterns for High-Performance Dataframes

Figure 3 for In-depth Analysis On Parallel Processing Patterns for High-Performance Dataframes

Figure 4 for In-depth Analysis On Parallel Processing Patterns for High-Performance Dataframes

Abstract:The Data Science domain has expanded monumentally in both research and industry communities during the past decade, predominantly owing to the Big Data revolution. Artificial Intelligence (AI) and Machine Learning (ML) are bringing more complexities to data engineering applications, which are now integrated into data processing pipelines to process terabytes of data. Typically, a significant amount of time is spent on data preprocessing in these pipelines, and hence improving its e fficiency directly impacts the overall pipeline performance. The community has recently embraced the concept of Dataframes as the de-facto data structure for data representation and manipulation. However, the most widely used serial Dataframes today (R, pandas) experience performance limitations while working on even moderately large data sets. We believe that there is plenty of room for improvement by taking a look at this problem from a high-performance computing point of view. In a prior publication, we presented a set of parallel processing patterns for distributed dataframe operators and the reference runtime implementation, Cylon [1]. In this paper, we are expanding on the initial concept by introducing a cost model for evaluating the said patterns. Furthermore, we evaluate the performance of Cylon on the ORNL Summit supercomputer.

Via

Access Paper or Ask Questions

Time Series Analysis of Blockchain-Based Cryptocurrency Price Changes

Feb 19, 2022

Jacques Fleischer, Gregor von Laszewski, Carlos Theran, Yohn Jairo Parra Bautista

Figure 1 for Time Series Analysis of Blockchain-Based Cryptocurrency Price Changes

Figure 2 for Time Series Analysis of Blockchain-Based Cryptocurrency Price Changes

Figure 3 for Time Series Analysis of Blockchain-Based Cryptocurrency Price Changes

Figure 4 for Time Series Analysis of Blockchain-Based Cryptocurrency Price Changes

Abstract:In this paper we apply neural networks and Artificial Intelligence (AI) to historical records of high-risk cryptocurrency coins to train a prediction model that guesses their price. This paper's code contains Jupyter notebooks, one of which outputs a timeseries graph of any cryptocurrency price once a CSV file of the historical data is inputted into the program. Another Jupyter notebook trains an LSTM, or a long short-term memory model, to predict a cryptocurrency's closing price. The LSTM is fed the close price, which is the price that the currency has at the end of the day, so it can learn from those values. The notebook creates two sets: a training set and a test set to assess the accuracy of the results. The data is then normalized using manual min-max scaling so that the model does not experience any bias; this also enhances the performance of the model. Then, the model is trained using three layers -- an LSTM, dropout, and dense layer-minimizing the loss through 50 epochs of training; from this training, a recurrent neural network (RNN) is produced and fitted to the training set. Additionally, a graph of the loss over each epoch is produced, with the loss minimizing over time. Finally, the notebook plots a line graph of the actual currency price in red and the predicted price in blue. The process is then repeated for several more cryptocurrencies to compare prediction models. The parameters for the LSTM, such as number of epochs and batch size, are tweaked to try and minimize the root mean square error.

Via

Access Paper or Ask Questions

HPTMT Parallel Operators for High Performance Data Science & Data Engineering

Aug 13, 2021

Vibhatha Abeykoon, Supun Kamburugamuve, Chathura Widanage, Niranda Perera, Ahmet Uyar, Thejaka Amila Kanewala, Gregor von Laszewski, Geoffrey Fox

Figure 1 for HPTMT Parallel Operators for High Performance Data Science & Data Engineering

Figure 2 for HPTMT Parallel Operators for High Performance Data Science & Data Engineering

Figure 3 for HPTMT Parallel Operators for High Performance Data Science & Data Engineering

Figure 4 for HPTMT Parallel Operators for High Performance Data Science & Data Engineering

Abstract:Data-intensive applications are becoming commonplace in all science disciplines. They are comprised of a rich set of sub-domains such as data engineering, deep learning, and machine learning. These applications are built around efficient data abstractions and operators that suit the applications of different domains. Often lack of a clear definition of data structures and operators in the field has led to other implementations that do not work well together. The HPTMT architecture that we proposed recently, identifies a set of data structures, operators, and an execution model for creating rich data applications that links all aspects of data engineering and data science together efficiently. This paper elaborates and illustrates this architecture using an end-to-end application with deep learning and data engineering parts working together.

Via

Access Paper or Ask Questions

HPTMT: Operator-Based Architecture for Scalable High-Performance Data-Intensive Frameworks

Jul 30, 2021

Supun Kamburugamuve, Chathura Widanage, Niranda Perera, Vibhatha Abeykoon, Ahmet Uyar, Thejaka Amila Kanewala, Gregor von Laszewski, Geoffrey Fox

Figure 1 for HPTMT: Operator-Based Architecture for Scalable High-Performance Data-Intensive Frameworks

Figure 2 for HPTMT: Operator-Based Architecture for Scalable High-Performance Data-Intensive Frameworks

Figure 3 for HPTMT: Operator-Based Architecture for Scalable High-Performance Data-Intensive Frameworks

Figure 4 for HPTMT: Operator-Based Architecture for Scalable High-Performance Data-Intensive Frameworks

Abstract:Data-intensive applications impact many domains, and their steadily increasing size and complexity demands high-performance, highly usable environments. We integrate a set of ideas developed in various data science and data engineering frameworks. They employ a set of operators on specific data abstractions that include vectors, matrices, tensors, graphs, and tables. Our key concepts are inspired from systems like MPI, HPF (High-Performance Fortran), NumPy, Pandas, Spark, Modin, PyTorch, TensorFlow, RAPIDS(NVIDIA), and OneAPI (Intel). Further, it is crucial to support different languages in everyday use in the Big Data arena, including Python, R, C++, and Java. We note the importance of Apache Arrow and Parquet for enabling language agnostic high performance and interoperability. In this paper, we propose High-Performance Tensors, Matrices and Tables (HPTMT), an operator-based architecture for data-intensive applications, and identify the fundamental principles needed for performance and usability success. We illustrate these principles by a discussion of examples using our software environments, Cylon and Twister2 that embody HPTMT.

Via

Access Paper or Ask Questions

AICov: An Integrative Deep Learning Framework for COVID-19 Forecasting with Population Covariates

Oct 08, 2020

Geoffrey C. Fox, Gregor von Laszewski, Fugang Wang, Saumyadipta Pyne

Figure 1 for AICov: An Integrative Deep Learning Framework for COVID-19 Forecasting with Population Covariates

Figure 2 for AICov: An Integrative Deep Learning Framework for COVID-19 Forecasting with Population Covariates

Figure 3 for AICov: An Integrative Deep Learning Framework for COVID-19 Forecasting with Population Covariates

Figure 4 for AICov: An Integrative Deep Learning Framework for COVID-19 Forecasting with Population Covariates

Abstract:The COVID-19 pandemic has profound global consequences on health, economic, social, political, and almost every major aspect of human life. Therefore, it is of great importance to model COVID-19 and other pandemics in terms of the broader social contexts in which they take place. We present the architecture of AICov, which provides an integrative deep learning framework for COVID-19 forecasting with population covariates, some of which may serve as putative risk factors. We have integrated multiple different strategies into AICov, including the ability to use deep learning strategies based on LSTM and even modeling. To demonstrate our approach, we have conducted a pilot that integrates population covariates from multiple sources. Thus, AICov not only includes data on COVID-19 cases and deaths but, more importantly, the population's socioeconomic, health and behavioral risk factors at a local level. The compiled data are fed into AICov, and thus we obtain improved prediction by integration of the data to our model as compared to one that only uses case and death data.

* 25 pages, 4 tabkes, 19 figures

Via

Access Paper or Ask Questions