Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jongwook Woo

Insuring Smiles: Predicting routine dental coverage using Spark ML

Oct 13, 2023

Aishwarya Gupta, Rahul S. Bhogale, Priyanka Thota, Prathushkumar Dathuri, Jongwook Woo

Abstract:Finding suitable health insurance coverage can be challenging for individuals and small enterprises in the USA. The Health Insurance Exchange Public Use Files (Exchange PUFs) dataset provided by CMS offers valuable information on health and dental policies [1]. In this paper, we leverage machine learning algorithms to predict if a health insurance plan covers routine dental services for adults. By analyzing plan type, region, deductibles, out-of-pocket maximums, and copayments, we employ Logistic Regression, Decision Tree, Random Forest, Gradient Boost, Factorization Model and Support Vector Machine algorithms. Our goal is to provide a clinical strategy for individuals and families to select the most suitable insurance plan based on income and expenses.

* 4 pages, 13 figures, 5 tables

Via

Access Paper or Ask Questions

Using Spark Machine Learning Models to Perform Predictive Analysis on Flight Ticket Pricing Data

Oct 11, 2023

Philip Wong, Phue Thant, Pratiksha Yadav, Ruta Antaliya, Jongwook Woo

Figure 1 for Using Spark Machine Learning Models to Perform Predictive Analysis on Flight Ticket Pricing Data

Figure 2 for Using Spark Machine Learning Models to Perform Predictive Analysis on Flight Ticket Pricing Data

Figure 3 for Using Spark Machine Learning Models to Perform Predictive Analysis on Flight Ticket Pricing Data

Figure 4 for Using Spark Machine Learning Models to Perform Predictive Analysis on Flight Ticket Pricing Data

Abstract:This paper discusses predictive performance and processes undertaken on flight pricing data utilizing r2(r-square) and RMSE that leverages a large dataset, originally from Expedia.com, consisting of approximately 20 million records or 4.68 gigabytes. The project aims to determine the best models usable in the real world to predict airline ticket fares for non-stop flights across the US. Therefore, good generalization capability and optimized processing times are important measures for the model. We will discover key business insights utilizing feature importance and discuss the process and tools used for our analysis. Four regression machine learning algorithms were utilized: Random Forest, Gradient Boost Tree, Decision Tree, and Factorization Machines utilizing Cross Validator and Training Validator functions for assessing performance and generalization capability.

* 4 pages, 13 figures, 1 table

Via

Access Paper or Ask Questions

Amazon Books Rating prediction & Recommendation Model

Oct 04, 2023

Hsiu-Ping Lin, Suman Chauhan, Yougender Chauhan, Nagender Chauhan, Jongwook Woo

Figure 1 for Amazon Books Rating prediction & Recommendation Model

Figure 2 for Amazon Books Rating prediction & Recommendation Model

Figure 3 for Amazon Books Rating prediction & Recommendation Model

Figure 4 for Amazon Books Rating prediction & Recommendation Model

Abstract:This paper uses the dataset of Amazon to predict the books ratings listed on Amazon website. As part of this project, we predicted the ratings of the books, and also built a recommendation cluster. This recommendation cluster provides the recommended books based on the column's values from dataset, for instance, category, description, author, price, reviews etc. This paper provides a flow of handling big data files, data engineering, building models and providing predictions. The models predict book ratings column using various PySpark Machine Learning APIs. Additionally, we used hyper-parameters and parameters tuning. Also, Cross Validation and TrainValidationSplit were used for generalization. Finally, we performed a comparison between Binary Classification and Multiclass Classification in their accuracies. We converted our label from multiclass to binary to see if we could find any difference between the two classifications. As a result, we found out that we get higher accuracy in binary classification than in multiclass classification.

* 5 pages, 4 figures, 8 tables

Via

Access Paper or Ask Questions

Scalable Predictive Time-Series Analysis of COVID-19: Cases and Fatalities

Apr 22, 2021

Shradha Shinde, Jay Joshi, Sowmya Mareedu, Yeon Pyo Kim, Jongwook Woo

Figure 1 for Scalable Predictive Time-Series Analysis of COVID-19: Cases and Fatalities

Figure 2 for Scalable Predictive Time-Series Analysis of COVID-19: Cases and Fatalities

Figure 3 for Scalable Predictive Time-Series Analysis of COVID-19: Cases and Fatalities

Figure 4 for Scalable Predictive Time-Series Analysis of COVID-19: Cases and Fatalities

Abstract:COVID 19 is an acute disease that started spreading throughout the world, beginning in December 2019. It has spread worldwide and has affected more than 7 million people, and 200 thousand people have died due to this infection as of Oct 2020. In this paper, we have forecasted the number of deaths and the confirmed cases in Los Angeles and New York of the United States using the traditional and Big Data platforms based on the Times Series: ARIMA and ETS. We also implemented a more sophisticated time-series forecast model using Facebook Prophet API. Furthermore, we developed the classification models: Logistic Regression and Random Forest regression to show that the Weather does not affect the number of the confirmed cases. The models are built and run in legacy systems (Azure ML Studio) and Big Data systems (Oracle Cloud and Databricks). Besides, we present the accuracy of the models.

* 8 pages, 7 figures, 4 tables

Via

Access Paper or Ask Questions