Abstract:Educational systems have traditionally been evaluated using cross-sectional studies, namely, examining a pretest, posttest, and single intervention. Although this is a popular approach, it does not model valuable information such as confounding variables, feedback to students, and other real-world deviations of studies from ideal conditions. Moreover, learning inherently is a sequential process and should involve a sequence of interventions. In this paper, we propose various experimental and quasi-experimental designs for educational systems and quantify them using the graphical model and directed acyclic graph (DAG) language. We discuss the applications and limitations of each method in education. Furthermore, we propose to model the education system as time-varying treatments, confounders, and time-varying treatments-confounders feedback. We show that if we control for a sufficient set of confounders and use appropriate inference techniques such as the inverse probability of treatment weighting (IPTW) or g-formula, we can close the backdoor paths and derive the unbiased causal estimate of joint interventions on the outcome. Finally, we compare the g-formula and IPTW performance and discuss the pros and cons of using each method.
Abstract:Synthetic data is widely used in various domains. This is because many modern algorithms require lots of data for efficient training, and data collection and labeling usually are a time-consuming process and are prone to errors. Furthermore, some real-world data, due to its nature, is confidential and cannot be shared. Bayesian networks are a type of probabilistic graphical model widely used to model the uncertainties in real-world processes. Dynamic Bayesian networks are a special class of Bayesian networks that model temporal and time series data. In this paper, we introduce the tsBNgen, a Python library to generate time series and sequential data based on an arbitrary dynamic Bayesian network. The package, documentation, and examples can be downloaded from https://github.com/manitadayon/tsBNgen.
Abstract:Time series and sequential data have gained significant attention recently since many real-world processes in various domains such as finance, education, biology, and engineering can be modeled as time series. Although many algorithms and methods such as the Kalman filter, hidden Markov model, and long short term memory (LSTM) are proposed to make inferences and predictions for the data, their usage significantly depends on the application, type of the problem, available data, and sufficient accuracy or loss. In this paper, we compare the supervised and unsupervised hidden Markov model to LSTM in terms of the amount of data needed for training, complexity, and forecasting accuracy. Moreover, we propose various techniques to discretize the observations and convert the problem to a discrete hidden Markov model under stationary and non-stationary situations. Our results indicate that even an unsupervised hidden Markov model can outperform LSTM when a massive amount of labeled data is not available. Furthermore, we show that the hidden Markov model can still be an effective method to process the sequence data even when the first-order Markov assumption is not satisfied.
Abstract:Time series forecasting has gained lots of attention recently; this is because many real-world phenomena can be modeled as time series. The massive volume of data and recent advancements in the processing power of the computers enable researchers to develop more sophisticated machine learning algorithms such as neural networks to forecast the time series data. In this paper, we propose various neural network architectures to forecast the time series data using the dynamic measurements; moreover, we introduce various architectures on how to combine static and dynamic measurements for forecasting. We also investigate the importance of performing techniques such as anomaly detection and clustering on forecasting accuracy. Our results indicate that clustering can improve the overall prediction time as well as improve the forecasting performance of the neural network. Furthermore, we show that feature-based clustering can outperform the distance-based clustering in terms of speed and efficiency. Finally, our results indicate that adding more predictors to forecast the target variable will not necessarily improve the forecasting accuracy.
Abstract:Contributions: Prior studies on education have mostly followed the model of the cross sectional study, namely, examining the pretest and the posttest scores. This paper shows that students' knowledge throughout the intervention can be estimated by time series analysis using a hidden Markov model. Background: Analyzing time series and the interaction between the students and the game data can result in valuable information that cannot be gained by only cross sectional studies of the exams. Research Questions: Can a hidden Markov model be used to analyze the educational games? Can a hidden Markov model be used to make a prediction of the students' performance? Methodology: The study was conducted on (N=854) students who played the Save Patch game. Students were divided into class 1 and class 2. Class 1 students are those who scored lower in the test than class 2 students. The analysis is done by choosing various features of the game as the observations. Findings: The state trajectories can predict the students' performance accurately for both class 1 and class 2.