Abstract:Kubernetes offers a powerful orchestration platform for machine learning training, but memory management can be challenging due to specialized needs and resource constraints. This paper outlines how Kubernetes handles memory requests, limits, Quality of Service classes, and eviction policies for ML workloads, with special focus on GPU memory and ephemeral storage. Common pitfalls such as overcommitment, memory leaks, and ephemeral volume exhaustion are examined. We then provide best practices for stable, scalable memory utilization to help ML practitioners prevent out-of-memory events and ensure high-performance ML training pipelines.
Abstract:Agent-based models (ABM) provide an excellent framework for modeling outbreaks and interventions in epidemiology by explicitly accounting for diverse individual interactions and environments. However, these models are usually stochastic and highly parametrized, requiring precise calibration for predictive performance. When considering realistic numbers of agents and properly accounting for stochasticity, this high dimensional calibration can be computationally prohibitive. This paper presents a random forest based surrogate modeling technique to accelerate the evaluation of ABMs and demonstrates its use to calibrate an epidemiological ABM named CityCOVID via Markov chain Monte Carlo (MCMC). The technique is first outlined in the context of CityCOVID's quantities of interest, namely hospitalizations and deaths, by exploring dimensionality reduction via temporal decomposition with principal component analysis (PCA) and via sensitivity analysis. The calibration problem is then presented and samples are generated to best match COVID-19 hospitalization and death numbers in Chicago from March to June in 2020. These results are compared with previous approximate Bayesian calibration (IMABC) results and their predictive performance is analyzed showing improved performance with a reduction in computation.