Abstract:Run-by-run variability in parallel programs caused by floating-point non-associativity (FPNA) has been known to significantly affect reproducibility in iterative algorithms, due to accumulating errors. Non-reproducibility negatively affects efficiency and effectiveness of correctness testing for stochastic programs. Recently, the sensitivity of deep learning (DL) training and inference pipelines to FPNA have been found to be extreme, and can prevent certification for commercial applications, accurate assessment of robustness and sensitivity, and bug detection. New approaches in scientific computing applications have coupled DL models with high-performance computing (HPC) simulations, leading to an aggravation of debugging and testing challenges. Here we perform an investigation of the statistical properties of FPNA within modern parallel programming models, analyze performance and productivity impacts of replacing atomic operations with deterministic alternatives on GPUs, and examine the recently-added deterministic options within the PyTorch framework within the context of GPU deployment, uncovering and quantifying the impacts of input parameters triggering run-by-run variability and reporting on the reliability and completeness of the documentation. Finally, we evaluate the strategy of exploiting automatic determinism provided by deterministic hardware, using the Groq LPU$^{TM}$ accelerator for inference portions of the DL pipeline. We demonstrate the benefits that this strategy can provide within reproducibility and correctness efforts.
Abstract:This paper presents a methodology for using LLVM-based tools to tune the DCA++ (dynamical clusterapproximation) application that targets the new ARM A64FX processor. The goal is to describethe changes required for the new architecture and generate efficient single instruction/multiple data(SIMD) instructions that target the new Scalable Vector Extension instruction set. During manualtuning, the authors used the LLVM tools to improve code parallelization by using OpenMP SIMD,refactored the code and applied transformation that enabled SIMD optimizations, and ensured thatthe correct libraries were used to achieve optimal performance. By applying these code changes, codespeed was increased by 1.98X and 78 GFlops were achieved on the A64FX processor. The authorsaim to automatize parts of the efforts in the OpenMP Advisor tool, which is built on top of existingand newly introduced LLVM tooling.