Weekly Shaarli
Week 52 (December 25, 2023)
We investigate the unusual way memory subsystem interacts with branch prediction and how this interaction shapes software performance.
Recently, on LinkedIn, I read a post about an engineer who was surprised that his new, optimized version of a parser was slower than the original. The optimization consisted of removing the branches, which are the source of all evil according to the common knowledge in the street, right? His new version was slower, and a benchmark opened his eyes.
We've all been there: the trains you're servicing for a customer suddenly brick themselves and the manufacturer claims that's because you...
We demonstrate a high-performance vendor-agnostic method for massively parallel solving of ensembles of ordinary differential equations (ODEs) and stochastic differential equations (SDEs) on GPUs. The method is integrated with a widely used differential equation solver library in a high-level language (Julia's DifferentialEquations.jl) and enables GPU acceleration without requiring code changes by the user. Our approach achieves state-of-the-art performance compared to hand-optimized CUDA-C++ kernels while performing 20--100$\times$ faster than the vectorizing map (vmap) approach implemented in JAX and PyTorch. Performance evaluation on NVIDIA, AMD, Intel, and Apple GPUs demonstrates performance portability and vendor-agnosticism. We show composability with MPI to enable distributed multi-GPU workflows. The implemented solvers are fully featured -- supporting event handling, automatic differentiation, and incorporation of datasets via the GPU's texture memory -- allowing scientists to take advantage of GPU acceleration on all major current architectures without changing their model code and without loss of performance. We distribute the software as an open-source library https://github.com/SciML/DiffEqGPU.jl