Massive multiuser (MU) multiple-input multiple-output (MIMO) enables concurrent transmission of multiple users to a multi-antenna basestation (BS). To detect the users' data using linear equalization, the BS must perform preprocessing, which requires, among other tasks, the inversion of a matrix whose dimension equals the number of user data streams. Explicit inversion of large matrices is notoriously difficult to implement due to high complexity, stringent data dependencies that lead to high latency, and high numerical precision requirements. We propose a novel preprocessing architecture based on the block-LDL matrix factorization, which improves parallelism and, hence, reduces latency. We demonstrate the effectiveness of our architecture through (i) massive MU-MIMO system simulations with mmWave channel vectors and (ii) measurements of a 22FDX ASIC, which is, to our knowledge, the first fabricated preprocessing engine for massive MU-MIMO with 64 BS antennas and 16 single-antenna users. Our ASIC reaches a clock frequency of 870 MHz while consuming 416 mW. At its peak throughput, the ASIC preprocesses 1.44 M 64$\times$16 matrices per second at a latency of only 0.7 $\mu$s.