The Burrows-Wheeler transform (BWT) is used by the bzip2 family of compressors. In this paper, we present a hardware architecture that implements an inplace algorithm to compute the BWT. Our design does not have explicit storage for the suffix array, or output array. The performance of our implementation is fixed, and does not depend on the input string content. We use a register based character buffer in a scanchain configuration, such that the BWT is computed from right to left, as characters are loaded. Loading new characters is done every six cycles, producing a new output character from the previously computed block at the same rate. Our FGPA implementation does not use block ram instances, and achieves throughput of 66, 35, 18, and 15 MB/s for block sizes of 128 B, 1 kB, 4 kB, and 8 kB. We also report results for an ASIC implementation in 65 nm CMOS that achieves 161 MB/s when using block size of 128 B.