Abstract:This paper introduces a lightweight vision transformer aimed at automatic sleep staging in a wearable device. The model is trained on the MASS SS3 dataset and achieves an accuracy of 82.9% on a 4-stage classification task with only 31.6k weights. The model is implemented in hardware and synthesized in 65nm CMOS. The accelerator consumes 6.54mW of dynamic power and 11.0mW of leakage power over 45.6ms. Using aggressive power gating while the accelerator is idle, it is calculated that the effective power consumption is 0.56mW. The accelerator uses only 0.754mm2 of silicon and has a clock frequency of 379MHz. These metrics are possible thanks to a layer-dependent fixed-point format and data width and a window average filter on the final softmax layer of the vision transformer.