To support the modern machine-type communications, a crucial task during the random access phase is device activity detection, which is to detect the active devices from a large number of potential devices based on the received signal at the access point. By utilizing the statistical properties of the channel, state-of-the-art covariance based methods have been demonstrated to achieve better activity detection performance than compressed sensing based methods. However, covariance based methods require to solve a high dimensional nonconvex optimization problem by updating the estimate of the activity status of each device sequentially. Since the number of updates is proportional to the device number, the computational complexity and delay make the iterative updates difficult for real-time implementation especially when the device number scales up. Inspired by the success of deep learning for real-time inference, this paper proposes a learning based method with a customized heterogeneous transformer architecture for device activity detection. By adopting an attention mechanism in the architecture design, the proposed method is able to extract the relevance between device pilots and received signal, is permutation equivariant with respect to devices, and is scale adaptable to different numbers of devices. Simulation results demonstrate that the proposed method achieves better activity detection performance with much shorter computation time than state-of-the-art covariance approach, and generalizes well to different numbers of devices, BS-antennas, and different signal-to-noise ratios.