Aggregating extra features of novel modality brings great advantages for building robust pedestrian detector under adverse illumination conditions. However, misaligned imagery still persists in multispectral scenario and will depress the performance of detector in a non-trivial way. In this paper, we first present and explore the cross-modality disparity problem in multispectral pedestrian detection, providing insights into the utilization of multimodal inputs. Then, to further address this issue, we propose a novel framework including a region feature alignment module and the region of interest (RoI) jittering training strategy. Moreover, dense, high-quality, and modality-independent color-thermal annotation pairs are provided to scrub the large-scale KAIST dataset to benefit future multispectral detection research. Extensive experiments demonstrate that the proposed approach improves the robustness of detector with a large margin and achieves state-of-the-art performance with high efficiency. Code and data will be publicly available.