Multimedia content is of predominance in the modern Web era. In real scenarios, multiple modalities reveal different aspects of item attributes and usually possess different importance to user purchase decisions. However, it is difficult for models to figure out users' true preference towards different modalities since there exists strong statistical correlation between modalities. Even worse, the strong statistical correlation might mislead models to learn the spurious preference towards inconsequential modalities. As a result, when data (modal features) distribution shifts, the learned spurious preference might not guarantee to be as effective on the inference set as on the training set. We propose a novel MOdality DEcorrelating STable learning framework, MODEST for brevity, to learn users' stable preference. Inspired by sample re-weighting techniques, the proposed method aims to estimate a weight for each item, such that the features from different modalities in the weighted distribution are decorrelated. We adopt Hilbert Schmidt Independence Criterion (HSIC) as independence testing measure which is a kernel-based method capable of evaluating the correlation degree between two multi-dimensional and non-linear variables. Our method could be served as a play-and-plug module for existing multimedia recommendation backbones. Extensive experiments on four public datasets and four state-of-the-art multimedia recommendation backbones unequivocally show that our proposed method can improve the performances by a large margin.