Inference of three-dimensional motion from the fusion of inertial and visual sensory data has to contend with the preponderance of outliers in the latter. Robust filtering deals with the joint inference and classification task of selecting which data fits the model, and estimating its state. We derive the optimal discriminant and propose several approximations, some used in the literature, others new. We compare them analytically, by pointing to the assumptions underlying their approximations, and empirically. We show that the best performing method improves the performance of state-of-the-art visual-inertial sensor fusion systems, while retaining the same computational complexity.