Modeling dynamical systems plays a crucial role in capturing and understanding complex physical phenomena. When physical models are not sufficiently accurate or hardly describable by analytical formulas, one can use generic function approximators such as neural networks to capture the system dynamics directly from sensor measurements. As for now, current methods to learn the parameters of these neural networks are highly sensitive to the inherent instability of most dynamical systems of interest, which in turn prevents the study of very long sequences. In this work, we introduce a generic and scalable method based on multiple shooting to learn latent representations of indirectly observed dynamical systems. We achieve state-of-the-art performances on systems observed directly from raw images. Further, we demonstrate that our method is robust to noisy measurements and can handle complex dynamical systems, such as chaotic ones.