Abstract:Nanopore sequencing offers the ability for real-time analysis of long DNA sequences at a low cost, enabling new applications such as early detection of cancer. Due to the complex nature of nanopore measurements and the high cost of obtaining ground truth datasets, there is a need for nanopore simulators. Existing simulators rely on handcrafted rules and parameters and do not learn an internal representation that would allow for analysing underlying biological factors of interest. Instead, we propose VADA, a purely data-driven method for simulating nanopores based on an autoregressive latent variable model. We embed subsequences of DNA and introduce a conditional prior to address the challenge of a collapsing conditioning. We introduce an auxiliary regressor on the latent variable to encourage our model to learn an informative latent representation. We empirically demonstrate that our model achieves competitive simulation performance on experimental nanopore data. Moreover, we show we have learned an informative latent representation that is predictive of the DNA labels. We hypothesize that other biological factors of interest, beyond the DNA labels, can potentially be extracted from such a learned latent representation.
Abstract:Neural networks are emerging as a tool for scalable data-driven simulation of high-dimensional dynamical systems, especially in settings where numerical methods are infeasible or computationally expensive. Notably, it has been shown that incorporating domain symmetries in deterministic neural simulators can substantially improve their accuracy, sample efficiency, and parameter efficiency. However, to incorporate symmetries in probabilistic neural simulators that can simulate stochastic phenomena, we need a model that produces equivariant distributions over trajectories, rather than equivariant function approximations. In this paper, we propose Equivariant Probabilistic Neural Simulation (EPNS), a framework for autoregressive probabilistic modeling of equivariant distributions over system evolutions. We use EPNS to design models for a stochastic n-body system and stochastic cellular dynamics. Our results show that EPNS considerably outperforms existing neural network-based methods for probabilistic simulation. More specifically, we demonstrate that incorporating equivariance in EPNS improves simulation quality, data efficiency, rollout stability, and uncertainty quantification. We conclude that EPNS is a promising method for efficient and effective data-driven probabilistic simulation in a diverse range of domains.