Abstract:Recent advancements in nanopore sequencing technology, particularly the R10 nanopore from Oxford Nanopore Technology, have necessitated the development of improved data processing methods to utilize their potential for more than 9-mer resolution fully. The processing of the ion currents predominantly utilizes neural network-based methods known for their high basecalling accuracy but face developmental bottlenecks at higher resolutions. In light of this, we introduce the Helicase Hidden Markov Model (HHMM), a novel framework designed to incorporate the dynamics of the helicase motor protein alongside the nucleotide sequence during nanopore sequencing. This model supports the analysis of millions of distinct states, enhancing our understanding of raw ion currents and their alignment with nucleotide sequences. Our findings demonstrate the utility of HHMM not only as a potent visualization tool but also as an effective base for developing advanced basecalling algorithms. This approach offers a promising avenue for leveraging the full capabilities of emerging high-resolution nanopore sequencing technologies.
Abstract:Inferring a state sequence from a sequence of measurements is a fundamental problem in bioinformatics and natural language processing. The Viterbi and the Beam Search (BS) algorithms are popular inference methods, but they have limitations when applied to Hierarchical Hidden Markov Models (HHMMs), where the interest lies in the outer state sequence. The Viterbi algorithm can not infer outer states without inner states, while the BS algorithm requires marginalization over prohibitively large state spaces. We propose two new algorithms to overcome these limitations: the greedy marginalized BS algorithm and the local focus BS algorithm. We show that they approximate the most likely outer state sequence with higher performance than the Viterbi algorithm, and we evaluate the performance of these algorithms on an explicit duration HMM with simulation and nanopore base calling data.