Abstract:As a medium for cold data storage, DNA stands out as it promises significant gains in storage capacity and lifetime. However, it comes with its own data processing challenges to overcome. Constrained codes over the DNA alphabet $\{A,T,G,C\}$ have been used to design DNA sequences that are free of long homopolymers to increase stability, yet effective error detection and error correction are required to achieve reliability in data retrieval. Recently, we introduced lexicographically-ordered constrained (LOCO) codes, namely DNA LOCO (D-LOCO) codes, with error detection. In this paper, we equip our D-LOCO codes with error correction for substitution errors via syndrome-like decoding, designated as residue decoding. We only use D-LOCO codewords of indices divisible by a suitable redundancy metric $R(m) > 0$, where $m$ is the code length, for error correction. We provide the community with a construction of constrained codes forbidding runs of length higher than fixed $\ell \in \{1,2,3\}$ and $GC$-content in $\big [0.5-\frac{1}{2K},0.5+\frac{1}{2K}\big ]$ that correct $K$ segmented substitution errors, one per codeword. We call the proposed codes error-correction (EC) D-LOCO codes. We also give a list-decoding procedure with near-quadratic time-complexity in $m$ to correct double-substitution errors within EC D-LOCO codewords, which has $> 98.20\%$ average success rate. The redundancy metric is projected to require $2\log_2(m)+O(1)$-bit allocation for a length-$m$ codeword. Hence, our EC D-LOCO codes are projected to be capacity-approaching with respect to the error-free constrained system.
Abstract:Because of their excellent asymptotic and finite-length performance, spatially-coupled (SC) codes are a class of low-density parity-check codes that is gaining increasing attention. Multi-dimensional (MD) SC codes are constructed by connecting copies of an SC code via relocations in order to mitigate various sources of non-uniformity and improve performance in many data storage and data transmission systems. As the number of degrees of freedom in the MD-SC code design increases, appropriately exploiting them becomes more difficult because of the complexity growth of the design process. In this paper, we propose a probabilistic framework for the MD-SC code design, which is based on the gradient-descent (GD) algorithm, to design better MD codes and address this challenge. In particular, we express the expected number of short cycles, which we seek to minimize, in the graph representation of the code in terms of entries of a probability-distribution matrix that characterizes the MD-SC code design. We then find a locally-optimal probability distribution, which serves as the starting point of a finite-length algorithmic optimizer that produces the final MD-SC code. We offer the theoretical analysis as well as the algorithms, and we present experimental results demonstrating that our MD codes, conveniently called GD-MD codes, have notably lower short cycle numbers compared with the available state-of-the-art. Moreover, our algorithms converge on solutions in few iterations, which confirms the complexity reduction as a result of limiting the search space via the locally-optimal GD-MD distributions.
Abstract:DNA strands serve as a storage medium for $4$-ary data over the alphabet $\{A,T,G,C\}$. DNA data storage promises formidable information density, long-term durability, and ease of replicability. However, information in this intriguing storage technology might be corrupted. Experiments have revealed that DNA sequences with long homopolymers and/or with low $GC$-content are notably more subject to errors upon storage. This paper investigates the utilization of the recently-introduced method for designing lexicographically-ordered constrained (LOCO) codes in DNA data storage. This paper introduces DNA LOCO (D-LOCO) codes, over the alphabet $\{A,T,G,C\}$ with limited runs of identical symbols. These codes come with an encoding-decoding rule we derive, which provides affordable encoding-decoding algorithms. In terms of storage overhead, the proposed encoding-decoding algorithms outperform those in the existing literature. Our algorithms are readily reconfigurable. D-LOCO codes are intrinsically balanced, which allows us to achieve balancing over the entire DNA strand with minimal rate penalty. Moreover, we propose four schemes to bridge consecutive codewords, three of which guarantee single substitution error detection per codeword. We examine the probability of undetecting errors. We also show that D-LOCO codes are capacity-achieving and that they offer remarkably high rates at moderate lengths.