Reconstruction algorithms for DNA-storage systems

Prof. Eitan Yaakobi | Computer Science


Information and Computer Science


A DNA storage system consists of three components: (i) DNA synthesis which produces the oligonucleotides, also called strands, that encode the data ; (ii) a storage container with compartments which stores the DNA strands, however without order ; (iii) sequencing performed to read back a representation of the strands, which are called reads. Current synthesis technologies are not able to generate a single copy for each DNA strand, but only multiple copies where the number of copies is in the order of thousands to millions. The encoding and decoding stages are two processes, that convert the user’s binary data into strands of DNA such that, even in the presence of errors, it will be possible to revert back and recover the original binary data. These two stages consist of three steps: clustering, reconstruction, and error correction. After the strands are read back by sequencing, the first task is to partition them into clusters such that all strands in the same cluster originated from the same synthesized strand. After the clustering step, the goal is to reconstruct each strand based upon all its noisy copies. Lastly, errors which were not corrected by the reconstruction step, mis-clustering errors, lost strands, and any other error mechanisms should be corrected by use of an error-correcting code. Any reconstruction algorithm for the second step is performed on each cluster to recover the original strand from the noisy copies in the cluster. Having several copies for each strand is beneficial since it allows to correct errors that may occur during this process. In fact, this setup falls under the general framework of the string reconstruction problem which refers to recovering a string based upon several noisy copies of it. The novel technology includes new algorithms for these reconstruction problems. The developed algorithms look globally on the entire sequence of the traces and use dynamic programming algorithms, which are used for the shortest common supersequence and the longest common subsequence problems, in order to decode the original sequence. These novel algorithms do not require any limitations on the input and the number of traces.


  • Less limitations on the input
  • Minimization of the error rates and maximization of the success rate
  • Support high success rate with smaller cluster size

Applications and Opportunities

  • DNA-storage systems
arrow Business Development Contacts
Motti Koren
Director of Business Development, Life Sciences