DNA gets competition as preferred molecular data storage medium

Researchers are making headway to conveniently access information stored in synthetic polymers.

Paul van Gerven

Storing data other than genetic code in DNA is a well-researched subject. A thimble of the stuff could contain the entire Youtube archive and thanks to the wonders of microbiology, all the tools to synthesize and read out the biopolymer are readily available. Add to that the relatively robust nature of DNA, and it’s clear why encoding data with nature’s genetic building blocks is thought to be particularly useful for archival storage, ie for storing data long-term that doesn’t need accessing often.

In principle, synthetic polymers could do even better. Possible advantages include simpler synthesis, higher storage density and better stability. Their downside is the read-out: there are no off-the-shelf sequencing tools available. The only decoding method that has proven fast and trustworthy so far has been mass spectrometry (MS), a widely used analytical technique that can be used to measure the weight of a molecule (technically: the mass-to-charge ratio of the ionized molecule).

The basic principle of synthetic-polymer data storage is as follows. Distinct molecular units are chosen to encode for a 0 and a 1. These are strung together to define bytes and sequences of bytes. ‘Separator units’ are introduced at well-defined positions in the polymeric chain to mark the beginning and end of specific sequences. Sometimes, molecular labels facilitate sequencing.

Sequencing based on mass is possible in the first place because ions (partially) break up and rearrange into smaller fragments inside a mass spectrometer according to well-defined rules. Through careful design of the encoding molecular units and the use of molecular labels, and by analyzing fragments of fragments (a process called tandem mass spectrometry or MS/MS), it’s possible to determine the sequence of encoding units.

Random access

When using mass spectrometry, the size of the molecules must be limited, which in turn limits the storage capacity of each polymer chain. In addition, the complete chain must be decoded in sequence, building block by building block – the bits of interest can’t be accessed directly. It’s like having to read through an entire book instead of just opening it to the relevant page. In contrast, long chains of DNA can be cut into fragments of random length, sequenced individually and then computationally reconstructed into the original sequence.

Researchers of Seoul National University have developed a new method by which very long synthetic polymer chains whose molecular weights greatly exceed the analytical limits of MS and MS/MS can be efficiently decoded. As a demonstration, the team encoded their university address into ASCII and translated this into binary code, together with a CRC error detection code. This 512-bit sequence was stored in a polymer chain made of two different monomers: lactic acid to represent a 1 and phenyllactic acid to represent a 0. They also included fragmentation codes containing mandelic acid. When chemically activated, the chains break at those locations. In their demonstration, the Koreans obtained 18 fragments of various sizes that could be individually decoded by MS/MS sequencing.

The name and address of the SNU researchers (496 bits) plus a 16-bit CRC were encoded in a 512-unit polymer using two different building blocks (shown as red and blue in this block diagram representation). An additional unit (orange) was introduced to enable breaking up the large polymer into fragments that a mass spectrometer can handle. Credit: Angewandte Chemie/SNU

Specially developed software initially identified the fragments based on their mass and their end groups, as shown by the MS spectra. During the MS/MS process, previously measured molecular ions break down further, and these pieces were then also analyzed. The fragments could be sequenced based on the mass difference of the pieces. With the aid of the CRC error detection code, the software reconstructed the sequence of the entire chain, overcoming the length limit for the polymer chains.

The team was also able to randomly access bits without sequencing the entire polymer chain, such as the word “chemistry” in the code for their address. By taking into account that the parts of their address are all in a specific order (department, institution, city, postal code, country) and separated by commas, they were able to isolate the location where the desired information was stored within the chain and only sequence the relevant fragments.

Main picture credit: BoliviaInteligente on Unsplash