DNA storage system: is it real and how does it work?



DNA-based data storage systems can be an outlet for humankind, which generates more and more information. Compared to all other media, DNA has a phenomenal data recording density. Another advantage is that in the case of DNA, data storage under optimal conditions does not require energy, and information can be stored for hundreds of years. After several centuries, the data can be read without problems - of course, subject to the availability of the appropriate technology.



But DNA has its downsides. For example, there are currently no standards for encoding information in a DNA strand. Synthesizing artificial molecules is quite expensive, and reading the stored information can take days or weeks. Repeated access to DNA strands for information leads to a violation of the structure of molecules, so that errors can eventually occur. A method has now been proposed that will help solve some of these problems. The data storage system (so far only images) is a cross between a regular file system and a database based on metadata.



More about the problems



Developed data storage systems in DNA provide for the addition of certain sequence tags to the DNA regions that contain data. To obtain the necessary information, regions are added to the molecule that are capable of forming base pairs with the desired labels. All this is used to amplify the complete sequence. Something like tagging each image in a collection with its own ID, and then setting everything up so that one specific ID is amplified.



The method is quite effective, but it has two limitations. First, the amplification step, which is performed by the polymerase chain reaction (PCR) process, has limitations on the size of the amplified sequence. However, each tag takes up part of an already limited space, so adding detailed tags reduces the amount of storage space.



Another limitation is that PCR amplifying certain DNA fragments with data consumes part of the original DNA library. That is, every time we read data, some of it is destroyed. Scientists compare this method of searching for information to burning a haystack to find a needle. If you do this often, you can end up losing the entire database altogether. True, there are ways to recover lost areas, but this method is not ideal, because when using it, the likelihood of errors in DNA and data areas increases.





The new method separates the label information from the master data. In addition, the researchers have created a system that makes it possible to access only the data of interest to us. The rest of the information remains intact, so that DNA molecules remain intact and not damaged.



New system



The technology is based on silicon dioxide capsules that store individual files. DNA tags are attached to each capsule to show what is in the file. Each capsule is approximately 6 micrometers in size. Thanks to such a system, scientists have managed to learn how to extract individual images with 100% accuracy. The set of files that they created is not very large - there are only 20 of them. But if you take into account the capabilities of DNA, then such a system can be scaled up to a sextillion of files.



These 20 files were encoded into DNA fragments of about 3000 nucleotides in length, which is about 100 bytes of data. One silica capsule can hold a file up to a gigabyte in size. Once the file is wrapped, single-stranded DNA labels are placed on its surface. Multiple tags can be attached to one shell to serve as keywords. For example, "red", "cat", "animal".



The silica capsules labeled in this way are combined into a single data library. It is not as compact as a repository made of pure DNA, but the data is not damaged in this case.



Search for files



A group of keywords - tags is used to search for files. For example, if you want to find an image of a cat, the tags are orange, cat, and domestic. To search for a tiger, only "orange" and "cat". The search speed in such a system is still very low - something about 1 KB per second.



Another trick is that each label is associated with different colored fluorescent molecules. Therefore, during the request, any capsules with the required labels will glow in a certain color. Now there are devices that use lasers to separate objects according to the color of fluorescence, so it is technically possible to separate the necessary data.



In this case, the rest of the library will not be affected, which means that the data will not be affected. It is no longer necessary to burn a haystack for the sake of finding one needle. An additional plus in the possibility of logical search with different criteria. For example, query conditions can be complex: true for "cat", false for "home", true for "black", etc.



Not only search



Yes, because the task of finding the necessary data is only part of the case, and not even half of it. The detected data still needs to be sequenced. And for this, it is required to open the silica shell, remove the thread stored in the capsule, inject DNA into the bacterium and then read the data. This is an extremely slow process, and even streamers are a very fast technology by comparison.



On the other hand, DNA-based systems will not be fast, their main purpose is to store huge amounts of information that does not need to be retrieved often. In addition, the technology will be improved over time, so that the speed of reading information will hopefully increase.






All Articles