👨🏾‍🏭 🧑🏿‍🤝‍🧑🏽 🍬 Generating random numbers using DNA 📯 🖇️ 🏇🏽

Accidents. For some, everything that happens around is one sheer accident. And someone claims that there are no accidents. You can philosophize and argue on this topic for many hours, but there will still be many conclusions. Moving from metaphysical thoughts to more real ones, you can see that random numbers have found their application in many aspects of our life: from slot machines to information coding systems. The process during which a sequence of random numbers / symbols is generated that cannot be predicted is called random number generation (RNG). Over the long history of mankind, many RNG methods have been created. Some are quite simple and straightforward: dice, coins (heads / tails), a deck of cards, etc.

Others use much more complex physical processes: for example, due to the high-frequency movements of electrons, the electrical resistance of the wire is not constant, i.e. varies randomly. By measuring this background noise, a sequence of random numbers can be obtained. But RNG techniques are not limited to physics alone. A group of scientists from the Swiss Higher Technical School of Zurich (or ETHZ for short) have created a new method for generating random numbers based on DNA synthesis. How exactly was this achieved, how random are the numbers obtained, and can they be predicted? The answers to these questions await us in the report of scientists. Go.

Basis of research

What is one of the main limitations of dice as a random number generator? The fact that these numbers will not be so much (36 combinations of dice, if exaggerated, i.e. without probabilities and other things). The less variation we have, the easier it is to predict the possible outcome. Therefore, for a more complex and, as a consequence, safe coding based on RNG, it is necessary that the generated numbers be larger, and they themselves are more complex. This is a very simplified explanation, but it nevertheless conveys the essence of the matter.

Variants of combinations of two dice.

Therefore, the use of those physical processes that cannot be accurately predicted has become the basis of many modern RNG methods. However, it is worth remembering that there are two main directions of RNG - the generation of random (truly random) and pseudo-random numbers. In the first case, a non-deterministic (chaotic) source is used to generate random numbers. The second creates a deterministic sequence of numbers that depends on the input (seed). If the input seed is known, the entire sequence of random numbers can be reproduced. At first glance, the pseudo-RNG seems to be less efficient, but this method has better statistical properties and can often generate random numbers much faster than the RNG.

It is quite obvious that not only physical processes or software algorithms, but also chemical reactions are suitable for generating truly random numbers. On the one hand, chemical reactions are statistical processes in which the formation of chemical products follows a certain probability distribution depending on the activation energy of the reaction. But on the other hand, the ability to identify individual molecules after synthesis is practically zero, despite the ability to statistically predict the outcome of the reaction.

Research has already been done on the use of chemistry to generate random numbers. For example, in this workdescribes a device that provides an impressive pool of entropy of detectable macrostates of growing crystals in the course of chemical reactions. The problem is that the inability to identify individual molecules leads to a loss of randomness when analyzing stochastic chemical processes. In other words, it seems that chemical reactions are not suitable for RNGs. However, as the authors of the work we are considering today declare, the situation with DNA synthesis is completely different.

Synthetic DNA production is a stochastic chemical process with an important advantage: individual molecules in the synthesized DNA sequence can be easily identified and analyzed using modern sequencing technologies (NGS from next generation sequencing). Sequencing itself has been around since the 1970s, but current techniques allow you to read individual molecules and thus use DNA as a source for generating random numbers.

Research results

It should be noted that in biology, methods for identifying the global schemes of a microbial component require the synthesis of random nucleotides at certain primer positions in order to evaluate hypervariable regions (for example, for the 16S rRNA gene) for taxonomic classification. Other uses random nucleotide synthesis can be found in the bar-coding, where with the help of molecular unique identifiers (UMI of unique molecular identifiers ) can be eliminated displacement amplification PCR * .

Polymerase chain reaction (PCR) * - a method that allows you to achieve a significant increase in small concentrations of certain fragments of nucleic acid (DNA) in a biological material.

Scientists note that such random nucleotides are marked with the letter N (according to the NC-IUB standards, i.e. the nomenclature committee of the international community on biochemistry). Consequently, the scientists used the opportunity to synthesize a random nucleotide for each position indicated by the letter N in the design of the DNA used.

Image # 1 The

DNA strands used in the study were designed in such a way that a random region of 64 nucleotides flows from a predetermined region of the forward primer * at one end and a predetermined region of the reverse primer at the other end (Fig. 1).

Primer * is a short nucleic acid fragment.

The total length of the engineered DNA strand was 105 nucleotides, including two primer regions and a random region.

Image # 2 The

designed DNA strand was later physically implemented using modern solid-state synthesis technologies (image # 2).

Mixing the building blocks of DNA nucleotides has also found applications in the field of DNA storage. Previous studies have shown that expanding the * DNA alphabet by first determining the mixing ratio of all four DNA nucleotides at specific positions in the DNA sequence can increase the logical storage density by using compound letters for DNA synthesis.

* — : A ( ), T (), G () C ().

The random DNA sequences were synthesized three times: twice by Microsynth and once by Eurofins Genomics . The first company was given the additional task of mixing building blocks before merging (synthesis 1). The second company produced synthesis without any additional intervention in the process (synthesis 2).

As a result, synthesis 1 gave 204 μg of dried DNA synthesized from the 3 'to 5' direction. To determine randomness, a pool of DNA was sequenced and subsequently digitally filtered.

If you look at the composition of DNA strands as a function of position in a random region (image # 3), you can see two general trends:

: G , A C;
: A C 60 , G 5' 3', T 5' 3'. Microsynth ( ), Eurofins ( ).

Image # 3 The

observed trends provide a first indication of the reliability of the data and can partly be explained by the chemical processes that occur during DNA synthesis. The discrepancy between the percentage of nucleotides G, T and A, C (nucleotide inequivalence) can be caused by several factors. According to Microsynth, the volumes of individual building blocks during synthesis are not controlled to the nearest microliter.

Consequently, differences in concentration can lead to a less uniform distribution of nucleotides along the chain. In addition, the efficiency of binding differs for each building block and depends on such variables as the period of use of the reagents for synthesis by the manufacturers or the protecting groups attached to each building block. The result of the different binding efficiency is most likely associated with the uneven distribution of the four nucleotides.

A decrease in G and an increase in T from 5 'to 3' (non-equivalence of positions) may be the result of a chemical procedure that the DNA strand undergoes during synthesis. As DNA synthesis proceeds in the 3 '- 5' direction, nucleotides at position 60 (image # 3) are first added to the DNA strand. Since the synthesized DNA fragments remain in the synthesis chamber until the desired DNA strand length is obtained, nucleotides added to the DNA strand at the beginning of the synthesis remain in the synthesis environment for the longest time. Thus, these nucleotides have gone through most of the synthesis steps and, therefore, most of the oxidation steps.

This characteristic of the performance of chemical DNA synthesis can be an explanation for trend # 2 (non-equivalence of positions), when the composition of G decreases along the chain in the 5 '- 3' direction and the composition of T increases in the 5 '- 3' direction.

Oxidation can lead to a phenomenon called G - T transversion ( 3e ), in which the G base is chemically altered such that it can be replaced with a T base during DNA replication steps.

In addition to the trends described above, differences in the curves in the graphs can be associated with differences in a synthesis strategy (with and without mixing building blocks).

There are two main potential sources of bias that can affect results: coverage bias (where some of the items under investigation are outside the ROI) and bias due to errors.

The first option is mainly expressed by an error, which can be associated with the spatial arrangement on the synthesis chip and the stochasticity of PCR. The second option is the result of insertions, deletions (chromosomal rearrangements, when a portion of the chromosome is lost) or substitutions of erroneous nucleotides during synthesis, PCR and sequencing.

In this particular study, coverage bias affects the distribution of nucleotides only if there is a significant discrepancy between the coverage of each random sequence. However, the analysis of the data showed that this variant of the error cannot be the cause of the observed nonequivalence of nucleotides and nonequivalence of position.

With regard to bias due to errors, it is extremely difficult to distinguish between synthesis and sequencing errors, since the two processes cannot be completely separated, since access to the molecular morphology of DNA is only possible through DNA sequencing. However, studies have shown that with appropriate data processing, sequencing errors occur at random locations.

During DNA synthesis, the growth of strands can be interrupted until the desired length is reached and thus cause an error in the pool. But the sequencing process did not show a significant effect on the result ( 3a - 3c ). Therefore, bias due to errors is caused solely by the process of DNA synthesis, and not by sequencing.

By normalizing synthesis 1 ( 3a ), a heat map was obtained illustrating the predominance of binding of two nucleotides ( 3d ). It also allows you to see the third error: the predominance of nucleotide binding.

The binding of a single base to an existing nucleotide depends in part on the nature of the existing nucleotide: G is less likely to bind to A if it is free to bind to A, T, C, or G; moreover, G is more likely to bind to G if it can freely bind to A, T, C, or G.

From the point of view of synthesis, this inaccuracy can be corrected quite simply. For example, you can add more T blocks instead of G (thus changing the target ratio of A, G, T and C nucleotides), which will increase the offset from transversion.

However, due to the complexity of this process, the scientists decided not to carry out "physical" edits as part of the study, but to use a computational post-processing algorithm to remove the bias created during DNA synthesis, increasing the reliability and reproducibility of the entire procedure.

At the stage of data processing (i.e. at the stage of preparing for the RNG), the pool obtained from synthesis 1 (Microsynth) was used. Although this variant shows the strongest displacement resulting from transversion, the smooth curves show the most uniform mixing and joining during the synthesis steps.

Reading randomness from synthesized DNA strands requires reading individual strands, which was done using sequencing (in this case, the iSeq100 system was used). After sequencing, the output data (digital file) was processed so as to select the correct (i.e., no errors) sequences from the pool.

Errors that may have occurred include deletion, insertion, and substitution errors. They can cause the DNA strand to be too short, too long, or contain a damaged base. To minimize the effect of errors (especially errors due to deletion) on randomness, all sequences were reduced to 60 nucleotides. From the chains obtained, only those were selected that contained the correct length of random nucleotides.

After the pool of computer-processed DNA was limited (only sequences of 60 nucleotides in length), the DNA nucleotides were mapped to bits using the following scheme: A → 0, C → 0, T → 1, G → 1. As a result, the digitized DNA strand has been converted to binary.

The bitstrings (bitstreams) obtained after matching were subsequently checked for randomness using the NIST statistical test suite. Scientists claim that their method of assessing chances was extremely tough: the sequence was considered sufficiently random only when all tests were successfully passed separately (if at least one test failed, the sequence was excluded).

Evaluation of the initial bitstreams using the NIST statistical test suite indicated that not all tests passed successfully. This means that the resulting bitstreams do not have the same statistical properties as a completely random sequence, i.e. they still contain some redundancy and bias. Therefore, additional bit processing was needed to remove the displacement that had arisen at the stage of DNA synthesis.

In order to solve the problem of shifting the output bits (when some numbers are more than others), scientists decided to use the von Neumann algorithm. The algorithm considers the bits in sequence in pairs, and then performs one of three actions: if two consecutive bits are equal, they are removed (ignored); the sequence "1, 0" is converted to one; the sequence "0, 1" is converted to zero.

In the context of this study, it was expected that the von Neumann algorithm would work as follows:

if the input is "0, 1" or "1, 0", the first digit becomes the output, and the second digit is discarded;
if the input is "0, 0" or "1, 1", there is no output, so both input digits are discarded.

One of the biggest drawbacks of this method is the large data loss: about 75% of the input data is discarded due to its operation. Therefore, the input must be large enough to compensate for further losses.

Image # 4

The effect of offset leveling (diagram above) is clearly visible when analyzing the difference between the raw bitstreams (containing offset) and processed bitstreams (no offset).

The cumulative sum of each raw bitstream (each 60 nucleotides in length) and each processed bitstream (each less than 60 nucleotides in length) was calculated by assigning each 0 to "-1" and each 1 to "1". Further, all bit streams without offset were combined into one bit block.

Scientists note that although the data loss is significant (more than 75% of all bits are lost), and the computational efficiency is quite low (the average data output rate is four times slower than the average data input rate), the elimination of the bias was performed perfectly (at the output, the bias is completely absent).

The bit block obtained after processing with the von Neumann algorithm was re-evaluated through the NIST system.

Table 1: NIST Statistical Test Results.

All processed bitstreams passed NIST statistical tests with a passing score of> 54/56 for each test, which exceeds the statistically required minimum (52/56). Further evaluation of the bitstream showed that the P-value ≥ 0.001. It follows that the sequence is random with a 99.9% confidence level.

Image # 5

The diagram above is a complete process for generating random numbers using DNA synthesis. As we remember, as a result of synthesis, 204 μg of DNA was obtained, which corresponds to approximately 4x10 ¹⁵ DNA strands. The process of synthesizing this amount of DNA takes about 8.75 hours, and the cost of production is about $ 100.

The dry DNA sample contains a theoretical entropy of 28 PB (if there is no bias in the data) and 7 PB of randomness when offset removed using the von Neumann algorithm (i.e. after 75% bit loss). Therefore, unlike storing data using DNA, synthesis itself is not a bottleneck (a performance limiting factor) in random number generation, since it can generate randomness at a rate of 225 gigabytes per second at a cost of $ 0.000014 / GB.

However, sequencing, on the contrary, is a bottleneck in terms of time and cost of processing. The iSeq system used in this work has more productive options (for example, the NovaSeq 6000), capable of performing up to 20 billion sequence reads in 36 hours. The financial costs are quite impressive ($ 22,000). Therefore, taking into account all the stages of the RNG, the result can be obtained at a speed of 300 kilobytes per second at a price of $ 600 per GB. It is possible to reduce costs by combining several synthesis and sequencing runs.

For a more detailed acquaintance with the nuances of the study, I recommend that you look into the report of scientists and additional materials to it.

Epilogue

Random number generators have been around for thousands of years (the oldest dice found in Iran are about 5200 years old), even if the people of those times did not know their full potential. Modern technology and scientific progress have allowed them to create complex algorithms and devices capable of generating randomness that a person would be unable to predict. However, where the person falls behind, technology catches up. In other words, where there is a cipher, there is also a decryptor. Therefore, the gradual improvement of information coding methods, where random number generators are used, causes a parallel improvement in the methods of hacking such systems. This endless race of locks and lock picks requires both sides to constantly come up with more and more new methods.

Many modern RNGs are based on physical processes and algorithms. But chemical reactions were on the sidelines for many years, since it was believed that they could not be a reliable foundation for an RNG. In this work, scientists have shown that DNA synthesis, being a chemical process, can be not only a worthy option for the basis for an RNG, but also surpass its "physical" competitors in many aspects.

Naturally, this method is still a rough diamond that requires grinding in the form of additional research, the purpose of which will be to increase productivity and reduce cost. Nevertheless, the generation of random numbers by means of DNA is an extremely promising direction right now.

Thanks for your attention, stay curious and have a good work week, guys.

A bit of advertising

Thank you for staying with us. Do you like our articles? Want to see more interesting content? Support us by placing an order or recommending to friends, cloud VPS for developers from $ 4.99 , a unique analogue of entry-level servers that we have invented for you: The Whole Truth About VPS (KVM) E5-2697 v3 (6 Cores) 10GB DDR4 480GB SSD 1Gbps from $ 19 or how to divide the server correctly? (options available with RAID1 and RAID10, up to 24 cores and up to 40GB DDR4).

Is Dell R730xd 2x cheaper in Equinix Tier IV data center in Amsterdam? Only we have 2 x Intel TetraDeca-Core Xeon 2x E5-2697v3 2.6GHz 14C 64GB DDR4 4x960GB SSD 1Gbps 100 TV from $ 199 in the Netherlands!Dell R420 - 2x E5-2430 2.2Ghz 6C 128GB DDR3 2x960GB SSD 1Gbps 100TB - From $ 99! Read about How to build the infrastructure of bldg. class with Dell R730xd E5-2650 v4 servers at a cost of 9000 euros for a penny?

Generating random numbers using DNA

Basis of research

Research results

Epilogue

A bit of advertising

More articles: