Reverse engineering of the source code for a coronavirus vaccine from BioNTech / Pfizer

Welcome. In this post, we will analyze the source code of the BioNTech / Pfizer SARS-CoV-2 mRNA vaccine symbol by character



Yes, such a statement may surprise you. A vaccine is a liquid that is injected into a person's hand. What does the source code have to do with it?



Good question. We'll start with a small portion of the same source code for the BioNTech / Pfizer vaccine, also known as BNT162b2 , also known as Tozinameran, also known as Comirnaty .





First 500 characters of BNT162b2 mRNA.



At the heart of the vaccine is this digital code. It is 4284 characters long, so it can fit into a few tweets. At the very beginning of the vaccine production process, someone uploaded this code into a DNA printer (aha), which, in turn, turned the bytes from the storage device into real DNA molecules.





DNA printer Codex DNA BioXp 3200



Such a machine produces a tiny amount of DNA, which, after long-term biological and chemical processing, turns into RNA in a vial with a vaccine. A dose of 30 μg actually contains 30 μg of RNA. It also has a clever lipid (fatty) packaging system that delivers mRNA to our cells.



RNA is a fickle version of DNA, as it were, stored in "working memory." DNA is like a flash drive for biology. DNA is reliable, robust, and internally redundant. But computers, too, do not execute code directly from a flash drive - before everything starts, the code is copied into a faster and more flexible, but also more fragile system.



In computers, this is RAM, and in biology, it is RNA. The similarity is shocking. Unlike flash memory, RAM degrades quickly if not properly cared for. The reason the Pfizer / BioNTech RNA vaccine needs to be stored at very low temperatures is the same: RNA is a delicate flower.



Each RNA symbol weighs on the order of 0.53 × 10 -21 grams, that is, one dose of 30 μg vaccine contains 6 × 10 16 symbols. In bytes, this will turn out to be about 25 PB, although in fact the vaccine consists of 2000 billion repetitions of the same 4284 characters. The actual content of the vaccine is just over a kilobyte. SARS-CoV-2 itself has a volume of about 7.5 KB.



Quick reference



DNA is a digital code. But unlike computers that use 0 and 1, it uses the characters A, C, G, and U / T (“nucleotides,” “nucleosides,” or “bases”).



In computers, 0 and 1 are stored in the form of presence / absence of charge, or current, or magnetic junction, or voltage, or signal modulation, or changes in reflexivity. In short, 0s and 1s are not abstractions, they live in the form of electrons and many other physical incarnations.



In nature, A, C, G, and U / T are molecules stored in DNA (or RNA) in chains.



In computers, 8 bits are grouped into bytes, and data is usually processed byte-wise.



Nature groups three nucleotides into codons, which are the typical units for processing genetic information. The codon contains 6 bits of information (2 bits per DNA symbol, 3 characters = 6 bits). This means that the codon can take 2 6 = 64 different values.



So far, everything is pretty digital. Doubters can look at the document from the WHO containing the digital code.



So what does this code do?



The idea behind a vaccine is to teach our immune system to fight a pathogen without actually getting sick. Historically, for this, a weakened or non-working virus was introduced into the body, complete with an auxiliary agent, in order to properly invigorate the immune system, forcing it to act. It was mostly analog technology, using billions of eggs (or insects). She also required a lot of luck and a lot of time. Sometimes a completely different virus was used for this.



mRNA achieves the same result (training the immune system), but much smarter, as if using a laser sight. In every sense - a narrowly focused, but powerful impact.



Here's how it works. The preparation contains unstable genetic material that describes the famous protein “spike” SARS-CoV-2. Through clever chemical reactions, the vaccine delivers this genetic material to some of our cells.



Those then obediently begin to produce the SARS-CoV-2 proteins, the amount of which is large enough to start our immune system. Faced with the spike proteins and the characteristic signs of cell infection, it develops a powerful response to the various properties of the spike protein and the process of its reproduction.



This is how the vaccine works with a 95% effectiveness.



Source!



Let's start from the best place - from the very beginning. The WHO document has the following helpful picture:







It's kind of content. Let's start with the "cap" item, which is designed in the form of a hat [eng. cap - cap, cap, hat].



Just as on a computer you can't just write opcodes into a file and run it, a biological operating system requires headers, linkers, and something like variable naming rules.



The vaccine code begins with the following two nucleotides:



GA




Comparable to any DOS and Windows executable starting with "MZ" characters, or UNIX scripts starting with "#!". Both in life and in operating systems, these two characters are not executed. But they must be there, otherwise nothing will work.



The mRNA header has several functions. She notes, for example, that the code comes from the kernel. In our case, this is not the case - the code naturally comes from the vaccine. But the cell does not need to know this. The hat gives the code believability, protecting it from destruction.



Also, the two original GA nucleotides are chemically slightly different from the rest of the RNA. In this sense, a kind of out-of-band signaling is built into the GA.



Five-bar-untranslated region



A bit of jargon. RNA molecules are read in only one direction. A bit confusing is that the reading starts from the part called 5 ' . And ends on part 3 '.



Life is made of proteins (and everything that is made of them). These proteins are described in RNA. The transformation of RNA into protein is called translation .



Next, I will give an untranslated region (UTR; untranslated region, UTR) 5 '- that is, this part does not pass into the protein:



GAAΨAAACΨAGΨAΨΨCΨΨCΨGGΨCCCCACAGACΨCAGAGAGAACCCGCCACC




Here we are in for the first surprise. The usual symbols for RNA are A, C, G and U. In DNA, U is also known as T. But then some kind of Ψ appears - what happens?



This is one of the extremely tricky properties of the vaccine. Our body has a powerful anti-virus system. Thanks to her, cells are extremely skeptical of foreign RNAs and try hard to destroy them before they do something.



This is a problem for a vaccine - it needs to slip past our immune system. Over many years of experiments, it was found that if U in RNA is replaced with a slightly altered molecule, our immune system loses interest in it. At all.



Therefore, in the BioNTech / Pfizer vaccine, each U is replaced by 1-methyl-3'-pseudouridine, which is denoted by Ψ. The trick here is that although such a replacement pacifies our immune system, the necessary parts of the cells perceive it as an ordinary U.



This trick is also known in computer security. Sometimes it is possible to transmit a slightly garbled version of the message that confuses firewalls and security systems, but is accepted by backend servers. And then they can be hacked.



Today we are reaping the fruits of fundamental scientific research from the past. The people who discovered this Ψ-technology had to fight to find funding and gain recognition. We should be grateful to them for this, and I am sure that the Nobel Prize will eventually find them .

, Ψ-, ? , . , 1--3'-. , , . , Ψ .


Let's go back to our 5 'NTO. What do these 51 symbols do? Like almost everything in nature, they do not have a single clear function.



When cells need to translate RNA into proteins, a machine called the ribosome turns on. The ribosome is like a 3D printer for proteins. It absorbs a piece of RNA and, on its basis, releases a chain of amino acids, which is then folded into protein .







This process is shown in the video. The dark stripe below is RNA. The stripe on the green background is the forming protein. Inbound and outbound gizmos are amino acids and adapters that allow them to fit onto RNA.



For the ribosome to work, it needs to physically sit on a piece of RNA. After that, it can begin to form proteins based on information coming from further RNA fragments. That is, it cannot read the parts on which it first needs to land. "Guidance" is provided by NTO.



In addition, the NTO contains metadata: when should the broadcast take place? How much? For the vaccine, the scientists found an NTO, in which the broadcast command was recorded as early as possible. She comes from gene alpha globin . This gene is known for the reliable production of large amounts of proteins. In the past, scientists have already found an opportunity to further optimize this UTR, so something better was used for the vaccine, not an UTR from alpha globin.



S-glycoprotein signal sequence



As noted, the goal of the vaccine is to get the cell to produce the SARS-CoV-2 spike protein on an industrial scale. So far, we've mostly dealt with metadata and naming conventions in the source code. And now we are entering viral protein territory.



However, first we need to go through another layer of metadata. After the ribosome (from the great animation above) makes a protein, it still needs to get somewhere. This is encoded in the signal sequence (peptide) of the S-glycoprotein (in the extended leader sequence ).



At the beginning of the protein is something like an address label, encoded in the same form as the entire protein. In this case, the signal sequence says that the protein must leave the endoplasmic reticulum cell . Even Star Trek didn't have such cool jargon!



The signal sequence is not very long, but the code example shows the difference between the RNA of the virus and the vaccine. For ease of comparison, I replaced Ψ with the usual U from RNA:



           3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
Virus: AUG UUU GUU UUU CUU GUU UUA UUG CCA CUA GUC UCU AGU CAG UGU GUU
Vaccine: AUG UUC GUG UUC CUG GUG CUG CUG CCU CUG GUG UCC AGC CAG UGU GUU
               ! ! ! ! ! ! ! ! ! ! ! ! !            




I am not accidentally grouping RNA by three symbols. These symbols form codons. And each codon encodes a specific amino acid. The signal sequence of the vaccine consists of exactly the same amino acids as the virus itself.



Why is RNA different?



There can be 4 3 = 64 codons , since RNA has 4 symbols, three of which make up a codon. At the same time, there are only 20 different amino acids. It turns out that several codons encode the same amino acid.



Life uses the following, almost universal table for mapping RNA codons to amino acids:







The table shows that vaccine modifications (UUU -> UUC) are synonymous. The RNA code of the vaccine is different, but the output is the same amino acids and proteins.



A close examination shows that most of the changes are contained in the third codon, marked with the number 3. By checking the universal codon table, we can see that this third position often does not affect which amino acid is obtained in the end.



But if the changes are synonymous, why are they needed? If you look closely, all but one change increases the number of C and G.



Why is this necessary? As already noted, our immune system is very skeptical about "external" RNA - that is, to the code that came from outside the cell. To avoid detection, we have already replaced U with Ψ.



It turns out, however, that RNA with a large amount of G and C is more often and more efficiently converted into proteins. For this, many symbols in the RNA vaccine have been replaced by G and C where possible.



Real Squirrel Thorn



The next 3777 RNA characters of the vaccine are also "codon-optimized" with the aim of adding more C and G. I won't give the whole code here, but we will study one special piece of it. It is thanks to him that the vaccine works - it is this part that helps us return to normal life:



                  * *
          LDKVEAEVQIDRLITG
Virus: CUU GAC AAA GUU GAG GCU GAA GUG CAA AUU GAU AGG UUG AUC ACA GGC
Vaccine: CUG GAC CCU CCU GAG GCC GAG GUG CAG AUC GAC AGA CUG AUC ACA GGC
          LDPPEAEVQIDRLITG
           ! !!! !! ! ! ! ! ! ! !              




Here again the usual synonymous RNA changes are visible. For example, in the first codon, CUU was replaced by CUG. This adds another G to the vaccine, which helps boost protein production. CUU and CUG code for the amino acid L, or leucine, so nothing changes in the protein.



By comparing the spike protein to the vaccine, we can see that all changes are similarly synonymous - except for two. Both of them are visible in this fragment.



The third and fourth codons contain real changes. The amino acids K and V are replaced by P, or proline. In the case of K, three changes were required ('!!!'), and in the case of V, two ('!!'). It turns out that these two changes enhance the vaccine incredibly.



So what's going on here? If you look at a real virus particle, you will see that the spike protein is a bunch of spines: The







spikes are attached to the body of the virus ("nucleocapsid protein"). But our vaccine only generates these thorns themselves, and we do not attach them to any viral bodies.



It turns out that separately existing spike proteins collapse into a completely different structure. If they were introduced as part of a vaccine, our bodies would develop immunity to them - but only to their collapsed appearance. The real coronavirus flaunts straight spikes. In this form, the vaccine would hardly work.



So what are we doing? In 2017 it was describedhow the double substitution of proline in the right place makes the SARS-CoV-1 and MERS S proteins take their "original" form, even without attachment to the virus. All thanks to the strength of proline. This amino acid works like a splint, stabilizing the protein in the state in which we need to present it to the immune system. People who



discovered this now need to constantly pat themselves on the shoulders, and constantly grin. And all this will be well deserved. After the first draft of the article was published, I spoke with people from McLeillan's lab, and they said that so far the clapping on the shoulders has been suspended due to the pandemic, but they are proud of their contribution to the vaccine. And emphasize the importance of other groups and volunteers working on it.







Squirrel end and next steps



If you scroll through the source code to the end, we will see small changes at the end of the spike protein:



          VLKGVKLHYT s             
Virus: GUG CUC AAA GGA GUC AAA UUA CAU UAC ACA UAA
Vaccine: GUG CUG AAG GGC GUG AAA CUG CAC UAC ACA UGA UGA 
          VLKGVKLHYT ss          
               ! ! ! ! ! ! ! !




At the end of the protein there is a "stop" codon, marked with the letter s. This is a polite way to indicate the end of the protein. The virus itself uses the UAA codon as a stop, and the vaccine uses two UGA codons. Perhaps just in case.



Untranslated area 3 '



Just as at the end of the 5 'we found the 5'-UTR, which is needed to guide the ribosome, at the end of the protein we find a similar construct, the 3'-UTR.



There are many words to write about her, but I'd rather quote from Wikipedia. “3'-UTR plays a critical role in gene expression, influencing the localization, stability, export, and translation efficiency of mRNA. Despite all our current knowledge of 3'-NTOs, their work is still largely mysterious. "



We do know, however, that certain 3'-UTRs are very successful in mediating protein expression. According to a document from WHO, the 3'-UTR contained in the BioNTech / Pfizer vaccine is derived from the "amino-terminal enhancer of cleaved (AES) mRNA and mitochondrial encoded 12S ribosomal RNA to ensure RNA stability and high total protein expression." What can I say - well done.







And the end of everything, AAAAAAAAAAAAAAAAAAAAAA



The very end of the mRNA is polyadenylated . It's a florid way of saying that it ends with a bunch of AAAAAAAAAAAAAAAAAAA strings. Apparently, 2020 even took out mRNA.



mRNA can be reused many times, but it loses a few A's from its end. As soon as the "A" runs out, the mRNA will stop working and will be discarded. In this sense, the poly-A tail protects it from degradation. Special studies have been carried out to determine the optimal amount of "A" at the end of mRNA vaccines. In open sources, I read that they got to their number of 120 pieces or so.



The BNT162b2 vaccine ends on



                                     ****** ****
UAGCAAAAAA AAAAAAAAAA AAAAAAAAAA AAAAGCAUAU GACUAAAAAA AAAAAAAAAA 
AAAAAAAAAA AAAAAAAAAA AAAAAAAAAA AAAAAAAAAA AAAAAAAAAA AAAA




30 "A", then "nucleotide linker-10" (GCAUAUGACU), followed by another 70 "A".



I suspect that proprietary optimization is again being used to improve protein expression.



Outcome



Now we know exactly what the BNT162b2 vaccine contains, and for the most part, we understand why it works this way:

  • CAP cap that makes RNA look like normal mRNA.
  • Well-known, successfully tested and optimized 5'-NTO.
  • Signal sequence with optimized codons that sends the spike protein to the right place (copied from the virus itself).
  • A variant of the original spike protein with optimized codons, with two proline substitutions to ensure the correct spine shape.
  • Well-known, successfully tested and optimized 3'-NTO.
  • A bit mysterious tail of poly-A with some kind of "linker".


Optimizing codons adds many Gs and Cs to the mRNA. Using Ψ (1-methyl-3'-pseudouridine) instead of U helps to trick the immune system. Thanks to it, mRNA exists long enough to train our immune system.



All Articles