How Bioinformatics Differs from Computational Biology - A Brief Introduction



A couple of days ago, Alsu Missarova, a graduate of the Faculty of Mechanics and Mathematics of Moscow State University, PhD in systems biology (functional genomics in yeast) at the Universitat Ponepu Fabra in Barcelona, ​​spoke on our YouTube. Now Alsou is a postdoc at JOhn Marioni's lab (EBI, Cambridge, UK), is engaged in single cell RNA-seq and integration with spatial transcriptomics.



Alsou gave a very brief introduction to what bioinformatics is and how it differs from computational biology. We share with you the recording and transcript of the broadcast: we hope that this is an introduction to a whole series of speakers who are engaged in bioinformatics.






My name is Alsu Missarova. I was asked to talk about bioinformatics - in particular, what problems I solve, what kind of data I process, what kind of problems there are in computational biology for techies, for people with a bias in computer science, data analysis, and so on.



I am not a bioinformatist myself, I am a computational biologist. These two concepts are highly correlated and the line between them is blurred, but it is important to understand the difference. For both, the goal is to answer some biological questions, or to improve our understanding of how biological processes work. Their approach is similar: processing and data analysis of a large amount of data that cannot be processed with eyes and hands. The difference is in priority. The Computational biologist will rather have a relatively specific biological question and need to understand what kind of data to collect. You need to have access to this data, you need to be able to correctly process, analyze, interpret and, in fact, answer the question. When the goal is informatics, it is, rather, the creation of algorithms, bodies, methods for working with biological data.The task will be put on top, most likely, and the data will be in a more industrial format. That is, they will have a certain data format that they will process, which will need to be produced for a large number of individuals or organisms, and so on.



You can take it like this: a Computational biologist is more likely a biologist who can open some libraries and use some tools, and a bioinformatist is more likely a computer scientist who doesn't care about biology, who doesn't really understand it, and he just works with numbers, with strings, with data. In fact, it is not, of course. This is true for any field, but when you work with data - in any field - you absolutely need to understand what kind of data you have and where you get the noise in the data. And there will be a lot of noise in the biological data that you will receive. Roughly speaking, it can be decomposed into technical and biological noise. Technical noise comes from the fact that the machines that create the data are imperfect and flawed. And biological noise occurs because there is a lot of variation in any system.Even between two cells of the same organism, even if they are adjacent skin cells, there will be a biological difference. It is necessary to distinguish technical noise from biological noise, remove technical noise and leave biological noise, and this requires an understanding of biology.



Let's move on to what kind of data we have in biology. First of all, when people listen to bioinformatics, they think about DNA sequencing (which, in principle, is justified). I think everyone knows what it is: it is, relatively speaking, the ability to determine what kind of DNA sequence an organism has. That is, DNA is a very long molecule; for humans it is about 3.1 billion "letters". 4 letters - ACDH - these are nucleotides. Accordingly, people have learned to read the DNA of a living being. It is very cool. Now you can, for example, determine the sequences of two people, compare them and contrast, what is the difference between these sequences and what is the difference between these people, and try to find a cause-and-effect relationship. That is how DNA affects your phenotype, what is the difference between two people. Likewise, let's say in computational biology:you can take two organisms from neighboring species, sequencing them in the same way - to determine the DNA sequence and, accordingly, try to understand what is the difference between the two organisms, and what DNA actually influences this.



Now you can go to a different dimension and ask the following question: if you take two cells from one organism, from one person, then what is the difference between them? That is, relatively speaking, skin epithelial cells will differ from neurons. Here DNA is no longer very suitable. There is such an axiom, which is by and large erroneous: that the DNA sequence of the cells of one organism is always the same. It is erroneous because a living organism is a dynamic structure; it grows, shares, dies. In this process, mutations accumulate. The DNA replication process is not perfect and sometimes it breaks down; DNA repeats itself, but repeats itself imperfectly. Mutations can be neutral, which lead to nothing, or harmful, which cause cell dysfunction. Of course, if we abstract, the DNA sequence is still more or less identical between the two cells,but they function differently. Accordingly, a large number of biological questions are aimed at understanding what is the difference between different cells and what affects this. The community has requests for this kind of data. You need to be able to highlight, calculate, read this difference.



This is where we come to what I do. The main (or one of the main) data format that people use here is RNA sequencing. Now I will briefly talk about what RNA is, and about the evolution of RNA sequencing in general.



This is a very abbreviated version, in fact, everything is more complicated. The two pillars that support cell biosynthesis are transcription and translation. DNA is a very long word that encodes certain information. This information by the cell can be processed, read, processed into functional elements.



Proteins are a prime example of this. These are such small machines in a cell that perform certain functions and provide life and functionality of this cell so that it works as it should. Proteins are encoded by genes. A gene is a subword in a DNA sequence. Transcription is when a large molecular machine sits on a long double helix of a DNA molecule - polymerase, which travels through genes, creates copies and throws them into the cytoplasm of the cell. These DNA copies (actually, not exactly copies) are made in a certain amount. Accordingly, two different cells have different amounts of RNA from different genes. For an epithelial cell, more gene A is needed, for neurons - more gene B, and a different number of them are produced. Then the RNA is processed, and then, when it is in a more final format, another machine "sits down" on the thread. Respectively,when people talk about RNA sequencing, they mean, relatively speaking, calculating how much of which RNA from which genes are produced in cells. This is RNA composition, or RNA sequencing.



In fact, it's very cool that people have learned to do this. For a long time, the main limitation of this technology was that it took a lot of cells to obtain RNA material. That is, it was necessary to put tens of thousands of cells together (naturally, already non-viable), remove the RNA and sequenced.



The problem is that cells will often differ from one another. There will be a lot of biological variation, because for many processes - for example, development, or immunology, or oncology - there will be a large interaction between cells of different functionalities. And when, say, a biopsy is done and a lot of cells are pulled out, a mix is ​​obtained. And if you take only the expectation of these RNAs for all cells, then you lose variance. And you do not understand and cannot study them.



And, accordingly, there was a request from the community to do this at the single cell level. And they learned to do this 10 years ago. This is very cool, for many areas it is very important. You can look very deeply into the system, see what kind of cells are at the microscopic level. But there are also limitations. One of them is that you are losing your spatial information. Relatively speaking, to do RNA sequencing, you need to take a piece of tissue, cut it into cells, and do your single cell RNA-seq.



But, in an amicable way, a lot of functionality is in how cells interact with each other in space. And for this they came up with the special transcriptomics technology - the ability to measure RNA without losing spatial information.



One of the main tricks for this is using a microscope: you take your tissue, fix it - that is, take a set of cells, and you have it fixed in the microscope. And then you send small probes to this tissue, which contain two elements: one of them is very specific to your RNA, and it will only bind to those genes that are important. And the second will be a glowing fluorescent mark. You can shine a microscope at a certain frequency of the wave on the tissue, and you can determine how many fireflies in the cells will light up. Accordingly, there will be as many RNA molecules. Actually, the tasks that I am doing are at the junction of special transcriptomics and single cell RNA sequencing. Relatively speaking, here I am doing development, looking at little mice; I have data on single cell and special transcriptomics,and I am trying to match the cells that I see in the special context with those that I see in the single cell RNA-seq.



I will move on to problems that, in principle, may be of interest to techies and ML-engineers. I have identified three types of tasks that are currently in demand, and they are all in the field of medicine; medicine now receives a lot of resources, a lot of money, a lot of data.



The first type of task is drug discovery. There is a disease, it needs to be cured, for this you need to find a drug. How to summarize this task in more detail; you need to find the composition of a chemical that can be placed in a pill or capsule, sent to the body, and after that the molecules will bind specifically to those proteins, those targets, which, if their state is modified, will change the state of the disease - relatively speaking, cure.



There are several stages here. One of them is target identification / validation. We must somehow be able to predict which molecules need to be bound in order for the state of the disease to change. For this, a large set of data is collected: you take sick people, you take healthy people, you measure a lot of different parameters from them. You are sequencing DNA, RNA, transcriptomics, proteomics - the state of proteins.



Next, you are trying to determine which of the parameters of the cells of sick people refer specifically to sick people, and which ones to healthy people. That is, you are trying to determine which molecules are potentially correlated with disease. This is on the one hand. On the other hand, you still need to find such molecules that will be drugable - that is, that have the potential to bind to active chemicals that you send to the body to heal. Here you need to measure many parameters: binding, protein folding, and so on.



For this, active Machine Learning is now used. That is, you look at different protein compounds and try to predict, based on known targets, whether a particular target will be good. In addition, one must also synthesize the correct drug. That is, you need to find such a chemical composition of the molecule that can bind specifically to the protein you need to contact, and can, in principle, get into the body, can dissolve in water, and so on. There are many features that need to be optimized. Doing it with your hands is hard, but it can be predicted based on the fact that you already have known drugs, and you compare the new potential drug with the known ones and predict how successful it could potentially be. All this is at the level of prediction; then it will need to be validated, really shown,that it works. But drug predictions are the key to cutting down on money and time spent on research. This is very relevant.



The second kind of problem associated with the first is, relatively speaking, finding the biomarkers of the disease. Cancer is a good example. Part of the reason he is so difficult to treat is because he is so different and there are so many differences between two people. In general, what cancer is is when a certain amount of mutations have accumulated, which has led to cell breakdown. And the cell, instead of performing its function, simply begins to divide very quickly and replace healthy cells. This gradually kills the body. But there are a lot of mechanisms due to which the cell breaks down. One person's cancer is not another person's cancer, and a drug that works for one may not work for another. Accordingly, it is very important to be able to quickly determine which genes and other parameters need to be looked at in order to understand that a person is sick with a specific disease. That is, we need to find biomarkers.For this, databases are used. Now data of various formats is being actively collected from a large number of people, healthy and sick. You need to crystallize the output; a person may or may not be cured, and you need to understand what kind of people get sick with what. If you quickly find exactly what broke, then you can cure it.



The third area that is currently developing is funny, but this is text mining. There is a lot of literature in biology now, a very large number of labs are engaged in a huge number of things. In fact, people often find things - say, protein-protein interaction or drug-protein interaction. It happens independently, in different parts of the world, and they don't know how it can interact. Text mining looks at different articles that are published and builds a database. That is, if in one place it was determined that one protein interacts with the second protein, and in another - that the second protein can be acted upon by a certain drug, it turns out that this drug can also affect the original protein. An interaction graph is created and you can predict new, previously not found interactions.



Another type of problem that I wanted to mention and which, in my opinion, is quite interesting, is image analysis. In general, image is a powerful data format, which is used very often and a lot in biology, because you can understand a lot about it from the way a cell looks.



If a large number of microscopic pictures accumulate, you need to analyze them quickly and be able to make predictions. A common example is, again, cancer; you take a biopsy and look at how healthy and diseased cells are connected. You paint them - the nucleus in one color, the cytoplasm in another. Then you try to predict: is this tissue with a tumor, or not?



For more fundamental research - processing a picture from a microscope is already more difficult; people want to look at certain organelles, or molecules, or proteins, and, accordingly, to trace how cells will interact with each other, how to develop, and so on. People have learned to color various elements of the cell, and this is done with the help of fluorescent proteins. You take what you need and attach that tagged protein to it. And if you shine a light on it, it will light up, and you will understand that these organelles, or protein, or RNA are in a certain place. And then you keep track of how the cells interact. This also requires image analysis, because there are a lot of pictures, and they, as a rule, are not very good resolution. And you need to get a good resolution from muddy pictures. In fact, the community does not stand still;people write neural networks, change different parameters, and so on. But data evolves, and methods must evolve with it. That is, these things must go hand in hand.



The current trend, which many labs think about, is "how to conquer time." That is, very often both in sequencing and in image analysis and so on there is such a problem: there is a snapshot of the system, but it is static. You take a measurement at a specific time. And you don't understand how the cells will develop further. One of the approaches to solving this problem is life imaging. When you do not kill cells, but place them in the environment in which they develop, interact and so on, and with a microscope every 10 seconds, take a snapshot every minute, and then you can restore the trajectories of movement, interactions, and so on. But there is a limitation: for example, fluorescent stamps are not very good to use for life imaging, because when you shine your light on a stamp, it emits radiation, and this is toxic to the cell. The cell begins to die.A compromise has to be found: on the one hand, you want to keep the cage as healthy as possible, but on the other hand, you want to make more snapshots - but the more you take them, the faster it dies.



And here there is such an approach: people are now trying to determine the fate of the cell with the help of a minimum number of fluorescent marks, but in fact - only with the help of the microcontour of the nucleus and the cell. It's like face recognition: earlier you could do it with visible eyes, mouth, nose, and other features, but now you have to do it only with your nose, because you can have sunglasses on your eyes, and a mask on your mouth. That is, the problem becomes more complicated, and here the same thing. It is necessary to calculate biological parameters using a small amount of information, and there are a large number of tasks.



There are a lot of tasks, there are a lot of data types. All parameters of cells, organisms and other things are measured. This is a very interesting area. I hope if you thought about her before, then I did not disbelieve you.




What happened before



  1. , Senior Software Engineer Facebook — ,
  2. , ML- — , Data Scientist
  3. , EO LastBackend — , 15 .
  4. , Vue.js core team member, GoogleDevExpret — GitLab, Vue Staff-engineer.
  5. , DeviceLock — .
  6. , RUVDS — . 1. 2.
  7. , - . — .
  8. , Senior Digital Analyst McKinsey Digital Labs — Google, .
  9. «» , Duke Nukem 3D, SiN, Blood — , .
  10. , - 12- — ,
  11. , GameAcademy — .
  12. , PHP- Badoo — Highload PHP Badoo.
  13. , CTO Delivery Club — 50 43 ,
  14. , Doom, Quake Wolfenstein 3D — , DOOM
  15. , Flipper Zero —
  16. , - Google — Google-
  17. .
  18. Data Science ? Unity
  19. c Revolut
  20. : ,
  21. — IT-
  22. — «Docker » , Devops,









All Articles