What is a service mesh, when to implement, Istio alternatives and other expert answers from the AMA Slurm session on service mesh





Publishing a Q&A session on the service mesh. The session was held in preparation for the Slurm intensive on service mesh. There is a recording on Youtube .



Experts answered the most popular service mesh questions and questions from the event participants. Key questions of the AMA session:



  • What is a service mesh,
  • When to implement
  • Istio alternatives,
  • Why is Envoy used in the service mesh and not Nginx.


Marsel Ibraev, STO Slorm, moderated the event, and Alexander Lukyanchenko, team lead in the Avito architecture team, and Ivan Kruglov, Staff Software Engineer at Databricks, shared their expertise.

Both engineers have experience not only with working with a specific service mesh implementation, but with building their own, which is much cooler.



Marsel Ibraev: What is a service mesh and what tasks does it solve?



Alexander Lukyanchenko: I would start with the basic definition that the service mesh is, first of all, an approach that has many specific implementations. The main point is that when we have some kind of distributed system that works, interacts over the network, we add an additional network layer that allows us to add some features, any logic to the interservice interaction of network nodes.



In simpler words, when we have a set of microservices, some pieces of the system, we simply add a special proxy server to each of this piece and let the traffic between these microservices go through the proxy servers. Thus, we get a large set of different possibilities for traffic management between these services, for collecting statistics, monitoring. In fact, there are many more cool things in the form of security, mutual TLS, etc. But the main thing is that the service mesh is an approach.



Ivan Kruglov:I agree with you. We all know about microservices. And many of us take monoliths and saw them. I have such an observation that when people talk about cutting a monolith, they imagine a brick and begin to divide it into small cubes. These cubes are their services. And when we think about them, we focus on the cubes, forgetting that the complexity, the business logic that was in the process of breaking the monolith has not gone anywhere. Some of the complexity remained in the cubes, but a lot of difficulties and problems remain in the space between these cubes, in the arrows, which people usually draw with thin lines. And there are many problems.



For example, a simple function call that was previously just a function call becomes a remote call. That is, you get to all the problems of network interaction, plus authorization, authentication, tracing, monitoring - in general, a whole set of characteristics. The service mesh is trying to solve exactly the problem of the complexity of interservice interactions, these very arrows that connect our cubes (microservices) with you. This is my understanding of the service mesh. I also think of it as Communication-as-a-Service. Some services that allow your services to interact with each other, abstracting to some extent the problems of interaction in terms of reliability, monitoring, and security.



Marsel Ibraev:We can conclude that the service mesh is not a specific technology, but rather an approach that, in practical terms, is the addition of proxy servers to each microservice. This allows us to more flexibly manage both traffic and the security of the connection between these microservices. Surely there is some kind of conditional Control Plane or Daemon that monitors all this, manages this, centrally distributes configurations to these proxy servers, and so on?



Ivan Kruglov:In general, yes. However, proxy service is an optional attribute. For the service mesh, I think it will be a dynamic configuration, because in a microservice environment it is important when it is dynamic. Everything changes, scales: scale up, scale down. And in my opinion, dynamism is the most important criterion. And with the rest, I agree.



Alexander Lukyanchenko: I would also say that Control Plane is just one of the parts of the service mesh that allows us to customize it, and tells us from one point directly to the entire system what rules we need, how everything should interact. Such flexible control over the entire system from one point.



Marsel Ibraev: Let's move on to the next question: When should we think about when we need to implement a service mesh? What are the pros and cons of using it? As far as I know, service mesh is a relatively young technology. In the sense that now only large companies allow themselves to implement it, they are just starting to do this, in particular, in Russia. How much of a layman should think about the time to use the service mesh. What factors serve to determine this? Well, what will this entail, what are the pros and cons of using?



Alexander Lukyanchenko:In any case, I would go from needs and problems. It is worth remembering that now this technology is really gaining popularity and many are starting to implement it or want to implement it, but it solves certain problems. One of the very widespread use cases for implementation is improving observability - understanding how all the cubes interact in general within our large system. And with the help of the service mesh, we unifiedly get the opportunity to do it immediately and completely for the entire system. This is one of the solution cases.



What Ivan said about the reliability and resilience of networking that we get when we move to a microservice architecture. It is also one of the problems that can be solved with the help of a service mesh, by introducing approaches such as Circuit Breaker, automatic retrays, that is, patterns of network interaction, and all this is outside our business logic, outside our cubes, in a uniform way for the entire system.



Having these problems in our system, we can just solve them by implementing a service mesh. Obviously, the solution to these cases is a plus, both in terms of simplicity and the specifics of how we can put a service mesh in our system. Because, for example, if we need to set up some kind of unified monitoring throughout the system, make sure that some common metrics are collected from all services: for latency, requests, errors, and so on, this can be a really difficult task. ... And not in a homogeneous environment it can be difficult to solve. In the case of the service mesh, we get the opportunity to do this easier by adding these proxy containers and we get the ability to configure everything throughout the system in a unified manner. This is probably one of the key features of the service mesh.



I also love this definition that, in fact, a service mesh is a technology that could be completely solved inside business applications, but due to the complexity that we cannot add some unified logic to hundreds and thousands of components of our system and to get some feature, we just move it to another level of interaction through these service proxies and get all these capabilities that could actually be implemented in the business applications themselves. And in heterogeneous systems, this really turns out to be very useful. This is a big plus.



Of the minuses, I would single out the performance. Implementation needs to accurately assess how the pieces of the system interact. For example, there are quite narrow cases, but when we have a very large amount of network synchronous interaction, for example, via the HTTP protocol within the system, then we receive additional inputs through each service proxy. In most cases this is not a problem, but it is worth keeping in mind when introducing this technology, to understand that it brings a certain overhead to our system, and we add each node to it.



And the second minus, which is probably even more bold than the first, is the complexity of configuration and implementation from those implementations that exist now. Although the technology is not very young anymore, it is still not easy to implement on a large or medium-sized system. This is due to the fact that there are a large number of possibilities, large spreading configurations - the threshold for entering and introducing this technology into the system is quite high.



Ivan Kruglov: I will also try to answer this question. I will come from afar. I will try to explain why I decided to implement a service mesh in my company in 2018. I'm talking about Booking.com.



The problem was that we had a number of services and a library that was responsible for interservice communication. She knew how to issue metrics, was able to issue backtracks, that is, reliability-patterns, and was able to discover services. Everything worked well, everything seemed to be fine. But there were two big problems.



Problem number one: I had 20-30 services. They were not under my control. They deployed according to their schedule by a team located in other time zones, and I need to upgrade some version. And the upgrade process took a long time. I had to wait. And it's a very long time to wait.



Second, we began at that time to move to a more microservice approach. And it was clear that we have a single stack. Sasha talked about homogeneous and heterogeneous stacks. Let's translate into Russian. A homogeneous stack is when you have one technology, for example Java, and all your services are in Java. Heterogeneous is when you have a zoo. One service in Python, one in Java, third in Go. At Booking.com, we decided not to lay on a homogeneous stack, we decided to lay on the fact that there will be a lot of stacks. Therefore, it was clear that a library in which everything is beautiful would have to be rewritten into potential 4-5 languages, and then all this would still be supported. And it is also desirable to make them all in the same tone. And the metrics so that they give out the same. And they also consistently implemented reliability patterns. In general, it was clear that this is a huge hemorrhoid,and therefore we came up with a solution to these problems by switching to a single platform or approach, that is, deploying a proxy that would do all metrics, patterns, and discovery in the same way regardless of what the application behind it is written on.



Speaking more specifically about the use of the service mesh, then I have no answer that at this point it is necessary to use it, but at this point it is not necessary. But in general, the more services you have and the more languages ​​they are written in, the more likely the service mesh will help you. I'm talking about probability, not confidence, because there are many problems there too. And Sasha told about them. If your services are responsible for hundreds of milliseconds, well, a proxy will add a couple of milliseconds to you - you won't feel much difference.



A much bigger problem is spreading technology. Istio has more entities than Kubernetes. That is, it is more complicated. They, of course, are now working on simplification, but in general, these are separate technologies that need to be supported, for which resources need to be allocated. In addition to the fact that you need to learn it yourself, to some extent you need to train some of your developers to use it.



Marsel Ibraev: About the overhead, from the point of view of infrastructure, running a proxy on each service is what I saw, but also RAM consumption. Each pod with us now consumes 50-70 MB of RAM more, even at idle, if I remember correctly the metrics that I looked at. That is, you need to think about whether you really need it.



To summarize, if your company has a really large sprawling cluster, a large sprawling application with a bunch of microservices, and even written in different languages, while there are serious requirements for fault tolerance and quick fixing of problems, so that if some kind of failure happens, we can quickly fix all this, the main thing is to quickly understand where exactly this problem arose. As Ivan said, there is no single factor or single argument that says that now we need to implement a service mesh. The more these arguments add up, the more you should look towards the service mesh and prepare some ground, maybe deploy on a test cluster, dig deeper, see how it is. Of course, it is easily put on production, the same Istio.



Ivan Kruglov:This is already the market. If you want to conquer the market, you maximize your customer base. And for this you need to make the installation as simple as possible. Therefore, everything is optimized for this.



Marsel Ibraev: Yes. The next question that I would like to discuss is that if the service mesh is an approach, a methodology, then what specific products are there now? What's relevant? And what should you look out for if a company decides to try a service mesh?



Alexander Lukyanchenko:From an implementation point of view, it now makes sense to try different implementations. If we talk about specific names, then the most popular solutions now is Istio, which has been very powerful in marketing campaigns for several years in terms of a set of its features, trying to be implemented everywhere. Probably most people know exactly Istio. This implementation has really existed for quite a long time, has a large number of possibilities, and, in principle, after the latest releases, it is already mature enough. Suitable for use in production and, in principle, has already had some basic problems related to performance, initial installation and configuration. The customization is simplified and the technology becomes more deployable. But there are a few more technologies that I would definitely pay attention to.



The first is Console Connect. This is a development from Hashicorp. Hashicorp technologies and their tools are of high quality. In case you already have something from their stack, it's a good story to look at including their service mesh solution. In terms of the number of possibilities, they usually catch up with Istio, but they implement it in a more thoughtful and detailed way, largely due to the fact that hashicorp have their own customers, on which they initially test these products and then issue ready-made solutions to the public.



And I would also highlight Linkerd2. It was called Conduit for a long time. This solution is worth looking at in terms of features and performance. Because it is quite different from the rest, it uses its own proxy server. And thanks to this, it can provide better scalability. In this regard, Envoy-proxy is quite suitable for almost any implementation and is used in large companies, including ours, and from the point of view of overhead, primarily in terms of resource, how much additional use of processor time, RAM, in principle, the ratio of compared to the consumption of business services is acceptable. That is, this difference is solely on networking, the network stack, request processing and the actual functionality that Envoy-proxy carries.



Actually, I would look from different angles at three solutions: Istio, Console Connect and Linkerd2. At the end of the day, chances are, the tasks you need to solve in your system, they all accomplish. Which of these technologies is best for you will already depend on what is more convenient for you and what you like best. If we talk about the vision, personally it seems to me that ultimately, if not the winners, then to a greater extent the leaders of the service mesh solutions based on Envoy-proxy, simply because now it is gaining immense popularity, has a very large number of features out of the box ... And, most likely, most of the service mesh will be on it. But nevertheless, it is still worth looking at everything.



Ivan Kruglov:I agree with Alexander. Istio is the largest, most popular, also because Google is behind it. This is their technology. The rest did not touch and I do not know. But I know that at Hashicorp, everything is focused on easy use. It makes sense to look there. And if you have a Discovery service or use Consul, then it also makes sense to look there. Linkerd2 is the third version currently available.



Marsel Ibraev: Why is Envoy used in the service mesh? Why not more sophisticated tools like Nginx or Haproxy that have a lot of functionality? Apparently something is missing. What?



Ivan Kruglov:The main characteristic of a service mesh is dynamism. And it is precisely this that is absent in both Nginx and Haproxy. They were born in the era of a more static configuration that you could write in configs, describe, and they either did not change or did not change often. In the service mesh, changes are made every second. Envoy and Linkerd have their own protocols that allow you to dynamically push configuration into them. You can write a service that will push the configuration there via HTTP2. And they both rebuild tables dynamically, work really well. This is for me the main reason why Nginx and Haproxy do not take root in the service mesh.



Alexander Lukyanchenko:I agree that dynamism is the main feature, and it was laid down in the design of Envoy. I must say that Haproxy also got this opportunity in the latest implementations, and now they are actively trying to make a service mesh on its basis. But there is one more point that is very important, in my opinion. For the same reason that Envoy was created as a cloud-native solution, it put patterns inside itself, such as a fairly extensive distribution of metrics across all network interaction. Nginx has these things, but it is achieved with the help of additional modules and bindings, and Envoy-proxy has it and everything comes out of the box, available by default. And when you put this in your system, you immediately get valuable data that would have to be collected, tested and hone using other technologies.



Marsel Ibraev:As a result, it turns out that if we have proxy services in the form of Envoy, we have some opportunities, some features. And the next question is related to this. If we are talking about the Service mesh, about the fact that it is really hard, but it seems to be very cool and cool, I would now like to focus on what features are in the service mesh? Security, observability, deployment strategies.



Ivan Kruglov:The term observability is usually understood as a combination of three: monitoring (classical metrics), logging, tracing. This is what allows you, as a service operator, to understand what is happening or was happening in your service now or sometime in the past. I must say right away that logging does not affect the service mesh in any way. That is, in the context of the service mesh, this is about metrics and tracing.



In answering this question, I will rephrase it first. What does the introduction of this technology give to the business? In my opinion, this is consistency. Let me explain now. Imagine that you have, relatively speaking, 100 microservices, and you need to understand how they interact with each other. You can instrument services and make them spit out HTTP metrics. But most likely, everything will turn out to be inconsistent. Someone will report in seconds, someone will report in milliseconds, someone will report an error class of 5xx, someone will issue 500. Inconsistency appears. Therefore, the question arises of building a single dashboard in order to understand what is happening with your system. And it all turns into a big headache. Because your metrics are called differently, they are in different units, etc. With the service mesh, this problem is solved all at once,because you deploy everything with one Envoy, it spits out metrics with the same name, in the same units, with the same buckets, etc. Just one dashboard is being built, where you select the desired service and see a list of everything that happens to it.



Tracing is a little more complicated, because it is built on headers, and these headers need to be passed between entering the application and exiting it. That is, a little additional instrumentation is required, but in general the result is the same. You are presented with a topology and the ability to track a specific request. In a sprawling microservice architecture, figuring out what's going on is a huge challenge. The Control Plane in the service mesh is a centralized control panel, so you can push certain policies centrally. Starting from the frequency of retries, timeouts. Imagine that you have created a new service. And if you have correctly configured the default settings, you immediately get monitoring, tracing, retries, backoffs, circuit breakers out of the box - and in some of the best implementations. You can push certificates, access policies (for example,this service can talk to one, but not to the other).



Summing up. Consistency is when reference points for monitoring the service appear in your microservice architecture, and this is the possibility of centralized management of policies, settings, etc.



Alexander Lukyanchenko: I would also add that since the service mesh is about introducing into synchronous interaction between our microservices, there are endless possibilities for managing traffic that goes between services. This is an opportunity to do canary deployments there, Blue-green deployments - those things that are difficult or impossible to do out of the box, for example, in Kubernetes.



If we talk about balancing capabilities, then there is also a very rich set of capabilities in terms of internal settings. These are various balancing with hash policies in order to nail the request to certain endpoints, the possibility of the same weight balancing, for example, the protection of the connection between microservices, mutual TLS, which, unlike manual configuration, are really easier to operate, because the management of certificates, their distribution and rotation can be done at the service mesh level, and the business services themselves may not even know that the interaction between services takes place over some kind of secure protocol. This can be generally a key point, especially for those who work in a multi-cloud environment, having instances in one public cloud and in another, or, for example, spread out instances between distributed data centers, also cases,in which this technology is mandatory for implementation. And with the help of the service mesh, this can be solved easier than if it is manually implemented without it.



The whole set of these possibilities is a very cool solution tool. There are also narrower cases, but they may turn out to be key. For example, Chaos Engineering to implement degradation of interaction, or, for example, politics. It really turns out to be a cool solution tool. There are also narrower cases, but they can be key to close a specific need for a specific company.



Marsel Ibraev:It turns out, to summarize, the service mesh, regardless of its implementation, represents a single centralized tool that allows us to implement a number of features, in particular, we have such a uniform standardized monitoring or observability, if we talk about the service mesh, we collect all the necessary metrics, tracing and everything you need. At the same time, we can immediately close the security issue: we can build network policies, encryption, and we have very flexible options for working with traffic - various balancing, tricky deployment options, etc. This may be enough to think about implementation, especially on big projects.



Continuing the topic of implementation. Let's say we sold the idea to the business and became saturated with this idea ourselves, what steps should specialists go through from a technical and organizational point of view? How to properly implement a service mesh in your own if we already have the same Kubernetes? How to implement a service mesh without breaking anything?



Alexander Lukyanchenko:Here I will tell you a little about our experience. The introduction of this technology is possible gradually. We can clearly choose those parts of the system that we are currently rolling out this technology to, test it there, look at its impact on the same performance, see if we get the desired result, the desired set of features. And gradually roll out. For example, Istio lets us inject Envoy automatically to each microservice using webhook technology that adds a container to our pod under the hood. And we can clearly say that we want to see Envoy-proxy in such and such a namespace or an already implemented Envoy in such and such instances. This is if we talk about implementation in production.



And if we talk about the organization, then here you need to understand how the system works technically, how the interaction with services goes, so that at the stage of problems you can already try and figure it out. And skate it all on the sandbox. If you have a Kubernetes cluster that repeats production, but with less load, you can implement, look and then go to the production cluster. Here you need to think over all the steps in order to quickly deactivate this system in case of a problem. That is, when doing it, you can immediately make sure that everything goes gradually, including for each service. The difficulty of debugging lies in understanding what is really happening at some point, which is not an easy task, and in order to quickly recover from the point of view of implementation in production, I would definitely recommend starting with a sandbox, because despite its high-level simplicity, the technology difficultdebugging it is not easy, and you need to have a clear understanding of what is going on there in order to quickly find problems.



:I agree. You, Sasha, talked mainly about Istio. To make it clear, when I talk about a service mesh, I'm talking about a self-written one. When we started (at Booking.com), Istio was version 0.1 and I never wanted to push it into production, because the technology was terribly raw. And in retrospect, Vanya did everything right. The second reason why I decided to write my own, because at that time there was no service mesh outside Kubernetes. Istio at that time had Kubernetes as its only platform. They are now slowly cutting it into being outside of Kubernetes, but Kubernetes is still the centerpiece. This is where the configuration is stored. And there was no Kubernetes on Booking.com three years ago, and I had to live outside of that. Returning to the main question, yes, I agree with Sasha that the advantages here arethat it can be implemented gradually. That's what I did. Service-oriented. From less critical ones, then moving on to more critical ones. As a result, the last service that we moved to the service mesh was the search service, which returned one million RPS per second. It was the hardest service in the company.



Marsel Ibraev: To summarize, we simply adhere to the correct approach, do not make sudden movements and, as Alexander said, we do not take the upper-level simplicity at face value, because everything is much more complicated under the hood. Is there anything to add about the service mesh outside Kubernetes: Istio and Linkerd2?



Ivan Kruglov:The last time I looked at Istio was about a year ago - and it was in a semi-working state. I think it is now in a normal state. They added the ability to add to Kubernetes, declare instances in it that are outside the Kubernetes framework. Why couldn't Istio be stretched to Kubernetes before? Because Istio relied on Kubernetes primitives: service, endpoints. That is, Istio led its discovery from there. They introduced a concept, in my opinion, it is called a virtual service or service entry, with which you can declare, prescribe that I have some kind of instance, it is available through some kind of IP address. But as I understand it, it is your responsibility to keep this description up to date. If you have an iron service on which the IP address is nailed, then everything is fine. If a more dynamic thing,then you have to write with your hands, support with your hands.



: I understood this question this way: is a service mesh even possible, where there are no Kubernetes technologies. This is technically possible. Because Kubernetes is just an orchestrator, there are other solutions. Istio was originally designed to work for different solutions, and there was a component that is now part of the main body of Control Plane called Galay. He was just engaged in converting various manifests into a single description, which Istio already understands for configuration. Thus, we could customize it for almost any solution. And from the point of view of bare-metal, or some kind of virtual machines, where there are no automation tools, in principle you can install Envoy there, write route policies so that traffic goes through Envoy. But here is the question of profit from this and what we want to get features. It is technically possible to put anywhere,the only question is ease of use and the need for this solution.



:By the way, so that colleagues understand, Envoy is one of the key, one of the two key components of all modern service meshes. It was created by Lyft. Their service mesh was very conditional. They configured their envoys through configuration files. And only later did they develop some self-written Control Plane. Writing a Control Plane, by the way, is not really that difficult. There is a clearly declared protocol. And there are template developments, some binding that allows you to create your own Control Plane. In fact, on current implementations, the light has not converged like a wedge, of course, they have a lot of features and community support, but in general, if you need some specific features that you cannot find in what is, then there is a possibility write your own. Of course, this is a certain overhead, serious enough,but there is a possibility - the technology is open, the protocols are all described. If we are considering the question of whether a service mesh is possible outside Kubernetes, in a broad sense, yes, it is possible, there are no restrictions. More narrowly, is it possible to stretch Istio outside Kubernetes, or Consul to stretch outside Kubernetes, as they say, it depends. The further we go, the more it becomes possible. At Istio, I think it can be pretty awesome now. Consul Connect, in my opinion, entered this area without Kubernetes. Therefore, there should work out of the box. I think in Consul Connect everything is tied to Consul Service Discovery. If your services are visible in their Discovery, then they will also be visible in the service mesh.there are no restrictions. More narrowly, is it possible to stretch Istio outside Kubernetes, or Consul to stretch outside Kubernetes, as they say, it depends. The further we go, the more it becomes possible. At Istio, I think it can be pretty awesome now. Consul Connect, in my opinion, entered this area without Kubernetes. Therefore, there should work out of the box. I think in Consul Connect everything is tied to Consul Service Discovery. If your services are visible in their Discovery, then they will also be visible in the service mesh.there are no restrictions. More narrowly, is it possible to stretch Istio outside Kubernetes, or Consul to stretch outside Kubernetes, as they say, it depends. The further we go, the more it becomes possible. At Istio, I think it can be pretty awesome now. Consul Connect, in my opinion, entered this area without Kubernetes. Therefore, there should work out of the box. I think in Consul Connect everything is tied to Consul Service Discovery. If your services are visible in their Discovery, then they will also be visible in the service mesh.I entered this area without Kubernetes. Therefore, there should work out of the box. I think in Consul Connect everything is tied to Consul Service Discovery. If your services are visible in their Discovery, then they will also be visible in the service mesh.I entered this area without Kubernetes. Therefore, there should work out of the box. I think in Consul Connect everything is tied to Consul Service Discovery. If your services are visible in their Discovery, then they will also be visible in the service mesh.



Marsel Ibrayev: At the top we now have a question from Anton, who asks what will be on the service mesh intensive. Yes, we are planning to conduct an online intensive on the service mesh topic from March 19-21. (editor's note: the first intensive passed, the second intensive is scheduled for September 24-26, 2021.)All this will be dealt with by our experts. Ivan Kruglov is a practice leader. All our educational programs come from practice, so we pay a lot of attention to this. And Alexander Lukyanchenko will be the speaker of the intensive. There won't be enough of such a captain's theory. For the most part, we will focus on practice, on practical application. Let's go to install the service mesh right away, we will do everything using Istio as an example. Let's install, figure it out, run it, figure it out with abstractions, go straight to touching features, deployment strategies, multicluster, mTLS, Chaos engineering, etc. We will definitely touch everything that we have in the service mesh concept with our own hands, and at the end you will receive some knowledge and skills that will help you in the future to do and implement all this on your own. Learning format, live, in Zoom.



This was the first part of the transcript from the AMA session. A continuation with additional questions from the participants of the event is already in the works, we will publish soon. What is the service mesh issue you are concerned about?



All Articles