Kubernetes on its own infrastructure: pros and cons of private clouds

Dear readers, good day!



In this article, Igor Kotenko, Chief Architect of Neoflex, shares his experience of deploying a containerization platform on an enterprise infrastructure.







The reasons why companies usually choose an on-premise solution are often non-technological and often related to funding. Someone is trying to reduce operating costs (paying for external clouds) in favor of capitalizing the company (buying their own servers), someone already has solid hardware resources and wants to use them in a microservice architecture.



Before moving on to the implementation details, let's turn to the terms.



The term "clouds" is considered to be very congested. It is customary to distinguish between different types of cloud solutions:



  • Infrastructure as a Service (IAAS) - hardware (usually virtual);
  • Software as a Service (SAAS), for example, DBAS - database as a service;
  • Platform as a Service (PAAS);
  • Application as a Service (AAAS).








At the same time, nothing prevents the layers from being based on each other. Obviously, there will be infrastructure under the platform or software.



There is some confusion about the term "private clouds". Sometimes this is called a cloud deployed on-premise, sometimes deployed on a leased infrastructure with complete isolation of your network segment. There was even a mention of virtual machines with encrypted memory and disks, while a memory dump will not give the provider access to your information. In this article, we will discuss solutions deployed in-house - on-premise.



When introducing private clouds, it is expected that they are the same as public ones, only cheaper, more secure and more reliable. Therefore, many people think that private clouds are a priori better. Often, experts simply deploy the chosen version of Kubernetes or Openshift and believe that this is where their work is completed.



What companies expect to get when implementing on - premise clouds:



  1. Low resource cost. Because you only pay for what you use.
  2. The ability to add and give back resources as quickly as possible.
  3. Fault tolerance. The server crashed, another was automatically given instead.
  4. Low maintenance cost.




How is this achieved in public clouds?



As a rule, due to automation of processes, economies of scale (cheaper in bulk) and sharing of resources between different consumers.



Let's look at these promises in the context of private clouds.



1. Low resource cost compared to conventional infrastructure.



In fact, it is not so. The same software was deployed on the same machines, but in containers. In our experience, on the contrary, more resources are wasted.



2. Ability to increase and decrease resources extremely quickly.



No. To expand, you need to either keep the hot reserve of hardware and software licenses idle, or first throw out something unnecessary. You can take resources out of use, but then they will be idle.



3. Fault tolerance.



Yes, but there are many nuances. Let's say the server is down. Where can I get another one? How to quickly deploy and add it to a cluster? If you are not Amazon, then you do not have an infinite supply of resources.



4. Low cost of support.



We have added at least one more layer (containerization platform), several new systems. We need specialists with new competencies. Where will the savings come from?



Let us examine these issues in more detail. It is important to remember that private clouds must coexist with existing legacy systems. Organizations are forced to maintain the infrastructure of existing systems in parallel, organizing a hybrid IT environment.



Naturally, 99% of the system is not built from scratch. Typically, even before the implementation of a PAAS solution, there is a set of processes and automations to support the old infrastructure. DevOps processes, planning and resource ownership, monitoring, software updates, security - all these issues have to be coordinated and changed during the implementation of a private cloud.



How DevOps processes are changing



Typically, before implementing your own PAAS, the approach to building DevOps is based on the use of configuration automation systems such as Ansible or Chef. They allow you to automate almost all routine IT processes, often using ready-made script libraries. However, containerization platforms are promoting an alternative approach - “immutable infrastructure”. Its essence is not to change the existing system, but to take a ready-made virtual image of the system with new settings and replace the old image with a new one. The new approach does not negate the old one, but forces configuration automation into the infrastructure layer. Of course, legacy systems require keeping the old approach.



Let's talk about the infrastructure layer



The de facto standard in IT is the use of virtual infrastructure. As practice shows, the most common option is to use vSphere. There are many reasons for this, but there are also consequences. In our practice, frequent problems with resource oversubscription (an attempt to sew seven hats from one skin) were aggravated by the almost complete lack of control and influence on this process from those responsible for the solution's performance. The delimitation of areas of responsibility in the company's divisions, the formalization of resource request procedures and different goals of the management of the divisions led to problems in the product environment and inconsistent load testing. At some point, our development department even created a virtual core performance measurement tool,to quickly diagnose the lack of hardware resources.



It is obvious that an attempt to place a containerization platform on such an infrastructure will bring new colors to the existing chaos.



The question of whether a virtual infrastructure is needed for an on-premise containerization platform or is it better to put it bare-metal (on iron servers) has been discussed for a long time and quite widely. Articles lobbied by manufacturers of virtualization systems argue that there are practically no performance losses, and the benefits are too great. On the other hand, there are independent test results that say about 10% performance losses. Don't forget about the cost of vSphere licenses. For example, to install a free version of Kubernetes on inexpensive hardware just to save money and pay for vSphere? Controversial decision.



It is worth mentioning an open source infrastructure virtualization solution, for example, Open Stack. In general, he was viewed as a solution requiring serious investment in the team. There are statistics on the network according to which the size of the Open Stack support team is from 20 to 60 people. And this is separate from the containerization platform support! There are few such specialists on the market, which increases their cost. Such investments usually pay off only on very large volumes of resources.



It is important to consider the presence of specialists with unique competencies in the company. Unfortunately, bare-metal Kubernetes installations, despite the benefits of transparency and lower license costs, are often hampered by, for example, the lack of proper installation automation tools. We hope that the future belongs to this approach.

Another aspect that influences the choice between virtual and bare-metal installations is the organization of iron servers.



Typically, an organization purchases servers for specific purposes. You can rent servers in the data center (from what they offer), you can standardize and unify the nomenclature (simplifying component redundancy), you can save on hardware (many inexpensive servers), you can save rack space. Different approaches in different organizations. In general, these are either powerful servers with a large number of cores and memory, or relatively small in volume, or a prefabricated hodgepodge. But, do not forget about the needs of legacy systems. At this point, we again encounter a contradiction in concepts. Kubernetes ideology is a lot of inexpensive hardware and readiness for its failures. The server fell - it doesn't matter, your services have moved to another one. Data is sharded (duplicated), not tied to a container. Legacy ideology - redundancy at the hardware level. RAID arrays,disk racks, multiple power supplies on the server, hot swap. Focus on maximum reliability. It can be unreasonably expensive to stake on such Kubernetes infrastructure.



, …







If a company has high-performance servers with many cores on board, it may be necessary to split them between different consumers. Here you will not be able to do without a virtualization system. At the same time, it is necessary to take into account the possibility of server failure or stoppage for maintenance. The arithmetic is simple: if you have two servers, if one fails, you need 50% power reserve on each; if - 4 servers, if one fails, you need 25% reserve. At first glance, everything is simple - you need an infinite number of servers, then the failure of one of them will not affect the total capacity and you do not need to reserve anything. Alas, the size of the resources of one host cannot be greatly reduced. At a minimum, it should accommodate all related components, which Kubernetes terminology calls “pods”. And, of course, when crushing into smaller servers,overhead costs for services of the platform itself are growing.



For practical purposes, it is desirable to unify the host parameters for the platform. In close-to-life examples, there are 2 data centers (support for the DR scenario means at least 50% resource reservation in terms of capacity). Next, the organization's needs for the resources of the containerization platform and the possibility of placing it on standard bare-metal, or virtual hosts, are determined. Kubernetes recommendation - no more than 110 pods per node.



Thus, to determine the size of the node, you need to consider the following:



  • It is desirable to make the nodes equal;
  • The nodes must fit on your hardware (for virtual machines - multiples, N virtual for one piece of hardware);
  • If one node fails (for the version with virtual infrastructure - one iron server), you should have enough resources on the remaining nodes to move the pods to them;
  • There cannot be too many (> 110) pods on one node;
  • Other things being equal, it is desirable to make the nodes larger.




These kinds of features have to be considered in every aspect of architecture.



Centralized logging - how to choose from several options?



Monitoring - maybe your existing monitoring system will not work, keep two or migrate to a new one?



Platform updates to a new version - regularly or only when absolutely necessary?

Fault Tolerant Balancing Between Two Data Centers - How?



Security architecture, interaction with legacy systems - there are differences from public clouds. It is possible to recommend building systems so that there is a possibility of migration to public clouds, but this will complicate the solution.



There is little difference in planning, allocation and ownership of resources for public and private clouds. The main difference is the limited amount of resources. If in public clouds it is possible to take additional necessary resources at any time, for example, for load testing, then on-premise has to carefully plan the sequence of their use. This means that you may have night launches and, accordingly, the work of employees from the 2nd and 3rd lines will increase at inopportune hours. Nothing new for those on their own, but a bitter taste of disappointment for the miracles awaiting private cloud adoption.



"Cadres decide everything"







When planning the implementation of an on-premise containerization platform, first of all, specialists with unique competencies are needed. There are clearly not enough of them in the current labor market. Moreover, without having experience of such implementation, it is difficult even to make a list of all the necessary specialists.



For example, for the platform to work, you need to select and install a storage system. Whichever system you choose (CEPH or Portworx), it will be critical for the platform. Everyone knows that an administrator is required to maintain a database. Likewise, a data storage system needs a separate specialist. Unfortunately, no one thinks about this before implementing the system! Note that the difference between administrators for different products is significant - comparable to the difference between Oracle DBA and MS SQL DBA. You will need at least two people for each role: employees go on vacation, get sick, and even, God forbid, quit. And so on for each competence, and the list of competencies required to support the solution is impressive.



I would like to immediately warn against attempts to cross all competences in some universal soldiers. Their cost, rarity and risks of loss exceed all reasonable limits.



Again, cloud maintenance is a complex process. Cloud companies do not eat their bread for nothing: there, behind the cloudy fog, there is a lot of technology and invested labor.



Of course, the use of consulting services can significantly accelerate the implementation of the platform. Experienced partners will help you avoid many mistakes, establish processes and train your team. However, if you host critical services for your business on the platform, it is equally important to provide quality support and development. Moreover, at the moment all systems existing on the market are actively developing, new technologies appear, new versions of the platform may require migration of complex processes and serious testing before updating. A strong support team is required to ensure the reliable operation of the system. And you will need such a team on an ongoing basis.



When should you consider implementing an on-premise containerization platform?



First of all, it is necessary to assess the ratio between investment and return, the volume of costs for hardware and employees. There must be either good reasons for not being able to live in public clouds, or really serious resource requirements. That is, if about 100 Core is enough for an organization and you are not ready to expand the support team to dozens of people, most likely you should focus on public clouds, or on simple configurations with application servers. There is a minimum team size required to support the platform and the cost may not pay off. However, with scaling resources and competent automation of all processes, the benefits of a private solution can be very significant. Many hundreds of nodes can be maintained with much the same command.



Another selection criterion is the variability of the needs for computing resources. If a company’s business processes create a more or less even load of resources, it’s more profitable to develop its own infrastructure. If you need to use large capacities, but rarely, consider public or hybrid clouds.



In any case, when choosing on-premise solutions, tune in to serious and systematic work, get ready to invest in the team. Go from simple to complex. Be attentive to the timing of implementation, and be especially careful when upgrading to new versions of the platform. Use the experience gained from others' mistakes, not yours.



Thanks for reading this article, we hope you find the information useful.



All Articles