"Domino Effect", or How we update the cloud software in the data center





Linxdatacenter . . 



, « » . 




vCloud Director



The main components of the Linxdatacenter cloud are the VMware technology stack, which implements the vCloud Director virtual infrastructure control panel. It is deployed on the basis of Cisco components and service infrastructure such as Windows Active Directory.



At some point at the end of 2020, we ran into a problem: vCloud Director 9.5 began to lag behind the specifics of current tasks, and we did not get our hands on its upgrade to version 10.1 or 10.2. 



There was nothing terrible in this, but at the beginning of 2021, Flash support in the browser stopped.



To be honest, we didn't expect Flash and browser makers to do this cruelty. That is, everyone has heard of the end of support for a long time, but the news that it will be physically removed from operating systems and completely blocked in browsers from January 12, became a very unpleasant surprise.



The fact is that vCloud Director has access through two portals. The first one is on Flash, it was the main one and, let's say, the original one, with very wide functionality and a lot of possibilities. 



The development of the HTML portal began with version 8.20, just in the perspective of abandoning Flash, gradually adding new functionality to it. The version of vCloud Director 9.5, which is now presented at our three sites, satisfies the majority of customer requests in terms of functions, but from the point of view of administration, quite significant problems began to appear.



As an intermediate solution, we managed to find a browser configuration in which Flash access still remains. And from the point of view of management, we continue to control the situation, there are no problems. 



However, for users, the functionality of version 9.5 is not ideal. Users are already accustomed to working in Flash, and its absence causes inconvenience, they ask questions, “but before thisit was so, but how can we do it now? " In 10 versions the functionality is noticeably better and is as close as possible to Flash. From here it was decided that updating vCloud Director is task # 1.



Heavy legacy



The situation was complicated by the fact that the cloud platform in our data center in St. Petersburg and on the partner site in Warsaw is our "legacy" from the system integrator who deployed it back in 2013. Until 2017, the same company fully performed maintenance and upgrades, and then we increased a sufficient amount of our own expertise to take control completely into our own hands. 



Already a preliminary analysis of the situation showed that you cannot just take and upgrade from version 9.5 to 10.2. Drawing up a step-by-step plan for updating all software versions for different cloud elements with compatibility matrices took the architect responsible for the task more than 2 weeks. 



This is due to the complex structure of software version dependencies, the logic of which requires a gradual and strictly sequential transition to new versions in order to maintain the smooth operation of the cloud as a whole.  



The need for an innocent, seemingly, vCloud Director upgrade forced us to launch a complete platform update, starting with Windows servers with Active Directory and ending with all additional components. To complete the planned upgrade to the target version in vCloud Director, you will need to upgrade the entire system four times: the cloud platform upgrade will be performed in three full rounds or queues. 



We will start with our own cloud in Warsaw, followed by sites in St. Petersburg and Moscow. The planned completion date for these works is May 2021. 



But first, let's practice on the cloud's “digital twin”. 



Digital twin for the cloud



The work plan for only one of the sites is simply colossal, taking into account the resource of the project team of three specialists. 



This limitation, as well as the timing, number of intermediate steps and the complexity of the infrastructure required us to thoroughly test the practical implementation of the project on a virtual mock-up - a digital twin of the cloud in a virtual laboratory. 



The cloud digital twin gives us confidence in the quality of the system changes being triggered and in achieving the expected results. In addition, it is very convenient to work out various scenarios for operating the platform on such a twin, and in the event of an error or failure, we will always have the opportunity to roll back to a snapshot of the virtual machine with correction of the errors. This speeds up the update process and allows it to be carried out without losing the quality of the system as a whole.



All updates will be preliminary performed on the digital twin, and then, if the processes of gradual transition to new versions of all elements of the platform are successful, the detailed work plan for the real target infrastructure will be adjusted. 



The digital twin completely replicates the platform infrastructure down to the very latest components of all systems. All changes to the real infrastructure - settings, configurations, software updates - are first processed on it. 



We look at how all the elements add up to a single picture, determine the risks, calculate the time it takes for this or that stage of the update, on the basis of which a detailed work plan is created. 



The twin is made using the technology of nested virtualization - Nested Virtualization. VMware allows you to raise hypervisors internally, and you can also run virtual machines on these hypervisors. 



Within the framework of our Warsaw cloud, resources were allocated for the creation of a "twin" and hypervisors, virtual machines, a network were deployed on them - we "rebuilt" the cloud from scratch in compliance with all OS versions, packages, software and architecture that connects them into a single solution.



This is not "Ctrl + C / Ctrl + V", that is, we did not just copy the existing system: only the main components of the system and the logic of their interaction were reproduced, up to the bandwidth of communication channels, NGINX as a reverse proxy and "configs" for traffic registration.



The twin "eats" the resources of the site in the amount of $ 650 per month - for example, we pay VMware as a service provider for the RAM consumed by the "double", and there are more than 20 virtual machines deployed on it.



These are significant costs, but on the other hand, safely running the upgrade script step by step guarantees us 100% protection against failures and unforeseen moments when upgrading on real infrastructure. Possible losses due to a system malfunction are orders of magnitude higher than the costs of maintaining the "twin". 



Our expectations



According to preliminary calculations, the system resource after the update will be enough for the platform to work at the proper level without the need for any changes, not to mention a global restructuring, at least until the end of 2021. 



As part of the current preparations for the update for most major software versions of key systems, the End-of-support indicator dates back to the end of 2023. Also, for a significant number of systems, the timing of the termination of support for software versions remains undeclared so far. 



That is, the general margin of safety should be enough for 1-2 years, and if we are talking about a global renewal comparable to the current one, then today we are laying the foundation for an even longer period. 



The life cycle of a cloud platform implies the need to always have up-to-date software versions of key system elements. 



Speaking in general about this area of ​​work, it can be noted that the overall complexity and pain of such a restructuring will always depend on how hard it is to launch it, having ceased to track the compatibility of versions, as well as on how diverse the “zoo” of various elements, technologies, protocols and the software on which the cloud is built.



What should we strive for here? Towards unification: the launched global update will ultimately greatly simplify our life and improve the reliability of the cloud as a whole. 



We will be able to completely get away from the legacy of the integrator who deployed all this infrastructure. We will not have any blind spots and potential weak links in the cloud value chain in the form of its availability, flexibility of settings, reliability and other parameters that affect the SLA. 



As soon as all cloud components operate on the same software versions at all sites, any subsequent upgrades, extensions and integrations will become a matter of competent management of the technical component, routine, and not a global administrative and technical project. 



We plan to use the digital twin of the cloud in the future. It is a handy tool that helps make the infrastructure upgrade process safer and faster.



All Articles