Network-as-a-Service for a large enterprise: a non-standard case



How to upgrade network equipment in a large enterprise without stopping production ? Oleg Fedorov, Linxdatacenter Project Management Manager, talks about a large-scale project in the "open heart surgery" mode



Over the past few years, we have seen an increased demand from customers for services related to the network component of the IT infrastructure. The need for the connectivity of IT systems, services, applications, monitoring and operational management of business in almost any area are forcing companies today to pay increased attention to networks.  



The range of requests - from ensuring network resiliency to creating and managing a client autonomous system with the purchase of a block of IP addresses, configuring routing protocols and traffic control according to the policies of the organizations.



There is also a growing demand for integrated solutions for the construction and maintenance of network infrastructure, primarily from customers, whose network infrastructure is created from scratch or is morally outdated, requiring serious modification. 



This trend coincided with the development and complication of Linxdatacenter's own network infrastructure. We expanded the geography of our presence in Europe by connecting to remote sites, which in turn required the improvement of the network infrastructure. 



The company has launched a new service for clients, Network-as-a-Service: we take care of all the network tasks of our clients, allowing them to focus on their core business.



In the summer of 2020, the first large project in this direction was completed, which I would like to talk about. 



At the start 



A large industrial complex turned to us for the modernization of the network part of the infrastructure at one of its enterprises. It was required to replace old equipment with new ones, including the network core.



The last modernization of the equipment at the enterprise took place about 10 years ago. The new management of the enterprise decided to improve connectivity, starting with infrastructure upgrades at the most basic, physical level. 



The project was divided into two parts: an upgrade of the server park and network equipment. We were responsible for the second part. 



The basic requirements for the work included minimizing the downtime of the production lines of the enterprise during the work (and in some areas the complete elimination of downtime). Any stop - direct financial loss of the client, which should not have happened under any circumstances. In connection with the operating mode of the facility 24x7x365, as well as taking into account the complete absence of periods of planned downtime in the practice of the enterprise, we were tasked, in fact, to perform open heart surgery. This became the main distinguishing feature of the project.



Go



The works were planned according to the principle of movement from the nodes of the network remote from the core to the closer ones, as well as from those less influencing the operation of production lines to those directly influencing this work. 



For example, if you take a network node in a sales department, then an interruption in communication as a result of work in this department will not affect production in any way. At the same time, such an incident will help us, as a contractor, to check the correctness of the chosen approach to work on such nodes and, having adjusted the actions, work on the next stages of the project. 



It is necessary not only to replace the nodes and wires in the network, but also to correctly configure all components for the correct operation of the solution as a whole. It was the configurations that were tested in this way: starting work at a distance from the kernel, we kind of gave ourselves the "right to make mistakes" without jeopardizing critical areas for the operation of the enterprise. 



We have identified areas that do not affect the production process, as well as critical areas - workshops, loading and unloading block, warehouses, etc. At key areas, the client agreed on the permissible downtime for each node of the network separately: from 1 to 15 minutes ... It was impossible to completely avoid the disconnection of individual network nodes, since the cable must be physically switched from the old equipment to the new one, and during the switching process it is also necessary to untangle the "beard" of wires that was formed during several years of operation without proper care (one of the consequences of outsourcing work on the installation of cable lines).



The work was divided into several stages.



Stage 1- Audit. Preparation and approval of the approach to work planning and assessment of the readiness of the teams: the client, the contractor performing the installation, and our team.



Stage 2 - Development of a format for carrying out the work, with in-depth detailed analysis and planning. We chose a checklist format with an exact indication of the order and sequence of actions, up to the sequence of switching patch cords by ports.



Stage 3 - Carrying out work in cabinets that do not affect production. Estimation and adjustment of downtime for subsequent stages of work.



Stage 4 - Carrying out work in cabinets that directly affect production. Estimation and adjustment of downtime for the final stage of work.



Stage 5- Carrying out work in the server room to switch the remaining equipment. Running on routing on a new kernel.



Stage 6 - Sequential switching of the system core from old network configurations to new ones for a smooth transition of the entire system complex (VLAN, routing, etc.). At this stage, we connected all users and transferred all services to new equipment, checked the correctness of the connection, made sure that none of the enterprise services stopped, ensured that in the event of any problems they would be directly connected to the kernel, which made it easier to eliminate possible troubleshooting and final setting. 



Beard hairstyle



The project was also challenging due to the difficult initial conditions. 



Firstly, it is a huge number of nodes and sections of the network, with an intricate topology and classification of wires according to their purpose. Such "beards" had to be taken out of the closets and painstakingly "combed", figuring out which wire leads from where and where. 



It looked something like this:





So:





or like this: 





Secondly, for each such task it was necessary to prepare a file describing the process. "We take wire X from port 1 of the old equipment, we plug it into port 18 of the new equipment." It sounds simple, but when you have 48 completely clogged ports in your initial data, and there is no idle option (we remember about 24x7x365), the only way out is to work in blocks. The more wires you can pull out of old equipment at a time, the faster you can comb them and insert them into new network hardware, avoiding network disruptions and downtime. 



Therefore, at the preparatory stage, we divided the network into blocks - each of them belonged to a specific VLAN. Each port (or a subset of them) on the old equipment is one of the VLANs in the new network topology. We have grouped them as follows: the first ports of the switch house the user networks, the middle - the production networks, and the last - the access points and uplinks. 



This approach made it possible to pull out and comb out of the old equipment not 1 wire, but 10-15 wires at a time. This accelerated the workflow several times.  



By the way, here's what the wires look like in closets after combing: 





or, for example, like this: 





After the completion of the 2nd stage, we took a break to analyze errors and project dynamics. For example, minor flaws immediately came out due to inaccuracies in the network diagrams provided to us (the wrong connector in the diagram is an incorrect purchased patch cord and the need to replace it). 



The pause was necessary, since even a small failure in the process was unacceptable when working from server right. If the goal was to ensure downtime on a network section of no more than 5 minutes, then it could not be exceeded. Any possible deviation from the schedule had to be agreed with the client. 



However, preliminary planning and dividing the project into blocks made it possible to meet the planned downtime in all areas, and in most cases to do without it altogether. 



Time challenge - project under COVID 



However, there were some additional complications. Of course, the coronavirus was one of the obstacles. 



The work was complicated by the fact that a pandemic began, and it was impossible for all specialists involved in the process to be present during the work at the client's site. Only the installation staff were admitted to the site, and control was carried out through the room in Zoom - there was a network engineer from Linxdatacenter, I, as a project manager, a network engineer on the client's side responsible for the production of work, and the team performing the installation work.



During the work, unaccounted for problems arose, and it was necessary to make adjustments on the fly. So it was possible to quickly prevent the influence of the human factor (errors in the scheme, errors in determining the status of the interface activity, etc.).



Although the remote work format seemed unusual at the beginning of the project, we quickly adapted to the new conditions and entered the final stage of work. 



We ran a temporary network configuration configuration to run two network cores in parallel, the old and the new one, to ensure a smooth transition. However, it turned out that one extra line from the configuration file of the new kernel was not removed, and the transition did not occur. This made us spend some time looking for the problem. 



It turned out that the main traffic was transmitted correctly, and the control traffic did not reach the node through the new core. Due to the clear division of the project into stages, it was possible to quickly identify the network section where the difficulty arose, identify the problem and fix it. 



As a result



Technical results of the project 



First of all, a new core of the new enterprise network was created, for which we built physical / logical rings. This is done in such a way that each switch in the network has a "second shoulder". In the old network, many switches were connected to the core via one route, one shoulder (uplink). If it was torn, the switch became completely inaccessible. And if several switches were connected through one uplink, then the accident put out of action an entire department or production line at the enterprise. 



In a new network, even a rather serious network incident, under no scenario, will be able to "kill" the entire network or its significant section. 



90% of all network equipment has been updated, media converters (signal propagation media converters) have been decommissioned, and the need for dedicated power lines to power equipment has been eliminated by connecting to PoE switches, where power is supplied via Ethernet wires. 



Also, all optical connections in the server room and in the cabinets in the field are marked - at all key communication centers. This made it possible to prepare a topological diagram of equipment and connections in the network, reflecting its actual state today. 



Network diagram



The most important result from a technical point of view: fairly large-scale infrastructure work was carried out quickly, without creating any interference in the operation of the enterprise and almost invisible to its personnel. 



Business results of the project



In my opinion, this project is interesting primarily not from the technical, but from the organizational point of view. The difficulty was primarily in planning and thinking through the steps to implement the project tasks. 



The success of the project allows us to say that our initiative to develop the network direction within the Linxdatacenter portfolio of services is the right choice of the vector for the company's development. A responsible approach to project management, a competent strategy, clear planning allowed us to perform the work at the proper level. 



Confirmation of the quality of work - a request from the client to continue the provision of network modernization services at its other sites in Russia.



All Articles