Linux Switchdev Mellanox-style

This is a transcription of a speech delivered at Yandex NextHop 2020 - video at the end of the page






Greetings. My name is Alexander Zubkov, I want to tell you about Linux Switchdev - what it is and how we live with him in Qrator Labs.







We have been using Switchdev on Mellanox switches for about 2-3 years now. Mellanox Spectrum-based switches are classified as β€œwhite-box”, which means that you can put different operating systems on these switches. Usually the vendor provides some SDK for this and operating systems use this SDK in order to interact with the switch. And in the case of Mellanox switches, there is an operating system from Mellanox itself, there is Cumulus. SAI (Switch Abstraction Interface) is also supported - this is some attempt to create a standard SDK for different switches, which, in turn, is already used by the SONiC operating system. And of course Switchdev is supported by Mellanox switches.







Switchdev is such an infrastructure in the Linux kernel that allows you to build a mapping of the usual network settings of the kernel itself to the datapane, to the hardware of your switch - this is called offload. The picture shows that pink is the switch driver and blue is the API and utilities for configuring the user space. Switchdev here acts as an intermediary: for the user space it represents the switch model, for the driver it provides the infrastructure for organizing this display.







We use a fairly standard set of functions on Mellanox switches: routing, ECMP, in general, nothing unusual. All this is supported with the possibility of offloading to the datapline. The only thing missing is policy-based routing - there is no support in the Mellanox driver.







The Mellanox driver resides in a vanilla Linux kernel with Switchdev support - no patches or additional binary drivers needed. You can practically take the kernel from your favorite distro or compile the vanilla kernel yourself and use it. The firmware in the switch is updated by the driver itself - you only need to insert the corresponding file, which is usually contained in the linux-firmware package or something similar.







To configure the switch itself, of course, standard Linux utilities are used in large quantities. A set of iproute2, ethtool, LLDP-daemon for QoS is also used. And sysctl for some options.







For vrf in Linux, there are both network namespaces. But there is also a so-called vrf subsystem - it differs from network namespaces. In this case, all your interfaces are in the same namespace - when working with vrf. And in order to control routing, there is a special rule in the ip rule, which determines which vrf the packet belongs to and, in accordance with this, directs it to a specific routing table. To configure this - vrf in Linux - a special interface of the vrf type is created and this table is bound to it during creation. And further, if you want to add some interface to your vrf, then using the ip link command you set this special device as the master interface for your interface.And since all these interfaces are in the same namespace, then you can explicitly specify an interface from another vrf to the route and thus make routes between the interfaces.







For example, we have a task in which policy based routing would help - we receive traffic from the uplink and want to direct it entirely and unconditionally to some filtering nodes. In Cisco or Arista, we would make policy route maps or some service policy, in Linux and ip rule you can do it - but in Linux all this, unfortunately, will not be offloaded.







And we have to turn around. For example, we have made such a feature - we have divided vrf into two parts, that is, in one part - in the outer part, there is an interface with our uplink, and in the inner part there are interfaces with our filtering nodes.







And this is how routing looks like. In the internal vrf, we have a more or less standard set of routes - that is, we have internal routes there and a default route through our uplink. And already in the external interface, we only have a default route, but it lies through our filtering nodes. Thus, we got a pseudo Policy Based Routing for interfaces. All traffic that comes through the uplink interface is routed along a different route.







And in general, when you configure a switch on Switchdev, you usually have to first configure the ports, then the bond, then connect to the bridge, then vlans, vrfs, and at the end of the address and routes. This is mainly dictated by the very structure of interfaces in Linux - how you should configure everything, well, there are some other restrictions that do not allow you to arbitrarily change the settings. That is, this is a rather dreary work, which in our company was initially performed by a large init-script that configured all this. But, of course, we sometimes have to make changes at runtime, in production.







It sometimes hurts, because you have to sort out this structure almost by hand - to disassemble some interfaces, reassemble them, and this is all fraught with errors, of course. When you work at Cisco, you change the settings and the shell will take care of everything, and then some kind of low-level work is being done.







Well, thanks for the fact that we have Perl - we wrote a script mlxrtr that takes such a config and generates command sets for configuring the network and everything else. And it also supports changes - if you make any changes and it will read your current config in Linux and see what needs to be done to bring it to the state you want.







Initially, if you run this configuration will generate such a set of commands for you, and I also threw out the same ones.











There are quite a few commands, but in general, if you have it in your init-script, then it can be more or less supported.







For example, if you need to switch one port to another bond, then you need to disconnect this port from the old bond, disconnect the new bond from the bridge, then connect the port to that bond, then return the bond to the bridge, reconfigure the vlans on it - in in general, it's a rather dreary work and it is unpleasant to do it with your hands, of course. The script does all this by itself.







Further. ACL is configurable ... you can use iptables, but it will not be offloaded - you can only use it to filter control plane traffic. And if you want to filter in the data-line, then you need to use tc filter in the case of Switchdev. And here it is worth keeping in mind that the tc filter will already filter not only routed traffic, but also the one that is switched. And also the tc filter can only be hung on physical ports, so if you work with vlans, you need to do more complex constructions here. But there are interesting features there, for example, you can hang such a block on several interfaces and they will fumble (in the sense of sharing) a common filter. There is also a goto operator in the tc rules, which is also pretty cool and allows you to do nonlinear acls, unlike Cisco or Arista.







Here we also have a utility for configuring acl - mlxacl. We mainly work with vlans at the third level and the utility works in such a way that for each vlan it creates a separate chain and in the main chain it simply matches vlans and goes to the corresponding chain for this vlan.







Here, too, there is an example of such a configuration - such commands are the result. There are fewer of them than in the case of the configuration of the switch itself, because one rule is mapped to approximately one command - not so difficult.







But if you have to make any changes - in this case, I deleted one rule and the utility does everything in such a way that it rewrites all the chains that have changed, after which it renumbers in the zero - main - chain so that they refer to new chains. And it is clear that in this case it would be possible, with manual work, to solve it in one command.







But for this we need to first look at the current state and this is how the tc filter output looks like - it's quite difficult to work with it.







When you work with all this, people passing by look at you like this. Therefore, we wrote this utility - mlxacl - first, because it was much more painful to work with it, and then word for word and for the rest of the settings we also wrote the utility.







These utilities, which I told you about, we posted them in the public on Gitlab - you can use them. They are licensed under MIT, and therefore freely available.







Naturally, without any guarantee. This is a couple of Perl scripts (anticipating your questions - because I know Perl and it just works), relatively small, almost without dependencies - it uses a couple of Perl modules that are in the standard Perl distribution and Linux utilities, of course.







Finally, if you have worked a little with a serial console, with COM ports, I want to give some advice. For example, if someone thought it was a way to exit Vim, you almost guessed it.







For some bios, this is the equivalent of Ctrl + Alt + Del, as they perceive it through the serial port. That is, if your bootloader hangs, for example, and you need to somehow reboot the switch, you can use.



Further, when it comes to the kernel, it naturally intercepts work with the keyboard, so here you better have your SysRq kernel accept commands - otherwise it will be difficult to restart the switch. And in the case of SysRq, when you work with the keyboard and a regular terminal, PrintScreen is used there, and in the case of a serial console, with a COM port, you need to send a special break signal - in minicom it is Ctrl + F, in screen ' e Ctrl + A, Ctrl + B, and then make a special SysRq key.



And in order to get into the bios when booting - into the bios of the switch, of course, because in fact, like in a regular computer, there is a bios through which it usually boots - you can press Ctrl + B.



That's all I wanted to tell you briefly. If you have any questions, I will be happy to answer.







β†’ English version of the publication.



All Articles