👨‍🔧 🤛🏾 🤡 Storage Performance Engineering 🖐🏼 🧘🏾 💶

Hello everyone! Every day, our large and friendly team of engineers solves complex problems and contributes to the creation of high-tech products - data processing and storage systems. We decided to introduce you to their routine more closely, and today we are starting a series of interviews with colleagues in order to tell you in the first person about all the nuances of their work.

Performance is one of the key characteristics of good software; other characteristics of storage systems will not be appreciated if they are slow or unstable. Today we are talking with Sergey Kachkin kachini- Head of the Department of Technical Expertise of the Department of Applied Research and Technical Expertise of YADRO.

His profession has several names: performance analyst, performance engineer, performance tester. And all of them are quite rare in Russia. Performance engineering, meanwhile, helps to create efficient computer systems that operate quickly and reliably. His task is to study why the system is not functioning as we would like, to understand the reasons for the slow or not corresponding to the target parameters of work, to identify and find problem areas, and to help eliminate them.

Sergey Kachkin talked about finding bottlenecks in the software stack and optimizing storage performance, about what his team is doing.

Sergey, how did you come to YADRO? Have you already had experience with OpenPOWER?

Before that, I worked for another vendor, was involved in supporting a proprietary version of UNIX OS on IA64 (not to be confused with x86) processors in terms of kernel performance. EPIC architecture is not like RISC, it is completely different. So this is my first experience of working with OpenPOWER at YADRO, and the rebuilding took some time. But the idea of OpenPOWER, despite some minimalism, is the same, so everything can be mastered.

What do performance engineers do? What methods are used in work? Is it difficult for you to recruit new employees?

The main specialization of our team is performance engineering or performance engineering. It is a separate discipline aimed at ensuring that the solution being developed satisfies non-functional requirements, in particular performance. It includes a set of practices, knowledge, methods and techniques that can be applied at different stages of software development: preparatory, programming, testing and system operation.

In Russia, this discipline is not very widespread, at least, such an impression is created by the results of the search for employees. However, in the world, this is an established direction. This IT specialization rarely involves direct coding. We program little and, in fact, do not know how to do it like professional programmers. This requires specific skills to localize "hot spots" in the software that affect non-functional requirements. On the one hand, it helps to create a product that meets the requirements, on the other, it prevents the cost of further optimization or rework.

How do you ensure quality control and bottleneck identification in the software stack?

The methods can be divided into two types. The first is the system centric approach. It is resource oriented: we analyze the load of individual components of the system and, based on the results obtained, make an assumption where there is a bottleneck.

The second is the application centric approach, when the object of research is the entire application or individual processes in Linux. We look at what the application is doing, what work it is doing. Is this work useful, or it does something useless, that is, it wastes time. If the application is waiting, we see what it is waiting for. Usually these are hardware or software resources, synchronization mechanisms.

In real life, you have to switch between these methods. That is, on the one hand, we look at the resources: are there any obvious problems or errors. We draw conclusions. Then we look at the application: how it feels. In this case, the application is the storage system code or something else that is the object of optimization.

How to understand that storage is working "at the limit"? How can you tell if your productivity is depleted? What parameters indicate this? What are the main metrics used to measure storage performance?

Several metrics are available to the average user. The main one is the response time. Its absolute value is important. In addition to response time, bandwidth is also important. If, as the load grows, the response time begins to grow, while the IOPS and the amount of transmitted data do not increase, then this means that some storage resource is close to saturation. As you know, a storage system works as fast as its slowest resource can function.

At the same time, different applications can be critical either to response time or to bandwidth. For example, if we are talking about a database, then usually this is random access in small blocks, a lot of reads, and it is important for it to perform in IOPS and the minimum response time. For other loads such as streaming for backups, recording from video cameras or the Internet of Things, bandwidth is more important, the ability to record a large data stream.

Is the storage system optimized for a specific task, or is it created as a universal solution?

For a long time, storage systems, at least for general purposes, have been versatile. They are not "sharpened" for any particular load and try to "please" the most common applications. After all, it is roughly known what the load profile of the database, backup system, video surveillance, and so on. The storage system must adequately respond to such loads without any additional configuration.

Therefore, general purpose storage systems are designed from the ground up to suit the most common tasks. For this, synthetic tests are used with a set of "critical" profiles that simulate a real situation. Most of the time it works, but the reality is always much more complicated.

Real loads are modeled by synthetics very approximately. This is generally a science-intensive area, because in addition to IOPS, bandwidth, block size and the ratio of read / write operations, the load has much more characteristics. This is the localization of the data spot on the disk, the presence of "hot areas", the distribution of requests in time, and the uniformity of their arrival. Therefore, there is a possibility that a particular load k will not fall into any of the profiles. Maybe because of the features of the software or the specifics of the business problem itself. In this case, you need to configure the system for specific tasks.

Examine the application, how it works. And it may be necessary to change either the application or the storage settings. Sometimes it is much easier to solve problems on the side of the application with some kind of customization than to change the storage system.

Is the system automatically configured for the task? Do you need artificial intelligence for this? Can the administrator or user choose the load profile himself?

Storage systems have been doing this automatically for a long time - the administrator is not loaded with such a task. Usually they try to achieve this without using artificial intelligence - traditional algorithms. However, AI has great potential. If it allows you to predict which blocks of data and at what point in time the application can request, then you can prepare for this in advance.

If earlier optimization algorithms were quite simple, such as read-ahead, that is, when sequentially reading data, the system loaded the data into the cache in advance, or, on the contrary, freed the cache memory for other data, now the possibilities are expanding: the system will be able to prepare for a peak of requests or complexly organized " hot data spot ".

What should be the scale of storage optimization? Does it also cover server software / hardware, infrastructure (SAN)? Does it require tight integration of the software and hardware stacks?

From the point of view of performance engineering, the system is considered as a whole, in a complex, that is, an application, a host (server), a storage infrastructure, (SAN), storage. It is important to understand how the application works, because it is it that generates requests to the storage system. All this, of course, is taken into account and used.

It is believed that the most optimal option for using drives of different types in storage systems is tiered data storage. Can tearing be considered as a means of increasing storage performance?

Generally speaking, tearing is similar to caching - they have common elements. The only difference is that when cached, data is duplicated, that is, it is located both on the SSD (in the cache) and on the disk, and when tiring is stored in only one place. That is, if caching is a way to optimize performance, then tearing can also be considered an optimization method.

Where do you see the advantages / disadvantages of software-defined storage (SDS) in terms of performance analysis and system optimization? Maybe these are simpler, more flexible solutions?

In fact, quite the opposite. SDS is a distributed system consisting of many servers that interact with each other. If special operating systems are used, some kind of file systems, then this also adds complexity. From an engineering point of view, this is more difficult, but in some ways more interesting. On the other hand, SDS usually does not have any strict performance requirements, while classic storage systems are more strict. What is forgiven for software-defined systems will not be forgiven for traditional storage.

One of the company's goals is to develop optimized products for artificial intelligence, IoT and fifth generation networks. How difficult do you think this is? What will these products look like?

At the moment, file storages are often used to store raw data in AI, for training and model building - SDS, that is, these are almost always distributed solutions. In my opinion, many companies now use AI as a kind of experiment, they look at it and try to understand how it can be useful. Therefore, the requirements for the hardware are not very strict. If it works - well, it doesn't work - you can wait a day or two. As the work of AI in companies becomes more critical, so will the requirements for disk subsystems. We will see new storage solutions for AI and the Internet of Things already mission critical-class.

What role does YADRO's partnership with global technology companies play in software optimization?

From a technician's point of view, it certainly helps. Such cooperation facilitates the communication of engineers with each other, their access to information, ready-made developments, and does not have to "reinvent the wheel" every time.

How do you see the role of virtualization in storage? Does it help remove software bottlenecks, or vice versa? And how are system performance and reliability related? Can reliability be maintained while increasing productivity?

Virtualization adds complexity, of course, but it can be useful for isolating one storage functionality from another. In general, these are additional costs and complications, so it should be viewed critically, with caution.

When it comes to increasing productivity, it is indeed easy to lose reliability along the way. This is a kind of dualism. For example, when we talk about servers, for a high-performance server (HPC), reliability usually comes second. Storage systems generally need to provide high availability, functionality, and performance first. As the reliability of the redundancy level increases, the system becomes more complex. It becomes necessary to synchronize elements. However, system performance will inevitably suffer. The task of development is to minimize this effect.

Now there are new memory classes like Storage Class Memory, Persistent Memory, flash drives are being improved. How does this affect system architecture? Is the software keeping up with these changes?

Well, at least he tries. In general, the advent of fast memory has significantly changed the way performance engineers work in the industry. Before the advent of SSDs, the vast majority of IT performance problems were related to storage I / O. Because there are fast processors and disks (HDD) with mechanical elements that are many orders of magnitude slower than a processor. Therefore, we had to try to smooth out delays from slow disks at the expense of algorithms.

With the advent of fast memory and algorithms must change. If the algorithm is heavy enough, it still helped before, because the disk is much slower. If you managed to hide the delay in the mechanics, that's good. With the advent of SSDs, software should work differently. It should introduce the minimum latency to get the maximum speed from the SSD. That is, the need for complex algorithms that hide latency from disks has decreased. An I / O intensive database that is particularly sensitive to response time can be migrated to an SSD.

Will this change the storage architecture? Yes and no. Because the disks have not gone anywhere. On the one hand, the code must be able to work with SSD, that is, be very fast. On the other hand, mechanical discs use loads that they can withstand well, such as streaming. At the same time, the size of the disks increased many times, but the speed remained the same as 10 years ago.

Storage Performance Engineering

More articles: