Building a common architecture for high performance computing, artificial intelligence and data analytics

Today, high performance computing ( HPC ), artificial intelligence ( AI ), and data analysis ( DA ) overlap more and more. The point is that solving complex problems requires a combination of different techniques. The combination of AI, HPC, and DA in traditional workflows can accelerate scientific discovery and innovation.



Data scientists and researchers are developing new computing-intensive problem solving processes such as HPC systems on a massive scale. AI and data analytics workloads benefit from HPC infrastructure that scales to improve performance. Today we will talk about the trends in this market and approaches to creating architecture for DA, AI and HPC under the cut.







The trend towards convergence of modern workloads requires a more unified architecture. Traditional HPC workloads (such as simulation) require a lot of computing power, as well as fast network connections and high-performance file systems. For example, the creation of a reservoir model for a mineral deposit can take from several hours to several days.



Artificial intelligence and data analytics workloads are resource intensive, requiring data collection tools and dedicated workspaces for operators to process data. Artificial intelligence and data analytics are processes that require interactive interaction and repetitive actions.



The difference in HPC, AI and DA workloads might give the impression that they would require three separate infrastructures, but this is not the case. The unified architecture is suitable for both data analysts and scientists working with artificial intelligence, without retraining and adapting to the new operating model.



However, integrating all three workloads on a single architecture does pose challenges to consider:



  • HPC, AI, or DA user skills vary.
  • Resource management systems and load planners are not interchangeable.
  • Not all software and not all frameworks are integrated into a single platform.
  • Ecosystems require different tools and functions.
  • Loads and their performance requirements are different.


The foundation of Dell Technologies turnkey solutions



Dell Technologies' out-of-the-box AI and data analytics solutions provide a single environment for all three workloads. They are built taking into account four basic principles:



  1. Data availability.
  2. Simple job scheduling and resource management.
  3. Optimizing workloads.
  4. Integrated orchestration and containerization.


Data availability



Users need fast access to their data regardless of workload. Data movement should be limited between disparate storage environments. Datasets for HPC, AI, and DA should be combined into a single environment to improve operational efficiency, especially if the workflow combines multiple techniques.



For example, advanced driver assistance systems use extreme weather models to prevent accidents in real-life driving in bad weather. The new data is then used to train the deep neural network: the output becomes the input for training the model. The results are then loaded into Spark, which is used to connect to the customer's current dataset and to select the best data for subsequent model training. For best performance, the data received from the workflow should be as close as possible to the data already available.







Job scheduling and resource management



HPC consumers rely on traditional job schedulers like SLURM. For batch scheduling, SLURM allocates hardware resources based on time intervals and provides a framework to initiate, run, and control running jobs. SLURM also provides queue management for submitted tickets to avoid contention between tasks in the cluster.



Data analysis uses task schedulers such as Spark Standalone and Mesos. A pre-built architecture for high performance computing and artificial intelligence uses Kubernetes to orchestrate Spark and manage resources for the tasks being performed. Since no job scheduler addresses both environments, the architecture must support both. Dell Technologies has developed an architecture that meets both requirements.



Dell EMC's turnkey architecture for HPC, data analytics and artificial intelligence creates a single pool of resources. Resources can be dynamically assigned to any HPC task that is managed through the HPC Resource Manager or for containerized AI or data analytics workloads that are in turn managed from the Kubernetes container system.



Optimizing workloads



The architecture must be able to scale for one type of workload without compromising on another. Programming languages, scaling needs, and management of the software stack and file systems are important in understanding workload requirements. The table below shows examples of technologies used when building a scalable architecture:









The final design component is the integration of Kubernetes and Docker into the Kubernetes architecture, an open source containerization system used to automate deployment, scaling, and management. Kubernetes helps you organize a cluster of servers and scheduling containers based on the available resources and the resource needs of each container. Containers are organized into groups, the basic operating unit of Kubernetes, that scale to the desired size.



Kubernetes helps manage the discovery service, which includes load balancing, resource allocation tracking, utilization, and individual resource health checks. This allows applications to self-heal by automatically restarting or copying containers.



Docker is a software platform that allows you to quickly build, test, and deploy software products. It packages programs into standard modules called containers that have everything you need to run a program, including libraries, system tools, code, and conditions for its execution. With Docker, you can quickly deploy and scale applications in any environment and be sure your code will run.



Hardware architecture blocks



Choosing the right server



The Dell EMC PowerEdge DSS 8440 is a 2-socket (4U) server optimized for HPC. One DSS 8440 can accommodate 4, 8 or 10 NVIDIA V100 graphics accelerators for image recognition or NVIDIA T4 for natural language processing (NLP). Ten NVMe drives provide quick access to training data. This server has both the performance and the flexibility to be ideal for machine learning as well as other resource-intensive workloads. For example, modeling and predictive analysis in engineering and scientific environments.







Dell EMC PowerEdge C4140meets the needs for scalable server solutions required for training neural networks. Deep learning is a computationally intensive process, including fast GPUs, especially during the learning phase. Each C4140 server supports up to four NVIDIA Tesla V100 (Volta) GPUs. Connected through the NVIDIA NVLINK 20 fabric, eight or more C4140s can be clustered for larger models, delivering performance up to 500 Pflops.







Dell EMC PowerEdge R740xdIs a classic 2-socket server that is suitable for most machine learning projects. This general-purpose 2U server has the prospect of further use for deep learning tasks, as it supports the installation of graphics accelerators and a large number of storage devices.







Choosing the right network



Dell EMC PowerSwitch S5232F-ON: High Performance Ethernet Dell EMC S5235F-ON The S5235F-ON has 32 QSFP28 ports each supporting 100 GbE or 10/25/40/50 GbE using split cables. The switch bus has a bandwidth of 64 Tbps, providing high performance with low latency.



The Mellanox SB7800 is the right solution for many simultaneous workloads. A high-performance, non-blocking 72 Tbit / s bus with a 90 ns latency between any two switching points provides a high performance solution.



Services and storage systems



Choosing the right storage service



The choice of hardware components depends on the problem being solved and the software used. Rather conditionally, data storage subsystems can be divided into three types:



  1. The storage service is built into the software and is an integral part of it. An example is Apache Hadoop with HDFS file system, or No SQL Apache Cassandra database.
  2. The storage service is provided either by specialized solutions (for example, Dell EMC PowerScale) or by corporate storage systems.
  3. Access to cloud resources: both private Dell EMC ECS, Cloudian, Ceph, and public - Amazon, Google, MS Azure. Data access, as a rule, is carried out based on REST protocols - Amazon S3, Openstack Swift, etc. This is one of the most actively developing segments of the storage market for Big Data.


Combined approaches can be distinguished, when either built-in storage services or specialized systems are used as the operational storage layer, and cloud systems act as archival long-term storage. The use of a particular storage service depends on the task being solved and regulatory requirements (protection against disasters, integration with authorization and audit providers, usability).



On the one hand, built-in storage services, if they are available in the software, are quickly deployed and, of course, integrated as much as possible with other application services. On the other hand, they do not always meet all the necessary requirements. For example, there is no full-fledged replication or no integration with backup systems. Moreover, we create another dedicated "data segment / island" exclusively for one distribution or a set of applications.







Functionality requirements



The following requirements can be imposed on the storage service:



  • Linear scalability in both capacity and performance.
  • The ability to work effectively in a multi-threaded environment.
  • Tolerance to massive failures of system components.
  • Easy to upgrade and expand the system.
  • Ability to create online and archive storage tiers.
  • Advanced functionality for working with data (audit, DR tools, protection against unauthorized changes, deduplication, metadata search, etc.).


Storage performance is critical for high performance computing, machine learning, and artificial intelligence projects. Therefore, Dell Technologies offers a wide range of all-flash and hybrid storage systems to meet the most demanding customer requirements.



Dell EMC's storage portfolio includes high-performance PowerScale (HDFS, NFS / SMB) and ECS (S3, Opensatck Swift, HDFS) storage systems, as well as NFS and Luster distributed storage systems.



An example of a specialized system



Dell EMC PowerScale is an example of a specialized system that allows you to effectively work in projects related to big data. It allows you to build a corporate data lake. The storage system does not contain controllers and disk shelves, but is a set of equivalent nodes connected using a dedicated duplicated network. Each node contains disks, processors, memory, and network interfaces for client access. All disk capacity of the cluster forms a single storage pool and a single file system, which can be accessed through any of the nodes.



Dell EMC PowerScaleIs a storage system with concurrent access over various file protocols. All nodes form a single resource pool and a single file system. All nodes are equal, any node can process any request without additional overhead. The system expands to 252 nodes. Within one cluster, we can use pools of nodes with different performance. For operational processing, use productive nodes with SSD / NVMe and efficient network access of 40 or 25 GbE, and for archive data, nodes with capacious SATA disks of 8-12 terabytes. Additionally, it becomes possible to move the least used data to the cloud: both private and public.







Projects and applications



The use of Dell EMC PowerScale has led to a number of exciting big data projects . For example, a suspicious activity identification system for Mastercard. It also successfully solves problems related to the automatic vehicle control (ADAS) of Zenuity. One of the important points is the ability to separate the storage service into a separate tier with the possibility of its separate scaling.



Thus, multiple analytics platforms can be connected to a single storage platform with a single dataset. For example, a main analytical cluster with a specific Hadoop distribution that runs directly on servers, and a virtualized development / test loop. At the same time, not the entire cluster can be allocated for the tasks of analytics, but only a certain part of it.



The second important point is that PowerScale provides access to the file system. That is, in comparison with traditional solutions, there is no strict limitation on the amount of analyzed information. The clustered architecture provides excellent performance for machine learning tasks even with large SATA drives. An excellent illustration is the ML / DL problems where the accuracy of the resulting model can depend on the volume and quality of the data.



Traditional systems



Dell EMC PowerVault ME4084 (DAS) can be used as a basic storage system. It is scalable to 3 petabytes and is capable of 5,500 MB / s throughput and 320,000 IOPS.



Typical diagram of a turnkey solution for HPC, AI and data analysis







Typical AI use cases by industry







Summary



Dell Technologies turnkey solutions for HPC , AI, and data analytics provide a unified architecture that supports multiple workloads. The architecture is based on four key components: data availability, easy job scheduling and resource management, workload optimization plus integrated orchestration and containerization. The architecture supports multiple server, networking, and storage options to best meet HPC needs.



They can be used to solve very different problems, and we are always ready to help customers with the selection, deployment, configuration and maintenance of equipment.



The author of the material is Alexander Koryakovsky, consulting engineer of the department of computing and network solutions of Dell Technologies in Russia



All Articles