Distributed DBMS for Enterprise

The CAP theorem is the cornerstone of the theory of distributed systems. Of course, the controversy around it does not subside: the definitions in it are not canonical, and there is no rigorous proof ... Nevertheless, firmly adhering to the positions of everyday common sense ™, we intuitively understand that the theorem is true.







The only thing that is not obvious is the meaning of the letter "P". When the cluster is divided, it decides whether not to respond until a quorum is reached, or to give the data that is. Depending on the results of this selection, the system is classified as either CP or AP. Cassandra, for example, can behave this way and that, depending not even on the cluster settings, but on the parameters of each specific request. But if the system is not "P" and it has split, then what?



The answer to this question is somewhat unexpected: the CA cluster cannot be split.

What is this cluster that cannot be split?



An indispensable attribute of such a cluster is a shared data storage system. In the vast majority of cases, this means connecting over a SAN, which limits the use of CA solutions to large enterprises that can host a SAN infrastructure. In order for multiple servers to be able to work with the same data, a clustered file system is required. These file systems are available in the HPE (CFS), Veritas (VxCFS) and IBM (GPFS) portfolios.



Oracle RAC



The Real Application Cluster option first appeared in the 2001 release of Oracle 9i. In a cluster such that multiple server instances work with the same database.

Oracle can work with both a clustered file system and its own solution - ASM, Automatic Storage Management.



Each copy keeps its own journal. The transaction is executed and committed in one instance. In the event of an instance failure, one of the surviving cluster nodes (instances) reads its log and recovers the lost data, thereby ensuring availability.



All instances maintain their own cache, and the same pages (blocks) can be simultaneously in the caches of multiple instances. Moreover, if a page is needed by one instance, and it is in the cache of another instance, it can get it from the “neighbor” using the cache fusion mechanism instead of reading from disk.







But what happens if one of the instances needs to change data?



The peculiarity of Oracle is that it does not have a dedicated locking service: if the server wants to lock a row, then the lock record is placed directly on the memory page where the locked row is located. This approach makes Oracle the performance champion of the monolithic database: the lock service never becomes a bottleneck. But in a clustered configuration, this architecture can lead to heavy network traffic and deadlocks.



As soon as a record is blocked, the instance notifies all other instances that the page that stores the record has been acquired in exclusive mode. If another instance needs to change a record on the same page, it must wait until the changes on the page are committed, that is, the change information is written to the disk log (and the transaction can continue). It may also happen that the page will be changed sequentially by several copies, and then, when writing the page to disk, you will have to find out who has the current version of this page.



Accidentally refreshing the same pages across different RAC nodes dramatically degrades database performance — to the point where cluster performance can be slower than a single instance.



The correct use of Oracle RAC is to physically partition the data (for example, using a partitioned table mechanism) and access each set of partitions through a dedicated node. The main purpose of RAC was not horizontal scaling, but providing fault tolerance.



If a node stops responding to heartbeat, then the node that detects it first starts the disk voting procedure. If the missing node was not marked here, then one of the nodes takes on the responsibility of restoring the data:



  • “Freezes” all pages that were in the cache of the missing node;
  • Reads the logs (redo) of the missing node and reapplies the changes recorded in these logs, along the way checking if other nodes have more recent versions of the changed pages;
  • rolls back uncommitted transactions.


To simplify switching between nodes, Oracle has the concept of a service - a virtual instance. An instance can serve multiple services, and a service can move between nodes. An application instance serving a certain part of the base (for example, a group of clients) works with one service, and the service responsible for this part of the base, when a node fails, moves to another node.



IBM Pure Data Systems for Transactions



The cluster solution for the DBMS appeared in the Blue Giant portfolio in 2009. Ideologically, it is the heir to the Parallel Sysplex cluster built on "conventional" hardware. In 2009, DB2 pureScale was released as a software suite, and in 2012 IBM is offering an appliance called Pure Data Systems for Transactions. It should not be confused with Pure Data Systems for Analytics, which is nothing more than the renamed Netezza.



The pureScale architecture looks like Oracle RAC at first glance: in the same way, multiple nodes are connected to a shared storage system, and each node runs its own instance of a DBMS with its own memory areas and transaction logs. But unlike Oracle, DB2 has a dedicated locking service represented by the db2LLM * process set. In a cluster configuration, this service is placed on a separate node, which is called the coupling facility (CF) in Parallel Sysplex, and PowerHA in Pure Data.



PowerHA provides the following services:



  • lock manager;
  • global buffer cache;
  • the area of ​​interprocess communications.


Remote memory access is used to transfer data from PowerHA to database nodes and vice versa, so the cluster interconnect must support the RDMA protocol. PureScale can use both Infiniband and RDMA over Ethernet.







If a node needs a page, and this page is not in the cache, then the node requests a page in the global cache, and only if it is not there, it reads it from disk. Unlike Oracle, the query only goes to PowerHA and not to neighboring nodes.



If the instance is going to change the string, it blocks it in exclusive mode, and the page where the string resides in shared mode. All locks are registered in the global lock manager. When the transaction completes, the node sends a message to the lock manager, which copies the modified page to the global cache, releases the locks, and invalidates the modified page in the caches of other nodes.



If the page containing the modified string is already locked, then the lock manager will read the modified page from the memory of the node that made the changes, release the lock, invalidate the modified page in the caches of other nodes, and give the page lock to the node that requested it.



"Dirty", that is, changed, pages can be written to disk both from a regular node and from PowerHA (castout).



If one of the pureScale nodes fails, recovery is limited to only those transactions that were not yet complete at the time of the failure: the pages modified by that node in the completed transactions are in the global cache on PowerHA. The node restarts in a stripped down configuration on one of the cluster servers, rolls back uncommitted transactions and releases locks.



PowerHA runs on two servers and the master node replicates its state synchronously. If the primary PowerHA node fails, the cluster continues to operate with the standby node.

Of course, if you access the dataset through a single node, the overall cluster performance will be better. PureScale may even notice that some area of ​​data is being processed by one node, and then all locks related to this area will be processed locally by the node without communicating with PowerHA. But as soon as the application tries to access this data through another node, centralized lock processing will resume.



Internal IBM benchmarks at 90% read and 10% write workload, very similar to real production workloads, show nearly linear scaling up to 128 nodes. The testing conditions, alas, were not disclosed.



HPE NonStop SQL



The Hewlett-Packard Enterprise portfolio also has its own highly available platform. This is the NonStop platform launched in 1976 by Tandem Computers. In 1997, the company was acquired by Compaq, which in turn merged with Hewlett-Packard in 2002.



NonStop is used to build critical applications - for example, HLR or bank card processing. The platform is delivered in the form of a software and hardware complex (appliance), which includes computing nodes, data storage system and communication equipment. ServerNet (in modern systems - Infiniband) serves both for exchange between nodes and for access to the data storage system.



Earlier versions of the system used proprietary processors that were synchronized with each other: all operations were performed synchronously by several processors, and as soon as one of the processors was mistaken, it was turned off, and the second continued to work. Later, the system switched to conventional processors (first MIPS, then Itanium, and finally x86), and other mechanisms began to be used for synchronization:



  • messages: each system process has a “shadow” twin, to which the active process periodically sends messages about its state; if the main process fails, the shadow process starts working from the moment determined by the last message;
  • : , , ; , /.


Since 1987, a relational DBMS has been running on the NonStop platform - first SQL / MP, and later SQL / MX.



The entire database is divided into parts, and each part is responsible for its own Data Access Manager (DAM) process. It provides data writing, caching and locking mechanism. Data processing is handled by the Executor Server Process, running on the same nodes as the respective data managers. The SQL / MX scheduler divides tasks between executors and merges the results. If it is necessary to make consistent changes, the two-phase commit protocol provided by the TMF (Transaction Management Facility) library is used.







NonStop SQL knows how to prioritize processes so that long analytical queries do not interfere with the execution of transactions. However, its purpose is precisely the processing of short transactions, not analytics. The developer guarantees the availability of the NonStop cluster at the level of five nines, that is, the downtime is only 5 minutes per year.



SAP HANA



The first stable release of the HANA DBMS (1.0) took place in November 2010, and the SAP ERP package moved to HANA in May 2013. The platform is based on purchased technologies: TREX Search Engine (search in columnar storage), P * TIME and MAX DB.



The word "HANA" itself is an acronym, High performance ANalytical Appliance. This DBMS is delivered in the form of code that can run on any x86 servers, however, industrial installations are allowed only on certified equipment. There are solutions from HP, Lenovo, Cisco, Dell, Fujitsu, Hitachi, NEC. Some Lenovo configurations even allow operation without a SAN - a GPFS cluster on local disks plays the role of shared storage.



Unlike the platforms listed above, HANA is an in-memory DBMS, that is, the primary image of the data is stored in RAM, and only logs and periodic snapshots are written to disk for recovery in case of a disaster.







Each node of the HANA cluster is responsible for its own part of the data, and the data map is stored in a special component - Name Server, located on the coordinator node. Data between nodes is not duplicated. Lock information is also stored on each node, but the system has a global deadlock detector.



A HANA client, when connecting to a cluster, loads its topology and in the future can directly access any node, depending on what data it needs. If a transaction affects the data of a single node, then it can be performed by this node locally, but if the data of several nodes changes, then the initiator node contacts the coordinator node, which opens and coordinates the distributed transaction, committing it using an optimized two-phase commit protocol.



The coordinator node is duplicated, so in the event of a coordinator failure, the backup node immediately takes over. But if a node with data fails, then the only way to get access to its data is to restart the node. As a rule, in HANA clusters, a spare server is kept in order to restart the lost node on it as soon as possible.



All Articles