🌿 🧟 🎑 Operating experience CEPH 🌯 🧒🏻 🐥

When there is more data than it fits on one disk, it's time to think about RAID. As a child, I often heard from my elders: “One day RAID will be a thing of the past, object storages will fill the world, and you don't even know what CEPH is,” so the first thing in an independent life was to create your own cluster. The goal of the experiment was to get acquainted with the internals of ceph and understand the scope of its application. How justified is the introduction of ceph in medium-sized businesses, but in small ones? After several years of operation and a couple of irreversible data loss, an understanding of the subtleties arose that not everything is so simple. The features of CEPH hinder its widespread adoption, and experiments have come to a standstill because of them. Below is a description of all the steps taken, the result obtained and the conclusions drawn.If knowledgeable people share their experiences and explain some points, I will be grateful.

Note: Commentators have pointed to serious errors in some of the assumptions that require revision of the entire article.

CEPH Strategy

The CEPH cluster combines an arbitrary number of K disks of arbitrary size and stores data on them, duplicating each piece (4 MB by default) a specified number of N times.

Consider the simplest case with two identical disks. From them, you can either assemble a RAID 1, or a cluster with N = 2 - the result will be the same. If there are three disks, and they are of different sizes, then it is easy to assemble a cluster with N = 2: some of the data will be on disks 1 and 2, some on disks 1 and 3, and some on 2 and 3, while RAID will not (you can collect such a RAID, but it would be a perversion). If there are even more disks, then RAID 5 is possible, CEPH has an analogue - erasure_code, which contradicts the early concepts of developers, and therefore is not considered. RAID 5 assumes that you have a small number of disks and that they are all in good condition. If one fails, the rest must hold out until the disk is replaced and the data is restored to it. CEPH, for N> = 3, encourages the use of old disks, in particular,if you keep several good disks for storing one copy of the data, and the remaining two or three copies are stored on a large number of old disks, then the information will be safe, as long as the new disks are alive - there are no problems, and if one of them breaks, then a simultaneous failure three disks with a service life of more than five years, preferably from different servers - an extremely unlikely event.

There is a subtlety in distributing copies. By default, it is assumed that the data is divided into more (~ 100 per disk) PG distribution groups, each of which is duplicated on some disks. Suppose K = 6, N = 2, then if any two disks fail, data is guaranteed to be lost, since, according to the theory of probability, there will be at least one PG that will be located on these two disks. And the loss of one group makes all data in the pool inaccessible. If the disks are divided into three pairs and allowed to store data only on disks inside one pair, then this distribution is also resistant to the failure of any one disk, but if two fail, the probability of data loss is not 100%, but only 3/15, and even in case of failure three discs - only 12/20. Hence, entropy in data distribution does not contribute to fault tolerance. Also notethat for the file server, free RAM significantly increases the response speed. The more memory in each node, and the more memory in all nodes - the faster. This is undoubtedly the advantage of a cluster over a single server and, moreover, a hardware NAS, where a very small amount of memory is built in.

It follows that CEPH is a good way with minimal investment from outdated equipment to create a reliable storage system for tens of TB with the ability to scale (here, of course, it will require costs, but small in comparison with commercial storage systems).

Cluster implementation

For the experiment, take a decommissioned Intel DQ57TM + Intel core i3 540 + 16 GB RAM computer. We will organize four 2 TB disks in a kind of RAID10, after a successful test we will add a second node and the same number of disks.

Install Linux. The distribution requires customization and stability. Debian and Suse fit the requirements. Suse has a more flexible installer to disable any package; unfortunately, I could not figure out which ones can be thrown out without damaging the system. We install Debian via debootstrap buster. The min-base option installs a non-working system that lacks drivers. The difference in size compared to the full version is not so big as to bother. Since the work is carried out on a physical machine, I want to make snapshots like on virtual machines. Either LVM or btrfs (or xfs or zfs - the difference is not great) provides such an opportunity. Snapshots are not a strong point for LVM. We put btrfs. And the bootloader is in the MBR. It makes no sense to clog a 50 MB disk with a FAT partition,when you can push it into a 1 MB area of the partition table and allocate all the space for the system. Used on disk 700 MB. How much the base SUSE installation has - I don't remember, it seems, about 1.1 or 1.4 GB.

Installing CEPH. Ignore version 12 in the debian repository and link directly from the 15.2.3 site. Follow the instructions in the "Install CEPH Manually" section with the following reservations:

Before connecting the repository, you need to install gnupg wget ca-certificates
After connecting the repository, but before installing the cluster, installation of packages is omitted: apt -y --no-install-recommends install ceph-common ceph-mon ceph-osd ceph-mds ceph-mgr
At the time of installation, ceph-osd for (already understandable) reasons will try to install lvm2. There is no urgent need for this. If you have problems installing a package, you can abandon it by removing the dependency in / var / lib / dpkg / status for ceph-osd.
A less humane patch was used when writing the article:
```
cat << EOF >> /var/lib/dpkg/status
Package: lvm2
Status: install ok installed
Priority: important
Section: admin
Installed-Size: 0
Maintainer: Debian Adduser Developers <adduser@packages.debian.org>
Architecture: all
Multi-Arch: foreign
Version: 113.118
Description: No-install
EOF
```

Cluster overview

ceph-osd - responsible for storing data on disk. A network service is started for each disk, which accepts and executes requests to read or write to objects. This article discusses bluestore storage as the lowest level. Service files are created in the service directory that describe the cluster ID, storage, its type, etc., as well as the required block file - these files are not changed. If the file is physical, then osd will create a file system in it and will accumulate data. If the file is a link, then the data will be in the device to which the link points. In addition to the main device, block.db - metadata (RocksDB) and block.wal - log (RocksDB write-ahead log) can be additionally specified. If additional devices are not specified, then the metadata and the log will be stored in the main device.It is very important to keep track of the availability of free space for RocksDB, otherwise the OSD will not start!

During the standard creation of osd in older versions, the disk is divided into two partitions: the first is 100 MB xfs, mounted in / var / lib / ... and contains service information, the second is given for the main storage. The new version uses lvm.

Theoretically, you can not mount a miniature partition, but place files in / var / lib / ..., duplicate them on all nodes, and allocate the entire disk for data without creating either a GPT or LVM header. When manually adding an OSD, you need to make sure that the ceph user has write access to the data block devices, and the service data directory is automatically mounted in / var / lib ... if you decide to place them there. It is also advisable to specify the osd memory target parameter so that there is enough physical memory.

ceph-mds. At a low level, CEPH is an object store. The possibility of block storage is reduced to saving each 4MB block as an object. File storage works by the same principle. Two pools are created: one for metadata, the other for data. They are combined into a file system. At this moment, some kind of record is created, so if you delete the file system, but keep both pools, then you will not be able to restore it. There is a procedure for extracting files in blocks, I have not tested it. The ceph-mds service is responsible for accessing the file system. A separate instance of the service is required for each file system. There is a "rank" option, which allows you to create a semblance of several file systems in one - also not tested.

ceph-mon - This service stores the cluster map. It includes information about all OSDs, the PG distribution algorithm in OSD and, most importantly, information about all objects (the details of this mechanism are not clear to me: there is a directory /var/lib/ceph/mon/.../store.db, in it there is a large file - 26MB, and in a cluster of 105K objects, it turns out a little more than 256 bytes per object - I think that the monitor stores a list of all the objects and PG in which they are located). Damage to this directory results in the loss of all data in the cluster. From this, it was concluded that CRUSH shows how PGs are located on OSD, and how objects are located on PG in the database (the conclusion turned out to be incorrect, what exactly is contained in it requires clarification). As a consequence, first of all, we cannot install the system on a USB flash drive in RO mode, since the database is constantly being written,An additional disk is needed for this (hardly more than 1 GB) data, and secondly, you need to have a copy of this database in real time. If there are several monitors, then fault tolerance is provided at the expense of them, but if there is only one monitor, maximum two, then it is necessary to ensure data protection. There is a theoretical procedure for recovering a monitor based on OSD data, at the moment it turned out to be restored at the object level, the file system has not been restored at the moment. So far, one cannot rely on this mechanism.at the moment it has been restored at the object level, the file system has not been restored at the current moment. So far, you cannot rely on this mechanism.at the moment, it turned out to be restored at the object level, the file system at the moment was not able to be restored. So far, one cannot rely on this mechanism.

rados-gw - Exports object storage over S3 and the like. Creates many pools, it is not clear why. Didn't experiment much.

ceph-mgr - When installing this service, several modules are started. One of them is non-disconnectable autoscale. It strives to maintain the correct amount of PG / OSD. If you want to control the ratio manually, you can disable scaling for each pool, but in this case, the module falls with a division by 0, and the cluster status becomes ERROR. The module is written in python, and if you comment out the necessary line in it, this leads to its disconnection. Details are too lazy to remember.

References:

Installing CEPH

Recovering from complete monitor failure

Article about BlueStore in Ceph

Description of ceph architecture. Sage A. Weil

Script listings:

debootstrap

blkdev=sdb1
mkfs.btrfs -f /dev/$blkdev
mount /dev/$blkdev /mnt
cd /mnt
for i in {@,@var,@home}; do btrfs subvolume create $i; done
mkdir snapshot @/{var,home}
for i in {var,home}; do mount -o bind @${i} @/$i; done
debootstrap buster @ http://deb.debian.org/debian; echo $?
for i in {dev,proc,sys}; do mount -o bind /$i @/$i; done
cp /etc/bash.bashrc @/etc/

chroot /mnt/@ /bin/bash
echo rbd1 > /etc/hostname
passwd
uuid=`blkid | grep $blkdev | cut -d "\"" -f 2`
cat << EOF > /etc/fstab
UUID=$uuid / btrfs noatime,nodiratime,subvol=@ 0 1
UUID=$uuid /var btrfs noatime,nodiratime,subvol=@var 0 2
UUID=$uuid /home btrfs noatime,nodiratime,subvol=@home 0 2
EOF
cat << EOF >> /var/lib/dpkg/status
Package: lvm2
Status: install ok installed
Priority: important
Section: admin
Installed-Size: 0
Maintainer: Debian Adduser Developers <adduser@packages.debian.org>
Architecture: all
Multi-Arch: foreign
Version: 113.118
Description: No-install

Package: sudo
Status: install ok installed
Priority: important
Section: admin
Installed-Size: 0
Maintainer: Debian Adduser Developers <adduser@packages.debian.org>
Architecture: all
Multi-Arch: foreign
Version: 113.118
Description: No-install
EOF

exit
grub-install --boot-directory=@/boot/ /dev/$blkdev
init 6

apt -yq install --no-install-recommends linux-image-amd64 bash-completion ed btrfs-progs grub-pc iproute2 ssh  smartmontools ntfs-3g net-tools man
exit
grub-install --boot-directory=@/boot/ /dev/$blkdev
init 6

apt -yq install --no-install-recommends gnupg wget ca-certificates
echo 'deb https://download.ceph.com/debian-octopus/ buster main' >> /etc/apt/sources.list
wget -q -O- 'https://download.ceph.com/keys/release.asc' | apt-key add -
apt update
apt -yq install --no-install-recommends ceph-common ceph-mon

echo 192.168.11.11 rbd1 >> /etc/hosts
uuid=`cat /proc/sys/kernel/random/uuid`
cat << EOF > /etc/ceph/ceph.conf
[global]
fsid = $uuid
auth cluster required = cephx
auth service required = cephx
auth client required = cephx
mon allow pool delete = true
mon host = 192.168.11.11
mon initial members = rbd1
mon max pg per osd = 385
osd crush update on start = false
#osd memory target = 2147483648
osd memory target = 1610612736
osd scrub chunk min = 1
osd scrub chunk max = 2
osd scrub sleep = .2
osd pool default pg autoscale mode = off
osd pool default size = 1
osd pool default min size = 1
osd pool default pg num = 1
osd pool default pgp num = 1
[mon]
mgr initial modules = dashboard
EOF

ceph-authtool --create-keyring ceph.mon.keyring --gen-key -n mon. --cap mon 'allow *'
ceph-authtool --create-keyring ceph.client.admin.keyring --gen-key -n client.admin --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow *' --cap mgr 'allow *'
cp ceph.client.admin.keyring /etc/ceph/
ceph-authtool --create-keyring bootstrap-osd.ceph.keyring --gen-key -n client.bootstrap-osd --cap mon 'profile bootstrap-osd' --cap mgr 'allow r'
cp bootstrap-osd.ceph.keyring /var/lib/ceph/bootstrap-osd/ceph.keyring
ceph-authtool ceph.mon.keyring --import-keyring /etc/ceph/ceph.client.admin.keyring
ceph-authtool ceph.mon.keyring --import-keyring /var/lib/ceph/bootstrap-osd/ceph.keyring
monmaptool --create --add rbd1 192.168.11.11 --fsid $uuid monmap
rm -R /var/lib/ceph/mon/ceph-rbd1/*
ceph-mon --mkfs -i rbd1 --monmap monmap --keyring ceph.mon.keyring
chown ceph:ceph -R /var/lib/ceph
systemctl enable ceph-mon@rbd1
systemctl start ceph-mon@rbd1
ceph mon enable-msgr2
ceph status

# dashboard

apt -yq install --no-install-recommends ceph-mgr ceph-mgr-dashboard python3-distutils python3-yaml
mkdir /var/lib/ceph/mgr/ceph-rbd1
ceph auth get-or-create mgr.rbd1 mon 'allow profile mgr' osd 'allow *' mds 'allow *' > /var/lib/ceph/mgr/ceph-rbd1/keyring
systemctl enable ceph-mgr@rbd1
systemctl start ceph-mgr@rbd1
ceph config set mgr mgr/dashboard/ssl false
ceph config set mgr mgr/dashboard/server_port 7000
ceph dashboard ac-user-create root 1111115 administrator
systemctl stop ceph-mgr@rbd1
systemctl start ceph-mgr@rbd1

OSD ()

apt install ceph-osd

osdnum=`ceph osd create`
mkdir -p /var/lib/ceph/osd/ceph-$osdnum
mkfs -t xfs /dev/sda1
mount -t xfs /dev/sda1 /var/lib/ceph/osd/ceph-$osdnum
cd /var/lib/ceph/osd/ceph-$osdnum
ceph auth get-or-create osd.0 mon 'profile osd' mgr 'profile osd' osd 'allow *' > /var/lib/ceph/osd/ceph-$osdnum/keyring
ln -s /dev/disk/by-partuuid/d8cc3da6-02  block
ceph-osd -i $osdnum --mkfs
#chown ceph:ceph /dev/sd?2
chown ceph:ceph -R /var/lib/ceph
systemctl enable ceph-osd@$osdnum
systemctl start ceph-osd@$osdnum

The main marketing advantage of CEPH is CRUSH, an algorithm for calculating the location of data. Monitors distribute this algorithm to clients, after which clients directly request the desired host and desired OSD. CRUSH ensures there is no centralization. It is a small file that you can at least print and hang on the wall. Practice has shown that CRUSH is not an exhaustive map. Destroying and re-creating monitors while keeping all OSDs and CRUSHs is not enough to restore the cluster. From this, it was concluded that each monitor stores some metadata about the entire cluster. An insignificant amount of this metadata does not impose restrictions on the size of the cluster, but requires ensuring their safety, which eliminates disk savings by installing the system on a USB flash drive and excludes clusters with less than three nodes.Aggressive developer policy regarding optional features. Far to minimalism. The documentation at the level: "for what it is - thanks already, but very, very scarce." The ability to interact with services at a low level is provided, but the documentation is too superficial on this topic, so more likely no than yes. There is practically no chance of recovering data in an emergency situation (thanks to community explanations, the chance still remains).There is almost no chance of data recovery in an emergency situation (thanks to the community's explanations, there is still a chance).There is almost no chance of data recovery in an emergency situation (thanks to the community's explanations, there is still a chance).

Options for further actions: abandon CEPH and use the banal multi-disk btrfs (or xfs, zfs), find out new information about CEPH, which will allow you to use it under these conditions, try to write your own repository as a further training.

Operating experience CEPH

CEPH Strategy

Cluster implementation

Cluster overview

More articles: