Reed - Solomon codes in RAID 6

There are many articles on the Internet about recovering information in a RAID-6 array and how to make your own implementation of such an array. But most of these articles are crammed with mathematical formulas. It takes a lot of time to understand the real algorithm.



In this article I will try to show a simple example of my own error correction system based on RAID-6. In particular, consider a situation where you need to provide system redundancy so that it can withstand the failure of one or two drives.



As a bonus, information on how error correction works in RAID-5, because RAID-6 is an improved version of RAID-5.



Overview



Let's say you have three disks with some data. Let's call them D1, D2 and D3. To use a RAID-6 system, you need two additional drives: PD and RS. In a few minutes I will describe what PD and RS stand for. So, a total of five drives: D1, D2, D3, PD and RS.







So the situation:



  • D1, D2 and D3 contain arbitrary data . Let's say photos of cats.

  • The special disk PD (Parity Drive, sometimes P in the documentation) contains zoomed data automatically generated from D1, D2 and D3.

  • Second special disc RS (Reed-Solomon codes, sometimes called Q) for the same data as PD.


Let's see how to perform basic operations on such an array.



How recovery works



If you calculate the PD and RS correctly, you can painlessly survive a failure of up to two disks. The recovery procedure depends on which particular drives fail. Seven situations are usually considered. Below they are sorted from easy to complex.



  1. Loss of PD (only one disk).







    A very simple case. The PD drive contains only auto-generated data, so it can be recovered using the original data on drives D1, D2, and D3.



  2. : D1, D2 D3 ( ).







    , , , RAID-5: PD . PD, . RS ( ).



  3. RS ( ).







    1: , RS,  — .



  4. PD RS ( ).







    1 3. , PD, RS.



  5. RS ( ).







    , . PD, , № 2. , RS.



  6. PD ( ).







    . ( D3), PD, . RS (D1 D2), D3. PD. ,  — .



  7. ( ).







    . PD, RS .  — .


In the next sections, we will explore these cases in more detail and see the source code (in Python) that performs the actual data recovery.



Keep in mind that actual RAID-6 arrays do not actually allocate an entire drive for PD or RS. This data is spread across all drives. Different controllers use different methods: left asynchronous or right synchronous, there may be a shift in relation to RAID data, latency, etc. Let's leave aside the discussion of why this happens and what real stripes look like RAID-6. Let's focus specifically on the Reed-Solomon codes.



Test data



Let's define "user data". For simplicity, let's set the size of each "disk" to 5 bytes.



Disk ASCII data Data in HEX
D1



first 0x66, 0x69, 0x72, 0x73, 0x74
D2



secnd 0x73, 0x65, 0x63, 0x6e, 0x64
D3



third 0x74, 0x68, 0x69, 0x72, 0x64


Now let's take a closer look at the mentioned scenarios.



Situation 1. Loss of PD disk



To generate the PD, only user data disks are needed. In our case, these are D1, D2 and D3. The PD disk is simply XORed of all user data.



To generate offset 0 for PD, all bytes from offset 0 must be zipped across all disks. Ditto for offset 1 and so on:



PD [0] = D1 [0] xor D2 [0] xor D3 [0]
PD [1] = D1 [1] xor D2 [1] xor D3 [1]
PD [2] = D1 [2] xor D2 [2] xor D3 [2]
PD [3] = D1 [3] xor D2 [3] xor D3 [3]
PD [4] = D1 [4] xor D2 [4] xor D3 [4]


Example:



PD [0] = 0x66 xor 0x73 xor 0x74 => 0x61
PD [1] = 0x69 xor 0x65 xor 0x63 => 0x64
PD [2] = 0x72 xor 0x63 xor 0x69 => 0x78
PD [3] = 0x73 xor 0x6e xor 0x72 => 0x6f
PD [4] = 0x74 xor 0x64 xor 0x64 => 0x74


Yes, it's very simple. Do this for entire disks (in our case 5-byte), and you will get a correctly generated PD:



Disk Data in HEX
PD



0x61, 0x64, 0x78, 0x6f, 0x74


Thus, if only PD fails, it is rather trivial to restore it from D1, D2 and D3.



Situation 2. Loss of one of the data stores: D1, D2 or D3



By the way, this is how RAID-5 error fixing works. If only one user data disk fails, we can use the PD disk to recalculate the missing user data.



Let's say D2 is lost. D1, D3, PD and RS remained in stock. In this case, don't even touch RS. Only drives D1, D3 and PD are needed. To calculate the missing data, you can use the XOR function again as in the previous situation.



To recover the user data from offset 0, xorim the bytes from the zero offsets of the user data disks that are left (D1 and D3) with the byte from the zero offset PD. Repeat for offset 1, and so on:



D2 [0] = D1 [0] xor D3 [0] xor PD [0]
D2 [1] = D1 [1] xor D3 [1] xor PD [1]
D2 [2] = D1 [2] xor D3 [2] xor PD [2]
D2 [3] = D1 [3] xor D3 [3] xor PD [3]
D2 [4] = D1 [4] xor D3 [4] xor PD [4]


Example:



D2 [0] = 0x66 xor 0x74 xor 0x61 => 0x73 (s)
D2 [1] = 0x69 xor 0x63 xor 0x64 => 0x65 (e)
D2 [2] = 0x72 xor 0x69 xor 0x78 => 0x63 (c)
D2 [3] = 0x73 xor 0x72 xor 0x6f => 0x6e (n)
D2 [4] = 0x74 xor 0x64 xor 0x74 => 0x64 (d)


As you can see, it is very easy to recover data from a missing disk. It doesn't matter which disc is missing: the XOR function always works.



Situation 3. Loss of RS disk



Now the Reed-Solomon and Galois field codes come into play. But don't worry, you don't have to be a mathematician to use them.



When we only lose the RS drive or create a new system like RAID-6, we just need to re-generate the codes. To do this, use the gflog and gfilog tables with immutable content, as well as data from existing drives D1, D2 and D3.



The gflog table always looks like this:



0x00, 0x00, 0x01, 0x19, 0x02, 0x32, 0x1a, 0xc6, 0x03, 0xdf, 0x33, 0xee, 0x1b, 0x68, 0xc7, 0x4b,
0x04, 0x64, 0xe0, 0x0e, 0x34, 0x8d, 0xef, 0x81, 0x1c, 0xc1, 0x69, 0xf8, 0xc8, 0x08, 0x4c, 0x71,
0x05, 0x8a, 0x65, 0x2f, 0xe1, 0x24, 0x0f, 0x21, 0x35, 0x93, 0x8e, 0xda, 0xf0, 0x12, 0x82, 0x45,
0x1d, 0xb5, 0xc2, 0x7d, 0x6a, 0x27, 0xf9, 0xb9, 0xc9, 0x9a, 0x09, 0x78, 0x4d, 0xe4, 0x72, 0xa6,
0x06, 0xbf, 0x8b, 0x62, 0x66, 0xdd, 0x30, 0xfd, 0xe2, 0x98, 0x25, 0xb3, 0x10, 0x91, 0x22, 0x88,
0x36, 0xd0, 0x94, 0xce, 0x8f, 0x96, 0xdb, 0xbd, 0xf1, 0xd2, 0x13, 0x5c, 0x83, 0x38, 0x46, 0x40,
0x1e, 0x42, 0xb6, 0xa3, 0xc3, 0x48, 0x7e, 0x6e, 0x6b, 0x3a, 0x28, 0x54, 0xfa, 0x85, 0xba, 0x3d,
0xca, 0x5e, 0x9b, 0x9f, 0x0a, 0x15, 0x79, 0x2b, 0x4e, 0xd4, 0xe5, 0xac, 0x73, 0xf3, 0xa7, 0x57,
0x07, 0x70, 0xc0, 0xf7, 0x8c, 0x80, 0x63, 0x0d, 0x67, 0x4a, 0xde, 0xed, 0x31, 0xc5, 0xfe, 0x18,
0xe3, 0xa5, 0x99, 0x77, 0x26, 0xb8, 0xb4, 0x7c, 0x11, 0x44, 0x92, 0xd9, 0x23, 0x20, 0x89, 0x2e,
0x37, 0x3f, 0xd1, 0x5b, 0x95, 0xbc, 0xcf, 0xcd, 0x90, 0x87, 0x97, 0xb2, 0xdc, 0xfc, 0xbe, 0x61,
0xf2, 0x56, 0xd3, 0xab, 0x14, 0x2a, 0x5d, 0x9e, 0x84, 0x3c, 0x39, 0x53, 0x47, 0x6d, 0x41, 0xa2,
0x1f, 0x2d, 0x43, 0xd8, 0xb7, 0x7b, 0xa4, 0x76, 0xc4, 0x17, 0x49, 0xec, 0x7f, 0x0c, 0x6f, 0xf6,
0x6c, 0xa1, 0x3b, 0x52, 0x29, 0x9d, 0x55, 0xaa, 0xfb, 0x60, 0x86, 0xb1, 0xbb, 0xcc, 0x3e, 0x5a,
0xcb, 0x59, 0x5f, 0xb0, 0x9c, 0xa9, 0xa0, 0x51, 0x0b, 0xf5, 0x16, 0xeb, 0x7a, 0x75, 0x2c, 0xd7,
0x4f, 0xae, 0xd5, 0xe9, 0xe6, 0xe7, 0xad, 0xe8, 0x74, 0xd6, 0xf4, 0xea, 0xa8, 0x50, 0x58, 0xaf.


The gfilog table is also persistent:



0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80, 0x1d, 0x3a, 0x74, 0xe8, 0xcd, 0x87, 0x13, 0x26,
0x4c, 0x98, 0x2d, 0x5a, 0xb4, 0x75, 0xea, 0xc9, 0x8f, 0x03, 0x06, 0x0c, 0x18, 0x30, 0x60, 0xc0,
0x9d, 0x27, 0x4e, 0x9c, 0x25, 0x4a, 0x94, 0x35, 0x6a, 0xd4, 0xb5, 0x77, 0xee, 0xc1, 0x9f, 0x23,
0x46, 0x8c, 0x05, 0x0a, 0x14, 0x28, 0x50, 0xa0, 0x5d, 0xba, 0x69, 0xd2, 0xb9, 0x6f, 0xde, 0xa1,
0x5f, 0xbe, 0x61, 0xc2, 0x99, 0x2f, 0x5e, 0xbc, 0x65, 0xca, 0x89, 0x0f, 0x1e, 0x3c, 0x78, 0xf0,
0xfd, 0xe7, 0xd3, 0xbb, 0x6b, 0xd6, 0xb1, 0x7f, 0xfe, 0xe1, 0xdf, 0xa3, 0x5b, 0xb6, 0x71, 0xe2,
0xd9, 0xaf, 0x43, 0x86, 0x11, 0x22, 0x44, 0x88, 0x0d, 0x1a, 0x34, 0x68, 0xd0, 0xbd, 0x67, 0xce,
0x81, 0x1f, 0x3e, 0x7c, 0xf8, 0xed, 0xc7, 0x93, 0x3b, 0x76, 0xec, 0xc5, 0x97, 0x33, 0x66, 0xcc,
0x85, 0x17, 0x2e, 0x5c, 0xb8, 0x6d, 0xda, 0xa9, 0x4f, 0x9e, 0x21, 0x42, 0x84, 0x15, 0x2a, 0x54,
0xa8, 0x4d, 0x9a, 0x29, 0x52, 0xa4, 0x55, 0xaa, 0x49, 0x92, 0x39, 0x72, 0xe4, 0xd5, 0xb7, 0x73,
0xe6, 0xd1, 0xbf, 0x63, 0xc6, 0x91, 0x3f, 0x7e, 0xfc, 0xe5, 0xd7, 0xb3, 0x7b, 0xf6, 0xf1, 0xff,
0xe3, 0xdb, 0xab, 0x4b, 0x96, 0x31, 0x62, 0xc4, 0x95, 0x37, 0x6e, 0xdc, 0xa5, 0x57, 0xae, 0x41,
0x82, 0x19, 0x32, 0x64, 0xc8, 0x8d, 0x07, 0x0e, 0x1c, 0x38, 0x70, 0xe0, 0xdd, 0xa7, 0x53, 0xa6,
0x51, 0xa2, 0x59, 0xb2, 0x79, 0xf2, 0xf9, 0xef, 0xc3, 0x9b, 0x2b, 0x56, 0xac, 0x45, 0x8a, 0x09,
0x12, 0x24, 0x48, 0x90, 0x3d, 0x7a, 0xf4, 0xf5, 0xf7, 0xf3, 0xfb, 0xeb, 0xcb, 0x8b, 0x0b, 0x16,
0x2c, 0x58, 0xb0, 0x7d, 0xfa, 0xe9, 0xcf, 0x83, 0x1b, 0x36, 0x6c, 0xd8, 0xad, 0x47, 0x8e, 0x01.


It is not necessary to include these tables in the program, you can use the following generation algorithm at runtime:



# gflog_tables.py

def generate_tables():
    polynomial = 0x11d
    s = 8
    gf_elements = 1 << s

    gflog = gf_elements * [0]
    gfilog = gf_elements * [0]

    b = 1
    for i in range(0, gf_elements):
        gflog[b] = i & 255
        gfilog[i] = b & 255
        b <<= 1
        if b & gf_elements:
            b ^= polynomial

    gflog[1] = 0;
    return (gflog, gfilog)

def dump_table(caption, tab):
    item = 0
    print("--- {} ---".format(caption))
    for i in tab:
        print("0x{:02x}, ".format(i), end="")
        item += 1
        if item % 16 == 0:
            item = 0
            print()
    print("")

(gflog, gfilog) = generate_tables()

# Uncomment if you want to see the tables on the console:
#
# dump_table("gflog", gflog)
# dump_table("gfilog", gfilog)
      
      





After the tables are declared, some operations need to be defined. Now we are working in a finite field (Galois field), so the basic arithmetic operations have a different implementation (although the meaning is somewhat similar). You need to redefine the basic operations - addition, multiplication and division:



# rs_functions.py

from gflog_tables import *

# Addition
def gf_add(*args):
    result = 0
    for arg in args:
        result ^= arg

    return result

# Indexing
# First drive is 1, second drive is 2, etc...
def gf_drive(index):
    global gfilog

    return gfilog[index - 1]

# Multiplication
def gf_mul(a, b):
    global gflog
    global gfilog

    if a == 0 or b == 0:
        return 0
    else:
        return gfilog[(gflog[a] + gflog[b]) % 255]

# Division helper
def sub_gf8(a, b):
    if a > b:
        return a - b
    else:
        return (255 - (0 - (a - b)))

# Division
def gf_div(a, b):
    global gfilog
    global gflog

    return gfilog[sub_gf8(gflog[a], gflog[b])]
      
      





Since the helper functions are declared, let's try to generate the RS disk data.



# case 3 -- recover_rs.py

from rs_functions import *

# Here are our drives, together with their data.
image1 = [ ord('f'), ord('i'), ord('r'), ord('s'), ord('t') ]
image2 = [ ord('s'), ord('e'), ord('c'), ord('n'), ord('d') ]
image3 = [ ord('t'), ord('h'), ord('i'), ord('r'), ord('d') ]

# This is a placeholder for our RS drive. It will be regenerated
# in the lines below.
imageRS = [0] * 5

# And this is our loop that generates the RS data using nothing more
# than the user data drives.
for i in range(0, 5):
    imageRS[i] = gf_add(gf_mul(gf_drive(1), image1[i]),
                        gf_mul(gf_drive(2), image2[i]),
                        gf_mul(gf_drive(3), image3[i]))

dump_table("imageRS", imageRS)
      
      





After running the script recover_rs.py



, the RS disk contains the following data:



Disk Data in HEX
RS



0x4d, 0x1e, 0x0d, 0x7a, 0x31


At the moment, D1, D2 and D3 drives are protected by the full RAID-6 error correction algorithm, since we have correctly generated PD and RS.



It is important to remember that the current RS data is only valid for D1, D2 and D3 in that particular order . Thus, the RS for D1, D2, and D3 will be different from D3, D2, and D1 even though the actual data on the disks is the same. This is important to remember because when recovering RAID-6 data, you need to know the correct disk sequence within the array. Fortunately, if the array is small, you can force the RS data to be generated to find the correct disk sequence.



Situation 4. Loss of PD and RS



This is also a simple situation: we first execute scenario # 1, and then # 3. I



repeat, in this case user data is not affected. We can use them to create a PD. Then to create RS. Both cases have already been described in points 1 and 3.



Situation 5. Loss of RS and one data disk



And here it is not difficult. We lost one data disk, but we still have a PD, so we can run scenario # 2 to recover the missing data disk. Then use all the data disks for RS regeneration, as in scenario # 3. The complete set of disks is now recovered.



Situation 6. Loss of PD and one data disk



The general approach is to first recover the missing data disk using other disks in combination with RS, and then, after all data disks have been recovered, proceed to regenerate the PD (scenario # 2).



In this situation, you have to do some calculations. Suppose that along with PD we lost the data disk D2. So we have D1, D3 and RS in stock.



Thanks to the RS disk, we can restore D2 by combining D1, D3 and RS, like this:



# case 6 -- recover_d2_and_pd.py

from rs_functions import *

# We have these drives...
image1 = [ ord('f'), ord('i'), ord('r'), ord('s'), ord('t') ]
image3 = [ ord('t'), ord('h'), ord('i'), ord('r'), ord('d') ]
imageRS = [ 0x4d, 0x1e, 0x0d, 0x7a, 0x31 ]

# ...and these drives are dead
imagePD = [0] * 5
image2 = [0] * 5

for i in range(0, 5):
    partialRS = gf_add(gf_mul(gf_drive(1), image1[i]),
                       imageRS[i],  # Use RS drive instead of the dead drive.
                       gf_mul(gf_drive(3), image3[i]))

    # gf_drive(2) is our dead drive.
    div_result = gf_div(1, gf_drive(2))

    # This will generate the data from the dead D2 drive.
    image2[i] = gf_mul(div_result, partialRS)

    # This will generate the data from the dead PD drive.
    imagePD[i] = gf_add(image1[i], image2[i], image3[i])

dump_table("image2", image2)
dump_table("imagePD", imagePD)
      
      





First, you need to generate a value partialRS



by adding (gf_add) the return values gf_mul



for all bytes of all valid disks along with the RS value instead of the missing data disk (in our case, D2).



We then use the value partialRS



to regenerate the D2 data by dividing one by the dead disk index ( gf_drive(2)



) and multiplying the result by partialRS



. The argument gf_drive(2)



indicates the index of our dead disk. If D1 failed, we would use here gf_drive(1)



.



After regenerating D2, restore all data disks. In this case, we perform PD regeneration as in scenario # 1: in the above code, this is done by adding (gf_add) data from all disks. If you remembergf_add



the Galois field is a simple XOR operation, so instead of manually xoring bytes from all data disks, you can use the gf_add



.



Situation 7. Loss of two data collectors



This is the most interesting and most difficult scenario. Suppose disks D2 and D3 are out of order. In this case, you need to somehow use the D1, PD and RS disks to regenerate the missing disks.



This is a different approach from previous cases. A general approach is to first generate data for D2 and then use the same estimate as in scenario # 2 to generate data for D3. Here is the code:



# case 7 -- recover_d2_and_d3.py

from rs_functions import *

# These drives are still alive.
image1 = [ ord('f'), ord('i'), ord('r'), ord('s'), ord('t') ]
imagePD = [ 0x61, 0x64, 0x78, 0x6f, 0x74 ]
imageRS = [ 0x4d, 0x1e, 0x0d, 0x7a, 0x31 ]

# These drives are dead, we can't read from them.
image2 = [0] * 5
image3 = [0] * 5

for i in range(0, 5):
    partialPD = gf_add(image1[i]) # add other drives if they exist
    partialRS = gf_add(gf_mul(gf_drive(1), image1[i])) # add other drives if they exist

    g = gf_div(1, gf_add(gf_drive(2), gf_drive(3)))
    xoredPD = gf_add(partialPD, imagePD[i])
    xoredRS = gf_add(partialRS, imageRS[i])
    mid = gf_add(gf_mul(gf_drive(3), xoredPD), xoredRS) # gf_drive(3) is the second drive we've lost

    # Regenerate data for D2.
    data = gf_mul(mid, g)
    image2[i] = data

    # Regenerate data for D3.
    image3[i] = gf_add(image1[i], image2[i], imagePD[i])

    # or:
    #
    # image3[i] = gf_add(data, xoredPD)

dump_table("image2", image2)
dump_table("image3", image3)
      
      





First, you need to add all the bytes from all existing data disks to generate partialPD



. In this example, we only have one data disk, so the parameter partialPD



will simply be the contents of the D1 disk. But RAID-6 arrays span multiple drives. Therefore, if we have more than one data disk, for example, three live data disks, then the partialPD calculation would look like this:



partialPD = gf_add(image1[i], image2[i], image3[i])
      
      





Next we need a parameter partialRS



. It can be calculated by adding data from existing disks as follows:



partialRS = gf_add(A, B, C, ..., Z)

where A = gf_mul(gf_drive(1), image1[i])
      B = gf_mul(gf_drive(2), image2[i]) if we have drive 2
      C = gf_mul(gf_drive(3), image3[i]) if we have drive 3

etc.
      
      





In our case, there is only one data storage device (D1), so ours partialRS



is simple gf_mul(gf_drive(1), image1[i])



.



Then you need to generate the parameter g



by dividing one by the sum of the dead disk indices (D2 and D3).



Next comes the parameter xoredPD



; it is computed by adding the contents of the PD to the partialPD



previously computed parameter . The next parameter is xoredRS



calculated in a similar way, by adding partialRS



to the RS content.



Now for the tricky part. You can calculate the data from the first broken disk, that is, from the D2 disk. To do this, multiply the index of the second broken disk (D3) by a parameter xoredPD



and add a parameter to the result xoredRS



. Then, after multiplying the result by the parameterg



, we get the contents of the D2 disk.



Since we have just recovered data for D2, this case is no different from scenario # 2 - loss of one data disk (D3). To create a D3 drive, all live data drives (D1 and D2) must be added to the PD.



Done! We returned a complete set of discs.



Epilogue



I chose Python to demonstrate that correcting errors with Reed-Solomon codes does not require much programming or processing power. Everything is very fast and the implementation can be quite compact. Of course, a more efficient implementation should be written with parallel processing in mind. Since each byte is calculated independently of the others, parallelization is not difficult.



It should be noted that the described data recovery method does not have to be used on separate physical disks. "Disks" can be thought of as "buffers" in the process of transmitting data over an unreliable channel, and such error correction remains effective. It requires more intensive computations than with Hamming codes, but two fallen streams can be raised. This is a powerful resiliency feature.



Of course, RAID-6 is far from a new invention, and the Reed-Solomon codes are even older. They were used back in the Voyager 2 mission , which is pretty cool.



Among the more modern alternatives for Reed-Solomon codes are turbo codes  - I hope I will have the opportunity to dig into them too.



All Articles