Since the language in which all this is written is C #, that is, the nuances. Namely, the standard HashSet <int> spends 16 bytes to store one number, the phill factor also affects. There are more efficient implementations (I'll write about them someday), but on the other hand, you can stupidly store in arrays, 4 bytes per number (you need to store ints), which is quite efficient. But can it be reduced further?
I must say right away that I have no answer on how best to do it, perhaps it does not exist, because there are many factors associated with the distribution of specific data. But there are ideas that I will share: what options for saving memory are there. I also recommend that you think for yourself before reading the post, after all, this is a good workout for the mind. For definiteness, I will formulate the problem as follows:
There is a set of non-negative unique ints (32 bits). It is required to store them efficiently in RAM, from operations - creating a set and getting all elements. No need to get items by index, add new ones or delete them.
The article will contain many letters and numbers and not a single picture (except for a packed cat on the KDPV).
I do not specifically indicate what kind of project, what kind of task is specific, because in general, it doesn't matter. Valid solutions are highly data dependent. Some are better suited for some, some for others, also do not forget about the speed of work. Somewhere it is better to save memory as much as possible, but somewhere it is worth keeping a balance.
Also, I do not consider solutions of the form - it is stupid to store on disk and use the cache for hot data, this is a separate task.
Just to understand the amount of data that I have encountered: several million sets, each with from one element to two million. In memory it takes about 10 GB
So, we have basic data - an array of ints, 4 bytes (32 bits) per number. We will build on this indicator.
To begin with, I will express a brilliant idea: in order for a number to occupy less than 32 bits in memory, you must store it using fewer bits. Cool idea, huh? And people get fame and recognition for this. So the worse I am.
Lyrical digression: a few years ago, specialists from Russian Railways found out that if you make the wheels round and of the same size, then the train goes faster and quieter.
Separating numbers by size
A simple solution to start: Numbers from 0 to 255 can be stored using 1 byte per number, up to 65536 with two, up to 16777216 with three. Hence the first solution:
We create 4 arrays, in one we store numbers by 1 byte, in the other by 2, in the third by 3, and what in the fourth, I propose to guess on our own.
Clap, and we are already saving. But why stay where you have been? Let's use 32 arrays! And store numbers by 1, 2 ... bits. It has become even more economical.
On the other hand, what is an array? This is a pointer to a block of memory (8 bytes), length, and for C # there is also memory for the array object itself (20 bytes). In total, each array costs us 32 bytes (in fact, in C #, an object takes at least 24 bytes in increments of 8, of which 20 bytes are for the object, and 4 is for what is left or is stupid for alignment). Hereinafter, calculations for a 64-bit system. For 32 bits, pointers are 2 times less, alignment is also 4, so almost everything is 2 times more economical.
What is this passage for? Besides, 32 arrays will devour 1KB of memory just for themselves. What to do about it? Everything is simple: we will store these 32 arrays in one array!
In the first element we store the length of a one-bit array, then the array itself, then the length for two bits, etc. As a result, there is only 32 bytes of overhead and efficient storage.
An inquisitive reader (I have always liked this phrase) may notice a certain problem: to store numbers from one bit, we first spend 2 bits for the length (0, 1 or 2), and then 2 bits for the numbers themselves. But you can spend only 2 bits: the first bit is whether there is a 0, the second is whether there is a 1.
We just came up with a bitmap . You can not worry much and store numbers from 0 to 255 with this method - there is a number - 1, no - 0. And spend 32 bytes on it (8 bits in a byte * 32 = 256). Naturally, with each new value, the card's effectiveness begins to decline. Those. to store all ints we need 536870912 bytes ... It's a bit too much. So, when to stop: at 256, at 16, at 65536 - depends on the data. Let it be 256. I like this number, it's beautiful.
Those. we store the first 256 numbers with a bitmap, then we store the length of numbers of a certain length in bits and the numbers themselves.
But look what happens: numbers from 0 to 511 require 9 bits to store. At the same time, we are numbers from 0 to 255 - we have already saved. Those. in the range of 9 bits the number 12 cannot be found. Only 256 and more. So why store them in 9 bits, if you can store a number from 0 to 255 and then add the missing 256 in your head. Saved one more bit! Naturally, each next range will also be 1 bit more economical. We are great!
What else can you do? And you can look at the data. If they are very dense (1,2,3,5,6), then you can store not the numbers themselves, but those that do not exist (4). Those. instead of storing conditional 5 numbers, we will store one. A simple rule: we have more than half - we keep those that do not exist, otherwise vice versa. Where to store? And in length! Look: to store numbers 10 bits long, we need 11 bits (because from 0 to 1024 inclusive). But at the same time, you can shove 2048 values ββinto 11 bits, and we use only 1025. So we will store: positive length - we store numbers. Negative - we keep what is not. I propose to make a detailed calculation for the reader himself as an independent exercise (because I am not sure that everything will fit together, so I will pretend that it is necessary).
As a result, we got: an array in which the first 16 bytes are a bit mask for the presence of numbers from 0 to 255, then - the length with an indication - we store the numbers or their absence, the numbers themselves, the bit length for the next, etc.
After you implement this, and even without errors, I think you will go straight to the durke, subsequent programmers trying to understand this code will follow you. So let's try some more options.
We think about order
Look. We have an array. What does he have, as opposed to many? And he has: the order of the elements. This is additional information, and we haven't used it yet. What can you do about it?
And you can store not the elements themselves, but the difference between them:
1,2,3,4,8 => 1,1,1,1,4
Ie. we store the first as it is, the second - we add the value of the first to the second, etc. What does it give us? And the fact that if we sort the array in advance , then the values ββin it will become smaller in general, and they can be stored in fewer bits.
In addition, according to the condition of the problem, all elements are different, i.e. we can still subtract one from the difference to save bits:
1,2,3,4,8 => 1,1,1,1,4 => 1,0,0,0,3
This is not difficult, so why and no.
But now the problem has got out. Because Now we cannot store numbers independently, but only in the same order, then the method with an array and lengths is no longer suitable. It is necessary to come up with something else, because all numbers must be stored in order.
Store the length of the number in bits before the number itself.
Not a bad option. The number takes from 1 to 32 bits, i.e. for the length we need 5 bits, and then the number itself. For convenience, you can cut off extreme cases (well, why will we save there? Pennies!), Or vice versa, highlight them separately - for example, if the length is 0, then it means the number 0, if the length is 1 - the number - 1, if the length is 2, then the next 2 bit number 2,3,4,5 (we already know that we can shift to something that cannot be), etc.
Or can store the length of a number in the number itself?
Variable-length quantity
No matter how we are the first to ask this question, so there is a standard solution. Used to store strings in UTF-8 and many other places. The meaning is simple.
If the number is from 0 to 127 inclusive, we store it in 1 byte (although we used only 7 bits). If more, then set the 8th bit to 1 and use the next byte in the same way (7 bits, missing - checkbox and next). Those. small numbers will be stored in one byte, a little more - in two, and so on up to 5.
You can say - fuu ... we just played with the bits, and then the bytes went, not cool! Yes, it's not cool, on the other hand, working with bytes is still easier than with bits, a little less savings, but the speed of work is higher and the code is clearer. But ... spending a bit per byte is somehow not very cool, maybe there are better solutions?
Using values ββas flags
Let's skip all reasoning and immediately decide. We will store it as follows:
- numbers from 0 to 252 will be stored in one byte. If more, then:
- if the number is from 252 to 252 + 256 = 508 we set the value 252, and in the next byte the number is 252 (yes, we already know how to shift values)
- if from 252 + 256 to 252 + 256 + 65536, set 253 and use the next 2 bytes to store the number itself - an unnecessary difference
- if from 252 + 256 + 65536 to 252 + 256 + 65536 + 16777216, put 254 and 3 bytes
- otherwise, 255 and 4 bytes.
Is this a good way? Everything is relative. In one byte we can push values ββup to 252, while in VLQ only up to 127, but only 508 in 2 bytes, and 16383 in VLQ. the method is good if your numbers are densely enough, and here we will win. But the good thing about the method is that it can be adjusted to different ranges. For example, if we know that most of the numbers are from 10,000 to 50,000, then we can always store them in two bytes, but if some large number comes out, we will write 65535 and use already 4. In fact, we optimize the storage of the required range at the cost of inefficient storage unnecessary.
Conclusion
We examined the main ways to save memory (in fact, my imagination has run out, but I won't admit it). These techniques can be combined, used for other tasks, and modified to suit the situation. What is the best technique in the end? It all depends on your data. Take them and try them. Fortunately, it is not necessary to implement everything completely at once. It is easy enough to write code that will simply evaluate the length. And after the assessment, already implement what you liked.
Do not forget about the speed of this whole thing: are you ready to spend a lot of time preparing data or getting it. Is it worth starting a fight with bits, or shouldn't you go below bytes. Is it enough to optimize frequent situations, leaving rare ones with ineffective implementation. Is it possible, depending on the data, to use different storage methods (for example, it is stupid to store up to 8 bytes in an array, since side costs will devour all the gain, and from 1 byte - generally store in a pseudo-array of one element, i.e. in the number).
Also, a few words about compression: here it will not be very effective. Compression algorithms like repetitions very much, but there are not very many of them here. If you take a conditional Zip, which consists of LZ77 + Huffman, it is unlikely that something useful will come out with LZ77, but Huffman may try to save bytes. So Zip will be half useless. But the speed will drop very, very much.
Situations where we know that we have many sets and we can store them all together using different slices have not yet been considered at all. Here I confess - I'm not sure that it will work out. Immediately, I did not come up with options. But I realized that it would be difficult. However, you may have different opinions.
So share your ideas in the comments, maybe I missed some obvious elephant that will save even more bytes and get such a result that the housewives from the detergent advertisement (which is enough for one drop) will envy us all!