You've probably already used UUIDs in your projects and thought they were unique. Let's look at the main aspects of the implementation and see why UUIDs are practically unique, since there is a miniscule possibility of the same values occurring.
The modern implementation of UUIDs can be traced back to RFC 4122, which describes five different approaches to generating these identifiers. We'll go over each of them and walk through the implementation of version 1 and version 4.
Theory
UUID (universally unique IDentifier) is a 128-bit number that is used in software development as a unique identifier for elements. Its classic textual representation is a series of 32 hexadecimal characters, separated by hyphens into five groups in the 8-4-4-4-12 pattern.
For instance:
3422b448-2460-4fd2-9183-8000de6f8343
UUID implementation information is embedded in this seemingly random sequence of characters:
xxxxxxxx-xxxx-Mxxx-Nxxx-xxxxxxxxxxxx
The values at positions M and N define the version and variant of the UUID, respectively.
Version
The version number is identified by the four most significant bits at position M. Today there are the following versions:
Option
This field defines the template of the information embedded in the UUID. The interpretation of all other bits in the UUID depends on the value of the variant.
We determine it by the first 1-3 most significant bits at position N.
Today, option 1 is most often used, in which
MSB0
equals 1
and MSB1
equals 0
. This means that given wildcards - bits selected
- only possible values are 8
, 9
, A
or B
.
Memo:
1 0 0 0 = 8
1 0 0 1 = 9
1 0 1 0 = A
1 0 1 1 = B
So if you see a UUID with such values at position N, then this is the identifier in option 1.
Version 1 (time + unique or random host id)
In this case, the UUID is generated like this: some identifying property of the device that generates the UUID is added to the current time, most often it is the MAC address (also known as the node ID).
The identifier is obtained by concatenating a 48-bit MAC address, a 60-bit timestamp, a 14-bit "unique" clock sequence, and 6 bits reserved for version and variant UUIDs.
The clock sequence is simply a value that is incremented each time the clock is changed.
The timestamp used in this version is the number of 100-nanosecond intervals since October 15, 1582, the date the Gregorian calendar originated.
You may be familiar with the Unix system of time since the beginning of an epoch. It's just a different kind of Day Zero. There are services on the web that can help you transform one temporal representation into another, so let's not dwell on this.
Although this implementation looks quite simple and reliable, the use of the MAC address of the machine on which the identifier is generated does not allow this method to be considered universal. Especially when safety is the main criterion. Therefore, in some implementations, instead of the node identifier, 6 random bytes are used, taken from a cryptographically protected random number generator.
Building UUID version 1 goes like this:
- The least significant 32 bits of the current UTC timestamp are taken. This will be the first 4 bytes (8 hex characters) of the UUID [
TimeLow
]. - The middle 16 bits of the current UTC timestamp are taken. This will be the next 2 bytes (4 hex characters) [
TimeMid
]. - The next 2 bytes (4 hex characters) concatenate the 4 bits of the UUID version with the remaining 12 MSBs of the current UTC timestamp (which has a total of 60 bits) [
TimeHighAndVersion
]. - The next 1-3 bits define the UUID version variant. The remaining bits contain a clock sequence that adds a little bit of randomness to this implementation. This avoids collisions when multiple UUID generators are running on the same system: either the system clock is set back for the generator, or the time change is slowed down [
ClockSequenceHiAndRes && ClockSequenceLow
]. - The last 6 bytes (12 hexadecimal characters, 48 bits) are the "node ID", which is usually the generator MAC address [
NodeID
].
The version 1 UUID is generated using concatenation:
TimeLow + TimeMid + TimeHighAndVersion + (ClockSequenceHiAndRes && ClockSequenceLow) + NodeID
Since this implementation is clock dependent, we need to handle edge situations. First, to minimize correlation between systems, by default the clock sequence is taken as a random number - this is done only once during the entire life cycle of the system. This gives us the added benefit of supporting node IDs that can be carried across systems, since the initial clock sequence is completely independent of the node ID.
Remember that the main purpose of using a clock sequence is to add some randomness to our equation. The clock sequence bits help extend the timestamp and accommodate situations where multiple UUIDs are generated even before the processor clock changes. This way we avoid creating the same identifiers when the clock is set back (device is off) or the node identifier changes. If the clock is set back, or could have been set back (for example, while the system was turned off), and the UUID generator cannot verify that the identifiers were generated with later time stamps than the specified clock value, then the clock sequence must be changed. If we know its previous value, we can simply increase it;otherwise, it must be set randomly or with a high quality PRNG.
Version 2 (security of a distributed computing environment)
The main difference of this version from the previous one is that instead of "randomness" in the form of the least significant bits of the clock sequence, an identifier characteristic of the system is used here. Often, this is just the ID of the current user. Version 2 is used less often, it differs very little from version 1, so let's move on.
Version 3 (name + MD5 hash)
If unique identifiers are needed for naming or naming information, then usually UUID version 3 or version 5 is used.
They encode any "named" entities (sites, DNS, plain text, etc.) into a UUID value. Most importantly, the same UUID will be generated for the same namespace or text.
Note that namespace itself is a UUID.
let namespace = “digitalbunker.dev”
let namespaceUUID = UUID3(.DNS, namespace)
// Ex:
UUID3(namespaceUUID, “/category/things-you-should-know-1/”)
4896c91b-9e61-3129-87b6-8aa299028058
UUID3(namespaceUUID, “/category/things-you-should-know-2/”)
29be0ee3-fe77-331e-a1bf-9494ec18c0ba
UUID3(namespaceUUID, “/category/things-you-should-know-3/”)
33b06619-1ee7-3db5-827d-0dc85df1f759
In this implementation, the UUID namespace is converted to a string of bytes concatenated with the input name, then hashed with MD5, resulting in 128 bits for the UUID. We then rewrite some of the bits to accurately reproduce the version and version information, and leave the rest untouched.
It is important to understand that neither namespace nor input name can be computed based on UUID. This is an irreversible operation. The only exception is brute force when one of the values (namespace or text) is already known to the attacker.
With the same input, the generated version 3 and 5 UUIDs will be deterministic.
Version 4 (PRNG)
Simplest implementation.
6 bits are reserved for version and variant, there are still 122 bits left. This version simply generates 128 random bits and then replaces 6 of them with version and version data.
Such UUIDs are completely dependent on the quality of the PRNG (pseudo-random number generator). If its algorithm is too simple, or it lacks initial values, then the probability of repeating identifiers increases.
In modern languages, UUID version 4 is most often used.
Its implementation is quite simple:
- We generate 128 random bits.
- Rewrite some bits with correct version and version information:
- Take the seventh bit and
0x0F
AND to clear the high nibble. And then0x40
OR is used to assign version 4. - Then we take the ninth byte, perform an
0x3F
AND operation on c and an0x80
OR operation on it.
- Take the seventh bit and
- Convert 128 bits to hex and insert hyphens.
Version 5 (name + SHA-1-hash)
The only difference from version 3 is that we are using the SHA-1 hashing algorithm instead of MD5. This version is preferred over the third (SHA-1> MD5).
Practice
One of the important advantages of UUIDs is that their uniqueness does not depend on a central authorizing authority or on coordination between different systems. Anyone can create a UUID with some confidence that no one else will generate this value for the foreseeable future.
This makes it possible to combine identifiers created by different participants in one database, or move identifiers between databases with a negligible collision probability.
UUIDs can be used as primary keys in databases, as unique names for uploaded files, as unique names for any web sources. You don't need a central authorizing authority to generate them. But this is a double-edged solution. Due to the lack of a controller, it is impossible to track the generated UUIDs.
There are a few more drawbacks that need to be addressed. Inherent randomness increases security, but it makes debugging more difficult. Also, the UUID can be redundant in some situations. Let's say it doesn't make sense to use 128 bits to uniquely identify data whose total size is less than 128 bits.
Uniqueness
It may seem that if you have enough time, you can repeat a value. Especially in the case of version 4. But in reality this is not the case. If you were to generate one billion UUIDs per second for 100 years, then the chance of one of the values repeating would be about 50%. This is in view of the fact that the PRNG provides a sufficient amount of entropy (true randomness), otherwise the probability of a double will be higher. A more illustrative example: if you generated 10 trillion UUIDs, the probability of two identical values appearing is 0.00000006%.
And in the case of version 1, the clock will be reset to zero only in 3603. So if you don't plan to keep your service running until 1583, then you're safe.
However, the probability of the appearance of a double remains, and in some systems they try to take this into account. But in the vast majority of cases, UUIDs can be considered completely unique. If you need more proof, here is a simple visualization of the collision probability in practice.