Recently I noticed that in different projects I have to actively write bitwise operations in PHP. This is a very interesting and useful skill that comes in handy from reading binaries to emulating processors.
PHP has many tools to help you manipulate binary data, but I want to warn you right away: if you want super low-level efficiency, then this language is not for you.
And now to business! In this article I will tell you a lot of interesting things about bitwise operations, binary and hexadecimal processing, which will be useful in ANY language.
- Why PHP may not be the best candidate
- A quick introduction to binary and hexadecimal representation of data
- Transfer operations
- Data representation in computer memory
- Arithmetic overflows
- PHP
- : PHP, ?
- PHP
PHP
I love PHP, don't get me wrong. And I'm sure this language will work just fine in most cases. But if you need maximum efficiency in processing binary data, then PHP will not do it.
Let me explain: I'm not talking about the fact that the application can consume five or ten megabytes more, but about allocating a specific amount of memory to store data of a certain type.
According to the official documentation on integers , PHP represents decimal, hexadecimal, octal, and binary values โโusing an integer type. So it doesn't matter what data you put there, it will always be integers.
You probably already know about ZVAL - it is a C structure that represents each PHP variable. It has a zend_long field to represent all numbers . This field has a type
lval
, the size of which depends on the platform: on 64-bit platforms, the field will be represented as a 64-bit number , and on 32-bit platforms, as a 32-bit number .
# zval stores every integer as a lval
typedef union _zend_value {
zend_long lval;
// ...
} zend_value;
# lval is a 32 or 64-bit integer
#ifdef ZEND_ENABLE_ZVAL_LONG64
typedef int64_t zend_long;
// ...
#else
typedef int32_t zend_long;
// ...
#endif
The bottom line is this: it doesn't matter if you need to store 0xff, 0xffff, 0xffffff, or something else. In PHP, all these values โโwill be stored as long ( lval ) with a length of 32 or 64 bits.
For example, I recently experimented with emulating microcontrollers. And while it was necessary to handle memory contents and operations correctly, I didn't need too much memory efficiency because my hosting machine was compensating for orders of magnitude costs.
Of course, everything changes when we talk about C-extensions or FFI, but this is not my goal either. I am talking about pure PHP.
So remember: it works and can behave the way you want it to, but in most cases the types will waste memory inefficiently.
A quick introduction to binary and hexadecimal representation of data
Before talking about how PHP handles binary data, you must first talk about what binary is. If you think you already know all about this, then skip to the Binary Numbers and Strings in PHP chapter .
In mathematics, there is the concept of "foundation". It defines how we can represent quantities in different formats. People usually use the decimal base (base 10), which allows us to represent any number with the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9.
To clarify the next example, I will refer to the number 20 as "Decimal 20".
Binary numbers (base 2) can represent any number, but only using two digits: 0 and 1.
Decimal 20 in binary looks like this: 0b000 10100 . You don't need to convert it to its familiar form yourself, let computers do it. ;)
Hexadecimal numbers (base 16) can represent any numbers using ten digits 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9, as well as an additional six characters from the Latin alphabet: a, b, c , d, e and f.
Decimal 20 in hexadecimal form looks like this: 0x14. Leave the transformation to computers, they are experts in this!
It is important to understand that numbers can be represented in different bases: binary (base 2), octal (base 8), decimal (base 10, our usual), and hexadecimal (base 16).
In PHP and many other languages, binary numbers are written like any other, but prefixed with 0b : decimal 20 looks like 0b 00010100. Hexadecimal numbers are prefixed with 0x : decimal 20 looks like 0x 14.
As you might already know, computers do not store literal data ... They are all represented in the form of binary numbers, zeros and ones. Symbols, numbers, letters, instructions - everything is presented in base 2. Letters are just a convention of number sequences. For example, the letter "a" is numbered 97 in the ASCII table.
But while everything is stored in binary, programmers are most comfortable reading the data in hexadecimal format. They look better that way. Just look:
# string "abc"
'abc'
# binary form (bleh)
0b01100001 0b01100010 0b01100011
# hexadecimal form (such wow)
0x61 0x62 0x63
Although the binary format visually takes up a lot of space, hexadecimal data is very similar to binary representation. Therefore, we usually use them in low-level programming.
Transfer operations
You are already familiar with the concept of carry, but I have to pay attention to it so that we can use it for different reasons.
In the decimal set, we have ten separate digits to represent numbers, from 0 to 9. But when we try to represent a number greater than nine, we miss the digits! And here the transfer operation is applied: we prefix the number with the digit 1, and reset the right digit to 0.
# decimal (base 10)
1 + 1 = 2
2 + 2 = 4
9 + 1 = 10 // <- Carry
The binary base behaves the same way, only it is limited to the digits 0 and 1.
# binary (base 2)
0 + 0 = 0
0 + 1 = 1
1 + 1 = 10 // <- Carry
1 + 10 = 11
It's the same with hexadecimal base, only it has a much wider range.
# hexadecimal (base 16)
1 + 9 = a // no carry, a is in range
1 + a = b
1 + f = 10 // <- Carry
1 + 10 = 11
As you understood, the carry operation requires more digits to represent certain numbers. This allows us to understand how limited certain types of data are and, since they are stored in computers, how limited their binary representation is.
Data representation in computer memory
As I mentioned above, computers store everything in binary format. That is, they only contain zeros and ones in memory.
It is easiest to visualize this concept as a large table with one row and many columns (as much as the memory capacity allows. Each column is a binary number (bit). The
representation of our decimal 20 in such a table using 8 bits looks like this:
Position (address) | 0 | one | 2 | 3 | 4 | 5 | 6 | 7 |
Bit | 0 | 0 | 0 | one | 0 | one | 0 | 0 |
An unsigned 8-bit integer is a number that can be represented using a maximum of 8 binary numbers. That is, 0b11111111 (255 decimal) will be the largest unsigned 8-bit number. Adding 1 to it will require the use of a carry operation, which can no longer be represented using the same number of digits.
Knowing this, we can easily figure out why there are so many representations in memory for numbers and what they are: uint8 are unsigned 8-bit integers (decimal 0-255), uint16 are unsigned 16-bit integers (decimal 0-65535 ). There are also uint32, uint64 and, in theory, higher ones.
Signed integers, which can represent negative values, typically use the last bit to determine whether they are positive (last bit = 0) or negative (last bit = 1). As you can imagine, they allow you to store smaller values โโin the same amount of memory. A signed 8-bit integer ranges from -128 to decimal 127.
Here is decimal -20, represented as a signed 8-bit integer. Note that the first bit is set (address 0, value 1), this means a negative number.
Position (address) | 0 | one | 2 | 3 | 4 | 5 | 6 | 7 |
Bit | one | 0 | 0 | one | 0 | one | 0 | 0 |
I hope everything is clear so far. This introduction is essential for understanding the inner workings of computers. Keep this in mind, and then you will always understand how PHP works under the hood.
Arithmetic overflows
The selected number representation (8-bit, 16-bit) determines the minimum and maximum value of the range. It's all about how numbers are stored in memory: adding 1 to the binary digit 1 leads to a carry operation, that is, you need another bit as a prefix for the current number. Since the integer format is very carefully defined, we cannot rely on out-of-bounds carry operations (actually possible, but pretty crazy).
Position (address) | 0 | one | 2 | 3 | 4 | 5 | 6 | 7 |
Bit | one | one | one | one | one | one | one | 0 |
Here we are very close to the 8-bit limit (255 decimal). If we add one, we get 255 decimal in binary:
Position (address) | 0 | one | 2 | 3 | 4 | 5 | 6 | 7 |
Bit | one | one | one | one | one | one | one | one |
All bits are assigned! Adding 1 will require a carry operation which will not be possible because we are running out of bits, all 8 are already assigned! This situation is called overflow , we go beyond a certain limit. The binary operation 255 + 2 should give an 8-bit result of 1.
Position (address) | 0 | one | 2 | 3 | 4 | 5 | 6 | 7 |
Bit | 0 | 0 | 0 | 0 | 0 | 0 | 0 | one |
This behavior is not accidental, the new value is calculated using certain rules, which we will not consider here.
Binary numbers and strings in PHP
Back to PHP! Sorry for this big digression, but I think it's important.
I hope you already have pieces of a puzzle in your head: binary numbers, how they are stored, what is overflow, how does PHP represent numbers ...
Decimal 20, represented in PHP as an integer value, can have two different representations depending on the platform ... On the x86 platform it will be a 32-bit representation, on the x64 it will be 64-bit, but in both cases there will be a sign (that is, the value can be negative). We know that decimal 20 can fit into 8-bit space, but PHP treats any decimal number as 32 or 64 bits.
PHP also has binary strings that can be converted back and forth using the functions pack () and unpack () .
In PHP, the main difference between binary strings and numbers is that strings simply contain data, like a buffer. Integer values โโ(binary and not only) allow you to perform arithmetic operations with themselves, but also binary (bitwise) values โโsuch as AND, OR, XOR and NOT.
Binary: what should be used in PHP, numbers or strings?
We usually use binary strings to transport data. Therefore, reading a binary file or networking requires packing and unpacking binary strings.
However, actual operations such as OR and XOR cannot be performed reliably with strings, so you need to use numbers.
Debugging binary values โโin PHP
Now let's have some fun and play around with some PHP code!
First, I'll show you how to visualize data. We must understand what we are dealing with.
Debugging integers is very, very easy, we can use the sprintf () function . It has very powerful formatting and will help us quickly understand what values โโwe are working with.
Let's represent decimal 20 in 8-bit binary and 1-byte hexadecimal:
<?php
// Decimal 20
$n = 20;
echo sprintf('%08b', $n) . "\n";
echo sprintf('%02X', $n) . "\n";
// Output:
00010100
14
The format
%08b
outputs a variable in
$n
binary representation (
b
) with eight digits (
08
).
The format
%02X
displays the variable
$n
in hexadecimal notation (
X
) with two digits (
02
).
Visualizing Binary Strings
Although in PHP integers are always 32 or 64 bits long, the length of strings is equal to the length of their contents. To decode their binary values โโand render them, we need to examine and transform each byte.
Fortunately, in PHP, strings are not named like arrays, and each position points to a 1-byte character. Here's an example of accessing symbols:
<?php
$str = 'thephp.website';
echo $str[3];
echo $str[4];
echo $str[5];
// Outputs:
php
Assuming one character is 1 byte, we can call ord () to cast to a 1-byte integer:
<?php
$str = 'thephp.website';
$f = ord($str[3]);
$s = ord($str[4]);
$t = ord($str[5]);
echo sprintf(
'%02X %02X %02X',
$f,
$s,
$t,
);
// Outputs:
70 68 70
Now you can double check with the hexdump command line application:
$ echo 'php' | hexdump
// Outputs
0000000 70 68 70 ...
The first column contains only the address, and in the second column we see the hexadecimal values โโrepresenting the characters
p
,
h
and
p
.
Also when handling binary strings, we can use the pack () and unpack () functions , and I have a great example for you! Let's say you need to read a JPEG file to extract some data (like EXIF). Using the binary read mode, you can open a file handler and immediately read the first two bytes:
<?php
$h = fopen('file.jpeg', 'rb');
// Read 2 bytes
$soi = fread($h, 2);
To extract the values โโinto an integer array, you can simply unpack them:
$ints = unpack('C*', $soi);
var_dump($ints);
// Outputs
array(2) {
[1] => int(-1)
[2] => int(-40)
}
echo sprintf('%02X', $ints[1]);
echo sprintf('%02X', $ints[2]);
// Outputs
FFD8
Note that the C format in the function
unpack()
converts the character to a string
$soi
as unsigned 8-bit numbers. The modifier
*
unpacks the entire line.
Bitwise operations
PHP implements all the bitwise operations you might need. They are built in as expressions, and the result of their work is described below:
Php code | Name | Description |
$ x | $ y | Inclusive OR | $ x and $ y are assigned a value with all the given bits. |
$ x ^ $ y | Exclusive OR | $ x or $ y is assigned a value with the given bits. |
$ x & $ y | AND | $ x and $ y are simultaneously assigned a value with the given bits. |
~ $ x | NOT | Change the values โโof all bits in $ x. |
$ x << $ y | Left SHIFT | Shifts the bits of $ x left by $ y positions. |
$ x >> $ y | Right SHIFT | Shifts the bits of $ x to the right by $ y positions. |
I'll explain how each one works!
Let
$x = 0x20
and
$y = 0x30
. Below I will show examples using binary notation.
How Inclusive Or ($ x | $ y) works
The inclusive OR operation takes all bits from both inputs. That is, it
$x | $y
should return
0x30
. Take a look:
// 1 | 1 = 1
// 1 | 0 = 1
// 0 | 0 = 0
0b00100000 // $x = 0x20
0b00110000 // $y = 0x30
OR ------- // $x | $y
0b00110000 // 0x30
Note: From right to left, the sixth bit
$x
(1) has been specified , as well as the fifth and sixth bits
$y
. Data were pooled and generated value given the fifth and sixth bits:
0x30
.
How Exclusive Or ($ x ^ $ y) works
The exclusive OR operation (also known as XOR) takes bits from only one side. That is, the result of the calculation
$x ^ $y
will be
0x10
:
// 1 ^ 1 = 0
// 1 ^ 0 = 1
// 0 ^ 0 = 0
0b00100000 // $x = 0x20
0b00110000 // $y = 0x30
XOR ------ // $x ^ $y
0b00010000 // 0x10
How AND ($ x & $ y) works
The AND operator is much easier to understand. It applies an AND operation to each bit, so only those values โโthat are equal to each other on both sides will be retrieved. The result of the calculation
$x & $y
will be
0x20
:
// 1 & 1 = 1
// 1 & 0 = 0
// 0 & 0 = 0
0b00100000 // $x = 0x20
0b00110000 // $y = 0x30
AND ------ // $x & $y
0b00100000 // 0x20
How NOT (~ $ x) works
The NOT operation requires one parameter, it simply changes the values โโof all transmitted bits. It turns all 0s into 1, and all 1s into 0 .:
// ~1 = 0
// ~0 = 1
0b00100000 // $x = 0x20
NOT ------ // ~$x
0b11011111 // 0xDF
If you performed this operation in PHP and decided to debug with
sprintf()
, then you probably noticed wider numbers? In the chapter on Normalizing Numbers, I'll explain what's going on here and how to fix it.
How Left SHIFT and Right SHIFT work ($ x << $ n and $ x >> $ n)
Bit shifting is similar to multiplying or dividing numbers by a power of two. All bits go
$n
left or right positions.
Let's take a small binary number to make it easier to show, for example
$x = 0b0010
. If we shift
$x
left once , that one bit must move one position to the left:
$x = 0b0010;
$x = $x << 1;
// 0b0100
Same thing with offset to the right:
$x = 0b0100;
$x = $x >> 2;
// 0b0001
That is, shifting the number of
$n
times to the left is equivalent to multiplying
$n
twice, and shifting the number of
$n
times to the right is equivalent to dividing by two
$n
.
What is bit mask
A lot of interesting things can be done with these operations and other techniques. For example, apply a bit mask. This is an arbitrary binary number of your choice, created to extract very specific information.
For example, take the idea that an 8-bit signed number is positive if the eighth bit (0) is not specified, and negative if a bit is specified. Is the number positive or negative
0x20
? What about
0x81
?
To answer this, we can create a very convenient byte with a single negative bit specified (
0b10000000
, equivalent
0x80
) and
0x20
AND it. If the result is
0x80
(
0b10000000
, our mask), then it is a negative number, otherwise it is positive:
// 0x80 === 0b10000000 (bitmask)
// 0x20 === 0b00100000
// 0x81 === 0b10000001
0x20 & 0x80 === 0x80 // false
0x81 & 0x80 === 0x80 // true
This is often the case when working with flags. You can even find examples of usage in PHP itself, such as error message flags .
You can choose what kind of errors will be generated:
error_reporting(E_WARNING | E_NOTICE);
What's going on here? Just look at your meaning:
0b00000010 (0x02) E_WARNING
0b00001000 (0x08) E_NOTICE
OR -------
0b00001010 (0x0A)
When PHP sees a notification that can be sent, it checks for something like this:
// error reporting we set before
$e_level = 0x0A;
// Needs to throw a notice
if ($e_level & E_NOTICE === E_NOTICE)
// Flag is set: throws notice
And you will see it everywhere! Binaries, processors, all sorts of low-level stuff!
Normalizing numbers
PHP has one peculiarity related to the handling of binary numbers: integers are 32 or 64 bits in size. This means that we often need to normalize them in order to trust our calculations.
For example, executing this operation on a 64-bit machine will give a strange (but expected) result:
echo sprintf(
'0b%08b',
~0x20
);
// Expected
0b11011111
// Actual
0b1111111111111111111111111111111111111111111111111111111111011111
What happened here? The NOT operation on an 8-bit integer (
0x20
) turned all zero bits into ones. Guess what we had zeros? That's right, all the other 56 bits on the left, which were previously ignored!
Again, the reason is that in PHP the length of integers is 32 or 64 bits, regardless of their value!
However, the code works as expected. For example, the result of the ~ operation
0x20 & 0b11011111 === 0b11011111
will be a boolean value (true). But do not forget that these bits on the left do not go anywhere, otherwise you will get strange code behavior.
To solve this problem, you can normalize the numbers by applying a bit mask that clears all zeros. For example, to normalize
~0x20
an 8-bit integer must be ANDed with
0xFF
(
0b11111111
) so that all the previous 56 bits become zeros.
~0x20 & 0xFF
-> 0b11011111
Attention! Do not forget about what is in your variables, otherwise you will get unexpected behavior. For example, let's take a look at what happens when we shift the above value to the right without an 8-bit mask:
~0x20 & 0xFF
-> 0b11011111
0b11011111 >> 2
-> 0b00110111 // expected
(~0x20 & 0xFF) >> 2
-> 0b00110111 // expected
(~0x20 >> 2) & 0xFF
-> 0b11110111 // expected?
Let me explain: from a PHP point of view, this is expected, because you are explicitly processing a 64-bit number. You need to understand what YOUR program is expecting.
Tip: Avoid these silly mistakes by programming in the TDD paradigm .
Conclusion: Binary and PHP are cool
Once armed with such tools, everything else becomes just finding the correct documentation on the behavior of binaries or protocols. After all, everything is binary sequences.
I highly recommend reading the PDF or EXIF โโspecs. You might even want to experiment with your own implementation of the MessagePack serialization format , or Avro, Protobuf ... The possibilities are endless!