Protobuf vs Avro. How to make a choice?

This article lists the features of two popular serialization formats that an architect should consider when choosing one of them.

Size and speed

On the net you can find comparative tests of serialization formats. You should not attach importance to specific numbers, since the speed of serialization / deserialization, as well as the size of the resulting binary data, depends on the specific data scheme and on the implementation of the serializer. We only note that the avro and protobuff occupy the leading positions in such tests.

The advantage of Euro is that the record fields are saved one after another, without separators. But when dealing with an avro , you need to store the schema of the recorded data somewhere. It can be attached to the serialized data, or it can be stored separately (then the schema identifier is added to the data in external storage).

The trick of the protobuff is that when serializing integers, by default, the variable length format ( varint ) is used, which takes up less space for small positive numbers. Protobuff adds the field number and type to the binary stream, which increases the total size. Also, if the message includes fields of the record type ( nested message in protobuff terminology ), you first need to calculate the final record size, which complicates the serialization algorithm and takes additional time.

UPD: Avro also uses a variable length format for writing integers, with alternating positive and negative values ​​( zigzag encoding ). Avrow's int matches Protobuff's sint32 and long matches sint64.

Overall, you can say that you will be satisfied with the size and speed of both formats. In most cases, this is not the factor that will determine your choice.

UPD: Highly loaded system or real-time data processing may be the case when it is worth looking at more specialized codecs ( discussion thread ).

Data types

, : bool, string, int32(int), int64(long), float, double, byte[]. uint32, uint64. 

, -, varint, .  , : sint32, sint64, fixed32, fixed64, sfixed32, sixed64.

(map). ( ).

(enumerations).

(records , message ) (union , oneof ).

, (nullable) , , union , null, - oneof .

UPD: nullable message . optional, , oneof. stackoverflow.

(logical types well known types ). (timestamp) (duration).

, decimal UUID. fixed - .

, decimal - , , .

(backward compatibility) -. , , , (0, , false). (aliases) (record, enum, fixed). , .

, ( int long, float double, ). , C++. bool , enum .

, , , , . (forward compatibility). 

.

enum, -, , - .

(case) (union) unknown , , .

. (ADT), , , , .

Json

, , Json. , , (, MongoDB). 

, , ( , , json_name ). (aliases) .

, ( bytes, fixed) UTF16 . (, .), Json , UTF16. base64.

Json , , , , , UTF16.

, , . , (, ), (, Schema Registry). , (statefullness), “” .

(, python), , , . , , , “ ”, . , Any, , , .

RPC

.

(one-way). (handshake), .

, (streaming) .

RPC - gRPC. , gRPC, -, , , . , , , , , , gRPC , , , , .

, , RPC, .

Kafka

. .

Hadoop

gRPC. , Hadoop - , elephant-bird .

.

https://github.com/apache/avro (1.7K , 1.1 )

https://github.com/protocolbuffers/protobuf (45K , 12.1 )




All Articles