🌫️ 🧑🏿‍🤝‍🧑🏿 💶 A Beginner's Guide to Columnar File Formats in Spark and Hadoop 🎢 🍘 🗓️

This article is a follow-up to the Spark and Hadoop File Format Guide and is a good starting point if you don't already know anything about big data file formats.

What is the "columnar file format"?

, , , .

, (, , columnar) , . CSV, TSV, JSON Avro — . Parquet ORC — .

, , .

«». :

{
  id: Integer,
  first_name: String,
  last_name: String,
  age: Integer,
  cool: Boolean,
  favorite_fruit: Array[String]
}

, , JSON-:

{"id": 1, "first_name": "Matthew", "last_name": "Rathbone", "age": 19, "cool": true, "favorite_fruit": ["bananas", "apples"]}
{"id": 2, "first_name": "Joe", "last_name": "Bloggs", "age": 102, "cool": true, "favorite_fruit": null}

CSV

1, Matthew, Rathbone, 19, True, ['bananas', 'apples']
2, Joe, Bloggs, 102, True,

, . : , , Excel, .

() . CCSV ( CSV).

CCSV- :

Field Name/Field Type/Number of Characters:[data in csv format]

CCSV:

ID/INT/3:1,2
FIRST_NAME/STRING/11:Matthew,Joe
LAST_NAME/STRING/15:Rathbone,Bloggs
AGE/INT/6:19,102
COOL/BOOL/3:1,1
FAVORITE_FRUIT/ARRAY[STRING]/19:[bananas,apples],[]

, , , . , CCSV 1000 . , 10 000 10 .

CCSV-? , , Excel. CCSV , , .

, SQL- , :

SELECT COUNT(1) from people where last_name = "Rathbone"

CSV- SQL- , , last_name

, Rathbone

, .

CCSV SQL- , .

? SQL- 1/6 , .. CCSV ( ) 600% CSV-.

. , ( ) JSON. .

, CCSV , , , .

. ( GZIP Snappy) . , , . , .

, , . , Map/Reduce, .

(, Parquet) , « » , (, 200+ ) .

, , , .

, , , — Parquet. parquet, parquet, , , .

Parquet, Parquet-. , , . , , - , ( , , !).

Spark, Spark SQL, Hive, Impala 100-, 1000- , . , , , , JSON Avro.

, ?

, ! , big data, . - , , — matthew (at) rathbonelabs (dot com). big data .

, «Data Engineer» - «ML Spark». ML Spark, production.

A Beginner's Guide to Columnar File Formats in Spark and Hadoop

What is the "columnar file format"?

, ?

More articles: