This article is a follow-up to the Spark and Hadoop File Format Guide and is a good starting point if you don't already know anything about big data file formats.
What is the "columnar file format"?
, , , .
, (, , columnar) , . CSV, TSV, JSON Avro — . Parquet ORC — .
, , .
«». :
{ id: Integer, first_name: String, last_name: String, age: Integer, cool: Boolean, favorite_fruit: Array[String] }
, , JSON-:
{"id": 1, "first_name": "Matthew", "last_name": "Rathbone", "age": 19, "cool": true, "favorite_fruit": ["bananas", "apples"]}
{"id": 2, "first_name": "Joe", "last_name": "Bloggs", "age": 102, "cool": true, "favorite_fruit": null}
CSV
1, Matthew, Rathbone, 19, True, ['bananas', 'apples']
2, Joe, Bloggs, 102, True,
, . : , , Excel, .
() . CCSV ( CSV).
CCSV- :
Field Name/Field Type/Number of Characters:[data in csv format]
CCSV:
ID/INT/3:1,2
FIRST_NAME/STRING/11:Matthew,Joe
LAST_NAME/STRING/15:Rathbone,Bloggs
AGE/INT/6:19,102
COOL/BOOL/3:1,1
FAVORITE_FRUIT/ARRAY[STRING]/19:[bananas,apples],[]
, , , . , CCSV 1000 . , 10 000 10 .
CCSV-? , , Excel. CCSV , , .
, SQL- , :
SELECT COUNT(1) from people where last_name = "Rathbone"
CSV- SQL- , , last_name
, Rathbone
, .
CCSV SQL- , .
? SQL- 1/6 , .. CCSV ( ) 600% CSV-.
. , ( ) JSON. .
, CCSV , , , .
. ( GZIP Snappy) . , , . , .
, , . , Map/Reduce, .
(, Parquet) , « » , (, 200+ ) .
, , , .
, , , .
, , , — Parquet. parquet, parquet, , , .
Parquet, Parquet-. , , . , , - , ( , , !).
Spark, Spark SQL, Hive, Impala 100-, 1000- , . , , , , JSON Avro.
, ?
, ! , big data, . - , , — matthew (at) rathbonelabs (dot com). big data .
, «Data Engineer» - «ML Spark». ML Spark, production.