A Beginner's Guide to Columnar File Formats in Spark and Hadoop










This article is a follow-up to the Spark and Hadoop File Format Guide and is a good starting point if you don't already know anything about big data file formats.





What is the "columnar file format"?

, , , .





, (, , columnar) , . CSV, TSV, JSON Avro — . Parquet ORC — .





, , .





«». :





{
  id: Integer,
  first_name: String,
  last_name: String,
  age: Integer,
  cool: Boolean,
  favorite_fruit: Array[String]
}
      
      



, , JSON-:





{"id": 1, "first_name": "Matthew", "last_name": "Rathbone", "age": 19, "cool": true, "favorite_fruit": ["bananas", "apples"]}
{"id": 2, "first_name": "Joe", "last_name": "Bloggs", "age": 102, "cool": true, "favorite_fruit": null}
      
      



CSV





1, Matthew, Rathbone, 19, True, ['bananas', 'apples']
2, Joe, Bloggs, 102, True,
      
      



, . : , , Excel, .





() . CCSV ( CSV).





CCSV- :





Field Name/Field Type/Number of Characters:[data in csv format]
      
      



CCSV:





ID/INT/3:1,2
FIRST_NAME/STRING/11:Matthew,Joe
LAST_NAME/STRING/15:Rathbone,Bloggs
AGE/INT/6:19,102
COOL/BOOL/3:1,1
FAVORITE_FRUIT/ARRAY[STRING]/19:[bananas,apples],[]
      
      



, , , . , CCSV 1000 . , 10 000 10 .





CCSV-? , , Excel. CCSV , , .





, SQL- , :





SELECT COUNT(1) from people where last_name = "Rathbone"
      
      



CSV- SQL- , , last_name



, Rathbone



, .





CCSV SQL- , .





? SQL- 1/6 , .. CCSV ( ) 600% CSV-.





. , ( ) JSON. .





, CCSV , , , .





. ( GZIP Snappy) . , , . , .





, , . , Map/Reduce, .





(, Parquet) , « » , (, 200+ ) .





, , , .





, , , .





, , , — Parquet. parquet, parquet, , , .





Parquet, Parquet-. , , . , , - , ( , , !).





Spark, Spark SQL, Hive, Impala 100-, 1000- , . , , , , JSON Avro.





, ?

, ! , big data, . - , , — matthew (at) rathbonelabs (dot com). big data .






, «Data Engineer» - «ML Spark». ML Spark, production.













All Articles