For future students of the courses "Data Engineer" and "Ecosystem Hadoop, Spark, Hive" we have prepared another translation of a useful article.
Criteo β , . , . Spark β . , , , .
:
Spark . , Spark SQL (Datasets) Spark Core API (RDD), , 2β10 , .
Spark 2.4.6, Macbook Pro 2017 Intel Core i7 3,5
Java- ( 100 , 90 ). Scala, Python.
,
, , :
, ;
-, , .
β 2006 , Hadoop, , MapReduce. , Spark. .
2015 (Kay Ousterhout) .ΒΉ Spark, , , , - . , , TPC-DSΒ², , :
, 2 % ( );
- , 19 % ( ).
! , - , . :
Spark - , , .
, , , , , .
, Databricks 2016 Β³ , Spark . SQL, API DataFrames Datasets.
Spark?
β 0 10βΉ. Spark, , , Scala:
var res: Long = 0L
var i: Long = 0L
while (i < 1000L * 1000 * 1000) {
if (i % 2 == 0) res += 1
i += 1L
}
1.
Spark RDD Spark Datasets. , Spark [1] :
val res = spark.sparkContext
.range(0L, 1000L * 1000 * 1000)
.filter(_ % 2 == 0)
.count()
2. RDD
val res = spark.range(1000L * 1000 * 1000)
.filter(col("id") % 2 === 0)
.select(count(col("id")))
.first().getAs[Long](0)
3. Datasets
. , . , RDD , Datasets , .
Datasets
: API- Datasets RDD, , , , . ? .
β Volcano
, RDD, Volcano. , RDD :
RDD;
compute
Iterator[T], RDD ( private Spark).
abstract class RDD[T: ClassTag]
def compute(β¦): Iterator[T]
4. RDD.scala
RDD, , :
def pseudo_rdd_count(rdd: RDD[T]): Long = {
val iter = rdd.compute
var result = 0
while (iter.hasNext) result += 1
result
}
5. RDD
, , 1? :
: Iterator.next() , , JIT (inline).
: Java- JIT -, 5, , -, 1. , Java- JIT , .
β
, Spark SQLβ΅, , , RDD. , Spark , . (Whole-Stage Code Generation)βΆ. Spark , . JVM/JIT . Spark , ., , Spark 3.
Spark , - Janinoβ΄. Spark SQL RDD.
Spark
Spark 3 API- Scala/Java: RDD, Datasets DataFrames ( Datasets). RDD Spark β , - , API , Β« Β» . , , API- Datasets .
β
, Spark SQL, API RDD. , Java, Spark SQL:
val res = spark.range(1000L * 1000 * 1000)
.rdd
.filter(_ %2 == 0)
.count()
6. Dataset RDD
43 2,1 , . RDD Java, . 3 6 (. ), , .
β
Spark SQL . ( 6 ):
val res = spark
.range(1000L * 1000 * 1000)
.filter(x => x % 2 == 0) // note that the condition changed
.select(count(col("id")))
.first()
.getAs[Long](0)
7. Spark SQL Scala
Spark . Scala, Spark SQL, Spark , . β (. 1a), , (DAG) Spark.
Spark SQL β - ! , , : filter(condition: Column) filter(T => Boolean) select(β¦) map(β¦). Spark , (Dataset). , , RDD.
, . , Spark SQL . , , -.
2β10 , , !
Ousterhout, Kay, et al. Making sense of performance in data analytics frameworks ( ). 12- {USENIX} ({NSDI} 15). 2015.
databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html