Spark Performance Secrets, or Why Query Compilation Matters

For future students of the courses "Data Engineer" and "Ecosystem Hadoop, Spark, Hive" we have prepared another translation of a useful article.


Criteo β€” , . , . Spark β€” . , , , .

:

  • , ;

  • , ;

  • .

Spark . , Spark SQL (Datasets) Spark Core API (RDD), , 2–10 , .

Spark 2.4.6, Macbook Pro 2017 Intel Core i7 3,5 

Java- ( 100 , 90 ). Scala, Python.

,

, , :

  • , ;

  • -, , .

 β€” 2006 , Hadoop, , MapReduce. , Spark. .

2015   (Kay Ousterhout) .ΒΉ Spark, , , , - . , ,  TPC-DSΒ², , :

  • , 2 % ( );

  • - , 19 % ( ).

! , - , . :

  • Spark - , , .

  • , , , , , .

, Databricks 2016 Β³ , Spark . SQL, API DataFrames Datasets.

Spark?

 β€” 0 10⁹. Spark, , , Scala:

var res: Long = 0L
var i: Long  = 0L
while (i < 1000L * 1000 * 1000) {
  if (i % 2 == 0) res += 1
  i += 1L
}

 1.

Spark RDD Spark Datasets. , Spark [1] :

val res = spark.sparkContext
  .range(0L, 1000L * 1000 * 1000)
  .filter(_ % 2 == 0)
  .count()

 2. RDD

val res = spark.range(1000L * 1000 * 1000)
  .filter(col("id") % 2 === 0)
  .select(count(col("id")))
  .first().getAs[Long](0)

 3. Datasets

. , . , RDD , Datasets , .

 

Datasets

: API- Datasets RDD, , , , . ? .

 β€” Volcano

, RDD, Volcano. , RDD :

  • RDD;

  •  compute  Iterator[T], RDD ( private Spark).

abstract class RDD[T: ClassTag]
def compute(…): Iterator[T]

 4. RDD.scala

RDD, , :

def pseudo_rdd_count(rdd: RDD[T]): Long = {
  val iter = rdd.compute
  var result = 0
  while (iter.hasNext) result += 1
  result
}

 5. RDD

, , 1? :

  • : Iterator.next() , ,  JIT (inline).

  • : Java- JIT -, 5, , -, 1. , Java- JIT , .

 β€”

, Spark SQL⁡, , , RDD. , Spark , . (Whole-Stage Code Generation)⁢. Spark , . JVM/JIT . Spark , ., , Spark 3.

Spark , - Janino⁴. Spark SQL RDD.

Spark

Spark 3 API- Scala/Java: RDD, Datasets DataFrames ( Datasets). RDD Spark β€” , - , API , Β« Β» . , , API- Datasets .

 β€”

, Spark SQL, API RDD. , Java, Spark SQL:

val res = spark.range(1000L * 1000 * 1000)
    .rdd
    .filter(_ %2 == 0)
    .count()

 6. Dataset RDD

43 2,1 , . RDD Java, . 3 6 (. ), , .

 

Figure 1. Visual representations of the steps for Listing 3 (diagram a) and Listing 6 (diagram b)
1. 3 ( a) 6 ( b)

 β€”

Spark SQL . ( 6 ):

val res = spark
  .range(1000L * 1000 * 1000)
  .filter(x => x % 2 == 0) // note that the condition changed
  .select(count(col("id"))) 
  .first()
  .getAs[Long](0)

 7. Spark SQL Scala

Spark . Scala, Spark SQL, Spark , .  β€” (. 1a), , (DAG) Spark.

Spark SQL  β€” - ! , , :  filter(condition: Column)  filter(T => Boolean)  select(…)  map(…). Spark , (Dataset). , , RDD.

, . , Spark SQL . , , -.

2–10 , , !

. Spark.

  1. Ousterhout, Kay, et al. Making sense of performance in data analytics frameworks ( ). 12- {USENIX} ({NSDI} 15). 2015.

  2. www.tpc.org/tpcds/

  3. databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html

  4. janino-compiler.github.io/janino/

  5. people.csail.mit.edu/matei/papers/2015/sigmodsparksql.pdf

  6. databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html


"Data Engineer" " Hadoop, Spark, Hive"




All Articles