Why your Spark apps are slow or not working at all. Part 1: memory management

We invite future students on the course "Ecosystem Hadoop, Spark, Hive" to the open webinar on the topic "Spark Streaming" . At the webinar, participants, together with an expert, will get acquainted with Spark Streaming and Structured Streaming, study their features and write a simple stream processing application.



And now we are sharing with you the traditional translation of useful material.






Spark apps are easy to write and easy to understand when everything goes according to plan. However, this becomes very difficult when Spark applications start to start slowly or crash. At times, a well-tuned application can crash due to data changes or data composition changes. Sometimes an application that has worked well so far starts behaving badly due to lack of resources. The list goes on and on.





Spark, , , , .., , .





, Spark . — .





Spark , (OOM) , .  , Spark . OOM:





  • Spark





  • (high concurrency)













, Spark . , , OOM, , - OOM.  Spark . OOM, . 





, . .





Spark — JVM (Java Virtual Machine) , . OutOfMemory



— OOM ( - Spark. Spark — . . , . , .





, OutOfMemory



OOM ( ) , :





  • rdd.collect()







  • sparkContext.broadcast



     





  • ,





  • Spark.sql.autoBroadcastJoinThreshold



    .





Spark . , . .





, . . , , , , .





SQL (Structured Query Language) Spark, OOM -   , , ; "spark.sql.autoBroadcastJoinThreshold



" ( ) , . 





Spark, . — ,   . .









, OOM, , Spark .





Spark , . , , , .. map-stage ( SQL), , , .





, ORC (Optimized Row Columnar)  2000 , map-stage 2000 , ,   . reduce-stage ( Shuffle), Spark "spark.default.parallelism



" RDD (Resilient Distributed Dataset), "spark.sql.shuffle.partitions



" DataSet ( ). , "spark.executor.cores



". ,   OOM ( ). , , , , OOM.





, (map) SQL HDFS ( Hadoop distributed file system) Parquet/ORC. HDFS Spark 128 .  , 10 , 128*10 . , .





Spark Parquet ( ) . , Spark Parquet . Parquet , .  . , Spark . , , , , . .





Spark tasks and memory components during table scan
Spark

, , . , (broadcast join), (broadcast variables) . , .









Spark's Catalyst , , . , Parquet/ORC. , . , .





. , , . . () , .









Spark. .









. , -. spark.executor.memory



spark.driver.memory



.  , . . Unravel (Unravel Data Operations Platform) .









, YARN (Yet Another Resource Negotiator — ), OOM (killed) - YARN. "YARN kill" :





YARN Spark, . — off-heap , JVM , JVM. spark.yarn.executor.memoryOverhead



. 10% .









Spark , Spark. Spark , . , , .





Spark : . , , - , . .





,   (  — 300). "spark.memory.fraction



". — 60%. 50% ( "spark.memory.storageFraction



") .





, , , , . , , . , , , , , .





, , "spark.memory.storageFraction



" , .





Spark , , . . () () GC (Garbage Collector), . .





. , , . , . , , .





Spark YARN, NodeManager ( ) , . NodeManager 1 . , , - , NodeManager . NodeManager, .





№1,

Spark — . , Spark . Spark . , . , , .





, Spark. , Unravel , , , . -, Unravel . Spark.





, Spark : , , , , Spark.






« Hadoop, Spark, Hive».



«Spark Streaming».












All Articles