Spark 3.0: new features and examples of their use - part 1

For our new program "Apache Spark for Data Engineers" and the webinar about the course , which will take place on December 2, we have prepared a translation of an overview article about Spark 3.0.

Spark 3.0 came out with a whole bunch of important improvements, including: improved performance with ADQ, reading binaries, improved SQL and Python support, Python 3.0, Hadoop 3 integration, ACID support. 

In this article, the author tried to give examples of the use of these new functions. This is the first first article on the functionality of Spark 3.0 and this article series is planned to continue.

This article highlights the following features in Spark 3.0:

  • Adaptive Query Execution (AQE) Framework

  • Support for new languages

  • New interface for structured streaming

  • Reading binary files

  • Recursive folder browsing

  • Multiple data delimiter support (||)

  • New built-in Spark features

  • Switch to Proleptic Gregorian Calendar

  • Data Frame Tail

  • Repartition function in SQL queries

  • Improved ANSI SQL compatibility

(AQE) – , , Spark 3.0. , , .

3.0 Spark , , Spark , . AQE , , , .  

, (AQE) . spark.sql.adaptive.enabled  true. AQE, Spark TPC-DS Spark 2.4 

AQE Spark 3.0 3 :

  • ,

  • join sort-merge broadcast  

Spark 3.0 , : 

  • Python3 (Python 2.x)

  • Scala 2.12

  • JDK 11

Hadoop 3 , Kafka 2.4.1 .

Spark Structured Streaming

web- Spark . , , , -, . , .

2 :

  •  

: Databricks

«Active Streaming Queries» , «Completed Streaming Queries» –

Run ID : , , , , , . , Databricks.

 

Spark 3.0 “binaryFile”, .

binaryFile, DataFrameReader image, pdf, zip, gzip, tar . , .  

val df = spark.read.format("binaryFile").load("/tmp/binary/spark.png")

df.printSchema()

df.show()

root

 |-- path: string (nullable = true)

 |-- modificationTime: timestamp (nullable = true)

 |-- length: long (nullable = true)

 |-- content: binary (nullable = true) 

+--------------------+--------------------+------+--------------------+

|                path|    modificationTime|length|             content|

+--------------------+--------------------+------+--------------------+

|file:/C:/tmp/bina…|2020-07-25 10:11:…| 74675|[89 50 4E 47 0D 0...|

+--------------------+--------------------+------+--------------------+

Spark 3.0  recursiveFileLookup, . true  , DataFrameReader , .

spark.read.option("recursiveFileLookup", "true").csv("/path/to/folder")

 

Spark 3.0 (||) CSV . , CSV :

 col1||col2||col3||col4

val1||val2||val3||val4

val1||val2||val3||val4

:

 val df  = spark.read

      .option("delimiter","||")

      .option("header","true")

      .csv("/tmp/data/douplepipedata.csv")

Spark 2.x , . :

 throws java.lang.IllegalArgumentException: Delimiter cannot be more than one character: ||

Spark

Spark SQL, Spark . 

sinh,cosh,tanh,asinh,acosh,atanh,any,bitand,bitor,bitcount,bitxor,

booland,boolor,countif,datepart,extract,forall,fromcsv,

makedate,makeinterval,maketimestamp,mapentries

mapfilter,mapzipwith,maxby,minby,schemaofcsv,tocsv

transformkeys,transform_values,typeof,version

xxhash64

 

Spark : 1582 , – .

JDK 7 java.sql.Date API. JDK 8 java.time.LocalDate API

Spark 3.0 , Pandas, R Apache Arrow. 15 1582 ., Date&Timestamp, Spark 3.0, . , 15 1582 .

Spark 3.0 Date & Timestamp : 

makedate(), maketimestamp(), makeinterval(). 

makedate(year, month, day) – <>, <> <>. 

makedate(2014, 8, 13)

//returns 2014-08-13.

maketimestamp(year, month, day, hour, min, sec[, timezone]) – Timestamp <>, <>, <>, <>, <>, < >. 

maketimestamp(2014, 8, 13, 1,10,40.147)

//returns Timestamp 2014-08-13 1:10:40.147

maketimestamp(2014, 8, 13, 1,10,40.147,CET)

makeinterval(years, months, weeks, days, hours, mins, secs)   

 makedate()  make_timestam()  0.

DataFrame.tail() 

Spark head(), , tail(), Pandas Python. Spark 3.0 tail() . tail() scala.Array[T]  Scala. 

 

val data=spark.range(1,100).toDF("num").tail(5)

data.foreach(print)

//Returns

//[95][96][97][98][99]

repartition SQL

SQL Spark actions,   Dataset/DataFrame, , Spark SQL repartition() . SQL-. . 

 

val df=spark.range(1,10000).toDF("num")

println("Before re-partition :"+df.rdd.getNumPartitions)

df.createOrReplaceTempView("RANGE¨C17CTABLE")

println("After re-partition :"+df2.rdd.getNumPartitions)

//Returns 

//Before re-partition :1

//After re-partition :20

ANSI SQL 

Spark data-, ANSI SQL, Spark 3.0 . , true  spark.sql.parser.ansi.enabled Spark .


Newprolab Apache Spark:

Apache Spark - (Scala). 11 , 5 .

Apache Spark (Python). " ". 6 , 5 .




All Articles