👨‍🎓 👩🏻‍🎤 👩🏿‍🎨 Spark 3.0: new features and examples of their use

For our new program "Apache Spark for Data Engineers" and the webinar about the course , which will take place on December 2, we have prepared a translation of an overview article about Spark 3.0.

Spark 3.0 came out with a whole bunch of important improvements, including: improved performance with ADQ, reading binaries, improved SQL and Python support, Python 3.0, Hadoop 3 integration, ACID support.

In this article, the author tried to give examples of the use of these new functions. This is the first first article on the functionality of Spark 3.0 and this article series is planned to continue.

This article highlights the following features in Spark 3.0:

Adaptive Query Execution (AQE) Framework
Support for new languages
New interface for structured streaming
Reading binary files
Recursive folder browsing
Multiple data delimiter support (||)
New built-in Spark features
Switch to Proleptic Gregorian Calendar
Data Frame Tail
Repartition function in SQL queries
Improved ANSI SQL compatibility

(AQE) – , , Spark 3.0. , , .

3.0 Spark , , Spark , . AQE , , , .

, (AQE) . spark.sql.adaptive.enabled true. AQE, Spark TPC-DS Spark 2.4

AQE Spark 3.0 3 :

,
join sort-merge broadcast

Spark 3.0 , :

Python3 (Python 2.x)
Scala 2.12
JDK 11

Hadoop 3 , Kafka 2.4.1 .

Spark Structured Streaming

web- Spark . , , , -, . , .

2 :

: Databricks

«Active Streaming Queries» , «Completed Streaming Queries» – .

Run ID : , , , , , . , Databricks.

Spark 3.0 “binaryFile”, .

binaryFile, DataFrameReader image, pdf, zip, gzip, tar . , .

val df = spark.read.format("binaryFile").load("/tmp/binary/spark.png")

df.printSchema()

df.show()

root

|-- path: string (nullable = true)

|-- modificationTime: timestamp (nullable = true)

|-- length: long (nullable = true)

|-- content: binary (nullable = true)

+--------------------+--------------------+------+--------------------+

+--------------------+--------------------+------+--------------------+

|file:/C:/tmp/bina…|2020-07-25 10:11:…| 74675|[89 50 4E 47 0D 0...|

+--------------------+--------------------+------+--------------------+

Spark 3.0 recursiveFileLookup, . true , DataFrameReader , .

spark.read.option("recursiveFileLookup", "true").csv("/path/to/folder")

Spark 3.0 (||) CSV . , CSV :

col1||col2||col3||col4

val1||val2||val3||val4

val df = spark.read

.option("delimiter","||")

.option("header","true")

.csv("/tmp/data/douplepipedata.csv")

Spark 2.x , . :

throws java.lang.IllegalArgumentException: Delimiter cannot be more than one character: ||

Spark

Spark SQL, Spark .

sinh,cosh,tanh,asinh,acosh,atanh,any,bitand,bitor,bitcount,bitxor,

booland,boolor,countif,datepart,extract,forall,fromcsv,

makedate,makeinterval,maketimestamp,mapentries

mapfilter,mapzipwith,maxby,minby,schemaofcsv,tocsv

transformkeys,transform_values,typeof,version

xxhash64

Spark : 1582 , – .

JDK 7 java.sql.Date API. JDK 8 java.time.LocalDate API .

Spark 3.0 , Pandas, R Apache Arrow. 15 1582 ., Date&Timestamp, Spark 3.0, . , 15 1582 .

Spark 3.0 Date & Timestamp :

makedate(), maketimestamp(), makeinterval().

makedate(year, month, day) – <>, <> <>.

makedate(2014, 8, 13)

//returns 2014-08-13.

maketimestamp(year, month, day, hour, min, sec[, timezone]) – Timestamp <>, <>, <>, <>, <>, < >.

maketimestamp(2014, 8, 13, 1,10,40.147)

//returns Timestamp 2014-08-13 1:10:40.147

maketimestamp(2014, 8, 13, 1,10,40.147,CET)

makeinterval(years, months, weeks, days, hours, mins, secs) –

makedate() make_timestam() 0.

DataFrame.tail()

Spark head(), , tail(), Pandas Python. Spark 3.0 tail() . tail() scala.Array[T] Scala.

val data=spark.range(1,100).toDF("num").tail(5)

data.foreach(print)

//Returns

//[95][96][97][98][99]

repartition SQL

SQL Spark actions, Dataset/DataFrame, , Spark SQL repartition() . SQL-. .

val df=spark.range(1,10000).toDF("num")

println("Before re-partition :"+df.rdd.getNumPartitions)

df.createOrReplaceTempView("RANGE¨C17CTABLE")

println("After re-partition :"+df2.rdd.getNumPartitions)

//Returns

//Before re-partition :1

//After re-partition :20

ANSI SQL

Spark data-, ANSI SQL, Spark 3.0 . , true spark.sql.parser.ansi.enabled Spark .

Newprolab Apache Spark:

Apache Spark - (Scala). 11 , 5 .

Apache Spark (Python). " ". 6 , 5 .

Spark 3.0: new features and examples of their use - part 1

Spark Structured Streaming

Spark

DataFrame.tail()

repartition SQL

ANSI SQL

More articles: