For our new program "Apache Spark for Data Engineers" and the webinar about the course , which will take place on December 2, we have prepared a translation of an overview article about Spark 3.0.
Spark 3.0 came out with a whole bunch of important improvements, including: improved performance with ADQ, reading binaries, improved SQL and Python support, Python 3.0, Hadoop 3 integration, ACID support.
In this article, the author tried to give examples of the use of these new functions. This is the first first article on the functionality of Spark 3.0 and this article series is planned to continue.
This article highlights the following features in Spark 3.0:
Adaptive Query Execution (AQE) Framework
Support for new languages
New interface for structured streaming
Reading binary files
Recursive folder browsing
Multiple data delimiter support (||)
New built-in Spark features
Switch to Proleptic Gregorian Calendar
Data Frame Tail
Repartition function in SQL queries
Improved ANSI SQL compatibility
(AQE) – , , Spark 3.0. , , .
3.0 Spark , , Spark , . AQE , , , .
, (AQE) . spark.sql.adaptive.enabled true. AQE, Spark TPC-DS Spark 2.4
AQE Spark 3.0 3 :
,
join sort-merge broadcast
Spark 3.0 , :
Python3 (Python 2.x)
Scala 2.12
JDK 11
Hadoop 3 , Kafka 2.4.1 .
Spark Structured Streaming
web- Spark . , , , -, . , .
2 :
: Databricks
«Active Streaming Queries» , «Completed Streaming Queries» – .
Run ID : , , , , , . , Databricks.
Spark 3.0 “binaryFile”, .
binaryFile, DataFrameReader image, pdf, zip, gzip, tar . , .
val df = spark.read.format("binaryFile").load("/tmp/binary/spark.png")
df.printSchema()
df.show()
root
|-- path: string (nullable = true)
|-- modificationTime: timestamp (nullable = true)
|-- length: long (nullable = true)
|-- content: binary (nullable = true)
+--------------------+--------------------+------+--------------------+
| path| modificationTime|length| content|
+--------------------+--------------------+------+--------------------+
|file:/C:/tmp/bina…|2020-07-25 10:11:…| 74675|[89 50 4E 47 0D 0...|
+--------------------+--------------------+------+--------------------+
Spark 3.0 recursiveFileLookup, . true , DataFrameReader , .
spark.read.option("recursiveFileLookup", "true").csv("/path/to/folder")
Spark 3.0 (||) CSV . , CSV :
col1||col2||col3||col4
val1||val2||val3||val4
val1||val2||val3||val4
:
val df = spark.read
.option("delimiter","||")
.option("header","true")
.csv("/tmp/data/douplepipedata.csv")
Spark 2.x , . :
throws java.lang.IllegalArgumentException: Delimiter cannot be more than one character: ||
Spark
Spark SQL, Spark .
sinh,cosh,tanh,asinh,acosh,atanh,any,bitand,bitor,bitcount,bitxor,
booland,boolor,countif,datepart,extract,forall,fromcsv,
makedate,makeinterval,maketimestamp,mapentries
mapfilter,mapzipwith,maxby,minby,schemaofcsv,tocsv
transformkeys,transform_values,typeof,version
xxhash64
Spark : 1582 , – .
JDK 7 java.sql.Date API
. JDK 8 java.time.LocalDate API
.
Spark 3.0 , Pandas, R Apache Arrow. 15 1582 ., Date&Timestamp, Spark 3.0, . , 15 1582 .
Spark 3.0 Date & Timestamp :
makedate(), maketimestamp(), makeinterval().
makedate(year, month, day)
– <>, <> <>.
makedate(2014, 8, 13)
//returns 2014-08-13.
maketimestamp(year, month, day, hour, min, sec[, timezone])
– Timestamp <>, <>, <>, <>, <>, < >.
maketimestamp(2014, 8, 13, 1,10,40.147)
//returns Timestamp 2014-08-13 1:10:40.147
maketimestamp(2014, 8, 13, 1,10,40.147,CET)
makeinterval(years, months, weeks, days, hours, mins, secs)
–
makedate()
make_timestam()
0.
DataFrame.tail()
Spark head(),
, tail()
, Pandas Python. Spark 3.0 tail()
. tail()
scala.Array[T]
Scala.
val data=spark.range(1,100).toDF("num").tail(5)
data.foreach(print)
//Returns
//[95][96][97][98][99]
repartition SQL
SQL Spark actions, Dataset/DataFrame, , Spark SQL repartition() . SQL-. .
val df=spark.range(1,10000).toDF("num")
println("Before re-partition :"+df.rdd.getNumPartitions)
df.createOrReplaceTempView("RANGE¨C17CTABLE")
println("After re-partition :"+df2.rdd.getNumPartitions)
//Returns
//Before re-partition :1
//After re-partition :20
ANSI SQL
Spark data-, ANSI SQL, Spark 3.0 . , true spark.sql.parser.ansi.enabled
Spark .
Newprolab Apache Spark:
Apache Spark - (Scala). 11 , 5 .
Apache Spark (Python). " ". 6 , 5 .