Spark Core

Programmers and developers familiar with Apache Spark who wish to expand their skill sets


Expected Duration
125 minutes

Spark Core provides basic I/O functionalities, distributed task dispatching, and scheduling. Resilient Distributed Datasets (RDDs) are logical collections of data partitioned across machines. RDDs can be created by referencing datasets in external storage systems, or by applying transformations on existing RDDs. In this course, you will learn how to improve Spark’s performance and work with Data Frames and Spark SQL.


Spark RDDs

  • start the course
  • recall what is included in the Spark Stack
  • define lazy evaluation as it relates to Spark
  • recall that RDD is an interface comprised of a set of partitions, list of dependencies, and functions to compute
  • pre-partition an RDD for performance
  • store RDDS in serialized form
  • perform numeric operations on RDDs
  • create custom accumulators
  • use broadcast functionality for optimization
  • pipe to external applications
  • adjust garbage collection settings
  • perform batch import on a Spark cluster
  • determine memory consumption
  • tune data structures to reduce memory consumption
  • use Spark’s different shuffle operations to minimize memory usage of reduce tasks
  • set the levels of parallelism for each operation

Data Frames and Spark SQL

  • create DataFrames
  • interoperate with RDDs
  • describe the generic load and save functions
  • read and write Parquet files
  • use JSON Dataset as a DataFrame
  • read and write data in Hive tables
  • read and write data using JDBC
  • run the Thrift JDBC/OCBC server

Practice: Tuning Spark

  • show the different ways to tune up Spark for better performance





Multi-license discounts available for Annual and Monthly subscriptions.