Course Overview

Working with Apache Spark is a three-day, hands-on course geared for technical business professional who wish to solve real-world data related problems using Apache Spark. This course explores using Apache Spark for common data related activities, such as real-time processing of data (such as financial and sever data), performing data transformations and ETL (Extract, Transform, Load).  Students will learn to build complete, unified big data applications combining batch, streaming, and interactive analytics on all their data.

Key Learning Areas

  • Using the Spark shell for interactive data analysis
  • The features of Spark’s Resilient Distributed Datasets
  • How Spark runs on a cluster
  • Parallel programming with Spark
  • Writing Spark applications
  • Processing streaming data with Spark

Course Outline

Spark Basics

  • What is Apache Spark?
  • Using the Spark Shell
  • RDDs (Resilient Distributed Datasets)
  • Functional Programming in Spark

Working with RDDs in Spark

  • A Closer Look at RDDs
  • Key-Value Pair RDDs
  • MapReduce
  • Other Pair RDD Operations

Writing and Deploying Basic Spark Applications

  • Spark Applications vs. Spark Shell
  • Creating the SparkContext
  • Building a Spark Application (Scala and Java)
  • Running a Spark Application
  • The Spark Application Web UI
  • Configuring Spark Properties
  • Logging

Who Benefits

This course is an introductory-level course for basic level Apache Spark users. Typical attendees would include systems administrators, testers or technical data related roles who need to learn to use Spark for data analysis or processing data.

Prerequisites

This course has an IT/ data/business user focus as opposed to a developer orientation. Although this is not a developer course, some familiarity with basic programming concepts (such a .jar files, etc.) would be helpful.  These concepts can also be covered in the course as needed.