Course Overview

The Spark framework supports streaming data processing and complex, iterative algorithms, enabling applications to run up to 100x faster than traditional Hadoop MapReduce programs.  With Spark, you can write sophisticated parallel applications to execute faster decisions, better decisions, and real-time actions, applied to a wide variety of use cases, architectures, and industries.

Apache Spark for Data Scientists is a three-day, hands-on course geared for technical business professional who wish to solve real-world data related problems using Apache Spark. This course explores using Apache Spark for common data related activities.  Students will learn to build complete, unified big data applications combining batch, streaming, and interactive analytics on all their data.

Course Outline

  • Spark Overview
  • Introduction to Spark
  • DataFrames
  • Spark SQL
  • Spark MLib
  • Spark Streaming
  • Spark GraphX
  • Performance and Tuning
  • Cluster Mode

Who Benefits

This course is an intermediate-level course for basic level Apache Spark users. Typical attendees would include systems administrators, testers or technical data related roles who need to learn to use Spark for data analysis or processing data.

Prerequisites

Attending students should have the following background:

  • Introduction to Java Programming (at least exposure to basic Java Syntax)
  • Introduction to SQL (or equivalent)
  • Statistics and Probability
  • Data Science backgound