Course Overview

Apache Spark, a significant component in the Hadoop Ecosystem, is a cluster computing engine used in Big Data. Building on top of the Hadoop YARN and HDFS ecosystem, it offers order-of-magnitude faster processing for many in-memory computing tasks compared to Map/Reduce. It can be programmed in Java, Scala, Python, and R - the favorite languages of Data Scientists - along with SQL-based front-ends.

With advanced libraries like Mahout and MLib for Machine Learning, GraphX or Neo4J for rich data graph processing as well as access to other NOSQL data stores, Rule engines and other Enterprise components, Spark is a lynchpin in modern Big Data and Data Science computing.

Developing for Apache Spark is a comprehensive, intermediate-level and beyond, lab-intensive hands-on course. The majority of this course is offered in support of the Java programming languages, with alternatives available in R Programming, Python and Scala. Our team will work with you to coordinate the languages, tools and environment that will work best for your organization and needs.

Key Learning Areas

Working in a hands-on learning environment, students will learn where Spark fits into the Big Data ecosystem, and how to use Spark for critical data analysis.  The course also explores key Spark features and technologies such as Spark shell for interactive data analysis, Spark internals, RDDs, Dataframes and Spark SQL.  Students will learn Spark ecosystem and tools.  At the end of the course, students will be proficient in Spark technology for advanced use in the Big Data and Hadoop ecosystem.

Course Outline

  • Spark Overview
  • Spark Component Overview
  • RDDs: Resilient Distributed Datasets
  • DataFrames
  • Spark Applications
  • DataFrame Persistence
  • Spark Streaming
  • Accessing NOSQL Data
  • Enterprise Integration
  • Algorithms and Patterns
  • Spark SQL
  • GraphX
  • Alternate Languages
  • Clustering Spark for Developers
  • Performance and Tuning


Skills-Focused, Hands On Learning: This course is about 50% hands-on lab to 50% lecture ratio, combining engaging instructor presentation, demos and practical group discussions with lab intensive, machine-based student exercises. Our development and testing courses include a wide range of complementary materials and labs to ensure all students are appropriately challenged – no matter their incoming skill level.

Who Benefits

This course is geared for experienced Developers and Architects (with development experience) who seek to be proficient in advanced, modern development skills working with Apache Spark in an enterprise data environment.


Attendees should be experienced developers who are comfortable with Java, Scala or Python programming (based on which programming flavor of the course they are attending) and have basic exposure to working with Spark.  Students should also be able to navigate Linux command line, and have basic knowledge of Linux editors (such as VI/nano) for editing code.