Course Overview
Introduction to Working with Apache Spark is a two-day, fast-paced course that provides students with a thorough introduction to the Spark environment, benefits, features and common uses and tools. Working in a hands-on learning environment, students will learn where Spark fits into the Big Data ecosystem, and how to use Spark for critical data analysis. The course also explores key Spark features and technologies such as Spark shell for interactive data analysis, Spark internals, RDDs, Dataframes and Spark SQL.
Key Learning Areas
Students will learn Spark ecosystem and tools. At the end of the course, attendees will be proficient in Spark technology.
Course Outline
Spark Basics
- Background and history
- Spark and Hadoop
- Spark concepts and architecture
- Spark eco system (core, spark sql, mlib, streaming)
First Look at Spark
- Spark in local mode
- Spark web UI
- Spark shell
- Analyzing dataset – Part 1
- Inspecting RDDs
RDDs in Depth
- Partitions
- RDD Operations/transformations
- RDD types
- MapReduce on RDD
- Caching and persistence
- Sharing cached RDDs
Spark SQL and Dataframes
- Dataframes
- Dataframes DDL
- Spark SQL
- Defining tables and importing datasets
- Queries
Who Benefits
This course is geared for Developers and Architects seeking to be proficient in Spark tools & technologies.
Prerequisites
Attendees should be experienced developers who are comfortable with Java, Scala or Python programming. Students should also be able to navigate Linux command line, and have basic knowledge of Linux editors (such as VI / nano) for editing code.