Course Overview

This three-day course is designed to provide Developers and/or Data Analysts a gentle immersive hands-on introduction to the Python programming language and Apache PySpark.

NOTE: In order to provide the broadest range of courses and class dates for this class, this course may be taught by either Wintellect or one of our training Partners.

Key Learning Areas

  • Introduction to Python
  • Python Scripts
  • Data Types and Variables
  • Python Collections
  • Control Statements and Looping
  • Functions in Python
  • Working with Data in Python
  • Reading and Writing Text Files
  • Functional Programming Primer
  • Introduction to Apache Spark
  • The Spark Shell
  • Spark RDDs
  • Parallel Data Processing with Spark
  • Shared Variables in Spark
  • Introduction to Spark SQL
  • Repairing and Normalizing Data
  • Data Grouping and Aggregation in Python

Course Outline

Introduction to Python

  • What is Python
  • Uses of Python
  • Installing Python
  • Python Package Manager (PIP)
  • Using the Python Shell
  • Python Code Conventions
  • Importing Modules
  • The Help(object) Command
  • The Help Prompt
  • Summary

Python Scripts

  • Executing Python Code
  • Python Scripts
  • Writing Scripts
  • Running Python Scripts
  • Self-Executing Scripts
  • Accepting Command-Line Parameters
  • Accepting Interactive Input
  • Retrieving Environment Settings
  • Summary

Data Types and Variables

  • Creating Variables
  • Displaying Variables
  • Basic Concatenation
  • Data Types
  • Strings
  • Strings as Arrays
  • String Methods
  • Combining Strings and Numbers
  • Numeric Types
  • Integer Types
  • Floating Point Types
  • Boolean Types
  • Checking Data Type
  • Summary

Python Collections

  • Python Collections
  • List Type
  • Modifying Lists
  • Sorting a List
  • Tuple Type
  • Python Sets
  • Modifying Sets
  • Dictionary (Map) Type
  • Dictionary Methods
  • Sequences
  • Summary

Control Statements and Looping

  • If Statement
  • elif Keyword
  • Boolean Conditions
  • Single Line If Statements
  • For-in Loops
  • Looping over an Index
  • Range Function
  • Nested Loops
  • While Loops
  • Exception Handling
  • Built-in Exceptions
  • Exceptions thrown by Built-In Functions
  • Summary

Functions in Python

  • Defining Functions
  • Using Functions
  • Function Parameters
  • Named Parameters
  • Variable Length Parameter List
  • How Parameters are Passed
  • Variable Scope
  • Returning Values
  • Summary

Working with Data in Python

  • Data Type Conversions
  • Conversions from other Types to Integer
  • Conversions from other Types to Float
  • Conversions from other Types to String
  • Conversions from other Types to Boolean
  • Converting Between Set, List and Tuple Data Structures
  • Modifying Tuples
  • Combining Set, List and Tuple Data Structures
  • Creating Dictionaries from other Data Structures
  • Summary

Reading and Writing Text Files

  • Opening a File
  • Writing a File
  • Reading a File
  • Appending to a File
  • File Operations Using the
  • With
  • Statement
  • File and Directory Operations
  • Reading JSON
  • Writing JSON
  • Summary

Functional Programming Primer

  • What is Functional Programming?
  • Benefits of Functional Programming
  • Functions as Data
  • Using Map Function
  • Using Filter Function
  • Lambda expressions
  • List.sort() Using Lambda Expression
  • Difference Between Simple Loops and map/filter Type Functions
  • Additional Functions
  • Summary

Introduction to Apache Spark

  • What is Apache Spark
  • A Short History of Spark
  • Where to Get Spark?
  • The Spark Platform
  • Spark Logo
  • Common Spark Use Cases
  • Languages Supported by Spark
  • Running Spark on a Cluster
  • The Driver Process
  • Spark Applications
  • Spark Shell
  • The spark-submit Tool
  • The spark-submit Tool Configuration
  • The Executor and Worker Processes
  • The Spark Application Architecture
  • Interfaces with Data Storage Systems
  • Limitations of Hadoop's MapReduce
  • Spark vs MapReduce
  • Spark as an Alternative to Apache Tez
  • The Resilient Distributed Dataset (RDD)
  • Datasets and DataFrames
  • Spark Streaming (Micro-batching)
  • Spark SQL
  • Example of Spark SQL
  • Spark Machine Learning Library
  • GraphX
  • Spark vs R
  • Summary

The Spark Shell

  • The Spark Shell
  • The Spark v.2 + Shells
  • The Spark Shell UI
  • Spark Shell Options
  • Getting Help
  • The Spark Context (sc) and Spark Session (spark)
  • The Shell Spark Context Object (sc)
  • The Shell Spark Session Object (spark)
  • Loading Files
  • Saving Files
  • Summary

Spark RDDs

  • The Resilient Distributed Dataset (RDD)
  • Ways to Create an RDD
  • Custom RDDs
  • Supported Data Types
  • RDD Operations
  • RDDs are Immutable
  • Spark Actions
  • RDD Transformations
  • Other RDD Operations
  • Chaining RDD Operations
  • RDD Lineage
  • The Big Picture
  • What May Go Wrong
  • Checkpointing RDDs
  • Local Checkpointing
  • Parallelized Collections
  • More on parallelize() Method
  • The Pair RDD
  • Where do I use Pair RDDs?
  • Example of Creating a Pair RDD with Map
  • Example of Creating a Pair RDD with keyBy
  • Miscellaneous Pair RDD Operations
  • RDD Caching
  • RDD Persistence
  • Summary

Parallel Data Processing with Spark

  • Running Spark on a Cluster
  • Spark Stand-alone Option
  • The High-Level Execution Flow in Stand-alone Spark Cluster
  • Data Partitioning
  • Data Partitioning Diagram
  • Single Local File System RDD Partitioning
  • Multiple File RDD Partitioning
  • Special Cases for Small-sized Files
  • Parallel Data Processing of Partitions
  • Spark Application, Jobs, and Tasks
  • Stages and Shuffles
  • The "Big Picture"
  • Summary

Shared Variables in Spark

  • Shared Variables in Spark
  • Broadcast Variables
  • Creating and Using Broadcast Variables
  • Example of Using Broadcast Variables
  • Problems with Global Variables
  • Example of the Closure Problem
  • Accumulators
  • Creating and Using Accumulators
  • Example of Using Accumulators (Scala Example)
  • Example of Using Accumulators (Python Example)
  • Custom Accumulators
  • Summary

Introduction to Spark SQL

  • What is Spark SQL?
  • Uniform Data Access with Spark SQL
  • Hive Integration
  • Hive Interface
  • Integration with BI Tools
  • What is a DataFrame?
  • The SQLContext Object
  • Example of Spark SQL (PySpark Example)
  • Example of Reading / Writing a JSON File
  • Using JDBC Sources
  • JDBC Connection Example
  • Performance & Scalability of Spark SQL
  • Summary

Repairing and Normalizing Data

  • Repairing and Normalizing Data
  • Dealing with the Missing Data
  • Sample Data Set
  • Getting Info on Null Data
  • Dropping a Column
  • Interpolating Missing Data in pandas
  • Replacing the Missing Values with the Mean Value
  • Scaling (Normalizing) the Data
  • Data Preprocessing with scikit-learn
  • Scaling with the scale() Function
  • The MinMaxScaler Object
  • Summary

Data Grouping and Aggregation in Python

  • Data Aggregation and Grouping
  • Sample Data Set
  • The pandas.core.groupby.SeriesGroupBy Object
  • Grouping by Two or More Columns
  • Emulating SQL's WHERE Clause
  • The Pivot Tables
  • Cross-Tabulation
  • Summary

Lab Exercises

Lab 1. Introduction to Python
Lab 2. Creating Scripts
Lab 3. Variables in Python
Lab 4. Collections
Lab 5. Control Statements and Loops
Lab 6. Functions in Python
Lab 7. Reading and Writing Text Files
Lab 8. Functional Programming
Lab 9. The PySpark Shell
Lab 10. Data Transformation with PySpark
Lab 11. RDD Performance Improvement Techniques with PySpark
Lab 12. Spark SQL with PySpark
Lab 13. Repairing and Normalizing Data
Lab 14. Data Grouping and Aggregation

Who Benefits

Developers and/or Data Analysts

Prerequisites

Programming and/or scripting experience in another language other than Python