Introduction to Python and PySpark

Course Overview

This three-day course is designed to provide Developers and/or Data Analysts a gentle immersive hands-on introduction to the Python programming language and Apache PySpark.

NOTE: In order to provide the broadest range of courses and class dates for this class, this course may be taught by either Wintellect or one of our training Partners.

Key Learning Areas

Introduction to Python
Python Scripts
Data Types and Variables
Python Collections
Control Statements and Looping
Functions in Python
Working with Data in Python
Reading and Writing Text Files
Functional Programming Primer
Introduction to Apache Spark
The Spark Shell
Spark RDDs
Parallel Data Processing with Spark
Shared Variables in Spark
Introduction to Spark SQL
Repairing and Normalizing Data
Data Grouping and Aggregation in Python

Course Outline

Introduction to Python

What is Python
Uses of Python
Installing Python
Python Package Manager (PIP)
Using the Python Shell
Python Code Conventions
Importing Modules
The Help(object) Command
The Help Prompt
Summary

Python Scripts

Executing Python Code
Python Scripts
Writing Scripts
Running Python Scripts
Self-Executing Scripts
Accepting Command-Line Parameters
Accepting Interactive Input
Retrieving Environment Settings
Summary

Data Types and Variables

Creating Variables
Displaying Variables
Basic Concatenation
Data Types
Strings
Strings as Arrays
String Methods
Combining Strings and Numbers
Numeric Types
Integer Types
Floating Point Types
Boolean Types
Checking Data Type
Summary

Python Collections

Python Collections
List Type
Modifying Lists
Sorting a List
Tuple Type
Python Sets
Modifying Sets
Dictionary (Map) Type
Dictionary Methods
Sequences
Summary

Control Statements and Looping

If Statement
elif Keyword
Boolean Conditions
Single Line If Statements
For-in Loops
Looping over an Index
Range Function
Nested Loops
While Loops
Exception Handling
Built-in Exceptions
Exceptions thrown by Built-In Functions
Summary

Functions in Python

Defining Functions
Using Functions
Function Parameters
Named Parameters
Variable Length Parameter List
How Parameters are Passed
Variable Scope
Returning Values
Summary

Working with Data in Python

Data Type Conversions
Conversions from other Types to Integer
Conversions from other Types to Float
Conversions from other Types to String
Conversions from other Types to Boolean
Converting Between Set, List and Tuple Data Structures
Modifying Tuples
Combining Set, List and Tuple Data Structures
Creating Dictionaries from other Data Structures
Summary

Reading and Writing Text Files

Opening a File
Writing a File
Reading a File
Appending to a File
File Operations Using the
With
Statement
File and Directory Operations
Reading JSON
Writing JSON
Summary

Functional Programming Primer

What is Functional Programming?
Benefits of Functional Programming
Functions as Data
Using Map Function
Using Filter Function
Lambda expressions
List.sort() Using Lambda Expression
Difference Between Simple Loops and map/filter Type Functions
Additional Functions
Summary

Introduction to Apache Spark

What is Apache Spark
A Short History of Spark
Where to Get Spark?
The Spark Platform
Spark Logo
Common Spark Use Cases
Languages Supported by Spark
Running Spark on a Cluster
The Driver Process
Spark Applications
Spark Shell
The spark-submit Tool
The spark-submit Tool Configuration
The Executor and Worker Processes
The Spark Application Architecture
Interfaces with Data Storage Systems
Limitations of Hadoop's MapReduce
Spark vs MapReduce
Spark as an Alternative to Apache Tez
The Resilient Distributed Dataset (RDD)
Datasets and DataFrames
Spark Streaming (Micro-batching)
Spark SQL
Example of Spark SQL
Spark Machine Learning Library
GraphX
Spark vs R
Summary

The Spark Shell

The Spark Shell
The Spark v.2 + Shells
The Spark Shell UI
Spark Shell Options
Getting Help
The Spark Context (sc) and Spark Session (spark)
The Shell Spark Context Object (sc)
The Shell Spark Session Object (spark)
Loading Files
Saving Files
Summary

Spark RDDs

The Resilient Distributed Dataset (RDD)
Ways to Create an RDD
Custom RDDs
Supported Data Types
RDD Operations
RDDs are Immutable
Spark Actions
RDD Transformations
Other RDD Operations
Chaining RDD Operations
RDD Lineage
The Big Picture
What May Go Wrong
Checkpointing RDDs
Local Checkpointing
Parallelized Collections
More on parallelize() Method
The Pair RDD
Where do I use Pair RDDs?
Example of Creating a Pair RDD with Map
Example of Creating a Pair RDD with keyBy
Miscellaneous Pair RDD Operations
RDD Caching
RDD Persistence
Summary

Parallel Data Processing with Spark

Running Spark on a Cluster
Spark Stand-alone Option
The High-Level Execution Flow in Stand-alone Spark Cluster
Data Partitioning
Data Partitioning Diagram
Single Local File System RDD Partitioning
Multiple File RDD Partitioning
Special Cases for Small-sized Files
Parallel Data Processing of Partitions
Spark Application, Jobs, and Tasks
Stages and Shuffles
The "Big Picture"
Summary

Shared Variables in Spark

Shared Variables in Spark
Broadcast Variables
Creating and Using Broadcast Variables
Example of Using Broadcast Variables
Problems with Global Variables
Example of the Closure Problem
Accumulators
Creating and Using Accumulators
Example of Using Accumulators (Scala Example)
Example of Using Accumulators (Python Example)
Custom Accumulators
Summary

Introduction to Spark SQL

What is Spark SQL?
Uniform Data Access with Spark SQL
Hive Integration
Hive Interface
Integration with BI Tools
What is a DataFrame?
The SQLContext Object
Example of Spark SQL (PySpark Example)
Example of Reading / Writing a JSON File
Using JDBC Sources
JDBC Connection Example
Performance & Scalability of Spark SQL
Summary

Repairing and Normalizing Data

Repairing and Normalizing Data
Dealing with the Missing Data
Sample Data Set
Getting Info on Null Data
Dropping a Column
Interpolating Missing Data in pandas
Replacing the Missing Values with the Mean Value
Scaling (Normalizing) the Data
Data Preprocessing with scikit-learn
Scaling with the scale() Function
The MinMaxScaler Object
Summary

Data Grouping and Aggregation in Python

Data Aggregation and Grouping
Sample Data Set
The pandas.core.groupby.SeriesGroupBy Object
Grouping by Two or More Columns
Emulating SQL's WHERE Clause
The Pivot Tables
Cross-Tabulation
Summary

Lab Exercises

Lab 1. Introduction to Python
Lab 2. Creating Scripts
Lab 3. Variables in Python
Lab 4. Collections
Lab 5. Control Statements and Loops
Lab 6. Functions in Python
Lab 7. Reading and Writing Text Files
Lab 8. Functional Programming
Lab 9. The PySpark Shell
Lab 10. Data Transformation with PySpark
Lab 11. RDD Performance Improvement Techniques with PySpark
Lab 12. Spark SQL with PySpark
Lab 13. Repairing and Normalizing Data
Lab 14. Data Grouping and Aggregation

Who Benefits

Developers and/or Data Analysts

Prerequisites

Programming and/or scripting experience in another language other than Python