Spark

Over the last decade, there has been a giant leap in the technology available to the market for processing big data workloads. Driven by the demands of the largest tech companies Spark was developed to address limitations in the MapReduce paradigm. Spark is open source and maintained by Apache Software Foundation. The success of the technology has to lead to it being widely adopted and used by multiple services within Azure: HD Insight, Synapse, & Databricks.

Spark is a parallel processing framework that processes data in memory. It supports multiple sources and sinks of all data types (structured, semi-structured, unstructured). Applications/pipelines/experiments can be written in Java, Scala, Python, R, & SQL! Your data is defined as DataFrames, partitioned, and processed in parallel on your cluster’s nodes. Spark supports batch and streaming modalities with only small code changes. It is built on composable APIs that are improving and well documented.

The easiest way to get familiar with the power of Spark is to use Databricks.

Databricks

Databricks is not a native Azure service but is included as a first-class citizen within the Azure cloud complete with Azure-specific documentation.

Within Azure, you can have Azure Databricks up and running and data flowing in a few clicks. This is the fastest and easiest way to experience the power of Spark.

Is Azure Databricks & Spark a Good Fit

Spark is very beneficial when you need to process a large volume of data. If you have a traditional ETL pipeline that is crunching away at your data for hours at a time and maxing your server’s CPU / Memory Spark will probably be a good fit. Azure Databricks is a good option to consider, especially if you are considering a multi-cloud implementation as Databricks is available on other cloud providers.

The code you develop in Azure Databricks can be utilized on other implementations of Spark. So, if you are coming from another cloud like AWS or run your own Spark cluster, Azure Databricks is a good place to gain experience and confidence in the Azure platform.

Getting Started

Spark can be intimidating for data engineers that have never used it before. Looking at the Apache documentation you are immediately bombarded with many terms and technology you may or may not be familiar with. There are also many avenues and options that can lead to analysis paralysis when getting started.

Azure Databricks provides an easy-to-use Spark environment with a few clicks in Azure that won’t get you locked into any configuration that you might end up in if you roll your own Spark cluster.

Follow the Quickstart Guide to get up and running and check here for more posts on Databricks.

As you are planning your data pipeline modernization reach out to Wintellect / Atmosera and we can help identify, architect, and implement where Azure Databricks fits into your solution. Training is also available.