Batch and Streaming Pipelines

One of the best features of Databricks is that it facilitates transitioning batch pipelines to streaming pipelines. Depending on your transformation logic it may be as easy as changing two lines of code. In this post we will review some batch and streaming concepts and show how to write a notebook that can be run in both batch and streaming modes.

I’ve also recorded a webinar that accompanies this post Databricks Batch and Streaming

Batch

To compose your notebooks / jobs as batch use the commands spark.read and spark.write to read and write your dataframes.

They will run one time with all the data that exists at the source, processing it all through the pipeline and into the sink.

For an example of how to do this see the provided notebook smokerPipelineBatch.py. It uses publicly available datasets and anyone can run it using the free trial of Databricks.

If a batch pipeline used a data source that was continually added to, you may have to write your own logic to keep track of the pipeline progress using your own checkpoint. One way to do this is to process partitions of the source data once they are ready for processing.

Streaming

If at some point you realize you need more real-time data you can convert your notebook to be a streaming notebook using spark.readStream and spark.writeStream.

For an example of how to do this see the provided notebook smokerPipelineStreaming.py

Databricks will take care of keeping track of what data has been processed using its own checkpointing logic. All the user has to do is specify the checkpoint location in the sink. The first time the notebook runs in streaming mode all the data that exists at the source will be processed in one micro-batch.

Streaming Output Modes

Append (default) – appends only the new rows
Complete – entire result table written out
Update – only updated rows are output (equivalent to append if no aggregations)

Trigger

The trigger specifies how it will be processed

Unspecified (default) – The streaming pipeline will run in micro-batch mode
Fixed interval – manually specify the micro-batch interval
Once – only one micro-batch will run
Continuous (Experimental) – lower latency but at least once fault tolerance

Sink Types

foreach & foreachBatch

These modes are very powerful and allow the application of custom logic to each micro-batch.

Streaming Limitations

There are a number of unsupported operations when streaming. These are outside the scope of this post but can and should be understood if developing complex streaming pipelines.

Batch or Streaming? Batch AND Streaming

The fact that the majority, if not all, of the logic can remain the same allows us to write notebooks that could be run in batch or streaming mode based on a parameter passed to the notebook when the job is called.

Here is an example of how this can be performed using notebook parameterization: smokerPipelineBatchOrStreaming.py

Azure Databricks Series - Part 6 Batch and Streaming Pipelines