How Azure Synapse integrates your data pipelines

Cloud computing is an exciting platform to deliver solutions in a fraction of the time it is used to take. It bypasses the heavy lifting of setting up everything from the power outlet in a data center or data room. Dealing with the multiple cloud provider’s approaches to their services can be challenging, though. Even more when the cloud providers decide to change their own approach, always with good intentions, for sure.

Data processing in the Microsoft Azure cloud is one of these cases. Early on Azure offered a particular set of services for BI implementation. The main one being Azure SQL Data Warehouse. And big data solutions can be found in Azure HDInsights with Hadoop and Apache Spark, as well as in Databricks.

In this blog post, you will better understand how Azure Synapse Analytics changes Microsoft’s approach to implementing data pipelines at scale. You will also learn to make sense of the service swaps and the comprehensive, integrated environment.

The Basics

Companies are always interested in extract insights from their data. This is usually challenged by the disconnection and disparity of their data sources. Data comes from every corner of the business: from the IoT sensors, from the POS devices, from legacy systems, even from other cloud providers and so many other data sources.

A typical cloud data pipeline would tap each of these data sources and land this data in a central repository of either unstructured or semi-structured data commonly named data lake. The illustration below depicts this design:

As the animation shows at the end, the core of this process is the data transformations necessary to connect and integrate the data together. This is usually implemented either through ETL or ELT strategies. In Azure, the two very popular services to do this are Databricks and Azure SQL Data Warehouse (DW).

Databricks is a compute service. It does not implement a particular storage technology but relies on external storage like the data lake and file standards (Parquet for instance). On the other hand, SQL Data Warehouse is both a compute service and storage at scale blended in one service. It can tap into the data lake and write back to it, but it has its own proprietary storage system optimized to relational designs based on tables. It is very friendly at first to DBA professionals but has its own particular way to deal with data at scale.

Back in 2019 Microsoft rebranded and redesigned Azure SQL DW to Azure Synapse Analytics Dedicated SQL Pool. And they redesigned it to be part of a larger service, instead of being a service in itself. The actual service name is Azure Synapse Analytics. Don’t get confused. The names are very close. Azure Synapse Analytics is the cloud service and Synapse Analytics dedicated SQL pool is one of the data engines in the service. For clarity, I’ll refer to it just as a dedicated SQL pool.

In the animation above you can see at the end both Databricks and the dedicated SQL pool working on the data in the data lake originated from multiple data sources.

After all the data transformations have taken place, data analysts and data scientists would be consuming data from either the relational engine or the transformed dataset in the lake. The next animation depicts that:

For illustration purposes, I depicted Power BI, both Pro, and Premium subscriptions, as consuming either data from the relational engine or the data lake, as well as Azure Analysis Service. Of course, any application that can connect to SQL Server protocols and Azure Data Lake and supports one of the standard file storage formats.

The trick portion here is data privacy and network isolation. Because Databricks, dedicated SQL pools, and the data lake are three independent services, each one of them needs to be “wired up” together through vNets when the business needs to isolate the traffic from the public Azure network rings. Of course, it can be done – and has been done – but it just introduces a lot of moving parts. Especially when you have multiple clients/data consumers (data scientists/data analysts) that need to be connected to the vNet as well.

This is one of the problems that Azure Synapse Analytics addresses. By integrating these services that actually work together as parts of the same engine is one of the benefits of the platform.

So what are the fundamental parts of Azure Synapse Analytics? The animation below highlights these parts by removing from the traditional design the external connections and keep only what is core for data ingestion and transformation:

As you can see, a traditional data transformational pipeline includes an ingestion engine, in the animation it is Azure Data Factory; a common data repository for unstructured or semi-structured data, in the animation it is the Data Lake; a schema-less data engine for ETL/ELT and data cleansing, in the animation it is Databricks; and a schema full, relational data engine target for data warehousing and BI tasks, in the animation it is dedicated SQL pools.

The specifics

This is how it has been for a very long time in cloud computing at scale. Where is the innovation in Synapse?

The first innovation came in aspects:

the already mentioned integration of all parts in one workspace;
the replacement of some parts
the addition of new parts

Databricks is not a Microsoft product. It is hosted in Azure but it is a third-party solution. Well integrated with Azure as well as Amazon Web Services (AWS). For Azure Synapse Analytics Microsoft has decided not to leverage on Databricks but to have its own Spark implementation in Synapse. Not as a proprietary solution. It is based on the Apache Spark open-source. But with some specifics to better integrate with the other parts of Synapse. Synapse Spark pool engine implements Databricks functionality in Synapse Analytics.

Synapse Integrate Pipelines replaces Azure Data Factory. While ADF is backed up by Databricks engine under the hood for some of its functionality, Azure Integrate Pipeline runs the same Apache Spark engine supporting Synapse Spark pools under the hood. Conceptually they do the same thing but Integrate Pipelines has some specific approaches to some tasks that you will either not find in ADF or is implemented in a different way.

Finally, Microsoft added a new data engine in Synapse named Azure Synapse Serverless SQL pool. It is designed to support data analysts to deal with semi-structured data without having to learn a new platform (Spark) and a new computer language (Scala, Python, etc.). Rather they can stick with a large subset of T-SQL, the language in SQL Server, and work directly with the data in the data lake.

Serverless SQL pool doesn’t have its own data repository. It only implements compute, like the Spark pool. I always find it funny when I think about this: the serverless SQL pool does have a server, a compute capability; what it lacks is dedicated relational storage. Shouldn’t it be storage less instead? Well … perhaps serverless is more friendly and ambiguous so people can wonder about it. Go figure. In any case, the serverless SQL pool brings great functionality for data exploration from the lake in a well-known language. It is super easy to set up and interact with.

The last addition is deep integration with Power BI. Here it is important to understand that the Power BI integration does not grant user licenses. In other words, it is not for report consumption. Rather it is for report authoring and testing resulting datasets coming from either the lake or the relational engine.

The final design is illustrated in this last animation:

There are other aspects of Azure Synapse Analytics that are not at the core of data pipeline transformation implementation. We will go over those pieces of the puzzle in other blog posts. Tools related to management and monitoring.

Keep in mind, though, that you don’t have to abandon other services and investments. Current running pipelines implemented in Databricks and Azure Data Factory can write to the same Data Lake shared with Synapse. The Data Lake is your bridge from any data activity happening outside Synapse Analytics and within the service.

And you only pay for what you use. If you decide to keep implementing your ETL/ELT in Databricks and never touch the Spark pool in Synapse, no problem. You won’t be charged for the data engine you are not using. The same goes for Integration Pipelines, dedicated SQL pools, and serverless SQL pools.

In Summary

Working with Azure Synapse Analytics increases your time to market when implementing data transformations at scale in the cloud. To take full advantage of what it offers there are some tradeoffs between well-established data solutions. They are not mandatory, though. It is fully integrated and allows for data privacy in a public cloud through network isolation.

Even though it is based on common, well-known technologies like Apache Spark and T-SQL, it has its particulars. A DBA or data engineer should not assume that all the same rules from the original technologies apply to Synapse Analytics.

I hope this has helped you better understand the eco-system inside Synapse.

Join us this Friday, February 5th for our 1-day Azure Synapse Workshop:

Azure Synapse Analytics | 1-day Workshop