When deploying a data platform that utilises Data Factory or Synapse, one of the most important considerations that needs to be made is how will data be ingested into the platform? How can I get my data from my SQL server sat in my datacenter into Azure? Doing this manually wouldn't be very efficient, and depending on your requirements, this data may need to be ingested maybe every hour, or every day, or every week. It's important that this task is automated.
Thankfully, we have Integration Runtimes that help us with this, there are a few types of Integration Runtime or "IR"
IR type | Public Network Support | Private Link Support |
Azure | Data Flow Data movement Activity dispatch | Data Flow Data movement Activity dispatch |
Self-hosted | Data movement Activity dispatch | Data movement Activity dispatch |
Azure-SSIS | SSIS package execution | SSIS package execution |
But hang on, what is an Integration Runtime? The Microsoft Documentation will tell you this:
"An integration runtime (IR) is the compute infrastructure that Azure Data Factory and Synapse pipelines use to provide data-integration capabilities across different network environments.
A self-hosted integration runtime can run copy activities between a cloud data store and a data store in a private network. It also can dispatch transform activities against compute resources in an on-premises network or an Azure virtual network. The installation of a self-hosted integration runtime needs an on-premises machine or a virtual machine inside a private network."
The (very) simple way of looking at it is that an IR is piece of fully managed, serverless compute that lets Synapse or Data Factory interact/move data between cloud data sources in with a secure approach. A SHIR on the other hand, is compute that requires managing but allows you to bridge the gap between your Azure and on-premise environment. Now there's more nuance than that to it, you may need a SHIR to interact with IaaS cloud sources as well but let's look past that for the sake of simplicity at this time.
So when do I need a Self-Hosted Integration Runtime? The simple answer is, if you have data on-premise you're probably going to need a SHIR deployed either in Azure or on-premise (blog post coming on that soon).
Hopefully this all pretty simple to follow so far, IRs interact with (most) cloud resources, and SHIRs allow you to ingest on-premise data sources. To add some complexity into the mix, there are differences between what an ADF SHIR Supports and what a Synapse SHIR supports.
Management of a SHIR
A SHIR, if it's hosted on-premise or Azure, is still a Virtual Machine and as such requires management by organisations, patching, backups, DR. To name a few, fortunately these machines don't actually store any data so spinning up a new SHIR and connecting it to your workspace would be pretty easy in a DR scenario, but not ideal, so try and keep these backed-up at the very least. It can be a challenge while building a new fancy (mostly PaaS) modern data platform to turn around to a customer and tell them they need a Virtual Machine deployed, as quite simply, it's an additional VM to manage, or maybe even multiple VMs depending on your configuration.
Similarities
4 Node Limit
Both Synapse & ADF can only connect to 4 nodes per workspace. You can have multiple self-hosted integration runtimes on different machines that connect to the same on-premises data source. For example, if you have two self-hosted integration runtimes that serve two data factories, the same on-premises data source can be registered with both data factories.
Why would you ever need 4 nodes? High availability and/or scaling. Depending on the amount of data your ingesting and how frequently you're doing it, one SHIR node might not cut it. There are general guidelines on how & when to scale, alternatively you might be preforming large-scale data migration from an EDW (enterprise data warehouse) where this might be an immediate requirement.
To save costs, you also have the option to scale up by upgrading the SKU of your SHIR VM, just remember to increase the number of concurrent jobs that run on a node!
The scaling guidance can be found here: Copy activity performance and scalability guide - Azure Data Factory & Azure Synapse | Microsoft Learn
Supported data sources & formats.
Synapse & ADF share the data sources that can be supported as a source / sink. You can find the list documented here: Copy activity - Azure Data Factory & Azure Synapse | Microsoft Learn
Networking Requirements
The networking & firewall requirements are the same, with some external connections required for the SHIR to talk to some Azure Services and of course the SHIR needs to be able to communicate with your Data Source. Depending on where your SHIR is deployed, these rules may need configuring on your on-premise NVA or Azure Firewall, should you opt to deploy your SHIR into Azure and connect to your data source over S2S VPN / ExpressRoute. Firewall requirements can be found here: Security considerations - Azure Data Factory | Microsoft Learn
Differences
Shared SHIR
The biggest key differences between the way Synapse and ADF can interact with SHIRs is "Shared Self-Hosted Integration Runtime" to put it simply, multiple data factories can share a single SHIR, synapse however, does not support this. Why might this be useful? Lets say your IT Team uses a Data Factory and your BI Team uses a different one, they could be in different subscriptions for security reasons, or they could just be separated to keep things tidy. Deploying a SHIR for every team that requires a data factory could be costly, so ADF allows you to share a SHIR with some pretty basic Managed Identity configuration. The "primary" Data Factory displays the SHIR as "Shared" and the "secondary" data factory shows the SHIR as "Linked"
Shared SHIR
Linked SHIR
As well as the security benefits of sharing a SHIR between multiple ADF instances, there's also the cost saving element, allowing multiple teams to share the same underlying infrastructure, bottlenecks can also be avoided by the teams agreeing trigger times for copy activities.
There are some limitations that can be found here: Create a shared self-hosted integration runtime with PowerShell - Azure Data Factory | Microsoft Learn.
Along with the limitations it's worth highlighting that there are some scenarios where I don't think this option is viable. If you have a centralised Data Platform that all teams use, and you use an environment tiering approach (Dev, Test & Prod - for example) generally it's worth ringfencing each SHIR to their respective environment from a firewall perspective (SHIR to Data Source) and also, imagine if a dev job got stuck or someone manually triggered a copy activity before a production activity was due to start, your Dev job is going to take precedence and you'll be scratching your head wondering why your most recent PowerBI report hasn't come through on time.
Purview
Similar to ADF & Synapse, Purview has an IR and can be linked to a SHIR, however, they're not to be confused with ETL integration runtimes, these are purely for scanning data sources. Here's a couple of handy diagrams to show the differences:
The difference in function also brings a difference in requirements, especially when it comes to Spec - Microsoft recommend a base spec of "2-GHz processor with 8 cores, 28 GB of RAM, and 80 GB of available hard drive space" or in Azure SKU Terms, a D4 or DS4. It's worth noting that the spec of a Purview SHIR should be determined like a Synapse or ADF SHIR, analyse what you're scanning and how often and work from there, while this is the recommended spec, I always prefer to start off low and build up depending on VM metrics / connector specific spec requirements. Let's take the Oracle Data Source for example - you can find a list of supported sources here: https://learn.microsoft.com/en-us/purview/microsoft-purview-connector-overview - I'm going to find Oracle in this list, and look at the details for this specific connector, after going through and following the documentation, I can find a spec recommendation here: Connect to and manage Oracle | Microsoft Learn
The 4 node rule still applies to Purview, and can give a performance boost when it comes to scanning too. Here are the list of benefits:
Higher availability of the self-hosted integration runtime so that it's no longer the single point of failure for scan. This availability helps ensure continuity when you use up to four nodes.
Run more concurrent scans. Each self-hosted integration runtime can empower many scan runs at the same time, auto determined based on the machine's CPU/memory. You can install more nodes if you have more concurrency need.
When scanning sources like Azure Blob, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2 and Azure Files, each scan run can leverage all those nodes to boost the scan performance. For other sources, scan will be executed on one of the nodes.
Purview SHIRs also have a firewall requirement which can be found here: Create and manage Integration Runtimes | Microsoft Learn
Conclusion
In this post, we've covered off what an Integration Runtime is, what a Self-Hosted Integration Runtime is and how they work. The differences between a Synapse/ADF/Purview SHIR. In future posts, I'll be looking at where you should deploy your SHIR and various networking scenarios (Cloud-only, hybrid, multi-cloud).
Comments