Tips and tricks on Securing your Data factory

Why?

Like we all know Identity is a hot topic after the MFA fatigue methods or discovering and attacking our resources as external or internal entities but the often overlooked aspect of security is inside the data layer.

You could leak data out and not even know it. Some things you should consider are how you transfer it, how do you encrypt it when in rest, who has access to it?

Will discover a bit on how Data factory works and what should you consider security wise.

What can you do with it?

Azure Data Factory is a cloud-based data integration service that allows you to create, schedule, and orchestrate data pipelines to move and transform data between various data stores and data sources. With Data Factory, you can:

  • Extract data from various sources: Data Factory supports a wide range of data sources, including on-premises and cloud-based data stores, such as Azure SQL Database, Azure Blob Storage, and Azure Cosmos DB. You can use Data Factory to extract data from these sources using connectors and data movement activities.
  • Transform and enrich data: Data Factory provides a variety of data transformation activities that you can use to clean, normalize, and enrich your data. You can use Data Factory to apply transformations such as aggregations, pivots, and joins to your data.
  • Load data into destination stores: Data Factory allows you to load transformed data into a variety of destination stores, including Azure SQL Database, Azure Data Lake Storage, and Azure Cosmos DB.
  • Schedule and automate data pipelines: Data Factory allows you to schedule and automate your data pipelines using triggers and orchestration activities. You can use Data Factory to run pipelines on a schedule, in response to an event, or on demand.
  • Monitor and manage data pipelines: Data Factory provides a range of tools for monitoring and managing your data pipelines, including Azure Monitor, Azure Log Analytics, and Azure Resource Manager templates.

Overall, Azure Data Factory provides a powerful platform for integrating and transforming data from various sources and destinations, and automating data pipelines for efficient data management.

To understand better what needs to be secured, here is an excellent example of Enterprise Power BI solutions from Azure Architecture Center

How to secure it?

Here a list I compiled on the steps. This list is not complete but it gives a good starting point on the steps needed in many of the use cases organizations could have. Just make sure these will fit to your own deployments and add the ones that are needed on-top of these.

Azure Policy

First of all, apply Azure policies for your environment, here are some examples that I exported for easier reading.

policyNamepolicyDescriptionpolicyId
Azure Data Factory should use private linkAzure Private Link lets you connect your virtual network to Azure services without a public IP address at the source or destination. The Private Link platform handles the connectivity between the consumer and services over the Azure backbone network. By mapping private endpoints to Azure Data Factory, data leakage risks are reduced. Learn more about private links at: https://docs.microsoft.com/azure/data-factory/data-factory-private-link.8b0323be-cc25-4b61-935d-002c3798c6ea
Public network access on Azure Data Factory should be disabledDisabling the public network access property improves security by ensuring your Azure Data Factory can only be accessed from a private endpoint.1cf164be-6819-4a50-b8fa-4bcaa4f98fb6
Configure Data Factories to disable public network accessDisable public network access for your Data Factory so that it is not accessible over the public internet. This can reduce data leakage risks. Learn more at: https://docs.microsoft.com/azure/data-factory/data-factory-private-link.08b1442b-7789-4130-8506-4f99a97226a7
Azure data factories should be encrypted with a customer-managed keyUse customer-managed keys to manage the encryption at rest of your Azure Data Factory. By default, customer data is encrypted with service-managed keys, but customer-managed keys are commonly required to meet regulatory compliance standards. Customer-managed keys enable the data to be encrypted with an Azure Key Vault key created and owned by you. You have full control and responsibility for the key lifecycle, including rotation and management. Learn more at https://aka.ms/adf-cmk.4ec52d6d-beb7-40c4-9a9e-fe753254690e
[Preview]: Azure Data Factory integration runtime should have a limit for number of coresTo manage your resources and costs, limit the number of cores for an integration runtime.85bb39b5-2f66-49f8-9306-77da3ac5130f
[Preview]: Azure Data Factory linked services should use system-assigned managed identity authentication when it is supportedUsing system-assigned managed identity when communicating with data stores via linked services avoids the use of less secured credentials such as passwords or connection strings.f78ccdb4-7bf4-4106-8647-270491d2978a
Configure private endpoints for Data factoriesPrivate endpoints connect your virtual network to Azure services without a public IP address at the source or destination.  By mapping private endpoints to your Azure Data Factory, you can reduce data leakage risks.  Learn more at: https://docs.microsoft.com/azure/data-factory/data-factory-private-link.496ca26b-f669-4322-a1ad-06b7b5e41882
[Preview]: Azure Data Factory should use a Git repository for source controlEnable source control on data factories, to gain capabilities such as change tracking, collaboration, continuous integration, and deployment.77d40665-3120-4348-b539-3192ec808307
[Preview]: Azure Data Factory linked services should use Key Vault for storing secretsTo ensure secrets (such as connection strings) are managed securely, require users to provide secrets using an Azure Key Vault instead of specifying them inline in linked services.127ef6d7-242f-43b3-9eef-947faf1725d0
SQL Server Integration Services integration runtimes on Azure Data Factory should be joined to a virtual networkAzure Virtual Network deployment provides enhanced security and isolation for your SQL Server Integration Services integration runtimes on Azure Data Factory, as well as subnets, access control policies, and other features to further restrict access.0088bc63-6dee-4a9c-9d29-91cfdc848952
Configure private DNS zones for private endpoints that connect to Azure Data FactoryPrivate DNS records allow private connections to private endpoints. Private endpoint connections allow secure communication by enabling private connectivity to your Azure Data Factory without a need for public IP addresses at the source or destination. For more information on private endpoints and DNS zones in Azure Data Factory, see https://docs.microsoft.com/azure/data-factory/data-factory-private-link.86cd96e1-1745-420d-94d4-d3f2fe415aa4
[Preview]: Azure Data Factory linked service resource type should be in allow listDefine the allow list of Azure Data Factory linked service types. Restricting allowed resource types enables control over the boundary of data movement. For example, restrict a scope to only allow blob storage with Data Lake Storage Gen1 and Gen2 for analytics or a scope to only allow SQL and Kusto access for real-time queries.6809a3d0-d354-42fb-b955-783d207c62a8

To see more try AzPolicyAdvertizer. This is an excellent site for finding which policies you should use based on your services.

Network

Use dedicated ExpressRoute and IPsec-tunnel to encrypt the traffic during transit.

Identity

Keeping the users that are not wanted disabled from logging in.

Use RBAC for access management

AAD allows you to control access to your Data Factory resources using Azure role-based access control (RBAC). You can also use AAD to enable multi-factor authentication (MFA) for added security.

In example through Conditional Access policies, you could require a trusted location and a trusted device to manage Data factory.

See more information here.

Store credentials in Key vault

Keep your credentials inside Key vault for safe keeping. Example for on-premises SQL you can specify the name of the Azure Key Vault secret that stores the destined SQL server linked service’s connection string (e.g. Data Source=<servername>\<instance name if using named instance>;Initial Catalog=<databasename>;Integrated Security=False;User ID=<username>;Password=<password>;).

Infrastructure

Use encryption

Data Factory supports encryption at rest for all data stored in Azure Data Lake Storage Gen2 and Azure Blob Storage. You can also use Azure Key Vault to manage and rotate encryption keys for your data.

Enable network isolation

By default, Data Factory resources are isolated from other resources in your Azure subscription. You can further isolate your Data Factory resources by using virtual networks and subnets.

Use Customer-Managed keys to protect your Data factory

Quick tip! You have to encrypt the whole Data factory before adding any Linked services or Pipelines.

And copy the whole Key identifier path under the Key.

See more on Key vault and Managed HSM in my previous posts.

Use resource locks

Use resource locks to prevent accidental deletion or modification of critical Data Factory resources.

Use Azure Private Link

If you don’t use an completely Isolated mode, you can use Azure Private Link to securely access your Data Factory resources over a private network connection, rather than over the public internet.

Inside Data factory

Use System-managed Identities to give access to Linked resources.

Just make sure that your System-managed identities don’t have too much rights and that you use dedicated keys for different components inside your Infrastructure.

Using Purview

You can attach labels and policies directly to our content.

And you can also access Isolated Purview from Data factory.

Create Custom roles

Creating the following custom roles for accessing Data Factory instance. You can locate the JSON’s for these custom roles from here.

Use data masking

Data masking allows you to obscure sensitive data in your data pipelines by replacing it with fictitious data. This can help protect sensitive information from unauthorized access.

Hiding activity from logs

The activity’s output could turn into private data that needs to be displayed in plain text.

It’s possible that a lookup activity is accessing a source and returning personally identifiable content.

When the following are checked, output and input activity will not be captured in logging.

Monitor and audit activity

To keep an eye on and audit activity in your Data Factory environment, use Azure Monitor and Azure Log Analytics. Alerts that warn you of shady behavior or potential security risks can be put up.

You can also make Log analytics Isolated with

Additional ways make it more secure

See Azure Security Benchmark for Data factory here.

More on Data movement considerations from Microsoft Learn.

What Kind of attack paths there are?

There has been and will be, SynLapse is one example. It was already fixed by Microsoft, no worries but as data is more important for organizations, it will also be for attackers

Microsoft has addressed this one already

Closure

Organization concentrate to Securing normal users, admins and their workstation. There is a lot more to security than just that.

Like we see in this write-up, Data solutions need to secured also. There is always risks and the automated processes don’t even always reveal those gaps until it’s too late. Depending on the sector you are doing the solution in, there could monetary fines if you don’t comply to the regulations for the industry or Data protection.

Now it will be time for concentrate on family time, thanks all in the community for this year, let’s see again next year!

That in mind, have a very Merry Christmas and keep your loved ones close!

Author: Harri Jaakkonen

Leave a Reply

Your email address will not be published. Required fields are marked *