This article is contributed. See the original author and article here.
The Azure Well-Architected Framework (WAF) helps ensure that Azure workloads are reliable, stable, and secure while meeting SLAs for performance and cost. The WAF tenets are:
- Cost Optimization – Managing costs to maximize the value delivered.
- Reliability – The ability of a system to recover from failures and continue to function.
- Operational Excellence – Operational processes that keep a system running in production.
- Performance Efficiency – The ability of a system to adapt to changes in load.
- Security – Protecting applications and data from threats.
Applying the Azure WAF to your Azure Data Factory (ADF) workloads is critical and should be considered during initial architecture design and resource deployment. If you haven’t already, check out this companion blog on Azure Data Factory Patterns and Features for the Azure Well-Architected Framework. But how do you ensure that your ADF environment still meets WAF as workloads grow and evolve?
In this blog post, we’ll focus on monitoring Azure Data Factory to help align to the Azure Well-Architected Framework for data workloads.
Alerts and monitoring over Azure Data Factory
All Azure resources offer the capability to build dashboards over costs, but don’t necessarily give you the detail needed or have the alerting capabilities when an issue arises. You can view pipeline activity within the Data Factory itself, but this does not allow you to create aggregated reports over activities and pipelines over time.
Create alerts over ADF metrics, leverage Azure Monitor and Log Analytics for detailed and/or summarized information about your Data Factory activities and/or create your own notification framework within Data Factory, helping your Data Factories to continue to be optimized for cost, performance and reliability.
Using metrics and alerts in Data Factory
Metrics are essentially performance counters, always returning a number, and are leveraged when you configure alerts.
Configure alerts for failures
Configure ADF metrics and alerts to send notifications when triggers, pipelines, activities or SSIS packages fail. In the example below, an alert will be issued whenever the activity name “cdCopyTextToSQL” fails:
Configure Pipeline Elapsed Time metric
In the ADF Pipeline Settings, the Elapsed time metrics on Pipeline Settings allows you to set a duration metric for the pipeline:
Then create an Alert Rule for Elapsed Time Pipeline Run metrics:
If the pipeline runtime exceeds the duration defined in the Elapsed time metric Pipeline Settings, an alert will be issued.
Set Alerts on Self-Hosted Integration Runtimes
Self-Hosted Integration Runtimes (SHIRs) are used to move and transform data that resides in an on-premises network or VNet. Set alerts to ensure resources are not overutilized or queuing data movement requests:
The following metrics are available:
- Integration runtime available memory (IntegrationRuntimeAvailableMemory) – be notified when there are any dips in available memory
- Integration runtime available node count (IntegrationRuntimeAvailableNodeNumber) – be notified when nodes in a SHIR cluster are not available or not being fully utilized
- Integration runtime CPU Utilization (IntegrationRuntimeCpuPercentage) – be notified when there are spikes in CPU or when CPU is being maxed out
- Integration runtime queue duration (IntegrationRuntimeAverageTaskPickupDelay) – be notified when the average activity queue duration exceeds a limit
- Integration runtime queue length (IntegrationRuntimeQueueLength) – be notified when there are long waits between activities
You can also configure event log capture on the VM(s) that hosts your SHIR.
Set alerts on Azure Subscription Limits
ADF has resources limits per Azure subscription. If you expect a Data Factory will have a large number of pipelines, datasets, triggers, linked services, private endpoints and other entities, set alerts on the count of Total entities to be notified when Data Factories start approaching the limit (Default Limit is 5000). For example:
You can also set an alert or query on Total factory size (GB unit) to ensure the Data Factory will not exceed the data factory size limit (2 GB default).
Leveraging alerts in ADF allows you to be immediately notified when pipelines are failing or when resources are reaching their limits, supporting WAF tents of Cost Optimization, Reliability, Operational Excellence, and Performance Efficiency.
Use Azure Monitor with Log Analytics over Data Factory
Azure Monitor provides verbose information about your ADF triggers, pipelines, and activities for further analysis.
Add diagnostic settings
(In my Data Factories, I do not use SSIS therefore I do not have them configured.)
Explore logs with KQL
In the Azure Portal for the Data Factory where you configured the diagnostic settings, go to Monitoring -> Logs to query the corresponding Log Analytics tables containing the run information about my Data Factory:
Detailed Failure Information
Run queries to get detailed information or aggregated information around failures, as in the example below:
ADFActivityRun | where Status == 'Failed' | project ActivityName, TimeGenerated, Error, Input, Output
Extrapolate costs for orchestration
Costs in Azure Data Factory are based upon Usage. Costs are based upon the number of activities run or triggered, the type of Integration Runtime (IR) used, the number of cores used in an IR, and the type of activity. Get the latest pricing details here
Calculations for Orchestration activities are simple: sum up the number of failed or successful activities (ADFActivityRun) plus the number of triggers executed (ADFTriggerRun) plus the number of debug runs (ADFSandboxPipelineRun). The table below summarizes the cost per 1000 runs (as of 11/14/2022):
VNet Managed IR
Here’s a sample query to the number of activity runs, where you can apply the cost per IR:
ADFActivityRun | where Status != "Queued" and Status != "InProgress" | where EffectiveIntegrationRuntime != "" | summarize count() by EffectiveIntegrationRuntime
Costs are also accrued based upon the type of activity, the activity run duration, and the Integration Runtime used. This data is available in the ADFActivityRun table. Below are the cost details for pipeline activities by IR (for West US 2, as of 11/14/2022):
VNet Managed IR
Data movement activities
External pipeline activities
The example query below derives the elements highlighted above that contribute to the Activity cost:
ADFActivityRun | where Status != "Queued" and Status != "InProgress" | project ActivityJson = parse_json(Output) | project billing = parse_json(ActivityJson.billingReference.billableDuration), ActivityType = parse_json(ActivityJson.billingReference.activityType) | where ActivityType =="PipelineActivity" | evaluate bag_unpack(billing) | project duration, meterType, unit
Dataflow activity costs are based upon whether the cluster is General Purpose or Memory optimized as well as the data flow run duration (Cost as of 11/14/2022 for West US 2):
$0.274 per vCore-hour
$0.343 per vCore-hour
Here’s an example query to get elements for Dataflow costs:
ADFActivityRun | where Status != "Queued" and Status != "InProgress" and ActivityType =="ExecuteDataFlow" | project ActivityJson = parse_json(Output), InputJSon = parse_json(Input) | project billing = parse_json(ActivityJson.billingReference.billableDuration), compute = parse_json(InputJSon.compute) | evaluate bag_unpack(billing) | evaluate bag_unpack(compute)
Costs on Data Factory operations are also incurred, but these are generally insignificant (costs as of 11/14/2022, US West 2):
$0.50 per 50,000 modified/referenced entities
$0.25 per 50,000 run records retrieved
For more examples on Data Factory pricing, see Understanding Azure Data Factory pricing through examples.
You can also export all the table data from Log Analytics to Power BI and build our own reports:
Build your own monitoring framework
Some organizations prefer to build their own monitoring platform, extracting pipeline input, output, or error information to SQL or their data platform of choice. You can also send email notifications when an activity fails.
Monitoring your data factories, whether it is with the built-in features of Azure Metrics, Azure Monitor and Log Analytics or through your own auditing framework, helps ensure your workloads continue to be optimized for cost, performance and reliability to meet the tenets of the WAF. New features are continually added to Azure Data Factory and new ideas evolve as well. Please post your comments and feedback with other features or patterns that have helped you monitor your data factories!
Brought to you by Dr. Ware, Microsoft Office 365 Silver Partner, Charleston SC.