Distributed key considerations for data movement on SQL DW performance

This article is contributed. See the original author and article here.

I was working on a few performance cases last week. So I thought it would be a good idea to follow my plan that I started in 2017 ( yes, I know a long time) and discuss more about SQL DW performance.

From the time tunnel my post from 2017: https://docs.microsoft.com/en-gb/archive/blogs/dataplatform/sql-azure-dw-what-is-it-how-it-works.

I was working on those cases with my colleague Frederico Guimaraes which is a person that has a lot of experience in this matter.

I will copy and paste a few concepts from my previous post to give us some ground:

MPP means…

It is “divide to conquer”. Azure DW relays in nodes and CPUs, instead of in only CPUs to process a task. Our classical SQL divides a task in different CPUs, which is parallel processing. Azure DW process a task in CPUs running in different nodes (computers).

In order to achieve this distributed architecture DW has:

Control node: The Control node manages and optimizes queries. It is the front end that interacts with all applications and connections.

Compute nodes: The Compute nodes serve as the power behind SQL Data Warehouse. They are SQL Databases that store your data and process your query

Storage: Your data is stored in Azure Blob storage. When Compute nodes interact with your data, they write and read directly to and from blob storage

(https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-overview-what-is)

Distribution columns:

Behind the scenes, SQL Data Warehouse divides your data into 60 databases. Each individual database is referred to as a distribution. When data is loaded into each table, SQL Data Warehouse has to know how to divide your data across these 60 distributions.

So the column chooses as the distribution key will be used to distribute the data across nodes.

We have two types of distribution:

Round robin which distributes data evenly but randomly. As it sounds round-robin will work distributing the data in round-robin fashion.
Hash Distributed which distributes data based on hashing values from a single column. Hash distributed tables are tables that are divided between the distributed databases using a hashing algorithm on a single column that you select.

Ok that is enough…

Once you create your distributed table and defined the distributed key, keep in mind the key holds the secret to avoid data movement on large tables. I mean suppose you will join large tables, it sounds a good idea to join the distributed tables on their distribution keys.

For example:

/****** Object:  Table [dbo].[Medallion]    Script Date: 14/07/2020 10:13:05 ******/
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE TABLE [dbo].[Medallion]
(
	[MedallionID] [int] NOT NULL,
	[MedallionBKey] [varchar](50) NOT NULL,
	[MedallionCode] [varchar](50) NULL
)
WITH
(
	DISTRIBUTION = HASH ( [MedallionID] ),
	CLUSTERED COLUMNSTORE INDEX
)
GO
/****** Object:  Table [dbo].[Medallion]    Script Date: 14/07/2020 10:13:05 ******/
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE TABLE [dbo].[Medallion]
(
	[MedallionID] [int] NOT NULL,
	[MedallionBKey] [varchar](50) NOT NULL,
	[MedallionCode] [varchar](50) NULL
)
WITH
(
	DISTRIBUTION = HASH ( [MedallionID] ),
	CLUSTERED COLUMNSTORE INDEX
)
GO

iT is pretty much the same table with the same data and the same distribution key.

If my query does not include the key for the join: MPP will estimate in a different way even the results by including the key would be the same. Let me explain by showing:

Note I included explain command to get the query plan :

EXPLAIN
SELECT m.[MedallionID]
      ,H.[MedallionBKey]
      ,m.[MedallionCode]
  FROM [dbo].[Medallion] m 
  INNER JOIN Medallion_hash h
   ON  m.MedallionID = h.MedallionID 
   AND  m.MedallionBKey=h.MedallionBKey

Result:

<?xml version="1.0" encoding="utf-8"?>
<dsql_query number_nodes="1" number_distributions="60" number_distributions_per_node="60">
<sql>SELECT m.[MedallionID]
,H.[MedallionBKey]
,m.[MedallionCode]
FROM [dbo].[Medallion] m 
INNER JOIN Medallion_hash h on m.MedallionID = h.MedallionID and m.MedallionBKey=h.MedallionBKey</sql>
<dsql_operations total_cost="0" total_number_operations="1">
<dsql_operation operation_type="RETURN">
<location distribution="AllDistributions" />
<select>SELECT [T1_1].[MedallionID] AS [MedallionID], [T1_1].[MedallionBKey] AS [MedallionBKey], [T1_1].[MedallionCode] AS [MedallionCode] FROM (SELECT [T2_2].[MedallionID] AS [MedallionID], [T2_1].[MedallionBKey] AS [MedallionBKey], [T2_2].[MedallionCode] AS [MedallionCode] FROM [SQLDW].[dbo].[Medallion_hash] AS T2_1 INNER JOIN
[SQLDW].[dbo].[Medallion] AS T2_2
ON (([T2_2].[MedallionID] = [T2_1].[MedallionID]) AND ([T2_2].[MedallionBKey] = [T2_1].[MedallionBKey]))) AS T1_1
OPTION (MAXDOP 2)</select>
</dsql_operation>
</dsql_operations>
</dsql_query>

I changed the query without including my distribution key which is MedeallionID and filtered by MedallionBkey only, now I have data movement:

EXPLAIN
SELECT m.[MedallionID]
      ,H.[MedallionBKey]
      ,m.[MedallionCode]
  FROM [dbo].[Medallion] m 
  INNER JOIN Medallion_hash h
   ON   m.MedallionBKey=h.MedallionBKey

Result is:

<?xml version="1.0" encoding="utf-8"?>
<dsql_query number_nodes="1" number_distributions="60" number_distributions_per_node="60">
  <sql>SELECT m.[MedallionID]
      ,H.[MedallionBKey]
      ,m.[MedallionCode]
  FROM [dbo].[Medallion] m 
  INNER JOIN Medallion_hash h on   m.MedallionBKey=h.MedallionBKey</sql>
  <dsql_operations total_cost="6.451296" total_number_operations="9">
    <dsql_operation operation_type="RND_ID">
      <identifier>TEMP_ID_17</identifier>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="AllDistributions" />
      <sql_operations>
        <sql_operation type="statement">CREATE TABLE [qtabledb].[dbo].[TEMP_ID_17] ([MedallionBKey] VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL ) WITH(DISTRIBUTED_MOVE_FILE='');</sql_operation>
      </sql_operations>
    </dsql_operation>
    <dsql_operation operation_type="SHUFFLE_MOVE">
      <operation_cost cost="1.749504" accumulative_cost="1.749504" average_rowsize="32" output_rows="13668" GroupNumber="4" />
      <source_statement>SELECT [T1_1].[MedallionBKey] AS [MedallionBKey] FROM [SQLDW].[dbo].[Medallion_hash] AS T1_1
OPTION (MAXDOP 2, MIN_GRANT_PERCENT = [MIN_GRANT], DISTRIBUTED_MOVE(N''))</source_statement>
      <destination_table>[TEMP_ID_17]</destination_table>
      <shuffle_columns>MedallionBKey;</shuffle_columns>
    </dsql_operation>
    <dsql_operation operation_type="RND_ID">
      <identifier>TEMP_ID_18</identifier>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="AllDistributions" />
      <sql_operations>
        <sql_operation type="statement">CREATE TABLE [qtabledb].[dbo].[TEMP_ID_18] ([MedallionID] INT NOT NULL, [MedallionBKey] VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL, [MedallionCode] VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS ) WITH(DISTRIBUTED_MOVE_FILE='');</sql_operation>
      </sql_operations>
    </dsql_operation>
    <dsql_operation operation_type="SHUFFLE_MOVE">
      <operation_cost cost="4.701792" accumulative_cost="6.451296" average_rowsize="86" output_rows="13668" GroupNumber="3" />
      <source_statement>SELECT [T1_1].[MedallionID] AS [MedallionID], [T1_1].[MedallionBKey] AS [MedallionBKey], [T1_1].[MedallionCode] AS [MedallionCode] FROM [SQLDW].[dbo].[Medallion] AS T1_1
OPTION (MAXDOP 2, MIN_GRANT_PERCENT = [MIN_GRANT], DISTRIBUTED_MOVE(N''))</source_statement>
      <destination_table>[TEMP_ID_18]</destination_table>
      <shuffle_columns>MedallionBKey;</shuffle_columns>
    </dsql_operation>
    <dsql_operation operation_type="RETURN">
      <location distribution="AllDistributions" />
      <select>SELECT [T1_1].[MedallionID] AS [MedallionID], [T1_1].[MedallionBKey] AS [MedallionBKey], [T1_1].[MedallionCode] AS [MedallionCode] FROM (SELECT [T2_2].[MedallionID] AS [MedallionID], [T2_1].[MedallionBKey] AS [MedallionBKey], [T2_2].[MedallionCode] AS [MedallionCode] FROM [qtabledb].[dbo].[TEMP_ID_17] AS T2_1 INNER JOIN
[qtabledb].[dbo].[TEMP_ID_18] AS T2_2
ON ([T2_1].[MedallionBKey] = [T2_2].[MedallionBKey])) AS T1_1
OPTION (MAXDOP 2, MIN_GRANT_PERCENT = [MIN_GRANT])</select>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="AllDistributions" />
      <sql_operations>
        <sql_operation type="statement">DROP TABLE [qtabledb].[dbo].[TEMP_ID_18]</sql_operation>
      </sql_operations>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="AllDistributions" />
      <sql_operations>
        <sql_operation type="statement">DROP TABLE [qtabledb].[dbo].[TEMP_ID_17]</sql_operation>
      </sql_operations>
    </dsql_operation>
  </dsql_operations>
</dsql_query>

Note data movement is happening on the plan: <dsql_operation operation_type=”SHUFFLE_MOVE”>. Which means ( copy and paste again from my previous post):

SHUFFLE_MOVE – Redistributes a distributed table. The redistributed table has a different distribution column than the original distributed table. This might be used to when running incompatible joins or incompatible aggregations.

To perform this operation, SQL DW will move each row to the correct Compute node according to the distribution column of the destination table. Rows that are already stored on the correct Compute node are not copied during this operation.

So the case from this week is about that. There was a data movement which was not desired to be on the plan, for that we took some actions:

1) Review the distribution keys on the table

2) Review the stats. So, wrong stats can lead to MPP misestimated the plan.

Follow a simple query to check your stats:

SELECT stats_id, name AS stats_name, 
    STATS_DATE(object_id, stats_id) AS statistics_date
FROM sys.stats s

That is it! I hope the examples provide help you to get an idea of the importance while defining distribution keys.

Liliam C Leme

UK Engineer

Brought to you by Dr. Ware, Microsoft Office 365 Silver Partner, Charleston SC.

Distributed key considerations for data movement on SQL DW performance

Submit a Comment Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta

We look forward to meeting you