This article is contributed. See the original author and article here.
In Azure Synapse Analytics, data will be distributed across several distributions based on the distribution type (Hash, Round Robin, and Replicated). So, on an operation like Join condition we may have Compatible Joins or Incompatible Joins which depends on the type of the joined table distribution type and location on the join (LEFT or RIGHT).
As part of query performance troubleshooting/ tuning, one of the main factors to enhance the query performance is to minimize the data movement between distributions. Customers can modify the join types / table distribution type (if applicable) to achieve that.
Below are two examples; the first shows how to get compatible join, and the other one shows how to minimize the data movement by modifying the table (incompatible join).
Examples:
compatible join example, for this example we have two tables [dbo].[DimCustomer] and [dbo].[FactSurveyResponse]
- [dbo].[DimCustomer] distributed as a hash on CustomerKey with integer as a data type
- [dbo].[FactSurveyResponse] distributed as a hash on CustomerKey with integer as a data type as well.
Based on the below reference (Table 1: Compatible Joins) , if we are joining two tables on the same key and same date type no data movements will be required.
CREATE TABLE [dbo].[DimCustomer]
(
[CustomerKey] [int] NOT NULL,
[GeographyKey] [int] NULL,
[CommuteDistance] [nvarchar](15) NULL
)
WITH
(
DISTRIBUTION = HASH ( [CustomerKey] ),
CLUSTERED COLUMNSTORE INDEX
)
CREATE TABLE [dbo].[FactSurveyResponse]
(
[SurveyResponseKey] [int] NOT NULL,
[DateKey] [int] NOT NULL,
[CustomerKey] [int] NOT NULL,
[ProductCategoryKey] [int] NOT NULL
)
WITH
(
DISTRIBUTION = HASH ( [CustomerKey] ),
CLUSTERED COLUMNSTORE INDEX
)
SELECT FSR.* , DC.LastName , DC.FirstName
FROM [dbo].[FactSurveyResponse] FSR
INNER JOIN [dbo].[DimCustomer] DC
ON DC.[CustomerKey] = FSR.[CustomerKey]
By examining the produced query steps for the above query, no data movement happened as we have joined the two tables based on the same hash key (CustomerKey)
incompatible join example, for this example we have two tables [dbo].[ DimProductCategory] and [dbo].[FactSurveyResponse]
- [dbo].[ DimProductCategory] distributed as a hash on ProductCategoryKey
- [dbo].[FactSurveyResponse] distributed as a hash on CustomerKey as well.
Based on the below reference (Table 2: Incompatible Joins) , if we are joining two tables on different keys additional data movements (e.g. BroadcastMoveOperation, ShuffleMoveOperation) will be required.
CREATE TABLE [dbo].[FactSurveyResponse]
(
[SurveyResponseKey] [int] NOT NULL,
[DateKey] [int] NOT NULL,
[CustomerKey] [int] NOT NULL,
[ProductCategoryKey] [int] NOT NULL
)
WITH
(
DISTRIBUTION = HASH ( [CustomerKey] ),
CLUSTERED COLUMNSTORE INDEX
)
CREATE TABLE [dbo].[DimProductCategory]
(
[ProductCategoryKey] [int] NOT NULL,
[ProductCategoryAlternateKey] [int] NULL,
[EnglishProductCategoryName] [nvarchar](50) NOT NULL,
[SpanishProductCategoryName] [nvarchar](50) NOT NULL,
[FrenchProductCategoryName] [nvarchar](50) NOT NULL
)
WITH
(
DISTRIBUTION = HASH ( [ProductCategoryKey] ),
CLUSTERED COLUMNSTORE INDEX
)
SELECT FSR.* , DPC.[EnglishProductCategoryName]
FROM [dbo].[FactSurveyResponse] FSR
INNER JOIN [dbo].[DimProductCategory] DPC
ON DPC.[ProductCategoryKey] = FSR.[ProductCategoryKey]
Based on the above there was a BroadcastMoveOperation which moved 4 rows, since we have joined the two tables on different keys (incompatible Join). One of the ways to remove the BroadcastMoveOperation is to use REPLICATE distribution type.
Note: The customer needs to consider the changes not CSS, since this might affect other queries and requires a structure redesign for related objects. In the next step we will create a new table by using CTAS with REPLICATE distribution data type.
Steps to minimize the data movements (Just an example).
- Create a new table with REPLICATE distribution by using CTAS, and verify that both left and right table has the predicate joins data type. (e.g. int = int)
Build the replicate cash.
CREATE TABLE [dbo].[DimProductCategory_Replicate]
WITH (DISTRIBUTION=REPLICATE)
AS SELECT * FROM [dbo].[DimProductCategory]
SELECT FSR.* , DPC.[EnglishProductCategoryName]
FROM [dbo].[FactSurveyResponse] FSR
INNER JOIN [dbo].[DimProductCategory_Replicate] DPC
ON DPC.[ProductCategoryKey] = FSR.[ProductCategoryKey]
- Based on the Compatible Joins table we can get a compatible join either by having the replicated table on left or right.
By examining the produced query steps for the above query, no data movement happened. Based on the below reference (Table 1: Compatible Joins) if the left table is distributed and the right table is replicated along with inner join no data movement is required.
Compatible and Incompatible Joins table reference:
Compatible Joins: is a join that doesn’t require data movement before each compute node.
Table 1: Compatible Joins
Join type | Left Table | Right Table | Compatibility |
All join types | Replicated | Replicated | Compatible – no data movement required. |
Inner Join | Replicated | Distributed | Compatible – no data movement required. |
Inner Join | Distributed | Replicated | Compatible – no data movement required. |
All join types, except cross joins, can be compatible. | Distributed | Distributed | Compatible – no data movement required if the join predicate is an equality join and if the predicate joins two distributed columns that have matching data types. |
Example:
Incompatible Joins is a join that requires data movement before each compute node.
Note: Data movement operations take extra time and storage and can negatively impact query performance.
Table 2: Incompatible Joins
Join type | Left Table | Right Table | Compatibility |
Left Outer Join | Replicated | Distributed | Incompatible – requires data movement before the join. |
Right Outer Join | Distributed | Replicated | Incompatible – requires data movement before the join. |
See Compatibility column for details. | Distributed | Distributed | Incompatible – requires data movement if joins have different keys and the predicate joins on the distributed columns have different data types. |
Brought to you by Dr. Ware, Microsoft Office 365 Silver Partner, Charleston SC.
Recent Comments