This article is contributed. See the original author and article here.

In Azure Synapse Analytics, data will be distributed across several distributions based on the distribution type (Hash, Round Robin, and Replicated). So, on an operation like Join condition we may have Compatible Joins or Incompatible Joins which depends on the type of the joined table distribution type and location on the join (LEFT or RIGHT).


 


As part of query performance troubleshooting/ tuning, one of the main factors to enhance the query performance is to minimize the data movement between distributions. Customers can modify the join types / table distribution type (if applicable) to achieve that.


 


Below are two examples; the first shows how to get compatible join, and the other one shows how to minimize the data movement by modifying the table (incompatible join).


 


Examples:


compatible join example, for this example we have two tables [dbo].[DimCustomer] and [dbo].[FactSurveyResponse]



  • [dbo].[DimCustomer] distributed as a hash on CustomerKey with integer as a data type

  • [dbo].[FactSurveyResponse] distributed as a hash on CustomerKey with integer as a data type as well.


Based on the below reference (Table 1: Compatible Joins) , if we are joining two tables on the same key and same date type no data movements will be required.


 


 

CREATE TABLE [dbo].[DimCustomer]
(
	[CustomerKey] [int] NOT NULL,
	[GeographyKey] [int] NULL,
	[CommuteDistance] [nvarchar](15) NULL
)
WITH
(
	DISTRIBUTION = HASH ( [CustomerKey] ),
	CLUSTERED COLUMNSTORE INDEX
)

CREATE TABLE [dbo].[FactSurveyResponse]
(
	[SurveyResponseKey] [int] NOT NULL,
	[DateKey] [int] NOT NULL,
	[CustomerKey] [int] NOT NULL,
	[ProductCategoryKey] [int] NOT NULL
)
WITH
(
	DISTRIBUTION = HASH ( [CustomerKey] ),
	CLUSTERED COLUMNSTORE INDEX
)

SELECT FSR.* , DC.LastName , DC.FirstName
FROM [dbo].[FactSurveyResponse] FSR
INNER JOIN [dbo].[DimCustomer] DC
ON DC.[CustomerKey] = FSR.[CustomerKey]

 


 


By examining the produced query steps for the above query, no data movement happened as we have joined the two tables based on the same hash key (CustomerKey)


 


incompatible join example, for this example we have two tables [dbo].[ DimProductCategory] and [dbo].[FactSurveyResponse]



  • [dbo].[ DimProductCategory] distributed as a hash on ProductCategoryKey

  • [dbo].[FactSurveyResponse] distributed as a hash on CustomerKey as well.


Based on the below reference (Table 2: Incompatible Joins)  , if we are joining two tables on different keys additional data movements (e.g. BroadcastMoveOperation, ShuffleMoveOperation) will be required.


 


 

CREATE TABLE [dbo].[FactSurveyResponse]
(
	[SurveyResponseKey] [int] NOT NULL,
	[DateKey] [int] NOT NULL,
	[CustomerKey] [int] NOT NULL,
	[ProductCategoryKey] [int] NOT NULL
)
WITH
(
	DISTRIBUTION = HASH ( [CustomerKey] ),
	CLUSTERED COLUMNSTORE INDEX
)

CREATE TABLE [dbo].[DimProductCategory]
(
	[ProductCategoryKey] [int] NOT NULL,
	[ProductCategoryAlternateKey] [int] NULL,
	[EnglishProductCategoryName] [nvarchar](50) NOT NULL,
	[SpanishProductCategoryName] [nvarchar](50) NOT NULL,
	[FrenchProductCategoryName] [nvarchar](50) NOT NULL
)
WITH
(
	DISTRIBUTION = HASH ( [ProductCategoryKey] ),
	CLUSTERED COLUMNSTORE INDEX
)

SELECT FSR.* , DPC.[EnglishProductCategoryName]
FROM [dbo].[FactSurveyResponse] FSR
INNER JOIN [dbo].[DimProductCategory] DPC
ON DPC.[ProductCategoryKey] = FSR.[ProductCategoryKey]

 


 


Steps-2.png 


Based on the above there was a BroadcastMoveOperation which moved 4 rows, since we have joined the two tables on different keys (incompatible Join). One of the ways to remove the BroadcastMoveOperation is to use REPLICATE distribution type.


 


Note: The customer needs to consider the changes not CSS, since this might affect other queries and requires a structure redesign for related objects. In the next step we will create a new table by using CTAS with REPLICATE distribution data type.


 


Steps to minimize the data movements (Just an example).



  • Create a new table with REPLICATE distribution by using CTAS, and verify that both left and right table has the predicate joins data type. (e.g. int = int)


Build the replicate cash.


 


 

CREATE TABLE [dbo].[DimProductCategory_Replicate] 
WITH (DISTRIBUTION=REPLICATE) 
AS SELECT * FROM [dbo].[DimProductCategory]

SELECT FSR.* , DPC.[EnglishProductCategoryName]
FROM [dbo].[FactSurveyResponse] FSR
INNER JOIN [dbo].[DimProductCategory_Replicate] DPC
ON DPC.[ProductCategoryKey] = FSR.[ProductCategoryKey]

 



  • Based on the Compatible Joins table we can get a compatible join either by having the replicated table on left or right.


By examining the produced query steps for the above query, no data movement happened. Based on the below reference (Table 1: Compatible Joins)  if the left table is distributed and the right table is replicated along with inner join no data movement is required.


 


Steps-3.png


Compatible and Incompatible Joins table reference:


 


Compatible Joins: is a join that doesn’t require data movement before each compute node.


 


Table 1: Compatible Joins




































Join type



Left Table



Right Table



Compatibility



All join types



Replicated



Replicated



Compatible – no data movement required.



Inner Join

Right Outer Join

Cross Join



Replicated



Distributed



Compatible – no data movement required.



Inner Join

Left Outer Join

Cross Join



Distributed



Replicated



Compatible – no data movement required.



All join types, except cross joins, can be compatible.



Distributed



Distributed



Compatible – no data movement required if the join predicate is an equality join and if the predicate joins two distributed columns that have matching data types.

Cross Joins are always incompatible.



 


Example:


 


Incompatible Joins is a join that requires data movement before each compute node.


Note: Data movement operations take extra time and storage and can negatively impact query performance.


 


Table 2: Incompatible Joins






























Join type



Left Table



Right Table



Compatibility



Left Outer Join

Full Outer Join



Replicated



Distributed



Incompatible – requires data movement before the join.



Right Outer Join

Full Outer Join



Distributed



Replicated



Incompatible – requires data movement before the join.



See Compatibility column for details.



Distributed



Distributed



Incompatible – requires data movement if joins have different keys and the predicate joins on the distributed columns have different data types.

Cross joins are always incompatible.



 

Brought to you by Dr. Ware, Microsoft Office 365 Silver Partner, Charleston SC.

%d bloggers like this: