This article is contributed. See the original author and article here.
I was working on this case where a customer load a certain amount of rows and when he compared the rows that he already had on the DW to the rows that he wanted to insert the process just get stuck for 2 hours and fails with TEMPDB full.
There was no transaction open or running before the query or even with the query. The query was running alone and failing alone.
This problem could be due to different reasons. I am just trying to show some possibilities to troubleshooting by writing this post.
Following some tips:
1) Check for transactions open that could fill your tempdb
2) Check if there is skew data causing a large amount of the data moving between the nodes (https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute#:~:text=A%20quick%20way%20to%20check%20for%20data%20skew,should%20be%20spread%20evenly%20across%20all%20the%20distributions.)
3) Check the stats. Maybe the plan is been misestimate. (https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-statistics)
4)If like me, there was no skew data, no stats and no transaction open causing this.
So the next step was to check the select performance and execution plan.
1) Find the active queries:
SELECT * FROM sys.dm_pdw_exec_requests WHERE status not in ('Completed','Failed','Cancelled') AND session_id <> session_id() ORDER BY submit_time DESC;
2) By finding the query that you know is failing to get the query id:
SELECT * FROM sys.dm_pdw_request_steps WHERE request_id = 'QID####' ORDER BY step_index;
3) Filter the step that is taking most of the time based on the last query results:
SELECT * FROM sys.dm_pdw_sql_requests WHERE request_id = 'QID####' AND step_index = XX;
4) From the last query results filter the distribution id and spid replace those values on the next query:
DBCC PDW_SHOWEXECUTIONPLAN ( distribution_id, spid )
We checked the XML plan that we got a sort to follow by estimated rows pretty high in each of the distributions. So it seems while the plan was been sorted in tempdb it just run out of space.
Hence, this was not a case of stats not updated and also the distribution keys were hashed as the same in all the tables involved which were temp tables.
If you want to know more:
So basically we had something like:
SELECT TMP1.* FROM [#TMP1] AS TMP1 LEFT OUTER JOIN [#TMP2] TMP2 ON TMP1.[key] = TMP2.[key] LEFT OUTER JOIN [#TMP3] TMP3 ON TMP1.[key] = TMP3.[key] AND TMP1.[Another_column] = TMP3.[Another_column] WHERE TMP3.[key] IS NULL;
After we discussed we reach the conclusion what it was needed was everything from TMP1 that does not exist on TMP2 and TMP3. So as the plan with the LEFT Join was not a very good plan with a potential cartesian product we replace with the following:
SELECT TMP1.* FROM [#TMP1] AS TMP1 WHERE NOT EXISTS ( SELECT 1 FROM [#TMP2] TMP2 WHERE TMP1.[key] = TMP2.[key]) AND WHERE NOT EXISTS ( SELECT 1 FROM [#TMP3] TMP3 WHERE TMP1.[key] = TMP3.[key] AND TMP1.[Another_column] = TMP3.[Another_column])
This second one runs in seconds. So the estimation was much better and the Tempdb was able to handle just fine.
That is it!
Brought to you by Dr. Ware, Microsoft Office 365 Silver Partner, Charleston SC.