This article is contributed. See the original author and article here.

Today, I worked in a very interesting case where our customer wants to insert millions of rows using Python. We reviewed two alternatives to import the data as soon as possible: Using BCP command line and using executemany command. Following I would like to share my lessons learned about it.


 


Background:


 



  • We have 5 CSV files with  111.100.000 and around 22 columns (20 varchar(6) and 2 int data type columns). 

  • The Azure SQL Database is Standard 9 

  • Virtual Machine:

    • All files are placed on temporary drive (D:).

    • The accelerated networking is disabled.

    • vCore 4 and 16 Gb of RAM



  • The goal is to import the data as soon as possible.  


 


Lessons Learned:


 



  • I developed two different python script, both will read every CSVs files but using two different methods to import the data:

    • 1) Using executemany method:

      • As I saw that a single thread will take too much time, I configure to run 100 threads at the same time. 

      • The main thread reads the CSV file and when it reached 10000 rows, it sends this array of data to a single thread that execute the executemany in background. I managed the status of 100 threads.

        • This thread creates a temporal table in TEMPDB

        • Once it finished runs an INSERT INTO to the main table. 

        • This main table is a head table without any non-clustered indexes.



      • The time spent to insert all the data was at database level:

        • 40 mins of execution.

        • CPU DB usage 70%-80%

        • LOG IO usage 60%-70%

        • Every minute 3,4 M rows are inserted.





    • 2) Calling BCP.exe 

      • Using the main thread to read every CSV. Batch size is 10000. 

      • The time spent to insert all the data was at database level:


        • 20 mins.

        • CPU DB usage 5%-10%

        • LOG IO usage 80%-90%

        • Every minute 7M rows are inserted








 


It is clear that BCP is better to even when executemany is using a prepared TSQL command to improve the performance. I will check if we have the save option as we have in .NET – SqlBulkCopy that the performance is much better.


 


Enjoy!

Brought to you by Dr. Ware, Microsoft Office 365 Silver Partner, Charleston SC.