This article is contributed. See the original author and article here.
In on-going ELT scenario, how to easily load new files only after an initial full data loading is a very common use case. One of the typical examples is that files can be continually dropped to a landing folder of your source store, where you want an easy way to copy the new files only to data lake store instead of repeatedly copy any files which have already been copied last time. In this blog, we will talk about several best practices of using ADF copy activity to achieve that use case.
When we try to come up with the best approaches to copy new files only, it is impossible to achieve that without understanding your data pattern as well as the scenario environment. Given that, we will illustrate 4 different scenarios below with the best practices for each of using ADF to copy new files only.
If your files become useless in source store after being moved to the destination store, we suggest you to simply delete files from source store after successfully moving them to the destination store by setting “deleteFilesAfterCompletion” as true in copy activity. By doing so, all the files which show up in source store are new files by nature.
If the files can not be deleted from data source after being moved to the destination, you can find if your folders or files are time-based partitioned or not. For example, your folder structure may follow the pattern like “yyyy/mm/dd/”. If so, you can leverage the ADF system variable with parameter to get the new files only via time partitioned folder name or file name. You can do this following the instruction below:
If your data pattern is not belong to scenario #1 or #2, you can try to find if your file property “LastModifiedDate” can be used to differentiate the new files from the old ones. If so, you can copy the new and changed files only by setting “modifiedDatetimeStart” and “modifiedDatetimeEnd” in ADF dataset. ADF will scan all the files from the source store, apply the file filter by their LastModifiedDate, and only copy the new and updated file since last time to the destination store. Please be aware if you let ADF scan huge amounts of files but only copy a few files to destination, you would still expect the long duration due to file scanning is time consuming as well.
You can go with the following instructions as below:
If none of approaches above can be used in your scenario, you need to build a custom way to get the file list of new files, and send the new file list to ADF to copy them. ADF copy activity can consume a text file that includes a list of files you want to copy.
More information as below:
Brought to you by Dr. Ware, Microsoft Office 365 Silver Partner, Charleston SC.