My pipeline is slow, what do I do?

It was a Sunday, I was working on my tile as usual. Pulling a double shift on weekends has become a norm, but enough about that. This isn’t my construction blog.

The email came in about Azure Data Factory. My friend happened to watch my video and reached out for help.

Watch it on YouTube

His issue was pretty much as mentioned on the tile. Our bank’s Azure Data Factory pipeline is running slow moving data from on-prem, we’re copying all tables in a SQL Server database, files from ftp sites and network share drives to Azure SQL DB Managed Instance and to blob storage (our datalake) , do you have some recommendations how to make it go faster? Its around 300GBs and takes over 8 hrs.

So I replied with the following and figured to post it here as it may help others.

REPLY:

I can share with your that I have production pipelines running 300-600 GBs of data copy from on-premise to Azure Synapse Analytics (formerly Azure SQL DW) in less than 8 hrs, this includes end-to-end pipeline activities including source read, uploading to blob/data lake, importing, transformations, automations with a single IR.  So I have a feeling that the performance of your pipeline can be improved.

Here’s a list of items I mentioned to review that will help improve the performance of your pipeline.

  1. Proximity (region) of ADF resource to your Source & Destination.
  2. Proximity of your self-hosted IR to your source.
  3. CPU & Network bandwidth availability of the VM hosting your IR.
  4. Network reach, ie. Express Route.
  5. Consider deploying additional IRs (upto 4 max) to increase the number parallel threads for your pipeline’s data movement.
  6. Use the Staging settings for your Copy Data Activity, proximity or Region of the selected storage account can also impact performance.
  7. Use the For Each loop for dynamically passing a list of tables and copying them as variables in parallel instead of sequential fashion. Don’t move one table at a time in a sequential matter by linking multiple copy data activities. Instead use a loop to launch the copy data activity multiple times for many tables at the same time. See video link below, this article is a good example as well.
  8. See Copy Data Activity performance tuning guide to increase the number of parallel threads:
    1. Copy activity performance optimization features – Azure Data Factory | Microsoft Docs
    2. Copy data to and from SQL Server – Azure Data Factory | Microsoft Docs
    3. Copy activity performance and scalability guide – Azure Data Factory | Microsoft Docs
    4. Troubleshoot copy activity performance – Azure Data Factory | Microsoft Docs
  9. Analyze your source SQL Server or Shared File System or FTP drive for performance metrics or contention, ie. For SQL Server I mentioned to use diskspeed.ps1 to get IO metric, the script uses diskspd.exe that you can copy to the same directory and there’s nothing to install, but there’s many performance monitoring tools. You can use diskspeed without impacting your server as it benchmarks for 30-60 seconds, can’t default duration but its in the UI. You can’t miss it. I don’t recommend running it during peak hours.
  10. Index your tables for proper coverage of your read operations by your pipeline, consider Columnstore depending on your table size. Since you’re reading the whole table then a clustered index will work, just make sure there’s no fragmentation.
  11. Try to keep your resources updated with the latest OS, SQL Server(s), IRs updates and CUs/Hot-fixes.
  12. Review other possible workload running on your source or destination that may be adding contention during the schedule of your pipeline.
  13. Consider Incremental loads, ie. Have you tried the Data Compare tool in Visual Studio? Since your sync is between SQL Server and Azure SQL MI, this basically means synching between 2 SQL Servers. You can implement similar checksum logic or incrementally to more frequently sync your tables and only transfer the changes.
  14. Lastly, you may also consider scaling up your Azure SQL DB Managed Instance during the time of your ETL for the heavy write operations.

I recently attended the following webinar that covers examples of similar items mentioned above. https://youtu.be/P_fEa32Ldtk

If you have any other questions pls feel free to reach out.

Sincerely, Hiram

One thought on “My pipeline is slow, what do I do?

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.