Replicated (optional) vs Round Robin (default) vs Hash (optional)
The statement about replicated vs round robing is simply a storage distribution setting.
In the data warehouse world due to the massive parallel processing capabilities these distribution settings help us do performance tuning so queries run optimal at a lower cost.
Azure SQL Data Warehouse has many compute nodes in a cluster.
By default any new table (not external tables) gets created as Round Robin distribution. This means peaces of the table are distributed across all nodes in a round robin fashion. So you end up having peaces of your table stored across many servers, in general this is not bad.
However, the option to create a table with Replicated distribution, means the entire table contents gets replicated across ALL of the servers in the dw. This is excellent for small tables. Note, if you resume (start) or scale your dw you may want to refresh your replicated tables to warm-up the cache. A simple select top 1 1 from replicatedtable would do or schedule this PowerShell Runbook in an Azure Automation Account. Each compute node can simply retrieve data for that table locally, it’s very performant and decreases data movement.
The caveat becomes…okay, what is small and what is big? We look at a couple of things to make a quick decision.
- Total number of rows in the table ie. <1 million (small) vs 100+ million (large)
- Size of the table. ie. ~500 MBs (small) vs 2+ GBs (large)
- How wide is the table. ie. ~20 columns (small) vs ~200 columns (large)
- How often is the table appended to or read. ie. monthly (small) vs per second (large)
- How is the table used. ie. few columns read (small) vs all/many columns read (large)
Dimension tables are typically candidates for Replicate distribution, not Fact tables.
Typically items 1 & 2 are sufficient to decide if it should be a table created as replicated distribution. Additionally, those numbers are relative to the performance allocation of the Azure SQL Data Warehouse SLO (Service Level Objective).
The syntax is quite simple. Here’s a reference to MS docs with samples.
create table DemoReplicated ( ID int not null, Name varchar(128), ZipCode varchar(10) ) with ( distribution = replicate, clustered columnstore index );
Hash is where we hit the tip of the spear and need massive storage consideration for a giant table ie. SuperTables. The hash distribution is the most optimal in such cases for extremely big tables, ie. Billions of rows, many GBs in size, many columns.
Any questions, feel free to msg me or post in the comments below. Hope it helps!
All the best,