Junk dimensions are collections of low cardinality values. The table represents all possible combinations of those values or, at least, all possible combinations of those values that have been imported so far.
The junk dimension template is almost identical to the dimension table template with one key difference. There is an additional column called RowHash which is a computed column that consist of a hash of all the the non-key, non-audit columns.
USE YourEDWDROPTABLEIFEXISTS [dw].[DimYourDimensionName]GOSETANSI_NULLSONGOSETQUOTED_IDENTIFIERONGOCREATETABLE [dw].[DimYourDimensionName]([YourDimensionNameCK] [bigint] IDENTITY(1,1) NOT NULL,--your columns.[CreatedBy] [nvarchar](50) NULL,[CreatedOn] [datetime] NULL,[UpdatedBy] [nvarchar](50) NULL,[UpdatedOn] [datetime] NULL,[SourceSystem] [nvarchar](100) NULL,[SourceSystemKey] [nvarchar](100) NULL,[EffectiveFrom] [datetime] NULL,[EffectiveTo] [datetime] NULL,[IsMostRecentRecord] [bit] NULL,[RowHash] AS (CONVERT([binary](16),hashbytes('MD5',concat(CONVERT([nvarchar](35),Column1,0),CONVERT([nvarchar](35),Column2,0),CONVERT([nvarchar](35),Column3,0),CONVERT([nvarchar](35),Column4,0),CONVERT([nvarchar](35),Column5,0),CONVERT([nvarchar](35),Column6,0))),0)) PERSISTED,CONSTRAINT [PK_YourDimensionName] PRIMARY KEYCLUSTERED([YourDimensionNameCK] ASC)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]END--index is optionalDROPINDEXIFEXISTS [NCIDX_FactYourFactTableName_RowHash] ON [dw].[FactYourFactTableName]GOCREATENONCLUSTEREDINDEX [NCIDX_FactYourFactTableName_RowHash] ON [dw].[FactYourFactTableName]( [RowHash] ASC)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
GO
Who created the record.
Any kind of record creation process should include the code:
CreatedBy = SYSTEM_USER
CreatedOn
datetime
When the record was created.
Any kind of record creation process should include the code:
CreatedOn= CURRENT_TIMESTAMP
UpdatedBy
nvarchar(50)
Who updated the record.
If there are any update statements ran against the table, this column should be populated as part of that process. This is especially important when engineers have to make manual adjustments to data. Your SQL UPDATE statement should include:
UpdatedBy = SYSTEM_USER
UpdatedOn
datetime
When the record was updated.
If there are any update statements ran against the table, this column should be populated as part of that process. This is especially important when engineers have to make manual adjustments to data. Your SQL UPDATE statement should include:
UpdatedOn = CURRENT_TIMESTAMP
SourceSystem
nvarchar(100)
The source system where the data came from. This value is populated from a column of the same name in the staging table.
SourceSystemKey
nvarchar(100)
The primary key of the record from the source system.
Strictly speaking, this does not have to be the primary key of a source table. It does not even have to be an atomic value. It can be concatenated value. The important thing is that the value uniquely identify the row, and that row can be tracked back to a record in the source system.
EffectiveFrom
datetime
If the processing of the dimension requires that history be kept, this column indicates when the row began to represent the most recent version of the data.
This column usually shares a value with CreatedOn and does not suffer from the challenges found when trying to update the value of EffectiveTo.
EffectiveTo
datetime
If the processing of the dimension requires that history be kept, this column indicates when the row ceased to represent the most recent version of the data.
If IsMostRecentRecord = 1, then the value of EffectiveTo is the highest date in the date dimension. If IsMostRecentRecord = 0, then the value of this column is set to the day it stopped being true. This has some challenges in the case where you have an intraday change. It happens.
You will need to use your creativity to solve your particular use case. My solution was to begin retiring records by decrementing a minute amount of time. That is the reason the data type is datatime and not just date. I will illustrate this in the section on loading a Type II SCD.
IsMostRecentRecord
bit
If the processing of the dimension requires that history be kept, this column indicates whether or not that row represents the most up to date version of that data.
RowHash
binary(16)
A computed value of those columns used to compute the uniqueness of a row. This value is used to determine INSERT or UPDATE operations on the table.