Understanding Slowly Changing Dimensions (SCD)

Table of Contents

Introduction

Slowly Changing Dimensions (SCD) are a crucial concept in data warehousing and business intelligence. They refer to the methods used to manage and track changes in dimension data over time. This is essential for maintaining historical accuracy and ensuring that reports and analyses reflect the correct data context. There are several types of SCDs, each with its own approach to handling changes.

Type 1: Overwrite

Type 1 SCDs simply overwrite the existing data with new data. This method does not maintain any history of changes.A Type 1 SCD is a method used in data warehousing to manage changes in data. When you update a record using Type 1 SCD, you simply overwrite the old data with the new data. This means you do not keep any history of the changes. It’s like updating a contact’s phone number in your phone; the old number is replaced and you can’t see what it was before.

SQL Example:

SQL

— Assuming source_table and dimension_table have the same structure
UPDATE dimension_table d
SET d.company_name = s.company_name
FROM source_table s
WHERE d.company_id = s.company_id
AND d.company_name <> s.company_name;

In this example:

dimension_table is the table you want to update.
source_table is the table with the new data.
The UPDATE statement changes the company_name in dimension_table to match the company_name in source_table where the company_id matches and the names are different.

PySpark Example:

Python

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName(“SCD Type 1”).getOrCreate()

Load source and dimension tables

source_df = spark.read.format(“csv”).option(“header”, “true”).load(“source_table.csv”)
dimension_df = spark.read.format(“csv”).option(“header”, “true”).load(“dimension_table.csv”)

Perform Type 1 SCD

updated_df = dimension_df.join(source_df, “company_id”) \
.withColumn(“company_name”, col(“source_table.company_name”)) \
.select(“dimension_table.*”)

updated_df.show()

In this PySpark example:

source_df and dimension_df are dataframes representing the source and dimension tables.
The join operation matches records based on company_id.
The withColumn function updates the company_name in the dimension table with the new name from the source table.

Sample Data:

Before Update:

surrogate_key	company_id	company_name
1	1	Company A
2	2	Company B
3	3	Company C

After Update:

surrogate_key	company_id	company_name
1	1	Company A
2	2	Company B Updated
3	3	Company C

In this example, “Company B” was updated to “Company B Updated” because the new data had a different name for the same company_id.

Type 2: Add New Row

Type 2 SCDs add a new row with a new surrogate key whenever a change occurs. This method maintains a complete history of changes. The end_date is set to a far future date (e.g., ‘9999-12-31’) to indicate that the record is currently active.A Type 2 SCD is a method used in data warehousing to keep track of changes over time. Instead of overwriting old data, it adds a new row each time a change occurs. This way, you maintain a complete history of all changes. Think of it like keeping a diary where you write a new entry every day, so you can always look back and see what happened on any given day.

SQL Example:

SQL

— Step 1: Detect changes: This part finds all the records in the source table that have a different company_name than the dimension table and are currently active.
WITH changes AS (
SELECT s.company_id, s.company_name
FROM source_table s
LEFT JOIN dimension_table d
ON s.company_id = d.company_id
WHERE s.company_name <> d.company_name
AND d.current_flag = ‘Y’
)

— Step 2: Mark the old records as inactive:This part updates the old records to mark them as inactive by setting an end date and changing the flag.
UPDATE dimension_table
SET end_date = CURRENT_DATE,
current_flag = ‘N’
WHERE company_id IN (SELECT company_id FROM changes)
AND current_flag = ‘Y’;

— Step 3: Insert new records with the updated company name:This part inserts new records with the updated company name and sets them as active.
INSERT INTO dimension_table (company_id, company_name, start_date, end_date, current_flag)
SELECT company_id, company_name, CURRENT_DATE, ‘9999-12-31’, ‘Y’
FROM changes;

PySpark Example:

Python

from pyspark.sql.functions import lit, current_date

Detect changes:This part finds all the records in the source dataframe that have a different company_name than the dimension dataframe and are currently active.

changes_df = source_df.join(dimension_df, “company_id”, “left”) \
.filter((col(“source_table.company_name”) != col(“dimension_table.company_name”)) & (col(“dimension_table.current_flag”) == ‘Y’)) \
.select(“source_table.company_id”, “source_table.company_name”)

Mark old records as inactive:This part updates the old records to mark them as inactive by setting an end date and changing the flag.

inactive_df = dimension_df.withColumn(“end_date”, current_date()) \
.withColumn(“current_flag”, lit(‘N’)) \
.filter(col(“company_id”).isin([row.company_id for row in changes_df.collect()]))

Insert new records:This part inserts new records with the updated company name and sets them as active.

new_records_df = changes_df.withColumn(“start_date”, current_date()) \
.withColumn(“end_date”, lit(‘9999-12-31’)) \
.withColumn(“current_flag”, lit(‘Y’))

Combine old and new records:This part combines the old and new records into a final dataframe.

final_df = inactive_df.union(new_records_df)

final_df.show()

Sample Data:

Before Update:

surrogate_key	company_id	company_name	start_date	end_date	current_flag
1	1	Company A	2024-10-01	9999-12-31	Y
2	2	Company B	2024-10-01	9999-12-31	Y
3	3	Company C	2024-10-01	9999-12-31	Y

After Update:

surrogate_key	company_id	company_name	start_date	end_date	current_flag
1	1	Company A	2024-10-01	9999-12-31	Y
2	2	Company B	2024-10-01	2024-10-31	N
3	3	Company C	2024-10-01	9999-12-31	Y
4	2	Company B Updated	2024-10-31	9999-12-31	Y

Type 3: Add New Column

Type 3 SCDs add a new column to store the previous value of the changed attribute. This method maintains limited history.

SQL Example:

SQL

— Assuming dimension_table has an additional column previous_company_name
UPDATE dimension_table d
SET d.previous_company_name = d.company_name,
d.company_name = s.company_name
FROM source_table s
WHERE d.company_id = s.company_id
AND d.company_name <> s.company_name;

PySpark Example:

Python

Assuming dimension_table has an additional column previous_company_name

updated_df = dimension_df.join(source_df, “company_id”) \
.withColumn(“previous_company_name”, col(“dimension_table.company_name”)) \
.withColumn(“company_name”, col(“source_table.company_name”)) \
.select(“dimension_table.*”)

updated_df.show()

Sample Data:

Before Update:
SQL

surrogate_key	company_id	company_name	previous_company_name
1	1	Company A	NULL
2	2	Company B	NULL
3	3	Company C	NULL

After Update:
SQL

surrogate_key	company_id	company_name	previous_company_name
1	1	Company A	NULL
2	2	Company B Updated	Company B
3	3	Company C	NULL

Hybrid Type: Overwrite and Add New Row

This approach combines Type 1 and Type 2, where some attributes are overwritten, and others are tracked historically. The end_date is set to a far future date (e.g., ‘9999-12-31’) to indicate that the record is currently active.A Hybrid Type 1/2 SCD is a method used in data warehousing that combines elements of both Type 1 and Type 2 SCDs. This means that some data is simply overwritten (like in Type 1), while other data is tracked historically by adding new rows (like in Type 2). This approach allows you to keep a history of certain changes while updating other information directly.

SQL Example:

SQL

— Step 1: Detect changes:This part finds all the records in the source table that have a different company_name than the dimension table and are currently active.
WITH changes AS (
SELECT s.company_id, s.company_name
FROM source_table s
LEFT JOIN dimension_table d
ON s.company_id = d.company_id
WHERE s.company_name <> d.company_name
AND d.current_flag = ‘Y’
)

— Step 2: Mark the old records as inactive:This part updates the old records to mark them as inactive by setting an end date and changing the flag.
UPDATE dimension_table
SET end_date = CURRENT_DATE,
current_flag = ‘N’
WHERE company_id IN (SELECT company_id FROM changes)
AND current_flag = ‘Y’;

— Step 3: Insert new records with some attributes overwritten:This part inserts new records with the updated company name and sets them as active.
INSERT INTO dimension_table (company_id, company_name, start_date, end_date, current_flag)
SELECT company_id, ‘new_value’, CURRENT_DATE, ‘9999-12-31’, ‘Y’
FROM changes;

PySpark Example:

Python

Detect changes:This part finds all the records in the source dataframe that have a different company_name than the dimension dataframe and are currently active.

Mark old records as inactive:This part updates the old records to mark them as inactive by setting an end date and changing the flag.

Insert new records:This part inserts new records with the updated company name and sets them as active.

new_records_df = changes_df.withColumn(“start_date”, current_date()) \
.withColumn(“end_date”, lit(‘9999-12-31’)) \
.withColumn(“current_flag”, lit(‘Y’))

Combine old and new records:This part combines the old and new records into a final dataframe.

final_df = inactive_df.union(new_records_df)

final_df.show()

Sample Data:

Before Update:

surrogate_key	company_id	company_name	start_date	end_date	current_flag
1	1	Company A	2024-10-01	9999-12-31	Y
2	2	Company B	2024-10-01	9999-12-31	Y
3	3	Company C	2024-10-01	9999-12-31	Y

After Update:

surrogate_key	company_id	company_name	start_date	end_date	current_flag
1	1	Company A	2024-10-01	9999-12-31	Y
2	2	Company B	2024-10-01	2024-10-31	N
3	3	Company C	2024-10-01	9999-12-31	Y
4	2	Company B Updated	2024-10-31	9999-12-31	Y

Conclusion

Slowly Changing Dimensions are essential for maintaining historical accuracy in data warehouses. Each type of SCD has its own use case and method for handling changes. By understanding and implementing these methods, you can ensure that your data warehouse accurately reflects the historical context of your data.

Understanding Slowly Changing Dimensions (SCD)

Introduction

Type 1: Overwrite

Type 2: Add New Row

Type 3: Add New Column

Hybrid Type: Overwrite and Add New Row

Conclusion

How to crack PL-300 Microsoft Power BI Data Analyst certification Exam in Easy Steps 2024(Free Resources)!!

Best 9 Business Intelligence Tools for 2024

Excel Mastery for Data Analysis: A Step-by-Step Guide to Unlocking Your Analytical Potential using excel 2024!

Single-Click Data Download Functionality Unleashed! Discover 5 Steps to Seamless Single-Click Data Download functionality for Power BI reports 2024

Data Analyst Interview Questions for Entry Level Jobs:

A Layman’s Guide to the Enigmatic World of Data Analyst Jobs and Responsibilities

Leave a Reply Cancel reply

Introduction

Type 1: Overwrite

Type 2: Add New Row

Type 3: Add New Column

Hybrid Type: Overwrite and Add New Row

Conclusion

Similar Posts

Leave a Reply Cancel reply