Understanding Slowly Changing Dimensions (SCD)

Introduction

Slowly Changing Dimensions (SCD) are a crucial concept in data warehousing and business intelligence. They refer to the methods used to manage and track changes in dimension data over time. This is essential for maintaining historical accuracy and ensuring that reports and analyses reflect the correct data context. There are several types of SCDs, each with its own approach to handling changes.

Type 1: Overwrite


Type 1 SCDs simply overwrite the existing data with new data. This method does not maintain any history of changes.A Type 1 SCD is a method used in data warehousing to manage changes in data. When you update a record using Type 1 SCD, you simply overwrite the old data with the new data. This means you do not keep any history of the changes. It’s like updating a contact’s phone number in your phone; the old number is replaced and you can’t see what it was before.

SQL Example:

SQL

— Assuming source_table and dimension_table have the same structure
UPDATE dimension_table d
SET d.company_name = s.company_name
FROM source_table s
WHERE d.company_id = s.company_id
AND d.company_name <> s.company_name;

In this example:

  • dimension_table is the table you want to update.
  • source_table is the table with the new data.
  • The UPDATE statement changes the company_name in dimension_table to match the company_name in source_table where the company_id matches and the names are different.

PySpark Example:

Python

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName(“SCD Type 1”).getOrCreate()

Load source and dimension tables

source_df = spark.read.format(“csv”).option(“header”, “true”).load(“source_table.csv”)
dimension_df = spark.read.format(“csv”).option(“header”, “true”).load(“dimension_table.csv”)

Perform Type 1 SCD

updated_df = dimension_df.join(source_df, “company_id”) \
.withColumn(“company_name”, col(“source_table.company_name”)) \
.select(“dimension_table.*”)

updated_df.show()

In this PySpark example:

  • source_df and dimension_df are dataframes representing the source and dimension tables.
  • The join operation matches records based on company_id.
  • The withColumn function updates the company_name in the dimension table with the new name from the source table.

Sample Data:

Before Update:

surrogate_keycompany_idcompany_name
11Company A
22Company B
33Company C

After Update:

surrogate_keycompany_idcompany_name
11Company A
22Company B Updated
33Company C

In this example, “Company B” was updated to “Company B Updated” because the new data had a different name for the same company_id.

Type 2: Add New Row


Type 2 SCDs add a new row with a new surrogate key whenever a change occurs. This method maintains a complete history of changes. The end_date is set to a far future date (e.g., ‘9999-12-31’) to indicate that the record is currently active.A Type 2 SCD is a method used in data warehousing to keep track of changes over time. Instead of overwriting old data, it adds a new row each time a change occurs. This way, you maintain a complete history of all changes. Think of it like keeping a diary where you write a new entry every day, so you can always look back and see what happened on any given day.

SQL Example:

SQL

— Step 1: Detect changes: This part finds all the records in the source table that have a different company_name than the dimension table and are currently active.
WITH changes AS (
SELECT s.company_id, s.company_name
FROM source_table s
LEFT JOIN dimension_table d
ON s.company_id = d.company_id
WHERE s.company_name <> d.company_name
AND d.current_flag = ‘Y’
)

— Step 2: Mark the old records as inactive:This part updates the old records to mark them as inactive by setting an end date and changing the flag.
UPDATE dimension_table
SET end_date = CURRENT_DATE,
current_flag = ‘N’
WHERE company_id IN (SELECT company_id FROM changes)
AND current_flag = ‘Y’;

— Step 3: Insert new records with the updated company name:This part inserts new records with the updated company name and sets them as active.
INSERT INTO dimension_table (company_id, company_name, start_date, end_date, current_flag)
SELECT company_id, company_name, CURRENT_DATE, ‘9999-12-31’, ‘Y’
FROM changes;

PySpark Example:

Python

from pyspark.sql.functions import lit, current_date

Detect changes:This part finds all the records in the source dataframe that have a different company_name than the dimension dataframe and are currently active.

changes_df = source_df.join(dimension_df, “company_id”, “left”) \
.filter((col(“source_table.company_name”) != col(“dimension_table.company_name”)) & (col(“dimension_table.current_flag”) == ‘Y’)) \
.select(“source_table.company_id”, “source_table.company_name”)

Mark old records as inactive:This part updates the old records to mark them as inactive by setting an end date and changing the flag.

inactive_df = dimension_df.withColumn(“end_date”, current_date()) \
.withColumn(“current_flag”, lit(‘N’)) \
.filter(col(“company_id”).isin([row.company_id for row in changes_df.collect()]))

Insert new records:This part inserts new records with the updated company name and sets them as active.

new_records_df = changes_df.withColumn(“start_date”, current_date()) \
.withColumn(“end_date”, lit(‘9999-12-31’)) \
.withColumn(“current_flag”, lit(‘Y’))

Combine old and new records:This part combines the old and new records into a final dataframe.

final_df = inactive_df.union(new_records_df)

final_df.show()

Sample Data:

Before Update:

surrogate_keycompany_idcompany_namestart_dateend_datecurrent_flag
11Company A2024-10-019999-12-31Y
22Company B2024-10-019999-12-31Y
33Company C2024-10-019999-12-31Y

After Update:

surrogate_keycompany_idcompany_namestart_dateend_datecurrent_flag
11Company A2024-10-019999-12-31Y
22Company B2024-10-012024-10-31N
33Company C2024-10-019999-12-31Y
42Company B Updated2024-10-319999-12-31Y

Type 3: Add New Column


Type 3 SCDs add a new column to store the previous value of the changed attribute. This method maintains limited history.

SQL Example:

SQL

— Assuming dimension_table has an additional column previous_company_name
UPDATE dimension_table d
SET d.previous_company_name = d.company_name,
d.company_name = s.company_name
FROM source_table s
WHERE d.company_id = s.company_id
AND d.company_name <> s.company_name;

PySpark Example:

Python

Assuming dimension_table has an additional column previous_company_name

updated_df = dimension_df.join(source_df, “company_id”) \
.withColumn(“previous_company_name”, col(“dimension_table.company_name”)) \
.withColumn(“company_name”, col(“source_table.company_name”)) \
.select(“dimension_table.*”)

updated_df.show()

Sample Data:

Before Update:
SQL

surrogate_keycompany_idcompany_nameprevious_company_name
11Company ANULL
22Company BNULL
33Company CNULL

After Update:
SQL

surrogate_keycompany_idcompany_nameprevious_company_name
11Company ANULL
22Company B UpdatedCompany B
33Company CNULL

Hybrid Type: Overwrite and Add New Row


This approach combines Type 1 and Type 2, where some attributes are overwritten, and others are tracked historically. The end_date is set to a far future date (e.g., ‘9999-12-31’) to indicate that the record is currently active.A Hybrid Type 1/2 SCD is a method used in data warehousing that combines elements of both Type 1 and Type 2 SCDs. This means that some data is simply overwritten (like in Type 1), while other data is tracked historically by adding new rows (like in Type 2). This approach allows you to keep a history of certain changes while updating other information directly.

SQL Example:

SQL

Step 1: Detect changes:This part finds all the records in the source table that have a different company_name than the dimension table and are currently active.
WITH changes AS (
SELECT s.company_id, s.company_name
FROM source_table s
LEFT JOIN dimension_table d
ON s.company_id = d.company_id
WHERE s.company_name <> d.company_name
AND d.current_flag = ‘Y’
)

Step 2: Mark the old records as inactive:This part updates the old records to mark them as inactive by setting an end date and changing the flag.
UPDATE dimension_table
SET end_date = CURRENT_DATE,
current_flag = ‘N’
WHERE company_id IN (SELECT company_id FROM changes)
AND current_flag = ‘Y’;

— Step 3: Insert new records with some attributes overwritten:This part inserts new records with the updated company name and sets them as active.
INSERT INTO dimension_table (company_id, company_name, start_date, end_date, current_flag)
SELECT company_id, ‘new_value’, CURRENT_DATE, ‘9999-12-31’, ‘Y’
FROM changes;

PySpark Example:

Python

Detect changes:This part finds all the records in the source dataframe that have a different company_name than the dimension dataframe and are currently active.

changes_df = source_df.join(dimension_df, “company_id”, “left”) \
.filter((col(“source_table.company_name”) != col(“dimension_table.company_name”)) & (col(“dimension_table.current_flag”) == ‘Y’)) \
.select(“source_table.company_id”, “source_table.company_name”)

Mark old records as inactive:This part updates the old records to mark them as inactive by setting an end date and changing the flag.

inactive_df = dimension_df.withColumn(“end_date”, current_date()) \
.withColumn(“current_flag”, lit(‘N’)) \
.filter(col(“company_id”).isin([row.company_id for row in changes_df.collect()]))

Insert new records:This part inserts new records with the updated company name and sets them as active.

new_records_df = changes_df.withColumn(“start_date”, current_date()) \
.withColumn(“end_date”, lit(‘9999-12-31’)) \
.withColumn(“current_flag”, lit(‘Y’))

Combine old and new records:This part combines the old and new records into a final dataframe.

final_df = inactive_df.union(new_records_df)

final_df.show()

Sample Data:

Before Update:

surrogate_keycompany_idcompany_namestart_dateend_datecurrent_flag
11Company A2024-10-019999-12-31Y
22Company B2024-10-019999-12-31Y
33Company C2024-10-019999-12-31Y

After Update:

surrogate_keycompany_idcompany_namestart_dateend_datecurrent_flag
11Company A2024-10-019999-12-31Y
22Company B2024-10-012024-10-31N
33Company C2024-10-019999-12-31Y
42Company B Updated2024-10-319999-12-31Y

Conclusion


Slowly Changing Dimensions are essential for maintaining historical accuracy in data warehouses. Each type of SCD has its own use case and method for handling changes. By understanding and implementing these methods, you can ensure that your data warehouse accurately reflects the historical context of your data.

Leave a Comment