Introduction
Slowly Changing Dimensions (SCD) are a crucial concept in data warehousing and business intelligence. They refer to the methods used to manage and track changes in dimension data over time. This is essential for maintaining historical accuracy and ensuring that reports and analyses reflect the correct data context. There are several types of SCDs, each with its own approach to handling changes.
Type 1: Overwrite
Type 1 SCDs simply overwrite the existing data with new data. This method does not maintain any history of changes.A Type 1 SCD is a method used in data warehousing to manage changes in data. When you update a record using Type 1 SCD, you simply overwrite the old data with the new data. This means you do not keep any history of the changes. It’s like updating a contact’s phone number in your phone; the old number is replaced and you can’t see what it was before.
SQL Example:
SQL
— Assuming source_table and dimension_table have the same structure
UPDATE dimension_table d
SET d.company_name = s.company_name
FROM source_table s
WHERE d.company_id = s.company_id
AND d.company_name <> s.company_name;
In this example:
dimension_table
is the table you want to update.source_table
is the table with the new data.- The
UPDATE
statement changes thecompany_name
indimension_table
to match thecompany_name
insource_table
where thecompany_id
matches and the names are different.
PySpark Example:
Python
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.appName(“SCD Type 1”).getOrCreate()
Load source and dimension tables
source_df = spark.read.format(“csv”).option(“header”, “true”).load(“source_table.csv”)
dimension_df = spark.read.format(“csv”).option(“header”, “true”).load(“dimension_table.csv”)
Perform Type 1 SCD
updated_df = dimension_df.join(source_df, “company_id”) \
.withColumn(“company_name”, col(“source_table.company_name”)) \
.select(“dimension_table.*”)
updated_df.show()
In this PySpark example:
source_df
anddimension_df
are dataframes representing the source and dimension tables.- The
join
operation matches records based oncompany_id
. - The
withColumn
function updates thecompany_name
in the dimension table with the new name from the source table.
Sample Data:
Before Update:
surrogate_key | company_id | company_name |
---|---|---|
1 | 1 | Company A |
2 | 2 | Company B |
3 | 3 | Company C |
After Update:
surrogate_key | company_id | company_name |
---|---|---|
1 | 1 | Company A |
2 | 2 | Company B Updated |
3 | 3 | Company C |
In this example, “Company B” was updated to “Company B Updated” because the new data had a different name for the same company_id
.
Type 2: Add New Row
Type 2 SCDs add a new row with a new surrogate key whenever a change occurs. This method maintains a complete history of changes. The end_date is set to a far future date (e.g., ‘9999-12-31’) to indicate that the record is currently active.A Type 2 SCD is a method used in data warehousing to keep track of changes over time. Instead of overwriting old data, it adds a new row each time a change occurs. This way, you maintain a complete history of all changes. Think of it like keeping a diary where you write a new entry every day, so you can always look back and see what happened on any given day.
SQL Example:
SQL
— Step 1: Detect changes: This part finds all the records in the source table that have a different company_name
than the dimension table and are currently active.
WITH changes AS (
SELECT s.company_id, s.company_name
FROM source_table s
LEFT JOIN dimension_table d
ON s.company_id = d.company_id
WHERE s.company_name <> d.company_name
AND d.current_flag = ‘Y’
)
— Step 2: Mark the old records as inactive:This part updates the old records to mark them as inactive by setting an end date and changing the flag.
UPDATE dimension_table
SET end_date = CURRENT_DATE,
current_flag = ‘N’
WHERE company_id IN (SELECT company_id FROM changes)
AND current_flag = ‘Y’;
— Step 3: Insert new records with the updated company name:This part inserts new records with the updated company name and sets them as active.
INSERT INTO dimension_table (company_id, company_name, start_date, end_date, current_flag)
SELECT company_id, company_name, CURRENT_DATE, ‘9999-12-31’, ‘Y’
FROM changes;
PySpark Example:
Python
from pyspark.sql.functions import lit, current_date
Detect changes:This part finds all the records in the source dataframe that have a different company_name
than the dimension dataframe and are currently active.
changes_df = source_df.join(dimension_df, “company_id”, “left”) \
.filter((col(“source_table.company_name”) != col(“dimension_table.company_name”)) & (col(“dimension_table.current_flag”) == ‘Y’)) \
.select(“source_table.company_id”, “source_table.company_name”)
Mark old records as inactive:This part updates the old records to mark them as inactive by setting an end date and changing the flag.
inactive_df = dimension_df.withColumn(“end_date”, current_date()) \
.withColumn(“current_flag”, lit(‘N’)) \
.filter(col(“company_id”).isin([row.company_id for row in changes_df.collect()]))
Insert new records:This part inserts new records with the updated company name and sets them as active.
new_records_df = changes_df.withColumn(“start_date”, current_date()) \
.withColumn(“end_date”, lit(‘9999-12-31’)) \
.withColumn(“current_flag”, lit(‘Y’))
Combine old and new records:This part combines the old and new records into a final dataframe.
final_df = inactive_df.union(new_records_df)
final_df.show()
Sample Data:
Before Update:
surrogate_key | company_id | company_name | start_date | end_date | current_flag |
---|---|---|---|---|---|
1 | 1 | Company A | 2024-10-01 | 9999-12-31 | Y |
2 | 2 | Company B | 2024-10-01 | 9999-12-31 | Y |
3 | 3 | Company C | 2024-10-01 | 9999-12-31 | Y |
After Update:
surrogate_key | company_id | company_name | start_date | end_date | current_flag |
---|---|---|---|---|---|
1 | 1 | Company A | 2024-10-01 | 9999-12-31 | Y |
2 | 2 | Company B | 2024-10-01 | 2024-10-31 | N |
3 | 3 | Company C | 2024-10-01 | 9999-12-31 | Y |
4 | 2 | Company B Updated | 2024-10-31 | 9999-12-31 | Y |
Type 3: Add New Column
Type 3 SCDs add a new column to store the previous value of the changed attribute. This method maintains limited history.
SQL Example:
SQL
— Assuming dimension_table has an additional column previous_company_name
UPDATE dimension_table d
SET d.previous_company_name = d.company_name,
d.company_name = s.company_name
FROM source_table s
WHERE d.company_id = s.company_id
AND d.company_name <> s.company_name;
PySpark Example:
Python
Assuming dimension_table has an additional column previous_company_name
updated_df = dimension_df.join(source_df, “company_id”) \
.withColumn(“previous_company_name”, col(“dimension_table.company_name”)) \
.withColumn(“company_name”, col(“source_table.company_name”)) \
.select(“dimension_table.*”)
updated_df.show()
Sample Data:
Before Update:
SQL
surrogate_key | company_id | company_name | previous_company_name |
---|---|---|---|
1 | 1 | Company A | NULL |
2 | 2 | Company B | NULL |
3 | 3 | Company C | NULL |
After Update:
SQL
surrogate_key | company_id | company_name | previous_company_name |
---|---|---|---|
1 | 1 | Company A | NULL |
2 | 2 | Company B Updated | Company B |
3 | 3 | Company C | NULL |
Hybrid Type: Overwrite and Add New Row
This approach combines Type 1 and Type 2, where some attributes are overwritten, and others are tracked historically. The end_date is set to a far future date (e.g., ‘9999-12-31’) to indicate that the record is currently active.A Hybrid Type 1/2 SCD is a method used in data warehousing that combines elements of both Type 1 and Type 2 SCDs. This means that some data is simply overwritten (like in Type 1), while other data is tracked historically by adding new rows (like in Type 2). This approach allows you to keep a history of certain changes while updating other information directly.
SQL Example:
SQL
— Step 1: Detect changes:This part finds all the records in the source table that have a different company_name
than the dimension table and are currently active.
WITH changes AS (
SELECT s.company_id, s.company_name
FROM source_table s
LEFT JOIN dimension_table d
ON s.company_id = d.company_id
WHERE s.company_name <> d.company_name
AND d.current_flag = ‘Y’
)
— Step 2: Mark the old records as inactive:This part updates the old records to mark them as inactive by setting an end date and changing the flag.
UPDATE dimension_table
SET end_date = CURRENT_DATE,
current_flag = ‘N’
WHERE company_id IN (SELECT company_id FROM changes)
AND current_flag = ‘Y’;
— Step 3: Insert new records with some attributes overwritten:This part inserts new records with the updated company name and sets them as active.
INSERT INTO dimension_table (company_id, company_name, start_date, end_date, current_flag)
SELECT company_id, ‘new_value’, CURRENT_DATE, ‘9999-12-31’, ‘Y’
FROM changes;
PySpark Example:
Python
Detect changes:This part finds all the records in the source dataframe that have a different company_name
than the dimension dataframe and are currently active.
changes_df = source_df.join(dimension_df, “company_id”, “left”) \
.filter((col(“source_table.company_name”) != col(“dimension_table.company_name”)) & (col(“dimension_table.current_flag”) == ‘Y’)) \
.select(“source_table.company_id”, “source_table.company_name”)
Mark old records as inactive:This part updates the old records to mark them as inactive by setting an end date and changing the flag.
inactive_df = dimension_df.withColumn(“end_date”, current_date()) \
.withColumn(“current_flag”, lit(‘N’)) \
.filter(col(“company_id”).isin([row.company_id for row in changes_df.collect()]))
Insert new records:This part inserts new records with the updated company name and sets them as active.
new_records_df = changes_df.withColumn(“start_date”, current_date()) \
.withColumn(“end_date”, lit(‘9999-12-31’)) \
.withColumn(“current_flag”, lit(‘Y’))
Combine old and new records:This part combines the old and new records into a final dataframe.
final_df = inactive_df.union(new_records_df)
final_df.show()
Sample Data:
Before Update:
surrogate_key | company_id | company_name | start_date | end_date | current_flag |
---|---|---|---|---|---|
1 | 1 | Company A | 2024-10-01 | 9999-12-31 | Y |
2 | 2 | Company B | 2024-10-01 | 9999-12-31 | Y |
3 | 3 | Company C | 2024-10-01 | 9999-12-31 | Y |
After Update:
surrogate_key | company_id | company_name | start_date | end_date | current_flag |
---|---|---|---|---|---|
1 | 1 | Company A | 2024-10-01 | 9999-12-31 | Y |
2 | 2 | Company B | 2024-10-01 | 2024-10-31 | N |
3 | 3 | Company C | 2024-10-01 | 9999-12-31 | Y |
4 | 2 | Company B Updated | 2024-10-31 | 9999-12-31 | Y |
Conclusion
Slowly Changing Dimensions are essential for maintaining historical accuracy in data warehouses. Each type of SCD has its own use case and method for handling changes. By understanding and implementing these methods, you can ensure that your data warehouse accurately reflects the historical context of your data.