What is SCD
SCD (Slowly Changing Dimensions) are methods used in data warehousing to handle dimensional data that changes slowly over time, like a customer`s address or employee status.
How It Works
SCDs let you preserve historical details (e.g., old addresses) and track data changes over time. Different types of SCDs define how changes are stored. For example, Type 1 overwrites old data, while Type 2 adds new rows to keep historical records. Type 3 can store changes in additional columns, but is less common. These techniques power historical analysis, like seeing how many orders came from a customer`s old vs. new address.
Technical Details
Using SCD usually involves a dimension table that includes fields for start and end dates, a current-row flag, or a special surrogate key that denotes "versions" of an entity like a customer. ETL (Extract, Transform, Load) processes detect changes in source data and update the dimension table accordingly. Many data warehouse frameworks offer built-in support for SCD logic.
How to Write It
Basic Syntax
-- SCD Type 2 Implementation Example
-- 1. Create dimension table with history tracking
CREATE TABLE dim_customer (
customer_key SERIAL PRIMARY KEY,
customer_id INT, -- Natural key
name VARCHAR(100),
address VARCHAR(200),
effective_date DATE,
end_date DATE,
is_current BOOLEAN,
version INT
);
-- 2. Insert a new version of a customer
WITH current_version AS (
UPDATE dim_customer
SET
end_date = CURRENT_DATE - 1,
is_current = false
WHERE customer_id = 123
AND is_current = true
RETURNING version
)
INSERT INTO dim_customer (
customer_id,
name,
address,
effective_date,
end_date,
is_current,
version
)
SELECT
123,
'John Doe',
'New Address',
CURRENT_DATE,
'9999-12-31',
true,
COALESCE((SELECT version + 1 FROM current_version), 1);
-- 3. Query current version of the customer
SELECT *
FROM dim_customer
WHERE is_current = true
AND customer_id = 123;
-- 4. Query for a point in time (past snapshot)
SELECT *
FROM dim_customer
WHERE '2023-06-15' BETWEEN effective_date AND end_date
AND customer_id = 123;
-- 5. Query the entire history for a customer
SELECT
customer_id,
name,
address,
effective_date,
end_date,
version
FROM dim_customer
WHERE customer_id = 123
ORDER BY version;
Learn More
Best Practices
- Use surrogate keys (e.g., a separate customer_key) instead of natural keys to preserve historical records.
- Keep clear naming conventions for date ranges and status fields (e.g., effective_date, end_date, is_current).
- Regularly verify your ETL logic to ensure proper detection of changes in source data.
- Plan your data retention strategy: decide how long to keep old versions.
Common Pitfalls
- Forgetting to update is_current and end_date correctly, causing inconsistent data.
- Mixing SCD methods (e.g., Type 1 overwriting old rows, Type 2 adding new rows) without careful thought.
- Performance bottlenecks if updates are large and not well indexed.
- Incorrect or missing surrogate keys leading to confusion in historical tracking.
Advanced Tips
- Use triggers or stored procedures to automate versioning logic, especially for frequently changing dimensions.
- Consider partitioning by end_date or version to speed up queries on historical data.
- Leverage Type 1 for non-critical attributes that do not require a historical audit trail, combined with Type 2 for sensitive fields.
- Use specialized data warehouse tools (like Apache Hive, AWS Glue, or custom ETL jobs) that support SCD out of the box.