Back to Data Glossary

What is SCD

SCD (Slowly Changing Dimensions) are methods used in data warehousing to handle dimensional data that changes slowly over time, like a customer`s address or employee status.

How It Works

SCDs let you preserve historical details (e.g., old addresses) and track data changes over time. Different types of SCDs define how changes are stored. For example, Type 1 overwrites old data, while Type 2 adds new rows to keep historical records. Type 3 can store changes in additional columns, but is less common. These techniques power historical analysis, like seeing how many orders came from a customer`s old vs. new address.

Technical Details

Using SCD usually involves a dimension table that includes fields for start and end dates, a current-row flag, or a special surrogate key that denotes "versions" of an entity like a customer. ETL (Extract, Transform, Load) processes detect changes in source data and update the dimension table accordingly. Many data warehouse frameworks offer built-in support for SCD logic.

How to Write It

Basic Syntax

-- SCD Type 2 Implementation Example

-- 1. Create dimension table with history tracking
CREATE TABLE dim_customer (
  customer_key SERIAL PRIMARY KEY,
  customer_id INT,    -- Natural key
  name VARCHAR(100),
  address VARCHAR(200),
  effective_date DATE,
  end_date DATE,
  is_current BOOLEAN,
  version INT
);

-- 2. Insert a new version of a customer
WITH current_version AS (
  UPDATE dim_customer
  SET 
    end_date = CURRENT_DATE - 1,
    is_current = false
  WHERE customer_id = 123
    AND is_current = true
  RETURNING version
)
INSERT INTO dim_customer (
  customer_id, 
  name, 
  address,
  effective_date, 
  end_date,
  is_current, 
  version
)
SELECT 
  123,
  'John Doe',
  'New Address',
  CURRENT_DATE,
  '9999-12-31',
  true,
  COALESCE((SELECT version + 1 FROM current_version), 1);

-- 3. Query current version of the customer
SELECT *
FROM dim_customer
WHERE is_current = true
  AND customer_id = 123;

-- 4. Query for a point in time (past snapshot)
SELECT *
FROM dim_customer
WHERE '2023-06-15' BETWEEN effective_date AND end_date
  AND customer_id = 123;

-- 5. Query the entire history for a customer
SELECT 
  customer_id,
  name,
  address,
  effective_date,
  end_date,
  version
FROM dim_customer
WHERE customer_id = 123
ORDER BY version;

Learn More

Best Practices

  • Use surrogate keys (e.g., a separate customer_key) instead of natural keys to preserve historical records.
  • Keep clear naming conventions for date ranges and status fields (e.g., effective_date, end_date, is_current).
  • Regularly verify your ETL logic to ensure proper detection of changes in source data.
  • Plan your data retention strategy: decide how long to keep old versions.

Common Pitfalls

  • Forgetting to update is_current and end_date correctly, causing inconsistent data.
  • Mixing SCD methods (e.g., Type 1 overwriting old rows, Type 2 adding new rows) without careful thought.
  • Performance bottlenecks if updates are large and not well indexed.
  • Incorrect or missing surrogate keys leading to confusion in historical tracking.

Advanced Tips

  • Use triggers or stored procedures to automate versioning logic, especially for frequently changing dimensions.
  • Consider partitioning by end_date or version to speed up queries on historical data.
  • Leverage Type 1 for non-critical attributes that do not require a historical audit trail, combined with Type 2 for sensitive fields.
  • Use specialized data warehouse tools (like Apache Hive, AWS Glue, or custom ETL jobs) that support SCD out of the box.
Space, at your fingertips
astronaut

What will you discover?