What is ETL (Extract, Transform, Load) and how does it work?

ETL stands for Extract, Transform, Load — a three-stage data integration process. Extract pulls data from source systems (databases, APIs, files). Transform cleans, validates, and restructures the data (formatting, deduplication, aggregation). Load writes the processed data into a target system (data warehouse, analytics database). ETL has been a foundational data engineering pattern since the 1970s and remains central to modern data pipelines, though many teams now use ELT (Extract, Load, Transform) where transformation happens after loading.

What Is ETL? Extract, Transform, Load Explained

Definition

ETL stands for Extract, Transform, Load — a three-stage data integration process that moves data from source systems into a target data store. Extract pulls raw data from source systems (databases, APIs, flat files, SaaS applications). Transform cleans, validates, restructures, and enriches the data (formatting dates, deduplicating records, aggregating values, applying business rules). Load writes the processed data into a destination system (data warehouse, analytics database, data lake).

ETL has been a foundational data engineering pattern since the 1970s, originally developed for populating data warehouses from operational databases. The pattern remains central to modern data infrastructure, though the implementation has evolved from monolithic ETL servers to distributed cloud-native pipelines.

The Three Stages in Detail

Extract

The extraction stage connects to source systems and retrieves raw data. Sources can include:

Relational databases — PostgreSQL, MySQL, SQL Server, Oracle via SQL queries or change data capture (CDC)
APIs — REST and GraphQL endpoints from SaaS applications (Salesforce, Shopify, HubSpot)
Flat files — CSV, JSON, XML, Parquet files from SFTP servers, cloud storage (S3, GCS), or local filesystems
Event streams — Kafka, RabbitMQ, AWS Kinesis for real-time data
NoSQL databases — MongoDB, DynamoDB, Cassandra

Extraction can be full (pulling the entire dataset each time) or incremental (pulling only records that changed since the last extraction, identified by timestamps or change data capture). Incremental extraction is preferred for large datasets because it reduces load on source systems and transfer time.

Transform

The transformation stage applies business logic to the extracted data:

Cleaning — Removing duplicates, fixing formatting inconsistencies, handling null values
Validation — Checking data against rules (valid email format, price within expected range, required fields present)
Restructuring — Flattening nested JSON, pivoting rows to columns, joining data from multiple sources
Enrichment — Adding calculated fields (customer lifetime value, year-over-year growth), geocoding addresses, categorizing free-text fields
Aggregation — Summarizing transactional data into daily/weekly/monthly totals, computing averages, medians, and percentiles
Conforming — Standardizing formats across sources (date formats, currency codes, product identifiers) so data from different systems can be combined

Transformations can be simple (renaming columns, casting data types) or complex (applying machine learning models, running statistical calculations, executing multi-step business logic). The complexity of the transformation layer is often the primary factor in choosing between ETL tools.

Load

The loading stage writes transformed data into the target system:

Data warehouses — Snowflake, BigQuery, Redshift, Azure Synapse
Analytics databases — ClickHouse, DuckDB, Apache Druid
Data lakes — S3 + Athena, GCS + BigQuery, Azure Data Lake
Operational databases — PostgreSQL, MySQL (for reverse ETL or data sync use cases)

Loading strategies include full load (replacing the entire target table), incremental append (adding new rows), and upsert (inserting new rows and updating existing ones based on a key). The choice of loading strategy affects both data freshness and warehouse compute costs.

ETL vs ELT

The key distinction between ETL and ELT lies in where transformation happens:

ETL (Extract, Transform, Load): Data is transformed before loading into the target. Transformation happens in a separate processing layer (an ETL server, a workflow engine, or custom code). This was the standard approach when data warehouses had limited compute power and storage was expensive.
ELT (Extract, Load, Transform): Raw data is loaded first into the target warehouse, then transformed using the warehouse's own compute power. This approach became dominant with cloud warehouses like Snowflake and BigQuery, which offer virtually unlimited compute that can be scaled independently of storage.

Aspect	ETL	ELT
Transform location	External processing layer	Inside the data warehouse
Best for	Limited warehouse compute, sensitive data filtering	Cloud warehouses with elastic compute
Data freshness	Slower (transform before load)	Faster (load raw, transform later)
Flexibility	Transformations fixed at design time	Transformations can be changed retroactively
Raw data access	Raw data not stored in warehouse	Raw data available for re-transformation
Typical tools	Informatica, Talend, custom scripts	Fivetran + dbt, Airbyte + dbt
Cost model	Compute on ETL servers	Compute on warehouse (pay per query)

The shift from ETL to ELT was driven by three factors: cloud warehouses made compute cheap and elastic, storage costs dropped to near-zero, and the dbt (data build tool) project created a standard framework for in-warehouse transformations using SQL. As of 2025, ELT is the dominant pattern for cloud-native data teams, while ETL remains common in enterprises with on-premise data warehouses, regulatory requirements for pre-load data filtering, or legacy infrastructure.

Batch vs Real-Time Processing

Batch ETL

Traditional ETL runs on a schedule — hourly, daily, or weekly. A nightly batch job might extract the day's transactions from an operational database, transform them into a star schema, and load them into a data warehouse before business hours. Batch processing is simpler to implement, easier to debug, and more cost-effective for use cases where data freshness of hours or days is acceptable.

Real-Time / Streaming ETL

Streaming ETL processes data continuously as events occur, with latency measured in seconds or minutes. Technologies like Apache Kafka, AWS Kinesis, and Apache Flink enable real-time data pipelines. Streaming ETL is required for use cases like fraud detection, real-time dashboards, and operational alerting where stale data is unacceptable.

Micro-Batch

A hybrid approach that processes data in small batches (every 1-15 minutes), balancing freshness with the simplicity and cost-effectiveness of batch processing. Many modern tools (Apache Spark Structured Streaming, Fivetran) use micro-batch as a pragmatic middle ground.

Modern ETL/ELT Tool Landscape (as of 2025)

Category	Tools	Description
Managed ELT	Fivetran, Airbyte, Stitch	Pre-built connectors that extract and load data; paired with dbt for transformation
Orchestration	Apache Airflow, Prefect, Dagster	Coordinate and schedule ETL/ELT pipeline execution
Transformation	dbt, SQLMesh	In-warehouse transformation using SQL and version control
Code-first pipelines	Windmill, Pipedream, n8n	Build custom ETL logic with TypeScript, Python, or visual workflows
Enterprise ETL	Informatica, Talend, Microsoft SSIS	Traditional ETL platforms for large-scale enterprise data integration
Reverse ETL	Census, Hightouch	Sync data from warehouses back to operational tools (CRM, marketing)

The modern data stack typically combines a managed ELT tool (Fivetran or Airbyte for extraction and loading) with dbt (for in-warehouse transformation) and an orchestrator (Airflow or Dagster for scheduling and dependency management). This modular approach replaced the monolithic ETL servers of the 2000s and 2010s.

Common ETL Challenges

Schema Changes

Source systems change their data schemas (adding columns, renaming fields, changing data types) without warning. Well-designed ETL pipelines detect schema changes and either adapt automatically or alert engineers before data is corrupted.

Data Quality

Source data is often messier than expected: null values in required fields, inconsistent formatting, duplicates, and encoding issues. ETL pipelines need validation rules, error handling, and quarantine mechanisms for bad records.

Scalability

Pipelines that work for 10,000 rows may fail at 10 million rows due to memory constraints, timeout limits, or API rate limiting. Designing for scale from the beginning (incremental extraction, chunked processing, backpressure handling) prevents costly re-architecture later.

Monitoring and Alerting

Silent failures are the most dangerous ETL problem. A pipeline that stops running or produces incorrect data without alerting anyone can cause downstream decisions based on stale or wrong information. Production ETL systems require monitoring for execution failures, data volume anomalies, and freshness SLAs.

When to Use ETL

ETL is appropriate when:

Data needs to be cleaned or filtered before entering the target system (regulatory requirements, PII removal)
Source and target systems have different schemas that require mapping
Data from multiple sources needs to be combined and reconciled
Business logic needs to be applied consistently across datasets
Historical data needs to be aggregated or summarized for analytics

For simple data synchronization between two systems (mirroring a CRM to a warehouse), managed ELT tools like Fivetran or Airbyte are typically faster to implement and maintain than custom ETL pipelines.