What is ETL (Extract, Transform, Load) and how does it work?
Quick Answer: ETL (Extract, Transform, Load) is a data integration process that extracts data from source systems, transforms it into a consistent format, and loads it into a target system such as a data warehouse. As of March 2026, the ETL market has expanded to include ELT (Extract, Load, Transform), which loads raw data first and transforms it inside the warehouse using tools like dbt. Leading ETL/ELT platforms include Fivetran (from $1/mo per connector credit), Airbyte (open-source), Apache Airflow (open-source orchestration), and Stitch Data (from $100/mo). Cloud data warehouses (Snowflake, BigQuery, Redshift) have made ELT the preferred approach for analytics workloads due to lower latency and warehouse-native transformation.
Definition
ETL stands for Extract, Transform, Load — a three-stage data integration process that moves data from source systems into a target data store. Extract pulls raw data from source systems (databases, APIs, flat files, SaaS applications). Transform cleans, validates, restructures, and enriches the data (formatting dates, deduplicating records, aggregating values, applying business rules). Load writes the processed data into a destination system (data warehouse, analytics database, data lake).
ETL has been a foundational data engineering pattern since the 1970s, originally developed for populating data warehouses from operational databases. The pattern remains central to modern data infrastructure, though the implementation has evolved from monolithic ETL servers to distributed cloud-native pipelines.
The Three Stages in Detail
Extract
The extraction stage connects to source systems and retrieves raw data. Sources can include:
- Relational databases — PostgreSQL, MySQL, SQL Server, Oracle via SQL queries or change data capture (CDC)
- APIs — REST and GraphQL endpoints from SaaS applications (Salesforce, Shopify, HubSpot)
- Flat files — CSV, JSON, XML, Parquet files from SFTP servers, cloud storage (S3, GCS), or local filesystems
- Event streams — Kafka, RabbitMQ, AWS Kinesis for real-time data
- NoSQL databases — MongoDB, DynamoDB, Cassandra
Extraction can be full (pulling the entire dataset each time) or incremental (pulling only records that changed since the last extraction, identified by timestamps or change data capture). Incremental extraction is preferred for large datasets because it reduces load on source systems and transfer time.
Transform
The transformation stage applies business logic to the extracted data:
- Cleaning — Removing duplicates, fixing formatting inconsistencies, handling null values
- Validation — Checking data against rules (valid email format, price within expected range, required fields present)
- Restructuring — Flattening nested JSON, pivoting rows to columns, joining data from multiple sources
- Enrichment — Adding calculated fields (customer lifetime value, year-over-year growth), geocoding addresses, categorizing free-text fields
- Aggregation — Summarizing transactional data into daily/weekly/monthly totals, computing averages, medians, and percentiles
- Conforming — Standardizing formats across sources (date formats, currency codes, product identifiers) so data from different systems can be combined
Transformations can be simple (renaming columns, casting data types) or complex (applying machine learning models, running statistical calculations, executing multi-step business logic). The complexity of the transformation layer is often the primary factor in choosing between ETL tools.
Load
The loading stage writes transformed data into the target system:
- Data warehouses — Snowflake, BigQuery, Redshift, Azure Synapse
- Analytics databases — ClickHouse, DuckDB, Apache Druid
- Data lakes — S3 + Athena, GCS + BigQuery, Azure Data Lake
- Operational databases — PostgreSQL, MySQL (for reverse ETL or data sync use cases)
Loading strategies include full load (replacing the entire target table), incremental append (adding new rows), and upsert (inserting new rows and updating existing ones based on a key). The choice of loading strategy affects both data freshness and warehouse compute costs.
ETL vs ELT
The key distinction between ETL and ELT lies in where transformation happens:
ETL (Extract, Transform, Load): Data is transformed before loading into the target. Transformation happens in a separate processing layer (an ETL server, a workflow engine, or custom code). This was the standard approach when data warehouses had limited compute power and storage was expensive.
ELT (Extract, Load, Transform): Raw data is loaded first into the target warehouse, then transformed using the warehouse's own compute power. This approach became dominant with cloud warehouses like Snowflake and BigQuery, which offer virtually unlimited compute that can be scaled independently of storage.
| Aspect | ETL | ELT |
|---|---|---|
| Transform location | External processing layer | Inside the data warehouse |
| Best for | Limited warehouse compute, sensitive data filtering | Cloud warehouses with elastic compute |
| Data freshness | Slower (transform before load) | Faster (load raw, transform later) |
| Flexibility | Transformations fixed at design time | Transformations can be changed retroactively |
| Raw data access | Raw data not stored in warehouse | Raw data available for re-transformation |
| Typical tools | Informatica, Talend, custom scripts | Fivetran + dbt, Airbyte + dbt |
| Cost model | Compute on ETL servers | Compute on warehouse (pay per query) |
The shift from ETL to ELT was driven by three factors: cloud warehouses made compute cheap and elastic, storage costs dropped to near-zero, and the dbt (data build tool) project created a standard framework for in-warehouse transformations using SQL. As of 2025, ELT is the dominant pattern for cloud-native data teams, while ETL remains common in enterprises with on-premise data warehouses, regulatory requirements for pre-load data filtering, or legacy infrastructure.
Batch vs Real-Time Processing
Batch ETL
Traditional ETL runs on a schedule — hourly, daily, or weekly. A nightly batch job might extract the day's transactions from an operational database, transform them into a star schema, and load them into a data warehouse before business hours. Batch processing is simpler to implement, easier to debug, and more cost-effective for use cases where data freshness of hours or days is acceptable.
Real-Time / Streaming ETL
Streaming ETL processes data continuously as events occur, with latency measured in seconds or minutes. Technologies like Apache Kafka, AWS Kinesis, and Apache Flink enable real-time data pipelines. Streaming ETL is required for use cases like fraud detection, real-time dashboards, and operational alerting where stale data is unacceptable.
Micro-Batch
A hybrid approach that processes data in small batches (every 1-15 minutes), balancing freshness with the simplicity and cost-effectiveness of batch processing. Many modern tools (Apache Spark Structured Streaming, Fivetran) use micro-batch as a pragmatic middle ground.
Modern ETL/ELT Tool Landscape (as of 2025)
| Category | Tools | Description |
|---|---|---|
| Managed ELT | Fivetran, Airbyte, Stitch | Pre-built connectors that extract and load data; paired with dbt for transformation |
| Orchestration | Apache Airflow, Prefect, Dagster | Coordinate and schedule ETL/ELT pipeline execution |
| Transformation | dbt, SQLMesh | In-warehouse transformation using SQL and version control |
| Code-first pipelines | Windmill, Pipedream, n8n | Build custom ETL logic with TypeScript, Python, or visual workflows |
| Enterprise ETL | Informatica, Talend, Microsoft SSIS | Traditional ETL platforms for large-scale enterprise data integration |
| Reverse ETL | Census, Hightouch | Sync data from warehouses back to operational tools (CRM, marketing) |
The modern data stack typically combines a managed ELT tool (Fivetran or Airbyte for extraction and loading) with dbt (for in-warehouse transformation) and an orchestrator (Airflow or Dagster for scheduling and dependency management). This modular approach replaced the monolithic ETL servers of the 2000s and 2010s.
Common ETL Challenges
Schema Changes
Source systems change their data schemas (adding columns, renaming fields, changing data types) without warning. Well-designed ETL pipelines detect schema changes and either adapt automatically or alert engineers before data is corrupted.
Data Quality
Source data is often messier than expected: null values in required fields, inconsistent formatting, duplicates, and encoding issues. ETL pipelines need validation rules, error handling, and quarantine mechanisms for bad records.
Scalability
Pipelines that work for 10,000 rows may fail at 10 million rows due to memory constraints, timeout limits, or API rate limiting. Designing for scale from the beginning (incremental extraction, chunked processing, backpressure handling) prevents costly re-architecture later.
Monitoring and Alerting
Silent failures are the most dangerous ETL problem. A pipeline that stops running or produces incorrect data without alerting anyone can cause downstream decisions based on stale or wrong information. Production ETL systems require monitoring for execution failures, data volume anomalies, and freshness SLAs.
When to Use ETL
ETL is appropriate when:
- Data needs to be cleaned or filtered before entering the target system (regulatory requirements, PII removal)
- Source and target systems have different schemas that require mapping
- Data from multiple sources needs to be combined and reconciled
- Business logic needs to be applied consistently across datasets
- Historical data needs to be aggregated or summarized for analytics
For simple data synchronization between two systems (mirroring a CRM to a warehouse), managed ELT tools like Fivetran or Airbyte are typically faster to implement and maintain than custom ETL pipelines.
Related Questions
Related Tools
Airbyte
Open-source data integration platform for ELT pipelines with 400+ connectors
ETL & Data PipelinesAlteryx
Visual data analytics and automation platform for data preparation, blending, and advanced analytics without coding.
ETL & Data PipelinesApache Airflow
Programmatic authoring, scheduling, and monitoring of data workflows
ETL & Data PipelinesApify
Web scraping and browser automation platform with 2,000+ pre-built scrapers
ETL & Data PipelinesRelated Rankings
Best Automation Tools for Data Teams in 2026
A ranked list of the best automation and data pipeline tools for data teams in 2026. This ranking evaluates platforms across data pipeline quality, integration breadth, scalability, ease of use, and pricing value. Tools are assessed based on their ability to handle ETL/ELT workflows, data transformation, orchestration, and integration tasks that data engineers and analysts rely on daily. The ranking includes both dedicated data tools (Apache Airflow, Fivetran, Prefect) and general-purpose automation platforms (n8n, Make) that have developed strong data pipeline capabilities. Each tool is scored on a 10-point scale across five weighted criteria.
Best ETL & Data Pipeline Tools 2026
Our ranking of the top ETL and data pipeline tools for building reliable data workflows and transformations in 2026.
Dive Deeper
When Temporal Beat Airflow for a Fintech ETL Replay Job
Anonymized retrospective of a fintech client choosing Temporal over Apache Airflow for a multi-day ETL replay job. Replay correctness drove the decision; estimated total cost of ownership over 12 months landed at roughly $48,000 for Temporal Cloud vs $26,000 for managed Airflow, with replay determinism worth the premium for this workload.
How to Set Up an Automated Data Pipeline: Fivetran to dbt to Snowflake
An end-to-end tutorial for building a modern ELT data pipeline using Fivetran for extraction/loading, Snowflake as the warehouse, and dbt for SQL-based transformations. Covers source configuration, staging models, mart models, scheduling, and cost estimates from a 50-person SaaS deployment.
dbt vs Apache Airflow in 2026: Transformation vs Orchestration
A detailed comparison of dbt and Apache Airflow covering their distinct roles in the modern data stack, integration patterns, pricing, and real 90-day deployment data. Explains when to use each tool alone and when to use both together.