The great analytic and machine learning projects that businesses need to stay competitive rely on data. The ETL process is the backbone that produces clean, accessible and reliable data. In this article, we describe why ETL is important, what it is and then compare it with its younger sibling ELT which is seeing increased adoption.
Why is ETL important?
- Centralizes analytical data
- Facilitates the answering of difficult business questions efficiently compared to operational databases
- Allows teams to reuse software and techniques to analyze data; without ETL, all our solutions might depend on the source data systems.
What is ETL?
ETL stands for Extract, Transform and Load. In the Extract step, data is exported in its raw state to a staging server. For instance, extraction could include SQL queries, API requests and Excel file importation, and all of this could be handled using a programming language such as Python.
Then, when data is in the designed staging area, we move on to the Transform step. We would use a tool like dbt to rename columns, filter by relevant values, aggregate, standardize units, summarize, remove or correct inaccurate rows and many other possible transformations.
Finally, in the last step, this transformed, cleaned data is loaded into the data warehouse. At this point, our end users should be able to access data.
What is ELT and how does it compare with ETL?
In the ELT process we shift the order of the final two operations. First, the raw data is loaded into the data warehouse and then transformation occurs. The key thing to note is that the mentioned transformations occur inside the data warehouse, not inside the staging area.
This inclusion of operational (raw) data into the analytical data warehouse is not ideal if we want only clean-transformed data there but has the advantage of requiring only the data warehouse server for all our transformation operations.
ELT has several advantages over traditional ETL. First is the idea of not having the setup of the staging servers. In the ELT process the transformations happen inside the data warehouse, which usually performs extremely well and efficiently handles processes. AWS Redshift and Snowflake are two examples of modern data warehouses used for this purpose.
In the ELT this process relies on the server’s processing power. Another advantage is that since all raw data is loaded into the data warehouse more flexibility is provided, and new transformations and tables can be added directly to the data warehouse.
However, ETL has some properties that could make it desirable over the ELT approach. ELT overloads the data warehouse with all our necessary transformations. This makes sense if we have a data warehouse with enough processing power, but it doesn’t work well if we don’t have a performant data warehouse. Another benefit of doing this is that ETL is a well-established process that has existed for more than 20 years; it is already widely studied and there are multiple tools and support for doing it.
The following table summarizes the comparison of both approaches:
|Historically it has been the standard approach.|
|Uses the staging server for the data transformation.|
Uses the data warehouse to handle data transformation.
|Raw and processed data are loaded into different areas.|
Raw and processed data are loaded into the data warehouse.
|It can be complex since it needs to manage additional servers.|
It is simpler since it does not need configuring or managing more servers.
Two different data transfers are done. The first is to load data from source systems into the staging server and the second one is to load transformed data into the data warehouse.
|A single data transfer is done. Data is loaded from source systems into the data warehouse directly.|
Practically speaking, we must remember that the industry uses both terms interchangeably since most of the process is similar regardless of the approach. The important thing is to know the differences to help pick the correct one based on the necessities.
At Factored we can help you design and implement your projects with our highly skilled and vetted data engineers. Moreover, the partnerships that Factored has with the technologies commonly used for ETL such as AWS and dbt guarantee that our team follows the best practices and continual learning of the state-of-the-art architectures.
Transforming Data in the ELT paradigm. Holistics book.