Why We Use Data Transformation Pipelines and You Should Too

Why We Use Data Transformation Pipelines & Why You Should Too

When data teams don’t have an established data transformation pipeline, chaos emerges in the form of siloed data, tribal knowledge, unstandardized logic, multiple metric definitions, and inconsistent KPIs.

So how can businesses go from unclear datasets and an inability to make data-informed decisions at scale to clean, concise datasets that empower decision-makers to make informed, strategic business moves?

Let’s take a closer look.

From ETL to ELT

ETL—or Extract, Transform, Load—is the process of extracting data from various data sources, transforming it into a desired format, and then loading it into a target database or data warehouse for analysis, reporting, and other purposes.

On the other hand, ELT—or Extract, Load, Transform—is a data integration process in which data is first extracted from various sources, loaded into a target database or data lake, and then transformed within the target system itself.

The main difference between ETL and ELT is that, with ETL, data is transformed before loading it into the target system, whereas with ELT, the transformation step is performed after the loading step. This means teams no longer have to capture their transformations during the initial load into the data warehouse, and are instead able to build transformations on top of the loaded data in its raw format.

The ELT approach has numerous benefits, including:

  • Harnesses the computational power and scalability of modern databases and data processing platforms
  • Allows organizations to leverage the processing power of these systems for more flexible, scalable, and efficient transformation capabilities—especially when dealing with large volumes of data
  • Flexibility in data exploration and analysis by allowing users to work directly with the raw data stored in the target system
  • Enables on-demand transformations
  • Facilitates iterative and agile data processing

Transformation Layers & Best Practices

The layering concept helps in structuring and managing the T portion of ELT and enables a systematic progression from raw data to a refined and usable output. As mentioned previously, all the transformations in the ELT integration process are done inside the target system, which allows them to be more closely related to what a business unit might need from the data. While the number of transformation layers may vary depending on the data architecture and the specific implementation, the following are the most commonly used.

Raw Layer

The raw layer represents the initial state of the loaded data; it contains the data in its original form as it was loaded into the target database, without any significant transformations or filtering. The raw layer acts as a data source for further processing, backups the original data, and is often stored in a dedicated storage area or a data lake.

Best Practices

It’s important to preserve the original extracted data in its raw form without modifications to maintain data lineage and enable auditing, implement proper security measures to protect it from unauthorized access, store it in a scalable and cost-effective storage system—such as a data lake or distributed file system—and establish data governance practices to ensure data quality and consistency.

Additionally, if there is data that is updated regularly (such as purchase status or user information) it would be a good idea to keep historical records along with the current data. This enables end users to do a “point-in-time” analysis of historical records. For example, let’s say there is an orders table that only contains the most recent information on each order. A business team could be interested in knowing how the transition between each order state went and the reason for the update. This relates to how a table is updated whenever records are inserted or changed and, in data management and warehousing, this is called Slowly Changing Dimension (SCD).

Base Layer

The base layer, also known as the persistent or core layer, is where the transformed and cleansed data is stored. This layer represents the central repository of the data warehouse or the target database. In the base layer, the data is organized and structured according to the defined schema (often equal to the source system schemas), which supports efficient querying. This layer serves as the foundation for downstream layers that will use its data to make transformations that are closer to data products a business unit will use. 

Best Practices

The DE team should design a well-defined data model and schema for the base layer in order to:

  • Ensure data consistency and ease of querying
  • Apply comprehensive data transformation and cleansing rules to convert the data into a standardized format
  • Establish data quality monitoring processes to identify and rectify any data anomalies or inconsistencies
  • Implement data indexing, partitioning, and compression techniques to optimize query performance

It’s important to note that the cleansing process mentioned above should aim to correct source system errors and not ones related to the operation. For example, let’s say we have a table containing customer orders that are loaded daily into the data warehouse. If a duplicate record is inserted during said loads, it’s the responsibility of the base layer to resolve the issue and keep the most recent information. On the other hand, orders canceled because of an operational failure (such as lack of stock, payment failures, or application failures) should be included. Later on, downstream data products can decide if these records are included.

Staging Layer

The staging layer is an intermediate step where the data from the base layer is temporarily held (in views or temporary tables) before further processing. In this layer, data may undergo some basic validations, such as checking for data completeness, data integrity, and conformity to predefined rules. Staging allows for data consolidation from multiple source systems (but without doing any combination or joining of tables) and facilitates efficient data loading into the target system, acting as a buffer between the raw data and the subsequent transformation layers. In this layer, no aggregations or any transformations should be done on the data that don’t allow downstream layers to reconstruct the information from the base layer.

Best Practices

One crucial aspect of the staging layer is the step in which information from the same source system is processed and sometimes combined. A source system refers to any application, program, or server creating and sending data to the database in which the transformation layer is being developed. Examples of such systems are event-capturing applications like Google Analytics, the company’s backend server, or an Ads provider that provides information on potential acquisition targets for a product.

In the staging layer, information from the source systems should not be combined (which will be done in downstream layers) and instead, the focus should be on making the data as easy to use as possible. This means that tables processed in this layer should be error-free; this definition has two important considerations:

1. Business Logic

Business logic should be present in this layer and be used to filter the data. This implies that records that don’t make sense to the business can be removed, such as payments with negative values, orders that were returned in full, or fictional users that were created to test out features by software teams. What information will be filtered depends on what analysis/reports will be created with the data.

2. Basic Tests

Basic tests on the data should pass without errors. Some examples of these tests include:

  • Uniqueness and not null for primary keys of tables.
  • Relations between tables of the source system should be respected: JOIN (as in SQL table joins) conditions shouldn’t create duplicates and IDs from a parent/fact table should only contain IDs from its dimensions (e.g. an Orders table shouldn’t have user IDs not present in the Users table).
  • Categorical data should contain the expected values. For example, an order status can only be created, started, delivered, or canceled.
  • Numerical values should have appropriate ranges or distributions. For example, payments should have values above 0 or extreme values.

Additionally, a naming convention and data type casting should be employed for columns and tables, and this layer should be the only place in which renaming happens. This will be helpful later on to avoid tables and transformations from downstream layers to do unnecessary renaming and casting that will end up being redundant if it was done in this layer.

Finally, as mentioned above, the transformations from this layer are temporary in nature, since the data produced in this layer only feeds the intermediate layer. Because of this, transformations in this layer should be viewed as functions, meaning that something comes inside the layer and an output is given to the next layer, but no information is persisted in the process. For example, if the transformations are stored in a database, they can be stored as SQL views that produce an output when called, but don’t actually store any of the data they produce.

Intermediate Layer

The intermediate layer, sometimes called the integration or semantic layer, sits between the base layer and the output layer. It involves additional transformations and consolidations specific to the business requirements. The intermediate layer may include data aggregation, summarization, denormalization, or the creation of derived data elements. This layer provides a more refined and optimized representation of the data, catering to the analytical needs of the organization.

Best Practices

In this layer, you’ll want to identify and define the necessary business logic and transformations required, add complex business logic and additional joins and calculations to your data when needed, and perform data aggregation, summarization, and denormalization to support analytical processing. You’ll also want to use appropriate data modeling techniques to create derived data elements and calculated measures, as well as implement efficient indexing and caching strategies to enhance query performance.

Output Layer

The output layer is the final stage of the transformation process where data is prepared and presented for consumption by end-users or downstream systems. This layer can include various outputs such as reports, dashboards, visualizations, data feeds, or data exports. The output layer focuses on delivering the transformed data in a format that is user-friendly and meets the specific needs of the intended audience.

Best Practices

It’s important to understand the specific requirements of the end-users or downstream systems and tailor the output accordingly, combine tables to create products that serve a business need, design intuitive and user-friendly reports, dashboards, or visualizations for effective data consumption, implement appropriate data access controls and security measures to protect sensitive information, and regularly validate and verify the accuracy and relevance of the output data.

In addition to these specific, layer-level best practices, it’s crucial to follow overarching best practices for the entire ETL process, such as maintaining proper documentation, version control, error handling, logging, and monitoring mechanisms. Additionally, conducting performance tuning, data profiling, and ongoing data quality assessments are essential practices to ensure the effectiveness and efficiency of all the transformation layers.

The Takeaway

In short, transformation pipelines leverage the power of a data warehouse to transform raw data into data products ready to be used by end users. By using such data pipelines, you’ll be able to take control of data transformation, increase ownership, replace siloed for centralized SQL scripts, and have all transformations in the warehouse—not in external tools like Python scripts or visualization tools.

Thankfully, Factored has used this methodology with multiple clients, using different data architectures and adapting it to each unique case. This has helped our clients improve business decision processes with trustworthy and accurate data.

If your team is facing any of these challenges—siloed data, tribal knowledge, unstandardized logic, multiple metric definitions, and inconsistent KPIs—contact us and we can provide the best Analytics Engineers who can help you improve your data transformation pipeline.

Comments are closed.