Definition

Data transformation

What is data transformation?

Data transformation is a part of the overall data preparation process. However data transformation is also a process in itself, made of several operations: converting, cleansing, structuring. It can entail spotting and deleting duplicates, converting data types, as well as enriching the dataset as a whole. 

The aim? Transforming raw data into clean, secure and standardized data – which will therefore become easily accessible and actionable in many ways. Data transformation is designed to get the data ready before it is used to guide and support decision-making in a Business Intelligence perspective.

It might now appear obvious that data transformation is a crucial process, especially in our Big Data era. It is the role of data engineers to ensure that the data used down the pipe is consistently functional or actionable, and thus truly enabling the company to be data-driven. This means converting data in order to match its destination system, which will depend on the BI tool used internally, or even the department using it.

How does it work?

The data transformation process encompasses many types of operations, among which:

  • Data cleaning - identifying and deleting datasets that turn out to be incorrect, inaccurate, irrelevant or incomplete;
  • Data deduplication - spotting and erasing duplicates;
  • Date aggregation - compiling data and organizing it into a more concise format;
  • Data integration - gathering data from various sources and providing a unified view of them;
  • Data summarization - creating an understandable and informative summary of the generated data;
  • Data splitting - separating data into several portions for cross-validatory purposes...

Prior to this transformation process, it is crucial to follow a data discovery one. This will enable analysts to understand the dataset and determine which data transformation operations must be performed.

What are the perks of data transformation?

Data transformation is beneficial in many ways:

  • it enhances data quality by deleting mistakes, improving structure and reducing the risk of computational errors;
  • it makes data more usable and actionable for advanced BI and Analytics purposes;
  • it results into organized data, which makes data management and data use easier;
  • it enables faster query writing, as the data is well sorted and stored

When does it happen?

Two scenarios: 

  • in the ETL pipeline, the data transformation process occurs before the data is loaded into the new database
  • in the ELT pipeline, the data transformation process occurs after the data has been loaded into the new database. 

Organizations should prioritize ELT and cloud-based data warehouses because of their scalability: with ELT, raw data remains available in the database’s history – so it can be transformed again in the future. 

Husprey Logo

Learn more about Husprey

Husprey is a powerful, yet simple, platform that provides tools for Data Analysts to create SQL notebooks effortlessly, collaborate with their team and share their analyses with anyone.