Microsoft Fabric is an end-to-end data analytics solution in the cloud, and one of its workloads is called Data Factory. In this article, you will learn what Data Factory is, how it works with the rest of Microsoft Fabric, and what are elements and functions of Data Factory.
Video
Microsoft Fabric
To understand Data Factory, it is best to understand Microsoft Fabric first. Microsoft Fabric is an end-to-end Data Analytics software-as-a-service offering from Microsoft. Microsoft Fabric combined some products and services to cover an end-to-end and easy-to-use platform for data analytics. Here are the components (also called workloads) of Microsoft Fabric.
To learn more about Microsoft Fabric and enable it in your organization, I recommend reading the articles below;
Data Factory Origin
Microsoft Fabric has a workload for Data Integration. Any end-to-end data analytics system should have a data integration component. Microsoft has been a strong data integration tool and service leader for decades. This started with SQL Server tools such as DTS (Data Transformation Service) and SSIS (SQL Server Integration Services) and then stepped into cloud-based technologies such as ADF (Azure Data Factory). Microsoft also used a data transformation engine that first targeted citizen data analysts called Power Query.
Data Factory is the data integration component of Microsoft Fabric which brings the power of Azure Data Factory and Power Query Dataflows into one place. For many years, we had these two technologies doing data transformations separately. But now, these two are combined under Fabric, called Data Factory.
Power Query
Power Query Dataflows was first announced a few years ago as an additional component to Power BI for data transformation as a cloud technology that is simple to use for data analysts. But soon, it became more than just for Power BI; it became Power Platform Dataflows. These days, Power Query Dataflows are used for data transformations in Power BI projects and data migration in Power Apps projects.
Although Power Query Dataflows is also on the dataflow side, it needed some enhancements on scalability and the control of execution with some control flow elements (such as loop structures, conditional execution, etc.).
Azure Data Factory
Azure Data Factory came into the market many years ago as the next generation of SSIS for in-the-cloud ETL. However, the data transformation engine of Azure Data Factory was not built on a strong basis, so most of the time, ADF was used for data ingestion, and then with the help of SQL stored procedures, etc., for doing the transformation afterward. ADF was not a tool for citizen data analysts. It was instead for data engineers and developers. ADF used data pipelines to execute a group of activities as a flow, and among those activities, there were tasks such as copy data, running a stored procedure, etc.
For the past few years, we have always had this split; If you wanted a simple-to-use data transformation engine but not much data, use Power Query Dataflows. If you want scalable data ingestion, then use Azure Data Factory.
Best of Both Worlds
Now in Microsoft Fabric, We combine the best things from Power Query Dataflows and Azure Data Factory Data Pipelines into one stream: Data Factory. Data Factory ensures that you still have a simple-to-use and powerful transformation engine of Power Query for data transformation, but on the other hand, you also have the scalability of Data Pipelines and can build a control flow for execution of the ETL using the Data Pipelines. In other words, Data Factory is a state-of-the-art ETL software-as-a-service offering for Microsoft Fabric.
Elements of Data Factory
Combining these two services brings great features that make the Data Factory an ultimate ETL service. Here are some of those below;
Data Connectors
For an ETL (Extract, Transform, Load) system, one of the most important aspects is what sources the data can be fetched from. Data Factory offers hundreds of data connectors, enabling you to get data from sources such as databases, files, folders, software-as-a-service systems, etc.
It is also possible to create your connector if you are keen.
Dataflows
Dataflows are the heart of Data Factory. This is where you get the data from the sources, define the data transformation and prepare it in any shape needed, and finally load it into destinations. Dataflows use the Power Query data transformation engine and the user interface for creating it using the simple-to-use Power Query Editor online.
Power Query Editor online is not only powerful in the graphical interface, it also enables the developer to write code in M language, which is the data transformation language for Power Query.
To learn more about Dataflows, I suggest reading my article below.
Dataflows support a few destinations at the time of writing this article which are;
- Azure Data Explorer (Kusto)
- Azure SQL Database
- Data Warehouse
- Lakehouse
Data Pipelines
Although Dataflows are the main ETL component of the Data Factory, they can be enhanced when wrapped by a control flow execution component. This control flow execution component is called Data Pipeline. A Data Pipeline is a group of activities (or tasks) defined by a particular flow of execution. The activities in a Pipeline can involve copying data, running a Dataflow, executing a stored procedure, looping until a certain condition is met, or executing a particular set of activities if a condition is met, etc.
Data Pipelines can then be scheduled, and there is a monitoring tool to check the execution stage of the pipeline in addition to the activity-state-outputs where you can define what happens if a certain task fails or succeeds.
As mentioned, one of the most important activities that can be done in a Pipeline is the execution of a Dataflow. This is where Dataflows and Data Pipelines work together in their best way.
To learn more about Data Pipelines, read my article below;
Summary
Data Factory is an ETL-in-cloud solution that is the data integration workload of Microsoft Fabric. Data Factory is not a new product or service; it comes from many years of Microsoft data transformation tools and services. It is built on top of Power Query and Azure Data Factory. Data Factory uses two main components to deliver the best ETL scenarios possible; Dataflows and Data Pipelines. Dataflows are for the main get data, transform, and load process, and the Data Pipeline can control the rest of the execution with control flow activities.
I highly recommend reading the articles below to study more about Data Factory;
- What is Microsoft Fabric
- How to enable Fabric in your organization
- What is Lakehouse, and a sample Dataflow Gen2 created to load data into it.
- Getting started with Data Pipelines in Microsoft Fabric Data Factory
- Getting started with Dataflows in Microsoft Fabric Data Factory
The post What is Data Factory in Microsoft Fabric appeared first on RADACAD.