Skip to main content This browser is no longer supported. Show
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support. Azure Data Factory FAQ
In this articleAPPLIES TO: This article provides answers to frequently asked questions about Azure Data Factory. What is Azure Data Factory?Data Factory is a fully managed, cloud-based, data-integration ETL service that automates the movement and transformation of data. Like a factory that runs equipment to transform raw materials into finished goods, Azure Data Factory orchestrates existing services that collect raw data and transform it into ready-to-use information. By using Azure Data Factory, you can create data-driven workflows to move data between on-premises and cloud data stores. And you can process and transform data with Data Flows. ADF also supports external compute engines for hand-coded transformations by using compute services such as Azure HDInsight, Azure Databricks, and the SQL Server Integration Services (SSIS) integration runtime. With Data Factory, you can execute your data processing either on an Azure-based cloud service or in your own self-hosted compute environment, such as SSIS, SQL Server, or Oracle. After you create a pipeline that performs the action you need, you can schedule it to run periodically (hourly, daily, or weekly, for example), time window scheduling, or trigger the pipeline from an event occurrence. For more information, see Introduction to Azure Data Factory. Compliance and Security ConsiderationsAzure Data Factory is certified for a range of compliance certifications, including SOC 1, 2, 3, HIPAA BAA, and HITRUST. Full and growing list of certifications can be found here. Digital copies for audit reports and compliance certifications can be found in Service Trust Center Control flows and scaleTo support the diverse integration flows and patterns in the modern data warehouse, Data Factory enables flexible data pipeline modeling. This entails full control flow programming paradigms, which include conditional execution, branching in data pipelines, and the ability to explicitly pass parameters within and across these flows. Control flow also encompasses transforming data through activity dispatch to external execution engines and data flow capabilities, including data movement at scale, via the Copy activity. Data Factory provides freedom to model any flow style that's required for data integration and that can be dispatched on demand or repeatedly on a schedule. A few common flows that this model enables are:
For more information, see Tutorial: Control flows. Data transformed at scale with code-free pipelinesThe new browser-based tooling experience provides code-free pipeline authoring and deployment with a modern, interactive web-based experience. For visual data developers and data engineers, the Data Factory web UI is the code-free design environment that you will use to build pipelines. It's fully integrated with Visual Studio Online Git and provides integration for CI/CD and iterative development with debugging options. Rich cross-platform SDKs for advanced usersData Factory V2 provides a rich set of SDKs that can be used to author, manage, and monitor pipelines by using your favorite IDE, including:
Users can also use the documented REST APIs to interface with Data Factory V2. Iterative development and debugging by using visual toolsAzure Data Factory visual tools enable iterative development and debugging. You can create your pipelines and do test runs by using the Debug capability in the pipeline canvas without writing a single line of code. You can view the results of your test runs in the Output window of your pipeline canvas. After your test run succeeds, you can add more activities to your pipeline and continue debugging in an iterative manner. You can also cancel your test runs after they are in progress. You are not required to publish your changes to the data factory service before selecting Debug. This is helpful in scenarios where you want to make sure that the new additions or changes will work as expected before you update your data factory workflows in development, test, or production environments. Ability to deploy SSIS packages to AzureIf you want to move your SSIS workloads, you can create a Data Factory and provision an Azure-SSIS integration runtime. An Azure-SSIS integration runtime is a fully managed cluster of Azure VMs (nodes) that are dedicated to run your SSIS packages in the cloud. For step-by-step instructions, see the Deploy SSIS packages to Azure tutorial. SDKsIf you are an advanced user and looking for a programmatic interface, Data Factory provides a rich set of SDKs that you can use to author, manage, or monitor pipelines by using your favorite IDE. Language support includes .NET, PowerShell, Python, and REST. MonitoringYou can monitor your Data Factories via PowerShell, SDK, or the Visual Monitoring Tools in the browser user interface. You can monitor and manage on-demand, trigger-based, and clock-driven custom flows in an efficient and effective manner. Cancel existing tasks, see failures at a glance, drill down to get detailed error messages, and debug the issues, all from a single pane of glass without context switching or navigating back and forth between screens. New features for SSIS in Data FactorySince the initial public preview release in 2017, Data Factory has added the following features for SSIS:
What is the integration runtime?The integration runtime is the compute infrastructure that Azure Data Factory uses to provide the following data integration capabilities across various network environments:
You can deploy one or many instances of the integration runtime as required to move and transform data. The integration runtime can run on an Azure public network or on a private network (on-premises, Azure Virtual Network, or Amazon Web Services virtual private cloud [VPC]). In Data Factory, an activity defines the action to be performed. A linked service defines a target data store or a compute service. An integration runtime provides the bridge between the activity and linked Services. It's referenced by the linked service or activity, and provides the compute environment where the activity either runs on or gets dispatched from. This way, the activity can be performed in the region closest possible to the target data store or compute service in the most performant way while meeting security and compliance needs. Integration runtimes can be created in the Azure Data Factory UX via the management hub and any activities, datasets, or data flows that reference them. For more information, see Integration runtime in Azure Data Factory. What is the limit on the number of integration runtimes?There is no hard limit on the number of integration runtime instances you can have in a data factory. There is, however, a limit on the number of VM cores that the integration runtime can use per subscription for SSIS package execution. For more information, see Data Factory limits. What are the top-level concepts of Azure Data Factory?An Azure subscription can have one or more Azure Data Factory instances (or data factories). Azure Data Factory contains four key components that work together as a platform on which you can compose data-driven workflows with steps to move and transform data. PipelinesA data factory can have one or more pipelines. A pipeline is a logical grouping of activities to perform a unit of work. Together, the activities in a pipeline perform a task. For example, a pipeline can contain a group of activities that ingest data from an Azure blob and then run a Hive query on an HDInsight cluster to partition the data. The benefit is that you can use a pipeline to manage the activities as a set instead of having to manage each activity individually. You can chain together the activities in a pipeline to operate them sequentially, or you can operate them independently, in parallel. Data flowsData flows are objects that you build visually in Data Factory which transform data at scale on backend Spark services. You do not need to understand programming or Spark internals. Just design your data transformation intent using graphs (Mapping) or spreadsheets (Power query activity). ActivitiesActivities represent a processing step in a pipeline. For example, you can use a Copy activity to copy data from one data store to another data store. Similarly, you can use a Hive activity, which runs a Hive query on an Azure HDInsight cluster to transform or analyze your data. Data Factory supports three types of activities: data movement activities, data transformation activities, and control activities. DatasetsDatasets represent data structures within the data stores, which simply point to or reference the data you want to use in your activities as inputs or outputs. Linked servicesLinked services are much like connection strings, which define the connection information needed for Data Factory to connect to external resources. Think of it this way: A linked service defines the connection to the data source, and a dataset represents the structure of the data. For example, an Azure Storage linked service specifies the connection string to connect to the Azure Storage account. And an Azure blob dataset specifies the blob container and the folder that contains the data. Linked services have two purposes in Data Factory:
TriggersTriggers represent units of processing that determine when a pipeline execution is kicked off. There are different types of triggers for different types of events. Pipeline runsA pipeline run is an instance of a pipeline execution. You usually instantiate a pipeline run by passing arguments to the parameters that are defined in the pipeline. You can pass the arguments manually or within the trigger definition. ParametersParameters are key-value pairs in a read-only configuration. You define parameters in a pipeline, and you pass the arguments for the defined parameters during execution from a run context. The run context is created by a trigger or from a pipeline that you execute manually. Activities within the pipeline consume the parameter values. A dataset is a strongly typed parameter and an entity that you can reuse or reference. An activity can reference datasets, and it can consume the properties that are defined in the dataset definition. A linked service is also a strongly typed parameter that contains connection information to either a data store or a compute environment. It's also an entity that you can reuse or reference. Control flowsControl flows orchestrate pipeline activities that include chaining activities in a sequence, branching, parameters that you define at the pipeline level, and arguments that you pass as you invoke the pipeline on demand or from a trigger. Control flows also include custom state passing and looping containers (that is, foreach iterators). For more information about Data Factory concepts, see the following articles:
What is the pricing model for Data Factory?For Azure Data Factory pricing details, see Data Factory pricing details. How can I stay up-to-date with information about Data Factory?For the most up-to-date information about Azure Data Factory, go to the following sites:
Technical deep diveHow can I schedule a pipeline?You can use the scheduler trigger or time window trigger to schedule a pipeline. The trigger uses a wall-clock calendar schedule, which can schedule pipelines periodically or in calendar-based recurrent patterns (for example, on Mondays at 6:00 PM and Thursdays at 9:00 PM). For more information, see Pipeline execution and triggers. Can I pass parameters to a pipeline run?Yes, parameters are a first-class, top-level concept in Data Factory. You can define parameters at the pipeline level and pass arguments as you execute the pipeline run on demand or by using a trigger. Can I define default values for the pipeline parameters?Yes. You can define default values for the parameters in the pipelines. Can an activity in a pipeline consume arguments that are passed to a pipeline run?Yes. Each activity within
the pipeline can consume the parameter value that's passed to the pipeline and run with the Can an activity output property be consumed in another activity?Yes. An activity output can be consumed in a subsequent activity with the How do I gracefully handle null values in an activity output?You
can use the How many pipeline activities can be executed simultaneously?A maximum of 50 concurrent pipeline activities is allowed. The 51st pipeline activity will be queued until a free slot is opened up. A maximum of 800 concurrent external activities will be allowed, after which they will be queued in the same way. Mapping data flowsI need help troubleshooting my data flow logic. What info do I need to provide to get help?When Microsoft provides help or troubleshooting with data flows, please provide the ADF pipeline support files. This Zip file contains the code-behind script from your data flow graph. From the ADF UI, select ... next to pipeline, and then select Download support files. How do I access data by using the other 90 dataset types in Data Factory?The mapping data flow feature currently allows Azure SQL Database, Azure Synapse Analytics, delimited text files from Azure Blob storage or Azure Data Lake Storage Gen2, and Parquet files from Blob storage or Data Lake Storage Gen2 natively for source and sink. Use the Copy activity to stage data from any of the other connectors, and then execute a Data Flow activity to transform data after it's been staged. For example, your pipeline will first copy into Blob storage, and then a Data Flow activity will use a dataset in source to transform that data. Is the self-hosted integration runtime available for data flows?Self-hosted IR is an ADF pipeline construct that you can use with the Copy Activity to acquire or move data to and from on-premises or VM-based data sources and sinks. The virtual machines that you use for a self-hosted IR can also be placed inside of the same VNET as your protected data stores for access to those data stores from ADF. With data flows, you'll achieve these same end-results using the Azure IR with managed VNET instead. Does the data flow compute engine serve multiple tenants?Clusters are never shared. We guarantee isolation for each job run in production runs. In case of debug scenario one person gets one cluster, and all debugs will go to that cluster which are initiated by that user. Is there a way to write attributes in Cosmos DB in the same order as specified in the sink in ADF data flow?For Cosmos DB, the underlying format of each document is a JSON object which is an unordered set of name/value pairs, so the order cannot be reserved. Why a user is unable to use data preview in the data flows?You should check permissions for custom role. There are multiple actions involved in the dataflow data preview. You start by checking network traffic while debugging on your browser. Please follow all of the actions, for details, please refer to Resource provider. In ADF, can I calculate value for a new column from existing column from mapping?You can use derive transformation in mapping data flow to create a new column on the logic you want. When creating a derived column, you can either generate a new column or update an existing one. In the Column textbox, enter in the column you are creating. To override an existing column in your schema, you can use the column dropdown. To build the derived column's expression, select on the Enter expression textbox. You can either start typing your expression or open up the expression builder to construct your logic. Why mapping data flow preview failing with Gateway timeout?Please try to use larger cluster and leverage the row limits in debug settings to a smaller value to reduce the size of debug output. How to parameterize column name in dataflow?Column name can be parameterized similar to other properties. Like in derived column customer can use $ColumnNameParam = toString(byName($myColumnNameParamInData)). These parameters can be passed from pipeline execution down to Data flows. The data flow advisory about TTL and costsThis troubleshoot document may help to resolve your issues: Mapping data flows performance and tuning guide-Time to live. Power Query Data WranglingWhat are the supported regions for data wrangling?Data factory is available in following regions. The Power Query feature is available in all data flow regions. If the feature is not available in your region, please check with support. What is the difference between mapping data flow and Power query activity (data wrangling)?Mapping data flows provide a way to transform data at scale without any coding required. You can design a data transformation job in the data flow canvas by constructing a series of transformations. Start with any number of source transformations followed by data transformation steps. Complete your data flow with a sink to land your results in a destination. Mapping data flow is great at mapping and transforming data with both known and unknown schemas in the sinks and sources. Power Query Data Wrangling allow you to do agile data preparation and exploration using the Power Query Online mashup editor at scale via spark execution. With the rise of data lakes sometimes you just need to explore a data set or create a dataset in the lake. You aren't mapping to a known target. Supported SQL TypesPower Query Data Wrangling supports the following data types in SQL. You will get a validation error for using a data type that isn't supported.
Next stepsFor step-by-step instructions to create a data factory, see the following tutorials:
FeedbackSubmit and view feedback for |