Temporary data in Workflows

To improve performance and to be able to inspect results in intermediate steps, Workflows makes use of temporary data objects that are stored, by default, under a so called workflows_temp schema/dataset in your data warehouse.

Note that at the time of creating the connection to your data warehouse, you can also provide a custom location for storing the temporary data generated by Workflows.

This system of intermediate tables functions as a cache, avoiding re-executing all the steps in a workflow when they have not been modified; thus, executing again only those that have been changed or their inputs have changed when you are re-running your workflows.

Specific permissions required by Workflows in your data warehouse

In order for CARTO to be able to execute workflows in your data warehouse and temporarily store the data generated in intermediate steps of your analysis, there is a specific set of permissions required on the connection selected when creating your workflow.

Since CARTO Workflows can be used with different providers (Google BigQuery, Snowflake, Amazon Redshift, and PostgreSQL), adapting the different terminology for each data warehouse, the recommended setup is:

The service account or user account used for creating the connection must have the following required roles:

  • BigQuery Data Editor

  • BigQuery User

If the workflows_temp dataset does not exist, it will be automatically created in the region of the tenant (US, EU, AS, AU). In order to specify a custom region for the workflows_temp, manually create the dataset in the desired region and set the name (if different than workflows_temp) in the BigQuery connection > Advanced options.

In order to learn more details regarding data warehouse connection permissions please check here.

How to manage the temporary data generated by Workflows

When your workflow runs, it stores the output results in intermediate tables. Depending on the data warehouse platform, a different strategy is applied to clean up the intermediate tables generated by CARTO Workflows. The default configuration removes the intermediate tables after 30 days.

Below, you can find code and instructions for automating the process of cleaning up the intermediate tables in each data warehouse:

When your workflow runs in BigQuery or CARTO Data Warehouse, these tables are temporary and will be automatically removed after 30 days.

You can find more information about temporary tables in the BigQuery documentation.

Cache options

When executing workflows from the UI, intermediate tables are created for each node in the workflow. Depending on the connection used for your workflow, these intermediate tables will be reused, avoiding recreation when the workflow structure or settings have not changed.

In order to control this behavior, you will find the Cache Settings menu, next to the 'Run' button:

When enabled, which is the default option, it will try and reuse previous successful executions of the workflow and its intermediate steps, with some differences depending on the data warehouse:

  • CARTO Data Warehouse, BigQuery and Snowflake: Intermediate and result tables will be reused as long as the the workflow structure have not changed.

    • This setting applies to executions from the UI, triggered by clicking on the 'Run' button.

    • This setting also applies to scheduled workflows (beta).

    • For workflows executed via API call, the Output table will be reused between API calls that have the exact same parameter values. If parameters are changed, the output table will be recomputed.

  • Redshift and PostgreSQL: Intermediate tables are never reused. This means that all nodes in a workflow are always computed completely.

    • This setting doesn't have an effect on executions from the UI, triggered by clicking on the 'Run' button.

    • This setting also doesn't have an effect on Scheduled executions (only available for PostgreSQL connections).

    • For workflows executed via API call, the Output table will be reused between API calls that have the exact same parameter values. If parameters are changed, the output table will be recomputed.

When disabled, all intermediate and result tables will be always be recomputed in all execution modes (UI, Schedule and API call), regardless of updates to source tables or parameter values.

Last updated