Temporary data in Workflows

To improve performance and to be able to inspect results in intermediate steps, Workflows makes use of temporary data objects that are stored, by default, under a so called workflows_temp schema/dataset in your data warehouse.

Note that at the time of creating the connection to your data warehouse, you can also provide a custom location for storing the temporary data generated by Workflows.

This system of intermediate tables functions as a cache, avoiding re-executing all the steps in a workflow when they have not been modified; thus, executing again only those that have been changed or their inputs have changed when you are re-running your workflows.

Specific permissions required by Workflows in your data warehouse

In order for CARTO to be able to execute workflows in your data warehouse and temporarily store the data generated in intermediate steps of your analysis, there is a specific set of permissions required on the connection selected when creating your workflow.

Since CARTO Workflows can be used with different providers (Google BigQuery, Snowflake, Amazon Redshift, and PostgreSQL), adapting the different terminology for each data warehouse, the recommended setup is:

The service account or user account used for creating the connection must have the following required roles:

BigQuery Data Editor
BigQuery User

If the workflows_temp dataset does not exist, it will be automatically created in the region of the tenant (US, EU, AS, AU). In order to specify a custom region for the workflows_temp, manually create the dataset in the desired region and set the name (if different than workflows_temp) in the BigQuery connection > Advanced options.

-- If the workflows_temp schema already exists:
-- Grant permissions to manage the schema
GRANT ALL PRIVILEGES ON SCHEMA DATABASE_NAME.WORKFLOWS_TEMP TO ROLE ROLE_NAME;

-- If the workflows_temp schema does not exist yet:
-- Grant permissions to create the schema
GRANT USAGE ON DATABASE DATABASE_NAME TO ROLE ROLE_NAME;
GRANT CREATE SCHEMA ON DATABASE DATABASE_NAME TO ROLE ROLE_NAME;

Your Databricks user will need to have USE CATALOG and CREATE SCHEMA permission.

GRANT USE CATALOG ON CATALOG my_catalog TO [email protected];
GRANT CREATE SCHEMA ON CATALOG my_catalog TO [email protected];

-- If the workflows_temp schema already exists:
-- Grant permissions to manage the schema
GRANT ALL ON SCHEMA workflows_temp TO user_name;

-- If the workflows_temp schema does not exist yet:
-- Grant permissions to create the schema
GRANT CREATE ON DATABASE database_name TO user_name;

-- If the workflows_temp schema already exists:
-- Grant permissions to manage the schema
GRANT ALL ON SCHEMA workflows_temp TO user_name;

-- If the workflows_temp schema does not exist yet:
-- Grant permissions to create the schema
GRANT CREATE ON DATABASE database_name TO user_name;

In order to learn more details regarding data warehouse connection permissions please check here.

How to manage the temporary data generated by Workflows

When your workflow runs, it stores the output results in intermediate tables. Depending on the data warehouse platform, a different strategy is applied to clean up the intermediate tables generated by CARTO Workflows. The default configuration removes the intermediate tables after 30 days.

Below, you can find code and instructions for automating the process of cleaning up the intermediate tables in each data warehouse:

When your workflow runs in BigQuery or CARTO Data Warehouse, these tables are temporary and will be automatically removed after 30 days.

You can find more information about temporary tables in the BigQuery documentation.

When your workflow runs in Snowflake, a Snowflake task is automatically created in the WORKFLOWS_TEMP schema to remove the intermediate tables after 30 days. This task is executed periodically every day at 0h UTC.

These are the required permissions to create the task:

-- Grant permissions to manage the schema
GRANT ALL PRIVILEGES ON SCHEMA DATABASE_NAME.WORKFLOWS_TEMP TO ROLE ROLE_NAME;

-- Grant permissions to execute the task
GRANT EXECUTE TASK ON ACCOUNT TO ROLE ROLE_NAME;

Note that the task affects only the tables in that schema. If the name of the schema is changed in the connection, for example to WORKFLOWS_TEMP_CUSTOM, a new task will be created for the new schema.

The task is created once per schema, and will be removed when the schema is removed. It can also be removed manually:

DROP TASK IF EXISTS DATABASE_NAME.WORKFLOWS_TEMP."carto_scheduler.workflow_clear_cache.workflows_temp"

In Databricks, a user with enough privileges will need to set up a workflow or notebook that runs periodically to clean up the content of the workflows_temp schema created in the selected catalog.

When your workflow runs in Redshift, in order to remove the intermediate tables, you should create a scheduled query in Redshift to remove datasets older than 30 days.

Scheduled queries are only available for provisioned Redshift clusters. If you are using Redshift Serverless you'll have to execute manually the clean up, e.g. calling the procedure defined below, or find other external means of executing it periodically.

The following procedure should be created in your Redshift cluster to delete tables older than 30 days (you will have to replace all the occurrences of the $database$ placeholder with the name of your database):

CREATE OR REPLACE PROCEDURE $database$.workflows_temp._clear_cache_fn() as
$$
DECLARE
  statement RECORD;
  query VARCHAR(MAX);
BEGIN
  query := 'SELECT
      \'DROP TABLE $database$.workflows_temp.\' || relname || \';\' as statement
    FROM
      pg_class_info
    LEFT JOIN
      pg_namespace
    ON
      pg_class_info.relnamespace = pg_namespace.oid
    WHERE
      reltype != 0
      AND TRIM(nspname) = \'workflows_temp\'
      AND datediff(day, relcreationtime, current_date) > 30
  ';
  FOR statement IN EXECUTE query
  LOOP
     EXECUTE statement.statement;
  END LOOP;
END;
$$ LANGUAGE plpgsql;

After that, you should define a Redshift scheduled query with the following CALL to execute the previous procedure once per day:

CALL $database$.workflows_temp._clear_cache_fn();

When your workflow runs in PostgreSQL, a pg_cron task is automatically created in the workflows_temp schema to remove the intermediate tables after 30 days. This task is executed periodically every day at 0h UTC.

Here are some resources to learn how to use pg_cron on different managed services based on PostgreSQL:

Amazon RDS. Usage guide available here.
Amazon Aurora PostgreSQL supports pg_cron since version 12.
Google Cloud SQL. Setup instructions here.
Azure Databases for PostgreSQL supports pg_cron since version 11 and provides some usage examples.

This is the command to install the pg_cron extension:

CREATE EXTENSION pg_cron;

These are the required permissions to create the pg_cron task:

-- Grant permissions to manage the schema
GRANT ALL ON SCHEMA workflows_temp TO user_name;

-- Grant permissions to manage the task
GRANT USAGE ON SCHEMA cron TO user_name;
GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA cron TO user_name;

Note that the task affects only the tables in that schema. If the name of the schema is changed in the connection, for example to workflows_temp_custom, a new task will be created for the new schema.

The task is created once per schema, and will be removed when the pg_cron extension is removed. It can also be removed manually:

SELECT cron.unschedule('carto_scheduler.workflow_clear_cache.workflows_temp');

Cache options

When executing workflows from the UI, intermediate tables are created for each node in the workflow. Depending on the connection used for your workflow, these intermediate tables will be reused, avoiding recreation when the workflow structure or settings have not changed.

In order to control this behavior, you will find the Cache Settings menu, next to the 'Run' button:

When enabled, which is the default option, it will try and reuse previous successful executions of the workflow and its intermediate steps, with some differences depending on the data warehouse:

CARTO Data Warehouse, BigQuery and Snowflake: Intermediate and result tables will be reused as long as the the workflow structure have not changed.
- This setting applies to executions from the UI, triggered by clicking on the 'Run' button.
- This setting also applies to scheduled workflows (beta).
- For workflows executed via API call, the Output table will be reused between API calls that have the exact same parameter values. If parameters are changed, the output table will be recomputed.
Redshift and PostgreSQL: Intermediate tables are never reused. This means that all nodes in a workflow are always computed completely.
- This setting doesn't have an effect on executions from the UI, triggered by clicking on the 'Run' button.
- This setting also doesn't have an effect on Scheduled executions (only available for PostgreSQL connections).
- For workflows executed via API call, the Output table will be reused between API calls that have the exact same parameter values. If parameters are changed, the output table will be recomputed.

When disabled, all intermediate and result tables will be always be recomputed in all execution modes (UI, Schedule and API call), regardless of updates to source tables or parameter values.

PreviousExecuting workflows via API NextExtension Packages

Last updated 6 months ago

Was this helpful?