Temporary data in Workflows
To improve performance and to be able to inspect results in intermediate steps, Workflows makes use of temporary data objects that are stored, by default, under a so called workflows_temp
schema/dataset in your data warehouse.
Note that at the time of creating the connection to your data warehouse, you can also provide a custom location for storing the temporary data generated by Workflows.
This system of intermediate tables functions as a cache, avoiding re-executing all the steps in a workflow when they have not been modified; thus, executing again only those that have been changed or their inputs have changed when you are re-running your workflows.
Specific permissions required by Workflows in your data warehouse
In order for CARTO to be able to execute workflows in your data warehouse and temporarily store the data generated in intermediate steps of your analysis, there is a specific set of permissions required on the connection selected when creating your workflow.
Since CARTO Workflows can be used with different providers (Google BigQuery, Snowflake, Redshift, and PostgreSQL), adapting the different terminology for each data warehouse, the recommended setup is:
The service account or user account used for creating the connection must have the following required roles:
BigQuery Data Editor
BigQuery Job User
In order to learn more details regarding data warehouse connection permissions please check here.
How to manage the temporary data generated by Workflows
When your workflow runs in BigQuery (or CARTO Data Warehouse), these tables are temporary and will be automatically removed after 30 days. However, in the rest of the data warehouse platforms, we currently need a specific configuration on the user's side in order to clean up the intermediate tables generated by CARTO Workflows.
Below, you can find code and instructions for automating the process of cleaning up the intermediate tables in each data warehouse:
The code shown below will add a task to regularly remove all tables in the $database$
older than 30 days (please replace this placeholder with the name of the database where you are running CARTO Workflows everywhere where it appears). You will also have to replace the $warehouse$
placeholder with the name of your warehouse.
Note that EXECUTE IMMEDIATE $$ ... $$
is only needed if you use Snowflake's Classic Console and you can omit it on the newer Snowsight interface.
The code uses Snowflake tasks to execute the clean-up periodically (every day at 0h UTC). There are some useful examples here.
EXECUTE IMMEDIATE
$$
CREATE TASK IF NOT EXISTS $database$.WORKFLOWS_TEMP.CLEAR_CACHE
SCHEDULE = 'USING CRON 0 0 * * * UTC'
WAREHOUSE = $warehouse$
AS
DECLARE
statements STRING;
BEGIN
LET res resultset := (EXECUTE IMMEDIATE '
SELECT
LISTAGG(''DROP TABLE ''|| TABLE_CATALOG || ''.'' || TABLE_SCHEMA || ''.'' || TABLE_NAME || '';'')
FROM
$database$.INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = ''WORKFLOWS_TEMP''
AND CREATED < DATEADD(''day'', -30, CURRENT_TIMESTAMP)
');
LET cur_res CURSOR FOR res;
OPEN cur_res;
FETCH cur_res INTO statements;
EXECUTE IMMEDIATE 'BEGIN\n' || :statements || '\nEND';
END;
$$
Temporary tables and cache limitation
If a temporary table has been generated as the output of one of its components, it will not be created again when the workflow is re-run later, unless the structure of the workflow has changed or the parameters used to configure the workflow have changed. That is, when the workflow itself causes the table to be different, and updating it is needed.
However, if one of the source tables used in the workflow (tables not generated by the workflows, but already existing in the database) changes and you re-run the workflow, intermediate tables might not be correctly updated.
If your workflow uses a BigQuery or Snowflake connection, the workflow will detect those changes and update all intermediate tables accordingly. For Redshift and PostgreSQL, however, such a mechanism is not implemented. If you change your source tables, you will have to manually drop the intermediate tables before re-running your workflow. Otherwise, those intermediate tables will not be updated.
Last updated
Was this helpful?