Workflows best practices

As you will soon see Workflows can be simple with only a few steps, up to large and complex with many steps, sections, and outputs. So what are the best ways to manage your workflows and ensure that they are useful, readable, and manageable? Below our our tips to making Workflows work for you.

Explore your data

Before you get started and at each step of the process make sure you explore your data to see what is in the dataset and the columns and data types. This is important because:

  • You may have some data that is in the wrong data type, such as a string that should be a number

  • You may be missing a column

  • You might have too many columns that you don’t need

  • You might have data nested in arrays or JSON

  • You might have null or missing data that could cause issues later

There are certainly other instances that you might encounter in your data but understanding the structure and limits of your data early on will help prevent issues later in your workflow.

Prune columns

Similar to above we highly suggest removing any columns you do not need in your final analysis and to do the same throughout your workflow. This is due to the fact that most components will add new columns. This is because we never want to modify the original data from the previous step since we cannot presume what columns you do or do not want to keep.With that said as the Worflow progresses, more and more columns will be added. Because of this at various stages we recommend using the SELECT component to pick the specific columns you need at various stages to keep your data clean and easy to read.

Casting to other data types

Many times you may need to cast your data to other data types. For example a postal code could be stored as a numeric value rather than a text value. Make sure these are appropriately cast into:

  • The data format you want to keep

  • The data format required by the components you need to use

You can use the Cast component in Workflows to do this otherwise you can cast using the Select component by selecting all columns using ‘*’ and the using the SQL CAST syntax as follows:*, CAST(zipcode AS TEXT)

Preparing dates and timestamps

Similar to the point above it is important to prepare dates or timestamps in the source data warehouse prior to running Workflows. Some functions only accept dates and others accept timestamps and ensuring you have the right data will help make things smoother.

Proper labeling

Using the labeling tools in Workflows can help you keep your Workflow organized and help others understand what is happening in each step. There are two different approaches to labeling your Workflows:

  • Label small sections individually across the Workflow canvas

  • Label larger parts of the Workflow and then add smaller sections as necessary

Doing this while you are constructing your Workflow will help when you revisit your work or share work with another user.

Multiple Workflows for large operations

As the size of your workflow increases there may come a point when you need to break your Workflow into smaller sections. While it is not impossible to make large Workflows, depending on the complexity it may in fact be easier to have Workflows broken into different steps. This helps in cases such as:

  • If the source data changes and is inconsistent this may break a part of your Workflow. Having smaller Workflows can help you identify the issue faster.

  • Smaller Workflows are likely to run faster

Optimize result tables

It is usual that once you have designed a workflow, you would like to optimize the result table creating a cluster over one or more columns.

Cluster and index optimization is the way to physically and/or logically locate records with similar relations (cluster column) to reduce the number of fetches or access time with prefetching technics.

These advanced options depend on destination data warehouse and are available for any component that generates one o more output tables.

Users can optimize their data by following these simple steps:

  1. Create a model with one or more component generating an output table and select this component e.g. as in the model below

  1. Fill all component parameters. This is necessary to allow final output schema generation that allow to fill up the dropdown menu of the optimization column.

  2. Select the column to be used for the optimization. The mechanism will depend on the data warehouse:

    • For BigQuery, Cluster by

    • For Snowflake, two settings: Search optimization column and Cluster by

    • For Redshift, Cluster by

    • For PostgreSQL, Index column and Cluster by index

BigQuery allow select Cluster by column that have the following types:

  • Geography

  • Geometry

  • Boolean

  • Number

  • String

  • Date

  • Datetime

  • Time

  • Timestamp

More info about BigQuery clustering here.

Last updated