Workflows best practices
As you will soon see Workflows can be simple with only a few steps, up to large and complex with many steps, sections, and outputs. So what are the best ways to manage your workflows and ensure that they are useful, readable, and manageable? Below our our tips to making Workflows work for you.
Explore your data
Before you get started and at each step of the process make sure you explore your data to see what is in the dataset and the columns and data types. This is important because:
You may have some data that is in the wrong data type, such as a string that should be a number
You may be missing a column
You might have too many columns that you don’t need
You might have data nested in arrays or JSON
You might have null or missing data that could cause issues later
There are certainly other instances that you might encounter in your data but understanding the structure and limits of your data early on will help prevent issues later in your workflow.
Prune columns
Similar to above we highly suggest removing any columns you do not need in your final analysis and to do the same throughout your workflow. This is due to the fact that most components will add new columns. This is because we never want to modify the original data from the previous step since we cannot presume what columns you do or do not want to keep.With that said as the Worflow progresses, more and more columns will be added. Because of this at various stages we recommend using the SELECT component to pick the specific columns you need at various stages to keep your data clean and easy to read.
Casting to other data types
Many times you may need to cast your data to other data types. For example a postal code could be stored as a numeric value rather than a text value. Make sure these are appropriately cast into:
The data format you want to keep
The data format required by the components you need to use
You can use the Cast component in Workflows to do this otherwise you can cast using the Select component by selecting all columns using ‘*’ and the using the SQL CAST syntax as follows:*, CAST(zipcode AS TEXT)
Preparing dates and timestamps
Similar to the point above it is important to prepare dates or timestamps in the source data warehouse prior to running Workflows. Some functions only accept dates and others accept timestamps and ensuring you have the right data will help make things smoother.
Proper labeling
Using the labeling tools in Workflows can help you keep your Workflow organized and help others understand what is happening in each step. There are two different approaches to labeling your Workflows:
Label small sections individually across the Workflow canvas
Label larger parts of the Workflow and then add smaller sections as necessary
Doing this while you are constructing your Workflow will help when you revisit your work or share work with another user.
Multiple Workflows for large operations
As the size of your workflow increases there may come a point when you need to break your Workflow into smaller sections. While it is not impossible to make large Workflows, depending on the complexity it may in fact be easier to have Workflows broken into different steps. This helps in cases such as:
If the source data changes and is inconsistent this may break a part of your Workflow. Having smaller Workflows can help you identify the issue faster.
Smaller Workflows are likely to run faster
Optimize result tables
It is usual that once you have designed a workflow, you would like to optimize the result table creating a cluster over one or more columns.
Cluster and index optimization is the way to physically and/or logically locate records with similar relations (cluster column) to reduce the number of fetches or access time with prefetching technics.
These advanced options depend on destination data warehouse and are available for any component that generates one o more output tables.
Users can optimize their data by following these simple steps:
Create a model with one or more component generating an output table and select this component e.g. as in the model below
Fill all component parameters. This is necessary to allow final output schema generation that allow to fill up the dropdown menu of the optimization column.
Select the column to be used for the optimization. The mechanism will depend on the data warehouse:
For BigQuery,
Cluster by
For Snowflake, two settings:
Search optimization column
andCluster by
For Redshift,
Cluster by
For PostgreSQL,
Index column
andCluster by index
BigQuery allow select Cluster by
column that have the following types:
Geography
Geometry
Boolean
Number
String
Date
Datetime
Time
Timestamp
More info about BigQuery clustering here.
Last updated