Links

How to optimize tables

It is usual that once you have designed a workflow, you would like to optimize the result table creating a cluster over one or more columns.
Cluster and index optimization is the way to physically and/or logically locate records with similar relations (cluster column) to reduce the number of fetches or access time with prefetching technics.
These advanced options depend on destination data warehouse and are available for any component that generates one o more output tables.
Users can optimize their data by following these simple steps:
  1. 1.
    Create a model with one or more component generating an output table and select this component e.g. as in the model below
  1. 2.
    Fill all component parameters. This is necessary to allow final output schema generation that allow to fill up the dropdown menu of the optimization column.
  2. 3.
    Select the column among that listed in the pull down named depending on datawareouse used. e.g:
    • if BigQuery then Cluster by
    • if SnowFlake then two pull down named Search optimization column and Cluster by
    • if Redshift then Cluster by
    • if Postgres then two pull down named Index column and Cluster by index
  1. 4.
    If a column is selected, the user can check that the clustering or indexing action is added into the generated SQL. Note that generated string would vary depending on data warehouse.
BigQuery
Snowflake
Redshift
PostgreSQL
BigQuery allow select Cluster by column that have the following types:
  • Geography
  • Geometry
  • Boolean
  • Number
  • String
  • Date
  • Datetime
  • Time
  • Timestamp
More info about BigQuery clustering here.
Snowflake allow select Search optimization column that have the following types:
  • Geography
  • Geometry
This restriction is because ST_GEOHASH based ordering is used. More info about Snowflake ST_GEOHASH here and the applied general search optimization here.
Whilst Cluster by column can be selected among that with the following types:
  • Boolean
  • Number
  • String
More info about Snowflake Cluster optimization here.
Redshift allow select Cluster by column as SORTKEY that have the following types:
  • Boolean
  • Number
  • String
More info about Redshift SORTKEY optimization here.
Postgres allow select Index column that have the following types:
  • Geography
  • Geometry
  • Boolean
  • Number
  • String
on which create GIST index. More about Indexes here.
Whilst Cluster by index column can be selected among any types. More info about Posgtgres Cluster optimization here.