data

This module contains functions and procedures that make use of data (either Data Observatory or user-provided data) for their computations.

For the DATAOBS_ methods, users can find the details on where to locate the tables (and their qualified names) within the corresponding data warehouse by going to the "Access in" option in the subscription page in the Data Explorer, and picking the data warehouse where that data has been made available.

DATAOBS_ENRICH_GRID

DATAOBS_ENRICH_GRID(grid_type, input_query, input_index_column, variables, filters, output, source)

Description

This procedure enriches a query containing grid cell indexes of one of the supported types (H3, Quadbin) with data from the Data Observatory. The user must be subscribed to all the Data Observatory datasets involved in the enrichment. The cells must all have the same resolution.

If the enrich data is indexed by an H3 or Quadbin grid compatible with the input (same grid type and equal or greater resolution), then the enrichment will be performed much more efficiently by matching the index values rather than intersecting associated GEOGRAPHY elements.

As a result of this process, each input grid cell will be enriched with the data of the Data Observatory datasets that spatially intersect it. When the input cell intersects with more than one polygon, point, or line of the Data Observatory datasets, the data is aggregated using the aggregation methods specified.

Valid aggregation methods are:

  • SUM: It assumes the aggregated variable is an extensive property (e.g. population). Accordingly, the value corresponding to the Data Observatory feature intersected is weighted by the fraction of area or length intersected. If the Data Observatory features are points or grid index matching is used, then a simple sum is performed.

  • MIN: It assumes the aggregated variable is an intensive property (e.g. temperature, population density). Thus, the value is not altered by the intersected area/length as it's the case for SUM.

  • MAX: It assumes the aggregated variable is an intensive property (e.g. temperature, population density). Thus, the value is not altered by the intersected area/length as it's the case for SUM.

  • AVG: It assumes the aggregated variable is an intensive property (e.g. temperature, population density). Thus, the value is not altered by the intersected area/length as it's the case for SUM. However, a weighted average is computed, using the intersection areas or lengths as the weight. When the Data Observatory features are points or grid index matching is used, a simple average is computed.

  • COUNT It computes the number of Data Observatory features that contain the enrichment variable and are intersected by the input geography.

For other types of aggregation, the DATAOBS_ENRICH_GRID_RAW procedure can be used to obtain non-aggregated data that can be later applied to any desired custom aggregation.

If the enrichment of an input table needs to be repeated, please notice that dropping the added columns will generate problems in consecutive enrichments as Bigquery saves those columns during 7 days for time travel purposes. We recommend storing the original table columns in a temporal table, dropping the input table and then recreating the input table from the temporal table.

Input parameters

  • grid_type: STRING Type of grid: "h3" or "quadbin".

  • input_query: STRING query to be enriched (Standard SQL); this query must produce valid grid indexes for the selected grid type in a column of the proper type (STRING for H3, and INT64 for Quadbin). It can include additional columns with data associated with the grid cells that will be preserved. A qualified table name can be given as well, e.g. 'project-id.dataset-id.table-name'.

  • input_index_column: STRING name of a column in the query that contains the grid indexes.

  • variables: ARRAY<STRUCT<variable STRING, aggregation STRING>>. Variables of the Data Observatory that will be used to enrich the input polygons. For each variable, its slug and the aggregation method must be provided. Use 'default' to use the variable's default aggregation method. Valid aggregation methods are: SUM, AVG, MAX, MIN, COUNT. The catalog procedure DATAOBS_SUBSCRIPTION_VARIABLES can be used to find available variables and their slugs and default aggregation.

  • filters ARRAY<STRUCT<dataset STRING, expression STRING>>. Filters to be applied to the Data Observatory datasets used in the enrichment can be passed here. Each filter is applied to the Data Observatory dataset or geography, identified by its corresponding slug, passed in the dataset field of the structure. The second field of the structure, expression, is an SQL expression that will be inserted in a WHERE clause and that can reference any column of the dataset or geography table. Please note that column names (not slugs) should be applied here. The catalog procedures DATAOBS_SUBSCRIPTIONS and DATAOBS_SUBSCRIPTION_VARIABLES can be used to find both the column names and the corresponding table slugs.

  • output: ARRAY<STRING>|NULL containing the name of an output table to store the results and optionally an SQL clause that can be used to partition it. The name of the output table should include project and dataset, e.g. ['project-id.dataset-id.table-name'] or ['project-id.dataset-id.table-name', 'PARTITION BY number']. If NULL the enrichment result is returned. When the output table is the same than the input, the input table will be enriched in place.

  • source: STRING name of the location where the Data Observatory subscriptions of the user are stored, in 'project-id.dataset-id' format. If only the 'dataset-id' is included, it uses the project 'carto-data' by default.

Output

The output table will contain all the input columns provided in the input_query and one extra column for each variable in variables, named after its corresponding slug and including a suffix indicating the aggregation method used.

If a new output table is created, it will be clustered by the spatial index to optimize its performance when filtering data by it or using it to join to other grid tables. This is important to visualize the results in a map efficiently. If an SQL clause is included in the output parameter this optimization will not be performed.

Examples

CALL `carto-un`.carto.DATAOBS_ENRICH_GRID(
  'h3',
  R'''
  SELECT * FROM UNNEST(['8718496d8ffffff','873974865ffffff','87397486cffffff','8718496daffffff','873974861ffffff','8718496dbffffff','87397494bffffff','8718496ddffffff','873974864ffffff']) AS index
  ''',
  'index',
  [('population_14d9cf55', 'sum')],
  NULL,
  ['my-project.my-dataset.my-enriched-table'],
  'my-dataobs-project.my-dataobs-dataset'
)
-- The table 'my-project.my-dataset.my-enriched-table' will be created
-- with columns: index, population_14d9cf55_sum
CALL `carto-un`.carto.DATAOBS_ENRICH_GRID(
  'h3',
  'my-project.my-dataset.my-table',
  'index',
  [('population_14d9cf55', 'sum')],
  NULL,
  ['my-project.my-dataset.my-table'],
  'my-dataobs-project.my-dataobs-dataset'
)
-- The column population_14d9cf55_sum will be added to the table
-- 'my-project.my-dataset.my-table'.

DATAOBS_ENRICH_GRID_RAW

DATAOBS_ENRICH_GRID_RAW(grid_type, input_query, input_index_column, variables, filters, output, source)

Description

This procedure enriches a query containing grid cell indexes of one of the supported types (H3, Quadbin) with data from the Data Observatory. The user must be subscribed to all the Data Observatory datasets involved in the enrichment. The cells must all have the same resolution.

As a result of this process, each input grid cell will be enriched with the data from the Data Observatory datasets that spatially intersect it. The variable values corresponding to all intersecting Data Observatory features for a given input cell will be returned in an ARRAY column. When variables come from multiple Data Observatory geographies, one ARRAY column will be included for each source geography table. Data Observatory geography slugs are used for the names of these columns. Each array contains STRUCTs with one field for each variable (named after the variable slug) and additional measure fields __carto_intersection, __carto_total, __carto_dimension. See the output information for more details.

If the enrich data is indexed by an H3 or Quadbin grid compatible with the input (same grid type and equal or greater resolution), then the enrichment will be performed much more efficiently by matching the index values rather than intersecting associated GEOGRAPHY elements. In this case the additional measure fields are omitted.

If the enrichment of an input table needs to be repeated, please notice that dropping the added columns will generate problems in consecutive enrichments as Bigquery saves those columns during 7 days for time travel purposes. We recommend storing the original table columns in a temporal table, dropping the input table and then recreating the input table from the temporal table.

Input parameters

  • grid_type: STRING Type of grid: "h3" or "quadbind".

  • input_query: STRING query to be enriched (Standard SQL); this query must produce valid grid indexes for the selected grid type in a column of the proper type (STRING for H3, and INT64 for Quadbin). It can include additional columns with data associated with the grid cells that will be preserved. A qualified table name can be given as well, e.g. 'project-id.dataset-id.table-name'.

  • input_index_column: STRING name of a column in the query that contains the grid indexes.

  • variables: ARRAY<STRING>. Variables of the Data Observatory that will be used to enrich the input polygons. For each variable, its slug must be provided. The catalog procedure DATAOBS_SUBSCRIPTION_VARIABLES can be used to find available variables and their slugs and default aggregation.

  • filters ARRAY<STRUCT<dataset STRING, expression STRING>>. Filters to be applied to the Data Observatory datasets used in the enrichment can be passed here. Each filter is applied to the Data Observatory dataset or geography, identified by its corresponding slug, passed in the dataset field of the structure. The second field of the structure, expression, is an SQL expression that will be inserted in a WHERE clause and that can reference any column of the dataset or geography table. Please note that column names (not slugs) should be applied here. The catalog procedures DATAOBS_SUBSCRIPTIONS and DATAOBS_SUBSCRIPTION_VARIABLES can be used to find both the column names and the corresponding table slugs.

  • output: ARRAY<STRING>|NULL containing the name of an output table to store the results and optionally an SQL clause that can be used to partition it. The name of the output table should include project and dataset, e.g. ['project-id.dataset-id.table-name'] or ['project-id.dataset-id.table-name', 'PARTITION BY number']. If NULL the enrichment result is returned. When the output table is the same than the input, the input table will be enriched in place.

  • source: STRING name of the location where the Data Observatory subscriptions of the user are stored, in 'project-id.dataset-id' format. If only the 'dataset-id' is included, it uses the project 'carto-data' by default.

Output

The output table will contain all the input columns provided in the input_query and one extra ARRAY column for each Data Observatory geography containing enrichment variables, named after their corresponding geography slug. The array contains STRUCTs with one field for each variable, using the variable slug as the field name. Additional fields will be included with information about the intersection of the grid cell and the Data Observatory geographies unless the grid matching described above is performed.

  • __carto_dimension dimension of the Data Observatory geography: 2 for areas (polygons), 1 for lines, and 0 for points.

  • __carto_intersection area in square meters (for dimension = 2) or length in meters (for dimension = 1) of the intersection.

  • __carto_total area in square meters (for dimension = 2) or length in meters (for dimension = 1) of the Data Observatory feature.

If a new output table is created, it will be clustered by the spatial index to optimize its performance when filtering data by it or using it to join to other grid tables. This is important to visualize the results in a map efficiently. If an SQL clause is included in the output parameter this optimization will not be performed.

Examples

CALL `carto-un`.carto.DATAOBS_ENRICH_GRID_RAW(
  'h3',
  R'''
  SELECT * FROM UNNEST(['8718496d8ffffff','873974865ffffff','87397486cffffff','8718496daffffff','873974861ffffff','8718496dbffffff','87397494bffffff','8718496ddffffff','873974864ffffff']) AS index
  ''',
  'index',
   ['population_93405ad7'],
   NULL,
   ['my-project.my-dataset.my-enriched-table'],
   'my-dataobs-project.my-dataobs-dataset'
);
-- The table 'my-project.my-dataset.my-enriched-table' will be created
-- with columns: index, and wp_grid100m_10955184.
-- Column wp_grid100m_10955184 will have the fields population_93405ad7,
-- __carto_dimension, __carto_intersection and __carto_total.
CALL `carto-un`.carto.DATAOBS_ENRICH_GRID_RAW(
  'h3',
  'my-project.my-dataset.my-table',
  'index',
   ['population_93405ad7'],
   NULL,
   ['my-project.my-dataset.my-table'],
   'my-dataobs-project.my-dataset'
);
-- The column wp_grid100m_10955184 will be added to the table
-- 'my-project.my-dataset.my-table'.
-- Column wp_grid100m_10955184 will have the fields population_93405ad7,
-- __carto_dimension, __carto_intersection and __carto_total.

DATAOBS_ENRICH_GRID_WEIGHTED

DATAOBS_ENRICH_GRID_WEIGHTED(grid_type, input_query, input_index_column, variables, filters, weight_variable, custom_weight_query, do_weight_filters, output, source)

Description

This procedure enriches a query containing grid cell indexes of one of the supported types (H3, Quadbin) with data from the Data Observatory using a speficied feature as weight for the enrichment, which weights appropiately the intersection segments with regard to the total original segment. For example the attribution of the feature to each intersected segment results from the value of the weighted feature in the intersection segment over the total original segment. The user must be subscribed to all the Data Observatory datasets involved in the enrichment. The cells must all have the same resolution. The feature used for the "weights" can either come from Data Observatory or provided by the user.

If the enrich data is indexed by an H3 or Quadbin grid compatible with the input (same grid type and equal or greater resolution), then the enrichment will be performed much more efficiently by matching the index values rather than intersecting associated GEOGRAPHY elements.

As a result of this process, each input grid cell will be enriched with the data of the Data Observatory datasets that spatially intersect it, weighted accordingly by the specified feature . When the input cell intersects with more than one polygon, point, or line of the Data Observatory datasets, the data is aggregated using the aggregation methods specified in a weighted manner where the weights result from the specified feature.

Valid aggregation methods are:

  • SUM: It assumes the aggregated variable is an extensive property (e.g. population). Accordingly, the value corresponding to the Data Observatory feature intersected is weighted by the fraction of area or length intersected. If the Data Observatory features are points or grid index matching is used, then a simple sum is performed.

  • MIN: It assumes the aggregated variable is an intensive property (e.g. temperature, population density). Thus, the value is not altered by the intersected area/length as it's the case for SUM.

  • MAX: It assumes the aggregated variable is an intensive property (e.g. temperature, population density). Thus, the value is not altered by the intersected area/length as it's the case for SUM.

  • AVG: It assumes the aggregated variable is an intensive property (e.g. temperature, population density). Thus, the value is not altered by the intersected area/length as it's the case for SUM. However, a weighted average is computed, using the intersection areas or lengths as the weight. When the Data Observatory features are points or grid index matching is used, a simple average is computed.

  • COUNT It computes the number of Data Observatory features that contain the enrichment variable and are intersected by the input geography.

For other types of aggregation, the DATAOBS_ENRICH_GRID_RAW procedure can be used to obtain non-aggregated data that can be later applied to any desired custom aggregation.

If the enrichment of an input table needs to be repeated, please notice that dropping the added columns will generate problems in consecutive enrichments as Bigquery saves those columns during 7 days for time travel purposes. We recommend storing the original table columns in a temporal table, dropping the input table and then recreating the input table from the temporal table.

Input parameters

  • grid_type: STRING Type of grid: "h3" or "quadbin".

  • input_query: STRING query to be enriched (Standard SQL); this query must produce valid grid indexes for the selected grid type in a column of the proper type (STRING for H3, and INT64 for Quadbin). It can include additional columns with data associated with the grid cells that will be preserved. A qualified table name can be given as well, e.g. 'project-id.dataset-id.table-name'.

  • input_index_column: STRING name of a column in the query that contains the grid indexes.

  • variables: ARRAY<STRUCT<variable STRING, aggregation STRING>>. Variables of the Data Observatory that will be used to enrich the input polygons. For each variable, its slug and the aggregation method must be provided. Use 'default' to use the variable's default aggregation method. Valid aggregation methods are: SUM, AVG, MAX, MIN, COUNT. The catalog procedure DATAOBS_SUBSCRIPTION_VARIABLES can be used to find available variables and their slugs and default aggregation.

  • filters ARRAY<STRUCT<dataset STRING, expression STRING>>. Filters to be applied to the Data Observatory datasets used in the enrichment can be passed here. Each filter is applied to the Data Observatory dataset or geography, identified by its corresponding slug, passed in the dataset field of the structure. The second field of the structure, expression, is an SQL expression that will be inserted in a WHERE clause and that can reference any column of the dataset or geography table. Please note that column names (not slugs) should be applied here. The catalog procedures DATAOBS_SUBSCRIPTIONS and DATAOBS_SUBSCRIPTION_VARIABLES can be used to find both the column names and the corresponding table slugs.

  • weight_variable: STRUCT<variable STRING, aggregation STRING> Variable that will be used to weight the intersections of the input polygons with the Data Observatory datasets polygons, lines, points. Its slug and the aggregation method must be provided. Valid aggregation methods are: SUM, AVG, MAX, MIN, COUNT. This variable can either originate from the Data Observatory or be provided by the user, and it is mandatory. If NULL then an error is raised.

  • custom_weight_query: STRING query that contains the custom variable to be used as weight together with a geography column geom. This field is compulsory when a custom variable is passed in weight_variable, otherwise this must be set to NULL.

  • do_weight_filters: STRUCT<dataset STRING, expression STRING> In case the weight_variable is from the DO, filters can be applied. Each filter is applied to the Data Observatory dataset or geography, identified by its corresponding slug, passed in the dataset field of the structure. The second field of the structure, expression, is an SQL expression that will be inserted in a WHERE clause and that can reference any column of the dataset or geography table. Please note that column names (not slugs) should be applied here. The catalog procedures DATAOBS_SUBSCRIPTIONS and DATAOBS_SUBSCRIPTION_VARIABLES can be used to find both the column names and the corresponding table slugs.

  • output: ARRAY<STRING>|NULL containing the name of an output table to store the results and optionally an SQL clause that can be used to partition it. The name of the output table should include project and dataset, e.g. ['project-id.dataset-id.table-name'] or ['project-id.dataset-id.table-name', 'PARTITION BY number']. If NULL the enrichment result is returned. When the output table is the same than the input, the input table will be enriched in place.

  • source: STRING name of the location where the Data Observatory subscriptions of the user are stored, in 'project-id.dataset-id' format. If only the 'dataset-id' is included, it uses the project 'carto-data' by default.

Output

The output table will contain all the input columns provided in the input_query and one extra column for each variable in variables, named after its corresponding slug and including a suffix indicating the aggregation method used.

If a new output table is created, it will be clustered by the spatial index to optimize its performance when filtering data by it or using it to join to other grid tables. This is important to visualize the results in a map efficiently. If an SQL clause is included in the output parameter this optimization will not be performed.

Examples

CALL `carto-un`.carto.DATAOBS_ENRICH_GRID_WEIGHTED(
  'h3',
  R'''
  SELECT * FROM UNNEST(['8718496d8ffffff','873974865ffffff','87397486cffffff','8718496daffffff','873974861ffffff','8718496dbffffff','87397494bffffff','8718496ddffffff','873974864ffffff']) AS index
  ''',
  'index',
  [('DI_PC_f7d891ba','avg'),('c02_05_gasto_m__23eb3ae8','avg')],
  NULL,
   ('population_1ff8e0d','sum'),
  NULL,
  NULL,
  ['my-project.my-dataset.my-enriched-table'],
  'my-dataobs-project.my-dataobs-dataset'
)
-- The table 'my-project.my-dataset.my-enriched-table' will be created
-- with columns: index, population_14d9cf55_sum
CALL `carto-un`.carto.DATAOBS_ENRICH_GRID_WEIGHTED(
  'h3',
  'my-project.my-dataset.my-table',
  'index',
  [('DI_PC_f7d891ba','avg'),('c02_05_gasto_m__23eb3ae8','avg')],
  NULL,
  ('population_1ff8e0d','sum'),
  NULL,
  NULL,
  ['my-project.my-dataset.my-table'],
  'my-dataobs-project.my-dataobs-dataset'
)
-- The column population_14d9cf55_sum will be added to the table
-- 'my-project.my-dataset.my-table'.
CALL `carto-un`.carto.DATAOBS_ENRICH_GRID_WEIGHTED(
  'h3',
  'my-project.my-dataset.my-table',
  'index',
  [('DI_PC_f7d891ba','avg'),('c02_05_gasto_m__23eb3ae8','avg')],
  NULL,
  ('population','sum'),
  '''
  SELECT geoid, geom, population FROM `my-project.my-dataset.custom_weight_query`
  ''',
  NULL,
  ['my-project.my-dataset.my-table'],
  'my-dataobs-project.my-dataobs-dataset'
)

DATAOBS_ENRICH_POINTS

DATAOBS_ENRICH_POINTS(input_query, input_geography_column, variables, filters, output, source)

Description

This procedure enriches a query containing geographic points with data from the Data Observatory. The user must be subscribed to all the Data Observatory datasets involved in the enrichment.

As a result of this process, each input point will be enriched with the data of the Data Observatory datasets that spatially intersect it. When the input point intersects with more than one polygon, point or line of the Data Observatory datasets, the data is aggregated using the aggregation methods specified.

Valid aggregation methods are: SUM, MIN, MAX, AVG, and COUNT.

For special types of aggregation, the DATAOBS_ENRICH_POINTS_RAW procedure can be used to obtain non-aggregated data that can be later applied to any desired custom aggregation.

If the enrichment of an input table needs to be repeated, please notice that dropping the added columns will generate problems in consecutive enrichments as Bigquery saves those columns during 7 days for time travel purposes. We recommend storing the original table columns in a temporal table, dropping the input table and then recreating the input table from the temporal table.

Input parameters

  • input_query: STRING query to be enriched (Standard SQL). A qualified table name can be given as well, e.g. 'project-id.dataset-id.table-name'.

  • input_geography_column: STRING name of the GEOGRAPHY column in the query containing the points to be enriched.

  • variables: ARRAY<STRUCT<variable STRING, aggregation STRING>>. Variables of the Data Observatory that will be used to enrich the input polygons. For each variable, its slug and the aggregation method to be used must be provided. Use 'default' to use the variable's default aggregation method. Valid aggregation methods are: SUM, AVG, MAX, MIN, and COUNT. The catalog procedure DATAOBS_SUBSCRIPTION_VARIABLES can be used to find available variables and their slugs and default aggregation.

  • filters ARRAY<STRUCT<dataset STRING, expression STRING>>. Filters to be applied to the Data Observatory datasets used in the enrichment can be passed here. Each filter is applied to the Data Observatory dataset or geography, identified by its corresponding slug, passed in the dataset field of the structure. The second field of the structure, expression, is an SQL expression that will be inserted in a WHERE clause and that can reference any column of the dataset or geography table. Please note that column names (not slugs) should be applied here. The catalog procedures DATAOBS_SUBSCRIPTIONS and DATAOBS_SUBSCRIPTION_VARIABLES can be used to find both the column names and the corresponding table slugs.

  • output: ARRAY<STRING>|NULL containing the name of an output table to store the results and optionally an SQL clause that can be used to partition it. The name of the output table should include project and dataset, e.g. ['project-id.dataset-id.table-name'] or ['project-id.dataset-id.table-name', 'PARTITION BY number']. If NULL the enrichment result is returned. When the output table is the same than then input, the input table will be enriched in place.

  • source: STRING name of the location where the Data Observatory subscriptions of the user are stored, in 'project-id.dataset-id' format. If only the 'dataset-id' is included, it uses the project 'carto-data' by default.

Output

The output table will contain all the input columns provided in the input_query and one extra column for each variable in variables, named after its corresponding slug and including a suffix indicating the aggregation method used.

If a new output table is created, it will be clustered by the geography column to optimize the performance of spatial filters and joins. This is important to visualize the results in a map efficiently. If an SQL clause is included in the output parameter this optimization will not be performed.

Examples

CALL `carto-un`.carto.DATAOBS_ENRICH_POINTS(
   R'''
   SELECT id, geom FROM `my-project.my-dataset.my-table`
   ''',
   'geom',
   [('population_93405ad7', 'sum')],
   NULL,
   ['my-project.my-dataset.my-enriched-table'],
   'my-dataobs-project.my-dataobs-dataset'
);
-- The table 'my-project.my-dataset.my-enriched-table' will be created
-- with columns: id, geom, population_93405ad7_sum
CALL `carto-un`.carto.DATAOBS_ENRICH_POINTS(
   'my-project.my-dataset.my-table',
   'geom',
   [('population_93405ad7', 'sum')],
   NULL,
   ['my-project.my-dataset.my-table'],
   'my-dataobs-project.my-dataobs-dataset'
);
-- The column population_93405ad7_sum will be added to the table
-- 'my-project.my-dataset.my-table'.

DATAOBS_ENRICH_POINTS_RAW

DATAOBS_ENRICH_POINTS_RAW(input_query, input_geography_column, variables, filters, output, source)

Description

This procedure enriches a query containing geographic points with data from the Data Observatory. The user must be subscribed to all the Data Observatory datasets involved in the enrichment.

As a result of this process, each input point will be enriched with the data of the Data Observatory datasets that spatially intersect it. The variable values corresponding to all intersecting Data Observatory features for a given input point will be returned in an ARRAY column. When variables come from multiple Data Observatory geographies, one ARRAY column will be included for each source geography table. Data Observatory geography slugs are used for the names of these columns. Each array contains STRUCTs with one field for each variable (named after the variable slug) and additional measure fields __carto_total, __carto_dimension. See the output information for more details.

If the enrichment of an input table needs to be repeated, please notice that dropping the added columns will generate problems in consecutive enrichments as Bigquery saves those columns during 7 days for time travel purposes. We recommend storing the original table columns in a temporal table, dropping the input table and then recreating the input table from the temporal table.

Input parameters

  • input_query: STRING query to be enriched (Standard SQL). A qualified table name can be given as well, e.g. 'project-id.dataset-id.table-name'.

  • input_geography_column: STRING name of the GEOGRAPHY column in the query containing the points to be enriched.

  • variables: ARRAY<STRING> of slugs (unique identifiers) of the Data Observatory variables to add to the input points. The catalog procedure DATAOBS_SUBSCRIPTION_VARIABLES can be used to find available variables and their slugs and default aggregation.

  • filters ARRAY<STRUCT<dataset STRING, expression STRING>>. Filters to be applied to the Data Observatory datasets used in the enrichment can be passed here. Each filter is applied to the Data Observatory dataset or geography, identified by its corresponding slug, passed in the dataset field of the structure. The second field of the structure, expression, is an SQL expression that will be inserted in a WHERE clause and that can reference any column of the dataset or geography table. Please note that column names (not slugs) should be applied here. The catalog procedures DATAOBS_SUBSCRIPTIONS and DATAOBS_SUBSCRIPTION_VARIABLES can be used to find both the column names and the corresponding table slugs.

  • output: ARRAY<STRING>|NULL containing the name of an output table to store the results and optionally an SQL clause that can be used to partition it. The name of the output table should include project and dataset, e.g. ['project-id.dataset-id.table-name'] or ['project-id.dataset-id.table-name', 'PARTITION BY number']. If NULL the enrichment result is returned. When the output table is the same than then input, the input table will be enriched in place.

  • source: STRING name of the location where the Data Observatory subscriptions of the user are stored, in 'project-id.dataset-id' format. If only the 'dataset-id' is included, it uses the project 'carto-data' by default.

Output

The output table will contain all the input columns provided in the input_query and one extra ARRAY column for each Data Observatory geography containing enrichment variables, named after their corresponding geography slug. The array contains STRUCTs with one field for each variable, using the variable slug as the field name. Additional fields will be included with information about the intersected enrichment geographies:

  • __carto_dimension dimension of the Data Observatory geography: 2 for areas (polygons), 1 for lines, and 0 for points.

  • __carto_total area in square meters (for dimension = 2) or length in meters (for dimension = 1) of the Data Observatory feature.

Moreover, another column named __carto_input_area will be added containing the area of the input polygon in square meters.

If a new output table is created, it will be clustered by the geography column to optimize the performance of spatial filters and joins. This is important to visualize the results in a map efficiently. If an SQL clause is included in the output parameter this optimization will not be performed.

Examples

CALL `carto-un`.carto.DATAOBS_ENRICH_POINTS_RAW(
   R'''
   SELECT id, geom FROM `my-project.my-dataset.my-table`
   ''',
   'geom',
   ['population_93405ad7'],
   NULL,
   ['`my-project.my-dataset.my-enriched-table`'],
   'my-dataobs-project.my-dataobs-dataset'
);
-- The table `my-project.my-dataset.my-enriched-table` will be created
-- with columns: id, geom, __carto_input_area and wp_grid100m_10955184.
-- Column wp_grid100m_10955184 will have the fields population_93405ad7, __carto_dimension and __carto_total.

Imagine that you need some information about your points of interest. We'll get population information from the Data Observatory at those points:

CALL `carto-un`.carto.DATAOBS_ENRICH_POLYGONS_RAW(
  R'''
    SELECT
      'Point1' AS name, ST_GEOGPOINT(1,42) AS geom
    UNION ALL
    SELECT
      'Point2', ST_GEOGPOINT(-2,40) AS geom
  ''',
  'geom',
  ['population_f9004c56'],
  NULL,
  ['my-project.my-dataset.enriched_points'],
  'my-dataobs-project.my-dataobs-dataset'
)

Now let's compute the average density of population at each location:

SELECT
  name,
  AVG(enrichment.population_f9004c56 / NULLIF(enrichment.__carto_total, 0) AS popdens_avg
FROM
  `my-project.my-dataset.enriched_points`, UNNEST(wp_grid1km_b16138c1) enrichment
GROUP BY name

Instead of creating a new table we could have modified the source table like this:

CALL `carto-un`.carto.DATAOBS_ENRICH_POINTS_RAW(
   'my-project.my-dataset.my-table',
   'geom',
   ['population_93405ad7'],
   NULL,
   ['my-project.my-dataset.my-table'],
   'my-dataobs-project.my-dataobs-dataset'
);
-- The columns __carto_input_area and wp_grid100m_10955184
-- will be added to the table 'my-project.my-dataset.my-table'.

DATAOBS_ENRICH_POLYGONS

DATAOBS_ENRICH_POLYGONS(input_query, input_geography_column, variables, filters, output, source)

Description

This procedure enriches a query containing geographic polygons with data from the Data Observatory. The user must be subscribed to all the Data Observatory datasets involved in the enrichment.

As a result of this process, each input polygon will be enriched with the data of the Data Observatory datasets that spatially intersect it. When the input polygon intersects with more than one polygon, point, or line of the Data Observatory datasets, the data is aggregated using the aggregation methods specified.

Valid aggregation methods are:

  • SUM: It assumes the aggregated variable is an extensive property (e.g. population). Accordingly, the value corresponding to the Data Observatory feature intersected is weighted by the fraction of area or length intersected. If the Data Observatory features are points, then a simple sum is performed.

  • MIN: It assumes the aggregated variable is an intensive property (e.g. temperature, population density). Thus, the value is not altered by the intersected area/length as it's the case for SUM.

  • MAX: It assumes the aggregated variable is an intensive property (e.g. temperature, population density). Thus, the value is not altered by the intersected area/length as it's the case for SUM.

  • AVG: It assumes the aggregated variable is an intensive property (e.g. temperature, population density). Thus, the value is not altered by the intersected area/length as it's the case for SUM. However, a weighted average is computed, using the intersection areas or lengths as the weight. When the Data Observatory features are points, a simple average is computed.

  • COUNT It computes the number of Data Observatory features that contain the enrichment variable and are intersected by the input geography.

For other types of aggregation, the DATAOBS_ENRICH_POLYGONS_RAW procedure can be used to obtain non-aggregated data that can be later applied any desired custom aggregation.

If the enrichment of an input table needs to be repeated, please notice that dropping the added columns will generate problems in consecutive enrichments as Bigquery saves those columns during 7 days for time travel purposes. We recommend storing the original table columns in a temporal table, dropping the input table and then recreating the input table from the temporal table.

Input parameters

  • input_query: STRING query to be enriched (Standard SQL). A qualified table name can be given as well, e.g. 'project-id.dataset-id.table-name'.

  • input_geography_column: STRING name of the GEOGRAPHY column in the query containing the polygons to be enriched.

  • variables: ARRAY<STRUCT<variable STRING, aggregation STRING>>. Variables of the Data Observatory that will be used to enrich the input polygons. For each variable, its slug and the aggregation method to be used must be provided. Use 'default' to use the variable's default aggregation method. Valid aggregation methods are: SUM, AVG, MAX, MIN, and COUNT. The catalog procedure DATAOBS_SUBSCRIPTION_VARIABLES can be used to find available variables and their slugs and default aggregation.

  • filters ARRAY<STRUCT<dataset STRING, expression STRING>>. Filters to be applied to the Data Observatory datasets used in the enrichment can be passed here. Each filter is applied to the Data Observatory dataset or geography, identified by its corresponding slug, passed in the dataset field of the structure. The second field of the structure, expression, is an SQL expression that will be inserted in a WHERE clause and that can reference any column of the dataset or geography table. Please note that column names (not slugs) should be applied here. The catalog procedures DATAOBS_SUBSCRIPTIONS and DATAOBS_SUBSCRIPTION_VARIABLES can be used to find both the column names and the corresponding table slugs.

  • output: ARRAY<STRING>|NULL containing the name of an output table to store the results and optionally an SQL clause that can be used to partition it. The name of the output table should include project and dataset, e.g. ['project-id.dataset-id.table-name'] or ['project-id.dataset-id.table-name', 'PARTITION BY number']. If NULL the enrichment result is returned. When the output table is the same than then input, the input table will be enriched in place.

  • source: STRING name of the location where the Data Observatory subscriptions of the user are stored, in 'project-id.dataset-id' format. If only the 'dataset-id' is included, it uses the project 'carto-data' by default.

Output

The output table will contain all the input columns provided in the input_query and one extra column for each variable in variables, named after its corresponding slug and including a suffix indicating the aggregation method used.

If a new output table is created, it will be clustered by the geography column to optimize the performance of spatial filters and joins. This is important to visualize the results in a map efficiently. If an SQL clause is included in the output parameter this optimization will not be performed.

Examples

CALL `carto-un`.carto.DATAOBS_ENRICH_POLYGONS(
   R'''
   SELECT id, geom FROM `my-project.my-dataset.my-table`
   ''',
   'geom',
   [('population_93405ad7', 'SUM')],
   NULL,
   ['`my-project.my-dataset.my-enriched-table`'],
   'my-dataobs-project.my-dataobs-dataset'
);
-- The table `my-project.my-dataset.my-enriched-table` will be created
-- with columns: id, geom, population_93405ad7_sum