Links

retail

ADVANCED BETA
This module contains procedures to perform analysis to solve specific retail analytics use cases, such as revenue prediction.

BUILD_REVENUE_MODEL_DATA

BUILD_REVENUE_MODEL_DATA(stores_query, stores_variables, competitors_query, aoi_query, grid_type, grid_level, kring, decay, do_variables, do_source, custom_variables, custom_query, output_prefix)
Description
This procedure is the first step of the Revenue Prediction analysis workflow. It prepares the model data to be used in the training and prediction phases by performing the following steps:
  1. 1.
    Polyfill the geometry from the area of interest using the grid type and resolution level.
  2. 2.
    Enrich the grid cells with the revenue, stores, Data Observatory (DO) variables and custom variables.
  3. 3.
    Apply a k-ring decay function to the enriched DO variables and custom variables. This operation smooths the features for a given cell by taking into account the values of these features in the neighboring cells (defined as those within the specified k-ring size), applying a scaling factor determined by the decay function of choice.
  4. 4.
    Create the revenue model_data table (see the output description for more details).
  5. 5.
    Create the revenue model_data_stats table (see the output description for more details).
Input parameters
  • stores_query: STRING query with variables related to the stores to be used in the model, including their revenue per store (required) and other variables (optional). It must contain the columns revenue (revenue of the store), store (store unique id) and geom (the geographical point of the store). The values of these columns cannot be NULL.
  • stores_variables: ARRAY<STRUCT<variable STRING, aggregation STRING>> list with the columns of the stores_query and their corresponding aggregation method (sum, avg, max, min, count) that will be used to enrich the grid cells. It can be set to NULL.
  • competitors_query: STRING query with the competitors information to be used in the model. It must contain the columns competitor (competitor store unique id) and geom (the geographical point of the store).
  • aoi_query: STRING query with the geography of the area of interest. It must contain a column geom with a single area (Polygon or MultiPolygon).
  • grid_type: STRING type of the cell grid. Supported values are h3, and quadbin.
  • grid_level: INT64 level or resolution of the cell grid. Check the available H3 levels, and Quadbin levels.
  • kring: INT64 size of the kring where the decay function will be applied. This value can be 0, in which case no kring will be computed and the decay function won't be applied.
  • decay: STRING decay function. Supported values are uniform, inverse, inverse_square and exponential. If set to NULL or '', uniform is used by default.
  • do_variables: ARRAY<STRUCT<variable STRING, aggregation STRING>> variables of the Data Observatory that will be used to enrich the grid cells and therefore train the revenue prediction model in the subsequent step of the Revenue Prediction workflow. For each variable, its slug and the aggregation method must be provided. Use default to use the variable's default aggregation method. Valid aggregation methods are: sum, avg, max, min, count. The catalog procedure DATAOBS_SUBSCRIPTION_VARIABLES can be used to find available variables and their slugs and default aggregation. It can be set to NULL.
  • do_source: STRING name of the location where the Data Observatory subscriptions of the user are stored, in <my-dataobs-project>.<my-dataobs-dataset> format. If only the <my-dataobs-dataset> is included, it uses the project carto-data by default. It can be set to NULL or ''.
  • custom_variables: ARRAY<STRUCT<variable STRING, aggregation STRING>> list with the columns of the custom_query and their corresponding aggregation method (sum, avg, max, min, count) that will be used to enrich the grid cells. It can be set to NULL.
  • custom_query: STRING query that contains a geography column geom and the columns with the custom data that will be used to enrich the grid cells. It can be set to NULL or ''.
  • output_prefix: STRING destination prefix for the output tables. It must contain the project, dataset and a prefix which will be prepended to each output tabla name. For example <my-project>.<my-dataset>.<output-prefix>.
Output
The procedure will output two tables:
  1. 1.
    Model data table: contains an index column with the cell ids and all the enriched columns: revenue_avg, store_count, competitor_count, stores_variables suffixed by aggregation method, DO variables and custom variables. The name of the table includes the suffix _model_data, for example <my-project>.<my-dataset>.<output-prefix>_model_data.
  2. 2.
    Model data stats table: contains the morans_i value computed for the revenue_avg column, computed with kring 1 and decay uniform. The name of the table includes the suffix _model_data_stats, for example <my-project>.<my-dataset>.<output-prefix>_model_data_stats.
Example
carto-un
carto-un-eu
manual
CALL `carto-un`.carto.BUILD_REVENUE_MODEL_DATA(
-- Stores: revenue, store, geom and optional store information
'''SELECT revenue, store, geom, store_area FROM `<project>.<dataset>.input_stores_data`''',
-- Stores information variables
[('store_area','sum')],
-- Competitors: competitor, geom
'''SELECT competitor, geom FROM `<project>.<dataset>.input_competitors_data`''',
-- Area of interest: geom
'''SELECT geom FROM `<project>.<dataset>.area_of_interest`''',
-- Grid params: grid type and level
'h3', 6,
-- Decay params: kring size and decay function
3, 'exponential',
-- Data Observatory enrichment
[('POPCY_4534fac4', 'sum'), ('INCCYPCAP_7c8377cf', 'avg')],
'<my-dataobs-project>.<my-dataobs-dataset>',
-- Custom data enrichment
[('var1', 'sum'), ('var2', 'avg')],
'''SELECT var1, var2, geom FROM `<project>.<dataset>.custom_data`''',
-- Output destination prefix
'<my-project>.<my-dataset>.<output-prefix>'
);
-- Table `<my-project>.<my-dataset>.<output-prefix>_model_data` will be created
-- with columns: index, revenue_avg, store_count, competitor_count, POPCY_4534fac4_sum, INCCYPCAP_7c8377cf_avg, var1_sum, var2_avg
-- Table `<my-project>.<my-dataset>.<output-prefix>_model_data_stats` will be created
-- with the column: morans_i
CALL `carto-un-eu`.carto.BUILD_REVENUE_MODEL_DATA(
-- Stores: revenue, store, geom and optional store information
'''SELECT revenue, store, geom, store_area FROM `<project>.<dataset>.input_stores_data`''',
-- Stores information variables
[('store_area','sum')],
-- Competitors: competitor, geom
'''SELECT competitor, geom FROM `<project>.<dataset>.input_competitors_data`''',
-- Area of interest: geom
'''SELECT geom FROM `<project>.<dataset>.area_of_interest`''',
-- Grid params: grid type and level
'h3', 6,
-- Decay params: kring size and decay function
3, 'exponential',
-- Data Observatory enrichment
[('POPCY_4534fac4', 'sum'), ('INCCYPCAP_7c8377cf', 'avg')],
'<my-dataobs-project>.<my-dataobs-dataset>',
-- Custom data enrichment
[('var1', 'sum'), ('var2', 'avg')],
'''SELECT var1, var2, geom FROM `<project>.<dataset>.custom_data`''',
-- Output destination prefix
'<my-project>.<my-dataset>.<output-prefix>'
);
-- Table `<my-project>.<my-dataset>.<output-prefix>_model_data` will be created
-- with columns: index, revenue_avg, store_count, competitor_count, POPCY_4534fac4_sum, INCCYPCAP_7c8377cf_avg, var1_sum, var2_avg
-- Table `<my-project>.<my-dataset>.<output-prefix>_model_data_stats` will be created
-- with the column: morans_i
CALL carto.BUILD_REVENUE_MODEL_DATA(
-- Stores: revenue, store, geom and optional store information
'''SELECT revenue, store, geom, store_area FROM `<project>.<dataset>.input_stores_data`''',
-- Stores information variables
[('store_area','sum')],
-- Competitors: competitor, geom
'''SELECT competitor, geom FROM `<project>.<dataset>.input_competitors_data`''',
-- Area of interest: geom
'''SELECT geom FROM `<project>.<dataset>.area_of_interest`''',
-- Grid params: grid type and level
'h3', 6,
-- Decay params: kring size and decay function
3, 'exponential',
-- Data Observatory enrichment
[('POPCY_4534fac4', 'sum'), ('INCCYPCAP_7c8377cf', 'avg')],
'<my-dataobs-project>.<my-dataobs-dataset>',
-- Custom data enrichment
[('var1', 'sum'), ('var2', 'avg')],
'''SELECT var1, var2, geom FROM `<project>.<dataset>.custom_data`''',
-- Output destination prefix
'<my-project>.<my-dataset>.<output-prefix>'
);
-- Table `<my-project>.<my-dataset>.<output-prefix>_model_data` will be created
-- with columns: index, revenue_avg, store_count, competitor_count, POPCY_4534fac4_sum, INCCYPCAP_7c8377cf_avg, var1_sum, var2_avg
-- Table `<my-project>.<my-dataset>.<output-prefix>_model_data_stats` will be created
-- with the column: morans_i

BUILD_REVENUE_MODEL

BUILD_REVENUE_MODEL(revenue_model_data, options, output_prefix)
Description
This procedure is the second step of the Revenue Prediction analysis workflow. It creates the model and its description tables from the input model data (output of the BUILD_REVENUE_MODEL_DATA procedure). It performs the following steps:
  1. 1.
    Compute the model from the input query and options.
  2. 2.
    Compute the revenue model_shap, model_stats tables (see the output description for more details).
Input parameters
  • revenue_model_data: STRING table with the revenue model data generated with the BUILD_REVENUE_MODEL_DATA procedure.
  • options: STRING JSON string to overwrite the model default options. If set to NULL or empty, it will use the default options. The following fixed options cannot be modified:
    • ENABLE_GLOBAL_EXPLAIN: TRUE
    • INPUT_LABEL_COLS: ['revenue_avg']
      The following default options can be modified:
    • MODEL_TYPE: 'BOOSTED_TREE_REGRESSOR'
    • BOOSTER_TYPE: 'GBTREE'
    • NUM_PARALLEL_TREE: 1
    • TREE_METHOD: 'AUTO'
    • COLSAMPLE_BYTREE: 1
    • MAX_TREE_DEPTH: 6
    • SUBSAMPLE: 0.85
    • L1_REG: 0
    • L2_REG: 1
    • EARLY_STOP: FALSE
    • MAX_ITERATIONS: 50
    • DATA_SPLIT_METHOD: 'NO_SPLIT'
      This parameter allows using other options compatible with the model used. Models currently supported are LINEAR_REG and BOOSTED_TREE_REGRESSOR. Check the model documentation for more information.
  • output_prefix: STRING destination prefix for the output tables. It must contain the project, dataset and prefix. For example <my-project>.<my-dataset>.<output-prefix>.
Output
The procedure will output the following:
  1. 1.
    Model: contains the trained model to be used for the revenue prediction. The name of the model includes the suffix _model, for example <my-project>.<my-dataset>.<output-prefix>_model.
  2. 2.
    Shap table: contains a list of the features and their attribution to the model, computed with ML.GLOBAL_EXPLAIN. The name of the table includes the suffix _model_shap, for example <my-project>.<my-dataset>.<output-prefix>_model_shap.
  3. 3.
    Stats table: contains the model stats (mean_error, variance, etc.), computed with ML.EVALUATE. The name of the table includes the suffix _model_stats, for example <my-project>.<my-dataset>.<output-prefix>_model_stats.
To learn more about how to evaluate the results of your model through the concept of explainability, refer to this article (https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-xai-overview).
Example
carto-un
carto-un-eu
manual
CALL `carto-un`.carto.BUILD_REVENUE_MODEL(
-- Model data
'<my-project>.<my-dataset>.<output-prefix>_model_data',
-- Options
'{"MAX_ITERATIONS": 100}',
-- Output destination prefix
'<my-project>.<my-dataset>.<output-prefix>'
);
-- Model `<my-project>.<my-dataset>.<output-prefix>_model` will be created
-- Table `<my-project>.<my-dataset>.<output-prefix>_model_shap` will be created
-- Table `<my-project>.<my-dataset>.<output-prefix>_model_stats` will be created
CALL `carto-un-eu`.carto.BUILD_REVENUE_MODEL(
-- Model data
'<my-project>.<my-dataset>.<output-prefix>_model_data',
-- Options
'{"MAX_ITERATIONS": 100}',
-- Output destination prefix
'<my-project>.<my-dataset>.<output-prefix>'
);
-- Model `<my-project>.<my-dataset>.<output-prefix>_model` will be created
-- Table `<my-project>.<my-dataset>.<output-prefix>_model_shap` will be created
-- Table `<my-project>.<my-dataset>.<output-prefix>_model_stats` will be created
CALL carto.BUILD_REVENUE_MODEL(
-- Model data
'<my-project>.<my-dataset>.<output-prefix>_model_data',
-- Options
'{"MAX_ITERATIONS": 100}',
-- Output destination prefix
'<my-project>.<my-dataset>.<output-prefix>'
);
-- Model `<my-project>.<my-dataset>.<output-prefix>_model` will be created
-- Table `<my-project>.<my-dataset>.<output-prefix>_model_shap` will be created
-- Table `<my-project>.<my-dataset>.<output-prefix>_model_stats` will be created

PREDICT_REVENUE_AVERAGE

PREDICT_REVENUE_AVERAGE(index, revenue_model, revenue_model_data)
Description
This procedure is the third and final step of the Revenue Prediction analysis workflow. It predicts the average revenue of an additional store located in the specified grid cell. It requires as input the model data (output of the BUILD_REVENUE_MODEL_DATA procedure) and the trained model (output of the BUILD_REVENUE_MODEL procedure).
Input parameters
  • index: ANY TYPE cell index where the new store will be located. It can be an H3 or a Quadbin index. For Quadbin, the value should be INT64 whereas for H3 the value should be STRING. It can also be 'ALL', in which case the prediction for all the grid cells of the model data are returned.
  • revenue_model: STRING the fully qualified model name.
  • revenue_model_data: STRING the fully qualified model_data table name.
  • candidate_data: STRING the fully qualified candidate_data table name. It can be set to NULL.
  • stores_variables: ARRAY<STRUCT<variable STRING, aggregation STRING>> list with the columns of the stores_query and their corresponding aggregation method (sum, avg, max, min, count) that will be used to enrich the grid cells. It can be set to NULL.
Output
The procedure will output the index, predicted_revenue_avg value in the cell (in the same units of the revenue column), and shap_values, an array of key value pairs with the shap values of the features for each prediction. It also includes a baseline_prediction, which is the expected revenue without considering the impact of any other features.
Example
carto-un
carto-un-eu
manual
CREATE TABLE '<my-project>.<my-dataset>.<output-prefix>_candidate_data' AS (SELECT 25 store_area);
​
CALL `carto-un`.carto.PREDICT_REVENUE_AVERAGE(
'862676d1fffffff',
'<my-project>.<my-dataset>.<output-prefix>_model',
'<my-project>.<my-dataset>.<output-prefix>_model_data',
'<my-project>.<my-dataset>.<output-prefix>_candidate_data',
[('store_area','sum')]
);
-- index, predicted_revenue_avg
CREATE TABLE '<my-project>.<my-dataset>.<output-prefix>_candidate_data' AS (SELECT 25 store_area);
​
CALL `carto-un-eu`.carto.PREDICT_REVENUE_AVERAGE(
'862676d1fffffff',
'<my-project>.<my-dataset>.<output-prefix>_model',
'<my-project>.<my-dataset>.<output-prefix>_model_data',
'<my-project>.<my-dataset>.<output-prefix>_candidate_data',
[('store_area','sum')]
);
-- index, predicted_revenue_avg
CREATE TABLE '<my-project>.<my-dataset>.<output-prefix>_candidate_data' AS (SELECT 25 store_area);
​
CALL carto.PREDICT_REVENUE_AVERAGE(
'862676d1fffffff',
'<my-project>.<my-dataset>.<output-prefix>_model',
'<my-project>.<my-dataset>.<output-prefix>_model_data',
'<my-project>.<my-dataset>.<output-prefix>_candidate_data',
[('store_area','sum')]
);
-- index, predicted_revenue_avg

FIND_WHITESPACE_AREAS

FIND_WHITESPACE_AREAS(revenue_model, revenue_model_data, generator_query, aoi_query, minimum_revenue, max_results, with_own_stores, with_competitors)
Description
This is a postprocessing step that may be used after completing a Revenue Prediction analysis workflow. It allows you to identify cells with the highest potential revenue (whitespaces), while satisfying a series of criteria (e.g. presence of competitors).
It requires as input the model data (output of the BUILD_REVENUE_MODEL_DATA procedure) and the trained model (output of the BUILD_REVENUE_MODEL procedure), as well as a query with points to use as generators for the area of applicability of the model, plus a series of optional filters.
A cell is eligible to be considered a whitespace if it complies with the filtering criteria (minimum revenue, presence of competitors, etc.) and is within the area of applicability of the revenue model provided.
Input parameters
  • revenue_model: STRING with the fully qualified model name.
  • revenue_model_data: STRING with the fully qualified model_data table name.
  • generator_query: STRING query with the location of a set of generator points as a geography column named geom. The algorithm will look for whitespaces in the surroundings of these locations, therefore avoiding offering results in locations that are not of the interest of the user. Good options to use as generator locations are, for instance, the location of the stores and competitors, or a collection of POIs that are known to drive commercial activity to an area.
  • aoi_query: STRING query with the geography of the area of interest in which to perform the search. May be NULL, in which case no spatial filter will be applied.
  • minimum_revenue: FLOAT64 the minimum revenue to filter results by. May be NULL, in which case no revenue threshold will be applied.
  • max_results: INT64 of the maximum number of results, ordered by decreasing predicted revenue. May be NULL, in which case all eligible cells are returned.
  • with_own_stores: BOOL specifying whether to consider cells that already have own stores in them. If NULL, defaults to TRUE.
  • with_competitors: BOOL specifying whether to consider cells that already have competitors in them. If NULL, defaults to TRUE.
Output
The procedure will output a table of cells with the following columns:
  • index: identifying the H3, or Quadbin cell.
  • predicted_revenue_avg: average revenue of an additional store located in the grid cell.
  • store_count: number of own stores present in the grid cell.
  • competitor_count: number of competitors present in the grid cell.
Example
carto-un
carto-un-eu
manual
CALL `carto-un`.carto.FIND_WHITESPACE_AREAS(
'<my-project>.<my-dataset>.<output-prefix>_model',
'<my-project>.<my-dataset>.<output-prefix>_model_data',
'SELECT geom FROM `<my-project>.<my-dataset>.<generator-table>`',
'SELECT geom FROM `<my-project>.<my-dataset>.<area_of_interest_table>`', -- Area of Interest filter
10000, -- Minimum predicted revenue filter
5, -- Maximum number of results
TRUE, -- Whether to include cells with own stores
FALSE -- Whether to include cells with competitors
);
CALL `carto-un-eu`.carto.FIND_WHITESPACE_AREAS(
'<my-project>.<my-dataset>.<output-prefix>_model',
'<my-project>.<my-dataset>.<output-prefix>_model_data',
'SELECT geom FROM `<my-project>.<my-dataset>.<generator-table>`',
'SELECT geom FROM `<my-project>.<my-dataset>.<area_of_interest_table>`', -- Area of Interest filter
10000, -- Minimum predicted revenue filter
5, -- Maximum number of results
TRUE, -- Whether to include cells with own stores
FALSE -- Whether to include cells with competitors
);
CALL carto.FIND_WHITESPACE_AREAS(
'<my-project>.<my-dataset>.<output-prefix>_model',
'<my-project>.<my-dataset>.<output-prefix>_model_data',
'SELECT geom FROM `<my-project>.<my-dataset>.<generator-table>`',
'SELECT geom FROM `<my-project>.<my-dataset>.<area_of_interest_table>`', -- Area of Interest filter
10000, -- Minimum predicted revenue filter
5, -- Maximum number of results
TRUE, -- Whether to include cells with own stores
FALSE -- Whether to include cells with competitors
);

FIND_TWIN_AREAS

FIND_TWIN_AREAS(origin_query, target_query, index_column, pca_explained_variance_ratio, max_results, output_prefix)
Description
Procedure to obtain the twin areas for a given origin location in a target area. The full description of the method, based on Principal Component Analysis (PCA), can be found here.
The output twin areas are those of the target area considered to be the most similar to the origin location, based on the values of a set of variables. Only variables with numerical values are supported. Both origin and target areas should be provided in grid format (H3, or Quadbin) of the same resolution. We recommend using the GRIDIFY_ENRICH procedure to prepare the data in the format expected by this procedure.
Input
  • origin_query: STRING query to provide the origin cell (index column) and its associated data columns. No NULL values should be contained in any of the data columns provided. The cell can be an H3, or a Quadbin index. For Quadbin, the value should be cast to STRING (CAST(index AS STRING)). Example origin queries are:
carto-un
carto-un-eu
manual
-- When selecting the origin cell from a dataset of gridified data
SELECT * FROM `<project>.<dataset>.<origin_table>`
WHERE index_column = <cell_id>
-- When selecting the origin cell from a dataset of gridified data
SELECT * FROM `<project>.<dataset>.<origin_table>`
WHERE index_column = <cell_id>
-- When selecting the origin cell from a dataset of gridified data
SELECT * FROM `<project>.<dataset>.<origin_table>`
WHERE index_column = <cell_id>
carto-un
carto-un-eu
manual
-- When the input H3 cell ID is inferred from a (longitude, latitude) pair
SELECT * FROM `<project>.<dataset>.<origin_table>`
WHERE ST_INTERSECTS(`carto-un`.carto.H3_BOUNDARY(index_column), ST_GEOGPOINT(<longitude>, <latitude>))
-- When the input H3 cell ID is inferred from a (longitude, latitude) pair
SELECT * FROM `<project>.<dataset>.<origin_table>`
WHERE ST_INTERSECTS(`carto-un-eu`.carto.H3_BOUNDARY(index_column), ST_GEOGPOINT(<longitude>, <latitude>))
-- When the input H3 cell ID is inferred from a (longitude, latitude) pair
SELECT * FROM `<project>.<dataset>.<origin_table>`
WHERE ST_INTERSECTS(carto.H3_BOUNDARY(index_column), ST_GEOGPOINT(<longitude>, <latitude>))
carto-un
carto-un-eu
manual
-- When the input Quadbin cell ID is inferred from a (longitude, latitude) pair
SELECT * FROM `<project>.<dataset>.<origin_table>`
WHERE ST_INTERSECTS(`carto-un`.carto.QUADBIN_BOUNDARY(index_column), ST_GEOGPOINT(<longitude>, <latitude>))
-- When the input Quadbin cell ID is inferred from a (longitude, latitude) pair
SELECT * FROM `<project>.<dataset>.<origin_table>`
WHERE ST_INTERSECTS(`carto-un-eu`.carto.QUADBIN_BOUNDARY(index_column), ST_GEOGPOINT(<longitude>, <latitude>))
-- When the input Quadbin cell ID is inferred from a (longitude, latitude) pair
SELECT * FROM `<project>.<dataset>.<origin_table>`
WHERE ST_INTERSECTS(carto.QUADBIN_BOUNDARY(index_column), ST_GEOGPOINT(<longitude>, <latitude>))
carto-un
carto-un-eu
manual
-- When the cell ID is a Quadbin and requires to be cast
SELECT * EXCEPT(index_column), CAST(index_column AS STRING)
FROM `<project>.<dataset>.<origin_table>`
-- When the cell ID is a Quadbin and requires to be cast
SELECT * EXCEPT(index_column), CAST(index_column AS STRING)
FROM `<project>.<dataset>.<origin_table>`
-- When the cell ID is a Quadbin and requires to be cast
SELECT * EXCEPT(index_column), CAST(index_column AS STRING)
FROM `<project>.<dataset>.<origin_table>`
  • target_query: STRING query to provide the target area grid cells (index column) and their associated data columns, e.g. SELECT * FROM <project>.<dataset>.<target_table>. The data columns should be similar to those provided in the origin_query, otherwise the procedure will fail. Grid cells with any NULL values will be excluded from the analysis.
  • index_column: STRING name of the index column for both the origin_query and the target_query.
  • pca_explained_variance_ratio: FLOAT64 of the explained variance retained in the PCA analysis. It defaults to 0.9 if set toNULL.
  • max_results: INT64 of the maximum number of twin areas returned. If set to NULL, all target cells are returned.
  • output_prefix: STRING destination and prefix for the output tables. It must contain the project, dataset and prefix: <project>.<dataset>.<prefix>.
Output
The procedure outputs the following:
  • Twin area model, named <project>.<dataset>.<prefix>_model. Please note that the model computation only depends on the target_query and therefore the same model can be used if the procedure is re-run for a different origin_query. To allow for this scenario in which the model is reused, if the output model already exists, it won't be recomputed. To avoid this behavior, simply choose a different <prefix> in the output_prefix parameter.
  • Results table, named <project>.<dataset>.<prefix>_<origin_index>_results, containing in each row the index of the target cells (index_column) and its associated similarity_score and similarity_skill_score. The similarity_score corresponds to the distance between the origin and target cell in the Principal Component (PC) Scores space; the similarity_skill_score for a given target cell *t* is computed as 1 - similarity_score(*t*) / similarity_score(<*t*>), where <*t*> is the average target cell, computed by averaging each retained PC score for all the target cells. This similarity_skill_score represents a relative measure: the score will be positive if and only if the target cell is more similar to the origin than the mean vector data, with a score of 1 meaning perfect matching or zero distance. Therefore, a target cell with a larger score will be more similar to the origin under this scoring rule.
Example
carto-un
carto-un-eu
manual
CALL `carto-un`.carto.FIND_TWIN_AREAS(
-- Input queries
'''SELECT * FROM `cartobq.docs.twin_areas_origin_enriched_quadbin` LIMIT 1''',
'''SELECT * FROM `cartobq.docs.twin_areas_target_enriched_quadbin`''',
-- Twin areas model inputs
'quadbin',
0.90,
NULL,
'my-project.my-dataset.my-prefix'
);
-- Table `<my-project>.<my-dataset>.<output-prefix>_{ID}_results` will be created
-- with the column: quadbin, similarity_score, similarity_skill_score
CALL `carto-un-eu`.carto.FIND_TWIN_AREAS(
-- Input queries
'''SELECT * FROM `cartobq.docs.twin_areas_origin_enriched_quadbin` LIMIT 1''',
'''SELECT * FROM `cartobq.docs.twin_areas_target_enriched_quadbin`''',
-- Twin areas model inputs
'quadbin',
0.90,
NULL,
'my-project.my-dataset.my-prefix'
);
-- Table `<my-project>.<my-dataset>.<output-prefix>_{ID}_results` will be created
-- with the column: quadbin, similarity_score, similarity_skill_score
CALL carto.FIND_TWIN_AREAS(
-- Input queries
'''SELECT * FROM `cartobq.docs.twin_areas_origin_enriched_quadbin` LIMIT 1''',
'''SELECT * FROM `cartobq.docs.twin_areas_target_enriched_quadbin`''',
-- Twin areas model inputs
'quadbin',
0.90,
NULL,
'my-project.my-dataset.my-prefix'
);
-- Table `<my-project>.<my-dataset>.<output-prefix>_{ID}_results` will be created
-- with the column: quadbin, similarity_score, similarity_skill_score
Additional examples

FIND_TWIN_AREAS_WEIGHTED

FIND_TWIN_AREAS_WEIGHTED(origin_query, target_query, index_column, weights, max_results, output_prefix)
Description
Procedure to obtain the twin areas for a given origin location in a target area. The function is similar to the FIND_TWIN_AREAS where the full description of the method, based on Principal Component Analysis (PCA), can be found here. Herein, no PCA is performed, but the user has the capability to specify weights for the features and check the similarities between origin and target area. The sum of weights must be less than or equal to 1. Not all them need to be defined. The undefined features are set to the remaining value divided by their number to reach 1. In the case where weights are provided, then no PCA takes place, and the features are standardized.
The output twin areas are those of the target area considered to be the most similar to the origin location, based on the values of a set of variables. Only variables with numerical values are supported. Both origin and target areas should be provided in grid format (h3, or quadbin) of the same resolution. We recommend using the GRIDIFY_ENRICH procedure to prepare the data in the format expected by this procedure.
Input
  • origin_query: STRING query to provide the origin cell (index column) and its associated data columns. No NULL values should be contained in any of the data columns provided. The cell can be an h3, or a quadbin index. For quadbin, the value should be cast to STRING (CAST(index AS STRING)). Example origin queries are:
carto-un
carto-un-eu
manual
-- When selecting the origin cell from a dataset of gridified data
SELECT * FROM `<project>.<dataset>.<origin_table>`
WHERE index_column = <cell_id>
-- When selecting the origin cell from a dataset of gridified data
SELECT * FROM `<project>.<dataset>.<origin_table>`
WHERE index_column = <cell_id>
-- When selecting the origin cell from a dataset of gridified data
SELECT * FROM `<project>.<dataset>.<origin_table>`
WHERE index_column = <cell_id>
carto-un
carto-un-eu
manual
-- When the input H3 cell ID is inferred from a (longitude, latitude) pair
SELECT * FROM `<project>.<dataset>.<origin_table>`
WHERE ST_INTERSECTS(`carto-un`.carto.H3_BOUNDARY(index_column), ST_GEOGPOINT(<longitude>, <latitude>))
-- When the input H3 cell ID is inferred from a (longitude, latitude) pair
SELECT * FROM `<project>.<dataset>.<origin_table>`
WHERE ST_INTERSECTS(`carto-un-eu`.carto.H3_BOUNDARY(index_column), ST_GEOGPOINT(<longitude>, <latitude>))
-- When the input H3 cell ID is inferred from a (longitude, latitude) pair
SELECT * FROM `<project>.<dataset>.<origin_table>`
WHERE ST_INTERSECTS(carto.H3_BOUNDARY(index_column), ST_GEOGPOINT(<longitude>, <latitude>))
carto-un
carto-un-eu
manual
-- When the input quadbin cell ID is inferred from a (longitude, latitude) pair
SELECT * FROM `<project>.<dataset>.<origin_table>`
WHERE ST_INTERSECTS(`carto-un`.carto.QUADBIN_BOUNDARY(index_column), ST_GEOGPOINT(<longitude>, <latitude>))
-- When the input quadbin cell ID is inferred from a (longitude, latitude) pair
SELECT * FROM `<project>.<dataset>.<origin_table>`
WHERE ST_INTERSECTS(`carto-un-eu`.carto.QUADBIN_BOUNDARY(index_column), ST_GEOGPOINT(<longitude>, <latitude>))
-- When the input quadbin cell ID is inferred from a (longitude, latitude) pair
SELECT * FROM `<project>.<dataset>.<origin_table>`
WHERE ST_INTERSECTS(carto.QUADBIN_BOUNDARY(index_column), ST_GEOGPOINT(<longitude>, <latitude>))
carto-un
carto-un-eu
manual
-- When the cell ID is a quadbin and requires to be cast
SELECT * EXCEPT(index_column), CAST(index_column AS STRING)
FROM `<project>.<dataset>.<origin_table>`
-- When the cell ID is a quadbin and requires to be cast
SELECT * EXCEPT(index_column), CAST(index_column AS STRING)
FROM `<project>.<dataset>.<origin_table>`
-- When the cell ID is a quadbin and requires to be cast
SELECT * EXCEPT(index_column), CAST(index_column AS STRING)
FROM `<project>.<dataset>.<origin_table>`
  • target_query: STRING query to provide the target area grid cells (index column) and their associated data columns, e.g. SELECT * FROM <project>.<dataset>.<target_table>. The data columns should be similar to those provided in the origin_query, otherwise the procedure will fail. Grid cells with any NULL values will be excluded from the analysis.
  • index_column: STRING name of the index column for both the origin_query and the target_query.
  • weights: ARRAY<STRUCT<name STRING, value FLOAT64>> the weights on the features. If set to NULL, then all features are treated equally. This parameter is considered only if the length of weights is greater or equal than one. The sum of weights must be less than or equal to 1. If less weights than the number of features are provided, then for the undefined features, the remaining 1 - sum(weights) is distributed evenly.
  • max_results: INT64 of the maximum number of twin areas returned. If set to NULL, all target cells are returned.
  • output_prefix: STRING destination and prefix for the output tables. It must contain the project, dataset and prefix: <project>.<dataset>.<prefix>.
Output
The procedure outputs the following:
  • Twin area model, named <project>.<dataset>.<prefix>_model. Please note that the model computation only depends on the target_query and therefore the same model can be used if the procedure is re-run for a different origin_query. To allow for this scenario in which the model is reused, if the output model already exists, it won't be recomputed. To avoid this behavior, simply choose a different <prefix> in the output_prefix parameter.
  • Results table, named <project>.<dataset>.<prefix>_<origin_index>_results, containing in each row the index of the target cells (index_column) and its associated similarity_score and similarity_skill_score. The similarity_score corresponds to the distance between the origin and target cell taking into account the user defined weights; the similarity_skill_score for a given target cell *t* is computed as 1 - similarity_score(*t*) / similarity_score(<*t*>), where <*t*> is the average target cell, computed by averaging each feature for all the target cells. This similarity_skill_score represents a relative measure: the score will be positive if and only if the target cell is more similar to the origin than the mean vector data, with a score of 1 meaning perfect matching or zero distance. Therefore, a target cell with a larger score will be more similar to the origin under this scoring rule.
Example
carto-un
carto-un-eu
manual
CALL `carto-un`.carto.FIND_TWIN_AREAS_WEIGHTED(
-- Input queries
'''SELECT * FROM `cartobq.docs.twin_areas_origin_enriched_quadbin` LIMIT 1''',
'''SELECT * FROM `cartobq.docs.twin_areas_target_enriched_quadbin`''',
-- Twin areas model inputs
'quadbin',
NULL,
NULL,
'my-project.my-dataset.my-prefix'
);
-- Table `<my-project>.<my-dataset>.<output-prefix>_{ID}_results` will be created
-- with the column: quadbin, similarity_score, similarity_skill_score
CALL `carto-un-eu`.carto.FIND_TWIN_AREAS_WEIGHTED(
-- Input queries
'''SELECT * FROM `cartobq.docs.twin_areas_origin_enriched_quadbin` LIMIT 1''',
'''SELECT * FROM `cartobq.docs.twin_areas_target_enriched_quadbin`''',
-- Twin areas model inputs
'quadbin',
NULL,
NULL,
'my-project.my-dataset.my-prefix'
);
-- Table `<my-project>.<my-dataset>.<output-prefix>_{ID}_results` will be created
-- with the column: quadbin, similarity_score, similarity_skill_score
CALL carto.FIND_TWIN_AREAS_WEIGHTED(
-- Input queries
'''SELECT * FROM `cartobq.docs.twin_areas_origin_enriched_quadbin` LIMIT 1''',
'''SELECT * FROM `cartobq.docs.twin_areas_target_enriched_quadbin`''',
-- Twin areas model inputs
'quadbin',
NULL,
NULL,
'my-project.my-dataset.my-prefix'
);
-- Table `<my-project>.<my-dataset>.<output-prefix>_{ID}_results` will be created
-- with the column: quadbin, similarity_score, similarity_skill_score

COMMERCIAL_HOTSPOTS

COMMERCIAL_HOTSPOTS(input, output, index_column, index_type, variable_columns, variable_weights, kring, pvalue_thresh)
Description
This procedure is used to locate hotspot areas by calculating a combined Getis-Ord Gi* statistic using a uniform kernel over several variables. The input data should be in either an H3 or Quadbin grid. Variables can be optionally weighted using the variable_weights parameter, uniform weights will be considered otherwise. The combined Gi* statistic for each cell will be computed by taking into account the neighboring cells within the k-ring of size kring.
Only those cells where the Gi* statistics is significant are returned, i.e., those above the p-value threshold (pvalue_thresh) set by the user. Hotspots can be identified as those cells with the highest Gi* values.
Input parameters
  • input: STRING name of the table containing the input data. It should include project and dataset, i.e., follow the format <project-id>.<dataset-id>.<table-name>.
  • output: STRING name of the table where the output data will be stored. It should include project and dataset, i.e., follow the format <project-id>.<dataset-id>.<table-name>. If NULL, the procedure will return the output but it will not be persisted.
  • index_column: STRING name of the column containing the H3 or Quadbin indexes.
  • index_type: STRING type of the input cell indexes. Supported values are 'h3', or 'quadbin'.
  • variable_columns: ARRAY<STRING> names of the columns containing the variables to take into account when computing the combined Gi* statistic.
  • variable_weights: ARRAY<FLOAT64> containing the weights associated with each of the variables. These weights can take any value but will be normalized to sum up to 1. If NULL, uniform weights will be considered
  • kring: INT64 size of the k-ring (distance from the origin). This defines the area around each cell that will be taken into account to compute its Gi* statistic. If NULL, uniform weights will be considered.
  • pvalue_thresh: Threshold for the Gi* value significance, ranging from 0 (most significant) to 1 (least significant). It defaults to 0.05. Cells with a p-value above this threshold won't be returned.
Output The output will contain the following columns:
  • index: STRING containing the cell index.
  • combined_gi: FLOAT64 with the resulting combined Gi*.
  • p_value: FLOAT64 with the p-value associated with the combined Gi* statistic.
If the output table is not specified when calling the procedure, the result will be returned but it won't be persisted.
Examples
carto-un
carto-un-eu
manual
CALL `carto-un`.carto.COMMERCIAL_HOTSPOTS(
'project_id.dataset_id.my_input_table',
'project_id.dataset_id.my_output_table',
'index',
'h3',
['feature_0', 'feature_1'],
[0.7, 0.3],
3,
0.01
)
-- Table project_id.dataset_id.my_output_table will be created.
-- with columns: index, combined_gi, p_value
CALL `carto-un-eu`.carto.COMMERCIAL_HOTSPOTS(
'project_id.dataset_id.my_input_table',
'project_id.dataset_id.my_output_table',
'index',
'h3',
['feature_0', 'feature_1'],
[0.7, 0.3],
3,
0.01
)
-- Table project_id.dataset_id.my_output_table will be created.
-- with columns: index, combined_gi, p_value
CALL carto.COMMERCIAL_HOTSPOTS(
'project_id.dataset_id.my_input_table',
'project_id.dataset_id.my_output_table',
'index',
'h3',
['feature_0', 'feature_1'],
[0.7, 0.3],
3,
0.01
)
-- Table project_id.dataset_id.my_output_table will be created.
-- with columns: index, combined_gi, p_value
carto-un
carto-un-eu
manual
CALL `carto-un`.carto.COMMERCIAL_HOTSPOTS(
'project_id.dataset_id.my_input_table',
'project_id.dataset_id.my_output_table',
'index',
'quadbin',
['feature_0', 'feature_1'],
[0.5, 0.5],
1,
0.05
);
-- Table project_id.dataset_id.my_output_table will be created.
-- with columns: index, combined_gi, p_value
CALL `carto-un-eu`.carto.COMMERCIAL_HOTSPOTS(
'project_id.dataset_id.my_input_table',
'project_id.dataset_id.my_output_table',