cpg
ADVANCED BETA
This module contains procedures to perform spatial analysis to solve specific use-cases for the Consumer Packaged Goods (CPG) industry, such as customer segmentation.
GENERATE_TRADE_AREAS(customers_query, method, options, output_prefix)
Description
This procedure generates the trade areas for each location specified based on the method and the options provided. Four methods are available:
buffer
, kring-h3
, kring-quadbin
and isoline
.For the
isoline
method, the use of this procedure requires providing authorization credentials. Two parameters are needed: api_base_url and lds_token. Both the API base url and your LDS token can be accessed through the developers section of the CARTO user interface. Please check our documentation for Developers for more details.Input parameters
customers_query
:STRING
query with store id and location. It must contain the columnsstore_id
(store unique id) andgeom
(the geographical point of the store). The values of these columns cannot beNULL
.method
:STRING
indicates the method of trade area generation. Four options available:buffer
,kring-h3
,kring-quadbin
andisoline
. This method applies to all locations provided.options
:JSON
A JSON string containing the required parameters for the specified method. Forbuffer
: {buffer
: radius -FLOT64
},kring-h3
:{resolution
: resolution-INT64
,kring
:number of layers -INT64
},kring-quadbin
: {resolution
: zoom level -INT64
,kring
:number of layers -INT64
},isoline
: {mode
: type of transport. Supported: 'walk', 'car' -STRING
,time
: range of the isoline in seconds -INT64
,api_base_url
: url of the API where the customer account is stored -STRING
,lds_token
: customer's token for accessing the different API services -STRING
}.output_prefix
:STRING
the prefix for each table in the output destination.
Output
This procedure will output one table:
- Table containing the
store_id
,geom
,method
,options
. The output table can be found at the output destination with name<output-prefix>_output
. Overall path<my-project>.<my-dataset>.<output-prefix>_trade_areas
.
Example
carto-un
carto-un-eu
manual
CALL `carto-un`.carto.GENERATE_TRADE_AREAS(
--customers_query
'''SELECT store_id, geom, FROM `<project>.<dataset>.<table_name_with_stores>`''',
--method
'buffer',
--options
"{'buffer':500.0}",
--output_prefix
'<my-project>.<my-dataset>.<output-prefix>'
);
CALL `carto-un-eu`.carto.GENERATE_TRADE_AREAS(
--customers_query
'''SELECT store_id, geom, FROM `<project>.<dataset>.<table_name_with_stores>`''',
--method
'buffer',
--options
"{'buffer':500.0}",
--output_prefix
'<my-project>.<my-dataset>.<output-prefix>'
);
CALL carto.GENERATE_TRADE_AREAS(
--customers_query
'''SELECT store_id, geom, FROM `<project>.<dataset>.<table_name_with_stores>`''',
--method
'buffer',
--options
"{'buffer':500.0}",
--output_prefix
'<my-project>.<my-dataset>.<output-prefix>'
);
CUSTOMER_SEGMENTATION_ANALYSIS_DATA(customers_query, do_variables, do_source, custom_variables, custom_query, output_prefix)
Description
This procedure prepares the "data table" to be fed into
RUN_CUSTOMER_SEGMENTATION
. It takes as input a set of locations along with optional preprocessed features, and either it further enriches these locations with new features from "Data Observatory subscriptions" or custom features "from the user's own tables", and then performs descriptives analysis, correlation analyis on the features and produces three "output" tables:- A table with the data to fed into the next procedure, with store_id, geom and all the features that will be considered for the segmentation.
- Correlation table, where the Pearson correlation between each pair of features is included.
- Descriptives table, where the
count
,mean
,std
,min
, 10th, 25th, 50th, 75th, 90th percentiles, andmax
values per features are included.
Input parameters
customers_query
:STRING
query with store id and location and any optional preprocessed feature. It must contain the columnsstore_id
(store unique id) andgeom
(the geographical point of the store). The values of these columns cannot beNULL
.- do_variables:
ARRAY<STRUCT<variable STRING, aggregation STRING>>
variables of the Data Observatory that will be used to enrich the provided trade areas of each location. For each variable, its slug and the aggregation method must be provided. Use 'default' to use the variable's default aggregation method. Valid aggregation methods are: sum, avg, max, min, count. The catalog procedureDATAOBS_SUBSCRIPTION_VARIABLES
can be used to find available variables and their slug ids and default aggregation methods. It can be set toNULL
. - do_source:
STRING
name of the location where the Data Observatory subscriptions of the user are stored, in<my-dataobs-project>.<my-dataobs-dataset>
format. If only the<my-dataobs-dataset>
is included, it uses the project carto-data by default. It can be set toNULL
or ''. custom_variables
:ARRAY<STRUCT<variable STRING, aggregation STRING>>
list with the columns of thecustom_query
and their corresponding aggregation method (sum
,avg
,max
,min
,count
) that will be used to enrich the provided trade areas of each location. It can be set toNULL
.custom_query
:STRING
query that contains a geography columngeom
and the columns with the custom data that will be used to enrich the provided trade areas of each location. It can be set toNULL
or''
.output_prefix
:STRING
the prefix for each table in the output destination.
Output
This procedure will output three tables:
- Table containing the
store_id
,geom
and all features to be considered for the segmentation (i.e. pre-processed, from DO subscription, or features from other tables included by the user). The output table can be found at the output destination with name<output-prefix>_enrich
. Overall path<my-project>.<my-dataset>.<output-prefix>_enrich
. - Table containing the correlation amongst features. Three columns,
feature 1
,feature 2
andcorrelation
between feature 1 and feature 2. The output table can be found at the output destination with name<output-prefix>_correlation
. Overall path<my-project>.<my-dataset>.<output-prefix>_correlation
. - Table containing the descriptive statistics of the features. For each feature the
count
,mean
,std
,min
, 10th, 25th, 50th, 75th, 90th percentiles andmax
values are calculated. The output table can be found at the output destination with name<output-prefix>_descriptives
. Overall path<my-project>.<my-dataset>.<output-prefix>_descriptives
.
Example
carto-un
carto-un-eu
manual
CALL `carto-un`.carto.CUSTOMER_SEGMENTATION_ANALYSIS_DATA(
R'''
SELECT store_id, geom, feature_1, feature_2 ... FROM `<project>.<dataset>.<table_name_with_stores>`
''',
-- Data Observatory enrichment
[('POPCY_4534fac4', 'sum'), ('INCCYPCAP_7c8377cf', 'avg')],
'<my-dataobs-project>.<my-dataobs-dataset>',
-- Custom data enrichment
[('var1', 'sum'), ('var2', 'avg')],
'''SELECT var1, var2, geom FROM `<project>.<dataset>.custom_data`''',
'<my-project>.<my-dataset>.<output-prefix>'
);
CALL `carto-un-eu`.carto.CUSTOMER_SEGMENTATION_ANALYSIS_DATA(
R'''
SELECT store_id, geom, feature_1, feature_2 ... FROM `<project>.<dataset>.<table_name_with_stores>`
''',
-- Data Observatory enrichment
[('POPCY_4534fac4', 'sum'), ('INCCYPCAP_7c8377cf', 'avg')],
'<my-dataobs-project>.<my-dataobs-dataset>',
-- Custom data enrichment
[('var1', 'sum'), ('var2', 'avg')],
'''SELECT var1, var2, geom FROM `<project>.<dataset>.custom_data`''',
'<my-project>.<my-dataset>.<output-prefix>'
);
CALL carto.CUSTOMER_SEGMENTATION_ANALYSIS_DATA(
R'''
SELECT store_id, geom, feature_1, feature_2 ... FROM `<project>.<dataset>.<table_name_with_stores>`
''',
-- Data Observatory enrichment
[('POPCY_4534fac4', 'sum'), ('INCCYPCAP_7c8377cf', 'avg')],
'<my-dataobs-project>.<my-dataobs-dataset>',
-- Custom data enrichment
[('var1', 'sum'), ('var2', 'avg')],
'''SELECT var1, var2, geom FROM `<project>.<dataset>.custom_data`''',
'<my-project>.<my-dataset>.<output-prefix>'
);
RUN_CUSTOMER_SEGMENTATION(customers_data_table, number_of_clusters, pca_explain_ratio, output_prefix)
Description
This procedure is the final step of the Customer Segmentation workflow, and it uses the output from the
CUSTOMER_SEGMENTATION_ANALYSIS_DATA
. The function performs clustering (KMeans algorithm) and clusterizes the stores into a number of clusters. Multiple number of clusters - cases can be defined. The clustering is either performed directly on the input data or the capability of performing Principal Component Analysis (PCA) on the input data first exists. If PCA is enabled then the input data are firstly passed into PCA having been standardized, in order to remove multicolinearity or correlation amongst features, and the output of the PCA is passed into the KMeans. If PCA is not enabled, then the input data are standardized and then passed to the KMeans algorithm. The input to this function consists of the output table from the aformentioned function, the enrich table, a list with the options for the number of clusters, the variance explainability ratio for PCA and the output_perfix. As output, the function produces the following tables:- A table with the
store_id
,geom
, the number of clusters on that scenario/case, and the cluster the store belongs to. - A table with descriptives statistics of each feature for each scenario of different number of clusters and the specific cluster of the store.
- A table with statistics of the KMeans performance for each number of clusters.
Input parameters
customers_data_table
:STRING
table, output fromCUSTOMER_SEGMENTATION_ANALYSIS_DATA
which containsstore_id
,geom
, and features values.number_of_clusters
:ARRAY<INT64>
list with number of clusters to perform the segmentation. It can be regarded as the different number of cases for segmentation.pca_explain_ratio
:FLOAT64
of the explained variance retained in the PCA analysis. It defaults to 0.9. If 0 orNULL
is passed then PCA is not enabled and the raw data are passed into the clustering algorithm.output_prefix
:STRING
the prefix for each table in the output destination.
Output
This procedure will output three tables:
- Table containing the
store_id
,geom
, one column per scenario (number of clusters), in which the cluster the store belongs to is included for this scenario. The output table can be found at the output destination with name<output-prefix>_clusters
. Overall path<my-project>.<my-dataset>.<output-prefix>_clusters
. - Table containing statistics of the clusterings. It contains the davies bouldin index and mean squared distance for each case, scenario - number of clusters. The output table can be found at the output destination with name
<output-prefix>_clusters_stats
. Overall path<my-project>.<my-dataset>.<output-prefix>_clusters_stats
. - Table containing the descriptive statistics for each scenario (number of clsuters). Eleven columns, scenario (number of clusters), cluster value, count, mean, std, min, 10,25, 50, 75, 90 percentiles and max values. The output table can be found at the output destination with name
<output-prefix>_clusters_descr
. Overall path<my-project>.<my-dataset>.<output-prefix>_clusters_descr
.
Example
carto-un
carto-un-eu
manual
CALL `carto-un`.carto.RUN_CUSTOMER_SEGMENTATION(
'<project>.<dataset>.<table_name_with_stores>_enrich',
[4,5],
0.9,
'<my-project>.<my-dataset>.<output-prefix>'
);
CALL `carto-un-eu`.carto.RUN_CUSTOMER_SEGMENTATION(
'<project>.<dataset>.<table_name_with_stores>_enrich',
[4,5],
0.9,
'<my-project>.<my-dataset>.<output-prefix>'
);
CALL carto.RUN_CUSTOMER_SEGMENTATION(
'<project>.<dataset>.<table_name_with_stores>_enrich',
[4,5],
0.9,
'<my-project>.<my-dataset>.<output-prefix>'
);
UNIVERSE_MATCHING(
current_universe,
current_universe_id_column,
current_universe_name_column,
current_universe_location_column,
total_universe,
total_universe_id_column,
total_universe_name_column,
total_universe_location_column,
table_name,
options
)
Description
Procedure to perform a text-and-distance hybrid, fuzzy match between two universes. Even though this function can match any given dataset with another as long as they have the required columns to be compared, the common metaphor that is used along the documentation requires these two datasets in the CPG-related scenario:
- A current universe comprised of active and known distributors that are already reselling a given product;
- And a total universe containing all possible distributors that could distribute such product.
The procedure will then provide the matched universe: a table containing a mapping from each location in the current universe to the one found most similar within the possible candidates of the total universe.
To perform this match, a two-step approach is followed, taking into account both locations and names in the two different datasets:
- 1.Distance is taken into account, using two different user-provided thresholds: the maximum number of closest neighbors to check and the maximum distance between two points to be considered a candidate pair.
- 2.String similarity is computed on the location names. It is based on Dice Coefficient and accounts for words in a different order in both strings.
Out of all the possible candidates, the best one is chosen based on the best final similarity value. The similarity is a weighted average between the proximity (a value in the [0, 1] interval based on the potential candidates' distance distributions) and the text similarity (which also lies within the [0, 1] interval). The final table is also ordered in descending similarity, which should help in case a similarity cutoff is to be made to discard mismatched pairs.
When adhering to the CPG use case, it is also interesting to perform the
UNIVERSE_MATCHING_REPORT
, which will return several valuable tables and metrics, such as the potential expansion universe and the current market penetration.Input
current_universe
:STRING
SQL query or table name to provide the current universe, that is, the dataset to be taken as a reference for the matching. The current universe is a representation of the current active providers to be matched and is usually the smallest dataset of the two. It should have (at least) a column that serves as a unique ID, a name column and a location column. If any of the rows have missing values in any of these columns, the rows will be ignored during the analysis.current_universe_id_column
:STRING
name of the column containing unique ID values for each of the current universe locations in the results of thecurrent_universe
. Any type is valid as long as it can be used in grouping functions. Rows with missing values in thecurrent_universe_id_column
will be ignored during the procedure.current_universe_name_column
:STRING
name of the column containing the names of each of the current universe locations in the results of thecurrent_universe
. Such column is expected to be inSTRING
format and will be matched against its total universe counterpart to compute the text similarity value. Rows with missing values in thecurrent_universe_name_column
will be ignored during the procedure.current_universe_location_column
:STRING
name of the column containing the point geography of each of the current universe locations in the results of thecurrent_universe
. Such column is expected to be inGEOGRAPHY
,POINT
format, and will be matched against its total universe counterpart to compute the proximity score. Rows with missing values in thecurrent_universe_location_column
will be ignored during the procedure.total_universe
:STRING
SQL query or table name to provide the total universe, that is, the dataset to be taken as a reference for the matching. The total universe is a representation of the total set of vendors or distributors that can be related to the current CPG use case being treated and, therefore, all possible vendors to which the distribution could potentially be expanded. It should have (at least) a column that serves as a unique ID, a name column and a location column. If any of the rows have missing values in any of these columns, the rows will be ignored during the analysis.total_universe_id_column
:STRING
name of the column containing unique ID values for each of the total universe locations in the results of thetotal_universe
. Any type is valid as long as it can be used in grouping functions. Rows with missing values in thetotal_universe_id_column
will be ignored during the procedure.total_universe_name_column
:STRING
name of the column containing the names of each of the total universe locations in the results of thetotal_universe
. Such column is expected to be inSTRING
format and will be matched against its current universe counterpart to compute the text similarity value. Rows with missing values in thetotal_universe_name_column
will be ignored during the procedure.total_universe_location_column
:STRING
name of the column containing the point geography of each of the total universe locations in the results of thetotal_universe
. Such column is expected to be inGEOGRAPHY
,POINT
format, and will be matched against its current universe counterpart to compute the proximity score. Rows with missing values in thetotal_universe_location_column
will be ignored during the procedure.table_name
:STRING
qualified name of the table to store the final results of the match. A detailed explanation of such an output table is detailed below.options
(optional): JSON-formattedSTRING
containing a set of optional parameters. All the parameters inside this JSON can be overridden if the user needs to, but it also provides a set of sensible default values. Current accepted parameters are:
Option | Description |
---|---|
max_neighbors | INT64 - Default: 60. Maximum numbers of neighbors to consider candidate matches to a given location. Only this many closest neighbors will be taken into account when looking for a location's match. Increasing this value will increase the computation time and requirements of the procedure. |
max_distance | INT64 - Default: 500. Maximum distance in meters to consider candidate matches to a given location. Only locations within this many meters of each other will be considered candidates for matching. Increasing this value will increase the computation time and requirements of the procedure. |
weights | JSON-formatted STRING containing a mapping with two values, used to weight the average of the two different similarities. If provided, both values should be provided, since the normalization could distort the weights in case a single one was provided. |
weights.text_similarity | FLOAT64 - Default: 0.7. Weight of the text similarity computed, defaults to 0.7 . The user could like to increase this weight if there is a higher confidence in the text matching than the locations being accurate. |
weights.proximity | FLOAT64 - Default: 0.3. Weight of the proximity similarity computed. The user could like to increase this weight if there is a higher confidence in the accuracy of locations than the text being in the same representation. |
Output
The procedure outputs the following:
- The results table, named after
table_name
, contains a row per location in the current universe, as well as their best match found in the total universe, ordered by descending similarity. It follows this schema:current_universe_id
,STRING
is the ID found in the columncurrent_universe_id_column
within the results ofcurrent_universe
.total_universe_id
,STRING
is the ID found in the columntotal_universe_id_column
within the results oftotal_universe
. It could beNULL
if no candidate was found within the maximum radius defined byoptions.max_distance
.proximity
,FLOAT64
represents the proximity score, computed using the distance distribution of the candidate neighbors taken into account for matching. It lies within the [0, 1] interval.text_similarity
,FLOAT64
represents the Dice Coefficient similarity, computed using the names provided for each location. It lies within the [0, 1] interval.similarity
,FLOAT64
represents the aggregated similarity; a weighted average of theproximity
andtext_similarity
values as per the weights inoptions.weights
.
Example
carto-un
carto-un-eu
manual
CALL `carto-un`.carto.UNIVERSE_MATCHING(
-- Current universe arguments
'<my-project>.<my-dataset>.current_universe',
'poi_id',
'poi_name',
'poi_location',
-- Total universe arguments
'<my-project>.<my-dataset>.total_universe',
'poi_id',
'poi_name',
'poi_location',
-- Universe matching arguments
'<my-project>.<my-dataset>.universe_matching_results',
'''
{
"max_neighbors": 60,
"max_distance": 500,
"weights": {"text_similarity": 0.7, "proximity": 0.3}
}
'''
);
-- Table `<my-project>.<my-dataset>.universe_matching_results` will be created
CALL `carto-un-eu`.carto.UNIVERSE_MATCHING(
-- Current universe arguments
'<my-project>.<my-dataset>.current_universe',
'poi_id',
'poi_name',
'poi_location',
-- Total universe arguments
'<my-project>.<my-dataset>.total_universe',
'poi_id',
'poi_name',
'poi_location',
-- Universe matching arguments
'<my-project>.<my-dataset>.universe_matching_results',
'''
{
"max_neighbors": 60,
"max_distance": 500,
"weights": {"text_similarity": 0.7, "proximity": 0.3}
}
'''
);
-- Table `<my-project>.<my-dataset>.universe_matching_results` will be created
CALL carto.UNIVERSE_MATCHING(
-- Current universe arguments
'<my-project>.<my-dataset>.current_universe',
'poi_id',
'poi_name',
'poi_location',
-- Total universe arguments
'<my-project>.<my-dataset>.total_universe',
'poi_id',
'poi_name',
'poi_location',
-- Universe matching arguments
'<my-project>.<my-dataset>.universe_matching_results',
'''
{
"max_neighbors": 60,
"max_distance": 500,
"weights": {"text_similarity": 0.7, "proximity": 0.3}
}
'''
);
-- Table `<my-project>.<my-dataset>.universe_matching_results` will be created
Additional examples
UNIVERSE_MATCHING_REPORT(
total_universe,
total_universe_id_column,
matched_universe,
table_prefix,
options
)
Description
Procedure to generate a report on the output of an
UNIVERSE_MATCHING
execution, as well as some insightful results out of it.This procedure requires a previous
UNIVERSE_MATCHING
execution, so please do refer to the documentation to understand their inputs and the different universe metaphors.Input
total_universe
:STRING
SQL query or table name to provide the total universe, that is, the dataset to be taken as a reference for the matching. Ideally, it should be the same one that was fed to theUNIVERSE_MATCHING
procedure.total_universe_id_column
:STRING
name of the column containing unique ID values for each of the total universe locations in the results of thetotal_universe
. Any type is valid as long as it can be used in grouping functions.table_prefix
:STRING
qualified prefix desired for all the output tables.options
(optional): JSON-formattedSTRING
containing a set of optional parameters. All the parameters inside this JSON can be overridden if the user needs to but it also provides a set of sensible default values. The current accepted parameters are:
Option | Description |
---|---|
min_similarity | FLOAT64 - Default: 0.7. Similarity threshold to consider a match of current-total locations reliable. All pairs whose similarity lies below this minimum value (or is NULL ) will not be taken into account during the report. |
Output
The procedure outputs the following tables:
- The matched universe filtered table, named
<table_prefix>_filtered
, contains a row per pair in thematched_universe
that surpasses the minimum similarity threshold (options.min_similarity
). Its schema is the same one as thematched_universe
. - The expansion universe table, named
<table_prefix>_expansion_universe
, contains all locations in thetotal_universe
that were not successfully matched in the<table_prefix>_filtered
table. This table represents all potential locations to be considered potential expansion points that are not yet covered. Its schema is the same astotal_universe
. - The report table, named
<table_prefix>_report
, and contains a single row with some useful metrics regarding the results of the analysis:current_universe
: the number of locations in the current universe.total_universe
: the number of locations in the total universe.matched_universe
: the number of locations successfully paired between the two universes over the similarity threshold.expansion_universe
: the number of locations in the expansion universe.market_penetration
: the percentage of the total universe currently covered by the filtered matched universe (in the [0, 1] interval)
Example
carto-un
carto-un-eu
manual
CALL `carto-un`.carto.UNIVERSE_MATCHING_REPORT(
-- Total universe arguments
'<my-project>.<my-dataset>.total_universe',
'poi_id',
-- Universe matching results
'<my-project>.<my-dataset>.universe_matching_results',
-- Report arguments
'<my-project>.<my-dataset>.universe_matching',
'''
{
"min_similarity": 0.7,
}
'''
);
-- These tables will be created:
-- - <my-project>.<my-dataset>.universe_matching_filtered
-- - <my-project>.<my-dataset>.universe_matching_expansion_universe
-- - <my-project>.<my-dataset>.universe_matching_report
CALL `carto-un-eu`.carto.UNIVERSE_MATCHING_REPORT(
-- Total universe arguments
'<my-project>.<my-dataset>.total_universe',
'poi_id',
-- Universe matching results
'<my-project>.<my-dataset>.universe_matching_results',
-- Report arguments
'<my-project>.<my-dataset>.universe_matching',
'''
{
"min_similarity": 0.7,
}
'''
);
-- These tables will be created:
-- - <my-project>.<my-dataset>.universe_matching_filtered
-- - <my-project>.<my-dataset>.universe_matching_expansion_universe
-- - <my-project>.<my-dataset>.universe_matching_report
CALL carto.UNIVERSE_MATCHING_REPORT(
-- Total universe arguments
'<my-project>.<my-dataset>.total_universe',
'poi_id',
-- Universe matching results
'<my-project>.<my-dataset>.universe_matching_results',
-- Report arguments
'<my-project>.<my-dataset>.universe_matching',
'''
{
"min_similarity": 0.7,
}
'''
);
-- These tables will be created:
-- - <my-project>.<my-dataset>.universe_matching_filtered
-- - <my-project>.<my-dataset>.universe_matching_expansion_universe
-- - <my-project>.<my-dataset>.universe_matching_report
Additional examples
FIND_SIMILAR_LOCATIONS(origin_query, target_query, index_column, pca_explained_variance_ratio, max_results, output_prefix)
Description
Procedure to obtain the similar locations for a given location (origin) on a set of external variables. These similar locations are selected from a set of candidates (target locations), that can be stores, restaurants, etc. Internally it works just like the
FIND_TWIN_AREAS
function, so please refer to its documentation for further examples and details of the implementation.The output similar locations are those of the target candidates considered to be the most similar to the origin locations, based on the values of a set of variables. Only variables with numerical values are supported. Both origin and target locations should be uniquely identified by an index column.
This procedure expects to receive features that characterize each of the locations, for which we recommend taking into account:
-
GENERATE_TRADE_AREAS
procedure to generate different types of service areas for both origin and target locations, -
ENRICH_POLYGONS
orDATAOBS_ENRICH_POLYGONS
procedure to enrich the defined trade areas resulting in the format expected for this procedure.
Input
origin_query
:STRING
query to provide the origin location (identified with itsindex_column
value) and its associated data columns. NoNULL
values should be contained in any of the data columns provided.target_query
:STRING
query to provide the target locations (uniquely identified byindex_column
) and their associated data columns, e.g.SELECT * FROM <project>.<dataset>.<target_table>
. The data columns should be equal to those provided in theorigin_query
, otherwise the procedure will fail. Locations with anyNULL
values will be excluded from the analysis.index_column
:STRING
name of the index column for both theorigin_query
and thetarget_query
. This index has to uniquely identify each of the individual locations in the queries.pca_explained_variance_ratio
:FLOAT64
of the explained variance retained in the PCA analysis. It defaults to 0.9 if set toNULL
.max_results
:INT64
of the maximum number of similar locations returned. If set toNULL
, all target locations are returned.output_prefix
:STRING
destination and prefix for the output tables. It must contain the project, dataset and prefix:<project>.<dataset>.<prefix>
.
Output
The procedure outputs the following:
- Similar locations model, named
<project>.<dataset>.<prefix>_model
. Please note that the model computation only depends on thetarget_query
and therefore the same model can be used if the procedure is re-run for a differentorigin_query
. To allow for this scenario in which the model is reused, if the output model already exists, it won't be recomputed. To avoid this behavior, simply choose a different<prefix>
in theoutput_prefix
parameter. - Results table, named
<project>.<dataset>.<prefix>_<origin_index>_results
, containing in each row the index of the target locations (index_column
) and its associatedsimilarity_score
andsimilarity_skill_score
. Thesimilarity_score
corresponds to the distance between the origin and target location: thesimilarity_skill_score
for a given target locations*t*
is computed as1 - similarity_score(*t*) / similarity_score(<*t*>)
, where<*t*>
is the average target location, computed by averaging each feature for all the target locations. Thissimilarity_skill_score
represents a relative measure: the score will be positive if and only if the target location is more similar to the origin than the mean vector data, with a score of 1 meaning perfect matching or zero distance. Therefore, a target location with a larger score will be more similar to the origin under this scoring rule.
Example
carto-un
carto-un-eu
manual
CALL `carto-un`.carto.FIND_SIMILAR_LOCATIONS(
-- Origin location
"""
SELECT * FROM `cartobq.docs.similar_locations_enriched`
WHERE location_id = '001'
""",
-- Target locations (candidates)
"""
SELECT * FROM `cartobq.docs.similar_locations_enriched`
WHERE location_id <> '001'
""",
-- Similar locations model inputs
'location_id',
0.90,
NULL,
'my-project.my-dataset.my-prefix'
);
-- Table `<my-project>.<my-dataset>.<output-prefix>_{ID}_results` will be created
-- with the column: location_id, similarity_score, similarity_skill_score
CALL `carto-un-eu`.carto.FIND_SIMILAR_LOCATIONS(
-- Origin location
"""
SELECT * FROM `cartobq.docs.similar_locations_enriched`
WHERE location_id = '001'
""",
-- Target locations (candidates)
"""
SELECT * FROM `cartobq.docs.similar_locations_enriched`
WHERE location_id <> '001'
""",
-- Similar locations model inputs
'location_id',
0.90,
NULL,
'my-project.my-dataset.my-prefix'
);
-- Table `<my-project>.<my-dataset>.<output-prefix>_{ID}_results` will be created
-- with the column: location_id, similarity_score, similarity_skill_score
CALL carto.FIND_SIMILAR_LOCATIONS(
-- Origin location
"""
SELECT * FROM `cartobq.docs.similar_locations_enriched`
WHERE location_id = '001'
""",
-- Target locations (candidates)
"""
SELECT * FROM `cartobq.docs.similar_locations_enriched`
WHERE location_id <> '001'
""",
-- Similar locations model inputs
'location_id',
0.90,
NULL,
'my-project.my-dataset.my-prefix'
);
-- Table `<my-project>.<my-dataset>.<output-prefix>_{ID}_results` will be created
-- with the column: location_id, similarity_score, similarity_skill_score
FIND_SIMILAR_LOCATIONS_WEIGHTED(origin_query, target_query, index_column, weights, max_results, output_prefix)
Description
Procedure to obtain the similar locations for a given location (origin) on a set of weighted external variables. These similar locations are selected from a set of candidates (target locations), that can be stores, restaurants, etc. The function is similar to the
FIND_SIMILAR_LOCATIONS
, but it includes user-defined weights for the variables. Internally it works just like the FIND_TWIN_AREAS_WEIGHTED
function, so please refer to its documentation for further examples and details of the implementation.The output similar locations are those of the target query considered to be the most similar to the origin location, based on the values of a set of variables. Only variables with numerical values are supported. Both origin and target locations should be uniquely identified by an index column.
This procedure expects to receive features that characterize each of the locations, for which we recommend taking into account:
-
GENERATE_TRADE_AREAS
procedure to generate different types of service areas for both origin and target locations, -
ENRICH_POLYGONS
orDATAOBS_ENRICH_POLYGONS
procedure to enrich the defined trade areas resulting in the format expected for this procedure.
Input
origin_query
:STRING
query to provide the origin location (identified with itsindex_column
value) and its associated data columns. NoNULL
values should be contained in any of the data columns provided.target_query
:STRING
query to provide the target locations (uniquely identified byindex_column
) and their associated data columns, e.g.SELECT * FROM <project>.<dataset>.<target_table>
. The data columns should be equal to those provided in theorigin_query
, otherwise the procedure will fail. Locations with anyNULL
values will be excluded from the analysis.index_column
:STRING
name of the index column for both theorigin_query
and thetarget_query
. This index has to uniquely identify each of the individual locations in the queries.weights
:ARRAY<STRUCT<name STRING, value FLOAT64>>
the weights on the features. If set toNULL
, then all features are treated equally. This parameter is considered only if the length of weights is greater or equal than one. The sum of weights must be less than or equal to 1. If less weights than the number of features are provided, then for the undefined features, the remaining 1 * sum(weights) is distributed evenly.max_results
:INT64
of the maximum number of similar locations returned. If set toNULL
, all target locations are returned.output_prefix
:STRING
destination and prefix for the output tables. It must contain the project, dataset and prefix:<project>.<dataset>.<prefix>
.
Output
The procedure outputs the following:
- Results table, named
<project>.<dataset>.<prefix>_<origin_index>_results
, containing in each row the index of the target locations (index_column
) and its associatedsimilarity_score
andsimilarity_skill_score
. Thesimilarity_score
corresponds to the distance between the origin and target location taking into account the user defined weights; thesimilarity_skill_score
for a given target location*t*
is computed as1 - similarity_score(*t*) / similarity_score(<*t*>)
, where<*t*>
is the average target location, computed by averaging each feature for all the target locations. Thissimilarity_skill_score
represents a relative measure: the score will be positive if and only if the target location is more similar to the origin than the mean vector data, with a score of 1 meaning perfect matching or zero distance. Therefore, a target location with a larger score will be more similar to the origin under this scoring rule.
Example
carto-un
carto-un-eu
manual
CALL `carto-un`.carto.FIND_SIMILAR_LOCATIONS_WEIGHTED(
"""
SELECT * FROM `cartobq.docs.similar_locations_enriched`
WHERE location_id = '001'
""",
-- Target locations (candidates)
"""
SELECT * FROM `cartobq.docs.similar_locations_enriched`
WHERE location_id <> '001'
""",
-- Similar locations model inputs
'location_id',
NULL,
NULL,
'my-project.my-dataset.my-prefix'
);
-- Table `<my-project>.<my-dataset>.<output-prefix>_{ID}_results` will be created
-- with the column: location_id, similarity_score, similarity_skill_score
CALL `carto-un-eu`.carto.FIND_SIMILAR_LOCATIONS_WEIGHTED(
"""
SELECT * FROM `cartobq.docs.similar_locations_enriched`
WHERE location_id = '001'
""",
-- Target locations (candidates)
"""
SELECT * FROM `cartobq.docs.similar_locations_enriched`
WHERE location_id <> '001'
""",
-- Similar locations model inputs
'location_id',
NULL,
NULL,
'my-project.my-dataset.my-prefix'
);
-- Table `<my-project>.<my-dataset>.<output-prefix>_{ID}_results` will be created
-- with the column: location_id, similarity_score, similarity_skill_score
CALL carto.FIND_SIMILAR_LOCATIONS_WEIGHTED(
"""
SELECT * FROM `cartobq.docs.similar_locations_enriched`
WHERE location_id = '001'
""",
-- Target locations (candidates)
"""
SELECT * FROM `cartobq.docs.similar_locations_enriched`
WHERE location_id <> '001'
""",
-- Similar locations model inputs
'location_id',
NULL,
NULL,
'my-project.my-dataset.my-prefix'
);
-- Table `<my-project>.<my-dataset>.<output-prefix>_{ID}_results` will be created
-- with the column: location_id, similarity_score, similarity_skill_score
CREATE_SPATIAL_SCORE(input_query, index_column, output, options)
Description
This procedure allows to derive spatial scores using both point geometries and polygon geometries. It ranks stores, merchants (point locations) or areas (polygons) based on a spatial composite score that measures attractiveness, risk, sales potential, etc. based on some business criteria (both internal and external data). The procedure combines (spatial) variables into a meaningful composite score by first scaling each individual variable between 0 and 1 and then aggregating them linearly through a weighted average. The score is provided on a 0-to-100 scale. Rows with a
NULL
value in any of the individual variables are dropped. For a more advanced, customizable version of this procedure, please refer to CREATE_SPATIAL_COMPOSITE_UNSUPERVISED
.This procedure expects to receive features that characterize each of the geometries. If geometries are point locations, we recommend taking into account:
-
GENERATE_TRADE_AREAS
procedure to generate different types of service areas for both origin and target locations, -
ENRICH_POLYGONS
orDATAOBS_ENRICH_POLYGONS
procedure to enrich the defined trade areas resulting in the format expected for this procedure.
Input parameters
input_query
:STRING
the query to the data used to compute the spatial composite. It must contain all the individual variables that should be included in the computation of the composite as well as a unique geographic id for each row. A qualified table name can be given as well, e.g.<my-project>.<my-dataset>.<my-table>
.index_column
:STRING
the name of the column with the unique geographic identifier of each point of sales or area.output_table
:STRING
table name of the form<my-project>.<my-dataset>.<my-table>
to store the results.options
:STRING
containing a valid JSON with the different options, which can be set toNULL
. Valid options are described below.
Option | Description |
---|---|
weights | STRUCT - The (optional) weights for each variable used to compute the spatial composite passed as {"name":value, …} . If specified, the sum of the weights must be lower than 1. If no weights are specified, equal weights are assumed. If weights are specified only for some variables and the sum of weights is less than 1, the remainder is distributed equally between the remaining variables. If weights are specified for all the variables and the sum of weights is less than 1, the remainder is distributed equally between all the variables. |
nbuckets | INT64 - Default: 5. The selected number of groups to bucketize the resulting score into buckets of equal widths. |
Output
The results are stored in the table named
<output_table>
, which contains the following columns:ranking
:INT64
the ranking position or the merchant based on the derived composite score, from 1 (maximum score) to n (minimum score), where n is the total number of merchants considered.index_column
: the unique geographic identifier. The type of this column depends on the type ofindex_column
ininput_query
.spatial_score
:FLOAT64
the value of the composite score. The type of this column isFLOAT64
if the score is not discretized andINT64
otherwise.spatial_score_buckets
:INT64
the value of the discretized composite score.
The procedure also returns a lookup table named
<output_table>_lookup_table
with the following columns:lower_bound
:FLOAT64
the lowest value of the composite score within each bucket.upper_bound
:FLOAT64
the largest value of the composite score within each bucket.spatial_score_buckets
:INT64
the unique values of the bucketized composite score, from 1 tonbuckets
.
Example
CALL carto.CREATE_SPATIAL_SCORE(
'SELECT pos_id, footfall, sentiment, positive_opinions_count, footfall_early_morning, footfall_early_afternoon, horeca_count_100m FROM `cartobq.docs.cpg_spatial_scoring_input`',
'pos_id',
'<my-project>.<my-dataset>.<table-name>',
'''{
"weights":{"sentiment":0.5}
}
'''
)
-- Table `<my-project>.<my-dataset>.<table-name>` will be created
Additional examples
CREATE_SPATIAL_PERFORMANCE_SCORE(input_query, index_column, business_kpi_col, output, options)
Description
Procedure to rank stores or merchants based on a spatial composite score that detects if locations are performing as expected (or if they are under/over-performing) in terms of some business criteria (both internal and external data) and a performance KPI (sales, volume, or margin). The score is derived as the residuals of an XGBoost regression model, defined as the observed performance KPI minus the predicted one. The performance KPI should be measurable and correlated with the set of variables defining the score. Rows with a
NULL
value in any of the individual variables are dropped. For a more advanced, customizable version of this procedure, please refer to CREATE_SPATIAL_COMPOSITE_SUPERVISED
.This procedure expects to receive features that characterize each of the locations, for which we recommend taking into account:
-
GENERATE_TRADE_AREAS
procedure to generate different types of service areas for both origin and target locations, -
ENRICH_POLYGONS
orDATAOBS_ENRICH_POLYGONS
procedure to enrich the defined trade areas resulting in the format expected for this procedure.
Input parameters
input_query
:STRING
the query to the data used to compute the spatial composite. It must contain all the individual variables that should be included in the computation of the composite as well as a unique geographic id for each row. A qualified table name can be given as well, e.g.<my-project>.<my-dataset>.<my-table>
.index_column
:STRING
the name of the column with the unique geographic identifier of each point of sales.business_kpi_col
:STRING
column with the performance business KPI to be modeled (i.e sales volume).output_table
:STRING
table name of the form<my-project>.<my-dataset>.<my-table>
to store the results.options
:STRING
containing a valid JSON with the different options, which can be set toNULL
. Valid options are described below.
Option | Description |
---|---|
r2_thr | FLOAT64 - Default: 0.5. The minimum acceptable R2 model score. If the R2 of the regression model is lower than this threshold this implies poor fitting and a warning is raised. |
nbuckets | INT64 - Default: 5. The selected number of groups to bucketize the resulting score using a zero-centered equal widths approach. The lower and upper limits are derived from the outliers-removed maximum of the absolute values of the score. |
Return type
The results are stored in the table named
<output_table>
, which contains the following columns:ranking
:INT64
the ranking position or the merchant based on the derived composite score.index_column
: the unique geographic identifier. The type of this column depends on the type ofindex_column
ininput_query
.spatial_score
:FLOAT64
the value of the composite score. The type of this column isFLOAT64
if the score is not discretized andINT64
otherwise.spatial_score_buckets
:INT64
the value of the discretized composite score.
The procedure also returns a lookup table named
<output_table>_lookup_table
with the following columns:lower_bound
:FLOAT64
the lowest value of the composite score within each bucket.upper_bound
:FLOAT64
the largest value of the composite score within each bucket.spatial_score_buckets
:INT64
the unique values of the bucketized composite score, from 1 tonbuckets
.
Example
CALL carto.CREATE_SPATIAL_PERFORMANCE_SCORE(
'SELECT pos_id, footfall, sentiment, positive_opinions_count, footfall_early_morning, footfall_early_afternoon, horeca_count_100m, sales_volume FROM `cartobq.docs.cpg_spatial_scoring_input`',
'pos_id',
'sales_volume',
'<my-project>.<my-dataset>.<table-name>',
'''{
"r2_thr":0.4
}
'''
)
-- Table `<my-project>.<my-dataset>.<table-name>` will be created
Additional examples
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 960401.

Last modified 1mo ago