LogoLogo
HomeAcademyLoginTry for free
  • Welcome
  • What's new
    • Q2 2025
    • Q1 2025
    • Q4 2024
    • Q3 2024
    • Q2 2024
    • Q1 2024
    • Q4 2023
    • Q3 2023
    • Q2 2023
    • Q1 2023
    • Q4 2022
    • Q3 2022
  • FAQs
    • Accounts
    • Migration to the new platform
    • User & organization setup
    • General
    • Builder
    • Workflows
    • Data Observatory
    • Analytics Toolbox
    • Development Tools
    • Deployment Options
    • CARTO Basemaps
    • CARTO for Education
    • Support Packages
    • Security and Compliance
  • Getting started
    • What is CARTO?
    • Quickstart guides
      • Connecting to your data
      • Creating your first map
      • Creating your first workflow
      • Developing your first application
    • CARTO Academy
  • CARTO User Manual
    • Overview
      • Creating your CARTO organization
      • CARTO Cloud Regions
      • CARTO Workspace overview
    • Maps
      • Data sources
        • Simple features
        • Spatial Indexes
        • Pre-generated tilesets
        • Rasters
        • Defining source spatial data
        • Managing data freshness
        • Changing data source location
      • Layers
        • Point
          • Grid point aggregation
          • H3 point aggregation
          • Heatmap point aggregation
          • Cluster point aggregation
        • Polygon
        • Line
        • Grid
        • H3
        • Raster
        • Zoom to layer
      • Widgets
        • Formula widget
        • Category widget
        • Pie widget
        • Histogram widget
        • Range widget
        • Time Series widget
        • Table widget
      • SQL Parameters
        • Date parameter
        • Text parameter
        • Numeric parameter
        • Publishing SQL parameters
      • Interactions
      • Legend
      • Basemaps
        • Basemap selector
      • AI Agents
      • SQL analyses
      • Map view modes
      • Map description
      • Feature selection tool
      • Search locations
      • Measure distances
      • Exporting data
      • Download PDF reports
      • Managing maps
      • Sharing and collaboration
        • Editor collaboration
        • Map preview for editors
        • Map settings for viewers
        • Comments
        • Embedding maps
        • URL parameters
      • Performance considerations
    • Workflows
      • Workflow canvas
      • Results panel
      • Components
        • Aggregation
        • Custom
        • Data Enrichment
        • Data Preparation
        • Generative AI
        • Input / Output
        • Joins
        • Parsers
        • Raster Operations
        • Spatial Accessors
        • Spatial Analysis
        • Spatial Constructors
        • Spatial Indexes
        • Spatial Operations
        • Statistics
        • Tileset Creation
        • BigQuery ML
        • Snowflake ML
        • Google Earth Engine
        • Google Environment APIs
        • Telco Signal Propagation Models
      • Data Sources
      • Scheduling workflows
      • Sharing workflows
      • Using variables in workflows
      • Executing workflows via API
      • Temporary data in Workflows
      • Extension Packages
      • Managing workflows
      • Workflows best practices
    • Data Explorer
      • Creating a map from your data
      • Importing data
        • Importing rasters
      • Geocoding data
      • Optimizing your data
    • Data Observatory
      • Terminology
      • Browsing the Spatial Data Catalog
      • Subscribing to public and premium datasets
      • Accessing free data samples
      • Managing your subscriptions
      • Accessing your subscriptions from your data warehouse
        • Access data in BigQuery
        • Access data in Snowflake
        • Access data in Databricks
        • Access data in Redshift
        • Access data in PostgreSQL
    • Connections
      • Google BigQuery
      • Snowflake
      • Databricks
      • Amazon Redshift
      • PostgreSQL
      • CARTO Data Warehouse
      • Sharing connections
      • Deleting a connection
      • Required permissions
      • IP whitelisting
      • Customer data responsibilities
    • Applications
    • Settings
      • Understanding your organization quotas
      • Activity Data
        • Activity Data Reference
        • Activity Data Examples
        • Activity Data Changelog
      • Users and Groups
        • Inviting users to your organization
        • Managing user roles
        • Deleting users
        • SSO
        • Groups
        • Mapping groups to user roles
      • CARTO Support Access
      • Customizations
        • Customizing appearance and branding
        • Configuring custom color palettes
        • Configuring your organization basemaps
        • Enabling AI Agents
      • Advanced Settings
        • Managing applications
        • Configuring S3 Bucket for Redshift Imports
        • Configuring OAuth connections to Snowflake
        • Configuring OAuth U2M connections to Databricks
        • Configuring S3 Bucket integration for RDS for PostgreSQL Exports in Builder
        • Configuring Workload Identity Federation for BigQuery
      • Data Observatory
      • Deleting your organization
    • Developers
      • Managing Credentials
        • API Base URL
        • API Access Tokens
        • SPA OAuth Clients
        • M2M OAuth Clients
      • Named Sources
  • Data and Analysis
    • Analytics Toolbox Overview
    • Analytics Toolbox for BigQuery
      • Getting access
        • Projects maintained by CARTO in different BigQuery regions
        • Manual installation in your own project
        • Installation in a Google Cloud VPC
        • Core module
      • Key concepts
        • Tilesets
        • Spatial indexes
      • SQL Reference
        • accessors
        • clustering
        • constructors
        • cpg
        • data
        • http_request
        • import
        • geohash
        • h3
        • lds
        • measurements
        • placekey
        • processing
        • quadbin
        • random
        • raster
        • retail
        • routing
        • s2
        • statistics
        • telco
        • tiler
        • transformations
      • Guides
        • Running queries from Builder
        • Working with Raster data
      • Release notes
      • About Analytics Toolbox regions
    • Analytics Toolbox for Snowflake
      • Getting access
        • Native App from Snowflake's Marketplace
        • Manual installation
      • Key concepts
        • Spatial indexes
        • Tilesets
      • SQL Reference
        • accessors
        • clustering
        • constructors
        • data
        • http_request
        • import
        • h3
        • lds
        • measurements
        • placekey
        • processing
        • quadbin
        • random
        • raster
        • retail
        • s2
        • statistics
        • tiler
        • transformations
      • Guides
        • Running queries from Builder
        • Working with Raster data
      • Release Notes
    • Analytics Toolbox for Databricks
      • Getting access
        • Personal (former Single User) cluster
        • Standard (former Shared) cluster
      • Reference
        • lds
        • tiler
      • Guides
      • Release Notes
    • Analytics Toolbox for Redshift
      • Getting access
        • Manual installation in your database
        • Installation in an Amazon Web Services VPC
        • Core version
      • Key concepts
        • Tilesets
        • Spatial indexes
      • SQL Reference
        • clustering
        • constructors
        • data
        • http_request
        • import
        • lds
        • placekey
        • processing
        • quadbin
        • random
        • s2
        • statistics
        • tiler
        • transformations
      • Guides
        • Running queries from Builder
      • Release Notes
    • Analytics Toolbox for PostgreSQL
      • Getting access
        • Manual installation
        • Core version
      • Key concepts
        • Tilesets
        • Spatial Indexes
      • SQL Reference
        • h3
        • quadbin
        • tiler
      • Guides
        • Creating spatial index tilesets
        • Running queries from Builder
      • Release Notes
    • CARTO + Python
      • Installation
      • Authentication Methods
      • Visualizing Data
      • Working with Data
        • How to work with your data in the CARTO Data Warehouse
        • How to access your Data Observatory subscriptions
        • How to access CARTO's Analytics Toolbox for BigQuery and create visualizations via Python notebooks
        • How to access CARTO’s Analytics Toolbox for Snowflake and create visualizations via Python notebooks
        • How to visualize data from Databricks
      • Reference
    • CARTO QGIS Plugin
  • CARTO for Developers
    • Overview
    • Key concepts
      • Architecture
      • Libraries and APIs
      • Authentication methods
        • API Access Tokens
        • OAuth Access Tokens
        • OAuth Clients
      • Connections
      • Data sources
      • Visualization with deck.gl
        • Basemaps
          • CARTO Basemap
          • Google Maps
            • Examples
              • Gallery
              • Getting Started
              • Basic Examples
                • Hello World
                • BigQuery Tileset Layer
                • Data Observatory Tileset Layer
              • Advanced Examples
                • Arc Layer
                • Extrusion
                • Trips Layer
            • What's New
          • Amazon Location
            • Examples
              • Hello World
              • CartoLayer
            • What's New
        • Rapid Map Prototyping
      • Charts and widgets
      • Filtering and interactivity
      • Summary
    • Quickstart
      • Make your first API call
      • Visualize your first dataset
      • Create your first widget
    • Guides
      • Build a public application
      • Build a private application
      • Build a private application using SSO
      • Visualize massive datasets
      • Integrate CARTO in your existing application
      • Use Boundaries in your application
      • Avoid exposing SQL queries with Named Sources
      • Managing cache in your CARTO applications
    • Reference
      • Deck (@deck.gl reference)
      • Data Sources
        • vectorTableSource
        • vectorQuerySource
        • vectorTilesetSource
        • h3TableSource
        • h3QuerySource
        • h3TilesetSource
        • quadbinTableSource
        • quadbinQuerySource
        • quadbinTilesetSource
        • rasterSource
        • boundaryTableSource
        • boundaryQuerySource
      • Layers (@deck.gl/carto)
      • Widgets
        • Data Sources
        • Server-side vs. client-side
        • Models
          • getFormula
          • getCategories
          • getHistogram
          • getRange
          • getScatter
          • getTimeSeries
          • getTable
      • Filters
        • Column filters
        • Spatial filters
      • CARTO APIs Reference
    • Release Notes
    • Examples
    • CARTO for React
      • Guides
        • Getting Started
        • Views
        • Data Sources
        • Layers
        • Widgets
        • Authentication and Authorization
        • Basemaps
        • Look and Feel
        • Query Parameters
        • Code Generator
        • Sample Applications
        • Deployment
        • Upgrade Guide
      • Examples
      • Library Reference
        • Introduction
        • API
        • Auth
        • Basemaps
        • Core
        • Redux
        • UI
        • Widgets
      • Release Notes
  • CARTO Self-Hosted
    • Overview
    • Key concepts
      • Architecture
      • Deployment requirements
    • Quickstarts
      • Single VM deployment (Kots)
      • Orchestrated container deployment (Kots)
      • Advanced Orchestrated container deployment (Helm)
    • Guides
      • Guides (Kots)
        • Configure your own buckets
        • Configure an external in-memory cache
        • Enable Google Basemaps
        • Enable the CARTO Data Warehouse
        • Configure an external proxy
        • Enable BigQuery OAuth connections
        • Configure Single Sign-On (SSO)
        • Use Workload Identity in GCP
        • High availability configuration for CARTO Self-hosted
        • Configure your custom service account
      • Guides (Helm)
        • Configure your own buckets (Helm)
        • Configure an external in-memory cache (Helm)
        • Enable Google Basemaps (Helm)
        • Enable the CARTO Data Warehouse (Helm)
        • Configure an external proxy (Helm)
        • Enable BigQuery OAuth connections (Helm)
        • Configure Single Sign-On (SSO) (Helm)
        • Use Workload Identity in GCP (Helm)
        • Use EKS Pod Identity in AWS (Helm)
        • Enable Redshift imports (Helm)
        • Migrating CARTO Self-hosted installation to an external database (Helm)
        • Advanced customizations (Helm)
        • Configure your custom service account (Helm)
    • Maintenance
      • Maintenance (Kots)
        • Updates
        • Backups
        • Uninstall
        • Rotating keys
        • Monitoring
        • Change the Admin Console password
      • Maintenance (Helm)
        • Monitoring (Helm)
        • Rotating keys (Helm)
        • Uninstall (Helm)
        • Backups (Helm)
        • Updates (Helm)
    • Support
      • Get debug information for Support (Kots)
      • Get debug information for Support (Helm)
    • CARTO Self-hosted Legacy
      • Key concepts
        • Architecture
        • Deployment requirements
      • Quickstarts
        • Single VM deployment (docker-compose)
      • Guides
        • Configure your own buckets
        • Configure an external in-memory cache
        • Enable Google Basemaps
        • Enable the CARTO Data Warehouse
        • Configure an external proxy
        • Enable BigQuery OAuth connections
        • Configure Single Sign-On (SSO)
        • Enable Redshift imports
        • Configure your custom service account
        • Advanced customizations
        • Migrating CARTO Self-Hosted installation to an external database
      • Maintenance
        • Updates
        • Backups
        • Uninstall
        • Rotating keys
        • Monitoring
      • Support
    • Release Notes
  • CARTO Native App for Snowflake Containers
    • Deploying CARTO using Snowflake Container Services
  • Get Help
    • Legal & Compliance
    • Previous libraries and components
    • Migrating your content to the new CARTO platform
Powered by GitBook
On this page
  • GENERATE_TRADE_AREAS
  • CUSTOMER_SEGMENTATION_ANALYSIS_DATA
  • RUN_CUSTOMER_SEGMENTATION
  • UNIVERSE_MATCHING
  • UNIVERSE_MATCHING_REPORT
  • CREATE_SPATIAL_SCORE
  • CREATE_SPATIAL_PERFORMANCE_SCORE
  • BUILD_SIMILAR_LOCATIONS_MODEL
  • FIND_SIMILAR_LOCATIONS
  • FIND_TWIN_AREAS_WEIGHTED

Was this helpful?

Export as PDF
  1. Data and Analysis
  2. Analytics Toolbox for BigQuery
  3. SQL Reference

cpg

PreviousconstructorsNextdata

Last updated 2 months ago

Was this helpful?

This module contains procedures to perform spatial analysis to solve specific use-cases for the Consumer Packaged Goods (CPG) industry, such as customer segmentation.

GENERATE_TRADE_AREAS

GENERATE_TRADE_AREAS(customers_query, method, options, output_prefix)

Description

This procedure generates the trade areas for each location specified based on the method and the options provided. Four methods are available: buffer, kring-h3, kring-quadbin and isoline.

For the isoline method, the use of this procedure requires providing authorization credentials. Two parameters are needed: api_base_url and api_access_token. Both the API base url and your API access token can be accessed through the developers section of the CARTO user interface. Please check our for more details.

Input parameters

  • customers_query: STRING query with store id and location. It must contain the columns store_id (store unique id) and geom (the geographical point of the store). The values of these columns cannot be NULL.

  • method: STRING indicates the method of trade area generation. Four options available: buffer, kring-h3, kring-quadbin and isoline. This method applies to all locations provided.

  • options: JSON A JSON string containing the required parameters for the specified method. For buffer: {buffer: radius - FLOT64}, kring-h3:{resolution: resolution-INT64, kring:number of layers - INT64}, kring-quadbin : {resolution: zoom level - INT64, kring:number of layers - INT64}, isoline : {mode: type of transport. Supported: 'walk', 'car' - STRING, time: range of the isoline in seconds - INT64, api_base_url : url of the API where the customer account is stored - STRING, api_access_token: customer's token for accessing the different API services - STRING}.

  • output_prefix: STRING destination prefix for the output tables. It must contain the project, dataset and prefix. For example <my-project>.<my-dataset>.<output-prefix>.

Output

This procedure will output one table:

  • Table containing the store_id, geom, method, options. The output table can be found at the output destination with name <output-prefix>_output. Overall path <my-project>.<my-dataset>.<output-prefix>_trade_areas.

Example

CALL `carto-un`.carto.GENERATE_TRADE_AREAS(
    --customers_query
    '''SELECT store_id, geom, FROM `<project>.<dataset>.<table_name_with_stores>`''',
    --method
    'buffer',
    --options
    "{'buffer':500.0}",
    --output_prefix
    '<my-project>.<my-dataset>.<output-prefix>'
);
CALL `carto-un-eu`.carto.GENERATE_TRADE_AREAS(
    --customers_query
    '''SELECT store_id, geom, FROM `<project>.<dataset>.<table_name_with_stores>`''',
    --method
    'buffer',
    --options
    "{'buffer':500.0}",
    --output_prefix
    '<my-project>.<my-dataset>.<output-prefix>'
);
CALL carto.GENERATE_TRADE_AREAS(
    --customers_query
    '''SELECT store_id, geom, FROM `<project>.<dataset>.<table_name_with_stores>`''',
    --method
    'buffer',
    --options
    "{'buffer':500.0}",
    --output_prefix
    '<my-project>.<my-dataset>.<output-prefix>'
);

CUSTOMER_SEGMENTATION_ANALYSIS_DATA

CUSTOMER_SEGMENTATION_ANALYSIS_DATA(customers_query, do_variables, do_source, custom_variables, custom_query, output_prefix)

Description

  • A table with the data to fed into the next procedure, with store_id, geom and all the features that will be considered for the segmentation.

  • Correlation table, where the Pearson correlation between each pair of features is included.

  • Descriptives table, where the count, mean, std, min, 10th, 25th, 50th, 75th, 90th percentiles, and max values per features are included.

Input parameters

  • customers_query: STRING query with store id and location and any optional preprocessed feature. It must contain the columns store_id (store unique id) and geom (the geographical point of the store). The values of these columns cannot be NULL.

  • do_source: STRING name of the location where the Data Observatory subscriptions of the user are stored, in <my-dataobs-project>.<my-dataobs-dataset> format. If only the <my-dataobs-dataset> is included, it uses the project carto-data by default. It can be set to NULL or ''.

  • custom_variables: ARRAY<STRUCT<variable STRING, aggregation STRING>> list with the columns of the custom_query and their corresponding aggregation method (sum, avg, max, min, count) that will be used to enrich the provided trade areas of each location. It can be set to NULL.

  • custom_query: STRING query that contains a geography column geom and the columns with the custom data that will be used to enrich the provided trade areas of each location. It can be set to NULL or ''.

  • output_prefix: STRING destination prefix for the output tables. It must contain the project, dataset and prefix. For example <my-project>.<my-dataset>.<output-prefix>.

Output

This procedure will output three tables:

  • Table containing the store_id, geom and all features to be considered for the segmentation (i.e. pre-processed, from DO subscription, or features from other tables included by the user). The output table can be found at the output destination with name <output-prefix>_enrich. Overall path <my-project>.<my-dataset>.<output-prefix>_enrich.

  • Table containing the correlation amongst features. Three columns, feature 1, feature 2 and correlation between feature 1 and feature 2. The output table can be found at the output destination with name <output-prefix>_correlation. Overall path <my-project>.<my-dataset>.<output-prefix>_correlation.

  • Table containing the descriptive statistics of the features. For each feature the count, mean, std, min, 10th, 25th, 50th, 75th, 90th percentiles and max values are calculated. The output table can be found at the output destination with name <output-prefix>_descriptives. Overall path <my-project>.<my-dataset>.<output-prefix>_descriptives.

Example

CALL `carto-un`.carto.CUSTOMER_SEGMENTATION_ANALYSIS_DATA(
  R'''
  SELECT store_id, geom, feature_1, feature_2 ... FROM `<project>.<dataset>.<table_name_with_stores>`
  ''',
  -- Data Observatory enrichment
    [('POPCY_4534fac4', 'sum'), ('INCCYPCAP_7c8377cf', 'avg')],
    '<my-dataobs-project>.<my-dataobs-dataset>',
    -- Custom data enrichment
    [('var1', 'sum'), ('var2', 'avg')],
    '''SELECT var1, var2, geom FROM `<project>.<dataset>.custom_data`''',
  '<my-project>.<my-dataset>.<output-prefix>'
);
CALL `carto-un-eu`.carto.CUSTOMER_SEGMENTATION_ANALYSIS_DATA(
  R'''
  SELECT store_id, geom, feature_1, feature_2 ... FROM `<project>.<dataset>.<table_name_with_stores>`
  ''',
  -- Data Observatory enrichment
    [('POPCY_4534fac4', 'sum'), ('INCCYPCAP_7c8377cf', 'avg')],
    '<my-dataobs-project>.<my-dataobs-dataset>',
    -- Custom data enrichment
    [('var1', 'sum'), ('var2', 'avg')],
    '''SELECT var1, var2, geom FROM `<project>.<dataset>.custom_data`''',
  '<my-project>.<my-dataset>.<output-prefix>'
);
CALL carto.CUSTOMER_SEGMENTATION_ANALYSIS_DATA(
  R'''
  SELECT store_id, geom, feature_1, feature_2 ... FROM `<project>.<dataset>.<table_name_with_stores>`
  ''',
  -- Data Observatory enrichment
    [('POPCY_4534fac4', 'sum'), ('INCCYPCAP_7c8377cf', 'avg')],
    '<my-dataobs-project>.<my-dataobs-dataset>',
    -- Custom data enrichment
    [('var1', 'sum'), ('var2', 'avg')],
    '''SELECT var1, var2, geom FROM `<project>.<dataset>.custom_data`''',
  '<my-project>.<my-dataset>.<output-prefix>'
);

RUN_CUSTOMER_SEGMENTATION

RUN_CUSTOMER_SEGMENTATION(customers_data_table, number_of_clusters, pca_explain_ratio, output_prefix)

Description

  • A table with the store_id, geom, the number of clusters on that scenario/case, and the cluster the store belongs to.

  • A table with descriptives statistics of each feature for each scenario of different number of clusters and the specific cluster of the store.

  • A table with statistics of the KMeans performance for each number of clusters.

Input parameters

  • number_of_clusters: ARRAY<INT64> list with number of clusters to perform the segmentation. It can be regarded as the different number of cases for segmentation.

  • pca_explain_ratio: FLOAT64 of the explained variance retained in the PCA analysis. It defaults to 0.9. If 0 or NULL is passed then PCA is not enabled and the raw data are passed into the clustering algorithm.

  • output_prefix: STRING destination prefix for the output tables. It must contain the project, dataset and prefix. For example <my-project>.<my-dataset>.<output-prefix>.

Output

This procedure will output three tables:

  • Table containing the store_id, geom, one column per scenario (number of clusters), in which the cluster the store belongs to is included for this scenario. The output table can be found at the output destination with name <output-prefix>_clusters. Overall path <my-project>.<my-dataset>.<output-prefix>_clusters.

  • Table containing statistics of the clusterings. It contains the davies bouldin index and mean squared distance for each case, scenario - number of clusters. The output table can be found at the output destination with name <output-prefix>_clusters_stats. Overall path <my-project>.<my-dataset>.<output-prefix>_clusters_stats.

  • Table containing the descriptive statistics for each scenario (number of clsuters). Eleven columns, scenario (number of clusters), cluster value, count, mean, std, min, 10,25, 50, 75, 90 percentiles and max values. The output table can be found at the output destination with name <output-prefix>_clusters_descr. Overall path <my-project>.<my-dataset>.<output-prefix>_clusters_descr.

Example

CALL `carto-un`.carto.RUN_CUSTOMER_SEGMENTATION(
    '<project>.<dataset>.<table_name_with_stores>_enrich',
    [4,5],
    0.9,
    '<my-project>.<my-dataset>.<output-prefix>'
);
CALL `carto-un-eu`.carto.RUN_CUSTOMER_SEGMENTATION(
    '<project>.<dataset>.<table_name_with_stores>_enrich',
    [4,5],
    0.9,
    '<my-project>.<my-dataset>.<output-prefix>'
);
CALL carto.RUN_CUSTOMER_SEGMENTATION(
    '<project>.<dataset>.<table_name_with_stores>_enrich',
    [4,5],
    0.9,
    '<my-project>.<my-dataset>.<output-prefix>'
);

UNIVERSE_MATCHING

UNIVERSE_MATCHING(
  current_universe,
  current_universe_id_column,
  current_universe_name_column,
  current_universe_location_column,
  total_universe,
  total_universe_id_column,
  total_universe_name_column,
  total_universe_location_column,
  table_name,
  options
)

Description

Procedure to perform a text-and-distance hybrid, fuzzy match between two universes. Even though this function can match any given dataset with another as long as they have the required columns to be compared, the common metaphor that is used along the documentation requires these two datasets in the CPG-related scenario:

  • A current universe comprised of active and known distributors that are already reselling a given product;

  • And a total universe containing all possible distributors that could distribute such product.

The procedure will then provide the matched universe: a table containing a mapping from each location in the current universe to the one found most similar within the possible candidates of the total universe.

To perform this match, a two-step approach is followed, taking into account both locations and names in the two different datasets:

  1. Distance is taken into account, using two different user-provided thresholds: the maximum number of closest neighbors to check and the maximum distance between two points to be considered a candidate pair.

Out of all the possible candidates, the best one is chosen based on the best final similarity value. The similarity is a weighted average between the proximity (a value in the [0, 1] interval based on the potential candidates' distance distributions) and the text similarity (which also lies within the [0, 1] interval). The final table is also ordered in descending similarity, which should help in case a similarity cutoff is to be made to discard mismatched pairs.

Input parameters

  • current_universe: STRING SQL query or table name to provide the current universe, that is, the dataset to be taken as a reference for the matching. The current universe is a representation of the current active providers to be matched and is usually the smallest dataset of the two. It should have (at least) a column that serves as a unique ID, a name column and a location column. If any of the rows have missing values in any of these columns, the rows will be ignored during the analysis.

  • current_universe_id_column: STRING name of the column containing unique ID values for each of the current universe locations in the results of the current_universe. Any type is valid as long as it can be used in grouping functions. Rows with missing values in the current_universe_id_column will be ignored during the procedure.

  • current_universe_name_column: STRING name of the column containing the names of each of the current universe locations in the results of the current_universe. Such column is expected to be in STRING format and will be matched against its total universe counterpart to compute the text similarity value. Rows with missing values in the current_universe_name_column will be ignored during the procedure.

  • current_universe_location_column: STRING name of the column containing the point geography of each of the current universe locations in the results of the current_universe. Such column is expected to be in GEOGRAPHY, POINT format, and will be matched against its total universe counterpart to compute the proximity score. Rows with missing values in the current_universe_location_column will be ignored during the procedure.

  • total_universe: STRING SQL query or table name to provide the total universe, that is, the dataset to be taken as a reference for the matching. The total universe is a representation of the total set of vendors or distributors that can be related to the current CPG use case being treated and, therefore, all possible vendors to which the distribution could potentially be expanded. It should have (at least) a column that serves as a unique ID, a name column and a location column. If any of the rows have missing values in any of these columns, the rows will be ignored during the analysis.

  • total_universe_id_column: STRING name of the column containing unique ID values for each of the total universe locations in the results of the total_universe. Any type is valid as long as it can be used in grouping functions. Rows with missing values in the total_universe_id_column will be ignored during the procedure.

  • total_universe_name_column: STRING name of the column containing the names of each of the total universe locations in the results of the total_universe. Such column is expected to be in STRING format and will be matched against its current universe counterpart to compute the text similarity value. Rows with missing values in the total_universe_name_column will be ignored during the procedure.

  • total_universe_location_column: STRING name of the column containing the point geography of each of the total universe locations in the results of the total_universe. Such column is expected to be in GEOGRAPHY, POINT format, and will be matched against its current universe counterpart to compute the proximity score. Rows with missing values in the total_universe_location_column will be ignored during the procedure.

  • table_name: STRING qualified name of the table to store the final results of the match. A detailed explanation of such an output table is detailed below.

  • options (optional): JSON-formatted STRING containing a set of optional parameters. All the parameters inside this JSON can be overridden if the user needs to, but it also provides a set of sensible default values. Current accepted parameters are:

    Option
    Description

    max_neighbors

    INT64 - Default: 60. Maximum numbers of neighbors to consider candidate matches to a given location. Only this many closest neighbors will be taken into account when looking for a location's match. Increasing this value will increase the computation time and requirements of the procedure.

    max_distance

    INT64 - Default: 500. Maximum distance in meters to consider candidate matches to a given location. Only locations within this many meters of each other will be considered candidates for matching. Increasing this value will increase the computation time and requirements of the procedure.

    weights

    JSON-formatted STRING containing a mapping with two values, used to weight the average of the two different similarities. If provided, both values should be provided, since the normalization could distort the weights in case a single one was provided.

    weights.text_similarity

    FLOAT64 - Default: 0.7. Weight of the text similarity computed, defaults to 0.7. The user could like to increase this weight if there is a higher confidence in the text matching than the locations being accurate.

    weights.proximity

    FLOAT64 - Default: 0.3. Weight of the proximity similarity computed. The user could like to increase this weight if there is a higher confidence in the accuracy of locations than the text being in the same representation.

Output

The procedure outputs the following:

  • The results table, named after table_name, contains a row per location in the current universe, as well as their best match found in the total universe, ordered by descending similarity. It follows this schema:

    • current_universe_id, STRING is the ID found in the column current_universe_id_column within the results of current_universe.

    • total_universe_id, STRING is the ID found in the column total_universe_id_column within the results of total_universe. It could be NULL if no candidate was found within the maximum radius defined by options.max_distance.

    • proximity, FLOAT64 represents the proximity score, computed using the distance distribution of the candidate neighbors taken into account for matching. It lies within the [0, 1] interval.

    • text_similarity, FLOAT64 represents the Dice Coefficient similarity, computed using the names provided for each location. It lies within the [0, 1] interval.

    • similarity, FLOAT64 represents the aggregated similarity; a weighted average of the proximity and text_similarity values as per the weights in options.weights.

Example

CALL `carto-un`.carto.UNIVERSE_MATCHING(
  -- Current universe arguments
  '<my-project>.<my-dataset>.current_universe',
  'poi_id',
  'poi_name',
  'poi_location',
  -- Total universe arguments
  '<my-project>.<my-dataset>.total_universe',
  'poi_id',
  'poi_name',
  'poi_location',
  -- Universe matching arguments
  '<my-project>.<my-dataset>.universe_matching_results',
  '''
  {
    "max_neighbors": 60,
    "max_distance": 500,
    "weights": {"text_similarity": 0.7, "proximity": 0.3}
  }
  '''
);

-- Table `<my-project>.<my-dataset>.universe_matching_results` will be created
CALL `carto-un-eu`.carto.UNIVERSE_MATCHING(
  -- Current universe arguments
  '<my-project>.<my-dataset>.current_universe',
  'poi_id',
  'poi_name',
  'poi_location',
  -- Total universe arguments
  '<my-project>.<my-dataset>.total_universe',
  'poi_id',
  'poi_name',
  'poi_location',
  -- Universe matching arguments
  '<my-project>.<my-dataset>.universe_matching_results',
  '''
  {
    "max_neighbors": 60,
    "max_distance": 500,
    "weights": {"text_similarity": 0.7, "proximity": 0.3}
  }
  '''
);

-- Table `<my-project>.<my-dataset>.universe_matching_results` will be created
CALL carto.UNIVERSE_MATCHING(
  -- Current universe arguments
  '<my-project>.<my-dataset>.current_universe',
  'poi_id',
  'poi_name',
  'poi_location',
  -- Total universe arguments
  '<my-project>.<my-dataset>.total_universe',
  'poi_id',
  'poi_name',
  'poi_location',
  -- Universe matching arguments
  '<my-project>.<my-dataset>.universe_matching_results',
  '''
  {
    "max_neighbors": 60,
    "max_distance": 500,
    "weights": {"text_similarity": 0.7, "proximity": 0.3}
  }
  '''
);

-- Table `<my-project>.<my-dataset>.universe_matching_results` will be created

Additional examples

UNIVERSE_MATCHING_REPORT

UNIVERSE_MATCHING_REPORT(
  total_universe,
  total_universe_id_column,
  matched_universe,
  table_prefix,
  options
)

Description

Input parameters

  • total_universe: STRING SQL query or table name to provide the total universe, that is, the dataset to be taken as a reference for the matching. Ideally, it should be the same one that was fed to the UNIVERSE_MATCHING procedure.

  • total_universe_id_column: STRING name of the column containing unique ID values for each of the total universe locations in the results of the total_universe. Any type is valid as long as it can be used in grouping functions.

  • table_prefix: STRING destination prefix for the output tables. It must contain the project, dataset and prefix. For example <my-project>.<my-dataset>.<output-prefix>.

  • options (optional): JSON-formatted STRING containing a set of optional parameters. All the parameters inside this JSON can be overridden if the user needs to but it also provides a set of sensible default values. The current accepted parameters are:

    Option
    Description

    min_similarity

    FLOAT64 - Default: 0.7. Similarity threshold to consider a match of current-total locations reliable. All pairs whose similarity lies below this minimum value (or is NULL) will not be taken into account during the report.

Output

The procedure outputs the following tables:

  • The matched universe filtered table, named <my-project>.<my-dataset>.<output-prefix>_filtered, contains a row per pair in the matched_universe that surpasses the minimum similarity threshold (options.min_similarity). Its schema is the same one as the matched_universe.

  • The expansion universe table, named <my-project>.<my-dataset>.<output-prefix>_expansion_universe, contains all locations in the total_universe that were not successfully matched in the <my-project>.<my-dataset>.<output-prefix>_filtered table. This table represents all potential locations to be considered potential expansion points that are not yet covered. Its schema is the same as total_universe.

  • The report table, named <my-project>.<my-dataset>.<output-prefix>_report, and contains a single row with some useful metrics regarding the results of the analysis:

    • current_universe: the number of locations in the current universe.

    • total_universe: the number of locations in the total universe.

    • matched_universe: the number of locations successfully paired between the two universes over the similarity threshold.

    • expansion_universe: the number of locations in the expansion universe.

    • market_penetration: the percentage of the total universe currently covered by the filtered matched universe (in the [0, 1] interval)

Example

CALL `carto-un`.carto.UNIVERSE_MATCHING_REPORT(
  -- Total universe arguments
  '<my-project>.<my-dataset>.total_universe',
  'poi_id',
  -- Universe matching results
  '<my-project>.<my-dataset>.universe_matching_results',
  -- Report arguments
  '<my-project>.<my-dataset>.universe_matching',
  '''
  {
    "min_similarity": 0.7,
  }
  '''
);

-- These tables will be created:
--   - <my-project>.<my-dataset>.universe_matching_filtered
--   - <my-project>.<my-dataset>.universe_matching_expansion_universe
--   - <my-project>.<my-dataset>.universe_matching_report
CALL `carto-un-eu`.carto.UNIVERSE_MATCHING_REPORT(
  -- Total universe arguments
  '<my-project>.<my-dataset>.total_universe',
  'poi_id',
  -- Universe matching results
  '<my-project>.<my-dataset>.universe_matching_results',
  -- Report arguments
  '<my-project>.<my-dataset>.universe_matching',
  '''
  {
    "min_similarity": 0.7,
  }
  '''
);

-- These tables will be created:
--   - <my-project>.<my-dataset>.universe_matching_filtered
--   - <my-project>.<my-dataset>.universe_matching_expansion_universe
--   - <my-project>.<my-dataset>.universe_matching_report
CALL carto.UNIVERSE_MATCHING_REPORT(
  -- Total universe arguments
  '<my-project>.<my-dataset>.total_universe',
  'poi_id',
  -- Universe matching results
  '<my-project>.<my-dataset>.universe_matching_results',
  -- Report arguments
  '<my-project>.<my-dataset>.universe_matching',
  '''
  {
    "min_similarity": 0.7,
  }
  '''
);

-- These tables will be created:
--   - <my-project>.<my-dataset>.universe_matching_filtered
--   - <my-project>.<my-dataset>.universe_matching_expansion_universe
--   - <my-project>.<my-dataset>.universe_matching_report

Additional examples

CREATE_SPATIAL_SCORE

CREATE_SPATIAL_SCORE(input_query, index_column, output, options)

Description

This procedure expects to receive features that characterize each of the geometries. If geometries are point locations, we recommend taking into account:

Input parameters

  • input_query: STRING the query to the data used to compute the spatial composite. It must contain all the individual variables that should be included in the computation of the composite as well as a unique geographic id for each row. A qualified table name can be given as well, e.g. <my-project>.<my-dataset>.<my-table>.

  • index_column: STRING the name of the column with the unique geographic identifier of each point of sales or area.

  • output_table: STRING table name of the form <my-project>.<my-dataset>.<my-table> to store the results.

  • options: STRING containing a valid JSON with the different options, which can be set to NULL. Valid options are described below.

    Option
    Description

    weights

    STRUCT - The (optional) weights for each variable used to compute the spatial composite passed as {"name":value, …}. If specified, the sum of the weights must be lower than 1. If no weights are specified, equal weights are assumed. If weights are specified only for some variables and the sum of weights is less than 1, the remainder is distributed equally between the remaining variables. If weights are specified for all the variables and the sum of weights is less than 1, the remainder is distributed equally between all the variables.

    nbuckets

    INT64 - Default: 5. The selected number of groups to bucketize the resulting score into buckets of equal widths.

Output

The results are stored in the table named <output_table>, which contains the following columns:

  • ranking: INT64 the ranking position or the merchant based on the derived composite score, from 1 (maximum score) to n (minimum score), where n is the total number of merchants considered.

  • index_column: the unique geographic identifier. The type of this column depends on the type of index_column in input_query.

  • spatial_score: FLOAT64 the value of the composite score. The type of this column is FLOAT64 if the score is not discretized and INT64 otherwise.

  • spatial_score_buckets: INT64 the value of the discretized composite score.

The procedure also returns a lookup table named <output_table>_lookup_table with the following columns:

  • lower_bound: FLOAT64 the lowest value of the composite score within each bucket.

  • upper_bound: FLOAT64 the largest value of the composite score within each bucket.

  • spatial_score_buckets: INT64 the unique values of the bucketized composite score, from 1 to nbuckets.

Example

CALL carto.CREATE_SPATIAL_SCORE(
    'SELECT pos_id, footfall, sentiment, positive_opinions_count, footfall_early_morning, footfall_early_afternoon, horeca_count_100m FROM `cartobq.docs.cpg_spatial_scoring_input`',
    'pos_id',
    '<my-project>.<my-dataset>.<table-name>',
    '''{
        "weights":{"sentiment":0.5}
    }
    '''
)
-- Table `<my-project>.<my-dataset>.<table-name>` will be created

Additional examples

CREATE_SPATIAL_PERFORMANCE_SCORE

CREATE_SPATIAL_PERFORMANCE_SCORE(input_query, index_column, business_kpi_col, output, options)

Description

This procedure expects to receive features that characterize each of the locations, for which we recommend taking into account:

Input parameters

  • input_query: STRING the query to the data used to compute the spatial composite. It must contain all the individual variables that should be included in the computation of the composite as well as a unique geographic id for each row. A qualified table name can be given as well, e.g. <my-project>.<my-dataset>.<my-table>.

  • index_column: STRING the name of the column with the unique geographic identifier of each point of sales.

  • business_kpi_col: STRING column with the performance business KPI to be modeled (i.e sales volume).

  • output_table: STRING table name of the form <my-project>.<my-dataset>.<my-table> to store the results.

  • options: STRING containing a valid JSON with the different options, which can be set to NULL. Valid options are described below.

    Option
    Description

    r2_thr

    FLOAT64 - Default: 0.5. The minimum acceptable R2 model score. If the R2 of the regression model is lower than this threshold this implies poor fitting and a warning is raised.

    nbuckets

    INT64 - Default: 5. The selected number of groups to bucketize the resulting score using a zero-centered equal widths approach. The lower and upper limits are derived from the outliers-removed maximum of the absolute values of the score.

Return type

The results are stored in the table named <output_table>, which contains the following columns:

  • ranking: INT64 the ranking position or the merchant based on the derived composite score.

  • index_column: the unique geographic identifier. The type of this column depends on the type of index_column in input_query.

  • spatial_score: FLOAT64 the value of the composite score. The type of this column is FLOAT64 if the score is not discretized and INT64 otherwise.

  • spatial_score_buckets: INT64 the value of the discretized composite score.

The procedure also returns a lookup table named <output_table>_lookup_table with the following columns:

  • lower_bound: FLOAT64 the lowest value of the composite score within each bucket.

  • upper_bound: FLOAT64 the largest value of the composite score within each bucket.

  • spatial_score_buckets: INT64 the unique values of the bucketized composite score, from 1 to nbuckets.

Example

CALL carto.CREATE_SPATIAL_PERFORMANCE_SCORE(
    'SELECT pos_id, footfall, sentiment, positive_opinions_count, footfall_early_morning, footfall_early_afternoon, horeca_count_100m, sales_volume FROM `cartobq.docs.cpg_spatial_scoring_input`',
    'pos_id',
    'sales_volume',
    '<my-project>.<my-dataset>.<table-name>',
    '''{
        "r2_thr":0.4
    }
    '''
)
-- Table `<my-project>.<my-dataset>.<table-name>` will be created

Additional examples

BUILD_SIMILAR_LOCATIONS_MODEL

BUILD_SIMILAR_LOCATIONS_MODEL(origin_query, target_query, index_column, output_prefix, options)

Description

Input parameters

  • origin_query: STRING the query or the fully qualified name of the table containing the origin cells data.

  • target_query: STRING the query or the fully qualified name of the table containing the target cells data.

  • index_column: STRING the name of the index column.

  • output_prefix: STRING destination prefix for the output tables. It must contain the project, dataset and prefix. For example <my-project>.<my-dataset>.<output-prefix>.

  • options: STRING the JSON string containing the available options as described in the table below.

    Option
    Description

    categorical_variables

    ARRAY<STRING> The array containing the names of the categorical (a.k.a. qualitative) columns

    NUM_PRINCIPAL_COMPONENTS

    PCA_EXPLAINED_VARIANCE_RATIO

    PCA_SOLVER

Return type

The procedure will output the following:

  • Target data table: contains the transformed data for the data in the target cells, that will be used to create the PCA model. The name of the table includes the suffix _target_data, for example <my-project>.<my-dataset>.<output-prefix>_target_data.

  • Origin data table: contains the transformed data for the data in the origin cells. The name of the table includes the suffix _origin_data, for example <my-project>.<my-dataset>.<output-prefix>_origin_data.

  • The PCA model. The model name includes the suffix _model, for example <my-project>.<my-dataset>.<output-prefix>_model.

Examples

CALL `carto-un`.carto.BUILD_SIMILAR_LOCATIONS_MODEL
(
    '''SELECT * FROM `cartobq.docs.similar_locations_origin_enriched_w_mixed` WHERE urbanity_66f7575b_max = "rural"''',
    '''SELECT * FROM `cartobq.docs.similar_locations_target_enriched_w_mixed`''',
    'quadbin',
    '<my-project>.<my-dataset>.<output-prefix>',
    '''{
        "categorical_variables":["urbanity_66f7575b_max"],
        "model_options":{
            "NUM_PRINCIPAL_COMPONENTS":3
        }
    }'''
)
-- Tables `<my-project>.<my-dataset>.<output-prefix>_target_data` and `<my-project>.<my-dataset>.<output-prefix>_origin_data` and model `<my-project>.<my-dataset>.<output-prefix>_model will be created
CALL `carto-un-eu`.carto.BUILD_SIMILAR_LOCATIONS_MODEL
(
    '''SELECT * FROM `cartobq.docs.similar_locations_origin_enriched_w_mixed` WHERE urbanity_66f7575b_max = "rural"''',
    '''SELECT * FROM `cartobq.docs.similar_locations_target_enriched_w_mixed`''',
    'quadbin',
    '<my-project>.<my-dataset>.<output-prefix>',
    '''{
        "categorical_variables":["urbanity_66f7575b_max"],
        "model_options":{
            "NUM_PRINCIPAL_COMPONENTS":3
        }
    }'''
)
-- Tables `<my-project>.<my-dataset>.<output-prefix>_target_data` and `<my-project>.<my-dataset>.<output-prefix>_origin_data` and model `<my-project>.<my-dataset>.<output-prefix>_model will be created
CALL carto.BUILD_SIMILAR_LOCATIONS_MODEL
(
    '''SELECT * FROM `cartobq.docs.similar_locations_origin_enriched_w_mixed` WHERE urbanity_66f7575b_max = "rural"''',
    '''SELECT * FROM `cartobq.docs.similar_locations_target_enriched_w_mixed`''',
    'quadbin',
    '<my-project>.<my-dataset>.<output-prefix>',
    '''{
        "categorical_variables":["urbanity_66f7575b_max"],
        "model_options":{
            "NUM_PRINCIPAL_COMPONENTS":3
        }
    }'''
)
-- Tables `<my-project>.<my-dataset>.<output-prefix>_target_data` and `<my-project>.<my-dataset>.<output-prefix>_origin_data` and model `<my-project>.<my-dataset>.<output-prefix>_model will be created
CALL `carto-un`.carto.BUILD_SIMILAR_LOCATIONS_MODEL
(
    '''SELECT * FROM `cartobq.docs.similar_locations_origin_enriched_w_mixed_h3` WHERE urbanity_66f7575b_max = "rural"''',
    '''SELECT * FROM `cartobq.docs.similar_locations_target_enriched_w_mixed_h3`''',
    'h3',
    '<my-project>.<my-dataset>.<output-prefix>',
    '''{
        "categorical_variables":["urbanity_66f7575b_max"],
        "model_options":{
            "NUM_PRINCIPAL_COMPONENTS":3
        }
    }'''
)
-- Tables `<my-project>.<my-dataset>.<output-prefix>_target_data` and `<my-project>.<my-dataset>.<output-prefix>_origin_data` and model `<my-project>.<my-dataset>.<output-prefix>_model will be created
CALL `carto-un-eu`.carto.BUILD_SIMILAR_LOCATIONS_MODEL
(
    '''SELECT * FROM `cartobq.docs.similar_locations_origin_enriched_w_mixed_h3` WHERE urbanity_66f7575b_max = "rural"''',
    '''SELECT * FROM `cartobq.docs.similar_locations_target_enriched_w_mixed_h3`''',
    'h3',
    '<my-project>.<my-dataset>.<output-prefix>',
    '''{
        "categorical_variables":["urbanity_66f7575b_max"],
        "model_options":{
            "NUM_PRINCIPAL_COMPONENTS":3
        }
    }'''
)
-- Tables `<my-project>.<my-dataset>.<output-prefix>_target_data` and `<my-project>.<my-dataset>.<output-prefix>_origin_data` and model `<my-project>.<my-dataset>.<output-prefix>_model will be created
CALL carto.BUILD_SIMILAR_LOCATIONS_MODEL
(
    '''SELECT * FROM `cartobq.docs.similar_locations_origin_enriched_w_mixed_h3` WHERE urbanity_66f7575b_max = "rural"''',
    '''SELECT * FROM `cartobq.docs.similar_locations_target_enriched_w_mixed_h3`''',
    'h3',
    '<my-project>.<my-dataset>.<output-prefix>',
    '''{
        "categorical_variables":["urbanity_66f7575b_max"],
        "model_options":{
            "NUM_PRINCIPAL_COMPONENTS":3
        }
    }'''
)
-- Tables `<my-project>.<my-dataset>.<output-prefix>_target_data` and `<my-project>.<my-dataset>.<output-prefix>_origin_data` and model `<my-project>.<my-dataset>.<output-prefix>_model will be created

FIND_SIMILAR_LOCATIONS

FIND_SIMILAR_LOCATIONS(similar_locations_model, index_column, output_table, options)

Description

Input parameters

  • similar_locations_model: STRING the fully qualified name of the Principal Component Analysis model based on the target data.

  • index_column: STRING the name of the index column.

  • output_table: STRING table name of the form <my-project>.<my-dataset>.<my-table> to store the results.

  • options: STRING the JSON string containing the available options as described in the table below.

    Option
    Description

    origin_index

    STRING the index of the origin cell. Must be specified if origin_coords is not specified.

    origin_coords

    ARRAY<FLOAT64> the longitude and the latitude of the origin location. Must be specified if origin_index is not specified.

    max_results

    INT64 the maximum number of results to be returned. If this is less than the number of target cells with a positive similarity skill score, only these cells are returned.

Return type

A table containing in each row the index of the target cells (index_column) and its associated similarity_score and similarity_skill_score. The similarity_score corresponds to the distance between the origin and target cell in the Principal Component (PC) Scores space; the similarity_skill_score for a given target cell *t* is computed as 1 - similarity_score(*t*) / similarity_score(<*t*>), where <*t*> is the average target cell, computed by averaging each retained PC score for all the target cells. This similarity_skill_score represents a relative measure: the score will be positive if and only if the target cell is more similar to the origin than the mean vector data, with a score of 1 meaning perfect matching or zero distance. Therefore, a target cell with a larger score will be more similar to the origin under this scoring rule.

Examples

In the following query, the index of the cell containing the origin location is specified:

CALL `carto-un`.carto.FIND_SIMILAR_LOCATIONS
(
    'cartobq.docs.similar_locations_test_model',
    'quadbin',
    '<my-project>.<my-dataset>.<my-table>',
    '''{
        "origin_index":"5256404338758123519",
        "max_results":10
    }'''
)
-- Table `<my-project>.<my-dataset>.<my-table>` will be created
-- with the column: quadbin, similarity_score, similarity_skill_score
CALL `carto-un-eu`.carto.FIND_SIMILAR_LOCATIONS
(
    'cartobq.docs.similar_locations_test_model',
    'quadbin',
    '<my-project>.<my-dataset>.<my-table>',
    '''{
        "origin_index":"5256404338758123519",
        "max_results":10
    }'''
)
-- Table `<my-project>.<my-dataset>.<my-table>` will be created
-- with the column: quadbin, similarity_score, similarity_skill_score
CALL carto.FIND_SIMILAR_LOCATIONS
(
    'cartobq.docs.similar_locations_test_model',
    'quadbin',
    '<my-project>.<my-dataset>.<my-table>',
    '''{
        "origin_index":"5256404338758123519",
        "max_results":10
    }'''
)
-- Table `<my-project>.<my-dataset>.<my-table>` will be created
-- with the column: quadbin, similarity_score, similarity_skill_score

Here, instead, the longitude and the latitude of the origin location are specified:

CALL `carto-un`.carto.FIND_SIMILAR_LOCATIONS
(
    'cartobq.docs.similar_locations_test_h3_model',
    'h3',
    '<my-project>.<my-dataset>.<my-table>',
    '''{
        "origin_coords":[-91.416, 41.915]
    }'''
)
-- Table `<my-project>.<my-dataset>.<my-table>` will be created
-- with the column: h3, similarity_score, similarity_skill_score
CALL `carto-un-eu`.carto.FIND_SIMILAR_LOCATIONS
(
    'cartobq.docs.similar_locations_test_h3_model',
    'h3',
    '<my-project>.<my-dataset>.<my-table>',
    '''{
        "origin_coords":[-91.416, 41.915]
    }'''
)
-- Table `<my-project>.<my-dataset>.<my-table>` will be created
-- with the column: h3, similarity_score, similarity_skill_score
CALL carto.FIND_SIMILAR_LOCATIONS
(
    'cartobq.docs.similar_locations_test_h3_model',
    'h3',
    '<my-project>.<my-dataset>.<my-table>',
    '''{
        "origin_coords":[-91.416, 41.915]
    }'''
)
-- Table `<my-project>.<my-dataset>.<my-table>` will be created
-- with the column: h3, similarity_score, similarity_skill_score

FIND_TWIN_AREAS_WEIGHTED

FIND_TWIN_AREAS_WEIGHTED(origin_query, target_query, index_column, output_table, options)

Description

Input parameters

  • origin_query: STRING the query or the fully qualified name of the table containing the origin cells data.

  • target_query: STRING the query or the fully qualified name of the table containing the target cells data.

  • index_column: STRING the name of the index column.

  • output_table: STRING table name of the form <my-project>.<my-dataset>.<my-table> to store the results.

  • options: STRING the JSON string containing the available options as described in the table below.

    Option
    Description

    max_results

    INT64 of the maximum number of twin areas returned. If set to NULL, all target cells are returned

    weights

    ARRAY<STRUCT<name STRING, value FLOAT64>> the weights on the features. If set to NULL, then all features are treated equally. This parameter is considered only if the length of weights is greater or equal than one. The sum of weights must be less than or equal to 1. If less weights than the number of features are provided, then for the undefined features, the remaining 1 - sum(weights) is distributed evenly

Output

The procedure outputs a table containing in each row the index of the target cells (index_column) and its associated similarity_score and similarity_skill_score. The similarity_score corresponds to the distance between the origin and target cell taking into account the user defined weights; the similarity_skill_score for a given target cell *t* is computed as 1 - similarity_score(*t*) / similarity_score(<*t*>), where <*t*> is the average target cell, computed by averaging each feature for all the target cells. This similarity_skill_score represents a relative measure: the score will be positive if and only if the target cell is more similar to the origin than the mean vector data, with a score of 1 meaning perfect matching or zero distance. Therefore, a target cell with a larger score will be more similar to the origin under this scoring rule.

Examples

In this example, we are using default equal weights:

CALL `carto-un`.carto.FIND_SIMILAR_LOCATIONS_WEIGHTED(
  -- Input queries
  '''SELECT * FROM `cartobq.docs.similar_locations_origin_enriched_quadbin` LIMIT 1''',
  '''SELECT * FROM `cartobq.docs.similar_locations_origin_enriched_quadbin`''',
  -- Twin areas model inputs
  'quadbin',
  NULL,
  NULL,
  '<my-project>.<my-dataset>.<my-table>'
);
-- Table `<my-project>.<my-dataset>.<my-table>'` will be created
-- with the column: quadbin, similarity_score, similarity_skill_score
CALL `carto-un-eu`.carto.FIND_SIMILAR_LOCATIONS_WEIGHTED(
  -- Input queries
  '''SELECT * FROM `cartobq.docs.similar_locations_origin_enriched_quadbin` LIMIT 1''',
  '''SELECT * FROM `cartobq.docs.similar_locations_origin_enriched_quadbin`''',
  -- Twin areas model inputs
  'quadbin',
  NULL,
  NULL,
  '<my-project>.<my-dataset>.<my-table>'
);
-- Table `<my-project>.<my-dataset>.<my-table>'` will be created
-- with the column: quadbin, similarity_score, similarity_skill_score
CALL carto.FIND_SIMILAR_LOCATIONS_WEIGHTED(
  -- Input queries
  '''SELECT * FROM `cartobq.docs.similar_locations_origin_enriched_quadbin` LIMIT 1''',
  '''SELECT * FROM `cartobq.docs.similar_locations_origin_enriched_quadbin`''',
  -- Twin areas model inputs
  'quadbin',
  NULL,
  NULL,
  '<my-project>.<my-dataset>.<my-table>'
);
-- Table `<my-project>.<my-dataset>.<my-table>'` will be created
-- with the column: quadbin, similarity_score, similarity_skill_score

In this example, we are using user-specified weights:

CALL `carto-un`.carto.FIND_SIMILAR_LOCATIONS_WEIGHTED(
  -- Input queries
  '''SELECT * FROM `cartobq.docs.similar_locations_origin_enriched_quadbin`''',
  '''SELECT * FROM `cartobq.docs.similar_locations_origin_enriched_quadbin`''',
  'quadbin',
  '<my-project>.<my-dataset>.<my-table>',
  '''{
    "origin_coords":[-91.4117431, 41.9145412],
    "weights":
      {
        "total_pop_3cf008b3_sum":0.35,
        "households_58bf2ab1_sum":0.35,
        "count_qualified_count":0.35
      }
    }'''
);
-- Table `<my-project>.<my-dataset>.<my-table>` will be created
-- with the column: quadbin, similarity_score, similarity_skill_score
CALL `carto-un-eu`.carto.FIND_SIMILAR_LOCATIONS_WEIGHTED(
  -- Input queries
  '''SELECT * FROM `cartobq.docs.similar_locations_origin_enriched_quadbin`''',
  '''SELECT * FROM `cartobq.docs.similar_locations_origin_enriched_quadbin`''',
  'quadbin',
  '<my-project>.<my-dataset>.<my-table>',
  '''{
    "origin_coords":[-91.4117431, 41.9145412],
    "weights":
      {
        "total_pop_3cf008b3_sum":0.35,
        "households_58bf2ab1_sum":0.35,
        "count_qualified_count":0.35
      }
    }'''
);
-- Table `<my-project>.<my-dataset>.<my-table>` will be created
-- with the column: quadbin, similarity_score, similarity_skill_score
CALL carto.FIND_SIMILAR_LOCATIONS_WEIGHTED(
  -- Input queries
  '''SELECT * FROM `cartobq.docs.similar_locations_origin_enriched_quadbin`''',
  '''SELECT * FROM `cartobq.docs.similar_locations_origin_enriched_quadbin`''',
  'quadbin',
  '<my-project>.<my-dataset>.<my-table>',
  '''{
    "origin_coords":[-91.4117431, 41.9145412],
    "weights":
      {
        "total_pop_3cf008b3_sum":0.35,
        "households_58bf2ab1_sum":0.35,
        "count_qualified_count":0.35
      }
    }'''
);
-- Table `<my-project>.<my-dataset>.<my-table>` will be created
-- with the column: quadbin, similarity_score, similarity_skill_score

This procedure prepares the "data table" to be fed into . It takes as input a set of locations along with optional preprocessed features, and either it further enriches these locations with new features from "Data Observatory subscriptions" or custom features "from the user's own tables", and then performs descriptives analysis, correlation analyis on the features and produces three "output" tables:

do_variables: ARRAY<STRUCT<variable STRING, aggregation STRING>> variables of the Data Observatory that will be used to enrich the provided trade areas of each location. For each variable, its slug and the aggregation method must be provided. Use 'default' to use the variable's default aggregation method. Valid aggregation methods are: sum, avg, max, min, count. The catalog procedure can be used to find available variables and their slug ids and default aggregation methods. It can be set to NULL.

This procedure is the final step of the Customer Segmentation workflow, and it uses the output from the . The function performs clustering (KMeans algorithm) and clusterizes the stores into a number of clusters. Multiple number of clusters - cases can be defined. The clustering is either performed directly on the input data or the capability of performing Principal Component Analysis (PCA) on the input data first exists. If PCA is enabled then the input data are firstly passed into PCA having been standardized, in order to remove multicolinearity or correlation amongst features, and the output of the PCA is passed into the KMeans. If PCA is not enabled, then the input data are standardized and then passed to the KMeans algorithm. The input to this function consists of the output table from the aformentioned function, the enrich table, a list with the options for the number of clusters, the variance explainability ratio for PCA and the output_perfix. As output, the function produces the following tables:

customers_data_table: STRING table, output from which contains store_id, geom, and features values.

String similarity is computed on the location names. It is based on and accounts for words in a different order in both strings.

When adhering to the CPG use case, it is also interesting to perform the , which will return several valuable tables and metrics, such as the potential expansion universe and the current market penetration.

Procedure to generate a report on the output of an execution, as well as some insightful results out of it.

This procedure requires a previous execution, so please do refer to the documentation to understand their inputs and the different universe metaphors.

matched_universe: STRING SQL query or table name to provide the results.

This procedure allows to derive spatial scores using both point geometries and polygon geometries. It ranks stores, merchants (point locations) or areas (polygons) based on a spatial composite score that measures attractiveness, risk, sales potential, etc. based on some business criteria (both internal and external data). The procedure combines (spatial) variables into a meaningful composite score by first scaling each individual variable between 0 and 1 and then aggregating them linearly through a weighted average. The score is provided on a 0-to-100 scale. Rows with a NULL value in any of the individual variables are dropped. For a more advanced, customizable version of this procedure, please refer to .

procedure to generate different types of service areas for both origin and target locations,

or procedure to enrich the defined trade areas resulting in the format expected for this procedure.

Procedure to rank stores or merchants based on a spatial composite score that detects if locations are performing as expected (or if they are under/over-performing) in terms of some business criteria (both internal and external data) and a performance KPI (sales, volume, or margin). The score is derived as the residuals of an , defined as the observed performance KPI minus the predicted one. The performance KPI should be measurable and correlated with the set of variables defining the score. Rows with a NULL value in any of the individual variables are dropped. For a more advanced, customizable version of this procedure, please refer to . For large inputs this procedure might return a , which can be solved by using instead the CREATE_SPATIAL_COMPOSITE_SUPERVISED procedure with the remove_outliers options set to FALSE.

procedure to generate different types of service areas for both origin and target locations,

or procedure to enrich the defined trade areas resulting in the format expected for this procedure.

This procedure runs the first step in the Similar Locations analysis, which can be used to find, amongst some target cells in a Quadbin or H3 grid, the most similar cells (a.k.a locations) with respect to a cell of interest based on a set of input variables. For both the origin and the target cells, this procedure transforms the input data by standardizing the numerical variables and creating a standardized indicator matrix for the categorical variables and then it creates a Principal Component Analysis (PCA) model using the processed target data as input. Internally it works just like the procedure. More details on the data processing can be found in the procedure, which is used to return the transformed data.

Since the input data must be on a regular Quabin or H3 grid, we recommend using the procedure to prepare the data in the format expected by the procedure.

INT64 Number of principal components to keep as defined in

FLOAT64 as defined in

STRING as defined in

Procedure to obtain similar locations for a given origin location in a target area. The procedure first computes a similarity score as the Euclidean distance between the principal component scores derived with the of the origin cell and those of each of the cells in the target area and then it derives a similarity skill score that is used to rank the results. Internally it works just like the procedure.

We recommend using the procedure to prepare the data in the format expected by the procedure, which must preceed the use of this procedure.

Procedure to obtain the twin areas for a given origin location in a target area. The function is similar to the where the full description of the method, based on Principal Component Analysis (PCA), can be found . Herein, no PCA is performed, but the user has the capability to specify weights for the features and check the similarities between origin and target area. The sum of weights must be less than or equal to 1. Not all them need to be defined. The undefined features are set to the remaining value divided by their number to reach 1. In the case where weights are provided, then no PCA takes place, and the features are standardized.

The output twin areas are those of the target area considered to be the most similar to the origin location, based on the values of a set of variables. Only variables with numerical values are supported. Both origin and target areas should be provided in grid format (h3, or quadbin) of the same resolution. We recommend using the procedure to prepare the data in the format expected by this procedure.

This project has received funding from the research and innovation programme under grant agreement No 960401.

documentation for Developers
Dice Coefficient
Calculating market penetration in CPG with merchant universe matching
Calculating market penetration in CPG with merchant universe matching
CREATE_SPATIAL_COMPOSITE_UNSUPERVISED
GENERATE_TRADE_AREAS
ENRICH_POLYGONS
DATAOBS_ENRICH_POLYGONS
Measuring merchant attractiveness and performance in CPG with spatial scores
Random Forest regression model
CREATE_SPATIAL_COMPOSITE_SUPERVISED
Query exceeds CPU resources error
GENERATE_TRADE_AREAS
ENRICH_POLYGONS
DATAOBS_ENRICH_POLYGONS
Measuring merchant attractiveness and performance in CPG with spatial scores
RUN_CUSTOMER_SEGMENTATION
CUSTOMER_SEGMENTATION_ANALYSIS_DATA
CUSTOMER_SEGMENTATION_ANALYSIS_DATA
UNIVERSE_MATCHING_REPORT
UNIVERSE_MATCHING
UNIVERSE_MATCHING
UNIVERSE_MATCHING
BigQuery ML CREATE MODEL statement for PCA models
BigQuery ML CREATE MODEL statement for PCA models
BigQuery ML CREATE MODEL statement for PCA models
here
European Union’s Horizon 2020
BUILD_SIMILAR_LOCATIONS_MODEL
BUILD_SIMILAR_LOCATIONS_MODEL
BUILD_PCAMIX_DATA
DATAOBS_SUBSCRIPTION_VARIABLES
GRIDIFY_ENRICH
GRIDIFY_ENRICH
GRIDIFY_ENRICH
BUILD_TWIN_AREAS_MODEL
FIND_TWIN_AREAS
FIND_TWIN_AREAS