LogoLogo
HomeAcademyLoginTry for free
  • Welcome
  • What's new
    • Q2 2025
    • Q1 2025
    • Q4 2024
    • Q3 2024
    • Q2 2024
    • Q1 2024
    • Q4 2023
    • Q3 2023
    • Q2 2023
    • Q1 2023
    • Q4 2022
    • Q3 2022
  • FAQs
    • Accounts
    • Migration to the new platform
    • User & organization setup
    • General
    • Builder
    • Workflows
    • Data Observatory
    • Analytics Toolbox
    • Development Tools
    • Deployment Options
    • CARTO Basemaps
    • CARTO for Education
    • Support Packages
    • Security and Compliance
  • Getting started
    • What is CARTO?
    • Quickstart guides
      • Connecting to your data
      • Creating your first map
      • Creating your first workflow
      • Developing your first application
    • CARTO Academy
  • CARTO User Manual
    • Overview
      • Creating your CARTO organization
      • CARTO Cloud Regions
      • CARTO Workspace overview
    • Maps
      • Data sources
        • Simple features
        • Spatial Indexes
        • Pre-generated tilesets
        • Rasters
        • Defining source spatial data
        • Managing data freshness
        • Changing data source location
      • Layers
        • Point
          • Grid point aggregation
          • H3 point aggregation
          • Heatmap point aggregation
          • Cluster point aggregation
        • Polygon
        • Line
        • Grid
        • H3
        • Raster
        • Zoom to layer
      • Widgets
        • Formula widget
        • Category widget
        • Pie widget
        • Histogram widget
        • Range widget
        • Time Series widget
        • Table widget
      • SQL Parameters
        • Date parameter
        • Text parameter
        • Numeric parameter
        • Publishing SQL parameters
      • Interactions
      • Legend
      • Basemaps
        • Basemap selector
      • AI Agents
      • SQL analyses
      • Map view modes
      • Map description
      • Feature selection tool
      • Search locations
      • Measure distances
      • Exporting data
      • Download PDF reports
      • Managing maps
      • Sharing and collaboration
        • Editor collaboration
        • Map preview for editors
        • Map settings for viewers
        • Comments
        • Embedding maps
        • URL parameters
      • Performance considerations
    • Workflows
      • Workflow canvas
      • Results panel
      • Components
        • Aggregation
        • Custom
        • Data Enrichment
        • Data Preparation
        • Generative AI
        • Input / Output
        • Joins
        • Parsers
        • Raster Operations
        • Spatial Accessors
        • Spatial Analysis
        • Spatial Constructors
        • Spatial Indexes
        • Spatial Operations
        • Statistics
        • Tileset Creation
        • BigQuery ML
        • Snowflake ML
        • Google Earth Engine
        • Google Environment APIs
        • Telco Signal Propagation Models
      • Data Sources
      • Scheduling workflows
      • Sharing workflows
      • Using variables in workflows
      • Executing workflows via API
      • Temporary data in Workflows
      • Extension Packages
      • Managing workflows
      • Workflows best practices
    • Data Explorer
      • Creating a map from your data
      • Importing data
        • Importing rasters
      • Geocoding data
      • Optimizing your data
    • Data Observatory
      • Terminology
      • Browsing the Spatial Data Catalog
      • Subscribing to public and premium datasets
      • Accessing free data samples
      • Managing your subscriptions
      • Accessing your subscriptions from your data warehouse
        • Access data in BigQuery
        • Access data in Snowflake
        • Access data in Databricks
        • Access data in Redshift
        • Access data in PostgreSQL
    • Connections
      • Google BigQuery
      • Snowflake
      • Databricks
      • Amazon Redshift
      • PostgreSQL
      • CARTO Data Warehouse
      • Sharing connections
      • Deleting a connection
      • Required permissions
      • IP whitelisting
      • Customer data responsibilities
    • Applications
    • Settings
      • Understanding your organization quotas
      • Activity Data
        • Activity Data Reference
        • Activity Data Examples
        • Activity Data Changelog
      • Users and Groups
        • Inviting users to your organization
        • Managing user roles
        • Deleting users
        • SSO
        • Groups
        • Mapping groups to user roles
      • CARTO Support Access
      • Customizations
        • Customizing appearance and branding
        • Configuring custom color palettes
        • Configuring your organization basemaps
        • Enabling AI Agents
      • Advanced Settings
        • Managing applications
        • Configuring S3 Bucket for Redshift Imports
        • Configuring OAuth connections to Snowflake
        • Configuring OAuth U2M connections to Databricks
        • Configuring S3 Bucket integration for RDS for PostgreSQL Exports in Builder
        • Configuring Workload Identity Federation for BigQuery
      • Data Observatory
      • Deleting your organization
    • Developers
      • Managing Credentials
        • API Base URL
        • API Access Tokens
        • SPA OAuth Clients
        • M2M OAuth Clients
      • Named Sources
  • Data and Analysis
    • Analytics Toolbox Overview
    • Analytics Toolbox for BigQuery
      • Getting access
        • Projects maintained by CARTO in different BigQuery regions
        • Manual installation in your own project
        • Installation in a Google Cloud VPC
        • Core module
      • Key concepts
        • Tilesets
        • Spatial indexes
      • SQL Reference
        • accessors
        • clustering
        • constructors
        • cpg
        • data
        • http_request
        • import
        • geohash
        • h3
        • lds
        • measurements
        • placekey
        • processing
        • quadbin
        • random
        • raster
        • retail
        • routing
        • s2
        • statistics
        • telco
        • tiler
        • transformations
      • Guides
        • Running queries from Builder
        • Working with Raster data
      • Release notes
      • About Analytics Toolbox regions
    • Analytics Toolbox for Snowflake
      • Getting access
        • Native App from Snowflake's Marketplace
        • Manual installation
      • Key concepts
        • Spatial indexes
        • Tilesets
      • SQL Reference
        • accessors
        • clustering
        • constructors
        • data
        • http_request
        • import
        • h3
        • lds
        • measurements
        • placekey
        • processing
        • quadbin
        • random
        • raster
        • retail
        • s2
        • statistics
        • tiler
        • transformations
      • Guides
        • Running queries from Builder
        • Working with Raster data
      • Release Notes
    • Analytics Toolbox for Databricks
      • Getting access
        • Personal (former Single User) cluster
        • Standard (former Shared) cluster
      • Reference
        • lds
        • tiler
      • Guides
      • Release Notes
    • Analytics Toolbox for Redshift
      • Getting access
        • Manual installation in your database
        • Installation in an Amazon Web Services VPC
        • Core version
      • Key concepts
        • Tilesets
        • Spatial indexes
      • SQL Reference
        • clustering
        • constructors
        • data
        • http_request
        • import
        • lds
        • placekey
        • processing
        • quadbin
        • random
        • s2
        • statistics
        • tiler
        • transformations
      • Guides
        • Running queries from Builder
      • Release Notes
    • Analytics Toolbox for PostgreSQL
      • Getting access
        • Manual installation
        • Core version
      • Key concepts
        • Tilesets
        • Spatial Indexes
      • SQL Reference
        • h3
        • quadbin
        • tiler
      • Guides
        • Creating spatial index tilesets
        • Running queries from Builder
      • Release Notes
    • CARTO + Python
      • Installation
      • Authentication Methods
      • Visualizing Data
      • Working with Data
        • How to work with your data in the CARTO Data Warehouse
        • How to access your Data Observatory subscriptions
        • How to access CARTO's Analytics Toolbox for BigQuery and create visualizations via Python notebooks
        • How to access CARTO’s Analytics Toolbox for Snowflake and create visualizations via Python notebooks
        • How to visualize data from Databricks
      • Reference
    • CARTO QGIS Plugin
  • CARTO for Developers
    • Overview
    • Key concepts
      • Architecture
      • Libraries and APIs
      • Authentication methods
        • API Access Tokens
        • OAuth Access Tokens
        • OAuth Clients
      • Connections
      • Data sources
      • Visualization with deck.gl
        • Basemaps
          • CARTO Basemap
          • Google Maps
            • Examples
              • Gallery
              • Getting Started
              • Basic Examples
                • Hello World
                • BigQuery Tileset Layer
                • Data Observatory Tileset Layer
              • Advanced Examples
                • Arc Layer
                • Extrusion
                • Trips Layer
            • What's New
          • Amazon Location
            • Examples
              • Hello World
              • CartoLayer
            • What's New
        • Rapid Map Prototyping
      • Charts and widgets
      • Filtering and interactivity
      • Summary
    • Quickstart
      • Make your first API call
      • Visualize your first dataset
      • Create your first widget
    • Guides
      • Build a public application
      • Build a private application
      • Build a private application using SSO
      • Visualize massive datasets
      • Integrate CARTO in your existing application
      • Use Boundaries in your application
      • Avoid exposing SQL queries with Named Sources
      • Managing cache in your CARTO applications
    • Reference
      • Deck (@deck.gl reference)
      • Data Sources
        • vectorTableSource
        • vectorQuerySource
        • vectorTilesetSource
        • h3TableSource
        • h3QuerySource
        • h3TilesetSource
        • quadbinTableSource
        • quadbinQuerySource
        • quadbinTilesetSource
        • rasterSource
        • boundaryTableSource
        • boundaryQuerySource
      • Layers (@deck.gl/carto)
      • Widgets
        • Data Sources
        • Server-side vs. client-side
        • Models
          • getFormula
          • getCategories
          • getHistogram
          • getRange
          • getScatter
          • getTimeSeries
          • getTable
      • Filters
        • Column filters
        • Spatial filters
      • CARTO APIs Reference
    • Release Notes
    • Examples
    • CARTO for React
      • Guides
        • Getting Started
        • Views
        • Data Sources
        • Layers
        • Widgets
        • Authentication and Authorization
        • Basemaps
        • Look and Feel
        • Query Parameters
        • Code Generator
        • Sample Applications
        • Deployment
        • Upgrade Guide
      • Examples
      • Library Reference
        • Introduction
        • API
        • Auth
        • Basemaps
        • Core
        • Redux
        • UI
        • Widgets
      • Release Notes
  • CARTO Self-Hosted
    • Overview
    • Key concepts
      • Architecture
      • Deployment requirements
    • Quickstarts
      • Single VM deployment (Kots)
      • Orchestrated container deployment (Kots)
      • Advanced Orchestrated container deployment (Helm)
    • Guides
      • Guides (Kots)
        • Configure your own buckets
        • Configure an external in-memory cache
        • Enable Google Basemaps
        • Enable the CARTO Data Warehouse
        • Configure an external proxy
        • Enable BigQuery OAuth connections
        • Configure Single Sign-On (SSO)
        • Use Workload Identity in GCP
        • High availability configuration for CARTO Self-hosted
        • Configure your custom service account
      • Guides (Helm)
        • Configure your own buckets (Helm)
        • Configure an external in-memory cache (Helm)
        • Enable Google Basemaps (Helm)
        • Enable the CARTO Data Warehouse (Helm)
        • Configure an external proxy (Helm)
        • Enable BigQuery OAuth connections (Helm)
        • Configure Single Sign-On (SSO) (Helm)
        • Use Workload Identity in GCP (Helm)
        • Use EKS Pod Identity in AWS (Helm)
        • Enable Redshift imports (Helm)
        • Migrating CARTO Self-hosted installation to an external database (Helm)
        • Advanced customizations (Helm)
        • Configure your custom service account (Helm)
    • Maintenance
      • Maintenance (Kots)
        • Updates
        • Backups
        • Uninstall
        • Rotating keys
        • Monitoring
        • Change the Admin Console password
      • Maintenance (Helm)
        • Monitoring (Helm)
        • Rotating keys (Helm)
        • Uninstall (Helm)
        • Backups (Helm)
        • Updates (Helm)
    • Support
      • Get debug information for Support (Kots)
      • Get debug information for Support (Helm)
    • CARTO Self-hosted Legacy
      • Key concepts
        • Architecture
        • Deployment requirements
      • Quickstarts
        • Single VM deployment (docker-compose)
      • Guides
        • Configure your own buckets
        • Configure an external in-memory cache
        • Enable Google Basemaps
        • Enable the CARTO Data Warehouse
        • Configure an external proxy
        • Enable BigQuery OAuth connections
        • Configure Single Sign-On (SSO)
        • Enable Redshift imports
        • Configure your custom service account
        • Advanced customizations
        • Migrating CARTO Self-Hosted installation to an external database
      • Maintenance
        • Updates
        • Backups
        • Uninstall
        • Rotating keys
        • Monitoring
      • Support
    • Release Notes
  • CARTO Native App for Snowflake Containers
    • Deploying CARTO using Snowflake Container Services
  • Get Help
    • Legal & Compliance
    • Previous libraries and components
    • Migrating your content to the new CARTO platform
Powered by GitBook
On this page
  • P_VALUE
  • CREATE_SPATIAL_COMPOSITE_SUPERVISED
  • CREATE_SPATIAL_COMPOSITE_UNSUPERVISED
  • CRONBACH_ALPHA_COEFFICIENT
  • GWR_GRID
  • GETIS_ORD_H3
  • GETIS_ORD_QUADBIN
  • GETIS_ORD_SPACETIME_H3
  • GETIS_ORD_SPACETIME_QUADBIN
  • MORANS_I_H3
  • MORANS_I_QUADBIN
  • LOCAL_MORANS_I_H3
  • LOCAL_MORANS_I_QUADBIN
  • SPACETIME_HOTSPOTS_CLASSIFICATION
  • TIME_SERIES_CLUSTERING

Was this helpful?

Export as PDF
  1. Data and Analysis
  2. Analytics Toolbox for Snowflake
  3. SQL Reference

statistics

Previouss2Nexttiler

Last updated 19 days ago

Was this helpful?

This module contains functions to perform spatial statistics calculations.

P_VALUE

P_VALUE(z_score)

Description

This function computes the p-value (two-tails test) of a given z-score assuming the population follows a normal distribution where the mean is 0 and the standard deviation is 1. The is a measure of how many standard deviations below or above the population mean a value is. It gives you an idea of how far from the mean a data point is. The is the probability that a randomly sampled point has a value at least as extreme as the point whose z-score is being tested.

  • z_score: FLOAT

Return type

FLOAT

Example

SELECT CARTO.CARTO.P_VALUE(value) as p_value
FROM LATERAL FLATTEN(input => ARRAY_CONSTRUCT(-2,-1,0,1,2));
-- 0.04550012577451279,
-- 0.31731052766472745,
-- 0.999999999,
-- 0.31731052766472745,
-- 0.04550012577451279

CREATE_SPATIAL_COMPOSITE_SUPERVISED

CREATE_SPATIAL_COMPOSITE_SUPERVISED(
  input,
  index_column,
  output_table,
  options
)

Description

This procedure derives a spatial composite score as the residuals of a regression model which is used to detect areas of under- and over-prediction. The response variable should be measurable and correlated with the set of variables defining the score. For each data point. the residual is defined as the observed value minus the predicted value. Rows with a NULL value in any of the individual variables are dropped.

Input parameters

  • input_query: STRING the query to the data used to compute the spatial composite. It must contain all the individual variables that should be included in the computation of the composite as well as a unique geographic id for each row. A qualified table name can be given as well, e.g. <database>.<schema>.<table>.

  • index_column: STRING the name of the column with the unique geographic identifier.

  • output_table: STRING the prefix for the output table. It should include database and schema, e.g. <database>.<schema>.<output_table>.

  • options: STRING containing a valid JSON with the different options. Valid options are described below.

    • model_options: JSON string containing all the settings to be passed to the ML model function. These settings are:

      • input_label: STRING name of the column to be used as a target to train the model and evaluate the predictions.

      • encoder: JSON containing the name and parameters of the class from Snowpark ML to be used as an encoder, which will be applied to all the categorical features in your input query. It can be NULL or omitted, but the function will return an error if there are categorical columns present. It can contain two different values:

      • scaler: JSON containing the name and parameters of the class from Snowpark ML to be used as a scaler, which will be applied to all the input features (numerical or encoded categories) in your query. It can be NULL or omitted to skip this step altogether. It can contain two different values:

      • regressor: JSON containing the name and parameters of the class from Snowflake ML to be used as a regressor, and whose predictions will be used to generate the score index. It can contain two different values:

    • bucketize_method: STRING the method used to discretize the spatial composite score. The default value is NULL, which will return a continuous variable. Possible options are:

      • EQUAL_INTERVALS_ZERO_CENTERED: the values of the spatial composite score are discretized into buckets of equal widths centered in zero. The lower and upper limits are derived from the outliers-removed maximum of the absolute values of the score.

    • r2_thr: FLOAT the minimum allowed value for the R2 model score. If the R2 of the regression model is lower than this threshold this implies poor fitting and an error is raised. If it is NULL, a default value of 0.5 is used instead.

Return type

The results are stored in the table named output_table, which contains the following columns:

  • index_column: the unique geographic identifier. The type of this column depends on the type of index_column in input_query.

  • spatial_score: the value of the composite score. The type of this column is FLOAT if the score is not discretized and INTEGER otherwise.

When the score is discretized by specifying the bucketize_method parameter, the procedure also returns a lookup table named <output_table>_lookup_table with the following columns:

  • lower_bound: FLOAT the lower bound of the bin.

  • upper_bound: FLOAT the upper bound of the bin.

  • spatial_score: INTEGER the value of the (discretized) composite score.

Example

CALL CARTO.CARTO.CREATE_SPATIAL_COMPOSITE_SUPERVISED(
  'SELECT * FROM <database>.<schema>.<table>',
  'geoid',
  '<database>.<schema>.<output_table>_SCORE',
  $
  {
    "model_options":{
      "input_label": "label_column_name",
      "encoder": {
        "class": "snowflake.ml.modeling.preprocessing.OneHotEncoder",
        "options": {
          "drop": "if_binary"
        }
      },
      "scaler": {
        "class": "snowflake.ml.modeling.preprocessing.MinMaxScaler",
        "options": {
          "feature_range": [0, 1]
        }
      },
      "regressor": {
        "class": "snowflake.ml.modeling.ensemble.RandomForestRegressor",
        "options": {
          "n_estimators": 100,
          "max_depth": 10,
          "min_samples_split": 2,
          "min_samples_leaf": 1
        }
      },
    },
    "r2_thr": 0.6,
    "bucketize_method": "EQUAL_INTERVALS_ZERO_CENTERED",
    "nbuckets": 9,
    "remove_outliers": true
  }
  $
);
-- Two tables will be created:
--   * <database>.<schema>.<output_table>_SCORE
--   * <database>.<schema>.<output_table>_SCORE_LOOKUP_TABLE

CREATE_SPATIAL_COMPOSITE_UNSUPERVISED

CREATE_SPATIAL_COMPOSITE_UNSUPERVISED(input, index_column, output_table, options)

Description

This procedure combines (spatial) variables into a meaningful composite score. The composite score can be derived using different methods, scaling and aggregation functions and weights. Rows with a NULL value in any of the model predictors are dropped.

Input parameters

  • input: STRING the query to the data used to compute the spatial composite. It must contain all the individual variables that should be included in the computation of the composite as well as a unique geographic id for each row. A qualified table name can be given as well, e.g. <database>.<schema>.<table>.

  • index_column: STRING the name of the column with the unique geographic identifier.

  • output_table: STRING the name for the output table. It should include database and schema, e.g. <database>.<schema>.<output_table>_SCORE.

  • options: STRING containing a valid JSON with the different options. Valid options are described below. If options is set to NULL then all options are set to default values, as specified in the table below.

    • scoring_method: STRING Possible options are ENTROPY, CUSTOM_WEIGHTS, FIRST_PC. With the ENTROPY method the spatial composite is derived as the weighted sum of the proportion of the min-max scaled individual variables, where the weights are based on the entropy of the proportion of each variable. Only numerical variables are allowed. With the CUSTOM_WEIGHTS method, the spatial composite is computed by first scaling each individual variable and then aggregating them according to user-defined scaling and aggregation methods and individual weights. Depending on the scaling parameter, both numerical and ordinal variables are allowed (categorical and boolean variables need to be transformed to ordinal). With the FIRST_PC method, the spatial composite is derived from a Principal Component Analysis as the first principal component score. Only numerical variables are allowed.

    • weights: ARRAY the (optional) weights for each variable used to compute the spatial composite when scoring_method is set to CUSTOM_WEIGHTS. If a different scoring method is selected, then this input parameter is ignored. If specified, the sum of the weights must be lower than 1. If no weights are specified, equal weights are assumed. If weights are specified only for some variables and the sum of weights is less than 1, the remainder is distributed equally between the remaining variables. If weights are specified for all the variables and the sum of weights is less than 1, the remainder is distributed equally between all the variables.

    • scaling: STRING the user-defined scaling when the scoring_method is set to CUSTOM_WEIGHTS. Possible options are:

      • MIN_MAX_SCALER: data is rescaled into the range [0,1] based on minimum and maximum values. Only numerical variables are allowed.

      • STANDARD_SCALER: data is rescaled by subtracting the mean value and dividing the result by the standard deviation. Only numerical variables are allowed.

      • RANKING: data is replaced by its percent rank, that is by values ranging from 0 lowest to 1. Both numerical and ordinal variables are allowed (categorical and boolean variables need to be transformed to ordinal).

      • DISTANCE_TO_TARGET_MIN(_MAX,_AVG): data is rescaled by dividing by the minimum, maximum, or mean of all the values. Only numerical variables are allowed.

      • PROPORTION: data is rescaled by dividing by the sum total of all the values. Only numerical variables are allowed.

    • aggregation: STRING the aggregation function used when the scoring_method is set to CUSTOM_WEIGHTS. Possible options are:

      • LINEAR: the spatial composite is derived as the weighted sum of the scaled individual variables.

      • GEOMETRIC: the spatial composite is given by the product of the scaled individual variables, each to the power of its weight.

    • correlation_var: STRING when scoring_method is set to FIRST_PC, the spatial score will be positively correlated with the selected variable (i.e. the sign the spatial score is set such that the correlation between the selected variable and the first principal component score is positive).

    • correlation_thr: FLOAT the minimum absolute value of the correlation between each individual variable and the first principal component score when scoring_method is set to FIRST_PC.

    • return_range: ARRAY the user-defined normalization range of the spatial composite score, e.g [0.0,1.0]. Ignored if bucketize_method is specified.

    • bucketize_method: STRING the method used to discretize the spatial composite score. Possible options are:

      • EQUAL_INTERVALS: the values of the spatial composite score are discretized into buckets of equal widths.

      • QUANTILES: the values of the spatial composite score are discretized into buckets based on quantiles.

      • JENKS: the values of the spatial composite score are discretized into buckets obtained using k-means clustering.

    • bucketize_random_state: INTEGER the random state used to run the discretization when bucketize_method is set to JENKS. If a different scoring method is selected, then this input parameter is ignored. A non-negative value must be specified. It defaults to 42.

Option

ENTROPY

CUSTOM_WEIGHTS

FIRST_PC

Valid options

Default value

scoring_method

Optional

Optional

Optional

ENTROPY, CUSTOM_WEIGHTS, FIRST_PC

ENTROPY

weights

Ignored

Optional

Ignored

{"name":value…}

NULL

scaling

Ignored

Optional

Ignored

MIN_MAX_SCALER, STANDARD_SCALER, RANKING, DISTANCE_TO_TARGET_MIN, DISTANCE_TO_TARGET_MAX, DISTANCE_TO_TARGET_AVG, PROPORTION

MIN_MAX_SCALER

aggregation

Ignored

Optional

Ignored

LINEAR, GEOMETRIC

LINEAR

correlation_var

Ignored

Ignored

Mandatory

-

NULL

correlation_thr

Ignored

Ignored

Optional

-

NULL

return_range

Optional

Optional

Optional

-

NULL

bucketize_method

Optional

Optional

Optional

EQUAL_INTERVALS, QUANTILES, JENKS

NULL

nbuckets

Optional

Optional

Optional

-

nbuckets

Optional

Optional

Optional

-

When bucketize_random_state is set to JENKS it defaults to 42

Return type

The results are stored in the table named <output_table>, which contains the following columns:

  • index_column: the unique geographic identifier. The type of this column depends on the type of index_column in input.

  • spatial_score: the value of the composite score. The type of this column is FLOAT if the score is not discretized and INTEGER otherwise.

When the score is discretized by specifying the bucketize_method parameter, the procedure also returns a lookup table named <output_table>_LOOKUP_TABLE with the following columns:

  • lower_bound: FLOAT the lower bound of the bin.

  • upper_bound: FLOAT the upper bound of the bin.

  • spatial_score: INTEGER the value of the (discretized) composite score.

Examples

With the ENTROPY method:

CALL carto.CREATE_SPATIAL_COMPOSITE_UNSUPERVISED(
    'SELECT * EXCLUDE(geom, revenue_change, urbanity, urbanity_ordinal) FROM <database>.<schema>.spatial_scoring_input',
    'geoid',
    '<database>.<schema>.<output_table>_SCORE',
    '''{
        "scoring_method":"ENTROPY",
        "return_range":[0.0,1.0]
    }
    '''
)
-- Table `<database>.<schema>.<output_table>_SCORE` will be created

With the CUSTOM_WEIGHTS method:

CALL carto.CREATE_SPATIAL_COMPOSITE_UNSUPERVISED(
    'SELECT * EXCLUDE(geom, revenue_change, urbanity) FROM <database>.<schema>.spatial_scoring_input',
    'geoid',
    '<database>.<schema>.<output_table>_SCORE',
    '''{
        "scoring_method":"CUSTOM_WEIGHTS",
        "weights":{"fempop_15_44":0.2,"education":0.1,"urbanity_ordinal":0.1,"pois":0.1},
        "scaling":"RANKING",
        "aggregation":"LINEAR",
        "bucketize_method":"JENKS"
    }
    '''
 )
-- Table `<database>.<schema>.<output_table>_SCORE` will be created

With the FIRST_PC method:

CALL carto.CREATE_SPATIAL_COMPOSITE_UNSUPERVISED(
    'SELECT * EXCLUDE(geom, revenue_change, urbanity, urbanity_ordinal) FROM <database>.<schema>.spatial_scoring_input',
    'geoid',
    '<database>.<schema>.<output_table>_SCORE',
    $$
    {
        "scoring_method":"FIRST_PC",
        "correlation_var":"fempop_15_44",
        "correlation_thr":0.6,
        "bucketize_method":"QUANTILES"
    }
    $$
)
-- Table `<database>.<schema>.<output_table>_SCORE` will be created

CRONBACH_ALPHA_COEFFICIENT

CRONBACH_ALPHA_COEFFICIENT(input, output_table)

Description

Input parameters

  • input: STRING the query to the data used to compute the coefficient. It must contain all the individual variables that should be included in the computation of the coefficient. A qualified table name can be given as well, e.g. <database>.<schema>.<table>.

  • output_table: STRING the name for the output table. It should include database and schema, e.g. <database>.<schema>.<output_table>.

Return type

The output table with the following columns:

  • cronbach_alpha_coef: FLOAT the computed Cronbach Alpha coefficient.

  • k: INTEGER the number of the individual variables used to compute the composite.

  • mean_var: FLOAT the mean variance of all individual variables.

  • mean_cov: FLOAT the mean inter-item covariance among all variable pairs.

Example

CALL CARTO.CARTO.CRONBACH_ALPHA_COEFFICIENT(
    'SELECT * EXCEPT(geoid, geom, revenue_change, urbanity, urbanity_ordinal) FROM <database>.<schema>.<table>',
    '<database>.<schema>.<output_table>'
)
-- Table `<database>.<schema>.<output_table>` will be created

GWR_GRID

GWR_GRID(input_table, features_columns, label_column, cell_column, kring_distance, kernel_function, fit_intercept, output_table)

Description

Geographically weighted regression (GWR) models local relationships between spatially varying predictors and an outcome of interest using a local least squares regression.

In each regression, the data of the locations in each cell and those of the neighboring cells, defined by the kring_distance parameter, will be taken into account. The data of the neighboring cells will be assigned a lower weight the further they are from the origin cell, following the function specified in the kernel_function. For example, considering cell i and kring_distance of 1. Having n locations located inside cell i, and in the neigheboring cells [n_1, n_2, ..., n_k], then the regression of the cell i will have in total n + n_1 + n_2 + ... + n_k points.

  • input: STRING the query to the input data. A qualified table name can be given as well: <database-id>.<schema-id>.<table-name>.

  • features_columns: ARRAY array of column names from input_table to be used as features in the GWR.

  • label_column: STRING name of the target variable column.

  • index_column: STRING name of the column containing the cell ids.

  • kring_distance: INT distance of the neighboring cells whose data will be included in the local regression of each cell.

  • fit_intercept: BOOLEAN whether to calculate the interception of the model or to force it to zero if, for example, the input data is already supposed to be centered. If NULL, fit_intercept will be considered as TRUE.

  • output_table: STRING name of the output table. It should be a quoted qualified table with project and dataset: <database-id>.<schema-id>.<table-name>.

Output

The output table will contain a column named either H3 (STRING) OR QUADBIN (BIGINT) depending on the grid type, storing the unique geographic identifier of each grid cell, and a column for each feature column containing its corresponding coefficient estimate and one extra column for the intercept if fit_intercept is set to TRUE.

Examples

CALL CARTO.CARTO.GWR_GRID(
    '<mydatabase-id>.<myschema-id>.<mytable-name>',
    ['bedrooms', 'bathrooms'], -- [ beds feature, bathrooms feature ]
    'price', -- price (target variable)
    'h3_z6', 3, 'gaussian', TRUE,
    '<mydatabase-id>.<myschema-id>.<mytable-name>_OUTPUT'
);
-- Table `<mydatabase-id>.<myschema-id>.<mytable-name>_OUTPUT` will be created
CALL CARTO.CARTO.GWR_GRID(
    '<mydatabase-id>.<myschema-id>.<mytable-name>',
    ['bedrooms', 'bathrooms'], -- [ beds feature, bathrooms feature ]
    'price', -- price (target variable)
    'qb_z12', 3, 'gaussian', TRUE,
    '<mydatabase-id>.<myschema-id>.<mytable-name>_OUTPUT'
);
-- Table `<mydatabase-id>.<myschema-id>.<mytable-name>_OUTPUT` will be created

GETIS_ORD_H3

GETIS_ORD_H3(input, output_table, index_column, value_column, size, kernel)

Description

This procedure computes the Getis-Ord Gi* statistic for each row in the input table.

  • input: STRING the query to the data used to compute the coefficient. A qualified table name can be given as well: <database-id>.<schema-id>.<table-name>.

  • output_table: STRING qualified name of the output table: <database-id>.<schema-id>.<table-name>.

  • index_column: STRING name of the column with the H3 indexes.

  • value_column: STRING name of the column with the values for each H3 cell.

  • size: INT size of the H3 kring (distance from the origin). This defines the area around each index cell that will be taken into account to compute its Gi* statistic.

The index_column cannot contain NULL values, otherwise a Invalid input origin error will be returned.

Output

The results are stored in the table named <output_table>, which contains the following columns:

  • h3: STRING the H3 index.

  • gi: FLOAT computed Gi* value.

  • p_value: FLOAT computed P value.

Example

CALL CARTO.CARTO.GETIS_ORD_H3(
    '<mydatabase-id>.<myschema-id>.<mytable-name>',
    '<mydatabase-id>.<myschema-id>.<mytable-name>_OUTPUT',
    'h3',
    'value',
    3,
    'gaussian'
);
-- Table `<mydatabase-id>.<myschema-id>.<mytable-name>_OUTPUT` will be created

GETIS_ORD_QUADBIN

GETIS_ORD_QUADBIN(input, output_table, index_column, value_column, size, kernel)

Description

This procedure computes the Getis-Ord Gi* statistic for each row in the input table.

  • input: STRING the query to the data used to compute the coefficient. A qualified table name can be given as well: <database-id>.<schema-id>.<table-name>.

  • output_table: STRING qualified name of the output table: <database-id>.<schema-id>.<table-name>.

  • index_column: STRING name of the column with the Quadbin indexes.

  • value_column: STRING name of the column with the values for each Quadbin cell.

  • size: INT size of the Quadbin kring (distance from the origin). This defines the area around each index cell that will be taken into account to compute its Gi* statistic.

The index_column cannot contain NULL values, otherwise the results will contain non valid results.

Output

The results are stored in the table named <output_table>, which contains the following columns:

  • quadbin: BIGINT the QUADBIN index.

  • gi: FLOAT computed Gi* value.

  • p_value: FLOAT computed P value.

Example

CALL CARTO.CARTO.GETIS_ORD_QUADBIN(
    '<mydatabase-id>.<myschema-id>.<mytable-name>',
    '<mydatabase-id>.<myschema-id>.<mytable-name>_OUTPUT',
    'quadbin',
    'value',
    3,
    'gaussian'
);
-- Table `<mydatabase-id>.<myschema-id>.<mytable-name>_OUTPUT` will be created

GETIS_ORD_SPACETIME_H3

GETIS_ORD_SPACETIME_H3(input, output_table, index_column, date_column, value_column, size, time_freq, time_bw, kernel, kernel_time)

Description

  • input: STRING the query to the data used to compute the coefficient. A qualified table name can be given as well: <database-id>.<schema-id>.<table-name>.

  • output_table: STRING qualified name of the output table: <database-id>.<schema-id>.<table-name>.

  • index_column: STRING name of the column with the H3 indexes.

  • date_column: STRING name of the column with the date.

  • value_column: STRING name of the column with the values for each H3 cell.

  • size: INTEGER size of the H3 kring (distance from the origin). This defines the area around each index cell that will be taken into account to compute its Gi* statistic.

  • time_freq: STRING The time interval - step to use for the time series. Available values are: year, quarter, month, week, day, hour, minute, second. It is the equivalent of the spatial index in the time domain.

  • time_bw: INTEGER The bandwidth to use for the time series. This defines the number of adjacent observations in time domain to be considered. It is the equivalent of the H3 kring in the time domain.

The index_column cannot contain NULL values, otherwise a Invalid input origin error will be returned.

Output

The results are stored in the table named <output_table>, which contains the following columns:

  • h3: STRING the H3 index.

  • date: DATETIME

  • gi: FLOAT computed Gi* value.

  • p_value: FLOAT computed P value.

Example

CALL carto.GETIS_ORD_SPACETIME_H3(
    '<mydatabase-id>.<myschema-id>.<mytable-name>',
    '<mydatabase-id>.<myschema-id>.<mytable-name>_OUTPUT',
    'h3',
    'date',
    'value',
    3,
    'DAY',
    1,
    'gaussian',
    'gaussian'
);
-- Table `<mydatabase-id>.<myschema-id>.<mytable-name>_OUTPUT` will be created

GETIS_ORD_SPACETIME_QUADBIN

GETIS_ORD_SPACETIME_QUADBIN(input, output_table, index_column, date_column, value_column, size, time_freq, time_bw, kernel, kernel_time)

Description

  • input: STRING the query to the data used to compute the coefficient. A qualified table name can be given as well: <database-id>.<schema-id>.<table-name>.

  • output_table: STRING qualified name of the output table: <database-id>.<schema-id>.<table-name>.

  • index_column: STRING name of the column with the Quadbin indexes.

  • date_column: STRING name of the column with the date.

  • value_column: STRING name of the column with the values for each Quadbin cell.

  • size: INTEGER size of the Quadbin kring (distance from the origin). This defines the area around each index cell that will be taken into account to compute its Gi* statistic.

  • time_freq: STRING The time interval - step to use for the time series. Available values are: year, quarter, month, week, day, hour, minute, second. It is the equivalent of the spatial index in the time domain.

  • time_bw: INTEGER The bandwidth to use for the time series. This defines the number of adjacent observations in time domain to be considered. It is the equivalent of the Quadbin kring in the time domain.

The index_column cannot contain NULL values, otherwise the results will contain non valid results.

Output

The results are stored in the table named <output_table>, which contains the following columns:

  • quadbin: BIGINT the QUADBIN index.

  • date: DATETIME

  • gi: FLOAT computed Gi* value.

  • p_value: FLOAT computed P value.

Example

CALL carto.GETIS_ORD_SPACETIME_QUADBIN(
    '<mydatabase-id>.<myschema-id>.<mytable-name>',
    '<mydatabase-id>.<myschema-id>.<mytable-name>_OUTPUT',
    'quadbin',
    'date',
    'value',
    3,
    'DAY',
    1,
    'gaussian',
    'gaussian'
);
-- Table `<mydatabase-id>.<myschema-id>.<mytable-name>_OUTPUT` will be created

MORANS_I_H3

MORANS_I_H3(input, output_table, index_column, value_column, size, decay)

Description

  • input: STRING the query to the data used to compute the coefficient. A qualified table name can be given as well: <database-id>.<schema-id>.<table-name>.

  • output_table: STRING qualified name of the output table: <database-id>.<schema-id>.<table-name>.

  • index_column: STRING name of the column with the H3 indexes.

  • value_column: STRING name of the column with the values for each H3 cell.

  • size: INT size of the H3 k-ring (distance from the origin). This defines the area around each index cell where the distance decay will be applied. If no neighboring cells are found, the weight of the corresponding index cell is set to zero.

The index_column cannot contain NULL values, otherwise a Invalid input origin error will be returned.

Output

The results are stored in the table named <output_table>, which contains the following column:

  • morans_i: FLOAT Moran's I spatial autocorrelation.

If all cells have no neighbours, then the procedure will fail.

Example

CALL CARTO.CARTO.MORANS_I_H3(
    '<mydatabase-id>.<myschema-id>.<mytable-name>',
    '<mydatabase-id>.<myschema-id>.<mytable-name>_OUTPUT',
    'h3',
    'value',
    5,
    'uniform'
);
-- Table `<mydatabase-id>.<myschema-id>.<mytable-name>_OUTPUT` will be created

MORANS_I_QUADBIN

MORANS_I_QUADBIN(input, output_table, index_column, value_column, size, decay)

Description

  • input: STRING the query to the data used to compute the coefficient. A qualified table name can be given as well: <database-id>.<schema-id>.<table-name>.

  • output_table: STRING qualified name of the output table: <database-id>.<schema-id>.<table-name>.

  • index_column: STRING name of the column with the Quadbin indexes.

  • value_column: STRING name of the column with the values for each Quadbin cell.

  • size: INT size of the Quadbin k-ring (distance from the origin). This defines the area around each index cell where the distance decay will be applied. If no neighboring cells are found, the weight of the corresponding index cell is set to zero.

The index_column cannot contain NULL values, otherwise the results will contain non valid results.

Output

The results are stored in the table named <output_table>, which contains the following column:

  • morans_i: FLOAT Moran's I spatial autocorrelation.

If all cells have no neighbours, then the procedure will fail.

Example

CALL CARTO.CARTO.MORANS_I_QUADBIN(
    '<mydatabase-id>.<myschema-id>.<mytable-name>',
    '<mydatabase-id>.<myschema-id>.<mytable-name>_OUTPUT',
    'quadbin',
    'value',
    5,
    'uniform'
);
-- Table `<mydatabase-id>.<myschema-id>.<mytable-name>_OUTPUT` will be created

LOCAL_MORANS_I_H3

LOCAL_MORANS_I_H3(input, output_table, index_column, value_column, size, decay, permutations)

Description

This procedure computes the local Moran's I spatial autocorrelation from the input table with H3 indexes. It outputs the H3 index, local Moran's I spatial autocorrelation value, simulated p value psim, Conditional randomization null - expectation EIc, Conditional randomization null - variance VIc, Total randomization null - expectation EI, Total randomization null - variance VI, and the quad HH=1, LL=2, LH=3, HL=4.

  • input: STRING the query to the data used to compute the coefficient. A qualified table name can be given as well: <database-id>.<schema-id>.<table-name>.

  • output_table: STRING qualified name of the output table: <database-id>.<schema-id>.<table-name>.

  • index_column: STRING name of the column with the H3 indexes.

  • value_column: STRING name of the column with the values for each H3 cell.

  • size: INTEGER size of the H3 k-ring (distance from the origin). This defines the area around each index cell where the distance decay will be applied. If no neighboring cells are found, the weight of the corresponding index cell is set to zero.

  • permutations: INTEGER number of permutations for the estimation of p-value.

The index_column cannot contain NULL values, otherwise a Invalid input origin error will be returned.

Output

The results are stored in the table named <output_table>, which contains the following columns:

  • h3: STRING the H3 index.

  • value: FLOAT local Moran's I spatial autocorrelation.

  • psim: FLOAT simulated p value.

  • EIc: FLOAT conditional randomization null - expectation.

  • VIc: FLOAT conditional randomization null - variance.

  • EI: FLOAT total randomization null - expectation.

  • VI: FLOAT total randomization null - variance.

  • quad: INTEGER HH=1, LL=2, LH=3, HL=4.

Example

CALL CARTO.CARTO.LOCAL_MORANS_I_H3(
    '<mydatabase-id>.<myschema-id>.<mytable-name>',
    '<mydatabase-id>.<myschema-id>.<mytable-name>_OUTPUT',
    'h3',
    'value',
    3,
    'exponential',
    100
);
-- Table `<mydatabase-id>.<myschema-id>.<mytable-name>_OUTPUT` will be created

LOCAL_MORANS_I_QUADBIN

LOCAL_MORANS_I_QUADBIN(input, output_table, index_column, value_column, size, decay, permutations)

Description

This procedure computes the local Moran's I spatial autocorrelation from the input table with Quadbin indexes. It outputs the Quadbin index, local Moran's I spatial autocorrelation value, simulated p value psim, Conditional randomization null - expectation EIc, Conditional randomization null - variance VIc, Total randomization null - expectation EI, Total randomization null - variance VI, and the quad HH=1, LL=2, LH=3, HL=4.

  • input: STRING the query to the data used to compute the coefficient. A qualified table name can be given as well: <database-id>.<schema-id>.<table-name>.

  • output_table: STRING qualified name of the output table: <database-id>.<schema-id>.<table-name>.

  • index_column: STRING name of the column with the Quadbin indexes.

  • value_column: STRING name of the column with the values for each Quadbin cell.

  • size: INTEGER size of the Quadbin k-ring (distance from the origin). This defines the area around each index cell where the distance decay will be applied. If no neighboring cells are found, the weight of the corresponding index cell is set to zero.

  • permutations: INTEGER number of permutations for the estimation of p-value.

The index_column cannot contain NULL values, otherwise the results will contain non valid results.

Output

The results are stored in the table named <output_table>, which contains the following columns:

  • quadbin: BIGINT the QUADBIN index.

  • value: FLOAT local Moran's I spatial autocorrelation.

  • psim: FLOAT simulated p value.

  • EIc: FLOAT conditional randomization null - expectation.

  • VIc: FLOAT conditional randomization null - variance.

  • EI: FLOAT total randomization null - expectation.

  • VI: FLOAT total randomization null - variance.

  • quad: INTEGER HH=1, LL=2, LH=3, HL=4.

Example

CALL CARTO.CARTO.LOCAL_MORANS_I_QUADBIN(
    '<mydatabase-id>.<myschema-id>.<mytable-name>',
    '<mydatabase-id>.<myschema-id>.<mytable-name>_OUTPUT',
    'quadbin',
    'value',
    3,
    'exponential',
    100
);
-- Table `<mydatabase-id>.<myschema-id>.<mytable-name>_OUTPUT` will be created

SPACETIME_HOTSPOTS_CLASSIFICATION

SPACETIME_HOTSPOTS_CLASSIFICATION (input, output_table, index_column, date_column, gi_column, p_value_column, options)

Description

Category
Description

Undetected Pattern

This category applies to locations that do not exhibit any discernible patterns of hot or cold activity as defined in subsequent categories.

Incipient Hotspot

This denotes a location that has become a significant hotspot only in the latest observed time step, without any prior history of significant hotspot activity.

Sequential Hotspot

Identifies a location experiencing an unbroken series of significant hotspot activity leading up to the most recent time step, provided it had no such activity beforehand and less than 90% of all observed intervals were hotspots.

Strengthening Hotspot

A location consistently identified as a hotspot in at least 90% of time steps, including the last, where there's a statistically significant upward trend in activity intensity.

Stable Hotspot

Represents a location maintaining significant hotspot status in at least 90% of time steps without showing a clear trend in activity intensity changes over time.

Declining Hotspot

A location that has consistently been a hotspot in at least 90% of time steps, including the most recent one, but shows a statistically significant decrease in the intensity of its activity.

Occasional Hotspot

Locations that sporadically become hotspot, with less than 90% of time steps marked as significant hotspots and no instances of being a significant coldspot.

Fluctuating Hotspot

Marks a location as a significant hotspot in the latest time step that has also experienced significant coldspot phases in the past, with less than 90% of intervals as significant hotspots.

Legacy Hotspot

A location that isn't currently a hotspot but was significantly so in at least 90% of past intervals.

Incipient Coldspot

Identifies a location that is marked as a significant coldspot for the first time in the latest observed interval, without any previous history of significant coldspot status.

Sequential Coldspot

A location with a continuous stretch of significant coldspot activity leading up to the latest interval, provided it wasn't identified as a coldspot before this streak and less than 90% of intervals were marked as coldspots.

Strengthening Coldspot

A location identified as a coldspot in at least 90% of observed intervals, including the most recent, where there's a statistically significant increase in the intensity of low activity.

Stable Coldspot

A location that has been a significant coldspot in at least 90% of intervals without any discernible trend in the intensity of low activity over time.

Declining Coldspot

Locations that have been significant coldspots in at least 90% of time steps, including the latest, but show a significant decrease in low activity intensity.

Occasional Coldspot

Represents locations that sporadically become significant coldspots, with less than 90% of time steps marked as significant coldspots and no instances of being a significant hot spot.

Fluctuating Coldspot

A location marked as a significant coldspot in the latest interval that has also been a significant hot spot in past intervals, with less than 90% of intervals marked as significant coldspots.

Legacy Coldspot

Locations that are not currently coldspots but were significantly so in at least 90% of past intervals.

Input parameters

The input parameters to this procedure are:

  • input: STRING the query to the data used to compute the coefficient. A qualified table name can be given as well: <mydatabase-id>.<myschema-id>.<mytable-name>.

  • output_table: STRING qualified name of the output table: <mydatabase-id>.<myschema-id>.<mytable-name>.

  • index_column: STRING name of the column with the spatial indexes.

  • date_column: STRING name of the column with the date.

  • gi_column: STRING name of the column with the getis ord values.

  • p_value_column: STRING name of the column with the p_value associated with the getis ord values.

  • options: STRING containing a valid JSON with the different options. Valid options are described in the table below. If options is set to NULL then all options are set to default.

Option
Description

threshold

FLOAT the threshold of the $p_value$ for an location to be considered as hotspot/coldspot. Default is 0.05.

algorithm

STRING the algorithm to be used for the monotonic trend test. It can be either mk for the original Mann-Kendall test or mmk for the modified one. Default is mk.

Output

The output table contains the following fields:

  • index_column: the type of the index_col specified in the input.

  • classification: STRING is one of the categories in the table.

  • tau: FLOAT the $tau$ value of the trend test.

  • tau_p: FLOAT the p-value of the trend $tau$ value. If it equals to 2 then it means that the trend test has failed.

Example

CALL CARTO.CARTO.SPACETIME_HOTSPOTS_CLASSIFICATION(
    '<mydatabase-id>.<myschema-id>.<mytable-name>',
    '<mydatabase-id>.<myschema-id>.<mytable-name>_OUTPUT',
    'quadbin',
    'date',
    'gi',
    'p_value',
    $
    {
        "algorithm":"mk",
        "threshold":0.05
    }
    $
)
-- It will return a table table with columns quadbin, classification STRING, tau FLOAT, tau_p FLOAT
CALL CARTO.CARTO.SPACETIME_HOTSPOTS_CLASSIFICATION(
    '<mydatabase-id>.<myschema-id>.<mytable-name>',
    '<mydatabase-id>.<myschema-id>.<mytable-name>_OUTPUT',
    'geoid',
    'date',
    'gi',
    'p_value',
    $
    {
        "algorithm":"mk",
        "threshold":0.1
    }
    $
)
-- It will return a table with columns geoid, classification STRING, tau FLOAT, tau_p FLOAT
CALL CARTO.CARTO.SPACETIME_HOTSPOTS_CLASSIFICATION(
    '<mydatabase-id>.<myschema-id>.<mytable-name>',
    '<mydatabase-id>.<myschema-id>.<mytable-name>_OUTPUT',
    'h3',
    'date',
    'gi',
    'p_value',
    $
    {
        "algorithm":"mmk",
        "threshold":0.01
    }
    $
)
-- It will return a table with columns h3, classification STRING, tau FLOAT, tau_p FLOAT
CALL CARTO.CARTO.SPACETIME_HOTSPOTS_CLASSIFICATION(
    ('SELECT h3, date, gi, p_value FROM <mydatabase-id>.<myschema-id>.<mytable-name>'),
    '<mydatabase-id>.<myschema-id>.<mytable-name>_OUTPUT',
    'h3',
    'date',
    'gi',
    'p_value',
    $
    {
        "algorithm":"mmk",
        "threshold":0.01
    }
    $
)
-- It will return a table with columns h3, classification STRING, tau FLOAT, tau_p FLOAT

TIME_SERIES_CLUSTERING

TIME_SERIES_CLUSTERING(
    input,
    output_table,
    partitioning_column,
    ts_column,
    value_column,
    options
)

Description

This procedure provides a way to group together different time series based on several methods that are use-case dependant. The user is able to choose the number of clusters. Please read carefully the method definition below to understand their usage and possible caveats that each method may have.

The data passed to the function requires to be structured using two different columns that will serve as indices:

  • A unique ID per time series (partitioning_column), which can be a spatial index, a location unique ID (for instance a POI, store, point of sale, antenna, asset, etc.) or any other ID that uniquely identifies each time series.

  • A timestamp (ts_column), that will identify each of the time steps within each and all series.

All these methods require the series to be aligned and grouped, so that there is one and only one observation per combination of ID and timestamp; the input data cannot have missing values (a series is missing a value in the nth timestep) or multiple values (a series has multiple values in the nth timestep). Since each series require different treatment, the user is in charge of performing this step.

This example below will re-sample a daily time series into a weekly sampling rate and impute all missing values (if any).

WITH
    aligned_idx AS (
        SELECT
            ticket_week,
            shop_id
        FROM
            (
                SELECT DISTINCT
                    DATE_TRUNC(ticket_date, WEEK(MONDAY)) AS ticket_week
                FROM
                    DATABASE.SCHEMA.SALES_INFO
            ),
            (
                SELECT DISTINCT
                    shop_id
                FROM
                    DATABASE.SCHEMA.SALES_INFO
            )
    )
SELECT
    all.shop_id,
    all.ticket_week
    COALESCE(SUM(info.ticket_amount), 0) AS total_sales
    -- Please note that the sales are an extensive variable; an intensive
    -- variable would require different aggregation and imputation methods
FROM
    aligned_idx all
LEFT JOIN
    DATABASE.SCHEMA.SALES_INFO info
ON
    all.ticket_week = DATE_TRUNC(info.ticket_date, WEEK(MONDAY))
    AND all.shop_id = info.shop_id

The procedure will use any of the methods explained below to cluster the series. There is no single correct method, but different ways to approach different use cases.

  • Value characteristic: will group the series based on how similar they are point-wise, that is, sample by sample in each of them. This should return intuitive results: different changes in scale will probably split the series in larger and smaller ones, and changes that make points in similar ranges will contribute to the series being together. Another way of thinking of this method is assuming that will group series together the closer their points are if we plot them on a graph.

  • Profile characteristic: will group the series together based on how similar their dynamics are; that is, how similar their step to step changes are. This method will not take into account their scale but the correlation between series, grouping them together if their changes are similar. For example, two stores will be grouped together if their sales are more substantial on weekends, despite the scale of them may differ orders of magnitude.

Arguments

  • input: VARCHAR the query to the data used to compute the clustering. A qualified table name can be given as well: <database-id>.<schema-id>.<table-name>.

  • output_table: VARCHAR qualified name of the output table: <database-id>.<schema-id>.<table-name>.

  • partitioning_column: VARCHAR name of the column with the time series IDs.

  • ts_column: VARCHAR name of the column with the date.

  • value_column: VARCHAR name of the column with the value per ID and timestep.

  • options: VARCHAR a JSON-formatted VARCHAR containing the advanced options for the procedure:

    • method: VARCHAR, one of:

      • value for value characteristic,

      • profile for profile characteristic.

    • n_clusters: INT number of clusters to generate in the K-Means.

    • reduce_dimensionality: BOOLEAN whether to reduce the number of time dimensions. If true, it will run Principal Component Analysis to reduce the number of time dimensions to the same rank as the unique time series IDs.

Output

The results are stored in the table named <output_table>, which contains the following columns:

  • The given partitioning_column, one entry per unique value in the input;

  • cluster, a VARCHAR column with the cluster label associated.

Example

CALL CARTO.CARTO.TIME_SERIES_CLUSTERING(
    '<mydatabase-id>.<myschema-id>.<mytable-name>',
    '<mydatabase-id>.<myschema-id>.<mytable-name>_OUTPUT',
    'h3',
    'date',
    'temperature',
    '{ "method": "profile", "n_clusters": 6, "reduce_dimensionality": true }'
);

class, a STRING containing the fully qualified name of the to be used in the step.

options, an optional JSON dictionary containing the keyword arguments to be passed to the class during initialization. Please check the to check which parameters can be passed to the class.

class, a STRING containing the fully qualified name of the to be used in the step.

options, an optional JSON dictionary containing the keyword arguments to be passed to the class during initialization. Please check the to check which parameters can be passed to the class.

class, a STRING containing the fully qualified name of the to be used in the step.

options, an optional JSON dictionary containing the keyword arguments to be passed to the class during initialization. Please check the to check which parameters can be passed to the class.

nbuckets: INTEGER the number of buckets used when a bucketization method is specified. The default number of buckets is selected using . Ignored if bucketize_method is not specified.

remove_outliers: BOOL. When bucketize_method is specified, if remove_outliers is set to true the buckets are derived from the oulier-removed data. The outliers are computed using for outlier detection. The default value is true. Ignored if bucketize_method is not specified.

nbuckets: INTEGER the number of buckets used when a bucketization method is specified. When bucketize_method is set to EQUAL_INTERVALS, if nbuckets is NULL, the default number of buckets is selected using . When bucketize_method is set to JENKS or QUANTILES, nbuckets cannot be NULL. When bucketize_method is set to JENKS the maximum value is 100, aka the maximum number of clusters allowed by BigQuery with k-means clustering.

When bucketize_method is set to EQUAL_INTERVALS is selected using

This procedure computes the coefficient for a set of (spatial) variables. This coefficient can be used as a measure of internal consistency or reliability of the data, based on the strength of correlations between individual variables. Cronbach’s alpha reliability coefficient normally ranges between 0 and 1 but there is actually no lower limit to the coefficient. Higher alpha (closer to 1) vs lower alpha (closer to 0) means higher vs lower consistency, with usually 0.65 being the minimum acceptable value of internal consistency. Rows with a NULL value in any of the individual variables are dropped.

This procedure performs a local least squares regression for every input cell. This approach was selected to improve computation time and efficiency. The number of models is controlled by the selected cell resolution, thus the user can increase or decrease the resolution of the cell index to perform more or less regressions. Note that you need to provide the cell ID (spatial index) for every location as input (see cell_column parameter), i.e., the cell type and resolution are not passed explicitly, but rather the index has to be computed previously. Hence if you want to increase or decrease the resolution, you need to precompute the corresponding cell ID of every location (see or module).

kernel_function: STRING to compute the spatial weights across the kring. Available functions are: 'uniform', 'triangular', 'quadratic', 'quartic' and 'gaussian'.

kernel: STRING to compute the spatial weights across the kring. Available functions are: uniform, triangular, quadratic, quartic and gaussian.

kernel: STRING to compute the spatial weights across the kring. Available functions are: uniform, triangular, quadratic, quartic and gaussian.

This procedure computes the space temporal Getis-Ord Gi* statistic for each H3 index and each datetime timestamp according to the method described in this . It extends the function by including the time domain. The Getis-Ord Gi* statistic is a measure of spatial autocorrelation, which is the degree to which data values are clustered together in space and time. The statistic is computed as the sum of the values of the cells in the kring (distance from the origin, space and temporal) weighted by the kernel functions, minus the value of the origin cell, divided by the standard deviation of the values of the cells in the kring. The Getis-Ord Gi* statistic is calculated from minimum to maximum datetime with the step defined by the user, in the input array. The datetime timestamp is truncated to the provided level, for example day / hour / week etc. For each spatial index, the missing datetime timestamp, from minimum to maximum, are filled with the default value of 0. Any other imputation of the values should take place outside of the function prior to passing the input to the function. The p value is computed as the probability of observing a value as extreme as the observed value, assuming the null hypothesis that the values are randomly distributed in space and time. The p value is computed using a normal distribution approximation.

kernel: STRING to compute the spatial weights across the kring. Available functions are: uniform, triangular, quadratic, quartic and gaussian.

kernel_time: STRING to compute the temporal weights within the time bandwidth. Available functions are: uniform, triangular, quadratic, quartic and gaussian.

This procedure computes the space temporal Getis-Ord Gi* statistic for each Quadbin index and each datetime timestamp according to the method described in this . It extends the function by including the time domain. The Getis-Ord Gi* statistic is a measure of spatial autocorrelation, which is the degree to which data values are clustered together in space and time. The statistic is computed as the sum of the values of the cells in the kring (distance from the origin, space and temporal) weighted by the kernel functions, minus the value of the origin cell, divided by the standard deviation of the values of the cells in the kring. The Getis-Ord Gi* statistic is calculated from minimum to maximum datetime with the step defined by the user, in the input array. The datetime timestamp is truncated to the provided level, for example day / hour / week etc. For each spatial index, the missing datetime timestamp, from minimum to maximum, are filled with the default value of 0. Any other imputation of the values should take place outside of the function prior to passing the input to the function. The p value is computed as the probability of observing a value as extreme as the observed value, assuming the null hypothesis that the values are randomly distributed in space and time. The p value is computed using a normal distribution approximation.

kernel: STRING to compute the spatial weights across the kring. Available functions are: uniform, triangular, quadratic, quartic and gaussian.

kernel_time: STRING to compute the temporal weights within the time bandwidth. Available functions are: uniform, triangular, quadratic, quartic and gaussian.

This procedure computes the from the input table with H3 indexes.

decay: STRING decay function to compute the . Available functions are: uniform, inverse, inverse_square and exponential.

This procedure computes the from the input table with Quadbin indexes.

decay: STRING decay function to compute the . Available functions are: uniform, inverse, inverse_square and exponential.

decay: STRING decay function to compute the . Available functions are: uniform, inverse, inverse_square and exponential.

decay: STRING decay function to compute the . Available functions are: uniform, inverse, inverse_square and exponential.

This procedure is designed to analyze spatio-temporal data in order to identify and categorize locations based on their hotspot or coldspot status over time. Utilizing z-score values generated by the Space-Time Getis-Ord function, i.e. the , and applying either the or trend test on these values, it categorizes each location into specific types of hotspots or coldspots. This categorization is based on patterns of spatial clustering and intensity trends over observed time intervals. The categories can be seen in the following table along with their description.

Note: this function uses the internally where the dimensions are the timesteps; therefore it will suffer the for a very large number of timesteps. To identify this issue, please start the analysis on very large time aggregations (i.e. monthly) and increase the temporal resolution to inspect any changes in this classification. If your data is affected, and it has a much larger number of timesteps than unique series IDs, consider using the options.reduce_dimensionality parameter.

z-score
p-value
Snowpark ML modeling class
Snowpark ML API reference
Snowpark ML modeling class
Snowpark ML API reference
Snowpark ML modeling class
Snowpark ML API reference
Freedman and Diaconis’s (1981) rule
Tukey’s fences k parameter
Freedman and Diaconis's (1981) rule
Cronbach’s alpha
kernel function
kernel function
kernel function
kernel function
kernel function
kernel function
kernel function
Moran's I spatial autocorrelation
distance decay
Moran's I spatial autocorrelation
distance decay
distance decay
distance decay
Euclidean Distance
Curse of Dimensionality
Quadbin
H3
paper
Getis-Ord Gi*
paper
Getis-Ord Gi*
original Mann Kendall
modified Mann-Kendall
h3 spacetime getis ord
Freedman and Diaconis's (1981) rule