statistics

ADVANCED

This module contains functions to perform spatial statistics calculations.

P_VALUE

P_VALUE(z_score)

Description

This function computes the one tail p-value (upper-tail test) of a given z-score assuming the population follows a normal distribution where the mean is 0 and the standard deviation is 1. The z-score is a measure of how many standard deviations below or above the population mean a value is. It gives you an idea of how far from the mean a data point is. The p-value is the probability that a randomly sampled point has a value at least as extreme as the point whose z-score is being tested.

z_score: FLOAT64

Return type

FLOAT64

Example

SELECT `carto-un`.carto.P_VALUE(u) as p_value
    FROM UNNEST([-2,-1,0,1,2]) u;

-- [ 0.9772499371127437, 0.8413447361676363, 0.49999999949999996, 0.15865526383236372, 0.02275006288725634]

SELECT `carto-un-eu`.carto.P_VALUE(u) as p_value
    FROM UNNEST([-2,-1,0,1,2]) u;

-- [ 0.9772499371127437, 0.8413447361676363, 0.49999999949999996, 0.15865526383236372, 0.02275006288725634]

SELECT carto.P_VALUE(u) as p_value
    FROM UNNEST([-2,-1,0,1,2]) u;

-- [ 0.9772499371127437, 0.8413447361676363, 0.49999999949999996, 0.15865526383236372, 0.02275006288725634]

KNN

KNN(points, k)

Description

This function returns for each point the k-nearest neighbors of a given set of points.

points: ARRAY<STRUCT<geoid STRING, geo GEOGRAPHY>> input data with unique id and geography.
k: INT64 number of nearest neighbors (positive, typically small).

Return type

ARRAY<STRUCT<geo GEOGRAPHY, geo_knn GEOGRAPHY, geoid STRING, geoid_knn STRING, distance FLOAT64, knn INT64>>

where:

geo: the geometry of the considered point.
geo_knn: the k-nearest neighbor point.
geoid: the unique identifier of the considered point.
geoid_knn: the unique identifier of the k-nearest neighbor.
distance: the k-nearest neighbor distance to the considered point.
knn: the k-order (knn)

Example

SELECT *
FROM UNNEST((
    SELECT `carto-un`.carto.KNN(myarray, 10)
    FROM (
        SELECT ARRAY_AGG(STRUCT(format('%08x', uid),position_geom)) myarray
        FROM (
            SELECT ROW_NUMBER() OVER (ORDER BY hour) AS uid, position_geom
            FROM `bigquery-public-data.catalonian_mobile_coverage.mobile_data_2015_2017`
            WHERE date = '2017-12-31'
        )
    )
))
ORDER BY geoid
--{
--  "geo": "POINT(2.82263 41.97118)",
--  "geo_knn": "POINT(2.8225 41.97117)",
--  "geoid": "00000001",
--  "geoid_knn": "00000624",
--  "distance": "10.804663098937658",
--  "knn": "1"
--},
--{
--  "geo": "POINT(2.82263 41.97118)",
--  "geo_knn": "POINT(2.823 41.9712)",
--  "geoid": "00000001",
--  "geoid_knn": "00000666",
--  "distance": "30.66917920746894",
--  "knn": "2"
--},
--{
--  "geo": "POINT(2.82263 41.97118)",
--  "geo_knn": "POINT(2.82298 41.9713)",
--  "geoid": "00000001",
--  "geoid_knn": "00000618",
--  "distance": "31.863463704968353",
--  "knn": "3"
--},
-- ...

SELECT *
FROM UNNEST((
    SELECT `carto-un-eu`.carto.KNN(myarray, 10)
    FROM (
        SELECT ARRAY_AGG(STRUCT(format('%08x', uid),position_geom)) myarray
        FROM (
            SELECT ROW_NUMBER() OVER (ORDER BY hour) AS uid, position_geom
            FROM `bigquery-public-data.catalonian_mobile_coverage.mobile_data_2015_2017`
            WHERE date = '2017-12-31'
        )
    )
))
ORDER BY geoid
--{
--  "geo": "POINT(2.82263 41.97118)",
--  "geo_knn": "POINT(2.8225 41.97117)",
--  "geoid": "00000001",
--  "geoid_knn": "00000624",
--  "distance": "10.804663098937658",
--  "knn": "1"
--},
--{
--  "geo": "POINT(2.82263 41.97118)",
--  "geo_knn": "POINT(2.823 41.9712)",
--  "geoid": "00000001",
--  "geoid_knn": "00000666",
--  "distance": "30.66917920746894",
--  "knn": "2"
--},
--{
--  "geo": "POINT(2.82263 41.97118)",
--  "geo_knn": "POINT(2.82298 41.9713)",
--  "geoid": "00000001",
--  "geoid_knn": "00000618",
--  "distance": "31.863463704968353",
--  "knn": "3"
--},
-- ...

SELECT *
FROM UNNEST((
    SELECT carto.KNN(myarray, 10)
    FROM (
        SELECT ARRAY_AGG(STRUCT(format('%08x', uid),position_geom)) myarray
        FROM (
            SELECT ROW_NUMBER() OVER (ORDER BY hour) AS uid, position_geom
            FROM `bigquery-public-data.catalonian_mobile_coverage.mobile_data_2015_2017`
            WHERE date = '2017-12-31'
        )
    )
))
ORDER BY geoid
--{
--  "geo": "POINT(2.82263 41.97118)",
--  "geo_knn": "POINT(2.8225 41.97117)",
--  "geoid": "00000001",
--  "geoid_knn": "00000624",
--  "distance": "10.804663098937658",
--  "knn": "1"
--},
--{
--  "geo": "POINT(2.82263 41.97118)",
--  "geo_knn": "POINT(2.823 41.9712)",
--  "geoid": "00000001",
--  "geoid_knn": "00000666",
--  "distance": "30.66917920746894",
--  "knn": "2"
--},
--{
--  "geo": "POINT(2.82263 41.97118)",
--  "geo_knn": "POINT(2.82298 41.9713)",
--  "geoid": "00000001",
--  "geoid_knn": "00000618",
--  "distance": "31.863463704968353",
--  "knn": "3"
--},
-- ...

LOF_TABLE

LOF_TABLE(src_fullname STRING, target_fullname STRING, geoid_column_name STRING, geo_column_name STRING, k INT64)

Description

This function computes the Local Outlier Factor for each point of a specified column and stores the result in an output table along with the other input columns.

src_fullname: STRING The input table. A STRING of the form project-id.dataset-id.table-name is expected. The project-id can be omitted (in which case the default one will be used).
target_fullname: STRING The resulting table where the LOF will be stored. A STRING of the form project-id.dataset-id.table-name is expected. The project-id can be omitted (in which case the default one will be used). The dataset must exist and the caller needs to have permissions to create a new table in it. The process will fail if the target table already exists.
geoid_column_name: STRING The column name with a unique identifier for each point.
geo_column_name: STRING The column name containing the points.
lof_target_column_name: STRING The column name where the resulting Local Outlier Factor will be stored in the output table.
k: INT64 Number of nearest neighbors (positive, typically small).

Example

CALL `carto-un`.carto.LOF_TABLE(
  'bigquery-public-data.new_york_subway.stations',
  'myproject.mydataset.my_output_table',
  'station_id',
  'station_geom',
  'lof',
  10
);
-- The table `'myproject.mydataset.my_output_table` will be created
-- with an extra column containing the `lof` value.

CALL `carto-un-eu`.carto.LOF_TABLE(
  'bigquery-public-data.new_york_subway.stations',
  'myproject.mydataset.my_output_table',
  'station_id',
  'station_geom',
  'lof',
  10
);
-- The table `'myproject.mydataset.my_output_table` will be created
-- with an extra column containing the `lof` value.

CALL carto.LOF_TABLE(
  'bigquery-public-data.new_york_subway.stations',
  'myproject.mydataset.my_output_table',
  'station_id',
  'station_geom',
  'lof',
  10
);
-- The table `'myproject.mydataset.my_output_table` will be created
-- with an extra column containing the `lof` value.

LOF

LOF(points, k)

Description

This function computes the Local Outlier Factor of each point of a given set of points.

points: ARRAY<STRUCT<geoid STRING, geo GEOGRAPHY>> input data points with unique id and geography.
k: INT64 number of nearest neighbors (positive, typically small).

Return type

ARRAY<STRUCT<geo GEOGRAPHY, geoid GEOGRAPHY, lof FLOAT64>> where:

geo: the geometry of the considered point.
geoid: the unique identifier of the considered point.
lof: the Local Outlier Factor score.

Example

SELECT *
FROM UNNEST((
    SELECT `carto-un`.carto.LOF(myarray, 10)
    FROM (
        SELECT ARRAY_AGG(STRUCT(format('%08x', uid),position_geom)) myarray
        FROM (
            SELECT ROW_NUMBER() OVER (ORDER BY hour) AS uid, position_geom
            FROM `bigquery-public-data.catalonian_mobile_coverage.mobile_data_2015_2017`
            WHERE date = '2017-12-31'
        )
    )
))
ORDER BY geoid
-- {"geo": POINT(2.82263 41.97118), "geoid": "00000001", "lof": 1.3217599116891428}
-- {"geo": POINT(2.35705 41.49786), "geoid": "00000002", "lof": 1.235551000737416}
-- {"geo": POINT(2.13967 41.3838), "geoid": "00000003", "lof": 1.1305674032876687}
-- ...

SELECT *
FROM UNNEST((
    SELECT `carto-un-eu`.carto.LOF(myarray, 10)
    FROM (
        SELECT ARRAY_AGG(STRUCT(format('%08x', uid),position_geom)) myarray
        FROM (
            SELECT ROW_NUMBER() OVER (ORDER BY hour) AS uid, position_geom
            FROM `bigquery-public-data.catalonian_mobile_coverage.mobile_data_2015_2017`
            WHERE date = '2017-12-31'
        )
    )
))
ORDER BY geoid
-- {"geo": POINT(2.82263 41.97118), "geoid": "00000001", "lof": 1.3217599116891428}
-- {"geo": POINT(2.35705 41.49786), "geoid": "00000002", "lof": 1.235551000737416}
-- {"geo": POINT(2.13967 41.3838), "geoid": "00000003", "lof": 1.1305674032876687}
-- ...

SELECT *
FROM UNNEST((
    SELECT carto.LOF(myarray, 10)
    FROM (
        SELECT ARRAY_AGG(STRUCT(format('%08x', uid),position_geom)) myarray
        FROM (
            SELECT ROW_NUMBER() OVER (ORDER BY hour) AS uid, position_geom
            FROM `bigquery-public-data.catalonian_mobile_coverage.mobile_data_2015_2017`
            WHERE date = '2017-12-31'
        )
    )
))
ORDER BY geoid
-- {"geo": POINT(2.82263 41.97118), "geoid": "00000001", "lof": 1.3217599116891428}
-- {"geo": POINT(2.35705 41.49786), "geoid": "00000002", "lof": 1.235551000737416}
-- {"geo": POINT(2.13967 41.3838), "geoid": "00000003", "lof": 1.1305674032876687}
-- ...

GFUN

GFUN(points)

Description

This function computes the G-function of a given set of points.

points: ARRAY<GEOGRAPHY> input data points.

Return type

ARRAY<STRUCT<distance FLOAT64, gfun_G FLOAT64, gfun_ev FLOAT64>>

where:

distance: the nearest neighbors distances.
gfun_G: the empirical G evaluated for each distance in the support.
gfun_ev: the theoretical Poisson G evaluated for each distance in the support.

Example

SELECT *
FROM UNNEST((
    SELECT `carto-un`.carto.GFUN(myarray)
    FROM (
        SELECT ARRAY_AGG(position_geom) myarray
        FROM `bigquery-public-data.catalonian_mobile_coverage.mobile_data_2015_2017`
        WHERE date = '2017-12-31'
    )
))
ORDER BY distance
--{
--  "distance": "38.599968853183",
--  "gfun_G": "0.4319167389418907",
--  "gfun_ev": "4.037383876246414E-4"
--},
--{
--  "distance": "77.199937706366",
--  "gfun_G": "0.5771899392888118",
--  "gfun_ev": "0.0016139757856029613"
--},
--{
--  "distance": "115.799906559549",
--  "gfun_G": "0.6522116218560278",
--  "gfun_ev": "0.003627782844736638"
--},
-- ...

SELECT *
FROM UNNEST((
    SELECT `carto-un-eu`.carto.GFUN(myarray)
    FROM (
        SELECT ARRAY_AGG(position_geom) myarray
        FROM `bigquery-public-data.catalonian_mobile_coverage.mobile_data_2015_2017`
        WHERE date = '2017-12-31'
    )
))
ORDER BY distance
--{
--  "distance": "38.599968853183",
--  "gfun_G": "0.4319167389418907",
--  "gfun_ev": "4.037383876246414E-4"
--},
--{
--  "distance": "77.199937706366",
--  "gfun_G": "0.5771899392888118",
--  "gfun_ev": "0.0016139757856029613"
--},
--{
--  "distance": "115.799906559549",
--  "gfun_G": "0.6522116218560278",
--  "gfun_ev": "0.003627782844736638"
--},
-- ...

SELECT *
FROM UNNEST((
    SELECT carto.GFUN(myarray)
    FROM (
        SELECT ARRAY_AGG(position_geom) myarray
        FROM `bigquery-public-data.catalonian_mobile_coverage.mobile_data_2015_2017`
        WHERE date = '2017-12-31'
    )
))
ORDER BY distance
--{
--  "distance": "38.599968853183",
--  "gfun_G": "0.4319167389418907",
--  "gfun_ev": "4.037383876246414E-4"
--},
--{
--  "distance": "77.199937706366",
--  "gfun_G": "0.5771899392888118",
--  "gfun_ev": "0.0016139757856029613"
--},
--{
--  "distance": "115.799906559549",
--  "gfun_G": "0.6522116218560278",
--  "gfun_ev": "0.003627782844736638"
--},
-- ...

CREATE_SPATIAL_COMPOSITE_SUPERVISED

CREATE_SPATIAL_COMPOSITE_SUPERVISED(input_query,index_column, output_prefix, options)

Description

This procedure derives a spatial composite score as the standardized residuals of a regression model which is used to detect areas of under- and over-prediction. The response variable should be measurable and correlated with the set of variables defining the score. For each data point. the residual is defined as the observed value minus the predicted value. Rows with a NULL value in any of the individual variables are dropped.

Input parameters

input_query: STRING the query to the data used to compute the spatial composite. It must contain all the individual variables that should be included in the computation of the composite as well as a unique geographic id for each row. A qualified table name can be given as well, e.g. 'project-id.dataset-id.table-name'.
index_column: STRING the name of the column with the unique geographic identifier.
output_prefix: STRING the prefix for the output table. It should include project and dataset, e.g. 'project-id.dataset-id.table-name'.
options: STRING containing a valid JSON with the different options. Valid options are described below.
- model_transform: STRING containing the TRANSFORM clause in a BigQuery ML CREATE MODEL statement. If NULL no TRANSFORM clause is applied.
- model_options: JSON with the different options allowed by BigQuery ML CREATE MODEL statement for regression models. Any model is allowed as long as it can deal with numerical inputs for the response variable. At least the INPUT_LABEL_COLS and MODEL_TYPE parameters must be specified.
- r2_thr: FLOAT64 the minimum allowed value for the R2 model score. If the R2 of the regression model is lower than this threshold this implies poor fitting and a warning is raised. The default value is 0.5.
- return_range: ARRAY<FLOAT64> The user-defined normalization range of the spatial composite score, e.g [0.0,1.0]. The default value is NULL. Ignored if bucketize_method is specified.
- bucketize_method: STRING the method used to discretize the spatial composite score. The default value is NULL. Possible options are:
  - EQUAL_INTERVALS: the values of the spatial composite score are discretized into buckets of equal widths.
  - QUANTILES: the values of the spatial composite score are discretized into buckets based on quantiles.
  - JENKS: the values of the spatial composite score are discretized into buckets obtained using k-means clustering.
- nbuckets: INT64 the number of buckets used when a bucketization method is specified. The default number of buckets is selected using Freedman and Diaconis’s (1981) rule. For bucketize_method = JENKS the maximum value is 100, aka the maximum number of clusters allowed by BigQuery with k-means clustering.

Return type

The output table with the following columns:

index_column: STRING the unique geographic identifier.
spatial_score: FLOAT64 the value of the composite score.

Example

CALL  `carto-un`.carto.CREATE_SPATIAL_COMPOSITE_SUPERVISED(
    'SELECT * FROM `cartobq.docs.spatial_scoring_input`',
    'geoid',
    '<project-id>.<dataset-id>.<table-name>',
    '''{
        "model_transform":[
            "revenue_change",
            "fempop_15_44, public_transport, education, pois, urbanity"
        ],
        "model_options":{
            "MODEL_TYPE":"LINEAR_REG",
            "INPUT_LABEL_COLS":['revenue_change'],
            "DATA_SPLIT_METHOD":"no_split",
            "OPTIMIZE_STRATEGY":"NORMAL_EQUATION",
            "CATEGORY_ENCODING_METHOD":"ONE_HOT_ENCODING",
            "ENABLE_GLOBAL_EXPLAIN":true
        },
        "r2_thr":0.4,
        "bucketize_method":"JENKS"
    }
    '''
)
-- Table `<my-project>.<my-dataset>.<table-name>` will be created

CALL  `carto-un-eu`.carto.CREATE_SPATIAL_COMPOSITE_SUPERVISED(
    'SELECT * FROM `cartobq.docs.spatial_scoring_input`',
    'geoid',
    '<project-id>.<dataset-id>.<table-name>',
    '''{
        "model_transform":[
            "revenue_change",
            "fempop_15_44, public_transport, education, pois, urbanity"
        ],
        "model_options":{
            "MODEL_TYPE":"LINEAR_REG",
            "INPUT_LABEL_COLS":['revenue_change'],
            "DATA_SPLIT_METHOD":"no_split",
            "OPTIMIZE_STRATEGY":"NORMAL_EQUATION",
            "CATEGORY_ENCODING_METHOD":"ONE_HOT_ENCODING",
            "ENABLE_GLOBAL_EXPLAIN":true
        },
        "r2_thr":0.4,
        "bucketize_method":"JENKS"
    }
    '''
)
-- Table `<my-project>.<my-dataset>.<table-name>` will be created

CALL  carto.CREATE_SPATIAL_COMPOSITE_SUPERVISED(
    'SELECT * FROM `cartobq.docs.spatial_scoring_input`',
    'geoid',
    '<project-id>.<dataset-id>.<table-name>',
    '''{
        "model_transform":[
            "revenue_change",
            "fempop_15_44, public_transport, education, pois, urbanity"
        ],
        "model_options":{
            "MODEL_TYPE":"LINEAR_REG",
            "INPUT_LABEL_COLS":['revenue_change'],
            "DATA_SPLIT_METHOD":"no_split",
            "OPTIMIZE_STRATEGY":"NORMAL_EQUATION",
            "CATEGORY_ENCODING_METHOD":"ONE_HOT_ENCODING",
            "ENABLE_GLOBAL_EXPLAIN":true
        },
        "r2_thr":0.4,
        "bucketize_method":"JENKS"
    }
    '''
)
-- Table `<my-project>.<my-dataset>.<table-name>` will be created

CREATE_SPATIAL_COMPOSITE_UNSUPERVISED

CREATE_SPATIAL_COMPOSITE_UNSUPERVISED(input_query, index_column, output_prefix, options)

Description

This procedure combines (spatial) variables into a meaningful composite score. The composite score can be derived using different methods, scaling and aggregation functions and weights. Rows with a NULL value in any of the model predictors are dropped.

Input parameters

input_query: STRING the query to the data used to compute the spatial composite. It must contain all the individual variables that should be included in the computation of the composite as well as a unique geographic id for each row. A qualified table name can be given as well, e.g. 'project-id.dataset-id.table-name'.
index_column: STRING the name of the column with the unique geographic identifier.
output_prefix: STRING the prefix for the output table. It should include project and dataset, e.g. 'project-id.dataset-id.table-name'.
options: STRING containing a valid JSON with the different options. Valid options are described below. If options is set to NULL then all options are set to default values, as specified in the table below.
- scoring_method: STRING Possible options are ENTROPY, CUSTOM_WEIGHTS, FIRST_PC. With the ENTROPY method the spatial composite is derived as the weighted sum of the proportion of the min-max scaled individual variables, where the weights are based on the entropy of the proportion of each variable. Only numerical variables are allowed. With the CUSTOM_WEIGHTS method, the spatial composite is computed by first scaling each individual variable and then aggregating them according to user-defined scaling and aggregation methods and individual weights. Depending on the scaling parameter, both numerical and ordinal variables are allowed (categorical and boolean variables need to be transformed to ordinal). With the FIRST_PC method, the spatial composite is derived from a Principal Component Analysis as the first principal component score. Only numerical variables are allowed.
- weights: STRUCT the (optional) weights for each variable used to compute the spatial composite when scoring_method is set to CUSTOM_WEIGHTS, passed as {"name":value, …}. If a different scoring method is selected, then this input parameter is ignored. If specified, the sum of the weights must be lower than 1. If no weights are specified, equal weights are assumed. If weights are specified only for some variables and the sum of weights is less than 1, the remainder is distributed equally between the remaining variables. If weights are specified for all the variables and the sum of weights is less than 1, the remainder is distributed equally between all the variables.
- scaling: STRING the user-defined scaling when the scoring_method is set to CUSTOM_WEIGHTS. Possible options are:
  - MIN_MAX_SCALER: data is rescaled into the range [0,1] based on minimum and maximum values. Only numerical variables are allowed.
  - STANDARD_SCALER: data is rescaled by subtracting the mean value and dividing the result by the standard deviation. Only numerical variables are allowed.
  - RANKING: data is replaced by its percent rank, that is by values ranging from 0 lowest to 1. Both numerical and ordinal variables are allowed (categorical and boolean variables need to be transformed to ordinal).
  - DISTANCE_TO_TARGET_MIN(_MAX,_AVG):data is rescaled by dividing by the minimum, maximum, or mean of all the values. Only numerical variables are allowed.
  - PROPORTION: data is rescaled by dividing by the sum total of all the values. Only numerical variables are allowed.
- aggregation: STRING the aggregation function used when the scoring_method is set to CUSTOM_WEIGHTS. Possible options are:
  - LINEAR: the spatial composite is derived as the weighted sum of the scaled individual variables.
  - GEOMETRIC: the spatial composite is given by the product of the scaled individual variables, each to the power of its weight.
- correlation_var: STRING when scoring_method is set to FIRST_PC, the spatial score will be positively correlated with the selected variable (i.e. the sign the spatial score is set such that the correlation between the selected variable and the first principal component score is positive).
- correlation_thr: FLOAT64 the minimum absolute value of the correlation between each individual variable and the first principal component score when scoring_method is set to FIRST_PC.
- return_range: ARRAY<FLOAT64> the user-defined normalization range of the spatial composite score, e.g [0.0,1.0]. Ignored if bucketize_method is specified.
- bucketize_method: STRING the method used to discretize the spatial composite score. Possible options are:
  - EQUAL_INTERVALS: the values of the spatial composite score are discretized into buckets of equal widths.
  - QUANTILES: the values of the spatial composite score are discretized into buckets based on quantiles.
  - JENKS: the values of the spatial composite score are discretized into buckets obtained using k-means clustering.
- nbuckets: INT64 the number of buckets used when a bucketization method is specified. The default number of buckets is selected using Freedman and Diaconis’s (1981) rule. For bucketize_method = JENKS the maximum value is 100, aka the maximum number of clusters allowed by BigQuery with k-means clustering.

Option

ENTROPY

CUSTOM_WEIGHTS

FIRST_PC

Valid options

Default value

scoring_method

Optional

ENTROPY, CUSTOM_WEIGHTS, FIRST_PC

ENTROPY

weights

Ignored

Optional

Ignored

{"name":value…}

NULL

scaling

Ignored

Optional

Ignored

MIN_MAX_SCALER, STANDARD_SCALER, RANKING, DISTANCE_TO_TARGET_MIN, DISTANCE_TO_TARGET_MAX, DISTANCE_TO_TARGET_AVG, PROPORTION

MIN_MAX_SCALER

aggregation

Ignored

Optional

Ignored

LINEAR, GEOMETRIC

LINEAR

correlation_var

Ignored

Optional

Mandatory

NULL

correlation_thr

Ignored

Optional

NULL

return_range

Ignored

Optional

NULL

bucketize_method

Ignored

Optional

EQUAL_INTERVALS, QUANTILES, JENKS

NULL

nbuckets

Ignored

Optional

Selected using Freedman and Diaconis’s (1981) rule

Return type

The output table with the following columns:

index_column: STRING the unique geographic identifier.
spatial_score: FLOAT64 the value of the composite score.

Examples

With the ENTROPY method:

CALL `carto-un`.carto.CREATE_SPATIAL_COMPOSITE_UNSUPERVISED(
    'SELECT * EXCEPT(geom, revenue_change, urbanity, urbanity_ordinal) FROM `cartobq.docs.spatial_scoring_input`',
    'geoid',
    '<project-id>.<dataset-id>.<table-name>',
    '''{
        "scoring_method":"ENTROPY",
        "return_range":[0.0,1.0]
    }
    '''
)
-- Table `<my-project>.<my-dataset>.<table-name>` will be created

CALL `carto-un-eu`.carto.CREATE_SPATIAL_COMPOSITE_UNSUPERVISED(
    'SELECT * EXCEPT(geom, revenue_change, urbanity, urbanity_ordinal) FROM `cartobq.docs.spatial_scoring_input`',
    'geoid',
    '<project-id>.<dataset-id>.<table-name>',
    '''{
        "scoring_method":"ENTROPY",
        "return_range":[0.0,1.0]
    }
    '''
)
-- Table `<my-project>.<my-dataset>.<table-name>` will be created

CALL carto.CREATE_SPATIAL_COMPOSITE_UNSUPERVISED(
    'SELECT * EXCEPT(geom, revenue_change, urbanity, urbanity_ordinal) FROM `cartobq.docs.spatial_scoring_input`',
    'geoid',
    '<project-id>.<dataset-id>.<table-name>',
    '''{
        "scoring_method":"ENTROPY",
        "return_range":[0.0,1.0]
    }
    '''
)
-- Table `<my-project>.<my-dataset>.<table-name>` will be created

With the CUSTOM_WEIGHTS method:

CALL `carto-un`.carto.CREATE_SPATIAL_COMPOSITE_UNSUPERVISED(
    'SELECT * EXCEPT(geom, revenue_change, urbanity) FROM `cartobq.docs.spatial_scoring_input`',
    'geoid',
    '<project-id>.<dataset-id>.<table-name>',
    '''{
        "scoring_method":"CUSTOM_WEIGHTS",
        "weights":{"fempop_15_44":0.2,"education":0.1,"urbanity_ordinal":0.1,"pois":0.1},
        "scaling":"RANKING",
        "aggregation":"LINEAR",
        "bucketize_method":"JENKS"
    }
    '''
 )
-- Table `<my-project>.<my-dataset>.<table-name>` will be created

CALL `carto-un-eu`.carto.CREATE_SPATIAL_COMPOSITE_UNSUPERVISED(
    'SELECT * EXCEPT(geom, revenue_change, urbanity) FROM `cartobq.docs.spatial_scoring_input`',
    'geoid',
    '<project-id>.<dataset-id>.<table-name>',
    '''{
        "scoring_method":"CUSTOM_WEIGHTS",
        "weights":{"fempop_15_44":0.2,"education":0.1,"urbanity_ordinal":0.1,"pois":0.1},
        "scaling":"RANKING",
        "aggregation":"LINEAR",
        "bucketize_method":"JENKS"
    }
    '''
 )
-- Table `<my-project>.<my-dataset>.<table-name>` will be created

CALL carto.CREATE_SPATIAL_COMPOSITE_UNSUPERVISED(
    'SELECT * EXCEPT(geom, revenue_change, urbanity) FROM `cartobq.docs.spatial_scoring_input`',
    'geoid',
    '<project-id>.<dataset-id>.<table-name>',
    '''{
        "scoring_method":"CUSTOM_WEIGHTS",
        "weights":{"fempop_15_44":0.2,"education":0.1,"urbanity_ordinal":0.1,"pois":0.1},
        "scaling":"RANKING",
        "aggregation":"LINEAR",
        "bucketize_method":"JENKS"
    }
    '''
 )
-- Table `<my-project>.<my-dataset>.<table-name>` will be created

With the FIRST_PC method:

CALL `carto-un`.carto.CREATE_SPATIAL_COMPOSITE_UNSUPERVISED(
    'SELECT * EXCEPT(geom, revenue_change, urbanity, urbanity_ordinal) FROM `cartobq.docs.spatial_scoring_input`',
    'geoid',
    '<project-id>.<dataset-id>.<table-name>',
    '''{
        "scoring_method":"FIRST_PC",
        "correlation_var":"fempop_15_44",
        "correlation_thr":0.6,
        "bucketize_method":"QUANTILES"
    }
    '''
)
-- Table `<my-project>.<my-dataset>.<table-name>` will be created

CALL `carto-un-eu`.carto.CREATE_SPATIAL_COMPOSITE_UNSUPERVISED(
    'SELECT * EXCEPT(geom, revenue_change, urbanity, urbanity_ordinal) FROM `cartobq.docs.spatial_scoring_input`',
    'geoid',
    '<project-id>.<dataset-id>.<table-name>',
    '''{
        "scoring_method":"FIRST_PC",
        "correlation_var":"fempop_15_44",
        "correlation_thr":0.6,
        "bucketize_method":"QUANTILES"
    }
    '''
)
-- Table `<my-project>.<my-dataset>.<table-name>` will be created

CALL carto.CREATE_SPATIAL_COMPOSITE_UNSUPERVISED(
    'SELECT * EXCEPT(geom, revenue_change, urbanity, urbanity_ordinal) FROM `cartobq.docs.spatial_scoring_input`',
    'geoid',
    '<project-id>.<dataset-id>.<table-name>',
    '''{
        "scoring_method":"FIRST_PC",
        "correlation_var":"fempop_15_44",
        "correlation_thr":0.6,
        "bucketize_method":"QUANTILES"
    }
    '''
)
-- Table `<my-project>.<my-dataset>.<table-name>` will be created

CRONBACH_ALPHA_COEFFICIENT

CRONBACH_ALPHA_COEFFICIENT(input_query, output_prefix)

Description

This procedure computes the Cronbach’s alpha coefficient for a set of (spatial) variables. This coefficient can be used as a measure of internal consistency or reliability of the data, based on the strength of correlations between individual variables. Cronbach’s alpha reliability coefficient normally ranges between 0 and 1 but there is actually no lower limit to the coefficient. Higher alpha (closer to 1) vs lower alpha (closer to 0) means higher vs lower consistency, with usually 0.65 being the minimum acceptable value of internal consistency. Rows with a NULL value in any of the individual variables are dropped.

Input parameters

input_query: STRING the query to the data used to compute the coefficient. It must contain all the individual variables that should be included in the computation of the coefficient. A qualified table name can be given as well, e.g. 'project-id.dataset-id.table-name'.
output_prefix: STRING the prefix for the output table. It should include project and dataset, e.g. 'project-id.dataset-id.table-name'.

Return type

The output table with the following columns:

cronbach_alpha_coef: FLOAT64 the computed Cronbach Alpha coefficient.
k: INT64 the number of the individual variables used to compute the composite.
mean_var: FLOAT64 the mean variance of all individual variables.
mean_cov: FLOAT64 the mean covariance of all individual variables.

Example

CALL `carto-un`.carto.CRONBACH_ALPHA_COEFFICIENT(
    'SELECT * EXCEPT(geoid, geom, revenue_change, urbanity, urbanity_ordinal) FROM `cartobq.docs.spatial_scoring_input`',
    '<project-id>.<dataset-id>.<table-name>'
)
-- Table `<my-project>.<my-dataset>.<table-name>` will be created

CALL `carto-un-eu`.carto.CRONBACH_ALPHA_COEFFICIENT(
    'SELECT * EXCEPT(geoid, geom, revenue_change, urbanity, urbanity_ordinal) FROM `cartobq.docs.spatial_scoring_input`',
    '<project-id>.<dataset-id>.<table-name>'
)
-- Table `<my-project>.<my-dataset>.<table-name>` will be created

CALL carto.CRONBACH_ALPHA_COEFFICIENT(
    'SELECT * EXCEPT(geoid, geom, revenue_change, urbanity, urbanity_ordinal) FROM `cartobq.docs.spatial_scoring_input`',
    '<project-id>.<dataset-id>.<table-name>'
)
-- Table `<my-project>.<my-dataset>.<table-name>` will be created

GWR_GRID

GWR_GRID(input_table, features_columns, label_column, cell_column, cell_type, kring_distance, kernel_function, fit_intercept, output_table)

Description

Geographically weighted regression (GWR) models local relationships between spatially varying predictors and an outcome of interest using a local least squares regression.

This procedures performs a local least squares regression for every input cell. In each regression, the data of each cell and that of the neighboring cells, defined by the kring_distance parameter, will be taken into account. The data of the neighboring cells will be assigned a lower weight the further they are from the origin cell, following the function specified in the kernel_function.

input_table: STRING name of the source dataset. It should be a quoted qualified table with project and dataset: <project-id>.<dataset-id>.<table-name>.
features_columns: ARRAY<STRING> array of column names from input_table to be used as features in the GWR.
label_column: STRING name of the target variable column.
cell_column: STRING name of the column containing the cell ids.
cell_type: STRING spatial index type as 'h3', 'quadbin'.
kring_distance: INT64 distance of the neighboring cells whose data will be included in the local regression of each cell.
kernel_function: STRING kernel function to compute the spatial weights across the kring. Available functions are: 'uniform', 'triangular', 'quadratic', 'quartic' and 'gaussian'.
fit_intercept: BOOL whether to calculate the interception of the model or to force it to zero if, for example, the input data is already supposed to be centered. If NULL, fit_intercept will be considered as TRUE.
output_table: STRING name of the output table. It should be a quoted qualified table with project and dataset: <project-id>.<dataset-id>.<table-name>. The process will fail if the target table already exists. If NULL, the result will be returned directly by the query and not persisted.

Output

The output table will contain a column with the cell id, a column for each feature column containing its corresponding coefficient estimate and one extra column for intercept if fit_intercept is TRUE.

Examples

CALL `carto-un`.carto.GWR_GRID(
    'cartobq.docs.airbnb_berlin_h3_qk_qb',
    ['bedrooms', 'bathrooms'], -- [ beds feature, bathrooms feature ]
    'price', -- price (target variable)
    'h3_z6', 'h3', 3, 'gaussian', TRUE,
    '<project-id>.<dataset-id>.<table-name>'
);

CALL `carto-un-eu`.carto.GWR_GRID(
    'cartobq.docs.airbnb_berlin_h3_qk_qb',
    ['bedrooms', 'bathrooms'], -- [ beds feature, bathrooms feature ]
    'price', -- price (target variable)
    'h3_z6', 'h3', 3, 'gaussian', TRUE,
    '<project-id>.<dataset-id>.<table-name>'
);

CALL carto.GWR_GRID(
    'cartobq.docs.airbnb_berlin_h3_qk_qb',
    ['bedrooms', 'bathrooms'], -- [ beds feature, bathrooms feature ]
    'price', -- price (target variable)
    'h3_z6', 'h3', 3, 'gaussian', TRUE,
    '<project-id>.<dataset-id>.<table-name>'
);

CALL `carto-un`.carto.GWR_GRID(
    'cartobq.docs.airbnb_berlin_h3_qk_qb',
    ['bedrooms', 'bathrooms'], -- [ beds feature, bathrooms feature ]
    'price', -- price (target variable)
    'qb_z12', 'quadbin', 3, 'gaussian', TRUE,
    '<project-id>.<dataset-id>.<table-name>'
);

CALL `carto-un-eu`.carto.GWR_GRID(
    'cartobq.docs.airbnb_berlin_h3_qk_qb',
    ['bedrooms', 'bathrooms'], -- [ beds feature, bathrooms feature ]
    'price', -- price (target variable)
    'qb_z12', 'quadbin', 3, 'gaussian', TRUE,
    '<project-id>.<dataset-id>.<table-name>'
);

CALL carto.GWR_GRID(
    'cartobq.docs.airbnb_berlin_h3_qk_qb',
    ['bedrooms', 'bathrooms'], -- [ beds feature, bathrooms feature ]
    'price', -- price (target variable)
    'qb_z12', 'quadbin', 3, 'gaussian', TRUE,
    '<project-id>.<dataset-id>.<table-name>'
);

Additional examples

Applying GWR to understand Airbnb listings prices

GETIS_ORD_H3

GETIS_ORD_H3(input, size, kernel)

Description

This function computes the Getis-Ord Gi* statistic for each H3 index in the input array.

input: ARRAY<STRUCT<index STRING, value FLOAT64>> input data with the indexes and values of the cells.
size: INT64 size of the H3 kring (distance from the origin). This defines the area around each index cell that will be taken into account to compute its Gi* statistic.
kernel: STRING kernel function to compute the spatial weights across the kring. Available functions are: uniform, triangular, quadratic, quartic and gaussian.

Return type

ARRAY<STRUCT<index STRING, gi FLOAT64, p_value FLOAT64>>

Example

SELECT `carto-un`.carto.GETIS_ORD_H3(
    [
        STRUCT('89394460323ffff', 51.0),
        STRUCT('89394460c37ffff', 28.0),
        STRUCT('89394460077ffff', 19.0)
    ],
    3, 'gaussian'
);
-- {"index": "89394460323ffff", "gi": 1.3606194139870573, "p_value": 0.13329689888387608}
-- {"index": "89394460c37ffff", "gi": -0.34633948719670526, "p_value": 0.6113291103317855}
-- {"index": "89394460077ffff", "gi": -1.0142799267903515, "p_value": 0.7962089998559484 }

SELECT `carto-un-eu`.carto.GETIS_ORD_H3(
    [
        STRUCT('89394460323ffff', 51.0),
        STRUCT('89394460c37ffff', 28.0),
        STRUCT('89394460077ffff', 19.0)
    ],
    3, 'gaussian'
);
-- {"index": "89394460323ffff", "gi": 1.3606194139870573, "p_value": 0.13329689888387608}
-- {"index": "89394460c37ffff", "gi": -0.34633948719670526, "p_value": 0.6113291103317855}
-- {"index": "89394460077ffff", "gi": -1.0142799267903515, "p_value": 0.7962089998559484 }

SELECT carto.GETIS_ORD_H3(
    [
        STRUCT('89394460323ffff', 51.0),
        STRUCT('89394460c37ffff', 28.0),
        STRUCT('89394460077ffff', 19.0)
    ],
    3, 'gaussian'
);
-- {"index": "89394460323ffff", "gi": 1.3606194139870573, "p_value": 0.13329689888387608}
-- {"index": "89394460c37ffff", "gi": -0.34633948719670526, "p_value": 0.6113291103317855}
-- {"index": "89394460077ffff", "gi": -1.0142799267903515, "p_value": 0.7962089998559484 }

SELECT `carto-un`.carto.GETIS_ORD_H3(input_data, 3, 'gaussian')
FROM (
    SELECT ARRAY_AGG(STRUCT(index, value)) AS input_data
    FROM mytable
)

SELECT `carto-un-eu`.carto.GETIS_ORD_H3(input_data, 3, 'gaussian')
FROM (
    SELECT ARRAY_AGG(STRUCT(index, value)) AS input_data
    FROM mytable
)

SELECT carto.GETIS_ORD_H3(input_data, 3, 'gaussian')
FROM (
    SELECT ARRAY_AGG(STRUCT(index, value)) AS input_data
    FROM mytable
)

Additional examples

Identifying amenity hotspots in Stockholm

GETIS_ORD_QUADBIN

GETIS_ORD_QUADBIN(input, size, kernel)

Description

This function computes the Getis-Ord Gi* statistic for each Quadbin index in the input array.

input: ARRAY<STRUCT<index STRING, value FLOAT64>> input data with the indexes and values of the cells.
size: INT64 size of the Quadbin k-ring (distance from the origin). This defines the area around each index cell that will be taken into account to compute its Gi* statistic.
kernel: STRING kernel function to compute the spatial weights across the kring. Available functions are: uniform, triangular, quadratic, quartic and gaussian.

Return type

ARRAY<STRUCT<index STRING, gi FLOAT64, p_value FLOAT64>>

Example

SELECT `carto-un`.carto.GETIS_ORD_QUADBIN(
    [
        STRUCT(5266443791933898751, 51.0),
        STRUCT(5266443803500740607, 28.0),
        STRUCT(5266443790415822847, 19.0)
    ],
    3, 'gaussian'
);
-- {"index": 5266443791933898751, "gi": 1.360619413987058, "p_value": 0.086817058065399522}
-- {"index": 5266443803500740607, "gi": -0.3463394871967051, "p_value": 0.63545613599515272}
-- {"index": 5266443790415822847, "gi": -1.0142799267903515, "p_value": 0.84477538488255133}

SELECT `carto-un-eu`.carto.GETIS_ORD_QUADBIN(
    [
        STRUCT(5266443791933898751, 51.0),
        STRUCT(5266443803500740607, 28.0),
        STRUCT(5266443790415822847, 19.0)
    ],
    3, 'gaussian'
);
-- {"index": 5266443791933898751, "gi": 1.360619413987058, "p_value": 0.086817058065399522}
-- {"index": 5266443803500740607, "gi": -0.3463394871967051, "p_value": 0.63545613599515272}
-- {"index": 5266443790415822847, "gi": -1.0142799267903515, "p_value": 0.84477538488255133}

SELECT carto.GETIS_ORD_QUADBIN(
    [
        STRUCT(5266443791933898751, 51.0),
        STRUCT(5266443803500740607, 28.0),
        STRUCT(5266443790415822847, 19.0)
    ],
    3, 'gaussian'
);
-- {"index": 5266443791933898751, "gi": 1.360619413987058, "p_value": 0.086817058065399522}
-- {"index": 5266443803500740607, "gi": -0.3463394871967051, "p_value": 0.63545613599515272}
-- {"index": 5266443790415822847, "gi": -1.0142799267903515, "p_value": 0.84477538488255133}

SELECT `carto-un`.carto.GETIS_ORD_QUADBIN(input_data, 3, 'gaussian')
FROM (
    SELECT ARRAY_AGG(STRUCT(index, value)) AS input_data
    FROM mytable
)

SELECT `carto-un-eu`.carto.GETIS_ORD_QUADBIN(input_data, 3, 'gaussian')
FROM (
    SELECT ARRAY_AGG(STRUCT(index, value)) AS input_data
    FROM mytable
)

SELECT carto.GETIS_ORD_QUADBIN(input_data, 3, 'gaussian')
FROM (
    SELECT ARRAY_AGG(STRUCT(index, value)) AS input_data
    FROM mytable
)

MORANS_I_H3

MORANS_I_H3(input, size, decay)

Description

This function computes the Moran's I spatial autocorrelation from the input array of H3 indexes.

input: ARRAY<STRUCT<index STRING, value FLOAT64>> input data with the indexes and values of the cells.
size: INT64 size of the H3 kring (distance from the origin). This defines the area around each index cell where the distance decay will be applied.
decay: STRING decay function to compute the distance decay. Available functions are: uniform, inverse, inverse_square and exponential.

Return type

FLOAT64

Example

SELECT `carto-un`.carto.MORANS_I_H3(
    [
        STRUCT('89394460323ffff', 51.0),
        STRUCT('89394460c37ffff', 28.0),
        STRUCT('89394460077ffff', 19.0)
    ],
    3, 'exponential'
);
-- 0.07219909881624618

SELECT `carto-un-eu`.carto.MORANS_I_H3(
    [
        STRUCT('89394460323ffff', 51.0),
        STRUCT('89394460c37ffff', 28.0),
        STRUCT('89394460077ffff', 19.0)
    ],
    3, 'exponential'
);
-- 0.07219909881624618

SELECT carto.MORANS_I_H3(
    [
        STRUCT('89394460323ffff', 51.0),
        STRUCT('89394460c37ffff', 28.0),
        STRUCT('89394460077ffff', 19.0)
    ],
    3, 'exponential'
);
-- 0.07219909881624618

SELECT `carto-un`.carto.MORANS_I_H3(input_data, 3, 'exponential')
FROM (
    SELECT ARRAY_AGG(STRUCT(index, value)) AS input_data
    FROM mytable
)

SELECT `carto-un-eu`.carto.MORANS_I_H3(input_data, 3, 'exponential')
FROM (
    SELECT ARRAY_AGG(STRUCT(index, value)) AS input_data
    FROM mytable
)

SELECT carto.MORANS_I_H3(input_data, 3, 'exponential')
FROM (
    SELECT ARRAY_AGG(STRUCT(index, value)) AS input_data
    FROM mytable
)

Additional examples

Computing the spatial autocorrelation of POIs locations in Berlin

MORANS_I_QUADBIN

MORANS_I_QUADBIN(input, size, decay)

Description

This function computes the Moran's I spatial autocorrelation from the input array of Quadbin indexes.

input: ARRAY<STRUCT<index INT64, value FLOAT64>> input data with the indexes and values of the cells.
size: INT64 size of the Quadbin k-ring (distance from the origin). This defines the area around each index cell where the distance decay will be applied.
decay: STRING decay function to compute the distance decay. Available functions are: uniform, inverse, inverse_square and exponential.

Return type

FLOAT64

Example

SELECT `carto-un`.carto.MORANS_I_QUADBIN(
    [
        STRUCT(5266443791927869439, 51.0),
        STRUCT(5266443791928131583, 28.0),
        STRUCT(5266443791928918015, 19.0)
    ],
    3, 'exponential'
);
-- -0.29665713826808621

SELECT `carto-un-eu`.carto.MORANS_I_QUADBIN(
    [
        STRUCT(5266443791927869439, 51.0),
        STRUCT(5266443791928131583, 28.0),
        STRUCT(5266443791928918015, 19.0)
    ],
    3, 'exponential'
);
-- -0.29665713826808621

SELECT carto.MORANS_I_QUADBIN(
    [
        STRUCT(5266443791927869439, 51.0),
        STRUCT(5266443791928131583, 28.0),
        STRUCT(5266443791928918015, 19.0)
    ],
    3, 'exponential'
);
-- -0.29665713826808621

SELECT `carto-un`.carto.MORANS_I_QUADBIN(input_data, 3, 'exponential')
FROM (
    SELECT ARRAY_AGG(STRUCT(index, value)) AS input_data
    FROM mytable
)

SELECT `carto-un-eu`.carto.MORANS_I_QUADBIN(input_data, 3, 'exponential')
FROM (
    SELECT ARRAY_AGG(STRUCT(index, value)) AS input_data
    FROM mytable
)

SELECT carto.MORANS_I_QUADBIN(input_data, 3, 'exponential')
FROM (
    SELECT ARRAY_AGG(STRUCT(index, value)) AS input_data
    FROM mytable
)

LOCAL_MORANS_I_H3

LOCAL_MORANS_I_H3(input, size, decay, permutations)

Description

This function computes the local Moran's I spatial autocorrelation from the input array of H3 indexes. It outputs the H3 index, local Moran's I spatial autocorrelation value, simulated p value psim, Conditional randomization null - expectation EIc, Conditional randomization null - variance VIc, Total randomization null - expectation EI, Total randomization null - variance VI, and the quad HH=1, LL=2, LH=3, HL=4.

input: ARRAY<STRUCT<index STRING, value FLOAT64>> input data with the indexes and values of the cells.
size: INT64 size of the H3 kring (distance from the origin). This defines the area around each index cell where the distance decay will be applied.
decay: STRING decay function to compute the distance decay. Available functions are: uniform, inverse, inverse_square and exponential.
permutations: INT64 number of permutations for the estimation of p-value.

Return type

ARRAY<STRUCT<index STRING, value FLOAT64, psim FLOAT64, EIc FLOAT64, VIc FLOAT64, EI FLOAT64, VI FLOAT64, quad INT64>>

Example

SELECT `carto-un`.carto.LOCAL_MORANS_I_H3(
    [
        STRUCT('89394460323ffff', 51.0),
        STRUCT('8939446033bffff', 28.0),
        STRUCT('8939446032bffff', 19.0)
    ],
    3, 'exponential', 100
);
-- {
--   "index": "8939446032bffff",
--   "value": "-0.342921256629947",
--   "psim": "0.0099009900990099011",
--   "EIc": "-1.0287637698898404",
--   "VIc": "-1.0",
--   "EI": "0.0",
--   "VI": "-0.64721503525401447",
--   "quad": "3"
-- },
-- ...

SELECT `carto-un-eu`.carto.LOCAL_MORANS_I_H3(
    [
        STRUCT('89394460323ffff', 51.0),
        STRUCT('8939446033bffff', 28.0),
        STRUCT('8939446032bffff', 19.0)
    ],
    3, 'exponential', 100
);
-- {
--   "index": "8939446032bffff",
--   "value": "-0.342921256629947",
--   "psim": "0.0099009900990099011",
--   "EIc": "-1.0287637698898404",
--   "VIc": "-1.0",
--   "EI": "0.0",
--   "VI": "-0.64721503525401447",
--   "quad": "3"
-- },
-- ...

SELECT carto.LOCAL_MORANS_I_H3(
    [
        STRUCT('89394460323ffff', 51.0),
        STRUCT('8939446033bffff', 28.0),
        STRUCT('8939446032bffff', 19.0)
    ],
    3, 'exponential', 100
);
-- {
--   "index": "8939446032bffff",
--   "value": "-0.342921256629947",
--   "psim": "0.0099009900990099011",
--   "EIc": "-1.0287637698898404",
--   "VIc": "-1.0",
--   "EI": "0.0",
--   "VI": "-0.64721503525401447",
--   "quad": "3"
-- },
-- ...

SELECT `carto-un`.carto.LOCAL_MORANS_I_H3(
    ARRAY(SELECT AS STRUCT index, value FROM mytable),
    3, 'exponential', 100
)

SELECT `carto-un-eu`.carto.LOCAL_MORANS_I_H3(
    ARRAY(SELECT AS STRUCT index, value FROM mytable),
    3, 'exponential', 100
)

SELECT carto.LOCAL_MORANS_I_H3(
    ARRAY(SELECT AS STRUCT index, value FROM mytable),
    3, 'exponential', 100
)

LOCAL_MORANS_I_QUADBIN

LOCAL_MORANS_I_QUADBIN(input, size, decay)

Description

This function computes the local Moran's I spatial autocorrelation from the input array of Quadbin indexes. It outputs the Quadbin index, local Moran's I spatial autocorrelation value, simulated p value psim, Conditional randomization null - expectation EIc, Conditional randomization null - variance VIc, Total randomization null - expectation EI, Total randomization null - variance VI, and the quad HH=1, LL=2, LH=3, HL=4.

input: ARRAY<STRUCT<index INT64, value FLOAT64>> input data with the indexes and values of the cells.
size: INT64 size of the Quadbin k-ring (distance from the origin). This defines the area around each index cell where the distance decay will be applied.
decay: STRING decay function to compute the distance decay. Available functions are: uniform, inverse, inverse_square and exponential.
permutations: INT64 number of permutations for the estimation of p-value.

Return type

ARRAY<STRUCT<index INT64, value FLOAT64, psim FLOAT64, EIc FLOAT64, VIc FLOAT64, EI FLOAT64, VI FLOAT64, quad INT64>>

Example

SELECT `carto-un`.carto.LOCAL_MORANS_I_QUADBIN(
    [
        STRUCT(5266443791927869439, 51.0),
        STRUCT(5266443791928131583, 28.0),
        STRUCT(5266443791928918015, 19.0)
    ],
    3, 'exponential', 100
);
-- {
--   "index": "5266443791928918015",
--   "value": "-0.076228184845253524",
--   "psim": "0.0099009900990099011",
--   "EIc": "-0.70361240532717062",
--   "VIc": "-0.68393972058572117",
--   "EI": "0.29943435718277039",
--   "VI": "0.19089112237884748",
--   "quad": "3"
-- },
-- ...

SELECT `carto-un-eu`.carto.LOCAL_MORANS_I_QUADBIN(
    [
        STRUCT(5266443791927869439, 51.0),
        STRUCT(5266443791928131583, 28.0),
        STRUCT(5266443791928918015, 19.0)
    ],
    3, 'exponential', 100
);
-- {
--   "index": "5266443791928918015",
--   "value": "-0.076228184845253524",
--   "psim": "0.0099009900990099011",
--   "EIc": "-0.70361240532717062",
--   "VIc": "-0.68393972058572117",
--   "EI": "0.29943435718277039",
--   "VI": "0.19089112237884748",
--   "quad": "3"
-- },
-- ...

SELECT carto.LOCAL_MORANS_I_QUADBIN(
    [
        STRUCT(5266443791927869439, 51.0),
        STRUCT(5266443791928131583, 28.0),
        STRUCT(5266443791928918015, 19.0)
    ],
    3, 'exponential', 100
);
-- {
--   "index": "5266443791928918015",
--   "value": "-0.076228184845253524",
--   "psim": "0.0099009900990099011",
--   "EIc": "-0.70361240532717062",
--   "VIc": "-0.68393972058572117",
--   "EI": "0.29943435718277039",
--   "VI": "0.19089112237884748",
--   "quad": "3"
-- },
-- ...

SELECT `carto-un`.carto.LOCAL_MORANS_I_QUADBIN(
    ARRAY(SELECT AS STRUCT index, value FROM mytable),
    3, 'exponential', 100
);

SELECT `carto-un-eu`.carto.LOCAL_MORANS_I_QUADBIN(
    ARRAY(SELECT AS STRUCT index, value FROM mytable),
    3, 'exponential', 100
);

SELECT carto.LOCAL_MORANS_I_QUADBIN(
    ARRAY(SELECT AS STRUCT index, value FROM mytable),
    3, 'exponential', 100
);

VARIOGRAM

VARIOGRAM(input, n_bins, max_distance, model)

Description

This function computes the Variogram from the input array of points and their associated values.

It returns a STRUCT with the parameters of the variogram, the x values, the y values, the predicted y values and the number of values aggregated per bin.

input: ARRAY<STRUCT<point GEOGRAPHY, value FLOAT64>> input array with the points and their associated values.
n_bins: INT64 number of bins to compute the semivariance.
max_distance: FLOAT64 maximum distance to compute the semivariance.
model: STRING type of model for fitting the semivariance. It can be either:
- exponential: P0 * (1. - exp(-xi / (P1 / 3.0))) + P2
- spherical: P1 * (1.5 * (xi / P0) - 0.5 * (xi / P0)**3) + P2.

Return type

STRUCT<variogram_params ARRAY<FLOAT64>, x ARRAY<FLOAT64>, y ARRAY<FLOAT64>, yp ARRAY<FLOAT64>, count ARRAY<INT64>>

where:

variogram_params: array containing the parameters [P0, P1, P2] fitted to the model.
x: array with the x values used to fit the model.
y: array with the y values used to fit the model.
yp: array with the y values as predicted by the model.
count: array with the number of elements aggregated in the bin.

Examples

DECLARE sample_points ARRAY<STRUCT<point GEOGRAPHY, value FLOAT64>>;

 generate the spatially correlated values
SET sample_points = ARRAY(SELECT AS STRUCT st_geogpoint(lon_sqrt+0.1*RAND(),lat_sqrt+0.1*RAND()) point,
            pow(sin(lon_sqrt)*sin(lat_sqrt),2)+0.1*RAND() value
        FROM
            UNNEST(GENERATE_ARRAY(-10,10,0.1)) lon_sqrt,
            UNNEST(GENERATE_ARRAY(-10,10,0.1)) lat_sqrt
        ORDER BY RAND()
        LIMIT 1000);

 compute parameters of the variogram
SELECT `carto-un`.carto.VARIOGRAM(sample_points, 20, 1.0E5, 'exponential');
-- {
--   variogram_params: [1.8656766501394384, 9890263.713521793, -0.007675798653736552],
--   x: [13433.902872564133, 20772.802451664986, 56973.516169567, 67627.90034684369, 70363.43483710312, 78689.64706974, ...],
--   y: [0.005, 0.125, 3.125, 3.380, 2.0, 2.205, ...],
--   yp: [-0.14889750150153813, 0.49581158712413576, 2.351461086006329, 2.635658071286461, 2.696612846710653, 2.857216896041544, ...],
--   count: [162, 308, 328, 326, 312, 305, ...]
-- }

DECLARE sample_points ARRAY<STRUCT<point GEOGRAPHY, value FLOAT64>>;

 generate the spatially correlated values
SET sample_points = ARRAY(SELECT AS STRUCT st_geogpoint(lon_sqrt+0.1*RAND(),lat_sqrt+0.1*RAND()) point,
            pow(sin(lon_sqrt)*sin(lat_sqrt),2)+0.1*RAND() value
        FROM
            UNNEST(GENERATE_ARRAY(-10,10,0.1)) lon_sqrt,
            UNNEST(GENERATE_ARRAY(-10,10,0.1)) lat_sqrt
        ORDER BY RAND()
        LIMIT 1000);

 compute parameters of the variogram
SELECT `carto-un-eu`.carto.VARIOGRAM(sample_points, 20, 1.0E5, 'exponential');
-- {
--   variogram_params: [1.8656766501394384, 9890263.713521793, -0.007675798653736552],
--   x: [13433.902872564133, 20772.802451664986, 56973.516169567, 67627.90034684369, 70363.43483710312, 78689.64706974, ...],
--   y: [0.005, 0.125, 3.125, 3.380, 2.0, 2.205, ...],
--   yp: [-0.14889750150153813, 0.49581158712413576, 2.351461086006329, 2.635658071286461, 2.696612846710653, 2.857216896041544, ...],
--   count: [162, 308, 328, 326, 312, 305, ...]
-- }

DECLARE sample_points ARRAY<STRUCT<point GEOGRAPHY, value FLOAT64>>;

 generate the spatially correlated values
SET sample_points = ARRAY(SELECT AS STRUCT st_geogpoint(lon_sqrt+0.1*RAND(),lat_sqrt+0.1*RAND()) point,
            pow(sin(lon_sqrt)*sin(lat_sqrt),2)+0.1*RAND() value
        FROM
            UNNEST(GENERATE_ARRAY(-10,10,0.1)) lon_sqrt,
            UNNEST(GENERATE_ARRAY(-10,10,0.1)) lat_sqrt
        ORDER BY RAND()
        LIMIT 1000);

 compute parameters of the variogram
SELECT carto.VARIOGRAM(sample_points, 20, 1.0E5, 'exponential');
-- {
--   variogram_params: [1.8656766501394384, 9890263.713521793, -0.007675798653736552],
--   x: [13433.902872564133, 20772.802451664986, 56973.516169567, 67627.90034684369, 70363.43483710312, 78689.64706974, ...],
--   y: [0.005, 0.125, 3.125, 3.380, 2.0, 2.205, ...],
--   yp: [-0.14889750150153813, 0.49581158712413576, 2.351461086006329, 2.635658071286461, 2.696612846710653, 2.857216896041544, ...],
--   count: [162, 308, 328, 326, 312, 305, ...]
-- }

ORDINARY_KRIGING_TABLE

ORDINARY_KRIGING_TABLE(input_table, interp_table, target_table, n_bins, max_distance, n_neighbors, model)

Description

This procedure uses Ordinary kriging to compute the interpolated values of a set of points stored in a table, given another set of points with known associated values.

input_table: STRING name of the table with the sample points locations and their values stored in a column named point (type GEOGRAPHY) and value (type FLOAT), respectively. It should be a qualified table name including project and dataset: <project-id>.<dataset-id>.<table-name>.
interp_table: STRING name of the table with the point locations whose values will be interpolated stored in a column named point of type GEOGRAPHY. It should be a qualified table name including project and dataset: <project-id>.<dataset-id>.<table-name>.
target_table: STRING name of the output table where the result of the kriging will be stored. It should be a qualified table name including project and dataset: <project-id>.<dataset-id>.<table-name>. The process will fail if the target table already exists. If NULL, the result will be returned by the procedure and won't be persisted.
n_bins: INT64 number of bins to compute the semivariance.
max_distance: FLOAT64 maximum distance to compute the semivariance.
n_neighbors: INT64 maximum number of neighbors of a point to be taken into account for interpolation.
model: STRING type of model for fitting the semivariance. It can be either:
- exponential: P0 * (1. - exp(-xi / (P1 / 3.0))) + P2
- spherical: P1 * (1.5 * (xi / P0) - 0.5 * (xi / P0)**3) + P2.

Example

CALL `carto-un`.carto.ORDINARY_KRIGING_TABLE(
         'cartobq.docs.nasadem_jp_extract',
         'cartobq.docs.interp_points',
         NULL,
         50,
         1000,
         20,
         'exponential');
-- {"point": "POINT(142.4277 43.51606)", "value": "288.531297133198"},
-- {"point": "POINT(142.4181 43.50518)", "value": "306.62910397500843"},
-- {"point": "POINT(142.4175 43.5045)", "value": "306.9708080004128"},
-- {"point": "POINT(142.4121 43.49838)", "value": "328.37518451985943"},
-- {"point": "POINT(142.4172 43.50416)", "value": "307.1771955935104"},
-- ...

CALL `carto-un-eu`.carto.ORDINARY_KRIGING_TABLE(
         'cartobq.docs.nasadem_jp_extract',
         'cartobq.docs.interp_points',
         NULL,
         50,
         1000,
         20,
         'exponential');
-- {"point": "POINT(142.4277 43.51606)", "value": "288.531297133198"},
-- {"point": "POINT(142.4181 43.50518)", "value": "306.62910397500843"},
-- {"point": "POINT(142.4175 43.5045)", "value": "306.9708080004128"},
-- {"point": "POINT(142.4121 43.49838)", "value": "328.37518451985943"},
-- {"point": "POINT(142.4172 43.50416)", "value": "307.1771955935104"},
-- ...

CALL carto.ORDINARY_KRIGING_TABLE(
         'cartobq.docs.nasadem_jp_extract',
         'cartobq.docs.interp_points',
         NULL,
         50,
         1000,
         20,
         'exponential');
-- {"point": "POINT(142.4277 43.51606)", "value": "288.531297133198"},
-- {"point": "POINT(142.4181 43.50518)", "value": "306.62910397500843"},
-- {"point": "POINT(142.4175 43.5045)", "value": "306.9708080004128"},
-- {"point": "POINT(142.4121 43.49838)", "value": "328.37518451985943"},
-- {"point": "POINT(142.4172 43.50416)", "value": "307.1771955935104"},
-- ...

Additional examples

Interpolating elevation along a road using kriging

ORDINARY_KRIGING

ORDINARY_KRIGING(sample_points, interp_points, max_distance, variogram_params, n_neighbors, model)

Description

This function uses Ordinary kriging to compute the interpolated values of an array of points, given another array of points with known associated values and a variogram. This variogram may be computed with the [#variogram] function.

sample_points: ARRAY<STRUCT<point GEOGRAPHY, value FLOAT64>> input array with the sample points and their values.
interp_points: ARRAY<GEOGRAPHY> input array with the points whose values will be interpolated.
max_distance: FLOAT64 maximum distance to compute the semivariance.
variogram_params: ARRAY<FLOAT64> parameters [P0, P1, P2] of the variogram model.
n_neighbors: INT64 maximum number of neighbors of a point to be taken into account for interpolation.
model: STRING type of model for fitting the semivariance. It can be either exponential or spherical and it should be the same type of model as the one used to compute the variogram:
- exponential: P0 * (1. - exp(-xi / (P1 / 3.0))) + P2
- spherical: P1 * (1.5 * (xi / P0) - 0.5 * (xi / P0)**3) + P2.

Return type

ARRAY<STRUCT<point GEOGRAPHY, value FLOAT64>>

Examples

Here is a standalone example:

SELECT
  `carto-un`.carto.ORDINARY_KRIGING(
             [STRUCT(st_geogpoint(0.26,1.02) as point, 1.0 as value),
              STRUCT(st_geogpoint(0.91,0.74) as point, 3.1 as value),
              STRUCT(st_geogpoint(-0.59,0.51) as point, 1.5 as value),
              STRUCT(st_geogpoint(0.86,0.92) as point, 3.6 as value),
              STRUCT(st_geogpoint(0.37,1.07) as point, 1.1 as value),
              STRUCT(st_geogpoint(0.69,-0.52) as point, 1.2 as value)],
             [st_geogpoint(0.,0.),
              st_geogpoint(0.,1.)],
             1.0E5,
             [0.1,1E8,0.1],
             20,
             'exponential')
-- {"point": "POINT(0 0)", "value": "1.357680916212768"},
-- {"point": "POINT(0 1)", "value": "1.07161192146499"}

SELECT
  `carto-un-eu`.carto.ORDINARY_KRIGING(
             [STRUCT(st_geogpoint(0.26,1.02) as point, 1.0 as value),
              STRUCT(st_geogpoint(0.91,0.74) as point, 3.1 as value),
              STRUCT(st_geogpoint(-0.59,0.51) as point, 1.5 as value),
              STRUCT(st_geogpoint(0.86,0.92) as point, 3.6 as value),
              STRUCT(st_geogpoint(0.37,1.07) as point, 1.1 as value),
              STRUCT(st_geogpoint(0.69,-0.52) as point, 1.2 as value)],
             [st_geogpoint(0.,0.),
              st_geogpoint(0.,1.)],
             1.0E5,
             [0.1,1E8,0.1],
             20,
             'exponential')
-- {"point": "POINT(0 0)", "value": "1.357680916212768"},
-- {"point": "POINT(0 1)", "value": "1.07161192146499"}

SELECT
  carto.ORDINARY_KRIGING(
             [STRUCT(st_geogpoint(0.26,1.02) as point, 1.0 as value),
              STRUCT(st_geogpoint(0.91,0.74) as point, 3.1 as value),
              STRUCT(st_geogpoint(-0.59,0.51) as point, 1.5 as value),
              STRUCT(st_geogpoint(0.86,0.92) as point, 3.6 as value),
              STRUCT(st_geogpoint(0.37,1.07) as point, 1.1 as value),
              STRUCT(st_geogpoint(0.69,-0.52) as point, 1.2 as value)],
             [st_geogpoint(0.,0.),
              st_geogpoint(0.,1.)],
             1.0E5,
             [0.1,1E8,0.1],
             20,
             'exponential')
-- {"point": "POINT(0 0)", "value": "1.357680916212768"},
-- {"point": "POINT(0 1)", "value": "1.07161192146499"}

Here is an example using the ORDINARY_KRIGING function along with a VARIOGRAM estimation:


DECLARE sample_points ARRAY<STRUCT<point GEOGRAPHY, value FLOAT64>>;
DECLARE variogram_output STRUCT<params ARRAY<FLOAT64>, x ARRAY<FLOAT64>, y ARRAY<FLOAT64>, yp ARRAY<FLOAT64>, count ARRAY<INT64>>;
DECLARE interp_points ARRAY<GEOGRAPHY>;

-- Generate the spatially correlated values
SET sample_points = ARRAY(SELECT AS STRUCT st_geogpoint(lon_sqrt+0.1*RAND(),lat_sqrt+0.1*RAND()) point,
            pow(sin(lon_sqrt)*sin(lat_sqrt),2)+0.1*RAND() value
        FROM
            UNNEST(GENERATE_ARRAY(-10,10,0.1)) lon_sqrt,
            UNNEST(GENERATE_ARRAY(-10,10,0.1)) lat_sqrt
        ORDER BY RAND()
        LIMIT 1000);

-- Compute parameters of the variogram
SET variogram_output = `carto-un`.carto.VARIOGRAM(sample_points, 20, 1.0E5, 'spherical');

-- Generate the points to be interpolated
SET interp_points = ARRAY(SELECT st_geogpoint(lon_sqrt,lat_sqrt) point
        FROM
            UNNEST(GENERATE_ARRAY(-5,5,0.25)) lon_sqrt,
            UNNEST(GENERATE_ARRAY(-5,5,0.25)) lat_sqrt
            );

-- Calculate interpolated values
SELECT
  point, value
FROM
  UNNEST(`carto-un`.carto.ORDINARY_KRIGING(
         sample_points,
         interp_points,
         1.0E5,
         variogram_output.params,
         20,
         'spherical')) WITH OFFSET pos
ORDER BY pos

-- {"point": POINT(-5 -5), "value": 0.568294714734378},
-- {"point": POINT(-5 -4.75), "value": 0.8303238799265198},
-- {"point": POINT(-5 -4.5), "value": 0.8876712348264676},
-- {"point": POINT(-5 -4.25), "value": 0.7437099678173889},
-- {"point": POINT(-5 -4), "value": 0.5543380644791405},
-- {"point": POINT(-5 -3.75), "value": 0.45182050244159944}
-- ...


DECLARE sample_points ARRAY<STRUCT<point GEOGRAPHY, value FLOAT64>>;
DECLARE variogram_output STRUCT<params ARRAY<FLOAT64>, x ARRAY<FLOAT64>, y ARRAY<FLOAT64>, yp ARRAY<FLOAT64>, count ARRAY<INT64>>;
DECLARE interp_points ARRAY<GEOGRAPHY>;

-- Generate the spatially correlated values
SET sample_points = ARRAY(SELECT AS STRUCT st_geogpoint(lon_sqrt+0.1*RAND(),lat_sqrt+0.1*RAND()) point,
            pow(sin(lon_sqrt)*sin(lat_sqrt),2)+0.1*RAND() value
        FROM
            UNNEST(GENERATE_ARRAY(-10,10,0.1)) lon_sqrt,
            UNNEST(GENERATE_ARRAY(-10,10,0.1)) lat_sqrt
        ORDER BY RAND()
        LIMIT 1000);

-- Compute parameters of the variogram
SET variogram_output = `carto-un-eu`.carto.VARIOGRAM(sample_points, 20, 1.0E5, 'spherical');

-- Generate the points to be interpolated
SET interp_points = ARRAY(SELECT st_geogpoint(lon_sqrt,lat_sqrt) point
        FROM
            UNNEST(GENERATE_ARRAY(-5,5,0.25)) lon_sqrt,
            UNNEST(GENERATE_ARRAY(-5,5,0.25)) lat_sqrt
            );

-- Calculate interpolated values
SELECT
  point, value
FROM
  UNNEST(`carto-un-eu`.carto.ORDINARY_KRIGING(
         sample_points,
         interp_points,
         1.0E5,
         variogram_output.params,
         20,
         'spherical')) WITH OFFSET pos
ORDER BY pos

-- {"point": POINT(-5 -5), "value": 0.568294714734378},
-- {"point": POINT(-5 -4.75), "value": 0.8303238799265198},
-- {"point": POINT(-5 -4.5), "value": 0.8876712348264676},
-- {"point": POINT(-5 -4.25), "value": 0.7437099678173889},
-- {"point": POINT(-5 -4), "value": 0.5543380644791405},
-- {"point": POINT(-5 -3.75), "value": 0.45182050244159944}
-- ...


DECLARE sample_points ARRAY<STRUCT<point GEOGRAPHY, value FLOAT64>>;
DECLARE variogram_output STRUCT<params ARRAY<FLOAT64>, x ARRAY<FLOAT64>, y ARRAY<FLOAT64>, yp ARRAY<FLOAT64>, count ARRAY<INT64>>;
DECLARE interp_points ARRAY<GEOGRAPHY>;

-- Generate the spatially correlated values
SET sample_points = ARRAY(SELECT AS STRUCT st_geogpoint(lon_sqrt+0.1*RAND(),lat_sqrt+0.1*RAND()) point,
            pow(sin(lon_sqrt)*sin(lat_sqrt),2)+0.1*RAND() value
        FROM
            UNNEST(GENERATE_ARRAY(-10,10,0.1)) lon_sqrt,
            UNNEST(GENERATE_ARRAY(-10,10,0.1)) lat_sqrt
        ORDER BY RAND()
        LIMIT 1000);

-- Compute parameters of the variogram
SET variogram_output = carto.VARIOGRAM(sample_points, 20, 1.0E5, 'spherical');

-- Generate the points to be interpolated
SET interp_points = ARRAY(SELECT st_geogpoint(lon_sqrt,lat_sqrt) point
        FROM
            UNNEST(GENERATE_ARRAY(-5,5,0.25)) lon_sqrt,
            UNNEST(GENERATE_ARRAY(-5,5,0.25)) lat_sqrt
            );

-- Calculate interpolated values
SELECT
  point, value
FROM
  UNNEST(carto.ORDINARY_KRIGING(
         sample_points,
         interp_points,
         1.0E5,
         variogram_output.params,
         20,
         'spherical')) WITH OFFSET pos
ORDER BY pos

-- {"point": POINT(-5 -5), "value": 0.568294714734378},
-- {"point": POINT(-5 -4.75), "value": 0.8303238799265198},
-- {"point": POINT(-5 -4.5), "value": 0.8876712348264676},
-- {"point": POINT(-5 -4.25), "value": 0.7437099678173889},
-- {"point": POINT(-5 -4), "value": 0.5543380644791405},
-- {"point": POINT(-5 -3.75), "value": 0.45182050244159944}
-- ...

IDW

IDW(inputsample, origin, maxdistance, n_neighbors, p)

Description

This function performs Inverse Distance Weighted interpolation. More information on the method can be found here. The method uses the values of the input samples to interpolate the values for the derived locations. The user can select the number of neighbors to be selected for the interpolation, the maximum distance between points and neighbors and the factor p for the weights.

inputsample: ARRAY<STRUCT<point GEOGRAPHY, value FLOAT64>> Input array with the sample points and their values.
origin: ARRAY<GEOGRAPHY> Input array with the points whose values will be interpolated.
maxdistance: FLOAT64 Maximum distance between point for interpolation and sampling points.
n_neighbors: INT64 Maximum number of sampling points to be considered for the interpolation.
p: FLOAT64 Power of distance.

Return type

ARRAY<STRUCT<point GEOGRAPHY, value FLOAT64>>

Example

SELECT
      point, value
    FROM
      UNNEST(`carto-un`.carto.IDW(
             ARRAY(SELECT AS STRUCT point, value FROM `cartobq.docs.kriging_sample_points`),
             ARRAY(SELECT point FROM `cartobq.docs.kriging_interp_points`),
             1.0E5,
             20,
             2)) WITH OFFSET pos
    ORDER BY pos
-- TODO

SELECT
      point, value
    FROM
      UNNEST(`carto-un-eu`.carto.IDW(
             ARRAY(SELECT AS STRUCT point, value FROM `cartobq.docs.kriging_sample_points`),
             ARRAY(SELECT point FROM `cartobq.docs.kriging_interp_points`),
             1.0E5,
             20,
             2)) WITH OFFSET pos
    ORDER BY pos
-- TODO

SELECT
      point, value
    FROM
      UNNEST(carto.IDW(
             ARRAY(SELECT AS STRUCT point, value FROM `cartobq.docs.kriging_sample_points`),
             ARRAY(SELECT point FROM `cartobq.docs.kriging_interp_points`),
             1.0E5,
             20,
             2)) WITH OFFSET pos
    ORDER BY pos
-- TODO

SMOOTHING_MRF_H3

SMOOTHING_MRF_H3(input, output, index_column, variable_column, options)

Description

This procedure computes a Markov Random Field (MRF) smoothing for a table containing H3 cell indexes and their associated values.

This implementation is based on the work of Christopher J. Paciorek: "Spatial models for point and areal data using Markov random fields on a fine grid." Electron. J. Statist. 7 946 - 972, 2013. https://doi.org/10.1214/13-EJS791

tip

if your data is in lat/long format, you can still use this procedure by first converting your points to H3 cell indexes by using the H3_FROMLONGLAT function.

input: STRING name of the source table. It should be a fully qualified table name including project and dataset: <project-id>.<dataset-id>.<table-name>.
output: STRING name of the output table. It should be a fully qualified table name including project and dataset: <project-id>.<dataset-id>.<table-name>. The process will fail if the table already exists. If NULL, the result will be returned directly by the procedure and not persisted.
index_column: STRING name of the column containing the cell ids.
variable_column: STRING name of the target variable column.
options: STRING JSON string to overwrite the model's default options. If set to NULL or empty, it will use the default values.
- closing_distance: INT64 distance of closing. It defaults to 0. If strictly positive, the algorithm performs a morphological closing on the cells by the closing_distance, defined in number of cells, before performing the smoothing. No closing is performed otherwise.
- output_closing_cell: BOOL controls whether the cells generated by the closing are added to the output. If defaults to FALSE.
- lambda: FLOAT64 iteration update factor. It defaults to 1.6. For more details, see https://doi.org/10.1214/13-EJS791, page 963.
- iter: INT64 number of iterative queries to perform the smoothing. It defaults to 10. Increasing this parameter might help if the convergence_limit is not reached by the end of the procedure's execution. Tip: if this limit has ben reached, the status of the second-to-last step of the procedure will throw an error.
- intra_iter: INT64 number of iterations per query. It defaults to 50. Reducing this parameter might help if a resource error is reached during the procedure's execution.
- convergence_limit: FLOAT64 threshold condition to stop iterations. If this threshold is not reached, then the procedure will finish its execution after the maximum number of iterations (iter) is reached. It defaults to 10e-5. For more details, see https://doi.org/10.1214/13-EJS791, page 963.

Return type

FLOAT64

Example

CALL `carto-un`.carto.SMOOTHING_MRF_H3( "cartobq.docs.airbnb_berlin_h3_qk",
  NULL,
  'h3_z7',
  'price',
  '{"closing_distance":0, "output_closing_cell":"true", "lambda":1.6, "iter":10, "intra_iter":5, "convergence_limit":10e-5}');
-- {"id": 871f18840ffffff, "beta": 64.56696796809489}
-- {"id": 871f18841ffffff, "beta": 62.61498241759014}
-- {"id": 871f18844ffffff, "beta": 65.47069449331353}

CALL `carto-un-eu`.carto.SMOOTHING_MRF_H3( "cartobq.docs.airbnb_berlin_h3_qk",
  NULL,
  'h3_z7',
  'price',
  '{"closing_distance":0, "output_closing_cell":"true", "lambda":1.6, "iter":10, "intra_iter":5, "convergence_limit":10e-5}');
-- {"id": 871f18840ffffff, "beta": 64.56696796809489}
-- {"id": 871f18841ffffff, "beta": 62.61498241759014}
-- {"id": 871f18844ffffff, "beta": 65.47069449331353}

CALL carto.SMOOTHING_MRF_H3( "cartobq.docs.airbnb_berlin_h3_qk",
  NULL,
  'h3_z7',
  'price',
  '{"closing_distance":0, "output_closing_cell":"true", "lambda":1.6, "iter":10, "intra_iter":5, "convergence_limit":10e-5}');
-- {"id": 871f18840ffffff, "beta": 64.56696796809489}
-- {"id": 871f18841ffffff, "beta": 62.61498241759014}
-- {"id": 871f18844ffffff, "beta": 65.47069449331353}

SMOOTHING_MRF_QUADBIN

SMOOTHING_MRF_QUADBI (input, output, index_column, variable_column, options)

Description

This procedure computes a Markov Random Field (MRF) smoothing for a table containing QUADBIN cell indexes and their associated values.

tip

if your data is in lat/long format, you can still use this procedure by first converting your points to QUADINT cell indexes by using the QUADBIN_FROMLONGLAT function.

input: STRING name of the source table. It should be a fully qualified table name including project and dataset: <project-id>.<dataset-id>.<table-name>.
output: STRING name of the output table. It should be a fully qualified table name including project and dataset: <project-id>.<dataset-id>.<table-name>. The process will fail if the table already exists.
index_column: STRING name of the column containing the cell ids.
variable_column: STRING name of the target variable column.
options: STRING JSON string to overwrite the model's default options. If set to NULL or empty, it will use the default values.
- closing_distance: INT64 distance of closing. It defaults to 0. If strictly positive, the algorithm performs a morphological closing on the cells by the closing_distance, defined in number of cells, before performing the smoothing. No closing is performed otherwise.
- output_closing_cell: BOOL controls whether the cells generated by the closing are added to the output. If defaults to FALSE.
- lambda: FLOAT64 iteration update factor. It defaults to 1.6. For more details, see https://doi.org/10.1214/13-EJS791, page 963.
- iter: INT64 number of iterative queries to perform the smoothing. It defaults to 10. Increasing this parameter might help if the convergence_limit is not reached by the end of the procedure's execution. Tip: if this limit has ben reached, the status of the second-to-last step of the procedure will throw an error.
- intra_iter: INT64 number of iterations per query. It defaults to 50. Reducing this parameter might help if a resource error is reached during the procedure's execution.
- convergence_limit: FLOAT64 threshold condition to stop iterations. If this threshold is not reached, then the procedure will finish its execution after the maximum number of iterations (iter) is reached. It defaults to 10e-5. For more details, see https://doi.org/10.1214/13-EJS791, page 963.

Return type

FLOAT64

Example

CALL `carto-un`.carto.SMOOTHING_MRF_QUADBIN(
  'cartobq.docs.airbnb_berlin_h3_qk_qb',
  'my-project.my-dataset.my-smoothing-table',
  'qb_z11',
  'price',
  '''{
    "closing_distance": 0,
    "output_closing_cell": "true",
    "lambda": 1.6,
    "iter": 10,
    "intra_iter": 5,
    "convergence_limit": 10e-5
  }'''
);
-- The table `my-project.my-dataset.my-smoothing-table` will be created
-- with columns: id, price_smoothed

CALL `carto-un-eu`.carto.SMOOTHING_MRF_QUADBIN(
  'cartobq.docs.airbnb_berlin_h3_qk_qb',
  'my-project.my-dataset.my-smoothing-table',
  'qb_z11',
  'price',
  '''{
    "closing_distance": 0,
    "output_closing_cell": "true",
    "lambda": 1.6,
    "iter": 10,
    "intra_iter": 5,
    "convergence_limit": 10e-5
  }'''
);
-- The table `my-project.my-dataset.my-smoothing-table` will be created
-- with columns: id, price_smoothed

CALL carto.SMOOTHING_MRF_QUADBIN(
  'cartobq.docs.airbnb_berlin_h3_qk_qb',
  'my-project.my-dataset.my-smoothing-table',
  'qb_z11',
  'price',
  '''{
    "closing_distance": 0,
    "output_closing_cell": "true",
    "lambda": 1.6,
    "iter": 10,
    "intra_iter": 5,
    "convergence_limit": 10e-5
  }'''
);
-- The table `my-project.my-dataset.my-smoothing-table` will be created
-- with columns: id, price_smoothed

AREA_OF_APPLICABILITY

AREA_OF_APPLICABILITY(source_input_query, candidate_input_query, shap_query, index_column, output_prefix, options)

Description

This procedure computes the Area of Applicability (AOA) of a Bigquery ML model. It generates a metric which tells the user where the results from a Machine Learning (ML) model can be trusted when the predictions are extrapolated outside the training space (i.e. where the estimated cross-validation performance holds). Adding a method to compute this metric is particularly useful for non-linear models.

This implementation is based on Meyer, H., & Pebesma, E. (2021). Predicting into unknown space? Estimating the area of applicability of spatial prediction models. Methods in Ecology and Evolution, 12, 1620– 1633.

Given the SHAP values of a trained model, the procedure computes a Dissimilarity Index (DI) for each new data point used for prediction as the multivariate distance between the model covariates for that point and the nearest training data point. To identify those new points that lie in the model AOA, the DI is compared using a threshold obtained as the (outlier-removed) maximum DI of the training data derived via cross-validation: for each training data point the DI is computed as the distance to the nearest training data point that is not in the same (spatial) cross-validation fold with respect to the average of all pairwise distances between all training data. Alternatively, the user can also input a user-defined threshold. To compute the DI, two distance metrics are available:

Euclidean distance: the distance between two data points is computed as the sum over all predictors of the weighted square differences between the standardized value of each predictor variable, where the weight is derived from the model SHAP table. This distance should be used only when all predictor variables are numerical, for which squared differences are well defined.
Gower distance: the distance between two data points is computed as the sum over all predictors of the weighted and normalized absolute differences for numerical (continous and discrete) predictors and the indicator function (0 if equal, 1 otherwise) for categorical/ordinal predictors. This distance can be used for numerical only, categorical only or mixed-type data and is normalized between 0 and 1, with 0 indicating that two points are the same.

The cross-validation folds for the training data can be obtained using a custom index (“CUSTOM_KFOLD”), a random cross-validation strategy (“RANDOM_KFOLD”), or environmental blocking (“ENV_BLOCKING_KFOLD”).

Finally, rows with a NULL value in any of the model predictors are dropped.

Input parameters

source_input_query: STRING the query to the data used to train the model. It must contain all the model predictors columns.
candidate_input_query: STRING query to provide the data over which the domain of applicability is estimated. It must contain all the model predictors columns.
shap_query: STRING the query to the model SHAP table with the feature importance of the model. For example, for the BUILD_REVENUE_MODEL these values are stored in a table with suffix _model_shap. When the model is trained with BigQuery ML the feature importance of the model predictors can also be found on the model Interpretability tag.
index_column: STRING the name of the unique geographic identifier.
output_prefix: STRING destination and prefix for the output table. It must contain the project, dataset and prefix: <project>.<dataset>.<prefix>.
options: STRING containing a valid JSON with the different options. Valid options are described in the table below. If options is set to NULL then all options are set to default.
Option
Description
threshold_method
Default: "RANDOM_KFOLD". STRING method used for calculating the threshold to be applied on dissimilarity index of the candidate set in order to identify the area of applicability. Possible options are: "USER_DEFINED_THRESHOLD" uses a user defined threshold to derive the AOA. The threshold is provided by the user-defined threshold value; "CUSTOM_KFOLD" uses a customized k-fold index. The threshold is based on the cross-validation folds stored in the kfold_index_column in the source_input_query data; "RANDOM_KFOLD" uses a random k-fold index. The threshold is based on the cross-validation folds derived from a random k-fold strategy with the number of folds specified by the user in the nfolds parameter; "ENV_BLOCKING_KFOLD" uses a environmental blocking k-fold index. The threshold is based on the cross-validation folds derived from an environmental blocking strategy. This method can only be used when all predictors are numerical, otherwise an error is raised.
threshold
FLOAT64 the user defined threshold when the "USER_DEFINED_THRESHOLD" threshold method is used. The threshold should be defined in the [0,1] interval.
kfold_index_column
STRING name of the cross-validation fold column. If threshold_method is set to "CUSTOM_KFOLD", the user needs to pass this parameter, otherwise an error is raised. If threshold_method is set to "RANDOM_KFOLD" or "ENV_BLOCKING_KFOLD", this parameter is optional.
distance_type
Default: "GOWER". STRING the distance used to compute the dissimilarity index. Possible options are GOWER for the Gower distance and EUCLIDEAN for the Euclidean distance. When working with mixed data types the user can only use the Gower distance, otherwise an error is raised.
outliers_scale_factor
FLOAT64 the scale factor used to define the threshold when threshold_method is set to "CUSTOM_KFOLD", "RANDOM_KFOLD", or "ENV_BLOCKING_KFOLD". Analogue to Tukey’s fences k parameter for outlier detection.
pca_explained_variance_ratio
FLOAT64 the proportion of explained variance retained in the PCA analysis. Only values in the (0,1] range are allowed.
nfolds
INT64 the default number of k-folds when the threshold_method is set to "RANDOM_KFOLD" or "ENV_BLOCKING_KFOLD". Cannot be NULL if threshold_method="RANDOM_KFOLD"; if threshold_method="ENV_BLOCKING_KFOLD", if NULL, nfolds_min and nfolds_max must be specified and the optimal number of folds is computed by deriving the clusters for a number of folds between nfolds_min and nfolds_max and choosing the number of folds (clusters) that minimizes the Calinski-Harabasz Index. If not NULL then nfolds should be at least 1.
nfolds_min
INT64 the minimum number of environmental folds (clusters) if nfolds is set to NULL, otherwise it is ignored.
nfolds_max
INT64 the maximum number of environmental folds (clusters) if nfolds is set to NULL, otherwise it is ignored.
normalize_dissimilarity_index
BOOLEAN if TRUE the dissimilarity factor is normalized between 0 and 1. If threshold_method is set to USER_DEFINED_THRESHOLD this parameter must be set to TRUE.
return_source_dataset
BOOLEAN if TRUE the dissimilarity index for the source model data is also returned. If threshold_method is set to "USER_DEFINED_THRESHOLD" this parameter must be set to FALSE.
The different options for each threshold_method are explained in the table below.
threshold_method
USER_DEFINED_THRESHOLD
CUSTOM_KFOLD
RANDOM_KFOLD
ENV_BLOCKING_KFOLD
Default value
threshold
Mandatory
Ignored
Ignored
Ignored
"RANDOM_KFOLD"
kfold_index_column
Ignored
Mandatory
Optional
Optional
"kfold_index"
distance_type
Optional
Optional
Optional
Optional
"GOWER"
outliers_scale_factor
Ignored
Optional
Optional
Optional
1.5
pca_explained_variance_ratio
Ignored
Ignored
Ignored
Optional
0.9
nfolds
Ignored
Ignored
Mandatory
Optional if nfolds_min and nfolds_max are defined
NULL if OPTIONS IS NOT NULL and 4 otherwise
nfolds_min
Ignored
Ignored
Ignored
Optional if nfolds is defined; mandatory if nfolds_max is defined
NULL
nfolds_max
Ignored
Ignored
Ignored
Optional if nfolds is defined; mandatory if nfolds_min is defined
NULL
normalize_dissimilarity_index
Optional
Optional
Optional
Optional
TRUE
return_source_dataset
Optional
Optional
Optional
Optional
FALSE

Output

The output table with the following columns:

index_column: STRING the unique geographic identifier. The data type of the index_column in source_input_query and in the candidate_input_query is casted to STRING to take into account potential differences between the data type of source_input_query and candidate_input_query.
is_source: BOOLEAN TRUE if a data point is in the source model data and FALSE otherwise (only returned if return_source_dataset is TRUE).
kfold_index_column: STRING the cross-validation fold index for each data point in the source model dataset (only returned if return_source_dataset is TRUE). The name of the column is given by the kfold_index_column parameter in the OPTIONS section.
dissimilarity_index: FLOAT64 the dissimilarity index.
dissimilarity_index_threshold: FLOAT64 the dissimilarity index threshold used to define the Area of Applicability (AOA, data points in the candidate set for which the dissimilarity index is smaller than this threshold belong to the area of applicability).
is_in_area_of_applicability: BOOLEAN TRUE if a data point in the candidate set (is_source = FALSE) is in the AOA and FALSE otherwise. For data points in the source set (is_source = TRUE) this is set to NULL.

Examples

Let's start by setting the OPTIONS to NULL. In this case the threshold is computed from a random k-fold strategy (threshold_method=RANDOM_KFOLD) with nfolds = 4:

CALL `carto-un`.carto.AREA_OF_APPLICABILITY(
  'SELECT * EXCEPT(revenue_avg) FROM `cartobq.docs.aoa_revenue_model_data` WHERE revenue_avg IS NOT NULL',
  'SELECT * EXCEPT(revenue_avg) FROM `cartobq.docs.aoa_revenue_model_data`',
  'SELECT * FROM `cartobq.docs.aoa_revenue_model_shap`',
  'geoid',
  '<my-project>.<my-dataset>.<output-prefix>',
  NULL
);
-- Table `<my-project>.<my-dataset>.<output-prefix>` will be created

CALL `carto-un-eu`.carto.AREA_OF_APPLICABILITY(
  'SELECT * EXCEPT(revenue_avg) FROM `cartobq.docs.aoa_revenue_model_data` WHERE revenue_avg IS NOT NULL',
  'SELECT * EXCEPT(revenue_avg) FROM `cartobq.docs.aoa_revenue_model_data`',
  'SELECT * FROM `cartobq.docs.aoa_revenue_model_shap`',
  'geoid',
  '<my-project>.<my-dataset>.<output-prefix>',
  NULL
);
-- Table `<my-project>.<my-dataset>.<output-prefix>` will be created

CALL carto.AREA_OF_APPLICABILITY(
  'SELECT * EXCEPT(revenue_avg) FROM `cartobq.docs.aoa_revenue_model_data` WHERE revenue_avg IS NOT NULL',
  'SELECT * EXCEPT(revenue_avg) FROM `cartobq.docs.aoa_revenue_model_data`',
  'SELECT * FROM `cartobq.docs.aoa_revenue_model_shap`',
  'geoid',
  '<my-project>.<my-dataset>.<output-prefix>',
  NULL
);
-- Table `<my-project>.<my-dataset>.<output-prefix>` will be created

With USER_DEFINED_THRESHOLD, the threshold is provided by the user:

CALL `carto-un`.carto.AREA_OF_APPLICABILITY(
  'SELECT * EXCEPT(revenue_avg) FROM `cartobq.docs.aoa_revenue_model_data` WHERE revenue_avg IS NOT NULL',
  'SELECT * EXCEPT(revenue_avg) FROM `cartobq.docs.aoa_revenue_model_data`',
  'SELECT * FROM `cartobq.docs.aoa_revenue_model_shap`',
  'geoid',
  '<my-project>.<my-dataset>.<output-prefix>',
  '''
  {
    "threshold_method":"USER_DEFINED_THRESHOLD",
    "threshold":0.50
  }
  '''
);
-- Table `<my-project>.<my-dataset>.<output-prefix>` will be created

CALL `carto-un-eu`.carto.AREA_OF_APPLICABILITY(
  'SELECT * EXCEPT(revenue_avg) FROM `cartobq.docs.aoa_revenue_model_data` WHERE revenue_avg IS NOT NULL',
  'SELECT * EXCEPT(revenue_avg) FROM `cartobq.docs.aoa_revenue_model_data`',
  'SELECT * FROM `cartobq.docs.aoa_revenue_model_shap`',
  'geoid',
  '<my-project>.<my-dataset>.<output-prefix>',
  '''
  {
    "threshold_method":"USER_DEFINED_THRESHOLD",
    "threshold":0.50
  }
  '''
);
-- Table `<my-project>.<my-dataset>.<output-prefix>` will be created

CALL carto.AREA_OF_APPLICABILITY(
  'SELECT * EXCEPT(revenue_avg) FROM `cartobq.docs.aoa_revenue_model_data` WHERE revenue_avg IS NOT NULL',
  'SELECT * EXCEPT(revenue_avg) FROM `cartobq.docs.aoa_revenue_model_data`',
  'SELECT * FROM `cartobq.docs.aoa_revenue_model_shap`',
  'geoid',
  '<my-project>.<my-dataset>.<output-prefix>',
  '''
  {
    "threshold_method":"USER_DEFINED_THRESHOLD",
    "threshold":0.50
  }
  '''
);
-- Table `<my-project>.<my-dataset>.<output-prefix>` will be created

With CUSTOM_KFOLD, the threshold is based on the cross-validation folds stored in the kfold_index_column in the source_input_query data

CALL `carto-un`.carto.AREA_OF_APPLICABILITY(
  'SELECT * EXCEPT(revenue_avg) FROM `cartobq.docs.aoa_revenue_model_data` WHERE revenue_avg IS NOT NULL',
  'SELECT * EXCEPT(revenue_avg) FROM `cartobq.docs.aoa_revenue_model_data`',
  'SELECT * FROM `cartobq.docs.aoa_revenue_model_shap`',
  'geoid',
  '<my-project>.<my-dataset>.<output-prefix>',
  '''
  {
    "threshold_method":"CUSTOM_KFOLD",
    "kfold_index_column":"kfold_index",
    "return_source_dataset":true,
    "normalize_dissimilarity_index":true
  }
  '''
);
-- Table `<my-project>.<my-dataset>.<output-prefix>` will be created

CALL `carto-un-eu`.carto.AREA_OF_APPLICABILITY(
  'SELECT * EXCEPT(revenue_avg) FROM `cartobq.docs.aoa_revenue_model_data` WHERE revenue_avg IS NOT NULL',
  'SELECT * EXCEPT(revenue_avg) FROM `cartobq.docs.aoa_revenue_model_data`',
  'SELECT * FROM `cartobq.docs.aoa_revenue_model_shap`',
  'geoid',
  '<my-project>.<my-dataset>.<output-prefix>',
  '''
  {
    "threshold_method":"CUSTOM_KFOLD",
    "kfold_index_column":"kfold_index",
    "return_source_dataset":true,
    "normalize_dissimilarity_index":true
  }
  '''
);
-- Table `<my-project>.<my-dataset>.<output-prefix>` will be created

CALL carto.AREA_OF_APPLICABILITY(
  'SELECT * EXCEPT(revenue_avg) FROM `cartobq.docs.aoa_revenue_model_data` WHERE revenue_avg IS NOT NULL',
  'SELECT * EXCEPT(revenue_avg) FROM `cartobq.docs.aoa_revenue_model_data`',
  'SELECT * FROM `cartobq.docs.aoa_revenue_model_shap`',
  'geoid',
  '<my-project>.<my-dataset>.<output-prefix>',
  '''
  {
    "threshold_method":"CUSTOM_KFOLD",
    "kfold_index_column":"kfold_index",
    "return_source_dataset":true,
    "normalize_dissimilarity_index":true
  }
  '''
);
-- Table `<my-project>.<my-dataset>.<output-prefix>` will be created

With RANDOM_KFOLD the threshold is based on the cross-validation folds derived from a random k-fold strategy with the number of folds specified by the user in the nfolds parameter

CALL `carto-un`.carto.AREA_OF_APPLICABILITY(
  'SELECT * EXCEPT(revenue_avg) FROM `cartobq.docs.aoa_revenue_model_data` WHERE revenue_avg IS NOT NULL',
  'SELECT * EXCEPT(revenue_avg) FROM `cartobq.docs.aoa_revenue_model_data`',
  'SELECT * FROM `cartobq.docs.aoa_revenue_model_shap`',
  'geoid',
  '<my-project>.<my-dataset>.<output-prefix>',
  '''
  {
    "threshold_method":"RANDOM_KFOLD",
    "nfolds":6,
    "return_source_dataset":true,
    "normalize_dissimilarity_index":true
  }
  '''
);
-- Table `<my-project>.<my-dataset>.<output-prefix>` will be created

CALL `carto-un-eu`.carto.AREA_OF_APPLICABILITY(
  'SELECT * EXCEPT(revenue_avg) FROM `cartobq.docs.aoa_revenue_model_data` WHERE revenue_avg IS NOT NULL',
  'SELECT * EXCEPT(revenue_avg) FROM `cartobq.docs.aoa_revenue_model_data`',
  'SELECT * FROM `cartobq.docs.aoa_revenue_model_shap`',
  'geoid',
  '<my-project>.<my-dataset>.<output-prefix>',
  '''
  {
    "threshold_method":"RANDOM_KFOLD",
    "nfolds":6,
    "return_source_dataset":true,
    "normalize_dissimilarity_index":true
  }
  '''
);
-- Table `<my-project>.<my-dataset>.<output-prefix>` will be created

CALL carto.AREA_OF_APPLICABILITY(
  'SELECT * EXCEPT(revenue_avg) FROM `cartobq.docs.aoa_revenue_model_data` WHERE revenue_avg IS NOT NULL',
  'SELECT * EXCEPT(revenue_avg) FROM `cartobq.docs.aoa_revenue_model_data`',
  'SELECT * FROM `cartobq.docs.aoa_revenue_model_shap`',
  'geoid',
  '<my-project>.<my-dataset>.<output-prefix>',
  '''
  {
    "threshold_method":"RANDOM_KFOLD",
    "nfolds":6,
    "return_source_dataset":true,
    "normalize_dissimilarity_index":true
  }
  '''
);
-- Table `<my-project>.<my-dataset>.<output-prefix>` will be created

With ENV_BLOCKING_KFOLD the threshold is based on the cross-validation folds derived from an environmental blocking strategy

CALL `carto-un`.carto.AREA_OF_APPLICABILITY(
  'SELECT * EXCEPT(revenue_avg) FROM `cartobq.docs.aoa_revenue_model_data` WHERE revenue_avg IS NOT NULL',
  'SELECT * EXCEPT(revenue_avg) FROM `cartobq.docs.aoa_revenue_model_data`',
  'SELECT * FROM `cartobq.docs.aoa_revenue_model_shap`',
  'geoid',
  '<my-project>.<my-dataset>.<output-prefix>',
  '''
  {
    "threshold_method":"ENV_BLOCKING_KFOLD",
    "pca_explained_variance_ratio":0.9,
    "nfolds_min":3,
    "nfolds_max":6,
    "return_source_dataset":true,
    "normalize_dissimilarity_index":true
  }
  '''
);
-- Table `<my-project>.<my-dataset>.<output-prefix>` will be created

CALL `carto-un-eu`.carto.AREA_OF_APPLICABILITY(
  'SELECT * EXCEPT(revenue_avg) FROM `cartobq.docs.aoa_revenue_model_data` WHERE revenue_avg IS NOT NULL',
  'SELECT * EXCEPT(revenue_avg) FROM `cartobq.docs.aoa_revenue_model_data`',
  'SELECT * FROM `cartobq.docs.aoa_revenue_model_shap`',
  'geoid',
  '<my-project>.<my-dataset>.<output-prefix>',
  '''
  {
    "threshold_method":"ENV_BLOCKING_KFOLD",
    "pca_explained_variance_ratio":0.9,
    "nfolds_min":3,
    "nfolds_max":6,
    "return_source_dataset":true,
    "normalize_dissimilarity_index":true
  }
  '''
);
-- Table `<my-project>.<my-dataset>.<output-prefix>` will be created

CALL carto.AREA_OF_APPLICABILITY(
  'SELECT * EXCEPT(revenue_avg) FROM `cartobq.docs.aoa_revenue_model_data` WHERE revenue_avg IS NOT NULL',
  'SELECT * EXCEPT(revenue_avg) FROM `cartobq.docs.aoa_revenue_model_data`',
  'SELECT * FROM `cartobq.docs.aoa_revenue_model_shap`',
  'geoid',
  '<my-project>.<my-dataset>.<output-prefix>',
  '''
  {
    "threshold_method":"ENV_BLOCKING_KFOLD",
    "pca_explained_variance_ratio":0.9,
    "nfolds_min":3,
    "nfolds_max":6,
    "return_source_dataset":true,
    "normalize_dissimilarity_index":true
  }
  '''
);
-- Table `<my-project>.<my-dataset>.<output-prefix>` will be created

ENV_BLOCKING

ENV_BLOCKING(input_query, predictors, index_column, output_prefix, options)

Description

This procedure derives cross validation (CV) folds based on environmental blocking.

This procedure uses multivariate methods (Principal Component Analysis + K-means clustering) to specify sets of similar conditions based on the input covariates. It should be used to overcome the issue of overfitting due to non-causal predictors: in this case, the spatial structure in the data may be explained by the model through some other non-causal covariate which correlates with the spatial structure. The resulting model predictions may perform fine in a situation where the correlation structure between non-causal and the “true” predictors (i.e. the underlying structures) remains unchanged but they could completely fail when predicting to novel situations (extrapolation).

The method performs a Principal Component Analysis (PCA) on the standardized data and then applies K-means clustering to cluster the data. The cluster number is then used to assign to each data point a corresponding CV fold. If the optimal number of folds is not provided, this is obtained by choosing the number of clusters that minimizes the Calinski-Harabasz Index.

Since this method transforms the data using PCA and then applies K-means clustering, it should only be applied when all predictors are continuous, for which squared differences are well defined. While PCA/K-means are designed for continuous data, it might be applied to discrete variables (e.g. count data), although in this case the results might exhibit some artifacts (PCA) and will not map back to the data (K-means).

Since when running PCA/K-means in BigQuery categorical/ordinal variables are hot-encoded, while in principle this method would run even for all categorical/ordinal predictors, the results won’t be necessarily meaningful for two reasons:

Sparsity and the curse of dimensionality: by hot-encoding categorical/ordinal variables (especially with high cardinality) the data dimension will grow quickly, the data becomes very sparse and this will affect, especially when the Euclidean distance is used, the computation of the distance (everything is as similar as everything else)
Even by hot-encoding categorical/ordinal predictors, both PCA and K-means minimize a squared loss function which may be inappropriate for data that is not real-valued

Finally, rows with a NULL value in any of the model predictors are dropped.

Input parameters

input_query: STRING the input query. It must contain all the model predictors columns.
predictors: ARRAY<STRING> the names of the (numeric) predictors.
index_column: STRING the name of the column with the unique geographic identifier.
output_prefix: STRING destination and prefix for the output table. It must contain the project, dataset and prefix: <project>.<dataset>.<prefix>.
options: STRING containing a valid JSON with the different options. Valid options are described in the table below. If options is set to NULL then all options are set to default.
Option
Description
pca_explained_variance_ratio
FLOAT64 the proportion of explained variance retained in the PCA analysis (DEFAULT: 0.9). Only values in the (0,1] range are allowed.
nfolds
INT64 the default number of folds (clusters). If NULL the optimal number of folds is computed by deriving the clusters for a number of folds between nfolds_min and nfolds_max and choosing the number of folds (clusters) that minimizes the Calinski-Harabasz Index. If not NULL then nfolds should be at least 1.
nfolds_min
INT64 the minimum number of environmental folds (clusters) if nfolds is set to NULL, otherwise it is ignored. If NOT NULL nfolds_min should be at least 1 and nfolds_max should also be specified.
nfolds_max
INT64 the maximum number of environmental folds (clusters) if nfolds is set to NULL, otherwise it is ignored. If NOT NULL nfolds_max should be at always larger than nfolds_min.
kfold_index_column
STRING the name of the cross-validation fold column. If NULL this parameter is set to 'k_fold_index'.

Output

The output table with the following columns:

index_column: STRING the unique geographic identifier. Its type will depend on the type of this column in the input_query.
kfold_index_column: INT64 the cross-validation fold column.

Examples

With nfolds specified:

CALL `carto-un`.carto.ENV_BLOCKING(
  'SELECT *  FROM `cartobq.docs.env_blocking_input`',
  ['tavg', 'tmin', 'tmax', 'prec', 'srad', 'vapr', 'wind'],
  'geoid',
  '<my-project>.<my-dataset>.<output-prefix>',
  '''{
    "pca_explained_variance_ratio":0.9,
    "nfolds":4,
    "kfold_index_column":"k_fold_index"
  }
  '''
);
-- Table `<my-project>.<my-dataset>.<output-prefix>` will be created

CALL `carto-un-eu`.carto.ENV_BLOCKING(
  'SELECT *  FROM `cartobq.docs.env_blocking_input`',
  ['tavg', 'tmin', 'tmax', 'prec', 'srad', 'vapr', 'wind'],
  'geoid',
  '<my-project>.<my-dataset>.<output-prefix>',
  '''{
    "pca_explained_variance_ratio":0.9,
    "nfolds":4,
    "kfold_index_column":"k_fold_index"
  }
  '''
);
-- Table `<my-project>.<my-dataset>.<output-prefix>` will be created

CALL carto.ENV_BLOCKING(
  'SELECT *  FROM `cartobq.docs.env_blocking_input`',
  ['tavg', 'tmin', 'tmax', 'prec', 'srad', 'vapr', 'wind'],
  'geoid',
  '<my-project>.<my-dataset>.<output-prefix>',
  '''{
    "pca_explained_variance_ratio":0.9,
    "nfolds":4,
    "kfold_index_column":"k_fold_index"
  }
  '''
);
-- Table `<my-project>.<my-dataset>.<output-prefix>` will be created

With nfolds_min and nfolds_max specified:

CALL `carto-un`.carto.ENV_BLOCKING(
  'SELECT *  FROM `cartobq.docs.env_blocking_input`',
  ['tavg', 'tmin', 'tmax', 'prec', 'srad', 'vapr', 'wind'],
  'geoid',
  '<my-project>.<my-dataset>.<output-prefix>',
  '''{
    "pca_explained_variance_ratio":0.9,
    "nfolds_min":3,
    "nfolds_max":6,
    "kfold_index_column":"k_fold_index"
  }
  '''
);
-- Table `<my-project>.<my-dataset>.<output-prefix>` will be created

CALL `carto-un-eu`.carto.ENV_BLOCKING(
  'SELECT *  FROM `cartobq.docs.env_blocking_input`',
  ['tavg', 'tmin', 'tmax', 'prec', 'srad', 'vapr', 'wind'],
  'geoid',
  '<my-project>.<my-dataset>.<output-prefix>',
  '''{
    "pca_explained_variance_ratio":0.9,
    "nfolds_min":3,
    "nfolds_max":6,
    "kfold_index_column":"k_fold_index"
  }
  '''
);
-- Table `<my-project>.<my-dataset>.<output-prefix>` will be created

CALL carto.ENV_BLOCKING(
  'SELECT *  FROM `cartobq.docs.env_blocking_input`',
  ['tavg', 'tmin', 'tmax', 'prec', 'srad', 'vapr', 'wind'],
  'geoid',
  '<my-project>.<my-dataset>.<output-prefix>',
  '''{
    "pca_explained_variance_ratio":0.9,
    "nfolds_min":3,
    "nfolds_max":6,
    "kfold_index_column":"k_fold_index"
  }
  '''
);
-- Table `<my-project>.<my-dataset>.<output-prefix>` will be created

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 960401.

Previouss2 Nexttiler

Last updated 2 years ago

Was this helpful?