# statistics

This module contains functions to perform spatial statistics calculations.

## P\_VALUE <a href="#p_value" id="p_value"></a>

```sql
P_VALUE(z_score)
```

**Description**

This function computes the p-value (two-tails test) of a given z-score assuming the population follows a normal distribution where the mean is 0 and the standard deviation is 1. The [z-score](https://en.wikipedia.org/wiki/Standard_score) is a measure of how many standard deviations below or above the population mean a value is. It gives you an idea of how far from the mean a data point is. The [p-value](https://en.wikipedia.org/wiki/P-value) is the probability that a randomly sampled point has a value at least as extreme as the point whose z-score is being tested.

**Input parameters**

* `z_score`: `FLOAT64`

**Return type**

`FLOAT64`

**Example**

{% tabs %}
{% tab title="carto-un" %}

```sql
SELECT `carto-un`.carto.P_VALUE(u) AS p_value
    FROM UNNEST([-2,-1,0,1,2]) u;

-- [ 0.04550012577451279, 0.31731052766472745, 0.999999999, 0.31731052766472745, 0.04550012577451279 ]
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
SELECT `carto-un-eu`.carto.P_VALUE(u) AS p_value
    FROM UNNEST([-2,-1,0,1,2]) u;

-- [ 0.04550012577451279, 0.31731052766472745, 0.999999999, 0.31731052766472745, 0.04550012577451279 ]
```

{% endtab %}

{% tab title="manual" %}

```sql
SELECT carto.P_VALUE(u) AS p_value
    FROM UNNEST([-2,-1,0,1,2]) u;

-- [ 0.04550012577451279, 0.31731052766472745, 0.999999999, 0.31731052766472745, 0.04550012577451279 ]
```

{% endtab %}
{% endtabs %}

## KNN\_TABLE <a href="#knn_table" id="knn_table"></a>

```sql
KNN_TABLE(input, output_table, geoid_col, geo_col, k)
```

**Description**

This procedure returns for each point the [*k-nearest neighbors*](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) of a given set of points.

**Input parameters**

* `input`: `STRING` the query to the data used to compute the KNN. A qualified table name can be given as well, e.g. `<my-project>.<my-dataset>.<my-table>`.
* `output_table`: `STRING` qualified name of the output table, e.g. `<my-project>.<my-dataset>.<my-output-table>`.
* `geoid_col`: `STRING` name of the column with unique ids.
* `geo_col`: `STRING` name of the column with the geometries.
* `k`: `INT64` number of nearest neighbors (positive, typically small).

**Output**

The results are stored in the table named `<output_table>`, which contains the following columns:

* `geo`: `GEOGRAPHY` the geometry of the considered point.
* `geo_knn`: `GEOGRAPHY` the k-nearest neighbor point.
* `geoid`: `STRING` the unique identifier of the considered point.
* `geoid_knn`: `STRING` the unique identifier of the k-nearest neighbor.
* `distance`: `FLOAT64` the k-nearest neighbor distance to the considered point.
* `knn`: `INT64` the k-order (knn).

**Example**

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.KNN_TABLE(
    '<my-project>.<my-dataset>.<my-input-table>',
    '<my-project>.<my-dataset>.<my-output-table>',
    'id',
    'geo',
    10
);
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.KNN_TABLE(
    '<my-project>.<my-dataset>.<my-input-table>',
    '<my-project>.<my-dataset>.<my-output-table>',
    'id',
    'geo',
    10
);
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.KNN_TABLE(
    '<my-project>.<my-dataset>.<my-input-table>',
    '<my-project>.<my-dataset>.<my-output-table>',
    'id',
    'geo',
    10
);
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}
{% endtabs %}

## KNN <a href="#knn" id="knn"></a>

```sql
KNN(points, k)
```

**Description**

This function returns for each point the [*k-nearest neighbors*](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) of a given set of points.

**Input parameters**

* `points`: `ARRAY<STRUCT<geoid STRING, geo GEOGRAPHY>>` input data with unique id and geography.
* `k`: `INT64` number of nearest neighbors (positive, typically small).

**Return type**

`ARRAY<STRUCT<geo GEOGRAPHY, geo_knn GEOGRAPHY, geoid STRING, geoid_knn STRING, distance FLOAT64, knn INT64>>`

where:

* `geo`: the geometry of the considered point.
* `geo_knn`: the k-nearest neighbor point.
* `geoid`: the unique identifier of the considered point.
* `geoid_knn`: the unique identifier of the k-nearest neighbor.
* `distance`: the k-nearest neighbor distance to the considered point.
* `knn`: the k-order (knn).

**Example**

{% tabs %}
{% tab title="carto-un" %}

```sql
SELECT *
FROM UNNEST((
    SELECT `carto-un`.carto.KNN(myarray, 10)
    FROM (
        SELECT ARRAY_AGG(STRUCT(format('%08x', uid),position_geom)) myarray
        FROM (
            SELECT ROW_NUMBER() OVER (ORDER BY hour) AS uid, position_geom
            FROM `bigquery-public-data.catalonian_mobile_coverage.mobile_data_2015_2017`
            WHERE date = '2017-12-31'
        )
    )
))
ORDER BY geoid;
--{
--  "geo": "POINT(2.82263 41.97118)",
--  "geo_knn": "POINT(2.8225 41.97117)",
--  "geoid": "00000001",
--  "geoid_knn": "00000624",
--  "distance": "10.804663098937658",
--  "knn": "1"
--},
--{
--  "geo": "POINT(2.82263 41.97118)",
--  "geo_knn": "POINT(2.823 41.9712)",
--  "geoid": "00000001",
--  "geoid_knn": "00000666",
--  "distance": "30.66917920746894",
--  "knn": "2"
--},
--{
--  "geo": "POINT(2.82263 41.97118)",
--  "geo_knn": "POINT(2.82298 41.9713)",
--  "geoid": "00000001",
--  "geoid_knn": "00000618",
--  "distance": "31.863463704968353",
--  "knn": "3"
--},
-- ...
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
SELECT *
FROM UNNEST((
    SELECT `carto-un-eu`.carto.KNN(myarray, 10)
    FROM (
        SELECT ARRAY_AGG(STRUCT(format('%08x', uid),position_geom)) myarray
        FROM (
            SELECT ROW_NUMBER() OVER (ORDER BY hour) AS uid, position_geom
            FROM `bigquery-public-data.catalonian_mobile_coverage.mobile_data_2015_2017`
            WHERE date = '2017-12-31'
        )
    )
))
ORDER BY geoid;
--{
--  "geo": "POINT(2.82263 41.97118)",
--  "geo_knn": "POINT(2.8225 41.97117)",
--  "geoid": "00000001",
--  "geoid_knn": "00000624",
--  "distance": "10.804663098937658",
--  "knn": "1"
--},
--{
--  "geo": "POINT(2.82263 41.97118)",
--  "geo_knn": "POINT(2.823 41.9712)",
--  "geoid": "00000001",
--  "geoid_knn": "00000666",
--  "distance": "30.66917920746894",
--  "knn": "2"
--},
--{
--  "geo": "POINT(2.82263 41.97118)",
--  "geo_knn": "POINT(2.82298 41.9713)",
--  "geoid": "00000001",
--  "geoid_knn": "00000618",
--  "distance": "31.863463704968353",
--  "knn": "3"
--},
-- ...
```

{% endtab %}

{% tab title="manual" %}

```sql
SELECT *
FROM UNNEST((
    SELECT carto.KNN(myarray, 10)
    FROM (
        SELECT ARRAY_AGG(STRUCT(format('%08x', uid),position_geom)) myarray
        FROM (
            SELECT ROW_NUMBER() OVER (ORDER BY hour) AS uid, position_geom
            FROM `bigquery-public-data.catalonian_mobile_coverage.mobile_data_2015_2017`
            WHERE date = '2017-12-31'
        )
    )
))
ORDER BY geoid;
--{
--  "geo": "POINT(2.82263 41.97118)",
--  "geo_knn": "POINT(2.8225 41.97117)",
--  "geoid": "00000001",
--  "geoid_knn": "00000624",
--  "distance": "10.804663098937658",
--  "knn": "1"
--},
--{
--  "geo": "POINT(2.82263 41.97118)",
--  "geo_knn": "POINT(2.823 41.9712)",
--  "geoid": "00000001",
--  "geoid_knn": "00000666",
--  "distance": "30.66917920746894",
--  "knn": "2"
--},
--{
--  "geo": "POINT(2.82263 41.97118)",
--  "geo_knn": "POINT(2.82298 41.9713)",
--  "geoid": "00000001",
--  "geoid_knn": "00000618",
--  "distance": "31.863463704968353",
--  "knn": "3"
--},
-- ...
```

{% endtab %}
{% endtabs %}

## LOF\_TABLE <a href="#lof_table" id="lof_table"></a>

```sql
LOF_TABLE(src_fullname STRING, target_fullname STRING, geoid_column_name STRING, geo_column_name STRING, lof_target_column_name STRING, k INT64)
```

**Description**

This procedure computes the [Local Outlier Factor](https://en.wikipedia.org/wiki/Local_outlier_factor) for each point of a specified column and stores the result in an output table along with the other input columns.

**Input parameters**

* `src_fullname`: `STRING` name of the input table, e.g. `<my-project>.<my-dataset>.<my-table>`. The `project-id` can be omitted (in which case the default one will be used).
* `target_fullname`: `STRING` qualified name of the output table, e.g. `<my-project>.<my-dataset>.<my-output-table>`. The `project-id` can be omitted. The dataset must exist and the caller needs to have permissions to create a new table in it. The process will fail if the target table already exists.
* `geoid_column_name`: `STRING` The column name with a unique identifier for each point.
* `geo_column_name`: `STRING` The column name containing the points.
* `lof_target_column_name`: `STRING` The column name where the resulting Local Outlier Factor will be stored in the output table.
* `k`: `INT64` Number of nearest neighbors (positive, typically small).

**Output**

The results are stored in the table named `<output_table>`, which contains the following columns:

* `geo`: `GEOGRAPHY` the geometry of the considered point.
* `geoid`: `STRING` the unique identifier of the considered point.
* `lof`: `FLOAT64` the Local Outlier Factor score.

**Example**

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.LOF_TABLE(
  'bigquery-public-data.new_york_subway.stations',
  '<my-project>.<my-dataset>.<my-output-table>',
  'station_id',
  'station_geom',
  'lof',
  10
);
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
-- with an extra column containing the `lof` value.
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.LOF_TABLE(
  'bigquery-public-data.new_york_subway.stations',
  '<my-project>.<my-dataset>.<my-output-table>',
  'station_id',
  'station_geom',
  'lof',
  10
);
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
-- with an extra column containing the `lof` value.
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.LOF_TABLE(
  'bigquery-public-data.new_york_subway.stations',
  '<my-project>.<my-dataset>.<my-output-table>',
  'station_id',
  'station_geom',
  'lof',
  10
);
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
-- with an extra column containing the `lof` value.
```

{% endtab %}
{% endtabs %}

## LOF <a href="#lof" id="lof"></a>

```sql
LOF(points, k)
```

**Description**

This function computes the [Local Outlier Factor](https://en.wikipedia.org/wiki/Local_outlier_factor) of each point of a given set of points.

**Input parameters**

* `points`: `ARRAY<STRUCT<geoid STRING, geo GEOGRAPHY>>` input data points with unique id and geography.
* `k`: `INT64` number of nearest neighbors (positive, typically small).

**Return type**

`ARRAY<STRUCT<geo GEOGRAPHY, geoid STRING, lof FLOAT64>>`

where:

* `geo`: the geometry of the considered point.
* `geoid`: the unique identifier of the considered point.
* `lof`: the Local Outlier Factor score.

**Example**

{% tabs %}
{% tab title="carto-un" %}

```sql
SELECT *
FROM UNNEST((
    SELECT `carto-un`.carto.LOF(myarray, 10)
    FROM (
        SELECT ARRAY_AGG(STRUCT(format('%08x', uid),position_geom)) myarray
        FROM (
            SELECT ROW_NUMBER() OVER (ORDER BY hour) AS uid, position_geom
            FROM `bigquery-public-data.catalonian_mobile_coverage.mobile_data_2015_2017`
            WHERE date = '2017-12-31'
        )
    )
))
ORDER BY geoid;
-- {"geo": POINT(2.82263 41.97118), "geoid": "00000001", "lof": 1.3217599116891428}
-- {"geo": POINT(2.35705 41.49786), "geoid": "00000002", "lof": 1.235551000737416}
-- {"geo": POINT(2.13967 41.3838), "geoid": "00000003", "lof": 1.1305674032876687}
-- ...
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
SELECT *
FROM UNNEST((
    SELECT `carto-un-eu`.carto.LOF(myarray, 10)
    FROM (
        SELECT ARRAY_AGG(STRUCT(format('%08x', uid),position_geom)) myarray
        FROM (
            SELECT ROW_NUMBER() OVER (ORDER BY hour) AS uid, position_geom
            FROM `bigquery-public-data.catalonian_mobile_coverage.mobile_data_2015_2017`
            WHERE date = '2017-12-31'
        )
    )
))
ORDER BY geoid;
-- {"geo": POINT(2.82263 41.97118), "geoid": "00000001", "lof": 1.3217599116891428}
-- {"geo": POINT(2.35705 41.49786), "geoid": "00000002", "lof": 1.235551000737416}
-- {"geo": POINT(2.13967 41.3838), "geoid": "00000003", "lof": 1.1305674032876687}
-- ...
```

{% endtab %}

{% tab title="manual" %}

```sql
SELECT *
FROM UNNEST((
    SELECT carto.LOF(myarray, 10)
    FROM (
        SELECT ARRAY_AGG(STRUCT(format('%08x', uid),position_geom)) myarray
        FROM (
            SELECT ROW_NUMBER() OVER (ORDER BY hour) AS uid, position_geom
            FROM `bigquery-public-data.catalonian_mobile_coverage.mobile_data_2015_2017`
            WHERE date = '2017-12-31'
        )
    )
))
ORDER BY geoid;
-- {"geo": POINT(2.82263 41.97118), "geoid": "00000001", "lof": 1.3217599116891428}
-- {"geo": POINT(2.35705 41.49786), "geoid": "00000002", "lof": 1.235551000737416}
-- {"geo": POINT(2.13967 41.3838), "geoid": "00000003", "lof": 1.1305674032876687}
-- ...
```

{% endtab %}
{% endtabs %}

## GFUN\_TABLE <a href="#gfun_table" id="gfun_table"></a>

```sql
GFUN_TABLE(input, output_table, geo_col)
```

**Description**

This function computes the [G-function](http://www.css.cornell.edu/faculty/dgr2/_static/files/ov/ov_PPA_Handout.pdf) of a given set of points.

**Input parameters**

* `input`: `STRING` the query to the data used to compute the G-Function. A qualified table name can be given as well, e.g. `<my-project>.<my-dataset>.<my-table>`.
* `output_table`: `STRING` qualified name of the output table, e.g. `<my-project>.<my-dataset>.<my-output-table>`.
* `geo_col`: `STRING` name of the column with the geometries.

**Output**

The results are stored in the table named `<output_table>`, which contains the following columns:

* `distance`: `FLOAT64` the nearest neighbors distances.
* `gfun_G`: `FLOAT64` the empirical G evaluated for each distance in the support.
* `gfun_ev`: `FLOAT64` the theoretical Poisson G evaluated for each distance in the support.

**Example**

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.GFUN_TABLE(
    '<my-project>.<my-dataset>.<my-input-table>',
    '<my-project>.<my-dataset>.<my-output-table>',
    'geo'
);
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.GFUN_TABLE(
    '<my-project>.<my-dataset>.<my-input-table>',
    '<my-project>.<my-dataset>.<my-output-table>',
    'geo'
);
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.GFUN_TABLE(
    '<my-project>.<my-dataset>.<my-input-table>',
    '<my-project>.<my-dataset>.<my-output-table>',
    'geo'
);
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}
{% endtabs %}

## GFUN <a href="#gfun" id="gfun"></a>

```sql
GFUN(points)
```

**Description**

This function computes the [G-function](http://www.css.cornell.edu/faculty/dgr2/_static/files/ov/ov_PPA_Handout.pdf) of a given set of points.

**Input parameters**

* `points`: `ARRAY<GEOGRAPHY>` input data points.

**Return type**

`ARRAY<STRUCT<distance FLOAT64, gfun_G FLOAT64, gfun_ev FLOAT64>>`

where:

* `distance`: the nearest neighbors distances.
* `gfun_G`: the empirical G evaluated for each distance in the support.
* `gfun_ev`: the theoretical Poisson G evaluated for each distance in the support.

**Example**

{% tabs %}
{% tab title="carto-un" %}

```sql
SELECT *
FROM UNNEST((
    SELECT `carto-un`.carto.GFUN(myarray)
    FROM (
        SELECT ARRAY_AGG(position_geom) myarray
        FROM `bigquery-public-data.catalonian_mobile_coverage.mobile_data_2015_2017`
        WHERE date = '2017-12-31'
    )
))
ORDER BY distance;
--{
--  "distance": "38.599968853183",
--  "gfun_G": "0.4319167389418907",
--  "gfun_ev": "4.037383876246414E-4"
--},
--{
--  "distance": "77.199937706366",
--  "gfun_G": "0.5771899392888118",
--  "gfun_ev": "0.0016139757856029613"
--},
--{
--  "distance": "115.799906559549",
--  "gfun_G": "0.6522116218560278",
--  "gfun_ev": "0.003627782844736638"
--},
-- ...
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
SELECT *
FROM UNNEST((
    SELECT `carto-un-eu`.carto.GFUN(myarray)
    FROM (
        SELECT ARRAY_AGG(position_geom) myarray
        FROM `bigquery-public-data.catalonian_mobile_coverage.mobile_data_2015_2017`
        WHERE date = '2017-12-31'
    )
))
ORDER BY distance;
--{
--  "distance": "38.599968853183",
--  "gfun_G": "0.4319167389418907",
--  "gfun_ev": "4.037383876246414E-4"
--},
--{
--  "distance": "77.199937706366",
--  "gfun_G": "0.5771899392888118",
--  "gfun_ev": "0.0016139757856029613"
--},
--{
--  "distance": "115.799906559549",
--  "gfun_G": "0.6522116218560278",
--  "gfun_ev": "0.003627782844736638"
--},
-- ...
```

{% endtab %}

{% tab title="manual" %}

```sql
SELECT *
FROM UNNEST((
    SELECT carto.GFUN(myarray)
    FROM (
        SELECT ARRAY_AGG(position_geom) myarray
        FROM `bigquery-public-data.catalonian_mobile_coverage.mobile_data_2015_2017`
        WHERE date = '2017-12-31'
    )
))
ORDER BY distance;
--{
--  "distance": "38.599968853183",
--  "gfun_G": "0.4319167389418907",
--  "gfun_ev": "4.037383876246414E-4"
--},
--{
--  "distance": "77.199937706366",
--  "gfun_G": "0.5771899392888118",
--  "gfun_ev": "0.0016139757856029613"
--},
--{
--  "distance": "115.799906559549",
--  "gfun_G": "0.6522116218560278",
--  "gfun_ev": "0.003627782844736638"
--},
-- ...
```

{% endtab %}
{% endtabs %}

## CREATE\_SPATIAL\_COMPOSITE\_SUPERVISED <a href="#create_spatial_composite_supervised" id="create_spatial_composite_supervised"></a>

```sql
CREATE_SPATIAL_COMPOSITE_SUPERVISED(input_query, index_column, output_prefix, options)
```

**Description**

This procedure derives a spatial composite score as the residuals of a regression model which is used to detect areas of under- and over-prediction. The response variable should be measurable and correlated with the set of variables defining the score. For each data point. the residual is defined as the observed value minus the predicted value. Rows with a NULL value in any of the individual variables are dropped.

**Input parameters**

* `input_query`: `STRING` the query to the data used to compute the spatial composite. It must contain all the individual variables that should be included in the computation of the composite as well as a unique geographic id for each row. A qualified table name can be given as well, e.g. `'<my-project>.<my-dataset>.<my-table>'`.
* `index_column`: `STRING` the name of the column with the unique geographic identifier.
* `output_prefix`: `STRING` the prefix for the output table. It should include project and dataset, e.g. `'<my-project>.<my-dataset>.<my-output-prefix>'`.
* `options`: `STRING` containing a valid JSON with the different options. Valid options are described below.
  * `model_transform`: `STRING` containing the [TRANSFORM](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create#transform) clause in a [BigQuery ML CREATE MODEL statement](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create). If NULL no TRANSFORM clause is applied.
  * `model_options`: `JSON` with the different options allowed by [BigQuery ML CREATE MODEL statement](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create) for regression models. Any model is allowed as long as it can deal with numerical inputs for the response variable. At least the `INPUT_LABEL_COLS` and `MODEL_TYPE` parameters must be specified. By default, data will not be split into train and test (`DATA_SPLIT_METHOD = 'NO_SPLIT'`). [Hyperparameter tuning](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-train) is not currently supported.
  * `r2_thr`: `FLOAT64` the minimum allowed value for the R2 model score. If the R2 of the regression model is lower than this threshold this implies poor fitting and a warning is raised. The default value is 0.5.
  * `bucketize_method`: `STRING` the method used to discretize the spatial composite score. The default value is NULL. Possible options are:
    * EQUAL\_INTERVALS\_ZERO\_CENTERED: the values of the spatial composite score are discretized into buckets of equal widths centered in zero. The lower and upper limits are derived from the outliers-removed maximum of the absolute values of the score.
  * `nbuckets`: `INT64` the number of buckets used when a bucketization method is specified. The default number of buckets is selected using [Freedman and Diaconis’s (1981) rule](https://robjhyndman.com/papers/sturges.pdf). Ignored if `bucketize_method` is not specified.
  * `remove_outliers`: `BOOL`. When `bucketize_method` is specified, if `remove_outliers` is set to TRUE the buckets are derived from the oulier-removed data. The outliers are computed using [Tukey’s fences k parameter](https://en.wikipedia.org/wiki/Outlier#Tukey's_fences) for outlier detection. The default value is TRUE. For large inputs, setting this option to TRUE might cause a [`Query exceeds CPU resources` error](https://cloud.google.com/bigquery/docs/troubleshoot-queries#exceeded_resource_limits). Ignored if `bucketize_method` is not specified.

**Return type**

The results are stored in the table named `<output_prefix>`, which contains the following columns:

* `index_column`: the unique geographic identifier. The type of this column depends on the type of `index_column` in `input_query`.
* `spatial_score`: the value of the composite score. The type of this column is `FLOAT64` if the score is not discretized and `INT64` otherwise.

When the score is discretized by specifying the `bucketize_method` parameter, the procedure also returns a lookup table named `<output_prefix>_lookup_table` with the following columns:

* `lower_bound`: `FLOAT64` the lower bound of the bin.
* `upper_bound`: `FLOAT64` the upper bound of the bin.
* `spatial_score`: `INT64` the value of the (discretized) composite score.

**Example**

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.CREATE_SPATIAL_COMPOSITE_SUPERVISED(
    'SELECT * FROM `<my-project>.<my-dataset>.<my-spatial-scoring-input>`',
    'geoid',
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "model_transform":[
            "revenue_change",
            "fempop_15_44, public_transport, education, pois, urbanity"
        ],
        "model_options":{
            "MODEL_TYPE":"LINEAR_REG",
            "INPUT_LABEL_COLS":["revenue_change"],
            "DATA_SPLIT_METHOD":"no_split",
            "OPTIMIZE_STRATEGY":"NORMAL_EQUATION",
            "CATEGORY_ENCODING_METHOD":"ONE_HOT_ENCODING"
        },
        "r2_thr":0.4,
        "bucketize_method":"EQUAL_INTERVALS_ZERO_CENTERED",
        "nbuckets":5
    }
    '''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.CREATE_SPATIAL_COMPOSITE_SUPERVISED(
    'SELECT * FROM `<my-project>.<my-dataset>.<my-spatial-scoring-input>`',
    'geoid',
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "model_transform":[
            "revenue_change",
            "fempop_15_44, public_transport, education, pois, urbanity"
        ],
        "model_options":{
            "MODEL_TYPE":"LINEAR_REG",
            "INPUT_LABEL_COLS":["revenue_change"],
            "DATA_SPLIT_METHOD":"no_split",
            "OPTIMIZE_STRATEGY":"NORMAL_EQUATION",
            "CATEGORY_ENCODING_METHOD":"ONE_HOT_ENCODING"
        },
        "r2_thr":0.4,
        "bucketize_method":"EQUAL_INTERVALS_ZERO_CENTERED",
        "nbuckets":5
    }
    '''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.CREATE_SPATIAL_COMPOSITE_SUPERVISED(
    'SELECT * FROM `<my-project>.<my-dataset>.<my-spatial-scoring-input>`',
    'geoid',
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "model_transform":[
            "revenue_change",
            "fempop_15_44, public_transport, education, pois, urbanity"
        ],
        "model_options":{
            "MODEL_TYPE":"LINEAR_REG",
            "INPUT_LABEL_COLS":["revenue_change"],
            "DATA_SPLIT_METHOD":"no_split",
            "OPTIMIZE_STRATEGY":"NORMAL_EQUATION",
            "CATEGORY_ENCODING_METHOD":"ONE_HOT_ENCODING"
        },
        "r2_thr":0.4,
        "bucketize_method":"EQUAL_INTERVALS_ZERO_CENTERED",
        "nbuckets":5
    }
    '''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}
{% endtabs %}

## CREATE\_SPATIAL\_COMPOSITE\_UNSUPERVISED <a href="#create_spatial_composite_unsupervised" id="create_spatial_composite_unsupervised"></a>

```sql
CREATE_SPATIAL_COMPOSITE_UNSUPERVISED(input_query, index_column, output_prefix, options)
```

**Description**

This procedure combines (spatial) variables into a meaningful composite score. The composite score can be derived using different methods, scaling and aggregation functions and weights. Rows with a NULL value in any of the model predictors are dropped.

**Input parameters**

* `input_query`: `STRING` the query to the data used to compute the spatial composite. It must contain all the individual variables that should be included in the computation of the composite as well as a unique geographic id for each row. A qualified table name can be given as well, e.g. `'<my-project>.<my-dataset>.<my-table>'`.
* `index_column`: `STRING` the name of the column with the unique geographic identifier.
* `output_prefix`: `STRING` the prefix for the output table. It should include project and dataset, e.g. `'<my-project>.<my-dataset>.<my-output-prefix>'`.
* `options`: `STRING` containing a valid JSON with the different options. Valid options are described below. If options is set to NULL then all options are set to default values, as specified in the table below.
  * `scoring_method`: `STRING` Possible options are ENTROPY, CUSTOM\_WEIGHTS, FIRST\_PC. With the ENTROPY method the spatial composite is derived as the weighted sum of the proportion of the min-max scaled individual variables, where the weights are based on the entropy of the proportion of each variable. Only numerical variables are allowed. With the CUSTOM\_WEIGHTS method, the spatial composite is computed by first scaling each individual variable and then aggregating them according to user-defined scaling and aggregation methods and individual weights. Depending on the scaling parameter, both numerical and ordinal variables are allowed (categorical and boolean variables need to be transformed to ordinal). With the FIRST\_PC method, the spatial composite is derived from a Principal Component Analysis as the first principal component score. Only numerical variables are allowed.
  * `weights`: `STRUCT` the (optional) weights for each variable used to compute the spatial composite when scoring\_method is set to CUSTOM\_WEIGHTS, passed as `{"name":value, …}`. If a different scoring method is selected, then this input parameter is ignored. If specified, the sum of the weights must be lower than 1. If no weights are specified, equal weights are assumed. If weights are specified only for some variables and the sum of weights is less than 1, the remainder is distributed equally between the remaining variables. If weights are specified for all the variables and the sum of weights is less than 1, the remainder is distributed equally between all the variables.
  * `scaling`: `STRING` the user-defined scaling when the scoring\_method is set to CUSTOM\_WEIGHTS. Possible options are:
    * MIN\_MAX\_SCALER: data is rescaled into the range \[0,1] based on minimum and maximum values. Only numerical variables are allowed.
    * STANDARD\_SCALER: data is rescaled by subtracting the mean value and dividing the result by the standard deviation. Only numerical variables are allowed.
    * RANKING: data is replaced by its percent rank, that is by values ranging from 0 lowest to 1. Both numerical and ordinal variables are allowed (categorical and boolean variables need to be transformed to ordinal).
    * DISTANCE\_TO\_TARGET\_MIN(\_MAX,\_AVG):data is rescaled by dividing by the minimum, maximum, or mean of all the values. Only numerical variables are allowed.
    * PROPORTION: data is rescaled by dividing by the sum total of all the values. Only numerical variables are allowed.
  * `aggregation`: `STRING` the aggregation function used when the scoring\_method is set to CUSTOM\_WEIGHTS. Possible options are:
    * LINEAR: the spatial composite is derived as the weighted sum of the scaled individual variables.
    * GEOMETRIC: the spatial composite is given by the product of the scaled individual variables, each to the power of its weight.
  * `correlation_var`: `STRING` when scoring\_method is set to FIRST\_PC, the spatial score will be positively correlated with the selected variable (i.e. the sign the spatial score is set such that the correlation between the selected variable and the first principal component score is positive).
  * `correlation_thr`: `FLOAT64` the minimum absolute value of the correlation between each individual variable and the first principal component score when scoring\_method is set to FIRST\_PC.
  * `return_range`: `ARRAY<FLOAT64>` the user-defined normalization range of the spatial composite score, e.g \[0.0,1.0]. Ignored if `bucketize_method` is specified.
  * `bucketize_method`: `STRING` the method used to discretize the spatial composite score. Possible options are:
    * EQUAL\_INTERVALS: the values of the spatial composite score are discretized into buckets of equal widths.
    * QUANTILES: the values of the spatial composite score are discretized into buckets based on quantiles.
    * JENKS: the values of the spatial composite score are discretized into buckets obtained using k-means clustering.
  * `nbuckets`: `INT64` the number of buckets used when a bucketization method is specified. When `bucketize_method` is set to EQUAL\_INTERVALS, if `nbuckets` is NULL, the default number of buckets is selected using [Freedman and Diaconis’s (1981) rule](https://robjhyndman.com/papers/sturges.pdf). When `bucketize_method` is set to JENKS or QUANTILES, `nbuckets` cannot be NULL. When `bucketize_method` is set to JENKS the maximum value is 100, aka the maximum number of clusters allowed by BigQuery with k-means clustering.

| Option             | `ENTROPY` | `CUSTOM_WEIGHTS` | `FIRST_PC` | Valid options                                                                                                                            | Default value                                                                                                                                          |
| ------------------ | --------- | ---------------- | ---------- | ---------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `scoring_method`   | Optional  | Optional         | Optional   | ENTROPY, CUSTOM\_WEIGHTS, FIRST\_PC                                                                                                      | ENTROPY                                                                                                                                                |
| `weights`          | Ignored   | Optional         | Ignored    | `{"name":value…}`                                                                                                                        | NULL                                                                                                                                                   |
| `scaling`          | Ignored   | Optional         | Ignored    | MIN\_MAX\_SCALER, STANDARD\_SCALER, RANKING, DISTANCE\_TO\_TARGET\_MIN, DISTANCE\_TO\_TARGET\_MAX, DISTANCE\_TO\_TARGET\_AVG, PROPORTION | MIN\_MAX\_SCALER                                                                                                                                       |
| `aggregation`      | Ignored   | Optional         | Ignored    | LINEAR, GEOMETRIC                                                                                                                        | LINEAR                                                                                                                                                 |
| `correlation_var`  | Ignored   | Optional         | Mandatory  | -                                                                                                                                        | NULL                                                                                                                                                   |
| `correlation_thr`  | Ignored   | Optional         | Optional   | -                                                                                                                                        | NULL                                                                                                                                                   |
| `return_range`     | Optional  | Optional         | Optional   | -                                                                                                                                        | NULL                                                                                                                                                   |
| `bucketize_method` | Optional  | Optional         | Optional   | EQUAL\_INTERVALS, QUANTILES, JENKS                                                                                                       | NULL                                                                                                                                                   |
| `nbuckets`         | Optional  | Optional         | Optional   | -                                                                                                                                        | When `bucketize_method` is set to EQUAL\_INTERVALS is selected using [Freedman and Diaconis’s (1981) rule](https://robjhyndman.com/papers/sturges.pdf) |

**Return type**

The results are stored in the table named `<output_prefix>`, which contains the following columns:

* `index_column`: the unique geographic identifier. The type of this column depends on the type of `index_column` in `input_query`.
* `spatial_score`: the value of the composite score. The type of this column is `FLOAT64` if the score is not discretized and `INT64` otherwise.

When the score is discretized by specifying the `bucketize_method` parameter, the procedure also returns a lookup table named `<output_prefix>_lookup_table` with the following columns:

* `lower_bound`: `FLOAT64` the lower bound of the bin.
* `upper_bound`: `FLOAT64` the upper bound of the bin.
* `spatial_score`: `INT64` the value of the (discretized) composite score.

**Examples**

With the `ENTROPY` method:

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.CREATE_SPATIAL_COMPOSITE_UNSUPERVISED(
    'SELECT * EXCEPT(geom, revenue_change, urbanity, urbanity_ordinal) FROM `<my-project>.<my-dataset>.<my-spatial-scoring-input>`',
    'geoid',
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "scoring_method":"ENTROPY",
        "return_range":[0.0,1.0]
    }
    '''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.CREATE_SPATIAL_COMPOSITE_UNSUPERVISED(
    'SELECT * EXCEPT(geom, revenue_change, urbanity, urbanity_ordinal) FROM `<my-project>.<my-dataset>.<my-spatial-scoring-input>`',
    'geoid',
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "scoring_method":"ENTROPY",
        "return_range":[0.0,1.0]
    }
    '''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.CREATE_SPATIAL_COMPOSITE_UNSUPERVISED(
    'SELECT * EXCEPT(geom, revenue_change, urbanity, urbanity_ordinal) FROM `<my-project>.<my-dataset>.<my-spatial-scoring-input>`',
    'geoid',
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "scoring_method":"ENTROPY",
        "return_range":[0.0,1.0]
    }
    '''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}
{% endtabs %}

With the `CUSTOM_WEIGHTS` method:

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.CREATE_SPATIAL_COMPOSITE_UNSUPERVISED(
    'SELECT * EXCEPT(geom, revenue_change, urbanity) FROM `<my-project>.<my-dataset>.<my-spatial-scoring-input>`',
    'geoid',
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "scoring_method":"CUSTOM_WEIGHTS",
        "weights":{"fempop_15_44":0.2,"education":0.1,"urbanity_ordinal":0.1,"pois":0.1},
        "scaling":"RANKING",
        "aggregation":"LINEAR",
        "bucketize_method":"JENKS"
    }
    '''
 )
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.CREATE_SPATIAL_COMPOSITE_UNSUPERVISED(
    'SELECT * EXCEPT(geom, revenue_change, urbanity) FROM `<my-project>.<my-dataset>.<my-spatial-scoring-input>`',
    'geoid',
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "scoring_method":"CUSTOM_WEIGHTS",
        "weights":{"fempop_15_44":0.2,"education":0.1,"urbanity_ordinal":0.1,"pois":0.1},
        "scaling":"RANKING",
        "aggregation":"LINEAR",
        "bucketize_method":"JENKS"
    }
    '''
 )
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.CREATE_SPATIAL_COMPOSITE_UNSUPERVISED(
    'SELECT * EXCEPT(geom, revenue_change, urbanity) FROM `<my-project>.<my-dataset>.<my-spatial-scoring-input>`',
    'geoid',
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "scoring_method":"CUSTOM_WEIGHTS",
        "weights":{"fempop_15_44":0.2,"education":0.1,"urbanity_ordinal":0.1,"pois":0.1},
        "scaling":"RANKING",
        "aggregation":"LINEAR",
        "bucketize_method":"JENKS"
    }
    '''
 )
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}
{% endtabs %}

With the `FIRST_PC` method:

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.CREATE_SPATIAL_COMPOSITE_UNSUPERVISED(
    'SELECT * EXCEPT(geom, revenue_change, urbanity, urbanity_ordinal) FROM `<my-project>.<my-dataset>.<my-spatial-scoring-input>`',
    'geoid',
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "scoring_method":"FIRST_PC",
        "correlation_var":"fempop_15_44",
        "correlation_thr":0.6,
        "bucketize_method":"QUANTILES"
    }
    '''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.CREATE_SPATIAL_COMPOSITE_UNSUPERVISED(
    'SELECT * EXCEPT(geom, revenue_change, urbanity, urbanity_ordinal) FROM `<my-project>.<my-dataset>.<my-spatial-scoring-input>`',
    'geoid',
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "scoring_method":"FIRST_PC",
        "correlation_var":"fempop_15_44",
        "correlation_thr":0.6,
        "bucketize_method":"QUANTILES"
    }
    '''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.CREATE_SPATIAL_COMPOSITE_UNSUPERVISED(
    'SELECT * EXCEPT(geom, revenue_change, urbanity, urbanity_ordinal) FROM `<my-project>.<my-dataset>.<my-spatial-scoring-input>`',
    'geoid',
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "scoring_method":"FIRST_PC",
        "correlation_var":"fempop_15_44",
        "correlation_thr":0.6,
        "bucketize_method":"QUANTILES"
    }
    '''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}
{% endtabs %}

## CRONBACH\_ALPHA\_COEFFICIENT <a href="#cronbach_alpha_coefficient" id="cronbach_alpha_coefficient"></a>

```sql
CRONBACH_ALPHA_COEFFICIENT(input_query, output_prefix)
```

**Description**

This procedure computes the [Cronbach’s alpha](https://en.wikipedia.org/wiki/Cronbach's_alpha) coefficient for a set of (spatial) variables. This coefficient can be used as a measure of internal consistency or reliability of the data, based on the strength of correlations between individual variables. Cronbach’s alpha reliability coefficient normally ranges between 0 and 1 but there is actually no lower limit to the coefficient. Higher alpha (closer to 1) vs lower alpha (closer to 0) means higher vs lower consistency, with usually 0.65 being the minimum acceptable value of internal consistency. Rows with a NULL value in any of the individual variables are dropped.

**Input parameters**

* `input_query`: `STRING` the query to the data used to compute the coefficient. It must contain all the individual variables that should be included in the computation of the coefficient. A qualified table name can be given as well, e.g. `'<my-project>.<my-dataset>.<my-table>'`.
* `output_prefix`: `STRING` the name for the output table. It should include project and dataset, e.g. `'<my-project>.<my-dataset>.<my-output-table>'`.

**Return type**

The output table with the following columns:

* `cronbach_alpha_coef`: `FLOAT64` the computed Cronbach Alpha coefficient.
* `k`: `INT64` the number of the individual variables used to compute the composite.
* `mean_var`: `FLOAT64` the mean variance of all individual variables.
* `mean_cov`: `FLOAT64` the mean inter-item covariance among all variable pairs.

**Example**

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.CRONBACH_ALPHA_COEFFICIENT(
    'SELECT * EXCEPT(geoid, geom, revenue_change, urbanity, urbanity_ordinal) FROM `<my-project>.<my-dataset>.<my-spatial-scoring-input>`',
    '<my-project>.<my-dataset>.<my-output-table>'
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.CRONBACH_ALPHA_COEFFICIENT(
    'SELECT * EXCEPT(geoid, geom, revenue_change, urbanity, urbanity_ordinal) FROM `<my-project>.<my-dataset>.<my-spatial-scoring-input>`',
    '<my-project>.<my-dataset>.<my-output-table>'
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.CRONBACH_ALPHA_COEFFICIENT(
    'SELECT * EXCEPT(geoid, geom, revenue_change, urbanity, urbanity_ordinal) FROM `<my-project>.<my-dataset>.<my-spatial-scoring-input>`',
    '<my-project>.<my-dataset>.<my-output-table>'
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}
{% endtabs %}

## GWR\_GRID <a href="#gwr_grid" id="gwr_grid"></a>

```sql
GWR_GRID(input_table, features_columns, label_column, cell_column, cell_type, kring_distance, kernel_function, fit_intercept, output_table)
```

**Description**

Geographically weighted regression (GWR) models local relationships between spatially varying predictors and an outcome of interest using a local least squares regression.

This procedure performs a local least squares regression for every input cell. This approach was selected to improve computation time and efficiency. The number of models is controlled by the selected cell resolution, thus the user can increase or decrease the resolution of the cell index to perform more or less regressions. Note that you need to provide the cell ID (spatial index) for every location as input (see `cell_column` parameter), i.e., the cell type and resolution are not passed explicitly, but rather the index has to be computed previously. Hence if you want to increase or decrease the resolution, you need to precompute the corresponding cell ID of every location (see [H3](#h3) or [Quadbin](https://github.com/CartoDB/gitbook-documentation/blob/master/data-and-analysis/analytics-toolbox-for-bigquery/sql-reference/.md/README.md#quadbin) module).

In each regression, the data of the locations in each cell and those of the neighboring cells, defined by the `kring_distance` parameter, will be taken into account. The data of the neighboring cells will be assigned a lower weight the further they are from the origin cell, following the function specified in the `kernel_function`. For example, considering cell `i` and `kring_distance` of 1. Having `n` locations located inside cell `i`, and in the neigheboring cells \[`n_1`, `n_2`, ..., `n_k`], then the regression of the cell `i` will have in total `n` + `n_1` + `n_2` + ... + `n_k` points.

**Input parameters**

* `input_table`: `STRING` name of the input table, e.g. `<my-project>.<my-dataset>.<my-table>`.
* `features_columns`: `ARRAY<STRING>` array of column names from `input_table` to be used as features in the GWR.
* `label_column`: `STRING` name of the target variable column.
* `cell_column`: `STRING` name of the column containing the cell ids.
* `cell_type`: `STRING` spatial index type as 'h3', 'quadbin'.
* `kring_distance`: `INT64` distance of the neighboring cells whose data will be included in the local regression of each cell.
* `kernel_function`: `STRING` [kernel function](https://en.wikipedia.org/wiki/Kernel_\(statistics\)) to compute the spatial weights across the kring. Available functions are: 'uniform', 'triangular', 'quadratic', 'quartic' and 'gaussian'.
* `fit_intercept`: `BOOL` whether to calculate the interception of the model or to force it to zero if, for example, the input data is already supposed to be centered. If NULL, `fit_intercept` will be considered as `TRUE`.
* `output_table`: `STRING` qualified name of the output table, e.g. `<my-project>.<my-dataset>.<my-output-table>`. The process will fail if the table already exists. If NULL, the result will be returned directly by the query and not persisted.

**Output**

The output table will contain a column with the cell id, a column for each feature column containing its corresponding coefficient estimate and one extra column for intercept if `fit_intercept` is `TRUE`.

**Examples**

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.GWR_GRID(
    '<my-project>.<my-dataset>.<my-airbnb-berlin-h3-qk-qb>',
    ['bedrooms', 'bathrooms'], -- [ beds feature, bathrooms feature ]
    'price', -- price (target variable)
    'h3_z6', 'h3', 3, 'gaussian', TRUE,
    '<my-project>.<my-dataset>.<my-output-table>'
);
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.GWR_GRID(
    '<my-project>.<my-dataset>.<my-airbnb-berlin-h3-qk-qb>',
    ['bedrooms', 'bathrooms'], -- [ beds feature, bathrooms feature ]
    'price', -- price (target variable)
    'h3_z6', 'h3', 3, 'gaussian', TRUE,
    '<my-project>.<my-dataset>.<my-output-table>'
);
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.GWR_GRID(
    '<my-project>.<my-dataset>.<my-airbnb-berlin-h3-qk-qb>',
    ['bedrooms', 'bathrooms'], -- [ beds feature, bathrooms feature ]
    'price', -- price (target variable)
    'h3_z6', 'h3', 3, 'gaussian', TRUE,
    '<my-project>.<my-dataset>.<my-output-table>'
);
```

{% endtab %}
{% endtabs %}

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.GWR_GRID(
    '<my-project>.<my-dataset>.<my-airbnb-berlin-h3-qk-qb>',
    ['bedrooms', 'bathrooms'], -- [ beds feature, bathrooms feature ]
    'price', -- price (target variable)
    'qb_z12', 'quadbin', 3, 'gaussian', TRUE,
    '<my-project>.<my-dataset>.<my-output-table>'
);
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.GWR_GRID(
    '<my-project>.<my-dataset>.<my-airbnb-berlin-h3-qk-qb>',
    ['bedrooms', 'bathrooms'], -- [ beds feature, bathrooms feature ]
    'price', -- price (target variable)
    'qb_z12', 'quadbin', 3, 'gaussian', TRUE,
    '<my-project>.<my-dataset>.<my-output-table>'
);
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.GWR_GRID(
    '<my-project>.<my-dataset>.<my-airbnb-berlin-h3-qk-qb>',
    ['bedrooms', 'bathrooms'], -- [ beds feature, bathrooms feature ]
    'price', -- price (target variable)
    'qb_z12', 'quadbin', 3, 'gaussian', TRUE,
    '<my-project>.<my-dataset>.<my-output-table>'
);
```

{% endtab %}
{% endtabs %}

{% hint style="info" %}
**Additional examples**

* [Applying GWR to understand Airbnb listings prices](https://academy.carto.com/advanced-spatial-analytics/spatial-analytics-for-bigquery/step-by-step-tutorials/applying-gwr-to-understand-airbnb-listings-prices)
  {% endhint %}

## GETIS\_ORD\_H3\_TABLE <a href="#getis_ord_h3_table" id="getis_ord_h3_table"></a>

```sql
GETIS_ORD_H3_TABLE(input, output_table, index_col, value_col, size, kernel)
```

**Description**

This procedure computes the Getis-Ord Gi\* statistic for each row in the input table.

**Input parameters**

* `input`: `STRING` the query to the data used to compute the coefficient. A qualified table name can be given as well, e.g. `<my-project>.<my-dataset>.<my-table>`.
* `output_table`: `STRING` qualified name of the output table, e.g. `<my-project>.<my-dataset>.<my-output-table>`.
* `index_col`: `STRING` name of the column with the H3 indexes.
* `value_col`: `STRING` name of the column with the values for each H3 cell.
* `size`: `INT64` size of the H3 kring (distance from the origin). This defines the area around each index cell that will be taken into account to compute its Gi\* statistic.
* `kernel`: `STRING` [kernel function](https://en.wikipedia.org/wiki/Kernel_\(statistics\)) to compute the spatial weights across the kring. Available functions are: uniform, triangular, quadratic, quartic and gaussian.

{% hint style="warning" %}
The `index_col` cannot contain NULL values, otherwise a `Invalid input origin` error will be returned.
{% endhint %}

**Output**

The results are stored in the table named `<output_table>`, which contains the following columns:

* `index`: `STRING`
* `gi`: `FLOAT64` computed Gi\* value.
* `p_value`: `FLOAT64` computed P value.

**Example**

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.GETIS_ORD_H3_TABLE(
    '<my-project>.<my-dataset>.<my-h3-table>',
    '<my-project>.<my-dataset>.<my-output-table>',
    'h3',
    'value',
    3,
    'gaussian'
);
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.GETIS_ORD_H3_TABLE(
    '<my-project>.<my-dataset>.<my-h3-table>',
    '<my-project>.<my-dataset>.<my-output-table>',
    'h3',
    'value',
    3,
    'gaussian'
);
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.GETIS_ORD_H3_TABLE(
    '<my-project>.<my-dataset>.<my-h3-table>',
    '<my-project>.<my-dataset>.<my-output-table>',
    'h3',
    'value',
    3,
    'gaussian'
);
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}
{% endtabs %}

## GETIS\_ORD\_H3 <a href="#getis_ord_h3" id="getis_ord_h3"></a>

```sql
GETIS_ORD_H3(input, size, kernel)
```

**Description**

This function computes the Getis-Ord Gi\* statistic for each H3 index in the input array.

**Input parameters**

* `input`: `ARRAY<STRUCT<index STRING, value FLOAT64>>` input data with the indexes and values of the cells.
* `size`: `INT64` size of the H3 kring (distance from the origin). This defines the area around each index cell that will be taken into account to compute its Gi\* statistic.
* `kernel`: `STRING` [kernel function](https://en.wikipedia.org/wiki/Kernel_\(statistics\)) to compute the spatial weights across the kring. Available functions are: uniform, triangular, quadratic, quartic and gaussian.

{% hint style="warning" %}
The `input` cannot contain NULL indexes values, otherwise a `Invalid input origin` error will be returned.
{% endhint %}

**Return type**

`ARRAY<STRUCT<index STRING, gi FLOAT64, p_value FLOAT64>>`

**Examples**

{% tabs %}
{% tab title="carto-un" %}

```sql
SELECT `carto-un`.carto.GETIS_ORD_H3(
    [
        STRUCT('89394460323ffff', 51.0),
        STRUCT('89394460c37ffff', 28.0),
        STRUCT('89394460077ffff', 19.0)
    ],
    3, 'gaussian'
);
-- {"index": "89394460323ffff", "gi": 1.3606194139870573, "p_value": 0.17363411613079893}
-- {"index": "89394460c37ffff", "gi": -0.34633948719670526, "p_value": 0.7290877280096945}
-- {"index": "89394460077ffff", "gi": -1.0142799267903515, "p_value": 0.31044923023489734}
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
SELECT `carto-un-eu`.carto.GETIS_ORD_H3(
    [
        STRUCT('89394460323ffff', 51.0),
        STRUCT('89394460c37ffff', 28.0),
        STRUCT('89394460077ffff', 19.0)
    ],
    3, 'gaussian'
);
-- {"index": "89394460323ffff", "gi": 1.3606194139870573, "p_value": 0.17363411613079893}
-- {"index": "89394460c37ffff", "gi": -0.34633948719670526, "p_value": 0.7290877280096945}
-- {"index": "89394460077ffff", "gi": -1.0142799267903515, "p_value": 0.31044923023489734}
```

{% endtab %}

{% tab title="manual" %}

```sql
SELECT carto.GETIS_ORD_H3(
    [
        STRUCT('89394460323ffff', 51.0),
        STRUCT('89394460c37ffff', 28.0),
        STRUCT('89394460077ffff', 19.0)
    ],
    3, 'gaussian'
);
-- {"index": "89394460323ffff", "gi": 1.3606194139870573, "p_value": 0.17363411613079893}
-- {"index": "89394460c37ffff", "gi": -0.34633948719670526, "p_value": 0.7290877280096945}
-- {"index": "89394460077ffff", "gi": -1.0142799267903515, "p_value": 0.31044923023489734}
```

{% endtab %}
{% endtabs %}

{% tabs %}
{% tab title="carto-un" %}

```sql
SELECT `carto-un`.carto.GETIS_ORD_H3(input_data, 3, 'gaussian')
FROM (
    SELECT ARRAY_AGG(STRUCT(index, value)) AS input_data
    FROM mytable
);
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
SELECT `carto-un-eu`.carto.GETIS_ORD_H3(input_data, 3, 'gaussian')
FROM (
    SELECT ARRAY_AGG(STRUCT(index, value)) AS input_data
    FROM mytable
);
```

{% endtab %}

{% tab title="manual" %}

```sql
SELECT carto.GETIS_ORD_H3(input_data, 3, 'gaussian')
FROM (
    SELECT ARRAY_AGG(STRUCT(index, value)) AS input_data
    FROM mytable
);
```

{% endtab %}
{% endtabs %}

{% hint style="info" %}
**Additional examples**

* [Identifying amenity hotspots in Stockholm](https://academy.carto.com/advanced-spatial-analytics/spatial-analytics-for-bigquery/step-by-step-tutorials/identifying-amenity-hotspots-in-stockholm)
  {% endhint %}

## GETIS\_ORD\_QUADBIN\_TABLE <a href="#getis_ord_quadbin_table" id="getis_ord_quadbin_table"></a>

```sql
GETIS_ORD_QUADBIN_TABLE(input, output_table, index_col, value_col, size, kernel)
```

**Description**

This procedure computes the Getis-Ord Gi\* statistic for each row in the input table.

**Input parameters**

* `input`: `STRING` the query to the data used to compute the coefficient. A qualified table name can be given as well, e.g. `<my-project>.<my-dataset>.<my-table>`.
* `output_table`: `STRING` qualified name of the output table, e.g. `<my-project>.<my-dataset>.<my-output-table>`.
* `index_col`: `STRING` name of the column with the Quadbin indexes.
* `value_col`: `STRING` name of the column with the values for each Quadbin cell.
* `size`: `INT64` size of the Quadbin kring (distance from the origin). This defines the area around each index cell that will be taken into account to compute its Gi\* statistic.
* `kernel`: `STRING` [kernel function](https://en.wikipedia.org/wiki/Kernel_\(statistics\)) to compute the spatial weights across the kring. Available functions are: uniform, triangular, quadratic, quartic and gaussian.

{% hint style="warning" %}
The `index_col` cannot contain NULL values, otherwise a `Invalid input origin` error will be returned.
{% endhint %}

**Output**

The results are stored in the table named `<output_table>`, which contains the following columns:

* `index`: `INT64`
* `gi`: `FLOAT64` computed Gi\* value.
* `p_value`: `FLOAT64` computed P value.

**Example**

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.GETIS_ORD_QUADBIN_TABLE(
    '<my-project>.<my-dataset>.<my-quadbin-table>',
    '<my-project>.<my-dataset>.<my-output-table>',
    'quadbin',
    'value',
    3,
    'gaussian'
);
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.GETIS_ORD_QUADBIN_TABLE(
    '<my-project>.<my-dataset>.<my-quadbin-table>',
    '<my-project>.<my-dataset>.<my-output-table>',
    'quadbin',
    'value',
    3,
    'gaussian'
);
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.GETIS_ORD_QUADBIN_TABLE(
    '<my-project>.<my-dataset>.<my-quadbin-table>',
    '<my-project>.<my-dataset>.<my-output-table>',
    'quadbin',
    'value',
    3,
    'gaussian'
);
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}
{% endtabs %}

## GETIS\_ORD\_QUADBIN <a href="#getis_ord_quadbin" id="getis_ord_quadbin"></a>

```sql
GETIS_ORD_QUADBIN(input, size, kernel)
```

**Description**

This function computes the Getis-Ord Gi\* statistic for each Quadbin index in the input array.

**Input parameters**

* `input`: `ARRAY<STRUCT<index INT64, value FLOAT64>>` input data with the indexes and values of the cells.
* `size`: `INT64` size of the Quadbin *k-ring* (distance from the origin). This defines the area around each index cell that will be taken into account to compute its Gi\* statistic.
* `kernel`: `STRING` [kernel function](https://en.wikipedia.org/wiki/Kernel_\(statistics\)) to compute the spatial weights across the kring. Available functions are: uniform, triangular, quadratic, quartic and gaussian.

{% hint style="warning" %}
The `input` cannot contain NULL indexes values, otherwise a `Invalid input origin` error will be returned.
{% endhint %}

**Return type**

`ARRAY<STRUCT<index INT64, gi FLOAT64, p_value FLOAT64>>`

**Examples**

{% tabs %}
{% tab title="carto-un" %}

```sql
SELECT `carto-un`.carto.GETIS_ORD_QUADBIN(
    [
        STRUCT(5266443791933898751, 51.0),
        STRUCT(5266443803500740607, 28.0),
        STRUCT(5266443790415822847, 19.0)
    ],
    3, 'gaussian'
);
-- {"index": 5266443791933898751, "gi": 1.360619413987058, "p_value": 0.17363411613079893}
-- {"index": 5266443803500740607, "gi": -0.3463394871967051, "p_value": 0.7290877280096945}
-- {"index": 5266443790415822847, "gi": -1.0142799267903515, "p_value": 0.31044923023489734}
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
SELECT `carto-un-eu`.carto.GETIS_ORD_QUADBIN(
    [
        STRUCT(5266443791933898751, 51.0),
        STRUCT(5266443803500740607, 28.0),
        STRUCT(5266443790415822847, 19.0)
    ],
    3, 'gaussian'
);
-- {"index": 5266443791933898751, "gi": 1.360619413987058, "p_value": 0.17363411613079893}
-- {"index": 5266443803500740607, "gi": -0.3463394871967051, "p_value": 0.7290877280096945}
-- {"index": 5266443790415822847, "gi": -1.0142799267903515, "p_value": 0.31044923023489734}
```

{% endtab %}

{% tab title="manual" %}

```sql
SELECT carto.GETIS_ORD_QUADBIN(
    [
        STRUCT(5266443791933898751, 51.0),
        STRUCT(5266443803500740607, 28.0),
        STRUCT(5266443790415822847, 19.0)
    ],
    3, 'gaussian'
);
-- {"index": 5266443791933898751, "gi": 1.360619413987058, "p_value": 0.17363411613079893}
-- {"index": 5266443803500740607, "gi": -0.3463394871967051, "p_value": 0.7290877280096945}
-- {"index": 5266443790415822847, "gi": -1.0142799267903515, "p_value": 0.31044923023489734}
```

{% endtab %}
{% endtabs %}

{% tabs %}
{% tab title="carto-un" %}

```sql
SELECT `carto-un`.carto.GETIS_ORD_QUADBIN(input_data, 3, 'gaussian')
FROM (
    SELECT ARRAY_AGG(STRUCT(index, value)) AS input_data
    FROM mytable
);
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
SELECT `carto-un-eu`.carto.GETIS_ORD_QUADBIN(input_data, 3, 'gaussian')
FROM (
    SELECT ARRAY_AGG(STRUCT(index, value)) AS input_data
    FROM mytable
);
```

{% endtab %}

{% tab title="manual" %}

```sql
SELECT carto.GETIS_ORD_QUADBIN(input_data, 3, 'gaussian')
FROM (
    SELECT ARRAY_AGG(STRUCT(index, value)) AS input_data
    FROM mytable
);
```

{% endtab %}
{% endtabs %}

## GETIS\_ORD\_SPACETIME\_H3\_TABLE <a href="#getis_ord_spacetime_h3_table" id="getis_ord_spacetime_h3_table"></a>

```sql
GETIS_ORD_SPACETIME_H3_TABLE(input, output_table, index_col, date_col, value_col, size, time_freq, time_bw, kernel, kernel_time)
```

**Description**

This procedure computes the space temporal Getis-Ord Gi\* statistic for each H3 index and each datetime timestamp according to the method described in this [paper](https://doi.org/10.1080/00330124.2019.1709215). It extends the [Getis-Ord Gi\*](#getis_ord_h3) function by including the time domain. The Getis-Ord Gi\* statistic is a measure of spatial autocorrelation, which is the degree to which data values are clustered together in space and time. The statistic is computed as the sum of the values of the cells in the kring (distance from the origin, space and temporal) weighted by the kernel functions, minus the value of the origin cell, divided by the standard deviation of the values of the cells in the kring. The Getis-Ord Gi\* statistic is calculated from minimum to maximum datetime with the step defined by the user, in the input array. The datetime timestamp is truncated to the provided level, for example day / hour / week etc. For each spatial index, the missing datetime timestamp, from minimum to maximum, are filled with the default value of 0. Any other imputation of the values should take place outside of the function prior to passing the input to the function. The p value is computed as the probability of observing a value as extreme as the observed value, assuming the null hypothesis that the values are randomly distributed in space and time. The p value is computed using a normal distribution approximation.

**Input parameters**

* `input`: `STRING` the query to the data used to compute the coefficient. A qualified table name can be given as well, e.g. `<my-project>.<my-dataset>.<my-table>`.
* `output_table`: `STRING` qualified name of the output table, e.g. `<my-project>.<my-dataset>.<my-output-table>`.
* `index_col`: `STRING` name of the column with the H3 indexes.
* `date_col`: `STRING` name of the column with the date.
* `value_col`: `STRING` name of the column with the values for each H3 cell.
* `size`: `INT64` size of the H3 kring (distance from the origin). This defines the area around each index cell that will be taken into account to compute its Gi\* statistic.
* `time_freq`: `STRING` The time interval - step to use for the time series. Available values are: `year`, `quarter`, `month`, `week`, `day`, `hour`, `minute`, `second`. It is the equivalent of the spatial index in the time domain.
* `time_bw`: `INT64` The bandwidth to use for the time series. This defines the number of adjacent observations in time domain to be considered. It is the equivalent of the H3 kring in the time domain.
* `kernel`: `STRING` [kernel function](https://en.wikipedia.org/wiki/Kernel_\(statistics\)) to compute the spatial weights across the kring. Available functions are: uniform, triangular, quadratic, quartic and gaussian.
* `kernel_time`: `STRING` [kernel function](https://en.wikipedia.org/wiki/Kernel_\(statistics\)) to compute the temporal weights within the time bandwidth. Available functions are: uniform, triangular, quadratic, quartic and gaussian.

{% hint style="warning" %}
The `index_col` cannot contain NULL values, otherwise a `Invalid input origin` error will be returned.
{% endhint %}

**Output**

The results are stored in the table named `<output_table>`, which contains the following columns:

* `index`: `STRING`
* `date`: `DATETIME`
* `gi`: `FLOAT64` computed Gi\* value.
* `p_value`: `FLOAT64` computed P value.

**Example**

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.GETIS_ORD_SPACETIME_H3_TABLE(
    '<my-project>.<my-dataset>.<my-h3-table>',
    '<my-project>.<my-dataset>.<my-output-table>',
    'h3',
    'date',
    'value',
    3,
    'DAY',
    1,
    'gaussian',
    'gaussian'
);
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.GETIS_ORD_SPACETIME_H3_TABLE(
    '<my-project>.<my-dataset>.<my-h3-table>',
    '<my-project>.<my-dataset>.<my-output-table>',
    'h3',
    'date',
    'value',
    3,
    'DAY',
    1,
    'gaussian',
    'gaussian'
);
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.GETIS_ORD_SPACETIME_H3_TABLE(
    '<my-project>.<my-dataset>.<my-h3-table>',
    '<my-project>.<my-dataset>.<my-output-table>',
    'h3',
    'date',
    'value',
    3,
    'DAY',
    1,
    'gaussian',
    'gaussian'
);
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}
{% endtabs %}

## GETIS\_ORD\_SPACETIME\_H3 <a href="#getis_ord_spacetime_h3" id="getis_ord_spacetime_h3"></a>

```sql
GETIS_ORD_SPACETIME_H3(input, size, time_freq, time_bw, kernel, kernel_time)
```

**Description**

This table function computes the space temporal Getis-Ord Gi\* statistic for each H3 index and each datetime timestamp according to the method described in this [paper](https://doi.org/10.1080/00330124.2019.1709215). It extends the [Getis-Ord Gi\*](#getis_ord_h3) function by including the time domain. The Getis-Ord Gi\* statistic is a measure of spatial autocorrelation, which is the degree to which data values are clustered together in space and time. The statistic is computed as the sum of the values of the cells in the kring (distance from the origin, space and temporal) weighted by the kernel functions, minus the value of the origin cell, divided by the standard deviation of the values of the cells in the kring. The Getis-Ord Gi\* statistic is calculated from minimum to maximum datetime with the step defined by the user, in the input array. The datetime timestamp is truncated to the provided level, for example day / hour / week etc. For each spatial index, the missing datetime timestamp, from minimum to maximum, are filled with the default value of 0. Any other imputation of the values should take place outside of the function prior to passing the input to the function. The p value is computed as the probability of observing a value as extreme as the observed value, assuming the null hypothesis that the values are randomly distributed in space and time. The p value is computed using a normal distribution approximation.

**Input parameters**

* `input`: `ARRAY<STRUCT<index STRING, date DATETIME, value FLOAT64>>` input data with the indexes and values of the cells.
* `size`: `INT64` size of the H3 kring (distance from the origin). This defines the area around each index cell that will be taken into account to compute its Gi\* statistic.
* `time_freq`: `STRING` The time interval - step to use for the time series. Available values are: `year`, `quarter`, `month`, `week`, `day`, `hour`, `minute`, `second`. It is the equivalent of the spatial index in the time domain.
* `time_bw`: `INT64` The bandwidth to use for the time series. This defines the number of adjacent observations in time domain to be considered. It is the equivalent of the H3 kring in the time domain.
* `kernel`: `STRING` [kernel function](https://en.wikipedia.org/wiki/Kernel_\(statistics\)) to compute the spatial weights across the kring. Available functions are: uniform, triangular, quadratic, quartic and gaussian.
* `kernel_time`: `STRING` [kernel function](https://en.wikipedia.org/wiki/Kernel_\(statistics\)) to compute the temporal weights within the time bandwidth. Available functions are: uniform, triangular, quadratic, quartic and gaussian.

{% hint style="warning" %}
The `input` cannot contain NULL indexes values, otherwise a `Invalid input origin` error will be returned.
{% endhint %}

**Return type**

`TABLE(index STRING, date DATETIME, gi FLOAT64, p_value FLOAT64)`

**Examples**

{% tabs %}
{% tab title="carto-un" %}

```sql
SELECT * FROM `carto-un`.carto.GETIS_ORD_SPACETIME_H3(
    [
        STRUCT('89394460323ffff', DATETIME(2023, 5, 1, 0, 0, 0), 51.0),
        STRUCT('89394460c37ffff', DATETIME(2023, 5, 2, 0, 0, 0), 28.0),
        STRUCT('89394460077ffff', DATETIME(2023, 5, 3, 0, 0, 0), 19.0)
    ],
    3, 'DAY', 1, 'gaussian', 'gaussian'
);
-- index, date, gi, p_value
-- 89394460077ffff 2023-05-01 00:00:00.000 -0.9194024096597111 0.3578850857359205
-- 89394460077ffff 2023-05-02 00:00:00.000 -0.6294621529087477 0.5290464242343208
-- 89394460077ffff 2023-05-03 00:00:00.000 0.07918630608303281 0.9368843022965981
-- 89394460323ffff 2023-05-01 00:00:00.000 1.7610199325971272 0.07823494063450298
-- 89394460323ffff 2023-05-02 00:00:00.000 0.3421868170238157 0.732210438784662
-- 89394460323ffff 2023-05-03 00:00:00.000 -0.9194024096597111 0.3578850857359205
-- 89394460c37ffff 2023-05-01 00:00:00.000 -0.026829208020286936 0.9785958577333086
-- 89394460c37ffff 2023-05-02 00:00:00.000 0.1953523050277685 0.8451171948001063
-- 89394460c37ffff 2023-05-03 00:00:00.000 -0.026829208020286936 0.9785958577333086
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
SELECT * FROM `carto-un-eu`.carto.GETIS_ORD_SPACETIME_H3(
    [
        STRUCT('89394460323ffff', DATETIME(2023, 5, 1, 0, 0, 0), 51.0),
        STRUCT('89394460c37ffff', DATETIME(2023, 5, 2, 0, 0, 0), 28.0),
        STRUCT('89394460077ffff', DATETIME(2023, 5, 3, 0, 0, 0), 19.0)
    ],
    3, 'DAY', 1, 'gaussian', 'gaussian'
);
-- index, date, gi, p_value
-- 89394460077ffff 2023-05-01 00:00:00.000 -0.9194024096597111 0.3578850857359205
-- 89394460077ffff 2023-05-02 00:00:00.000 -0.6294621529087477 0.5290464242343208
-- 89394460077ffff 2023-05-03 00:00:00.000 0.07918630608303281 0.9368843022965981
-- 89394460323ffff 2023-05-01 00:00:00.000 1.7610199325971272 0.07823494063450298
-- 89394460323ffff 2023-05-02 00:00:00.000 0.3421868170238157 0.732210438784662
-- 89394460323ffff 2023-05-03 00:00:00.000 -0.9194024096597111 0.3578850857359205
-- 89394460c37ffff 2023-05-01 00:00:00.000 -0.026829208020286936 0.9785958577333086
-- 89394460c37ffff 2023-05-02 00:00:00.000 0.1953523050277685 0.8451171948001063
-- 89394460c37ffff 2023-05-03 00:00:00.000 -0.026829208020286936 0.9785958577333086
```

{% endtab %}

{% tab title="manual" %}

```sql
SELECT * FROM carto.GETIS_ORD_SPACETIME_H3(
    [
        STRUCT('89394460323ffff', DATETIME(2023, 5, 1, 0, 0, 0), 51.0),
        STRUCT('89394460c37ffff', DATETIME(2023, 5, 2, 0, 0, 0), 28.0),
        STRUCT('89394460077ffff', DATETIME(2023, 5, 3, 0, 0, 0), 19.0)
    ],
    3, 'DAY', 1, 'gaussian', 'gaussian'
);
-- index, date, gi, p_value
-- 89394460077ffff 2023-05-01 00:00:00.000 -0.9194024096597111 0.3578850857359205
-- 89394460077ffff 2023-05-02 00:00:00.000 -0.6294621529087477 0.5290464242343208
-- 89394460077ffff 2023-05-03 00:00:00.000 0.07918630608303281 0.9368843022965981
-- 89394460323ffff 2023-05-01 00:00:00.000 1.7610199325971272 0.07823494063450298
-- 89394460323ffff 2023-05-02 00:00:00.000 0.3421868170238157 0.732210438784662
-- 89394460323ffff 2023-05-03 00:00:00.000 -0.9194024096597111 0.3578850857359205
-- 89394460c37ffff 2023-05-01 00:00:00.000 -0.026829208020286936 0.9785958577333086
-- 89394460c37ffff 2023-05-02 00:00:00.000 0.1953523050277685 0.8451171948001063
-- 89394460c37ffff 2023-05-03 00:00:00.000 -0.026829208020286936 0.9785958577333086
```

{% endtab %}
{% endtabs %}

{% tabs %}
{% tab title="carto-un" %}

```sql
SELECT * FROM `carto-un`.carto.GETIS_ORD_SPACETIME_H3((
    SELECT ARRAY_AGG(STRUCT(index, timestamp, value)) AS input_data
    FROM mytable
), 3, 'DAY', 1, 'gaussian', 'gaussian');
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
SELECT * FROM `carto-un-eu`.carto.GETIS_ORD_SPACETIME_H3((
    SELECT ARRAY_AGG(STRUCT(index, timestamp, value)) AS input_data
    FROM mytable
), 3, 'DAY', 1, 'gaussian', 'gaussian');
```

{% endtab %}

{% tab title="manual" %}

```sql
SELECT * FROM carto.GETIS_ORD_SPACETIME_H3((
    SELECT ARRAY_AGG(STRUCT(index, timestamp, value)) AS input_data
    FROM mytable
), 3, 'DAY', 1, 'gaussian', 'gaussian');
```

{% endtab %}
{% endtabs %}

## GETIS\_ORD\_SPACETIME\_QUADBIN\_TABLE <a href="#getis_ord_spacetime_quadbin_table" id="getis_ord_spacetime_quadbin_table"></a>

```sql
GETIS_ORD_SPACETIME_QUADBIN_TABLE(input, output_table, index_col, date_col, value_col, size, time_freq, time_bw, kernel, kernel_time)
```

**Description**

This procedure computes the space temporal Getis-Ord Gi\* statistic for each Quadbin index and each datetime timestamp according to the method described in this [paper](https://doi.org/10.1080/00330124.2019.1709215). It extends the [Getis-Ord Gi\*](#getis_ord_quadbin) function by including the time domain. The Getis-Ord Gi\* statistic is a measure of spatial autocorrelation, which is the degree to which data values are clustered together in space and time. The statistic is computed as the sum of the values of the cells in the kring (distance from the origin, space and temporal) weighted by the kernel functions, minus the value of the origin cell, divided by the standard deviation of the values of the cells in the kring. The Getis-Ord Gi\* statistic is calculated from minimum to maximum datetime with the step defined by the user, in the input array. The datetime timestamp is truncated to the provided level, for example day / hour / week etc. For each spatial index, the missing datetime timestamp, from minimum to maximum, are filled with the default value of 0. Any other imputation of the values should take place outside of the function prior to passing the input to the function. The p value is computed as the probability of observing a value as extreme as the observed value, assuming the null hypothesis that the values are randomly distributed in space and time. The p value is computed using a normal distribution approximation.

**Input parameters**

* `input`: `STRING` the query to the data used to compute the coefficient. A qualified table name can be given as well, e.g. `<my-project>.<my-dataset>.<my-table>`.
* `output_table`: `STRING` qualified name of the output table, e.g. `<my-project>.<my-dataset>.<my-output-table>`.
* `index_col`: `STRING` name of the column with the Quadbin indexes.
* `date_col`: `STRING` name of the column with the date.
* `value_col`: `STRING` name of the column with the values for each Quadbin cell.
* `size`: `INT64` size of the Quadbin kring (distance from the origin). This defines the area around each index cell that will be taken into account to compute its Gi\* statistic.
* `time_freq`: `STRING` The time interval - step to use for the time series. Available values are: `year`, `quarter`, `month`, `week`, `day`, `hour`, `minute`, `second`. It is the equivalent of the spatial index in the time domain.
* `time_bw`: `INT64` The bandwidth to use for the time series. This defines the number of adjacent observations in time domain to be considered. It is the equivalent of the Quadbin kring in the time domain.
* `kernel`: `STRING` [kernel function](https://en.wikipedia.org/wiki/Kernel_\(statistics\)) to compute the spatial weights across the kring. Available functions are: uniform, triangular, quadratic, quartic and gaussian.
* `kernel_time`: `STRING` [kernel function](https://en.wikipedia.org/wiki/Kernel_\(statistics\)) to compute the temporal weights within the time bandwidth. Available functions are: uniform, triangular, quadratic, quartic and gaussian.

{% hint style="warning" %}
The `index_col` cannot contain NULL values, otherwise a `Invalid input origin` error will be returned.
{% endhint %}

**Output**

The results are stored in the table named `<output_table>`, which contains the following columns:

* `index`: `INT64`
* `date`: `DATETIME`
* `gi`: `FLOAT64` computed Gi\* value.
* `p_value`: `FLOAT64` computed P value.

**Example**

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.GETIS_ORD_SPACETIME_QUADBIN_TABLE(
    '<my-project>.<my-dataset>.<my-quadbin-table>',
    '<my-project>.<my-dataset>.<my-output-table>',
    'quadbin',
    'date',
    'value',
    3,
    'DAY',
    1,
    'gaussian',
    'gaussian'
);
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.GETIS_ORD_SPACETIME_QUADBIN_TABLE(
    '<my-project>.<my-dataset>.<my-quadbin-table>',
    '<my-project>.<my-dataset>.<my-output-table>',
    'quadbin',
    'date',
    'value',
    3,
    'DAY',
    1,
    'gaussian',
    'gaussian'
);
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.GETIS_ORD_SPACETIME_QUADBIN_TABLE(
    '<my-project>.<my-dataset>.<my-quadbin-table>',
    '<my-project>.<my-dataset>.<my-output-table>',
    'quadbin',
    'date',
    'value',
    3,
    'DAY',
    1,
    'gaussian',
    'gaussian'
);
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}
{% endtabs %}

## GETIS\_ORD\_SPACETIME\_QUADBIN <a href="#getis_ord_spacetime_quadbin" id="getis_ord_spacetime_quadbin"></a>

```sql
GETIS_ORD_SPACETIME_QUADBIN(input, size, time_freq, time_bw, kernel, kernel_time)
```

**Description**

This table function computes the space temporal Getis-Ord Gi\* statistic for each Quadbin index and each datetime timestamp according to the method described in this [paper](https://doi.org/10.1080/00330124.2019.1709215). It extends the [Getis-Ord Gi\*](#getis_ord_quadbin) function by including the time domain. The Getis-Ord Gi\* statistic is a measure of spatial autocorrelation, which is the degree to which data values are clustered together in space and time. The statistic is computed as the sum of the values of the cells in the kring (distance from the origin, space and temporal) weighted by the kernel functions, minus the value of the origin cell, divided by the standard deviation of the values of the cells in the kring. The Getis-Ord Gi\* statistic is calculated from minimum to maximum datetime with the step defined by the user, in the input array. The datetime timestamp is truncated to the provided level, for example day / hour / week etc. For each spatial index, the missing datetime timestamp, from minimum to maximum, are filled with the default value of 0. Any other imputation of the values should take place outside of the function prior to passing the input to the function. The p value is computed as the probability of observing a value as extreme as the observed value, assuming the null hypothesis that the values are randomly distributed in space and time. The p value is computed using a normal distribution approximation.

**Input parameters**

* `input`: `ARRAY<STRUCT<index INT64, date DATETIME, value FLOAT64>>` input data with the indexes and values of the cells.
* `size`: `INT64` size of the Quadbin kring (distance from the origin). This defines the area around each index cell that will be taken into account to compute its Gi\* statistic.
* `time_freq`: `STRING` The time interval - step to use for the time series. Available values are: `year`, `quarter`, `month`, `week`, `day`, `hour`, `minute`, `second`. It is the equivalent of the spatial index in the time domain.
* `time_bw`: `INT64` The bandwidth to use for the time series. This defines the number of adjacent observations in time domain to be considered. It is the equivalent of the Quadbin kring in the time domain.
* `kernel`: `STRING` [kernel function](https://en.wikipedia.org/wiki/Kernel_\(statistics\)) to compute the spatial weights across the kring. Available functions are: uniform, triangular, quadratic, quartic and gaussian.
* `kernel_time`: `STRING` [kernel function](https://en.wikipedia.org/wiki/Kernel_\(statistics\)) to compute the temporal weights within the time bandwidth. Available functions are: uniform, triangular, quadratic, quartic and gaussian.

{% hint style="warning" %}
The `input` cannot contain NULL indexes values, otherwise a `Invalid input origin` error will be returned.
{% endhint %}

**Return type**

`TABLE(index INT64, date DATETIME, gi FLOAT64, p_value FLOAT64)`

**Examples**

{% tabs %}
{% tab title="carto-un" %}

```sql
SELECT * FROM `carto-un`.carto.GETIS_ORD_SPACETIME_QUADBIN(
    [
        STRUCT(5266443791933898751, DATETIME(2023, 5, 1, 0, 0, 0), 51.0),
        STRUCT(5266443803500740607, DATETIME(2023, 5, 2, 0, 0, 0), 28.0),
        STRUCT(5266443790415822847, DATETIME(2023, 5, 3, 0, 0, 0), 19.0)
    ],
    3, 'DAY', 1, 'gaussian', 'gaussian'
);
-- index, date, gi, p_value
-- 5266443790415822847 2023-05-01 00:00:00.000 -0.9194024096597111 0.3578850857359205
-- 5266443790415822847 2023-05-02 00:00:00.000 -0.6294621529087477 0.5290464242343208
-- 5266443790415822847 2023-05-03 00:00:00.000 0.07918630608303281 0.9368843022965981
-- 5266443791933898751 2023-05-01 00:00:00.000 1.7610199325971272 0.07823494063450298
-- 5266443791933898751 2023-05-02 00:00:00.000 0.3421868170238157 0.732210438784662
-- 5266443791933898751 2023-05-03 00:00:00.000 -0.9194024096597111 0.3578850857359205
-- 5266443803500740607 2023-05-01 00:00:00.000 -0.026829208020286936 0.9785958577333086
-- 5266443803500740607 2023-05-02 00:00:00.000 0.1953523050277685 0.8451171948001063
-- 5266443803500740607 2023-05-03 00:00:00.000 -0.026829208020286936 0.9785958577333086
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
SELECT * FROM `carto-un-eu`.carto.GETIS_ORD_SPACETIME_QUADBIN(
    [
        STRUCT(5266443791933898751, DATETIME(2023, 5, 1, 0, 0, 0), 51.0),
        STRUCT(5266443803500740607, DATETIME(2023, 5, 2, 0, 0, 0), 28.0),
        STRUCT(5266443790415822847, DATETIME(2023, 5, 3, 0, 0, 0), 19.0)
    ],
    3, 'DAY', 1, 'gaussian', 'gaussian'
);
-- index, date, gi, p_value
-- 5266443790415822847 2023-05-01 00:00:00.000 -0.9194024096597111 0.3578850857359205
-- 5266443790415822847 2023-05-02 00:00:00.000 -0.6294621529087477 0.5290464242343208
-- 5266443790415822847 2023-05-03 00:00:00.000 0.07918630608303281 0.9368843022965981
-- 5266443791933898751 2023-05-01 00:00:00.000 1.7610199325971272 0.07823494063450298
-- 5266443791933898751 2023-05-02 00:00:00.000 0.3421868170238157 0.732210438784662
-- 5266443791933898751 2023-05-03 00:00:00.000 -0.9194024096597111 0.3578850857359205
-- 5266443803500740607 2023-05-01 00:00:00.000 -0.026829208020286936 0.9785958577333086
-- 5266443803500740607 2023-05-02 00:00:00.000 0.1953523050277685 0.8451171948001063
-- 5266443803500740607 2023-05-03 00:00:00.000 -0.026829208020286936 0.9785958577333086
```

{% endtab %}

{% tab title="manual" %}

```sql
SELECT * FROM carto.GETIS_ORD_SPACETIME_QUADBIN(
    [
        STRUCT(5266443791933898751, DATETIME(2023, 5, 1, 0, 0, 0), 51.0),
        STRUCT(5266443803500740607, DATETIME(2023, 5, 2, 0, 0, 0), 28.0),
        STRUCT(5266443790415822847, DATETIME(2023, 5, 3, 0, 0, 0), 19.0)
    ],
    3, 'DAY', 1, 'gaussian', 'gaussian'
);
-- index, date, gi, p_value
-- 5266443790415822847 2023-05-01 00:00:00.000 -0.9194024096597111 0.3578850857359205
-- 5266443790415822847 2023-05-02 00:00:00.000 -0.6294621529087477 0.5290464242343208
-- 5266443790415822847 2023-05-03 00:00:00.000 0.07918630608303281 0.9368843022965981
-- 5266443791933898751 2023-05-01 00:00:00.000 1.7610199325971272 0.07823494063450298
-- 5266443791933898751 2023-05-02 00:00:00.000 0.3421868170238157 0.732210438784662
-- 5266443791933898751 2023-05-03 00:00:00.000 -0.9194024096597111 0.3578850857359205
-- 5266443803500740607 2023-05-01 00:00:00.000 -0.026829208020286936 0.9785958577333086
-- 5266443803500740607 2023-05-02 00:00:00.000 0.1953523050277685 0.8451171948001063
-- 5266443803500740607 2023-05-03 00:00:00.000 -0.026829208020286936 0.9785958577333086
```

{% endtab %}
{% endtabs %}

{% tabs %}
{% tab title="carto-un" %}

```sql
SELECT * FROM `carto-un`.carto.GETIS_ORD_SPACETIME_QUADBIN((
    SELECT ARRAY_AGG(STRUCT(index, timestamp, value)) AS input_data
    FROM mytable
), 3, 'DAY', 1, 'gaussian', 'gaussian');
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
SELECT * FROM `carto-un-eu`.carto.GETIS_ORD_SPACETIME_QUADBIN((
    SELECT ARRAY_AGG(STRUCT(index, timestamp, value)) AS input_data
    FROM mytable
), 3, 'DAY', 1, 'gaussian', 'gaussian');
```

{% endtab %}

{% tab title="manual" %}

```sql
SELECT * FROM carto.GETIS_ORD_SPACETIME_QUADBIN((
    SELECT ARRAY_AGG(STRUCT(index, timestamp, value)) AS input_data
    FROM mytable
), 3, 'DAY', 1, 'gaussian', 'gaussian');
```

{% endtab %}
{% endtabs %}

## MORANS\_I\_H3\_TABLE <a href="#morans_i_h3_table" id="morans_i_h3_table"></a>

```sql
MORANS_I_H3_TABLE(input, output_table, index_col, value_col, size, decay)
```

**Description**

This procedure computes the [Moran's I spatial autocorrelation](https://en.wikipedia.org/wiki/Moran's_I) from the input table with H3 indexes.

**Input parameters**

* `input`: `STRING` the query to the data used to compute the coefficient. A qualified table name can be given as well, e.g. `<my-project>.<my-dataset>.<my-table>`.
* `output_table`: `STRING` qualified name of the output table, e.g. `<my-project>.<my-dataset>.<my-output-table>`.
* `index_col`: `STRING` name of the column with the H3 indexes.
* `value_col`: `STRING` name of the column with the values for each H3 cell.
* `size`: `INT64` size of the H3 *k-ring* (distance from the origin). This defines the area around each index cell where the distance decay will be applied. If no neighboring cells are found, the weight of the corresponding index cell is set to zero.
* `decay`: `STRING` decay function to compute the [distance decay](https://en.wikipedia.org/wiki/Distance_decay). Available functions are: uniform, inverse, inverse\_square and exponential.

{% hint style="warning" %}
The `index_col` cannot contain NULL values, otherwise a `Invalid input origin` error will be returned.
{% endhint %}

**Output**

The results are stored in the table named `<output_table>`, which contains the following column:

* `morans_i`: `FLOAT64` Moran's I spatial autocorrelation.

If all cells have no neighbours, then the procedure will fail.

**Example**

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.MORANS_I_H3_TABLE(
    '<my-project>.<my-dataset>.<my-h3-table>',
    '<my-project>.<my-dataset>.<my-output-table>',
    'h3',
    'value',
    5,
    'uniform'
);
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.MORANS_I_H3_TABLE(
    '<my-project>.<my-dataset>.<my-h3-table>',
    '<my-project>.<my-dataset>.<my-output-table>',
    'h3',
    'value',
    5,
    'uniform'
);
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.MORANS_I_H3_TABLE(
    '<my-project>.<my-dataset>.<my-h3-table>',
    '<my-project>.<my-dataset>.<my-output-table>',
    'h3',
    'value',
    5,
    'uniform'
);
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}
{% endtabs %}

## MORANS\_I\_H3 <a href="#morans_i_h3" id="morans_i_h3"></a>

```sql
MORANS_I_H3(input, size, decay)
```

**Description**

This function computes the [Moran's I spatial autocorrelation](https://en.wikipedia.org/wiki/Moran's_I) from the input array of H3 indexes.

**Input parameters**

* `input`: `ARRAY<STRUCT<index STRING, value FLOAT64>>` input data with the indexes and values of the cells.
* `size`: `INT64` size of the H3 *k-ring* (distance from the origin). This defines the area around each index cell where the distance decay will be applied. If the cells don't have neighbours given the kring size `NULL` is returned. If no neighboring cells are found, the weight of the corresponding index cell is set to zero.
* `decay`: `STRING` decay function to compute the [distance decay](https://en.wikipedia.org/wiki/Distance_decay). Available functions are: uniform, inverse, inverse\_square and exponential.

{% hint style="warning" %}
The `input` cannot contain NULL indexes values, otherwise a `Invalid input origin` error will be returned.
{% endhint %}

**Return type**

`FLOAT64`. If all cells have no neighbours, then the function will fail.

**Examples**

{% tabs %}
{% tab title="carto-un" %}

```sql
SELECT `carto-un`.carto.MORANS_I_H3(
    [
        STRUCT('89394460323ffff', 51.0),
        STRUCT('89394460c37ffff', 28.0),
        STRUCT('89394460077ffff', 19.0)
    ],
    8, 'exponential'
);
-- -0.92003263973888194
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
SELECT `carto-un-eu`.carto.MORANS_I_H3(
    [
        STRUCT('89394460323ffff', 51.0),
        STRUCT('89394460c37ffff', 28.0),
        STRUCT('89394460077ffff', 19.0)
    ],
    8, 'exponential'
);
-- -0.92003263973888194
```

{% endtab %}

{% tab title="manual" %}

```sql
SELECT carto.MORANS_I_H3(
    [
        STRUCT('89394460323ffff', 51.0),
        STRUCT('89394460c37ffff', 28.0),
        STRUCT('89394460077ffff', 19.0)
    ],
    8, 'exponential'
);
-- -0.92003263973888194
```

{% endtab %}
{% endtabs %}

{% tabs %}
{% tab title="carto-un" %}

```sql
SELECT `carto-un`.carto.MORANS_I_H3(input_data, 3, 'exponential')
FROM (
    SELECT ARRAY_AGG(STRUCT(index, value)) AS input_data
    FROM mytable
);
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
SELECT `carto-un-eu`.carto.MORANS_I_H3(input_data, 3, 'exponential')
FROM (
    SELECT ARRAY_AGG(STRUCT(index, value)) AS input_data
    FROM mytable
);
```

{% endtab %}

{% tab title="manual" %}

```sql
SELECT carto.MORANS_I_H3(input_data, 3, 'exponential')
FROM (
    SELECT ARRAY_AGG(STRUCT(index, value)) AS input_data
    FROM mytable
);
```

{% endtab %}
{% endtabs %}

{% hint style="info" %}
**Additional examples**

* [Computing the spatial autocorrelation of POIs locations in Berlin](https://academy.carto.com/advanced-spatial-analytics/spatial-analytics-for-bigquery/step-by-step-tutorials/computing-the-spatial-autocorrelation-of-pois-locations-in-berlin)
  {% endhint %}

## MORANS\_I\_QUADBIN\_TABLE <a href="#morans_i_quadbin_table" id="morans_i_quadbin_table"></a>

```sql
MORANS_I_QUADBIN_TABLE(input, output_table, index_col, value_col, size, decay)
```

**Description**

This procedure computes the [Moran's I spatial autocorrelation](https://en.wikipedia.org/wiki/Moran's_I) from the input table with Quadbin indexes.

**Input parameters**

* `input`: `STRING` the query to the data used to compute the coefficient. A qualified table name can be given as well, e.g. `<my-project>.<my-dataset>.<my-table>`.
* `output_table`: `STRING` qualified name of the output table, e.g. `<my-project>.<my-dataset>.<my-output-table>`.
* `index_col`: `STRING` name of the column with the Quadbin indexes.
* `value_col`: `STRING` name of the column with the values for each Quadbin cell.
* `size`: `INT64` size of the Quadbin *k-ring* (distance from the origin). This defines the area around each index cell where the distance decay will be applied. If no neighboring cells are found, the weight of the corresponding index cell is set to zero.
* `decay`: `STRING` decay function to compute the [distance decay](https://en.wikipedia.org/wiki/Distance_decay). Available functions are: uniform, inverse, inverse\_square and exponential.

{% hint style="warning" %}
The `index_col` cannot contain NULL values, otherwise a `Invalid input origin` error will be returned.
{% endhint %}

**Output**

The results are stored in the table named `<output_table>`, which contains the following column:

* `morans_i`: `FLOAT64` Moran's I spatial autocorrelation.

If all cells have no neighbours, then the procedure will fail.

**Example**

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.MORANS_I_QUADBIN_TABLE(
    '<my-project>.<my-dataset>.<my-quadbin-table>',
    '<my-project>.<my-dataset>.<my-output-table>',
    'quadbin',
    'value',
    5,
    'uniform'
);
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.MORANS_I_QUADBIN_TABLE(
    '<my-project>.<my-dataset>.<my-quadbin-table>',
    '<my-project>.<my-dataset>.<my-output-table>',
    'quadbin',
    'value',
    5,
    'uniform'
);
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.MORANS_I_QUADBIN_TABLE(
    '<my-project>.<my-dataset>.<my-quadbin-table>',
    '<my-project>.<my-dataset>.<my-output-table>',
    'quadbin',
    'value',
    5,
    'uniform'
);
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}
{% endtabs %}

## MORANS\_I\_QUADBIN <a href="#morans_i_quadbin" id="morans_i_quadbin"></a>

```sql
MORANS_I_QUADBIN(input, size, decay)
```

**Description**

This function computes the [Moran's I spatial autocorrelation](https://en.wikipedia.org/wiki/Moran's_I) from the input array of Quadbin indexes.

**Input parameters**

* `input`: `ARRAY<STRUCT<index INT64, value FLOAT64>>` input data with the indexes and values of the cells.
* `size`: `INT64` size of the Quadbin *k-ring* (distance from the origin). This defines the area around each index cell where the distance decay will be applied. If no neighboring cells are found, the weight of the corresponding index cell is set to zero.
* `decay`: `STRING` decay function to compute the [distance decay](https://en.wikipedia.org/wiki/Distance_decay). Available functions are: uniform, inverse, inverse\_square and exponential.

{% hint style="warning" %}
The `input` cannot contain NULL indexes values, otherwise a `Invalid input origin` error will be returned.
{% endhint %}

**Return type**

`FLOAT64`. If all cells have no neighbours, then the function will fail.

**Examples**

{% tabs %}
{% tab title="carto-un" %}

```sql
SELECT `carto-un`.carto.MORANS_I_QUADBIN(
    [
        STRUCT(5266443791927869439, 51.0),
        STRUCT(5266443791928131583, 28.0),
        STRUCT(5266443791928918015, 19.0)
    ],
    3, 'exponential'
);
-- -0.29665713826808621
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
SELECT `carto-un-eu`.carto.MORANS_I_QUADBIN(
    [
        STRUCT(5266443791927869439, 51.0),
        STRUCT(5266443791928131583, 28.0),
        STRUCT(5266443791928918015, 19.0)
    ],
    3, 'exponential'
);
-- -0.29665713826808621
```

{% endtab %}

{% tab title="manual" %}

```sql
SELECT carto.MORANS_I_QUADBIN(
    [
        STRUCT(5266443791927869439, 51.0),
        STRUCT(5266443791928131583, 28.0),
        STRUCT(5266443791928918015, 19.0)
    ],
    3, 'exponential'
);
-- -0.29665713826808621
```

{% endtab %}
{% endtabs %}

{% tabs %}
{% tab title="carto-un" %}

```sql
SELECT `carto-un`.carto.MORANS_I_QUADBIN(input_data, 3, 'exponential')
FROM (
    SELECT ARRAY_AGG(STRUCT(index, value)) AS input_data
    FROM mytable
)
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
SELECT `carto-un-eu`.carto.MORANS_I_QUADBIN(input_data, 3, 'exponential')
FROM (
    SELECT ARRAY_AGG(STRUCT(index, value)) AS input_data
    FROM mytable
)
```

{% endtab %}

{% tab title="manual" %}

```sql
SELECT carto.MORANS_I_QUADBIN(input_data, 3, 'exponential')
FROM (
    SELECT ARRAY_AGG(STRUCT(index, value)) AS input_data
    FROM mytable
)
```

{% endtab %}
{% endtabs %}

## LOCAL\_MORANS\_I\_H3\_TABLE <a href="#local_morans_i_h3_table" id="local_morans_i_h3_table"></a>

```sql
LOCAL_MORANS_I_H3_TABLE(input, output_table, index_col, value_col, size, decay, permutations)
```

**Description**

This procedure computes the local Moran's I spatial autocorrelation from the input table with H3 indexes. It outputs the H3 `index`, local Moran's I spatial autocorrelation `value`, simulated p value `psim`, Conditional randomization null - expectation `EIc`, Conditional randomization null - variance `VIc`, Total randomization null - expectation `EI`, Total randomization null - variance `VI`, and the `quad` HH=1, LL=2, LH=3, HL=4.

**Input parameters**

* `input`: `STRING` the query to the data used to compute the coefficient. A qualified table name can be given as well, e.g. `<my-project>.<my-dataset>.<my-table>`.
* `output_table`: `STRING` qualified name of the output table, e.g. `<my-project>.<my-dataset>.<my-output-table>`.
* `index_col`: `STRING` name of the column with the H3 indexes.
* `value_col`: `STRING` name of the column with the values for each H3 cell.
* `size`: `INT64` size of the H3 *k-ring* (distance from the origin). This defines the area around each index cell where the distance decay will be applied. If no neighboring cells are found, the weight of the corresponding index cell is set to zero.
* `decay`: `STRING` decay function to compute the [distance decay](https://en.wikipedia.org/wiki/Distance_decay). Available functions are: uniform, inverse, inverse\_square and exponential.
* `permutations`: `INT64` number of permutations for the estimation of p-value.

{% hint style="warning" %}
The `index_col` cannot contain NULL values, otherwise a `Invalid input origin` error will be returned.
{% endhint %}

**Output**

The results are stored in the table named `<output_table>`, which contains the following columns:

* `index`: `STRING` H3 index.
* `value`: `FLOAT64` local Moran's I spatial autocorrelation.
* `psim`: `FLOAT64` simulated p value.
* `EIc`: `FLOAT64` conditional randomization null - expectation.
* `VIc`: `FLOAT64` conditional randomization null - variance.
* `EI`: `FLOAT64` total randomization null - expectation.
* `VI`: `FLOAT64` total randomization null - variance.
* `quad`: `INT64` HH=1, LL=2, LH=3, HL=4.

**Example**

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.LOCAL_MORANS_I_H3_TABLE(
    '<my-project>.<my-dataset>.<my-h3-table>',
    '<my-project>.<my-dataset>.<my-output-table>',
    'h3',
    'value',
    3,
    'exponential',
    100
);
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.LOCAL_MORANS_I_H3_TABLE(
    '<my-project>.<my-dataset>.<my-h3-table>',
    '<my-project>.<my-dataset>.<my-output-table>',
    'h3',
    'value',
    3,
    'exponential',
    100
);
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.LOCAL_MORANS_I_H3_TABLE(
    '<my-project>.<my-dataset>.<my-h3-table>',
    '<my-project>.<my-dataset>.<my-output-table>',
    'h3',
    'value',
    3,
    'exponential',
    100
);
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}
{% endtabs %}

## LOCAL\_MORANS\_I\_H3 <a href="#local_morans_i_h3" id="local_morans_i_h3"></a>

```sql
LOCAL_MORANS_I_H3(input, size, decay, permutations)
```

**Description**

This function computes the local Moran's I spatial autocorrelation from the input array of H3 indexes. It outputs the H3 `index`, local Moran's I spatial autocorrelation `value`, simulated p value `psim`, Conditional randomization null - expectation `EIc`, Conditional randomization null - variance `VIc`, Total randomization null - expectation `EI`, Total randomization null - variance `VI`, and the quad HH=1, LL=2, LH=3, HL=4.

**Input parameters**

* `input`: `ARRAY<STRUCT<index STRING, value FLOAT64>>` input data with the indexes and values of the cells.
* `size`: `INT64` size of the H3 *k-ring* (distance from the origin). This defines the area around each index cell where the distance decay will be applied. If no neighboring cells are found, the weight of the corresponding index cell is set to zero.
* `decay`: `STRING` decay function to compute the [distance decay](https://en.wikipedia.org/wiki/Distance_decay). Available functions are: uniform, inverse, inverse\_square and exponential.
* `permutations`: `INT64` number of permutations for the estimation of p-value.

{% hint style="warning" %}
The `input` cannot contain NULL indexes values, otherwise a `Invalid input origin` error will be returned.
{% endhint %}

**Return type**

`ARRAY<STRUCT<index STRING, value FLOAT64, psim FLOAT64, EIc FLOAT64, VIc FLOAT64, EI FLOAT64, VI FLOAT64, quad INT64>>`

**Examples**

{% tabs %}
{% tab title="carto-un" %}

```sql
SELECT `carto-un`.carto.LOCAL_MORANS_I_H3(
    [
        STRUCT('89394460323ffff', 51.0),
        STRUCT('8939446033bffff', 28.0),
        STRUCT('8939446032bffff', 19.0)
    ],
    3, 'exponential', 100
);
    "index": "8939446032bffff",
    "value": "-0.342921256629947",
    "psim": "0.0099009900990099011",
    "EIc": "-1.0287637698898404",
    "VIc": "0.0",
    "EI": "-1.0",
    "VI": "-0.64721503525401447",
    "quad": "3"
-- {
--   "index": "8939446032bffff",
--   "value": "-0.342921256629947",
--   "psim": "0.0099009900990099011",
--   "EIc": "-1.0287637698898404",
--   "VIc": "0.0",
--   "EI": "-1.0",
--   "VI": "-0.64721503525401447",
--   "quad": "3"
-- },
-- ...
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
SELECT `carto-un-eu`.carto.LOCAL_MORANS_I_H3(
    [
        STRUCT('89394460323ffff', 51.0),
        STRUCT('8939446033bffff', 28.0),
        STRUCT('8939446032bffff', 19.0)
    ],
    3, 'exponential', 100
);
    "index": "8939446032bffff",
    "value": "-0.342921256629947",
    "psim": "0.0099009900990099011",
    "EIc": "-1.0287637698898404",
    "VIc": "0.0",
    "EI": "-1.0",
    "VI": "-0.64721503525401447",
    "quad": "3"
-- {
--   "index": "8939446032bffff",
--   "value": "-0.342921256629947",
--   "psim": "0.0099009900990099011",
--   "EIc": "-1.0287637698898404",
--   "VIc": "0.0",
--   "EI": "-1.0",
--   "VI": "-0.64721503525401447",
--   "quad": "3"
-- },
-- ...
```

{% endtab %}

{% tab title="manual" %}

```sql
SELECT carto.LOCAL_MORANS_I_H3(
    [
        STRUCT('89394460323ffff', 51.0),
        STRUCT('8939446033bffff', 28.0),
        STRUCT('8939446032bffff', 19.0)
    ],
    3, 'exponential', 100
);
    "index": "8939446032bffff",
    "value": "-0.342921256629947",
    "psim": "0.0099009900990099011",
    "EIc": "-1.0287637698898404",
    "VIc": "0.0",
    "EI": "-1.0",
    "VI": "-0.64721503525401447",
    "quad": "3"
-- {
--   "index": "8939446032bffff",
--   "value": "-0.342921256629947",
--   "psim": "0.0099009900990099011",
--   "EIc": "-1.0287637698898404",
--   "VIc": "0.0",
--   "EI": "-1.0",
--   "VI": "-0.64721503525401447",
--   "quad": "3"
-- },
-- ...
```

{% endtab %}
{% endtabs %}

{% tabs %}
{% tab title="carto-un" %}

```sql
SELECT `carto-un`.carto.LOCAL_MORANS_I_H3(
    ARRAY(SELECT AS STRUCT index, value FROM mytable),
    3, 'exponential', 100
);
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
SELECT `carto-un-eu`.carto.LOCAL_MORANS_I_H3(
    ARRAY(SELECT AS STRUCT index, value FROM mytable),
    3, 'exponential', 100
);
```

{% endtab %}

{% tab title="manual" %}

```sql
SELECT carto.LOCAL_MORANS_I_H3(
    ARRAY(SELECT AS STRUCT index, value FROM mytable),
    3, 'exponential', 100
);
```

{% endtab %}
{% endtabs %}

## LOCAL\_MORANS\_I\_QUADBIN\_TABLE <a href="#local_morans_i_quadbin_table" id="local_morans_i_quadbin_table"></a>

```sql
LOCAL_MORANS_I_QUADBIN_TABLE(input, output_table, index_col, value_col, size, decay, permutations)
```

**Description**

This procedure computes the local Moran's I spatial autocorrelation from the input table with Quadbin indexes. It outputs the Quadbin `index`, local Moran's I spatial autocorrelation `value`, simulated p value `psim`, Conditional randomization null - expectation `EIc`, Conditional randomization null - variance `VIc`, Total randomization null - expectation `EI`, Total randomization null - variance `VI`, and the `quad` HH=1, LL=2, LH=3, HL=4.

**Input parameters**

* `input`: `STRING` the query to the data used to compute the coefficient. A qualified table name can be given as well, e.g. `<my-project>.<my-dataset>.<my-table>`.
* `output_table`: `STRING` qualified name of the output table, e.g. `<my-project>.<my-dataset>.<my-output-table>`.
* `index_col`: `STRING` name of the column with the Quadbin indexes.
* `value_col`: `STRING` name of the column with the values for each Quadbin cell.
* `size`: `INT64` size of the Quadbin *k-ring* (distance from the origin). This defines the area around each index cell where the distance decay will be applied. If no neighboring cells are found, the weight of the corresponding index cell is set to zero.
* `decay`: `STRING` decay function to compute the [distance decay](https://en.wikipedia.org/wiki/Distance_decay). Available functions are: uniform, inverse, inverse\_square and exponential.
* `permutations`: `INT64` number of permutations for the estimation of p-value.

{% hint style="warning" %}
The `index_col` cannot contain NULL values, otherwise a `Invalid input origin` error will be returned.
{% endhint %}

**Output**

The results are stored in the table named `<output_table>`, which contains the following columns:

* `index`: `INT64` quadbin index.
* `value`: `FLOAT64` local Moran's I spatial autocorrelation.
* `psim`: `FLOAT64` simulated p value.
* `EIc`: `FLOAT64` conditional randomization null - expectation.
* `VIc`: `FLOAT64` conditional randomization null - variance.
* `EI`: `FLOAT64` total randomization null - expectation.
* `VI`: `FLOAT64` total randomization null - variance.
* `quad`: `INT64` HH=1, LL=2, LH=3, HL=4.

**Example**

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.LOCAL_MORANS_I_QUADBIN_TABLE(
    '<my-project>.<my-dataset>.<my-quadbin-table>',
    '<my-project>.<my-dataset>.<my-output-table>',
    'quadbin',
    'value',
    3,
    'exponential',
    100
);
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.LOCAL_MORANS_I_QUADBIN_TABLE(
    '<my-project>.<my-dataset>.<my-quadbin-table>',
    '<my-project>.<my-dataset>.<my-output-table>',
    'quadbin',
    'value',
    3,
    'exponential',
    100
);
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.LOCAL_MORANS_I_QUADBIN_TABLE(
    '<my-project>.<my-dataset>.<my-quadbin-table>',
    '<my-project>.<my-dataset>.<my-output-table>',
    'quadbin',
    'value',
    3,
    'exponential',
    100
);
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}
{% endtabs %}

## LOCAL\_MORANS\_I\_QUADBIN <a href="#local_morans_i_quadbin" id="local_morans_i_quadbin"></a>

```sql
LOCAL_MORANS_I_QUADBIN(input, size, decay, permutations)
```

**Description**

This function computes the local Moran's I spatial autocorrelation from the input array of Quadbin indexes. It outputs the Quadbin `index`, local Moran's I spatial autocorrelation `value`, simulated p value `psim`, Conditional randomization null - expectation `EIc`, Conditional randomization null - variance `VIc`, Total randomization null - expectation `EI`, Total randomization null - variance `VI`, and the quad HH=1, LL=2, LH=3, HL=4.

**Input parameters**

* `input`: `ARRAY<STRUCT<index INT64, value FLOAT64>>` input data with the indexes and values of the cells.
* `size`: `INT64` size of the Quadbin *k-ring* (distance from the origin). This defines the area around each index cell where the distance decay will be applied. If no neighboring cells are found, the weight of the corresponding index cell is set to zero.
* `decay`: `STRING` decay function to compute the [distance decay](https://en.wikipedia.org/wiki/Distance_decay). Available functions are: uniform, inverse, inverse\_square and exponential.
* `permutations`: `INT64` number of permutations for the estimation of p-value.

{% hint style="warning" %}
The `input` cannot contain NULL indexes values, otherwise a `Invalid input origin` error will be returned.
{% endhint %}

**Return type**

`ARRAY<STRUCT<index INT64, value FLOAT64, psim FLOAT64, EIc FLOAT64, VIc FLOAT64, EI FLOAT64, VI FLOAT64, quad INT64>>`

**Examples**

{% tabs %}
{% tab title="carto-un" %}

```sql
SELECT `carto-un`.carto.LOCAL_MORANS_I_QUADBIN(
    [
        STRUCT(5266443791927869439, 51.0),
        STRUCT(5266443791928131583, 28.0),
        STRUCT(5266443791928918015, 19.0)
    ],
    3, 'exponential', 100
);
-- {
--   "index": "5266443791928918015",
--   "value": "-0.076228184845253524",
--   "psim": "0.0099009900990099011",
--   "EIc": "-0.70361240532717062",
--   "VIc": "0.29943435718277039",
--   "EI": "-0.68393972058572117",
--   "VI": "0.19089112237884748",
--   "quad": "3"
-- },
-- ...
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
SELECT `carto-un-eu`.carto.LOCAL_MORANS_I_QUADBIN(
    [
        STRUCT(5266443791927869439, 51.0),
        STRUCT(5266443791928131583, 28.0),
        STRUCT(5266443791928918015, 19.0)
    ],
    3, 'exponential', 100
);
-- {
--   "index": "5266443791928918015",
--   "value": "-0.076228184845253524",
--   "psim": "0.0099009900990099011",
--   "EIc": "-0.70361240532717062",
--   "VIc": "0.29943435718277039",
--   "EI": "-0.68393972058572117",
--   "VI": "0.19089112237884748",
--   "quad": "3"
-- },
-- ...
```

{% endtab %}

{% tab title="manual" %}

```sql
SELECT carto.LOCAL_MORANS_I_QUADBIN(
    [
        STRUCT(5266443791927869439, 51.0),
        STRUCT(5266443791928131583, 28.0),
        STRUCT(5266443791928918015, 19.0)
    ],
    3, 'exponential', 100
);
-- {
--   "index": "5266443791928918015",
--   "value": "-0.076228184845253524",
--   "psim": "0.0099009900990099011",
--   "EIc": "-0.70361240532717062",
--   "VIc": "0.29943435718277039",
--   "EI": "-0.68393972058572117",
--   "VI": "0.19089112237884748",
--   "quad": "3"
-- },
-- ...
```

{% endtab %}
{% endtabs %}

{% tabs %}
{% tab title="carto-un" %}

```sql
SELECT `carto-un`.carto.LOCAL_MORANS_I_QUADBIN(
    ARRAY(SELECT AS STRUCT index, value FROM mytable),
    3, 'exponential', 100
);
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
SELECT `carto-un-eu`.carto.LOCAL_MORANS_I_QUADBIN(
    ARRAY(SELECT AS STRUCT index, value FROM mytable),
    3, 'exponential', 100
);
```

{% endtab %}

{% tab title="manual" %}

```sql
SELECT carto.LOCAL_MORANS_I_QUADBIN(
    ARRAY(SELECT AS STRUCT index, value FROM mytable),
    3, 'exponential', 100
);
```

{% endtab %}
{% endtabs %}

## VARIOGRAM <a href="#variogram" id="variogram"></a>

```sql
VARIOGRAM(inputsample, n_bins, max_distance, model)
```

**Description**

This function computes the [Variogram](https://en.wikipedia.org/wiki/Variogram) from the input array of points and their associated values.

It returns a STRUCT with the parameters of the variogram, the *x* values, the *y* values, the predicted *y* values and the number of values aggregated per bin.

**Input parameters**

* `inputsample`: `ARRAY<STRUCT<point GEOGRAPHY, value FLOAT64>>` input array with the points and their associated values.
* `n_bins`: `INT64` number of bins to compute the semivariance.
* `max_distance`: `FLOAT64` maximum distance to compute the semivariance.
* `model`: `STRING` type of model for fitting the semivariance. It can be either:
  * `exponential`: `P0 * (1. - exp(-xi / (P1 / 3.0))) + P2`
  * `spherical`: `P1 * (1.5 * (xi / P0) - 0.5 * (xi / P0)**3) + P2`.

**Return type**

`STRUCT<variogram_params ARRAY<FLOAT64>, x ARRAY<FLOAT64>, y ARRAY<FLOAT64>, yp ARRAY<FLOAT64>, count ARRAY<INT64>>`

where:

* `variogram_params`: array containing the parameters \[P0, P1, P2] fitted to the `model`.
* `x`: array with the *x* values used to fit the `model`.
* `y`: array with the *y* values used to fit the `model`.
* `yp`: array with the *y* values as predicted by the `model`.
* `count`: array with the number of elements aggregated in the bin.

**Example**

{% tabs %}
{% tab title="carto-un" %}

```sql
DECLARE sample_points ARRAY<STRUCT<point GEOGRAPHY, value FLOAT64>>;

-- generate the spatially correlated values
SET sample_points = ARRAY(SELECT AS STRUCT ST_GEOGPOINT(lon_sqrt+0.1*RAND(),lat_sqrt+0.1*RAND()) point,
            pow(sin(lon_sqrt)*sin(lat_sqrt),2)+0.1*RAND() value
        FROM
            UNNEST(GENERATE_ARRAY(-10,10,0.1)) lon_sqrt,
            UNNEST(GENERATE_ARRAY(-10,10,0.1)) lat_sqrt
        ORDER BY RAND()
        LIMIT 1000);

-- compute parameters of the variogram
SELECT `carto-un`.carto.VARIOGRAM(sample_points, 20, 1.0E5, 'exponential');
-- {
--   variogram_params: [1.8656766501394384, 9890263.713521793, -0.007675798653736552],
--   x: [13433.902872564133, 20772.802451664986, 56973.516169567, 67627.90034684369, 70363.43483710312, 78689.64706974, ...],
--   y: [0.005, 0.125, 3.125, 3.380, 2.0, 2.205, ...],
--   yp: [-0.14889750150153813, 0.49581158712413576, 2.351461086006329, 2.635658071286461, 2.696612846710653, 2.857216896041544, ...],
--   count: [162, 308, 328, 326, 312, 305, ...]
-- }
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
DECLARE sample_points ARRAY<STRUCT<point GEOGRAPHY, value FLOAT64>>;

-- generate the spatially correlated values
SET sample_points = ARRAY(SELECT AS STRUCT ST_GEOGPOINT(lon_sqrt+0.1*RAND(),lat_sqrt+0.1*RAND()) point,
            pow(sin(lon_sqrt)*sin(lat_sqrt),2)+0.1*RAND() value
        FROM
            UNNEST(GENERATE_ARRAY(-10,10,0.1)) lon_sqrt,
            UNNEST(GENERATE_ARRAY(-10,10,0.1)) lat_sqrt
        ORDER BY RAND()
        LIMIT 1000);

-- compute parameters of the variogram
SELECT `carto-un-eu`.carto.VARIOGRAM(sample_points, 20, 1.0E5, 'exponential');
-- {
--   variogram_params: [1.8656766501394384, 9890263.713521793, -0.007675798653736552],
--   x: [13433.902872564133, 20772.802451664986, 56973.516169567, 67627.90034684369, 70363.43483710312, 78689.64706974, ...],
--   y: [0.005, 0.125, 3.125, 3.380, 2.0, 2.205, ...],
--   yp: [-0.14889750150153813, 0.49581158712413576, 2.351461086006329, 2.635658071286461, 2.696612846710653, 2.857216896041544, ...],
--   count: [162, 308, 328, 326, 312, 305, ...]
-- }
```

{% endtab %}

{% tab title="manual" %}

```sql
DECLARE sample_points ARRAY<STRUCT<point GEOGRAPHY, value FLOAT64>>;

-- generate the spatially correlated values
SET sample_points = ARRAY(SELECT AS STRUCT ST_GEOGPOINT(lon_sqrt+0.1*RAND(),lat_sqrt+0.1*RAND()) point,
            pow(sin(lon_sqrt)*sin(lat_sqrt),2)+0.1*RAND() value
        FROM
            UNNEST(GENERATE_ARRAY(-10,10,0.1)) lon_sqrt,
            UNNEST(GENERATE_ARRAY(-10,10,0.1)) lat_sqrt
        ORDER BY RAND()
        LIMIT 1000);

-- compute parameters of the variogram
SELECT carto.VARIOGRAM(sample_points, 20, 1.0E5, 'exponential');
-- {
--   variogram_params: [1.8656766501394384, 9890263.713521793, -0.007675798653736552],
--   x: [13433.902872564133, 20772.802451664986, 56973.516169567, 67627.90034684369, 70363.43483710312, 78689.64706974, ...],
--   y: [0.005, 0.125, 3.125, 3.380, 2.0, 2.205, ...],
--   yp: [-0.14889750150153813, 0.49581158712413576, 2.351461086006329, 2.635658071286461, 2.696612846710653, 2.857216896041544, ...],
--   count: [162, 308, 328, 326, 312, 305, ...]
-- }
```

{% endtab %}
{% endtabs %}

## ORDINARY\_KRIGING\_TABLE <a href="#ordinary_kriging_table" id="ordinary_kriging_table"></a>

```sql
ORDINARY_KRIGING_TABLE(input_table, interp_table, target_table, n_bins, max_distance, n_neighbors, model)
```

**Description**

This procedure uses [Ordinary kriging](https://en.wikipedia.org/wiki/Kriging) to compute the interpolated values of a set of points stored in a table, given another set of points with known associated values.

**Input parameters**

* `input_table`: `STRING` name of the input table with the sample points, e.g. `<my-project>.<my-dataset>.<my-table>`.
* `interp_table`: `STRING` name of the interpolation points table, e.g. `<my-project>.<my-dataset>.<my-table>`.
* `target_table`: `STRING` qualified name of the output table, e.g. `<my-project>.<my-dataset>.<my-output-table>`. The process will fail if the table already exists. If NULL, the result will be returned by the procedure and won't be persisted.
* `n_bins`: `INT64` number of bins to compute the semivariance.
* `max_distance`: `FLOAT64` maximum distance to compute the semivariance.
* `n_neighbors`: `INT64` maximum number of neighbors of a point to be taken into account for interpolation.
* `model`: `STRING` type of model for fitting the semivariance. It can be either:
  * `exponential`: `P0 * (1. - exp(-xi / (P1 / 3.0))) + P2`
  * `spherical`: `P1 * (1.5 * (xi / P0) - 0.5 * (xi / P0)**3) + P2`.

**Example**

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.ORDINARY_KRIGING_TABLE(
         '<my-project>.<my-dataset>.<my-nasadem-jp-extract>',
         '<my-project>.<my-dataset>.<my-interp-points>',
         NULL,
         50,
         1000,
         20,
         'exponential');
-- {"point": "POINT(142.4277 43.51606)", "value": "288.531297133198"},
-- {"point": "POINT(142.4181 43.50518)", "value": "306.62910397500843"},
-- {"point": "POINT(142.4175 43.5045)", "value": "306.9708080004128"},
-- {"point": "POINT(142.4121 43.49838)", "value": "328.37518451985943"},
-- {"point": "POINT(142.4172 43.50416)", "value": "307.1771955935104"},
-- ...
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.ORDINARY_KRIGING_TABLE(
         '<my-project>.<my-dataset>.<my-nasadem-jp-extract>',
         '<my-project>.<my-dataset>.<my-interp-points>',
         NULL,
         50,
         1000,
         20,
         'exponential');
-- {"point": "POINT(142.4277 43.51606)", "value": "288.531297133198"},
-- {"point": "POINT(142.4181 43.50518)", "value": "306.62910397500843"},
-- {"point": "POINT(142.4175 43.5045)", "value": "306.9708080004128"},
-- {"point": "POINT(142.4121 43.49838)", "value": "328.37518451985943"},
-- {"point": "POINT(142.4172 43.50416)", "value": "307.1771955935104"},
-- ...
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.ORDINARY_KRIGING_TABLE(
         '<my-project>.<my-dataset>.<my-nasadem-jp-extract>',
         '<my-project>.<my-dataset>.<my-interp-points>',
         NULL,
         50,
         1000,
         20,
         'exponential');
-- {"point": "POINT(142.4277 43.51606)", "value": "288.531297133198"},
-- {"point": "POINT(142.4181 43.50518)", "value": "306.62910397500843"},
-- {"point": "POINT(142.4175 43.5045)", "value": "306.9708080004128"},
-- {"point": "POINT(142.4121 43.49838)", "value": "328.37518451985943"},
-- {"point": "POINT(142.4172 43.50416)", "value": "307.1771955935104"},
-- ...
```

{% endtab %}
{% endtabs %}

{% hint style="info" %}
**Additional examples**

* [Interpolating elevation along a road using kriging](https://academy.carto.com/advanced-spatial-analytics/spatial-analytics-for-bigquery/step-by-step-tutorials/interpolating-elevation-along-a-road-using-kriging)
  {% endhint %}

## ORDINARY\_KRIGING <a href="#ordinary_kriging" id="ordinary_kriging"></a>

```sql
ORDINARY_KRIGING(inputsample, origin, max_distance, variogram_params, n_neighbors, model)
```

**Description**

This function uses [Ordinary kriging](https://en.wikipedia.org/wiki/Kriging) to compute the interpolated values of an array of points, given another array of points with known associated values and a variogram. This variogram may be computed with the \[#variogram] function.

**Input parameters**

* `inputsample`: `ARRAY<STRUCT<point GEOGRAPHY, value FLOAT64>>` input array with the sample points and their values.
* `origin`: `ARRAY<GEOGRAPHY>` input array with the points whose values will be interpolated.
* `max_distance`: `FLOAT64` maximum distance to compute the semivariance.
* `variogram_params`: `ARRAY<FLOAT64>` parameters \[P0, P1, P2] of the variogram model.
* `n_neighbors`: `INT64` maximum number of neighbors of a point to be taken into account for interpolation.
* `model`: `STRING` type of model for fitting the semivariance. It can be either `exponential` or `spherical` and it should be the same type of model as the one used to compute the variogram:
  * `exponential`: `P0 * (1. - exp(-xi / (P1 / 3.0))) + P2`
  * `spherical`: `P1 * (1.5 * (xi / P0) - 0.5 * (xi / P0)**3) + P2`.

**Return type**

`ARRAY<STRUCT<point GEOGRAPHY, value FLOAT64>>`

**Examples**

Here is a standalone example:

{% tabs %}
{% tab title="carto-un" %}

```sql
SELECT
  `carto-un`.carto.ORDINARY_KRIGING(
             [STRUCT(ST_GEOGPOINT(0.26,1.02) AS point, 1.0 AS value),
              STRUCT(ST_GEOGPOINT(0.91,0.74) AS point, 3.1 AS value),
              STRUCT(ST_GEOGPOINT(-0.59,0.51) AS point, 1.5 AS value),
              STRUCT(ST_GEOGPOINT(0.86,0.92) AS point, 3.6 AS value),
              STRUCT(ST_GEOGPOINT(0.37,1.07) AS point, 1.1 AS value),
              STRUCT(ST_GEOGPOINT(0.69,-0.52) AS point, 1.2 AS value)],
             [ST_GEOGPOINT(0.,0.),
              ST_GEOGPOINT(0.,1.)],
             1.0E5,
             [0.1,1E8,0.1],
             20,
             'exponential')
-- {"point": "POINT(0 0)", "value": "1.357680916212768"},
-- {"point": "POINT(0 1)", "value": "1.07161192146499"}
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
SELECT
  `carto-un-eu`.carto.ORDINARY_KRIGING(
             [STRUCT(ST_GEOGPOINT(0.26,1.02) AS point, 1.0 AS value),
              STRUCT(ST_GEOGPOINT(0.91,0.74) AS point, 3.1 AS value),
              STRUCT(ST_GEOGPOINT(-0.59,0.51) AS point, 1.5 AS value),
              STRUCT(ST_GEOGPOINT(0.86,0.92) AS point, 3.6 AS value),
              STRUCT(ST_GEOGPOINT(0.37,1.07) AS point, 1.1 AS value),
              STRUCT(ST_GEOGPOINT(0.69,-0.52) AS point, 1.2 AS value)],
             [ST_GEOGPOINT(0.,0.),
              ST_GEOGPOINT(0.,1.)],
             1.0E5,
             [0.1,1E8,0.1],
             20,
             'exponential')
-- {"point": "POINT(0 0)", "value": "1.357680916212768"},
-- {"point": "POINT(0 1)", "value": "1.07161192146499"}
```

{% endtab %}

{% tab title="manual" %}

```sql
SELECT
  carto.ORDINARY_KRIGING(
             [STRUCT(ST_GEOGPOINT(0.26,1.02) AS point, 1.0 AS value),
              STRUCT(ST_GEOGPOINT(0.91,0.74) AS point, 3.1 AS value),
              STRUCT(ST_GEOGPOINT(-0.59,0.51) AS point, 1.5 AS value),
              STRUCT(ST_GEOGPOINT(0.86,0.92) AS point, 3.6 AS value),
              STRUCT(ST_GEOGPOINT(0.37,1.07) AS point, 1.1 AS value),
              STRUCT(ST_GEOGPOINT(0.69,-0.52) AS point, 1.2 AS value)],
             [ST_GEOGPOINT(0.,0.),
              ST_GEOGPOINT(0.,1.)],
             1.0E5,
             [0.1,1E8,0.1],
             20,
             'exponential')
-- {"point": "POINT(0 0)", "value": "1.357680916212768"},
-- {"point": "POINT(0 1)", "value": "1.07161192146499"}
```

{% endtab %}
{% endtabs %}

Here is an example using the `ORDINARY_KRIGING` function along with a `VARIOGRAM` estimation:

{% tabs %}
{% tab title="carto-un" %}

```sql

DECLARE sample_points ARRAY<STRUCT<point GEOGRAPHY, value FLOAT64>>;
DECLARE variogram_output STRUCT<params ARRAY<FLOAT64>, x ARRAY<FLOAT64>, y ARRAY<FLOAT64>, yp ARRAY<FLOAT64>, count ARRAY<INT64>>;
DECLARE interp_points ARRAY<GEOGRAPHY>;

-- Generate the spatially correlated values
SET sample_points = ARRAY(SELECT AS STRUCT ST_GEOGPOINT(lon_sqrt+0.1*RAND(),lat_sqrt+0.1*RAND()) point,
            pow(sin(lon_sqrt)*sin(lat_sqrt),2)+0.1*RAND() value
        FROM
            UNNEST(GENERATE_ARRAY(-10,10,0.1)) lon_sqrt,
            UNNEST(GENERATE_ARRAY(-10,10,0.1)) lat_sqrt
        ORDER BY RAND()
        LIMIT 1000);

-- Compute parameters of the variogram
SET variogram_output = `carto-un`.carto.VARIOGRAM(sample_points, 20, 1.0E5, 'spherical');

-- Generate the points to be interpolated
SET interp_points = ARRAY(SELECT ST_GEOGPOINT(lon_sqrt,lat_sqrt) point
        FROM
            UNNEST(GENERATE_ARRAY(-5,5,0.25)) lon_sqrt,
            UNNEST(GENERATE_ARRAY(-5,5,0.25)) lat_sqrt
            );

-- Calculate interpolated values
SELECT
  point, value
FROM
  UNNEST(`carto-un`.carto.ORDINARY_KRIGING(
         sample_points,
         interp_points,
         1.0E5,
         variogram_output.params,
         20,
         'spherical')) WITH OFFSET pos
ORDER BY pos

-- {"point": POINT(-5 -5), "value": 0.568294714734378},
-- {"point": POINT(-5 -4.75), "value": 0.8303238799265198},
-- {"point": POINT(-5 -4.5), "value": 0.8876712348264676},
-- {"point": POINT(-5 -4.25), "value": 0.7437099678173889},
-- {"point": POINT(-5 -4), "value": 0.5543380644791405},
-- {"point": POINT(-5 -3.75), "value": 0.45182050244159944}
-- ...
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql

DECLARE sample_points ARRAY<STRUCT<point GEOGRAPHY, value FLOAT64>>;
DECLARE variogram_output STRUCT<params ARRAY<FLOAT64>, x ARRAY<FLOAT64>, y ARRAY<FLOAT64>, yp ARRAY<FLOAT64>, count ARRAY<INT64>>;
DECLARE interp_points ARRAY<GEOGRAPHY>;

-- Generate the spatially correlated values
SET sample_points = ARRAY(SELECT AS STRUCT ST_GEOGPOINT(lon_sqrt+0.1*RAND(),lat_sqrt+0.1*RAND()) point,
            pow(sin(lon_sqrt)*sin(lat_sqrt),2)+0.1*RAND() value
        FROM
            UNNEST(GENERATE_ARRAY(-10,10,0.1)) lon_sqrt,
            UNNEST(GENERATE_ARRAY(-10,10,0.1)) lat_sqrt
        ORDER BY RAND()
        LIMIT 1000);

-- Compute parameters of the variogram
SET variogram_output = `carto-un-eu`.carto.VARIOGRAM(sample_points, 20, 1.0E5, 'spherical');

-- Generate the points to be interpolated
SET interp_points = ARRAY(SELECT ST_GEOGPOINT(lon_sqrt,lat_sqrt) point
        FROM
            UNNEST(GENERATE_ARRAY(-5,5,0.25)) lon_sqrt,
            UNNEST(GENERATE_ARRAY(-5,5,0.25)) lat_sqrt
            );

-- Calculate interpolated values
SELECT
  point, value
FROM
  UNNEST(`carto-un-eu`.carto.ORDINARY_KRIGING(
         sample_points,
         interp_points,
         1.0E5,
         variogram_output.params,
         20,
         'spherical')) WITH OFFSET pos
ORDER BY pos

-- {"point": POINT(-5 -5), "value": 0.568294714734378},
-- {"point": POINT(-5 -4.75), "value": 0.8303238799265198},
-- {"point": POINT(-5 -4.5), "value": 0.8876712348264676},
-- {"point": POINT(-5 -4.25), "value": 0.7437099678173889},
-- {"point": POINT(-5 -4), "value": 0.5543380644791405},
-- {"point": POINT(-5 -3.75), "value": 0.45182050244159944}
-- ...
```

{% endtab %}

{% tab title="manual" %}

```sql

DECLARE sample_points ARRAY<STRUCT<point GEOGRAPHY, value FLOAT64>>;
DECLARE variogram_output STRUCT<params ARRAY<FLOAT64>, x ARRAY<FLOAT64>, y ARRAY<FLOAT64>, yp ARRAY<FLOAT64>, count ARRAY<INT64>>;
DECLARE interp_points ARRAY<GEOGRAPHY>;

-- Generate the spatially correlated values
SET sample_points = ARRAY(SELECT AS STRUCT ST_GEOGPOINT(lon_sqrt+0.1*RAND(),lat_sqrt+0.1*RAND()) point,
            pow(sin(lon_sqrt)*sin(lat_sqrt),2)+0.1*RAND() value
        FROM
            UNNEST(GENERATE_ARRAY(-10,10,0.1)) lon_sqrt,
            UNNEST(GENERATE_ARRAY(-10,10,0.1)) lat_sqrt
        ORDER BY RAND()
        LIMIT 1000);

-- Compute parameters of the variogram
SET variogram_output = carto.VARIOGRAM(sample_points, 20, 1.0E5, 'spherical');

-- Generate the points to be interpolated
SET interp_points = ARRAY(SELECT ST_GEOGPOINT(lon_sqrt,lat_sqrt) point
        FROM
            UNNEST(GENERATE_ARRAY(-5,5,0.25)) lon_sqrt,
            UNNEST(GENERATE_ARRAY(-5,5,0.25)) lat_sqrt
            );

-- Calculate interpolated values
SELECT
  point, value
FROM
  UNNEST(carto.ORDINARY_KRIGING(
         sample_points,
         interp_points,
         1.0E5,
         variogram_output.params,
         20,
         'spherical')) WITH OFFSET pos
ORDER BY pos

-- {"point": POINT(-5 -5), "value": 0.568294714734378},
-- {"point": POINT(-5 -4.75), "value": 0.8303238799265198},
-- {"point": POINT(-5 -4.5), "value": 0.8876712348264676},
-- {"point": POINT(-5 -4.25), "value": 0.7437099678173889},
-- {"point": POINT(-5 -4), "value": 0.5543380644791405},
-- {"point": POINT(-5 -3.75), "value": 0.45182050244159944}
-- ...
```

{% endtab %}
{% endtabs %}

## IDW <a href="#idw" id="idw"></a>

```sql
IDW(inputsample, origin, maxdistance, n_neighbors, p)
```

**Description**

This function performs Inverse Distance Weighted interpolation. More information on the method can be found [here](https://en.wikipedia.org/wiki/Inverse_distance_weighting). The method uses the values of the input samples to interpolate the values for the derived locations. The user can select the number of neighbors to be selected for the interpolation, the maximum distance between points and neighbors and the factor `p` for the weights.

**Input parameters**

* `inputsample`: `ARRAY<STRUCT<point GEOGRAPHY, value FLOAT64>>` Input array with the sample points and their values.
* `origin`: `ARRAY<GEOGRAPHY>` Input array with the points whose values will be interpolated.
* `maxdistance`: `FLOAT64` Maximum distance between point for interpolation and sampling points.
* `n_neighbors`: `INT64` Maximum number of sampling points to be considered for the interpolation.
* `p`: `FLOAT64` Power of distance.

**Return type**

`ARRAY<STRUCT<point GEOGRAPHY, value FLOAT64>>`

**Example**

{% tabs %}
{% tab title="carto-un" %}

```sql
SELECT
      point, value
    FROM
      UNNEST(`carto-un`.carto.IDW(
             ARRAY(SELECT AS STRUCT point, value FROM `<my-project>.<my-dataset>.<my-kriging-sample-points>`),
             ARRAY(SELECT point FROM `<my-project>.<my-dataset>.<my-kriging-interp-points>`),
             1.0E5,
             20,
             2)) WITH OFFSET pos
    ORDER BY pos
-- TODO
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
SELECT
      point, value
    FROM
      UNNEST(`carto-un-eu`.carto.IDW(
             ARRAY(SELECT AS STRUCT point, value FROM `<my-project>.<my-dataset>.<my-kriging-sample-points>`),
             ARRAY(SELECT point FROM `<my-project>.<my-dataset>.<my-kriging-interp-points>`),
             1.0E5,
             20,
             2)) WITH OFFSET pos
    ORDER BY pos
-- TODO
```

{% endtab %}

{% tab title="manual" %}

```sql
SELECT
      point, value
    FROM
      UNNEST(carto.IDW(
             ARRAY(SELECT AS STRUCT point, value FROM `<my-project>.<my-dataset>.<my-kriging-sample-points>`),
             ARRAY(SELECT point FROM `<my-project>.<my-dataset>.<my-kriging-interp-points>`),
             1.0E5,
             20,
             2)) WITH OFFSET pos
    ORDER BY pos
-- TODO
```

{% endtab %}
{% endtabs %}

## SMOOTHING\_MRF\_H3 <a href="#smoothing_mrf_h3" id="smoothing_mrf_h3"></a>

```sql
SMOOTHING_MRF_H3(input, output, index_column, variable_column, options)
```

**Description**

This procedure computes a Markov Random Field (MRF) smoothing for a table containing H3 cell indexes and their associated values.

This implementation is based on the work of Christopher J. Paciorek: "Spatial models for point and areal data using Markov random fields on a fine grid." Electron. J. Statist. 7 946 - 972, 2013. <https://doi.org/10.1214/13-EJS791>

{% hint style="info" %}
**tip**

if your data is in lat/long format, you can still use this procedure by first converting your points to H3 cell indexes by using the [H3\_FROMLONGLAT](https://docs.carto.com/data-and-analysis/analytics-toolbox-for-bigquery/h3#h3_fromlonglat) function.
{% endhint %}

**Input parameters**

* `input`: `STRING` name of the input table, e.g. `<my-project>.<my-dataset>.<my-table>`.
* `output`: `STRING` qualified name of the output table, e.g. `<my-project>.<my-dataset>.<my-output-table>`. The process will fail if the table already exists. If NULL, the result will be returned directly by the procedure and not persisted.
* `index_column`: `STRING` name of the column containing the cell ids.
* `variable_column`: `STRING` name of the target variable column.
* `options`: `STRING` JSON string to overwrite the model's default options. If set to NULL or empty, it will use the default values.
  * `closing_distance`: `INT64` distance of closing. It defaults to 0. If strictly positive, the algorithm performs a [morphological closing](https://en.wikipedia.org/wiki/Closing_\(morphology\)) on the cells by the `closing_distance`, defined in number of cells, before performing the smoothing. No closing is performed otherwise.
  * `output_closing_cell`: `BOOL` controls whether the cells generated by the closing are added to the output. If defaults to `FALSE`.
  * `lambda`: `FLOAT64` iteration update factor. It defaults to 1.6. For more details, see <https://doi.org/10.1214/13-EJS791>, page 963.
  * `iter`: `INT64` number of iterative queries to perform the smoothing. It defaults to 10. Increasing this parameter might help if the `convergence_limit` is not reached by the end of the procedure's execution. Tip: if this limit has ben reached, the status of the second-to-last step of the procedure will throw an error.
  * `intra_iter`: `INT64` number of iterations per query. It defaults to 50. Reducing this parameter might help if a resource error is reached during the procedure's execution.
  * `convergence_limit`: `FLOAT64` threshold condition to stop iterations. If this threshold is not reached, then the procedure will finish its execution after the maximum number of iterations (`iter`) is reached. It defaults to 10e-5. For more details, see <https://doi.org/10.1214/13-EJS791>, page 963.

**Return type**

`FLOAT64`

**Example**

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.SMOOTHING_MRF_H3( "<my-project>.<my-dataset>.<my-airbnb-berlin-h3-qk>",
  NULL,
  'h3_z7',
  'price',
  '{"closing_distance":0, "output_closing_cell":"true", "lambda":1.6, "iter":10, "intra_iter":5, "convergence_limit":10e-5}');
-- {"id": 871f18840ffffff, "beta": 64.56696796809489}
-- {"id": 871f18841ffffff, "beta": 62.61498241759014}
-- {"id": 871f18844ffffff, "beta": 65.47069449331353}
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.SMOOTHING_MRF_H3( "<my-project>.<my-dataset>.<my-airbnb-berlin-h3-qk>",
  NULL,
  'h3_z7',
  'price',
  '{"closing_distance":0, "output_closing_cell":"true", "lambda":1.6, "iter":10, "intra_iter":5, "convergence_limit":10e-5}');
-- {"id": 871f18840ffffff, "beta": 64.56696796809489}
-- {"id": 871f18841ffffff, "beta": 62.61498241759014}
-- {"id": 871f18844ffffff, "beta": 65.47069449331353}
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.SMOOTHING_MRF_H3( "<my-project>.<my-dataset>.<my-airbnb-berlin-h3-qk>",
  NULL,
  'h3_z7',
  'price',
  '{"closing_distance":0, "output_closing_cell":"true", "lambda":1.6, "iter":10, "intra_iter":5, "convergence_limit":10e-5}');
-- {"id": 871f18840ffffff, "beta": 64.56696796809489}
-- {"id": 871f18841ffffff, "beta": 62.61498241759014}
-- {"id": 871f18844ffffff, "beta": 65.47069449331353}
```

{% endtab %}
{% endtabs %}

## SMOOTHING\_MRF\_QUADBIN <a href="#smoothing_mrf_quadbin" id="smoothing_mrf_quadbin"></a>

```sql
SMOOTHING_MRF_QUADBIN(input, output, index_column, variable_column, options)
```

**Description**

This procedure computes a Markov Random Field (MRF) smoothing for a table containing QUADBIN cell indexes and their associated values.

This implementation is based on the work of Christopher J. Paciorek: "Spatial models for point and areal data using Markov random fields on a fine grid." Electron. J. Statist. 7 946 - 972, 2013. <https://doi.org/10.1214/13-EJS791>

{% hint style="info" %}
**tip**

if your data is in lat/long format, you can still use this procedure by first converting your points to QUADBIN cell indexes by using the [QUADBIN\_FROMLONGLAT](https://docs.carto.com/data-and-analysis/analytics-toolbox-for-bigquery/quadbin#quadbin_fromlonglat) function.
{% endhint %}

**Input parameters**

* `input`: `STRING` name of the input table, e.g. `<my-project>.<my-dataset>.<my-table>`.
* `output`: `STRING` qualified name of the output table, e.g. `<my-project>.<my-dataset>.<my-output-table>`. The process will fail if the table already exists. If NULL, the result will be returned directly by the procedure and not persisted.
* `index_column`: `STRING` name of the column containing the cell ids.
* `variable_column`: `STRING` name of the target variable column.
* `options`: `STRING` JSON string to overwrite the model's default options. If set to NULL or empty, it will use the default values.
  * `closing_distance`: `INT64` distance of closing. It defaults to 0. If strictly positive, the algorithm performs a [morphological closing](https://en.wikipedia.org/wiki/Closing_\(morphology\)) on the cells by the `closing_distance`, defined in number of cells, before performing the smoothing. No closing is performed otherwise.
  * `output_closing_cell`: `BOOL` controls whether the cells generated by the closing are added to the output. If defaults to `FALSE`.
  * `lambda`: `FLOAT64` iteration update factor. It defaults to 1.6. For more details, see <https://doi.org/10.1214/13-EJS791>, page 963.
  * `iter`: `INT64` number of iterative queries to perform the smoothing. It defaults to 10. Increasing this parameter might help if the `convergence_limit` is not reached by the end of the procedure's execution. Tip: if this limit has ben reached, the status of the second-to-last step of the procedure will throw an error.
  * `intra_iter`: `INT64` number of iterations per query. It defaults to 50. Reducing this parameter might help if a resource error is reached during the procedure's execution.
  * `convergence_limit`: `FLOAT64` threshold condition to stop iterations. If this threshold is not reached, then the procedure will finish its execution after the maximum number of iterations (`iter`) is reached. It defaults to 10e-5. For more details, see <https://doi.org/10.1214/13-EJS791>, page 963.

**Return type**

`FLOAT64`

**Example**

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.SMOOTHING_MRF_QUADBIN(
  '<my-project>.<my-dataset>.<my-airbnb-berlin-h3-qk-qb>',
  'my-project.my-dataset.my-smoothing-table',
  'qb_z11',
  'price',
  '''{
    "closing_distance": 0,
    "output_closing_cell": "true",
    "lambda": 1.6,
    "iter": 10,
    "intra_iter": 5,
    "convergence_limit": 10e-5
  }'''
);
-- The table `my-project.my-dataset.my-smoothing-table` will be created
-- with columns: id, price_smoothed
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.SMOOTHING_MRF_QUADBIN(
  '<my-project>.<my-dataset>.<my-airbnb-berlin-h3-qk-qb>',
  'my-project.my-dataset.my-smoothing-table',
  'qb_z11',
  'price',
  '''{
    "closing_distance": 0,
    "output_closing_cell": "true",
    "lambda": 1.6,
    "iter": 10,
    "intra_iter": 5,
    "convergence_limit": 10e-5
  }'''
);
-- The table `my-project.my-dataset.my-smoothing-table` will be created
-- with columns: id, price_smoothed
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.SMOOTHING_MRF_QUADBIN(
  '<my-project>.<my-dataset>.<my-airbnb-berlin-h3-qk-qb>',
  'my-project.my-dataset.my-smoothing-table',
  'qb_z11',
  'price',
  '''{
    "closing_distance": 0,
    "output_closing_cell": "true",
    "lambda": 1.6,
    "iter": 10,
    "intra_iter": 5,
    "convergence_limit": 10e-5
  }'''
);
-- The table `my-project.my-dataset.my-smoothing-table` will be created
-- with columns: id, price_smoothed
```

{% endtab %}
{% endtabs %}

## BUILD\_PCAMIX\_DATA <a href="#build_pcamix_data" id="build_pcamix_data"></a>

```sql
BUILD_PCAMIX_DATA(input_query, index_column, cols_num_arr, cols_cat_arr, cols_ord_arr, output_prefix, options)
```

**Description**

Prepares the input data for the [BUILD\_PCAMIX\_MODEL](https://github.com/CartoDB/gitbook-documentation/blob/master/data-and-analysis/analytics-toolbox-for-bigquery/sql-reference/clouds/bigquery/modules/doc/statistics/BUILD_PCAMIX_MODEL.md) procedure.

This procedure is tested against the R package [FactoMineR](https://www.jstatsoft.org/article/view/v025i01), which adopts the Factorial Analysis of Mixed Data (FAMD) method developed by [Pagés (2004)](http://www.numdam.org/article/RSA_2004__52_4_93_0.pdf). The same method is applied here and generalizes the use of PCA to account for the number of modalities available to each categorical/ordinal variable and on the probabilities of these modalities.

Depending on the variable type, the procedure applies the following transformations to the input data:

* For the numerical variables: standard scale the columns to get their z-scores
* For the categorical variables:
  * One-hot-encode the categorical columns to get their indicator matrix
  * Weight each column by the inverse of the square root of its probability, given by the number of ones in each column (N<sub>s</sub>) divided by the number of observations (N)
  * Center the columns

In this procedure, we have extended this method also to ordinal variables by choosing from different encoding methods and by applying the correspoding weight. The available options include:

* Categorical encoding: ordinal variables are hot-encoded and the columns of the resulting indicator matrix are then weighted and centered as in the FAMD method
* Numerical encoding: ordinal variables are treated as numerical variables

**Input parameters**

* `input_query`: `STRING` the query or the fully qualified name of the table containing the input data which will be used to create the PCA model. It must contain all the individual variables specified in `cols_num_arr`, `cols_cat_arr`, and `cols_ord_arr`.
* `index_column`: `STRING` the name of the column with the unique geographic identifier.
* `cols_num_arr`: `ARRAY<STRING>` the array containing the names of the numerical (a.k.a. quantitative) columns. Should be set to NULL if no numerical variables are used.
* `cols_cat_arr`: `ARRAY<STRING>` the array containing the names of the categorical (a.k.a. qualitative) columns. Should be set to NULL if no categorical variables are used.
* `cols_ord_arr`: `ARRAY<STRING>` the array containing the names of the ordinal columns. Should be set to NULL if no ordinal variables are used.
* `output_prefix`: `STRING` destination prefix for the output tables. It must contain the project, dataset and prefix. For example `<my-project>.<my-dataset>.<my-output-prefix>`.
* `options`: `STRING` containing a valid JSON with the different options. Valid options are described in the table below.

  | Option                 | Description                                                                                                                                                                                                                                                                                                                                                                            |
  | ---------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
  | `ordinal_encoding`     | `STRING` the method used to encode ordinal variables. Possible options are CATEGORICAL (DEFAULT): the ordinal variables are treated as categorical variables and are transformed using a [one-hot encoding](https://en.wikipedia.org/wiki/One-hot) scheme; NUMERICAL: the ordinal variables are treated as numerical variables and used as such, without any additional transformation |
  | `new_data_input_query` | `STRING` the query to the data which will be projected in the PCA space to obtain the PC scores. It must contain all the individual variables specified in `cols_num_arr`, `cols_cat_arr`, and `cols_ord_arr`. A qualified table name can be given as well, e.g. `'<my-project>.<my-dataset>.<my-table>'`.                                                                             |

**Return type**

The procedure will output two tables:

* Model data table: contains the transformed data for the data that will be used to create the PCA model. The name of the table includes the suffix \_model\_data, for example ..\<output\_prefix>\_model\_data.
* New data table: contains the transformed data for the data that will be used to derive the PC scores. The name of the table includes the suffix \_new\_data, for example ..\<output\_prefix>\_new\_data.

**Examples**

This example shows the call for input data containing a mix of numerical, categorical, ordinal variables:

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.BUILD_PCAMIX_DATA(
  '''SELECT * FROM `<my-project>.<my-dataset>.<my-pcamix-model-data-input>`''',
  'id',
  ['height', 'distance'], 
  ['winters','shadow','tubers','color'],
  ['soil','preference'],
  '<my-project>.<my-dataset>.<my-output-prefix>',
  '''{
      "ordinal_encoding":"NUMERICAL"
  }'''
)
-- The table `<my-project>.<my-dataset>.<my-output-prefix>_model_data` will be created
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.BUILD_PCAMIX_DATA(
  '''SELECT * FROM `<my-project>.<my-dataset>.<my-pcamix-model-data-input>`''',
  'id',
  ['height', 'distance'], 
  ['winters','shadow','tubers','color'],
  ['soil','preference'],
  '<my-project>.<my-dataset>.<my-output-prefix>',
  '''{
      "ordinal_encoding":"NUMERICAL"
  }'''
)
-- The table `<my-project>.<my-dataset>.<my-output-prefix>_model_data` will be created
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.BUILD_PCAMIX_DATA(
  '''SELECT * FROM `<my-project>.<my-dataset>.<my-pcamix-model-data-input>`''',
  'id',
  ['height', 'distance'], 
  ['winters','shadow','tubers','color'],
  ['soil','preference'],
  '<my-project>.<my-dataset>.<my-output-prefix>',
  '''{
      "ordinal_encoding":"NUMERICAL"
  }'''
)
-- The table `<my-project>.<my-dataset>.<my-output-prefix>_model_data` will be created
```

{% endtab %}
{% endtabs %}

In this example instead, only numerical and categorical variables are included and new data for predicting the principal component scores are specified:

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.BUILD_PCAMIX_DATA(
  '''SELECT * FROM `<my-project>.<my-dataset>.<my-pcamix-model-data-input>`''',
  'id',
  ['height', 'distance'], 
  ['winters','shadow','tubers','color'],
  ['soil','preference'],
  '<my-project>.<my-dataset>.<my-output-prefix>',
  '''{
      "ordinal_encoding":"NUMERICAL",
      "new_data_input_query":"SELECT * FROM `<my-project>.<my-dataset>.<my-pcamix-new-data-input>`"
  }'''
)
-- The tables `<my-project>.<my-dataset>.<my-output-prefix>_model_data` and `<my-project>.<my-dataset>.<my-output-prefix>_new_data` will be created
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.BUILD_PCAMIX_DATA(
  '''SELECT * FROM `<my-project>.<my-dataset>.<my-pcamix-model-data-input>`''',
  'id',
  ['height', 'distance'], 
  ['winters','shadow','tubers','color'],
  ['soil','preference'],
  '<my-project>.<my-dataset>.<my-output-prefix>',
  '''{
      "ordinal_encoding":"NUMERICAL",
      "new_data_input_query":"SELECT * FROM `<my-project>.<my-dataset>.<my-pcamix-new-data-input>`"
  }'''
)
-- The tables `<my-project>.<my-dataset>.<my-output-prefix>_model_data` and `<my-project>.<my-dataset>.<my-output-prefix>_new_data` will be created
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.BUILD_PCAMIX_DATA(
  '''SELECT * FROM `<my-project>.<my-dataset>.<my-pcamix-model-data-input>`''',
  'id',
  ['height', 'distance'], 
  ['winters','shadow','tubers','color'],
  ['soil','preference'],
  '<my-project>.<my-dataset>.<my-output-prefix>',
  '''{
      "ordinal_encoding":"NUMERICAL",
      "new_data_input_query":"SELECT * FROM `<my-project>.<my-dataset>.<my-pcamix-new-data-input>`"
  }'''
)
-- The tables `<my-project>.<my-dataset>.<my-output-prefix>_model_data` and `<my-project>.<my-dataset>.<my-output-prefix>_new_data` will be created
```

{% endtab %}
{% endtabs %}

## BUILD\_PCAMIX\_MODEL <a href="#build_pcamix_model" id="build_pcamix_model"></a>

```sql
BUILD_PCAMIX_MODEL(input_query, index_column, output_model, options)
```

**Description**

Performs principal component analysis (PCA) of a set of `N` observations described by a mixture of `P` categorical, ordinal and numerical variables, also known as Factorial Analysis of Mixed Data. This procedure includes ordinary principal component analysis (PCA), when all the input variables are numerical, and multiple correspondence analysis (MCA), when all the input variables are categorical, as special cases.

Note that when all the P variables are qualitative, the principal component scores are equal to scores of standard MCA times square root of P and the eigenvalues are then equal to the usual eigenvalues of MCA times P. When all the variables are quantitative, the procedure gives exactly the same results as standard PCA.

<details>

<summary>Technical Note: Factorial Analysis of Mixed Data</summary>

Principal Component Analysis (PCA) is primarily suited for continuous data, for which squared differences are well defined, but it also might be applied to discrete variables (although in this case the results might exhibit some [artifacts](https://www.frontiersin.org/articles/10.3389/fonc.2020.00973/full)). When dealing with categorical or ordinal data, direct application of PCA is not recommended, even if the data has been [hot-encoded](https://en.wikipedia.org/wiki/One-hot), as for example done in the [PCA](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-pca) method in Google BigQuery. The issue when applying the PCA method over a table containing the one-hot encoded data is that the component of the variance associated with a categorical/ordinal variable would inherently depend on the number of modalities available to the variable as well as on the probabilities of these modalities, and therefore it would be impossible to equally weight all the input variables when maximizing the variance.

The Factorial Analysis of Mixed Data (FAMD) method can be used to deal with a mix of numerical and categorical data and consists in the following steps:

* Transform the input data to standardize the numerical columns and build an indicator matrix for the categorical data scaled to account for the number of modalities available to each category (c.f. [BUILD\_PCAMIX\_DATA](https://github.com/CartoDB/gitbook-documentation/blob/master/data-and-analysis/analytics-toolbox-for-bigquery/sql-reference/clouds/bigquery/modules/doc/statistics/BUILD_PCAMIX_DATA.md))
* Apply PCA on the resulting table

</details>

**Input parameters**

* `input_query`: `STRING` the query to the input data created with the [BUILD\_PCAMIX\_DATA](https://github.com/CartoDB/gitbook-documentation/blob/master/data-and-analysis/analytics-toolbox-for-bigquery/sql-reference/clouds/bigquery/modules/doc/statistics/BUILD_PCAMIX_DATA.md) procedure. A qualified table name can be given as well, e.g. `'<my-project>.<my-dataset>.<my-table>'`.
* `index_column`: `STRING` the name of the column with the unique geographic identifier.
* `output_model`: `STRING` the name for the output model. It should include project and dataset, e.g. `'<my-project>.<my-dataset>.<my-model-name>'`.
* `options`: `STRING` containing a valid JSON with the different options. Valid options are described in the table below.

  | Option                         | Description                                                                                                                                                                                                  |
  | ------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
  | `NUM_PRINCIPAL_COMPONENTS`     | `INT64` Number of principal components to keep as defined in [BigQuery ML CREATE MODEL statement for PCA models](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-pca) |
  | `PCA_EXPLAINED_VARIANCE_RATIO` | `FLOAT64` as defined in [BigQuery ML CREATE MODEL statement for PCA models](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-pca)                                      |
  | `PCA_SOLVER`                   | `STRING` as defined in [BigQuery ML CREATE MODEL statement for PCA models](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-pca)                                       |

**Return type**

The procedure created a PCA model named `<output_model>`.

**Example**

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.BUILD_PCAMIX_MODEL(
  '''SELECT * FROM `<my-project>.<my-dataset>.<my-pcamix-model-data-input>`''',
  'id',
  '<my-project>.<my-dataset>.<my-model-name>',
  '''{
      "PCA_EXPLAINED_VARIANCE_RATIO":0.9
  }'''
)
-- Model `<my-project>.<my-dataset>.<model-name>` will be created
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.BUILD_PCAMIX_MODEL(
  '''SELECT * FROM `<my-project>.<my-dataset>.<my-pcamix-model-data-input>`''',
  'id',
  '<my-project>.<my-dataset>.<my-model-name>',
  '''{
      "PCA_EXPLAINED_VARIANCE_RATIO":0.9
  }'''
)
-- Model `<my-project>.<my-dataset>.<model-name>` will be created
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.BUILD_PCAMIX_MODEL(
  '''SELECT * FROM `<my-project>.<my-dataset>.<my-pcamix-model-data-input>`''',
  'id',
  '<my-project>.<my-dataset>.<my-model-name>',
  '''{
      "PCA_EXPLAINED_VARIANCE_RATIO":0.9
  }'''
)
-- Model `<my-project>.<my-dataset>.<model-name>` will be created
```

{% endtab %}
{% endtabs %}

## PREDICT\_PCAMIX\_SCORES <a href="#predict_pcamix_scores" id="predict_pcamix_scores"></a>

```sql
PREDICT_PCAMIX_SCORES(input_query, index_column, input_model, output_table)
```

**Description**

Given the principal component analysis (PCA) model trained with the [BUILD\_PCAMIX\_MODEL](https://github.com/CartoDB/gitbook-documentation/blob/master/data-and-analysis/analytics-toolbox-for-bigquery/sql-reference/clouds/bigquery/modules/doc/statistics/BUILD_PCAMIX_MODEL.md) procedure, it returns the principal component scores for the input data, as returned by the [BUILD\_PCAMIX\_DATA](https://github.com/CartoDB/gitbook-documentation/blob/master/data-and-analysis/analytics-toolbox-for-bigquery/sql-reference/clouds/bigquery/modules/doc/statistics/BUILD_PCAMIX_DATA.md) procedure.

**Input parameters**

* `input_query`: `STRING` the query to the input data created with the [BUILD\_PCAMIX\_DATA](https://github.com/CartoDB/gitbook-documentation/blob/master/data-and-analysis/analytics-toolbox-for-bigquery/sql-reference/clouds/bigquery/modules/doc/statistics/BUILD_PCAMIX_DATA.md) procedure which will be used to derive the principal component scores. A qualified table name can be given as well, e.g. `'<my-project>.<my-dataset>.<my-table>'`.
* `index_column`: `STRING` the name of the column with the unique geographic identifier.
* `input_model`: `STRING` the name for the PCA model trained with the [BUILD\_PCAMIX\_MODEL](https://github.com/CartoDB/gitbook-documentation/blob/master/data-and-analysis/analytics-toolbox-for-bigquery/sql-reference/clouds/bigquery/modules/doc/statistics/BUILD_PCAMIX_MODEL.md) procedure. It should include project and dataset, e.g. `'<my-project>.<my-dataset>.<my-model-name>'`.
* `output_table`: `STRING` the name for the output table. It should include project and dataset, e.g. `'<my-project>.<my-dataset>.<my-output-table>'`.

**Return type**

The results are stored in the table named `<output_table>`, which contains

* the retained principal component scores, named as `principal_component_1`, `principal_component_2`, etc. with the first column being the retained score explaining most of the variance and the last column being the retained score explaining the least of the variance
* `index_column`: the unique geographic identifier. The type of this column depends on the type of `index_column` in `input_query`.

**Example**

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.PREDICT_PCAMIX_SCORES(
  '''SELECT * FROM `<my-project>.<my-dataset>.<my-pcamix-new-data-input>`''',
  'id',
  '<my-project>.<my-dataset>.<my-pcamix-model>',
  '<my-project>.<my-dataset>.<my-output-table>'
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.PREDICT_PCAMIX_SCORES(
  '''SELECT * FROM `<my-project>.<my-dataset>.<my-pcamix-new-data-input>`''',
  'id',
  '<my-project>.<my-dataset>.<my-pcamix-model>',
  '<my-project>.<my-dataset>.<my-output-table>'
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.PREDICT_PCAMIX_SCORES(
  '''SELECT * FROM `<my-project>.<my-dataset>.<my-pcamix-new-data-input>`''',
  'id',
  '<my-project>.<my-dataset>.<my-pcamix-model>',
  '<my-project>.<my-dataset>.<my-output-table>'
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}
{% endtabs %}

## DETECT\_SPATIAL\_ANOMALIES <a href="#detect_spatial_anomalies" id="detect_spatial_anomalies"></a>

```sql
DETECT_SPATIAL_ANOMALIES(input_query, index_column, input_variable_column, output_table, options)
```

**Description**

This procedure can be used to detect anomalous spatial regions. It implements the [scan statistics framework](https://www.cs.cmu.edu/~neill/papers/ijf.pdf) developed in [this R package](https://cran.r-project.org/web/packages/scanstatistics/index.html) to detect spatial regions where the variable of interest is higher (or lower) than its baseline value.

<details>

<summary>Technical Note: The generalized spatial scan statistics framework</summary>

The generalized spatial scan statistics framework consists of the following steps:

* Choose a set of spatial regions to search over, where each spatial region *S* consists of a set of spatial locations (e.g. defined using spatial indexes)
* Choose models of the data under *H0* (the null hypothesis of no cluster of anomalies) and *H1(S)* (the alternative hypothesis assuming an anomalous cluster in region *S*). Here we assume that that each location's value is drawn independently from some distribution $$Dist(b\_{i}, q\_{i})$$ where $$b\_{i}$$ represents the set of baseline values of that location, and $$q\_{i}$$ represents some underlying relative risk parameter. Second, we make the assumption that the relative risk $$q\_{i}$$ is uniform under the null hypothesis: thus we assume that any space-time variation in the values under the null is accounted for by our baseline parameters and our methods are designed to detect any additional variation not reflected in these baselines.
* Choose a baseline. Two typical approaches to obtaining baselines are the following:
  * Population-based. In the population-based approach, the observed values are expected to be proportional to the baselines, which typically represent the population corresponding to each spatial location. This population can be either given (e.g. from census data) or inferred (e.g. from sales data), and can be adjusted for any known covariates. Under the simplifying assumption of uniform rates, we wish to test the null hypothesis that the rate is uniform everywhere (all $$q\_{i,t}$$ are equal to some constant $$q\_{all}$$) against the set of alternative hypotheses with $$q\_{i,t} = q\_{in}$$ inside some region *S* and $$q\_{i,t} = q\_{out}$$ outside S, for some constants $$q\_{in} > q\_{in}$$.
  * Expectation-based. In the expectation-based method, the observed values should be equal (and not just proportional as in the population-based approach) to the baseline under the null hypothesis (i.e. each baseline represents the expected value of the input variable for a given location, under the null hypothesis). These expected values are often derived from a [regression model](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create) used to estimate the expected value given a set of covariates.
* Derive a score function *F(S)* based on *H1(S)* and *H0*:

$$F(S)=\frac{Pr(Data|H1(S))}{Pr(Data|H0)}$$

* Find the most interesting regions, i.e. those regions *S* with the highest values of F(S). Given the above test statistic *F(S)*, this method can be easily applied by choosing a set of regions *S*, calculating the score function *F(S)* for each of these regions, and obtaining the highest scoring region(s).
* Calculate the statistical significance of each discovered region using a Monte Carlo randomization: each replica is a copy of the original search area where each value is randomly drawn from the null distribution (e.g. for a Poisson and a Gaussian model, we would generate a sample respectively from a Poisson distribution with mean equal to the baseline value and a Gaussian distribution with mean equal to the baseline value and standard deviation equal to the baseline standard deviation). Once we have randomly generated a number of replicas under the null hypothesis, empirical results suggest that the null distribution of the scan statistic is fit well by a Gumbel extreme value distribution *G*. This implies that we can compute the p-value for each spatial zone following these steps:
  * Generate *R* random permutations of the data
  * For each permutation, select the space-time zone associated with the maximum score
  * Fit a Gumbel distribution to the maximum scores. Using the methods of moments, the Cumulative Distribution Function (CDF) for the Gumbel distribution of maxima can be calculated as the following: $$G(x) = e^{-e^{-\frac{x-\mu}{\beta}}}$$ where $$\bar{x} = \mu + 0.5772 , \beta$$ and $$\sigma = \dfrac{\beta , \pi}{\sqrt{6}}$$
  * Compute the p-value as $$1-G(x)$$ (right-tailed test)

</details>

**Input parameters**

* `input_query`: `STRING` the query or the fully qualified name of the table containing the data used to detect the spatial anomalies. It must contain the `index_column`, the `input_variable_column`, a column named `<input_variable_column>_baseline` with the values that should be used as a baseline to detect the anomalies and, when the `distributional model` parameter is set to 'GAUSSIAN' also its variance, `<input_variable_column>_baseline_sigma2`. The locations should all have the same timestamps (i.e. no missing locations / time combinations are allowed). No NULL values in any of the input columns are allowed, and no duplicated locations are allowed. Baselines can be broadly defined, depending on the application domain under consideration. For example, the variable of interest might be some counts and the baselines might be the *at-risk* population of that area. Alternatively, rather than being given the baselines in advance, we might infer these baselines from the data using a [statistical model](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create) that accounts for any know covariate and can used to [estimate](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-explain-predict) the expected values and their variance.
* `index_column`: `STRING` the name of the column with the unique geographic identifier of the spatial index, either 'H3' or 'QUADBIN'.
* `input_variable_column`: `STRING` the name of the column with the variable for which the spatial anomalies should be detected. When the permutations option is set to a number larger the zero and the values are of the `input_variable_column` are larger than , scaling of the values of the `input_variable_column` (and of `<input_variable_column>_baseline` and `<input_variable_column>_baseline_sigma2` columns) is recommended to avoid a floating point error when computing the p-value using the \_\_GUMBEL\_PVALUE function
* `output_table`: `STRING` qualified name of the output table, e.g. `<my-project>.<my-dataset>.<my-output-table>`. The process will fail if the table already exists.
* `options`: `STRING` a JSON with the different options. If options is set to NULL then all options are set to default.

  | Option                   | Description                                                                                                                                                                                                                                        | Default value |
  | ------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------- |
  | `kring_size`             | `ARRAY<INT64>` either the minimum and the maximum size of the k-ring used to define the spatial zones (e.g. \[0,5]) or an array of specific k-rings (e.g. \[0,2,5]). Use the latter to reduce the number of spatial zones and speed up computation | \[0, 3]       |
  | `is_high_mean_anomalies` | `BOOL` a boolean to specify if the analysis is for detecting spatial zones higher (`true`) or lower (`false`) than the baseline                                                                                                                    | `true`        |
  | `estimation_method`      | `STRING` the estimation method used to detect the spatial anomalies, either 'EXPECTATION' or 'POPULATION' for the expectation- and population-based methods respectively                                                                           | 'EXPECTATION' |
  | `distributional_model`   | `STRING` the distributional model of the data, either 'POISSON' or 'GAUSSIAN'. The 'POISSON' model should be used only for non-negative data                                                                                                       | 'GAUSSIAN'    |
  | `permutations`           | `INT64` the number of permutations used to derive the random replicas to test the anomaly statistical significance                                                                                                                                 | 10            |
  | `max_results`            | `INT64` the maximum number of spatial zones returned. The spatial zones are first ordered in descending order according to the score. When two (or more) space-time zones have the same score, the order is arbitrary                              | 10            |

**Return type**

The results are stored in the table named `<output_table>`, which contains the following columns:

* `index_scan`: `STRING` the unique identifier of the spatial zone
* `score`: `FLOAT64` the score representing the scan statistics
* `relrisk`: `FLOAT64` the value of the relative risk of the spatial zone of being anomalous, computed as the ratio of the aggregated observed values to the aggregated baseline values
* `pgumbel`: `FLOAT64` the p-value obtained by fitting a Gumbel distribution to the replicate scan statistics
* `locations`: `ARRAY<INT64>` or `ARRAY<STRING>` if the index\_column is of type QUADBIN and H3 respectively with the distinct locations belonging to the corresponding spatial zone

**Examples**

This procedure call performs an anomaly detection analysis using the expectation-based method (with a Gaussian distributional model) for spatial zone defined by a k-ring between 1 and 3:

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.DETECT_SPATIAL_ANOMALIES(
    '''
    SELECT quadbin, var, var_baseline, var_baseline_sigma2
    FROM `<my-project>.<my-dataset>.<my-input-table>`
    ''',
    'quadbin',
    'var',
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "kring_size":[1,3],
        "distributional_model":"GAUSSIAN",
        "permutations":10,
        "estimation_method":"EXPECTATION"
    }'''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.DETECT_SPATIAL_ANOMALIES(
    '''
    SELECT quadbin, var, var_baseline, var_baseline_sigma2
    FROM `<my-project>.<my-dataset>.<my-input-table>`
    ''',
    'quadbin',
    'var',
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "kring_size":[1,3],
        "distributional_model":"GAUSSIAN",
        "permutations":10,
        "estimation_method":"EXPECTATION"
    }'''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.DETECT_SPATIAL_ANOMALIES(
    '''
    SELECT quadbin, var, var_baseline, var_baseline_sigma2
    FROM `<my-project>.<my-dataset>.<my-input-table>`
    ''',
    'quadbin',
    'var',
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "kring_size":[1,3],
        "distributional_model":"GAUSSIAN",
        "permutations":10,
        "estimation_method":"EXPECTATION"
    }'''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}
{% endtabs %}

This procedure call performs an anomaly detection analysis using the population-based method (with a Poisson distributional model) for specific spatial k-rings:

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.DETECT_SPATIAL_ANOMALIES(
    '''
    SELECT h3, counts, var_baseline, var_baseline_sigma2
    FROM `<my-project>.<my-dataset>.<my-input-table>`
    ''',
    'h3',
    'counts',
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "kring_size":[0,1,2],
        "distributional_model":"POISSON",
        "permutations": 100,
        "estimation_method":"POPULATION"
    }'''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.DETECT_SPATIAL_ANOMALIES(
    '''
    SELECT h3, counts, var_baseline, var_baseline_sigma2
    FROM `<my-project>.<my-dataset>.<my-input-table>`
    ''',
    'h3',
    'counts',
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "kring_size":[0,1,2],
        "distributional_model":"POISSON",
        "permutations": 100,
        "estimation_method":"POPULATION"
    }'''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.DETECT_SPATIAL_ANOMALIES(
    '''
    SELECT h3, counts, var_baseline, var_baseline_sigma2
    FROM `<my-project>.<my-dataset>.<my-input-table>`
    ''',
    'h3',
    'counts',
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "kring_size":[0,1,2],
        "distributional_model":"POISSON",
        "permutations": 100,
        "estimation_method":"POPULATION"
    }'''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}
{% endtabs %}

## DETECT\_SPACETIME\_ANOMALIES <a href="#detect_spacetime_anomalies" id="detect_spacetime_anomalies"></a>

```sql
DETECT_SPACETIME_ANOMALIES(input_query, index_column, date_column, input_variable_column, time_freq, output_table, options)
```

**Description**

This procedure can be used to detect anomalous space-time regions. It implements the [scan statistics framework](https://www.cs.cmu.edu/~neill/papers/ijf.pdf) developed in [this R package](https://cran.r-project.org/web/packages/scanstatistics/index.html) to detect space-time regions where the variable of interest is higher (or lower) than its baseline value.

<details>

<summary>Technical Note: The generalized space-time scan statistics framework</summary>

The generalized space-time scan statistics framework consists of the following steps:

* Choose a set of spatial regions to search over, where each space-time region *S* consists of a set of space-time locations (e.g. defined using spatial indexes).
* Choose models of the data under *H0* (the null hypothesis of no cluster of anomalies) and *H1(S)* (the alternative hypothesis assuming an anomalous cluster in region *S*). Here we assume that that each location's value is drawn independently from some distribution $$Dist(b\_{i,t}, q\_{i,t})$$ where $$b\_{i,t}$$ represents the set of baseline values of that location, and $$q\_{i,t}$$ represents some underlying relative risk parameter. Second, we make the assumption that the relative risk $$q\_{i,t}$$ is uniform under the null hypothesis: thus we assume that any space-time variation in the values under the null is accounted for by our baseline parameters and our methods are designed to detect any additional variation not reflected in these baselines.
* Choose a baseline. Two typical approaches to obtaining baselines are the following:
  * Population-based. In the population-based approach, the observed values are expected to be proportional to the baselines, which typically represent the population corresponding to each space-time location. This population can be either given (e.g. from census data) or inferred (e.g. from sales data), and can be adjusted for any known covariates using for example a [regression model](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create). We wish to test the null hypothesis that the relative risk is uniform everywhere (all $$q\_{i,t}$$ are equal to some constant $$q\_{all}$$) against the set of alternative hypotheses with $$q\_{i,t} = q\_{in}$$ inside some region *S* and $$q\_{i,t} = q\_{out}$$ outside S, for some constants $$q\_{in} \stackrel{<}{>} q\_{out}$$.
  * Expectation-based. In the expectation-based method, the observed values should be equal (and not just proportional as in the population-based approach) to the baseline under the null hypothesis (i.e. each baseline represents the expected value of the input variable for a given location and at a given time stamp, under the null hypothesis). These expected values are often derived from a regression model (e.g. a [time series model](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-time-series) of the historical data used to forecast the expected value of the current data) or by computing the moving average over a window larger than the expected temporal extent of the anomaly.
* Derive a score function *F(S)* based on *H1(S)* and *H0*:

$$F(S)=\frac{Pr(Data|H1(S))}{Pr(Data|H0)}$$

* Find the most interesting regions, i.e. those regions *S* with the highest values of F(S). Given the above test statistic *F(S)*, this method can be easily applied by choosing a set of regions *S*, calculating the score function *F(S)* for each of these regions, and obtaining the highest scoring region(s).
* Calculate the statistical significance of each discovered region using a Monte Carlo randomization: each replica is a copy of the original search area where each value is randomly drawn from the null distribution (e.g. for a Poisson and a Gaussian model, we would generate a sample respectively from a Poisson distribution with mean equal to the baseline value and a Gaussian distribution with mean equal to the baseline value and standard deviation equal to the baseline standard deviation). Once we have randomly generated a number of replicas under the null hypothesis, empirical results suggest that the null distribution of the scan statistic is fit well by a Gumbel extreme value distribution *G*. This implies that we can compute the p-value for each space-time zone following these steps:
  * Generate *R* random permutations of the data
  * For each permutation, select the space-time zone associated with the maximum score
  * Fit a Gumbel distribution to the maximum scores. Using the methods of moments, the Cumulative Distribution Function (CDF) for the Gumbel distribution of maxima can be calculated as the following: $$G(x) = e^{-e^{-\frac{x-\mu}{\beta}}}$$ where $$\bar{x} = \mu + 0.5772 , \beta$$ and $$\sigma = \dfrac{\beta , \pi}{\sqrt{6}}$$
  * Compute the p-value as $$1-G(x)$$ (right-tailed test)

</details>

**Input parameters**

* `input_query`: `STRING` the query or the fully qualified name of the table containing the data used to detect the space-time anomalies. It must contain the `index_column`, the `date_column`, the `input_variable_column`, a column named `<input_variable_column>_baseline` with the values that should be used as a baseline to detect the anomalies and, when the `distributional model` parameter is set to 'GAUSSIAN' also its variance, `<input_variable_column>_baseline_sigma2`. The locations should all have the same timestamps (i.e. no missing locations / time combinations are allowed). No NULL values in any of the input columns are allowed, and no duplicated location/time pairs are allowed. Baselines can be broadly defined, depending on the application domain under consideration. For example, the variable of interest might be some counts and the baselines might be the *at-risk* population of that area. Alternatively, rather than being given the baselines in advance, we might infer these baselines from historical data using a [time series model](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-time-series) that accounts for any know covariate and can used to [estimate](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-explain-forecast) the expected values and their variance.
* `index_column`: `STRING` the name of the column with the unique geographic identifier of the spatial index, either 'H3' or 'QUADBIN'.
* `date_column`: `STRING` the name of the column with the timestamp identifier. The type of this column could be any type that can be casted to `DATETIME`
* `input_variable_column`: `STRING` the name of the column with the variable for which the space-time anomalies should be detected. When the permutations option is set to a number larger the zero and the values of the `input_variable_column` are larger than $$10^2$$, scaling of the values of the `input_variable_column` (and of `<input_variable_column>_baseline` and `<input_variable_column>_baseline_sigma2` columns) is recommended to avoid a floating point error when computing the p-value
* `time_freq`: `STRING` the temporal frequency of the data selected from one of the following: `second`, `minute`, `hour`, `day`, `week`, `month`, `quarter`, `year`
* `output_table`: `STRING` qualified name of the output table, e.g. `<my-project>.<my-dataset>.<my-output-table>`. The process will fail if the table already exists.
* `options`: `STRING` a JSON with the different options. If options is set to NULL then all options are set to default.

  | Option                   | Description                                                                                                                                                                                                                                                                                                                                                                                                                       | Default value                                            |
  | ------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------- |
  | `kring_size`             | `ARRAY<INT64>` either the minimum and the maximum size of the k-ring used to define the spatial zones (e.g. \[0,5]) or an array of specific k-rings (e.g. \[0,2,5]). Use the latter to reduce the number of spatial zones and speed up computation                                                                                                                                                                                | \[0, 3]                                                  |
  | `time_bw`                | `ARRAY<INT64>` either the minimum and the maximum temporal bandwidth used to define the temporal zones (e.g. \[0,10]) or an array of specific temporal bandwidths (e.g. \[0,5,10]). Use the latter to reduce the number of temporal zones and speed up computation                                                                                                                                                                | \[0, L] where L is the maximum length of the time series |
  | `is_prospective`         | `BOOL` a boolean to specify if the analysis is retrospective or prospective. In a retrospective analysis, the space-time anomalies can happen at any point in time over all the past data (a temporal zone can end at any timestamp); in the prospective analysis instead, only temporal zones that end with the last timestamp are considered and the interest lies in detecting new emerging anomalies                          | `true`                                                   |
  | `is_high_mean_anomalies` | `BOOL` a boolean to specify if the analysis is for detecting space-time zones higher (`true`) or lower (`false`) than the baseline                                                                                                                                                                                                                                                                                                | `true`                                                   |
  | `estimation_method`      | `STRING` the estimation method used to detect the spacetime anomalies, either 'EXPECTATION' or 'POPULATION' for the expectation- and population-based methods respectively. In the population-based method we expect each observed value to be proportional to its baseline under the null hypothesis, while in the expectation-based method, we expect each observed value to be equal to its baseline under the null hypothesis | 'EXPECTATION'                                            |
  | `distributional_model`   | `STRING` the distributional model of the data, either 'POISSON' or 'GAUSSIAN'. The 'POISSON' model should be used only for non-negative data                                                                                                                                                                                                                                                                                      | 'GAUSSIAN'                                               |
  | `permutations`           | `INT64` the number of permutations used to derive the random replicas to test the anomaly statistical significance                                                                                                                                                                                                                                                                                                                | 10                                                       |
  | `max_results`            | `INT64` the maximum number of space-time zones returned. The space-time zones are first ordered in descending order according to the score. When two (or more) space-time zones have the same score, the order is arbitrary                                                                                                                                                                                                       | 10                                                       |

**Return type**

The results are stored in the table named `<output_table>`, which contains the following columns:

* `index_scan`: `STRING` the unique identifier of the space-time zone
* `score`: `FLOAT64` the score representing the scan statistics
* `relrisk`: `FLOAT64` the value of the relative risk of the space-time zone of being anomalous, computed as the ratio of the aggregated observed values to the aggregated baseline values
* `pgumbel`: `FLOAT64` the p-value obtained by fitting a Gumbel distribution to the replicate scan statistics
* `locations`: `ARRAY<INT64>` or `ARRAY<STRING>` if the `index_column` is of type QUADBIN and H3 respectively with the distinct locations belonging to the corresponding spatial zone
* `times`: `ARRAY<DATETIME>` with the distinct timestamps belonging to the corresponding space-time zone

**Examples**

This procedure call performs a retrospective anomaly detection analysis using the expectation-based method (with a Gaussian distributional model) for every 12-months temporal window and spatial zone defined by a k-ring between 1 and 3:

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.DETECT_SPACETIME_ANOMALIES(
    '''
    SELECT quadbin, date, var, var_baseline, var_baseline_sigma2
    FROM `<my-project>.<my-dataset>.<my-input-table>`
    ''',
    'quadbin',
    'date',
    'var',
    'MONTH',
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "kring_size":[1,3],
        "time_bw":[0,12],
        "is_prospective": false,
        "distributional_model":"GAUSSIAN",
        "permutations":10,
        "estimation_method":"EXPECTATION"
    }'''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.DETECT_SPACETIME_ANOMALIES(
    '''
    SELECT quadbin, date, var, var_baseline, var_baseline_sigma2
    FROM `<my-project>.<my-dataset>.<my-input-table>`
    ''',
    'quadbin',
    'date',
    'var',
    'MONTH',
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "kring_size":[1,3],
        "time_bw":[0,12],
        "is_prospective": false,
        "distributional_model":"GAUSSIAN",
        "permutations":10,
        "estimation_method":"EXPECTATION"
    }'''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.DETECT_SPACETIME_ANOMALIES(
    '''
    SELECT quadbin, date, var, var_baseline, var_baseline_sigma2
    FROM `<my-project>.<my-dataset>.<my-input-table>`
    ''',
    'quadbin',
    'date',
    'var',
    'MONTH',
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "kring_size":[1,3],
        "time_bw":[0,12],
        "is_prospective": false,
        "distributional_model":"GAUSSIAN",
        "permutations":10,
        "estimation_method":"EXPECTATION"
    }'''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}
{% endtabs %}

This procedure call performs a prospective anomaly detection analysis using the population-based method (with a Poisson distributional model) for specific spatial k-rings and temporal windows:

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.DETECT_SPACETIME_ANOMALIES(
    '''
    SELECT h3, date, counts, counts_baseline, counts_baseline_sigma2
    FROM `<my-project>.<my-dataset>.<my-input-table>`
    ''',
    'h3',
    'date',
    'counts',
    'DAY',
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "kring_size":[0,1,2],
        "time_bw":[0,10,30],
        "is_prospective": true,
        "distributional_model":"POISSON",
        "permutations": 100,
        "estimation_method":"POPULATION"
    }'''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.DETECT_SPACETIME_ANOMALIES(
    '''
    SELECT h3, date, counts, counts_baseline, counts_baseline_sigma2
    FROM `<my-project>.<my-dataset>.<my-input-table>`
    ''',
    'h3',
    'date',
    'counts',
    'DAY',
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "kring_size":[0,1,2],
        "time_bw":[0,10,30],
        "is_prospective": true,
        "distributional_model":"POISSON",
        "permutations": 100,
        "estimation_method":"POPULATION"
    }'''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.DETECT_SPACETIME_ANOMALIES(
    '''
    SELECT h3, date, counts, counts_baseline, counts_baseline_sigma2
    FROM `<my-project>.<my-dataset>.<my-input-table>`
    ''',
    'h3',
    'date',
    'counts',
    'DAY',
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "kring_size":[0,1,2],
        "time_bw":[0,10,30],
        "is_prospective": true,
        "distributional_model":"POISSON",
        "permutations": 100,
        "estimation_method":"POPULATION"
    }'''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}
{% endtabs %}

## AREA\_OF\_APPLICABILITY <a href="#area_of_applicability" id="area_of_applicability"></a>

```sql
AREA_OF_APPLICABILITY(source_input_query, candidate_input_query, shap_query, index_column, output_prefix, options)
```

**Description**

This procedure computes the Area of Applicability (AOA) of a Bigquery ML model. It generates a metric which tells the user where the results from a Machine Learning (ML) model can be trusted when the predictions are extrapolated outside the training space (i.e. where the estimated cross-validation performance holds). Adding a method to compute this metric is particularly useful for non-linear models.

This implementation is based on Meyer, H., & Pebesma, E. (2021). Predicting into unknown space? [Estimating the area of applicability of spatial prediction models](https://doi.org/10.1111/2041-210X.13650). Methods in Ecology and Evolution, 12, 1620– 1633.

Given the [SHAP values](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-xai-overview) of a trained model, the procedure computes a Dissimilarity Index (DI) for each new data point used for prediction as the multivariate distance between the model covariates for that point and the nearest training data point. To identify those new points that lie in the model AOA, the DI is compared using a threshold obtained as the (outlier-removed) maximum DI of the training data derived via cross-validation: for each training data point the DI is computed as the distance to the nearest training data point that is not in the same (spatial) cross-validation fold with respect to the average of all pairwise distances between all training data. Alternatively, the user can also input a user-defined threshold. To compute the DI, two distance metrics are available:

* Euclidean distance: the distance between two data points is computed as the sum over all predictors of the weighted square differences between the standardized value of each predictor variable, where the weight is derived from the model SHAP table. This distance should be used only when all predictor variables are numerical, for which squared differences are well defined.
* [Gower distance](https://www.rdocumentation.org/packages/StatMatch/versions/1.4.1/topics/gower.dist): the distance between two data points is computed as the sum over all predictors of the weighted and normalized absolute differences for numerical (continous and discrete) predictors and the indicator function (0 if equal, 1 otherwise) for categorical/ordinal predictors. This distance can be used for numerical only, categorical only or mixed-type data and is normalized between 0 and 1, with 0 indicating that two points are the same.

The cross-validation folds for the training data can be obtained using a custom index (“CUSTOM\_KFOLD”), a random cross-validation strategy (“RANDOM\_KFOLD”), or [environmental blocking](https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.13107) (“ENV\_BLOCKING\_KFOLD”).

Finally, rows with a NULL value in any of the model predictors are dropped.

**Input parameters**

* `source_input_query`: `STRING` the query to the data used to train the model. It must contain all the model predictors columns as well as `index_column`.
* `candidate_input_query`: `STRING` query to provide the data over which the domain of applicability is estimated. It must contain all the model predictors columns as well as `index_column`.
* `shap_query`: `STRING` the query to the model SHAP table with the feature importance of the model. For example, for the [BUILD\_REVENUE\_MODEL](https://docs.carto.com/data-and-analysis/analytics-toolbox-for-bigquery/retail#build_revenue_model) these values are stored in a table with suffix `_model_shap`. When the model is [trained with BigQuery ML](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-xai-overview) the feature importance of the model predictors can also be found on the model *Interpretability* tag.
* `index_column`: `STRING` the name of the column with the unique geographic identifier. A column with this name needs to be selected (or created) both in `source_input_query` and in `candidate_input_query`.
* `output_prefix`: `STRING` destination and prefix for the output table. It must contain the project, dataset and prefix. For example `<my-project>.<my-dataset>.<my-output-prefix>`.
* `options`: `STRING` containing a valid JSON with the different options. Valid options are described in the table below. If options is set to NULL then all options are set to default.

  | Option                          | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
  | ------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
  | `threshold_method`              | Default: `"RANDOM_KFOLD"`. `STRING` method used for calculating the threshold to be applied on dissimilarity index of the candidate set in order to identify the area of applicability. Possible options are: `"USER_DEFINED_THRESHOLD"` uses a user defined threshold to derive the AOA. The threshold is provided by the user-defined `threshold` value; `"CUSTOM_KFOLD"` uses a customized k-fold index. The threshold is based on the cross-validation folds stored in the kfold\_index\_column in the `source_input_query` data; `"RANDOM_KFOLD"` uses a random k-fold index. The threshold is based on the cross-validation folds derived from a random k-fold strategy with the number of folds specified by the user in the `nfolds` parameter; `"ENV_BLOCKING_KFOLD"` uses a environmental blocking k-fold index. The threshold is based on the cross-validation folds derived from an environmental blocking strategy. This method can only be used when all predictors are numerical, otherwise an error is raised. |
  | `threshold`                     | `FLOAT64` the user defined threshold when the `"USER_DEFINED_THRESHOLD"` threshold method is used. The threshold should be defined in the \[0,1] interval.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
  | `kfold_index_column`            | `STRING` name of the cross-validation fold column. If `threshold_method` is set to `"CUSTOM_KFOLD"`, the user needs to pass this parameter, otherwise an error is raised. If threshold\_method is set to `"RANDOM_KFOLD"` or `"ENV_BLOCKING_KFOLD"`, this parameter is optional.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
  | `distance_type`                 | Default: `"GOWER"`. `STRING` the distance used to compute the dissimilarity index. Possible options are GOWER for the Gower distance and EUCLIDEAN for the Euclidean distance. When working with mixed data types the user can only use the Gower distance, otherwise an error is raised.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
  | `outliers_scale_factor`         | `FLOAT64` the scale factor used to define the threshold when threshold\_method is set to `"CUSTOM_KFOLD"`, `"RANDOM_KFOLD"`, or `"ENV_BLOCKING_KFOLD"`. Analogue to [Tukey’s fences k parameter](https://en.wikipedia.org/wiki/Outlier#Tukey's_fences) for outlier detection.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
  | `pca_explained_variance_ratio`  | `FLOAT64` the proportion of explained variance retained in the PCA analysis. Only values in the (0,1] range are allowed.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
  | `nfolds`                        | `INT64` the default number of k-folds when the threshold\_method is set to `"RANDOM_KFOLD"` or `"ENV_BLOCKING_KFOLD"`. Cannot be NULL if `threshold_method="RANDOM_KFOLD"`; if `threshold_method="ENV_BLOCKING_KFOLD"`, if NULL, `nfolds_min` and `nfolds_max` must be specified and the optimal number of folds is computed by deriving the clusters for a number of folds between `nfolds_min` and `nfolds_max` and choosing the number of folds (clusters) that minimizes the [Calinski-Harabasz Index](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.calinski_harabasz_score.html). If not NULL then `nfolds` should be at least 1.                                                                                                                                                                                                                                                                                                                                                                    |
  | `nfolds_min`                    | `INT64` the minimum number of environmental folds (clusters) if `nfolds` is set to NULL, otherwise it is ignored.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
  | `nfolds_max`                    | `INT64` the maximum number of environmental folds (clusters) if `nfolds` is set to NULL, otherwise it is ignored.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
  | `normalize_dissimilarity_index` | `BOOL` if TRUE the dissimilarity factor is normalized between 0 and 1. If threshold\_method is set to USER\_DEFINED\_THRESHOLD this parameter must be set to TRUE.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
  | `return_source_dataset`         | `BOOL` if TRUE the dissimilarity index for the source model data is also returned. If threshold\_method is set to `"USER_DEFINED_THRESHOLD"` this parameter must be set to FALSE.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |

  The different options for each threshold\_method are explained in the table below.

  | threshold\_method               | `USER_DEFINED_THRESHOLD` | `CUSTOM_KFOLD` | `RANDOM_KFOLD` | `ENV_BLOCKING_KFOLD`                                                  | Default value                               |
  | ------------------------------- | ------------------------ | -------------- | -------------- | --------------------------------------------------------------------- | ------------------------------------------- |
  | `threshold`                     | Mandatory                | Ignored        | Ignored        | Ignored                                                               | "RANDOM\_KFOLD"                             |
  | `kfold_index_column`            | Ignored                  | Mandatory      | Optional       | Optional                                                              | "kfold\_index"                              |
  | `distance_type`                 | Optional                 | Optional       | Optional       | Optional                                                              | "GOWER"                                     |
  | `outliers_scale_factor`         | Ignored                  | Optional       | Optional       | Optional                                                              | 1.5                                         |
  | `pca_explained_variance_ratio`  | Ignored                  | Ignored        | Ignored        | Optional                                                              | 0.9                                         |
  | `nfolds`                        | Ignored                  | Ignored        | Mandatory      | Optional if `nfolds_min` and `nfolds_max` are defined                 | NULL if OPTIONS IS NOT NULL and 4 otherwise |
  | `nfolds_min`                    | Ignored                  | Ignored        | Ignored        | Optional if `nfolds` is defined; mandatory if `nfolds_max` is defined | NULL                                        |
  | `nfolds_max`                    | Ignored                  | Ignored        | Ignored        | Optional if `nfolds` is defined; mandatory if `nfolds_min` is defined | NULL                                        |
  | `normalize_dissimilarity_index` | Optional                 | Optional       | Optional       | Optional                                                              | TRUE                                        |
  | `return_source_dataset`         | Optional                 | Optional       | Optional       | Optional                                                              | FALSE                                       |

**Output**

The output table with the following columns:

* `index_column`: `STRING` the unique geographic identifier. The data type of the index\_column in source\_input\_query and in the candidate\_input\_query is casted to STRING to take into account potential differences between the data type of source\_input\_query and candidate\_input\_query.
* `is_source`: `BOOL` TRUE if a data point is in the source model data and FALSE otherwise (only returned if return\_source\_dataset is TRUE).
* `kfold_index_column`: `STRING` the cross-validation fold index for each data point in the source model dataset (only returned if return\_source\_dataset is TRUE). The name of the column is given by the kfold\_index\_column parameter in the OPTIONS section.
* `dissimilarity_index`: `FLOAT64` the dissimilarity index.
* `dissimilarity_index_threshold`: `FLOAT64` the dissimilarity index threshold used to define the Area of Applicability (AOA, data points in the candidate set for which the dissimilarity index is smaller than this threshold belong to the area of applicability).
* `is_in_area_of_applicability`: `BOOL` TRUE if a data point in the candidate set (is\_source = FALSE) is in the AOA and FALSE otherwise. For data points in the source set (is\_source = TRUE) this is set to NULL.

**Examples**

Let's start by setting the OPTIONS to `NULL`. In this case the threshold is computed from a random k-fold strategy (threshold\_method=RANDOM\_KFOLD) with nfolds = 4:

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.AREA_OF_APPLICABILITY(
  'SELECT * EXCEPT(revenue_avg) FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-data>` WHERE revenue_avg IS NOT NULL',
  'SELECT * EXCEPT(revenue_avg) FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-data>`',
  'SELECT * FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-shap>`',
  'geoid',
  '<my-project>.<my-dataset>.<my-output-prefix>',
  NULL
);
-- The table `<my-project>.<my-dataset>.<my-output-prefix>` will be created
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.AREA_OF_APPLICABILITY(
  'SELECT * EXCEPT(revenue_avg) FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-data>` WHERE revenue_avg IS NOT NULL',
  'SELECT * EXCEPT(revenue_avg) FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-data>`',
  'SELECT * FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-shap>`',
  'geoid',
  '<my-project>.<my-dataset>.<my-output-prefix>',
  NULL
);
-- The table `<my-project>.<my-dataset>.<my-output-prefix>` will be created
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.AREA_OF_APPLICABILITY(
  'SELECT * EXCEPT(revenue_avg) FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-data>` WHERE revenue_avg IS NOT NULL',
  'SELECT * EXCEPT(revenue_avg) FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-data>`',
  'SELECT * FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-shap>`',
  'geoid',
  '<my-project>.<my-dataset>.<my-output-prefix>',
  NULL
);
-- The table `<my-project>.<my-dataset>.<my-output-prefix>` will be created
```

{% endtab %}
{% endtabs %}

With `USER_DEFINED_THRESHOLD`, the threshold is provided by the user:

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.AREA_OF_APPLICABILITY(
  'SELECT * EXCEPT(revenue_avg) FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-data>` WHERE revenue_avg IS NOT NULL',
  'SELECT * EXCEPT(revenue_avg) FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-data>`',
  'SELECT * FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-shap>`',
  'geoid',
  '<my-project>.<my-dataset>.<my-output-prefix>',
  '''
  {
    "threshold_method":"USER_DEFINED_THRESHOLD",
    "threshold":0.50
  }
  '''
);
-- The table `<my-project>.<my-dataset>.<my-output-prefix>` will be created
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.AREA_OF_APPLICABILITY(
  'SELECT * EXCEPT(revenue_avg) FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-data>` WHERE revenue_avg IS NOT NULL',
  'SELECT * EXCEPT(revenue_avg) FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-data>`',
  'SELECT * FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-shap>`',
  'geoid',
  '<my-project>.<my-dataset>.<my-output-prefix>',
  '''
  {
    "threshold_method":"USER_DEFINED_THRESHOLD",
    "threshold":0.50
  }
  '''
);
-- The table `<my-project>.<my-dataset>.<my-output-prefix>` will be created
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.AREA_OF_APPLICABILITY(
  'SELECT * EXCEPT(revenue_avg) FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-data>` WHERE revenue_avg IS NOT NULL',
  'SELECT * EXCEPT(revenue_avg) FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-data>`',
  'SELECT * FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-shap>`',
  'geoid',
  '<my-project>.<my-dataset>.<my-output-prefix>',
  '''
  {
    "threshold_method":"USER_DEFINED_THRESHOLD",
    "threshold":0.50
  }
  '''
);
-- The table `<my-project>.<my-dataset>.<my-output-prefix>` will be created
```

{% endtab %}
{% endtabs %}

With `CUSTOM_KFOLD`, the threshold is based on the cross-validation folds stored in the `kfold_index_column` in the source\_input\_query data

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.AREA_OF_APPLICABILITY(
  'SELECT * EXCEPT(revenue_avg) FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-data>` WHERE revenue_avg IS NOT NULL',
  'SELECT * EXCEPT(revenue_avg) FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-data>`',
  'SELECT * FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-shap>`',
  'geoid',
  '<my-project>.<my-dataset>.<my-output-prefix>',
  '''
  {
    "threshold_method":"CUSTOM_KFOLD",
    "kfold_index_column":"kfold_index",
    "return_source_dataset":true,
    "normalize_dissimilarity_index":true
  }
  '''
);
-- The table `<my-project>.<my-dataset>.<my-output-prefix>` will be created
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.AREA_OF_APPLICABILITY(
  'SELECT * EXCEPT(revenue_avg) FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-data>` WHERE revenue_avg IS NOT NULL',
  'SELECT * EXCEPT(revenue_avg) FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-data>`',
  'SELECT * FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-shap>`',
  'geoid',
  '<my-project>.<my-dataset>.<my-output-prefix>',
  '''
  {
    "threshold_method":"CUSTOM_KFOLD",
    "kfold_index_column":"kfold_index",
    "return_source_dataset":true,
    "normalize_dissimilarity_index":true
  }
  '''
);
-- The table `<my-project>.<my-dataset>.<my-output-prefix>` will be created
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.AREA_OF_APPLICABILITY(
  'SELECT * EXCEPT(revenue_avg) FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-data>` WHERE revenue_avg IS NOT NULL',
  'SELECT * EXCEPT(revenue_avg) FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-data>`',
  'SELECT * FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-shap>`',
  'geoid',
  '<my-project>.<my-dataset>.<my-output-prefix>',
  '''
  {
    "threshold_method":"CUSTOM_KFOLD",
    "kfold_index_column":"kfold_index",
    "return_source_dataset":true,
    "normalize_dissimilarity_index":true
  }
  '''
);
-- The table `<my-project>.<my-dataset>.<my-output-prefix>` will be created
```

{% endtab %}
{% endtabs %}

With `RANDOM_KFOLD` the threshold is based on the cross-validation folds derived from a random k-fold strategy with the number of folds specified by the user in the nfolds parameter

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.AREA_OF_APPLICABILITY(
  'SELECT * EXCEPT(revenue_avg) FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-data>` WHERE revenue_avg IS NOT NULL',
  'SELECT * EXCEPT(revenue_avg) FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-data>`',
  'SELECT * FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-shap>`',
  'geoid',
  '<my-project>.<my-dataset>.<my-output-prefix>',
  '''
  {
    "threshold_method":"RANDOM_KFOLD",
    "nfolds":6,
    "return_source_dataset":true,
    "normalize_dissimilarity_index":true
  }
  '''
);
-- The table `<my-project>.<my-dataset>.<my-output-prefix>` will be created
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.AREA_OF_APPLICABILITY(
  'SELECT * EXCEPT(revenue_avg) FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-data>` WHERE revenue_avg IS NOT NULL',
  'SELECT * EXCEPT(revenue_avg) FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-data>`',
  'SELECT * FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-shap>`',
  'geoid',
  '<my-project>.<my-dataset>.<my-output-prefix>',
  '''
  {
    "threshold_method":"RANDOM_KFOLD",
    "nfolds":6,
    "return_source_dataset":true,
    "normalize_dissimilarity_index":true
  }
  '''
);
-- The table `<my-project>.<my-dataset>.<my-output-prefix>` will be created
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.AREA_OF_APPLICABILITY(
  'SELECT * EXCEPT(revenue_avg) FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-data>` WHERE revenue_avg IS NOT NULL',
  'SELECT * EXCEPT(revenue_avg) FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-data>`',
  'SELECT * FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-shap>`',
  'geoid',
  '<my-project>.<my-dataset>.<my-output-prefix>',
  '''
  {
    "threshold_method":"RANDOM_KFOLD",
    "nfolds":6,
    "return_source_dataset":true,
    "normalize_dissimilarity_index":true
  }
  '''
);
-- The table `<my-project>.<my-dataset>.<my-output-prefix>` will be created
```

{% endtab %}
{% endtabs %}

With `ENV_BLOCKING_KFOLD` the threshold is based on the cross-validation folds derived from an environmental blocking strategy

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.AREA_OF_APPLICABILITY(
  'SELECT * EXCEPT(revenue_avg) FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-data>` WHERE revenue_avg IS NOT NULL',
  'SELECT * EXCEPT(revenue_avg) FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-data>`',
  'SELECT * FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-shap>`',
  'geoid',
  '<my-project>.<my-dataset>.<my-output-prefix>',
  '''
  {
    "threshold_method":"ENV_BLOCKING_KFOLD",
    "pca_explained_variance_ratio":0.9,
    "nfolds_min":3,
    "nfolds_max":6,
    "return_source_dataset":true,
    "normalize_dissimilarity_index":true
  }
  '''
);
-- The table `<my-project>.<my-dataset>.<my-output-prefix>` will be created
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.AREA_OF_APPLICABILITY(
  'SELECT * EXCEPT(revenue_avg) FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-data>` WHERE revenue_avg IS NOT NULL',
  'SELECT * EXCEPT(revenue_avg) FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-data>`',
  'SELECT * FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-shap>`',
  'geoid',
  '<my-project>.<my-dataset>.<my-output-prefix>',
  '''
  {
    "threshold_method":"ENV_BLOCKING_KFOLD",
    "pca_explained_variance_ratio":0.9,
    "nfolds_min":3,
    "nfolds_max":6,
    "return_source_dataset":true,
    "normalize_dissimilarity_index":true
  }
  '''
);
-- The table `<my-project>.<my-dataset>.<my-output-prefix>` will be created
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.AREA_OF_APPLICABILITY(
  'SELECT * EXCEPT(revenue_avg) FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-data>` WHERE revenue_avg IS NOT NULL',
  'SELECT * EXCEPT(revenue_avg) FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-data>`',
  'SELECT * FROM `<my-project>.<my-dataset>.<my-aoa-revenue-model-shap>`',
  'geoid',
  '<my-project>.<my-dataset>.<my-output-prefix>',
  '''
  {
    "threshold_method":"ENV_BLOCKING_KFOLD",
    "pca_explained_variance_ratio":0.9,
    "nfolds_min":3,
    "nfolds_max":6,
    "return_source_dataset":true,
    "normalize_dissimilarity_index":true
  }
  '''
);
-- The table `<my-project>.<my-dataset>.<my-output-prefix>` will be created
```

{% endtab %}
{% endtabs %}

## ENV\_BLOCKING <a href="#env_blocking" id="env_blocking"></a>

```sql
ENV_BLOCKING(input_query, predictors, index_column, output_prefix, options)
```

**Description**

This procedure derives cross validation (CV) folds based on [environmental blocking](https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.13107).

This procedure uses multivariate methods ([Principal Component Analysis](https://en.wikipedia.org/wiki/Principal_component_analysis) + [K-means clustering](https://en.wikipedia.org/wiki/K-means_clustering)) to specify sets of similar conditions based on the input covariates. It should be used to overcome the issue of overfitting due to non-causal predictors: in this case, the spatial structure in the data may be explained by the model through some other non-causal covariate which correlates with the spatial structure. The resulting model predictions may perform fine in a situation where the correlation structure between non-causal and the “true” predictors (i.e. the underlying structures) remains unchanged but they could completely fail when predicting to novel situations (extrapolation).

The method performs a Principal Component Analysis (PCA) on the standardized data and then applies K-means clustering to cluster the data. The cluster number is then used to assign to each data point a corresponding CV fold. If the optimal number of folds is not provided, this is obtained by choosing the number of clusters that minimizes the [Calinski-Harabasz Index](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.calinski_harabasz_score.html).

We suggest to limit the use of this procedure to numerical data only. Principal Component Analysis (PCA) and K-means are primarily suited for continuous data, for which squared differences are well defined, but they also might be applied to discrete variables (although in this case the PCA results might exhibit some [artifacts](https://www.frontiersin.org/articles/10.3389/fonc.2020.00973/full) and the K-means results will not map back to the data). When dealing with categorical or ordinal data instead, direct application of PCA/K-means is not recommended, even if the data has been [one-hot-encoded](https://en.wikipedia.org/wiki/One-hot), as for example done in the [PCA](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-pca) and [K-means](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-kmeans) methods in Google BigQuery. The issue when applying the PCA/K-means method over a table containing the one-hot encoded data is that the results would inherently depend on the number of modalities available to each variable as well as on the probabilities of these modalities and therefore it would be impossible to equally weight all the input variables when maximizing the variance (PCA) or the within-cluster sums of squares (K-means).

Finally, rows with a NULL value in any of the model predictors are dropped.

**Input parameters**

* `input_query`: `STRING` the input query. It must contain all the model predictors columns.
* `predictors`: `ARRAY<STRING>` the names of the (numeric) predictors.
* `index_column`: `STRING` the name of the column with the unique geographic identifier.
* `output_prefix`: `STRING` destination and prefix for the output table. It must contain the project, dataset and prefix. For example `<my-project>.<my-dataset>.<my-output-prefix>`.
* `options`: `STRING` containing a valid JSON with the different options. Valid options are described in the table below. If options is set to NULL then all options are set to default.

  | Option                         | Description                                                                                                                                                                                                                                                                                                                                                                                                                |
  | ------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
  | `pca_explained_variance_ratio` | `FLOAT64` as defined in [BigQuery ML CREATE MODEL statement for PCA models](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-pca) (DEFAULT: 0.9).                                                                                                                                                                                                                                    |
  | `nfolds`                       | `INT64` the default number of folds (clusters). If NULL the optimal number of folds is computed by deriving the clusters for a number of folds between nfolds\_min and nfolds\_max and choosing the number of folds (clusters) that minimizes the [Calinski-Harabasz Index](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.calinski_harabasz_score.html). If not NULL then nfolds should be at least 1. |
  | `nfolds_min`                   | `INT64` the minimum number of environmental folds (clusters) if nfolds is set to NULL, otherwise it is ignored. If NOT NULL nfolds\_min should be at least 1 and nfolds\_max should also be specified.                                                                                                                                                                                                                     |
  | `nfolds_max`                   | `INT64` the maximum number of environmental folds (clusters) if nfolds is set to NULL, otherwise it is ignored. If NOT NULL nfolds\_max should be at always larger than nfolds\_min.                                                                                                                                                                                                                                       |
  | `kfold_index_column`           | `STRING` the name of the cross-validation fold column. If NULL this parameter is set to 'k\_fold\_index'.                                                                                                                                                                                                                                                                                                                  |

**Output**

The output table with the following columns:

* `index_column`: `STRING` the unique geographic identifier. Its type will depend on the type of this column in the input\_query.
* `kfold_index_column`: `INT64` the cross-validation fold column.

**Examples**

With `nfolds` specified:

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.ENV_BLOCKING(
  'SELECT * FROM `<my-project>.<my-dataset>.<my-env-blocking-input>`',
  ['tavg', 'tmin', 'tmax', 'prec', 'srad', 'vapr', 'wind'],
  'geoid',
  '<my-project>.<my-dataset>.<my-output-prefix>',
  '''{
    "pca_explained_variance_ratio":0.9,
    "nfolds":4,
    "kfold_index_column":"k_fold_index"
  }
  '''
);
-- The table `<my-project>.<my-dataset>.<my-output-prefix>` will be created
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.ENV_BLOCKING(
  'SELECT * FROM `<my-project>.<my-dataset>.<my-env-blocking-input>`',
  ['tavg', 'tmin', 'tmax', 'prec', 'srad', 'vapr', 'wind'],
  'geoid',
  '<my-project>.<my-dataset>.<my-output-prefix>',
  '''{
    "pca_explained_variance_ratio":0.9,
    "nfolds":4,
    "kfold_index_column":"k_fold_index"
  }
  '''
);
-- The table `<my-project>.<my-dataset>.<my-output-prefix>` will be created
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.ENV_BLOCKING(
  'SELECT * FROM `<my-project>.<my-dataset>.<my-env-blocking-input>`',
  ['tavg', 'tmin', 'tmax', 'prec', 'srad', 'vapr', 'wind'],
  'geoid',
  '<my-project>.<my-dataset>.<my-output-prefix>',
  '''{
    "pca_explained_variance_ratio":0.9,
    "nfolds":4,
    "kfold_index_column":"k_fold_index"
  }
  '''
);
-- The table `<my-project>.<my-dataset>.<my-output-prefix>` will be created
```

{% endtab %}
{% endtabs %}

With `nfolds_min` and `nfolds_max` specified:

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.ENV_BLOCKING(
  'SELECT * FROM `<my-project>.<my-dataset>.<my-env-blocking-input>`',
  ['tavg', 'tmin', 'tmax', 'prec', 'srad', 'vapr', 'wind'],
  'geoid',
  '<my-project>.<my-dataset>.<my-output-prefix>',
  '''{
    "pca_explained_variance_ratio":0.9,
    "nfolds_min":3,
    "nfolds_max":6,
    "kfold_index_column":"k_fold_index"
  }
  '''
);
-- The table `<my-project>.<my-dataset>.<my-output-prefix>` will be created
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.ENV_BLOCKING(
  'SELECT * FROM `<my-project>.<my-dataset>.<my-env-blocking-input>`',
  ['tavg', 'tmin', 'tmax', 'prec', 'srad', 'vapr', 'wind'],
  'geoid',
  '<my-project>.<my-dataset>.<my-output-prefix>',
  '''{
    "pca_explained_variance_ratio":0.9,
    "nfolds_min":3,
    "nfolds_max":6,
    "kfold_index_column":"k_fold_index"
  }
  '''
);
-- The table `<my-project>.<my-dataset>.<my-output-prefix>` will be created
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.ENV_BLOCKING(
  'SELECT * FROM `<my-project>.<my-dataset>.<my-env-blocking-input>`',
  ['tavg', 'tmin', 'tmax', 'prec', 'srad', 'vapr', 'wind'],
  'geoid',
  '<my-project>.<my-dataset>.<my-output-prefix>',
  '''{
    "pca_explained_variance_ratio":0.9,
    "nfolds_min":3,
    "nfolds_max":6,
    "kfold_index_column":"k_fold_index"
  }
  '''
);
-- The table `<my-project>.<my-dataset>.<my-output-prefix>` will be created
```

{% endtab %}
{% endtabs %}

## KRING\_SPACETIME\_AGGREGATE <a href="#kring_spacetime_aggregate" id="kring_spacetime_aggregate"></a>

```sql
KRING_SPACETIME_AGGREGATE(input, index_column, date_column, variables, time_freq, output_table, options)
```

**Description**

This procedure can be used to aggregate space-time features within a spatio-temporal neighbourhood to extract new features. It does not resample the input data but allows you to compute statistics for each space-time location based on its neighboring data points. Data must have a temporal dimension and contain grid cell indexes of one of the supported types (H3, Quadbin). For data points to be aggregated with their neighbours, the user must select a k-ring size to define the neighbouring region.

**Input parameters**

* `input`: `STRING` the query or the fully qualified name of the table containing the data to be smoothed. It must contain the `index_column`, the `date_column`, and all the columns specified in `variables`.
* `index_column`: `STRING` the name of the column with the unique geographic identifier of the spatial index, either 'H3' or 'QUADBIN'.
* `date_column`: `STRING` the name of the column with the timestamp identifier. The type of this column could be any type that can be casted to `DATETIME`.
* `variables`: `ARRAY<STRUCT<variable STRING, aggregation STRING>>` the list with the columns and their corresponding aggregation method (`sum`, `avg`, `max`, `min`, `count`, `mode`, `perc25`,`perc50`,`perc75`,`perc95`). The choice of aggregation function depends on the variable's nature: numerical variables typically use functions like sum, mean, or median (`perc50`), while categorical variables use modes or counts.
* `time_freq`: `STRING` the temporal frequency of the data selected from one of the following: `second`, `minute`, `hour`, `day`, `week`, `month`, `quarter`, `year`.
* `output_table`: `STRING` qualified name of the output table, e.g. `<my-project>.<my-dataset>.<my-output-table>`. The process will fail if the table already exists.
* `options`: `STRING` a JSON with the different options. If options is set to `NULL` then all options are set to default.

  | Option               | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | Default value |
  | -------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------- |
  | `kring_size`         | `INT64` the size of the k-ring used to define the neighbours in the spatial domain; that is, the spatial bandwith. It defines the spatial area (spatial window) around each cell that will be taken into account to aggregate each variable. If set to `0`, only the time domain is considered for smoothing the data.                                                                                                                                                                                                                                                                                                                                             | 1             |
  | `time_bw`            | `INT64` the size of the k-ring used to define the neighbours in the time domain; that is, the temporal bandwith. It defines the time range (time window) around each cell that will be taken into account to aggregate each variable. If set to `0`, only the spatial domain is considered for smoothing the data.                                                                                                                                                                                                                                                                                                                                                 | 1             |
  | `window_alignment`   | <p><code>STRING</code> the method used to create the time-based window around each data point that includes its neighboring values. Specifically, for a given point in time, the k-ring encompasses all values within a defined range that can be:</p><ul><li><code>centered</code>: takes all previous and subsequent values within the temporal k-ring, i.e.: <code>\[- time\_bw, time\_bw]</code>.</li><li><code>backward</code>: takes all previous values within the temporal k-ring, i.e.: <code>\[- time\_bw, 0]</code>.</li><li><code>forward</code>: takes all subsequent values within the temporal k-ring, i.e.: <code>\[0, time\_bw]</code>.</li></ul> | `centered`    |
  | `keep_input_columns` | `BOOL` a boolean to specify whether to preserve the original values in `input` or to output only the newly smoothened variables.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | True          |

**Return type**

The results are stored in the table named `<output_table>`, which contains the columns in `input` plus the smoothened variables as `<variable>_<aggregation>`

**Examples**

This procedure call aggregates variables `var1` and `var2` within a space-time neighbouring region using a backward time window (i.e. only considering past information). Here we are adding up all neighbouring `var1` values, counting all neighbouring `var2` cells and counting all neighbouring `var2` cells with values larger than 10 (which is encoded in `var3`).

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.KRING_SPACETIME_AGGREGATE(
    "SELECT h3, date, var1, var2, CASE WHEN var2 > 10 THEN 1 ELSE 0 END AS var3 FROM `<my-project>.<my-dataset>.<my-input-table>`",
    'h3',
    'date',
    [('var1','sum'), ('var2','count'), ('var3','count')],
    'MONTH',
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "kring_size":3,
        "time_bw":2,
        "window_alignment":"backward"
    }'''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.KRING_SPACETIME_AGGREGATE(
    "SELECT h3, date, var1, var2, CASE WHEN var2 > 10 THEN 1 ELSE 0 END AS var3 FROM `<my-project>.<my-dataset>.<my-input-table>`",
    'h3',
    'date',
    [('var1','sum'), ('var2','count'), ('var3','count')],
    'MONTH',
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "kring_size":3,
        "time_bw":2,
        "window_alignment":"backward"
    }'''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.KRING_SPACETIME_AGGREGATE(
    "SELECT h3, date, var1, var2, CASE WHEN var2 > 10 THEN 1 ELSE 0 END AS var3 FROM `<my-project>.<my-dataset>.<my-input-table>`",
    'h3',
    'date',
    [('var1','sum'), ('var2','count'), ('var3','count')],
    'MONTH',
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "kring_size":3,
        "time_bw":2,
        "window_alignment":"backward"
    }'''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}
{% endtabs %}

This procedure call aggregates variables `var1` and `var2` within a space-only neighbouring region.

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.KRING_SPACETIME_AGGREGATE(
    "SELECT h3, date, var1, var2 FROM `<my-project>.<my-dataset>.<my-input-table>`",
    'h3',
    'date',
    [('var1','sum'), ('var2','count'), ('var2','mode')],
    'MONTH',
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "kring_size":2,
        "time_bw":0
    }'''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.KRING_SPACETIME_AGGREGATE(
    "SELECT h3, date, var1, var2 FROM `<my-project>.<my-dataset>.<my-input-table>`",
    'h3',
    'date',
    [('var1','sum'), ('var2','count'), ('var2','mode')],
    'MONTH',
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "kring_size":2,
        "time_bw":0
    }'''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.KRING_SPACETIME_AGGREGATE(
    "SELECT h3, date, var1, var2 FROM `<my-project>.<my-dataset>.<my-input-table>`",
    'h3',
    'date',
    [('var1','sum'), ('var2','count'), ('var2','mode')],
    'MONTH',
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "kring_size":2,
        "time_bw":0
    }'''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}
{% endtabs %}

This procedure call aggregates variables `var1` and `var2` within a time-only neighbouring region.

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.KRING_SPACETIME_AGGREGATE(
    "SELECT quadbin, date, var1, var2 FROM `<my-project>.<my-dataset>.<my-input-table>`",
    'quadbin',
    'date',
    [('var1','sum'), ('var2','count'), ('var2','mode')],
    'MONTH',
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "kring_size":0,
        "time_bw":2,
        "window_alignment":"centered"
    }'''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.KRING_SPACETIME_AGGREGATE(
    "SELECT quadbin, date, var1, var2 FROM `<my-project>.<my-dataset>.<my-input-table>`",
    'quadbin',
    'date',
    [('var1','sum'), ('var2','count'), ('var2','mode')],
    'MONTH',
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "kring_size":0,
        "time_bw":2,
        "window_alignment":"centered"
    }'''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.KRING_SPACETIME_AGGREGATE(
    "SELECT quadbin, date, var1, var2 FROM `<my-project>.<my-dataset>.<my-input-table>`",
    'quadbin',
    'date',
    [('var1','sum'), ('var2','count'), ('var2','mode')],
    'MONTH',
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "kring_size":0,
        "time_bw":2,
        "window_alignment":"centered"
    }'''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}
{% endtabs %}

## KRING\_SPACETIME\_SMOOTHING <a href="#kring_spacetime_smoothing" id="kring_spacetime_smoothing"></a>

```sql
KRING_SPACETIME_SMOOTHING(input, index_column, date_column, value_columns, time_freq, output_table, options)
```

**Description**

This procedure can be used to smooth space-time features over space and/or time, aiming to reduce noise or variability in data while maintaining important underlying trends. Data must have a temporal dimension and contain grid cell indexes of one of the supported types (H3, Quadbin). For data points to be averaged with their neighbours, the user must select a k-ring size and a decay function to weight the data of the neighbors cells according to their distance in both time and space (a.k.a. neighboring order).

**Input parameters**

* `input`: `STRING` the query or the fully qualified name of the table containing the data to be smoothed. It must contain the `index_column`, the `date_column`, and all the columns specified in the `value_columns`.
* `index_column`: `STRING` the name of the column with the unique geographic identifier of the spatial index, either 'H3' or 'QUADBIN'.
* `date_column`: `STRING` the name of the column with the timestamp identifier. The type of this column could be any type that can be casted to `DATETIME`.
* `value_columns`: `ARRAY<STRING>` the name of the column(s) with the variable(s) for which the space-time smoothing should be applied. If more than one column is selected, the same smoothing parameters specified in `options` will be applied to all.
* `time_freq`: `STRING` the temporal frequency of the data selected from one of the following: `second`, `minute`, `hour`, `day`, `week`, `month`, `quarter`, `year`.
* `output_table`: `STRING` qualified name of the output table, e.g. `<my-project>.<my-dataset>.<my-output-table>`. The process will fail if the table already exists.
* `options`: `STRING` a JSON with the different options. If options is set to `NULL` then all options are set to default.

  | Option               | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | Default value |
  | -------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------- |
  | `kring_size`         | `INT64` the size of the k-ring used to define the neighbours in the spatial domain; that is, the spatial bandwith. It defines the spatial area (spatial window) around each cell that will be taken into account to average each variable. If set to `0`, only the time domain is considered for smoothing the data.                                                                                                                                                                                                                                                                                                                                                                                                                                        | 1             |
  | `time_bw`            | `INT64` the size of the k-ring used to define the neighbours in the time domain; that is, the temporal bandwith. It defines the time range (time window) around each cell that will be taken into account to average each variable. If set to `0`, only the spatial domain is considered for smoothing the data.                                                                                                                                                                                                                                                                                                                                                                                                                                            | 1             |
  | `kernel`             | `STRING` the [kernel function](https://en.wikipedia.org/wiki/Kernel_\(statistics\)) to compute the weights across the spatial kring, that determine how much influence each point within the window has on the smoothed value, based on its distance from the center of the window. All kernel values beyond the bandwidth (`kring_size`) are set to 0. Available functions are: `uniform` (all points in each neighborhood are weighted equally), `bounded_triangular`, `bounded_quadratic`, `bounded_quartic`, `bounded_gaussian`, `gaussian`, `inverse`, `inverse_square` & `exponential`. Those including the prefix `bounded_` are functions whose weights are (close to) zero when the distance equals the bandwidth (i.e. the outermost neighbours). | `uniform`     |
  | `kernel_time`        | `STRING` the [kernel function](https://en.wikipedia.org/wiki/Kernel_\(statistics\)) to compute the weights across the temporal kring, that determine how much influence each point within the window has on the smoothed value, based on its distance from the center of the window. All kernel values beyond the bandwidth (`time_bw`) are set to 0. Available functions are: `uniform` (all points in each neighborhood are weighted equally), `bounded_triangular`, `bounded_quadratic`, `bounded_quartic`, `bounded_gaussian`, `gaussian`, `inverse`, `inverse_square` & `exponential`. Those including the prefix `bounded_` are functions whose weights are (close to) zero when the distance equals the bandwidth (i.e. the outermost neighbours).   | `uniform`     |
  | `window_alignment`   | <p><code>STRING</code> the method used to create the time-based window around each data point that includes its neighboring values. Specifically, for a given point in time, the k-ring encompasses all values within a defined range that can be:</p><ul><li><code>centered</code>: takes all previous and subsequent values within the temporal k-ring, i.e.: <code>\[- time\_bw, time\_bw]</code>.</li><li><code>backward</code>: takes all previous values within the temporal k-ring, i.e.: <code>\[- time\_bw, 0]</code>.</li><li><code>forward</code>: takes all subsequent values within the temporal k-ring, i.e.: <code>\[0, time\_bw]</code>.</li></ul>                                                                                          | `centered`    |
  | `keep_input_columns` | `BOOL` a boolean to specify whether to preserve the original values in `input` or to output only the newly smoothened variables.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | True          |

**Return type**

The results are stored in the table named `<output_table>`, which contains the columns in `input` plus the smoothened variables as `<value_column>_smooth`

**Examples**

This procedure call smoothens variables `var1` and `var2` by averaging the values within a space-time neighbouring region, using a backward time window (i.e. only considering past information).

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.KRING_SPACETIME_SMOOTHING(
    "SELECT h3, date, var1, var2 FROM `<my-project>.<my-dataset>.<my-input-table>`",
    'h3',
    'date',
    ['var1', 'var2'],
    'MONTH',
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "kring_size":3,
        "time_bw":2,
        "kernel":"gaussian",
        "kernel_time":"uniform",
        "window_alignment":"backward"
    }'''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.KRING_SPACETIME_SMOOTHING(
    "SELECT h3, date, var1, var2 FROM `<my-project>.<my-dataset>.<my-input-table>`",
    'h3',
    'date',
    ['var1', 'var2'],
    'MONTH',
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "kring_size":3,
        "time_bw":2,
        "kernel":"gaussian",
        "kernel_time":"uniform",
        "window_alignment":"backward"
    }'''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.KRING_SPACETIME_SMOOTHING(
    "SELECT h3, date, var1, var2 FROM `<my-project>.<my-dataset>.<my-input-table>`",
    'h3',
    'date',
    ['var1', 'var2'],
    'MONTH',
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "kring_size":3,
        "time_bw":2,
        "kernel":"gaussian",
        "kernel_time":"uniform",
        "window_alignment":"backward"
    }'''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}
{% endtabs %}

This procedure call smoothens variables `var1` and `var2` by averaging the values within a space-only neighbouring region.

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.KRING_SPACETIME_SMOOTHING(
    "SELECT h3, date, var1, var2 FROM `<my-project>.<my-dataset>.<my-input-table>`",
    'h3',
    'date',
    ['var1', 'var2'],
    'MONTH',
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "kring_size":2,
        "time_bw":0,
        "kernel":"gaussian"
    }'''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.KRING_SPACETIME_SMOOTHING(
    "SELECT h3, date, var1, var2 FROM `<my-project>.<my-dataset>.<my-input-table>`",
    'h3',
    'date',
    ['var1', 'var2'],
    'MONTH',
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "kring_size":2,
        "time_bw":0,
        "kernel":"gaussian"
    }'''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.KRING_SPACETIME_SMOOTHING(
    "SELECT h3, date, var1, var2 FROM `<my-project>.<my-dataset>.<my-input-table>`",
    'h3',
    'date',
    ['var1', 'var2'],
    'MONTH',
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "kring_size":2,
        "time_bw":0,
        "kernel":"gaussian"
    }'''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}
{% endtabs %}

This procedure call smoothens variables `var1` and `var2` by averaging the values within a time-only neighbouring region.

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.KRING_SPACETIME_SMOOTHING(
    "SELECT quadbin, date, var1, var2 FROM `<my-project>.<my-dataset>.<my-input-table>`",
    'quadbin',
    'date',
    ['var1', 'var2'],
    'MONTH',
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "kring_size":0,
        "time_bw":2,
        "kernel_time":"exponential",
        "window_alignment":"centered"
    }'''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.KRING_SPACETIME_SMOOTHING(
    "SELECT quadbin, date, var1, var2 FROM `<my-project>.<my-dataset>.<my-input-table>`",
    'quadbin',
    'date',
    ['var1', 'var2'],
    'MONTH',
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "kring_size":0,
        "time_bw":2,
        "kernel_time":"exponential",
        "window_alignment":"centered"
    }'''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.KRING_SPACETIME_SMOOTHING(
    "SELECT quadbin, date, var1, var2 FROM `<my-project>.<my-dataset>.<my-input-table>`",
    'quadbin',
    'date',
    ['var1', 'var2'],
    'MONTH',
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "kring_size":0,
        "time_bw":2,
        "kernel_time":"exponential",
        "window_alignment":"centered"
    }'''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}
{% endtabs %}

## KRING\_SPATIAL\_AGGREGATE <a href="#kring_spatial_aggregate" id="kring_spatial_aggregate"></a>

```sql
KRING_SPATIAL_AGGREGATE(input, index_column, variables, output_table, options)
```

**Description**

This procedure can be used to aggregate spatial features within a neighbourhood to extract new features. It does not resample the input data but allows you to compute statistics for each location based on its neighboring data points. Data must contain grid cell indexes of one of the supported types (H3, Quadbin). For data points to be aggregated with their neighbours, the user must select a k-ring size to define the neighbouring region.

**Input parameters**

* `input`: `STRING` the query or the fully qualified name of the table containing the data to be smoothed. It must contain the `index_column`, and all the columns specified in `variables`.
* `index_column`: `STRING` the name of the column with the unique geographic identifier of the spatial index, either 'H3' or 'QUADBIN'.
* `variables`: `ARRAY<STRUCT<variable STRING, aggregation STRING>>` the list with the columns and their corresponding aggregation method (`sum`, `avg`, `max`, `min`, `count`, `mode`, `perc25`,`perc50`,`perc75`,`perc95`). The choice of aggregation function depends on the variable's nature: numerical variables typically use functions like sum, mean, or median (`perc50`), while categorical variables use modes or counts.
* `output_table`: `STRING` qualified name of the output table, e.g. `<my-project>.<my-dataset>.<my-output-table>`. The process will fail if the table already exists.
* `options`: `STRING` a JSON with the different options. If options is set to `NULL` then all options are set to default.

  | Option               | Description                                                                                                                                                                                                                                  | Default value |
  | -------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------- |
  | `kring_size`         | `INT64` the size of the k-ring used to define the neighbours in the spatial domain; that is, the spatial bandwith. It defines the spatial area (spatial window) around each cell that will be taken into account to aggregate each variable. | 1             |
  | `keep_input_columns` | `BOOL` a boolean to specify whether to preserve the original values in `input` or to output only the newly smoothened variables.                                                                                                             | True          |

**Return type**

The results are stored in the table named `<output_table>`, which contains the columns in `input` plus the smoothened variables as `<variable>_<aggregation>`

**Examples**

This procedure call aggregates variables `var1` and `var2` within a neighbouring region. Here we are adding up all neighbouring `var1` values, counting all neighbouring `var2` cells and counting all neighbouring `var2` cells with values larger than 10 (which is encoded in `var3`).

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.KRING_SPATIAL_AGGREGATE(
    "SELECT h3, var1, var2, CASE WHEN var2 > 10 THEN 1 ELSE 0 END AS var3 FROM `<my-project>.<my-dataset>.<my-input-table>`",
    'h3',
    [('var1','sum'), ('var2','count'), ('var3','count')],
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "kring_size":3
    }'''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.KRING_SPATIAL_AGGREGATE(
    "SELECT h3, var1, var2, CASE WHEN var2 > 10 THEN 1 ELSE 0 END AS var3 FROM `<my-project>.<my-dataset>.<my-input-table>`",
    'h3',
    [('var1','sum'), ('var2','count'), ('var3','count')],
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "kring_size":3
    }'''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.KRING_SPATIAL_AGGREGATE(
    "SELECT h3, var1, var2, CASE WHEN var2 > 10 THEN 1 ELSE 0 END AS var3 FROM `<my-project>.<my-dataset>.<my-input-table>`",
    'h3',
    [('var1','sum'), ('var2','count'), ('var3','count')],
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "kring_size":3
    }'''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}
{% endtabs %}

This procedure call aggregates variables `var1` and `var2` within a neighbouring region.

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.KRING_SPATIAL_AGGREGATE(
    "SELECT h3, var1, var2 FROM `<my-project>.<my-dataset>.<my-input-table>`",
    'h3',
    [('var1','sum'), ('var2','count'), ('var2','mode')],
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "kring_size":2
    }'''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.KRING_SPATIAL_AGGREGATE(
    "SELECT h3, var1, var2 FROM `<my-project>.<my-dataset>.<my-input-table>`",
    'h3',
    [('var1','sum'), ('var2','count'), ('var2','mode')],
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "kring_size":2
    }'''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.KRING_SPATIAL_AGGREGATE(
    "SELECT h3, var1, var2 FROM `<my-project>.<my-dataset>.<my-input-table>`",
    'h3',
    [('var1','sum'), ('var2','count'), ('var2','mode')],
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "kring_size":2
    }'''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}
{% endtabs %}

This procedure call aggregates variables `var1` and `var2` within a neighbouring region.

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.KRING_SPATIAL_AGGREGATE(
    "SELECT quadbin, var1, var2 FROM `<my-project>.<my-dataset>.<my-input-table>`",
    'quadbin',
    [('var1','sum'), ('var2','count'), ('var2','mode')],
    '<my-project>.<my-dataset>.<my-output-table>',
    NULL
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.KRING_SPATIAL_AGGREGATE(
    "SELECT quadbin, var1, var2 FROM `<my-project>.<my-dataset>.<my-input-table>`",
    'quadbin',
    [('var1','sum'), ('var2','count'), ('var2','mode')],
    '<my-project>.<my-dataset>.<my-output-table>',
    NULL
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.KRING_SPATIAL_AGGREGATE(
    "SELECT quadbin, var1, var2 FROM `<my-project>.<my-dataset>.<my-input-table>`",
    'quadbin',
    [('var1','sum'), ('var2','count'), ('var2','mode')],
    '<my-project>.<my-dataset>.<my-output-table>',
    NULL
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}
{% endtabs %}

## KRING\_SPATIAL\_SMOOTHING <a href="#kring_spatial_smoothing" id="kring_spatial_smoothing"></a>

```sql
KRING_SPATIAL_SMOOTHING(input, index_column, value_columns, output_table, options)
```

**Description**

This procedure can be used to smooth features over space, aiming to reduce noise or variability in data while maintaining important underlying trends. Data must contain grid cell indexes of one of the supported types (H3, Quadbin). For data points to be averaged with their neighbours, the user must select a k-ring size and a decay function to weight the data of the neighbors cells according to their distance (a.k.a. neighboring order).

**Input parameters**

* `input`: `STRING` the query or the fully qualified name of the table containing the data to be smoothed. It must contain the `index_column`, and all the columns specified in the `value_columns`.
* `index_column`: `STRING` the name of the column with the unique geographic identifier of the spatial index, either 'H3' or 'QUADBIN'.
* `value_columns`: `ARRAY<STRING>` the name of the column(s) with the variable(s) for which the space-time smoothing should be applied. If more than one column is selected, the same smoothing parameters specified in `options` will be applied to all.
* `output_table`: `STRING` qualified name of the output table, e.g. `<my-project>.<my-dataset>.<my-output-table>`. The process will fail if the table already exists.
* `options`: `STRING` a JSON with the different options. If options is set to `NULL` then all options are set to default.

  | Option               | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | Default value |
  | -------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------- |
  | `kring_size`         | `INT64` the size of the k-ring used to define the neighbours in the spatial domain; that is, the spatial bandwith. It defines the spatial area (spatial window) around each cell that will be taken into account to average each variable.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | 1             |
  | `kernel`             | `STRING` the [kernel function](https://en.wikipedia.org/wiki/Kernel_\(statistics\)) to compute the weights across the spatial kring, that determine how much influence each point within the window has on the smoothed value, based on its distance from the center of the window. All kernel values beyond the bandwidth (`kring_size`) are set to 0. Available functions are: `uniform` (all points in each neighborhood are weighted equally), `bounded_triangular`, `bounded_quadratic`, `bounded_quartic`, `bounded_gaussian`, `gaussian`, `inverse`, `inverse_square` & `exponential`. Those including the prefix `bounded_` are functions whose weights are (close to) zero when the distance equals the bandwidth (i.e. the outermost neighbours). | `uniform`     |
  | `keep_input_columns` | `BOOL` a boolean to specify whether to preserve the original values in `input` or to output only the newly smoothened variables.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | True          |

**Return type**

The results are stored in the table named `<output_table>`, which contains the columns in `input` plus the smoothened variables as `<value_column>_smooth`

**Examples**

This procedure call smoothens variables `var1` and `var2` by averaging the values within a spatial neighbouring region.

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.KRING_SPATIAL_SMOOTHING(
    "SELECT h3, var1, var2 FROM `<my-project>.<my-dataset>.<my-input-table>`",
    'h3',
    ['var1', 'var2'],
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "kring_size":3,
        "kernel":"gaussian"
    }'''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.KRING_SPATIAL_SMOOTHING(
    "SELECT h3, var1, var2 FROM `<my-project>.<my-dataset>.<my-input-table>`",
    'h3',
    ['var1', 'var2'],
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "kring_size":3,
        "kernel":"gaussian"
    }'''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.KRING_SPATIAL_SMOOTHING(
    "SELECT h3, var1, var2 FROM `<my-project>.<my-dataset>.<my-input-table>`",
    'h3',
    ['var1', 'var2'],
    '<my-project>.<my-dataset>.<my-output-table>',
    '''{
        "kring_size":3,
        "kernel":"gaussian"
    }'''
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}
{% endtabs %}

This procedure call smoothens variables `var1` and `var2` by averaging the values within a spatial neighbouring region with defaults.

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.KRING_SPATIAL_SMOOTHING(
    "SELECT quadbin, var1, var2 FROM `<my-project>.<my-dataset>.<my-input-table>`",
    'quadbin',
    ['var1', 'var2'],
    '<my-project>.<my-dataset>.<my-output-table>',
    NULL
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.KRING_SPATIAL_SMOOTHING(
    "SELECT quadbin, var1, var2 FROM `<my-project>.<my-dataset>.<my-input-table>`",
    'quadbin',
    ['var1', 'var2'],
    '<my-project>.<my-dataset>.<my-output-table>',
    NULL
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.KRING_SPATIAL_SMOOTHING(
    "SELECT quadbin, var1, var2 FROM `<my-project>.<my-dataset>.<my-input-table>`",
    'quadbin',
    ['var1', 'var2'],
    '<my-project>.<my-dataset>.<my-output-table>',
    NULL
)
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}
{% endtabs %}

## SPACETIME\_HOTSPOTS\_CLASSIFICATION <a href="#spacetime_hotspots_classification" id="spacetime_hotspots_classification"></a>

```sql
SPACETIME_HOTSPOTS_CLASSIFICATION(input, output, index_column, date_column, gi_column, p_value_column, options)
```

**Description**

This procedure is designed to analyze spatio-temporal data in order to identify and categorize locations based on their hotspot or coldspot status over time. Utilizing z-score values generated by the Space-Time Getis-Ord function, i.e. the [h3 spacetime getis ord](#.md#getis_ord_spacetime_h3), and applying either the [original Mann Kendall](https://vsp.pnnl.gov/help/vsample/design_trend_mann_kendall.htm) or [modified Mann-Kendall](https://doi.org/10.1016/S0022-1694\(97\)00125-X) trend test on these values, it categorizes each location into specific types of hotspots or coldspots. This categorization is based on patterns of spatial clustering and intensity trends over observed time intervals. The categories can be seen in the following table along with their description.

| Category                 | Description                                                                                                                                                                                                                           |
| ------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `Undetected Pattern`     | This category applies to locations that do not exhibit any discernible patterns of hot or cold activity as defined in subsequent categories.                                                                                          |
| `Incipient Hotspot`      | This denotes a location that has become a significant hotspot only in the latest observed time step, without any prior history of significant hotspot activity.                                                                       |
| `Sequential Hotspot`     | Identifies a location experiencing an unbroken series of significant hotspot activity leading up to the most recent time step, provided it had no such activity beforehand and less than 90% of all observed intervals were hotspots. |
| `Strengthening Hotspot`  | A location consistently identified as a hotspot in at least 90% of time steps, including the last, where there's a statistically significant upward trend in activity intensity.                                                      |
| `Stable Hotspot`         | Represents a location maintaining significant hotspot status in at least 90% of time steps without showing a clear trend in activity intensity changes over time.                                                                     |
| `Declining Hotspot`      | A location that has consistently been a hotspot in at least 90% of time steps, including the most recent one, but shows a statistically significant decrease in the intensity of its activity.                                        |
| `Occasional Hotspot`     | Locations that sporadically become hotspot, with less than 90% of time steps marked as significant hotspots and no instances of being a significant coldspot.                                                                         |
| `Fluctuating Hotspot`    | Marks a location as a significant hotspot in the latest time step that has also experienced significant coldspot phases in the past, with less than 90% of intervals as significant hotspots.                                         |
| `Legacy Hotspot`         | A location that isn't currently a hotspot but was significantly so in at least 90% of past intervals.                                                                                                                                 |
| `Incipient Coldspot`     | Identifies a location that is marked as a significant coldspot for the first time in the latest observed interval, without any previous history of significant coldspot status.                                                       |
| `Sequential Coldspot`    | A location with a continuous stretch of significant coldspot activity leading up to the latest interval, provided it wasn't identified as a coldspot before this streak and less than 90% of intervals were marked as coldspots.      |
| `Strengthening Coldspot` | A location identified as a coldspot in at least 90% of observed intervals, including the most recent, where there's a statistically significant increase in the intensity of low activity.                                            |
| `Stable Coldspot`        | A location that has been a significant coldspot in at least 90% of intervals without any discernible trend in the intensity of low activity over time.                                                                                |
| `Declining Coldspot`     | Locations that have been significant coldspots in at least 90% of time steps, including the latest, but show a significant decrease in low activity intensity.                                                                        |
| `Occasional Coldspot`    | Represents locations that sporadically become significant coldspots, with less than 90% of time steps marked as significant coldspots and no instances of being a significant hot spot.                                               |
| `Fluctuating Coldspot`   | A location marked as a significant coldspot in the latest interval that has also been a significant hot spot in past intervals, with less than 90% of intervals marked as significant coldspots.                                      |
| `Legacy Coldspot`        | Locations that are not currently coldspots but were significantly so in at least 90% of past intervals.                                                                                                                               |

**Input parameters**

The input parameters to this procedure are:

* `input`: `STRING` the query to the data used to compute the coefficient. A qualified table name can be given as well, e.g. `<my-project>.<my-dataset>.<my-table>`.
* `output`: `STRING` qualified name of the output table, e.g. `<my-project>.<my-dataset>.<my-output-table>`.
* `index_column`: `STRING` name of the column with the spatial indexes.
* `date_column`: `STRING` name of the column with the date.
* `gi_column`: `STRING` name of the column with the getis ord values.
* `p_value_column`: `STRING` name of the column with the p\_value associated with the getis ord values.
* `options`: `STRING` containing a valid JSON with the different options. Valid options are described in the table below. If options is set to NULL then all options are set to default.

  | Option      | Description                                                                                                                                                              |
  | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
  | `threshold` | `FLOAT64` the threshold of the $p\_value$ for a location to be considered as hotspot/coldspot. Default is 0.05.                                                          |
  | `algorithm` | `STRING` the algorithm to be used for the monotonic trend test. It can be either `mk` for the original Mann-Kendall test or `mmk` for the modified one. Default is `mk`. |

**Output**

The output table contains the following fields:

* `index_column`: the `type` of the `index_col` specified in the input.
* `classification`: `STRING` is one of the categories in the table.
* `tau`: `FLOAT64` the $tau$ value of the trend test.
* `tau_p`: `FLOAT64` the p-value of the trend $tau$ value. If it equals to 2 then it means that the trend test has failed.

**Examples**

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.SPACETIME_HOTSPOTS_CLASSIFICATION(
    "<my-project>.<my-dataset>.<my-input-table>",
    "<my-project>.<my-dataset>.<my-output-table>",
    "quadbin",
    "date",
    "gi",
    "p_value",
    '{"threshold" :0.05, "algorithm": "mk"}'
)
-- It will return a table with columns quadbin, classification STRING, tau FLOAT64, tau_p FLOAT64
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.SPACETIME_HOTSPOTS_CLASSIFICATION(
    "<my-project>.<my-dataset>.<my-input-table>",
    "<my-project>.<my-dataset>.<my-output-table>",
    "quadbin",
    "date",
    "gi",
    "p_value",
    '{"threshold" :0.05, "algorithm": "mk"}'
)
-- It will return a table with columns quadbin, classification STRING, tau FLOAT64, tau_p FLOAT64
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.SPACETIME_HOTSPOTS_CLASSIFICATION(
    "<my-project>.<my-dataset>.<my-input-table>",
    "<my-project>.<my-dataset>.<my-output-table>",
    "quadbin",
    "date",
    "gi",
    "p_value",
    '{"threshold" :0.05, "algorithm": "mk"}'
)
-- It will return a table with columns quadbin, classification STRING, tau FLOAT64, tau_p FLOAT64
```

{% endtab %}
{% endtabs %}

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.SPACETIME_HOTSPOTS_CLASSIFICATION(
    "<my-project>.<my-dataset>.<my-input-table>",
    "<my-project>.<my-dataset>.<my-output-table>",
    "geoid",
    "date",
    "gi",
    "p_value",
    '{"threshold" :0.1, "algorithm": "mk"}'
)
-- It will return a table with columns geoid, classification STRING, tau FLOAT64, tau_p FLOAT64
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.SPACETIME_HOTSPOTS_CLASSIFICATION(
    "<my-project>.<my-dataset>.<my-input-table>",
    "<my-project>.<my-dataset>.<my-output-table>",
    "geoid",
    "date",
    "gi",
    "p_value",
    '{"threshold" :0.1, "algorithm": "mk"}'
)
-- It will return a table with columns geoid, classification STRING, tau FLOAT64, tau_p FLOAT64
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.SPACETIME_HOTSPOTS_CLASSIFICATION(
    "<my-project>.<my-dataset>.<my-input-table>",
    "<my-project>.<my-dataset>.<my-output-table>",
    "geoid",
    "date",
    "gi",
    "p_value",
    '{"threshold" :0.1, "algorithm": "mk"}'
)
-- It will return a table with columns geoid, classification STRING, tau FLOAT64, tau_p FLOAT64
```

{% endtab %}
{% endtabs %}

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.SPACETIME_HOTSPOTS_CLASSIFICATION(
    "<my-project>.<my-dataset>.<my-input-table>",
    "<my-project>.<my-dataset>.<my-output-table>",
    "h3",
    "date",
    "gi",
    "p_value",
    '{"threshold" :0.01, "algorithm": "mmk"}'
)
-- It will return a table with columns h3, classification STRING, tau FLOAT64, tau_p FLOAT64
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.SPACETIME_HOTSPOTS_CLASSIFICATION(
    "<my-project>.<my-dataset>.<my-input-table>",
    "<my-project>.<my-dataset>.<my-output-table>",
    "h3",
    "date",
    "gi",
    "p_value",
    '{"threshold" :0.01, "algorithm": "mmk"}'
)
-- It will return a table with columns h3, classification STRING, tau FLOAT64, tau_p FLOAT64
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.SPACETIME_HOTSPOTS_CLASSIFICATION(
    "<my-project>.<my-dataset>.<my-input-table>",
    "<my-project>.<my-dataset>.<my-output-table>",
    "h3",
    "date",
    "gi",
    "p_value",
    '{"threshold" :0.01, "algorithm": "mmk"}'
)
-- It will return a table with columns h3, classification STRING, tau FLOAT64, tau_p FLOAT64
```

{% endtab %}
{% endtabs %}

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.SPACETIME_HOTSPOTS_CLASSIFICATION(
    ('SELECT h3, date, gi, p_value FROM `<my-project>.<my-dataset>.<my-table>`'),
    "<my-project>.<my-dataset>.<my-output-table>",
    "h3",
    "date",
    "gi",
    "p_value",
    '{"threshold" :0.01, "algorithm": "mmk"}'
)
-- It will return a table with columns h3, classification STRING, tau FLOAT64, tau_p FLOAT64
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.SPACETIME_HOTSPOTS_CLASSIFICATION(
    ('SELECT h3, date, gi, p_value FROM `<my-project>.<my-dataset>.<my-table>`'),
    "<my-project>.<my-dataset>.<my-output-table>",
    "h3",
    "date",
    "gi",
    "p_value",
    '{"threshold" :0.01, "algorithm": "mmk"}'
)
-- It will return a table with columns h3, classification STRING, tau FLOAT64, tau_p FLOAT64
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.SPACETIME_HOTSPOTS_CLASSIFICATION(
    ('SELECT h3, date, gi, p_value FROM `<my-project>.<my-dataset>.<my-table>`'),
    "<my-project>.<my-dataset>.<my-output-table>",
    "h3",
    "date",
    "gi",
    "p_value",
    '{"threshold" :0.01, "algorithm": "mmk"}'
)
-- It will return a table with columns h3, classification STRING, tau FLOAT64, tau_p FLOAT64
```

{% endtab %}
{% endtabs %}

## TIME\_SERIES\_CLUSTERING <a href="#time_series_clustering" id="time_series_clustering"></a>

```sql
TIME_SERIES_CLUSTERING(
    input,
    output_table,
    partitioning_column,
    ts_column,
    value_column,
    options
)
```

**Description**

This procedure provides a way to group together different time series based on several methods that are use-case dependant. The user is able to choose the number of clusters. Please read carefully the method definition below to understand their usage and possible caveats that each method may have.

The data passed to the function requires to be structured using two different columns that will serve as indices:

* A **unique ID per time series** (`partitioning_column`), which can be a spatial index, a location unique ID (for instance a POI, store, point of sale, antenna, asset, etc.) or any other ID that uniquely identifies each time series.
* A **timestamp** (`ts_column`), that will identify each of the time steps within each and all series.

All these methods require the series to be aligned and grouped, so that **there is one and only one observation per combination of ID and timestamp**; the input data cannot have missing values (a series is missing a value in the nth timestep) or multiple values (a series has multiple values in the nth timestep). Since each series require different treatment, the user is in charge of performing this step.

This example below will re-sample a daily time series into a weekly sampling rate and impute all missing values (if any).

{% tabs %}
{% tab title="carto-un" %}

```sql
WITH
    aligned_idx AS (
        SELECT
            ticket_week,
            shop_id
        FROM
            (
                SELECT DISTINCT
                    DATE_TRUNC(ticket_date, WEEK(MONDAY)) AS ticket_week
                FROM
                    `<my-project>.<my-dataset>.<my-table>`
            ),
            (
                SELECT DISTINCT
                    shop_id
                FROM
                    `<my-project>.<my-dataset>.<my-table>`
            )
    )
SELECT
    all.shop_id,
    all.ticket_week,
    COALESCE(SUM(info.ticket_amount), 0) AS total_sales
    -- Please note that the sales are an extensive variable; an intensive
    -- variable would require different aggregation and imputation methods
FROM
    aligned_idx all
LEFT JOIN
    `<my-project>.<my-dataset>.<my-table>` info
ON
    all.ticket_week = DATE_TRUNC(info.ticket_date, WEEK(MONDAY))
    AND all.shop_id = info.shop_id
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
WITH
    aligned_idx AS (
        SELECT
            ticket_week,
            shop_id
        FROM
            (
                SELECT DISTINCT
                    DATE_TRUNC(ticket_date, WEEK(MONDAY)) AS ticket_week
                FROM
                    `<my-project>.<my-dataset>.<my-table>`
            ),
            (
                SELECT DISTINCT
                    shop_id
                FROM
                    `<my-project>.<my-dataset>.<my-table>`
            )
    )
SELECT
    all.shop_id,
    all.ticket_week,
    COALESCE(SUM(info.ticket_amount), 0) AS total_sales
    -- Please note that the sales are an extensive variable; an intensive
    -- variable would require different aggregation and imputation methods
FROM
    aligned_idx all
LEFT JOIN
    `<my-project>.<my-dataset>.<my-table>` info
ON
    all.ticket_week = DATE_TRUNC(info.ticket_date, WEEK(MONDAY))
    AND all.shop_id = info.shop_id
```

{% endtab %}

{% tab title="manual" %}

```sql
WITH
    aligned_idx AS (
        SELECT
            ticket_week,
            shop_id
        FROM
            (
                SELECT DISTINCT
                    DATE_TRUNC(ticket_date, WEEK(MONDAY)) AS ticket_week
                FROM
                    `<my-project>.<my-dataset>.<my-table>`
            ),
            (
                SELECT DISTINCT
                    shop_id
                FROM
                    `<my-project>.<my-dataset>.<my-table>`
            )
    )
SELECT
    all.shop_id,
    all.ticket_week,
    COALESCE(SUM(info.ticket_amount), 0) AS total_sales
    -- Please note that the sales are an extensive variable; an intensive
    -- variable would require different aggregation and imputation methods
FROM
    aligned_idx all
LEFT JOIN
    `<my-project>.<my-dataset>.<my-table>` info
ON
    all.ticket_week = DATE_TRUNC(info.ticket_date, WEEK(MONDAY))
    AND all.shop_id = info.shop_id
```

{% endtab %}
{% endtabs %}

The procedure will use any of the methods explained below to cluster the series. There is no single correct method, but different ways to approach different use cases. Please take a look at [the official BigQuery guide for working with time series](https://cloud.google.com/bigquery/docs/working-with-time-series), since probably some of the provided functions or examples can be useful to deal with this kind of preprocessing.

* **Value characteristic**: will group the series based on how similar they are point-wise, that is, sample by sample in each of them. This should return intuitive results: different changes in scale will probably split the series in larger and smaller ones, and changes that make points in similar ranges will contribute to the series being together. Another way of thinking of this method is assuming that will group series together the closer their points are if we plot them on a graph. **Note**: this function uses the [Euclidean Distance](https://en.wikipedia.org/wiki/Euclidean_distance) internally where the dimensions are the timesteps; therefore it will suffer the [Curse of Dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality) for a very large number of timesteps. To identify this issue, please start the analysis on very large time aggregations (i.e. monthly) and increase the temporal resolution to inspect any changes in this classification.
* **Profile characteristic**: will group the series together based on how similar their dynamics are; that is, how similar their step to step changes are. This method will not take into account their scale but the correlation between series, grouping them together if their changes are similar. For example, two stores will be grouped together if their sales are more substantial on weekends, despite the scale of them may differ orders of magnitude.

**Input parameters**

* `input`: `STRING` the query to the data used to compute the clustering. A qualified table name can be given as well, e.g. `<my-project>.<my-dataset>.<my-table>`.
* `output_table`: `STRING` qualified name of the output table, e.g. `<my-project>.<my-dataset>.<my-output-table>`.
* `partitioning_column`: `STRING` name of the column with the time series IDs.
* `ts_column`: `STRING` name of the column with the date.
* `value_column`: `STRING` name of the column with the value per ID and timestep.
* `options`: `JSON` containing the advanced options for the procedure:
  * `method`: `STRING`, one of:
    * `value` for value characteristic,
    * `profile` for profile characteristic.
  * `n_clusters`: `INT` number of clusters to generate in the K-Means.
  * `model_name`: name to use for the K-Means object in BigQuery. Defaults to `<output_table>_model`.

**Output**

The results are stored in the table named `<output_table>`, which contains the following columns:

* The given `partitioning_column`, one entry per unique value in the input;
* `cluster`, a `STRING` column with the cluster label associated.

**Example**

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.TIME_SERIES_CLUSTERING(
    '<my-project>.<my-dataset>.<my-input-table>',
    '<my-project>.<my-dataset>.<my-output-table>',
    'h3',
    'date',
    'temperature',
    JSON '{ "method": "profile", "n_clusters": 6}'
);
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.TIME_SERIES_CLUSTERING(
    '<my-project>.<my-dataset>.<my-input-table>',
    '<my-project>.<my-dataset>.<my-output-table>',
    'h3',
    'date',
    'temperature',
    JSON '{ "method": "profile", "n_clusters": 6}'
);
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.TIME_SERIES_CLUSTERING(
    '<my-project>.<my-dataset>.<my-input-table>',
    '<my-project>.<my-dataset>.<my-output-table>',
    'h3',
    'date',
    'temperature',
    JSON '{ "method": "profile", "n_clusters": 6}'
);
-- The table `<my-project>.<my-dataset>.<my-output-table>` will be created
```

{% endtab %}
{% endtabs %}

<img src="https://3029946802-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FybPdpmLltPkzGFvz7m8A%2Fuploads%2Fgit-blob-1c82685e4e434438152a8e3d867df996413489fe%2Feu-flag-website.png?alt=media&#x26;token=4343f6e5-973a-4e9a-8e14-50366a086f72" alt="EU flag" data-size="line">This project has received funding from the [European Union’s Horizon 2020](https://ec.europa.eu/programmes/horizon2020/en) research and innovation programme under grant agreement No 960401.
