statistics

This module contains functions to perform spatial statistics calculations.

P_VALUE

P_VALUE(z_score)

Description

This function computes the p-value (two-tails test) of a given z-score assuming the population follows a normal distribution where the mean is 0 and the standard deviation is 1. The z-score is a measure of how many standard deviations below or above the population mean a value is. It gives you an idea of how far from the mean a data point is. The p-value is the probability that a randomly sampled point has a value at least as extreme as the point whose z-score is being tested.

  • z_score: FLOAT64

Return type

FLOAT64

Example

SELECT `carto-un`.carto.P_VALUE(u) as p_value
    FROM UNNEST([-2,-1,0,1,2]) u;

-- [ 0.04550012577451279, 0.31731052766472745, 0.999999999, 0.31731052766472745, 0.04550012577451279 ]

KNN_TABLE

KNN_TABLE(input, output_table, geoid_col, geo_col, k)

Description

This procedure returns for each point the k-nearest neighbors of a given set of points.

  • input: STRING the query to the data used to compute the KNN. A qualified table name can be given as well: <project-id>.<dataset-id>.<table-name>.

  • output_table: STRING qualified name of the output table: <project-id>.<dataset-id>.<table-name>.

  • geoid_col: STRING name of the column with unique ids.

  • geo_col: STRING name of the column with the geometries.

  • k: INT64 number of nearest neighbors (positive, typically small).

Output

The results are stored in the table named <output_table>, which contains the following columns:

  • geo: GEOGRAPHY the geometry of the considered point.

  • geo_knn: GEOGRAPHY the k-nearest neighbor point.

  • geoid: STRING the unique identifier of the considered point.

  • geoid_knn: STRING the unique identifier of the k-nearest neighbor.

  • distance: FLOAT64 the k-nearest neighbor distance to the considered point.

  • knn: INT64 the k-order (knn).

Example

CALL `carto-un`.carto.KNN_TABLE(
    'myproject.mydataset.points',
    'myproject.mydataset.knn',
    'id',
    'geo',
    10
);

KNN

KNN(points, k)

Description

This function returns for each point the k-nearest neighbors of a given set of points.

  • points: ARRAY<STRUCT<geoid STRING, geo GEOGRAPHY>> input data with unique id and geography.

  • k: INT64 number of nearest neighbors (positive, typically small).

Return type

ARRAY<STRUCT<geo GEOGRAPHY, geo_knn GEOGRAPHY, geoid STRING, geoid_knn STRING, distance FLOAT64, knn INT64>>

where:

  • geo: the geometry of the considered point.

  • geo_knn: the k-nearest neighbor point.

  • geoid: the unique identifier of the considered point.

  • geoid_knn: the unique identifier of the k-nearest neighbor.

  • distance: the k-nearest neighbor distance to the considered point.

  • knn: the k-order (knn).

Example

SELECT *
FROM UNNEST((
    SELECT `carto-un`.carto.KNN(myarray, 10)
    FROM (
        SELECT ARRAY_AGG(STRUCT(format('%08x', uid),position_geom)) myarray
        FROM (
            SELECT ROW_NUMBER() OVER (ORDER BY hour) AS uid, position_geom
            FROM `bigquery-public-data.catalonian_mobile_coverage.mobile_data_2015_2017`
            WHERE date = '2017-12-31'
        )
    )
))
ORDER BY geoid;
--{
--  "geo": "POINT(2.82263 41.97118)",
--  "geo_knn": "POINT(2.8225 41.97117)",
--  "geoid": "00000001",
--  "geoid_knn": "00000624",
--  "distance": "10.804663098937658",
--  "knn": "1"
--},
--{
--  "geo": "POINT(2.82263 41.97118)",
--  "geo_knn": "POINT(2.823 41.9712)",
--  "geoid": "00000001",
--  "geoid_knn": "00000666",
--  "distance": "30.66917920746894",
--  "knn": "2"
--},
--{
--  "geo": "POINT(2.82263 41.97118)",
--  "geo_knn": "POINT(2.82298 41.9713)",
--  "geoid": "00000001",
--  "geoid_knn": "00000618",
--  "distance": "31.863463704968353",
--  "knn": "3"
--},
-- ...

LOF_TABLE

LOF_TABLE(src_fullname STRING, target_fullname STRING, geoid_column_name STRING, geo_column_name STRING, k INT64)

Description

This procedure computes the Local Outlier Factor for each point of a specified column and stores the result in an output table along with the other input columns.

  • src_fullname: STRING The input table. A STRING of the form project-id.dataset-id.table-name is expected. The project-id can be omitted (in which case the default one will be used).

  • target_fullname: STRING The resulting table where the LOF will be stored. A STRING of the form project-id.dataset-id.table-name is expected. The project-id can be omitted (in which case the default one will be used). The dataset must exist and the caller needs to have permissions to create a new table in it. The process will fail if the target table already exists.

  • geoid_column_name: STRING The column name with a unique identifier for each point.

  • geo_column_name: STRING The column name containing the points.

  • lof_target_column_name: STRING The column name where the resulting Local Outlier Factor will be stored in the output table.

  • k: INT64 Number of nearest neighbors (positive, typically small).

Output

The results are stored in the table named <output_table>, which contains the following columns:

  • geo: GEOGRAPHY the geometry of the considered point.

  • geoid: GEOGRAPHY the unique identifier of the considered point.

  • lof: FLOAT64 the Local Outlier Factor score.

Example

CALL `carto-un`.carto.LOF_TABLE(
  'bigquery-public-data.new_york_subway.stations',
  'myproject.mydataset.my_output_table',
  'station_id',
  'station_geom',
  'lof',
  10
);
-- The table `'myproject.mydataset.my_output_table` will be created
-- with an extra column containing the `lof` value.

LOF

LOF(points, k)

Description

This function computes the Local Outlier Factor of each point of a given set of points.

  • points: ARRAY<STRUCT<geoid STRING, geo GEOGRAPHY>> input data points with unique id and geography.

  • k: INT64 number of nearest neighbors (positive, typically small).

Return type

ARRAY<STRUCT<geo GEOGRAPHY, geoid GEOGRAPHY, lof FLOAT64>>

where:

  • geo: the geometry of the considered point.

  • geoid: the unique identifier of the considered point.

  • lof: the Local Outlier Factor score.

Example

SELECT *
FROM UNNEST((
    SELECT `carto-un`.carto.LOF(myarray, 10)
    FROM (
        SELECT ARRAY_AGG(STRUCT(format('%08x', uid),position_geom)) myarray
        FROM (
            SELECT ROW_NUMBER() OVER (ORDER BY hour) AS uid, position_geom
            FROM `bigquery-public-data.catalonian_mobile_coverage.mobile_data_2015_2017`
            WHERE date = '2017-12-31'
        )
    )
))
ORDER BY geoid;
-- {"geo": POINT(2.82263 41.97118), "geoid": "00000001", "lof": 1.3217599116891428}
-- {"geo": POINT(2.35705 41.49786), "geoid": "00000002", "lof": 1.235551000737416}
-- {"geo": POINT(2.13967 41.3838), "geoid": "00000003", "lof": 1.1305674032876687}
-- ...

GFUN_TABLE

GFUN_TABLE(input, output_table, geo_col)

Description

This function computes the G-function of a given set of points.

  • input: STRING the query to the data used to compute the G-Function. A qualified table name can be given as well: <project-id>.<dataset-id>.<table-name>.

  • output_table: STRING qualified name of the output table: <project-id>.<dataset-id>.<table-name>.

  • geo_col: STRING name of the column with the geometries.

Output

The results are stored in the table named <output_table>, which contains the following columns:

  • distance: FLOAT64 the nearest neighbors distances.

  • gfun_G: FLOAT64 the empirical G evaluated for each distance in the support.

  • gfun_ev: FLOAT64 the theoretical Poisson G evaluated for each distance in the support.

Example

CALL `carto-un`.carto.GFUN_TABLE(
    'myproject.mydataset.points',
    'myproject.mydataset.knn',
    'geo'
);

GFUN

GFUN(points)

Description

This function computes the G-function of a given set of points.

  • points: ARRAY<GEOGRAPHY> input data points.

Return type

ARRAY<STRUCT<distance FLOAT64, gfun_G FLOAT64, gfun_ev FLOAT64>>

where:

  • distance: the nearest neighbors distances.

  • gfun_G: the empirical G evaluated for each distance in the support.

  • gfun_ev: the theoretical Poisson G evaluated for each distance in the support.

Example

SELECT *
FROM UNNEST((
    SELECT `carto-un`.carto.GFUN(myarray)
    FROM (
        SELECT ARRAY_AGG(position_geom) myarray
        FROM `bigquery-public-data.catalonian_mobile_coverage.mobile_data_2015_2017`
        WHERE date = '2017-12-31'
    )
))
ORDER BY distance;
--{
--  "distance": "38.599968853183",
--  "gfun_G": "0.4319167389418907",
--  "gfun_ev": "4.037383876246414E-4"
--},
--{
--  "distance": "77.199937706366",
--  "gfun_G": "0.5771899392888118",
--  "gfun_ev": "0.0016139757856029613"
--},
--{
--  "distance": "115.799906559549",
--  "gfun_G": "0.6522116218560278",
--  "gfun_ev": "0.003627782844736638"
--},
-- ...

CREATE_SPATIAL_COMPOSITE_SUPERVISED

CREATE_SPATIAL_COMPOSITE_SUPERVISED(input_query, index_column, output_prefix, options)

Description

This procedure derives a spatial composite score as the residuals of a regression model which is used to detect areas of under- and over-prediction. The response variable should be measurable and correlated with the set of variables defining the score. For each data point. the residual is defined as the observed value minus the predicted value. Rows with a NULL value in any of the individual variables are dropped.

Input parameters

  • input_query: STRING the query to the data used to compute the spatial composite. It must contain all the individual variables that should be included in the computation of the composite as well as a unique geographic id for each row. A qualified table name can be given as well, e.g. 'project-id.dataset-id.table-name'.

  • index_column: STRING the name of the column with the unique geographic identifier.

  • output_prefix: STRING the prefix for the output table. It should include project and dataset, e.g. 'project-id.dataset-id.table-name'.

  • options: STRING containing a valid JSON with the different options. Valid options are described below.

    • model_transform: STRING containing the TRANSFORM clause in a BigQuery ML CREATE MODEL statement. If NULL no TRANSFORM clause is applied.

    • model_options: JSON with the different options allowed by BigQuery ML CREATE MODEL statement for regression models. Any model is allowed as long as it can deal with numerical inputs for the response variable. At least the INPUT_LABEL_COLS and MODEL_TYPE parameters must be specified. By default, data will not be split into train and test (DATA_SPLIT_METHOD = 'NO_SPLIT'). Hyperparameter tuning is not currently supported.

    • r2_thr: FLOAT64 the minimum allowed value for the R2 model score. If the R2 of the regression model is lower than this threshold this implies poor fitting and a warning is raised. The default value is 0.5.

    • bucketize_method: STRING the method used to discretize the spatial composite score. The default value is NULL. Possible options are:

      • EQUAL_INTERVALS_ZERO_CENTERED: the values of the spatial composite score are discretized into buckets of equal widths centered in zero. The lower and upper limits are derived from the outliers-removed maximum of the absolute values of the score.

    • nbuckets: INT64 the number of buckets used when a bucketization method is specified. The default number of buckets is selected using Freedman and Diaconis’s (1981) rule. Ignored if bucketize_method is not specified.

    • remove_outliers: BOOL. When bucketize_method is specified, if remove_outliers is set to TRUE the buckets are derived from the oulier-removed data. The outliers are computed using Tukey’s fences k parameter for outlier detection. The default value is TRUE. For large inputs, setting this option to TRUE might cause a Query exceeds CPU resources error. Ignored if bucketize_method is not specified.

Return type

The results are stored in the table named <output_prefix>, which contains the following columns:

  • index_column: the unique geographic identifier. The type of this column depends on the type of index_column in input_query.

  • spatial_score: the value of the composite score. The type of this column is FLOAT64 if the score is not discretized and INT64 otherwise.

When the score is discretized by specifying the bucketize_method parameter, the procedure also returns a lookup table named <output_prefix>_lookup_table with the following columns:

  • lower_bound: FLOAT64 the lower bound of the bin.

  • upper_bound: FLOAT64 the upper bound of the bin.

  • spatial_score: INT64 the value of the (discretized) composite score.

Example

CALL `carto-un`.carto.CREATE_SPATIAL_COMPOSITE_SUPERVISED(
    'SELECT * FROM `cartobq.docs.spatial_scoring_input`',
    'geoid',
    '<project-id>.<dataset-id>.<table-name>',
    '''{
        "model_transform":[
            "revenue_change",
            "fempop_15_44, public_transport, education, pois, urbanity"
        ],
        "model_options":{
            "MODEL_TYPE":"LINEAR_REG",
            "INPUT_LABEL_COLS":["revenue_change"],
            "DATA_SPLIT_METHOD":"no_split",
            "OPTIMIZE_STRATEGY":"NORMAL_EQUATION",
            "CATEGORY_ENCODING_METHOD":"ONE_HOT_ENCODING"
        },
        "r2_thr":0.4,
        "bucketize_method":"EQUAL_INTERVALS_ZERO_CENTERED",
        "nbuckets":5
    }
    '''
)
-- Table `<my-project>.<my-dataset>.<table-name>` will be created