BigQuery ML

Extension Package provided by CARTO

The BigQuery ML extension package for CARTO Workflows includes a variety of components that enable users to integrate machine learning workflows with geospatial data. These components allow for creating, evaluating, explaining, forecasting, and managing ML models directly within CARTO Workflows, utilizing BigQuery ML’s capabilities.

Create Classification Model

Description

This component trains a classification model using a table of input data.

For more details, refer to the official ML.CREATE_MODEL documentation.

Inputs

  • Input table: A data table that is used as input for the model creation.

Settings

  • Model's FQN: Fully qualified name for the model created by this component.

  • Unique identifier column: A column from the input table to be used as unique identifier for the model.

  • Input label column: A column from the input table to be used as source of labels for the model.

  • Model type: Select the type of model to be created. Options are:

    • LOGISTIC_REG

    • BOOSTED_TREE_CLASSIFIER

    • RANDOM_FOREST_CLASSIFIER

  • Fit intercept: Determines whether to fit an intercept term in the model. Only applies if Model type is "LOGISTIC_REG".

  • Max tree depth: Determines the maximum depth of the individual trees. Only applies if Model type is "BOOSTED_TREE_CLASSIFIER".

  • Number of parallel trees: Determines the number of parallel trees constructed on each iteration. Only applies if Model type is "BOOSTED_TREE_CLASSIFIER".

  • Minimum tree child weight: Determines the minimum sum of instance weight needed in a child. Only applies if Model type is "BOOSTED_TREE_CLASSIFIER".

  • Subsample: Determines whether to subsample. Only applies if Model type is "BOOSTED_TREE_CLASSIFIER".

  • Column sample by tree: Subsample ratio of columns when constructing each tree. A fraction between 0 and 1 that controls the number of columns used by each tree. Only applies if Model type is "BOOSTED_TREE_CLASSIFIER".

  • Column sample by node: "Subsample ratio of columns when constructing each tree. A fraction between 0 and 1 that controls the number of columns used by each tree. Only applies if Model type is "BOOSTED_TREE_CLASSIFIER".

  • Data split method: The method used to split the input data into training, evaluation and test data. Options are:

    • AUTO_SPLIT automatically splits the data

    • RANDOM splits randomly based on specified fractions

    • CUSTOM uses a specified column

    • SEQ splits sequentially

    • NO_SPLIT uses all data for training.

  • Data split evaluation fraction: Fraction of data to use for evaluation. Only applies if Data split method is RANDOM or SEQ.

  • Data split test fraction: Fraction of data to use for testing. Only applies if Data split method is RANDOM or SEQ.

  • Data split column: Column to use for splitting the data. Only applies if Data split method is CUSTOM.

Outputs

  • Output table: This component generates a single-row table with the FQN of the created model.

Create Forecast Model

Description

This component trains a forecast model using a table of input data.

For more details, refer to the official ML.CREATE_MODEL documentation.

Inputs

  • Input table: A data table that is used as input for the model creation.

  • Holidays table: A table containing custom holidays to use during model training.

Settings

  • Model's FQN: Fully qualified name for the model created by this component.

  • Model type: Select a type of model to be created. Options are:

    • ARIMA_PLUS

    • ARIMA_PLUS_XREG

  • Time-series ID column: Column from Input table that uniquely identifies each individual time series in the input data. Only applies if Model type is ARIMA_PLUS and Auto ARIMA is set to true.

  • Time-series timestamp column: Column from Input table containing timestamps for each data point in the time series.

  • Time-series data column: Column from Input table containing the target values to forecast for each data point in the time series.

  • Auto ARIMA: Automatically determine ARIMA hyperparameters.

  • P value: Number of autoregressive terms. Only applies if Auto ARIMA is set to false.

  • D value: Number of non-seasonal differences. Only applies if Auto ARIMA is set to false.

  • Q value: Number of lagged forecast errors. Only applies if Auto ARIMA is set to false.

  • Data frequency: Frequency of data points in the time series. Used by BigQuery ML to properly interpret the time intervals between data points for forecasting. AUTO_FREQUENCY will attempt to detect the frequency from the data. Options are:

    • AUTO_FREQUENCY (default)

    • PER_MINUTE

    • HOURLY

    • DAILY

    • WEEKLY

    • MONTHLY

    • QUARTERLY

    • YEARLY

  • Holiday region: Region for wich the holidays will be applied. Check the reference for available values.

  • Clean spikes and dips: Determines whether to remove spikes and dips from the time series data.

Outputs

  • Output table: This component generates a single-row table with the FQN of the created model.

Create Regression Model

Description

This component trains a regression model using a table of input data.

For more details, refer to the official ML.CREATE_MODEL documentation.

Inputs

  • Input table: A data table that is used as input for the model creation.

Settings

  • Model's FQN: Fully qualified name for the model created by this component.

  • Unique identifier column: A column from the input table to be used as unique identifier for the model.

  • Input label column: A column from the input table to be used as source of labels for the model.

  • Model type: Select the type of model to be created. Options are:

    • LINEAR_REG

    • BOOSTED_TREE_REGRESSOR

    • RANDOM_FOREST_REGRESSOR

  • Fit intercept: Determines whether to fit an intercept term in the model. Only applies if Model type is "LINEAR_REG".

  • Max tree depth: Determines the maximum depth of the individual trees. Only applies if Model type is "BOOSTED_TREE_REGRESSOR".

  • Number of parallel trees: Determines the number of parallel trees constructed on each iteration. Only applies if Model type is "BOOSTED_TREE_REGRESSOR".

  • Minimum tree child weight: Determines the minimum sum of instance weight needed in a child. Only applies if Model type is "BOOSTED_TREE_REGRESSOR".

  • Subsample: Determines whether to subsample. Only applies if Model type is "BOOSTED_TREE_REGRESSOR".

  • Column sample by tree: Subsample ratio of columns when constructing each tree. A fraction between 0 and 1 that controls the number of columns used by each tree. Only applies if Model type is "BOOSTED_TREE_REGRESSOR".

  • Column sample by node: "Subsample ratio of columns when constructing each tree. A fraction between 0 and 1 that controls the number of columns used by each tree. Only applies if Model type is "BOOSTED_TREE_REGRESSOR".

  • Data split method: The method used to split the input data into training, evaluation and test data. Options are:

    • AUTO_SPLIT automatically splits the data

    • RANDOM splits randomly based on specified fractions

    • CUSTOM uses a specified column

    • SEQ splits sequentially

    • NO_SPLIT uses all data for training.

  • Data split evaluation fraction: Fraction of data to use for evaluation. Only applies if Data split method is RANDOM or SEQ.

  • Data split test fraction: Fraction of data to use for testing. Only applies if Data split method is RANDOM or SEQ.

  • Data split column: Column to use for splitting the data. Only applies if Data split method is CUSTOM.

Outputs

  • Output table: This component generates a single-row table with the FQN of the created model.

Evaluate

Description

Given a pre-trained ML model and an input table, this component evaluates its performance against some input data provided. The result will contain some metrics regarding the model performance. The user is offered some extra options when the model is a forecasting model.

For more details, refer to the official ML.EVALUATE function documentation.

Inputs

  • Model: This component's receives a trained model table as input.

  • Input table: This component receives a data table to be used as data input.

Settings

  • Forecast: Determines whether to evaluate on the model as a forecast.

  • Perform aggregation: Determines whether to evaluate on the time series level or the timestep level. Only applies if Forecast is set to true.

  • Horizon: Number of forecasted time points to evaluate against. Only applies if Forecast is set to true.

  • Confidence level: Percentage of the future values that fall in the prediction interval. Only applies if Forecast is set to true.

Outputs

  • Output table: This components produces a table with a predictions column.

Evaluate Forecast

Description

Given a pre-trained ML forecast model, this component evaluates its performance using the ARIMA_EVALUATE function in BigQuery.

Inputs

  • Model: This component's receives a trained model table as input.

Settings

  • Show all candidate models: Determines whether to show evaluation metrics for all candidate models or only for the best model.

Outputs

  • Output table: This component produces a table with the evaluation metrics.

Explain Forecast

Description

Given a pre-trained ML model and an input table, this component runs an explainability analysis invoking the EXPLAIN_FORECAST function in BigQuery.

Inputs

  • Model: This component's receives a trained model table as input.

  • Input table: This component receives a data table to be used as data input. It can only be used with ARIMA models. The execution will fail if a different model is selected.

Settings

  • Model type: Select the type of model to be used with the input. Options are:

    • ARIMA_PLUS

    • ARIMA_PLUS_XREG

  • Horizon: The number of time units to forecast into the future.

  • Confidence level: The confidence level to use for the prediction intervals.

Outputs

  • Output table: This component produces a table with the explainability metrics.

Explain Predict

Description

Given a pre-trained ML model, this component generates a predicted value and a set of feature attributions for each instance of the input data. Feature attributions indicate how much each feature in your model contributed to the final prediction for each given instance.

For more details, refer to the official ML.EXPLAIN_PREDICT function documentation.

Inputs

  • Model: This component's receives a trained model table as input.

  • Input table: This component receives a data table to be used as data input.

Settings

  • Number of top features: The number of the top features to be returned.

Outputs

  • Output table: This component produces a table with the attribution per feature for the input data.

Forecast

Description

Given a pre-trained ML model and an optional input table, this component infers the predictions for each of the input samples. Take into account that the actual forecasting happens when creating the model, this component only retrieves the desired results.

For more details, refer to the ML.PREDICT function documentation.

Inputs

  • Model: This component's receives a trained model table as input.

  • Input table: This component receives a data table to be used as data input. It can only be used with ARIMA models. The execution will fail if a different model is selected.

Settings

  • Model type: Select the type of model to be used with the input. Options are:

    • ARIMA_PLUS

    • ARIMA_PLUS_XREG

  • Horizon: The number of time units to forecast into the future.

  • Confidence level: The confidence level to use for the prediction intervals.

Outputs

  • Output table: This component produces a table with the predictions column.

Get Model by Name

Description

This component loads an pre-existing model in BigQuery into the expected Workflows format to be used with the rest of BigQuery ML components.

Inputs

  • Model FQN: Fully-qualified name to get the model from.

Outputs

  • Output: This component returns a model that can be connected to other BigQuery ML components that expect a model as input.

Global Explain

Description

Given a pre-trained ML model, this component lets you provide explanations for the entire model by aggregating the local explanations of the evaluation data. It returns the attribution of each feature. In the case of classification model, an option can be set to provide explanation for each class of the model.

For more details, refer to the official ML.GLOBAL_EXPLAIN function documentation.

Inputs

  • Model: This component's receives a trained model table as input.

Settings

  • Class level explain: Determines whether global feature importances are returned for each class in the case of classification.

Outputs

  • Output table: This component produces a table with the attributions per row of the input data.

Import model

Description

This component imports an ONNX model from Google Cloud Storage. The model will be loaded into BigQuery ML using the provided FQN and it will be ready to use in Workflows with the rest of ML components.

Settings

  • Model path: Google Cloud Storage URI (gs://) of the pre-trained ONNX file.

  • Model FQN: A fully-qualified name to save the model to.

  • Overwrite model: Determines whether to overwrite the model if it already exists.

Outputs

  • Output table: This component returns a model that can be connected to other BigQuery ML components that expect a model as input.

Predict

Description

Given a pre-trained ML model and an input table, this component infers the predictions for each of the input samples. A new variable prediction will be returned. All columns in the input table will be returned by default; an option can be unmarked to select a single ID column that will be returned with the prediction.

For more details, check out the ML.PREDICT function documentation.

Inputs

  • Model: This component's receives a trained model table as input.

  • Input table: This component receives a data table to be used as data input.

Settings

  • Keep input columns: Determines whether to keep all input columns in the output or not.

  • ID column: Select a column from the input table to be used as the unique identifier for the model. Only applies if Keep input columns is set to false.

Outputs

  • Output table: This component produces a table with the predictions column.

Last updated