BigQuery ML

Extension Package provided by CARTO

The BigQuery ML extension package for CARTO Workflows includes a variety of components that enable users to integrate machine learning workflows with geospatial data. These components allow for creating, evaluating, explaining, forecasting, and managing ML models directly within CARTO Workflows, utilizing BigQuery ML’s capabilities.

The following table summarises available components, and explains how different components can be connected to each other:

Create Classification Model

Description

This component trains a classification model using a table of input data.

For more details, refer to the official ML.CREATE_MODEL documentation.

Inputs

Input table: A data table that is used as input for the model creation.

Settings

Model's FQN: Fully qualified name for the model created by this component.
Unique identifier column: A column from the input table to be used as unique identifier for the model.
Input label column: A column from the input table to be used as source of labels for the model.
Model type: Select the type of model to be created. Options are:
- LOGISTIC_REG
- BOOSTED_TREE_CLASSIFIER
- RANDOM_FOREST_CLASSIFIER
Fit intercept: Determines whether to fit an intercept term in the model. Only applies if Model type is "LOGISTIC_REG".
Max tree depth: Determines the maximum depth of the individual trees. Only applies if Model type is "BOOSTED_TREE_CLASSIFIER".
Number of parallel trees: Determines the number of parallel trees constructed on each iteration. Only applies if Model type is "BOOSTED_TREE_CLASSIFIER".
Minimum tree child weight: Determines the minimum sum of instance weight needed in a child. Only applies if Model type is "BOOSTED_TREE_CLASSIFIER".
Subsample: Determines whether to subsample. Only applies if Model type is "BOOSTED_TREE_CLASSIFIER".
Column sample by tree: Subsample ratio of columns when constructing each tree. A fraction between 0 and 1 that controls the number of columns used by each tree. Only applies if Model type is "BOOSTED_TREE_CLASSIFIER".
Column sample by node: "Subsample ratio of columns when constructing each tree. A fraction between 0 and 1 that controls the number of columns used by each tree. Only applies if Model type is "BOOSTED_TREE_CLASSIFIER".
Data split method: The method used to split the input data into training, evaluation and test data. Options are:
- AUTO_SPLIT automatically splits the data
- RANDOM splits randomly based on specified fractions
- CUSTOM uses a specified column
- SEQ splits sequentially
- NO_SPLIT uses all data for training.
Data split evaluation fraction: Fraction of data to use for evaluation. Only applies if Data split method is RANDOM or SEQ.
Data split test fraction: Fraction of data to use for testing. Only applies if Data split method is RANDOM or SEQ.
Data split column: Column to use for splitting the data. Only applies if Data split method is CUSTOM.

Outputs

Output table: This component generates a single-row table with the FQN of the created model.

Create Forecast Model

Description

This component trains a forecast model using a table of input data.

For more details, refer to the official ML.CREATE_MODEL documentation.

Inputs

Input table: A data table that is used as input for the model creation.
Holidays table: A table containing custom holidays to use during model training.

Settings

Model's FQN: Fully qualified name for the model created by this component.
Model type: Select a type of model to be created. Options are:
- ARIMA_PLUS
- ARIMA_PLUS_XREG
Time-series ID column: Column from Input table that uniquely identifies each individual time series in the input data. Only applies if Model type is ARIMA_PLUS and Auto ARIMA is set to true.
Time-series timestamp column: Column from Input table containing timestamps for each data point in the time series.
Time-series data column: Column from Input table containing the target values to forecast for each data point in the time series.
Auto ARIMA: Automatically determine ARIMA hyperparameters.
P value: Number of autoregressive terms. Only applies if Auto ARIMA is set to false.
D value: Number of non-seasonal differences. Only applies if Auto ARIMA is set to false.
Q value: Number of lagged forecast errors. Only applies if Auto ARIMA is set to false.
Data frequency: Frequency of data points in the time series. Used by BigQuery ML to properly interpret the time intervals between data points for forecasting. AUTO_FREQUENCY will attempt to detect the frequency from the data. Options are:
- AUTO_FREQUENCY (default)
- PER_MINUTE
- HOURLY
- DAILY
- WEEKLY
- MONTHLY
- QUARTERLY
- YEARLY
Holiday region: Region for wich the holidays will be applied. Check the reference for available values.
Clean spikes and dips: Determines whether to remove spikes and dips from the time series data.

Outputs

Output table: This component generates a single-row table with the FQN of the created model.

Create Regression Model

Description

This component trains a regression model using a table of input data.

For more details, refer to the official ML.CREATE_MODEL documentation.

Inputs

Input table: A data table that is used as input for the model creation.

Settings

Model's FQN: Fully qualified name for the model created by this component.
Unique identifier column: A column from the input table to be used as unique identifier for the model.
Input label column: A column from the input table to be used as source of labels for the model.
Model type: Select the type of model to be created. Options are:
- LINEAR_REG
- BOOSTED_TREE_REGRESSOR
- RANDOM_FOREST_REGRESSOR
Fit intercept: Determines whether to fit an intercept term in the model. Only applies if Model type is "LINEAR_REG".
Max tree depth: Determines the maximum depth of the individual trees. Only applies if Model type is "BOOSTED_TREE_REGRESSOR".
Number of parallel trees: Determines the number of parallel trees constructed on each iteration. Only applies if Model type is "BOOSTED_TREE_REGRESSOR".
Minimum tree child weight: Determines the minimum sum of instance weight needed in a child. Only applies if Model type is "BOOSTED_TREE_REGRESSOR".
Subsample: Determines whether to subsample. Only applies if Model type is "BOOSTED_TREE_REGRESSOR".
Column sample by tree: Subsample ratio of columns when constructing each tree. A fraction between 0 and 1 that controls the number of columns used by each tree. Only applies if Model type is "BOOSTED_TREE_REGRESSOR".
Column sample by node: "Subsample ratio of columns when constructing each tree. A fraction between 0 and 1 that controls the number of columns used by each tree. Only applies if Model type is "BOOSTED_TREE_REGRESSOR".
Data split method: The method used to split the input data into training, evaluation and test data. Options are:
- AUTO_SPLIT automatically splits the data
- RANDOM splits randomly based on specified fractions
- CUSTOM uses a specified column
- SEQ splits sequentially
- NO_SPLIT uses all data for training.
Data split evaluation fraction: Fraction of data to use for evaluation. Only applies if Data split method is RANDOM or SEQ.
Data split test fraction: Fraction of data to use for testing. Only applies if Data split method is RANDOM or SEQ.
Data split column: Column to use for splitting the data. Only applies if Data split method is CUSTOM.

Outputs

Output table: This component generates a single-row table with the FQN of the created model.

Evaluate

Description

Given a pre-trained ML model and an input table, this component evaluates its performance against some input data provided. The result will contain some metrics regarding the model performance. The user is offered some extra options when the model is a forecasting model.

For more details, refer to the official ML.EVALUATE function documentation.

Inputs

Model: This component's receives a trained model table as input.
Input table: This component receives a data table to be used as data input.

Settings

Forecast: Determines whether to evaluate on the model as a forecast.
Perform aggregation: Determines whether to evaluate on the time series level or the timestep level. Only applies if Forecast is set to true.
Horizon: Number of forecasted time points to evaluate against. Only applies if Forecast is set to true.
Confidence level: Percentage of the future values that fall in the prediction interval. Only applies if Forecast is set to true.

Outputs

Output table: This components produces a table with a predictions column.

Evaluate Forecast

Description

Given a pre-trained ML forecast model, this component evaluates its performance using the ARIMA_EVALUATE function in BigQuery.

Inputs

Model: This component's receives a trained model table as input.

Settings

Show all candidate models: Determines whether to show evaluation metrics for all candidate models or only for the best model.

Outputs

Output table: This component produces a table with the evaluation metrics.

Explain Forecast

Description

Given a pre-trained ML model and an input table, this component runs an explainability analysis invoking the EXPLAIN_FORECAST function in BigQuery.

Inputs

Model: This component's receives a trained model table as input.
Input table: This component receives a data table to be used as data input. It can only be used with ARIMA models. The execution will fail if a different model is selected.

Settings

Model type: Select the type of model to be used with the input. Options are:
- ARIMA_PLUS
- ARIMA_PLUS_XREG
Horizon: The number of time units to forecast into the future.
Confidence level: The confidence level to use for the prediction intervals.

Outputs

Output table: This component produces a table with the explainability metrics.

Explain Predict

Description

Given a pre-trained ML model, this component generates a predicted value and a set of feature attributions for each instance of the input data. Feature attributions indicate how much each feature in your model contributed to the final prediction for each given instance.

For more details, refer to the official ML.EXPLAIN_PREDICT function documentation.

Inputs

Model: This component's receives a trained model table as input.
Input table: This component receives a data table to be used as data input.

Settings

Number of top features: The number of the top features to be returned.

Outputs

Output table: This component produces a table with the attribution per feature for the input data.

Forecast

Description

Given a pre-trained ML model and an optional input table, this component infers the predictions for each of the input samples. Take into account that the actual forecasting happens when creating the model, this component only retrieves the desired results.

For more details, refer to the ML.FORECAST function documentation.

Inputs

Model: This component's receives a trained model table as input.
Input table: This component receives a data table to be used as data input. It can only be used with ARIMA models. The execution will fail if a different model is selected.

Settings

Model type: Select the type of model to be used with the input. Options are:
- ARIMA_PLUS
- ARIMA_PLUS_XREG
Horizon: The number of time units to forecast into the future.
Confidence level: The confidence level to use for the prediction intervals.

Outputs

Output table: This component produces a table with the predictions column.

Get Model by Name

Description

This component loads an pre-existing model in BigQuery into the expected Workflows format to be used with the rest of BigQuery ML components.

Inputs

Model FQN: Fully-qualified name to get the model from.

Outputs

Output: This component returns a model that can be connected to other BigQuery ML components that expect a model as input.

Global Explain

Description

Given a pre-trained ML model, this component lets you provide explanations for the entire model by aggregating the local explanations of the evaluation data. It returns the attribution of each feature. In the case of classification model, an option can be set to provide explanation for each class of the model.

For more details, refer to the official ML.GLOBAL_EXPLAIN function documentation.

Inputs

Model: This component's receives a trained model table as input.

Settings

Class level explain: Determines whether global feature importances are returned for each class in the case of classification.

Outputs

Output table: This component produces a table with the attributions per row of the input data.

Import model

Description

This component imports an ONNX model from Google Cloud Storage. The model will be loaded into BigQuery ML using the provided FQN and it will be ready to use in Workflows with the rest of ML components.

Settings

Model path: Google Cloud Storage URI (gs://) of the pre-trained ONNX file.
Model FQN: A fully-qualified name to save the model to.
Overwrite model: Determines whether to overwrite the model if it already exists.

Outputs

Output table: This component returns a model that can be connected to other BigQuery ML components that expect a model as input.

Predict

Description

Given a pre-trained ML model and an input table, this component infers the predictions for each of the input samples. A new variable prediction will be returned. All columns in the input table will be returned by default; an option can be unmarked to select a single ID column that will be returned with the prediction.

For more details, check out the ML.PREDICT function documentation.

Inputs

Model: This component's receives a trained model table as input.
Input table: This component receives a data table to be used as data input.

Settings

Keep input columns: Determines whether to keep all input columns in the output or not.
ID column: Select a column from the input table to be used as the unique identifier for the model. Only applies if Keep input columns is set to false.

Outputs

Output table: This component produces a table with the predictions column.

PreviousTileset Creation NextSnowflake ML

Last updated 4 months ago

Was this helpful?