Analytics on Embeddings

An extension package for CARTO Workflows that provides comprehensive embedding analytics capabilities. This package enables users to analyze, cluster, compare, and visualize high-dimensional vector embeddings derived from spatial data, satellite imagery, or any other geospatial data source.

All columns in the input table (except the ID column and date column, if specified) will be automatically used as embedding vector dimensions.

Change Detection

Description

Analyzes temporal changes in embeddings by comparing vector representations across different time periods. It calculates distance metrics between consecutive time points for each location, enabling the identification of significant changes in spatial patterns, environmental conditions, or other time-varying characteristics encoded in the embeddings.

This component is particularly useful for monitoring temporal evolution, detecting anomalies, tracking environmental changes, and identifying locations that have undergone significant transformations over time.

Inputs

  • Embedding table: Table containing the embeddings with associated timestamps and location identifiers.

Settings

  • ID column: Column containing unique identifiers for each embedding. This column is used to track changes for the same entity across different time periods.

  • Date column: Column containing the timestamps or dates associated with each embedding. This column is used to order the data chronologically and identify consecutive time periods for comparison.

  • Distance metric: Distance function used to compare vectors across time (default: Cosine).

    • Cosine: Measures the alignment between two vectors based on the cosine of the angle between them. Values closer to 0 indicate high similarity in direction, while values closer to 2 indicate larger change. This metric is ideal when you want to focus on the direction of change rather than magnitude.

    • Euclidean: Measures the straight-line distance between vectors in the embedding space. Values closer to 0 indicate high similarity, while larger values indicate greater changes. This metric considers both direction and magnitude of change, making it suitable when the absolute values of the embeddings are meaningful.

  • Comparison method: Determines how embeddings are compared across time (default: Rolling pairs).

    • Rolling pairs: Compares the embeddings of each timestamp with those of the previous time period, allowing you to track incremental changes over time. This mode is ideal for detecting gradual changes and trends.

    • Cross pairs: Compares the embeddings of each timestamp with all remaining time periods, providing a broader view of how a location evolves relative to any other time point. This mode is useful for identifying significant transformations and outliers.

  • Reduce embedding dimensionality: Whether to use Principal Component Analysis (PCA) to reduce the dimensionality of the vector embeddings before change detection (default: false). Enabling this can improve performance and reduce noise by focusing on the most significant patterns in your data.

  • PCA explained variance: Variance threshold for PCA dimensionality reduction (0.01-0.99, default: 0.9). This determines how much of the original variance should be preserved. Higher values (closer to 1.0) retain more information but may include more noise, while lower values focus on the most important patterns, with loss of information.

Outputs

  • Output table: The change detection table with the following columns:

    • <geoid>: Original identifier for each location/entity.

    • <date>_from: Starting timestamp for the change period.

    • <date>_to: Ending timestamp for the change period.

    • distance: Calculated distance metric between the embeddings at the two time points.

    • change_score: Measure that quantifies the degree of change between two embedding vectors for the same location across different time periods. 0 means no change.

  • Metrics table: The metrics table with the following columns:

    • total_explained_variance_ratio: The proportion of variance in the original data that is explained by the PCA dimensionality reduction. This metric is particularly useful when PCA dimensionality reduction is applied, showing how much of the original information is preserved in the reduced dimensions.

    • number_of_principal_components: The number of principal components extracted from the PCA.

Clustering

Description

Uses BigQuery ML's K-means algorithm to group embeddings into k clusters, identifying locations with similar spatial patterns or features. It optionally runs dimensionality reduction via Principal Component Analysis to improve clustering performance and reduce computational cost by focusing on the most significant patterns.

This component is particularly useful for discovering natural groupings in spatial data, segmenting areas based on their characteristics, and identifying outliers or distinct regions.

Inputs

  • Embedding table: Table containing the embeddings to cluster.

Settings

  • ID column: Column containing unique identifiers for each embedding. This column is used to maintain the relationship between input embeddings and their assigned clusters in the output.

  • Analysis type: Determines whether the analysis is spatial or spatio-temporal (default: Spatial).

    • Spatial: Analyzes embeddings at a single point in time or without temporal considerations. Use this for static spatial pattern analysis.

    • Spatio-temporal: Analyzes embeddings across multiple time periods, enabling detection of temporal changes in spatial patterns. Requires a date column to be specified.

  • Date column: Column containing temporal information for spatio-temporal analysis (required when Analysis type is Spatio-temporal). This column should contain date, datetime, timestamp, or numeric time values that identify when each embedding was captured.

  • Number of clusters: Number of clusters to create (2-100, default: 5). This parameter determines how many groups the algorithm will divide your embeddings into.

  • Distance type: Distance metric used for clustering calculations (default: EUCLIDEAN).

    • EUCLIDEAN: Measures straight-line distance between points in the embedding space. Best for embeddings where all dimensions have similar scales and importance.

    • COSINE: Measures the angle between vectors, ignoring magnitude. Ideal for embeddings where the direction matters more than the absolute values, such as normalized feature vectors.

  • Reduce embedding dimensionality: Whether to apply Principal Component Analysis (PCA) before clustering (default: false). Enabling this can improve performance and reduce noise by focusing on the most significant patterns in your data.

  • PCA explained variance: Variance threshold for PCA dimensionality reduction (0.01-0.99, default: 0.9). This determines how much of the original variance should be preserved. Higher values (closer to 1.0) retain more information but may include more noise, while lower values focus on the most important patterns, with loss of information.

  • Maximum iterations: Maximum number of iterations for the K-means algorithm (1-1000, default: 20). The algorithm will stop after this many iterations even if it hasn't converged. Increase this value for complex datasets that may need more iterations to find optimal clusters.

  • Early stopping: Whether to stop the algorithm early if it converges before reaching the maximum iterations (default: true). Enabling this improves performance by stopping as soon as cluster assignments stabilize, but may miss optimal solutions in some cases.

  • Minimum relative progress: Threshold for early stopping convergence detection (0.001-0.1, default: 0.01). The algorithm stops when the improvement between iterations falls below this threshold. Lower values allow more precise convergence but may require more iterations.

Outputs

  • Output table: The clustering table with the following columns:

    • <geoid>: Original identifier for each embedding.

    • cluster_id: Assigned cluster ID (1 to k).

    • distance_to_centroid: Distance from the embedding to its cluster centroid.

  • Metrics table: The metrics table with the following columns:

    • davies_bouldin_index: A clustering quality metric that measures the ratio of within-cluster scatter to between-cluster separation. Lower values indicate better clustering quality, with values closer to 0 representing well-separated, compact clusters.

    • mean_squared_distance: The average squared distance from each point to its assigned cluster centroid. This metric helps assess how tightly packed the clusters are - lower values indicate more compact clusters.

    • total_explained_variance_ratio: The proportion of variance in the original data that is explained by the clustering solution. This metric is particularly useful when PCA dimensionality reduction is applied, showing how much of the original information is preserved in the reduced dimensions.

    • number_of_principal_components: The number of principal components extracted from the PCA.

Description

Identifies regions with comparable spatial or contextual characteristics by leveraging spatially aware embeddings. It compares the embeddings of one or more reference locations against a set of search location embeddings to identify areas that share similar spatial patterns or features. This component supports any embedding representation that encodes spatial relationships and can be applied across a wide range of domains, including remote sensing, environmental monitoring, and spatial data analysis.

This component is particularly useful for finding locations with similar characteristics, identifying patterns across different regions, discovering areas that match specific reference conditions, and conducting spatial similarity analysis at scale.

Inputs

  • Reference location(s): Table containing vector(s) representing the baseline data or items to compare against.

  • Search locations: Table containing vectors for which similarity to the reference set will be computed.

Settings

  • Reference location(s) ID column: Column containing the IDs of the reference vector(s). This column is used to maintain the relationship between input embeddings and their similarity scores in the output.

  • Search location(s) ID column: Column containing the IDs of the search vector(s). This column is used to maintain the relationship between input embeddings and their similarity scores in the output.

  • Distance metric: Distance function used to compare vectors (default: Cosine).

    • Cosine: Measures the alignment between two vectors based on the cosine of the angle between them. Values closer to 0 indicate high similarity in direction, while values closer to 2 indicate high dissimilarity. This metric is ideal when you want to focus on the direction of change rather than magnitude.

    • Euclidean: Measures the straight-line distance between vectors in the embedding space. Values closer to 0 indicate high similarity, while larger values indicate greater dissimilarity. This metric considers both direction and magnitude, making it suitable when the absolute values of the embeddings are meaningful.

  • Aggregate across reference locations: Whether the similarity results from multiple reference locations should be aggregated into a single output (default: true). When enabled, the component computes similarity maps for each reference location and returns their mean or maximum, producing one consolidated similarity result.

  • Aggregate function: The function to use when aggregating similarity results from multiple reference locations (default: AVG).

    • AVG: Returns the average similarity score across all reference locations.

    • MAX: Returns the maximum similarity score across all reference locations.

  • Return top-k similar locations: Whether to return only the top-k most similar locations (default: true). When enabled, only the most similar locations are returned, improving performance and focusing on the most relevant results.

  • Top-k metric: Number of most similar locations to return for each reference vector (1-infinity, default: 10). This parameter determines how many of the most similar locations will be included in the output.

  • Reduce embedding dimensionality: Whether to use Principal Component Analysis (PCA) to reduce the dimensionality of the vector embeddings (default: false). Enabling this can improve performance and reduce noise by focusing on the most significant patterns in your data.

  • PCA explained variance: Variance threshold for PCA dimensionality reduction (0.01-0.99, default: 0.9). This determines how much of the original variance should be preserved. Higher values (closer to 1.0) retain more information but may include more noise, while lower values focus on the most important patterns, with loss of information.

  • Create vector index: Whether to create a vector index to optimize the similarity search at scale (default: false, advanced option). When enabled, uses Approximate Nearest Neighbor search to improve performance with the trade-off of returning more approximate results. Without a vector index, uses brute force search to measure distance for every record.

Outputs

  • Output table: The similarity table with the following columns:

    • reference_<id>: Identifier of the reference location.

    • search_<id>: Identifier of the search location.

    • distance: Distance between the reference and search embeddings.

    • similarity_score: Calculated similarity metric between the reference and search embeddings. When Cosine distance is computed, the similarity score is based on the cosine similarity between the two vectors - ranging from 0 (opposite direction) to 1 (identical direction), where higher values indicate greater similarity. When Euclidean distance is computed, the similarity score compares the distance for each search location to the distance from the mean vector data. The score will be positive if and only if the search location is more similar to the reference than the mean vector data. The larger the score, the greater the similarity.

  • Metrics table: The metrics table with the following columns:

    • total_explained_variance_ratio: The proportion of variance in the original data that is explained by the PCA dimensionality reduction. This metric is particularly useful when PCA dimensionality reduction is applied, showing how much of the original information is preserved in the reduced dimensions.

    • number_of_principal_components: The number of principal components extracted from the PCA.

The component requires that both reference and search tables contain compatible embedding vectors with the same dimensionality. The similarity calculation will be performed for all combinations of reference and search locations, or for the top-k most similar pairs when the top-k option is enabled.

Visualization

Description

Transforms high-dimensional embeddings into RGB color representations for intuitive visualization. It supports two dimensionality reduction approaches: Principal Component Analysis (PCA) for automatic feature extraction, or manual selection of specific embedding dimensions to map directly to red, green, and blue color channels.

This component is particularly useful for creating visual representations of spatial patterns, identifying clusters through color similarity, and generating color-coded maps that reveal underlying data structures in your embeddings.

Inputs

  • Embedding table: Table containing the embeddings to visualize.

Settings

  • ID column: Column containing unique identifiers for each embedding. This column is used to maintain the relationship between input embeddings and their color assignments in the output.

  • Analysis type: Determines whether the analysis is spatial or spatio-temporal (default: Spatial).

    • Spatial: Analyzes embeddings at a single point in time or without temporal considerations. Use this for static spatial pattern visualization.

    • Spatio-temporal: Analyzes embeddings across multiple time periods, enabling visualization of temporal changes in spatial patterns. Requires a date column to be specified.

  • Date column: Column containing temporal information for spatio-temporal analysis (required when Analysis type is Spatio-temporal). This column should contain date, datetime, timestamp, or numeric time values that identify when each embedding was captured.

  • Dimensionality reduction technique: Method for reducing embeddings to 3 dimensions for RGB visualization (default: PCA).

    • PCA: Automatically extracts the three most significant principal components from your embeddings and maps them to RGB channels. This approach preserves the maximum variance in your data and is ideal when you want to discover the most important patterns automatically.

    • Manual: Allows you to manually select three specific embedding dimensions to map directly to red, green, and blue channels. Use this when you have domain knowledge about which features are most important for your visualization.

  • Red channel: Column containing values for the red color channel (required when using Manual reduction mode). Select the embedding dimension that should control the red intensity in your visualization.

  • Green channel: Column containing values for the green color channel (required when using Manual reduction mode). Select the embedding dimension that should control the green intensity in your visualization.

  • Blue channel: Column containing values for the blue color channel (required when using Manual reduction mode). Select the embedding dimension that should control the blue intensity in your visualization.

Outputs

  • Output table: The image table with the following columns:

    • <geoid>: Original identifier for each embedding.

    • r_channel: Normalized red channel value (0-255).

    • g_channel: Normalized green channel value (0-255).

    • b_channel: Normalized blue channel value (0-255).

    • hex_color: Hexadecimal color representation (e.g., #FF5733) for easy use in mapping applications.

  • Metrics table: The metrics table with the following columns:

    • total_explained_variance_ratio: The proportion of variance in the original data that is explained by the dimensionality reduction. When using PCA, this indicates how much information is preserved in the visualization. Values closer to 1.0 indicate that the visualization captures most of the original data's structure.

    • luminance_variance: The variance of the luminance values across all visualized embeddings. Luminance is calculated using the standard RGB-to-luminance conversion formula (0.2989xR + 0.5870xG + 0.1140xB) and represents the perceived brightness of each color. Higher variance indicates greater diversity in brightness levels across your visualization, while lower variance suggests more uniform brightness distribution.

Last updated

Was this helpful?