Access data in Databricks
Last updated
Was this helpful?
Last updated
Was this helpful?
In order to make the Data Observatory subscriptions accessible in your Databricks account, first of all CARTO needs to create and register your subscriptions in your CARTO account. Once this first step has been done, Admin users of your CARTO organization will be able to see the list of active subscriptions in the Data Observatory section in Settings.
Once your subscriptions have been created, the CARTO team will proceed with the data transfers so to make the data accessible directly on your Databricks account. Please contact your CARTO representative if you need to start this process on demand (e.g. public data subscriptions).
In order to make your subscriptions available in your Databricks account, CARTO leverages the native private exchanges mechanisms of Databricks, powered by Delta Sharing.
In order to be able to create and share a private exchange with Delta Sharing to your Databricks account, you will need to have Databricks' Unity Catalog enabled.
CARTO will create and maintain a private exchange shared with your Databricks account containing the data from all your Data Observatory subscriptions. To be able to create such private share we will need to receive the following information regarding your Databricks account:
Databricks Metastore ID or Databricks sharing identifier (more info)
Cloud
Region
Once CARTO has completed the data transfer, you will be able to navigate to the Delta Sharing
section on the Catalog area and identify CARTO's private share on the "Shared with me" section.
In order to be able to access the data from the private share you then need to click on "Create Catalog". You will be prompted to provide a Catalog name.
Once the new catalog has been created you will be able to access the data tables within the carto
schema.
Alternatively, CARTO can facilitate you the command to create the catalog associated with the private share directly on the SQL Editor. It will look like this (with the details of your specific private share):
It is important to note that the catalog that is created from the Delta Share is a “Delta Sharing Catalog”, which is read-only. This will prevent you to prepare the data for faster geospatial queries, and also to carry out any sort of processing of the data contained in that catalog.
We recommend you to copy the tables into a different Catalog within your Databricks organization - to do that you can execute the following query:
In order to use the data from your Data Observatory subscriptions that have been transferred directly to your Databricks account you should not use the Data Observatory tabs in Data Explorer, Workflows or Builder, but directly your own Databricks connections.
You need to ensure that the catalog where you have copied the tables from your Data Observatory subscriptions is accessible from your Databricks connections.
Remember that in order to be able to benefit from all the power of CARTO and Databricks for your geospatial data, you need to prepare your tables for fast geospatial queries.
For that it is required that your Databricks Workspace is enabled with Spatial SQL functions, currently in Private Preview. The Databricks team has made this form available to request access to the functions. Please get in touch with your Databricks representative or through the form to gain access to all Spatial SQL functions.
To learn more about Databricks connections in CARTO please access this section. For map performance considerations please read these recommendations.
There are two options in order to prepare your spatial data for faster queries and map visualizations in Databricks:
CARTO's Data Explorer: If your tables are available via one of your Databricks connections set up in CARTO, our Data Explorer will detect if your tables have been prepared for efficient geospatial processing. Click on "Prepare this table" to get your data ready.
Then, You will need to define a location within your catalog and a name from the resulting table. Once ready, click on "Create".
Once the process is completed, the resulting table will be ready for geospatial visualizations and faster queries.
Databricks SQL Editor: In order to prepare your data for spatial analysis in Databricks, please follow these recommendations:
The geo column must be of binary type and contains a WKB representation of the geography.
Each row must contain four additional columns __carto_xmin, __carto_ymin, __carto_xmax, __carto_ymax that describe the Bounding Box of each feature. These columns help store the table in a way that allows fast queries and avoids full scans on each query.
You can achieve that in your tables with a SQL query just as follows: