AWS Databricks Setup#

This section covers how to install and setup Big Data Toolkit in Databricks in AWS.

Creating an S3 Bucket#

  • An S3 bucket is required to store the data that BDT will process. Start by following this guide to create an S3 Bucket within AWS.

  • Creating an S3 Bucket is required to use BDT’s Geoenrichment and Network Analysis functionalities.

Subscribing to Databricks on AWS#

  • To begin, search for Databricks in AWS Marketplace. Subscribe to the Databricks Data Intelligence Platform.

Subscribe to Databricks

Deploying Databricks on AWS#

  • Follow the instructions here to deploy a databricks workspace on the AWS cloud.

  • Launch the workspace when the setup is complete.

Add Notebook to Databricks Workspace#

Create Notebook

image1

Create a Databricks Cluster#

The Databricks Runtime of the cluster depends on the version of BDT3.

For BDT3 Version 3.0: DBR 10.4 LTS

For BDT3 Version 3.1: DBR 11.x

For BDT3 Version 3.2: DBR 12.2 LTS, DBR 13.3 LTS

For BDT3 Version 3.3: DBR 14.3 LTS

Currently, Shared access mode is not supported. A policy must be selected that does not force shared access mode. See the table of Databricks policies and access modes supported by BDT on the System Requirements page for more details.

Install Big Data Toolkit#

  • Go to Cluster Libraries to upload and install:

    • The BDT jar file

    • The BDT whl file

  • It is also recommended to install GeoPandas and its supporting libraries to visualize results in notebooks. Use PyPi in the cluster library installation window to install the following libraries:

    • geopandas

    • folium

    • matplotlib

    • mapclassify

Connect Cluster to Amazon S3 Storage#

  • This guide provides instructions on connecting to AWS S3 from Databricks using an instance profile. Connecting to S3 using an instance profile is required to use BDT’s Geoenrichment and Network Analysis functionalities.

  • Once the instance profile has been configured, return to the cluster settings. Above the Advanced Options section, set the Instance Profile to the profile that has just been created.

Set the Instance Profile