Azure Databricks Setup#

This section covers how to install and setup Big Data Toolkit in Azure Databricks.

Create Resource Group#

Create a resource group

  • Go to your resource group. Add a storage account in the same region as your resource group.

image1

Create Storage Account#

  • Search for Storage Account and follow these steps:

Create a storage account

image2

  • Use the same region, standard performance, and select LRS

image3

  • Make sure to enable hierarchical namespace

Add Databricks Service to Resource Group#

  • In your resource group, add Azure Databricks Service.

Create Azure Databricks Service

image4

Install & configure Databricks CLI#

  • Launch Databricks workspace.

Launch Workspace

  • Click on Generate New Token

image5

  • Save the token in a safe place

image6

  • Execute pip3 install databricks-cli.

  • Set the following environment variables in your .bash_profile.

export DATABRICKS_HOST=https://westus2.azuredatabricks.net
export DATABRICKS_TOKEN=YOUR TOKEN
  • In your terminal, execute source ~/.bash_profile.

Add Notebook to Databricks Workspace#

Create Notebook

image7

Use Principal Key to setup DBFS#

  • Setup of a service principal in your Azure subscription is a prerequisite

  • After this is setup, enter the following:

databricks secrets create-scope --scope adls --initial-manage-principal users
databricks secrets put --scope adls --key credential

Create a Databricks Cluster#

The Databricks Runtime of the cluster depends on the version of BDT3.

For BDT3 Version 3.0: DBR 10.4 LTS

For BDT3 Version 3.1: DBR 11.x

For BDT3 Version 3.2: DBR 12.2 LTS, DBR 13.3 LTS

For BDT3 Version 3.3: DBR 14.3 LTS

Currently, Shared access mode is not supported. A policy must be selected that does not force shared access mode. See the table of Databricks policies and access modes supported by BDT on the System Requirements page for more details.

Install Big Data Toolkit#

  • Go to Cluster Libraries to upload and install:

    • The BDT jar file

    • The BDT whl file

  • It is also recommended to install GeoPandas and its supporting libraries to visualize results in notebooks. Use PyPi in the cluster library installation window to install the following libraries:

    • geopandas

    • folium

    • matplotlib

    • mapclassify

* pyarrow is also required for GeoPandas visualization but is already pre-installed in databricks

How to access Azure Blob Storage#

There are three supported methods to connect to Azure Data Lake Storage in Databricks:

See Connect to Azure Data Lake Storage Gen2 and Blob Storage for details on how to connect using the various methods. Currently, in order to use Big Data Toolkit in Databricks, these instructions must be modified slightly. See the appropriate sections below for more detail on each connection method.

Access Azure Storage using Azure Service Principal#

To connect to Azure Storage using the Service Principal, follow the Instructions to set the Azure connection properties in the cluster configuration.

spark.hadoop.fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net OAuth
spark.hadoop.fs.azure.account.oauth.provider.type.<storage-account>.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
spark.hadoop.fs.azure.account.oauth2.client.id.<storage-account>.dfs.core.windows.net <application-id>
spark.hadoop.fs.azure.account.oauth2.client.secret.<storage-account>.dfs.core.windows.net {{secrets/<secret-scope>/<service-credential-key>}}
spark.hadoop.fs.azure.account.oauth2.client.endpoint.<storage-account>.dfs.core.windows.net https://login.microsoftonline.com/<directory-id>/oauth2/token

Access Azure Storage using SAS Token#

To connect to Azure Storage using a SAS Token, first create a SAS token for your storage account. The following properties will need to be enabled to allow the authorization of Databricks:

Once the SAS Token has been created, store it as a Databricks Secret using the following commands.

databricks secrets create-scope --scope adls --initial-manage-principal users
databricks secrets put --scope adls --key SAStoken

Then, follow the Instructions for a SAS Token to set the Azure connection properties in the cluster configuration. There isn’t an example of setting the configuration in the cluster using this authentication method, but it is still possible to set the configuration this way. Append “spark.hadoop.” to each of the Azure connection properties as shown below in order to set the connection properties in the cluster configuration.

spark.hadoop.fs.azure.sas.fixed.token.<storage-account>.dfs.core.windows.net {{secrets/adls/SAStoken}}
spark.hadoop.fs.azure.sas.token.provider.type.<storage-account>.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider
spark.hadoop.fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net SAS

Access Azure Storage using Account Key#

To connect to Azure Storage using an account key, first view your account access keys. Copy the value for key1 for use as the account key to Azure Storage and Big Data Toolkit and keep it safe for later use.

Store the account key as a Databricks Secret using the following commands.

databricks secrets create-scope --scope adls --initial-manage-principal users
databricks secrets put --scope adls --key accountkey

Then, follow the Instructions for an Account Key to set the Azure connection properties in the cluster configuration. There isn’t an example of setting the configuration in the cluster using this authentication method, but it is still possible to set the configuration this way. Append “spark.hadoop.” to the Azure connection property as shown below in order to set the connection properties in the cluster configuration.

spark.hadoop.fs.azure.account.key.<storage-account>.dfs.core.windows.net {{secrets/adls/accountkey}}