GCP Dataproc Platform Setup#

This section covers how to install and setup Big Data Toolkit in the GCP Dataproc Platform

Create a Storage Bucket#

Buckets

  • If there is not a bucket available to store the Big Data Toolkit artifacts, create one.

Create Bucket

Place BDT Artifacts in Storage#

  • Upload the Big Data Toolkit artifacts to the bucket created in the previous step. Upload the Python package .zip, Java package .jar and license file *.lic. You will need all three.

Upload Artifacts

Create a Cluster#

  • In the top bar, search “dataproc”.

DataProc

  • Select Clusters in the left sidebar. Then click “Create Cluster” to begin the cluster creation process.

Create Cluster

  • Choose to create the cluster on Compute Engine

Compute Engine

  • In the opening pane, set your cluster name and region, then scroll to the Components section and toggle Jupyter Notebook On.

Components

  • Navigate to the Customize Cluster pane. Set the following Cluster properties:

    • Prefix 1: spark, Key 1: spark.jars, Value 1: The location of your BDT jar file

    • Prefix 2: spark, Key2: spark.submit.pyFiles, Value 2: The location of your BDT Zip file

Cluster Properties

Enable Big Query (Optional)

  • Due to a known issue with the Dataproc 2.2.x images, the following cluster configuration is required to use Big Query with BDT in GCP Dataproc.

  • In the Customize Cluster pane during cluster creation, set the following Custom cluster Metadata:

    • Key 1: SPARK_BQ_CONNECTOR_URL, Value 1: gs://spark-lib/bigquery/spark-3.5-bigquery-0.42.0.jar

Cluster Metadata

  • Finish configuring the cluster to your preferences. Click “Create” and the cluster will be created shortly.

Open JupyterLab#

  • When the cluster is created, its status will show as “running”. Click on the name of the cluster to open the Cluster Details.

BDT Cluster

Cluster Details

  • Navigate to the Web interfaces tab and select the link to JupyterLab

Open JupyterLab

  • Once inside JupyterLab, create a new jupyter notebook. Select PySpark as the kernel when prompted.

New Notebook

Import BDT#

  • To import and authorize BDT, Run the following cell after replacing the filepath to point to the location of the BDT license uploaded to Cloud Storage earlier.

import bdt
bdt.auth("gs://[your-bucket]/[path-to-artifacts]/bdt.lic")
from bdt import functions as F
from bdt import processors as P
from bdt import sinks as S
from pyspark.sql.functions import rand, lit
  • Try out the API by importing the SQL functions and listing the functions

spark.sql("show user functions like 'ST_*'").show()

Next steps#

Start using BDT with Jupyter Notebook