GCP Dataproc Platform Setup#
This section covers how to install and setup Big Data Toolkit in the GCP Dataproc Platform
Create a Storage Bucket#
Go to https://console.cloud.google.com/. Using the top bar, search for buckets.
If there is not a bucket available to store the Big Data Toolkit artifacts, create one.
Place BDT Artifacts in Storage#
Upload the Big Data Toolkit artifacts to the bucket created in the previous step. Upload the Python package .zip, Java package .jar and license file *.lic. You will need all three.
Create a Cluster#
In the top bar, search “dataproc”.
Select Clusters in the left sidebar. Then click “Create Cluster” to begin the cluster creation process.
Choose to create the cluster on Compute Engine
In the opening pane, set your cluster name and region, then scroll to the Components section and toggle Jupyter Notebook On.
Navigate to the Customize Cluster pane. Set the following Cluster properties:
Prefix 1: spark, Key 1: spark.jars, Value 1: The location of your BDT jar file
Prefix 2: spark, Key2: spark.submit.pyFiles, Value 2: The location of your BDT Zip file
Enable Big Query (Optional)
Due to a known issue with the Dataproc 2.2.x images, the following cluster configuration is required to use Big Query with BDT in GCP Dataproc.
In the Customize Cluster pane during cluster creation, set the following Custom cluster Metadata:
Key 1: SPARK_BQ_CONNECTOR_URL, Value 1: gs://spark-lib/bigquery/spark-3.5-bigquery-0.42.0.jar
Finish configuring the cluster to your preferences. Click “Create” and the cluster will be created shortly.
Open JupyterLab#
When the cluster is created, its status will show as “running”. Click on the name of the cluster to open the Cluster Details.
Navigate to the Web interfaces tab and select the link to JupyterLab
Once inside JupyterLab, create a new jupyter notebook. Select PySpark as the kernel when prompted.
Import BDT#
To import and authorize BDT, Run the following cell after replacing the filepath to point to the location of the BDT license uploaded to Cloud Storage earlier.
import bdt
bdt.auth("gs://[your-bucket]/[path-to-artifacts]/bdt.lic")
from bdt import functions as F
from bdt import processors as P
from bdt import sinks as S
from pyspark.sql.functions import rand, lit
Try out the API by importing the SQL functions and listing the functions
spark.sql("show user functions like 'ST_*'").show()