Setup in macOS or Linux#
This section covers how to set up BDT3 with Spark Standalone and Jupyter Notebook on a Unix system (Linux or macOS).
Requirements#
Java 8 (OpenJDK 1.8)
Apache Spark and Hadoop
For BDT3 Version 3.0: Spark 3.2.x and Hadoop 3.2
For BDT3 Version 3.1: Spark 3.3.x and Hadoop 3.3
For BDT3 Version 3.2: Spark 3.4.1 and Hadoop 3.3.x
For BDT3 Version 3.3: Spark 3.5.0 and Hadoop 3.3.x
Anaconda or Miniconda
The BDT3 jar and zip files
A license for BDT3
Install and Setup Java#
Download Java 1.8 from a trusted source, e.g. Oracle or Azul.
Open the Terminal, and either open the bash profile (
.bashrc
) or zsh profile (.zshrc
) depending on which shell is being used. In the profile, create a new environment variable calledJAVA_HOME
and point it to the location of the downloaded jdk with the following command:export JAVA_HOME=<path/to/jdk>
.Append
$JAVA_HOME/bin
to thePATH
:export PATH=$PATH:$JAVA_HOME/bin
Install and Setup Spark and Hadoop#
Download Spark and Hadoop from here. Be sure to verify the downloaded release per the documentation.
Add the following to the shell profile:
export SPARK_HOME=<path/to/spark>
export SPARK_LOCAL_IP=localhost
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='lab --ip=0.0.0.0 --port 8888 --allow-root --no-browser --NotebookApp.token=""'
export PATH=$PATH:$SPARK_HOME/bin
Source the new shell profile with either
source ~/.bashrc
orsource ~/.zshrc
, depending upon which shell is being used. This will apply the new environment variables to the shell.
Install and Setup Anaconda#
Download and install Anaconda (or Miniconda) for the correct OS.
Open the Terminal. Create a new conda environment called
bdt3-py
and pull the default package set with the following command:conda create -n bdt3-py anaconda
.Activate the
bdt3-py
environment:conda activate bdt3-py
.Install jupyterlab:
conda install jupyterlab
.Geopandas is also recommended to visualize results in the notebook. Install geopandas and its supporting libraries with the below commands:
conda install pip
pip install geopandas
conda install -c conda-forge folium matplotlib mapclassify
conda install pyarrow
Ensure the
spark.sql.execution.arrow.pyspark.enabled
spark config is set totrue
. This is done in the init script below.
Launch Jupyter Notebook with BDT#
Create an init script by making a
.sh
shell script file. Call the init scriptinit-bdt3.sh
and put the below contents inside the script:
pyspark \
--master local[*] \
--driver-java-options "-XX:+UseCompressedOops -Djava.awt.headless=true" \
--conf spark.executor.extraJavaOptions="-XX:+UseCompressedOops -Djava.awt.headless=true" \
--conf spark.sql.execution.arrow.pyspark.enabled=true \
--conf spark.submit.pyFiles="/<path>/<to>/<bdt_zip>" \
--jars /<path>/<to>/<bdt_jar> \
Update
/<path>/<to>/<bdt_zip>
with the path to the bdt zip file. This file can be stored anywhere on the system.Update
/<path>/<to>/<bdt_jar>
with the path to the bdt3 jar file. This file can be stored anywhere on the system.If you would like to learn more about what
--conf
does, then visit the spark configuration docs here.Run
init-bdt3.sh
from the terminal. This will launch a jupyterlab session in the default browser with BDT installed on the environment.