Setup in macOS or Linux#

This section covers how to set up BDT3 with Spark Standalone and Jupyter Notebook on a Unix system (Linux or macOS).

Requirements#

  • Java 8 (OpenJDK 1.8)

  • Apache Spark and Hadoop

    • For BDT3 Version 3.0: Spark 3.2.x and Hadoop 3.2

    • For BDT3 Version 3.1: Spark 3.3.x and Hadoop 3.3

    • For BDT3 Version 3.2: Spark 3.4.1 and Hadoop 3.3.x

    • For BDT3 Version 3.3: Spark 3.5.0 and Hadoop 3.3.x

  • Anaconda or Miniconda

  • The BDT3 jar and zip files

  • A license for BDT3

Install and Setup Java#

  1. Download Java 1.8 from a trusted source, e.g. Oracle or Azul.

  2. Open the Terminal, and either open the bash profile (.bashrc) or zsh profile (.zshrc) depending on which shell is being used. In the profile, create a new environment variable called JAVA_HOME and point it to the location of the downloaded jdk with the following command: export JAVA_HOME=<path/to/jdk>.

  3. Append $JAVA_HOME/bin to the PATH: export PATH=$PATH:$JAVA_HOME/bin

Install and Setup Spark and Hadoop#

  1. Download Spark and Hadoop from here. Be sure to verify the downloaded release per the documentation.

  2. Add the following to the shell profile:

export SPARK_HOME=<path/to/spark>
export SPARK_LOCAL_IP=localhost
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='lab --ip=0.0.0.0 --port 8888 --allow-root --no-browser --NotebookApp.token=""'
export PATH=$PATH:$SPARK_HOME/bin
  1. Source the new shell profile with either source ~/.bashrc or source ~/.zshrc, depending upon which shell is being used. This will apply the new environment variables to the shell.

Install and Setup Anaconda#

  1. Download and install Anaconda (or Miniconda) for the correct OS.

  2. Open the Terminal. Create a new conda environment called bdt3-py and pull the default package set with the following command: conda create -n bdt3-py anaconda.

  3. Activate the bdt3-py environment: conda activate bdt3-py.

  4. Install jupyterlab: conda install jupyterlab.

  5. Geopandas is also recommended to visualize results in the notebook. Install geopandas and its supporting libraries with the below commands:

conda install pip
pip install geopandas
conda install -c conda-forge folium matplotlib mapclassify
conda install pyarrow
  1. Ensure the spark.sql.execution.arrow.pyspark.enabled spark config is set to true. This is done in the init script below.

Launch Jupyter Notebook with BDT#

  1. Create an init script by making a .sh shell script file. Call the init script init-bdt3.sh and put the below contents inside the script:

pyspark \
  --master local[*] \
  --driver-java-options "-XX:+UseCompressedOops -Djava.awt.headless=true" \
  --conf spark.executor.extraJavaOptions="-XX:+UseCompressedOops -Djava.awt.headless=true" \
  --conf spark.sql.execution.arrow.pyspark.enabled=true \
  --conf spark.submit.pyFiles="/<path>/<to>/<bdt_zip>" \
  --jars /<path>/<to>/<bdt_jar> \
  1. Update /<path>/<to>/<bdt_zip> with the path to the bdt zip file. This file can be stored anywhere on the system.

  2. Update /<path>/<to>/<bdt_jar> with the path to the bdt3 jar file. This file can be stored anywhere on the system.

  3. If you would like to learn more about what --conf does, then visit the spark configuration docs here.

  4. Run init-bdt3.sh from the terminal. This will launch a jupyterlab session in the default browser with BDT installed on the environment.

Next steps#

Start using BDT with Jupyter Notebook