Setup in Windows#

This section covers how to set up BDT3 with Spark Standalone and Jupyter Notebook on a Windows system. To run BDT3 in Windows using ArcGIS Pro and ArcGIS Notebooks, please see the “Setup in Pro” section.

Requirements#

  • Java 17 (OpenJDK 17)

  • Apache Spark and Hadoop

BDT Version

Apache Spark Supported Version

Hadoop Supported Version

3.0

3.2.2

3.2

3.1

3.3.x

3.3

3.2

3.4.1

3.3.x

3.3

3.5.1

3.3.x

3.4

3.5.4

3.3.x

3.5

3.5.8

3.3.x

  • Anaconda or Miniconda

  • The BDT3 jar and zip files

  • A license for BDT3

Install and Setup Java#

  1. Download and install Java 17 from a trusted source, e.g. Oracle or Azul.

  2. Open the Windows environment variables settings. To find the settings, type “environment variables” in the Windows search bar.

  3. Create a new system variable called JAVA_HOME and point it to the location of the downloaded jdk.

java_home

  1. Append %JAVA_HOME%\bin to the PATH system variable like the following: <existingpath>;%JAVA_HOME%\bin

Install and Setup Spark + Hadoop#

IMPORTANT (For BDT3 Version 3.0 Only): Spark 3.2.2 is highly recommended. Spark 3.2.1 has a known bug when calling spark-shell in Windows.

  1. Download and extract Spark and Hadoop from here. Download 7-Zip from here to extract the .tgz package. Extracting the .tgz will create a .tar, which will need to be extracted again. Be sure to verify the downloaded release per the documentation.

  2. Create a new system variable called SPARK_HOME and point it to the location of the Spark and Hadoop download. Append %SPARK_HOME%\bin to PATH. DO NOT place the Spark and Hadoop download in Windows program files directory. This will cause issues with Spark and Hadoop.

  3. Create a new empty folder with the path C:\Hadoop\bin. Visit this website and download winutils.exe and hadoop.dll for the matching version of Hadoop. Put winutils.exe and hadoop.dll in C:\Hadoop\bin.

  4. Create a new system variable called HADOOP_HOME and point it to C:\Hadoop. Append %HADOOP_HOME%\bin to PATH.

  5. Create a new system variable called PYSPARK_LOCAL_IP and set the value to localhost.

  6. Create a new system variable called PYSPARK_DRIVER_PYTHON and set the value to jupyter.

  7. Create a new system variable called PYSPARK_DRIVER_PYTHON_OPTS and set the value to lab.

  8. Create a new system variable called PYSPARK_PYTHON and set the value to python.

Install and Setup Anaconda#

  1. Download and install Anaconda (or Miniconda) for Windows.

  2. Open the Anaconda Command Prompt. Create a new conda environment called bdt3-py and pull the default package set with the following command: conda create -n bdt3-py.

  3. Activate the bdt3-py environment: conda activate bdt3-py.

  4. Due to a known issue using Python 3.12 with Spark 3.5, Python must be downgraded to 3.11 with the following command: conda install python==3.11.8. This may take a few minutes.

  5. Install jupyterlab: conda install jupyterlab

  6. Geopandas is also recommended to visualize results in the notebook. Install geopandas and its supporting libraries with the below commands:

conda install pip
pip install geopandas
conda install -c conda-forge folium matplotlib mapclassify
conda install pyarrow
  1. Ensure the spark.sql.execution.arrow.pyspark.enabled spark config is set to true. This is done in the init script below.

Launch Jupyter Notebook with BDT#

  1. Create an init script by making a .cmd command file. Call the init script init-bdt3.cmd and put the below contents inside the script:

pyspark ^
--master local[*] ^
--driver-java-options "-XX:+UseCompressedOops -Djava.awt.headless=true" ^
--conf spark.executor.extraJavaOptions="-XX:+UseCompressedOops -Djava.awt.headless=true" ^
--conf spark.sql.execution.arrow.pyspark.enabled=true ^
--conf spark.submit.pyFiles="C:\<path>\<to>\<bdt_zip>\" ^
--conf spark.jars="C:\<path>\<to>\<bdt_jar>\"
  1. Update C:\<path>\<to>\<bdt_zip>\ with the path to the bdt3 zip file. This file can be stored anywhere on the system.

  2. Update C:\<path>\<to>\<bdt_jar>\ with the path to the bdt3 jar file. This file can be stored anywhere on the system.

  3. If you would like to learn more about what --conf does, then visit the spark configuration docs here.

  4. Run init-bdt3.cmd from the Anaconda Command Prompt. This will launch a jupyterlab session in the default browser with BDT installed on the environment.

Note: If you are using raster functions per the Setup for raster processing guide, there is a known issue with using the spark packages option on windows: Packages may not work on windows. This may produce null pointer exceptions in the command prompt when first running a notebook cell. However, the notebook should still run with no issues. This issue was not seen when running using a local cluster

Next steps#

Start using BDT with Jupyter Notebook