Setup in Windows#

This section covers how to set up BDT3 with Spark Standalone and Jupyter Notebook on a Windows system. To run BDT3 in Windows using ArcGIS Pro and ArcGIS Notebooks, please see the “Setup in Pro” section.

Requirements#

  • Java 8 (OpenJDK 1.8)

  • Apache Spark and Hadoop

BDT Version

Apache Spark Supported Version

Hadoop Supported Version

3.0

3.2.2

3.2

3.1

3.3.x

3.3

3.2

3.4.1

3.3.x

3.3

3.5.1

3.3.x

3.4

3.5.4

3.3.x

  • Anaconda or Miniconda

  • The BDT3 jar and zip files

  • A license for BDT3

Install and Setup Java#

  1. Download and install Java 1.8 from a trusted source, e.g. Oracle or Azul.

  2. Open the Windows environment variables settings. To find the settings, type “environment variables” in the Windows search bar.

  3. Create a new system variable called JAVA_HOME and point it to the location of the downloaded jdk.

java_home

  1. Append %JAVA_HOME%\bin to the PATH system variable like the following: <existingpath>;%JAVA_HOME%\bin

Install and Setup Spark + Hadoop#

IMPORTANT (For BDT3 Version 3.0 Only): Spark 3.2.2 is highly recommended. Spark 3.2.1 has a known bug when calling spark-shell in Windows.

  1. Download and extract Spark and Hadoop from here. Download 7-Zip from here to extract the .tgz package. Extracting the .tgz will create a .tar, which will need to be extracted again. Be sure to verify the downloaded release per the documentation.

  2. Create a new system variable called SPARK_HOME and point it to the location of the Spark and Hadoop download. Append %SPARK_HOME%\bin to PATH. DO NOT place the Spark and Hadoop download in Windows program files directory. This will cause issues with Spark and Hadoop.

  3. Create a new empty folder with the path C:\Hadoop\bin. Visit this website and download winutils.exe and hadoop.dll for the matching version of Hadoop. Put winutils.exe and hadoop.dll in C:\Hadoop\bin.

  4. Create a new system variable called HADOOP_HOME and point it to C:\Hadoop. Append %HADOOP_HOME%\bin to PATH.

  5. Create a new system variable called PYSPARK_LOCAL_IP and set the value to localhost.

  6. Create a new system variable called PYSPARK_DRIVER_PYTHON and set the value to jupyter.

  7. Create a new system variable called PYSPARK_DRIVER_PYTHON_OPTS and set the value to lab.

  8. Create a new system variable called PYSPARK_PYTHON and set the value to python.

Install and Setup Anaconda#

  1. Download and install Anaconda (or Miniconda) for Windows.

  2. Open the Anaconda Command Prompt. Create a new conda environment called bdt3-py and pull the default package set with the following command: conda create -n bdt3-py anaconda.

  3. Activate the bdt3-py environment: conda activate bdt3-py.

  4. Due to a known issue using Python 3.12 with Spark 3.5, Python must be downgraded to 3.11 with the following command: conda install python==3.11.8. This may take a few minutes.

  5. Install jupyterlab: conda install jupyterlab

  6. Geopandas is also recommended to visualize results in the notebook. Install geopandas and its supporting libraries with the below commands:

conda install pip
pip install geopandas
conda install -c conda-forge folium matplotlib mapclassify
conda install pyarrow
  1. Ensure the spark.sql.execution.arrow.pyspark.enabled spark config is set to true. This is done in the init script below.

Launch Jupyter Notebook with BDT#

  1. Create an init script by making a .cmd command file. Call the init script init-bdt3.cmd and put the below contents inside the script:

pyspark ^
--master local[*] ^
--driver-java-options "-XX:+UseCompressedOops -Djava.awt.headless=true" ^
--conf spark.executor.extraJavaOptions="-XX:+UseCompressedOops -Djava.awt.headless=true" ^
--conf spark.sql.execution.arrow.pyspark.enabled=true ^
--conf spark.submit.pyFiles="C:\<path>\<to>\<bdt_zip>\" ^
--conf spark.jars="C:\<path>\<to>\<bdt_jar>\"
  1. Update C:\<path>\<to>\<bdt_zip>\ with the path to the bdt3 zip file. This file can be stored anywhere on the system.

  2. Update C:\<path>\<to>\<bdt_jar>\ with the path to the bdt3 jar file. This file can be stored anywhere on the system.

  3. If you would like to learn more about what --conf does, then visit the spark configuration docs here.

  4. Run init-bdt3.cmd from the Anaconda Command Prompt. This will launch a jupyterlab session in the default browser with BDT installed on the environment.

Note: If you are using raster functions per the Setup for raster processing guide, there is a known issue with using the spark packages option on windows: Packages may not work on windows. This may produce null pointer exceptions in the command prompt when first running a notebook cell. However, the notebook should still run with no issues. This issue was not seen when running using a local cluster

Next steps#

Start using BDT with Jupyter Notebook