Setup for Raster Processing in AWS EMR#
The BDT raster processors require additional libararies not included with the BDT jar. This guide describes how to pull these libraries into your environment.
For spark to properly install the required libraries, an ivySettings.xml file must be provided. See the snippet below for what this file should look like. Upload this xml file to an easily accessible place in your s3 bucket.
<ivysettings>
<settings defaultResolver="chain"/>
<resolvers>
<chain name="chain">
<ibiblio name="osgeo" m2compatible="true" root="https://repo.osgeo.org/repository/release" />
<ibiblio name="central" m2compatible="true" root="https://repo1.maven.org/maven2/" />
</chain>
</resolvers>
</ivysettings>
This ivySettings.xml file must be copied from s3 to each cluster node using an init script. Create an
emr_raster_bootstrap.sh
init script with the following line and upload it to s3:
sudo aws s3 cp s3://<path-to-ivySettings.xml> /home/hadoop/
Make sure to replace the ivySettings.xml path with the correct location on your s3 bucket.
Create a new clutser in EMR. In cluster creation under “Bootstrap actions”, add the
emr_raster_bootstrap.sh
init script. This is located in the s3 bucket that you saved it to in the previous step.
Once the cluster with the init script has started, insert this cell into your notebook and run it. Replace the path to the BDT jar and BDT zip with their locations on your s3 bucket.
%%configure -f
{
"jars": ["s3://my-bucket/<path-to-bdt-jar>"],
"pyFiles": ["s3://my-bucket/<path-to-bdt-zip>"],
"conf": {
"spark.jars.packages": "org.geotools:gt-main:24.6,org.geotools:gt-geotiff:24.6,org.geotools:gt-epsg-hsql:24.6",
"spark.jars.excludes": "org.scala-lang:scala-reflect,com.fasterxml.jackson.core:jackson-core",
"spark.jars.repositories": "https://mvnrepository.com/artifact,https://repo.osgeo.org/repository/release,https://repo1.maven.org/maven2",
"spark.jars.ivySettings": "file:///home/hadoop/ivySettings.xml"
}
}
Running this cell for the first time may produce an error. Because these additional libraries needed for raster processing take time to download, the startup may time out before they are done downloading. If this happens, detach and re-atach the cluster and run this cell again. You may need to do this once or twice more. The download progress should be cached from the previous run and subsequent runs will have enough time to complete.