Setup in AWS EMR#

This section describes getting AWS EMR running with Spark and the Big Data Toolkit (BDT). First, you need to create an S3 bucket, and upload the BDT resources to it. Next, you need to spin up an EMR cluster and test loading the resources in a Jupyter Notebook. Finally, you will create EMR Steps

To complete a BDT on EMR install you will need:

  • An active AWS subscription

  • BDT jar, whl, and license

Prepare the workspace#

Sign in to the AWS Management Console. You need to create an S3 bucket, and upload the BDT resources to it. Next, you need to spin up an EMR cluster and test loading the resources in a Jupyter Notebook.

Create a bucket in S3 or choose an existing one.

Upload the Python package .whl, Java package .jar and license file *.lic. You will need all three.

for example:

s3://<private-bucket>/bdt-3.3.0-3.5.1-2.12.jar"
s3://<private-bucket>/bdt-3.3.0-py3-none-any.whl"
s3://<private-bucket>/bdt.lic

Optionally, create a bootstrap script. The BDT3 install does not require any bootstrap actions but, for example, if you need to connect to a database via JDBC an appropriate driver on the classpath is required. This bootstrap script installs the postgres jdbc driver to the cluster.

#!/bin/bash
# These two following statements install libraries on all nodes of an EMR

# required for writing to RDS
sudo aws s3 cp s3://<private-bucket>/org.postgresql_postgresql-42.2.23.jar /usr/lib/spark/jars/

Create an EC2 keypair and save it for SSH access to the cluster.

Create or identify the VPC and subnets where EMR will live. This may be in the data tier alongside any databases EMR will need to access.

Review the EMR service role requirements

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-access-iam.html

At least one user will need the Administrator role for initial setup. The simplest procedure for initial setup is to let EMR create the default service roles and security groups. An administrator will follow the steps to create a cluster and attach a notebook for the first time. Subsequent users with lower access, for example developer roles, will then be able to repeat these steps to create their own clusters and notebooks.

Everything is implicitly denied unless explicitly granted.

We recommend a Windows VM in the same VPC and security group where EMR lives. This VM may host an instance of ArcGIS Pro, a web browser for notebooks, SQL Server Management Studio and SSH into the cluster. The jump machine pattern allows for tighter security group settings.

Create a cluster#

Select the advanced options and review the settings on each of the following four sections.

Supported EMR Releases#

For BDT3 Version 3.0: emr-6.6.0 or emr-6.7.0

For BDT3 Version 3.1: emr-6.8.0 or emr-6.9.0

For BDT3 Version 3.2: emr-6.12.0 or emr-6.15.0

For BDT3 Version 3.3: emr-7.1.0

Required Packages#

  • ☒ Hadoop

  • ☒ Hive

  • ☒ JupyterEnterpriseGateway

  • ☒ Spark 3.5.0

Networking#

Create or choose an existing VPC and EC2 Subnet. It is important that the selected VPC can connect to the data tier and allow incoming connections via HTTPS and SSH.

Cluster Nodes and Instances#

Size the cluster by selecting EC2 instance types and number of instances. This is highly dependent on the data and analysis. Start with m5.2xlarge with one Master and one Core. If the job is likely to spill to disk you may need to increase the EBS volume size from the default. Choosing Spot instances for the Core nodes can save on costs because spark jobs can tolerate the loss of worker nodes but we recommend only using the more reliable On Demand instances for the Master.

Auto-termination#

Enable Auto-termination option. Auto termination helps avoid clusters incurring costs after they are no longer in use.

EBS Root Volume#

Set Root device EBS volume size to 100GiB

General Options#

Give the cluster a descriptive name. This will appear in the cluster history and is helpful for tracking costs and cloning in the future.

Set a central logging location in s3 where you will collect log files. The default log files locations EMR creates can be hard to find later.

If you have a bootstrap script reference script location in s3 and add it as a custom action to execute the the script location.

Security#

Select the EC2 keypair created earlier. This is required to ssh into the cluster.

For the simplest setup experience use the default EMR role, EC2 Instance profile, and Auto scaling role.

Use the default security groups. If you create custom security groups, select the EC2 security groups carefully after reviewing EMR’s network traffic requirements: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-security-groups.html

Authorize BDT#

Create a new PySpark notebook and attach the notebook to your cluster.

Execute a cell with spark magics to add the jar and whl to the Livy spark context:

%%configure -f
{
    "jars": ["s3://<private-bucket>/bdt-3.3.0-3.5.1-2.12.jar"],
    "pyFiles": ["s3://<private-bucket>/bdt-3.3.0-py3-none-any.whl"]
}

Import the bdt library and authorize it using the license file in s3. For example:

import bdt
bdt.auth("s3://<private-bucket>/bdt.lic)

Try out the API by importing the SQL functions and listing the functions spark.sql(“show user functions like ‘ST_*’”).show()

Next steps#

Start using BDT with Jupyter Notebook