Geoenrichment#

  • BDT is capable of enriching geometries with apportioned Esri Buisiness Analyst (BA) variables. Currently, BDT only supports enriching polygons located in the United States.

  • Geoenrichment requires an additional, seperate license for the BA data. Please contact the BDT team at bdt_support@esri.com if you are interested.

Table of Contents

  1. Setup BDT

  2. Generate Sample Data

  3. Geoenrichment

    1. ST_Enrich

    2. ProcessorEnrich

Part 0: Setup BDT#

[ ]:
import bdt
bdt.auth("bdt.lic")
from bdt.processors import *
from bdt.functions import *
from pyspark.sql.functions import *
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
BDT has been successfully authorized!

            Welcome to
             ___    _                ___         __             ______             __   __     _   __
            / _ )  (_)  ___ _       / _ \ ___ _ / /_ ___ _     /_  __/ ___  ___   / /  / /__  (_) / /_
           / _  | / /  / _ `/      / // // _ `// __// _ `/      / /   / _ \/ _ \ / /  /  '_/ / / / __/
          /____/ /_/   \_, /      /____/ \_,_/ \__/ \_,_/      /_/    \___/\___//_/  /_/\_\ /_/  \__/
                      /___/

BDT python version: v3.3.0-v3.3.0
BDT jar version: v3.3.0-v3.3.0

Part 1: Generate Sample Data#

  • Geometries for enrichment must be polygons, and must be in spatial reference Web Mercator 3857.

  • In the below cell, a sample polygon in spatial reference 3857 is constructed.

[ ]:
polygon_wkt = """POLYGON ((-8589916.801660 4722261.253987,
                       -8559808.739548 4722117.817925,
                       -8557660.375723 4694677.577920,
                       -8590986.920056 4694254.930233)) """

poly_schema = StructType([StructField("POLY_ID", IntegerType()),
                          StructField("POLY_WKT", StringType())])

poly_data = [(1, polygon_wkt),]

poly_df = (
    spark
        .createDataFrame(data = poly_data, schema = poly_schema)
        .select(col("POLY_ID"), st_fromText("POLY_WKT").alias("SHAPE"))
)
poly_df.show(truncate = True)
+-------+--------------------+
|POLY_ID|               SHAPE|
+-------+--------------------+
|      1|{[01 06 00 00 00 ...|
+-------+--------------------+

Part 2: Geoenrichment#

  1. BDT has two different ways to enrich: ST_Enrich and Processor Enrich.

  2. ST_Enrich is a Spark SQL function, and Processor Enrich is a python function that expects a DataFrames as arguments.

  3. Processor Enrich supports variable selection, and ST_Enrich does not. More on this below.

Part 2.1: Geoenrchment with ST_Enrich#

  • In the below cell, st_enrich is used to enrich the polygons created above.

  • The function st_enrich expects an array of polygons, and returns all of the apportioned BA variables as a StructType. It is reccomended to use inline() to unpack the struct.

  • st_enrich will always enrich all available variables + 4 additional variables that are derived from the enrichment call.

  • The all of error messages seen below are expected and can be ignored. Enrichment still will complete sucessfully.

[ ]:
enriched_df = (
    poly_df
        .select(st_enrich(array(col("SHAPE"))).alias("ENRICH"))
        .selectExpr("inline(ENRICH)")

)

(
enriched_df
    .selectExpr("TOTPOP_CY", "HHPOP_CY", "FAMPOP_CY", "GQPOP_CY")
    .show(1, truncate=True)
)

print(len(enriched_df.columns))
+---------+---------+---------+--------+
|TOTPOP_CY| HHPOP_CY|FAMPOP_CY|GQPOP_CY|
+---------+---------+---------+--------+
|1677664.0|1618652.0|1057575.0| 59012.0|
+---------+---------+---------+--------+

19326

Part 2.2: Geoenrichment with Processor Enrich#

  • Processor Enrich does the same thing as st_enrich, but also supports variable selection.

  • To get the complete list of BDT supported BA Variables, use BDT processor ba_variables. Processor ba_varaibles has no parameters, and returns a DataFrame with 3 columns: Variable, Data Type, and Description. This list of varaibles is updated annualy.

  • As seen above, the st_enrich call ran above enriched 19,322 variables + 4 additional derived ones. We will select only a subset of those for processor enrich.

[ ]:
var_df = ba_variables()
[ ]:
var_df.printSchema()
root
 |-- Variable: string (nullable = false)
 |-- Description: string (nullable = false)
 |-- Data_Type: string (nullable = false)

[ ]:
var_df.show(5, truncate=False)
+----------+------------------------------------------------+---------+
|Variable  |Description                                     |Data_Type|
+----------+------------------------------------------------+---------+
|ACSAVGGRNT|2021 Average Gross Rent (ACS 5-Yr)              |DOUBLE   |
|ACSNKDFLF |2021 Females 20-64: No Kids <18/in LF (ACS 5-Yr)|DOUBLE   |
|ACSRMV1990|2021 RHHs/Moved In: 1990-1999 (ACS 5-Yr)        |DOUBLE   |
|AGGDIA15CY|2023 Aggr Disposable Inc: HHr 15-24             |DOUBLE   |
|AI50C20   |2020 American Indian Pop 50-54                  |DOUBLE   |
+----------+------------------------------------------------+---------+
only showing top 5 rows

[ ]:
var_df.count()
19322
  • There are many BA variables and not all may be relevant to our use case.

  • Let’s say for our sample case, we only want enrich variables relating to supermarket use.

[ ]:
var_df = var_df.where("contains(Description, 'Supermarket')")
[ ]:
var_df.show(5, truncate=False)
+----------+----------------------------------------------------+---------+
|Variable  |Description                                         |Data_Type|
+----------+----------------------------------------------------+---------+
|MP14132a_I|2023 Index: Filled Prescription at Supermarket/12 Mo|DOUBLE   |
|MP14132a_B|2023 Filled Prescription at Supermarket/12 Mo       |DOUBLE   |
+----------+----------------------------------------------------+---------+

[ ]:
var_df.count()
2
  • Now, we call processor enrich on the filtered variable DataFrame. Enrichment will take much less time, since we are now enriching only 2 variables after filtering instead of 19,322.

  • There are two ways of calling a proccessor in BDT. Both are shown below, and both will produce the exact same output.

[ ]:
poly_df = poly_df.withMeta("Polygon", 3857)
[ ]:
enriched_df1 = enrich(
    poly_df,
    var_df,
    variable_field="Variable",
    sliding=20
)

(
enriched_df1
    .selectExpr("MP14132a_I", "MP14132a_B")
    .show(1, truncate=True)
)

print(len(enriched_df1.columns))
+----------+----------+
|MP14132a_I|MP14132a_B|
+----------+----------+
|      64.0|   92724.0|
+----------+----------+

8
[ ]:
enriched_df2 = poly_df.enrich(
    var_df,
    variable_field="Variable",
    sliding=20
)

(
enriched_df2
    .selectExpr("MP14132a_I", "MP14132a_B")
    .show(1, truncate=True)
)

print(len(enriched_df2.columns))
+----------+----------+
|MP14132a_I|MP14132a_B|
+----------+----------+
|      64.0|   92724.0|
+----------+----------+

8