Geoenrichment#

BDT is capable of enriching geometries with apportioned Esri Buisiness Analyst (BA) variables. Currently, BDT only supports enriching polygons located in the United States.
Geoenrichment requires an additional, seperate license for the BA data. Please contact the BDT team at bdt_support@esri.com if you are interested.

Table of Contents

Setup BDT
Generate Sample Data
Geoenrichment
1. ST_Enrich
2. ProcessorEnrich

Part 0: Setup BDT#

[ ]:

import bdt
bdt.auth("bdt.lic")
from bdt.processors import *
from bdt.functions import *
from pyspark.sql.functions import *
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

BDT has been successfully authorized!

            Welcome to
             ___    _                ___         __             ______             __   __     _   __
            / _ )  (_)  ___ _       / _ \ ___ _ / /_ ___ _     /_  __/ ___  ___   / /  / /__  (_) / /_
           / _  | / /  / _ `/      / // // _ `// __// _ `/      / /   / _ \/ _ \ / /  /  '_/ / / / __/
          /____/ /_/   \_, /      /____/ \_,_/ \__/ \_,_/      /_/    \___/\___//_/  /_/\_\ /_/  \__/
                      /___/

BDT python version: v3.4.0-v3.4.0
BDT jar version: v3.4.0-v3.4.0

Part 1: Generate Sample Data#

Geometries for enrichment must be polygons, and must be in spatial reference Web Mercator 3857.
In the below cell, a sample polygon in spatial reference 3857 is constructed.

[ ]:

polygon_wkt = """POLYGON ((-8589916.801660 4722261.253987,
                       -8559808.739548 4722117.817925,
                       -8557660.375723 4694677.577920,
                       -8590986.920056 4694254.930233)) """

poly_schema = StructType([StructField("POLY_ID", IntegerType()),
                          StructField("POLY_WKT", StringType())])

poly_data = [(1, polygon_wkt),]

poly_df = (
    spark
        .createDataFrame(data = poly_data, schema = poly_schema)
        .select(col("POLY_ID"), st_fromText("POLY_WKT").alias("SHAPE"))
)
poly_df.show(truncate = True)

+-------+--------------------+
|POLY_ID|               SHAPE|
+-------+--------------------+
|      1|{[01 06 00 00 00 ...|
+-------+--------------------+

Part 2: Geoenrichment#

BDT has two different ways to enrich: ST_Enrich and Processor Enrich.
ST_Enrich is a Spark SQL function, and Processor Enrich is a python function that expects a DataFrames as arguments.
Processor Enrich supports variable selection, and ST_Enrich does not. More on this below.

Part 2.1: Geoenrchment with ST_Enrich#

In the below cell, st_enrich is used to enrich the polygons created above.
The function st_enrich expects an array of polygons, and returns all of the apportioned BA variables as a StructType. It is reccomended to use inline() to unpack the struct.
st_enrich will always enrich all available variables + 4 additional variables that are derived from the enrichment call.
The all of error messages seen below are expected and can be ignored. Enrichment still will complete sucessfully.

[ ]:

enriched_df = (
    poly_df
        .select(st_enrich(array(col("SHAPE"))).alias("ENRICH"))
        .selectExpr("inline(ENRICH)")

)

(
enriched_df
    .selectExpr("TOTPOP_CY", "HHPOP_CY", "FAMPOP_CY", "GQPOP_CY")
    .show(1, truncate=True)
)

print(len(enriched_df.columns))

+---------+---------+---------+--------+
|TOTPOP_CY| HHPOP_CY|FAMPOP_CY|GQPOP_CY|
+---------+---------+---------+--------+
|1664690.0|1608412.0|1074643.0| 56278.0|
+---------+---------+---------+--------+

19463

Part 2.2: Geoenrichment with Processor Enrich#

Processor Enrich does the same thing as st_enrich, but also supports variable selection.
To get the complete list of BDT supported BA Variables, use BDT processor ba_variables. Processor ba_varaibles has no parameters, and returns a DataFrame with 3 columns: Variable, Data Type, and Description. This list of varaibles is updated annualy.
As seen above, the st_enrich call ran above enriched 19,322 variables + 4 additional derived ones. We will select only a subset of those for processor enrich.

[ ]:

var_df = ba_variables()

[ ]:

var_df.printSchema()

root
 |-- Variable: string (nullable = false)
 |-- Description: string (nullable = false)
 |-- Data_Type: string (nullable = false)

[ ]:

var_df.show(5, truncate=False)

+----------+------------------------------------------------+---------+
|Variable  |Description                                     |Data_Type|
+----------+------------------------------------------------+---------+
|ACSAVGGRNT|2022 Average Gross Rent (ACS 5-Yr)              |DOUBLE   |
|ACSNKDFLF |2022 Females 20-64: No Kids <18/in LF (ACS 5-Yr)|DOUBLE   |
|ACSRMV1990|2022 RHHs/Moved In: 1990-1999 (ACS 5-Yr)        |DOUBLE   |
|AGGDIA15CY|2024 Aggr Disposable Inc: HHr 15-24             |DOUBLE   |
|AI50C20   |2020 American Indian Pop 50-54                  |DOUBLE   |
+----------+------------------------------------------------+---------+
only showing top 5 rows

[ ]:

var_df.count()

There are many BA variables and not all may be relevant to our use case.
Let’s say for our sample case, we only want enrich variables relating to supermarket use.

[ ]:

var_df = var_df.where("contains(Description, 'Supermarket')")

[ ]:

var_df.show(5, truncate=False)

+----------+----------------------------------------------------+---------+
|Variable  |Description                                         |Data_Type|
+----------+----------------------------------------------------+---------+
|MP14132a_I|2024 Index: Filled Prescription at Supermarket/12 Mo|DOUBLE   |
|MP14132a_B|2024 Filled Prescription at Supermarket/12 Mo       |DOUBLE   |
+----------+----------------------------------------------------+---------+

[ ]:

var_df.count()

Now, we call processor enrich on the filtered variable DataFrame. Enrichment will take much less time, since we are now enriching only 2 variables after filtering instead of 19,322.
There are two ways of calling a proccessor in BDT. Both are shown below, and both will produce the exact same output.

[ ]:

poly_df = poly_df.withMeta("Polygon", 3857)

[ ]:

enriched_df1 = enrich(
    poly_df,
    var_df,
    variable_field="Variable",
    sliding=20
)

(
enriched_df1
    .selectExpr("MP14132a_I", "MP14132a_B")
    .show(1, truncate=True)
)

print(len(enriched_df1.columns))

+----------+----------+
|MP14132a_I|MP14132a_B|
+----------+----------+
|      71.0|  102285.0|
+----------+----------+

8

[ ]:

enriched_df2 = poly_df.enrich(
    var_df,
    variable_field="Variable",
    sliding=20
)

(
enriched_df2
    .selectExpr("MP14132a_I", "MP14132a_B")
    .show(1, truncate=True)
)

print(len(enriched_df2.columns))

+----------+----------+
|MP14132a_I|MP14132a_B|
+----------+----------+
|      71.0|  102285.0|
+----------+----------+

8