Geoenrichment#

  • BDT is capable of enriching geometries with apportioned Esri Buisiness Analyst (BA) variables. Currently, BDT only supports enriching polygons located in the United States.

  • Geoenrichment requires an additional, seperate license for the BA data. Please contact the BDT team at bdt_support@esri.com if you are interested.

Table of Contents

  1. Setup BDT

  2. Generate Sample Data

  3. Geoenrichment

    1. ST_Enrich

    2. ProcessorEnrich

Part 0: Setup BDT#

[4]:
import bdt
bdt.auth("bdt.lic")
from bdt.processors import *
from bdt.functions import *
from pyspark.sql.functions import *
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
BDT has been successfully authorized!

            Welcome to
             ___    _                ___         __             ______             __   __     _   __
            / _ )  (_)  ___ _       / _ \ ___ _ / /_ ___ _     /_  __/ ___  ___   / /  / /__  (_) / /_
           / _  | / /  / _ `/      / // // _ `// __// _ `/      / /   / _ \/ _ \ / /  /  '_/ / / / __/
          /____/ /_/   \_, /      /____/ \_,_/ \__/ \_,_/      /_/    \___/\___//_/  /_/\_\ /_/  \__/
                      /___/

BDT python version: v3.5.0-v3.4.0-develop-72-g6a2dbe86
BDT jar version: v3.5.0-v3.4.0-develop-72-g6a2dbe86

Part 1: Generate Sample Data#

  • Geometries for enrichment must be polygons, and must be in spatial reference Web Mercator 3857.

  • In the below cell, a sample polygon in spatial reference 3857 is constructed.

[5]:
polygon_wkt = """POLYGON ((-8589916.801660 4722261.253987,
                       -8559808.739548 4722117.817925,
                       -8557660.375723 4694677.577920,
                       -8590986.920056 4694254.930233)) """

poly_schema = StructType([StructField("POLY_ID", IntegerType()),
                          StructField("POLY_WKT", StringType())])

poly_data = [(1, polygon_wkt),]

poly_df = (
    spark
        .createDataFrame(data = poly_data, schema = poly_schema)
        .select(col("POLY_ID"), st_fromText("POLY_WKT").alias("SHAPE"))
)
poly_df.show(truncate = True)
+-------+--------------------+
|POLY_ID|               SHAPE|
+-------+--------------------+
|      1|{[01 06 00 00 00 ...|
+-------+--------------------+

Part 2: Geoenrichment#

  1. BDT has two different ways to enrich: ST_Enrich and Processor Enrich.

  2. ST_Enrich is a Spark SQL function, and Processor Enrich is a python function that expects a DataFrames as arguments.

  3. Processor Enrich supports variable selection, and ST_Enrich does not. More on this below.

Part 2.1: Geoenrchment with ST_Enrich#

  • In the below cell, st_enrich is used to enrich the polygons created above.

  • The function st_enrich expects an array of polygons, and returns all of the apportioned BA variables as a StructType. It is reccomended to use inline() to unpack the struct.

  • st_enrich will always enrich all available variables + 4 additional variables that are derived from the enrichment call.

  • The all of error messages seen below are expected and can be ignored. Enrichment still will complete sucessfully.

[6]:
enriched_df = (
    poly_df
        .select(st_enrich(array(col("SHAPE"))).alias("ENRICH"))
        .selectExpr("inline(ENRICH)")

)

(
enriched_df
    .selectExpr("TOTPOP_CY", "HHPOP_CY", "FAMPOP_CY", "GQPOP_CY")
    .show(1, truncate=True)
)

print(len(enriched_df.columns))
25/08/04 15:20:49 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
25/08/04 15:20:54 WARN DAGScheduler: Broadcasting large task binary with size 2.3 MiB
25/08/04 15:20:58 WARN DAGScheduler: Broadcasting large task binary with size 2.3 MiB
25/08/04 15:21:00 WARN DAGScheduler: Broadcasting large task binary with size 2.3 MiB
25/08/04 15:21:05 ERROR DatabaseSchema: Unable to find table 'worldcountries' in schema 'geworld'
25/08/04 15:21:23 ERROR DatabaseSchema: Unable to find table 'worldcountries' in schema 'geworld'
25/08/04 15:21:23 ERROR DatabaseSchema: Unable to find table 'worldcountries' in schema 'geworld'
Enrichment initialization took 22 seconds.
+---------+---------+---------+--------+
|TOTPOP_CY| HHPOP_CY|FAMPOP_CY|GQPOP_CY|
+---------+---------+---------+--------+
|1691303.0|1635463.0|1093358.0| 55840.0|
+---------+---------+---------+--------+

19490

Part 2.2: Geoenrichment with Processor Enrich#

  • Processor Enrich does the same thing as st_enrich, but also supports variable selection.

  • To get the complete list of BDT supported BA Variables, use BDT processor ba_variables. Processor ba_varaibles has no parameters, and returns a DataFrame with 3 columns: Variable, Data Type, and Description. This list of varaibles is updated annualy.

  • As seen above, the st_enrich call ran above enriched 19,486 variables + 4 additional derived ones. We will select only a subset of those for processor enrich.

[21]:
var_df = ba_variables()
[22]:
var_df.printSchema()
root
 |-- Variable: string (nullable = false)
 |-- Description: string (nullable = false)
 |-- Data_Type: string (nullable = false)

[23]:
var_df.show(5, truncate=False)
+----------+------------------------------------------------+---------+
|Variable  |Description                                     |Data_Type|
+----------+------------------------------------------------+---------+
|ACSAVGGRNT|2023 Average Gross Rent (ACS 5-Yr)              |DOUBLE   |
|ACSNKDFLF |2023 Females 20-64: No Kids <18/in LF (ACS 5-Yr)|DOUBLE   |
|ACSRMV1990|2023 RHHs/Moved In: 1990-1999 (ACS 5-Yr)        |DOUBLE   |
|AGGDIA15CY|2025 Aggr Disposable Inc: HHr 15-24             |DOUBLE   |
|AI50C20   |2020 American Indian Pop 50-54                  |DOUBLE   |
+----------+------------------------------------------------+---------+
only showing top 5 rows

[24]:
var_df.count()
[24]:
19486
  • There are many BA variables and not all may be relevant to our use case.

  • Let’s say for our sample case, we only want enrich variables relating to grocery store use.

[25]:
var_df = var_df.where("contains(Description, 'Grocery Store')")
[26]:
var_df.show(5, truncate=False)
+----------+------------------------------------------------------+---------+
|Variable  |Description                                           |Data_Type|
+----------+------------------------------------------------------+---------+
|MP13019a_I|2025 Index: Shopped at Trader Joe`s Grocery Store/6 Mo|DOUBLE   |
|N13A_SALES|2025 Grocery Stores Sales (NAICS)                     |DOUBLE   |
|MP13006a_B|2025 Shopped at Aldi Grocery Store/6 Mo               |DOUBLE   |
|MP13012a_B|2025 Shopped at Meijer Grocery Store/6 Mo             |DOUBLE   |
|N13A_BUS  |2025 Grocery Stores Bus (NAICS)                       |DOUBLE   |
+----------+------------------------------------------------------+---------+
only showing top 5 rows

[27]:
var_df.count()
[27]:
43
  • Let’s take two of these variables relating to grocery store use, shopping at Aldi in the last 6 mo (MP13006a_B) and shopping at Meijer in the last 6 mo (MP13012a_B).

  • Now, we call processor enrich on the filtered variable DataFrame. Enrichment will take much less time, since we are now enriching only 2 variables after filtering instead of 19,486.

  • There are two ways of calling a proccessor in BDT. Both are shown below, and both will produce the exact same output.

[28]:
poly_df = poly_df.withMeta("Polygon", 3857)
[31]:
enriched_df1 = enrich(
    poly_df,
    var_df,
    variable_field="Variable",
    sliding=20
)

(
enriched_df1
    .selectExpr("MP13006a_B", "MP13012a_B")
    .show(1, truncate=True)
)

print(len(enriched_df1.columns))
+----------+----------+
|MP13006a_B|MP13012a_B|
+----------+----------+
|  281941.0|   45365.0|
+----------+----------+

49
[32]:
enriched_df2 = poly_df.enrich(
    var_df,
    variable_field="Variable",
    sliding=20
)

(
enriched_df2
    .selectExpr("MP13006a_B", "MP13012a_B")
    .show(1, truncate=True)
)

print(len(enriched_df2.columns))
+----------+----------+
|MP13006a_B|MP13012a_B|
+----------+----------+
|  281941.0|   45365.0|
+----------+----------+

49