Geoenrichment#
BDT is capable of enriching geometries with apportioned Esri Buisiness Analyst (BA) variables. Currently, BDT only supports enriching polygons located in the United States.
Geoenrichment requires an additional, seperate license for the BA data. Please contact the BDT team at bdt_support@esri.com if you are interested.
Table of Contents
Part 0: Setup BDT#
[ ]:
import bdt
bdt.auth("bdt.lic")
from bdt.processors import *
from bdt.functions import *
from pyspark.sql.functions import *
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
BDT has been successfully authorized!
Welcome to
___ _ ___ __ ______ __ __ _ __
/ _ ) (_) ___ _ / _ \ ___ _ / /_ ___ _ /_ __/ ___ ___ / / / /__ (_) / /_
/ _ | / / / _ `/ / // // _ `// __// _ `/ / / / _ \/ _ \ / / / '_/ / / / __/
/____/ /_/ \_, / /____/ \_,_/ \__/ \_,_/ /_/ \___/\___//_/ /_/\_\ /_/ \__/
/___/
BDT python version: v3.3.0-v3.3.0
BDT jar version: v3.3.0-v3.3.0
Part 1: Generate Sample Data#
Geometries for enrichment must be polygons, and must be in spatial reference Web Mercator 3857.
In the below cell, a sample polygon in spatial reference 3857 is constructed.
[ ]:
polygon_wkt = """POLYGON ((-8589916.801660 4722261.253987,
-8559808.739548 4722117.817925,
-8557660.375723 4694677.577920,
-8590986.920056 4694254.930233)) """
poly_schema = StructType([StructField("POLY_ID", IntegerType()),
StructField("POLY_WKT", StringType())])
poly_data = [(1, polygon_wkt),]
poly_df = (
spark
.createDataFrame(data = poly_data, schema = poly_schema)
.select(col("POLY_ID"), st_fromText("POLY_WKT").alias("SHAPE"))
)
poly_df.show(truncate = True)
+-------+--------------------+
|POLY_ID| SHAPE|
+-------+--------------------+
| 1|{[01 06 00 00 00 ...|
+-------+--------------------+
Part 2: Geoenrichment#
BDT has two different ways to enrich: ST_Enrich and Processor Enrich.
ST_Enrich is a Spark SQL function, and Processor Enrich is a python function that expects a DataFrames as arguments.
Processor Enrich supports variable selection, and ST_Enrich does not. More on this below.
Part 2.1: Geoenrchment with ST_Enrich#
In the below cell, st_enrich is used to enrich the polygons created above.
The function st_enrich expects an array of polygons, and returns all of the apportioned BA variables as a StructType. It is reccomended to use inline() to unpack the struct.
st_enrich will always enrich all available variables + 4 additional variables that are derived from the enrichment call.
The all of error messages seen below are expected and can be ignored. Enrichment still will complete sucessfully.
[ ]:
enriched_df = (
poly_df
.select(st_enrich(array(col("SHAPE"))).alias("ENRICH"))
.selectExpr("inline(ENRICH)")
)
(
enriched_df
.selectExpr("TOTPOP_CY", "HHPOP_CY", "FAMPOP_CY", "GQPOP_CY")
.show(1, truncate=True)
)
print(len(enriched_df.columns))
+---------+---------+---------+--------+
|TOTPOP_CY| HHPOP_CY|FAMPOP_CY|GQPOP_CY|
+---------+---------+---------+--------+
|1677664.0|1618652.0|1057575.0| 59012.0|
+---------+---------+---------+--------+
19326
Part 2.2: Geoenrichment with Processor Enrich#
Processor Enrich does the same thing as st_enrich, but also supports variable selection.
To get the complete list of BDT supported BA Variables, use BDT processor
ba_variables
. Processorba_varaibles
has no parameters, and returns a DataFrame with 3 columns: Variable, Data Type, and Description. This list of varaibles is updated annualy.As seen above, the st_enrich call ran above enriched 19,322 variables + 4 additional derived ones. We will select only a subset of those for processor enrich.
[ ]:
var_df = ba_variables()
[ ]:
var_df.printSchema()
root
|-- Variable: string (nullable = false)
|-- Description: string (nullable = false)
|-- Data_Type: string (nullable = false)
[ ]:
var_df.show(5, truncate=False)
+----------+------------------------------------------------+---------+
|Variable |Description |Data_Type|
+----------+------------------------------------------------+---------+
|ACSAVGGRNT|2021 Average Gross Rent (ACS 5-Yr) |DOUBLE |
|ACSNKDFLF |2021 Females 20-64: No Kids <18/in LF (ACS 5-Yr)|DOUBLE |
|ACSRMV1990|2021 RHHs/Moved In: 1990-1999 (ACS 5-Yr) |DOUBLE |
|AGGDIA15CY|2023 Aggr Disposable Inc: HHr 15-24 |DOUBLE |
|AI50C20 |2020 American Indian Pop 50-54 |DOUBLE |
+----------+------------------------------------------------+---------+
only showing top 5 rows
[ ]:
var_df.count()
19322
There are many BA variables and not all may be relevant to our use case.
Let’s say for our sample case, we only want enrich variables relating to supermarket use.
[ ]:
var_df = var_df.where("contains(Description, 'Supermarket')")
[ ]:
var_df.show(5, truncate=False)
+----------+----------------------------------------------------+---------+
|Variable |Description |Data_Type|
+----------+----------------------------------------------------+---------+
|MP14132a_I|2023 Index: Filled Prescription at Supermarket/12 Mo|DOUBLE |
|MP14132a_B|2023 Filled Prescription at Supermarket/12 Mo |DOUBLE |
+----------+----------------------------------------------------+---------+
[ ]:
var_df.count()
2
Now, we call processor enrich on the filtered variable DataFrame. Enrichment will take much less time, since we are now enriching only 2 variables after filtering instead of 19,322.
There are two ways of calling a proccessor in BDT. Both are shown below, and both will produce the exact same output.
[ ]:
poly_df = poly_df.withMeta("Polygon", 3857)
[ ]:
enriched_df1 = enrich(
poly_df,
var_df,
variable_field="Variable",
sliding=20
)
(
enriched_df1
.selectExpr("MP14132a_I", "MP14132a_B")
.show(1, truncate=True)
)
print(len(enriched_df1.columns))
+----------+----------+
|MP14132a_I|MP14132a_B|
+----------+----------+
| 64.0| 92724.0|
+----------+----------+
8
[ ]:
enriched_df2 = poly_df.enrich(
var_df,
variable_field="Variable",
sliding=20
)
(
enriched_df2
.selectExpr("MP14132a_I", "MP14132a_B")
.show(1, truncate=True)
)
print(len(enriched_df2.columns))
+----------+----------+
|MP14132a_I|MP14132a_B|
+----------+----------+
| 64.0| 92724.0|
+----------+----------+
8