JoinQR#

Table of Contents#

  1. What is JoinQR

  2. JoinQR Input Data

  3. Using JoinQR - ST_Distance Use Case

  4. Using JoinQR - ST_Distance with Manual QR

[1]:
import bdt

bdt.auth("bdt.lic")
import folium
import geopandas
import mapclassify
import matplotlib
from bdt.functions import *
from bdt.processors import *
from pyspark.sql.functions import *
BDT has been successfully authorized!

            Welcome to
             ___    _                ___         __             ______             __   __     _   __
            / _ )  (_)  ___ _       / _ \ ___ _ / /_ ___ _     /_  __/ ___  ___   / /  / /__  (_) / /_
           / _  | / /  / _ `/      / // // _ `// __// _ `/      / /   / _ \/ _ \ / /  /  '_/ / / / __/
          /____/ /_/   \_, /      /____/ \_,_/ \__/ \_,_/      /_/    \___/\___//_/  /_/\_\ /_/  \__/
                      /___/

BDT python version: v3.5.0-v3.3.0-Alpha-JohnDeere-Raster-73-g7ae9536e
BDT jar version: v3.5.0-v3.3.0-Alpha-JohnDeere-Raster-73-g7ae9536e

Part 1: What is JoinQR#

QR:#

  • QR is the spatial partitioning system for BDT. Think of QR as a large grid of cells overlaid on a map.

  • For a more in depth explanation of QR partitioning, see the ST_Distance sample notebook and concepts.

JoinQR:#

  • JoinQR is a processor that does a spatial join between two dataframes with SHAPE columns. It accelerates this join using spatial indexing.

  • This processor significantly speeds up joins on large dataframes (20+ million rows), but introduces overhead for smaller dataframes.

  • This notebook will be using a small dataset for demo purposes.

The JoinQR processor performs the following:

  • Joins the input dataframes on QR.

  • Creates spatial index to accelerate the join.

  • Runs lower-left QR Check to trim off QR values that are not relevant to subsequent computation.

Part 2: JoinQR Input Data#

  • Use Spark to create 2 dataframes with polygons around the Esri Redlands campus.

  • Use BDT to project to 3857 so QR cell size can be expressed in meters for simplicity.

[2]:
poly_df1 = (
    spark.createDataFrame(
        [
            (
                1,
                "POLYGON((-117.20614 34.056748, -117.201848 34.055184, -117.198973 34.057779, -117.201161 34.06009, -117.20511 34.06073, -117.209616 34.058704, -117.20614 34.056748))",
            ),
            (
                2,
                "POLYGON((-117.197084 34.060126, -117.198887 34.059486, -117.197342 34.057637, -117.193737 34.056642, -117.192965 34.060268, -117.197084 34.060126))",
            ),
        ],
        schema="polyid1 int, WKT string",
    )
    .selectExpr("polyid1", "ST_Project(ST_FromText(WKT), 4326, 3857) AS SHAPE")
    .withMeta("POLYGON", 3857, "SHAPE")
)
poly_df1.show()

+-------+--------------------+
|polyid1|               SHAPE|
+-------+--------------------+
|      1|{[01 06 00 00 00 ...|
|      2|{[01 06 00 00 00 ...|
+-------+--------------------+


[3]:
poly_df2 = (
    spark.createDataFrame(
        [
            (
                1,
                "POLYGON((-117.197685 34.066383, -117.199144 34.064819, -117.19717 34.062081, -117.193651 34.061868, -117.192192 34.063539, -117.194338 34.066063, -117.197685 34.066383))",
            ),
            (
                2,
                "POLYGON((-117.18657 34.060481, -117.18627 34.057957, -117.183437 34.056606, -117.180691 34.056855, -117.179704 34.06009, -117.18657 34.060481))",
            ),
        ],
        schema="polyid2 int, WKT string",
    )
    .selectExpr("polyid2", "ST_Project(ST_FromText(WKT), 4326, 3857) AS SHAPE")
    .withMeta("POLYGON", 3857, "SHAPE")
)
poly_df2.show()
+-------+--------------------+
|polyid2|               SHAPE|
+-------+--------------------+
|      1|{[01 06 00 00 00 ...|
|      2|{[01 06 00 00 00 ...|
+-------+--------------------+

Explore The Polygons#

  • The to_geopandas_layers function allows quick visualization of dataframes on a map.

[4]:
to_geo_pandas_layers([poly_df1, poly_df2], 3857, ["poly_df1", "poly_df2"], colors=["blue", "red"])
[4]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Part 3: Using JoinQR - ST_Distance Use Case#

This example will use JoinQR to optimize an ST_Distance workflow.

The JoinQR processor requires the following arguments:

  • ldf (DataFrame) – The left-hand side (LHS) input dataframe.

  • rdf (DataFrame) – The right-hand side (RHS) input dataframe.

  • join_type (str) – The join type. Supported join types are “inner” and “left”.

  • cell_size (float) – The spatial partitioning grid size in units of the spatial reference of the input data. It must be exactly the same as the grid size used to create both the input lhs and rhs QRs.

  • join_distance (float) – Features within this distance will be joined together. Default 0.0.

For the full list of optional arguments reference the API documentation.

For this example, use the JoinQR processor to join the two sample polygon dataframes together with a cell size of 350m and join_distance of 2000m.

[5]:
join_df = bdt.processors.join_qr(
    poly_df1, poly_df2, join_type="inner", cell_size=350, join_distance=2000
)
join_df.show(5)
[Stage 13:>                                                       (0 + 12) / 12]
+-------+--------------------+-------+--------------------+
|polyid1|           SHAPE_LHS|polyid2|           SHAPE_RHS|
+-------+--------------------+-------+--------------------+
|      1|{[01 06 00 00 00 ...|      1|{[01 06 00 00 00 ...|
|      2|{[01 06 00 00 00 ...|      1|{[01 06 00 00 00 ...|
|      1|{[01 06 00 00 00 ...|      2|{[01 06 00 00 00 ...|
|      2|{[01 06 00 00 00 ...|      2|{[01 06 00 00 00 ...|
+-------+--------------------+-------+--------------------+


  • Now that the dataframes are joined together, use ST_Distance.

[6]:
dist_df = join_df.select(
    "polyid1",
    "SHAPE_LHS",
    "polyid2",
    "SHAPE_RHS",
    st_distance("SHAPE_LHS", "SHAPE_RHS").alias("DISTANCE"),
)
dist_df.show()
+-------+--------------------+-------+--------------------+------------------+
|polyid1|           SHAPE_LHS|polyid2|           SHAPE_RHS|          DISTANCE|
+-------+--------------------+-------+--------------------+------------------+
|      1|{[01 06 00 00 00 ...|      1|{[01 06 00 00 00 ...| 514.6846679995746|
|      2|{[01 06 00 00 00 ...|      1|{[01 06 00 00 00 ...| 217.9847334695432|
|      1|{[01 06 00 00 00 ...|      2|{[01 06 00 00 00 ...|1409.6290225490116|
|      2|{[01 06 00 00 00 ...|      2|{[01 06 00 00 00 ...| 712.4632578447662|
+-------+--------------------+-------+--------------------+------------------+

  • Sort the distance dataframe to get the largest distance between polygons and visualize this using geopandas.

[9]:
sorted_df = dist_df.orderBy(desc("DISTANCE"))
max_dist = sorted_df.limit(1)
maxdf1 = max_dist.select(col("SHAPE_LHS").alias("SHAPE"))
maxdf2 = max_dist.select(col("SHAPE_RHS").alias("SHAPE"))
max_dist.show()
[Stage 28:>                                                       (0 + 12) / 12]
+-------+--------------------+-------+--------------------+------------------+
|polyid1|           SHAPE_LHS|polyid2|           SHAPE_RHS|          DISTANCE|
+-------+--------------------+-------+--------------------+------------------+
|      1|{[01 06 00 00 00 ...|      2|{[01 06 00 00 00 ...|1409.6290225490116|
+-------+--------------------+-------+--------------------+------------------+


[11]:
to_geo_pandas_layers([maxdf1, maxdf2], 3857, ["maxdf1", "maxdf2"], colors=["blue", "red"])

[11]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Part 4: Using JoinQR - ST_Distance with Manual QR:#

  • Part 3 is the recommended way to use JoinQR with its automatic QR creation as it takes advantage of internal optimizations avoiding unnecessary spilling.

  • Another way to use JoinQR is manually creating QR partitions and providing these dataframes to JoinQR to perform the spatial indexing, join, and lowerleft filtering.

  • When manually created QR partitions are provided to JoinQR, it wil use the provided QRs and avoid recalculating QRs within the processor

  • Use the ST_ASQRSpill to manually create QR partitions with a cell size of 350m and a spill distance of 2000m.

  • Spilling is necessary when joining features within a distance of each other. The spill distance parameter controls this.

  • Pair the spill function with JoinQR to join the dataframes together and filter out duplicates using lowerleft.

[12]:
spilldf1 = poly_df1.select(
    "polyid1", "SHAPE", explode(st_asQRSpill("SHAPE", 350, 2000)).alias("QR")
)
spilldf2 = poly_df2.select(
    "polyid2", "SHAPE", explode(st_asQRSpill("SHAPE", 350, 2000)).alias("QR")
)
  • Use JoinQR processor with 350m cell size and 2000m join_distance.

  • The join_distance and spill distance above must match when using manual QR creation.

[13]:
join_df = bdt.processors.join_qr(
    spilldf1, spilldf2, join_type="inner", cell_size=350, join_distance=2000
)
join_df.show(5)
+-------+--------------------+-------+--------------------+
|polyid1|           SHAPE_LHS|polyid2|           SHAPE_RHS|
+-------+--------------------+-------+--------------------+
|      1|{[01 06 00 00 00 ...|      1|{[01 06 00 00 00 ...|
|      2|{[01 06 00 00 00 ...|      1|{[01 06 00 00 00 ...|
|      1|{[01 06 00 00 00 ...|      2|{[01 06 00 00 00 ...|
|      2|{[01 06 00 00 00 ...|      2|{[01 06 00 00 00 ...|
+-------+--------------------+-------+--------------------+

  • Now that the dataframes are joined together, use ST_Distance.

[14]:
dist_df = join_df.select(
    "polyid1",
    "SHAPE_LHS",
    "polyid2",
    "SHAPE_RHS",
    st_distance("SHAPE_LHS", "SHAPE_RHS").alias("DISTANCE"),
)
dist_df.show()
+-------+--------------------+-------+--------------------+------------------+
|polyid1|           SHAPE_LHS|polyid2|           SHAPE_RHS|          DISTANCE|
+-------+--------------------+-------+--------------------+------------------+
|      1|{[01 06 00 00 00 ...|      1|{[01 06 00 00 00 ...| 514.6846679995746|
|      2|{[01 06 00 00 00 ...|      1|{[01 06 00 00 00 ...| 217.9847334695432|
|      1|{[01 06 00 00 00 ...|      2|{[01 06 00 00 00 ...|1409.6290225490116|
|      2|{[01 06 00 00 00 ...|      2|{[01 06 00 00 00 ...| 712.4632578447662|
+-------+--------------------+-------+--------------------+------------------+

  • Sort the distance dataframe to get the largest distance between polygons and visualize this using geopandas.

[15]:
sorted_df = dist_df.orderBy(desc("DISTANCE"))
max_dist = sorted_df.limit(1)
maxdf1 = max_dist.select(col("SHAPE_LHS").alias("SHAPE"))
maxdf2 = max_dist.select(col("SHAPE_RHS").alias("SHAPE"))
max_dist.show()
+-------+--------------------+-------+--------------------+------------------+
|polyid1|           SHAPE_LHS|polyid2|           SHAPE_RHS|          DISTANCE|
+-------+--------------------+-------+--------------------+------------------+
|      1|{[01 06 00 00 00 ...|      2|{[01 06 00 00 00 ...|1409.6290225490116|
+-------+--------------------+-------+--------------------+------------------+

[16]:
to_geo_pandas_layers([maxdf1, maxdf2], 3857, ["maxdf1", "maxdf2"], colors=["blue", "red"])

[16]:
Make this Notebook Trusted to load map: File -> Trust Notebook