Getting Started#

Table of Contents

  1. Importing Big Data Toolkit

  2. ST Functions Explained

  3. ST Function Examples

    1. Do it All in Spark SQL

    2. Hybrid of Spark SQL and Python

    3. Do it All in Python

  4. Processors

Part 1: Importing Big Data Toolkit#

  1. ‘bdt.auth(…)’ activates bdt with a valid license file

  2. ‘from bdt import functions as F’ imports ST Functions

  3. ‘from bdt import processors’ imports processors, like pip or nearest coordinate

  4. ‘from bdt import sinks as S’ imports sinks, like EGDB, which sinks to an esri geodatabase

[ ]:
import bdt
bdt.auth("bdt.lic")
from bdt import functions as F
from bdt import processors as P
from bdt import sinks as S
from pyspark.sql.functions import rand, lit
BDT has been successfully authorized!

            Welcome to
             ___    _                ___         __             ______             __   __     _   __
            / _ )  (_)  ___ _       / _ \ ___ _ / /_ ___ _     /_  __/ ___  ___   / /  / /__  (_) / /_
           / _  | / /  / _ `/      / // // _ `// __// _ `/      / /   / _ \/ _ \ / /  /  '_/ / / / __/
          /____/ /_/   \_, /      /____/ \_,_/ \__/ \_,_/      /_/    \___/\___//_/  /_/\_\ /_/  \__/
                      /___/

BDT python version: v3.3.0-v3.3.0
BDT jar version: v3.3.0-v3.3.0

Part 2: ST Functions Explained#

  • Big Data Toolkit ST functions get their “ST” prefix from the PostGIS naming convention

  • PostGIS Docs can be found here: postgis docs

  • The functions are highly optimized by the Apache Spark Catalyst engine

Let’s see the first 5 ST Functions returned by the below call

[ ]:
spark.sql("""
SHOW USER FUNCTIONS LIKE 'ST_*'
""").show(5)
+---------------+
|       function|
+---------------+
|       st_ashex|
|      st_asjson|
|        st_asqr|
|    st_asqrclip|
|st_asqrenvpclip|
+---------------+
only showing top 5 rows

  • Big Data Toolkit also includes functions that don’t start with ST. Generally speaking, a function that does have a shape as an input or output will not start with the “ST” label

  • Examples of Non ST Functions in Big Data Toolkit include Uber H3 Functions and others

Let’s see the first 5 Non ST Functions returned by the below call.

[ ]:
spark.sql("""
SHOW USER FUNCTIONS LIKE 'H3*'
""").show()
+----------------+
|        function|
+----------------+
|      h3distance|
|         h3kring|
|h3kringdistances|
|          h3line|
|      h3polyfill|
|    h3tochildren|
|         h3togeo|
| h3togeoboundary|
|      h3toparent|
|      h3tostring|
+----------------+

Part 3: ST Function Examples#

  • Big Data Toolkit can be used either in SQL or in Python, depending on what you are the most comfortable with.

  • Three different ways of producing the same output DataFrame are shown below.

(1) Do it all in Spark SQL#

  1. Create a temporary view of the spark DataFrame so that it can be referenced in a Spark SQL Statement

  2. Write the entire query in SQL

[ ]:
df1 = spark.range(10)
df1.createOrReplaceTempView("df1")
df1_o = spark.sql("""

SELECT *, ST_Buffer(ST_MakePoint(x, y), 100.0, 3857) AS SHAPE
FROM
    (SELECT *, LonToX(lon) as x, LatToY(lat) as y
     FROM
         (SELECT id, 360-180*rand() AS lon, 360-180*rand() AS lat
          FROM df1))
""")
df1_o.show()
+---+------------------+------------------+--------------------+--------------------+--------------------+
| id|               lon|               lat|                   x|                   y|               SHAPE|
+---+------------------+------------------+--------------------+--------------------+--------------------+
|  0|183.63172113047113|301.66993646460253| 2.044178968973646E7|  -8036975.549454975|{[01 06 00 00 00 ...|
|  1|215.35257299970627|   303.40049186252|2.3972938767348576E7|  -7678698.353713043|{[01 06 00 00 00 ...|
|  2|231.32602161805937|207.33261175596763| 2.575109493375616E7|  -3165089.006420536|{[01 06 00 00 00 ...|
|  3| 325.1027848050297| 342.9440762031778| 3.619027645997111E7| -1927335.8713460434|{[01 06 00 00 00 ...|
|  4|228.47316360048467| 235.9342794217521| 2.543351623193424E7|  -7545343.656632494|{[01 06 00 00 00 ...|
|  5|200.23403780180223|281.77634732970097|2.2289951127577715E7|-1.44895452713084...|{[01 06 00 00 00 ...|
|  6|249.17306589369514| 189.5857984010991|2.7737818814684942E7|  -1072099.370820312|{[01 06 00 00 00 ...|
|  7|190.70945034517993| 295.3249292888303| 2.122967890189052E7|  -9523299.532422358|{[01 06 00 00 00 ...|
|  8|223.65314652379334| 269.1234731689733|2.4896954385342076E7|-3.10815711600632...|{[01 06 00 00 00 ...|
|  9|195.61505261490532| 313.7295116320525| 2.177576804859068E7|  -5823801.630947499|{[01 06 00 00 00 ...|
+---+------------------+------------------+--------------------+--------------------+--------------------+

(2) Hybrid of Spark SQL and Python DataFrame API#

  • Using ‘selectExpr’ from the python DataFrame API allows for a hybrid approach

    • This syntax allows for the use of SQL, but makes the query less complicated to write

[ ]:
df2 = spark.range(10)\
    .selectExpr("id","360-180*rand() lon","180-90*rand() lat")\
    .selectExpr("*","LonToX(lon) x","LatToY(lat) y")\
    .selectExpr("*","ST_Buffer(ST_MakePoint(x,y), 100.0, 4326) SHAPE")
df2.show()
+---+------------------+------------------+--------------------+--------------------+--------------------+
| id|               lon|               lat|                   x|                   y|               SHAPE|
+---+------------------+------------------+--------------------+--------------------+--------------------+
|  0|194.22505883934758| 90.52237101327172| 2.162103464928977E7|   3.4382906372298E7|{[01 06 00 00 00 ...|
|  1| 329.2364997579045|110.27272405284565|3.6650439503609665E7|1.0980525498089023E7|{[01 06 00 00 00 ...|
|  2| 263.5701775218122|138.20705687009763| 2.934049795002085E7|   5130013.614343843|{[01 06 00 00 00 ...|
|  3|252.13705688166766|151.86527381416892|2.8067768782181896E7|   3265970.322337363|{[01 06 00 00 00 ...|
|  4|321.63012133885405| 132.1909279274017| 3.580370133122002E7|   6075149.804910965|{[01 06 00 00 00 ...|
|  5| 280.1316289382097|170.72696688832002|3.1184110288491763E7|   1036805.568944134|{[01 06 00 00 00 ...|
|  6| 350.5178432462774|152.53955733448015|3.9019467824132085E7|  3181116.6958396584|{[01 06 00 00 00 ...|
|  7| 273.2160012818873|146.90066901770308|3.0414266139274076E7|   3908495.924013681|{[01 06 00 00 00 ...|
|  8| 296.9052595388559|131.15177286709482|3.3051342305710167E7|   6249147.841863792|{[01 06 00 00 00 ...|
|  9| 266.3916938036678|149.29652600118698|2.9654587705781948E7|   3594299.178392326|{[01 06 00 00 00 ...|
+---+------------------+------------------+--------------------+--------------------+--------------------+

(3) Do it all in Python using the DataFrame API#

  • Do the same as above but all in the DataFrame Python API

> * ST Functions can be called in both SQL and Python
[ ]:
df3 = spark.range(10)\
    .withColumn("lon",rand()*360.0-180.0)\
    .withColumn("lat",rand()*180.0-90.0)\
    .select(["*", F.lonToX("lon").alias("x"), F.latToY("lat").alias("y")])\
    .select(["*", F.st_buffer(F.st_makePoint("x", "y"), lit(100.0), lit(4326)).alias("SHAPE")])
df3.show()
+---+-------------------+-------------------+--------------------+--------------------+--------------------+
| id|                lon|                lat|                   x|                   y|               SHAPE|
+---+-------------------+-------------------+--------------------+--------------------+--------------------+
|  0| -80.27306476207896|-52.526663883820405|  -8935956.693730101| -6895919.0212813215|{[01 06 00 00 00 ...|
|  1| 136.13052630889314| -72.70333131128358|1.5153980870126314E7|-1.20114677120744...|{[01 06 00 00 00 ...|
|  2|-48.735382911219034| -40.53230258567762|  -5425198.009292109|  -4943599.245997998|{[01 06 00 00 00 ...|
|  3| 100.24677469916406|  58.49787456295826|1.1159419913178964E7|   8072640.796522262|{[01 06 00 00 00 ...|
|  4|  85.10247593540095| -2.896281207768496|   9473564.286375651| -322549.94534973905|{[01 06 00 00 00 ...|
|  5| 120.62952954169941| 48.170558352241414|1.3428417803214129E7|  6135276.7178578405|{[01 06 00 00 00 ...|
|  6| -90.23592442613986| -57.46642537178351|-1.00450171583782...| -7856055.6065385835|{[01 06 00 00 00 ...|
|  7| -35.66765002331698| -80.50862715270581| -3970504.6383883385|-1.58732720738688...|{[01 06 00 00 00 ...|
|  8| 105.63398860986888|-18.336844532236483|1.1759121822513064E7| -2077013.5370249853|{[01 06 00 00 00 ...|
|  9| -20.52573956414119| 43.545686059120385|  -2284914.876435546|   5395403.133094095|{[01 06 00 00 00 ...|
+---+-------------------+-------------------+--------------------+--------------------+--------------------+

Part 4: Processors#

  • Processors are highly optimized functions that are typically designed to handle more complex tasks than ST Functions.

  • However, some simpler ST Functions still have a processor equivalent (Like STArea and ProcessorArea)

Processors require that input dataframes have metadata for their shape columns.#

  • Adding metadata gives the processor information about the geometry type and spatial reference (wkid)

  • An error will be thrown if a processor receives the wrong geometry type

    • Some processors require wkid 3857, like distanceAllCandidates

[ ]:
df1_m = df1_o.withMeta("POLYGON", 3857)

Call ProcessorArea on df1

  • Calling a processor is relatively straightforward, pass the dataframe(s) as the first parameters and optional parameters after that

  • This will calculate the area of each buffered point, which should be identical (up to a certain floating point precision)

[ ]:
area = P.area(df1_m)
area.show(5)
+---+------------------+------------------+--------------------+-------------------+--------------------+------------------+
| id|               lon|               lat|                   x|                  y|               SHAPE|        shape_area|
+---+------------------+------------------+--------------------+-------------------+--------------------+------------------+
|  0|183.63172113047113|301.66993646460253| 2.044178968973646E7| -8036975.549454975|{[01 06 00 00 00 ...| 31393.50203033758|
|  1|215.35257299970627|   303.40049186252|2.3972938767348576E7| -7678698.353713043|{[01 06 00 00 00 ...| 31393.50203033758|
|  2|231.32602161805937|207.33261175596763| 2.575109493375616E7| -3165089.006420536|{[01 06 00 00 00 ...|31393.502030408486|
|  3| 325.1027848050297| 342.9440762031778| 3.619027645997111E7|-1927335.8713460434|{[01 06 00 00 00 ...| 31393.50203050963|
|  4|228.47316360048467| 235.9342794217521| 2.543351623193424E7| -7545343.656632494|{[01 06 00 00 00 ...| 31393.50203033758|
+---+------------------+------------------+--------------------+-------------------+--------------------+------------------+
only showing top 5 rows