API Reference#

Context#

bdt.auth(lic)#

Authorize the usage of BDT3 with a valid license file

bdt.bdt_version()#

Return the version string for BDT. This string contains the version of the BDT python bindings and the version of the BDT jar. The version string format is as follows:

BDT python version: v<bdt-version>-<git-describe-output> BDT jar version: v<bdt-version>-<git-describe-output>

Return type:

str

Metadata#

These methods are used to add, remove, or get information about the metadata for a geometry column. When BDT initializes these methods are added to the DataFrame class and can be invoked by any DataFrame object. The first parameter is implicitly the DataFrame object and does not need to be passed as a parameter to the function.

geometry_type(df[, shapeField])

Return the geometry type of the shapeField column

has_column(df, colName)

Return True if the DataFrame has a column named colName.

has_meta(df[, shapeField])

Return True if the dataframe has metadata for the shapeField column

is_multipath(df[, shapeField])

Return True if the metadata geometry type is either 'Polygon' or 'Polyline' and False if it is not.

is_multipoint(df[, shapeField])

Return True if the metadata geometry type is 'MultiPoint' and False if it is not.

is_point(df[, shapeField])

Return True if the metadata geometry type is 'Point' and False if it is not.

is_polygon(df[, shapeField])

Return True if the metadata geometry type is 'Polygon' and False if it is not.

is_polyline(df[, shapeField])

Return True if the metadata geometry type is 'Polyline' and False if it is not.

withMeta(df, geometryType, wkid[, shapeField])

Add metadata to an existing Shape Struct column in a DataFrame.

wkid(df[, shapeField])

Return the coordinate system WKID (Well-Known Identifier) of the shapeField column

bdt.metadata.geometry_type(df, shapeField='SHAPE')#

Return the geometry type of the shapeField column

Parameters:

shapeField (str) – the name of the Shape column

Return type:

str

Returns:

String

Example:

df = spark.sql(
    "select ST_FromText('POINT(2.0 2.0)') as SHAPE").withMeta("Point", 4326)

df.geometry_type()
# 'Point'
bdt.metadata.has_column(df, colName)#

Return True if the DataFrame has a column named colName.

Parameters:

colName (str) – The name of the column

Return type:

bool

Returns:

Boolean

Example:

df = spark.sql(
    "select ST_FromText('POINT(2.0 2.0)') as SHAPE")

df.has_column("SHAPE")
# True
bdt.metadata.has_meta(df, shapeField='SHAPE')#

Return True if the dataframe has metadata for the shapeField column

Parameters:

shapeField (str) – the name of the Shape column

Return type:

bool

Returns:

Boolean

Example:

df = spark.sql(
    "select ST_FromText('POINT(2.0 2.0)') as SHAPE").withMeta("Point", 4326)

df.has_meta()
# True
bdt.metadata.is_multipath(df, shapeField='SHAPE')#

Return True if the metadata geometry type is either ‘Polygon’ or ‘Polyline’ and False if it is not.

Parameters:

shapeField (str) – the name of the Shape column

Return type:

bool

Returns:

Boolean

Example:

df = spark.sql(
    "select ST_FromText('POLYGON ((30 10, 40 40, 20 40, 10 20, 30 10))') as SHAPE").withMeta("Polygon", 4326)

df.is_multipath()
# True
bdt.metadata.is_multipoint(df, shapeField='SHAPE')#

Return True if the metadata geometry type is ‘MultiPoint’ and False if it is not.

Parameters:

shapeField (str) – the name of the Shape column

Return type:

bool

Returns:

Boolean

Example:

df = spark.sql(
    "select ST_FromText('MULTIPOINT (10 40, 40 30, 20 20, 30 10)') as SHAPE").withMeta("Multipoint", 4326)

df.is_multipoint()
# True
bdt.metadata.is_point(df, shapeField='SHAPE')#

Return True if the metadata geometry type is ‘Point’ and False if it is not.

Parameters:

shapeField (str) – the name of the Shape column

Return type:

bool

Returns:

Boolean

Example:

df = spark.sql(
    "select ST_FromText('POINT(2.0 2.0)') as SHAPE").withMeta("Point", 4326)

df.is_point()
# True
bdt.metadata.is_polygon(df, shapeField='SHAPE')#

Return True if the metadata geometry type is ‘Polygon’ and False if it is not.

Parameters:

shapeField (str) – the name of the Shape column

Return type:

bool

Returns:

Boolean

Example:

df = spark.sql(
    "select ST_FromText('POLYGON ((30 10, 40 40, 20 40, 10 20, 30 10))') as SHAPE").withMeta("Polygon", 4326)

df.is_polygon()
# True
bdt.metadata.is_polyline(df, shapeField='SHAPE')#

Return True if the metadata geometry type is ‘Polyline’ and False if it is not.

Parameters:

shapeField (str) – the name of the Shape column

Return type:

bool

Returns:

Boolean

Example:

df = spark.sql(
    "select ST_FromText('LINESTRING (30 10, 10 30, 40 40)') as SHAPE").withMeta("Polyline", 4326)

df.is_polyline()
# True
bdt.metadata.withMeta(df, geometryType, wkid, shapeField='SHAPE')#

Add metadata to an existing Shape Struct column in a DataFrame. Metadata is required for most Processors.

See https://spatialreference.org/ for a helpful list of spatial references available to use.

Parameters:
  • geometryType (str) – the geometry type as a case-insensitive string. It can be one of “Unknown”, “Point”, “Polyline”, “Polygon”, “Line”, “Envelope”, “Multipoint”.

  • wkid (int) – the spatial reference id

  • shapeField (str) – the name of the Shape column

Return type:

DataFrame

Returns:

DataFrame

Example:

df = spark.sql(
    "select ST_FromText('POINT(2.0 2.0)') as SHAPE")

df.withMeta("POINT", 3857, "SHAPE")
# DataFrame
bdt.metadata.wkid(df, shapeField='SHAPE')#

Return the coordinate system WKID (Well-Known Identifier) of the shapeField column

Parameters:

shapeField (str) – the name of the Shape column

Return type:

int

Returns:

Integer

Example:

df = spark.sql(
    "select ST_FromText('POINT(2.0 2.0)') as SHAPE").withMeta("Point", 4326)

df.wkid()
# 4326

Geometry Constructors#

These processors and functions produce a SHAPE struct output from a set of coordinate inputs.

Processors#

assembler(df, targetIDFields, timeField, ...)

Processor to assemble a set of targets (points) into a track (polyline).

bdt.processors.assembler(df, targetIDFields, timeField, timeFormat, separator=':', trackIDName='trackID', origMillisName='origMillis', destMillisName='destMillis', durationMillisName='durationMillis', numTargetsName='numTargets', shapeField='SHAPE', cache=False)#

Processor to assemble a set of targets (points) into a track (polyline).

The user can specify a set of target fields that make up a track identifier and a timestamp field in the target attributes, such that all the targets with that same track identifier will be assembled chronologically as polyline vertices. The polyline consists of 3D points where x/y are the target location and z is the timestamp.

The output DataFrame will have the following columns:

shape: The shape struct that defines the track (polyline)

trackID: The track identifier. Typically, this is a concatenation of the string representation of the user specified target fields separated by a colon (:)

origMillis: The track starting epoch in milliseconds.

destMillis: The track ending epoch in milliseconds.

durationMillis: The track duration in milliseconds.

numTargets: The number of targets in the track.

It is possible for the output shape to be an invalid polyline geometry. A polyline with two consecutive identical points is considered invalid. This can happen when two consecutive points in a track have identical positions. This can also happen when a track only has one point, in which case a polyline will be created with the single point duplicated, thus creating a polygon with two consecutive identical points. ProcessorSimplify or STSimplify can be used to make these geometries valid and remove consecutive identical points. However, simplification may produce an empty geometry in the case of a polyline made up of only identical points.

Parameters:
  • df (DataFrame) – The input DataFrame

  • targetIDFields (list) – The combination of target feature attribute fields to use as a track identifier.

  • timeField (str) – The name of the time field.

  • timeFormat (str) – The format of the time field. Has to be one of: milliseconds, date, or timestamp

  • separator (str) – The separator used to join the target id fields. The joined target id fields become the trackID.

  • trackIDName (str) – The name of the output field to hold the track identifier as a polyline attribute.

  • origMillisName (str) – The name of the output field to hold the track starting epoch in milliseconds as a polyline

  • destMillisName (str) – The name of the output field to hold the track ending epoch in milliseconds as a polyline

  • durationMillisName (str) – The name of the output field to hold the track duration in milliseconds as a polyline attribute.

  • numTargetsName (str) – The name of the output field to hold the track number of targets in milliseconds as a polyline

  • shapeField (str) – The name of the SHAPE Struct field.

  • cache (bool) – To persist the outgoing DataFrame or not.

Return type:

DataFrame

Returns:

A set of targets into a track.

Example:

bdt.processors.assembler(
    df,
    ["ID1", "ID2", "ID3"],
    "current_time",
    "timestamp",
    separator = ":",
    trackIDName = "trackID",
    origMillisName = "origMillis",
    destMillisName = "destMillis",
    durationMillisName = "durationMillis",
    numTargetsName = "numTargets",
    shapeField = "SHAPE",
    cache = False)

Functions#

st_envelope(xmin, ymin, xmax, ymax)

Create an Envelope using xmin, ymin, xmax, ymax.

st_line(array)

Accepts an array of point shape structs and returns a polyline shape struct.

st_makeLine(array)

Return a line shape struct from an array of coordinates like [x1, y1, x2, y2, x3, y3...]

st_makePoint(lon, lat)

Return a point shape struct from x and y values.

st_makePolygon(array)

Return a polygon shape struct from an array of coordinates like [x1, y1, x2, y2, x3, y3...]

st_multiPoint(arr)

Return a shape struct representing a multipoint from an array of arrays of float.

st_multipoint_shape(arr)

Return a shape struct representing a multipoint from an array of point shape struct.

bdt.functions.st_envelope(xmin, ymin, xmax, ymax)#

Create an Envelope using xmin, ymin, xmax, ymax.

Parameters:
  • xmin (column or str or float) – The xmin

  • ymin (column or str or float) – The ymin

  • xmax (column or str or float) – The xmax

  • ymax (column or str or float) – The ymax

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

spark.sql('''SELECT ST_Envelope(-180.0, -90.0, 180.0, 90.0) AS SHAPE''')

Python Example:

df.select(st_envelope(-180.0, -90.0, 180.0, 90.0).alias("SHAPE"))
bdt.functions.st_line(array)#

Accepts an array of point shape structs and returns a polyline shape struct.

Parameters:

array (column or str or array) – An array of point shape struct.

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

spark.sql('''SELECT array(ST_FromText('POINT (1 1)'), ST_FromText('POINT (2 2)')) AS
POINT_ARRAY''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_Line(POINT_ARRAY) AS SHAPE''')

Python Example:

df = spark \
        .createDataFrame([("POINT (1 1)", "POINT (2 2)")], ["WKT1", "WKT2"]) \
        .select(array(st_fromText(col("WKT1")), st_fromText(col("WKT2"))).alias("SHAPE_ARRAY")) \
        .select(st_line("SHAPE_ARRAY").alias("SHAPE"))
bdt.functions.st_makeLine(array)#

Return a line shape struct from an array of coordinates like [x1, y1, x2, y2, x3, y3…]

Parameters:

array (column or str or array) – An array of coordinates in the order [x1, y1, x2, y2, x3, y3, …]

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

spark.sql('''SELECT array(0, 0, 1, 0) AS coords''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_MakeLine(coords) AS SHAPE FROM df''')

Python Example:

df = spark \
    .createDataFrame([([0, 0, 1, 0],)], ["coords"])

df.select(st_makeLine("coords").alias("SHAPE"))
bdt.functions.st_makePoint(lon, lat)#

Return a point shape struct from x and y values.

Parameters:
  • lon (column or str or float) – The longitude value.

  • lat (column or str or float) – The latitude value.

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

spark.sql('''SELECT 180 AS lon, 90 AS lat''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_MakePoint(lon, lat) AS SHAPE FROM df''')

Python Example:

df = spark.createDataFrame([(180, 90)], ["lon", "lat"])

df.select(st_makePoint("lon", "lat").alias("SHAPE"))
bdt.functions.st_makePolygon(array)#

Return a polygon shape struct from an array of coordinates like [x1, y1, x2, y2, x3, y3…]

Parameters:

array (column or str or array) – An array of coordinates in the order [x1, y1, x2, y2, x3, y3, …]

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

spark.sql('''SELECT array(0, 0, 1, 0, 1, 1, 0, 1, 0, 0) AS COORDS''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_MakePolygon(COORDS) AS SHAPE FROM df''')

Python Example:

df = spark \
        .createDataFrame([([0, 0, 1, 0, 1, 1, 0, 1, 0, 0],)], ["coords"])

df.select(st_makePolygon("coords").alias("SHAPE"))
bdt.functions.st_multiPoint(arr)#

Return a shape struct representing a multipoint from an array of arrays of float.

Parameters:

arr (column or str or array) – an expression. An array of arrays of float.

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

spark.sql('''SELECT array(array(1.1, 2.2), array(3.3, 4.4)) AS array_of_array''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_MultiPoint(array_of_array) AS SHAPE FROM df''')

Python Example:

df = spark.createDataFrame(
    [("1", [[1.1, 2.2],[3.3, 4.4]])],
    ["id", "array_of_array"]
)

df.select(st_multiPoint("array_of_array").alias("SHAPE"))
bdt.functions.st_multipoint_shape(arr)#

Return a shape struct representing a multipoint from an array of point shape struct.

Parameters:

arr (column or str or array) – an expression. An array of shape struct with wkb, xmin, ymin, xmax, ymax.

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

spark.sql('''SELECT array(ST_MakePoint(1.1, 2.2), ST_MakePoint(3.3, 4.4)) AS array_of_point''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_MultiPointShape(array_of_point) AS SHAPE FROM df''')

Python Example:

df = spark.createDataFrame([
    (1, "POINT(1.1 2.2)"),
    (1, "POINT(3.3 4.4)")
], ["ID", "WKT"]).selectExpr("ID", "ST_FromText(WKT) SHAPE")

(
    df
        .groupby("ID")
        .agg(collect_list("SHAPE").alias("array_of_point"))
        .select(F.st_multipoint_shape("array_of_point").alias("SHAPE"))
)

Geometry Accessors#

These functions access components of an input SHAPE struct.

Functions#

st_boundary(struct, dist)

Returns the geometry extent of a given shape struct as a polygon.

st_dump(struct)

Returns an array of single part geometries from a multi part geometry.

st_dumpXY(struct)

Return array of (x,y) vertex values making up the given shape struct as an array of structs.

st_dumpXYWithIndex(struct)

Return array of (x,y,i) vertex values with indices making up the given shape struct as an array of structs.

st_dump_end_nodes_xy(struct)

Return array of (x,y,i) vertex values for only the first and last vertex of the input geometry.

st_dump_end_nodes_xy_explode(struct)

Convenience function that calls explode() on st_dump_end_nodes_xy.

st_dump_explode(struct)

Convenience function that calls explode() on st_dump.

st_is_empty(struct)

Check whether the given shape struct is empty or not.

st_num_geometries(struct)

Returns the number of Geometries.

st_pointCount(struct)

Return the count of points.

st_segment_ends(struct, away)

Returns an array of the first and last segment of a multi part geometry.

st_segment_ends_explode(struct, away)

Convenience function that calls explode() on st_segment_ends.

st_segments(struct)

Returns an array of segments from a multi path geometry (polyline or polygon).

st_segments_explode(struct)

Convenience function that calls explode() on st_segments.

st_x(struct)

Return the first longitude value from a Shape struct.

st_y(struct)

Return the first latitude value from a Shape struct.

bdt.functions.st_boundary(struct, dist)#

Returns the geometry extent of a given shape struct as a polygon.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • dist (column or str or float) – The distance. Can be set to 0.0 for no distance offset.

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

spark.sql('''SELECT ST_FromText('POLYGON ((30 10, 40 40, 20 40, 10 20, 30 10))') as SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_Boundary(SHAPE, 0.0) as boundary FROM df''')

Python Example:

df = spark.createDataFrame([
    ("POLYGON ((30 10, 40 40, 20 40, 10 20, 30 10))",)
], schema="wkt string")

df.select(
    st_boundary(
        st_fromText("wkt"), 0.0
    ).alias("boundary")
)
bdt.functions.st_dump(struct)#

Returns an array of single part geometries from a multi part geometry. Use explode() to flatten the array.

Parameters:

struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

Return type:

Column

Returns:

Column of ArrayType with SHAPE Struct

SQL Example:

spark.sql(
    '''SELECT ST_FromText("MULTIPOINT ((10 40), (40 30), (20 20), (30 10))") AS shape'''
).createOrReplaceTempView("df")

spark.sql('''SELECT explode(ST_Dump(shape)) AS POINTS FROM df''')

Python Example:

df = spark.createDataFrame([
    ("MULTIPOINT ((10 40), (40 30), (20 20), (30 10))", 1)
], schema="wkt string, id int")

result_df = df.select(
    explode(st_dump(st_fromText("wkt"))).alias("points")
)
bdt.functions.st_dumpXY(struct)#

Return array of (x,y) vertex values making up the given shape struct as an array of structs. Use explode() to flatten the array. The x and y values of the struct can be accessed in SQL by calling StructColName.x and StructColName.y, where StructColName is the name of the column with values returned by explode(ST_DumpXY(…)).

Unpack with inline().

Parameters:

struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

Return type:

Column

Returns:

Column of ArrayType with InternalRow of two DoubleTypes

SQL Example:

spark.sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT inline(ST_DumpXY(SHAPE))''')

Python Example:

df = spark \
    .createDataFrame([("POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))",)], ["WKT"]) \
    .select(st_fromText(col("WKT")).alias("SHAPE"))

df             .select(st_dumpXY("SHAPE").alias("DUMP"))             .selectExpr("inline(DUMP)")
bdt.functions.st_dumpXYWithIndex(struct)#

Return array of (x,y,i) vertex values with indices making up the given shape struct as an array of structs. Use explode() to flatten the array. The x, y, and i values of the struct can be accessed in SQL by calling StructColName.x, StructColName.y, StructColName.i, where StructColName is the name of the column with values returned by explode(ST_DumpXYWithIndex(…)).

Use inline() to unpack.

Parameters:

struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

Return type:

Column

Returns:

Column of ArrayType with InternalRow of 2 DoubleTypes and 1 IntegerType

SQL Example:

spark.sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT inline(ST_DumpXYWithIndex(SHAPE))''')

Python Example:

df = spark \
    .createDataFrame([("POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))",)], ["WKT"]) \
    .select(st_fromText(col("WKT")).alias("SHAPE"))

df \
    .select(st_dumpXYWithIndex("SHAPE").alias("DUMP_XY")) \
    .selectExpr("inline(DUMP_XY)")
bdt.functions.st_dump_end_nodes_xy(struct)#

Return array of (x,y,i) vertex values for only the first and last vertex of the input geometry. If the input geometry is a single vertex point, then the output will be an array of one element. If the input geometry is a multi vertex geometry, then the output will be an array of two elements.

The x, y, and i values of the struct can be accessed in SQL by calling StructColName.x, StructColName.y, StructColName.i, where StructColName is the name of the column with values returned by explode(ST_DumpNodesXY(…)).

Use inline() to unpack.

Parameters:

struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

Return type:

Column

Returns:

Column of ArrayType with InternalRow of 2 DoubleTypes and 1 IntegerType

SQL Example:

spark.sql('''SELECT ST_FromText('LINESTRING (0 0, 1 1)') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_DumpEndNodesXY(SHAPE) AS XY FROM df''')

Python Example:

df = (spark
        .createDataFrame([("LINESTRING (0 0, 1 1)",)], ["WKT"])
        .select(st_fromText(col("WKT")).alias("SHAPE")))

df.select(st_dump_end_nodes_xy("SHAPE").alias("XY"))
bdt.functions.st_dump_end_nodes_xy_explode(struct)#

Convenience function that calls explode() on st_dump_end_nodes_xy.

See the docs for st_dump_end_nodes_xy for more information.

This function can only be used in Python. It cannot be used in a spark sql statement.

Parameters:

struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

Return type:

Column

Returns:

Column of ArrayType with InternalRow of 2 DoubleTypes and 1 IntegerType

Python Example:

df = (spark
        .createDataFrame([("LINESTRING (0 0, 1 1)",)], ["WKT"])
        .select(st_fromText(col("WKT")).alias("SHAPE")))

df.select(st_dump_end_nodes_xy_explode("SHAPE").alias("XY"))
bdt.functions.st_dump_explode(struct)#

Convenience function that calls explode() on st_dump.

See the docs for st_dump for more information.

This function can only be used in Python. It cannot be used in a spark sql statement.

Parameters:

struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

Return type:

Column

Returns:

Column of SHAPE Struct

Python Example:

df = spark.createDataFrame([
    ("MULTIPOINT ((10 40), (40 30), (20 20), (30 10))", 1)
], schema="wkt string, id int")

result_df = df.select(
    st_dump_explode(st_fromText("wkt")).alias("points")
)
bdt.functions.st_is_empty(struct)#

Check whether the given shape struct is empty or not.

Parameters:

struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

Return type:

Column

Returns:

Column of BooleanType

SQL Example:

spark.sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_IsEmpty(SHAPE) AS IS_EMPTY FROM df''')

Python Example:

df = spark \
    .createDataFrame([("POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))",)], ["WKT"]) \
    .select(st_fromText(col("WKT")).alias("SHAPE"))

df.select(st_is_empty("SHAPE").alias("IS_EMPTY"))
bdt.functions.st_num_geometries(struct)#

Returns the number of Geometries. Single geometries will return 1 and empty geometries will return 0.

Parameters:

struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

Return type:

Column

Returns:

Column of IntegerType

SQL Example:

spark.sql(
    '''SELECT ST_FromText("MULTIPOINT ((10 40), (40 30),
      (20 20), (30 10))") AS shape'''
).createOrReplaceTempView("df")

df = spark.sql('''SELECT ST_NumGeometries(shape) AS NUM_GEOM FROM df''')

Python Example:

df = spark.createDataFrame([
    ("MULTIPOINT ((10 40), (40 30), (20 20), (30 10))",)
], schema="wkt string")

result_df = df.select(
    st_num_geometries(st_fromText("wkt")).alias("num_geom")
)
bdt.functions.st_pointCount(struct)#

Return the count of points. The input must be of the com.esri.core.geometry.MultiVertexGeometry.

Parameters:

struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

Return type:

Column

Returns:

Column of IntegerType

SQL Example:

spark.sql('''SELECT 'MULTIPOLYGON (((1 2, 3 2, 3 4, 1 4, 1 2)))' AS WKT''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_PointCount(ST_FromText(WKT)) AS POINT_COUNT FROM df''')

Python Example:

df = spark.createDataFrame([
    ("MULTIPOLYGON (((1 2, 3 2, 3 4, 1 4, 1 2)))", 1)
], schema='WKT string, ID int') \
    .select(st_fromText("WKT").alias("SHAPE"))

df.select(st_pointCount("SHAPE").alias("POINT_COUNT"))
bdt.functions.st_segment_ends(struct, away)#

Returns an array of the first and last segment of a multi part geometry. Use explode to unpack this array. When away is true, the returned end segments point away from each other. When away is false, the returned end segments point towards each other.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • away (column or str or bool) – Direction of the returned segments. True for away, False for towards.

Return type:

Column

Returns:

Column of ArrayType with SHAPE Struct

SQL Example:

spark.sql('''SELECT ST_FromText('LINESTRING (0 0, 10 0, 20 0, 30 0, 40 0))') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_SegmentEnds(SHAPE, false) FROM df''')

Python Example:

df = spark \
    .createDataFrame([("LINESTRING (0 0, 10 0, 20 0, 30 0, 40 0))",)], ["WKT"]) \
    .select(st_fromText("WKT").alias("SHAPE"))

df.select(st_segment_ends("SHAPE", False).alias("SHAPE"))
bdt.functions.st_segment_ends_explode(struct, away)#

Convenience function that calls explode() on st_segment_ends.

See the docs for st_segment_ends for more information.

This function can only be used in Python. It cannot be used in a spark sql statement.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • away (column or str or bool) – Direction of the returned segments. True for away, False for towards.

Return type:

Column

Returns:

Column of SHAPE Struct

Python Example:

df = spark \
    .createDataFrame([("LINESTRING (0 0, 10 0, 20 0, 30 0, 40 0))",)], ["WKT"]) \
    .select(st_fromText("WKT").alias("SHAPE"))

df.select(st_segment_ends_explode("SHAPE", False).alias("SHAPE"))
bdt.functions.st_segments(struct)#

Returns an array of segments from a multi path geometry (polyline or polygon). Use explode() to unpack.

Parameters:

struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax. Must be a polyline or polygon.

Return type:

Column

Returns:

Column of ArrayType with SHAPE Struct

SQL Example:

spark.sql('''SELECT ST_FromText('LINESTRING (0 0, 10 0, 20 0, 30 0, 40 0))') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT explode(ST_Segments(SHAPE)) segment FROM df''')

Python Example:

df = spark \
    .createDataFrame([("LINESTRING (0 0, 10 0, 20 0, 30 0, 40 0))",)], ["WKT"]) \
    .select(st_fromText("WKT").alias("SHAPE"))

df.select(explode(st_segments("SHAPE")).alias("SHAPE"))
bdt.functions.st_segments_explode(struct)#

Convenience function that calls explode() on st_segments.

See the docs for st_segments for more information.

This function can only be used in Python. It cannot be used in a spark sql statement.

Parameters:

struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

Return type:

Column

Returns:

Column of SHAPE Struct

Python Example:

df = spark \
    .createDataFrame([("LINESTRING (0 0, 10 0, 20 0, 30 0, 40 0))",)], ["WKT"]) \
    .select(st_fromText("WKT").alias("SHAPE"))

df.select(st_segments_explode("SHAPE").alias("SHAPE"))
bdt.functions.st_x(struct)#

Return the first longitude value from a Shape struct. If the given shape is neither a point nor MultiVertexGeometry, the default value, 0.0, is returned.

Parameters:

struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

Return type:

Column

Returns:

Column of DoubleType

SQL Example:

spark.sql('''SELECT ST_FromText('POINT (1 1)') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_X(SHAPE) AS x FROM df''')

Python Example:

df = spark \
    .createDataFrame([("POINT (1 1)",)], ["WKT"]) \
    .select(st_fromText(col("WKT")).alias("SHAPE"))

df.select(st_x("SHAPE").alias("x))
bdt.functions.st_y(struct)#

Return the first latitude value from a Shape struct. If the given shape is neither a point nor MultiVertexGeometry, the default value, 0.0, is returned.

Parameters:

struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

Return type:

Column

Returns:

Column of DoubleType

SQL Example:

spark.sql('''SELECT ST_FromText('POINT (1 1)') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_Y(SHAPE) AS y FROM df''')

Python Example:

df = spark \
     .createDataFrame([("POINT (1 1)",)], ["WKT"]) \
     .select(st_fromText(col("WKT")).alias("SHAPE"))

 df.select(st_y("SHAPE").alias("y"))

Geometry Editors#

These functions make edits to the geometry of an input SHAPE struct, and return the newly edited struct.

Processors#

eliminatePolygonPartArea(df, areaThreshold)

Remove the holes of a polygon that have area below the given threshold.

eliminatePolygonPartPercent(df, percentThreshold)

Remove the holes of a polygon whose area as a percentage of the whole polygon is less than this value.

bdt.processors.eliminatePolygonPartArea(df, areaThreshold, shapeField='SHAPE', cache=False)#

Remove the holes of a polygon that have area below the given threshold.

Parameters:
  • df (DataFrame) – The input DataFrame

  • areaThreshold (float) – If a hole has area less than this value, then it will be removed.

  • shapeField (str) – The name of the SHAPE Struct for DataFrame.

  • cache (bool) – To persist the outgoing DataFrame or not.

Return type:

DataFrame

Returns:

Remove holes of polygon.

Example:

bdt.processors.eliminatePolygonPartArea(
    df,
    1.0,
    shapeField = "SHAPE",
    cache = False)
bdt.processors.eliminatePolygonPartPercent(df, percentThreshold, shapeField='SHAPE', cache=False)#

Remove the holes of a polygon whose area as a percentage of the whole polygon is less than this value.

Parameters:
  • df (DataFrame) – The input DataFrame

  • percentThreshold (float) – If the area of a hole as a percentage of the whole polygon is less than this value, then the

  • shapeField (str) – The name of the SHAPE Struct for DataFrame.

  • cache (bool) – To persist the outgoing DataFrame or not.

Return type:

DataFrame

Returns:

Remove holes of polygon.

Example:

bdt.processors.eliminatePolygonPartPercent(
    df,
    1.0,
    shapeField = "SHAPE",
    cache = False)

Functions#

st_chop(struct, wkid)

Accept a geometry and a wkid.

st_chop2(struct, wkid, level, limit)

Accepts a MultiVertex geometry.

st_chop3(struct, wkid, cellSize, level, limit)

Accepts a MultiVertex geometry.

st_clip_at_length(struct, length)

Clips a polyline at a specified length.

st_closest_point(point_struct, shape_struct)

Similar to st_snap.

st_dice(struct, wkid, vertexLimit)

Subdivides a feature into smaller features based on a specified vertex limit. This tool is intended as a way to

st_eliminatePolygonPartArea(struct, ...)

Remove the holes of a polygon that have area below the given threshold.

st_eliminatePolygonPartPercent(struct, ...)

Remove the holes of a polygon whose area as a percentage of the whole polygon is less than this value.

st_eliminate_hole_by_buffer(struct, ...)

This function packages five geometry operations into one, in the following order:

st_extend(struct, orig_dist, dest_dist)

Extend a line from the origin point by orig_dist and from the destination point by dest_dist.

st_insert_point(left_shape_struct, ...)

Insert the point or multipoint 'left_shape_struct' into 'right_shape_struct'.

st_insert_points(shape_struct_array, ...)

This function is very similar to ST_InsertPoint, except the first argument of this function is an array of point or multipoint.

st_integrate(shape_struct_array, tolerance)

Integrate the shape structs in 'struct_array' together, given the tolerance.

st_reverseMultipath(struct)

Reverse the order of points in a multipath.

st_snap(point, geom)

Given a point and another geometry, finds the closest point on the geometry to the point.

st_snapToLine(point, line[, heading])

Given a point and polyline geometry (geometries must be in the web mercator aux 3857.), finds the closest point on the line to the point and returns the following information about the snap:

st_snap_to_street(x, y)

Snap the input X and Y coordinate to the nearest street.

st_split_at_points(polyline_shape, ...)

Split a polyline by the multipoint.

st_update(struct, structList)

Mutate a given [[MultiVertexGeometry]] based on a given array of named struct with i, x, y fields.

bdt.functions.st_chop(struct, wkid)#

Accept a geometry and a wkid. The extent of the given geometry will be divided into 4 sub-extents. Then the geometry will be clipped with each sub-extent. Returns the maximum of 4 clipped geometries. If a geometry does not span 1 or more sub-extents, then less than 4 geometries will be returned.

Use explode() to unpack.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • wkid (column or str or int) – The spatial reference id.

Return type:

Column

Returns:

Column of ArrayType with SHAPE Struct.

SQL Example:

spark.sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_Chop(SHAPE, 4326) AS SHAPE_ARRAY FROM df''')

Python Example:

df = spark \
    .createDataFrame([("POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))",)], ["WKT"]) \
    .select(st_fromText(col("WKT")).alias("SHAPE"))

df.select(st_chop("SHAPE", 4326).alias("SHAPE_ARRAY"))

SQL Example 2:

df = spark.createDataFrame([
    ("MULTIPOLYGON (((40 40, 20 45, 45 30, 40 40)),((20 35, 10 30, 10 10, 30 5, 45 20, 20 35),(30 20, 20 15, "
    "20 25, 30 20)))", 1)
], schema='wkt string, id int')

df_chopped = (df
    .withColumn("SHAPE", st_fromText('wkt'))
    .select("id", explode(st_chop(col('SHAPE'), 0)).alias("chopped")))
bdt.functions.st_chop2(struct, wkid, level, limit)#

Accepts a MultiVertex geometry.

Chop the geometry until:

  • level is met (level > 0 && limit = 0)

  • vertex limit is met (level = 0 && limit > 0)

  • or go by level but stop if a quad met vertex limit (level > 0 && limit > 0)

Too small vertex limit can cause an infinite loop. Returns an array of chopped shapes.

Use explode() to unpack.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • wkid (column or str or int) – The spatial reference id.

  • level (column or str or int) – The # of times to quad down to.

  • limit (column or str or int) – The vertex limit to quad down to.

Return type:

Column

Returns:

Column of ArrayType with SHAPE Struct.

SQL Example:

spark.sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_Chop2(SHAPE, 4326, 2, 3) AS SHAPE_ARRAY FROM df''')

Python Example:

df = spark \
     .createDataFrame([("POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))",)], ["WKT"]) \
     .select(st_fromText(col("WKT")).alias("SHAPE"))

df.select(st_chop2("SHAPE", 4326, 2, 3).alias("SHAPE_ARRAY"))
bdt.functions.st_chop3(struct, wkid, cellSize, level, limit)#

Accepts a MultiVertex geometry.

Chop, by the cell, the geometry until:

  • level is met (level > 0 && limit = 0). The cell size gets smaller (divide by 2) at each interation.

  • vertex limit is met (level = 0 && limit > 0), The cell size gets smaller (divide by 2) at each interation.

  • or go by level but stop if a quad met vertex limit (level > 0 && limit > 0), The cell size gets smaller (divide by 2) at each interation.

Too small vertex limit can cause an infinite loop.

Returns an array of chopped shapes.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • wkid (column or str or int) – The spatial reference id.

  • cellSize (column or str or float) – The spatial partitioning size.

  • level (column or str or int) – The # of times to quad down to.

  • limit (column or str or int) – The vertex limit to quad down to.

Return type:

Column

Returns:

Column of ArrayType with SHAPE Struct.

SQL Example:

spark.sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_Chop3(SHAPE, 4326, 2.0, 2, 3) AS SHAPE_ARRAY FROM df''')

Python Example:

df = spark \
        .createDataFrame([("POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))",)], ["WKT"]) \
        .select(st_fromText(col("WKT")).alias("SHAPE"))

df.select(st_chop3("SHAPE", 4326, 2.0, 2, 3).alias("SHAPE_ARRAY"))
bdt.functions.st_clip_at_length(struct, length)#

Clips a polyline at a specified length. This function only supports polyline geometries.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax. Must be a polyline.

  • length (column or str or float) – The length along the polyline to clip to.

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

spark.sql('''SELECT ST_FromText('LINESTRING (0 0, 10 0, 20 0, 30 0, 40 0))') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_ClipAtLength(SHAPE, 15.0) FROM df''')

Python Example:

df = spark \
    .createDataFrame([("LINESTRING (0 0, 10 0, 20 0, 30 0, 40 0))",)], ["WKT"]) \
    .select(st_fromText("WKT").alias("SHAPE"))

df.select(st_clip_at_length("SHAPE", 15.0).alias("SHAPE"))
bdt.functions.st_closest_point(point_struct, shape_struct)#

Similar to st_snap. Given a point and another geometry, finds the closest point on the geometry.

One of the input geometries must be a point, but the order of the input geometries does not matter. The point can be passed as either be the first or second argument.

st_closest_point will always return the same snap point that st_snap will for point to polyline cases.

st_closest_point and st_snap differ on the point to polygon case when the point is contained in the polygon:

- st_closest_point will return the snap point as the same input point if the point is contained in the polygon.
- st_snap will return the snap point as the closest point on the polygon border if the point is contained in the polygon.

In addition, st_closest_point will return additional attributes along with the snap location that st_snap will not. The return type is a struct with the following values:

- SHAPE: The snap location as a point geometry.
- distance: The distance between the input point and snap location on the geometry.
- index: The index of the vertex in the geometry that the point was snapped to.
- rightside: A boolean indicating if the point was snapped to the right side of the geometry.

If the output of the snap is empty OR the input geometry types are not supported, the values of the struct will be:

- SHAPE: A point geometry with empty coordinates.
- distance: -1.0
- index: -1
- rightside: false

The behavior of the index attribute is as follows:

- When the input is a polygon and the point is inside the polygon, the value is zero.
- When the input is a polygon and the point is outside the polygon, the value is the start vertex index of a segment with the closest coordinate.
- When the input is a polyline, the value is the start vertex index of a segment with the closest coordinate.
- When the input is a point, the value is 0. | - When the input is a multipoint, the value is the closest vertex.
- When the input is empty or not supported, the value is -1.

See the walkthrough notebook for more details and visualizations on the index attribute.

The rightside attribute indicates the side of the segment the point was snapped to. The possible values are true meaning right side, and false meaning left side. If the point is directly on the line segment, then it is considered to be on the right of the line. See the walkthrough notebook for more details and visualizations on the rightside attribute.

Parameters:
  • point_struct (column or str) – Point shape struct with wkb, xmin, ymin, xmax, ymax

  • shape_struct (column or str) – Shape struct with wkb, xmin, ymin, xmax, ymax

Return type:

Column

Returns:

Column of Struct with values (SHAPE, distance, index, rightside)

SQL Example:

spark.sql('''SELECT ST_FromText('POINT (0.5 0.5)') AS SHAPE1,
     ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE2''').createOrReplaceTempView("df2")

spark.sql('''SELECT inline(array(ST_ClosestPoint(SHAPE1, SHAPE2))) FROM df2''')

Python Example:

df = (spark
    .createDataFrame([("POINT (1 1)",
    "LINESTRING (0 0, 2 2)")], ["WKT1", "WKT2"])
    .select(st_fromText(col("WKT1")).alias("SHAPE1"), st_fromText(col("WKT2")).alias("SHAPE2")))

df.select(inline(array(st_closest_point("SHAPE1", "SHAPE2"))))
bdt.functions.st_dice(struct, wkid, vertexLimit)#

Subdivides a feature into smaller features based on a specified vertex limit. This tool is intended as a way to subdivide extremely large features that cause issues with drawing, analysis, editing, and/or performance but are difficult to split up with standard editing and geoprocessing tools. This tool should not be used in any cases other than those where tools are failing to complete successfully due to the size of features.

Use explode() to unpack.

Parameters:
  • struct (column or str) – an expression. The shape struct with wkb, xmin, ymin, xmax, ymax

  • wkid (column or str or int) – an expression. The spatial reference id.

  • vertexLimit (column or str or int) – an expression. The maximum vertex limit.

Return type:

Column

Returns:

Column of ArrayType with SHAPE Struct

SQL Example:

spark.sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE''').createOrReplaceTempView("df")

spark.sql(```SELECT explode(ST_Dice(SHAPE, 4326, 1)) AS SHAPE FROM df```)

Python Example:

df = spark \
     .createDataFrame([("POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))",)], ["WKT"]) \
     .select(st_fromText(col("WKT")).alias("SHAPE"))

df.select(explode(st_dice("SHAPE", 4326, 1)).alias("SHAPE"))
bdt.functions.st_eliminatePolygonPartArea(struct, area_threshold)#

Remove the holes of a polygon that have area below the given threshold.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • area_threshold (column or str or float) – If a hole has area less than this value, then it will be removed.

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

spark.sql('''SELECT ST_FromText('POLYGON ((0 0, 2 0, 2 2, 0 2, 0 0),
(0.5 0.5, 1.5 0.5, 1.5 1.5, 0.5 1.5, 0.5 0.5))') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_EliminatePolygonPartArea(SHAPE, 1.0) AS SHAPE FROM df''')

Python Example:

df = spark \
    .createDataFrame([('''POLYGON ((0 0, 2 0, 2 2, 0 2, 0 0),
                                   (0.5 0.5, 1.5 0.5, 1.5 1.5, 0.5 1.5,0.5 0.5))''',)], ["WKT"]) \
    .select(st_fromText(col("WKT")).alias("SHAPE"))

df.select(st_eliminatePolygonPartArea("SHAPE", 1.0).alias("SHAPE"))
bdt.functions.st_eliminatePolygonPartPercent(struct, percent_threshold)#

Remove the holes of a polygon whose area as a percentage of the whole polygon is less than this value.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • percent_threshold (column or str or float) – an expression. If a hole has area less than this value, then it will be removed.

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

spark.sql('''SELECT ST_FromText('POLYGON ((0 0, 2 0, 2 2, 0 2, 0 0),
(0.5 0.5, 1.5 0.5, 1.5 1.5, 0.5 1.5, 0.5 0.5))') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_EliminatePolygonPartPercent(SHAPE, 50.0) AS SHAPE FROM df''')

Python Example:

df = spark \
    .createDataFrame([('''POLYGON ((0 0, 2 0, 2 2, 0 2, 0 0),
                                   (0.5 0.5, 1.5 0.5, 1.5 1.5, 0.5 1.5,0.5 0.5))''',)], ["WKT"]) \
    .select(st_fromText(col("WKT")).alias("SHAPE"))

df.select(st_eliminatePolygonPartPercent("SHAPE", 50.0).alias("SHAPE"))
bdt.functions.st_eliminate_hole_by_buffer(struct, buffer_dist, percent_threshold, max_vert_in_circle)#

This function packages five geometry operations into one, in the following order:

  1. Buffer the polygon by the provided buffer distance.

  2. Simplify the polygon.

  3. Eliminate holes in the polygon: A hole with an area that is less than the percentage threshold of the total area of the polygon will be eliminated.

  4. Simplify the polygon with relevant holes eliminated.

  5. Negative buffer the polygon, effectively undoing the first buffer.

This function only accepts geometries of type Polygon (or MultiPolygon)

The recommended default value for max_vert_in_circle is 96. If polygon detail is valued more over performance, increase this value. If performance is valued more over polygon detail, decrease this value. The value must be an integer.

Parameters:
  • struct (column or str) – The input polygon or multipolygon shape struct with wkb, xmin, ymin, xmax, ymax

  • buffer_dist (column or str or float) – The distance to buffer (and un-buffer) the input polygon.

  • percent_threshold (column or str or float) – The percentage threshold. A hole with an area that is less than this percentage of the total area of the polygon will be eliminated.

  • max_vert_in_circle (column or str or int) – The maximum number of vertices in a circle (or curve). Buffering can result in polygons with curved corners. Setting this to a higher value will result in a more accurate representation of curves on buffered polygons, but could also negatively impact performance.

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

(
    spark
        .sql('''SELECT ST_FromText('POLYGON((1 1, 4 1, 4 4, 1 4, 1 1), (2 2, 3 2, 3 3, 2 3, 2 2))') AS SHAPE''')
        .createOrReplaceTempView("df")
)

spark.sql('''SELECT ST_EliminateHoleByBuffer(SHAPE, 1.0, 8.0, 96) AS SHAPE FROM df''')

Python Example:

df = spark.sql('''
        SELECT ST_FromText('POLYGON((1 1, 4 1, 4 4, 1 4, 1 1), (2 2, 3 2, 3 3, 2 3, 2 2))') AS SHAPE
     ''')

df.select(st_eliminate_hole_by_buffer("SHAPE", 1.0, 8.0, 96).alias("SHAPE"))
bdt.functions.st_extend(struct, orig_dist, dest_dist)#

Extend a line from the origin point by orig_dist and from the destination point by dest_dist. The angles of extension are the same as the angles of the segments of the origin and destination points.

Extend a line from the origin point by origDist and from the destination point by destDist. The angles of extension are the same as the angles of the segments of the origin and destination points (the start and end segments of the line).

The input line can have more than one segment, but cannot be a multipath. If it is a multipath, first use st_dump to extract each of the component lines into their own row in the DataFrame.

Parameters:
  • struct (column or str) – The line shape struct with wkb, xmin, ymin, xmax, ymax

  • orig_dist (column or str or float) – The distance to extend the line from the origin point.

  • dest_dist (column or str or float) – The distance to extend the line from the destination point.

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

(spark
    .sql('''SELECT ST_FromText('LINESTRING (0.0 0.0, 4.0 2.0)') AS SHAPE1''')
    .createOrReplaceTempView("df")
)

spark.sql('''SELECT ST_Extend(SHAPE, 2.0, 4.0) AS SHAPE FROM df''')

Python Example:

df = spark.sql('''SELECT ST_FromText('LINESTRING (0.0 0.0, 4.0 2.0)') AS SHAPE1''')

df.select(st_extend("SHAPE", 2.0, 4.0).alias("SHAPE"))
bdt.functions.st_insert_point(left_shape_struct, right_shape_struct, tolerance)#

Insert the point or multipoint ‘left_shape_struct’ into ‘right_shape_struct’.

If ‘left_shape_struct’ is a point, then it will be inserted into ‘right_shape_struct’ if that point is not within tolerance of a point in ‘right_shape_struct’.

If ‘left_shape_struct’ is a multipoint, then each point in the multipoint will be inserted into ‘right_shape_struct’ if that point is not within tolerance of a point in ‘right_shape_struct’.

If ‘right_shape_struct’ is a multipath, the point(s) are inserted at the closest location on the multipath if that closest location is not within tolerance of an existing point on the multipath. The closest location is measured by euclidian distance.

If it is the case that there are point(s) in ‘left_shape_struct’ that are within tolerance of each other, they will still all be inserted into ‘right_shape_struct’ provided they are not within tolerance of any point in ‘right_shape_struct’.

Set tolerance = 0 to ignore tolerance and always require insertion.

Parameters:
  • left_shape_struct (column or str) – Shape struct with wkb, xmin, ymin, xmax, ymax

  • right_shape_struct (column or str) – Shape struct with wkb, xmin, ymin, xmax, ymax

  • tolerance (column or str or float) – If an existing point in right_shape_struct is within ‘tolerance’ of the closest snap location of ‘left_shape_struct’ on ‘right_shape_struct’, then left_shape_struct will not be inserted.

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

spark.sql('''SELECT ST_FromText('POINT (0.5 0)') AS SHAPE1,
     ST_FromText('MULTILINESTRING ((0 0, 1 0))') AS SHAPE2''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_InsertPoint(SHAPE1, SHAPE2, 0.1) AS SHAPE FROM df''')

Example:

df = spark \
    .createDataFrame([("POINT (0.5 0)",
    "MULTILINESTRING ((0 0, 1 0))")], ["WKT1", "WKT2"]) \
    .select(st_fromText(col("WKT1")).alias("SHAPE1"), st_fromText(col("WKT2")).alias("SHAPE2"))

df.select(st_insert_point("SHAPE1", "SHAPE2", 0.1).alias("SHAPE"))
bdt.functions.st_insert_points(shape_struct_array, shape_struct, tolerance)#

This function is very similar to ST_InsertPoint, except the first argument of this function is an array of point or multipoint.

Inserts each point or multipoint of the input ‘shape_struct_array’ into ‘shape_struct’.

For each geometry in ‘shape_struct_array’:

If it is a point, then it will be inserted into ‘shape_struct’ if that point is not within tolerance of a point in ‘shape_struct’. If it is a multipoint, then each point in the multipoint will be inserted into ‘shape_struct’ if that point is not within tolerance of a point in ‘shape_struct’.

If ‘shape_struct’ is a multipath, the points or multipoint in ‘shape_struct_array’ are inserted at the closest location on the multipath if that closest location is not within tolerance of an existing point on the multipath. The closest location is measured by euclidian distance.

If it is the case that there are point(s) in ‘struct_array’ that are within tolerance of each other, they will still all be inserted into ‘shape_struct’ provided they are not within tolerance of any point in ‘shape_struct’.

Set tolerance = 0 to ignore tolerance and always require insertion.

Parameters:
  • shape_struct_array (column or str or array) – Array of shape struct with wkb, xmin, ymin, xmax, ymax

  • shape_struct (column or str) – Shape struct with wkb, xmin, ymin, xmax, ymax

  • tolerance (column or str or float) – If the snap location of a point in ‘shape_struct_array’ onto ‘shape_struct’ is within tolerance of a point in ‘shape_struct’, the point in ‘struct_array’ will not be inserted.

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

spark.sql('''SELECT array(ST_FromText('POINT (0.5 0)')) AS SHAPE_ARRAY,
     ST_FromText('MULTILINESTRING ((0 0, 1 0))') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_InsertPoints(SHAPE_ARRAY, SHAPE, 0.1) AS SHAPE FROM df''')

Example:

df = spark \
    .createDataFrame([("POINT (0.5 0)",
    "MULTILINESTRING ((0 0, 1 0))")], ["WKT1", "WKT2"]) \
    .select(array(st_fromText(col("WKT1"))).alias("SHAPE_ARRAY"), st_fromText(col("WKT2")).alias("SHAPE"))

df.select(st_insert_points("SHAPE_ARRAY", "SHAPE", 0.1).alias("SHAPE"))
bdt.functions.st_integrate(shape_struct_array, tolerance)#

Integrate the shape structs in ‘struct_array’ together, given the tolerance. Use explode() to unpack.

Tolerance can be conceptualized as a buffer circle with radius equal to tolerance around each point of each struct in ‘shape_struct_array’. If the tolerances of two points from two different geometries overlap, they are considered co-located.

Be careful about setting the tolerance to too high of a value - it may result in unexpected behavior. For more reading on this and integration in general, please see the ArcGIS Pro 3.0 Documentation: https://pro.arcgis.com/en/pro-app/latest/tool-reference/data-management/integrate.htm.

Parameters:
  • shape_struct_array (column or str or array) – Array of Shape struct with wkb, xmin, ymin, xmax, ymax

  • tolerance (column or str or float) – If points in the shape struct(s) in ‘shape_struct_array’ are co-located within the tolerance buffer they are assumed to represent the same location. In units of the spatial reference.

Return type:

Column

Returns:

Column of ArrayType with SHAPE Struct.

SQL Example:

spark.sql('''SELECT ST_FromText('LINESTRING (0 0, 1 0)') AS SHAPE1,
    ST_FromText('LINESTRING (1.1 0, 2.1 0)') AS SHAPE2''').createOrReplaceTempView("df")

spark.sql('''SELECT explode(ST_Integrate(array(SHAPE1, SHAPE2), 0.2)) AS SHAPE FROM df''')

Python Example:

from bdt.functions import st_integrate, st_fromText
df = spark \
    .createDataFrame([("LINESTRING (0 0, 1 0)",
    "LINESTRING (0 0, 1 0)")], ["WKT1", "WKT2"]) \
    .select(st_fromText(col("WKT1")).alias("SHAPE1"), st_fromText(col("WKT2")).alias("SHAPE2"))

df.select(explode(st_integrate(array("SHAPE1", "SHAPE2"), 0.2)).alias("SHAPE"))
bdt.functions.st_reverseMultipath(struct)#

Reverse the order of points in a multipath.

Parameters:

struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax. Must be a multipath geometry such as a polyline.

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

spark.sql(
    '''SELECT ST_FromText("LINESTRING (30 10, 10 30, 40 40)") AS shape'''
).createOrReplaceTempView("df")

spark.sql('''SELECT ST_ReverseMultipath(shape) AS shape_reverse FROM df''')

Python Example:

df = spark.createDataFrame([
    ("LINESTRING (30 10, 10 30, 40 40)", 1)
], schema="wkt string, id int")

result_df = df.select(
    st_reverseMultipath(st_fromText("wkt")).alias("points")
)
bdt.functions.st_snap(point, geom)#

Given a point and another geometry, finds the closest point on the geometry to the point. The input point and geom must be in the web mercator aux 3857.

If the point is inside a polygon, it will snap to the closest side of the polygon.

If a point is equidistant from more than one position on the geometry, the point will be snapped to whichever position on the geometry is defined first. For example, given a Multipath with two lines: MULTILINESTRING((0 0, 2 0), (0 2, 2 2)) and a point in between them: POINT(1 1) the point will be snapped to POINT(1 0) since the (0 0, 2 0) segment comes first in the multipath. Subsequent equidistant closest points (like at POINT(1 2)) are ignored.

Parameters:
  • point (column or str) – The point shape struct being snapped.

  • geom (column or str) – The geometry shape struct being snapped to.

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

spark.sql('''
    SELECT
        ST_FromText('POINT(3.0 2.0)') as POINT,
        ST_FromText('MULTILINESTRING((0.0 0.0, 5.0 0.0, 10.0 0.0))') as LINE
''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_Snap(POINT, LINE) AS SNAP_POINT FROM df''')

Python Example:

df = spark.createDataFrame([('POINT(3.0 2.0)', 'MULTILINESTRING((0.0 0.0, 5.0 0.0, 10.0 0.0))')], ["POINT", "LINE"])

df.select(st_snap(st_fromText("POINT"), st_fromText("LINE")).alias("SNAP_POINT"))
bdt.functions.st_snapToLine(point, line, heading=None)#

Given a point and polyline geometry (geometries must be in the web mercator aux 3857.), finds the closest point on the line to the point and returns the following information about the snap:

  • DIST_ON_LINE: the distance along the line from the start of the line to the snapped point.

  • DIST_TO_LINE: the distance to the line from the snapped point.

  • SNAP_SIDE: the side of the line the point was snapped to (‘L’ for left, ‘R’ for right, and ‘O’ for on the line).

  • SNAP_LINE_ALIGN: the alignment factor between the input line and the ‘snap line’ (the line from the snapped point to the original point). Between -1.0 and 1.0 with 1.0 meaning the two lines are pointing in the same direction and -1.0 meaning they are pointing in opposite directions.

  • HEADING_ALIGN: the alignment factor between the line and the point heading. Between -1.0 and 1.0 with 1.0 meaning the line and heading are pointing in the same direction and -1.0 meaning they are pointing in opposite directions. This is only included in the output if a heading is provided.

IMPORTANT: STSnapToLine is more expensive than STSnap and should ONLY be used when more than just the snap location is needed.

If a point is equidistant from more than one position on the line geometry, the point will be snapped to whichever position on the line geometry is defined first.

For example, given a Multipath with two lines: MULTILINESTRING((0 0, 2 0), (0 2, 2 2)) and a point in between them: POINT(1 1) the point will be snapped to POINT(1 0) since the (0 0, 2 0) segment comes first in the multipath. Subsequent equidistant closest points (like at POINT(1 2)) are ignored.

Parameters:
  • point (column or str) – The point shape struct being snapped. Must be in spatial reference 3857.

  • line (column or str) – The polyline geometry shape struct being snapped to. Must be in spatial reference 3857.

  • heading (column or str or float) – Optional. The heading in radians of the point. 0 is horizontal and positive is counter-clockwise.

Return type:

Column

Returns:

Column struct with SHAPE struct, 4 DoubleTypes, and 1 StringType

SQL Example:

spark.sql('''
    SELECT
        ST_FromText('POINT(3.0 2.0)') as POINT,
        ST_FromText('MULTILINESTRING((0.0 0.0, 5.0 0.0, 10.0 0.0))') as LINE,
        0.0 AS HEADING
''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_SnapToLine(POINT, LINE, HEADING) AS SNAP_RESULT FROM df''')

Python Example:

df = spark.createDataFrame([('POINT(3.0 2.0)', 'MULTILINESTRING((0.0 0.0, 5.0 0.0, 10.0 0.0))')], ["POINT", "LINE"])

df.select(st_snapToLine(st_fromText("POINT"), st_fromText("LINE"), 0.0).alias("SNAP_RESULT"))
bdt.functions.st_snap_to_street(x, y)#

Snap the input X and Y coordinate to the nearest street. The input coordinate must be in spatial reference Web Mercator 3857.

The output is a struct containing the snapped X and Y coordinate, as well as additional attributes. The additional attributes are:

KPH: The speed limit in kilometers per hour.
PUBLIC_ACCESS: Whether the path restricts public access.
LIMITED_ACCESS: Whether the road is limited access or not (gated community, etc.)
RESTRICT_CARS: Whether the path restricts cars.
PAVED: Whether the road is paved or not.
FUNC_CL: A hierarchical value used to determine a logical and efficient route for a traveler.
HIERARCHY: Similar to FUNC_CL, but with extra considerations like vehicle type restrictions.
ROAD_CL: A combination of a variety of conditions: Ramp, Ferry Type, Controlled Access, Intersection Category, and Functional Class.
For more information on the attributes:
2. Navigate to the release notes for File geodatabase (.gdb) and download the zipfile.
3. Unzip the release notes. Open help.htm. In the dropdown, navigate to Data Dictionary -> Line layers -> Streets.

The snap radius is 500 meters. If no street is found within 500 meters of the input coordinate, the output is null.

The output attributes of this function are subject to change in future releases.

Requires a LMDB network dataset.

Parameters:
  • x (column or str or float) – The input x ordinate in 3857 Web Mercator spatial reference.

  • y (column or str or float) – The input y ordinate in 3857 Web Mercator spatial reference.

Return type:

Column

Returns:

Struct with 2 DoubleTypes, 4 IntegerTypes, and 4 BooleanTypes

SQL Example:

spark.sql('''SELECT -13161875 AS X, 4035019.53758 AS Y''').createOrReplaceTempView("df")

spark.sql('''SELECT inline(array(ST_SnapToStreet(X, Y))) FROM df''')

Python Example:

df = spark.createDataFrame([(-13161875, 4035019.53758)], ["X", "Y"])

df.select(inline(array(st_snap_to_street("X", "Y", true)))
bdt.functions.st_split_at_points(polyline_shape, multipoint_shape, tol, wkid)#

Split a polyline by the multipoint. Each point in the multipoint will split the polyline at its closest point on the polyline. The polyline and multipoint must be in the same spatial reference.

Note: st_split_at_points will emit an empty line if a split results in a line of length 0.

Parameters:
  • polyline_shape (column or str) – The polyline shape struct with wkb, xmin, ymin, xmax, ymax

  • multipoint_shape (column or str) – The multipoint shape struct with wkb, xmin, ymin, xmax, ymax

  • tol (column or str or float) – The tolerance for the split. If a point is outside the tolerance, the polyline will not be split by that point. In units of the spatial reference.

  • wkid (column or str or int) – The spatial reference id of the multipath and polyline.

Return type:

Column

Returns:

Column of Array of SHAPE Struct

SQL Example:

spark.sql('''SELECT ST_FromText('MULTILINESTRING ((0 0, 1 1), (2 2, 3 3))') AS MULTIPATH_SHAPE,
    ST_FromText('MULTIPOINT (0.2 0, 2.1 2.1)') AS MULTIPOINT_SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_SplitAtPoints(MULTIPATH_SHAPE, MULTIPOINT_SHAPE, 0.5, 4326) AS SHAPE FROM df''')

Python Example:

df = (spark.createDataFrame([
    ("MULTILINESTRING ((0 0, 1 1), (2 2, 3 3))", "MULTIPOINT (0.2 0, 2.1 2.1)"),
], schema='MULTIPATH_SHAPE string, MULTIPOINT_SHAPE string')
    .select(st_fromText("MULTIPATH_SHAPE").alias("MULTIPATH_SHAPE"), st_fromText("MULTIPOINT_SHAPE").alias("MULTIPOINT_SHAPE")))

df.select(st_split_at_points("MULTIPATH_SHAPE", "MULTIPOINT_SHAPE", 0.5, 4326).alias("SHAPE"))
bdt.functions.st_update(struct, structList)#

Mutate a given [[MultiVertexGeometry]] based on a given array of named struct with i, x, y fields. The i is the index. The x and y are the coordinates. This replaces x, y values at a given index. When the index is negative, it is interpreted as relative to the end of the array.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax. Must be a MultiVertexGeometry.

  • structList (column or str or array) – An array of named struct with i, x, y fields.

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

 spark.sql('''SELECT ST_AsText(ST_Update(ST_FromText(LINESTRING(0 0, 1 1, 2 2)), array(named_struct('i', 1, 'x', 7d, 'y', 7d))))
   MULTILINESTRING ((0 0, 7 7, 2 2))''')

spark.sql('''SELECT ST_AsText(ST_Update(ST_FromText(LINESTRING(0 0, 1 1, 2 2)), array(named_struct('i', -3, 'x', 7d, 'y', 7d))))
   MULTILINESTRING ((7 7, 1 1, 2 2))''')

Python Example:

df = spark.createDataFrame([
    ("LINESTRING(0 0, 1 1, 2 2)", 1, 7.0, 7.0),
    ("LINESTRING(0 0, 1 1, 2 2)", -3, 7.0, 7.0)
], schema="wkt string, i int, x float, y float")

result_df = (
    df.select(st_fromText("wkt").alias("shape"), array(struct("i", "x", "y")).alias("structList"))
    .select(st_update("shape", "structList").alias("updated"))
)

Geometry Validation#

These functions evaluate the validity of an input SHAPE struct.

Functions#

st_is_valid(struct, wkid)

Return true if a given geometry is valid for the OGC standard.

bdt.functions.st_is_valid(struct, wkid)#

Return true if a given geometry is valid for the OGC standard. Otherwise, return false.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • wkid (column or str or int) – The spatial reference id.

Return type:

Column

Returns:

Column of BooleanType

SQL Example:

spark.sql('''SELECT ST_FromText('MULTILINESTRING ((0 0, 1 0))') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_IsValid(SHAPE, 4326) AS IS_VALID FROM df''')

Python Example:

df = spark \
    .createDataFrame([("MULTILINESTRING ((0 0, 1 0))",)], ["WKT"]) \
    .select(st_fromText(col("WKT")).alias("SHAPE"))

df.select(st_is_valid("SHAPE", 4326).alias("IS_VALID"))

Spatial Reference System Functions#

These processors and functions facilitate the conversion of the geometry in a SHAPE struct between Spatial Reference Systems.

See https://spatialreference.org/ for a helpful list of spatial references available to use.

Processors#

project(df, fromWk, toWk[, shapeField, cache])

Re-Project a geometry coordinate from a spatial reference to another spatial reference.

webMercator(df, latField, lonField[, ...])

Add x and y WebMercator coordinates to the output DataFrame.

bdt.processors.project(df, fromWk, toWk, shapeField='SHAPE', cache=False)#

Re-Project a geometry coordinate from a spatial reference to another spatial reference.

Parameters:
  • df (DataFrame) – The input DataFrame.

  • fromWk (str) – wkid or wktext

  • toWk (str) – wkid or wktext

  • shapeField (str) – The name of the SHAPE Struct field.

  • cache (bool) – To persist the outgoing DataFrame or not.

Return type:

DataFrame

Returns:

Projection of a geometry coordinate from one wkid to another.

Example:

bdt.processors.project(
        df,
        fromWk = "4326",
        toWk = "3857",
        shapeField = "SHAPE",
        cache = False)
bdt.processors.webMercator(df, latField, lonField, xField='X', yField='Y', cache=False)#

Add x and y WebMercator coordinates to the output DataFrame. The input dataframe must have Latitude and Longitude columns.

Parameters:
  • df (DataFrame) – The input DataFrame

  • latField (str) – The name of the latitude column in the DataFrame.

  • lonField (str) – The name of the longitude column in the DataFrame.

  • xField (str) – The name of the output WebMercator x value column for the DataFrame.

  • yField (str) – The name of the output WebMercator y value column for the DataFrame.

  • cache (bool) – To persist the outgoing DataFrame or not.

Return type:

DataFrame

Returns:

DataFrame with x and y WebMercator coordinates.

Example:

bdt.processors.webMercator(
         df,
         latField = "lat",
         lonField = "lon",
         xField = "X",
         yField = "Y",
         cache = False)

Functions#

latToY(lat)

Return the conversion of a Latitude value to a WebMercator Y value.

lonToX(lon)

Return the conversion of a Longitude value to a WebMercator X value.

st_project(struct, from_wkid, to_wkid)

Project the shape struct from spatial reference 'from_wkid' to spatial reference 'to_wkid'

st_project2(struct, from_wktext, to_wkid)

Project the shape struct from spatial reference 'from_wktext' to spatial reference 'to_wkid'

st_project3(struct, from_wkid, to_wktext)

Project the shape struct from spatial reference 'from_wkid' to spatial reference 'to_wktext'

st_project4(struct, from_wktext, to_wktext)

Project the shape struct from spatial reference 'from_wktext' to spatial reference 'to_wktext'.

st_tolerance(wkid)

Return a tolerance value for the given spatial reference id.

st_webMercator(lon, lat)

Return the conversion of Latitude and Longitude values as a WebMercator (X, Y) Coordinate Pair.

xToLon(x)

Return the conversion of a Web Mercator X value to a Longitude value.

yToLat(y)

Return the conversion of a Web Mercator Y value to a Latitude value.

bdt.functions.latToY(lat)#

Return the conversion of a Latitude value to a WebMercator Y value.

Parameters:

lat (column or str or float) – The latitude value.

Return type:

Column

Returns:

Column of DoubleType

SQL Example:

spark.sql('''SELECT 180 AS lon, 90 AS lat''').createOrReplaceTempView("df")

spark.sql('''SELECT LatToY(lat) AS Y FROM df''')

Python Example:

df = spark.createDataFrame([(180, 90)], ["lon", "lat"])

df.select(latToY("lat").alias("Y"))
bdt.functions.lonToX(lon)#

Return the conversion of a Longitude value to a WebMercator X value.

Parameters:

lon (column or str or float) – The longitude value.

Return type:

Column

Returns:

Column of DoubleType

SQL Example:

spark.sql('''SELECT 180 AS lon, 90 AS lat''').createOrReplaceTempView("df")

spark.sql('''SELECT LonToX(lon) AS X FROM df''')

Python Example:

df = spark.createDataFrame([(180, 90)], ["lon", "lat"])

df.select(lonToX("lon").alias("X"))
bdt.functions.st_project(struct, from_wkid, to_wkid)#

Project the shape struct from spatial reference ‘from_wkid’ to spatial reference ‘to_wkid’

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax.

  • from_wkid (column or str or int) – The current well-known spatial reference id.

  • to_wkid (column or str or int) – The desired well-known spatial reference id.

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

spark.sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_Project(SHAPE, 4326, 3857) FROM df''')

Python Example:

df = spark \
    .createDataFrame([("POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))",)], ["WKT"]) \
    .select(st_fromText(col("WKT")).alias("SHAPE"))

df.select(st_project("SHAPE", 4326, 3857).alias("SHAPE"))
bdt.functions.st_project2(struct, from_wktext, to_wkid)#

Project the shape struct from spatial reference ‘from_wktext’ to spatial reference ‘to_wkid’

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax.

  • from_wktext (column or str) – The current spatial reference well-known text.

  • to_wkid (column or str or int) – The desired well-known spatial reference id.

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

spark.sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_Project2(SHAPE, 'GEOGCS[..]', 3857) FROM df''')

Python Example:

df = spark \
    .createDataFrame([("POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))",)], ["WKT"]) \
    .select(st_fromText(col("WKT")).alias("SHAPE"))

df.select(st_project2("SHAPE", "GEOGCS[..]", 3857).alias("SHAPE"))
bdt.functions.st_project3(struct, from_wkid, to_wktext)#

Project the shape struct from spatial reference ‘from_wkid’ to spatial reference ‘to_wktext’

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax.

  • from_wkid (column or str or int) – The current well-known spatial reference id.

  • to_wktext (column or str) – The desired spatial reference well-known text.

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

spark.sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE''').createOrReplaceTempView("df")

 spark.sql('''SELECT ST_Project3(SHAPE, 4326, 'PROJCS[..]') AS SHAPE FROM df''')

Python Example:

df = spark \
    .createDataFrame([("POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))",)], ["WKT"]) \
    .select(st_fromText(col("WKT")).alias("SHAPE"))

df.select(F.st_project3("SHAPE", 4326, "PROJCS[..]").alias("SHAPE"))
bdt.functions.st_project4(struct, from_wktext, to_wktext)#

Project the shape struct from spatial reference ‘from_wktext’ to spatial reference ‘to_wktext’.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax.

  • from_wktext (column or str) – The current spatial reference well-known text.

  • to_wktext (column or str) – The desired spatial reference well-known text.

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

spark.sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_Project4(SHAPE, 'GEOGCS[..]', 'PROJCS[..]') AS SHAPE FROM df''')

Python Example:

df = spark \
    .createDataFrame([("POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))",)], ["WKT"]) \
    .select(st_fromText(col("WKT")).alias("SHAPE"))

df.select(st_project4("SHAPE", "GEOGCS[..]", "PROJCS[..]").alias("SHAPE"))
bdt.functions.st_tolerance(wkid)#

Return a tolerance value for the given spatial reference id.

Parameters:

wkid (column or str or int) – The well-known spatial reference id.

Return type:

Column

Returns:

Column of DoubleType

SQL Example:

spark.sql('SELECT ST_Tolerance(4326)')

Python Example:

df = spark.createDataFrame([(4326,)], ["WKID"])

df \
    .select(st_tolerance("WKID").alias("Tolerance"))
bdt.functions.st_webMercator(lon, lat)#

Return the conversion of Latitude and Longitude values as a WebMercator (X, Y) Coordinate Pair.

Parameters:
  • lon (column or str or float) – The longitude value.

  • lat (column or str or float) – The latitude value.

Return type:

Column

Returns:

Column of ArrayType with 2 DoubleTypes

SQL Example:

spark.sql('''SELECT 180 AS lon, 90 AS lat''').createOrReplaceTempView("df")

spark.sql(''' SELECT element_at(COORDS, 0) AS X, element_at(COORDS, 1) AS Y FROM (
    SELECT ST_WebMercator(lon, lat) AS COORDS FROM df
    )''')

Python Example:

df = spark.createDataFrame([(180, 90)], ["lon", "lat"])

df \
    .select(st_webMercator("lon", "lat").alias("COORDS")) \
    .select(element_at(col("COORDS"), 0).alias("X"), element_at(col("COORDS"), 1).alias("Y"))
bdt.functions.xToLon(x)#

Return the conversion of a Web Mercator X value to a Longitude value.

Parameters:

x (column or str or float int) – The x value.

Return type:

Column

Returns:

Column of DoubleType

SQL Example:

spark.sql('''SELECT 220624160 AS x, 20037508 AS y''').createOrReplaceTempView("df")

spark.sql('''SELECT XToLon(x) AS lon FROM df''')

Python Example:

df = spark.createDataFrame([(220624160, 20037508)], ["x", "y"])

df.select(xToLon("x").alias("lon"))
bdt.functions.yToLat(y)#

Return the conversion of a Web Mercator Y value to a Latitude value.

Parameters:

y (column or str or float) – The y value.

Return type:

Column

Returns:

Column of DoubleType

SQL Example:

spark.sql('''SELECT 220624160 AS x, 20037508 AS y''').createOrReplaceTempView("df")

spark.sql('''SELECT yToLat(y) AS lat FROM df''')

Python Example:

df = spark.createDataFrame([(220624160, 20037508)], ["x", "y"])

df.select(yToLat("y").alias("lat"))

Well-Known Text (WKT)#

These processors and functions read Well-Known Text input and return a SHAPE struct.

Processors#

addShapeFromWKT(df, wktField, geometryType)

Add the SHAPE column based on the configured WKT field.

addWKT(df[, wktField, keep, shapeField, cache])

Add the SHAPE column based on the configured WKT field.

bdt.processors.addShapeFromWKT(df, wktField, geometryType, wkid=4326, keep=False, shapeField='SHAPE', cache=False)#

Add the SHAPE column based on the configured WKT field. The possible values for geometryType are: Unknown, Point, Line, Envelope, MultiPoint, Polyline, Polygon, GeometryCollection.

Parameters:
  • df (DataFrame) – The input DataFrame

  • wktField (str) – The name of the WKT field.

  • geometryType (str) – The geometry type. The possible values for geometryType are: Unknown, Point, Line, Envelope, MultiPoint, Polyline, Polygon, GeometryCollection.

  • wkid (int) – The spatial reference id.

  • keep (bool) – Whether to retain the WKT field or not.

  • shapeField (str) – The name of the SHAPE Struct field.

  • cache (bool) – To persist the outgoing DataFrame or not.

Return type:

DataFrame

Returns:

DataFrame with SHAPE column based on the WKT field.

Example:

bdt.processors.addShapeFromWKT(
    df,
    "WKT",
    "Point",
    wkid = 4326,
    keep = False,
    shapeField = "SHAPE",
    cache = False)
bdt.processors.addWKT(df, wktField='wkt', keep=False, shapeField='SHAPE', cache=False)#

Add the SHAPE column based on the configured WKT field. The possible values for geometryType are: Unknown, Point, Line, Envelope, MultiPoint, Polyline, Polygon, GeometryCollection.

Parameters:
  • df (DataFrame) – The input DataFrame

  • wktField (str) – The name of the WKT field.

  • keep (bool) – Whether to retain the original GeoJson field or not.

  • shapeField (str) – The name of the SHAPE Struct field.

  • cache (bool) – To persist the outgoing DataFrame or not.

Return type:

DataFrame

Returns:

DataFrame with SHAPE column.

Example:

bdt.processors.addWKT(
    df,
    wktField = "wkt",
    keep = False,
    shapeField = "SHAPE",
    cache = False)

Functions#

st_fromText(wkt)

Convert the OGC Well-Known Text representation of a geometry to a BDT shape struct.

st_is_wkt_valid(wkt)

Return a boolean indicating whether the input WKT (Well-Known Text) representation of a geometry is valid.

bdt.functions.st_fromText(wkt)#

Convert the OGC Well-Known Text representation of a geometry to a BDT shape struct.

If an IllegalArgumentException: None error is thrown, it is likely that the WKT string is invalid. Use ST_IsWktValid to check if the WKT string is valid.

Parameters:

wkt (column or str) – The Well-Known Text representation of a geometry.

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

spark.sql('''SELECT 'POINT (1 1)' AS WKT''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_FromText(WKT) AS SHAPE FROM df''')

Python Example:

df = spark \
    .createDataFrame([("POINT (1 1)",)], ["WKT"])

df.select(st_fromText(col("WKT")).alias("SHAPE"))
bdt.functions.st_is_wkt_valid(wkt)#

Return a boolean indicating whether the input WKT (Well-Known Text) representation of a geometry is valid.

Parameters:

wkt (column or str) – The WKT representation of a geometry.

Return type:

Column

Returns:

Column of BooleanType.

SQL Example:

spark.sql('''SELECT 'POINT (1, 1)' AS WKT''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_IsWKTValid(WKT) AS IS_VALID FROM df''')

Python Example 1:

df = spark \
    .createDataFrame([("POINT (1, 1)",)], ["WKT"])

df.select(st_is_wkt_valid(col("WKT")).alias("IS_VALID"))

Geometry Input - Other formats#

These processors and functions read data from various formats and return a SHAPE struct.

Processors#

addShapeFromGeoJson(df, geoJsonField, ...[, ...])

Add the SHAPE column based on the configured GeoJSON field.

addShapeFromJson(df, jsonField, geometryType)

Add the SHAPE column based on the configured JSON field.

addShapeFromWKB(df, wkbField, geometryType)

Add the SHAPE column based on the configured WKB field.

addShapeFromXY(df, xField, yField[, wkid, ...])

Add the SHAPE column based on the configured X field and Y field.

bdt.processors.addShapeFromGeoJson(df, geoJsonField, geometryType, wkid=4326, keep=False, shapeField='SHAPE', cache=False)#

Add the SHAPE column based on the configured GeoJSON field. The possible values for geometryType are: Unknown, Point, Line, Envelope, MultiPoint, Polyline, Polygon, GeometryCollection.

Parameters:
  • df (DataFrame) – The input DataFrame

  • geoJsonField (str) – The name of the GeoJson field.

  • geometryType (str) – The geometry type. The possible values for geometryType are: Unknown, Point, Line, Envelope, MultiPoint, Polyline, Polygon, GeometryCollection.

  • wkid (int) – The spatial reference id.

  • keep (bool) – Whether to retain the original GeoJson field or not.

  • shapeField (str) – The name of the SHAPE Struct field.

  • cache (bool) – To persist the outgoing DataFrame or not.

Return type:

DataFrame

Returns:

DataFrame with SHAPE column.

Example:

bdt.processors.addShapeFromGeoJson(
    df,
    "GEOJSON",
    "Point",
    wkid = 4326,
    keep = False,
    shapeField = "SHAPE",
    cache = False)
bdt.processors.addShapeFromJson(df, jsonField, geometryType, wkid=4326, keep=False, shapeField='SHAPE', cache=False)#

Add the SHAPE column based on the configured JSON field. The possible values for geometryType are: Unknown, Point, Line, Envelope, MultiPoint, Polyline, Polygon, GeometryCollection.

Parameters:
  • df (DataFrame) – The input DataFrame

  • JsonField (str) – The name of the JSON field.

  • geometryType (str) – The geometry type. The possible values for geometryType are: Unknown, Point, Line, Envelope, MultiPoint, Polyline, Polygon, GeometryCollection.

  • wkid (int) – The spatial reference id.

  • keep (bool) – Whether to retain the JSON field or not.

  • shapeField (str) – The name of the SHAPE Struct field.

  • cache (bool) – To persist the outgoing DataFrame or not.

Return type:

DataFrame

Returns:

DataFrame with SHAPE column.

Example:

bdt.processors.addShapeFromJson(
    df,
    ["(A/B) as C"],
    cache = False)
bdt.processors.addShapeFromWKB(df, wkbField, geometryType, wkid=4326, keep=False, shapeField='SHAPE', cache=False)#

Add the SHAPE column based on the configured WKB field. The possible values for geometryType are: Unknown, Point, Line, Envelope, MultiPoint, Polyline, Polygon, GeometryCollection.

Parameters:
  • df (DataFrame) – The input DataFrame

  • wkbField (str) – The name of the WKB field.

  • geometryType (str) – The geometry type. The possible values for geometryType are: Unknown, Point, Line, Envelope, MultiPoint, Polyline, Polygon, GeometryCollection.

  • wkid (int) – The spatial reference id.

  • keep (bool) – Whether to retain the WKB field or not.

  • shapeField (str) – The name of the SHAPE Struct field.

  • cache (bool) – To persist the outgoing DataFrame or not.

Return type:

DataFrame

Returns:

DataFrame with SHAPE column.

Example:

bdt.processors.addShapeFromWKB(
    df,
    "WKB",
    "Point",
    wkid = 4326,
    keep = False,
    shapeField = "SHAPE",
    cache = False)
bdt.processors.addShapeFromXY(df, xField, yField, wkid=4326, keep=False, shapeField='SHAPE', cache=False)#

Add the SHAPE column based on the configured X field and Y field.

Parameters:
  • df (DataFrame) – The input DataFrame

  • xField (str) – The name of the longitude field.

  • yField (str) – the name of the latitude field

  • wkid (int) – The spatial reference id.

  • keep (bool) – Whether to retain the original xField, yField or not.

  • shapeField (str) – The name of the SHAPE Struct field.

  • cache (bool) – To persist the outgoing DataFrame or not.

Return type:

DataFrame

Returns:

DataFrame with SHAPE column.

Example:

bdt.processors.addShapeFromXY(
    df,
    "x",
    "y",
    wkid = 4326,
    keep = False,
    shapeField = "SHAPE",
    cache = False)

Functions#

st_fromFGDBMultipoint(struct, zm)

Convert a FileGeodatabase Multipoint Shape to a BDT shape struct.

st_fromFGDBPoint(struct)

Convert a FileGeodatabase Point Shape to a BDT shape struct.

st_fromFGDBPolygon(struct)

Convert a FileGeodatabase Polygon Shape to a BDT shape struct.

st_fromFGDBPolyline(struct)

Convert a FileGeodatabase Polyline Shape to a BDT shape struct.

st_fromGeoJSON(geoJSON)

Convert the GeoJSON representation of a geometry to a BDT shape struct.

st_fromJSON(json)

Convert the JSON representation of a geometry to a BDT shape struct.

st_fromWKB(wkb)

Create and return shape struct from the Well-Known Binary representation of a geometry.

st_fromXY(lon, lat)

Create and return a point shape struct based on the given longitude and latitude.

st_from_esri_shape(bArr)

Create and return shape struct from the Esri shape binary representation.

st_load_polygon(structList)

Construct a Polygon from an array of named struct with i, x, y fields.

st_load_polyline(structList)

Construct a Polyline from an array of named struct with i, x, y fields.

bdt.functions.st_fromFGDBMultipoint(struct, zm)#

Convert a FileGeodatabase Multipoint Shape to a BDT shape struct. Designed to be used when reading a FGDB data source.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • zm (column or str) – String to indicate if Z or M values are present in the Multipoint. - None indicates no Z values - “Z” indicates Z values - “M” indicates M values - “ZM” indicates both Z and M values

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

df = spark                .read                .format("com.esri.gdb")                .options(path="...", name="...")                .load()                .withColumnRenamed("Shape", "FGDB_Shape")                .selectExpr("ST_FromFGDBMultipoint(FGDB_Shape, NULL) AS SHAPE", "*")

Python Example:

df = spark                .read                .format("com.esri.gdb")                .options(path="...", name="...")                .load()                .withColumnRenamed("Shape", "FGDB_Shape")                 .select(st_fromFGDBMultipoint("FGDB_Shape", lit(None)).alias("SHAPE"), "*")
bdt.functions.st_fromFGDBPoint(struct)#

Convert a FileGeodatabase Point Shape to a BDT shape struct. Designed to be used when reading a FGDB data source.

Parameters:

struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

df = spark                .read                .format("com.esri.gdb")                .options(path="...", name="...")                .load()                .withColumnRenamed("Shape", "FGDB_Shape")
        .selectExpr("ST_FromFGDBPoint(FGDB_Shape) AS SHAPE", "*")

Python Example:

df = spark                .read                .format("com.esri.gdb")                .options(path="...", name="...")                .load()                .withColumnRenamed("Shape", "FGDB_Shape")                 .select(st_fromFGDBPoint("FGDB_Shape").alias("SHAPE"), "*")
bdt.functions.st_fromFGDBPolygon(struct)#

Convert a FileGeodatabase Polygon Shape to a BDT shape struct. Designed to be used when reading a FGDB data source.

Parameters:

struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

df = spark\
        .read\
        .format("com.esri.gdb")\
        .options(path="...", name="...")\
        .load()\
        .withColumnRenamed("Shape", "FGDB_Shape")
        .selectExpr("ST_FromFGDBPolygon(FGDB_Shape) AS SHAPE", "*")

Python Example:

df = spark                .read\
        .format("com.esri.gdb")\
        .options(path="...", name="...")\
        .load()\
        .withColumnRenamed("Shape", "FGDB_Shape") \
        .select(st_fromFGDBPolygon("FGDB_Shape").alias("SHAPE"), "*")
bdt.functions.st_fromFGDBPolyline(struct)#

Convert a FileGeodatabase Polyline Shape to a BDT shape struct. Designed to be used when reading a FGDB data source.

Parameters:

struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

df = spark                .read\
        .format("com.esri.gdb")\
        .options(path="...", name="...")\
        .load()\
        .withColumnRenamed("Shape", "FGDB_Shape")
        .selectExpr("ST_FromFGDBPolyline(FGDB_Shape) AS SHAPE", "*")

Python Example:

df = spark                .read\
        .format("com.esri.gdb")\
        .options(path="...", name="...")\
        .load()\
        .withColumnRenamed("Shape", "FGDB_Shape") \
        .select(st_fromFGDBPolyline("FGDB_Shape").alias("SHAPE"), "*")
bdt.functions.st_fromGeoJSON(geoJSON)#

Convert the GeoJSON representation of a geometry to a BDT shape struct.

Parameters:

geoJSON (column or str) – The GeoJSON string representation of a geometry.

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

spark.sql('''SELECT ST_FromGeoJSON('{
                "type": "Point",
                "coordinates": [1.0, 1.0]}') AS SHAPE FROM df''')

Python Example:

df = spark.createDataFrame([('''{
                "type": "Point",
                "coordinates": [1.0, 1.0]}''',)], ["GeoJSON"])

df.select(st_fromGeoJSON("GeoJSON").alias("SHAPE"))
bdt.functions.st_fromJSON(json)#

Convert the JSON representation of a geometry to a BDT shape struct.

Parameters:

json (column or str) – The JSON representation of a geometry.

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

spark.sql('''SELECT ST_FromJSON('{"rings":[[[0,0],[0,7],[3,7],[3,0],[0,0]]],"spatialReference":{
"wkid":4326}}') AS SHAPE FROM df''')

Python Example:

df.select(st_fromJSON(lit('''{"rings":[[[0,0],[0,7],[3,7],[3,0],[0,0]]],"spatialReference":{
"wkid":4326}}''')).alias("SHAPE"))
bdt.functions.st_fromWKB(wkb)#

Create and return shape struct from the Well-Known Binary representation of a geometry.

Parameters:

wkb (column or str) – The Well-Known Binary representation of a geometry.

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

spark.sql('''SELECT ST_FromText('POINT (1 1)') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_FromWKB(SHAPE.WKB) AS WKB FROM df''')

Python Example:

df = spark \
    .createDataFrame([("POINT (1 1)",)], ["WKT"]) \
    .select(st_fromText(col("WKT")).alias("SHAPE"))

df.select(st_fromWKB("SHAPE.WKB").alias("WKB"))
bdt.functions.st_fromXY(lon, lat)#

Create and return a point shape struct based on the given longitude and latitude.

Parameters:
  • lon (column or str or float) – The longitude value.

  • lat (column or str or float) – The latitude value.

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

spark.sql('''SELECT 180 AS lon, 90 AS lat''').createOrReplaceTempView("df")

spark.sql('''ST_FromXY(lon, lat) AS SHAPE FROM df''')

Python Example:

df = spark.createDataFrame([(180, 90)], ["lon", "lat"])

df.select(st_fromXY("lon", "lat").alias("SHAPE"))
bdt.functions.st_from_esri_shape(bArr)#

Create and return shape struct from the Esri shape binary representation.

Parameters:

bArr (column or str) – Esri binary shape representation

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

spark \
    .read \
    .parquet(f"/.../esriShape.parquet") \
    .selectExpr('''SELECT *, ST_FromEsriShape(EsriGeometry) SHAPE''')

Python Example:

spark \
    .read \
    .parquet(f"/.../esriShape.parquet") \
    .select(["*", st_from_esri_shape("EsriGeometry").alias("SHAPE")])
bdt.functions.st_load_polygon(structList)#

Construct a Polygon from an array of named struct with i, x, y fields. The i is the index. The x and y are the coordinates.

Designed to be used with simple polygon shapes. Multipart geometries and polygons with holes are not supported.

Parameters:

structList (column or str or array) – An array of named struct with i, x, y fields.

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

spark.sql('''SELECT ST_AsText(ST_LoadPolygon(array(
      named_struct("i", 0, "x", 0.0d, "y", 0.0d),
      named_struct("i", 1, "x", 1.0d, "y", 1.0d),
      named_struct("i", 2, "x", 2.0d, "y", 2.0d))))''')

Python Example:

df = spark.createDataFrame([
    (0, 0.0, 0.0),
    (1, 1.0, 1.0),
    (2, 2.0, 2.0)
], schema="i int, x float, y float")

result_df = (df
    .select(array(struct("i", "x", "y")).alias("structList"))
    .select(st_load_polygon("structList").alias("polygon")))
bdt.functions.st_load_polyline(structList)#

Construct a Polyline from an array of named struct with i, x, y fields. The i is the index. The x and y are the coordinates.

Parameters:

structList (column or str or array) – An array of named struct with i, x, y fields.

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

spark.sql('''SELECT ST_LoadPolyline(array(
      named_struct("i", 0, "x", 0.0d, "y", 0.0d),
      named_struct("i", 1, "x", 1.0d, "y", 1.0d)))''')

Python Example:

df = spark.createDataFrame([
    (0, 0.0, 0.0),
    (1, 1.0, 1.0)
], schema="i int, x float, y float")

result_df = (df
    .select(collect_list(struct("i", "x", "y")).alias("structList"))
    .select(st_load_polyline("structList").alias("polyline")))

Geometry Output - Other formats#

These functions convert a SHAPE struct into various formats of output.

Functions#

st_asGeoJSON(struct)

Return GeoJSON representation of the given shape struct.

st_asJSON(struct, wkid)

Return GeoJSON representation of the given shape struct.

st_asText(struct)

Return the Well-Known Text (WKT) representation of the shape struct.

bdt.functions.st_asGeoJSON(struct)#

Return GeoJSON representation of the given shape struct.

Parameters:

struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

Return type:

Column

Returns:

Column of StringType

SQL Example:

spark.sql('''SELECT ST_FromText('POINT (1 1)') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_AsGeoJSON(SHAPE) AS GEOJSON FROM df''')

Python Example:

df = spark             .createDataFrame([("POINT (1 1)",)], ["WKT"]) \
    .select(st_fromText(col("WKT")).alias("SHAPE"))

df.select(st_asGeoJSON("SHAPE").alias("GEOJSON"))
bdt.functions.st_asJSON(struct, wkid)#

Return GeoJSON representation of the given shape struct.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • wkid (column or str or int) – The spatial reference id.

Return type:

Column

Returns:

Column of StringType

SQL Example:

spark.sql('''SELECT ST_FromText('POINT (1 1)') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_AsJSON(SHAPE, 4326) AS JSON FROM df''')

Python Example:

df = spark \
    .createDataFrame([("POINT (1 1)",)], ["WKT"]) \
    .select(st_fromText(col("WKT")).alias("SHAPE"))

df.select(st_asJSON("SHAPE", 4326).alias("JSON"))
bdt.functions.st_asText(struct)#

Return the Well-Known Text (WKT) representation of the shape struct.

Parameters:

struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

Return type:

Column

Returns:

Column of StringType

SQL Example:

spark.sql('''SELECT ST_FromText('POINT (1 1)') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_AsText("SHAPE") AS WKT FROM df''')

Python Example:

df = spark \
    .createDataFrame([("POINT (1 1)",)], ["WKT"]) \
    .select(st_fromText(col("WKT")).alias("SHAPE"))

df.select(st_asText("SHAPE").alias("WKT"))

Topological Relationships#

These processors and functions evaluate topological relationships between two SHAPE structs, or a SHAPE struct and an extent.

Processors#

countWithin(ldf, rdf, cellSize[, ...])

Processor to produce a count of all left-hand side (lhs) features that are contained in a right-hand side (rhs) feature.

extentFilter(df, oper, extentList[, ...])

Processor to filter geometries that meet the operation relation with the given envelope.

intersection(ldf, rdf, cellSize[, mask, ...])

Produce the intersection geometries of two spatial datasets.

pip(ldf, rdf, cellSize[, emitPippedOnly, ...])

Enrich a source point feature with the attributes from a target polygon feature which contains the source point feature.

pointExclude(df, xmin, ymin, xmax, ymax[, ...])

Filter point features that are not in the specified extent, excluding points on the extent boundary.

pointInclude(df, xmin, ymin, xmax, ymax[, ...])

Filter point features that are in the specified extent, including points on the extent boundary.

relation(ldf, rdf, cellSize, relation[, ...])

Enrich left-hand side row with right-hand side row if the specified spatial relation is met.

summarizeWithin(ldf, rdf, cellSize, statsFields)

Processor to produce a count of all lhs features that are contained in a rhs feature.

bdt.processors.countWithin(ldf, rdf, cellSize, keepGeometry=False, emitContainsOnly=False, doBroadcast=False, lhsShapeField='SHAPE', rhsShapeField='SHAPE', cache=False, extentList=None, depth=16, acceleration='none')#

Processor to produce a count of all left-hand side (lhs) features that are contained in a right-hand side (rhs) feature.

Parameters:
  • ldf (DataFrame) – The left-hand side (LHS) input DataFrame

  • rdf (DataFrame) – The right-hand side (RHS) input DataFrame

  • cellSize (float) – The spatial partitioning cell size.

  • keepGeometry (bool) – To keep the rhs ShapeStruct in the output or not.

  • emitContainsOnly (bool) – To only emit rhs geometries that contain lhs geometries.

  • doBroadcast (bool) – If set to true, rhs DataFrame is broadcasted. Set spark.sql.autoBroadcastJoinThreshold

  • lhsShapeField (str) – The name of the SHAPE Struct field in the LHS DataFrame.

  • rhsShapeField (str) – The name of the SHAPE Struct field in the RHS DataFrame.

  • cache (bool) – To persist the outgoing DataFrame or not.

  • extentList (List[float]) – The spatial index extent. Array of the extent coordinates [xmin, ymin, xmax, ymax].

  • depth (int) – The spatial index depth.

  • acceleration (str) – The geometry acceleration type. Can speed up geometry operations. Can be one of ‘none’, ‘mild’, ‘medium’, or ‘hot’. Default ‘none’.

Return type:

DataFrame

Returns:

Count of features within a feature.

Example:

bdt.processors.countWithin(
    ldf,
    rdf,
    1.0,
    keepGeometry = False,
    emitContainsOnly = False,
    doBroadcast = False,
    lhsShapeField = "SHAPE",
    rhsShapeField = "SHAPE",
    cache = False,
    extentList = [-180.0, -90.0, 180.0, 90.0],
    depth = 16)
bdt.processors.extentFilter(df, oper, extentList, shapeField='SHAPE', cache=False)#

Processor to filter geometries that meet the operation relation with the given envelope.

The operation is executed on the geometry in relation to the envelope: read as ‘envelope intersects/overlaps/within/contains geometry’

Parameters:
  • df (DataFrame) – The input DataFrame

  • oper (str) – The type of operation

  • extentList (str) – The spatial index extent. Array of the extent coordinates [xmin, ymin, xmax, ymax].

  • shapeField (str) – The name of the SHAPE Struct for DataFrame.

  • cache (bool) – To persist the outgoing DataFrame or not.

Return type:

DataFrame

Returns:

Filtered geometries.

Example:

bdt.processors.extentFilter(
    df,
    "CONTAINS",
    extentList = [-180.0, -90.0, 180.0, 90.0],
    shapeField = "SHAPE",
    cache = False)
bdt.processors.intersection(ldf, rdf, cellSize, mask=-1, doBroadcast=False, lhsShapeField='SHAPE', rhsShapeField='SHAPE', cache=False, extentList=None, depth=16, acceleration='none', prune=False)#

Produce the intersection geometries of two spatial datasets.

Each LHS row is an intersector. If it intersects a RHS row, a lhs row’s shape is replaced by the intersection geometry and LHS attributes are enriched with the RHS attributes.

If the intersection is empty, nothing will be emitted.

This implementation uses dimension mask as explained in http://esri.github.io/geometry-api-java/doc/Intersection.html It is NOT recommended to use mask values 3, 5, 6, and 7. These mask values could output different geometry types depending on the input geometries, resulting in a DataFrame with mixed geometry types in a single column. This could result in unexpected behavior when using the output DataFrame in subsequent operations. Full support for outputting different geometry types will be available in the 3.4 release.

Recommended mask values:
-1 (default): The output intersection geometry has the lowest dimension of the two input geometries.
1: The output intersection geometry is always point.
2: The output intersection geometry is always a polyline.
4: The output intersection geometry is always a polygon.

For optimal performance, the mask value should be manually set to the desired output geometry type. For example, if the desired output geometry type is a polygon, set the mask value to 4. A mask value of -1 will still work but may have slower performance.

The outgoing DataFrame will have the same SHAPE metadata as the RHS SHAPE metadata, and its SHAPE field name will be same as the LHS SHAPE field name.

Parameters:
  • ldf (DataFrame) – The left-hand side (LHS) input DataFrame

  • rdf (DataFrame) – The right-hand side (RHS) input DataFrame

  • mask (int) – The dimension mask of the intersection.

  • cellSize (float) – The spatial partitioning cell size.

  • doBroadcast (bool) – If set to true, rhs DataFrame is broadcasted. Set spark.sql.autoBroadcastJoinThreshold

  • lhsShapeField (str) – The name of the SHAPE Struct field in the LHS DataFrame.

  • rhsShapeField (str) – The name of the SHAPE Struct field in the RHS DataFrame.

  • cache (bool) – To persist the outgoing DataFrame or not.

  • extentList (List[float]) – The spatial index extent. Array of the extent coordinates [xmin, ymin, xmax, ymax].

  • depth (int) – The spatial index depth.

  • acceleration (str) – The geometry acceleration type. Can speed up geometry operations. Can be one of ‘none’, ‘mild’, ‘medium’, or ‘hot’. Default ‘none’.

  • prune (bool) – If set to true, QRs that don’t overlap with the geometry will be pruned during internal spatial partitioning. May provide a performance boost for complex/oddly shaped geometries, but is less efficient for regularly shaped geometries. Defaults to false.

Return type:

DataFrame

Returns:

Intersection geometries.

Example:

bdt.processors.intersection(
        ldf,
        rdf,
        cellSize = 0.1,
        mask = -1,
        doBroadcast = False,
        lhsShapeField = "SHAPE",
        rhsShapeField = "SHAPE",
        extentList = [-180.0, -90.0, 180.0, 90.0],
        cache = False)
bdt.processors.pip(ldf, rdf, cellSize, emitPippedOnly=False, take=1, clip=False, doBroadcast=False, pointShapeField='SHAPE', polygonShapeField='SHAPE', cache=False, extentList=None, depth=16, acceleration='none', prune=False)#

Enrich a source point feature with the attributes from a target polygon feature which contains the source point feature.

A point is enriched if it is inside a polygon. A point is not enriched if it is on an edge or outside of polygon. The first DataFrame must be of Point type and the second polygon must be of Polygon type.

Parameters:
  • ldf (DataFrame) – The left-hand side (LHS) input DataFrame

  • rdf (DataFrame) – The right-hand side (RHS) input DataFrame

  • cellSize (float) – The spatial partitioning cell size.

  • emitPippedOnly (bool) – When set to true, only enriched points are returned.

  • take (int) – The number of overlapped polygons to process. Use -1 to include all overlapping polygons in the result.

  • clip (bool) – When set to true, polygons are clipped using the cell while processing.

  • doBroadcast (bool) – Polygons are broadcast to all worker nodes. Set spark.sql.autoBroadcastJoinThreshold accordingly.

  • pointShapeField (str) – The name of the SHAPE Struct for point DataFrame.

  • polygonShapeField (str) – The name of the SHAPE Struct for polygon DataFrame.

  • cache (bool) – To persist the outgoing DataFrame or not.

  • extentList (List[float]) – The spatial index extent. Array of the extent coordinates [xmin, ymin, xmax, ymax].

  • depth (int) – The spatial index depth.

  • acceleration (str) – The geometry acceleration type. Can speed up geometry operations. Can be one of ‘none’, ‘mild’, ‘medium’, or ‘hot’. Default ‘none’.

  • prune (bool) – If set to true, QRs that don’t overlap with the geometry will be pruned during internal spatial partitioning. May provide a performance boost for complex/oddly shaped geometries, but is less efficient for regularly shaped geometries. If clip is True, prune is ignored. Defaults to false.

Return type:

DataFrame

Returns:

Enriched DataFrame for points that lie within a polygon.

Example:

bdt.processors.pip(
        ldf,
        rdf,
        1.0,
        emitPippedOnly = True,
        take = 1,
        clip = True,
        doBroadcast = False,
        pointShapeField = "SHAPE",
        polygonShapeField = "SHAPE",
        cache = False,
        extentList = [-180.0, -90.0, 180.0, 90.0],
        depth = 16)
bdt.processors.pointExclude(df, xmin, ymin, xmax, ymax, shapeField='SHAPE', cache=False)#

Filter point features that are not in the specified extent, excluding points on the extent boundary.

Parameters:
  • df (DataFrame) – The input DataFrame

  • xmin (float) – xmin value of the extent

  • ymin (float) – ymin value of the extent

  • xmax (float) – xmax value of the extent

  • ymax (float) – ymax value of the extent

  • shapeField (str) – The name of the SHAPE Struct for DataFrame.

  • cache (bool) – To persist the outgoing DataFrame or not.

Return type:

DataFrame

Returns:

Filtered points.

Example:

bdt.processors.pointExclude(
        df,
        xmin = -10.0,
        ymin = -10.0,
        xmax = 10.0,
        ymax = 10.0,
        shapeField = "SHAPE",
        cache = False)
bdt.processors.pointInclude(df, xmin, ymin, xmax, ymax, shapeField='SHAPE', cache=False)#

Filter point features that are in the specified extent, including points on the extent boundary.

Parameters:
  • df (DataFrame) – The input DataFrame

  • xmin (float) – xmin value of the extent

  • ymin (float) – ymin value of the extent

  • xmax (float) – xmax value of the extent

  • ymax (float) – ymax value of the extent

  • shapeField (str) – The name of the SHAPE Struct for DataFrame.

  • cache (bool) – To persist the outgoing DataFrame or not.

Return type:

DataFrame

Returns:

Filtered points.

Example:

bdt.processors.pointInclude(
        df,
        xmin = -10.0,
        ymin = -10.0,
        xmax = 10.0,
        ymax = 10.0,
        shapeField = "SHAPE",
        cache = False)
bdt.processors.relation(ldf, rdf, cellSize, relation, doBroadcast=False, lhsShapeField='SHAPE', rhsShapeField='SHAPE', cache=False, extentList=None, depth=16, acceleration='none', prune=False)#

Enrich left-hand side row with right-hand side row if the specified spatial relation is met.

It’s possible that the left-hand side row will be emitted multiple time when there are multiple right-hand side rows satisfying the specified spatial relationship. When a left-hand side row finds no right-hand side rows satisfying the specified spatial relationship, it won’t be emitted.

Parameters:
  • ldf (DataFrame) – The left-hand side (LHS) input DataFrame

  • rdf (DataFrame) – The right-hand side (RHS) input DataFrame

  • cellSize (float) – The spatial partitioning cell size.

  • relation (str) – The spatial relationship. CONTAINS, EQUALS, TOUCHES, DISJOINT, INTERSECTS, CROSSES, OVERLAPS, WITHIN.

  • doBroadcast (bool) – If set to true, rhs DataFrame is broadcast. Set spark.sql.autoBroadcastJoinThreshold accordingly.

  • lhsShapeField (str) – The name of the SHAPE Struct field in the LHS DataFrame.

  • rhsShapeField (str) – The name of the SHAPE Struct field in the RHS DataFrame.

  • cache (bool) – To persist the outgoing DataFrame or not.

  • extentList (List[float]) – The spatial index extent. Array of the extent coordinates [xmin, ymin, xmax, ymax].

  • depth (int) – The spatial index depth.

  • acceleration (str) – The geometry acceleration type. Can speed up geometry operations. Can be one of ‘none’, ‘mild’, ‘medium’, or ‘hot’. Default ‘none’.

  • prune (bool) – If set to true, QRs that don’t overlap with the geometry will be pruned during internal spatial partitioning. May provide a performance boost for complex/oddly shaped geometries, but is less efficient for regularly shaped geometries. Defaults to false.

Return type:

DataFrame

Returns:

Enriched left-hand side row with right-hand side row if the specified spatial relation is met.

Example:

bdt.processors.relation(
         ldf,
         rdf,
         cellSize = 0.1,
         relation = "TOUCHES",
         doBroadcast = False,
         lhsShapeField = "SHAPE",
         rhsShapeField = "SHAPE",
         cache = False,
         extentList = [-180.0, -90.0, 180.0, 90.0],
         depth = 16)
bdt.processors.summarizeWithin(ldf, rdf, cellSize, statsFields, keepGeometry=False, emitContainsOnly=False, doBroadcast=False, lhsShapeField='SHAPE', rhsShapeField='SHAPE', cache=False, extentList=None, depth=16, acceleration='none')#

Processor to produce a count of all lhs features that are contained in a rhs feature.

Parameters:
  • ldf (DataFrame) – The left-hand side (LHS) input DataFrame

  • rdf (DataFrame) – The right-hand side (RHS) input DataFrame

  • cellSize (float) – The spatial partitioning cell size.

  • statsFields (List[str]) – The field names to compute statistics for.

  • keepGeometry (bool) – To keep the rhs ShapeStruct in the output or not.

  • emitContainsOnly (bool) – To only emit rhs geometries that contain lhs geometries.

  • doBroadcast (bool) – If set to true, rhs DataFrame is broadcast. Set spark.sql.autoBroadcastJoinThreshold

  • lhsShapeField (str) – The name of the SHAPE Struct field in the LHS DataFrame.

  • rhsShapeField (str) – The name of the SHAPE Struct field in the RHS DataFrame.

  • cache (bool) – To persist the outgoing DataFrame or not.

  • extentList (List[float]) – The spatial index extent. Array of the extent coordinates [xmin, ymin, xmax, ymax].

  • depth (int) – The spatial index depth.

  • acceleration (str) – The geometry acceleration type. Can speed up geometry operations. Can be one of ‘none’, ‘mild’, ‘medium’, or ‘hot’. Default ‘none’.

Return type:

DataFrame

Returns:

Count of all lhs features contained in a rhs feature.

Example:

bdt.processors.summarizeWithin(
         ldf,
         rdf,
         0.1,
         ["field1", "field2"],
         keepGeometry = False,
         emitContainsOnly = False,
         doBroadcast = False,
         lhsShapeField = "SHAPE",
         rhsShapeField = "SHAPE",
         cache = False,
         extentList = [-180.0, -90.0, 180.0, 90.0],
         depth = 16)

Functions#

st_contains(struct, another_struct, wkid)

Return true if one shape struct contains another shape struct.

st_coverage(struct, another_struct, ...)

Return the coverage fraction, coverage distance, and cosine similarity of segment1 on segment2 as an array.

st_crosses(struct, another_struct, wkid)

Return true if one shape struct crosses another shape struct.

st_disjoint(struct, another_struct, wkid)

Return true when the two shape structs are disjoint.

st_doBoxesOverlap(struct, another_struct)

Return true when the extents from the two given shape structs overlap.

st_equals(struct, another_struct, wkid)

Returns true if two geometries are equal.

st_intersects(struct, another_struct, wkid)

Return true if the two shape structs intersect.

st_isPointInExtent(struct, extent)

Return a boolean indicating if a point is in a given extent, including points on the extent boundary.

st_overlaps(struct, another_struct, wkid)

Return true if one geometry overlaps another geometry.

st_relate(struct, another_struct, wkid, de9im)

Return true if the given relation (DE-9IM value) holds for the two geometries.

st_touches(struct, another_struct, wkid)

Return true if one geometry touches another geometry.

st_within(struct, another_struct, wkid)

Return true if one geometry is within another geometry.

bdt.functions.st_contains(struct, another_struct, wkid)#

Return true if one shape struct contains another shape struct.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • another_struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • wkid (column or str or int) – The spatial reference id.

Return type:

Column

Returns:

Column of BooleanType

SQL Example:

spark.sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE1,
         ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE2''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_Contains(SHAPE1, SHAPE2, 4326) AS CONTAINS FROM df''')

Python Example:

df = spark \
    .createDataFrame([("POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))",
    "POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))")], ["WKT1", "WKT2"]) \
    .select(st_fromText(col("WKT1")).alias("SHAPE1"), st_fromText(col("WKT2")).alias("SHAPE2"))

df.select(st_contains("SHAPE1", "SHAPE2", 4326).alias("CONTAINS"))
bdt.functions.st_coverage(struct, another_struct, dist_threshold)#

Return the coverage fraction, coverage distance, and cosine similarity of segment1 on segment2 as an array.

The coverage fraction represents the fraction of segment1 that is covered by segment2. Its range is [0, 1]. The coverage distance measures how close segment1 is to segment2. It takes into account the input distance threshold. The coverage distance range is [1, 0). Coverage distance 1 means the actual distance between the segments is 0. As the segments become further and further apart, the coverage distance approaches 0.

A segment is a line with only 2 points. Thus, the input geometries must be lines with only 2 points.

Line geometries with multiple segments are not supported. Lines with multiple segments must be converted to single-segment lines using st_dump or st_dump_explode before they are passed to this function.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • another_struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • dist_threshold (column or str or float) – A distance threshold above which the coverage distance between two lines will be very small.

Return type:

Column

Returns:

Column of ArrayType with InternalRow of three DoubleTypes

SQL Example:

spark.sql('''SELECT ST_FromText('MULTILINESTRING ((0 0, 1 0))') AS SHAPE1,
     ST_FromText('MULTILINESTRING ((0 0, 1 0))') AS SHAPE2''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_Coverage(SHAPE1, SHAPE2, 10) AS COVERAGE FROM df''')

Python Example:

df = spark \
    .createDataFrame([("MULTILINESTRING ((0 0, 1 0))",
    "MULTILINESTRING ((0 0, 1 0))")], ["WKT1", "WKT2"]) \
    .select(st_fromText(col("WKT1")).alias("SHAPE1"), st_fromText(col("WKT2")).alias("SHAPE2"))

df.select(st_coverage("SHAPE", "SHAPE2", 10).alias("COVERAGE"))
bdt.functions.st_crosses(struct, another_struct, wkid)#

Return true if one shape struct crosses another shape struct. Otherwise, return false.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • another_struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • wkid (column or str or int) – The spatial reference id.

Return type:

Column

Returns:

Column of BooleanType

SQL Example:

spark.sql('''SELECT ST_FromText('MULTILINESTRING ((0 0, 1 0))') AS SHAPE1,
     ST_FromText('MULTILINESTRING ((0 0, 1 0))') AS SHAPE2''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_Crosses(SHAPE1, SHAPE2, 4326) AS CROSSES FROM df''')

Python Example:

df = spark \
    .createDataFrame([("MULTILINESTRING ((0 0, 1 0))",
    "MULTILINESTRING ((0 0, 1 0))")], ["WKT1", "WKT2"]) \
    .select(st_fromText(col("WKT1")).alias("SHAPE1"), st_fromText(col("WKT2")).alias("SHAPE2"))

df.select(st_crosses("SHAPE1", "SHAPE2", 4326).alias("CROSSES"))
bdt.functions.st_disjoint(struct, another_struct, wkid)#

Return true when the two shape structs are disjoint. Otherwise, return false.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • another_struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • wkid (column or str or int) – The spatial reference id.

Return type:

Column

Returns:

Column of BooleanType

SQL Example:

spark.sql('''SELECT ST_FromText('MULTILINESTRING ((0 0, 1 0))') AS SHAPE1,
     ST_FromText('MULTILINESTRING ((0 0, 1 0))') AS SHAPE2''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_Disjoint(SHAPE1, SHAPE2, 4326) AS DISJOINT FROM df''')

Python Example:

df = spark \
    .createDataFrame([("MULTILINESTRING ((0 0, 1 0))",
    "MULTILINESTRING ((0 0, 1 0))")], ["WKT1", "WKT2"]) \
    .select(st_fromText(col("WKT1")).alias("SHAPE1"), st_fromText(col("WKT2")).alias("SHAPE2"))

df.select(st_disjoint("SHAPE1", "SHAPE2", 4326).alias("DISJOINT"))
bdt.functions.st_doBoxesOverlap(struct, another_struct)#

Return true when the extents from the two given shape structs overlap. Return false otherwise.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • another_struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

Return type:

Column

Returns:

Column of BooleanType

SQL Example:

spark.sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE1,
         ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE2''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_DoBoxesOverlap(SHAPE1, SHAPE2) AS OVERLAPS FROM df''')

Python Example:

df = spark \
    .createDataFrame([("POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))",
    "POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))")], ["WKT1", "WKT2"]) \
    .select(st_fromText(col("WKT1")).alias("SHAPE1"), st_fromText(col("WKT2")).alias("SHAPE2"))

df.select(st_doBoxesOverlap("SHAPE1", "SHAPE2").alias("OVERLAPS"))
bdt.functions.st_equals(struct, another_struct, wkid)#

Returns true if two geometries are equal. Otherwise, return false.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • another_struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • wkid (column or str or int) – The spatial reference id.

Return type:

Column

Returns:

Column of BooleanType

SQL Example:

spark.sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE1,
         ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE2''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_Equals(A.SHAPE, B.SHAPE, 4326) AS EQUALS FROM df''')

Python Example:

df = spark \
    .createDataFrame([("POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))",
    "POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))")], ["WKT1", "WKT2"]) \
    .select(st_fromText(col("WKT1")).alias("SHAPE1"), st_fromText(col("WKT2")).alias("SHAPE2"))

df.select(st_equals("SHAPE1", "SHAPE2", 4326).alias("EQUALS"))
bdt.functions.st_intersects(struct, another_struct, wkid)#

Return true if the two shape structs intersect. Return false otherwise.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • another_struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • wkid (column or str or int) – The spatial reference id.

Return type:

Column

Returns:

Column of BooleanType

SQL Example:

spark.sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE1,
         ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE2''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_Intersects(SHAPE1, SHAPE2, 4326) AS INTERSECTS FROM df''')

Python Example:

df = spark \
    .createDataFrame([("POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))",
    "POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))")], ["WKT1", "WKT2"]) \
    .select(st_fromText(col("WKT1")).alias("SHAPE1"), st_fromText(col("WKT2")).alias("SHAPE2"))

df.select(explode(st_intersects("SHAPE1", "SHAPE2", 4326)).alias("INTERSECTS"))
bdt.functions.st_isPointInExtent(struct, extent)#

Return a boolean indicating if a point is in a given extent, including points on the extent boundary.

Parameters:
  • struct (column or str) – an expression. The point shape struct with wkb, xmin, ymin, xmax, ymax

  • extent (column or str or array) – an expression. The extent array with xmin, ymin, xmax, ymax

Return type:

Column

Returns:

Column of BooleanType

SQL Example:

spark.sql('''SELECT ST_FromText('POINT (1 1)') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_IsPointInExtent(SHAPE, [-10, -10, 10, 10]) AS IS_IN_EXTENT FROM df''')

Python Example:

df = spark \
     .createDataFrame([("POINT (1 1)",)], ["WKT"]) \
     .select(st_fromText(col("WKT")).alias("SHAPE"))

 df.select(st_isPointInExtent("SHAPE", array(-10, -10, 10, 10)).alias("IS_IN_EXTENT"))
bdt.functions.st_overlaps(struct, another_struct, wkid)#

Return true if one geometry overlaps another geometry. Otherwise, return false.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • another_struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • wkid (column or str or int) – The spatial reference id.

Return type:

Column

Returns:

Column of BooleanType

SQL Example:

spark.sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE1,
         ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE2''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_Overlaps(SHAPE1, SHAPE2, 4326) AS OVERLAPS FROM df''')

Python Example:

df = spark \
    .createDataFrame([("POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))",
    "POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))")], ["WKT1", "WKT2"]) \
    .select(st_fromText(col("WKT1")).alias("SHAPE1"), st_fromText(col("WKT2")).alias("SHAPE2"))

df.select(st_overlaps("SHAPE1", "SHAPE2", 4326).alias("OVERLAPS"))
bdt.functions.st_relate(struct, another_struct, wkid, de9im)#

Return true if the given relation (DE-9IM value) holds for the two geometries. Otherwise, return false.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • another_struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • wkid (column or str or int) – The spatial reference id.

  • de9im (column or str) – The DE-9IM value.

Return type:

Column

Returns:

Column of BooleanType

SQL Example:

spark.sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE1,
         ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE2''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_Relate(SHAPE1, SHAPE2, 4326, '1*1***1**') AS RELATION FROM df''')

Python Example:

df = spark \
    .createDataFrame([("POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))",
    "POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))")], ["WKT1", "WKT2"]) \
    .select(st_fromText(col("WKT1")).alias("SHAPE1"), st_fromText(col("WKT2")).alias("SHAPE2"))

df.select(st_relate("SHAPE1", "SHAPE2", 4326, "1*1***1**").alias("RELATION"))
bdt.functions.st_touches(struct, another_struct, wkid)#

Return true if one geometry touches another geometry. Otherwise, return false.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • another_struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • wkid (column or str or int) – The spatial reference id.

Return type:

Column

Returns:

Column of BooleanType

SQL Example:

spark.sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE1,
         ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE2''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_Touches(SHAPE1, SHAPE2, 4326) AS TOUCHES FROM df''')

Python Example:

df = spark \
    .createDataFrame([("POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))",
    "POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))")], ["WKT1", "WKT2"]) \
    .select(st_fromText(col("WKT1")).alias("SHAPE1"), st_fromText(col("WKT2")).alias("SHAPE2"))

df.select(st_touches("SHAPE1", "SHAPE2", 4326).alias("TOUCHES"))
bdt.functions.st_within(struct, another_struct, wkid)#

Return true if one geometry is within another geometry. Otherwise, return false.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • another_struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • wkid (column or str or int) – The spatial reference id.

Return type:

Column

Returns:

Column of BooleanType

SQL Example:

spark.sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE1,
         ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE2''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_Within(SHAPE1, SHAPE2, 4326) AS WITHIN FROM df''')

Python Example:

df = spark \
    .createDataFrame([("POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))",
    "POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))")], ["WKT1", "WKT2"]) \
    .select(st_fromText(col("WKT1")).alias("SHAPE1"), st_fromText(col("WKT2")).alias("SHAPE2"))

df.select(st_within("SHAPE1", SHAPE2", 4326).alias("WITHIN"))

Measurement Functions#

These processors and functions measure properties of an input SHAPE struct.

Processors#

area(df[, areaField, shapeField, cache])

Calculate the area of a geometry.

distance(ldf, rdf, cellSize, radius[, ...])

Per each left-hand side (lhs) feature, find the closest right-hand side (rhs) feature within the specified radius.

distanceAllCandidates(ldf, rdf, cellSize, radius)

Processor to find all features from the right-hand side dataset within a given search distance.

geodeticArea(df[, curveType, fieldName, ...])

Processor to calculate the geodetic areas of the input geometries.

length(df[, lengthField, shapeField, cache])

Calculate the length of a geometry.

nearestCoordinate(ldf, rdf, cellSize, snapRadius)

Processor to find nearest coordinate between two sources in a distributed parallel way.

bdt.processors.area(df, areaField='shape_area', shapeField='SHAPE', cache=False)#

Calculate the area of a geometry.

Parameters:
  • df (DataFrame) – The input DataFrame

  • areaField (str) – The name of the area column in the DataFrame

  • shapeField (str) – The name of the SHAPE Struct for DataFrame.

  • cache (bool) – To persist the outgoing DataFrame or not.

Return type:

DataFrame

Returns:

Area of a geometry.

Example:

bdt.processors.area(
    df,
    areaField = "shape_area",
    shapeField = "SHAPE",
    cache = False)
bdt.processors.distance(ldf, rdf, cellSize, radius, emitEnrichedOnly=True, doBroadcast=False, distanceField='distance', lhsShapeField='SHAPE', rhsShapeField='SHAPE', cache=False, extentList=None, depth=16, acceleration='none', distanceMethod='planar', prune=False)#

Per each left-hand side (lhs) feature, find the closest right-hand side (rhs) feature within the specified radius.

For the LHS feature type, it can be point, polyline, polygon types. For the RHS feature type, it can be point, polyline, polygon types.

The distance method defaults to planar and can be changed to geodesic by the distanceMethod parameter. Planar distance is always in the unit of the spatial reference and geodesic distance is always in meters.

Note: If there are duplicated rows in the lhs DataFrame, then only one copy of the row will be emitted. Therefore, it is possible that the amount of rows in the output DataFrame could be less than the amount of rows in the LHS DataFrame even if emitEnrichedOnly is set to false. To check for duplicates, DataFrame.dropDuplicates can be used. Then, compare the counts of the DataFrames before and after dropping duplicates. If all rows including duplicates must be retained, then a unique id can be added with the mid() funtion.

Parameters:
  • ldf (DataFrame) – The left-hand side (LHS) input DataFrame

  • rdf (DataFrame) – The right-hand side (RHS) input DataFrame

  • cellSize (float) – The spatial partitioning cell size in units of the spatial reference system.

  • radius (float) – The search radius in units of the spatial reference system. If distanceMethod is set to ‘geodesic’, the radius is in meters.

  • emitEnrichedOnly (bool) – If set to false, when no rhs features are found, a lhs feature is enriched with null values. If set to true, when no rhs features are found, the lhs feature is not emitted in the output.

  • doBroadcast (bool) – If set to true, rhs DataFrame is broadcast. Set spark.sql.autoBroadcastJoinThreshold

  • distanceField (str) – The name of the distance field to be added.

  • lhsShapeField (str) – The name of the SHAPE Struct field in the LHS DataFrame.

  • rhsShapeField (str) – The name of the SHAPE Struct field in the RHS DataFrame.

  • cache (bool) – To persist the outgoing DataFrame or not.

  • extentList (List[float]) – The spatial index extent. Array of the extent coordinates [xmin, ymin, xmax, ymax]. Defaults to the extent of spatial reference 4326.

  • depth (int) – The spatial index depth.

  • acceleration (str) – The geometry acceleration type. Can speed up geometry operations. Can be one of ‘none’, ‘mild’, ‘medium’, or ‘hot’. Default ‘none’.

  • distanceMethod (str) – The distance method (planar, geodesic). Default is ‘planar’.

  • prune (bool) – If set to true, QRs that don’t overlap with the geometry will be pruned during internal spatial partitioning. May provide a performance boost for complex/oddly shaped geometries, but is less efficient for regularly shaped geometries. Defaults to false.

Return type:

DataFrame

Returns:

Calculation of the distance to the nearest feature with a new column.

Example:

bdt.processors.distance(
    ldf,
    rdf,
    0.1,
    0.003,
    emitEnrichedOnly = True,
    doBroadcast = False,
    distanceField = "distance",
    lhsShapeField = "SHAPE",
    rhsShapeField = "SHAPE",
    cache = False,
    extentList = [-180.0, -90.0, 180.0, 90.0],
    depth = 16)

bdt.processors.distanceAllCandidates(ldf, rdf, cellSize, radius, emitEnrichedOnly=True, doBroadcast=False, distanceField='distance', lhsShapeField='SHAPE', rhsShapeField='SHAPE', cache=False, extentList=None, depth=16, acceleration='none', distanceMethod='planar', prune=False)#

Processor to find all features from the right-hand side dataset within a given search distance.

Enriches lhs features with the attributes of rhs features found within the specified distance.

The distance method defaults to planar and can be changed to geodesic by the distanceMethod parameter. Planar distance is always in the unit of the spatial reference and geodesic distance is always in meters.

Parameters:
  • ldf (DataFrame) – The left-hand side (LHS) input DataFrame

  • rdf (DataFrame) – The right-hand side (RHS) input DataFrame

  • cellSize (float) – The spatial partitioning cell size in units of the spatial reference system.

  • radius (float) – The search radius in units of the spatial reference system. If distanceMethod is set to ‘geodesic’, the radius is in meters.

  • emitEnrichedOnly (bool) – If set to false, when no rhs features are found, a lhs feature is enriched with null values. If set to true, when no rhs features are found, the lhs feature is not emitted in the output.

  • doBroadcast (bool) – If set to true, rhs DataFrame is broadcast. Set spark.sql.autoBroadcastJoinThreshold

  • distanceField (str) – The name of the distance field to be added.

  • lhsShapeField (str) – The name of the SHAPE Struct field in the LHS DataFrame.

  • rhsShapeField (str) – The name of the SHAPE Struct field in the RHS DataFrame.

  • cache (bool) – To persist the outgoing DataFrame or not.

  • extentList (List[float]) – The spatial index extent. Array of the extent coordinates [xmin, ymin, xmax, ymax]. Defaults to the extent of spatial reference 4326.

  • depth (int) – The spatial index depth.

  • acceleration (str) – The geometry acceleration type. Can speed up geometry operations. Can be one of ‘none’, ‘mild’, ‘medium’, or ‘hot’. Default ‘none’.

  • distanceMethod (str) – The distance method (planar, geodesic). Default is ‘planar’.

  • prune (bool) – If set to true, QRs that don’t overlap with the geometry will be pruned during internal spatial partitioning. May provide a performance boost for complex/oddly shaped geometries, but is less efficient for regularly shaped geometries. Defaults to false.

Return type:

DataFrame

Returns:

Calculation of the distance to the nearest feature with a new column.

Example:

bdt.processors.distanceAllCandidates(
    ldf,
    rdf,
    1.0,
    1.0,
    emitEnrichedOnly = True,
    doBroadcast = False,
    distanceField = "distance",
    lhsShapeField = "SHAPE",
    rhsShapeField = "SHAPE",
    cache = False,
    extentList = [-180.0, -90.0, 180.0, 90.0],
    depth = 16)
bdt.processors.geodeticArea(df, curveType=0, fieldName='geodetic_area', shapeField='SHAPE', cache=False)#

Processor to calculate the geodetic areas of the input geometries.

This is more reliable measurement of large area than ProcessorArea because it takes into account the Earth’s curvature but may increase processing time.

Parameters:
  • df (DataFrame) – The input DataFrame

  • curveType (int) – The curve type, represented as an Integer.

  • fieldName (str) – The name of the field to hold geodetic area value.

  • shapeField (str) – The name of the SHAPE Struct for DataFrame.

  • cache (bool) – To persist the outgoing DataFrame or not.

Return type:

DataFrame

Returns:

Geodetic areas.

Example:

bdt.processors.geodeticArea(
        df,
        curveType=0,
        fieldName="geodetic_area",
        shapeField="SHAPE",
        cache = False)
bdt.processors.length(df, lengthField='shape_length', shapeField='SHAPE', cache=False)#

Calculate the length of a geometry.

Parameters:
  • df (DataFrame) – The input DataFrame

  • lengthField (str) – The name of the length column in the DataFrame

  • shapeField (str) – The name of the SHAPE Struct for DataFrame.

  • cache (bool) – To persist the outgoing DataFrame or not.

Return type:

DataFrame

Returns:

Length.

Example:

bdt.processors.length(
        df,
        lengthField = "shape_length",
        shapeField = "SHAPE",
        cache = False)
bdt.processors.nearestCoordinate(ldf, rdf, cellSize, snapRadius, doBroadcast=False, lhsShapeField='SHAPE', rhsShapeField='SHAPE', distanceField='distance', xField='X', yField='Y', isOnRightField='isOnRight', distanceForNotSnapped=-999.0, cache=False, extentList=None, depth=16, acceleration='none')#

Processor to find nearest coordinate between two sources in a distributed parallel way.

The left-hand-side dataframe must be points. The right-hand-side dataframe can be a point, polyline, or polygon type.

The points on the left-hand-side input are augmented with closest feature attributes on the right-hand-side input. The input features must be in a projected coordinate system that is appropriate for distance calculations.

Prior to BDT 3.2, if the RHS feature type was a point, the output distance was calculated using geodesic distance and the unit of measurement was in meters (geodesic distance always outputs distance in meters). Now, the nearest coordinate is always calculated using planar distance regardless of input geometry type.

Planar distance is always in the unit of the spatial reference and geodesic distance is always in meters.

Parameters:
  • ldf (DataFrame) – The left-hand side (LHS) input points DataFrame

  • rdf (DataFrame) – The right-hand side (RHS) input DataFrame

  • cellSize (float) – The spatial partitioning cell size.

  • snapRadius (float) – The snapping radius.

  • doBroadcast (bool) – If set to true, rhs DataFrame is broadcasted. Set spark.sql.autoBroadcastJoinThreshold

  • lhsShapeField (str) – The name of the SHAPE Struct field in the LHS DataFrame.

  • rhsShapeField (str) – The name of the SHAPE Struct field in the RHS DataFrame.

  • distanceField (str) – The name of the distance field to be added.

  • xField (str) – The name of the field for the x coordinate of the nearest snapped point.

  • yField (str) – The name of the field for the y coordinate of the nearest snapped point.

  • isOnRightField (str) – The name of the isOnRight field. True if the closest coordinate is to the right of the RHS feature.

  • distanceForNotSnapped (float) – The distance to use when a given point is not snapped.

  • cache (bool) – To persist the outgoing DataFrame or not.

  • extentList (List[float]) – The spatial index extent. Array of the extent coordinates [xmin, ymin, xmax, ymax].

  • depth (int) – The spatial index depth.

  • acceleration (str) – The geometry acceleration type. Can speed up geometry operations. Can be one of ‘none’, ‘mild’, ‘medium’, or ‘hot’. Default ‘none’.

Return type:

DataFrame

Returns:

Dataframe that augments LHS with nearest RHS features with snapped points

Example:

bdt.processors.nearestCoordinate(
        ldf,
        rdf,
        cellSize = 1.0,
        snapRadius = 1.0,
        doBroadcast = False,
        lhsShapeField = "SHAPE",
        rhsShapeField = "SHAPE",
        distanceField = "distance",
        xField = "X",
        yField = "Y",
        isOnRightField = "isOnRight",
        distanceForNotSnapped = -999.0,
        cache = False,
        extentList = [-180.0, -90.0, 180.0, 90.0],
        depth = 16)

Functions#

st_area(struct)

Return the area of a given shape struct.

st_distance(struct, another_struct)

Calculates the 2D planar distance between two shape structs.

st_distanceWGS84Points(struct, another_struct)

Calculate and return the distance in meters between the two shape structs representing points in WGS84.

st_euclid(struct, another_struct)

Calculates the euclidian distance between two point shape structs.

st_frechet_distance(struct, another_struct)

Calculate and return the discrete Frechet distance between two multi-vertex geometries.

st_geodesicDistance(struct, another_struct, wkid)

Calculate and return the geodesic distance between two geometries.

st_geodesicLength(struct, wkid)

Calculate and return the geodesic length of the geometry in meters.

st_geodeticArea(struct, wkid, curve_type)

Return the geodetic area of a given shape struct.

st_haversineDistance(x1, y1, x2, y2)

Return the haversine distance in kilometers between two WGS84 points.

st_length(struct)

Return the length of a given shape struct.

bdt.functions.st_area(struct)#

Return the area of a given shape struct. The area will be calculated in units of the spatial reference.

Parameters:

struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

Return type:

Column

Returns:

Column of DoubleType

SQL Example:

spark.sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_Area(SHAPE) AS Area FROM df''')

Python Example:

df = spark \
    .createDataFrame([("POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))",)], ["WKT"]) \
    .select(st_fromText(col("WKT")).alias("SHAPE"))

df.select(st_area("SHAPE").alias("Area"))
bdt.functions.st_distance(struct, another_struct)#

Calculates the 2D planar distance between two shape structs.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • another_struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

Return type:

Column

Returns:

Column of DoubleType

SQL Example:

spark.sql('''SELECT ST_FromText('POINT (0 0)') AS SHAPE1,
         ST_FromText('POINT (1 1)') AS SHAPE2''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_Distance(SHAPE1, SHAPE2) AS DISTANCE FROM df''')

Python Example:

df = spark \
    .createDataFrame([("POINT (0 0)", "POINT (1 1)")], ["WKT1", "WKT2"]) \
    .select(st_fromText(col("WKT1")).alias("SHAPE1"), st_fromText(col("WKT2")).alias("SHAPE2"))

df.select(st_distance("SHAPE1", "SHAPE2").alias("DISTANCE"))
bdt.functions.st_distanceWGS84Points(struct, another_struct)#

Calculate and return the distance in meters between the two shape structs representing points in WGS84.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • another_struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

Return type:

Column

Returns:

Column of DoubleType

SQL Example:

spark.sql('''SELECT ST_FromText('POINT (0 0)') AS SHAPE1,
         ST_FromText('POINT (1 1)') AS SHAPE2''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_DistanceWGS84Points(SHAPE1, SHAPE2) AS DISTANCE FROM df''')

Python Example:

df = spark \
    .createDataFrame([("POINT (0 0)", "POINT (1 1)")], ["WKT1", "WKT2"]) \
    .select(st_fromText(col("WKT1")).alias("SHAPE1"), st_fromText(col("WKT2")).alias("SHAPE2"))

df.select(st_distanceWGS84Points("SHAPE1", "SHAPE2").alias("DISTANCE"))
bdt.functions.st_euclid(struct, another_struct)#

Calculates the euclidian distance between two point shape structs. The distance will be in the units of the input points spatial reference.

Parameters:
  • struct (column or str) – The point shape struct with wkb, xmin, ymin, xmax, ymax

  • another_struct (column or str) – The point shape struct with wkb, xmin, ymin, xmax, ymax

Return type:

Column

Returns:

Column of DoubleType

SQL Example:

spark.sql('''SELECT ST_FromText('POINT (0 0)') AS SHAPE1,
         ST_FromText('POINT (1 1)') AS SHAPE2''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_Euclid(SHAPE1, SHAPE2) AS EUCLIDIAN_DIST FROM df''')

Python Example:

df = spark \
    .createDataFrame([("POINT (0 0)", "POINT (1 1)")], ["WKT1", "WKT2"]) \
    .select(st_fromText(col("WKT1")).alias("SHAPE1"), st_fromText(col("WKT2")).alias("SHAPE2"))

df.select(st_euclid("SHAPE1", "SHAPE2").alias("EUCLIDIAN_DIST"))
bdt.functions.st_frechet_distance(struct, another_struct)#

Calculate and return the discrete Frechet distance between two multi-vertex geometries. Typically used to measure the similarity between two lines. Returns NaN if either geometry is empty.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax. Must be multi-vertex.

  • another_struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax. Must be multi-vertex.

Return type:

Column

Returns:

Column of DoubleType

SQL Example:

spark.sql('''SELECT ST_FromText('LINESTRING (0.0 0.0, 1.0 0.0)') as SHAPE1,
    ST_FromText('LINESTRING (0.0 1.0, 1.0 1.0, 1.0 2.0)') as SHAPE2''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_FrechetDistance(SHAPE1, SHAPE2) AS DISTANCE FROM df''')

Python Example:

df = spark\
    .createDataFrame([("LINESTRING (0.0 0.0, 1.0 0.0)",
                       "LINESTRING (0.0 1.0, 1.0 1.0, 1.0 2.0)")], ["WKT1", "WKT2"])\
    .select(st_fromText(col("WKT1")).alias("SHAPE1"), st_fromText(col("WKT2")).alias("SHAPE2"))

df.select(st_frechet_distance("SHAPE1", "SHAPE2").alias("DISTANCE"))
bdt.functions.st_geodesicDistance(struct, another_struct, wkid)#

Calculate and return the geodesic distance between two geometries.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • another_struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • wkid (column or str or int) – The spatial reference id.

Return type:

Column

Returns:

Column of DoubleType

SQL Example:

spark.sql('''SELECT ST_FromText('POLYGON((0 0, 0 1, 1 1, 1 0))') as SHAPE1,
    ST_FromText('POLYGON((2 2, 2 3, 3 3, 3 2))') as SHAPE2''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_GeodesicDistance(SHAPE1, SHAPE2, 4326) AS DISTANCE FROM df''')

Python Example:

df = spark\
    .createDataFrame([("POLYGON((0 0, 0 1, 1 1, 1 0))",
                       "POLYGON((2 2, 2 3, 3 3, 3 2))")], ["WKT1", "WKT2"])\
    .select(st_fromText(col("WKT1")).alias("SHAPE1"), st_fromText(col("WKT2")).alias("SHAPE2"))

df.select(st_geodesicDistance("SHAPE1", "SHAPE2", 4326).alias("DISTANCE"))
bdt.functions.st_geodesicLength(struct, wkid)#

Calculate and return the geodesic length of the geometry in meters.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • wkid (column or str or int) – The spatial reference id.

Return type:

Column

Returns:

Column of DoubleType

SQL Example:

spark.sql('''SELECT ST_FromText('MULTILINESTRING ((10 40, 40 30, 30 10))') as SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_GeodesicLength(SHAPE, 4326) AS LENGTH FROM df''')

Python Example:

df = spark\
    .createDataFrame([("MULTILINESTRING ((10 40, 40 30, 30 10))",)], ["WKT"])\
    .select(st_fromText(col("WKT")).alias("SHAPE"), lit(4326).alias("WKID"))

df.select(st_geodesicLength("SHAPE", "WKID").alias("LENGTH"))
bdt.functions.st_geodeticArea(struct, wkid, curve_type)#

Return the geodetic area of a given shape struct.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • wkid (column or str or int) – The spatial reference id.

  • curve_type (column or str or int) – The geodetic curve type.

Return type:

Column

Returns:

Column of DoubleType

SQL Example:

 spark.sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_GeodeticArea(SHAPE, 4326, 1) FROM df''')

Python Example:

df = spark \
    .createDataFrame([("POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))",)], ["WKT"]) \
    .select(st_fromText(col("WKT")).alias("SHAPE"))

df.select(st_geodeticArea("SHAPE", 4326, 1).alias("AREA"))
bdt.functions.st_haversineDistance(x1, y1, x2, y2)#

Return the haversine distance in kilometers between two WGS84 points.

Parameters:
Return type:

Column

Returns:

Column of DoubleType

SQL Example:

spark.sql('''SELECT 180 AS lon1, 90 AS lat1, -180 AS lon2, -90 AS lat2''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_HaversineDistance(lon1, lat1, lon2, lat2) AS HAVERSINE FROM df''')

Python Example:

df = spark.createDataFrame([180, 90, -180, -90], ["lon1", "lat1", "lon2", "lat2"])

df.select(st_haversineDistance("lon1", "lat1", "lon2", "lat2").alias("HAVERSINE"))
bdt.functions.st_length(struct)#

Return the length of a given shape struct.

Parameters:

struct (column or str) – an expression. The point shape struct with wkb, xmin, ymin, xmax, ymax

Return type:

Column

Returns:

Column of DoubleType

SQL Example:

spark.sql('''SELECT ST_FromText('MULTILINESTRING ((0 0, 1 0))') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_Length(SHAPE) AS LENGTH FROM df''')

Python Example:

df = spark \
    .createDataFrame([("MULTILINESTRING ((0 0, 1 0))",)], ["WKT"]) \
    .select(st_fromText(col("WKT")).alias("SHAPE"))

df.select(st_length("SHAPE").alias("LENGTH"))

Overlay Functions#

These processors and functions overlay SHAPE structs and provide output based on set-theoretical operations.

Processors#

difference(ldf, rdf, cellSize[, ...])

Per each left-hand side (lhs) feature, find all relevant right-hand side (rhs) features and take the set-theoretic difference of each rhs feature with the lhs feature.

dissolve(df[, fields, singlePart, ...])

Dissolve features either by both attributes and spatially or spatially only.

dissolve2(df[, fields, singlePart, sort, ...])

Dissolve features by both attributes and spatially or spatially only.

erase(ldf, rdf, cell_size[, mode, ...])

Erase the overlapping area (take the difference) of each lhs polygon with all of its overlapping rhs polygons.

symmetricDifference(ldf, rdf, cellSize[, ...])

Per each left-hand side (lhs) feature, find all relevant right-hand side (rhs) features and take the set-theoretic symmetric difference of each rhs feature with the lhs feature.

bdt.processors.difference(ldf, rdf, cellSize, doBroadcast=False, lhsShapeField='SHAPE', rhsShapeField='SHAPE', cache=False, extentList=None, depth=16, acceleration='none')#

Per each left-hand side (lhs) feature, find all relevant right-hand side (rhs) features and take the set-theoretic difference of each rhs feature with the lhs feature.

For the LHS feature type, it can be point, polyline, polygon types. For the RHS feature type, it can be point, polyline, polygon types.

Parameters:
  • ldf (DataFrame) – The left-hand side (LHS) input DataFrame

  • rdf (DataFrame) – The right-hand side (RHS) input DataFrame

  • cellSize (float) – The spatial partitioning cell size.

  • doBroadcast (bool) – If set to true, rhs DataFrame is broadcasted. Set spark.sql.autoBroadcastJoinThreshold

  • lhsShapeField (str) – The name of the SHAPE Struct field in the LHS DataFrame.

  • rhsShapeField (str) – The name of the SHAPE Struct field in the RHS DataFrame.

  • cache (bool) – To persist the outgoing DataFrame or not.

  • extentList (List[float]) – The spatial index extent. Array of the extent coordinates [xmin, ymin, xmax, ymax].

  • depth (int) – The spatial index depth.

  • acceleration (str) – The geometry acceleration type. Can speed up geometry operations. Can be one of ‘none’, ‘mild’, ‘medium’, or ‘hot’. Default ‘none’.

Return type:

DataFrame

Returns:

Set-theoretic difference of each rhs feature with respect to lhs feature.

Example:

bdt.processors.difference(
    ldf,
    rdf,
    0.1,
    doBroadcast = False,
    lhsShapeField = "SHAPE",
    rhsShapeField = "SHAPE",
    cache = False,
    extentList = [-180.0, -90.0, 180.0, 90.0],
    depth = 16)
bdt.processors.dissolve(df, fields=None, singlePart=True, shapeField='SHAPE', cache=False)#

Dissolve features either by both attributes and spatially or spatially only.

Performs the topological union operation on two geometries. The supported geometry types are multipoint, polyline and polygon. SinglePart indicates whether the returned geometry is multipart or not.

If singlepart is set to False, it is possible for this processor to produce a dataframe with mixed geometries. In the case where a group has one geometry a single part geometry is produced. If a group has more than one geometry, the geometries are unioned together producing a multipart geometry. To make the output dataframe geometries homogeneous, convert either the input geometries or the output geometries into multipart geometries. This can be done with ST_MultiPoint for multipoints, ST_LoadPolygon for polygons, and ST_LoadPolyline for polylines. For processor Dissolve, Point features must be converted to multipoint features using ST_Multipoint before running the processor.

Example of converting to multipart before processor dissolve:

df = (spark.createDataFrame([
        ("A", 1.0, 1.0),
        ("B", 2.0, 2.0),
        ("B", 3.0, 3.0)
    ], schema="id string, X float, Y float")
    .selectExpr("id", "ST_MultiPoint(array(array(X, Y))) AS SHAPE")
    .withMeta("Multipoint", 4326)
    )

bdt.processors.dissolve(
    df,
    fields = ["id"],
    singlePart = False,
    shapeField = "SHAPE",
    cache = False)

After ST_MultiPoint, processor Dissolve will return all multipoint geometries.

Parameters:
  • df (DataFrame) – The input DataFrame

  • fields (list) – The collection of field names to dissolve by.

  • singlePart (bool) – Indicates whether the returned geometry is Multipart or not.

  • cache (bool) – To persist the outgoing DataFrame or not.

  • shapeField (str) – The name of the SHAPE struct field.

Return type:

DataFrame

Returns:

Dissolved features.

Example:

bdt.processors.dissolve(
    df,
    fields = ["TYPE"],
    singlePart = False,
    shapeField = "SHAPE",
    cache = False)
bdt.processors.dissolve2(df, fields=None, singlePart=True, sort=False, shapeField='SHAPE', cache=False)#

Dissolve features by both attributes and spatially or spatially only. Performs the topological union operation on two geometries. The supported geometry types are multipoint, polyline and polygon. singlePart indicates whether the returned geometry is multipart or not.

If singlepart is set to False, it is possible for this processor to produce a dataframe with mixed geometries. In the case where a group has one geometry a single part geometry is produced. If a group has more than one geometry, the geometries are unioned together producing a multipart geometry. To make the output dataframe geometries homogeneous, convert either the input geometries or the output geometries into multipart geometries. This can be done with ST_MultiPoint for points, ST_LoadPolygon for polygons, and ST_LoadPolyline for polylines.

Example of converting to multipart before processor dissolve2:

df = (spark.createDataFrame([
        ("A", 1.0, 1.0),
        ("B", 2.0, 2.0),
        ("B", 3.0, 3.0)
    ], schema="id string, X float, Y float")
    .selectExpr("id", "ST_MultiPoint(array(array(X, Y))) AS SHAPE")
    .withMeta("Multipoint", 4326)
    )

bdt.processors.dissolve2(
    df,
    fields = ["id"],
    singlePart = False,
    sort = False,
    shapeField = "SHAPE",
    cache = False)

Without ST_MultiPoint, processor Dissolve2 would return a POINT for group “A” and a MULTIPOINT for group “B”. However, with ST_MultiPoint, all output geometries are multipoints.

Parameters:
  • df (DataFrame) – The input DataFrame

  • fields (list) – The collection of field names to dissolve by.

  • singlePart (bool) – Indicates whether the returned geometry is Multipart or not.

  • sort (bool) – Sort by the fields or not

  • cache (bool) – To persist the outgoing DataFrame or not.

  • shapeField (str) – The name of the SHAPE struct field.

Return type:

DataFrame

Returns:

Dissolved features.

Example:

bdt.processors.dissolve2(
    df,
    fields = ["TYPE"],
    singlePart = True,
    sort = False,
    shapeField = "SHAPE",
    cache = False)
bdt.processors.erase(ldf, rdf, cell_size, mode='union', do_broadcast=False, lhs_shape_field='SHAPE', rhs_shape_field='SHAPE', acceleration='none', extent_list=None, depth=16)#

Erase the overlapping area (take the difference) of each lhs polygon with all of its overlapping rhs polygons.

This processor is similar to ProcessorDifference, but different in a key way: This processor will emit one row for each lhs polygon, which is the difference of the lhs polygon with ALL of its overlapping rhs polygons, whereas ProcessorDifference will emit a separate row for the difference between each lhs polygon and each of its overlapping rhs polygon(s).

The “mode” parameter controls how the overlapping areas are erased, and is an optimization parameter. Changing this parameter will not change the output of the processor, but may change the performance. The two options are “union” and “disjoint”. The default is “union”, but it is still recommended to try both.

The description of the two modes is as follows. Each description is run in the context of a single lhs polygon:

The “union” mode will first union all of the relevant rhs polygons together, and then run a single erase operation on the lhs polygon with the unioned rhs polygon. Union mode is probably the better choice if:

1. The polygons in the rhs dataset have a small area on average.
2. The polygons in the rhs dataset overlap or a very dense.
3. The chosen cell size is small.

The “disjoint” mode will iterate through each relevant rhs polygon, and at each iteration it will run an erase operation on the result of the last erase operation with the current rhs polygon. The result of the last erase operation on the first iteration is just the lhs polygon itself. Disjoint mode is probably the better choice if:

1. The polygons in the rhs dataset are complex, have a large area on average, or span many cells.
2. The polygons in the rhs dataset are sparse or spread out.
3. The chosen cell size is large.

The output DataFrame will have the same schema as the lhs DataFrame. All rhs DataFrame columns will be dropped.

This processor will append an additional column to the input LHS DataFrame called “__INTERNAL_BDT_ID__”. This column will not be part of the output DataFrame. If the input LHS DataFrame already has a column with this name, it must be removed or an exception will be thrown.

Parameters:
  • ldf (DataFrame) – The left-hand side (LHS) input DataFrame.

  • rdf (DataFrame) – The right-hand side (RHS) input DataFrame.

  • cell_size (float) – an expression. The spatial partitioning grid size in units of the spatial reference of the input data.

  • mode (str) – The mode of the erase operation. The default is “union”.

  • do_broadcast (bool) – Whether to broadcast the RHS DataFrame. The default is false.

  • lhs_shape_field (str) – The name of the LHS DataFrame SHAPE Struct column. The default is “SHAPE”.

  • rhs_shape_field (str) – The name of the RHS DataFrame SHAPE Struct column. The default is “SHAPE”.

  • acceleration (str) – The acceleration type. The default is “none”.

  • extent_list (Optional[List[float]]) – The spatial index extent. The default is “world” in 4326. This must be changed if working in a different spatial reference.

  • depth (int) – The spatial index depth. The default is 16.

Return type:

DataFrame

Returns:

DataFrame. The erased overlapping area (difference) of each lhs polygon with all of its overlapping rhs polygons.

Example:

ldf = spark.createDataFrame([ \
    ("POLYGON ((2 1, 6 1, 6 3, 2 3, 2 1))",),], schema="WKT string") \
    .selectExpr("ST_FromText(WKT) AS SHAPE") \
    .withMeta("POLYGON", 4326)

rdf = spark.createDataFrame([ \
    ("POLYGON ((1 1, 3 1, 3 3, 1 3, 1 1))",),\
    ("POLYGON ((5 1, 7 1, 7 3, 5 3, 5 1))",)], schema="WKT string") \
    .selectExpr("ST_FromText(WKT) AS SHAPE") \
    .withMeta("POLYGON", 4326)

bdt.processors.erase(
    ldf,
    rdf,
    cell_size = 4.0,
    mode = "union",
    do_broadcast = False,
    lhs_shape_field = "SHAPE",
    rhs_shape_field = "SHAPE",
    acceleration = "none",
    extent_list = [-180.0, -90.0, 180.0, 90.0],
    depth = 16)
bdt.processors.symmetricDifference(ldf, rdf, cellSize, doBroadcast=False, lhsShapeField='SHAPE', rhsShapeField='SHAPE', cache=False, extentList=None, depth=16, acceleration='none')#

Per each left-hand side (lhs) feature, find all relevant right-hand side (rhs) features and take the set-theoretic symmetric difference of each rhs feature with the lhs feature.

Parameters:
  • ldf (DataFrame) – The left-hand side (LHS) input DataFrame

  • rdf (DataFrame) – The right-hand side (RHS) input DataFrame

  • cellSize (float) – The spatial partitioning cell size.

  • doBroadcast (bool) – If set to true, rhs DataFrame is broadcasted. Set spark.sql.autoBroadcastJoinThreshold

  • lhsShapeField (str) – The name of the SHAPE Struct field in the LHS DataFrame.

  • rhsShapeField (str) – The name of the SHAPE Struct field in the RHS DataFrame.

  • cache (bool) – To persist the outgoing DataFrame or not.

  • extentList (List[float]) – The spatial index extent. Array of the extent coordinates [xmin, ymin, xmax, ymax].

  • depth (int) – The spatial index depth.

  • acceleration (str) – The geometry acceleration type. Can speed up geometry operations. Can be one of ‘none’, ‘mild’, ‘medium’, or ‘hot’. Default ‘none’.

Return type:

DataFrame

Returns:

The set-theoretic symmetric difference of ll lhs features contained in a rhs feature.

Example:

bdt.processors.symmetricDifference(
         ldf,
         rdf,
         0.1,
         doBroadcast = False,
         lhsShapeField = "SHAPE",
         rhsShapeField = "SHAPE",
         cache = False,
         extentList = [-180.0, -90.0, 180.0, 90.0],
         depth = 16)

Functions#

st_difference(struct1, struct2)

Take the set-theoretic difference of a shape struct with another shape struct.

st_differenceArray(struct, structArray)

Take the set-theoretic difference of a shape struct with an array of shape structs.

st_intersection(struct, another_struct, ...)

Performs the Topological Intersection operation on the two given shape structs.

st_iou(left_shape_struct, right_shape_struct)

Calculate the ratio of the area of intersection of the two shape structs over the area of the union of the two shape structs.

st_union(lhs, rhs)

Returns the union of the left hand side geometry with right hand side geometry.

st_union_agg(geom)

Return the aggregated union of all geometries in a column.

bdt.functions.st_difference(struct1, struct2)#

Take the set-theoretic difference of a shape struct with another shape struct.

Parameters:
  • struct1 (column or str) – Shape struct with wkb, xmin, ymin, xmax, ymax

  • struct2 (column or str) – Shape struct with wkb, xmin, ymin, xmax, ymax

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

spark.sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE1,
         ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE2''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_Difference(SHAPE1, SHAPE2) AS SHAPE FROM df''')

Python Example:

df = spark \
     .createDataFrame([("POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))",
     "POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))")], ["WKT1", "WKT2"]) \
     .select(st_fromText(col("WKT1")).alias("SHAPE1"), st_fromText(col("WKT2")).alias("SHAPE2"))

 df.select(st_difference("SHAPE1", "SHAPE2").alias("SHAPE"))
bdt.functions.st_differenceArray(struct, structArray)#

Take the set-theoretic difference of a shape struct with an array of shape structs.

Parameters:
  • struct (column or str) – Shape struct with wkb, xmin, ymin, xmax, ymax

  • structArray (column or str or array) – An array of 0 or more shape structs.

Return type:

Column

Returns:

Column of SHAPE Struct.

SQL Example:

spark.sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE, array(ST_FromText('POLYGON (0
0, 1 0, 1 1, 0 1, 0 0)')) AS SHAPE_ARRAY''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_DifferenceArray(SHAPE, SHAPE_ARRAY) AS SHAPE FROM df''')

Python Example:

df = spark \
    .createDataFrame([("POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))",
    "POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))")], ["WKT1", "WKT2"]) \
    .select(st_fromText(col("WKT1")).alias("SHAPE"), array(st_fromText(col("WKT2"))).alias("SHAPE_ARRAY"))
df.select(st_differenceArray("SHAPE", "SHAPE_ARRAY").alias("SHAPE"))
bdt.functions.st_intersection(struct, another_struct, wkid, mask)#

Performs the Topological Intersection operation on the two given shape structs. Array of struct with wkb, xmin, ymin, xmax, ymax Use explode() function to flatten out.

This implementation uses dimension mask as explained in http://esri.github.io/geometry-api-java/doc/Intersection.html It is NOT recommended to use mask values 3, 5, 6, and 7. These mask values could output different geometry types depending on the input geometries, resulting in a DataFrame with mixed geometry types in a single column. This could result in unexpected behavior when using the output DataFrame in subsequent operations. Full support for outputting different geometry types will be available in the 3.4 release.

Recommended mask values:
-1 (default): The output intersection geometry has the lowest dimension of the two input geometries.
1: The output intersection geometry is always point.
2: The output intersection geometry is always a polyline.
4: The output intersection geometry is always a polygon.

For optimal performance, the mask value should be manually set to the desired output geometry type. For example, if the desired output geometry type is a polygon, set the mask value to 4. A mask value of -1 will still work but may have slower performance.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • another_struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • wkid (column or str or int) – The spatial reference id.

  • mask (column or str or int) – The dimension mask of the intersection.

Return type:

Column

Returns:

Column of ArrayType with SHAPE Struct

SQL Example:

spark.sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE1,
         ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE2''').createOrReplaceTempView("df")

spark.sql('''SELECT explode(ST_Intersection(SHAPE1, SHAPE2, 4326, -1)) AS SHAPE FROM df''')

Python Example:

df = spark \
    .createDataFrame([("POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))",
    "POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))")], ["WKT1", "WKT2"]) \
    .select(st_fromText(col("WKT1")).alias("SHAPE1"), st_fromText(col("WKT2")).alias("SHAPE2"))

df.select(explode(st_intersection("SHAPE1", "SHAPE2", 4326, -1)).alias("SHAPE"))
bdt.functions.st_iou(left_shape_struct, right_shape_struct)#

Calculate the ratio of the area of intersection of the two shape structs over the area of the union of the two shape structs. If the two shapes are the same, the ratio will be 1. If the two shapes barely intersect, then the ratio will be close to 0.

If either one of left_shape_struct and right_shape_struct are not a polygon geometry, then the return value will always be 0.0.

Parameters:
  • left_shape_struct (column or str) – Shape struct with wkb, xmin, ymin, xmax, ymax

  • right_shape_struct (column or str) – Shape struct with wkb, xmin, ymin, xmax, ymax

Return type:

Column

Returns:

Column of DoubleType

SQL Example:

spark.sql('''SELECT ST_FromText('MULTILINESTRING ((0 0, 1 0))') AS SHAPE1,
     ST_FromText('MULTILINESTRING ((0 0, 1 0))') AS SHAPE2''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_IoU(SHAPE1, SHAPE2) AS IOU FROM df''')

Python Example:

df = spark \
    .createDataFrame([("MULTILINESTRING ((0 0, 1 0))",
    "MULTILINESTRING ((0 0, 1 0))")], ["WKT1", "WKT2"]) \
    .select(st_fromText(col("WKT1")).alias("SHAPE1"), st_fromText(col("WKT2")).alias("SHAPE2"))

df.select(st_iou("SHAPE1", "SHAPE2").alias("IOU"))
bdt.functions.st_union(lhs, rhs)#

Returns the union of the left hand side geometry with right hand side geometry.

Parameters:
  • lhs (column or str) – A shape struct with wkb, xmin, ymin, xmax, ymax

  • rhs (column or str) – A shape struct with wkb, xmin, ymin, xmax, ymax

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

spark.sql('''SELECT ST_FromText('POINT (30 10)') as LEFT,
    ST_FromText('POINT (20 10)') as RIGHT''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_Union(LEFT, RIGHT) as union_res FROM df''')

Python Example:

df = spark.createDataFrame([
    ("POINT (30 10)", "POINT (20 10)")
], schema="wkt1 string, wkt2 string")

result_df = df.select(
    st_union(st_fromText("wkt1"), st_fromText("wkt2")).alias("union")
)
bdt.functions.st_union_agg(geom)#

Return the aggregated union of all geometries in a column. Must be used with group by expression.

It is possible for this function to produce a dataframe with mixed geometries. In the case where a group has one geometry a single part geometry is produced. If a group has more than one geometry, the geometries are unioned together producing a multipart geometry. To make the output dataframe geometries homogeneous, convert either the input geometries or the output geometries into multipart geometries. This can be done with ST_MultiPoint for points, ST_LoadPolygon for polygons, and ST_LoadPolyline for polylines.

Example of converting to multipart before union agg:

df = self.spark.createDataFrame([
        ("A", 1.0, 1.0),
        ("B", 2.0, 2.0),
        ("B", 3.0, 3.0)
    ], schema="id string, X float, Y float")

 df.selectExpr("ID", "ST_MultiPoint(array(array(X, Y))) AS SHAPE")\
    .groupBy("ID")\
    .agg(st_union_agg("SHAPE").alias("SHAPE"))

Without ST_MultiPoint, UnionAgg would return a POINT for group “A” and a MULTIPOINT for group “B”. However, with ST_MultiPoint, all output geometries are multipoints.

Parameters:

geom (column or str) – the column of geometries to aggregate with union on.

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

df = spark.sql(
    '''
    SELECT ID, ST_AsText(ST_UnionAgg(GEOM)) as UNION_AGG
    FROM
        (SELECT ST_FromText('POLYGON ((30 10, 40 40, 20 40, 10 20, 30 10))') as GEOM, 1 as ID UNION
         SELECT ST_FromText('POLYGON ((40 10, 40 40, 20 40, 10 20, 40 10))') as GEOM, 1 as ID)
    GROUP BY ID
    '''
)

Python Example:

df = spark.createDataFrame([
    (1, "POLYGON ((30 10, 40 40, 20 40, 10 20, 30 10))"),
    (1, "POLYGON ((40 10, 40 40, 20 40, 10 20, 40 10))"),
], schema="id int, wkt string")

result_df = df.select(
    "id", st_fromText("wkt").alias("geom")
).groupby("id").agg(st_asText(st_union_agg("geom")).alias("UNION_AGG"))

Geometry Processing#

Processors#

These processors and functions process a SHAPE struct in order to create an output SHAPE struct that can be used for analysis.

buffer(df, distance[, shapeField, cache])

Buffer geometries by a radius.

centroid(df[, shapeField, cache])

Produces centroids of features.

generalize(df[, maxDeviation, ...])

Generalize geometries using Douglas-Peucker algorithm.

simplify(df[, forceSimplify, ogc, ...])

Simplify a geometry to make it valid for a Geodatabase to store without additional processing.

stay(df, track_id_field, time_field, ...[, ...])

Detects stays in a track.

bdt.processors.buffer(df, distance, shapeField='SHAPE', cache=False)#

Buffer geometries by a radius.

Parameters:
  • df (DataFrame) – The input DataFrame

  • distance (float) – The buffer distance.

  • shapeField (str) – The name of the SHAPE Struct field.

  • cache (bool) – To persist the outgoing DataFrame or not.

Return type:

DataFrame

Returns:

Buffer geometries.

Example:

bdt.processors.buffer(
    df,
    2.0,
    shapeField = "SHAPE",
    cache = False)
bdt.processors.centroid(df, shapeField='SHAPE', cache=False)#

Produces centroids of features.

Parameters:
  • df (DataFrame) – The input DataFrame

  • shapeField (str) – The name of the SHAPE Struct field.

  • cache (bool) – To persist the outgoing DataFrame or not.

Return type:

DataFrame

Returns:

Centroids.

Example:

bdt.processors.centroid(
    df,
    shapeField = "SHAPE",
    cache = False)
bdt.processors.generalize(df, maxDeviation=0.0, removeDegenerateParts=False, shapeField='SHAPE', cache=False)#

Generalize geometries using Douglas-Peucker algorithm.

Parameters:
  • df (DataFrame) – The input DataFrame

  • maxDeviation (float) – The maximum allowed deviation from the generalized geometry to the original geometry. If maxDeviation <= 0 the processor returns the input geometries.

  • removeDegenerateParts (bool) – When true, the degenerate parts of the geometry will be removed from the output.

  • shapeField (str) – The name of the SHAPE Struct for DataFrame.

  • cache (bool) – To persist the outgoing DataFrame or not.

Return type:

DataFrame

Returns:

Generalized geometries.

Example:

bdt.processors.generalize(
    df,
    maxDeviation = 0.001,
    removeDegenerateParts = False,
    shapeField = "SHAPE",
    cache = False)
bdt.processors.simplify(df, forceSimplify=False, ogc=False, shapeField='SHAPE', cache=False)#

Simplify a geometry to make it valid for a Geodatabase to store without additional processing.

Parameters:
  • df (DataFrame) – The input DataFrame.

  • forceSimplify (bool) – When true, the Geometry will be simplified regardless of the internal IsKnownSimple flag.

  • ogc (bool) – When true, simplification follows the OGC specification for the Simple Feature Access v. 1.2.1 (06-103r4).

  • shapeField (str) – The name of the SHAPE struct field.

  • cache (bool) – Persist the outgoing DataFrame or not.

Return type:

DataFrame

Returns:

A simplified geometry.

Example:

bdt.processors.simplify(
         df,
         forceSimplify = False,
         ogc = False,
         shapeField = "SHAPE",
         cache = False)
bdt.processors.stay(df, track_id_field, time_field, center_type, dist_threshold, time_threshold, x_field, y_field, wkid, temporal_salt_field=None, qr_salt_field=None)#

Detects stays in a track. Each track must have a unique ID in Long format.

The algorithm uses a “look-forward” approach. It starts at the first point in the track and labels that point as the start point. Then, it looks forward to the following points to see if they can be put in a stay together with the start point.

When a stay is created, the algorithm moves on to the next point immediately following the last point in the stay and labels that point as the new start point. If a stay is not found for the start point, the algorithm moves on to the next point in the track and labels that point as the new start point.

A point can be put in a stay with the start point if the distance between that point and start point is less than the specified distance threshold, Distance is calculated as geodesic if the points are in a GCS spatial reference, and planar if the points are in a PCS spatial reference.

As soon as a point is found that exceeds the distance threshold, the algorithm checks if the time between the last point in the stay and the start point is greater than the time threshold. If it is, the stay is emitted. If it is not, the algorithm moves on.

The algorithm continues until it has iterated through the whole track. If a point is not part of a stay, it is not emitted.

The location of a stay is the “center point” of the stay. The center can be calculated in two ways: “standard” and “time-weighted’. The standard center is the spatial average of all the points in the stay. The time-weighted center is the weighted spatial average of all the points in the stay. Each point is weighted by the time between it and the next point.

This implementation closely follows the pseudocode at https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/Mining20user20similarity20based20on20location20history.pdf, with a few additional improvements.

Use the “temporal_field” and “qr_field” parameters to salt the data and divide it into smaller pieces. This is useful when the input data has tracks with a lot of records, and therefore will use up a lot of data. Keep in mind that if an input track has a stay that spans a temporal or QR boundary, it will be broken up into different stays.

Output Schema:

ID: integer, long, or string (nullable = false) The ID of the track the stay belongs to.
STAY_CENTER_X: double (nullable = false) The center point x ordinate of the stay.
STAY_CENTER_Y: double (nullable = false) The center point y ordinate of the stay.
N: integer (nullable = false) - The number of points in the stay.
STAY_START_TIME: long (nullable = false) The start time of the stay in Unixtime.
STAY_END_TIME: long (nullable = false) The end time of the stay in Unixtime.
STAY_DURATION: long (nullable = false) The duration of the stay in seconds.
Parameters:
  • df (DataFrame) – The input DataFrame.

  • track_id_field (str) – The name of the unique track ID column in the input DataFrame. The ID type can be integer, long, or string.

  • time_field (str) – The name of the time column in the input DataFrame. Must be Long type in Unix Timestamp format.

  • center_type (str) – The calculation method of the center point. Must be either “standard” or “time-weighted”.

  • dist_threshold (float) – The distance threshold in meters.

  • time_threshold (float) – The time threshold in seconds.

  • x_field (str) – The name of the x ordinate column in the input DataFrame.

  • y_field (str) – The name of the y ordinate column in the input DataFrame.

  • wkid (int) – The well-known ID of the spatial reference.

  • temporal_salt_field (str) – The name of the temporal column in the input DataFrame. Used for salting. Must be an integer type. An example could be day of the year. Defaults to None.

  • qr_salt_field (str) – The name of the QR column in the input DataFrame. Used for salting. Defaults to None.

Return type:

DataFrame

Returns:

DataFrame. Each output row represents a stay.

Example:

df = (
  spark.createDataFrame([
      (1, 1.0, 0.0, 1, 1),
      (1, 2.0, 0.0, 2, 1),
      (1, 3.0, 0.0, 3, 2),
      (1, 4.0, 0.0, 4, 2),
      (1, 6.0, 0.0, 5, 1),
      (1, 7.0, 0.0, 6, 1),
      (1, 8.0, 0.0, 7, 2),
      (1, 9.0, 0.0, 8, 2)], schema="trackid long, x double, y double, timestamp long, DD int")
        .selectExpr("*", "xyToQR(X, Y, 6.0) AS QR")
    )

  bdt.processors.stay(df,
                      track_id_field="trackid",
                      time_field="t",
                      center_type="standard",
                      dist_threshold=200,
                      time_threshold=10,
                      x_field="x",
                      y_field="y",
                      wkid=4326,
                      temporal_salt_field="DD",
                      qr_salt_field="QR")

Functions#

st_buffer(struct, distance, wkid[, max_vertices])

Create and return a buffer polygon of the shape struct as specified by the distance and wkid.

st_bufferGeodesic(struct, curveType, ...)

Create and return a geodesic buffer polygon of the shape struct as specified by the curveType, distance, wkid, and maxDeviation.

st_centroid(struct)

Return centroid of shape struct.

st_concave(arr)

Accept an array of arrays of lon, lat representing an array of points.

st_convex(arr)

Accept an array of arrays of lon, lat representing an array of points.

st_createNeighbors(struct, interval, max_dist)

Per a web mercator point, return additional points north, south, west, east directions in the given interval up to the given max distance.

st_generalize(struct, max_deviation, ...)

Return the generalization of a geometry in its shape struct representation.

st_generalize2(struct, max_deviation, ...)

Return the generalization of a geometry in its shape struct representation.

st_simplify(struct, wkid, force_simplify)

Simplify the given shape struct.

st_simplifyOGC(struct, wkid, force_simplify)

Simplify a geometry to make it valid for a Geodatabase to store without additional processing.

bdt.functions.st_buffer(struct, distance, wkid, max_vertices=96)#

Create and return a buffer polygon of the shape struct as specified by the distance and wkid.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • distance (column or str or float) – The distance.

  • wkid (column or str or int) – The spatial reference id.

  • max_vertices (column or str or int) – Optional. The maximum number of vertices in a circle (or curve). Buffering can result in polygons with curved corners. Setting this to a higher value will result in a more accurate representation of curves on buffered polygons, but could also negatively impact performance. Defaults to 96. Minimum is 12. If the value is less than 12 it is set to 12 internally.

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

spark.sql('''SELECT ST_FromText('POINT (1 1)') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_Buffer(SHAPE, 1.0, 4326, 96) AS SHAPE FROM df''')

Python Example:

df = spark \
    .createDataFrame([("POINT (1 1)",)], ["WKT"]) \
    .select(st_fromText(col("WKT")).alias("SHAPE"))

df.select(st_buffer("SHAPE", 1.0, 4326, 96).alias("SHAPE"))
bdt.functions.st_bufferGeodesic(struct, curveType, distance, wkid, maxDeviation)#

Create and return a geodesic buffer polygon of the shape struct as specified by the curveType, distance, wkid, and maxDeviation.

The options for curveType are:

  • Geodesic

  • Loxodrome

  • GreatElliptic

  • NormalSection

  • ShapePreserving

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • curveType (column or str) – The geodetic curve type of the segments. If the curveType is ShapePreserving, then the segments are densified in the projection where they are defined before buffering.

  • distance (column or str or float) – The distance in Meters.

  • wkid (column or str or int) – The spatial reference id

  • maxDeviation (column or str or float) – The deviation offset to use for convergence. The geodesic arcs of the resulting buffer will be closer than the max deviation of the true buffer. Pass in ‘NaN’ to use the default deviation.

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

spark.sql('''SELECT ST_FromText('POINT (1 1)') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_BufferGeodesic(SHAPE, 'ShapePreserving', 1.0, 4326, 'NaN') AS SHAPE FROM df''')

Python Example:

df = spark \
    .createDataFrame([("POINT (1 1)",)], ["WKT"]) \
    .select(st_fromText(col("WKT")).alias("SHAPE"))

df.select(st_bufferGeodesic("SHAPE", "ShapePreserving", 1.0, 4326, "NaN").alias("SHAPE"))
bdt.functions.st_centroid(struct)#

Return centroid of shape struct.

Parameters:

struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

spark.sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_Centroid(SHAPE) AS CENTROID FROM df''')

Python Example:

df = spark \
    .createDataFrame([("POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))",)], ["WKT"]) \
    .select(st_fromText(col("WKT")).alias("SHAPE"))

df.select(st_centroid("SHAPE").alias("CENTROID"))
bdt.functions.st_concave(arr)#

Accept an array of arrays of lon, lat representing an array of points. Return a concave hull geometry.

Parameters:

arr (column or str or array) – array(array(float,float)) - an expression. An array of arrays of lon, lat. Each outer array must contain at least 3 arrays of lon, lat.

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

self.spark.createDataFrame([
    ('A', 1.0, 1.0),
    ('A', 1.0, 5.0),
    ('A', 2.0, 2.0),
    ('A', 2.0, 4.0),
    ('A', 4.0, 2.0),
    ('A', 4.0, 4.0),
    ('A', 5.0, 1.0),
    ('A', 5.0, 5.0)
], ["id", "x", "y"]).createOrReplaceTempView("df")

spark.sql('''SELECT ID, ST_Concave(collect_list(array(X,Y)))) AS SHAPE FROM df GROUP BY ID''')

Python Example:

df = self.spark.createDataFrame([
    ('A', 1.0, 1.0),
    ('A', 1.0, 5.0),
    ('A', 2.0, 2.0),
    ('A', 2.0, 4.0),
    ('A', 4.0, 2.0),
    ('A', 4.0, 4.0),
    ('A', 5.0, 1.0),
    ('A', 5.0, 5.0)
], ["id", "x", "y"])

df \
    .groupBy('id') \
    .agg(collect_list(array('x','y')).alias('pts')) \
    .select(["id", st_concave(col('pts')).alias('SHAPE')])
bdt.functions.st_convex(arr)#

Accept an array of arrays of lon, lat representing an array of points. Return a convex hull geometry.

Parameters:

arr (column or str or array) – array(array(float,float)) - an expression. An array of arrays of lon, lat. If less than 3 arrays of lon, lat are supplied, an invalid geometry will be returned.

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

self.spark.createDataFrame([
    ('A', 1.0, 1.0),
    ('A', 1.0, 5.0),
    ('A', 2.0, 2.0),
    ('A', 2.0, 4.0),
    ('A', 4.0, 2.0),
    ('A', 4.0, 4.0),
    ('A', 5.0, 1.0),
    ('A', 5.0, 5.0)
], ["id", "x", "y"]).createOrReplaceTempView("df")

 spark.sql(```SELECT ID, ST_Convex(collect_list(array(X,Y)))) AS SHAPE FROM df GROUP BY ID```)

Python Example:

df = self.spark.createDataFrame([
    ('A', 1.0, 1.0),
    ('A', 1.0, 5.0),
    ('A', 2.0, 2.0),
    ('A', 2.0, 4.0),
    ('A', 4.0, 2.0),
    ('A', 4.0, 4.0),
    ('A', 5.0, 1.0),
    ('A', 5.0, 5.0)
], ["id", "x", "y"])

df \
    .groupBy('id') \
    .agg(collect_list(array('x','y')).alias('pts')) \
    .select(["id", st_convex(col('pts')).alias('SHAPE')])
bdt.functions.st_createNeighbors(struct, interval, max_dist)#

Per a web mercator point, return additional points north, south, west, east directions in the given interval up to the given max distance. The input shape must be in 3857 web mercator spatial reference. The output is an array of (Seq(r,c), Shape struct). The output is in 3857 web mercator spatial reference.

Use inline() to unpack.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • interval (column or str or float) – The interval distance in meters.

  • max_dist (column or str or float) – The maximum distance in meters.

Return type:

Column

Returns:

Column of ArrayType with InternalRow of SHAPE Struct and ArrayType with 2 LongTypes.

SQL Example:

spark.sql('''SELECT 220624160 AS x, 20037508 AS y''').createOrReplaceTempView("df")

spark.sql('''SELECT inline(ST_CreateNeighbors(SHAPE, 400, 800))''')

Python Example:

df = spark \
    .createDataFrame([("POINT (220624160 20037508)",)], ["WKT"]) \
    .select(st_fromText(col("WKT")).alias("SHAPE"))

df \
    .select(st_createNeighbors("SHAPE", 400, 800).alias("NBRS")) \
    .selectExpr("inline(NBRS)")
bdt.functions.st_generalize(struct, max_deviation, remove_degenerateParts)#

Return the generalization of a geometry in its shape struct representation. Returns an empty geometry if the max deviation is too large.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • max_deviation (column or str or float) – The maximum allowed deviation from the generalized geometry to the original geometry.

  • remove_degenerateParts (column or str or bool) – When true, the degenerate parts of the geometry will be removed from the output. May remove small polylines when true.

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

spark.sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0 ))') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_Generalize(SHAPE, 0.0, false) FROM df''')

Python Example:

df = spark \
    .createDataFrame(["POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0 ))"], ["WKT"]) \
    .select(st_fromText("WKT").alias("SHAPE"))

df.select(st_generalize("SHAPE", 0.0, False).alias("SHAPE"))
bdt.functions.st_generalize2(struct, max_deviation, remove_degenerateParts)#

Return the generalization of a geometry in its shape struct representation. In generalize2, when remove degenerate parts is true and the input geometry is a line with length less than maxDeviation, then the line will be emitted with points at the start and end.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • max_deviation (column or str or float) – The maximum allowed deviation from the generalized geometry to the original geometry.

  • remove_degenerateParts (column or str or bool) – When this is true, the degenerate parts of the geometry will be removed from the output. Generalization can result in a geometry to overlapping with itself. This overlapping part is considered degenerate.

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

spark.sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0 ))') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_Generalize2(SHAPE, 0.0, false) FROM df''')

Python Example:

df = spark \
    .createDataFrame([("POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0 ))",)], ["WKT"]) \
    .select(st_fromText("WKT").alias("SHAPE"))

df.select(st_generalize2("SHAPE", 0.0, False).alias("SHAPE"))
bdt.functions.st_simplify(struct, wkid, force_simplify)#

Simplify the given shape struct.

This can be useful to reduce memory usage and to be able to visualize the geometry in ArcGIS.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • wkid (column or str or int) – The spatial reference ID.

  • force_simplify (column or str or bool) – When True, the Geometry will be simplified regardless of the internal IsKnownSimple flag.

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

spark.sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_Simplify(SHAPE, 4326, false) AS SHAPE FROM df''')

Python Example:

df = spark \
    .createDataFrame([("POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))",)], ["WKT"]) \
    .select(st_fromText(col("WKT")).alias("SHAPE"))

df.select(st_simplify("SHAPE", 4326, False).alias("SHAPE"))
bdt.functions.st_simplifyOGC(struct, wkid, force_simplify)#

Simplify a geometry to make it valid for a Geodatabase to store without additional processing. Follows the OGC specification for the Simple Feature Access v. 1.2.1 (06-103r4).

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • wkid (column or str or int) – The spatial reference ID.

  • force_simplify (column or str or bool) – When True, the Geometry will be simplified regardless of the internal IsKnownSimple flag.

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

spark.sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_SimplifyOGC(SHAPE, 4326, false) AS SHAPE FROM df''')

Python Example:

df = spark \
    .createDataFrame([("POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))",)], ["WKT"]) \
    .select(st_fromText(col("WKT")).alias("SHAPE"))

df.select(st_simplifyOGC("SHAPE", 4326, False).alias("SHAPE"))

Business Analyst#

These processors and functions perform business analysis such as enrichment based on supported Business Analyst (BA) variables.

Processors#

ba_variables()

Returns a DataFrame with a list of supported Business Analyst (BA) variables for geoenrichment.

enrich(poly_df[, var_df, variable_field, ...])

This processor receives two DataFrames:

bdt.processors.ba_variables()#

Returns a DataFrame with a list of supported Business Analyst (BA) variables for geoenrichment. On each row, a BA variable is listed along with 2 other columns representing its Description and Data Type. This function has no arguments.

The schema of the output DataFrame is:

root
 |-- Variable: string (nullable = false)
 |-- Description: string (nullable = false)
 |-- Data_Type: string (nullable = false)

As of 2024, there are 19,322 BA variables are supported for enrichment in BDT.

This processor should be used in conjunction with ProcessorEnrich in almost all cases.

Returns:

BA Variable Name, Description, and Data Type

Return type:

A DataFrame with 3 columns

Example:

# Processor ba_variables should be used in a workflow with processor enrich.

poly_df = (
    spark
        .sql("SELECT ST_FromWKT('POLYGON (( -8589916 4722261, -8559808 4722117,
            -8557660 4694677, -8590986 4694254))') AS SHAPE")
        .withMeta("Polygon", 3857)
)

var_df = bdt.processors.ba_variables()

bdt.processors.enrich(
    poly_df,
    var_df,
    variable_field="Variable",
    sliding=20
)
bdt.processors.enrich(poly_df, var_df=None, variable_field='Variable', sliding=20, shape_field='SHAPE')#

This processor receives two DataFrames:

  1. DataFrame of Polygons in Web Mercator Aux (3857)

  2. Esri Business Analyst variables.

It appends apportioned business analyst dataset variables to each polygon.

As of 2024, there are 19,459 BA variables are supported for enrichment in BDT. To get a DataFrame with all supported variables listed, use processor ba_variables. It is highly recommended to use this processor to generate the required variables DataFrame.

This processor will append null values for the enrichment variables if polygons are empty.

Parameters:
  • poly_df (DataFrame) – The input DataFrame of polygons in Web Mercator Aux (3857).

  • var_df (DataFrame) – The input DataFrame of Business Analyst variables. Each variable must be on a different row.

  • variable_field (str) – The column name in var_df of the Business Analyst variables. Default “Variable”.

  • sliding (int) – The number of study areas to include in enrich() call. Default 20.

  • shapeField (str) – The name of the SHAPE Struct for DataFrame. Default “SHAPE”.

Return type:

DataFrame

Returns:

Appended BA dataset variables.

Example:

poly_df = (
    spark
        .sql("SELECT ST_FromWKT('POLYGON (( -8589916 4722261, -8559808 4722117,
            -8557660 4694677, -8590986 4694254))') AS SHAPE")
        .withMeta("Polygon", 3857)
)

var_df = bdt.processors.ba_variables()

bdt.processors.enrich(
    poly_df,
    var_df,
    variable_field="Variable",
    sliding=20
)

Functions#

st_enrich(struct_array)

Accepts an array of SHAPES representing polygons in spatial reference Web Mercator Aux (3857).

bdt.functions.st_enrich(struct_array)#

Accepts an array of SHAPES representing polygons in spatial reference Web Mercator Aux (3857). Returns an array of apportioned Business Analyst (BA) variables.

This function will always return the full list of supported Business Analyst (BA) variables. As of 2024, there are 19,459 BA variables are supported for enrichment in BDT. If you would like to select which variables are enriched, use processor enrich with processor ba_variables instead.

Use inline() to unpack.

Parameters:

struct_array (array or str) – An array The shape struct with wkb, xmin, ymin, xmax, ymax

Returns:

(TOTPOP_CY, HHPOP_CY, FAMPOP_CY, …)

Return type:

Column of ArrayType with InternalRow of Business Analyst Variables

SQL Example:

spark.sql('''SELECT ST_Envelope(-13046128, 4035563, -13043076, 4038779) AS SHAPE''')             .createOrReplaceTempView("df")

spark.sql('''SELECT inline(ST_Enrich(array(SHAPE)) FROM df''')

Python Example:

df = (
    spark
        .createDataFrame([(-13046128, 4035563, -13043076, 4038779)], ["xmin", "ymin", "xmax", "ymax"])                 .select(st_envelope("xmin", "ymin", "xmax", "ymax").alias("SHAPE"))
)

(
df
    .select(st_enrich(array(col("SHAPE"))).alias("ENRICH"))
    .selectExpr("inline(ENRICH)")
)

Hex#

IMPORTANT: BDT Hex functions and processors will be deprecated in the next major version BDT release. It is recommended to use the H3 functions/processors instead.

These processors and functions use built-in BDT hex tiling functionality.

Processors#

hex(df, hexFields[, shapeField, cache])

IMPORTANT: This processor and all BDT Hex processors will be deprecated in the next major version BDT release.

hexCentroidsAlongLine(df, hexSize[, ...])

IMPORTANT: This processor and all BDT Hex processors will be deprecated in the next major version BDT release.

hexNeighbors(df, hexField, size[, nbrField, ...])

IMPORTANT: This processor and all BDT Hex processors will be deprecated in the next major version BDT release.

hexToPolygon(df, hexField, size[, keep, ...])

IMPORTANT: This processor and all BDT Hex processors will be deprecated in the next major version BDT release.

bdt.processors.hex(df, hexFields, shapeField='SHAPE', cache=False)#

IMPORTANT: This processor and all BDT Hex processors will be deprecated in the next major version BDT release. It is recommended to use the H3 functions/processors instead.

Enrich a point DataFrame with hex column and row values for the given sizes.

The input DataFrame must be the point type and its spatial reference should be WGS84 (4326). The size is in meters.

Parameters:
  • df (DataFrame) – The input DataFrame

  • hexFields (dict) – A map of hex field name and hex size.

  • shapeField (str) – The name of the SHAPE Struct for DataFrame.

  • cache (bool) – To persist the outgoing DataFrame or not.

Return type:

DataFrame

Returns:

DataFrame with hex ID column(s).

Example:

bdt.processors.hex(
        df,
        hexFields = {"H100": 100.0, "H200": 200.0},
        shapeField = "SHAPE",
        cache = False)
bdt.processors.hexCentroidsAlongLine(df, hexSize, hexField='HEX_ID', xField='X', yField='Y', keep=False, shapeField='SHAPE', cache=False)#

IMPORTANT: This processor and all BDT Hex processors will be deprecated in the next major version BDT release. It is recommended to use the H3 functions/processors instead.

This processor takes the street lines, generates hexagon ids along the lines, then generates the centroids of the hexagons. This processor calls distinct() on the output data frame.

Parameters:
  • df (DataFrame) – The input DataFrame.

  • hexSize (float) – The hex size in meters.

  • hexField (str) – The field name to retain hex column and row values. The default is HEX_ID.

  • xField (str) – The field name to retain the hex centroid X. The default is X.

  • yField (str) – The field name to retain the hex centroid Y. The default is Y.

  • keep (bool) – To retain the input fields or not in the output data frame. The default is false.

  • shapeField (str) – The name of the SHAPE struct field in the input data frame which will be replaced with the centroid points.

  • cache (bool) – To persist the outgoing DataFrame or not.

Return type:

DataFrame

Returns:

Centroids along a line.

Example:

bdt.processors.hexCentroidsAlongLine(
        df,
        hexFields = {"H100": 100.0, "H200": 200.0},
        shapeField = "SHAPE",
        cache = False)
bdt.processors.hexNeighbors(df, hexField, size, nbrField='neighbors', cache=False)#

IMPORTANT: This processor and all BDT Hex processors will be deprecated in the next major version BDT release. It is recommended to use the H3 functions/processors instead.

The input data frame must have a column containing hexagon QR codes. The SHAPE field is not required.

Parameters:
  • df (DataFrame) – The input DataFrame.

  • hexField (str) – The field name to retain hex column and row values. The default is HEX_ID.

  • size (float) – The hexagon size in meters.

  • nbrField (str) – The name of the output field containing the neighbor hexagon QR values.

  • cache (bool) – To persist the outgoing DataFrame or not.

Return type:

DataFrame

Returns:

Hex neighbors.

bdt.processors.hexToPolygon(df, hexField, size, keep=False, shapeField='SHAPE', cache=False)#

IMPORTANT: This processor and all BDT Hex processors will be deprecated in the next major version BDT release. It is recommended to use the H3 functions/processors instead.

Replace or create the SHAPE struct field containing hexagon polygons based on the configured hex column row field. The outgoing DataFrame will have the spatial reference of 3857 (102100) WGS84 Web Mercator (Auxiliary Sphere).

Parameters:
  • df (DataFrame) – The input DataFrame.

  • hexField (str) – The field name to retain hex column and row values. The default is HEX_ID.

  • size (float) – The hexagon size in meters.

  • keep (bool) – Set true to keep the hex field. Set false to drop the hex field.

  • shapeField (str) – The name of the SHAPE struct field.

  • cache (bool) – To persist the outgoing DataFrame or not.

Return type:

DataFrame

Returns:

SHAPE field containing hexagon polygons.

Example:

bdt.processors.hexToPolygon(
        df,
        hexField = "H100",
        size = 100.0,
        keep = False,
        shapeField = "SHAPE",
        cache = False)

Functions#

st_asHex(lon, lat, hexSize)

For a given longitude, latitude, and size in meters, return a hex qr code string delimited by :.

st_fromHex(hexQR, hexSize)

IMPORTANT: This function and all BDT Hex functions will be deprecated in the next major version BDT release.

st_hexNbrs(hex_id, hex_size)

IMPORTANT: This function and all BDT Hex functions will be deprecated in the next major version BDT release.

bdt.functions.st_asHex(lon, lat, hexSize)#

For a given longitude, latitude, and size in meters, return a hex qr code string delimited by :.

Parameters:
  • lon (column or str or float) – longitude

  • lat (column or str or float) – latitude

  • hexSize (column or str or float) – Hexagon size.

Return type:

Column

Returns:

Column of StringType

SQL Example:

spark.sql('''SELECT 180 AS lon, 90 AS lat''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_AsHex(lon, lat, 500.0) as HEXCODE_500 FROM df''')

Python Example:

df = spark.createDataFrame([(180, 90)], ["lon", "lat"])

df.select(st_asHex("lon", "lat", 500.0).alias("HEXCODE_500"))
bdt.functions.st_fromHex(hexQR, hexSize)#

IMPORTANT: This function and all BDT Hex functions will be deprecated in the next major version BDT release. It is recommended to use H3 functions instead.

For a given Hex QR code and size, return a polygon shape struct. The ‘hexSize’ parameter denotes the length of one side of the bounding box of the hex (not the area).

The returned polygon is in the spatial reference of WGS_1984_Web_Mercator_Auxiliary_Sphere 3857 (102100).

Parameters:
  • hexQR (column or str) – The QR code of the hex.

  • hexSize (column or str or float) – The size of the hex.

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

df = spark.sql('''SELECT ST_AsHex(180, 90, 200) AS H200''')

spark.sql('''SELECT ST_FromHex(H200, 200) AS SHAPE FROM df''')

Python Example:

df = spark \
        .createDataFrame([180, 90], ["lon", "lat"])\
        .select(st_asHex("lon", "lat", 200).alias("H200"))

df.select(st_fromHex("H200", 200).alias("SHAPE"))
bdt.functions.st_hexNbrs(hex_id, hex_size)#

IMPORTANT: This function and all BDT Hex functions will be deprecated in the next major version BDT release. It is recommended to use H3 functions instead.

Returns the hex ids of neighbors. Use explode() to flatten.

Parameters:
  • hex_id (column or str or float) – The hex ID

  • hex_size (column or str or float) – The hex size in meters.

Return type:

Column

Returns:

Column of ArrayType with LongType.

SQL Example:

spark.sql('''SELECT 1 AS HEX_ID''').createOrReplaceTempView("df")

spark.sql('''SELECT explode(ST_HexNbrs(HEX_ID, 500.0)) AS HEX_NBRS FROM df''')

Python Example:

df = spark.createDataFrame([1], ["HEX_ID"])

df.select(explode(st_hexNbrs("HEX_ID", 500.0)).alias("HEX_NBR"))

Linear Referencing#

These functions facilitate analysis of line features.

Functions#

st_pointsAlongLine(struct, interval, spill, ...)

Create and return a collection of SHAPE Struct representing points that are spaced by the given interval distance.

bdt.functions.st_pointsAlongLine(struct, interval, spill, path_index)#

Create and return a collection of SHAPE Struct representing points that are spaced by the given interval distance. Use explode() to unpack.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • interval (column or str or float) – Distance Interval.

  • spill (column or str or float) – Distance to spill over

  • path_index (column or str or int) – Path Index.

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

spark.sql('''SELECT ST_FromText('MULTILINESTRING ((0 0, 1 0))') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT explode(ST_PointsAlongLine(SHAPE, 0.5, 0.1, 0)) AS SHAPE FROM df''')

Python Example:

df = spark \
    .createDataFrame([("MULTILINESTRING ((0 0, 1 0))",)], ["WKT"]) \
    .select(st_fromText(col("WKT")).alias("SHAPE"))

df.select(explode(st_pointsAlongLine("SHAPE", 0.5, 0.1, 0)).alias("SHAPE"))

Network Analysis#

These processors and functions perform network analysis using an input SHAPE Struct and a Road Network dataset stored in an LMDB. It is recommended to use a machine/cluster with at minimum 64 GB of memory to get optimal performance when using the Road Network LMDB.

Processors#

drive_time(df, x_field, y_field, ...[, ...])

IMPORTANT: This processor and all processors that use BDT Hex will be deprecated in the next major version BDT release.

drive_time_h3(df, x_field, y_field, ...[, ...])

Generates drive times (service areas) per X & Y (in 3857).

mapMatch(df, idField, timeField, pidField, ...)

Sequentially snap a sequence of noisy gps points to their correct path vectors.

mapMatch2(df, cellSize, idField, timeField, ...)

Sequentially snap a sequence of noisy gps points to their correct path vectors.

routeClosestTarget(ldf, rdf, maxMinutes, ...)

For every origin point, find the closest destination point (target point) out of a set of possibilities on the road network.

routeClosestTarget2(ldf, rdf, ...[, ...])

For every origin point, find the closest destination point (target point) out of a set of possibilities on the road network.

snap(df[, shapeField, cache])

Snaps to the closest street within the 500-meter radius.

bdt.processors.drive_time(df, x_field, y_field, no_of_costs, hex_size, rings_only=False, cost_field='DRIVE_TIME_COST', shape_field='SHAPE', cache=False)#

IMPORTANT: This processor and all processors that use BDT Hex will be deprecated in the next major version BDT release. It is recommended to use the H3 functions/processors instead.

Generates drive times (service areas) per X & Y (in 3857). Note that noOfCosts is 0 indexed.

The cost is based on the time to travel. 0 to 1 minute, 1 to 2 minute, 2 to 3 minute, and all the way up to noOfCosts - 1 to noOfCosts minute drive time polygons are created.

Parameters:
  • df (DataFrame) – The input DataFrame.

  • x_field (str) – The field name to retain the hex centroid X. The default is X. This field in the input dataframe must be a double.

  • y_field (str) – The field name to retain the hex centroid Y. The default is Y. This field in the input dataframe must be a double.

  • no_of_costs (int) – The number of minute costs. If 2 is specified, 0-1 minute and 0-2 minutes polygons are generated.

  • hex_size (float) – The hex size in meters.

  • rings_only (bool) – Optional. When true, only the time-bucket polygon rings are returned. When false, the entire drive time polygon up to the limit is returned. Default false.

  • cost_field (str) – The cost field name in the output.

  • shape_field (str) – The name of the SHAPE struct field.

  • cache (bool) – Persist the outgoing DataFrame or not.

Return type:

DataFrame

Returns:

Drive times per X & Y.

Example:

bdt.processors.drive_time(
         df,
         x_field="X",
         y_field="Y",
         no_of_costs=5,
         hex_size=125,
         rings_only=False,
         fill=False,
         cost_field="DRIVE_TIME_COST",
         shape_field="SHAPE",
         cache=False)
bdt.processors.drive_time_h3(df, x_field, y_field, no_of_costs, res, rings_only=False, fill=False, cost_field='DRIVE_TIME_COST', shape_field='SHAPE', cache=False)#

Generates drive times (service areas) per X & Y (in 3857). Note that noOfCosts is 0 indexed. This function uses the Uber H3 Library for the drive time hexes. It is suggested to use this function over driveTime as H3 brings significant performance improvements. The cost is based on the time to travel. 0 to 1 minute, 1 to 2 minute, 2 to 3 minute, and all the way up to noOfCosts - 1 to noOfCosts minute drive time polygons are created.

Note: Arguments ringsOnly and fill cannot both be true. If ringsOnly is true, fill must be set to false.

Parameters:
  • df (DataFrame) – The input DataFrame.

  • x_field (str) – The field name to retain the hex centroid X. The default is X. This field in the input dataframe must be a double.

  • y_field (str) – The field name to retain the hex centroid Y. The default is Y. This field in the input dataframe must be a double.

  • no_of_costs (int) – The number of minute costs. If 2 is specified, 0-1 minute and 0-2 minutes polygons are generated.

  • res (int) – The H3 resolution to use.

  • rings_only (bool) – Optional. When true, only the time-bucket polygon rings are returned. When false, the entire drive time polygon up to the limit is returned. Default false.

  • fill (bool) – Optional. When true, remove holes from the returned drive time polygons. When false, leave the polygons as is. Defaults to False.

  • cost_field (str) – The cost field name in the output.

  • shape_field (str) – The name of the SHAPE struct field.

  • cache (bool) – Persist the outgoing DataFrame or not.

Return type:

DataFrame

Returns:

Drive times per X & Y.

Example:

bdt.processors.drive_time_h3(
         df,
         x_field="X",
         y_field="Y",
         no_of_costs=5,
         res= 9,
         rings_only = False,
         fill = False,
         cost_field="DRIVE_TIME_COST",
         shape_field = "SHAPE",
         cache = False)
bdt.processors.mapMatch(df, idField, timeField, pidField, xField, yField, snapDist, microPathSize, pathDist=inf, alpha=10.0, beta=1.0, cache=False)#

Sequentially snap a sequence of noisy gps points to their correct path vectors. Depending on the context, “path” is used to refer to both (a) a path of noisy gps points and (b) a path of vectors on which the gps points are snapped

Parameters:
  • df (DataFrame) – The input DataFrame

  • idField (str) – The name of the unique id column in the input GPS dataframe.

  • timeField (str) – The name of the temporal column in the input GPS dataframe.

  • pidField (str) – The name of the path id column in the input GPS dataframe. GPS points are grouped by

  • xField (str) – The name of gps point x value column in the input GPS dataframe. Must be Web Mercator

  • yField (str) – The name of gps point y value column in the input GPS dataframe. Must be Web Mercator

  • snapDist (float) – The max snap distance. GPS points must be less than or equal to this distance from a candidate path vector.

  • microPathSize (int) – The micropath size. Each path of gps points is divided into smaller subsections (micropaths) for smart snapping.

  • pathDist (float) – The max path vector distance between two snap points.

  • alpha (float) – Penalty weight for the match of the heading of the gps point and the heading of the snapped path vector.

  • beta (float) – Penalty weight for the perpendicularity of the vector from the gps point to its snapped location on the path vector and the snapped path vector.

  • cache (bool) – To persist the outgoing DataFrame or not.

Return type:

DataFrame

Returns:

Collection of points snapped.

Example:

bdt.processors.mapMatch(
        gps_df,
        idField="oid",
        timeField="dt",
        pidField="vid",
        xField="x",
        yField="y",
        snapDist=50.0,
        microPathSize=5,
        pathDist=300.0,
        alpha=10.0,
        beta=1.0,
        cache = False)
bdt.processors.mapMatch2(df, cellSize, idField, timeField, pidField, xField, yField, maxSnapDist, maxPathDist=inf, headingFactor=10.0, snapAngleFactor=1.0)#

Sequentially snap a sequence of noisy gps points to their correct path vectors. Depending on the context, “path” is used to refer to both (a) a path of noisy gps points and (b) a path of vectors on which the gps points are snapped.

This processor groups points by QR to dynamically generate micropaths.

ProcessorMapMatch2 is generally faster than ProcessorMapMatch, except for certain cases like when the data skewed spatially.

Parameters:
  • df (DataFrame) – The input DataFrame

  • cellSize (float) – The spatial partitioning cell size.

  • idField (str) – The name of the unique id column in the input GPS dataframe.

  • timeField (str) – The name of the temporal column in the input GPS dataframe.

  • pidField (str) – The name of the path id column in the input GPS dataframe. GPS points are grouped by

  • xField (str) – The name of gps point x value column in the input GPS dataframe. Must be Web Mercator

  • yField (str) – The name of gps point y value column in the input GPS dataframe. Must be Web Mercator

  • maxSnapDist (float) – The max snap distance. GPS points must be less than or equal to this distance from a candidate path vector.

  • maxPathDist (float) – The max path vector distance between two snap points.

  • headingFactor (float) – Penalty weight for the match of the heading of the gps point and the heading of the snapped path vector.

  • snapAngleFactor (float) – Penalty weight for the perpendicularity of the vector from the gps point to its snapped location on the path vector and the snapped path vector.

Return type:

DataFrame

Returns:

Collection of points snapped.

Example:

bdt.processors.mapMatch2(
        gps_df,
        cellSize=0.1,
        idField="oid",
        timeField="dt",
        pidField="vid",
        xField="x",
        yField="y",
        maxSnapDist=50.0,
        maxPathDist=300.0,
        headingFactor=10.0,
        snapAngleFactor=1.0)
bdt.processors.routeClosestTarget(ldf, rdf, maxMinutes, maxMeters, oxField, oyField, dxField, dyField, origTolerance=500.0, destTolerance=500.0, allNodesAtLocation=False, costOnly=False, cache=False)#

For every origin point, find the closest destination point (target point) out of a set of possibilities on the road network. The associated costs (time and distance) to reach the target point are also returned. The distance unit of measurement is meters, and the time unit of measurement is minutes.

If a destination point is within the time limit OR the distance limit of a given origin point, it is considered a valid destination point. If no valid destination point is found for a given origin point, null values will be returned for the destination point and the associated costs.

The input Origin DataFrame must have the following columns:

  1. oxField: The Orig X (SR 3857) ordinate.

  2. oyField: The Orig Y (SR 3857) ordinate.

The input Destination DataFrame must have the following columns:

  1. dxField: The Dest X (SR 3857) ordinate.

  2. dyField: The Dest Y (SR 3857) ordinate.

Parameters:
  • ldf (DataFrame) – The left-hand side (LHS) input DataFrame

  • rdf (DataFrame) – The right-hand side (RHS) input DataFrame

  • maxMinutes (float) – Maximum time in minutes traveled on the road network to get to a given target from the origin.

  • maxMeters (float) – Maximum distance in meters traveled on the road network to get to a given target from the origin.

  • oxField (str) – Origin point x value. Must be 3857.

  • oyField (str) – Origin point y value. Must be 3857.

  • dxField (str) – Destination point x value. Must be 3857.

  • dyField (str) – Destination point y value. Must be 3857.

  • origTolerance (float) – Snap tolerance for origin point. Default is 500 meters. If an origin point is not snapped, it is not routed

  • destTolerance (float) – Snap tolerance for destination points.

  • allNodesAtLocation (bool) – Default False. Whether to return all nodes at a given destination (target) location or just the first node.

  • costOnly (bool) – Default False. Whether to return just the route cost (in time and meters) or the cost and the route polyline shape.

  • cache (bool) – To persist the outgoing DataFrame or not.

Return type:

DataFrame

Returns:

Closest destination for each origin point.

Example:

bdt.processors.routeClosestTarget(
             origin_df,
             dest_df,
             maxMinutes = 30.0,
             maxMeters = 60375.0,
             oxField = "OX",
             oyField = "OY",
             dxField = "DX",
             dyField = "DY",
             origTolerance = 500.0,
             destTolerance = 500.0,
             allNodesAtLocation = False,
             costOnly = False,
             cache = False)
bdt.processors.routeClosestTarget2(ldf, rdf, maxMinutesField, maxMetersField, oxField, oyField, dxField, dyField, origTolerance=500.0, destTolerance=500.0, allNodesAtLocation=False, costOnly=False)#

For every origin point, find the closest destination point (target point) out of a set of possibilities on the road network. The associated costs (time and distance) to reach the target point are also returned. The distance unit of measurement is meters, and the time unit of measurement is minutes.

If a destination point is within the time limit OR the distance limit of a given origin point, it is considered a valid destination point. If no valid destination point is found for a given origin point, null values will be returned for the destination point and the associated costs.

RouteClosestTarget2 enables dynamic maxMins and maxMeters. It requires that each origin point in the origin DataFrame has an associated max minutes and max meters value. The input Origin DataFrame must have the following columns:

  1. maxMinutesField: The maximum time in minutes traveled beyond which a given destination point is not valid.

  2. maxMetersField: The maximum distance in meters traveled on the road network beyond which a given destination point is not valid.

  3. oxField: The Orig X (SR 3857) ordinate.

  4. oyField: The Orig Y (SR 3857) ordinate.

A sample origin table could look something like this. In this case, different counties have different maxMin and maxMeters criterion:

  _____________________________________
| x | y | County   | maxMins | maxMeters|
|---------------------------------------|
| 1 | 1 | County A | 10      | 20000    |
| 2 | 2 | County B | 20      | 30000    |
|_______________________________________|

The input Destination DataFrame must have the following columns:

  1. dxField: The Dest X (SR 3857) ordinate.

  2. dyField: The Dest Y (SR 3857) ordinate.

Parameters:
  • ldf (DataFrame) – The left-hand side (LHS) input DataFrame

  • rdf (DataFrame) – The right-hand side (RHS) input DataFrame

  • maxMinutesField (str) – Column in the DataFrame that indicates the maximum time in minutes traveled on the road network to get to a given target from the origin.

  • maxMetersField (str) – Column in the DataFrame that indicates the maximum distance in meters traveled on the road network to get to a given target from the origin.

  • oxField (str) – Origin point x value. Must be 3857.

  • oyField (str) – Origin point y value. Must be 3857.

  • dxField (str) – Destination point x value. Must be 3857.

  • dyField (str) – Destination point y value. Must be 3857.

  • origTolerance (float) – Snap tolerance for origin point. Default is 500 meters. If an origin point is not snapped, it is not routed.

  • destTolerance (float) – Snap tolerance for destination points.

  • allNodesAtLocation (bool) – Default False. Whether to return all nodes at a given destination (target) location or just the first node.

  • costOnly (bool) – Default False. Whether to return just the route cost (in time and meters) or the cost and the route polyline shape.

Returns:

Closest destination for each origin point.

Example:

bdt.processors.routeClosestTarget2(
             origin_df,
             dest_df,
             maxMinutesField = "maxMinutes",
             maxMetersField = "maxMeters",
             oxField = "OX",
             oyField = "OY",
             dxField = "DX",
             dyField = "DY",
             origTolerance = 500.0,
             destTolerance = 500.0,
             allNodesAtLocation = False,
             costOnly = False)
bdt.processors.snap(df, shapeField='SHAPE', cache=False)#

Snaps to the closest street within the 500-meter radius. The [[SHAPE_FIELD]] must be in the web mercator aux 3857. The LMDB is used to snap to the closest street.

Parameters:
  • df (DataFrame) – The input DataFrame.

  • shapeField (str) – The name of the SHAPE struct field.

  • cache (bool) – Persist the outgoing DataFrame or not.

Return type:

DataFrame

Returns:

Snapped points to the closest street.

Functions#

st_drive_time(struct, no_of_costs, hex_size)

IMPORTANT: This function and all functions that use BDT Hex will be deprecated in the next major version BDT release.

st_drive_time_h3(struct, no_of_costs, res[, ...])

Accepts a point shape struct, and generates drive time (service area) polygons with the cost in minutes from each point.

st_drive_time_h3_xy(x, y, no_of_costs, res)

Accepts the x and y coordinates of a point, and generates drive time (service area) polygons with the cost in minutes from each x and y point.

st_drive_time_points(x, y, time, interval, ...)

Accepts an X and Y source point in the web mercator (3857) and returns points placed on a regular interval that are reachable within a given time from the source.

st_drive_time_xy(x, y, no_of_costs, hex_size)

IMPORTANT: This function and all functions that use BDT Hex will be deprecated in the next major version BDT release.

bdt.functions.st_drive_time(struct, no_of_costs, hex_size, fill=False)#

IMPORTANT: This function and all functions that use BDT Hex will be deprecated in the next major version BDT release. It is recommended to use the H3 functions/processors instead.

Accepts a point shape struct, and generates drive time (service area) polygons with the cost in minutes from each point. The output is an array of (drive_time_cost and polygon). The drive time polygons are hexagon based.

IMPORTANT: The point shape struct must be in spatial reference 3857 Web Mercator.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax in the point geometry and 3857.

  • no_of_costs (column or str or int) – The number of minute costs. If 2 is specified, 0-1 minute and 0-2 minutes polygons are generated.

  • hex_size (column or str or float) – The hexSize in FloatType.

  • fill (column or str or bool) – Optional. When true, remove holes from the returned drive time polygons. When false, leave the polygons as is. Default is false.

Return type:

Column

Returns:

Column of ArrayType with InternalRow of DoubleType and SHAPE Struct.

Spark SQL Example:

spark.sql('''SELECT ST_FromText('POINT (-13161875 4035019.53758)') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT inline(ST_DriveTime(SHAPE, 30, 125.0, false)) FROM df''')

Python Example:

df = spark \
    .createDataFrame([("POINT (-13161875 4035019.53758)",)], ["WKT"]) \
    .select(st_fromText(col("WKT")).alias("SHAPE"))

df \
    .select(st_drive_time("SHAPE", 30, 125.0, False).alias("SA")) \
    .selectExpr("inline(SA)")
bdt.functions.st_drive_time_h3(struct, no_of_costs, res, fill=False)#

Accepts a point shape struct, and generates drive time (service area) polygons with the cost in minutes from each point. The output is an array of (drive_time_cost and polygon). The drive time polygons are H3 hexagon based.

IMPORTANT: The point shape struct must be in spatial reference 3857 Web Mercator.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax in the point geometry and 3857.

  • no_of_costs (column or str or int) – The number of minute costs. If 2 is specified, 0-1 minute and 0-2 minutes polygons are generated.

  • res (column or str or int) – The H3 resolution value.

  • fill (column or str or bool) – Optional. When true, remove holes from the returned drive time polygons. When false, leave the polygons as is. Default is false.

Return type:

Column

Returns:

Column of ArrayType with InternalRow of DoubleType and SHAPE Struct.

Spark SQL Example:

spark.sql('''SELECT ST_FromText('POINT (-13161875 4035019.53758)') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT inline(ST_DriveTimeH3(SHAPE, 30, 9, false)) FROM df''')

Python Example:

df = spark \
    .createDataFrame([("POINT (-13161875 4035019.53758)",)], ["WKT"]) \
    .select(st_fromText(col("WKT")).alias("SHAPE"))

df \
    .select(st_drive_time_h3("SHAPE", 30, 9, False).alias("SA")) \
    .selectExpr("inline(SA)")
bdt.functions.st_drive_time_h3_xy(x, y, no_of_costs, res, fill=False)#

Accepts the x and y coordinates of a point, and generates drive time (service area) polygons with the cost in minutes from each x and y point. The output is an array of (drive_time_cost and polygon). The drive time polygons are H3 hexagon based.

IMPORTANT: The x and y coordinates must be in spatial reference 3857 Web Mercator.

Parameters:
  • x (column or str) – The X in web mercator (3857).

  • y (column or str) – The Y in web mercator (3857).

  • no_of_costs (column or str or int) – The number of minute costs. If 2 is specified, 0-1 minute and 0-2 minutes polygons are generated.

  • res (column or str or int) – The H3 resolution value.

  • fill (column or str or bool) – Optional. When true, remove holes from the returned drive time polygons. When false, leave the polygons as is. Default is false.

Return type:

Column

Returns:

Column of ArrayType with InternalRow of DoubleType and SHAPE Struct.

SQL Example:

spark.sql('''SELECT -13161875 AS X, 4035019.53758 AS Y''').createOrReplaceTempView("df")

spark.sql('''SELECT inline(ST_DriveTimeH3XY(X, Y, 30, 9, false)) FROM df''')

Python Example:

df = spark.createDataFrame([(-13161875, 4035019.53758)], ["X", "Y"])

df \
    .select(st_drive_time_h3_xy("X", "Y", 30, 9, False).alias("SA_XY")) \
    .select(inline("SA_XY"))
bdt.functions.st_drive_time_points(x, y, time, interval, radius)#

Accepts an X and Y source point in the web mercator (3857) and returns points placed on a regular interval that are reachable within a given time from the source. Use inline() to flatten out.

IMPORTANT: The x and y coordinates must be in spatial reference 3857 Web Mercator.

Parameters:
  • x (column or str or float) – The X in web mercator (3857).

  • y (column or str or float) – The Y in web mercator (3857).

  • time (column or str or int) – The time cost in minutes. If 2 is specified, 0-1 minute and 0-2 minutes polygons are generated.

  • interval (column or str or float) – The interval in meters to drop a point..

  • radius (column or str or float) – The radius in meters to search for a closest street in FloatType.

Return type:

Column

Returns:

Column of ArrayType with InternalRow of 4 DoubleTypes - COST, X, Y, M (distance value from source in meters).

SQL Example:

spark.sql('''SELECT inline(ST_DriveTimePoints(-13161875.0, 4035019.53758, 30, 100.0, 500.0))''')

Python Example:

df = spark.createDataFrame([(-13161875, 4035019.53758)], ["X", "Y"])

df \
    .select(st_drive_time_points("X", "Y", 30, 100.0, 500.0).alias("pts")) \
    .select(inline("pts"))
bdt.functions.st_drive_time_xy(x, y, no_of_costs, hex_size, fill=False)#

IMPORTANT: This function and all functions that use BDT Hex will be deprecated in the next major version BDT release. It is recommended to use the H3 functions/processors instead.

Accepts the x and y coordinates of a point, and generates drive time (service area) polygons with the cost in minutes from each x and y point. The output is an array of (drive_time_cost and polygon). The drive time polygons are hexagon based.

IMPORTANT: The x and y coordinates must be in spatial reference 3857 Web Mercator.

Parameters:
  • x (column or str) – The X in web mercator (3857).

  • y (column or str) – The Y in web mercator (3857).

  • no_of_costs (column or str or int) – The number of minute costs. If 2 is specified, 0-1 minute and 0-2 minutes polygons are generated.

  • hex_size (column or str or float) – The hexSize in FloatType.

  • fill (column or str or bool) – Optional. When true, remove holes from the returned drive time polygons. When false, leave the polygons as is. Default is false.

Return type:

Column

Returns:

Column of ArrayType with InternalRow of DoubleType and SHAPE Struct.

SQL Example:

spark.sql('''SELECT -13161875 AS X, 4035019.53758 AS Y''').createOrReplaceTempView("df")

spark.sql('''SELECT inline(ST_DriveTimeXY(X, Y, 30, 125.0, false)) FROM df''')

Python Example:

df = spark.createDataFrame([(-13161875, 4035019.53758)], ["X", "Y"])

df \
    .select(st_drive_time_xy("X", "Y", 30, 125.0, False).alias("SA_XY")) \
    .select(inline("SA_XY"))

Salting#

These functions perform salting. Salting is a Spark optimization technique for managing skewed data.

Functions#

saltList(maxSaltVal)

Given a maximum salt value, this function creates an array from 0 until the maximum salt value.

saltRand(maxSaltVal)

Returns a random integer in [0, maxSaltVal)

bdt.functions.saltList(maxSaltVal)#

Given a maximum salt value, this function creates an array from 0 until the maximum salt value. Chain this function with explode() to salt the rows in the DataFrame.

Parameters:

maxSaltVal (column or str or int) – The maximum salt value.

Return type:

Column

Returns:

Column of ArrayType with IntegerType

SQL Example:

spark.sql('''SELECT 10 AS maxSaltVal''').createOrReplaceTempView("df")

spark.sql('''SELECT explode(SaltList(maxSaltVal)) AS SALT_VAL FROM df''')

Python Example:

df = self.spark.createDataFrame([(10,)], ["maxSaltVal"])

df.select(explode(saltList("maxSaltVal")).alias("SALT_VAL"))
bdt.functions.saltRand(maxSaltVal)#

Returns a random integer in [0, maxSaltVal)

This is a convenience function for salting. Salting is a Spark optimization technique for managing skewed data.

Note: This function can only be called in Python.

Parameters:

maxSaltVal (column or str or int) – The max salt value.

Return type:

Column

Returns:

Column of IntegerType

Python Example:

df = spark \
     .createDataFrame([("POINT (1 1)",)], ["WKT"]) \
     .select(st_fromText(col("WKT")).alias("SHAPE"))

df.select("SHAPE", saltRand(10).alias("SALT_VAL"))

Spatial Partitioning#

These processors and functions perform spatial partitioning of an input SHAPE struct.

Processors#

join_qr(ldf, rdf, join_type, cell_size, ...)

Join the input DataFrames on QR.

join_qr_salt(ldf, rdf, cell_size, ...[, ...])

Inner join the input DataFrames on QR and an additional salt column.

bdt.processors.join_qr(ldf, rdf, join_type, cell_size, join_distance, lhs_qr_field, rhs_qr_field, lhs_shape_field, rhs_shape_field)#

Join the input DataFrames on QR. Spatially index to accelerate the join and run lower-left QR Check to trim off QR values that are not relevant to subsequent computation.

This processor will accelerate joins significantly on very large DataFrames. On smaller DataFrames, the overhead incurred from spatially indexing the geometries is greater than the speed benefit.

For reference, benefit was not seen until each DataFrame exceeded 20 million records.

The schema of the output DataFrame will retain all columns from both DataFrames except for the QR column. IMPORTANT: If the name of the shape columns are the same for both dataframes, then the shape in the left DataFrame will have ‘_LHS’ appended to it and the shape in the right DataFrame will have ‘_RHS’ appended to it in the output to avoid column name collisions.

When doing polygon to polygon joins with qr spill, both sides must be spilled.

Setting distance to 0 enforces feature-to-feature touches and intersection conditions in downstream operations. If distance or nearest operations will be run downstream after the join, then set the distance value in join_qr to the same distance value that will be used later on in those functions. Some features may get joined that are further away from each other than the join_distance. This is because join_distance functions a box extent around the LHS feature, not a circular radius. Thus, the corners of the box will be further away than a circular radius of join_dist.

Parameters:
  • ldf (DataFrame) – The left-hand side (LHS) input DataFrame

  • rdf (DataFrame) – The right-hand side (RHS) input DataFrame

  • join_type (str) – The join type. Supported join types are “inner” and “left”.

  • cell_size (float) – an expression. The spatial partitioning grid size in units of the spatial reference of the input data. It must be exactly the same as the grid size used to create both the input lhs and rhs QRs.

  • join_distance (float) – Features within this distance will be joined together.

  • lhs_qr_field (str) – The name of the QR field in the LHS DataFrame.

  • rhs_qr_field (str) – The name of the QR field in the RHS DataFrame.

  • lhs_shape_field (str) – The name of the SHAPE Struct field in the LHS DataFrame.

  • rhs_shape_field (str) – The name of the SHAPE Struct field in the RHS DataFrame.

Return type:

DataFrame

Returns:

DataFrame with the LHS and RHS Shapes joined on the QR field.

Example:

bdt.processors.join_qr(
     ldf,
     rdf,
     join_type="inner",
     cell_size = 2.0,
     join_distance = 0.0,
     lhs_qr_field = "QR",
     rhs_qr_field = "QR",
     lhs_shape_field = "SHAPE",
     rhs_shape_field = "SHAPE")
bdt.processors.join_qr_salt(ldf, rdf, cell_size, join_distance, lhs_salt_field, rhs_salt_field, lhs_qr_field='QR', rhs_qr_field='QR', lhs_shape_field='SHAPE', rhs_shape_field='SHAPE')#

Inner join the input DataFrames on QR and an additional salt column. Spatially index to accelerate the join and run lower-left QR Check to trim off QR values that are not relevant to subsequent computation.

This processor will accelerate joins significantly on very large DataFrames. On smaller DataFrames, the overhead incurred from spatially indexing the geometries is greater than the speed benefit.

For reference, benefit was not seen until each DataFrame exceeded 20 million records.

The schema of the output DataFrame will retain all columns from both DataFrames except for the QR column. IMPORTANT: If the name of the shape columns are the same for both dataframes, then the shape in the left DataFrame will have ‘_LHS’ appended to it and the shape in the right DataFrame will have ‘_RHS’ appended to it in the output to avoid column name collisions.

When doing polygon to polygon joins with qr spill, both sides must be spilled.

Setting distance to 0 enforces feature-to-feature touches and intersection conditions in downstream operations. If distance or nearest operations will be run downstream after the join, then set the distance value in join_qr to the same distance value that will be used later on in those functions. Some features may get joined that are further away from each other than the join_distance. This is because join_distance functions a box extent around the LHS feature, not a circular radius. Thus, the corners of the box will be further away than a circular radius of join_dist.

Parameters:
  • ldf (DataFrame) – The left-hand side (LHS) input DataFrame

  • rdf (DataFrame) – The right-hand side (RHS) input DataFrame

  • join_type (str) – The join type. Supported join types are “inner” and “left”.

  • cell_size (float) – an expression. The spatial partitioning grid size in units of the spatial reference of the input data. It must be exactly the same as the grid size used to create both the input lhs and rhs QRs.

  • join_distance (float) – Features within this distance will be joined together. Set to 0 to enforce feature-to-feature touches and intersection conditions.

  • lhs_qr_field (str) – The name of the QR field in the LHS DataFrame.

  • rhs_qr_field (str) – The name of the QR field in the RHS DataFrame.

  • lhs_salt_field (str) – The name of the salt field in the LHS DataFrame. Default “SALT”. Field must be Integer or String type.

  • rhs_salt_field (str) – The name of the salt field in the RHS DataFrame. Default “SALT”. Field must be Integer or String type.

  • lhs_shape_field (str) – The name of the SHAPE Struct field in the LHS DataFrame.

  • rhs_shape_field (str) – The name of the SHAPE Struct field in the RHS DataFrame.

Return type:

DataFrame

Returns:

DataFrame with the LHS and RHS Shapes joined on the QR and salt field.

Example:

ldf = spark.createDataFrame([ \
    ("POLYGON ((1 1, 2 1, 2 3, 4 3, 4 1, 5 1, 5 4, 1 4, 1 1))",),], schema="WKT1 string") \
    .selectExpr("ST_FromText(WKT1) AS SHAPE") \
    .select("*", F.saltRand(10).alias("SALT")) \
    .selectExpr("*", "explode(ST_AsQR(SHAPE, 2.0)) AS QR") \
    .withMeta("POLYGON", 0)

rdf = spark.createDataFrame([ \
    ("POLYGON ((1 1, 5 1, 5 2, 1 2, 1 1))",),], schema="WKT1 string") \
    .selectExpr("ST_FromText(WKT1) AS SHAPE") \
    .selectExpr("*", "explode(SaltList(10)) AS SALT") \
    .selectExpr("*", "explode(ST_AsQR(SHAPE, 2.0)) AS QR") \
    .withMeta("POLYGON", 0)

bdt.processors.join_qr_salt(
     ldf,
     rdf,
     cell_size = 2.0,
     join_distance = 0.0,
     lhs_salt_field = "SALT",
     rhs_salt_field = "SALT",
     lhs_qr_field = "QR",
     rhs_qr_field = "QR",
     lhs_shape_field = "SHAPE",
     rhs_shape_field = "SHAPE")

Functions#

lat_to_r(lat, cell_size)

Return the conversion of a Latitude value to its cell r value in web mercator.

lon_to_q(lon, cell_size)

Return the conversion of a Longitude value to its cell q value in web mercator.

q_to_lon(q, cell_size[, dist])

Return the conversion of a q value in web mercator to its longitude value.

q_to_x(q, cell_size[, dist])

Return the conversion of a q value to its web mercator x value.

r_to_lat(r, cell_size[, dist])

Return the conversion of a r value in web mercator to its latitude value.

r_to_y(r, cell_size[, dist])

Return the conversion of an r value to its web mercator y value.

st_asQR(struct, cellSize)

Return an array of long values that represent the spatial partition row and column of the shape struct.

st_asQRClip(struct, cellSize, wkid)

Return an array of struct with QR and clipped SHAPE struct.

st_asQREnvpClip(struct, cellSize, wkid, dist)

Returns an array of struct with QR and clipped SHAPE struct.

st_asQRPrune(struct, cellSize, wkid, distance)

Create and return an array of QR values for the given shape struct.

st_asQRSpill(struct, cellSize, distance)

Create and return an array of QR values of the shape struct.

st_asQR_explode(struct, cellSize)

Convenience function that calls explode() on st_asQR.

st_asRC(struct, cellSize)

Return an array of arrays of two long values, representing row and column, of shape struct.

st_asRCClip(struct, cellSize, wkid)

Return an array of arrays of two long values representing row & column and a shape struct of the clipped geometry.

st_asRCPrune(struct, cellSize, wkid, distance)

Create and return an array of RC value arrays for the given shape struct.

st_asRCSpill(struct, cellSize, distance)

Return an array of arrays of two long values, representing row and column, of shape struct.

st_createFishnet(xmin, ymin, xmax, ymax, ...)

Create and return cell grids for the given bounding box and cell size.

st_grid(struct, cell_size)

Returns an array of struct containing two columns: Array of long values representing row and column, and a shape struct.

st_gridFromExtent(extentList, cellSize)

Returns an array of structs containing two columns: QR value and its shape struct.

st_gridXY(lon, lat, cell_size)

For the given longitude, latitude and cell size, return a struct containing an array with the row & column (RC) and a shape struct representing the grid to which the given point location belongs.

st_isLowerLeftInCell(qr, cell_size, struct, ...)

Return true when the lower left corner point of the two shape envelops matches the given QR.

st_isLowerLeftInCellIntersects(qr, ...)

Returns true if:

st_qr_to_box(qr, cell_size, distance)

Returns the envelope of the cell represented by a given qr value.

st_qr_to_envelope(qr, cell_size, distance)

Returns the envelope of the cell represented by a given qr value.

x_to_q(x, cell_size)

Return the conversion of a web mercator x value to its cell q value.

xyToQR(x, y, cellSize)

Create and return a QR value for the given x and y coordinate.

xyToQRSpill(x, y, cellSize, distance)

Create and return an array of qr values of the given x and y coordinates.

xyToQRSpillOrigin(x, y, cellSize, distance)

For each x y point, create and return an array of a structs that contains: (1) The QR that contains the point.

y_to_r(y, cell_size)

Return the conversion of a web mercator y value to its cell r value.

bdt.functions.lat_to_r(lat, cell_size)#

Return the conversion of a Latitude value to its cell r value in web mercator.

Parameters:
  • lat (column or str or float) – The longitude value.

  • cell_size (column or str or float) – The cell size in meters.

Return type:

Column

Returns:

Column of LongType

SQL Example:

spark.sql('''SELECT -75.26259694908107 AS lat, 10 AS cell''').createOrReplaceTempView("df")

spark.sql('''SELECT LatToR(lat, cell) AS R FROM df''')

Python Example:

df = spark.createDataFrame([(-75.26259694908107, 10.0)], ["lat", "cell"])

df.select(lat_to_r("lat", "cell").alias("R"))
bdt.functions.lon_to_q(lon, cell_size)#

Return the conversion of a Longitude value to its cell q value in web mercator.

Parameters:
  • lon (column or str or float) – The longitude value.

  • cell_size (column or str or float) – The cell size in meters.

Return type:

Column

Returns:

Column of LongType

SQL Example:

spark.sql('''SELECT 117.195666 AS lon, 10 AS cell''').createOrReplaceTempView("df")

spark.sql('''SELECT LonToQ(lon, cell) AS Q FROM df''')

Python Example:

df = spark.createDataFrame([(117.195666, 10.0)], ["lon", "cell"])

df.select(lon_to_q("lon", "cell").alias("Q"))
bdt.functions.q_to_lon(q, cell_size, dist=nan)#

Return the conversion of a q value in web mercator to its longitude value.

Parameters:
  • q (column or str or int) – The q value.

  • cell_size (column or str or float) – The cell size in meters.

  • dist (column or str or float) – The offset distance. Optionally pass in NaN to use cell_size/2 as the distance.

Return type:

Column

Returns:

Column of DoubleType

SQL Example:

spark.sql('''SELECT -1304617 AS q, 10 AS cell, 5 as distance''').createOrReplaceTempView("df")

spark.sql('''SELECT QToLon(q, cell, distance) AS LON FROM df''')

Python Example:

df = spark.createDataFrame([(-1304617, 10.0, 5.0)], ["q", "cell", "dist"])

df.select(q_to_lon("q", "cell", "dist").alias("LON"))
bdt.functions.q_to_x(q, cell_size, dist=nan)#

Return the conversion of a q value to its web mercator x value.

Parameters:
  • q (column or str or int) – The q value.

  • cell_size (column or str or float) – The cell size in meters.

  • dist (column or str or float) – The offset distance. Optionally pass in NaN to use cell_size/2 as the distance.

Return type:

Column

Returns:

Column of DoubleType

SQL Example:

spark.sql('''SELECT -1304617 AS q, 10 AS cell, 5 as distance''').createOrReplaceTempView("df")

spark.sql('''SELECT QToX(q, cell, distance) AS X FROM df''')

Python Example:

df = spark.createDataFrame([(-1304617, 10.0, 5.0)], ["q", "cell", "dist"])

df.select(q_to_x("q", "cell", "dist").alias("X"))
bdt.functions.r_to_lat(r, cell_size, dist=nan)#

Return the conversion of a r value in web mercator to its latitude value.

Parameters:
  • r (column or str or int) – The r value.

  • cell_size (column or str or float) – The cell size in meters.

  • dist (column or str or float) – The offset distance. Optionally pass in NaN to use cell_size/2 as the distance.

Return type:

Column

Returns:

Column of DoubleType

SQL Example:

spark.sql('''SELECT -1304617 AS r, 10 AS cell, 5 as distance''').createOrReplaceTempView("df")

spark.sql('''SELECT RToLat(r, cell, distance) AS LAT FROM df''')

Python Example:

df = spark.createDataFrame([(-1304617, 10.0, 5.0)], ["r", "cell", "dist"])

df.select(r_to_lat("r", "cell", "dist").alias("LAT"))
bdt.functions.r_to_y(r, cell_size, dist=nan)#

Return the conversion of an r value to its web mercator y value.

Parameters:
  • r (column or str or int) – The r value.

  • cell_size (column or str or float) – The cell size in meters.

  • dist (column or str or float) – The offset distance. Optionally pass in NaN to use cell_size/2 as the distance.

Return type:

Column

Returns:

Column of DoubleType

SQL Example:

spark.sql('''SELECT -1304617 AS r, 10 AS cell, 5 as distance''').createOrReplaceTempView("df")

spark.sql('''SELECT RToY(r, cell, distance) AS Y FROM df''')

Python Example:

df = spark.createDataFrame([(-1304617, 10.0, 5.0)], ["r", "cell", "dist"])

df.select(r_to_y("r", "cell", "dist").alias("Y"))
bdt.functions.st_asQR(struct, cellSize)#

Return an array of long values that represent the spatial partition row and column of the shape struct. This is optimized for 2-dimensions only.

Note: This will return QR value 0 for empty geometries.

Note: QR and RC are functionally equivalent, but it is recommended to use QR. RC will be deprecated in a future release.

Use explode() to unpack.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • cellSize (column or str or float) – The spatial partitioning grid size.

Return type:

Column

Returns:

Column of ArrayType with LongType

SQL Example:

spark.sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT explode(ST_AsQR(SHAPE, 0.1)) AS QR FROM df''')

Python Example:

df = spark \
    .createDataFrame([("POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))",)], ["WKT"]) \
    .select(st_fromText(col("WKT")).alias("SHAPE"))

df.select(explode(st_asQR("SHAPE", 0.1)).alias("QR"))
bdt.functions.st_asQRClip(struct, cellSize, wkid)#

Return an array of struct with QR and clipped SHAPE struct.

If a clipped SHAPE struct is completely within the geometry, wkb will be empty but the extent will be populated accordingly. Use inline() function to flatten out. If the DataFrame has the default SHAPE column before calling this function, there will be two SHAPE columns in the DataFrame after calling the function. It’s the user’s responsibility to rename the columns appropriately to avoid the column name collisions.

Note: This will return QR value 0 for empty geometries.

Note: QR and RC are functionally equivalent, but it is recommended to use QR. RC will be deprecated in a future release.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • cellSize (column or str or float) – The spatial partitioning grid size.

  • wkid (column or str or int) – The spatial reference id.

Return type:

Column

Returns:

Column of ArrayType with InternalRow of SHAPE Struct and LongType.

SQL Example:

spark.sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT inline(ST_AsQRClip(SHAPE, 1.0, 4326)) AS QRClips FROM df''')

Python Example:

df = spark \
    .createDataFrame([("POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))",)], ["WKT"]) \
    .select(st_fromText(col("WKT")).alias("SHAPE"))

df \
    .select(st_asQRClip("SHAPE", 0.1, 4326).alias("QRClips")) \
    .selectExpr("inline(QRClips)")
bdt.functions.st_asQREnvpClip(struct, cellSize, wkid, dist)#

Returns an array of struct with QR and clipped SHAPE struct.

Each cell is inflated by the given dist parameter before the geometry is clipped. If the cell is completely within the geometry, a shape struct is returned with WKB set to NULL and the envelope using the xmin,ymin,xmax,ymax represents the geometry of the cell. Otherwise, the cell clips the geometry.

If the clipped geometry is empty, that geometry is not returned. Hence, this functions prunes. If the clipped geometry is not empty, the wkb representation of it is returned.

The bounding box of the clipped geometry and the cell bounding box COULD be slightly different within the tolerance of the spatial reference. Hence, the xmin, ymin, xmax, ymax represents the intersected area of the cell and the bounding box of the polygon. Use inline() function to flatten out.

If the DataFrame has the default SHAPE column before calling this function, there will be two SHAPE columns in the DataFrame after calling this function & inline(). It’s the user’s responsibility to rename the columns appropriately to avoid the column name collisions.

Note: This will return QR value 0 for empty geometries.

Note: QR and RC are functionally equivalent, but it is recommended to use QR. RC will be deprecated in a future release.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax.

  • cellSize (column or str or float) – The spatial partitioning grid size.

  • wkid (column or str or int) – The spatial reference id.

  • dist (column or str or float) – an expression. The spill distance. Must be greater than the spatial reference tolerance.

Return type:

Column

Returns:

Column of ArrayType with InternalRow of LongType and SHAPE Struct.

SQL Example:

spark.sql('''SELECT inline(ST_AsQREnvpClip(SHAPE, 1.0, 3857, 0.1))
          FROM VALUES (ST_FromText('POINT(1 1)')) tab(SHAPE)''')

Python Example:

df = spark \
    .createDataFrame([("POINT(1 1)")], ["WKT"]) \
    .select(st_fromText(col("WKT")).alias("SHAPE"))

df \
    .select(st_asQREnvpClip("SHAPE", 0.1, 4326, 0.01).alias("clipped")) \
    .selectExpr("inline(clipped)")
bdt.functions.st_asQRPrune(struct, cellSize, wkid, distance)#

Create and return an array of QR values for the given shape struct. QR values that overlap the geometry bounding box but not the actual geometry will not be returned.

Use inline() to unpack.

Note: This will return QR value 0 for empty geometries.

Note: QR and RC are functionally equivalent, but it is recommended to use QR. RC will be deprecated in a future release.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • cellSize (column or str or float) – The spatial partitioning grid size.

  • wkid (column or str or int) – The spatial reference id.

  • distance (column or str or float) – The distance to spill over.

Return type:

Column

Returns:

Column of ArrayType with InternalRow of LongType and SHAPE Struct.

SQL Example:

spark.sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT inline(ST_AsQRPrune(SHAPE, 0.1, 4326, 0.5)) FROM df''')

Python Example:

df = spark \
    .createDataFrame([("POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))",)], ["WKT"]) \
    .select(st_fromText(col("WKT")).alias("SHAPE"))

df \
    .select(st_asQRPrune("SHAPE", 0.1, 4326, 0.5).alias("QR_PRUNE")) \
    .selectExpr("inline(QR_PRUNE)")
bdt.functions.st_asQRSpill(struct, cellSize, distance)#

Create and return an array of QR values of the shape struct. Spill over to neighboring cells by the specified distance.

Use explode() to unpack.

Note: This will return QR value 0 for empty geometries.

Note: QR and RC are functionally equivalent, but it is recommended to use QR. RC will be deprecated in a future release.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • cellSize (column or str or float) – The spatial partitioning grid size.

  • distance (column or str or float) – The distance to spill over.

Return type:

Column

Returns:

Column of ArrayType with LongType

SQL Example:

spark.sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT explode(ST_AsQRSpill(SHAPE, 0.1, 0.01)) AS QRSpill FROM df''')

Python Example:

df = spark \
    .createDataFrame([("POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))",)], ["WKT"]) \
    .select(st_fromText(col("WKT")).alias("SHAPE"))

df.select(explode(st_asQRSpill("SHAPE", 0.1, 0.01)).alias("QRSpill"))
bdt.functions.st_asQR_explode(struct, cellSize)#

Convenience function that calls explode() on st_asQR.

See the docs for st_asQR for more information.

This function can only be used in Python. It cannot be used in a spark sql statement.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • cellSize (column or str or float) – The spatial partitioning grid size.

Return type:

Column

Returns:

Column of LongType

Python Example:

df = spark \
    .createDataFrame([("POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))",)], ["WKT"]) \
    .select(st_fromText(col("WKT")).alias("SHAPE"))

df.select(st_asQR_explode("SHAPE", 0.1).alias("QR"))
bdt.functions.st_asRC(struct, cellSize)#

Return an array of arrays of two long values, representing row and column, of shape struct.

Note: RC and QR are functionally equivalent, but it is recommended to use QR. RC will be deprecated in a future release.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • cellSize (column or str or float) – The spatial partitioning grid size.

Return type:

Column

Returns:

Column of ArrayType with ArrayType with two LongTypes

SQL Example:

spark.sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_AsRC(SHAPE, 1.0) AS RC FROM df''')

Python Example:

df = spark \
    .createDataFrame([("POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))",)], ["WKT"]) \
    .select(st_fromText(col("WKT")).alias("SHAPE"))

df.select(st_asRC("SHAPE", 0.1).alias("RC"))
bdt.functions.st_asRCClip(struct, cellSize, wkid)#

Return an array of arrays of two long values representing row & column and a shape struct of the clipped geometry.

Note: RC and QR are functionally equivalent, but it is recommended to use QR. RC will be deprecated in a future release.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • cellSize (column or str or float) – The spatial partitioning grid size.

  • wkid (column or str or int) – The spatial reference id.

Return type:

Column

Returns:

Array of InternalRow with SHAPE Struct and ArrayType with LongType

SQL Example:

df2 = spark.sql('''SELECT ST_AsRCClip(SHAPE, 1.0, 4326) AS RCClip FROM df''')

Python Example:

df.select(st_asRCClip("SHAPE", 0.1, 4326).alias("RCClip"))
bdt.functions.st_asRCPrune(struct, cellSize, wkid, distance)#

Create and return an array of RC value arrays for the given shape struct. RC values that overlap the geometry bounding box but not the actual geometry will not be returned.

Note: RC and QR are functionally equivalent, but it is recommended to use QR. RC will be deprecated in a future release.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • cellSize (column or str or float) – The spatial partitioning grid size.

  • wkid (column or str or int) – The spatial reference id.

  • distance (column or str or float) – The distance to spill over.

Return type:

Column

Returns:

Column of ArrayType with ArrayType with two LongTypes

SQL Example:

spark.sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_AsRCPrune(SHAPE, 1.0, 4326, 0.5) AS RCPrune FROM df''')

SQL Example:

df = spark \
    .createDataFrame([("POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))",)], ["WKT"]) \
    .select(st_fromText(col("WKT")).alias("SHAPE"))

df.select(st_asRCPrune("SHAPE", 0.1, 4326, 0.01))
bdt.functions.st_asRCSpill(struct, cellSize, distance)#

Return an array of arrays of two long values, representing row and column, of shape struct. Spill over to neighboring cells by the specified distance.

Use explode() to unpack.

Parameters:
  • struct (column or str) – The shape struct with wkb, xmin, ymin, xmax, ymax

  • cellSize (column or str or float) – The spatial partitioning grid size.

  • distance (column or str or float) – The distance to spill over.

Return type:

Column

Returns:

Column of ArrayType with ArrayType with two LongTypes

SQL Example:

spark.sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT explode(ST_AsRCSpill(SHAPE, 0.1, 0.01)) AS RCSpill FROM df''')

Python Example:

df = spark \
    .createDataFrame([("POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))",)], ["WKT"]) \
    .select(st_fromText(col("WKT")).alias("SHAPE"))

df.select(explode(st_asRCSpill("SHAPE", 0.1, 0.01)).alias("RCSpill"))
bdt.functions.st_createFishnet(xmin, ymin, xmax, ymax, cell_size)#

Create and return cell grids for the given bounding box and cell size. The output is an array of (Seq(r,c), Shape struct).

Use inline() to unpack.

Parameters:
  • xmin (column or str or float) – The minimum x value.

  • ymin (column or str or float) – The minimum y value.

  • xmax (column or str or float) – The maximum x value.

  • ymax (column or str or float) – The maximum y value.

  • cell_size (column or str or float) – an expression. The spatial partitioning grid size in units of the spatial reference of the input data.

Return type:

Column

Returns:

Column of ArrayType with InternalRow of SHAPE Struct and ArrayType with 2 LongTypes.

SQL Statement Example:

spark.sql("SELECT inline(ST_CreateFishnet(-180, -90, 180, 90, 0.01))")

DataFrame Function Example:

df \
    .select(st_createFishnet(-180, -90, 180, 90, 0.01).alias("FISHNET")) \
    .selectExpr("inline(FISHNET)")
bdt.functions.st_grid(struct, cell_size)#

Returns an array of struct containing two columns: Array of long values representing row and column, and a shape struct. Use explode() and inline() function to flatten out.

Parameters:
  • struct (column or str) – Shape struct

  • cell_size (column or str or float) – an expression. The spatial partitioning grid size in units of the spatial reference of the input data.

Return type:

Column

Returns:

Column of ArrayType with InternalRow of LongType and a SHAPE Struct

SQL Example:

spark.sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT inline(ST_Grid(SHAPE, 1.0))''')

Python Example:

df = spark \
    .createDataFrame([("POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))",)], ["WKT"]) \
    .select(st_fromText(col("WKT")).alias("SHAPE"))

df. \
    .select(st_grid("SHAPE", 1.0).alias("GRID")) \
    .selectExpr("inline(GRID)")
bdt.functions.st_gridFromExtent(extentList, cellSize)#

Returns an array of structs containing two columns: QR value and its shape struct. Use inline() to unpack.

Parameters:
  • extentList (List[float]) – Extent array. Must be an array of (xmin, ymin, xmax, ymax)

  • cellSize (column or str or float) – The spatial partitioning grid size.

Return type:

Column

Returns:

Column of ArrayType with InternalRow of LongType and a SHAPE Struct.

SQL Example:

spark.sql('''SELECT inline(ST_GridFromExtent(array(-118, 32, -115, 34), 0.1))''')

Python Example:

df = spark \
    .createDataFrame([(1,)], ["extent_id"])

df \
    .select(st_gridFromExtent([-118, 32, -115, 34], 0.1).alias("GRID")) \
    .selectExpr("inline(GRID)")
bdt.functions.st_gridXY(lon, lat, cell_size)#

For the given longitude, latitude and cell size, return a struct containing an array with the row & column (RC) and a shape struct representing the grid to which the given point location belongs. Use inline() to flatten out.

Parameters:
  • lon (column or str) – The x value.

  • lat (column or str) – The y value.

  • cell_size (column or str or float) – an expression. The spatial partitioning grid size in units of the spatial reference of the input data.

Return type:

Column

Returns:

Column of ArrayType with InternalRow of LongType and a SHAPE Struct

SQl Example:

spark.sql('''SELECT 180 AS lon, 90 AS lat''').createOrReplaceTempView("df")

spark.sql(''' SELECT inline(ST_Grid(lon, lat, 1)) FROM df''')

Python Example:

df = spark.createDataFrame([(180, 90)], ["lon", "lat"])

df \
    .select(st_gridXY("lon", "lat", 1).alias("GRID_XY")) \
    .selectExpr("inline(GRID_XY)")
bdt.functions.st_isLowerLeftInCell(qr, cell_size, struct, another_struct)#

Return true when the lower left corner point of the two shape envelops matches the given QR. Otherwise, return false.

This is key to the BDT spatial partitioning technique when considering two multipath geometries. These multipath geometries will likely span multiple QRs, especially if the resolution is small.

This lower-left check ensures that the QR partition containing the lower-left corner point is the ONLY partition that does the relevant operation, ensuring there are now duplicates in the output.

Parameters:
  • qr (column or str) – an expression. The row & column as a QR Long value.

  • cell_size (column or str or float) – an expression. The spatial partitioning grid size in units of the spatial reference of the input data. It must be exactly the same as the grid size used to create the input QR.

  • struct (column or str) – an expression. The shape struct with wkb, xmin, ymin, xmax, ymax

  • another_struct (column or str) – an expression. The shape struct with wkb, xmin, ymin, xmax, ymax

Return type:

Column

Returns:

Column of BooleanType

SQL Example:

spark
    .sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE1''')             .selectExpr("SHAPE", "explode(ST_AsQR(SHAPE, 0.5)) AS QR")             .createOrReplaceTempView("df1")

spark.sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE2''')             .selectExpr("SHAPE", "explode(ST_AsQR(SHAPE, 0.5)) AS QR")             .createOrReplaceTempView("df2")

spark.sql('''SELECT * FROM df1 LEFT JOIN df2 ON df1.QR = df2.QR WHERE ST_IsLowerLeftInCell(df1.QR, 1.0, SHAPE1,
SHAPE2))''')

Python Example:

df1 = spark \
    .sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE1''') \
    .selectExpr("SHAPE", "explode(ST_AsQR(SHAPE, 0.5)) AS QR")

df2 = spark.sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE2''') \
    .selectExpr("SHAPE", "explode(ST_AsQR(SHAPE, 0.5)) AS QR")

df \
    .join(df1, df2, on="QR", how="left") \
    .select("*", st_isLowerLeftInCell("QR", 0.5, SHAPE1, SHAPE2).alias("LOWER_LEFT")) \
    .filter(col("LOWER_LEFT") == True)
bdt.functions.st_isLowerLeftInCellIntersects(qr, cell_size, struct, another_struct)#

Returns true if:

  1. The two envelopes intersect, and

  2. The lower-left corner point of two envelopes matches the given QR.

It will return false otherwise.

This is key to the BDT spatial partitioning technique when considering two multipath geometries. These multipath geometries will likely span multiple QRs, especially if the resolution is small.

This lower-left check ensures that the QR partition containing the lower-left corner point is the ONLY partition that does the relevant operation, ensuring there are no duplicates in the output.

Parameters:
  • qr (column or str) – an expression. The row & column as a QR Long value.

  • cell_size (column or str or float) – an expression. The spatial partitioning grid size in units of the spatial reference of the input data. It must be exactly the same as the grid size used to create the input QR.

  • struct (column or str) – an expression. The shape struct with wkb, xmin, ymin, xmax, ymax

  • another_struct (column or str) – an expression. The shape struct with wkb, xmin, ymin, xmax, ymax

Return type:

Column

Returns:

Column of BooleanType

SQL Example:

spark
    .sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE1''')             .selectExpr("SHAPE", "explode(ST_AsQR(SHAPE, 0.5)) AS QR")             .createOrReplaceTempView("df1")

spark.sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE2''')             .selectExpr("SHAPE", "explode(ST_AsQR(SHAPE, 0.5)) AS QR")             .createOrReplaceTempView("df2")

spark.sql('''SELECT * FROM df1 LEFT JOIN df2 ON df1.QR = df2.QR
             WHERE ST_IsLowerLeftInCellIntersects(df1.QR, 1.0, SHAPE1, SHAPE2))''')

Python Example:

df1 = spark \
    .sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE1''') \
    .selectExpr("SHAPE", "explode(ST_AsQR(SHAPE, 0.5)) AS QR")

df2 = spark.sql('''SELECT ST_FromText('POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))') AS SHAPE2''') \
    .selectExpr("SHAPE", "explode(ST_AsQR(SHAPE, 0.5)) AS QR")

df \
    .join(df1, df2, on="QR", how="left") \
    .select("*", st_isLowerLeftInCellIntersects("QR", 0.5, SHAPE1, SHAPE2).alias("LOWER_LEFT")) \
    .filter(col("LOWER_LEFT") == True)
bdt.functions.st_qr_to_box(qr, cell_size, distance)#

Returns the envelope of the cell represented by a given qr value. Functions the same as st_qr_to_envelope.

Parameters:
  • qr (column or str or int) – The row & column as a QR Long value.

  • cell_size (column or str or float) – The spatial partitioning grid size. Must be the same as the cell size that was used to obtain the input qr value.

  • distance (column or str or float) – The distance to spill the envelope over. A convenient way to inflate the envelope.

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

spark.sql('''SELECT 0 AS qr, 1.0 AS cell, 0.1 as dist''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_QRToBox(qr, cell, dist) AS BOX FROM df''')

Python Example:

df = spark.createDataFrame([(0, 1.0, 0.1)], ["qr", "cell", "dist"])

df.select(st_qr_to_box("qr", "cell", "dist").alias("BOX"))
bdt.functions.st_qr_to_envelope(qr, cell_size, distance)#

Returns the envelope of the cell represented by a given qr value. Functions the same as st_qr_to_box.

Parameters:
  • qr (column or str or int) – The row & column as a QR Long value.

  • cell_size (column or str or float) – The spatial partitioning grid size. Must be the same as the cell size that was used to obtain the input qr value.

  • distance (column or str or float) – The distance to spill the envelope over. A convenient way to inflate the envelope.

Return type:

Column

Returns:

Column of SHAPE Struct

SQL Example:

spark.sql('''SELECT 0 AS qr, 1.0 AS cell, 0.1 as dist''').createOrReplaceTempView("df")

spark.sql('''SELECT ST_QRToEnvelope(qr, cell, dist) AS ENVELOPE FROM df''')

Python Example:

df = spark.createDataFrame([(0, 1.0, 0.1)], ["qr", "cell", "dist"])

df.select(st_qr_to_envelope("qr", "cell", "dist").alias("ENVELOPE"))
bdt.functions.x_to_q(x, cell_size)#

Return the conversion of a web mercator x value to its cell q value.

Parameters:
  • lon (column or str or float) – The web mercator x value.

  • cell_size (column or str or float) – The cell size in meters.

Return type:

Column

Returns:

Column of LongType

SQL Example:

spark.sql('''SELECT 47.350929 AS x, 10 AS cell''').createOrReplaceTempView("df")

spark.sql('''SELECT XToQ(x, cell) AS Q FROM df''')

Python Example:

df = spark.createDataFrame([(47.350929, 10)], ["x", "cell"])

df.select(x_to_q("x", "cell").alias("Q"))
bdt.functions.xyToQR(x, y, cellSize)#

Create and return a QR value for the given x and y coordinate.

Parameters:
  • x (column or str or float) – The coordinate x value.

  • y (column or str or float) – The coordinate y value.

  • cellSize (column or str or float) – The spatial partitioning grid size.

Return type:

Column

Returns:

Column of LongType

SQL Example:

spark.sql('''SELECT 180 AS lon, 90 AS lat''').createOrReplaceTempView("df")

spark.sql('''SELECT XYToQR(lon, lat, 1.0) AS QR FROM df''')

Python Example:

df = spark.createDataFrame([(180, 90)], ["lon", "lat"])

df.select(xyToQR("lon", "lat", 0.1).alias("QR"))
bdt.functions.xyToQRSpill(x, y, cellSize, distance)#

Create and return an array of qr values of the given x and y coordinates. Spill over to neighboring cells by the specified distance. Use explode() to unpack.

Parameters:
  • x (column or str or float) – The coordinate x value.

  • y (column or str or float) – The coordinate y value.

  • cellSize (column or str or float) – The spatial partitioning grid size.

  • distance (column or str or float) – The distance to spill over.

Return type:

Column

Returns:

Column of ArrayType with LongType.

SQL Example:

spark.sql('''SELECT 180 AS lon, 90 AS lat''').createOrReplaceTempView("df")

spark.sql('''SELECT explode(XYToQRSpill(lon, lat, 0.1, 0.05)) AS QR FROM df''')

Python Example:

df = spark.createDataFrame([(180, 90)], ["lon", "lat"])

df.select(explode(xyToQRSpill(lon, lat, 0.1, 0.05)).alias("QR"))
bdt.functions.xyToQRSpillOrigin(x, y, cellSize, distance)#

For each x y point, create and return an array of a structs that contains: (1) The QR that contains the point. ( 2) The QR(s) that the point as spilled over into. Each struct in the array has the following: (1) The QR value ( 2) A Boolean, true if this is the QR that contains the point and false if this is a QR that the point as spilled into.

This is optimized for 2-dimensions only.

Use explode() and inline() to unpack.

Parameters:
  • x (column or str or float) – The coordinate x value.

  • y (column or str or float) – The coordinate y value.

  • cellSize (column or str or float) – The spatial partitioning grid size.

  • distance (column or str or float) – The distance to spill over.

Return type:

Column

Returns:

Column of Arraytype with InternalRow of LongType and BooleanType

SQL Example:

spark.sql('''SELECT 180 AS lon, 90 AS lat''').createOrReplaceTempView("df")

spark.sql('''SELECT inline(explode(XYToQRSpillOrigin(lon, lat, 0.1, 0.05))) AS QR FROM df''')

Python Example:

df = spark.createDataFrame([(180, 90)], ["lon", "lat"])

df \
    .select(xyToQRSpillOrigin("lon", "lat", 0.1, 0.05).alias("QR_SPILL")) \
    .selectExpr("inline(QR_SPILL)")
bdt.functions.y_to_r(y, cell_size)#

Return the conversion of a web mercator y value to its cell r value.

Parameters:
  • y (column or str or float) – The web mercator y value.

  • cell_size (column or str or float) – The cell size in meters.

Return type:

Column

Returns:

Column of LongType

SQL Example:

spark.sql('''SELECT -437.721420 AS y, 10 AS cell''').createOrReplaceTempView("df")

spark.sql('''SELECT YToR(y, cell) AS R FROM df''')

Python Example:

df = spark.createDataFrame([(-437.721420, 10)], ["y", "cell"])

df.select(y_to_r("y", "cell").alias("R"))

Uber H3#

Functions#

These functions use the Uber H3 library for hex tiling analysis.

geoToH3(lat, lon, res)

Return the conversion of Latitude and Longitude values to an H3 index at the given resolution.

h3Distance(h3Origin, h3Idx)

Return the distance in H3 grid cell hexagons between the two H3 indices.

h3Line(struct, res)

Given a multipath shape struct, returns the array of h3 indexes along the line between the start and end points of the multipath.

h3Polyfill(struct, res)

Returns an array of H3 indices whose centroid is contained in the given polygon.

h3ToChildren(h3Idx, res)

Return the children of the given H3 Index.

h3ToGeo(h3Idx)

Return the conversion of an H3 index to its Latitude and Longitude centroid.

h3ToGeoBoundary(h3Idx)

Return a collection of the hexagon boundary latitude and longitude values for an H3 index.

h3ToParent(h3Idx, res)

Return the parent H3 Index of the given H3 index the given resolution.

h3ToString(h3Idx)

Return the conversion of an H3 index to its string representation.

h3kRing(h3Idx, k)

Return k rings of the neighboring H3 indexes in all directions from an origin H3 index.

h3kRingDistances(h3Idx, k)

Return k rings of the neighboring H3 indexes in all directions from an origin H3 index.

stringToH3(h3Idx)

Return the conversion of an H3 index string to its long representation.

bdt.functions.geoToH3(lat, lon, res)#

Return the conversion of Latitude and Longitude values to an H3 index at the given resolution.

This function uses the Uber H3 Library.

Parameters:
  • lat (column or str or float) – latitude

  • lon (column or str or float) – longitude

  • res (column or str or int) – H3 resolution

Return type:

Column

Returns:

Column of LongType

SQL Example:

spark.sql('''SELECT 180 AS lon, 90 AS lat''').createOrReplaceTempView("df")

spark.sql('''SELECT GeoToH3(lat, lon, 10) AS H3Idx FROM df''')

Python Example:

df = spark.createDataFrame([(180, 90)], ["lon", "lat"])

df.select(geoToH3("lat", "lon", 10).alias("H3Idx"))
bdt.functions.h3Distance(h3Origin, h3Idx)#

Return the distance in H3 grid cell hexagons between the two H3 indices.

This function uses the Uber H3 Library.

Parameters:
  • h3Origin (column or str or int) – The origin H3 index value.

  • h3Idx (column or str or int) – The destination H3 index value.

Return type:

Column

Returns:

Column of IntegerType

SQL Example:

spark.sql('''SELECT GeoToH3(-90, -180, 10) AS h3Origin, GeoToH3(90, 180,
10) AS h3Idx''').createOrReplaceTempView("df")

spark.sql('''SELECT H3Distance(h3Origin, h3Idx) AS Distance FROM df''')

Python Example:

spark \
    .createDataFrame([(-180, -90, 180, 90)], ["lon1", "lat1", "lon2", "lat2"]) \
    .select(geoToH3("lat1", "lon1", 10).alias("h3Origin"), geoToH3("lat2", "lon2", 10).alias("h3Idx"))

df.select(h3Distance("h3Origin", "h3idx").alias("Distance"))
bdt.functions.h3Line(struct, res)#

Given a multipath shape struct, returns the array of h3 indexes along the line between the start and end points of the multipath. The multipath shape struct must be in 4326 spatial reference.

This function uses the Uber H3 Library.

Parameters:
  • struct (column or str) – The multipath shape struct. Must be in 4326 spatial reference.

  • res (column or str or int) – The H3 resolution value.

Return type:

Column

Returns:

Column of ArrayType with LongType

SQL Example:

spark.sql('''SELECT ST_FromText("LINESTRING (-75.783691 45.440441, -80.975246 38.651198)") AS SHAPE''')            .createOrReplaceTempView("df")

spark.sql('''SELECT H3Line(SHAPE, 1) AS indices FROM df''')

Python Example:

df = spark.createDataFrame([
    ("LINESTRING (-75.783691 45.440441, -80.975246 38.651198)",)
], schema="wkt string")

result_df = df.select(
    h3Line(
        st_fromText("wkt"), 1
    ).alias("indices")
)
bdt.functions.h3Polyfill(struct, res)#

Returns an array of H3 indices whose centroid is contained in the given polygon. The polygon must be in 4326 spatial reference.

This function uses the Uber H3 Library.

Note: If the resolution is too large, there could be more H3 indices returned than the maximum size of an array and a java.lang.OutOfMemoryError or a java.lang.NegativeArraySizeException would be seen. In this case, it is recommended to split large input polygons into smaller ones or use a smaller resolution to limit the number of indices in the returned array.

Parameters:
  • struct (column or str) – The shape struct. Must be in 4326 spatial reference.

  • res (column or str or int) – The H3 resolution value.

Return type:

Column

Returns:

Column of ArrayType with LongType

SQL Example:

spark.sql('''SELECT ST_FromText("POLYGON ((35 10, 45 45, 15 40, 10 20, 35 10), (20 30, 35 35, 30 20, 20 30),
        (15 20, 25 20, 15 25, 15 20))") AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT H3Polyfill(SHAPE, 1) AS indices FROM df''')

Python Example:

df = spark.createDataFrame([
    ("POLYGON ((35 10, 45 45, 15 40, 10 20, 35 10), (20 30, 35 35, 30 20, 20 30), "
     "(15 20, 25 20, 15 25, 15 20))",)
], schema="wkt string")

result_df = df.select(
    h3Polyfill(
        st_fromText("wkt"), 1
    ).alias("indices")
)
bdt.functions.h3ToChildren(h3Idx, res)#

Return the children of the given H3 Index. The children will all be at the given resolution.

This function uses the Uber H3 Library.

Parameters:
  • h3Idx (column or str or int) – The H3 index value.

  • res (column or str or int) – The H3 resolution value.

Return type:

Column

Returns:

Column of ArrayType with LongType

SQL Example:

spark.sql('''SELECT GeoToH3(90, 180, 10) AS h3Idx''').createOrReplaceTempView("df")

spark.sql('''SELECT explode(H3ToChildren(h3Idx, 10)) AS child_h3Idx FROM df''')

Python Example:

spark \
    .createDataFrame([(180, 90)], ["lon", "lat"]) \
    .select(geoToH3("lat", "lon", 10).alias("h3Idx"))

df.select(explode(h3ToChildren("h3Idx", 10)).alias("child_h3Idx"))
bdt.functions.h3ToGeo(h3Idx)#

Return the conversion of an H3 index to its Latitude and Longitude centroid.

This function uses the Uber H3 Library.

Parameters:

h3Idx (column or str or int) – The H3 index value.

Return type:

Column

Returns:

Column of ArrayType with two LongTypes

SQL Example:

spark.sql('''SELECT GeoToH3(90, 180, 10) AS h3Idx''').createOrReplaceTempView("df")

spark.sql('''SELECT element_at(CENTROID, 1) AS lat, element_at(CENTROID, 2) AS lon FROM
( SELECT H3ToGeo(h3Idx) AS CENTROID FROM df )''')

Python Example:

df = spark \
    .createDataFrame([(180, 90)], ["lon", "lat"]) \
    .select(geoToH3("lat", "lon", 10).alias("h3Idx"))

df \
    .select(h3ToGeo("h3Idx").alias("CENTROID")) \
    .select(element_at("CENTROID", 1).alias("lat"), element_at("CENTROID", 2).alias("lon"))
bdt.functions.h3ToGeoBoundary(h3Idx)#

Return a collection of the hexagon boundary latitude and longitude values for an H3 index. The output is ordered as follows: [x1, y1, x2, y2, x3, y3, …]

The result can be converted to a shape struct using st_makePolygon.

This function uses the Uber H3 Library.

Parameters:

h3Idx (column or str or int) – The H3 index value.

Return type:

Column

Returns:

Column of ArrayType with DoubleType

SQL Example:

spark.sql('''SELECT GeoToH3(90, 180, 10) AS h3Idx''').createOrReplaceTempView("df")

spark.sql('''SELECT H3ToGeoBoundary(h3Idx) AS BOUNDARY FROM df''')

Python Example:

spark \
    .createDataFrame([(180, 90)], ["lon", "lat"]) \
    .select(geoToH3("lat", "lon", 10).alias("h3Idx"))

df.select(h3ToGeoBoundary("h3Idx").alias("BOUNDARY"))
bdt.functions.h3ToParent(h3Idx, res)#

Return the parent H3 Index of the given H3 index the given resolution.

This function uses the Uber H3 Library.

Parameters:
  • h3Idx (column or str or int) – The H3 index value.

  • res (column or str or int) – the H3 resolution value.

Return type:

Column

Returns:

Column of LongType

SQL Example:

spark.sql('''SELECT GeoToH3(90, 180, 10) AS h3Idx''').createOrReplaceTempView("df")

spark.sql('''SELECT H3ToParent(h3Idx, 9) AS parentIdx FROM df''')

Python Example:

spark \
    .createDataFrame([(180, 90)], ["lon", "lat"]) \
    .select(geoToH3("lat", "lon", 10).alias("h3Idx"))

df.select(h3ToParent("h3Idx", 10).alias("parentIdx"))
bdt.functions.h3ToString(h3Idx)#

Return the conversion of an H3 index to its string representation.

This function uses the Uber H3 Library.

Parameters:

h3Idx (column or str or int) – The H3 index value.

Return type:

Column

Returns:

Column of StringType

SQL Example:

spark.sql('''SELECT GeoToH3(90, 180, 10) AS h3Idx''').createOrReplaceTempView("df")

spark.sql('''SELECT H3ToString(h3Idx) AS h3Str FROM df''')

Python Example:

spark \
    .createDataFrame([(180, 90)], ["lon", "lat"]) \
    .select(geoToH3("lat", "lon", 10).alias("h3Idx"))

df.select(h3ToString("h3Idx").alias("h3Str"))
bdt.functions.h3kRing(h3Idx, k)#

Return k rings of the neighboring H3 indexes in all directions from an origin H3 index. Use explode() to unpack.

This function uses the Uber H3 Library.

Parameters:
  • h3Idx (column or str or int) – The H3 index value.

  • k (column or str or int) – The number of rings from the origin H3 index.

Return type:

Column

Returns:

Column of ArrayType with LongType

SQL Example:

spark.sql('''SELECT GeoToH3(90, 180, 10) AS h3Idx''').createOrReplaceTempView("df")

spark.sql('''SELECT explode(H3kRing(h3Idx, 10)) AS nbr_h3Idx FROM df''')

Python Example:

spark \
    .createDataFrame([(180, 90)], ["lon", "lat"]) \
    .select(geoToH3("lat", "lon", 10).alias("h3Idx"))

df.select(explode(h3kRing("h3Idx", 10)).alias("nbr_h3Idx"))
bdt.functions.h3kRingDistances(h3Idx, k)#

Return k rings of the neighboring H3 indexes in all directions from an origin H3 index. This returns a collection of rings, each of which is a list of indexes. The rings are in order from closest to origin to farthest.

Use explode() to unpack.

This function uses the Uber H3 Library.

Parameters:
  • h3Idx (column or str or int) – The H3 index value.

  • k (column or str or int) – The number of rings from the origin H3 index.

Return type:

Column

Returns:

Column of ArrayType with ArrayType with LongType

SQL Example:

spark.sql('''SELECT GeoToH3(90, 180, 10) AS h3Idx''').createOrReplaceTempView("df")

spark.sql('''SELECT explode(H3kRingDistances(h3Idx, 10)) AS nbr_h3Indices FROM df''')

Python Example:

spark \
    .createDataFrame([(180, 90)], ["lon", "lat"]) \
    .select(geoToH3("lat", "lon", 10).alias("h3Idx"))

df.select(explode(h3kRingDistances("h3Idx", 10)).alias("nbr_h3Indices"))
bdt.functions.stringToH3(h3Idx)#

Return the conversion of an H3 index string to its long representation.

This function uses the Uber H3 Library.

Parameters:

h3Idx (column or str) – The H3 Index string representation. Must be a column or use the pyspark lit function for a single string.

Return type:

Column

Returns:

Column of LongType

SQL Example:

spark.sql('''SELECT '8730b3985ffffff' AS h3IdxString''').createOrReplaceTempView("df")

spark.sql('''SELECT StringToH3(h3IdxString) AS h3Idx FROM df''')

Python Example:

df = spark.createDataFrame([("8730b3985ffffff",)], schema="h3Idx string")

df.select(stringToH3("h3Idx").alias("H3"))

OR

df.select(stringToH3(lit("8730b3985ffffff")).alias("H3"))

Raster#

These processors generate analysis using raster datasets.

These processors require additional libraries not included with the BDT jar. See the Setup for Raster guide for the respective environment to use these processors.

Important note for functions that accept TIF files: Input raster TIFs with a pixel/bit depth of 8-bit signed are not supported. Please convert the raster to have a pixel depth of 8-bit unsigned if there are no negative values, or to 16-bit signed for negative value support. This can be done in ArcGIS Pro.

Processors#

raster_erase(df, tiff_path, id_seq[, ...])

Erase parts of polygons by raster ID.

raster_extract(df, tiff_path, miid_field[, ...])

Extract cell values from raster cells based on Point geometries.

raster_to_vector(tiff_name, output_type, ...)

Converts all raster cells that satisfy the input 'valueThreshold' to a vector geometry representation.

tabulate_area(df, tiff_path[, pid_field, ...])

Calculates area statistics for the intersection of raster and polygon.

vector_to_raster(df, output_path, cell_size)

Processor that converts geometry vectors to a raster image in TIF format.

zonal_statistics_table(df, tiff_path, miid_field)

Calculates statistics on cell values of a raster within zones defined by polygons from an input dataframe.

bdt.processors.raster_erase(df, tiff_path, id_seq, shape_field='SHAPE')#

Erase parts of polygons by raster ID.

If a given raster cell has an ID in id_seq, and that raster cell intersects a polygon, then the area of intersection will be erased from the polygon. The resulting polygons are returned.

If there are no relevant and intersecting raster cells, the polygon will be returned as-is with no erasure.

Example use case: Erase areas of a polygon that are covered by certain land type, where the land type data is in raster format.

The raster geotiff and the vector polygon data must be the same spatial reference. The input tif raster file is put onto the driver and executor filesystems using SparkContext.addFile(…).

The input tiff raster file is put onto the driver and executor filesystems using SparkContext.addFile(…). Note: Spark does not allow adding two files with the same name but with different content. If you want to use this processor with two different tiffs with the same file name, either rename one of the files or allow spark to overwrite the existing file by setting the “spark.files.overwrite” configuration property to “true”.

Note: This processor requires additional libraries not included with the BDT jar. See the Setup for Raster guide for the respective environment to use these processors.

Note: Rasters with a pixel/bit depth of 8-bit signed are not supported. Please convert the raster to have a pixel depth of 8-bit unsigned if there are no negative values, or to 16-bit signed for negative value support. This can be done in ArcGIS Pro.

Parameters:
  • df (DataFrame) – The input DataFrame.

  • tiff_Path (str) – The path to the raster data in .tif format.

  • id_seq (List[str]) – The list of raster cell ids to erase from the polygon.

  • shape_field (str) – The name of the SHAPE Struct field.

Return type:

DataFrame

Returns:

Erased polygons.

Example:

bdt.processors.raster_erase(
         df,
         "/path/to/tiff.tif",
         [1, 2, 3],
         shape_field = "SHAPE")
bdt.processors.raster_extract(df, tiff_path, miid_field, shape_field='SHAPE')#

Extract cell values from raster cells based on Point geometries. Points must be in same spatial reference as tiff. Points not falling within the raster bounds will not be returned in the output dataframe. Null input geometries will not be returned in the output dataframe.

Note: This processor requires additional libraries not included with the BDT jar. See the Setup for Raster guide for the respective environment to use these processors.

Parameters:
  • df (DataFrame) – The input DataFrame.

  • tiff_Path (str) – The path to the raster data in .tif format.

  • miid_field (str) – (monotonically increasing id) the name of the unique identifier field for each input point used to link stats back in the output table to original polygon.

  • shape_field (str) – The name of the SHAPE Struct field. Must be a point geometry.

Return type:

DataFrame

Returns:

Values associated with each point id

Example:

bdt.processors.raster_extract(
         df,
         "/path/to/tiff.tif",
         "ID",
         shape_field = "SHAPE")
bdt.processors.raster_to_vector(tiff_name, output_type, num_partitions, value_threshold=-inf)#

Converts all raster cells that satisfy the input ‘valueThreshold’ to a vector geometry representation. Each raster cell is converted to either its centroid (point) or square cell boundary (polygon). The value of the raster cell is emitted alongside this geometry. The output coordinate values will be in the same spatial reference as the input raster.

If the ‘outputType’ is set to ‘centroid’ then the output schema will be:

root
 |-- X: double (nullable = false)
 |-- Y: double (nullable = false)
 |-- VALUE: double (nullable = false)

If the ‘outputType’ is set to ‘boundary’ then the output schema will be:

root
 |-- XMIN: double (nullable = false)
 |-- YMIN: double (nullable = false)
 |-- XMAX: double (nullable = false)
 |-- YMAX: double (nullable = false)
 |-- VALUE: double (nullable = false)

The parameter ‘numPartitions’ should at least be set to the number of cores of the machine or cluster the processor is run on to take advantage of the parallelism, but often peformance is improved with significantly more partitions than that. It is recommended to start by using approximately 25 times the number of cores.

The input tiff raster file is put onto the driver and executor filesystems using SparkContext.addFile(…). Note: Spark does not allow adding two files with the same name but with different content. If you want to use this processor with two different tiffs with the same file name, either rename one of the files or allow spark to overwrite the existing file by setting the “spark.files.overwrite” configuration property to “true”.

Note: This processor requires additional libraries not included with the BDT jar. See the Setup for Raster guide for the respective environment to use these processors.

Note: Some spatial references may not work, it is recommended to use 3857 or 4326 for the tiff if possible.

Note: When using this processor in a windows environment, the file path must be specified in the file:///drive:/Path/To/Tiff format.

Note: Rasters with a pixel/bit depth of 8-bit signed are not supported. Please convert the raster to have a pixel depth of 8-bit unsigned if there are no negative values, or to 16-bit signed for negative value support. This can be done in ArcGIS Pro.

Parameters:
  • tiff_name (str) – The path to the raster data in .tif format.

  • output_type (str) – Either ‘centroid’ or ‘boundary’. ‘centroid’ emits the centroid of the cell, and ‘boundary’ emits the square boundary.

  • num_partitions (int) – The number of partitions for processing.

  • value_threshold (float) – Raster cells with values in the first band that are less than this threshold will not be emitted.

Return type:

DataFrame

Returns:

Cell Centroids or Boundaries and Values

Example:

bdt.processors.raster_to_vector(
        "/path/to/tiff.tif",
        "centroid",
        num_partitions=1,
        value_threshold=2.0)
bdt.processors.tabulate_area(df, tiff_path, pid_field='uuid', stat_mode='ratio', shape_field='SHAPE', acceleration='mild')#

Calculates area statistics for the intersection of raster and polygon.

The output statistics calculated depend on the statMode parameter. The stats are calculated per unique land cover type. The possible modes are:

1. count This mode will append two columns, approx. count and approx. area. Raster cells are only counted if their centroid is contained in the parcel. The total area is approximated the total area of the raster cells whose midpoint are contained within the polygon. The entire area of a raster cell is counted even if only part of it intersects the parcel.

2. area This mode will append two columns, exact count and exact area. Raster cells are counted if any part of the cell intersects the parcel. If a raster cell partly intersects the parcel, then only the area that intersects the parcel will be counted. An equal area projection is recommended.

3. ratio This mode will append two columns, exact count and ratio. Raster cells are counted if any part of the cell intersects the parcel. Areas are calculated exactly and are taken as a ratio of the area of the overall parcel

Performance Note: The time to run a tabulate area process depends on the size of the input raster, the number of polygons, and most importantly, the extent of each polygon. The larger the extent of each polygon the larger the number of raster cells that need to be processed on a single task. If tasks appear stuck try chopping the polygon into smaller pieces, for example by using ST_Chop3 and choosing a cell size that breaks the polygon into smaller pieces. In this case use the area mode to get the area of each piece and then group the final output by the unique id and index to sum up the total area.

The raster geotiff and the vector polygon data must be the same spatial reference. The input tif raster file is put onto the driver and executor filesystems using SparkContext.addFile(…). Note: Spark does not allow adding two files with the same name but with different content. If you want to use this processor with two different tiffs with the same file name, either rename one of the files or allow spark to overwrite the existing file by setting the “spark.files.overwrite” configuration property to “true”.

Note: This processor requires additional libraries not included with the BDT jar. See the Setup for Raster guide for the respective environment to use these processors.

Note: Rasters with a pixel/bit depth of 8-bit signed are not supported. Please convert the raster to have a pixel depth of 8-bit unsigned if there are no negative values, or to 16-bit signed for negative value support. This can be done in ArcGIS Pro.

Parameters:
  • df (DataFrame) – The input DataFrame

  • tiff_path (str) – The path to the raster data in .tif format

  • pid_field (str) – The name of the unique id column in the input DataFrame.

  • stat_mode (str) – The statistic mode.

  • shape_field (str) – The name of the SHAPE Struct field in the input DataFrame.

  • acceleration (str) – The parcel geometry acceleration type. Can speed up geometry operations. Can be one of ‘none’, ‘mild’, ‘medium’, or ‘hot’. Default ‘mild’.

Return type:

DataFrame

Returns:

DataFrame.

Example:

bdt.processors.tabulate_area(
         df,
         "path/to/raster.tiff",
         "pid",
         stat_mode = "ratio",
         shape_field = "SHAPE")
bdt.processors.vector_to_raster(df, output_path, cell_size, image_dim=256, wkid=3857, comp_type='LZW', comp_quality=0.8, geom_field='SHAPE', attr_field='value', desc_field='desc', no_data=nan, swap_xy=False, antialiasing=False)#

Processor that converts geometry vectors to a raster image in TIF format. The input dataframe must have a geometry field and an attribute and description field for each geometry. The attribute field is the value assigned to the raster cells the geometry is drawn on.

If the geometry is a point, a 3x3 section of raster cells around the point will be populated with its value instead of the single raster cell the point is in.

Geometries are drawn in the order they are in the input dataframe. If geometries overlap, the value of the raster cell will be the value of the last geometry that was drawn on that cell. So, the order of the geometries in the input dataframe matters.

The geometries are partitioned by description. Each description partition is further partitioned by cellSize (QR). Each of the QR partitions is written to a separate TIF file. A sub folder for each description is created in the output path folder. All QR partitions with that description are written to that sub folder.

Outputs a dataframe with the path to each partition’s TIF file and the minimum x and y coordinate of the partition. The output TIF files can be put together into a continuous mosaic because the TIFs are all adjacent to each other.

Existing TIF files will be overwritten if their name and location match the output of the processor.

Note: This processor requires additional libraries not included with the BDT jar. See the Setup for Raster guide for the respective environment to use these processors.

Parameters:
  • df (DataFrame) – the dataframe with the polygon shape structs and a miid for each geom.

  • output_path (str) – The output folder. This folder MUST exist.

  • cell_size (float) – an expression. The spatial partitioning grid size in units of the spatial reference of the input data. Also the world dimension of each QR image.

  • image_dim (int) – Pixel dimension (width and height) of each QR cell image. Default = 256.

  • wkid (int) – The spatial reference id of the input geometries. Default = 3857. Some spatial references may not work, it is recommended to use 3857 or 4326 if possible.

  • comp_type (str) – The compression type. Default = “LZW”.

  • comp_quality (float) – The compression quality. Set to 0.0 to disable. Default = 0.8.

  • geom_field (str) – The name of the geometry field. Defaults to “SHAPE”.

  • attr_field (str) – The name of the attribute field. Defaults to “value”.

  • desc_field (str) – The name of the description field. Defaults to “desc”.

  • no_data (float) – The no data value. Default = NaN.

  • swap_xy (bool) – Should xy be swapped. Default = False.

  • antialiasing (bool) – Should antialiasing be used. Default = False.

Return type:

DataFrame

Returns:

Paths to each partition’s tiff file and the min x and y coordinate of the partition.

Example:

df: DataFrame = spark.sql(f"SELECT ST_FromText('POINT (1.5 1.5)') AS SHAPE, 20.0 AS value, 'point' AS desc")

outDF = bdt.processors.vector_to_raster(df, "pointRaster", 10.0)
bdt.processors.zonal_statistics_table(df, tiff_path, miid_field, shape_field='SHAPE', no_data='noop', mode='area', deno='polygon', wkid=-1, acceleration='mild')#

Calculates statistics on cell values of a raster within zones defined by polygons from an input dataframe.

Calculated per polygon with overlapping raster cells:

  1. count: number of raster cell values overlapping the polygon used for the calculations

  2. min: minimum raster cell value

  3. max: maximum raster cell value

  4. area: area of the overlapping region with the polygon

  5. mean: mean of overlapping raster cell values

  6. std: standard deviation of overlapping raster cell values

  7. sum: sum of overlapping raster cell values

  8. pct50: 50th percentile of overlapping raster cell values

  9. pct90: 90th percentile of overlapping raster cell values

The raster geotiff and the vector polygon data must be the same spatial reference.The input tif raster file is put onto the driver and executor filesystems using SparkContext.addFile(…). Note: Spark does not allow adding two files with the same name but with different content. If you want to use this processor with two different tiffs with the same file name, either rename one of the files or allow spark to overwrite the existing file by setting the “spark.files.overwrite” configuration property to “true”.

Note: This processor requires additional libraries not included with the BDT jar. See the Setup for Raster guide for the respective environment to use these processors.

Note: Rasters with a pixel/bit depth of 8-bit signed are not supported. Please convert the raster to have a pixel depth of 8-bit unsigned if there are no negative values, or to 16-bit signed for negative value support. This can be done in ArcGIS Pro.

Parameters:
  • df (DataFrame) – the dataframe with the polygon shape structs and a miid for each geom.

  • tiff_path (str) – Path to geotiff.

  • miid_field (str) – (monotonically increasing id) the name of the unique identifier field for each parcel polygon used to link stats back in the output table to original polygon.

  • shape_field (str) – the name of the geometry field. Defaults to “SHAPE”.

  • no_data (str) – how to handle a polygon that overlaps raster cells wih no data, use “noop” to ignore cells with no data or “zero” to replace the value of the cells with no data with zero. Defaults to noop.

  • mode (str) – determines which raster cells overlap the polygon, “area” mode considers all raster cells that overlap the polygon while “cell” mode only considers raster cells whose centroid overlaps the polygon. In “area” mode, raster cells are weighted by the proportion of their overlap with the polygon. Defaults to area mode.

  • deno (str) –

    • Only applicable when mode is “area”. The denominator value used to calculate the weights applied to

    raster cells. “cell” mode applies a weight that is the cell’s overlapping area with the polygon divided by that cell’s total area. “polygon” mode applies a weight that is the cell’s overlapping area with the polygon divided by the overall area of the polygon. Defaults to polygon mode.

  • wkid (int) – the spatial reference identifier. If “-1” provided, the spatial reference of the tiff will be used. Defaults to -1.

  • acceleration (str) – the acceleration degree (mild, medium, hot). Defaults to mild.

Return type:

DataFrame

Returns:

DataFrame with rows of (uuid_field, count, min, max, area, mean, std, sum, pct50, pct90) per non-empty input polygon

Example:

df: DataFrame = spark.sql(f"SELECT ST_FromText('MULTIPOLYGON (((-5936507.275876233 -114352.93351182807,"
                               f"-5936507.275876233 -114412.93351182807,"
                               f"-5936417.275876233 -114412.93351182807,"
                               f"-5936417.275876233 -114352.93351182807,"
                               f"-5936507.275876233 -114352.93351182807)))') AS SHAPE, '1' AS miid") \
    .withMeta("polygon", 102008)

outDF = bdt.processors.zonal_statistics_table(df, "path/to/raster.tiff", geom_field="SHAPE")

Statistics#

These processors generate statistical analysis based on the input data.

Processors#

powerSetStats(df, statsFields, categoryIds, ...)

Processor to produce summary statistics for combinations of columns with categorical values.

selfJoinStatistics(df, cellSize, radius, ...)

Processor to enrich the single dataset with the statistics for neighbors found within the given radius.

univariateOutliers(df, attributeField, factor)

Processor to identify univariate outliers based on mean and standard deviation.

bdt.processors.powerSetStats(df, statsFields, categoryIds, geographyIds, idField=None, cache=False)#

Processor to produce summary statistics for combinations of columns with categorical values.

Combinations will be limited to those that appear in rows of the table.

Parameters:
  • df (DataFrame) – The input DataFrame.

  • statsFields (List[str]) – The field names to compute statistics for.

  • categoryIds (List[str]) – Column names of categories to create combinations from.

  • geographyIds (List[str]) – Column names of geographies to create combinations from.

  • idField (str) – The column field of ids. Used to prevent possible over-counting, such as when there are overlapping geographies represented in the data.

  • cache (bool) – To persist the outgoing DataFrame or not.

Return type:

DataFrame

Returns:

Summary statistics for combinations of columns.

Example:

bdt.processors.powerSetStats(
        df,
        statsFields = ["TotalInsuredValue","TotalReplacementValue","YearBuilt"],
        geographyIds = ["FID","ST_ID","CY_ID","ZP_ID"],
        categoryIds = ["Quarter","LineOfBusiness"],
        idField = "Points_ID",
        cache = False)
bdt.processors.selfJoinStatistics(df, cellSize, radius, statisticsFields, emitEnrichedOnly=True, shapeField='SHAPE', cache=False, extentList=None, depth=16, acceleration='none')#

Processor to enrich the single dataset with the statistics for neighbors found within the given radius.

For example, for a given feature, find all other features within the search radius, calculate COUNT,MEAN,STDEV,MAX,MIN,SUM for each configured field and append them. Each feature does not consider itself a neighbor within its search radius.

Parameters:
  • df (DataFrame) – The input DataFrame.

  • cellSize (float) – The spatial partitioning cell size.

  • radius (float) – The search radius.

  • statisticsFields (Array[string]) – The field names to compute statistics for.

  • emitEnrichedOnly (bool) – If set to false, when no rhs features are found, a lhs feature is enriched with null values. If set to true, when no rhs features are found, the lhs feature is not emitted in the output.

  • shapeField (str) – The name of the SHAPE Struct field.

  • cache (bool) – To persist the outgoing DataFrame or not.

  • extentList (List[float]) – The spatial index extent. Array of the extent coordinates [xmin, ymin, xmax, ymax].

  • depth (int) – The spatial index depth.

  • acceleration (str) – The geometry acceleration type. Can speed up geometry operations. Can be one of ‘none’, ‘mild’, ‘medium’, or ‘hot’. Default ‘none’.

Return type:

DataFrame

Returns:

Enrichment with statistics for neighbors found within a given radius.

Example:

bdt.processors.selfJoinStatistics(
         df,
         ["floor(tip_amount/100)", "floor(total_amount/100)"],
         cache = False)
bdt.processors.univariateOutliers(df, attributeField, factor, cache=False)#

Processor to identify univariate outliers based on mean and standard deviation. A feature is an outlier if it field value is beyond mean +/- factor * standard deviation.

Parameters:
  • df (DataFrame) – The input DataFrame

  • attributeField (str) – The numerical field used to identify outliers.

  • factor (float) – The factor to scale the standard deviation by.

  • cache (bool) – To persist the outgoing DataFrame or not.

Return type:

DataFrame

Returns:

Univariate outliers based on mean and standard deviation.

Example:

bdt.processors.univariateOutliers(
         df,
         attributeField = "price",
         factor = 1.0,
         cache = False)

Other#

Processors#

These processors and functions provide various capabilities.

addMetadata(df, geometryType, wkid[, ...])

IMPORTANT: This processor will be deprecated in the next major version BDT release.

from_feature_class(feature_class_name[, ...])

Converts an ArcGIS feature class to a Spark DataFrame.

timeFilter(df, timeField, minTimestamp, ...)

Filter rows in the DataFrame based on a time field.

to_feature_class(df, feature_class_name, wkid)

Converts a Spark DataFrame to an ArcGIS feature class.

to_geo_pandas(df, wkid[, shape_field])

Convert a DataFrame to a GeoPandas DataFrame.

to_geo_pandas_layers(dfs, wkid[, name_lst, ...])

Use GeoPandas to visualize a list of DataFrames on a map.

bdt.processors.addMetadata(df, geometryType, wkid, shapeField='SHAPE', cache=False)#

IMPORTANT: This processor will be deprecated in the next major version BDT release. It is recommended to use the .withMeta DataFrame implicit method instead.

Set the geometry and spatial reference metadata on the shape column. The possible values for geometryType are: Unknown, Point, Line, Envelope, MultiPoint, Polyline, Polygon, GeometryCollection.

Parameters:
  • df (DataFrame) – The input DataFrame

  • geometryType (str) – The geometry type. The possible values for geometryType are: Unknown, Point, Line, Envelope, MultiPoint, Polyline, Polygon.

  • wkid (int) – The spatial reference id.

  • shapeField (str) – The name of the SHAPE struct field.

  • cache (bool) – To persist the outgoing DataFrame or not.

Return type:

DataFrame

Returns:

DataFrame with SHAPE column.

Example:

df.withMeta(
    "Point",
     4326,
    "SHAPE")
bdt.processors.from_feature_class(feature_class_name, fields='*', where_clause=None)#

Converts an ArcGIS feature class to a Spark DataFrame. This function is only for use in ArcGIS Pro. Geometry columns stored as WKB in the feature class can be converted to the BDT Shape Struct using st_fromWKB.

Note: When converting empty geometries to a feature class in the memory workspace, the WKB will be null. However, when doing the same in any other workspace (like the scratch workspace), the WKB will be an empty byte array. Empty byte arrays can produce problems if they are converted back into a Spark DataFrame and used with BDT. It is best to filter empty byte arrays before using BDT functions that accept WKB.

Parameters:
  • feature_class_name (str) – Name of the feature class.

  • fields (List[str]) – List of fields to include in the DataFrame. Pass “*” to select all fields. Default is “*”.

  • where_clause (str) – An optional SQL where clause that limits the records returned. Default is None for no where clause.

Return type:

DataFrame

Returns:

Spark DataFrame.

Example:

bdt.processors.from_feature_class(
    "Sample_FeatureClass",
    fields=["ID", "SHAPE"],
    where_clause="ID >= 0"
)
bdt.processors.timeFilter(df, timeField, minTimestamp, maxTimestamp, cache=False)#

Filter rows in the DataFrame based on a time field. The attribute time value has to be strictly between minTimestamp and maxTimestamp. The attribute time values must all be in either timestamp, date, or unix millisecond format..

Parameters:
  • df (DataFrame) – The input DataFrame

  • timeField (str) – The name of the attribute time column.

  • minTimestamp (str) – The minimum time value. Must be in timestamp format: yyyy-mm-dd hh:mm:ss

  • maxTimestamp (str) – The maximum time value. Must be in timestamp format: yyyy-mm-dd hh:mm:ss

  • cache (bool) – To persist the outgoing DataFrame or not.

Return type:

DataFrame

Returns:

Filtered DataFrame.

Example:

bdt.processors.timeFilter(
         df,
         timeField = "time",
         minTimestamp = "1997-11-15 00:00:00",
         maxTimestamp = "1997-11-17 00:00:00",
         cache = False)
bdt.processors.to_feature_class(df, feature_class_name, wkid, workspace='memory', shape_field='SHAPE')#

Converts a Spark DataFrame to an ArcGIS feature class. This function is only for use in ArcGIS Pro. ArcGIS Pro does not support Long values. Cast or exclude the fields that are of Long type before using this function. Features in the memory workspace will not be persisted when Pro is closed and should only be used for temporary feature classes.

Note: ArcGIS Pro does not support Long values over 53 bits. This includes H3 indexes. Cast these fields (to a string for example) or exclude them before using this function.

Note: When converting empty geometries to a feature class in the memory workspace, the WKB will be null. However, when doing the same in any other workspace (like the scratch workspace), the WKB will be an empty byte array. Empty byte arrays can produce problems if they are converted back into a Spark DataFrame and used with BDT. It is best to filter empty byte arrays before using BDT functions that accept WKB.

Parameters:
  • df (DataFrame) – Spark DataFrame.

  • feature_class_name (str) – Name of the output feature class.

  • wkid (int) – Spatial reference ID.

  • workspace (str) – Workspace where the feature class will be created. Can be “memory”, “scratch” or a path to a geodatabase. Default is “memory”. Note: when using paths in Pro on Windows, use double backslashes or prepend an “r” to the string.

  • shape_field (str) – The name of the SHAPE Struct in the input DataFrame. Default “SHAPE”.

Return type:

None

Returns:

Nothing. The feature class is created in the specified workspace.

Example:

bdt.processors.to_feature_class(sa_df,
                                "Sample_FeatureClass",
                                4326,
                                workspace="<path_to_sample_gdb>.gdb",
                                shape_field="SHAPE")
bdt.processors.to_geo_pandas(df, wkid, shape_field='SHAPE')#

Convert a DataFrame to a GeoPandas DataFrame. Call the explore function on the resulting GeoDataFrame to visualize the geometry in a notebook.

This function requires additional Python libraries to use. See the GeoPandas section of the Concepts page for more information on GeoPandas visualization.

Processor to_geo_pandas is not supported in Amazon EMR.

IMPORTANT:

  • Columns containing WKB (not including the one specified in the shape_field) or arrays must be dropped before calling this function.

  • Ensure there are no Null entries in the shape_field column of the input DataFrame. If there are, filter them out before calling this function.

  • Ensure the spark.sql.execution.arrow.pyspark.enabled spark config is set to true.

Parameters:
  • df (DataFrame) – The DataFrame to convert.

  • wkid (int) – The spatial reference id.

  • shape_field (str) – The name of the SHAPE struct field. Defaults to “SHAPE”.

Returns:

GeoDataFrame. A GeoPandas DataFrame.

Python Example:

df = spark \
    .createDataFrame([(1, "POINT (1 1)"), (2, "POINT (2 2)")], ["ID", "WKT"]) \
    .selectExpr("ST_FromText(WKT) AS SHAPE")

gdf = bdt.processors.to_geo_pandas(df, 4326)
gdf.explore()

Dataframe Implicit Example:

df = spark \
    .createDataFrame([(1, "POINT (1 1)"), (2, "POINT (2 2)")], ["ID", "WKT"]) \
    .selectExpr("ST_FromText(WKT) AS SHAPE")

gdf = df.to_geo_pandas(4326)
gdf.explore()
bdt.processors.to_geo_pandas_layers(dfs, wkid, name_lst=None, shape_field='SHAPE')#

Use GeoPandas to visualize a list of DataFrames on a map. Each DataFrame will be visualized as a separate layer. Use the layer controls on the output map to toggle the visibility of each layer.

The shape_field in each input DataFrame must have the same spatial reference.

This function requires additional Python libraries to use. See the GeoPandas section of the Concepts page for more information on GeoPandas visualization.

Processor to_geo_pandas_layers is not supported in Amazon EMR.

IMPORTANT:

  • Columns containing WKB (not including the one specified in the shape_field) or arrays must be dropped in each DataFrame before calling this function.

  • Ensure there are no Null entries in the shape_field column of the input DataFrames. If there are, filter them out before calling this function.

  • Ensure the spark.sql.execution.arrow.pyspark.enabled spark config is set to true.

Parameters:
  • dfs (DataFrame) – The list of DataFrames to visualize.

  • wkid (int) – The spatial reference id.

  • name_lst (List[str]) – The list of names for each layer in the output Map corresponding to the input DataFrame list. Defaults to None, which will name the layers from 0 to the number of input DataFrames.

  • shape_field (str) – The name of the SHAPE struct field in each DataFrame. Defaults to “SHAPE”.

Returns:

A Folium Map object.

Python Example:

df1 = spark.sql("SELECT ST_FromText('POLYGON ((30 10, 35 10, 35 20, 30 20))') SHAPE, 1 AS ID")
df2 = spark.sql("SELECT ST_FromText('POLYGON ((0 10, 5 10, 5 20, 0 20))') SHAPE, 2 AS ID")

bdt.processors.to_geo_pandas_layers([df1, df2], 4326, ["df1", "df2"])

Functions#

mid()

A shorter name for monotonically_increasing_id() Returns monotonically increasing 64-bit integers.

bdt.functions.mid()#

A shorter name for monotonically_increasing_id() Returns monotonically increasing 64-bit integers.

Return type:

Column

Returns:

Column of LongType

SQL Example:

spark.sql('''SELECT ST_FromText('POINT (1 1)') AS SHAPE''').createOrReplaceTempView("df")

spark.sql('''SELECT *, mid() AS MID FROM df''')

Python Example:

df = spark \
    .createDataFrame([("POINT (1 1)",)], ["WKT"]) \
    .select(st_fromText(col("WKT")).alias("SHAPE"))

df.select("*", mid().alias("MID"))

Sinks#

bdt.sinks.sinkEGDB(df, options, postSQL=None, shapeField='SHAPE', geometryField='Shape', mode='Overwrite', debug=False)#

Persist to the enterprise geodatabase (aka relational database). Use this sink if the input DataFrame has the [[SHAPE_STRUCT]]. Otherwise, use Spark JDBC writer to persist to the relational database.

Note: When using in a databricks cluster environment, “;trustServerCertificate=true” must be appended to the database URL to avoid SSL errors.

Parameters:
  • df (DataFrame) – The input DataFrame. Must have a Shape column with Shape Column Metadata set.

  • options (Dict) – Apache Spark JDBC options, more information can be found here: https://spark.apache.org/docs/2.4.5/sql-data-sources-jdbc.html

  • postSQL (List[str]) – SQL to run after creating the table. For example, create a spatial index.

  • shapeField (str) – The name of the field for the Shape Column.

  • geometryField (str) – The name of the field to use for the geometry field in the table.

  • mode (str) – Save Mode Option. The possible values for mode are: Append, Overwrite.

  • debug (bool) – print out the physical plan and the generated code.

Example:

bdt.sinks.sinkEGDB(
    df,
    options = {
        "numPartitions": "8",
        "url": f"{url}",
        "user": f"{username}",
        "password": f"{password}",
        "dbtable": "DBO.test",
        "truncate": "false"
    },
    postSQL = [
        "ALTER TABLE DBO.test ADD ID INT IDENTITY(1,1) NOT NULL",
        "ALTER TABLE DBO.test ALTER COLUMN Shape geometry NOT NULL",
        "ALTER TABLE DBO.test ADD CONSTRAINT PK_test PRIMARY KEY CLUSTERED (ID)",
        "CREATE SPATIAL INDEX SIndx_test ON DBO.test(SHAPE) WITH ( BOUNDING_BOX = ( -14786743.1218, 2239452.0547, -5961629.5841, 6700928.5216 ) )"
        ],
    shapeField  = "SHAPE",
    geometryField = "Shape",
    mode = "Overwrite",
    debug = False)