What is Big Data Toolkit
The Big Data Toolkit (BDT) is an Esri Professional Services solution that allows customers to aggregate, analyze, and enrich big data within their existing big data environment.
Delivered as a term subscription that includes a set of spatial analysis and data interoperability tools that work with an existing big data environment.
BDT can help data scientists and analysts enhance their big data analytics with spatial tools that take advantage of the massive computing capacity they already have.
With Big Data Toolkit, you can:
Geospatially enable your existing big data cluster.
Speed up your spatial analysis on data stored within your existing big data infrastructure.
Configure your workflows with no coding required
Perform exploratory analysis with the Notebook environment capability
Share analysis results in ArcGIS to improve decision-making.
Use processors that package geospatial analytics libraries from Esri so they can run on your big data cluster, rather than in an ArcGIS environment.
Read data from the sources, protocols, and formats you already use to store your data.
Write analysis results back to your specific formats and data sinks.
Extend the API to solve your specific business problems.
Performance
BDT performance is measured by testing individual processors on massive datasets with a large spark cluster. Benchmark runtimes are produced in an Azure HDInsight cluster using 40 executors with 4 cores and 59G each. Data sources include delimited and parquet files on Azure Gen2 Blob store.
Often we need to enrich a point dataset with attributes from a polygon layer. Our workflow takes 600 million point locations, calculates which US Postal polygon contains each point, and tags the points with zipcode information.
points | polygons | process | time |
---|---|---|---|
600 million points (128GB) | US Zipcodes (26MB) | ProcessorPointInPolygon | 4.2 minutes |
Another common process is find the nearest line for each point. Our workflow enriches a point dataset with the name and distance to the nearest street segment. A required input to this process is radius, or how far away, in decimal degrees, to search for a street segment. Our input (0.001) searches within a city block for the nearest street. Larger values take longer to process.
points | polylines | process | radius | time |
---|---|---|---|---|
600 million points (128GB) | 50 million streets (50GB) | ProcessorDistance | 0.001 | 6.6 minutes |
System Requirements
- Apache Spark cluster on
- Cloudera/Hortonworks
- Azure HDInsight
- Amazon EMR
- Azure/AWS Databricks
- Google Dataproc
- Apache Spark 2.4.x is required.
- ArcGIS Pro 2.6 is recommended
- ArcGIS Enterprise (optional)
Setup
Azure Databricks Setup
This section covers how to install and setup Big Data Toolkit in Azure Databricks.
Create Resource Group
- Go to http://portal.azure.com. Under
Resource Groups
. Add a new resource group.
- Go to your resource group. Add a storage account in the same region as your resource group.
Create Storage Account
- Search for Storage Account and follow these steps:
- Use the same region, standard performance, and select LRS
- Make sure to enable hierarchial namespace
Add Databricks Service to Resource Group
- In your resource group, add
Azure Databricks Service
.
- Use trial to save on costs and select the appropriate region
Add Notebook to Databricks Workspace
Install & configure Databricks CLI
- Launch Databricks workspace.
- Click on Generate New Token
- Save the token in a safe place
Execute
pip3 install databricks-cli
.Set the following environment variables in your
.bash_profile
.
export DATABRICKS_HOST=https://westus2.azuredatabricks.net
export DATABRICKS_TOKEN=YOUR TOKEN
- In your terminal, execute
source ~/.bash_profile
.
Use Principal Key to setup DBFS
Setup of a service principal in your Azure subscription is a prerequisite
After this is setup, enter the following:
databricks secrets create-scope --scope adls --initial-manage-principal users
databricks secrets put --scope adls --key credential
Install Big Data Toolkit
- Create a Cluster
- Go to Libraries and Install new and upload the Jar file
- Install the Python Egg using Pythpn egg and install 'deprecation' using PyPi
- BDT will not work on a High Concurrency cluster. Use a Standard cluster instead.
How to access Azure Blob Storage
- You can access data in a blob container by mounting:
Access Azure Data Lake Storage Gen2 using OAuth 2.0 with an Azure service principal
source = "abfss://rawdata@cdcvh.dfs.core.windows.net",
mountPoint = "/mnt/data",
extraConfigs = configs)
Local Jupyter Notebook Setup
This section covers how to use Big Data Toolkit with Jupyter Notebook locally.
Prerequisites
To setup Jupyter Notebook the following prerequisites are needed:
- Spark 2.4.5
- Java
- Pyspark
- Spark environment variables need to be configured
- Jupyter needs to be installed with a package manager such as pip
- Install deprecation
# Set the following environment variables in your bash or zsh profile to use Spark:
export SPARK_HOME={spark-location}/bin
export SPARK_LOCAL_IP=localhost
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='lab --ip=0.0.0.0 --port 8888 --allow-root --no-browser --NotebookApp.token=""'
# Install Jupyter:
pip install notebook
# Install deprecation:
pip install deprecation
Launch Jupyter Notebook with BDT
Start the Notebook with the PySpark driver
pyspark\
--conf spark.submit.pyFiles="/Users/{User}/Downloads/{Egg}"\
--conf spark.jars="/Users/{User}/Downloads/{Jar}"
Start using BDT with Jupyter Notebook
Walkthrough
Importing Big Data Toolkit
import bdt
spark.withBDT("/mnt/data/test.lic")
- Use a test license to authorize BDT
ST Functions
spark.sql("""
SHOW USER FUNCTIONS LIKE 'ST_*'
""").show(5)
+------------+
| function|
+------------+
| st_area|
|st_asgeojson|
| st_ashex|
| st_asjson|
| st_asqr|
+------------+
spark.sql("""
SHOW USER FUNCTIONS LIKE 'H3*'
""").show()
+----------------+
| function|
+----------------+
| h3distance|
| h3kring|
|h3kringdistances|
| h3tochildren|
| h3togeo|
| h3togeoboundary|
| h3toparent|
| h3tostring|
+----------------+
- Big Data Toolkit ST functions get their "ST" prefix from the PostGIS naming convention
- PostGIS Docs can be found here: postgis docs
- The functions are highly optimzied by the Apache Spark Catalyst engine
- Big Data Toolkit also includes functions that don't start with ST. Genereally speaking, a function that does have a shape as an input or output will not start with the "ST" label
- Examples of Non ST Functions in Big Data Toolkit include Uber H3 Functions and others
ST Function Examples
- Big Data Toolkit can be used either in SQL or in Python, depending on what you are the most comfortable with.
- Three different ways of producing the same output DataFrame are shown below.
Do it all in Spark SQL
- Create a temporary view of the spark DataFrame so that it can be referenced in a Spark SQL Statement
- Write the entire query in SQL
df1 = spark.range(10)
df1.createOrReplaceTempView("df1")
df1_o = spark.sql("""
SELECT *, ST_Buffer(ST_MakePoint(x, y), 100.0, 3857) AS SHAPE
FROM
(SELECT *, LonToX(lon) as x, LatToY(lat) as y
FROM
(SELECT id, 360-180*rand() AS lon, 360-180*rand() AS lat
FROM df1))
""")
df1_o.show()
+---+------------------+------------------+--------------------+--------------------+--------------------+
| id| lon| lat| x| y| SHAPE|
+---+------------------+------------------+--------------------+--------------------+--------------------+
| 0| 209.7420517387036|278.69107966694395|2.3348378397488922E7|-1.64374601454058...|{ ...|
| 1| 235.6965940336686|283.60464624333105|2.6237624829536907E7|-1.35615066742416...|{ ...|
| 2|253.00393806010976| 183.6534861987022|2.8164269553544343E7| -406980.1151084085|{ ...|
| 3| 260.1383894167863| 193.1159327466332|2.8958473045658957E7| -1472980.4194758015|{ ...|
| 4|339.14158668989427|319.97117910147915| 3.775306873714188E7| -4870131.33823664|{ ...|
| 5|196.72906017810777|350.15890505545616| 2.189977880326623E7| -1100932.2276620644|{ ...|
| 6| 308.3393018275092| 218.5508226001505| 3.432417407099181E7| -4657533.480692809|{ ...|
| 7|230.71769139172113| 319.8814906385373| 2.568337592272603E7| -4883178.711132153|{ ...|
| 8| 328.6877953661514| 183.3260391179804| 3.658935801012368E7| -370461.10529312206|{ ...|
| 9|290.42449965744925|189.24602659107038| 3.23299074157585E7| -1033759.5256822291|{ ...|
+---+------------------+------------------+--------------------+--------------------+--------------------+
Hybrid of Spark SQL and Python DataFrame API
- Using 'selectExpr' from the python DataFrame API allows for a hybrid approach
- This syntax allows for the use of SQL, but makes the query less complicated to write
df2 = spark.range(10)\
.selectExpr("id","360-180*rand() lon","180-90*rand() lat")\
.selectExpr("*","LonToX(lon) x","LatToY(lat) y")\
.selectExpr("*","ST_Buffer(ST_MakePoint(x,y), 100.0, 4326) SHAPE")
df2.show()
+---+------------------+------------------+--------------------+--------------------+--------------------+
| id| lon| lat| x| y| SHAPE|
+---+------------------+------------------+--------------------+--------------------+--------------------+
| 0| 272.2434268912087| 99.48671731291171| 3.030599965334515E7|1.5876415679293292E7|{ ...|
| 1|357.53193763641457|158.27029821614542| 3.980027324001811E7| 2479103.475519385|{ ...|
| 2|265.19248738341645|134.20606970025696|2.9521092657723542E7| 5747387.678019282|{ ...|
| 3|328.91483911816016|109.22908582796377|3.6614632404985085E7|1.1324393353312327E7|{ ...|
| 4| 198.1155773828232| 99.50944059423078|2.2054125192471266E7|1.5861086452229302E7|{ ...|
| 5|183.39843143144566| 113.097849987619| 2.041581999923363E7|1.0128237129727399E7|{ ...|
| 6| 241.2867789263583|170.76182038746066|2.6859921365231376E7| 1032874.515275041|{ ...|
| 7|192.59185824421752| 98.05268650607889| 2.143922759067662E7| 1.692579219007473E7|{ ...|
| 8|244.69326236261554|175.35379731732678|2.7239129366751254E7| 517780.7017250775|{ ...|
| 9|317.54181519860066|157.47594852504096|3.5348593173480004E7| 2574561.293889059|{ ...|
+---+------------------+------------------+--------------------+--------------------+--------------------+
Processors
- Processors are highly optimized functions that are typically designed to handle more complex tasks than ST Functions.
- However, some simpler ST Functions still have a processor equivalent (Like STArea and ProcessorArea)
- Processors require that input dataframes have metadata for their shape columns.
- Adding metadata gives the processor information about the geometry type and spatial reference (wkid)
- An error will be thrown if a processor receives the wrong geometry type
- Some processors require wkid 3857, like distanceAllCandidates
Call ProcessorArea on df1:
- Calling a processor is relatively straightforward, pass the dataframe(s) as the first parameters and optional paramters after that
- This will calcualte the area of each buffered point, which should identical (up to a certain floating point precision)
df1_m = df1_o.withMeta("POLYGON", 3857)
area = P.area(df1_m)
area.show(5)
+---+------------------+------------------+--------------------+--------------------+--------------------+------------------+
| id| lon| lat| x| y| SHAPE| shape_area|
+---+------------------+------------------+--------------------+--------------------+--------------------+------------------+
| 0| 209.7420517387036|278.69107966694395|2.3348378397488922E7|-1.64374601454058...|{ ...| 31393.50203038831|
| 1| 235.6965940336686|283.60464624333105|2.6237624829536907E7|-1.35615066742416...|{ ...| 31393.50203038831|
| 2|253.00393806010976| 183.6534861987022|2.8164269553544343E7| -406980.1151084085|{ ...| 31393.50203039233|
| 3| 260.1383894167863| 193.1159327466332|2.8958473045658957E7| -1472980.4194758015|{ ...| 31393.50203039616|
| 4|339.14158668989427|319.97117910147915| 3.775306873714188E7| -4870131.33823664|{ ...|31393.502030451047|
+---+------------------+------------------+--------------------+--------------------+--------------------+------------------+
Concepts
Shape Column
Format
spark.sql("SELECT ST_MakePoint(-47.36744,-37.11551) AS SHAPE").printSchema()
root
|-- SHAPE: struct (nullable = false)
| |-- WKB: binary (nullable = true)
| |-- XMIN: double (nullable = true)
| |-- YMIN: double (nullable = true)
| |-- XMAX: double (nullable = true)
| |-- YMAX: double (nullable = true)
Shape is stored as a struct for easier handling and processing
The struct is composed of WKB, XMIN, YMIN, XMAX, and YMAX values
Transform Shape struct to WKT before persisting to csv
Parquet can store the Shape struct without additional transformation
Methods
df = (spark
.read
.format("com.esri.spark.shp")
.options(path="/data/poi.shp", format="WKB")
.load()
.selectExpr("ST_FromWKB(shape) as SHAPE")
.withMeta("Point", 4326))
df.selectExpr("ST_AsText(SHAPE) WKT").write.csv("/tmp/sink/wkt")
+---------------------------+
|WKT |
+---------------------------+
|POINT (-47.36744 -37.11551)|
+---------------------------+
Shape structs can be created using multiple methods:
Processors
- ProcessorAddShapeFromGeoJson
- ProcessorAddShapeFromJson
- ProcessorAddShapeFromWKB
- ProcessorAddShapeFromWKT
- ProcessorAddShapeFromXY
SQL functions
- ST_MakePoint
- ST_FromXY
- ST_MakeLine
- ST_MakePolygon
- ST_FromText
- ST_FromWKB
- ST_FromFGDBPoint
- ST_FromFGDBPolygon
- ST_FromFGDBPolyline
- ST_FromGeoJSON
- ST_FromHex
- ST_FromJSON
Export
Transform Shape structs into formats suitable for data interchange:
Metadata
# Python
df1 = (spark
.read
.format("csv")
.options(path="/data/taxi.csv", header="true", delimiter= ",", inferSchmea="false")
.schema("VendorID STRING,tpep_pickup_datetime STRING,tpep_dropoff_datetime STRING,passenger_count STRING,trip_distance STRING,pickup_longitude DOUBLE,pickup_latitude DOUBLE,RatecodeID STRING,store_and_fwd_flag STRING,dropoff_longitude DOUBLE,dropoff_latitude DOUBLE,payment_type STRING,fare_amount STRING,extra STRING,mta_tax STRING,tip_amount DOUBLE,tolls_amount DOUBLE,improvement_surcharge DOUBLE,total_amount DOUBLE")
.load()
.selectExpr("tip_amount", "total_amount", "ST_FromXY(pickup_longitude, pickup_latitude) as SHAPE")
.withMeta("Point", 4326, shapeField = "SHAPE"))
df1.schema[-1].metadata
{'type': 33, 'wkid': 4326}
Metadata is set for the dataframe and describes the Geometry
Describes the Shapefield column with Shape struct
Includes the geometry type and spatial reference (wkid)
Parquet may persist metadata but must always be specified when reading csv.
Can be set on Sources by specifying geometryType, shapeField, and wkid or on an existing dataframe using ProcessorAddMetadata
Cell size
points_in_county = spark.bdt.pointInPolygon(
points,
county,
cellSize = 0.1,
)
Cell size represents the spatial partitioning grid size
Specified in the units of the spatial reference e.g. degrees for WGS84, meters for Web Mercator
Cell size is an optimization parameter used to improve efficiency of processing and is based on your data size and distribution, size of the cluster, and analysis workflow.
Setting
doBroadcast = True
ignores cellSize values during processing
Cell size must not be smaller than the following values:
For Spatial Reference WGS 84 GPS: 0.000000167638064 degrees For Spatial Reference WGS 1984 Web Mercator: 0.0186264515 meters
Cache
# Python
points_in_county = spark.bdt.pointInPolygon(
points,
county,
cellSize = 0.1,
cache = True,
)
points_in_cy_zp = spark.bdt.pointInPolygon(
points_in_county,
zip_codes,
cellSize = 0.1,
cache = True, #Setting cache here substantially reduces total run time
)
pssoverlaps = spark.bdt.powerSetStats(
points_in_county_zp,
statsFields = ["TotalInsuredValue"],
geographyIds = ["CY_ID","ZP_ID"],
categoryIds = ["Quarter","LineOfBusiness"],
idField = "Points_ID",
cache = False
)
pssoverlaps.cache() #same as setting cache = True above
pssoverlaps.write.mode("overwrite").parquet("/tmp/sink")
pssoverlaps.write.mode("overwrite").csv("/tmp/csv")
Cache uses DataFrame API method, all processors have this option with the default being set to false, using cache consumes memory but increases the efficiency of multistep workflows
As shown in the example on right keeping false instead of true can greatly increase the time it takes to run downstream steps
With multiple sinks there is a benefit to setting cache to true so that it can store all information and write it out to multiple locations
Calling the DataFrame cache method is the same as setting cache to True
Extent
An array of xmin, ymin, xmax, ymax that contains the spatial index
In units of the spatial reference e.g. degrees for WGS84, meters for Web Mercator, etc
Defaults to World Geographic [-180.0, -90.0, 180.0, 90.0]
When using a projected coordinate system, the extent must be specified
Try to tailor the extent for the data to reduce processing of empty areas and improve performance
LMDB
LMDB stands for Lightning Memory-Mapped Database and is used by BDT for routing and street network capabilities
LMDB deploys copies to every spark node with fast read access
Processors:
SQL Functions:
BDT v2
The title version number, for example ProcessorArea v2.1
, indicates the version first available.
ProcessorArea was available starting in v2.1
while v2
indicates available since v2.0
If new Processor configuration options are added the version availability is noted next to the property e.g. - emitPippedOnly (_v2.1_)
Sources v2
source can be thought of as the componenent that describes that data that the spark application will be consuming.
SourceGeneric v2
Hocon
# csv
{
name = taxi
type = com.esri.source.SourceGeneric
format = csv
options = {
path = ./src/test/resources/taxi.csv
header = true
delimiter = ","
inferSchema = false
}
schema = "VendorID STRING,tpep_pickup_datetime STRING,tpep_dropoff_datetime STRING,passenger_count STRING,trip_distance STRING,pickup_longitude DOUBLE,pickup_latitude DOUBLE,RatecodeID STRING,store_and_fwd_flag STRING,dropoff_longitude DOUBLE,dropoff_latitude DOUBLE,payment_type STRING,fare_amount STRING,extra STRING,mta_tax STRING,tip_amount DOUBLE,tolls_amount DOUBLE,improvement_surcharge DOUBLE,total_amount DOUBLE,SHAPE) as SHAPE", "tip_amount", "total_amount"]
where = "VendorID = '2'"
geometryType = point
shapeField = SHAPE
wkid = 4326
cache = false
numPartitions = 8
}
Hocon
# parquet
{
name = taxi
type = com.esri.source.SourceGeneric
format = parquet
options = {
path = ./src/test/resources/taxi.parquet
mergeSchema = true
}
where = "VendorID = '2'"
geometryType = point
shapeField = SHAPE
wkid = 4326
cache = false
numPartitions = 8
}
Hocon
# JDBC
{
name = service_area
type = com.esri.source.SourceGeneric
format = jdbc
options = {
url = "jdbc:sqlserver://localhost:1433;databaseName=bdt2;integratedSecurity=false"
user = "you"
password = "your secret"
// Microsoft Spatial Type function.
query = "SELECT *, Shape.STAsBinary() WKB FROM BFS_HEXAGON_SPLIT_ONE_T_2"
}
// BDT Spatial Type function.
selectExpr = ["M_ID", "ST_FromWKB(WKB) SHAPE"]
geometryType = Polygon
shapeField = SHAPE
wkid = 3857
}
# Python
# Deprecated as of v2.4. Use the Spark API instead.
spark.bdt.sourceGeneric(
"csv",
{"path": "./src/test/resources/taxi.csv", "header": "true", "delimiter": ",", "inferSchema": "true"},
schema = "VendorID STRING,tpep_pickup_datetime STRING,tpep_dropoff_datetime STRING,passenger_count STRING,trip_distance STRING,pickup_longitude DOUBLE,pickup_latitude DOUBLE,RatecodeID STRING,store_and_fwd_flag STRING,dropoff_longitude DOUBLE,dropoff_latitude DOUBLE,payment_type STRING,fare_amount STRING,extra STRING,mta_tax STRING,tip_amount DOUBLE,tolls_amount DOUBLE,improvement_surcharge DOUBLE,total_amount DOUBLE, SHAPE STRING",
selectExpr = ["ST_FromXY(pickup_longitude, pickup_latitude) as SHAPE", "tip_amount", "total_amount"],
where = "VendorID = '2'",
geometryType = "Point",
shapeField = "SHAPE",
wkid = 4326,
cache = True,
numPartitions = 8)
# Spark API
spark\
.read\
.format("csv")\
.options(path="./src/test/resources/taxi.csv", header="true", delimiter= ",", inferSchmea="true")\
.schema("VendorID STRING,tpep_pickup_datetime STRING,tpep_dropoff_datetime STRING,passenger_count STRING,trip_distance STRING,pickup_longitude DOUBLE,pickup_latitude DOUBLE,RatecodeID STRING,store_and_fwd_flag STRING,dropoff_longitude DOUBLE,dropoff_latitude DOUBLE,payment_type STRING,fare_amount STRING,extra STRING,mta_tax STRING,tip_amount DOUBLE,tolls_amount DOUBLE,improvement_surcharge DOUBLE,total_amount DOUBLE, SHAPE STRING")\
.load()\
.selectExpr("ST_FromXY(pickup_longitude, pickup_latitude) as SHAPE", "tip_amount", "total_amount")\
.where("VendorID = '2'")\
.withMeta("Point", 4326, shapeField = "SHAPE")\
.repartition(8)\
.cache()
Read dataset from the external storage.
Source Generic is a BDT-2 Module for reading in generic file types that are already spark supported.
If the input data has a shape column: - Specify a geometryType, shapeField, and wkid. This will set the metadata of the shape column. - If the shape column is WKT, then use ST_FromText on the shape column in selectExpr - If the shape column is WKB, then use ST_FromWKB on the shape column in selectExpr - If the shape column is shape struct, then no ST function is needed.
If the input data does not have shape struct column, do not provide geometryType, shapeField, and wkid. These values will default to None.
Configuration
Required
format
- description: Supported formats by Spark DataFrameReader
- type: String
options
- description: Lists Spark options for each format
- type: Map[String, String]
Optional
schema
- description: Specifies the schema by using the input DDL-formatted string.
- type: String
- default value: None
selectExpr
- description: Select a subset of the fields.
- type:
- default value:
where
- description: Where clause to filter the data.
- type: String
- default value: None
geometryType (v2.2)
- description: The geometry type to be read. For a standalone table, use None.
- type: String
- default value: None
shapeField (v2.2)
- description: The name of the SHAPE struct field.
- type: String
- default value: SHAPE
wkid (v2.2)
- description: The spatial reference id.
- type: String
- default value: 4326
cache
- description: To persist the outgoing DataFrame or not.
- type: Boolean
- default value: false
numPartitions
- description: Re-partitions the outgoing DataFrame.
- type: Int
- default value: None
References
- databricks-sources
- hive-warehouse-connector
- hive-warehouse
- move-hive-tables
- accessing-hive-from-spark-sql
- hdinsight-hadoop
- hive-support-in-spark-shell
- hdinsight-versin-release
- hdinsight-hive-connector
SourceSQL v2
Hocon
{
name = zip
type = com.esri.source.SourceSQL
sql = "select 'hello' as world"
cache = false
}
Hocon
{
name = zip
type = com.esri.source.SourceSQL
sql = "SELECT * FROM parquet.`examples/src/main/resources/users.parquet`"
cache = false
}
Hocon
# Example for shape column with WKB values
{
name = zip
type = com.esri.source.SourceSQL
sql = "SELECT ST_FromWKB(SHAPE) FROM parquet.`examples/src/main/resources/users.parquet`"
geometryType = polygon
shapeField = SHAPE
wkid = 4326
cache = false
}
# Python
# Deprecated as of v2.4. Use the Spark API instead.
spark.bdt.sourceSQL(
"SELECT * FROM parquet.`./src/test/resources/States_st.parquet`",
geometryType = "Polygon",
shapeField = "SHAPE",
wkid = 4326,
cache = True)
# Spark API
spark\
.sql("SELECT * FROM parquet. `./src/test/resources/States_st.parquet`")\
.withMeta("Polygon", 4326, shapeField = "SHAPE")\
.cache()
com.esri.source.SourceSQL
Generate a static DataFrame from sql.
You can run SQL on files directly.
{{{
SELECT * FROM parquet.examples/src/main/resources/users.parquet
}}}
- If the input data has a shape column:
- Specify a geometryType, shapeField, and wkid. This will set the metadata of the shape column.
- If the shape column is WKT, then use ST_FromText on the shape column in the sql statement.
- If the shape column is WKB, then use ST_FromWKB on the shape column in the sql statement.
- If the shape column is shape struct, then no ST function is needed.
If the input data does not have shape struct column, do not provide geometryType, shapeField, and wkid. These values will default to None.
Capable of reading directly from a hive table.
Configuration
Required
- sql
- description: SQL to generate a static DataFrame or to read files directly.
- type: String
Optional
geometryType (v2.2)
- description: The geometry type to be read. For a standalone table, use None.
- type: String
- default value: None
shapeField (v2.2)
- description: The name of the SHAPE struct field.
- type: String
- default value: SHAPE
wkid (v2.2)
- description: The spatial reference id.
- type: String
- default value: 4326
cache
- description: To persist the outgoing DataFrame or not.
- type: Boolean
- default value: false
SourceFGDB v2
Hocon
{
name = bgs
type = com.esri.source.SourceFGDB
path = "./src/test/resources/Redlands.gdb"
table = BlockGroups
sql = "SELECT TOTPOP_CY, Shape FROM BlockGroups WHERE ID = 060650301011"
geometryType = Polygon
geometryField = Shape
shapeField = SHAPE
wkid = 4326
cache = false
numPartitions = 4
}
Hocon
{
name = standalone
type = com.esri.source.SourceFGDB
path = "./src/test/resources/standalone_table.gdb"
table = standalone_table
sql = "SELECT * FROM standalone_table"
geometryType = None
}
# Python
# Deprecated as of v2.4. Use the Spark API instead.
spark.bdt.sourceFGDB(
"./src/test/resources/Redlands.gdb",
"Policy_25M",
"SELECT * FROM Policy_25M",
geometryType = "Point",
geometryField = "Shape",
shapeField = "SHAPE",
wkid = 4326,
cache = True,
numPartitions = 4)
# Spark API
spark\
.read\
.format("com.esri.gdb")\
.options(path="./src/test/resources/Redlands.gdb", name="Policy_25M")\
.load()\
.selectExpr("ST_FromFGDBPoint(Shape) as NEW_SHAPE", "*")\
.drop("Shape")\
.withColumnRenamed("NEW_SHAPE", "SHAPE")\
.withMeta("Point", 4326)\
.repartition(4)\
.cache()
com.esri.source.SourceFGDB
Read and create a DataFrame from the file geodatabase. [[SHAPE_STRUCT]] will be created when geometryType is not set to None. Unlike other sources, SourceFGDB sets the metadata on [[SHAPE_FIELD]] in the outgoing DataFrame.
Configuration
Required
path
- description: The path to the file geodatabase.
- type: String
table
- description: The name of the feature class or standalone table.
- type: String
sql
- description: Use SQL to select and/or filter fields. For the feature class, make sure to include the geometry field in the projection so that it can be turned into [[SHAPE_STRUCT]]. 'FROM' clause should use the table name specified with 'table' configuration.
- type: String
geometryType
- description: The geometry type to be read. For a standalone table, use None.
- type: String
Optional
geometryField
- description: The name of the geometry field in the feature class. Not applicable when geometryType is set to None.
- type: String
- default value: Shape
shapeField
- description: The name of the SHAPE struct field. Not applicable when geometryType is set to None.
- type: String
- default value: SHAPE
wkid
- description: The spatial reference id.
- type: Integer
- default value: 4326
cache
- description: To persist the outgoing DataFrame or not.
- type: Boolean
- default value: false
numPartitions
- description: The number of partitions. If not specified, the default parallelism would be set.
- type: Integer
- default value: None
SourceSHP v2.2
Hocon
{
name = bgs
type = com.esri.source.SourceSHP
path = "./src/test/resources/Redlands.shp"
sql = "SELECT TOTPOP_CY, SHAPE FROM BlockGroups WHERE ID = 060650301011"
table = BlockGroups
geometryType = Polygon
shapeField = SHAPE
wkid = 4326
cache = false
numPartitions = 4
}
# Python
# Deprecated as of v2.4. Use the Spark API instead.
spark.bdt.sourceSHP(
"./src/test/resources/shp",
"Dev_Ops",
"select * from Dev_Ops",
"Point",
shapeField="SHAPE",
wkid=4326,
cache=True,
numPartitions=1)
# Spark API
spark\
.read\
.format("com.esri.spark.shp")\
.options(path="./src/test/resources/shp/", format="WKB")\
.load()\
.selectExpr("ST_FromWKB(shape) as SHAPE_NEW", "*")\
.drop("shape")\
.withColumnRenamed("SHAPE_NEW", "SHAPE")\
.withMeta("Point", 4326)\
.repartition(1)\
.cache()
com.esri.source.SourceSHP
Read and create a DataFrame from a shapefile. SourceSHP sets the metadata on [[SHAPE_FIELD]] in the outgoing DataFrame.
Configuration
Required
path
- description: The path to the file shapefile.
- type: String
table
- description: The name of the feature class or standalone table.
- type: String
sql
- description: Use SQL to select and/or filter fields. 'FROM' clause should use the table name specified with 'table' configuration.
- type: String
geometryType
- description: The geometry type to be read. Possible values are Point, Polyline, Polygon.
- type: String
Optional
shapeField
- description: The name of the SHAPE struct field.
- type: String
- default value: SHAPE
wkid
- description: The spatial reference id.
- type: Integer
- default value: 4326
cache
- description: To persist the outgoing DataFrame or not.
- type: Boolean
- default value: false
numPartitions
- description: The number of partitions. If not specified, the default parallelism would be set.
- type: Integer
- default value: None
SourceHive v2
Hocon
{
name = taxi
type = com.esri.source.SourceHive
properties = {
"hive.metastore.uris" = "thrift://127.0.0.1:9083"
}
preHQL = []
mainHQL = "SELECT * FROM taxi"
database = default
cache = false
numPartitions = 8
}
Hocon
# Example for shape column with WKT values
{
name = taxi
type = com.esri.source.SourceHive
properties = {
"hive.metastore.uris" = "thrift://127.0.0.1:9083"
}
preHQL = []
mainHQL = "SELECT ST_FromText(SHAPE) FROM taxi"
database = default
geometryType = point
shapeField = SHAPE
wkid = 4326
cache = false
numPartitions = 8
}
# Python
# Deprecated as of v2.4. Use the Spark API instead.
spark.bdt.sourceHive(
{"hive.metastore.uris": "thrift://127.0.0.1:9083"},
"SELECT ST_FromText(SHAPE) FROM taxi",
[],
"default",
geometryType = "Point",
shapeField = "SHAPE",
wkid = 4326,
cache = False,
numPartitions = 8)
com.esri.source.SourceHive
Read from hive storage.
- If the input data has a shape column:
- Specify a geometryType, shapeField, and wkid. This will set the metadata of the shape column.
- If the shape column is WKT, then use ST_FromText on the shape column in the preHQL or mainHQL statement.
- If the shape column is WKB, then use ST_FromWKB on the shape column in the preHQL or mainHQL statement.
- If the shape column is shape struct, then no ST function is needed.
If the input data does not have shape struct column, do not provide geometryType, shapeField, and wkid. These values will default to None.
https://stackoverflow.com/questions/53044191/spark-warehouse-vs-hive-warehouse
https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html
Configuration
Required
properties
- description: Hive properties
- type: Map[String, String]
mainHQL
- description: Hive QL
- type: String
Optional
preHQL
- description: A set of Hive QLs to execute before the main HQL.
- type: Array[String]
- default value: []
database
- description: The database to use
- type: String
- default value: default
geometryType (v2.2)
- description: The geometry type to be read. For a standalone table, use None.
- type: String
- default value: None
shapeField (v2.2)
- description: The name of the SHAPE struct field.
- type: String
- default value: SHAPE
wkid (v2.2)
- description: The spatial reference id.
- type: String
- default value: 4326
cache
- description: To persist the outgoing DataFrame or not.
- type: Boolean
- default value: false
numPartitions
- description: Re-partitions the outgoing DataFrame.
- type: Int
- default value: None
Processors v2
ProcessorAddFields v2
Hocon
{
input = _default
output = _default
type = com.esri.processor.ProcessorAddFields
expressions = ["(A/B) as C"]
cache = false
}
# Python
# v2.2
spark.bdt.processorAddFields(
df,
["(A/B) as C"],
cache = False)
# v2.3
spark.bdt.addFields(
df,
["(A/B) as C"],
cache = False)
com.esri.processor.ProcessorAddFields
Add new fields using SQL expressions.
Configuration
Required
- expressions
- description: A Seq of SQL expressions.
- type: Seq[String]
Optional
cache
- description: Persist the outgoing DataFrame.
- type: Boolean
- default value: false
input
- description: The name of the input.
- type: String
- default value: _default
output
- description: The name of the output.
- type: String
- default value: _default
ProcessorAddMetadata v2
Hocon
{
input = _default
output = _default
type = com.esri.processor.ProcessorAddMetadata
geometryType = Point
wkid = 4326
shapeField = SHAPE
cache = false
}
# Python
# v2.2
spark.bdt.processorAddMetadata(
df,
"Point",
4326,
shapeField = "SHAPE",
cache = False)
# v2.3
df.withMeta(
"Point",
4326,
"SHAPE")
com.esri.processor.ProcessorAddMetadata
Set the geometry and spatial reference metadata on the [[SHAPE_STRUCT]] column.
Configuration
Required
geometryType
- description: The geometry type. The possible values for geometryType are: Unknown, Point, Line, Envelope, MultiPoint, Polyline, Polygon.
- type: String
wkid
- description: The spatial reference id.
- type: Integer
Optional
shapeField
- description: The name of the SHAPE struct field.
- type: String
- default value: SHAPE
cache
- description: To persist the outgoing DataFrame or not.
- type: Boolean
- default value: false
input
- description: The name of the input.
- type: String
- default value: _default
output
- description: The name of the output.
- type: String
- default value: _default
ProcessorAddShapeFromGeoJson v2
Hocon
{
input = _default
output = _default
type = com.esri.processor.ProcessorAddShapeFromGeoJson
geoJsonField = GEOJSON
geometryType = Point
wkid = 4326
keep = false
shapeField = SHAPE
cache = false
}
# Python
# v2.2
spark.bdt.processorAddShapeFromGeoJson(
df,
"GEOJSON",
"Point",
wkid = 4326,
keep = False,
shapeField = "SHAPE",
cache = False)
# v2.3
spark.bdt.addShapeFromGeoJson(
df,
"GEOJSON",
"Point",
wkid = 4326,
keep = False,
shapeField = "SHAPE",
cache = False)
com.esri.processor.ProcessorAddShapeFromGeoJson
Add the SHAPE column based on the configured GeoJson field. The possible values for geometryType are: Unknown, Point, Line, Envelope, MultiPoint, Polyline, Polygon, GeometryCollection.
Consider using ST_FromGeoJSON
Configuration
Required
geoJsonField
- description: The name of the GeoJson field.
- type: String
geometryType
- description: The geometry type. The possible values for geometryType are: Unknown, Point, Line, Envelope, MultiPoint, Polyline, Polygon, GeometryCollection.
- type: String
Optional
wkid
- description: The spatial reference id.
- type: Integer
- default value: 4326
keep
- description: Whether to retain the original GeoJson field or not.
- type: Boolean
- default value: false
shapeField
- description: The name of the SHAPE Struct field.
- type: String
- default value: SHAPE
cache
- description: To persist the outgoing DataFrame or not.
- type: Boolean
- default value: false
input
- description: The name of the input.
- type: String
- default value: _default
output
- description: The name of the output.
- type: String
- default value: _default
ProcessorAddShapeFromJson v2
Hocon
{
input = _default
output = _default
type = com.esri.processor.ProcessorAddShapeFromJson
geometryType = Polygon
jsonField = JSON
keep = false
wkid = 4326
shapeField = SHAPE
cache = false
}
# Python
# v2.2
spark.bdt.processorAddShapeFromJson(
df,
"JSON",
"Point",
wkid = 4326,
keep = False,
shapeField = "SHAPE",
cache = False)
# v2.3
spark.bdt.addShapeFromJson(
df,
"JSON",
"Point",
wkid = 4326,
keep = False,
shapeField = "SHAPE",
cache = False)
com.esri.processor.ProcessorAddShapeFromJson
Add the SHAPE column based on the configured JSON field. The possible values for geometryType are: Unknown, Point, Line, Envelope, MultiPoint, Polyline, Polygon, GeometryCollection.
Consider using ST_FromJson
Configuration
Required
jsonField
- description: The name of the JSON field.
- type: String
geometryType
- description: The geometry type. The possible values for geometryType are: Unknown, Point, Line, Envelope, MultiPoint, Polyline, Polygon, GeometryCollection.
- type: String
Optional
wkid
- description: The spatial reference id.
- type: String
- default value: 4326
keep
- description: Whether to retain the original JSON field or not.
- type: Boolean
- default value: false
shapeField
- description: The name of the SHAPE struct field.
- type: String
- default value: SHAPE
cache
- description: To persist the outgoing DataFrame or not.
- type: Boolean
- default value: false
input
- description: The name of the input.
- type: String
- default value: _default
output
- description: The name of the output.
- type: String
- default value: _default
ProcessorAddShapeFromWKB v2
Hocon
{
input = _default
output = _default
type = com.esri.processor.ProcessorAddShapeFromWKB
wkbField = WKB
geometryType = Point
wkid = 4326
keep = false
shapeField = SHAPE
}
# Python
# v2.2
spark.bdt.processorAddShapeFromWKB(
df,
"WKB",
"Point",
wkid = 4326,
keep = False,
shapeField = "SHAPE",
cache = False)
# v2.3
spark.bdt.addShapeFromWKB(
df,
"WKB",
"Point",
wkid = 4326,
keep = False,
shapeField = "SHAPE",
cache = False)
com.esri.processor.ProcessorAddShapeFromWKB
Add the SHAPE column based on the configured WKB field. The possible values for geometryType are: Unknown, Point, Line, Envelope, MultiPoint, Polyline, Polygon, GeometryCollection.
Consider using ST_FromWKB
Configuration
Required
wkbField
- description: The name of the WKB field.
- type: String
geometryType
- description: The geometry type. The possible values for geometryType are: Unknown, Point, Line, Envelope, MultiPoint, Polyline, Polygon, GeometryCollection.
- type: String
Optional
wkid
- description: The spatial reference id.
- type: String
- default value: 4326
keep
- description: Whether to retain the original WKB field or not.
- type: Boolean
- default value: false
shapeField
- description: The name of the SHAPE struct field.
- type: String
- default value: SHAPE
cache
- description: To persist the outgoing DataFrame or not.
- type: Boolean
- default value: false
input
- description: The name of the input.
- type: String
- default value: _default
output
- description: The name of the output.
- type: String
- default value: _default
ProcessorAddShapeFromWKT v2
Hocon
{
input = _default
output = _default
type = com.esri.processor.ProcessorAddShapeFromWKT
wktField = WKT
geometryType = Point
wkid = 4326
keep = false
shapeField = SHAPE
cache = false
}
# Python
# v2.2
spark.bdt.processorAddShapeFromWKT(
df,
"WKT",
"Point",
wkid = 4326,
keep = False,
shapeField = "SHAPE",
cache = False)
# v2.3
spark.bdt.addShapeFromWKT(
df,
"WKT",
"Point",
wkid = 4326,
keep = False,
shapeField = "SHAPE",
cache = False)
com.esri.processor.ProcessorAddShapeFromWKT
Add the SHAPE column based on the configured WKT field. The possible values for geometryType are: Unknown, Point, Line, Envelope, MultiPoint, Polyline, Polygon, GeometryCollection.
Consider using ST_FromText
Configuration
Required
wktField
- description: The name of the WKT field.
- type: String
geometryType
- description: The geometry type. The possible values for geometryType are: Unknown, Point, Line, Envelope, MultiPoint, Polyline, Polygon, GeometryCollection.
- type: String
Optional
wkid
- description: The spatial reference id.
- type: String
- default value: 4326
keep
- description: Whether to retain the original WKT field or not.
- type: Boolean
- default value: false
shapeField
- description: The name of the SHAPE struct field.
- type: String
- default value: SHAPE
cache
- description: To persist the outgoing DataFrame or not.
- type: Boolean
- default value: false
input
- description: The name of the input.
- type: String
- default value: _default
output
- description: The name of the output.
- type: String
- default value: _default
ProcessorAddShapeFromXY v2
Hocon
{
input = _default
output = _default
type = com.esri.processor.ProcessorAddShapeFromXY
xField = x
yField = y
wkid = 4326
keep = false
shapeField = SHAPE
cache = false
}
# Python
# v2.2
spark.bdt.processorAddShapeFromXY(
df,
"x",
"y",
wkid = 4326,
keep = False,
shapeField = "SHAPE",
cache = False)
# v2.3
spark.bdt.addShapeFromXY(
df,
"x",
"y",
wkid = 4326,
keep = False,
shapeField = "SHAPE",
cache = False)
com.esri.processor.ProcessorAddShapeFromXY
Add SHAPE column based on the configured X field and Y field.
Consider using ST_FromXY
Configuration
Required
xField
- description: The name of the longitude field.
- type: String
yField
- description: The name of the latitude field.
- type: String
Optional
wkid
- description: The spatial reference id.
- type: String
- default value: 4326
keep
- description: Whether to retain the original xField, yField or not.
- type: Boolean
- default value: false
shapeField
- description: The name of the SHAPE struct field.
- type: String
- default value: SHAPE
cache
- description: To persist the outgoing DataFrame or not.
- type: Boolean
- default value: false
input
- description: The name of the input.
- type: String
- default value: _default
output
- description: The name of the output.
- type: String
- default value: _default
ProcessorAddWKT v2
Hocon
{
input = _default
output = _default
type = com.esri.processor.ProcessorAddWKT
wktField = WKT
shapeField = SHAPE
keep = false
cache = false
}
# Python
# v2.2
spark.bdt.processorAddWKT(
df,
wktField = "WKT",
keep = False,
shapeField = "SHAPE",
cache = False)
# v2.2
spark.bdt.addWKT(
df,
wktField = "WKT",
keep = False,
shapeField = "SHAPE",
cache = False)
com.esri.processor.ProcessorAddWKT
Add a new field containing the well-known text representation.
Consider using ST_AsText
Configuration
Optional
wkt
- description: The name of the field holding well-known text.
- type: String
- defaultValue: wkt
keep
- description: Boolean flag whether to keep the SHAPE struct field or not. The default value is false.
- type: Boolean
- default value: false
shapeField
- description: The name of the SHAPE struct field.
- type: String
- default value: SHAPE
cache
- description: Persist the outgoing DataFrame.
- type: Boolean
- default value: false
input
- description: The name of the input.
- type: String
- default value: _default
output
- description: The name of the output.
- type: String
- default value: _default
ProcessorArea v2.1
Hocon
{
input = _default
output = _default
type = com.esri.processor.ProcessorArea
areaField = shape_area
shapeField = SHAPE
cache = false
}
# Python
# 2.2
spark.bdt.processorArea(
df,
areaField = "shape_area",
shapeField = "SHAPE",
cache = False)
# 2.3
spark.bdt.area(
df,
areaField = "shape_area",
shapeField = "SHAPE",
cache = False)
com.esri.processor.ProcessorArea
Calculate the area of a geometry.
Consider using ST_Area
Configuration
Optional
areaField
- description: The name of the area column in the DataFrame
- type: String
- default value: shape_area
shapeField
- description: The name of the SHAPE Struct for DataFrame.
- type: String
- default value: SHAPE
cache
- description: To persist the outgoing DataFrame or not.
- type: Boolean
- default value: false
input
- description: The name of the input DataFrame.
- type: String
- default value: _default
output
- description: The name of the output DataFrame.
- type: String
- default value: _default
ProcessorAssembler v2.2
Hocon
{
input = _default
output = _default
type = com.esri.processor.ProcessorAssembler
targetIDFields = [ID1, ID2, ID3]
timeField = current_time
timeFormat = timestamp
separator = ":"
trackIDName = trackID
origMillisName = origMillis
destMillisName = destMillis
durationMillisName = durationMillis
numTargetsName = numTargets
shapeField = SHAPE
cache = false
}
# Python
# v2.2
spark.bdt.processorAssembler(
df,
["ID1", "ID2", "ID3"],
"current_time",
"timestamp",
separator = ":",
trackIDName = "trackID",
origMillisName = "origMillis",
destMillisName = "destMillis",
durationMillisName = "durationMillis",
numTargetsName = "numTargets",
shapeField = "SHAPE",
cache = False)
# v2.3
spark.bdt.assembler(
df,
["ID1", "ID2", "ID3"],
"current_time",
"timestamp",
separator = ":",
trackIDName = "trackID",
origMillisName = "origMillis",
destMillisName = "destMillis",
durationMillisName = "durationMillis",
numTargetsName = "numTargets",
shapeField = "SHAPE",
cache = False)
com.esri.processor.ProcessorAssembler
Processor to assemble a set of targets (points) into a track (polyline). The user can specify a set of target fields that make up a track identifier and a timestamp field in the target attributes, such that all the targets with that same track identifier will be assembled chronologically as polyline vertices. The polyline consists of 3D points where x/y are the target location and z is the timestamp.
The output DataFrame will have the following columns: - shape: The shape struct that defines the track (polyline) - trackID: The track identifier. Typically, this is a concatenation of the string representation of the user specified target fields separated by a colon (:) - origMillis: The track starting epoch in milliseconds. - destMillis: The track ending epoch in milliseconds. - durationMillis: The track duration in milliseconds. - numTargets: The number of targets in the track.
Configuration
Required
targetIDFields
- desc: The combination of target attribute fields to use as a track identifier.
- type: Array[String]
timeField
- desc: The time field.
- type: String
timeFormat
- desc: The format of the time input. Has to be one of: long, date, timestamp
- type: String
optional
separator
- desc: The separator used to join the target id fields. The joined target id fields become the trackID.
- type: String
- defaultValue: ":"
trackIDName
- desc: The name of the output field to hold the track identifier as a polyline attribute.
- type: String
- defaultValue: trackID
origMillisName
- desc: The name of the output field to hold the track starting epoch in milliseconds as a polyline attribute.
- type: String
- defaultValue: origMillis
destMillisName
- desc: The name of the output field to hold the track ending epoch in milliseconds as a polyline attribute.
- type: String
- defaultValue: destMillis
durationMillisName
- desc: The name of the output field to hold the track duration in milliseconds as a polyline attribute.
- type: String
- defaultValue: durationMillis
numTargetsName
- desc: The name of the output field to hold the track number of targets in milliseconds as a polyline attribute.
- type: String
- defaultValue: numTargets
shapeField
- description: The name of the SHAPE Struct field.
- type: String
- default value: SHAPE
cache
- description: To persist the outgoing DataFrame or not.
- type: Boolean
- default value: false
input
- desc: The input feature class name.
- type: String
- defaultValue: _default
output
- desc: The output feature class name
- type: String
- defaultValue: _default
ProcessorBuffer v2
Hocon
{
input = _default
output = _default
type = com.esri.processor.ProcessorBuffer
distance = 2.0
shapeField = SHAPE
cache = false
}
# Python
# v2.2
spark.bdt.processorBuffer(
df,
2.0,
shapeField = "SHAPE",
cache = False)
# v2.3
spark.bdt.buffer(
df,
2.0,
shapeField = "SHAPE",
cache = False)
com.esri.processor.ProcessorBuffer
Buffer geometries by a radius.
Consider using ST_Buffer
Configuration
Required
- distance
- description: The buffer distance.
- type: Double
Optional
shapeField
- description: The name of the SHAPE struct field.
- type: String
- default value: SHAPE
cache
- description: Persist the outgoing DataFrame.
- type: Boolean
- default value: false
input
- description: The name of the input.
- type: String
- default value: _default
output
- description: The name of the output.
- type: String
- default value: _default
ProcessorCentroid v2
Hocon
{
input = _default
output = _default
type = com.esri.processor.ProcessorCentroid
shapeField = SHAPE
cache = false
}
# Python
# v2.2
spark.bdt.processorCentroid(
df,
shapeField = "SHAPE",
cache = False)
# v2.3
spark.bdt.centroid(
df,
shapeField = "SHAPE",
cache = False)
com.esri.processor.ProcessorCentroid
Produces centroids of features.
Consider using ST_Centroid
Configuration
Optional
shapeField
- description: The name of the SHAPE Struct field.
- type: String
- default value: SHAPE
cache
- description: Persist the outgoing DataFrame or not.
- type: Boolean
- default value: false
input
- description: The name of the input DataFrame.
- type: String
- default value: _default
output
- description: The name of the output DataFrame.
- type: String
- default value: _default
ProcessorCountWithin v2.2
Hocon
{
inputs = [_default, _default]
output = _default
type = com.esri.processor.ProcessorCountWithin
cellSize = 1
keepGeometry = false
emitContainsOnly = false
doBroadcast = false
lhsShapeField = SHAPE
rhsShapeField = SHAPE
cache = false
extent = [-180.0, -90.0, 180.0, 90.0]
depth = 16
}
# Python
# v2.2
spark.bdt.processorCountWithin(
df1,
df2,
1.0,
keepGeometry = False,
emitContainsOnly = False,
doBroadcast = False,
lhsShapeField = "SHAPE",
rhsShapeField = "SHAPE",
cache = False,
extentList = [-180.0, -90.0, 180.0, 90.0],
depth = 16)
# v2.3
spark.bdt.countWithin(
df1,
df2,
1.0,
keepGeometry = False,
emitContainsOnly = False,
doBroadcast = False,
lhsShapeField = "SHAPE",
rhsShapeField = "SHAPE",
cache = False,
extentList = [-180.0, -90.0, 180.0, 90.0],
depth = 16)
com.esri.processor.ProcessorCountWithin
Processor to produce a count of all lhs features that are contained in a rhs feature.
If emitContainsOnly is set to true (the default), this processor will only output rhs features that contain at least one lhs feature. To get all rhs features in the output, emitContainsOnly must be set to false. This will include rhs features that contain 0 lhs features. IMPORTANT: There will be a non-trivial time cost if emitContainsOnly is set to false.
If keepGeometry is set to false, the output format will be [rhs attributes + lhs count]. This is the default. If keepGeometry is set to true, the output format will be [rhs shape + rhs attributes + lhs count].
For example, let's assume that
Polygon with attributes [NAME: A, ID: 1] contains 2 points. Polygon with attributes [NAME: A, ID: 2] contains 1 point.
With keepGeometry set to false, the output rows will be: [NAME: A, ID: 1, 2] [NAME: A, ID: 2, 1]
If doBroadcast is true and there are null values in the shape column of either of the input dataframes, a NullPointerException will be thrown. These null values should be handled before using this processor. If doBroadcast=false, then rows with null values for shape column are implicitly handled and will be dropped.
Cell size must not be smaller than the following values:
For Spatial Reference WGS 84 GPS: 0.000000167638064 degrees For Spatial Reference WGS 1984 Web Mercator: 0.0186264515 meters
Configuration
Required
- cellSize
- description: The spatial partitioning cell size.
- type: Double
Optional
keepGeometry
- description: To keep the rhs ShapeStruct in the output or not.
- type: Boolean
- default value: false
emitContainsOnly
- description: To only emit rhs geometries that contain lhs geometries.
- type: Boolean
- default value: false
doBroadcast
- description: If set to true, rhs DataFrame is broadcasted. Set
spark.sql.autoBroadcastJoinThreshold
accordingly. - type: Boolean
- default value: false
- description: If set to true, rhs DataFrame is broadcasted. Set
lhsShapeField
- description: The name of the SHAPE Struct field in the LHS DataFrame.
- type: String
- default value: SHAPE
rhsShapeField
- description: The name of the SHAPE Struct field in the RHS DataFrame.
- type: String
- default value: SHAPE
cache
- description: To persist the outgoing DataFrame or not.
- type: Boolean
- default value: false
extent
- description: The spatial index extent. Array of the extent coordinates [xmin, ymin, xmax, ymax].
- type: Array[Double]
- default value: [-180.0, -90.0, 180.0, 90.0]
depth
- description: The spatial index depth.
- type: Integer
- default value: 16
inputs
- description: The name of the input.
- type: String
- default value: _default
output
- description: The name of the output.
- type: String
- default value: _default
ProcessorDifference v2.2
Hocon
{
inputs = [blockgroups, zipcodes]
output = _default
type = com.esri.processor.ProcessorDifference
cellSize = 0.1
doBroadcast = false
lhsShapeField = SHAPE
rhsShapeField = SHAPE
cache = false
extent = [-180.0, -90.0, 180.0, 90.0]
depth = 16
}
# Python
# v2.2
spark.bdt.processorDifference(
df1,
df2,
0.1,
doBroadcast = False,
lhsShapeField = "SHAPE",
rhsShapeField = "SHAPE",
cache = False,
extentList = [-180.0, -90.0, 180.0, 90.0],
depth = 16)
# v2.3
spark.bdt.difference(
df1,
df2,
0.1,
doBroadcast = False,
lhsShapeField = "SHAPE",
rhsShapeField = "SHAPE",
cache = False,
extentList = [-180.0, -90.0, 180.0, 90.0],
depth = 16)
com.esri.processor.combine.ProcessorDifference
Per each left hand side (lhs) feature, find all relevant right hand side (rhs) features and take the set-theoretic difference of each rhs feature with the lhs feature. For the LHS feature type, it can be point, polyline, polygon types. For the RHS feature type, it can be point, polyline, polygon types.
When n relevant rhs features are found, the lhs feature is emitted n times, and the lhs geometry is replaced with the difference of the lhs geometry and rhs geometry. When a lhs feature has no overlapping rhs features, nothing is emitted. The length output table could be less than the amount of input lhs features.
The final schema is: lhs attributes + [[SHAPE_FIELD]] + rhs attributes.
Cell size must not be smaller than the following values:
For Spatial Reference WGS 84 GPS: 0.000000167638064 degrees For Spatial Reference WGS 1984 Web Mercator: 0.0186264515 meters
Configuration
Required
- cellSize
- desc: The spatial partitioning grid size
- type: Double
Optional
doBroadcast
- description: If set to true, rhs DataFrame is broadcasted. Set
spark.sql.autoBroadcastJoinThreshold
accordingly. - type: Boolean
- default value: false
- description: If set to true, rhs DataFrame is broadcasted. Set
lhsShapeField
- description: The name of the SHAPE Struct field in the LHS DataFrame.
- type: String
- default value: SHAPE
rhsShapeField
- description: The name of the SHAPE Struct field in the RHS DataFrame.
- type: String
- default value: SHAPE
cache
- description: To persist the outgoing DataFrame or not.
- type: Boolean
- default value: false
extent
- description: The spatial index extent. Array of the extent coordinates [xmin, ymin, xmax, ymax].
- type: Array[Double]
- default value: [-180.0, -90.0, 180.0, 90.0]
depth
- desc: The spatial index depth
- type: Integer
- default value: 16
inputs
- description: The names of the input DataFrames.
- type: Array[String]
- default value: []
output
- description: The name of the output.
- type: String
- default value: _default
ProcessorDissolve v2
Hocon
{
type = com.esri.processor.ProcessorDissolve
fields = [TYPE]
singlePart = false
shapeField = SHAPE
cache = false
}
# Python
# v2.2
spark.bdt.processorDissolve(
df,
fields = ["TYPE"],
singlePart = False,
shapeField = "SHAPE",
cache = False)
# v2.3
spark.bdt.dissolve(
df,
fields = ["TYPE"],
singlePart = False,
shapeField = "SHAPE",
cache = False)
com.esri.processor.ProcessorDissolve
Dissolve features either by both attributes and spatially
or spatially
only.
Performs the topological union operation on two geometries.
The supported geometry types are multiPoint, polyline and polygon.
singlePart
indicates whether the returned geometry is multipart or not.
The output schema is: fields + SHAPE Struct.
Configuration
Optional
fields
- description: The collection of field names to dissolve by.
- type: Array[String]
- default value: []
singlePart
- description: Indicates whether the returned geometry is Multipart or not.
- type: Boolean
- default value: true
shapeField
- description: The name of the SHAPE struct field.
- type: String
- default value: SHAPE
cache
- description: To persist the outgoing DataFrame or not.
- type: Boolean
- default value: true
input
- description: The name of the input.
- type: String
- default value: _default
output
- description: The name of the output.
- type: String
- default value: _default
ProcessorDissolve2 v2
Hocon
{
type = com.esri.processor.ProcessorDissolve2
fields = [TYPE]
singlePart = true
sort = false
shapeField = SHAPE
cache = false
}
# Python
# v2.2
spark.bdt.processorDissolve2(
df,
fields = ["TYPE"],
singlePart = True,
sort = False,
shapeField = "SHAPE",
cache = False)
# v2.3
spark.bdt.dissolve2(
df,
fields = ["TYPE"],
singlePart = True,
sort = False,
shapeField = "SHAPE",
cache = False)
com.esri.processor.ProcessorDissolve2
This implementation is the same as ProcessorDissolve in BDT 1.0.
Dissolve features by both attributes and spatially
or spatially
only.
Performs the topological union operation on two geometries.
The supported geometry types are multipoint, polyline and polygon.
singlePart
indicates whether the returned geometry is multipart or not.
The output schema is: fields + SHAPE Struct.
Configuration
Optional
fields
- description: The collection of field names to dissolve by.
- type: Array[String]
- default value: []
singlePart
- description: Indicates whether the returned geometry is Multipart or not.
- type: Boolean
- default value: true
sort
- description: Sort by the fields or not
- type: Boolean
- default value: false
shapeField
- description: The name of the SHAPE struct field.
- type: String
- default value: SHAPE
cache
- description: To persist the outgoing DataFrame or not.
- type: Boolean
- default value: false
input
- description: The name of the input.
- type: String
- default value: _default
output
- description: The name of the output.
- type: String
- default value: _default
ProcessorDistance v2.1
Hocon
{
inputs = [blockgroups, zipcodes]
output = _default
type = com.esri.processor.ProcessorDistance
cellSize = 0.1
radius = 0.003
emitEnrichedOnly = false
doBroadcast = false
distanceField = distance
lhsShapeField = SHAPE
rhsShapeField = SHAPE
cache = false
extent = [-180.0, -90.0, 180.0, 90.0]
}
# Python
# v2.2
spark.bdt.processorDistance(
df1,
df2,
0.1,
0.003,
emitEnrichedOnly = True,
doBroadcast = False,
distanceField = "distance",
lhsShapeField = "SHAPE",
rhsShapeField = "SHAPE",
cache = False,
extentList = [-180.0, -90.0, 180.0, 90.0],
depth = 16)
# v2.3
spark.bdt.distance(
df1,
df2,
0.1,
0.003,
emitEnrichedOnly = True,
doBroadcast = False,
distanceField = "distance",
lhsShapeField = "SHAPE",
rhsShapeField = "SHAPE",
cache = False,
extentList = [-180.0, -90.0, 180.0, 90.0],
depth = 16)
com.esri.processor.ProcessorDistance
Find a target feature within the search distance for a source feature.
Per each left hand side (lhs) feature, find the closest right hand side (rhs) feature within the specified radius. For the LHS feature type, it can be point, polyline, polygon types. For the RHS feature type, it can be point, polyline, polygon types.
When the lhs and rhs feature types are point, the distance is in meters. When a lhs feature has no rhs features within the specified radius, that lhs feature is emitted with null values for rhs attributes.
The final schema is: lhs attributes + [[SHAPE_FIELD]] + rhs attributes + distance.
Extent must be specified when using a projected spatial reference. The default extent assumes world geographic (WGS84/4326).
If doBroadcast is true and there are null values in the shape column of either of the input dataframes, a NullPointerException will be thrown. These null values should be handled before using this processor. If doBroadcast=false, then rows with null values for shape column are implicitly handled and will be dropped.
Cell size must not be smaller than the following values:
For Spatial Reference WGS 84 GPS: 0.000000167638064 degrees For Spatial Reference WGS 1984 Web Mercator: 0.0186264515 meters
Consider using ST_Distance
Configuration
Required
cellSize
- description: The spatial partitioning cell size. note: a value is required but cellSize is ignored when doBroadcast = True
- type: Double
radius
- description: The search radius in units of the spatial reference system.
- type: Double
Optional
emitEnrichedOnly (v2.1)
- description: When true, LHS records that have nearest RHS records will be emitted. LHS records with null values as RHS attribute won't be emitted.
- type: Boolean
- default value: true (note: hocon default is false)
doBroadcast
- description: If set to true, rhs DataFrame is broadcasted. Set
spark.sql.autoBroadcastJoinThreshold
accordingly. - type: Boolean
- default value: false
- description: If set to true, rhs DataFrame is broadcasted. Set
distanceField
- description: The name of the distance field to be added.
- type: String
- default value: distance
lhsShapeField
- description: The name of the SHAPE Struct field in the LHS DataFrame.
- type: String
- default value: SHAPE
rhsShapeField
- description: The name of the SHAPE Struct field in the RHS DataFrame.
- type: String
- default value: SHAPE
cache
- description: To persist the outgoing DataFrame or not.
- type: Boolean
- default value: false
extent
- description: The spatial index extent. Array of the extent coordinates [xmin, ymin, xmax, ymax].
- type: Array[Double]
- default value: [-180.0, -90.0, 180.0, 90.0]
depth
- description: The spatial index depth.
- type: Integer
- default value: 16
inputs
- description: The names of the input DataFrames.
- type: Array[String]
- default value: []
output
- description: The name of the output.
- type: String
- default value: _default
ProcessorDistanceAllCandidates v2.1
Hocon
{
inputs = [_default, _default]
output = _default
type = com.esri.processor.ProcessorDistanceAllCandidates
cellSize = 1
radius = 1
emitEnrichedOnly = true
doBroadcast = false
distanceField = distance
lhsShapeField = SHAPE
rhsShapeField = SHAPE
cache = false
extent = [-180.0, -90.0, 180.0, 90.0]
depth = 16
}
# Python
# v2.2
spark.bdt.processorDistanceAllCandidates(
df1,
df2,
1.0,
1.0,
emitEnrichedOnly = True,
doBroadcast = False,
distanceField = "distance",
lhsShapeField = "SHAPE",
rhsShapeField = "SHAPE",
cache = False,
extentList = [-180.0, -90.0, 180.0, 90.0],
depth = 16)
# v2.3
spark.bdt.distanceAllCandidates(
df1,
df2,
1.0,
1.0,
emitEnrichedOnly = True,
doBroadcast = False,
distanceField = "distance",
lhsShapeField = "SHAPE",
rhsShapeField = "SHAPE",
cache = False,
extentList = [-180.0, -90.0, 180.0, 90.0],
depth = 16)
com.esri.processor.ProcessorDistanceAllCandidates
Processor to find all features from the right hand side dataset within a given search distance.
Enriches lhs features with the attributes of rhs features found within the specified distance.
If emitEnrichedOnly is set to false, when no rhs features are found, a lhs feature is enriched with null
values.
When emitEnrichedOnly is set to true, when no rhs features are found, the lhs feature is not emitted in the output.
The distance between the lhs and corresponding rhs geometry is appended to the end of each row. When the input sources are points, the distance is calculated in meters.
When there are multiple right hand side features (A,B,C) are found within the specified distance for a left hand side feature (X), the output would be:
X,A
X,B
X,C
If doBroadcast is true and there are null values in the shape column of either of the input dataframes, a NullPointerException will be thrown. These null values should be handled before using this processor. If doBroadcast=false, then rows with null values for shape column are implicitly handled and will be dropped.
Cell size must not be smaller than the following values:
For Spatial Reference WGS 84 GPS: 0.000000167638064 degrees For Spatial Reference WGS 1984 Web Mercator: 0.0186264515 meters
Configuration
Required
cellSize
- description: The spatial partitioning cell size. note: a value is required but cellSize is ignored when doBroadcast = True
- type: Double
radius
- description: The search radius in units of the spatial reference system.
- type: Double
Optional
emitEnrichedOnly
- description: When true, LHS records that have nearest RHS records will be emitted. LHS records with null values as RHS attribute won't be emitted.
- type: Boolean
- default value: true
doBroadcast
- description: If set to true, rhs DataFrame is broadcasted. Set
spark.sql.autoBroadcastJoinThreshold
accordingly. - type: Boolean
- default value: false
- description: If set to true, rhs DataFrame is broadcasted. Set
distanceField
- description: The name of the distance field to be added.
- type: String
- default value: distance
lhsShapeField
- description: The name of the SHAPE Struct field in the LHS DataFrame.
- type: String
- default value: SHAPE
rhsShapeField
- description: The name of the SHAPE Struct field in the RHS DataFrame.
- type: String
- default value: SHAPE
cache
- description: To persist the outgoing DataFrame or not.
- type: Boolean
- default value: false
extent
- description: The spatial index extent. Array of the extent coordinates [xmin, ymin, xmax, ymax].
- type: Array[Double]
- default value: [-180.0, -90.0, 180.0, 90.0]
depth
- description: The spatial index depth.
- type: Integer
- default value: 16
inputs
- description: The name of the input.
- type: String
- default value: _default
output
- description: The name of the output.
- type: String
- default value: _default
ProcessorDropFields v2
Hocon
{
input = _default
output = _default
type = com.esri.processor.ProcessorDropFields
fields = ["tip_amount", "total_amount"]
}
# Python
# v2.2
spark.bdt.processorDropFields(
df,
["tip_amount", "total_amount"],
cache = False)
# v2.3
spark.bdt.dropFields(
df,
["tip_amount", "total_amount"],
cache = False)
com.esri.processor.ProcessorDropFields
Drop the configured fields.
Configuration
Required
- fields
- description: The collection of field names to drop.
- type: Array[String]
- default value: []
Optional
cache
- description: To persist the outgoing DataFrame or not.
- type: Boolean
- default value: false
input
- description: The name of the input.
- type: String
- default value: _default
output
- description: The name of the output.
- type: String
- default value: _default
ProcessorEliminatePolygonPartArea v2.3
{
input = _default
output = _default
type = com.esri.processor.ProcessorEliminatePolygonPartArea
areaThreshold = 1.0
shapeField = SHAPE
cache = false
}
# Python
# v2.3
spark.bdt.eliminatePolygonPartArea(
df,
1.0,
shapeField = "SHAPE",
cache = False)
com.esri.processor.ProcessorEliminatePolygonPartArea
Remove the holes of a polygon that have area below the given threshold.
Consider using ST_EliminatePolygonPartArea
Configuration
Required
- areaThreshold
- description: If a hole has area less than this value, then it will be removed.
- type: double
Optional
shapeField
- description: The name of the SHAPE Struct for DataFrame.
- type: String
- default value: SHAPE
cache
- description: To persist the outgoing DataFrame or not.
- type: Boolean
- default value: false
input
- description: The name of the input.
- type: String
- default value: _default
output
- description: The name of the output.
- type: String
- default value: _default
ProcessorEliminatePolygonPartPercent v2.3
{
input = _default
output = _default
type = com.esri.processor.ProcessorEliminatePolygonPartPercent
percentThreshold = 10.0
shapeField = SHAPE
cache = false
}
# Python
# v2.3
spark.bdt.eliminatePolygonPartPercent(
df,
1.0,
shapeField = "SHAPE",
cache = False)
com.esri.processor.ProcessorEliminatePolygonPartPercent
Remove the holes of a polygon whose area as a percentage of the whole polygon is less than this value.
Consider using ST_EliminatePolygonPartPercent
Configuration
Required
- percentThreshold
- description: If the area of a hole as a percentage of the whole polygon is less than this value, then the hole will be removed.
- type: double
Optional
shapeField
- description: The name of the SHAPE Struct for DataFrame.
- type: String
- default value: SHAPE
cache
- description: To persist the outgoing DataFrame or not.
- type: Boolean
- default value: false
input
- description: The name of the input.
- type: String
- default value: _default
output
- description: The name of the output.
- type: String
- default value: _default
ProcessorExtentFilter v2.1
Hocon
{
input = _default
output = _default
type = com.esri.processor.ProcessorExtentFilter
oper = CONTAINS
extent = [-180.0, -90.0, 180.0, 90.0]
shapeField = SHAPE
cache = false
}
# Python
# v2.2
spark.bdt.processorExtentFilter(
df,
"CONTAINS",
extentList = [-180.0, -90.0, 180.0, 90.0],
shapeField = "SHAPE",
cache = False)
# v2.3
spark.bdt.extentFilter(
df,
"CONTAINS",
extentList = [-180.0, -90.0, 180.0, 90.0],
shapeField = "SHAPE",
cache = False)
com.esri.processor.ProcessorExtentFilter
Processor to filter geometries that meet the operation relation with the given envelope. The operation is executed on the geometry in relation to the envelope: read as 'envelope intersects/overlaps/within/contains geometry'
Configuration
Required
oper
- description: The type of operation
- type: String. Has to be one of ["INTERSECTS", "CONTAINS", "OVERLAPS", "WITHIN"].
extent
- description: The spatial index extent. Array of the extent coordinates [xmin, ymin, xmax, ymax].
- type: Array[Double]
Optional
shapeField
- description: The name of the SHAPE Struct for DataFrame.
- type: String
- default value: SHAPE
cache
- description: To persist the outgoing DataFrame or not.
- type: Boolean
- default value: false
input
- description: The name of the input DataFrame.
- type: String
- default value: _default
output
- description: The name of the output DataFrame.
- type: String
- default value: _default
ProcessorFeatureFilter (application of com.esri.processor.ProcessorSQL)
Hocon
#Suppose your feature column list is two column names, [price, location_Name], where x is of type Int and y is of type String.
#Return all rows where price is below 400000 and location_name is equal to "San Fransisco"
{
inputs = [_default]
output = _default
type = com.esri.processor.ProcessorSQL
sql = "SELECT * FROM _default WHERE x < 400000 AND y = 'San Fransisco'"
cache = false
}
Hocon
# Return the SHAPE column for rows where price is less than or equal to 700000 and location_name starts with "San"
{
inputs = [_default]
output = _default
type = com.esri.processor.ProcessorSQL
sql = "SELECT SHAPE FROM _default WHERE x <= 700000 AND y = 'San%'"
cache = false
}
In BDT 2, ProcessorSQL accomplishes the same task as ProcessorFeatureFilter in BDT 1. Use Processor SQL instead. Your filter condition will have to be expressed as a SQL statement, and must have a boolean return value.
ProcessorGeneralize v2.1
Hocon
{
input = _default
output = _default
type = com.esri.processor.ProcessorGeneralize
maxDeviation = 0.001
removeDegenerateParts = true
shapeField = SHAPE
cache = false
}
# Python
# v2.2
spark.bdt.processorGeneralize(
df,
maxDeviation = 0.001,
removeDegenerateParts = False,
shapeField = "SHAPE",
cache = False)
# v2.3
spark.bdt.generalize(
df,
maxDeviation = 0.001,
removeDegenerateParts = False,
shapeField = "SHAPE",
cache = False)
com.esri.processor.ProcessorGeneralize
Generalize geometries using Douglas-Peucker algorithm.
Consider using ST_Generalize
Configuration
Optional
maxDeviation
- description: The maximum allowed deviation from the generalized geometry to the original geometry. If maxDeviation <= 0 the processor returns the input geometries.
- type: Double
- default value: 0.0
removeDegenerateParts
- description: When true, the degenerate parts of the geometry will be removed from the output.
- type: Boolean
- default value: false
shapeField
- description: The name of the SHAPE Struct for DataFrame.
- type: String
- default value: SHAPE
cache
- description: To persist the outgoing DataFrame or not.
- type: Boolean
- default value: false
input
- description: The name of the input DataFrame.
- type: String
- default value: _default
output
- description: The name of the output DataFrame.
- type: String
- default value: _default
ProcessorGeodeticArea v2.1
Hocon
{
input = _default
output = _default
type = com.esri.processor.ProcessorGeodeticArea
curveType = 0
fieldName = geodetic_area
shapeField = SHAPE
cache = false
}
# Python
# v2.2
spark.bdt.processorGeodeticArea(
df,
curveType=0,
fieldName="geodetic_area",
shapeField="SHAPE",
cache = False)
# v2.3, v.2.4
spark.bdt.geodeticArea(
df,
curveType=0,
fieldName="geodetic_area",
shapeField="SHAPE",
cache = False)
com.esri.processor.ProcessorGeodeticArea
Processor to calculate the geodetic areas of the input geometries. This is more reliable measurement of area than ProcessorArea because it takes into account the Earth's curvature.
Possible curve type values:
Shortest distance between two points on an ellipsoide int Geodesic = 0;
A line of constant bearing or azimuth. Also known as a rhmub line int Loxodrome = 1;
The line on a spheroid defined along the intersection at the surface by a plane that passes through the center of the spheroid. When the spheroid flattening is equal to zero (sphere) then a Great Elliptic is a Great Circle int GreatElliptic = 2; int NormalSection = 3;
The ShapePreserving type means the segments shapes are preserved in the spatial reference where they are defined. The behavior of the ShapePreserving type can be emulated by densifying the geometry with a small step, and then calling a geodetic method using Geodesic or GreatElliptic curve types. int ShapePreserving = 4;
Consider using ST_GeodeticArea
Configuration
Optional
curveType
- description: The curve type, represented as an Integer.
- type: Integer
- default value: Geodesic
fieldName
- description: The name of the field to hold geodetic area value.
- type: String
- default value: geodetic_area
shapeField
- description: The name of the SHAPE Struct for DataFrame.
- type: String
- default value: SHAPE
cache
- description: To persist the outgoing DataFrame or not.
- type: Boolean
- default value: false
input
- description: The name of the input DataFrame.
- type: String
- default value: _default
output
- description: The name of the output DataFrame.
- type: String
- default value: _default
ProcessorHex v2
Hocon
{
type = com.esri.processor.ProcessorHex
input = _default
output = _default
shapeField = SHAPE
hexFields = [
{name = "H100", size = 100}
{name = "H200", size = 200}
]
cache = false
}
# Python
# v2.2
spark.bdt.processorHex(
df,
hexFields = {"H100": 100.0, "H200": 200.0},
shapeField = "SHAPE",
cache = False)
# v2.3
spark.bdt.hex(
df,
hexFields = {"H100": 100.0, "H200": 200.0},
shapeField = "SHAPE",
cache = False)
com.esri.processor.ProcessorHex
Enrich a point DataFrame with hex column and row values for the given sizes. The input DataFrame must be the point type and its spatial reference should be WGS84 (4326). The size is in meters.
Consider using ST_AsHex
Configuration
Required
- hexFields
- description: A map of hex field name and hex size.
- type: Map[String, Int]
Optional
shapeField
- description: The name of the SHAPE struct field.
- type: String
- default value: SHAPE
cache
- description: Persist the outgoing DataFrame or not.
- type: Boolean
- default value: false
input
- description: The name of the input.
- type: String
- default value: _default
output
- description: The name of the output.
- type: String
- default value: _default
ProcessorHexToPolygon v2
Hocon
{
type = com.esri.processor.ProcessorHexToPolygon
input = _default
output = _default
hexField = H100
size = 100
shapeField = SHAPE
cache = false
keep = false
}
# Python
# v2.2
spark.bdt.processorHexToPolygon(
df,
hexField = "H100",
size = 100.0,
keep = False,
shapeField = "SHAPE",
cache = False)
# v2.3
spark.bdt.hexToPolygon(
df,
hexField = "H100",
size = 100.0,
keep = False,
shapeField = "SHAPE",
cache = False)
com.esri.processor.ProcessorHexToPolygon
Replace or create the SHAPE struct field containing hexagon polygons based on the configured hex column row field. The outgoing DataFrame will have the spatial reference of 3857 (102100) WGS84 Web Mercator (Auxiliary Sphere).
https://developers.arcgis.com/web-map-specification/objects/spatialReference/
The input DataFrame does not need to have SHAPE struct field since this processor only depends on the existing hex field.
Consider using ST_FromHex
Configuration
Required
hexField
- description: The field containing hex column and row values.
- type: String
size
- description: The hex size.
- type: Double
Optional
keep
- description: Set true to keep the hex field. Set false to drop the hex field.
- type: Boolean
- default value: false
shapeField
- description: The name of the SHAPE struct field.
- type: String
- default value: SHAPE
cache
- description: Persist the outgoing DataFrame or not.
- type: Boolean
- default value: false
input
- description: The name of the input.
- type: String
- default value: _default
output
- description: The name of the output.
- type: String
- default value: _default
ProcessorIntersection v2
Hocon
{
type = com.esri.processor.ProcessorIntersection
inputs = [intersector, inGeoms]
output = _default
mask = -1
cellSize = 0.1
doBroadcast = false
lhsShapeField = SHAPE
rhsShapeField = SHAPE
extent = [-180.0, -90.0, 180.0, 90.0]
cache = false
}
# Python
# v2.2
spark.bdt.processorIntersection(
df1,
df2,
cellSize = 0.1,
mask = -1,
doBroadcast = False,
lhsShapeField = "SHAPE",
rhsShapeField = "SHAPE",
extentList = [-180.0, -90.0, 180.0, 90.0],
cache = False)
# v2.3
spark.bdt.intersection(
df1,
df2,
cellSize = 0.1,
mask = -1,
doBroadcast = False,
lhsShapeField = "SHAPE",
rhsShapeField = "SHAPE",
extentList = [-180.0, -90.0, 180.0, 90.0],
cache = False)
com.esri.processor.ProcessorIntersection
Produce the intersection geometries of two spatial datasets.
This is the equivalent one to the ProcessorIntersection2 in BDT 1x.
This implementation uses dimension mask
as explained in http://esri.github.io/geometry-api-java/doc/Intersection.html
Each LHS row is an intersector. If it intersects a RHS row, a lhs row's shape is replaced by the intersection geometry and LHS attributes are enriched with the RHS attributes.
If the intersection is empty, nothing will be emitted.
Depending on the dimension mask, multiple intersections could be emitted per intersector (hence LHS row).
The outgoing DataFrame will have the SHAPE metadata same as the RHS SHAPE metadata and it's SHAPE field name will be same as the LHS SHAPE field name.
If doBroadcast is true and there are null values in the shape column of either of the input dataframes, a NullPointerException will be thrown. These null values should be handled before using this processor. If doBroadcast=false, then rows with null values for shape column are implicitly handled and will be dropped.
Cell size must not be smaller than the following values:
For Spatial Reference WGS 84 GPS: 0.000000167638064 degrees For Spatial Reference WGS 1984 Web Mercator: 0.0186264515 meters
Consider using ST_Inserection
Configuration
Required
- cellSize
- description: The spatial partitioning cell size.
- type: Double
Optional
mask
- description: The dimension mask.
- type: Integer
- default value: -1
doBroadcast
- description: To broadcast RHS DataFrame to all worker nodes or not. Set
spark.sql.autoBroadcastJoinThreshold
accordingly. - type: Boolean
- default value: false
- description: To broadcast RHS DataFrame to all worker nodes or not. Set
lhsShapeField
- description: The name of the SHAPE Struct field in the LHS DataFrame.
- type: String
- default value: SHAPE
rhsShapeField
- description: The name of the SHAPE Struct field in the RHS DataFrame.
- type: String
- default value: SHAPE
cache
- description: To persist the outgoing DataFrame or not.
- type: Boolean
- default value: false
extent
- description: The spatial index extent. Array of the extent coordinates [xmin, ymin, xmax, ymax].
- type: Array[Double]
- default value: [-180.0, -90.0, 180.0, 90.0]
depth
- description: The spatial index depth.
- type: Integer
- default value: 16
inputs
- description: The names of the input DataFrames.
- type: Array[String]
- default value: []
output
- description: The name of the output.
- type: String
- default value: _default
ProcessorLength v2.1
Hocon
{
input = _default
output = _default
type = com.esri.processor.ProcessorLength
lengthField = shape_length
shapeField = SHAPE
cache = false
}
com.esri.processor.ProcessorLength
# Python
# v2.2
spark.bdt.processorLength(
df,
lengthField = "shape_length",
shapeField = "SHAPE",
cache = False)
# v2.3
spark.bdt.length(
df,
lengthField = "shape_length",
shapeField = "SHAPE",
cache = False)
Calculate the length of a geometry.
Consider using ST_Length
Configuration
Optional
lengthField
- description: The name of the length column in the DataFrame
- type: String
- default value: shape_length
shapeField
- description: The name of the SHAPE Struct for DataFrame.
- type: String
- default value: SHAPE
cache
- description: To persist the outgoing DataFrame or not.
- type: Boolean
- default value: false
input
- description: The name of the input DataFrame.
- type: String
- default value: _default
output
- description: The name of the output DataFrame.
- type: String
- default value: _default
ProcessorMapMatch v2.4
Hocon
{
input = _default
output = _default
type = com.esri.processor.ProcessorMapMatch
idField = oid
timeField = dt
pidField = vid
xField = x
yField = y
snapDist = 50.0
pathDist = 300.0
microPathSize = 5
alpha = 10.0
beta = 1.0
cache = false
}
#Python
outDF = spark.bdt.mapMatch(
df,
idField="oid",
timeField="dt",
pidField="vid",
xField="x",
yField="y",
snapDist=50.0,
pathDist=300.0,
microPathSize=5,
alpha=10.0,
beta=1.0,
cache=False)
com.esri.processor.ProcessorMapMatch
Sequentially snap a sequence of noisy gps points to their correct path vectors. Depending on the context, "path" is used to refer to both (a) a path of noisy gps points and (b) a path of vectors on which the gps points are snapped
An example use case would be smart snapping gps breadcrumbs from vehicle paths to their correct streets.
A network of links are required in LMDB format, configured through the spark properties:
spark.bdt.lmdb.map.size 304857600000
spark.bdt.lmdb.path /data/LMDB_v2_Release
This processor uses Web Mercator (srid 3857). The x, y, and distance values are in Meters
Configuration
idField
- description: The name of the unique id column in the input GPS dataframe.
- type: String
timeField
- description: The name of the temporal column in the input GPS dataframe.
- type: String
pidField
- description: The name of the path id column in the input GPS dataframe. GPS points are grouped by path id.
- type: String
xField
- description: The name of gps point x value column in the input GPS dataframe. Must be Web Mercator 3857.
- type: String
yField
- description: The name of gps point y value column in the input GPS dataframe. Must be Web Mercator 3857.
- type: String
snapDist
- description: The max snap distance. GPS points must be less than or equal to this distance from a candidate path vector.
- type: Double
pathDist
- description: The max path vector distance between two snap points.
- type: Double
microPathSize
- description: The micropath size. Each path of gps points is divided into smaller subsections (micropaths) for smart snapping.
- type: Int
Optional
alpha
- description: Penalty weight for the match of the heading of the gps point and the heading of the snapped path vector.
- type: Double
- default value: 10.0
beta
- description: Penalty weight for the perpendicularity of the vector from the gps point to its snapped location on the path vector and the snapped path vector.
- type: Double
- default value: 1.0
cache
- description: To persist the outgoing DataFrame or not.
- type: Boolean
- default value: false
input
- description: The name of the input DataFrame.
- type: String
- default value: _default
output
- description: The name of the output DataFrame.
- type: String
- default value: _default
ProcessorNearestCoordinate v2.1
Hocon
{
inputs = [points, lines]
output = _default
type = com.esri.processor.ProcessorNearestCoordinate
cellSize = 1.0
snapRadius = 1.0
doBroadcast = false
lhsShapeField = SHAPE
rhsShapeField = SHAPE
distanceField = distance
xField = X
yField = Y
isOnRightField = isOnRight
distanceForNotSnapped = -999.0
cache = false
extent = [-180.0, -90.0, 180.0, 90.0]
depth = 16
}
# Python
# v2.2
spark.bdt.processorNearestCoordinate(
df1,
df2,
cellSize = 1.0,
snapRadius = 1.0,
doBroadcast = False,
lhsShapeField = "SHAPE",
rhsShapeField = "SHAPE",
distanceField = "distance",
xField = "X",
yField = "Y",
isOnRightField = "isOnRight",
distanceForNotSnapped = -999.0,
cache = False,
extentList = [-180.0, -90.0, 180.0, 90.0],
depth = 16)
# v2.3
spark.bdt.nearestCoordinate(
df1,
df2,
cellSize = 1.0,
snapRadius = 1.0,
doBroadcast = False,
lhsShapeField = "SHAPE",
rhsShapeField = "SHAPE",
distanceField = "distance",
xField = "X",
yField = "Y",
isOnRightField = "isOnRight",
distanceForNotSnapped = -999.0,
cache = False,
extentList = [-180.0, -90.0, 180.0, 90.0],
depth = 16)
com.esri.core.processor.ProcessorNearestCoordinate
Processor to find nearest coordinate between two sources in a distributed parallel way. The features on the left-hand-side input are augmented with closest feature attributes on the right-hand-side input. In the implementation, the reduce side creates a spatial index of the right-hand-side features for fast lookup.
The LHS feature type has to be a point. The RHS feature type can be a point, polyline, or polygon type.
When RHS feature type is a point, the distance is in meters.
The input points that are snapped are kept in the output. When a point is not snapped: - The nearest_distance is set to input parameter distanceForNotSnapped (default value -999.0), - The X and Y coordinate values of the snapped point geometry are the same as the input point geometry.
If LHS feature has multiple nearest coordinates (all with equal distances), the result could be different between doBroadcast=False or doBroadcast= True. Both results are valid.
If doBroadcast is true and there are null values in the shape column of either of the input dataframes, a NullPointerException will be thrown. These null values should be handled before using this processor. If doBroadcast=false, then rows with null values for shape column are implicitly handled and will be dropped.
Cell size must not be smaller than the following values:
For Spatial Reference WGS 84 GPS: 0.000000167638064 degrees For Spatial Reference WGS 1984 Web Mercator: 0.0186264515 meters
Configuration
Required
cellSize
- description: The spatial partitioning cell size.
- type: Double
snapRadius
- description: The snapping radius.
- type: Double
Optional
doBroadcast
- description: If set to true, rhs DataFrame is broadcasted. Set
spark.sql.autoBroadcastJoinThreshold
accordingly. - type: Boolean
- default value: false
- description: If set to true, rhs DataFrame is broadcasted. Set
xField
- description: The name of the field for the x coordinate of the nearest snapped point.
- type: String
- default value: X
yField
- description: The name of the field for the y coordinate of the nearest snapped point.
- type: String
- default value: Y
distanceField
- description: The name of the distance field to be added.
- type: String
- default value: distance
leftShapeField
- description: The name of the SHAPE Struct field in the LHS DataFrame.
- type: String
- default value: SHAPE
rightShapeField
- description: The name of the SHAPE Struct field in the RHS DataFrame.
- type: String
- default value: SHAPE
isOnRightField
- description: The name of the isOnRight field.
- type: String
- default value: isShapeOnRight
distanceForNotSnapped
- description: The distance to use when a given point is not snapped.
- type: Double
- default value: -999.0
cache
- description: To persist the outgoing DataFrame or not.
- type: Boolean
- default value: false
extent
- description: The spatial index extent. Array of the extent coordinates [xmin, ymin, xmax, ymax].
- type: Array[Double]
- default value: [-180.0, -90.0, 180.0, 90.0]
depth
- description: The spatial index depth.
- type: Integer
- default value: 16
inputs
- description: The names of the input DataFrames.
- type: Array[String]
- default value: []
output
- description: The name of the output.
- type: String
- default value: _default
ProcessorPointExclude v2.1
Hocon
{
input = _default
output = _default
type = com.esri.processor.ProcessorPointExclude
xmin = -10
ymin = -10
xmax = 10
ymax = 10
shapeField = SHAPE
cache = false
}
# Python
# v2.2
spark.bdt.processorPointExclude(
df,
xmin = -10.0,
ymin = -10.0,
xmax = 10.0,
ymax = 10.0,
shapeField = "SHAPE",
cache = False)
# v2.3
spark.bdt.pointExclude(
df,
xmin = -10.0,
ymin = -10.0,
xmax = 10.0,
ymax = 10.0,
shapeField = "SHAPE",
cache = False)
com.esri.processor.ProcessorPointExclude
Filter point features that are not in the specified extent, excluding points on the extent boundary.
Configuration
Required
- xmin
- description: xmin value of the extent
- type: Double
- ymin
- description: ymin value of the extent
- type: Double
- xmax
- description: xmax value of the extent
- type: Double
- ymax
- description: ymax value of the extent
- type: Double
Optional
- shapeField
- description: The name of the SHAPE Struct for DataFrame.
- type: String
- default value: SHAPE
- cache
- description: To persist the outgoing DataFrame or not.
- type: Boolean
- default value: false
- input
- description: The name of the input DataFrame.
- type: String
- default value: _default
- output
- description: The name of the output DataFrame.
- type: String
- default value: _default
ProcessorPointInclude v2.1
Hocon
{
input = _default
output = _default
type = com.esri.processor.ProcessorPointInclude
xmin = -10
ymin = -10
xmax = 10
ymax = 10
shapeField = SHAPE
cache = false
}
# Python
# v2.2
spark.bdt.processorPointInclude(
df,
xmin = -10.0,
ymin = -10.0,
xmax = 10.0,
ymax = 10.0,
shapeField = "SHAPE",
cache = False)
# v2.3
spark.bdt.pointInclude(
df,
xmin = -10.0,
ymin = -10.0,
xmax = 10.0,
ymax = 10.0,
shapeField = "SHAPE",
cache = False)
com.esri.processor.ProcessorPointInclude
Filter point features that are in the specified extent, including points on the extent boundary.
Configuration
Required
- xmin
- description: xmin value of the extent
- type: Double
- ymin
- description: ymin value of the extent
- type: Double
- xmax
- description: xmax value of the extent
- type: Double
- ymax
- description: ymax value of the extent
- type: Double
Optional
- shapeField
- description: The name of the SHAPE Struct for DataFrame.
- type: String
- default value: SHAPE
- cache
- description: To persist the outgoing DataFrame or not.
- type: Boolean
- default value: false
- input
- description: The name of the input DataFrame.
- type: String
- default value: _default
- output
- description: The name of the output DataFrame.
- type: String
- default value: _default
ProcessorPointInPolygon v2
Hocon
{
inputs = [taxi, zip]
output = _default
type = com.esri.processor.ProcessorPointInPolygon
cellSize = 1.0
emitPippedOnly = true
clip = true
doBroadcast = false
pointShapeField = SHAPE
polygonShapeField = SHAPE
extent = [-180.0, -90.0, 180.0, 90.0]
depth = 16
}
# Python
# v2.2
spark.bdt.processorPointInPolygon(
df1,
df2,
1.0,
emitPippedOnly = True,
take = 1,
clip = True,
doBroadcast = False,
pointShapeField = "SHAPE",
polygonShapeField = "SHAPE",
cache = False,
extentList = [-180.0, -90.0, 180.0, 90.0],
depth = 16)
# v2.3
spark.bdt.pointInPolygon(
df1,
df2,
1.0,
emitPippedOnly = True,
take = 1,
clip = True,
doBroadcast = False,
pointShapeField = "SHAPE",
polygonShapeField = "SHAPE",
cache = False,
extentList = [-180.0, -90.0, 180.0, 90.0],
depth = 16)
com.esri.processor.ProcessorPointInPolygon
Enrich a source point feature with the attributes from a target polygon feature which contains the source point feature. A point is enriched if it is inside a polygon. A point is not enriched if it is on an edge or outside of polygon. The first DataFrame must be of Point type and the second polygon must be of Polygon type.
If you know the outgoing DataFrame will be re-used, cache
configuration would help.
The clip parameter is only relevant if doBroadcast is false.
If doBroadcast is true and there are null values in the shape column of either of the input dataframes, a NullPointerException will be thrown. These null values should be handled before using this processor. If doBroadcast=false, then rows with null values for shape column are implicitly handled and will be dropped.
Cell size must not be smaller than the following values:
For Spatial Reference WGS 84 GPS: 0.000000167638064 degrees For Spatial Reference WGS 1984 Web Mercator: 0.0186264515 meters
Configuration
Required
- cellSize
- description: The spatial partitioning cell size.
- type: Double
Optional
emitPippedOnly (v2.1)
- description: When set to true, only enriched points are returned.
- type: Boolean
- default value: false
take (v2.1)
- description: The number of overlapped polygons to process. Use -1 to include all overlapping polygons in the result.
- type: Integer
- default value: 1
clip (v2.1)
- description: When set to true, polygons are clipped using the cell while processing.
- type: Boolean
- default value: false
doBroadcast (v2.4)
- description: Polygons are broadcasted to all worker nodes. Set
spark.sql.autoBroadcastJoinThreshold
accordingly. - type: Boolean
- default value: false
- description: Polygons are broadcasted to all worker nodes. Set
pointShapeField
- description: The name of the SHAPE Struct for point DataFrame.
- type: String
- default value: SHAPE
polygonShapeField
- description: The name of the SHAPE Struct for polygon DataFrame.
- type: String
- default value: SHAPE
cache
- description: Persist the outgoing DataFrame.
- type: Boolean
- default value: false
extent
- desc: The spatial index extent. Array of the extent coordinates [xmin, ymin, xmax, ymax].
- type: Array[Double]
- defaultValue: [-180.0, -90.0, 180.0, 90.0]
depth
- desc: The spatial index depth.
- type: Integer
- defaultValue: 16
inputs
- description: The names of the input DataFrames.
- type: Array[String]
- default value: []
output
- description: The name of the output.
- type: String
- default value: _default
ProcessorPowerSetStats v2.2
Hocon
{
input = _default
output = _default
type = com.esri.processor.summary.ProcessorPowerSetStats
statsFields = ["field1", "field2"]
geographyIds = ["geoId1", "geoId2"]
categoryIds = ["catId1", "catId2"]
idField = "ids"
cache = false
}
Hocon
{
input = points_source
output = pssoverlaps
type = com.esri.processor.ProcessorPowerSetStats
statsFields = [TotalInsuredValue,TotalReplacementValue,YearBuilt]
geographyIds = [FID,ST_ID,CY_ID,ZP_ID]
categoryIds = [Quarter,LineOfBusiness]
idField = Points_ID
cache = false
}
# Python
# v2.2
spark.bdt.processorPowerSetStats(
df,
statsFields = ["TotalInsuredValue","TotalReplacementValue","YearBuilt"],
geographyIds = ["FID","ST_ID","CY_ID","ZP_ID"],
categoryIds = ["Quarter","LineOfBusiness"],
idField = "Points_ID",
cache = False)
# v2.3
spark.bdt.powerSetStats(
df,
statsFields = ["TotalInsuredValue","TotalReplacementValue","YearBuilt"],
geographyIds = ["FID","ST_ID","CY_ID","ZP_ID"],
categoryIds = ["Quarter","LineOfBusiness"],
idField = "Points_ID",
cache = False)
com.esri.processor.ProcessorPowerSetStats
Processor to produce summary statistics for combinations of columns with categorical values. Combinations will be limited to those that appear in rows of the table.
The categorical columns from which combinations are created are specified by categoryIds and geographyIds inputs. Only one value will be taken from the value space of a given categorical column per combination. If no value is specified for a categorical column, then "ALL" will be put in its place, and the whole value space of that column will be considered for that combination.
If an idField is specified and repeated ids are found, then statistics will only be counted once per id for a given combination. This prevents double-counting, triple-counting, or any amount of over-counting that would skew a statistic from its true value.
Designed for AccumulationEngine
Configuration
Required
statsFields
- description: The field names to compute statistics for.
- type: Array[String]
categoryIds
- description: Column names of categories to create combinations from.
- type: Array[String]
geographyIds
- description: Column names of geographies to create combinations from.
- type: Array[String]
Optional
idField
- description: The column field of ids. Used to prevent possible over-counting, such as when there are overlapping geographies represented in the data.
- type: String
- default value: None
cache
- description: To persist the outgoing DataFrame or not.
- type: Boolean
- default value: false
input
- description: The name of the input DataFrame.
- type: String
- default value: _default
output
- description: The name of the output DataFrame.
- type: String
- default value: _default
ProcessorProject v2
Hocon
{
input = _default
output = _default
type = com.esri.processor.ProcessorProject
fromWk = 4326
toWk = 3857
shapeField = SHAPE
cache = true
}
Hocon
{
input = _default
output = _default
type = com.esri.processor.ProcessorProject
fromWk = "GEOGCS[\"GCS_WGS_1984\",DATUM[\"D_WGS_1984\",SPHEROID[\"WGS_1984\",6378137.0,298.257223563]],PRIMEM[\"Greenwich\",0.0],UNIT[\"Degree\",0.0174532925199433]]"
toWk = 3857
shapeField = SHAPE
cache = true
}
Hocon
{
input = _default
output = _default
type = com.esri.processor.ProcessorProject
fromWk = 4326
toWk = "PROJCS[\"WGS_1984_Web_Mercator_Auxiliary_Sphere\",GEOGCS[\"GCS_WGS_1984\",DATUM[\"D_WGS_1984\",SPHEROID[\"WGS_1984\",6378137.0,298.257223563]],PRIMEM[\"Greenwich\",0.0],UNIT[\"Degree\",0.0174532925199433]],PROJECTION[\"Mercator_Auxiliary_Sphere\"],PARAMETER[\"False_Easting\",0.0],PARAMETER[\"False_Northing\",0.0],PARAMETER[\"Central_Meridian\",0.0],PARAMETER[\"Standard_Parallel_1\",0.0],PARAMETER[\"Auxiliary_Sphere_Type\",0.0],UNIT[\"Meter\",1.0]]"
shapeField = SHAPE
cache = true
}
{
input = _default
output = _default
type = com.esri.processor.ProcessorProject
fromWk = "GEOGCS[\"GCS_WGS_1984\",DATUM[\"D_WGS_1984\",SPHEROID[\"WGS_1984\",6378137.0,298.257223563]],PRIMEM[\"Greenwich\",0.0],UNIT[\"Degree\",0.0174532925199433]]"
toWk = "PROJCS[\"WGS_1984_Web_Mercator_Auxiliary_Sphere\",GEOGCS[\"GCS_WGS_1984\",DATUM[\"D_WGS_1984\",SPHEROID[\"WGS_1984\",6378137.0,298.257223563]],PRIMEM[\"Greenwich\",0.0],UNIT[\"Degree\",0.0174532925199433]],PROJECTION[\"Mercator_Auxiliary_Sphere\"],PARAMETER[\"False_Easting\",0.0],PARAMETER[\"False_Northing\",0.0],PARAMETER[\"Central_Meridian\",0.0],PARAMETER[\"Standard_Parallel_1\",0.0],PARAMETER[\"Auxiliary_Sphere_Type\",0.0],UNIT[\"Meter\",1.0]]"
shapeField = SHAPE
cache = true
}
# Python
# v2.2
spark.bdt.processorProject(
df,
fromWk = "4326",
toWk = "3857",
shapeField = "SHAPE",
cache = False)
# v2.3
spark.bdt.project(
df,
fromWk = "4326",
toWk = "3857",
shapeField = "SHAPE",
cache = False)
com.esri.processor.ProcessorProject v2
Re-Project a geometry coordinate from a spatial reference to another spatial reference.
Consider using ST_Project
Configuration
Required
fromWk
- description: wkid or wktext
- type: String
toWk
- description: wkid or wktext
- type: String
Optional
shapeField
- description: The name of the SHAPE Struct field.
- type: String
- default value: SHAPE
cache
- description: To persist the outgoing DataFrame or not.
- type: Boolean
- default value: false (default value is true for Hocon)
input
- description: The name of the input DataFrame.
- type: String
- default value: _default
output
- description: The name of the output DataFrame.
- type: String
- default value: _default
ProcessorRelation v2
Hocon
{
inputs = [df1, df2]
output = _default
type = com.esri.processor.ProcessorRelation
cellSize = 0.1
relation = TOUCHES
doBroadcast = false
lhsShapeField = SHAPE
rhsShapeField = SHAPE
extent = [-180.0, -90.0, 180.0, 90.0]
depth = 16
cache = false
}
# Python
# v2.2
spark.bdt.processorRelation(
df1,
df2,
cellSize = 0.1,
relation = "TOUCHES",
doBroadcast = False,
lhsShapeField = "SHAPE",
rhsShapeField = "SHAPE",
cache = False,
extentList = [-180.0, -90.0, 180.0, 90.0],
depth = 16)
# v2.3
spark.bdt.relation(
df1,
df2,
cellSize = 0.1,
relation = "TOUCHES",
doBroadcast = False,
lhsShapeField = "SHAPE",
rhsShapeField = "SHAPE",
cache = False,
extentList = [-180.0, -90.0, 180.0, 90.0],
depth = 16)
com.esri.processor.ProcessorRelation
Enrich left hand side row with right hand side row if the specified spatial relation is met. It's possible that the left hand side row will be emitted multiple time when there are multiple right hand side rows satisfying the specified spatial relationship. When a left hand side row finds no right hand side rows satisfying the specified spatial relationship, it won't be emitted.
The shape struct in the emitted row will retains the left hand side shape struct.
The possible spatial relationships are: CONTAINS
, EQUALS
, TOUCHES
, DISJOINT
, INTERSECTS
, CROSSES
, OVERLAPS
, WITHIN
.
https://desktop.arcgis.com/en/arcmap/latest/manage-data/using-sql-with-gdbs/relational-functions-for-st-geometry.htm#
If doBroadcast is true and there are null values in the shape column of either of the input dataframes, a NullPointerException will be thrown. These null values should be handled before using this processor. If doBroadcast=false, then rows with null values for shape column are implicitly handled and will be dropped.
Cell size must not be smaller than the following values:
For Spatial Reference WGS 84 GPS: 0.000000167638064 degrees For Spatial Reference WGS 1984 Web Mercator: 0.0186264515 meters
Consider using ST_Relate
Configuration
Required
- cellSize
- description: The spatial partitioning cell size.
- type: Double
- relation
- description: The spatial relationship.
CONTAINS
,EQUALS
,TOUCHES
,DISJOINT
,INTERSECTS
,CROSSES
,OVERLAPS
,WITHIN
. - type: String
- description: The spatial relationship.
Optional
- doBroadcast
- description: If set to true, rhs DataFrame is broadcasted. Set
spark.sql.autoBroadcastJoinThreshold
accordingly. - type: Boolean
- default value: false
- description: If set to true, rhs DataFrame is broadcasted. Set
- leftShapeField
- description: The name of the SHAPE Struct field in the LHS DataFrame.
- type: String
- default value: SHAPE
- rightShapeField
- description: The name of the SHAPE Struct field in the RHS DataFrame.
- type: String
- default value: SHAPE
- cache
- description: To persist the outgoing DataFrame or not.
- type: Boolean
- default value: false
- extent
- description: The spatial index extent. Array of the extent coordinates [xmin, ymin, xmax, ymax].
- type: Array[Double]
- default value: [-180.0, -90.0, 180.0, 90.0]
- depth
- description: The spatial index depth.
- type: Integer
- default value: 16
- inputs
- description: The names of the input DataFrames.
- type: Array[String]
- default value: []
- output
- description: The name of the output.
- type: String
- default value: _default
ProcessorRouteClosestTarget v2.4
Hocon
{
inputs = [origins, destinations]
output = routes
type = com.esri.processor.ProcessorRouteClosestTarget
maxMinutes = 30
maxMeters = 60375
oxField = OX
oyField = OY
dxField = DX
dyField = DY
origTolerance = 500.0
destTolerance = 500.0
allNodesAtLocation = true
costOnly = false
cache = false
}
#Python
df = spark.bdt.routeClosestTarget(
origin_df,
dest_df,
maxMinutes = 30.0,
maxMeters = 60375.0,
oxField = "OX",
oyField = "OY",
dxField = "DX",
dyField = "DY",
origTolerance = 500.0,
destTolerance = 500.0,
allNodesAtLocation = False,
costOnly = False,
cache = False
)
com.esri.processor.ProcessorRouteClosestTarget
For every origin point, find the closest destination point on the road network that is either less than or equal to maxMinutes or maxMeters. The cost is Meters.
The output dataframe will include all columns from both inputs therefore column names must be unique across both inputs. For example, do not include Shape columns in the input and make the X and Y fields unique e.g. OX, OY for origins and DX, DY for destinations.
A network is required in LMDB format, configured through the spark properties:
spark.bdt.lmdb.map.size 304857600000
spark.bdt.lmdb.path /data/LMDB_v2_Release
Configuration
Required
maxMinutes
- description: Maximum time in minutes traveled on the road network to get to a given target from the origin.
- type: Double
maxMeters
- description: Maximum distance in meters traveled on the road network to get to a given target from the origin.
- type: Double
oxField
- description: Orig X (SR 3857)
- type: Boolean
oyField
- description: Orig X (SR 3857)
- type: String
dxField
- description: Orig X (SR 3857)
- type: String
dyField
- description: Orig X (SR 3857)
- type: String
Optional
origTolerance
- description: Snap tolerance for origin point. Default is 500 meters. If an origin point is not snapped, it is not routed
- type: Double
- default value: 500
destTolerance
- description: Snap tolerance for destination points.
- type: Double
- default value: 500
allNodesAtLocation
- description: Whether to return all nodes at a given target location or just the first node.
- type: Boolean
- default value: false
costOnly
- description: To return only the cost of the nearest target or both the cost and the path geometry.
- type: Boolean
- default value: false
cache
- description: To persist the outgoing DataFrame or not.
- type: Boolean
- default value: false
inputs
- description: The name of the input.
- type: String
- default value: _default
output
- description: The name of the output.
- type: String
- default value: _default
ProcessorSelectFields v2
Hocon
{
input = _default
output = _default
type = com.esri.processor.ProcessorSelectFields
exprs = ["floor(tip_amount/100)", "floor(total_amount/100)"]
}
# Python
# v2.2
spark.bdt.processorSelectFields(
df,
["floor(tip_amount/100)", "floor(total_amount/100)"],
cache = False)
# v2.3
spark.bdt.selectFields(
df,
["floor(tip_amount/100)", "floor(total_amount/100)"],
cache = False)
com.esri.processor.ProcessorSelectFields
Transform a DataFrame with the configured expressions only. This processor is useful for hocon. Consider using Spark SQL or select expressions instead when using the Python bindings.
See the Spark Documentation
Configuration
Required
- exprs (note: this property is called 'expressions' in hocon)
- description: The collection of expressions.
- type: Array[String]
Optional
cache
- description: To persist the outgoing DataFrame or not.
- type: Boolean
- default value: false
input
- description: The name of the input DataFrame.
- type: String
- default value: _default
output
- description: The name of the output DataFrame.
- type: String
- default value: _default
ProcessorSelfJoinStatistics v2.2
Hocon
{
input = _default
output = _default
type = com.esri.processor.ProcessorSelfJoinStatistics
cellSize = 1
radius = 1
statisticsFields = [field1, field2]
emitEnrichedOnly = false
shapeField = SHAPE
cache = false
extent = [-180.0, -90.0, 180.0, 90.0]
depth = 16
}
# Python
# v2.2
spark.bdt.processorSelfJoinStatistics(
df,
1.0,
1.0,
["field1", "field2"],
emitEnrichedOnly = False,
shapeField = "SHAPE",
cache = False,
extentList = [-180.0, -90.0, 180.0, 90.0],
depth = 16)
# v2.3
spark.bdt.selfJoinStatistics(
df,
1.0,
1.0,
["field1", "field2"],
emitEnrichedOnly = False,
shapeField = "SHAPE",
cache = False,
extentList = [-180.0, -90.0, 180.0, 90.0],
depth = 16)
com.esri.processor.ProcessorSelfJoinStatistics
Processor to enrich the single dataset with the statistics for neighbors found within the given radius. For example, for a given feature, find all other features within the search radius, calculate COUNT,MEAN,STDEV,MAX,MIN,SUM for each configured field and append them. Each feature does not consider itself a neighbor within its search radius. This processor works with point, polyline and polygon datasets. If emitEnrichedOnly is set to false, when no rhs features are found, a lhs feature is enriched with null values. When emitEnrichedOnly is set to true, when no rhs features are found, the lhs feature is not emitted in the output.
For each statistics field, the following statistics are computed and appended.
{fieldName}_COUNT
{fieldName}_MEAN
{fieldName}_STDEV
{fieldName}_MAX
{fieldName}_MIN
{fieldName}_SUM
Cell size must not be smaller than the following values:
For Spatial Reference WGS 84 GPS: 0.000000167638064 degrees For Spatial Reference WGS 1984 Web Mercator: 0.0186264515 meters
Configuration
Required
cellSize
- description: The spatial partitioning cell size.
- type: Double
radius
- description: The search radius.
- type: Double
statisticsFields
- desc: The field names to compute statistics for.
- type: Array[String]
Optional
emitEnrichedOnly
- description: When true, only LHS records that have nearest RHS records will be emitted. LHS records with null values as RHS attribute will not be emitted.
- type: Boolean
- default value: true
shapeField
- description: The name of the SHAPE Struct field.
- type: String
- default value: SHAPE
cache
- description: To persist the outgoing DataFrame or not.
- type: Boolean
- default value: false
extent
- description: The spatial index extent. Array of the extent coordinates [xmin, ymin, xmax, ymax].
- type: Array[Double]
- default value: [-180.0, -90.0, 180.0, 90.0]
depth
- description: The spatial index depth.
- type: Integer
- default value: 16
input
- description: The names of the input DataFrame.
- type: Array[String]
- default value: []
output
- description: The name of the output.
- type: String
- default value: _default
ProcessorServiceArea v2.4
Hocon
{
input = _default
output = _default
type = com.esri.processor.ProcessorServiceArea
xField = X
yField = Y
noOfCosts = 5
hexSize = 125
shapeField = SHAPE
cache = false
}
# Python
# v2.4
spark.bdt.serviceArea(
df,
xField="X",
yField="Y",
noOfCosts=5,
hexSize= 125,
costField="SERVICE_AREA_COST",
shapeField = "SHAPE",
cache = False)
com.esri.processor.ProcessorServiceArea
You must have the X field & Y field explicitly in the 1st level. In other words, we don't support the nested fields such as SHAPE.XMIN
.
If you wish to use the nested fields, please bring them to the 1st level using [[ProcessorSQL]] before entering this processor.
(maybe less data to carry, the better...)
Therefore, the input DataFrame can be a standalone DataFrame or a spatial DataFrame. It just drops the specified shape field regardless whether the shape field exists or not.
Generates service areas per X & Y (in 3857). The cost is based on the time to travel. 0 to 1 minute, 1 to 2 minute, 2 to 3 minute, and all the way up to noOfCosts - 1 to noOfCosts minute service areas are created.
Hence, an input row will be exploded into noOfCost rows. Each row will be appended with the origination hex id, the SERVICE_AREA_COST field, for example, 0-1, 1-2,.., array of the destination hex ids, and it will have a polygon which could be empty polygon.
This processor snaps the given X, Y to the closest the street, walks along that the network by the (hexSize / 2.0) meter interval until it reaches the last max cost. Once we have the points along, we turn them into the hexagons and union them to create service areas.
In the future, there will be options to specify the type of service area polygons such as concave hull, convex hull, hexagon-based (current).
The input must have X & Y in 3857. Since it requires the explicit X & Y, the requirement of [[SHAPE_FIELD]] in the input DataFrame is not required.
The output appends the following columns:
- ORIG_HEX_ID
- SERVICE_AREA_COST ([[DoubleType]]) column
- DEST_HEX_IDS column in the array of string
- [[SHAPE_FIELD]] in 3857. For the rows where we couldn't produce the service areas, they will have empty polygons.
The reason that the SERVICE_AREA_COST is [[DoubleType]] is that we can support the time cost such as 1.5 minutes
when required.
The insertion order is very important.
The following application variables are required:
- spark.bdt.lmdb.path /data/LMDB
- spark.bdt.lmdb.map.size 304857600000 (limit in bytes. LMDB needs to know how large our DB might be. Over-estimating is OK.)
https://github.com/lmdbjava/lmdbjava/blob/master/src/test/java/org/lmdbjava/TutorialTest.java#L90
Configuration
Required
xField
- description: The field name to retain the hex centroid X. The default is X.
- type: String
yField
- description: The field name to retain the hex centroid Y. The default is Y.
- type: String
noOfCosts
- description: The number of minute costs. If 2 is specified, 0-1 minute and 0-2 minutes polygons are generated.
- type: Int
hexSize
- description: The hex size in meters.
- type: Double
Optional
costField
- description: The cost field name in the output.
- type: String
- default value: SERVICE_AREA_COST
shapeField
- description: The name of the SHAPE struct field.
- type: String
- default value: SHAPE
cache
- description: Persist the outgoing DataFrame or not.
- type: Boolean
- default value: false
input
- description: The name of the input.
- type: String
- default value: _default
output
- description: The name of the output.
- type: String
- default value: _default
ProcessorSimplify v2.1
Hocon
{
input = _default
output = _default
type = com.esri.processor.ProcessorSimplify
forceSimplify = false
ogc = false
shapeField = SHAPE
cache = false
}
# Python
# v2.2
spark.bdt.processorSimplify(
df,
forceSimplify = False,
ogc = False,
shapeField = "SHAPE",
cache = False)
# v2.3
spark.bdt.simplify(
df,
forceSimplify = False,
ogc = False,
shapeField = "SHAPE",
cache = False)
com.esri.processor.ProcessorSimplify
Simplify a geometry to make it valid for a Geodatabase to store without additional processing.
WKB and WKT are OGC interchange formats and in general requires for the geometry to be topologically simple. When storing geometry that may be non-simple WKB or WKT, call simplify. WKB is better for interchange with OGC services. If you are interoperating with OGC services, use WKB format (or WKT) with the option: ogc = true
.
Consider using ST_Simplify
Configuration
Optional
forceSimplify
- description: When true, the Geometry will be simplified regardless of the internal IsKnownSimple flag.
- type: Boolean
- default value: false
ogc (v2.2)
- description: When true, simplification follows the OGC specification for the Simple Feature Access v. 1.2.1 (06-103r4).
- type: Boolean
- default value: false
shapeField
- description: The name of the SHAPE Struct for DataFrame.
- type: String
- default value: SHAPE
cache
- description: To persist the outgoing DataFrame or not.
- type: Boolean
- default value: false
input
- description: The name of the input DataFrame.
- type: String
- default value: _default
output
- description: The name of the output DataFrame.
- type: String
- default value: _default
ProcessorSQL v2
Hocon
{
inputs = [_default]
output = _default
type = com.esri.processor.ProcessorSQL
sql = "SELECT SUM(total_amount) FROM _default"
cache = false
}
# Python
# v2.2
spark.bdt.processorSQL(
df,
sql = "SELECT SUM(total_amount)",
cache = False)
# v2.3
spark.bdt.SQL(
df,
sql = "SELECT SUM(total_amount)",
cache = False)
com.esri.processor.ProcessorSQL
Processor that takes SQL and execute.
See the Spark Documentation
Configuration
Required
- sql
- description: SQL
- type: String
Optional
cache
- description: To persist the output DataFrame in memory or not
- type: Boolean
- default value: false
input
- description: The name of the input DataFrame.
- type: String
- default value: _default
output
- description: The name of the output DataFrame.
- type: String
- default value: _default
ProcessorSummarizeWithin v2.2
Hocon
{
inputs = [_default, _default]
output = _default
type = com.esri.processor.ProcessorSummarizeWithin
cellSize = 0.1
statsFields = [field1, field2]
keepGeometry = false
emitContainsOnly = false
doBroadcast = false
lhsShapeField = SHAPE
rhsShapeField = SHAPE
cache = false
extent = [-180.0, -90.0, 180.0, 90.0]
depth = 16
}
# Python
# v2.2
spark.bdt.processorSummarizeWithin(
df1,
df2,
0.1,
["field1", "field2"],
keepGeometry = False,
emitContainsOnly = False,
doBroadcast = False,
lhsShapeField = "SHAPE",
rhsShapeField = "SHAPE",
cache = False,
extentList = [-180.0, -90.0, 180.0, 90.0],
depth = 16)
# v2.3
spark.bdt.summarizeWithin(
df1,
df2,
0.1,
["field1", "field2"],
keepGeometry = False,
emitContainsOnly = False,
doBroadcast = False,
lhsShapeField = "SHAPE",
rhsShapeField = "SHAPE",
cache = False,
extentList = [-180.0, -90.0, 180.0, 90.0],
depth = 16)
com.esri.processor.SummarizeWithin
Processor to produce summary statistics of all lhs features that are contained in a rhs feature.
statsFields are fields from the lhs DataFrame that must have numeric values.
For each statistic field, the following lhs statistics are computed and appended to each unique rhs row.
{fieldName}_COUNT
{fieldName}_MEAN
{fieldName}_STDEV
{fieldName}_MAX
{fieldName}_MIN
{fieldName}_SUM
When emitContainsOnly is set to true, only summary statistics of lhs geometries that are contained within a rhsGeometry are in the output. This is the default. Setting emitContainsOnly to false will output all rhsGeometries, even those that do not contain lhs geometries. The lhs statisticsFields will be populated with count 0 and null values. IMPORTANT: Setting emitContainsOnly to false will result in non-trivial time cost.
If doBroadcast is true and there are null values in the shape column of either of the input dataframes, a NullPointerException will be thrown. These null values should be handled before using this processor. If doBroadcast=false, then rows with null values for shape column are implicitly handled and will be dropped.
Cell size must not be smaller than the following values:
For Spatial Reference WGS 84 GPS: 0.000000167638064 degrees For Spatial Reference WGS 1984 Web Mercator: 0.0186264515 meters
Configuration
Required
cellSize
- desc: The spatial partitioning grid size
- type: Double
statsFields
- desc: The field names to compute statistics for.
- type: Array[String]
Optional
keepGeometry
- Description: To keep the rhsShape Struct in the output or not.
- type: Boolean
- default value: false
emitContainsOnly
- description: When true, only LHS records that have nearest RHS records will be emitted. LHS records with null values as RHS attribute won't be emitted.
- type: Boolean
- default value: false
doBroadcast
- description: If set to true, rhs DataFrame is broadcasted. Set
spark.sql.autoBroadcastJoinThreshold
accordingly. - type: Boolean
- default value: false
- description: If set to true, rhs DataFrame is broadcasted. Set
lhsShapeField
- description: The name of the SHAPE Struct field in the LHS DataFrame.
- type: String
- default value: SHAPE
rhsShapeField
- description: The name of the SHAPE Struct field in the RHS DataFrame.
- type: String
- default value: SHAPE
cache
- description: To persist the outgoing DataFrame or not.
- type: Boolean
- default value: false
extent
- description: The spatial index extent. Array of the extent coordinates [xmin, ymin, xmax, ymax].
- type: Array[Double]
- default value: [-180.0, -90.0, 180.0, 90.0]
depth
- desc: The spatial index depth
- type: Integer
- default value: 16
inputs
- description: The names of the input DataFrames.
- type: Array[String]
- default value: []
output
- description: The name of the output.
- type: String
- default value: _default
ProcessorSymmetricDifference v2.2
Hocon
{
inputs = [_default, _default]
output = _default
type = com.esri.processor.ProcessorSymmetricDifference
cellSize = 0.1
doBroadcast = false
lhsShapeField = SHAPE
rhsShapeField = SHAPE
cache = false
extent = [-180.0, -90.0, 180.0, 90.0]
depth = 16
}
# Python
# v2.2
spark.bdt.processorSymmetricDifference(
df1,
df2,
0.1,
doBroadcast = False,
lhsShapeField = "SHAPE",
rhsShapeField = "SHAPE",
cache = False,
extentList = [-180.0, -90.0, 180.0, 90.0],
depth = 16)
# v2.3
spark.bdt.symmetricDifference(
df1,
df2,
0.1,
doBroadcast = False,
lhsShapeField = "SHAPE",
rhsShapeField = "SHAPE",
cache = False,
extentList = [-180.0, -90.0, 180.0, 90.0],
depth = 16)
com.esri.processor.SymmetricDifference
Per each left hand side (lhs) feature, find all relevant right hand side (rhs) features and take the set-theoretic symmetric difference of each rhs feature with the lhs feature. For the LHS feature type, it can be point, polyline, polygon types. For the RHS feature type, it can be point, polyline, polygon types.
The symmetric difference of two geometries A and B is defined as the portions of A and B that to do not intersect with each other.
When n relevant rhs features are found, the lhs feature is emitted n times, and the lhs geometry is replaced with the difference of the lhs geometry and rhs geometry. When a lhs feature has no overlapping rhs features, nothing is emitted. The length output table could be less than the amount of input lhs features.
The final schema is: lhs attributes + [[SHAPE_FIELD]] + rhs attributes.
If doBroadcast is true and there are null values in the shape column of either of the input dataframes, a NullPointerException will be thrown. These null values should be handled before using this processor. If doBroadcast=false, then rows with null values for shape column are implicitly handled and will be dropped.
Cell size must not be smaller than the following values:
For Spatial Reference WGS 84 GPS: 0.000000167638064 degrees For Spatial Reference WGS 1984 Web Mercator: 0.0186264515 meters
Configuration
Required
- cellSize
- desc: The spatial partitioning grid size
- type: Double
Optional
doBroadcast
- description: If set to true, rhs DataFrame is broadcasted. Set
spark.sql.autoBroadcastJoinThreshold
accordingly. - type: Boolean
- default value: false
- description: If set to true, rhs DataFrame is broadcasted. Set
lhsShapeField
- description: The name of the SHAPE Struct field in the LHS DataFrame.
- type: String
- default value: SHAPE
rhsShapeField
- description: The name of the SHAPE Struct field in the RHS DataFrame.
- type: String
- default value: SHAPE
cache
- description: To persist the outgoing DataFrame or not.
- type: Boolean
- default value: false
extent
- description: The spatial index extent. Array of the extent coordinates [xmin, ymin, xmax, ymax].
- type: Array[Double]
- default value: [-180.0, -90.0, 180.0, 90.0]
depth
- desc: The spatial index depth
- type: Integer
- default value: 16
inputs
- description: The names of the input DataFrames.
- type: Array[String]
- default value: []
output
- description: The name of the output.
- type: String
- default value: _default
ProcessorTimeFilter v2.1
Hocon
{
input = _default
output = _default
type = com.esri.processor.ProcessorTimeFilter
timestampField = time
minTimestamp = "1997-11-15 00:00:00"
maxTimestamp = "1997-11-17 00:00:00"
cache = false
}
# Python
# v2.2
spark.bdt.processorTimeFilter(
df,
timeField = "time",
minTimestamp = "1997-11-15 00:00:00",
maxTimestamp = "1997-11-17 00:00:00",
cache = False)
# v2.3
spark.bdt.timeFilter(
df,
timeField = "time",
minTimestamp = "1997-11-15 00:00:00",
maxTimestamp = "1997-11-17 00:00:00",
cache = False)
com.esri.processor.ProcessorTimeFilter
Filter rows in the DataFrame based on a time field. The attribute time value has to be strictly between minTimestamp and maxTimestamp. The attribute time values must all be in either timestamp, date, or unix millisecond format.
Configuration
Required
timeField
- description: The name of the attribute time column.
- type: String
minTimestamp
- description: The minimum time value. Must be in timestamp format: yyyy-mm-dd hh:mm:ss
- type: String
maxTimestamp
- description: The maximum time value. Must be in timestamp format: yyyy-mm-dd hh:mm:ss
- type: String
Optional
cache
- description: To persist the output DataFrame in memory or not
- type: Boolean
- default value: false
input
- description: The name of the input DataFrame.
- type: String
- default value: _default
output
- description: The name of the output DataFrame.
- type: String
- default value: _default
ProcessorUnivariateOutliers v2.2
Hocon
{
input = _default
output = _default
type = com.esri.processor.ProcessorUnivariateOutliers
attributeField = price
factor = 1.0
cache = false
}
# Python
# v2.2
spark.bdt.processorUnivariateOutliers(
df,
attributeField = "price",
factor = 1.0,
cache = False)
# v2.3
spark.bdt.univariateOutliers(
df,
attributeField = "price",
factor = 1.0,
cache = False)
com.esri.processor.ProcessorUnivariateOutliers
Processor to identify univariate outliers based on mean and standard deviation. A feature is an outlier if it field value is beyond mean +/- factor * standard deviation.
Configuration
Required
attributeField
- description: The numerical field used to identify outliers.
- type: String
factor
- description: The factor to scale the standard deviation by.
- type: Double
Optional
cache
- description: To persist the output DataFrame in memory or not
- type: Boolean
- default value: false
input
- description: The name of the input DataFrame.
- type: String
- default value: _default
output
- description: The name of the output DataFrame.
- type: String
- default value: _default
ProcessorWebMercator v2.2
Hocon
{
input = _default
output = _default
type = com.esri.processor.ProcessorWebMercator
latField = lat
lonField = lon
xField = X
yField = Y
cache = false
}
# Python
# v2.2
spark.bdt.processorWebMercator(
df,
latField = "lat",
lonField = "lon",
xField = "X",
yField = "Y",
cache = False)
# v2.3
spark.bdt.webMercator(
df,
latField = "lat",
lonField = "lon",
xField = "X",
yField = "Y",
cache = False)
com.esri.processor.ProcessorWebMercator
Add x and y WebMercator coordinates to the output DataFrame. The input dataframe must have Latitude and Longitude columns.
Consider using ST_WebMercator
Configuration
Required
latField
- description: The name of the latitude column in the DataFrame.
- type: String
lonField
- description: The name of the longitude column in the DataFrame.
- type: String
Optional
xField
- description: The name of the output WebMercator x value column for the DataFrame.
- type: String
- default value: X
yField
- description: The name of the output WebMercator y value column for the DataFrame.
- type: String
- default value: Y
cache
- description: To persist the outgoing DataFrame or not.
- type: Boolean
- default value: false
input
- description: The name of the input DataFrame.
- type: String
- default value: _default
output
- description: The name of the output DataFrame.
- type: String
- default value: _default
Sinks v2
SinkGeneric v2
Hocon
{
input = _input
type = com.esri.sink.SinkGeneric
format = csv
options = {
path = /tmp/sink
}
mode = Overwrite
shapeField = SHAPE
numPartitions = 8
debug = true
}
# Python
# Deprecated as of v2.4. Use the Spark API instead.
spark.bdt.sinkGeneric(
df,
format = "csv",
options = { "path":"/tmp/sink" },
mode = "Overwrite",
shapeField = "SHAPE",
numPartitions = 8,
debug = True)
# Spark API: CSV Format. Internal SHAPE Struct has to be converted to WKT before writing to disk.
df\
.selectExpr("ST_AsText(SHAPE) as WKT", "*") \
.repartition(8) \
.write \
.format("csv") \
.mode("Overwrite") \
.options({"path": "/tmp/sink"}) \
.save()
# Spark API: All Other Formats
df\
.repartition(8) \
.write \
.format("parquet") \
.mode("Overwrite") \
.options({"path": "/tmp/sink"}) \
.save()
com.esri.sink.SinkGeneric
Persist to the storage.
- Configuration
Required
format
- description: csv, parquet, etc
- type: String
Options
- description: Map of options
- type: Map[String, String]
optional
mode
- description: SaveMode. The default value is
Overwrite
. - type: String
- default value: Overwrite
- description: SaveMode. The default value is
shapeField
- description: The name of the field for [[SHAPE_STRUCT]].
- type: String
- default value: SHAPE
numPartitions
- description: Re-partitions the outgoing DataFrame.
- type: Int
- default value: None
debug
- description: print out the physical plan and the generated code
- type: Boolean
- default value: false
input
- description: The name of the input DataFrame.
- type: String
- default value: _default
SinkConsole v2
Hocon
{
input = $DEFAULT_INPUT
type = com.esri.sink.SinkConsole
numRows = 2
truncate = false
printSchema = true
debug = true
}
# Python
# Deprecated as of v2.4. Use the Spark API instead.
spark.bdt.sinkConsole(
df,
numRows = 2,
truncate = False,
printSchema = True,
debug = True)
# Spark API
printSchema = True
debug = True
if debug:
df.doDebug()
if printSchema:
df.printSchema()
df.show(numRows=2, truncate=False)
com.esri.sink.SinkConsole
Print the DataFrame to stdout.
Configuration
Optional
numRows
- description: The number of rows to print.
- type: Int
- default value: 20
truncate
- description: Whether truncate long strings. If true, strings more than 20 characters will be truncated and all cells will be aligned right.
- type: Boolean
- default value: false
printSchema
- description: The flag to print schema or not.
- type: Boolean
- default value: true
debug
- description: print out the physical plan and the generated code
- type: Boolean
- default value: false
input
- description: The name of the input DataFrame.
- type: String
- default value: _default
SinkEGDB v2
Hocon
{
input = _default
type = com.esri.sink.SinkEGDB
options = {
numPartitions = 8
url = "$url"
user = $username
password = "$password"
dbtable = DBO.test
truncate = false
}
postSQL = [
"ALTER TABLE DBO.test ADD ID INT IDENTITY(1,1) NOT NULL",
"ALTER TABLE DBO.test ALTER COLUMN Shape geometry NOT NULL",
"ALTER TABLE DBO.test ADD CONSTRAINT PK_test PRIMARY KEY CLUSTERED (ID)",
"CREATE SPATIAL INDEX SIndx_test ON DBO.test(SHAPE) WITH ( BOUNDING_BOX = ( -14786743.1218, 2239452.0547, -5961629.5841, 6700928.5216 ) )"
]
shapeField = $SHAPE_FIELD
geometryField = Shape
mode = Overwrite
debug = false
}
# Python
spark.bdt.sinkEGDB(
df,
options = {
"numPartitions": "8",
"url": f"{url}",
"user": f"{username}",
"password": f"{password}",
"dbtable": "DBO.test",
"truncate": "false"
},
postSQL = [
"ALTER TABLE DBO.test ADD ID INT IDENTITY(1,1) NOT NULL",
"ALTER TABLE DBO.test ALTER COLUMN Shape geometry NOT NULL",
"ALTER TABLE DBO.test ADD CONSTRAINT PK_test PRIMARY KEY CLUSTERED (ID)",
"CREATE SPATIAL INDEX SIndx_test ON DBO.test(SHAPE) WITH ( BOUNDING_BOX = ( -14786743.1218, 2239452.0547, -5961629.5841, 6700928.5216 ) )"
],
shapeField = "SHAPE",
geometryField = "Shape",
mode = "Overwrite",
debug = False)
com.esri.sink.SinkEGDB
Persist to the enterprise geodatabase (aka relational database). Use this sink if the input DataFrame has the [[SHAPE_STRUCT]]. Otherwise, use SinkGeneric to persist to the relational database.
The numPartitions parameter in the options dictionary should be carefully selected. It controls the maximal number of concurrent JDBC database connections, and by default is set to the amount of partitions of the input DataFrame. You can find additional options in the Spark SQL guide to JDBC
note: PostSQL will not be run when the save mode is Overwrite and truncate is set to true. More broadly, PostSQL will only be run when a new table is created.
Green means yes PostSQL is run, red means no:
- The table does not already exist in the database ✅
- The table exists already in the database
- Save Mode is Overwrite (default)
- Truncate is false (default) ✅
- Truncate is true 🛑
- Save Mode is Overwrite (default)
- Save Mode is Append 🛑
Configuration
Required
- options
- description: Apache Spark JDBC options JDBC To Other Databases
- type: Map[String, String]
Optional
postSQL
- description: SQL to run after creating the table. For example, create a spatial index.
- type: Array[String]
- default value: []
shapeField
- description: The name of the field for [[SHAPE_STRUCT]].
- type: String
- default value: SHAPE
geometryField
- description: The name of the field to use for the geometry field in the table.
- type: String
- default value: Shape
mode
- description: [[SaveMode]] option.
- type: String
- default value: Overwrite
debug
- description: print out the physical plan and the generated code
- type: Boolean
- default value: false
input
- description: The name of the input DataFrame.
- type: String
- default value: _default
output
- description: The name of the output DataFrame.
- type: String
- default value: _default
SinkTable v2
Hocon
{
input = _input
type = com.esri.sink.SinkTable
format = csv
options = {
path = /tmp/sink
}
mode = Overwrite
shapeField = SHAPE
numPartitions = 8
debug = true
}
# Python
# Deprecated as of v2.4. Use the Spark API instead.
spark.bdt.sinkTable(
df,
format = "csv",
options = { "path": "/tmp/sink" },
mode = "Overwrite",
shapeField = "SHAPE",
numPartitions = 8,
debug = False)
# Spark API: CSV Format. Internal SHAPE Struct has to be converted to WKT before writing to disk.
df\
.selectExpr("ST_AsText(SHAPE_2) as WKT", "*") \
.repartition(8) \
.write \
.format("csv") \
.mode("Overwrite") \
.options({"path": "/tmp/sink"}) \
.saveAsTable("tableName")
# Spark API: All Other Formats
df\
.repartition(8) \
.write \
.format("parquet") \
.mode("Overwrite") \
.options({"path": "/tmp/sink"}) \
.saveAsTable("tableName")
com.esri.sink.SinkTable
Persist DataFrame as spark table.
Configuration
Required
format
- description: csv, parquet, etc
- type: String
options
- description: Map of options
- type: Map[String, String]
Optional
mode
- description: SaveMode. The default value is
Overwrite
. - type: String
- default value: Overwrite
- description: SaveMode. The default value is
shapeField
- description: The name of the field for [[SHAPE_STRUCT]].
- type: String
- default value: SHAPE
numPartitions
- description: Re-partitions the outgoing DataFrame.
- type: Int
- default value: None
debug
- description: print out the physical plan and the generated code
- type: Boolean
- default value: false
input
- description: The name of the input DataFrame.
- type: String
- default value: _default
SinkBDS v2.2
Hocon
{
inputs = [policy, highways, zipcodes]
type = com.esri.sink.bds.SinkBDS
geometryTypes = [esriGeometryPoint, esriGeometryPolyline, esriGeometryPolygon]
dataSourceNames = [policy, highways, zipcodes]
username = abc
password = def
machines = ["myMachine:9300:9220"]
clusterId = myClusterId
replicationFactor = 0
numShards = -1
responsePath = /tmp/sink
}
Hocon
{
inputs = [taxi]
type = com.esri.sink.bds.SinkBDS
geometryTypes = [esriGeometryPoint]
dataSourceNames = [taxi]
timeInfos = {
"taxi" : {
"startTimeField" : "tpep_pickup_datetime",
"endTimeField" : "tpep_dropoff_datetime",
"hasLiveData" : false,
"timeReference" : {
"timeZone" : "UTC",
"respectsDaylightSaving" : false
},
"timeInterval" : 1,
"timeIntervalUnits" : "esriTimeUnitsHours"
}
}
resolutions = {
"taxi" : {
"precision" : "50m",
"distance_error_pct" : 0.025
}
}
username = abc
password = def
machines = ["myServer:9300:9220"]
clusterId = myClusterId
replicationFactor = 0
numShards = -1
responsePath = /tmp/sink
}
com.esri.ext.SinkBDS
Note: Requires ArcGIS Enterprise 10.8.1
Sink to persist one or more feature classes to Spatiotemporal Big Data Store.
Spatial references of the inputs
must be 4326
.
SinkBDS writes the summary for the newly created BDS data sources to responsePath
such as, per input, datasource name, spatial extent, big data store schema, timeInfo and record count.
When a layer has no records, a data source is created in BDS but it is not saved to BDS as it errors out. But responsePath
with contain all layer information.
The client-tier application or middle-tier service must be responsible for managing the portal including creating/updating the (existing) feature/map service using the summary information provided and ArcGIS REST Services.
Note: If a data source of identical name to the corresponding dataSourceName already exists in the data store, then an error will be thrown.
Configuration
Note: BDS username, password, and clusterid can be obtained by running the listadminusers command on the Spatiotemporal Big DataStore machine. Find this utiliy under <ArcGIS Data Store installation directory>/datastore/tools
Required
geometryTypes
- desc: Describe the type of each input. To indicate an input is a table, use esriGeometryNull. Possible values are: esriGeometryPolygon, esriGeometryPolyline, esriGeometryLine, esriGeometryEnvelope, esriGeometryPoint, esriGeometryMultipoint, esriGeometryNull, esriGeometryUnknown.
- type: Array[String]
username
- desc: Username to Big Data Store.
- type: String
password
- desc: Password to Big Data Store.
- type: String
machines
- desc: List of Big Data Store nodes. Each node is a string value of hostname, tcpPort, httpPort separated by
:
. For example,["myServer.com:9300:9200"]
. - type: Array[String]
- desc: List of Big Data Store nodes. Each node is a string value of hostname, tcpPort, httpPort separated by
clusterId
- desc: Big Data Store Cluster ID.
- type: String
inputs
- desc: The names of feature classes to sink
- type: Array[String]
Optional
dataSourceNames
- desc: Provide the BDS datasource names for the inputs. Either provide for all inputs or don't provide at all. If not provided, the unique names will be generated.
- type: Seq[String]
- default values: Seq.Empty
timeInfos
- desc: Specify the time related properties such as startTimeField, endTimeField, trackIdField, etc.
- type: Array[Object]
- default value: Map.Empty
resolutions
- desc: Control Elasticsearch's geo shape precision. By default "precision" is
50m
. https://www.elastic.co/guide/en/elasticsearch/reference/current/geo-shape.html - type: Array[Object]
- default value: Map.Empty
- desc: Control Elasticsearch's geo shape precision. By default "precision" is
replicationFactor
- desc: Elasticsearch replication factor.
- type: Integer
- default value: 0
numShards
- desc: The number of shards in Elasticsearch.
- type: Integer
- default value: -1
responsePath
- desc: Path to write the datasources' medata.
- type: String
- default value: String.Empty
shapeField
- desc: The name of the SHAPE Struct field in the LHS DataFrame.
- type: String
- default value: SHAPE
Spatial Type Functions
You can use these functions in ProcessorSQL or also you can use this in the SourceGeneric with selectExpr
. These functions can also be used in a Notebook environment.
GeoToH3
SELECT GeoToH3(lat, lon, res) FROM df
Long
org.apache.spark.sql.catalyst.expressions.GeoToH3
For Uber H3. Return the conversion of Latitude and Longitude values to an H3 index.
since v2.3
Arguments:
lat - an expression. The latitude value.
lon - an expression. The longitude value.
res - an expression. The H3 resolution value.
H3Distance
SELECT H3Distance(h3Origin, h3idx) FROM df
Int
org.apache.spark.sql.catalyst.expressions.H3Distance
For Uber H3. Return the distance in H3 grid cell hexagons between the two H3 indicies.
since v2.3
Arguments:
h3Origin - an expression. The origin H3 index value.
h3Idx - an expression. The destination H3 index value.
H3kRing
SELECT H3kRing(h3Idx, k) FROM df
Array[Long]
org.apache.spark.sql.catalyst.expressions.H3kRing
For Uber H3. Return a collection of the H3 Indicies within k distance of the origin H3 index.
since v2.3
Arguments:
h3Idx - an expression. The H3 index value.
k - an expression. The distance from the origin H3 index.
H3kRingDistances
SELECT H3kRingDistances(h3Idx, k) FROM df
Array[Array[Long]]
org.apache.spark.sql.catalyst.expressions.H3kRingDistances
For Uber H3. Return a collection of collections of the H3 Indicies within k distance of the origin H3 index.
since v2.4
Arguments:
h3Idx - an expression. The H3 index value.
k - an expression. The distance from the origin H3 index.
H3ToChildren
SELECT H3ToChildren(h3Idx, resolution) FROM df
Array[Long]
org.apache.spark.sql.catalyst.expressions.H3ToChildren
For Uber H3. Return a collection of the children H3 Indicies of an H3 Index.
since v2.3
Arguments:
h3Idx - an expression. The H3 index value.
res - an expression. The H3 resolution value.
H3ToGeo
SELECT H3ToGeo(h3Idx) FROM df
Array[Double]
org.apache.spark.sql.catalyst.expressions.H3ToGeo
For Uber H3. Return the conversion of an H3 index to its Latitude and Longitude centroid.
since v2.3
Arguments:
- h3Idx - an expression. The H3 index value.
H3ToGeoBoundary
SELECT H3ToGeoBoundary(h3idx) from df
Array[Double]
org.apache.spark.sql.catalyst.expressions.H3ToGeoBoundary
For Uber H3. Return a collection of the hexagon boundary latitude and longitude values for an H3 index. The output is ordered as follows: [x1, y1, x2, y2, x3, y3, ...]
since v2.3
Arguments:
- h3Idx - an expression. The H3 index value.
H3ToParent
SELECT H3ToParent(h3idx, res) FROM df
Long
org.apache.spark.sql.catalyst.expressions.H3ToParent
For Uber H3. Return the parent H3 Index containing the given H3 Index.
since v2.3
Arguments:
h3Idx - an expression. The H3 index value.
res - an expression. The H3 resolution value.
H3ToString
SELECT H3ToString(h3idx) FROM df
string
org.apache.spark.sql.catalyst.expressions.H3ToString
For Uber H3. Return the conversion of an H3 index to its string representation.
since v2.3
Arguments:
- h3Idx - an expression. The H3 index value.
LatToY
SELECT LatToY(lat) FROM LatLonDF
Double
org.apache.spark.sql.catalyst.expressions.LatToY
Return the conversion of a Latitude value to a WebMercator Y value.
Arguments:
- lat - an expression. The latitude value.
LonToX
SELECT LonToX(lon) FROM LonLatDF
Double
org.apache.spark.sql.catalyst.expressions.LonToX
Return the conversion of a Longitude value to a WebMercator X value.
Arguments:
- lon - an expression. The longitude value.
ST_Area
SELECT ST_Area(A.SHAPE) FROM polygons A
wkt2 = """MULTIPOLYGON (((5148012 4080116, 4534599 3838522, 4985676 2684166,
7448886 2291969, 6703639 4769064, 5148012 4080116)))"""
df = spark.sql(f"""
SELECT ST_Area(ST_FromText('{wkt2}')) AREA
""").show()
+------------------+
| AREA|
+------------------+
|4.3430662295895E12|
+------------------+
Double
org.apache.spark.sql.catalyst.expressions.STArea
Return the area of a given shape struct. The area will be calculated in units of spatial reference.
since v2.1
Arguments:
- struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
Returns:
- Double
ST_AsGeoJSON
SELECT ST_AsGeoJSON(t1.shape) FROM t1
String
org.apache.spark.sql.catalyst.expressions.STAsGeoJSON
Return GeoJSON representation of the given shape struct.
Arguments:
- struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax.
ST_AsHex
SELECT
*,
ST_AsHex(pickup_longitude, pickup_latitude, 200.0) as H200,
ST_AsHex(pickup_longitude, pickup_latitude, 500.0) as H500
FROM
src_taxi
org.apache.spark.sql.catalyst.expressions.STAsHex
For a given longitude, latitude size in meter, return a hex qr code string delimited by :
.
Arguments:
longitude - an expression. The longitude.
latitude - an expression. The latitude.
hexSize - an expression. Hexagon size.
ST_AsQR
SELECT ST_AsQR(SHAPE, 1.0) FROM (SELECT ST_FromText(wkt) as SHAPE FROM (SELECT 'POINT(1.0 1.0)' wkt))
Array of long values.
org.apache.spark.sql.catalyst.expressions.STAsQR Return an array of array of two long values, representing row and column, of shape struct. This is optimized for 2-dimensions only.
Arguments:
struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
cellSize - an expression. The spatial partitioning grid size.
since v2.3
ST_AsQRClip
SELECT inline(ST_AsQRClip(SHAPE, 1.0, 4326)) FROM (SELECT ST_FromText(wkt) as SHAPE FROM (SELECT 'POLYGON((0 0, 10 0, 10 10, 0 10, 0 0))' wkt))
Array of InternalRow with LongType and SHAPE Struct.
org.apache.spark.sql.catalyst.expressions.STAsQRClip
Return an array of struct with QR and clipped SHAPE struct.
If a clipped SHAPE struct is completely within the geometry, wkb will be empty but the extent will be populated accordingly.
Use inline() function to flatten out.
If the DataFrame has the default SHAPE
column before calling this function, there will be two SHAPE
columns in the DataFrame after calling the function.
It's the user's responsibility to rename the columns appropriately to avoid the column name collisions.
Currently, this function is used by the ProcessorPointInPolygon(clip = true). Each clip is inflated a bit by (cellSize / 100). We are doing so to take care of the case where points are sitting on the edge of completely inner clip or sitting on the inner edge of the clip that is NOT completely inside the polygon.
Arguments:
struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
cellSize - an expression. The spatial partitioning grid size.
wkid - an expression. The spatial reference id.
since v2.3
ST_AsQRPrune
SELECT explode(ST_AsQRPrune(SHAPE, 1.0, 4326, 0.5)) FROM (SELECT ST_FromText(wkt) as SHAPE FROM (SELECT 'POINT(1.0 1.0)' wkt))
Array of long values.
org.apache.spark.sql.catalyst.expressions.STAsQRPrune
Create and return an array of QR values for the given shape struct. QR values that overlap the geometry bounding box but not the actual geometry will not be returned.
Arguments:
struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
cellSize - an expression. The spatial partitioning grid size.
wkid - an expression. The spatial reference id.
distance - an expression. The distance to spill over.
since v2.3
ST_AsQRSpill
SELECT ST_AsQRSpill(SHAPE, 1.0, 0.1) FROM (SELECT ST_FromText(wkt) as SHAPE FROM (SELECT 'POINT(1.0 1.0)' wkt))
Array of long values.
org.apache.spark.sql.catalyst.expressions.STAsQRSpill
Create and return an array of qr values of the shape struct. Spill over to neighboring cells by the specified distance.
Arguments:
struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
cellSize - an expression. The spatial partitioning grid size.
distance - an expression. The distance to spill over.
since v2.3
ST_AsRC
SELECT ST_AsRC(SHAPE, 1.0) FROM (SELECT ST_FromText(wkt) as SHAPE FROM (SELECT 'POINT(1.0 1.0)' wkt))
Array of array of two long values.
org.apache.spark.sql.catalyst.expressions.STAsRC
Return an array of array of two long values, representing row and column, of shape struct.
Arguments:
struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
cellSize - an expression. The spatial partitioning grid size.
ST_AsRCClip
SELECT ST_AsRCClip(SHAPE, 1.0, 4326) FROM (SELECT ST_FromText(wkt) as SHAPE FROM (SELECT 'POLYGON((0 0, 10 0, 10 10, 0 10, 0 0))' wkt))
Array of InternalRow with ArrayType(LongType) and SHAPE Struct.
org.apache.spark.sql.catalyst.expressions.STAsRCClip
Return an array of array of two long values representing row & column and a shape struct representing a clipped geometries.
Arguments:
struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
cellSize - an expression. The spatial partitioning grid size.
wkid - an expression. The spatial reference id.
ST_AsRCPrune
SELECT explode(ST_AsRCPrune(SHAPE, 1.0, 4326, 0.5)) FROM (SELECT ST_FromText(wkt) as SHAPE FROM (SELECT 'POINT(1.0 1.0)' wkt))
Array of Arrays with 2 long values.
org.apache.spark.sql.catalyst.expressions.STAsRCPrune
Create and return an array of RC value arrays for the given shape struct. RC values that overlap the geometry bounding box but not the actual geometry will not be returned.
Arguments:
struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
cellSize - an expression. The spatial partitioning grid size.
wkid - an expression. The spatial reference id.
distance - an expression. The distance to spill over.
since v2.3
ST_AsRCSpill
SELECT ST_AsRCSpill(SHAPE, 1.0, 0.1) FROM (SELECT ST_FromText(wkt) as SHAPE FROM (SELECT 'POINT(1.0 1.0)' wkt))
Array of array of two long values.
org.apache.spark.sql.catalyst.expressions.STAsRCSpill
Return an array of array of two long values, representing row and column, of shape struct. * Spill over to neighboring cells by the specified distance.
Arguments:
struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
cellSize - an expression. The spatial partitioning grid size.
distance - an expression. The distance to spill over.
ST_AsText
SELECT ST_AsText(t1.shape) FROM t1
String
org.apache.spark.sql.catalyst.expressions.STAsText
Return the Well-Known Text (WKT) representation of the shape struct.
Arguments:
- struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax.
ST_Buffer
SELECT ST_Buffer(shape, 1.0, 4326) FROM poi
Struct with wkb, xmin, ymin, xmax, ymax
org.apache.spark.sql.catalyst.expressions.STBuffer
Create and return a buffer polygon of the shape struct as specified by the distance and wkid.
Arguments:
struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax.
distance - an expression. The distance.
wkid - an expression. The spatial reference id.
ST_BufferGeodesic
SELECT ST_BufferGeodesic(shape, "ShapePreserving", 1.0, 4326, 'NaN') FROM locations
Struct with wkb, xmin, ymin, xmax, ymax
org.apache.spark.sql.catalyst.expressions.STBufferGeodesic
Create and return a geodesic buffer polygon of the shape struct as specified by the curveType, distance, wkid, and maxDeviation.
The options for curveType are:
- Geodesic
- Loxodrome
- GreatElliptic
- NormalSection
- ShapePreserving
Arguments:
struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax.
curveType - an expression. The geodetic curve type of the segments. If the curve_type is Geodetic_curve::shape_preserving, then the segments are densified in the projection where they are defined before buffering.
distance - an expression. The distance in Meters.
wkid - an expression. The spatial reference id
maxDeviation - an expression. The deviation offset to use for convergence. The geodesic arcs of the resulting buffer will be closer than the max deviation of the true buffer. Pass in 'NaN' to use the default deviation.
since v2.3
ST_Centroid
SELECT ST_Centroid(SHAPE) FROM locations
Struct with wkb, xmin, ymin, xmax, ymax
org.apache.spark.sql.catalyst.expressions.STCentroid
Return centroid of shape struct.
Arguments:
- struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
ST_Contains
SELECT ST_Contains(A.SHAPE, B.SHAPE, 4326) FROM polygons A, points B where A.RC = B.RC
true or false
org.apache.spark.sql.catalyst.expressions.STContains
Return true if one shape struct contains another shape struct.
https://desktop.arcgis.com/en/arcmap/latest/manage-data/using-sql-with-gdbs/relational-functions-for-st-geometry.htm#
Arguments:
struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
another_struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
wkid - an expression. The spatial reference id.
ST_Coverage
SELECT ST_Coverage(A.SHAPE, B.SHAPE, 10) FROM polylines A, polylines B
Array(double, double)
org.apache.spark.sql.catalyst.expressions.STCoverage
Return the coverage fraction and coverage distance of segment1 on segment2 as an array. Array(coverageFraction, coverageDistance)
The coverage fraction represents the fraction of segement1 that is covered by segment2. Its range is [0, 1]. The coverage distance measures how close segement1 is to segment2. It takes into account the input distance threshold. The coverage distance range is [1, 0). Coverage distance 1 means the actual distance between the segments is 0. As the segments become further and further apart, the coverage distance approaches 0.
Arguments:
struct - an expression. A polyline shape struct with wkb, xmin, ymin, xmax, ymax
another_struct - an expression. A polyline shape struct with wkb, xmin, ymin, xmax, ymax
distThreshold - an expression. A distance threshold above which the coverage distance between two lines will be very small.
since v2.3
ST_CreateFishnet
(SELECT GRID_ID, ST_FromWKB(WKB) AS SHAPE FROM parquet.`src/test/resources/grid_centroids`) LATERAL VIEW explode(ST_CreateFishnet(ST_Project(SHAPE, 4326, 3857), 400, 800)) as NBR
Array of InternalRow with ArrayType(LongType) and SHAPE Struct.
Per a web mercator point, return additional points north, south, west, east directions in the given interval up to the given max distance. The input shape must be in 3857 web mercator spatial reference. The output is an array of (Seq(r,c), Shape struct). The output is in 3857 web mercator spatial reference.
Arguments:
xmin - The minimum longitude.
ymin - The minimum latitude.
xmax - The maximum longitude.
ymax - The maximum latitude.
cellSize - The cell size.
ST_CreateNeighbors
(SELECT GRID_ID, ST_FromWKB(WKB) AS SHAPE FROM parquet.`src/test/resources/grid_centroids`) LATERAL VIEW explode(ST_CreateNeighbors(ST_Project(SHAPE, 4326, 3857), 400, 800)) as NBR
Array of InternalRow with ArrayType(LongType) and SHAPE Struct.
org.apache.spark.sql.catalyst.expressions.STCreateNeighbors
Per a web mercator point, return additional points north, south, west, east directions in the given interval up to the given max distance. The input shape must be in 3857 web mercator spatial reference. The output is an array of (Seq(r,c), Shape struct). The output is in 3857 web mercator spatial reference.
since v2.1
struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
interval - an expression. The interval distance in meter.
maxDist - an expression. The maximum distance in meter.
ST_Crosses
SELECT ST_Crosses(A.SHAPE, B.SHAPE, 4326) FROM line A, other_line B where A.RC = B.RC
true or false
org.apache.spark.sql.catalyst.expressions.STCrosses
Return true if one shape struct crosses another shape struct. Otherwise return false.
https://desktop.arcgis.com/en/arcmap/latest/manage-data/using-sql-with-gdbs/relational-functions-for-st-geometry.htm#
Arguments:
struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
another_struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
wkid - an expression. The spatial reference id.
ST_Disjoint
SELECT ST_Disjoint(A.SHAPE, B.SHAPE, 4326) FROM polygons A, point B where A.RC = B.RC
true or false
org.apache.spark.sql.catalyst.expressions.STDisjoint
Return true when the two shape structs disjoint. Otherwise return false.
Geometries are Disjoint, when the intersection of the geometries is empty. The relation operation Disjoint is the opposite of the Intersects.
https://desktop.arcgis.com/en/arcmap/latest/manage-data/using-sql-with-gdbs/relational-functions-for-st-geometry.htm#
Arguments:
struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
another_struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
wkid - an expression. The spatial reference id.
ST_Distance
SELECT ST_Distance(t1.shape, t2.shape) FROM t1, t2
Double
org.apache.spark.sql.catalyst.expressions.STDistance
Calculates the 2D planar distance between two shape structs.
Arguments:
struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
another_struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
ST_DistanceWGS84Points
SELECT ST_DistanceWGS84Points(t1.shape, t2.shape) FROM t1, t2
Double
org.apache.spark.sql.catalyst.expressions.STDistanceWGS84Points
Calculate and return the distance in meters between the two shape structs representing points in WGS84.
Arguments:
struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
another_struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
ST_DoBBoxesOverlap
SELECT *
FROM roads A, grids B
WHERE
A.RC = B.RC
AND
ST_DoBBoxesOverlap(A.SHAPE, B.SHAPE)
AND ST_Intersects(A.SHAPE, B.SHAPE, 4326)
Boolean
org.apache.spark.sql.catalyst.expressions.STDoBBoxesOverlap
Return true when the extents from the two given shape structs overlap. Return false otherwise.
This expression is used to do a simple check before executing ST_* operations such as ST_Distance, ST_Contains. If the two extents do not overlap, there is no need to execute the following ST_* operation.
Arguments:
struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
struct2 - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
ST_DumpXY
wkt = """POLYGON((-57.55608571669294 -11.287931275985708,-43.84514821669295 -11.287931275985708,
-43.84514821669295 -23.020737845578587,-57.55608571669294 -23.020737845578587,
-57.55608571669294 -11.287931275985708))"""
df = spark.sql(f"""
SELECT
inline(array(dump))
FROM (
SELECT explode(ST_DumpXY(ST_FromText('{wkt}'))) dump
)
""")
df.show()
+------------------+-------------------+
| x| y|
+------------------+-------------------+
|-57.55608571669294|-11.287931275985708|
|-43.84514821669295|-11.287931275985708|
|-43.84514821669295|-23.020737845578587|
|-57.55608571669294|-23.020737845578587|
|-57.55608571669294| -11.2879312759857|
+------------------+-------------------+
org.apache.spark.sql.catalyst.expressions.STDumpXY
Return array of (x,y) vertex values making up the given shape struct as an array of structs. Use explode() to flatten the array. The x and y values of the struct can be accessed in SQL by calling StructColName.x and StructColName.y, where StructColName is the name of the column with values returned by explode(ST_DumpXY(...)).
since v2.4
Arguments:
- struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
Returns:
- Array(Struct(Double, Double))
ST_DumpXYwithIndex
wkt = """POLYGON((-57.55608571669294 -11.287931275985708,-43.84514821669295 -11.287931275985708,
-43.84514821669295 -23.020737845578587,-57.55608571669294 -23.020737845578587,
-57.55608571669294 -11.287931275985708))"""
df = spark.sql(f"""
SELECT
inline(array(dump)),
ST_FromXY(x,y) SHAPE
FROM (
SELECT explode(ST_DumpXYWithIndex(ST_FromText('{wkt}'))) dump
)
""").withMeta('Point', 4326)
org.apache.spark.sql.catalyst.expressions.STDumpXYWithIndex
Return array of (x,y,i) vertex values with indicies making up the given shape struct as an array of structs. Use explode() to flatten the array. The x, y, and i values of the struct can be accessed in SQL by calling StructColName.x, StructColName.y, StructColName.i, where StructColName is the name of the column with values returned by explode(ST_DumpXYWithIndex(...)).
since v2.4
Arguments:
- struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
Returns:
- Array(Struct(Double, Double))
ST_EliminatePolygonPartPercent
SELECT ST_EliminatePolygonPartPercent(shape, 10.0) FROM locations
Shape Struct
org.apache.spark.sql.catalyst.expressions.STEliminatePolygonPartPercent
Remove the holes of a polygon whose area as a percentage of the whole polygon is less than this value.
since v2.3
Arguments:
struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
percentThreshold - an expression. If a hole has area less than this value, then it will be removed.
ST_EliminatePolygonPartArea
SELECT ST_EliminatePolygonPartArea(shape, 1.0) FROM locations
Shape Struct
org.apache.spark.sql.catalyst.expressions.STEliminatePolygonPartArea
Remove the holes of a polygon that have area below the given threshold.
since v2.3
Arguments:
struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
areaThreshold - an expression. If a hole has area less than this value, then it will be removed.
ST_Envelope
SELECT ST_Envelope(-180.0, -90.0, 180.0, 90.0);
Struct with wkb, xmin, ymin, xmax, ymax
org.apache.spark.sql.catalyst.expressions.STEnvelope
Create an [[Envelope]] using xmin, ymin, xmax, ymax.
since v2.3
Arguments:
xmin - an expression. The xmin.
ymin - an expression. The ymin.
xmax - an expression. The xmax.
ymax - an expression. The ymax.
ST_Equals
SELECT ST_Equals(A.SHAPE, B.SHAPE, 4326) FROM polygons A, other_polygons B where A.RC = B.RC
true or false
org.apache.spark.sql.catalyst.expressions.STEquals
Returns true if two geometries are equal. Otherwise return false.
https://desktop.arcgis.com/en/arcmap/latest/manage-data/using-sql-with-gdbs/relational-functions-for-st-geometry.htm#
Arguments:
struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
another_struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
wkid - an expression. The spatial reference id.
ST_FromFGDBPoint
SELECT ST_FromFGDBPoint(t.SHAPE) as SHAPE FROM t
Struct with wkb, xmin, ymin, xmax, ymax
org.apache.spark.sql.catalyst.expressions.STFromFGDBPoint
Create and return a shape struct representing a point for the given struct from the file geodatabase.
Arguments:
- struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax.
ST_FromFGDBPolygon
SELECT ST_FromFGDBPolygon(t.SHAPE) as SHAPE FROM t
Struct with wkb, xmin, ymin, xmax, ymax
org.apache.spark.sql.catalyst.expressions.STFromFGDBPolygon
Return a shape struct representing a polygon for the given struct from the file geodatabase.
Arguments:
- struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax.
ST_FromFGDBPolyline
SELECT ST_FromFGDBPolyline(t.SHAPE) as SHAPE FROM t
Struct with wkb, xmin, ymin, xmax, ymax
org.apache.spark.sql.catalyst.expressions.STFromFGDBPolyline
Return a shape struct representing a polyline for the given struct from the file geodatabase.
Arguments:
- struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax.
ST_FromGeoJSON
SELECT ST_FromGeoJSON(t.geojson) as SHAPE FROM t
Struct with wkb, xmin, ymin, xmax, ymax
org.apache.spark.sql.catalyst.expressions.STFromGeoJSON
Return a shape struct from GeoJSON.
Arguments:
- geoJSON - an expression. GeoJSON.
ST_FromHex
SELECT
ST_FromHex(H200, 200) as SHAPE, H200, TIP_AVG
FROM
(
SELECT
*,
ST_AsHex(pickup_longitude, pickup_latitude, 200.0) as H200
FROM
taxi
)
Struct with wkb, xmin, ymin, xmax, ymax
org.apache.spark.sql.catalyst.expressions.STFromHex
For a given hex qr code and size, return a polygon shape struct. The returned polygon is in the spatial reference of WGS_1984_Web_Mercator_Auxiliary_Sphere 3857 (102100).
Arguments:
hexQR - an expression. The hex QR code.
hexSize - an expression. Hexagon size.
ST_FromJSON
SELECT ST_FromJSON(t.json) as SHAPE FROM t
Struct with wkb, xmin, ymin, xmax, ymax
org.apache.spark.sql.catalyst.expressions.STFromJSON
Return a shape struct from the given JSON.
Arguments:
- json - an expression. JSON
ST_FromText
SELECT ST_FromText(wkt) as SHAPE FROM (SELECT 'POINT(1.0 1.0)' wkt)
Struct with wkb, xmin, ymin, xmax, ymax
org.apache.spark.sql.catalyst.expressions.STFromText
Create and return a shape struct from the OGC Well-Known text representation.
Arguments:
- wkt - an expression. The OGC Well-Known text representation
ST_FromWKB
SELECT ST_FromWKB(SHAPE.WKB) as WKB FROM (SELECT ST_FromText(wkt) as SHAPE FROM (SELECT 'POINT(1.0 1.0)' wkt))
Struct with wkb, xmin, ymin, xmax, ymax
org.apache.spark.sql.catalyst.expressions.STFromWKB
Create and return shape struct from the well-known binary representation of a geometry.
Arguments:
- wkb - an expression. The well-known binary representation of a geometry.
ST_FromXY
SELECT *, ST_FromXY(pickup_longitude, pickup_latitude) as SHAPE FROM taxi
Struct with wkb, xmin, ymin, xmax, ymax
org.apache.spark.sql.catalyst.expressions.STFromXY
Create and return a shape struct based on the given longitude and latitude.
Arguments:
longitude - an expression. The longitude value.
latitude - an expression. The latitude value.
ST_Generalize
SELECT ST_Generalize(A.SHAPE, 0.0, false) FROM polygons A
Shape struct
org.apache.spark.sql.catalyst.expressions.STGeneralize
Return the generalization of a geometry in its shape struct representation.
since v2.1
Arguments:
struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
maxDeviation - an expression. The maximum allowed deviation from the generalized geometry to the original geometry.
removeDegenerateParts - an expression. When true, the degenerate parts of the geometry will be removed from the output.
ST_GeodeticArea
SELECT ST_GeodeticArea(A.SHAPE, 4326, 1) FROM polygons A
double
org.apache.spark.sql.catalyst.expressions.STGeodeticArea
Return the geodetic area of geometry.
since v2.1
struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
wkid - an expression. The spatial reference id.
curveType - an expression. The geodetic curve type.
ST_Grid
WITH T AS (SELECT explode(ST_Grid(ST_FromText('LineString(0.0 0.0, 1.0 1.0)'), 1.0)) GRID)
SELECT inline(array(GRID)) FROM T
Rows with an array of (row, column) and Shape struct.
org.apache.spark.sql.catalyst.expressions.STGrid
Returns an array of struct containing two columns: Array of long values representing row and column, and a shape struct. Use explode() and inline() function to flatten out.
since v2.3
Arguments:
row - an expression. Shape struct.
cellSize - an expression. Cell size.
ST_GridXY
WITH T AS (SELECT explode(sequence(0,2)) ID, ID X, ID Y)
SELECT inline(array(ST_GridXY(X, Y, 1.0))) FROM T
Rows with an array of (row, column) and a shape struct representing the grid to which the given point location belongs.
org.apache.spark.sql.catalyst.expressions.STGridXY
For the given longitude, latitude and cell size, return a shape struct containing an array with the row & column (RC) and a shape struct representing the grid to which the given point location belongs. Use inline() to flatten out.
since v2.3
Arguments:
longitude - an expression. The longitude.
latitude - an expression. The latitude.
cellSize - an expression. The cell size.
ST_HaversineDistance
SELECT ST_HaversineDistance(X1, Y1, X2, Y2) AS haversine FROM T1
Double
Return the haversine distance in kilometers between two WGS84 points.
Arguments:
x1 - an expression. X1
y1 - an expression. Y1
x2 - an expression. X2
y2 - an expression. Y2
ST_Intersection
SELECT ST_Intersection(A.SHAPE, B.SHAPE, 4326, -1) as SHAPE FROM A, B
Array of struct with wkb, xmin, ymin, xmax, ymax
org.apache.spark.sql.catalyst.expressions.STIntersection
Performs the Topological Intersection operation on the two given shape structs. Array of struct with wkb, xmin, ymin, xmax, ymax. Use explode() function to flatten out.
Use explode() function to flatten out.
Arguments:
struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
another_struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
wkid - an expression. The spatial reference id.
mask - an expression. The mask value.
ST_Intersects
SELECT ST_Intersects(A.SHAPE, B.SHAPE, 4326) FROM roads A, grids B
true or false
org.apache.spark.sql.catalyst.expressions.STIntersects
Return true if the two shape structs intersect. Return false otherwise.
https://desktop.arcgis.com/en/arcmap/latest/manage-data/using-sql-with-gdbs/relational-functions-for-st-geometry.htm#
Arguments:
struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
another_struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
wkid - an expression. The spatial reference id.
ST_IsLowerLeftInCell
SELECT * FROM table1 t1 LEFT JOIN table2 t2 ON t1.RC = t2.RC where ST_IsLowerLeftInCell(t1.RC, 1.0, t1.SHAPE, t2.SHAPE))
Boolean
org.apache.spark.sql.catalyst.expressions.STIsLowerLeftInCell
Return true when the lower left corner of the intersection of the two shape envelops is in the given cell. Otherwise return false.
Arguments:
rc - an expression. The row & column
cellSize - an expression. The spatial partitioning grid size.
struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
another_struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
ST_IsPointInExtent
SELECT ST_IsPointInExtent(A.SHAPE,[-10, -10, 10, 10]) FROM points A
Boolean
org.apache.spark.sql.catalyst.expressions.STIsPointInExtent
Return a boolean indicating if a point is in a given extent, including points on the extent boundary.
since v2.1
Arguments:
struct - an expression. The point shape struct with wkb, xmin, ymin, xmax, ymax
extent - an expression. The extent array with min, ymin, xmax, ymax
ST_Length
SELECT ST_Length(A.SHAPE) FROM polygons A
Double
org.apache.spark.sql.catalyst.expressions.STLength
Return the length of a given shape struct.
since v2.1
Arguments:
- struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
ST_Line
SELECT ST_Line(array(
ST_FromText('POINT (1.0 1.0)'),
ST_FromText('POINT (2.0 1.0)'),
ST_FromText('POINT (3.0 2.0)')
))
Shape Struct
org.apache.spark.sql.catalyst.expressions.STLine
Accepts an array of point [[SHAPE_STRUCT]] and returns a polyline [[SHAPE_STRUCT]].
since v2.2
Arguments:
- struct - an expression. An array of point [[SHAPE_STRUCT]].
ST_MakeLine
SELECT ST_MakeLine(coords) FROM df
Shape struct
org.apache.spark.sql.catalyst.expressions.STMakeLine
Return a line shape struct from an array of coordinates like [x1, y1, x2, y2, x3, y3...]
since v2.3
Arguments:
- struct - an expression. An array of coordinates in the order [x1, y1, x2, y2, x3, y3, ...]
ST_MakePoint
SELECT ST_MakePoint(x, y) FROM pointsDF
Shape struct
org.apache.spark.sql.catalyst.expressions.STMakePoint
Return a point shape struct from x and y values.
since v2.3
Arguments:
longitude - an expression. The longitude value.
latitude - an expression. The latitude value.
ST_MakePolygon
SELECT ST_MakePolygon(coords) FROM df
Shape struct
org.apache.spark.sql.catalyst.expressions.STMakePolygon
Return a polygon shape struct from an array of coordinates like [x1, y1, x2, y2, x3, y3...]
Can be used to convert UberH3 GeoBoundaries to polygons.
since v2.3
Arguments:
- struct - an expression. An array of coordinates in the order [x1, y1, x2, y2, x3, y3, ...]
ST_Overlaps
SELECT ST_Overlaps(A.SHAPE, B.SHAPE, 4326) FROM polygons A, other_polygons B where A.RC = B.RC
true or false
org.apache.spark.sql.catalyst.expressions.STOverlaps
Return true if one geometry overlaps another geometry. Otherwise return false.
https://desktop.arcgis.com/en/arcmap/latest/manage-data/using-sql-with-gdbs/relational-functions-for-st-geometry.htm#
Arguments:
struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
another_struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
wkid - an expression. The spatial reference id.
ST_Project
SELECT ST_Project(SHAPE, 4326, 3857) FROM (SELECT ST_FromText('POINT (1.0 1.0)') as SHAPE)
Struct with wkb, xmin, ymin, xmax, ymax
org.apache.spark.sql.catalyst.expressions.STProject
Re-project and return the newly re-projected shape struct based on the from wkid and to wkid.
Arguments:
struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax.
fromWkid - an expression. The well-known spatial reference id.
toWkid - an expression. The well-known spatial reference id.
ST_Project2
SELECT ST_Project(SHAPE, GEOGCS[..], 3857) FROM (SELECT ST_FromText('POINT (1.0 1.0)') as SHAPE)
Struct with wkb, xmin, ymin, xmax, ymax
org.apache.spark.sql.catalyst.expressions.STProject2
Re-project and return the newly re-projected shape struct based on the from wktext and to wkid.
Arguments:
struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax.
fromWkText - an expression. The well-known text.
toWkid - an expression. The well-known spatial reference id.
ST_Project3
SELECT ST_Project3(SHAPE, 4326, PROJCS[..]) FROM (SELECT ST_FromText('POINT (1.0 1.0)') as SHAPE)
Struct with wkb, xmin, ymin, xmax, ymax
org.apache.spark.sql.catalyst.expressions.STProject3
Return the newly re-projected shape struct based on the from wkid and to wktext.
Arguments:
struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax.
fromWkid - an expression. The well-known spatial reference id.
toWktext - an expression. The well-known text.
ST_Project4
SELECT ST_Project4(SHAPE, GEOGCS[..], PROJCS[..]) FROM (SELECT ST_FromText('POINT (1.0 1.0)') as SHAPE)
Struct with wkb, xmin, ymin, xmax, ymax
org.apache.spark.sql.catalyst.expressions.STProject4 Return the newly re-projected shape struct based on the from wktext and to wktext.
Arguments:
struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax.
fromWktext - an expression. The well-known text.
toWktext - an expression. The well-known text.
ST_Relate
SELECT ST_Relate(A.SHAPE, B.SHAPE, 4326, "1*1***1**") FROM linestring A, other_linestring B where A.RC = B.RC
true or false
org.apache.spark.sql.catalyst.expressions.STRelate
Return true if the given relation (DE-9IM value) holds for the two geometries. Otherwise return false.
https://postgis.net/docs/ST_Relate.html https://desktop.arcgis.com/en/arcmap/latest/manage-data/using-sql-with-gdbs/relational-functions-for-st-geometry.htm#
Arguments:
struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
another_struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
wkid - an expression. The spatial reference id.
de9im - an expression. The DE-9IM value.
ST_ServiceArea
SELECT inline(ST_ServiceArea(SHAPE, 35, 125.0))
SELECT inline(ST_ServiceArea(ST_FromXY(-13161875, 4035019.53758), 35, 125.0))
Array of InternalRow with SERVICE_AREA_COST (in minutes) and SHAPE Struct.
org.apache.spark.sql.catalyst.expressions.STServiceArea
Accepts [[SHAPE_STRUCT]] in the point geometry and 3857 and return service area polygons with the cost in minutes. The output is an array of (service_area_cost and polygon). The service area polygons are hexagon based.
The algorithm used is Dijkstra's Search.
Use inline function to unpack the returned array.
since v2.4
Arguments:
struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax in the point geometry and 3857.
noOfCosts - an expression. The number of minute costs. If 2 is specified, 0-1 minute and 0-2 minutes polygons are generated.
hexSize - an expression. The hexSize in DoubleType.
ST_ServiceAreaXY
SELECT inline(ST_ServiceAreaXY(-13161875, 4035019.53758, 35, 125.0))
Array of InternalRow with SERVICE_AREA_COST (in minutes) and SHAPE Struct.
org.apache.spark.sql.catalyst.expressions.STServiceAreaXY
Accepts the explicit X, Y in the web mercator (3857) and return service area polygons with the cost in minutes. The output is an array of (service_area_cost and polygon). The service area polygons are hexagon based.
The algorithm used is Breadth First Search.
Use inline function to unpack the returned array.
since v2.4
Arguments:
x - an expression. The X in web mercator (3857).
y - an expression. The Y in web mercator (3857).
noOfCosts - an expression. The number of minute costs. If 2 is specified, 0-1 minute and 0-2 minutes polygons are generated.
hexSize - an expression. The hexSize in DoubleType.
ST_Simplify
SELECT ST_Simplify(A.SHAPE, 4326, false) FROM polygons A
Shape struct
org.apache.spark.sql.catalyst.expressions.STSimplify
Simplify a geometry to make it valid for a Geodatabase to store without additional processing.
since v2.1
Arguments:
struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
wkid - an expression. The spatial reference ID.
forceSimplify - an expression. When True, the Geometry will be simplified regardless of the internal IsKnownSimple flag.
ST_SimplifyOGC
SELECT ST_SimplifyOGC(A.SHAPE, 4326, false) FROM polygons A
Shape struct
org.apache.spark.sql.catalyst.expressions.STSimplifyOGC
Simplify a geometry to make it valid for a Geodatabase to store without additional processing. Follows the OGC specification for the Simple Feature Access v. 1.2.1 (06-103r4).
since v2.2
Arguments:
struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
wkid - an expression. The spatial reference ID.
forceSimplify - an expression. When True, the Geometry will be simplified regardless of the internal IsKnownSimple flag.
ST_Touches
SELECT ST_Touches(A.SHAPE, B.SHAPE, 4326) FROM polygons A, point B where A.RC = B.RC
true or false
org.apache.spark.sql.catalyst.expressions.STTouches
Return true if one geometry touches another geometry. Otherwise return false.
https://desktop.arcgis.com/en/arcmap/latest/manage-data/using-sql-with-gdbs/relational-functions-for-st-geometry.htm#
Arguments:
struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
another_struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
wkid - an expression. The spatial reference id.
ST_WebMercator
SELECT ST_WebMercator(lat, lon) FROM LatLonDF
Array(Double, Double)
org.apache.spark.sql.catalyst.expressions.STWebMercator
Return the conversion of Latitude and Longitude values as a WebMercator (X, Y) Coordinate Pair.
since v2.2
Arguments:
lat - an expression. The latitude value.
lon - an expression. The longitude value.
ST_Within
SELECT ST_Within(A.SHAPE, B.SHAPE, 4326) FROM point A, polygons B where A.RC = B.RC
true or false
org.apache.spark.sql.catalyst.expressions.STWithin
Return true if one geometry is within another geometry. Otherwise return false.
https://desktop.arcgis.com/en/arcmap/latest/manage-data/using-sql-with-gdbs/relational-functions-for-st-geometry.htm#
Arguments:
struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
another_struct - an expression. The shape struct with wkb, xmin, ymin, xmax, ymax
wkid - an expression. The spatial reference id.
XToLon
SELECT XToLon(x) FROM LatLonDF
Double
org.apache.spark.sql.catalyst.expressions.XToLon
Return the conversion of a Web Mercator X value to a Longitude value.
Arguments:
- x - an expression. The x value.
YToLat
SELECT YToLat(y) FROM LatLonDF
Double
org.apache.spark.sql.catalyst.expressions.YToLat
Return the conversion of a Web Mercator Y value to a Latitude value.
Arguments:
- y - an expression. The y value.
Open Source
ESRI - Big Data Toolkit Open source acknowledgement can be found below.
BDT v2
Services Offering
ESRI Professional Services
- Implementation and configuration planning. This includes a kick-off meeting, workflow requirements review and validation, and infrastructure validation.
- Installation and configuration of Big Data Toolkit. This includes installing and configuring Big Data Toolkit on the customer’s big data cluster, configuring and testing one or two custom workflows, and installing the ArcGIS Pro extension.
- Knowledge transfer. While on-site to implement and configure Big Data Toolkit, Esri will also provide knowledge transfer so the client can execute and configure additional workflows independently.
- Ongoing support. This includes a defined number of hours of technical support for troubleshooting and/or consulting support, to help the customer configure additional workflows with Big Data Toolkit and grow its use throughout the term of their subscription.