Parquet Toolbox#

The parquet toolbox allows for importing parquet folders exported from Spark or other data processes. It supports reading from Azure Blob Storage, Azure Delta Lake, Google Cloud Storage, and AWS S3.

The toolbox also features an export tool that allows for exporting a feature class from ArcGIS Pro to a parquet folder. It also supports exporting to the same cloud providers listed above.

Reach out to the BDT team to receive the Parquet Toolbox installation files.

Installation:#

Step 1 - Clone the base conda env in ArcGIS Pro:#

  • First, close out of ArcGIS Pro, open up the Python Command Prompt and do the following:

    • proswap arcgispro-py3

    • conda create --yes --name spark_esri --clone arcgispro-py3

    • proswap spark_esri

Step 2 - Install dependencies:#

  • Run the following pip commands with the Python Command Prompt.

  • pip install duckdb

  • If you plan on exporting parquet files to azure, an additional dependency is required: pip install adlfs.

Step 3 - Check dependencies:#

  • The following packages should already be installed from the base ArcGIS Pro conda env. Verify they are installed by checking by doing conda list <package_name>. If they are not installed, pip install them:

    • pyarrow

    • pandas

    • fsspec (if exporting to azure)

Note:#

  • If you get the error IO Error: Failed to create directory ".tmp": Access is denied. run ArcGIS Pro as adminstrator.

  • This error occurs when duckdb tries to create a .tmp folder in your user directory. Its a configuration operation that typically only occurs once

  • After the directory is created, the tool will work without having to run ArcGIS pro as adminstrator each time.

Environment#

The following is a list of all the versions this tool was tested with:

Package

ArcGIS Pro 3.4

ArcGIS Pro 3.2

Python

3.11.10

3.9.18

DuckDB

1.1.3

1.1.0

PyArrow

16.1.0

12.0.0

Pandas

2.0.2

2.0.2

ADLFS

2024.7.0

2024.7.0

FSSpec

2024.6.1

2024.4.0

Import Tool Usage:#

ImportTool

Data Source:#

  • Set this the location of your data

  • Possible options are: Local, S3, Azure, and Google Cloud Storage

Output Layer Name:#

  • The output feature class layer name which will appear in your catalog once the tool is complete.

Output File Geodatabase:#

  • By default the tool will output the feature class to a scratch gdb. Set this to the path of a specific gdb to override this.

Local Parquet Folder:#

  • This parameter is an input path to a local hive partitioned parquet folder (must be a folder. Tool does not support singular parquet files).

  • Must be an absolute path with format Drive:\path\to\your\parquet Ex: C:\Users\dre11620\Downloads\tester.parquet. No env variables required.

Cloud Parquet Folder:#

  • This parameter is an input path to a cloud hive partitioned parquet folder (must be a folder. Tool does not support singular parquet files).

  • A local or cloud folder path must be defined as input to the tool. An error will be thrown if both folder paths are defined.

  • It can be one of the following:

    • Azure path

      • Must be in one of these formats:

        • az for Azure Blob Storage - az://<containername>/<pathtoparquet> Ex: az://main/mydata.parquet

        • Or azure for Azure Blob Storage - azure://<containername>/<pathtoparquet> Ex: azure://main/mydata.parquet

        • abfss for Azure Data Lake Storage (ADLS) - abfss://<containername>/<pathtoparquet> Ex: abfss://main/mydata.parquet

        • Or abfs for Azure Data Lake Storage (ADLS) - abfs://<containername>/<pathtoparquet> Ex: abfs://main/mydata.parquet

      • Must have environment variable on your machine called AZURE_CONN_STR:

        • With the format:DefaultEndpointsProtocol=https;AccountName=<youraccount>;AccountKey=<yourkey>==;EndpointSuffix=core.windows.net.

        • You can find this key under your storage account on the Azure portal under Security+networking->Access keys->Connection String.

    • Google cloud storage path

      • Must be in the format - gs://<bucketname>/<pathtoparquet> Ex: gs://sdsdrewtest/myfile.parquet.

      • Or gcs - gcs://<bucketname>/<pathtoparquet> Ex: gcs://sdsdrewtest/myfile.parquet.

      • Must have environment variables on your machine called GOOGLE_ACCESS_KEY and GOOGLE_SECRET.

      • You can find these keys under Cloud Storage->Settings->Access keys for your user account.

    • AWS S3

      • Must be in the format - s3://<bucketname>/<pathtoparquet> Ex: s3://sdsdrewtest/myfile.parquet.

      • Must have the following env variables set: AWS_ACCESS_KEY_ID, AWS_REGION, AWS_SECRET_ACCESS_KEY.

      • If using temp credentials you will need to also have AWS_SESSION_TOKEN set.

      • If using minio locally you will need to set AWS_ENDPOINT_URL set. Ex: 127.0.0.1:9000.

X and Y Column:#

  • Set these columns if you have point geometries that you would like to import from two columns.

  • Example: X=pickup_longitude, Y= pickup_latitiude

Geometry Column:#

  • If not using X and Y columns, the tool will look for a geometry column.

  • The tool supports three types of columns:

    1. WKB - Well-Known Binary. Set this parameter to the name of your WKB column.

    2. WKT - Well-Known Text. Set this parameter to the name of your WKT column.

    3. Esri Big Data Toolkit SHAPE Struct - Example: SHAPE{WKB, XMIN, YMIN, XMAX, YMAX}. Set this parameter to the name of your struct column separated with a period with the name of your WKB key within the struct. Ex: MyStruct.myWKB

Column Selection:#

  • Warning - Columns with array values are not supported. Columns with int64 longs are only supported up to 53 bits, more information here. Use regular expressions to filter out any problematic columns.

  • Set a regular expression to only import the desired columns specified. Ex: (VendorID|RatecodeID).

  • You do not need include the shape column in the RegEX as it will be added automatically based on the inputs of WKB column name or X/Y column names.

Spatial Reference:#

  • Set to SR of your geometry data.

Memory:#

  • If checked will write FC to memory workspace, otherwise will write the FC to scratchGDB or output GDB path.

Parse Delta Table:#

  • If checked the tool will attempt to read a delta table instead of a parquet table.

Export Tool Usage:#

ExportTool

Input Dataset:#

  • The input FC to be exported.

Data Export Location:#

  • Set this the location of where you want to export your data

  • Possible options are: Local, S3, Azure, and Google Cloud Storage

Local Parquet Folder:#

  • This parameter is an output path to a local parquet folder.

  • A local or cloud folder path must be defined as input to the tool. An error will be thrown if both folder paths are defined.

  • Must be an absolute path with format Drive:\path\to\your\parquet Ex: C:\Users\dre11620\Downloads\tester.parquet. No env variables required.

Cloud Parquet Folder:#

  • This parameter is an output path to a cloud parquet folder.

  • A local or cloud folder path must be defined as input to the tool. An error will be thrown if both folder paths are defined.

  • It can be one of the following:

    • Azure path

      • The only path format supported for azure output is abfs. Example:abfs://<containername>/<pathtoparquet> - abfs://main/mydata.parquet.

      • Must have environment variable on your machine called AZURE_CONN_STR:

        • With the format:DefaultEndpointsProtocol=https;AccountName=<youraccount>;AccountKey=<yourkey>==;EndpointSuffix=core.windows.net.

        • You can find this key under your storage account on the Azure portal under Security+networking->Access keys->Connection String.

        • Writing to azure requires an additional dependency: pip install adlfs.

    • Google cloud storage path

      • Must be in the format - gs://<bucketname>/<pathtoparquet> Ex: gs://sdsdrewtest/myfile.parquet.

      • Or gcs - gcs://<bucketname>/<pathtoparquet> Ex: gcs://sdsdrewtest/myfile.parquet

      • Must have environment variables on your machine called GOOGLE_ACCESS_KEY and GOOGLE_SECRET.

      • You can find these keys under Cloud Storage->Settings->Access keys for your user account.

    • AWS S3

      • Must be in the format - s3://<bucketname>/<pathtoparquet> Ex: s3://sdsdrewtest/myfile.parquet.

      • Must have the following env variables set: AWS_ACCESS_KEY_ID, AWS_REGION, AWS_SECRET_ACCESS_KEY.

      • If using temp credentials you will need to also have AWS_SESSION_TOKEN set.

      • If using minio locally you will need to set AWS_ENDPOINT_URL set. Ex: 127.0.0.1:9000.

Output Shape:#

  • If you don’t want to include geometries in the output parquet, uncheck this box.

Shape Format:#

  • If outputting shape column, 3 formats are supported: "WKT", "WKB", "XY".

  • The shape column will be named "ShapeWKT", "ShapeWKB", "ShapeX", "ShapeY"

Rows Per Partition:#

  • Rows per partition controls the minimum number of rows in a Parquet row group. The default is 100,000 rows per group.

  • Read more here.

Import Tool S3 Usage Example:#

Step 1 - Set AWS Environment variables#

  • Be sure to set have the following env variables set: AWS_ACCESS_KEY_ID, AWS_REGION, AWS_SECRET_ACCESS_KEY.

  • To set them search for Environment variables in Windows

  • If using temp credentials you will need to also have AWS_SESSION_TOKEN set.

  • Restart Pro after settting these.

Step 2 - Open the toolbox:#

  • Open ParquetToolbox.pyt and select Import From Parquet

Step 3 - Input parquet folder path:#

  • This parameter is a path to your hive partitioned parquet folder (must be a folder. Tool does not support singular parquet files).

  • example for AWS s3 - s3://<bucketname>/<pathtoparquet> Ex: s3://sdsdrewtest/myfile.parquet.

Step 4 - Set Output Layer Name:#

  • The output feature class layer name which will appear in your catalog once the tool is complete.

Step 5 - Geometry Column:#

  • We can define our geometry column for the tool.

  • The tool supports three types of columns:

    1. WKB - Well-Known Binary. Set this parameter to the name of your WKB column.

    2. WKT - Well-Known Text. Set this parameter to the name of your WKT column.

    3. Esri Big Data Toolkit SHAPE Struct - Example: SHAPE{WKB, XMIN, YMIN, XMAX, YMAX}. Set this parameter to the name of your struct column separated with a period with the name of your WKB key within the struct. Ex: MyStruct.myWKB

Step 6 - Set Spatial Reference:#

  • Set to SR of your geometry data.

Step 7 - Run the tool:#

  • The tool should now log details about the import process