top of page
  • Writer's pictureMo Sarwat

Spatial Data, Parquet, and Apache Sedona

Apache Parquet is a columnar data file format optimized for analytical workloads. It is widely adopted in the big data ecosystem. Spatial data scientists may also use parquet to store spatial data, especially when analyzing large scale datasets on Spark or cloud data warehouses. The following problems arise when storing spatial data to parquet files.

  1. Parquet does not have native support for geometry objects such as point, linestring and polygon. These geometry objects needs to be serialized to be stored as a primitive type (such as BINARY) in parquet files. There’s no easy way to tell which column stores geometry objects by simply looking at the schema of the files. Users may adopt some conventions such that geometry objects were stored in geom or geometry column, or user has to read some metadata about the dataset to find which columns are geometry data.

  2. There are many serialization formats such as WKB, WKT, GeoJSON, etc. Users need to know which serialization format to use to properly read the geometry objects from geometry columns.

GeoParquet is an incubating OGC standard for addressing these problems. It defines how to serialize geometry objects as parquet values and the file-level metadata for describing geometry columns. Implementations that follow this standard could easily exchange parquet files containing spatial data, and enjoy the benefits of parquet columnar format when running analytics on GeoParquet files.

A Brief Introduction to GeoParquet Specification

The main idea of the GeoParquet specification is that it defines a standard schema for file-level metadata attached to the footer of parquet files. It defines which columns are for geometry data, and the encoding of geometry data in each column. For example, we can load a GeoParquet file using GeoPandas, and see that the column for storing geometry objects are interpreted as geometry field:

We can inspect the metadata of this GeoParquet file using PyArrow. The metadata keyed by geo stores information on the geometry columns:

The metadata designated that column g is a primary geometry column encoded as WKB. There are some additional informations about this column such as geometry types and CRS of geometries in this column. Readers can refer to the column metadata section of the specification to understand how to interpret these information.

There is an interesting metadata for geometry columns keyed by bbox. It is the minimum bounding box of all the geometries in this column. This metadata is useful for skipping unrelated GeoParquet files when user runs a spatial range query on a dataset made up of a bunch of GeoParquet files. For example, the query window shown as the red rectangle only overlaps with the bounding box of part1.parquet and part2.parquet, so the query engine can safely skip scanning part3.parquet and part4.parquet, reducing the IO cost and answer the query faster.

Working with GeoParquet in Apache Sedona

Apache Sedona 1.4.1 supports reading and writing GeoParquet files on Apache Spark 3.0 ~ 3.4. We’ll go through an example to demonstrate the usage of GeoParquet support of Apache Sedona, and how to speed up your spatial queries using GeoParquet.

Saving Spatial DataFrames to GeoParquet Files

We’ll use the TIGER census 2022 Shapefile dataset in our examples. Readers can navigate to and download the datasets. We’ve already prepared the data for Arizona state in our public S3 bucket so we can simply run this code to load the Shapefiles.

Now we can simply save the DataFrame as GeoParquet using df.write.format("geoparquet").save:

Loading GeoParquet Files into a Spatial DataFrame

We can call"geoparquet").load to load DataFrame from GeoParquet files. The geometry columns will be decoded as geometry objects automatically.

As we can see, the geometry column was read back as a geometry column. Users can directly apply ST functions on this column without first parsing it from WKB.

Interoperability with GeoPandas

The primary goal of GeoParquet is defining a standard so that we can exchange parquet files containing geospatial data amongst various geospatial libraries and platforms. For instance, the GeoParquet files saved by Apache Sedona could be loaded by GeoPandas and vice versa. We can load the GeoParquet files we saved in the previous section using GeoPandas and see that the geometry column was correctly loaded.

Load all Census 2022 Data

The entire census 2022 dataset takes 8 GB and contains 8 million polygons. We’ve already stored all the data in a single GeoParquet file. Readers can directly load data from our GeoParquet file stored on our public S3 bucket. It is recommended to run the above query on Sedona-Spark clusters with good network connection to AWS S3, otherwise you can download the dataset from here and store it in somewhere you have fast access to.

Query the Census 2022 dataset

Now let’s run some spatial range queries on this dataset. Here we define a range query to fetch records within the city of Phoenix. We can run the query and count the result set, and use %%time to measure time elapsed running this query:

The query took more than 1 minute to finish, and the result set contains 57812 rows. Now let’s explore ways to redistribute the dataset to achieve lower query latency.

Redistribute the dataset for faster spatial queries

GeoParquet files has bbox metadata for geometry columns. If we partition the dataset to multiple GeoParquet files such that geometry objects that are close to each other were stored in the same GeoParquet file, then each GeoParquet file will cover only a small portion of the entire space. We only need to scan the GeoParquet files whose bbox overlaps with the query window and skip scanning the rest. This optimization is called “filter pushdown” and it was already implemented by Apache Sedona since 1.4.0. In this section we explore different ways to repartition the dataset to speed up spatial range queries.

Partition by State

Census 2022 dataset comes up with a STATEFP20 column, which is the state ID of the record. We can partition the dataset by this field so that records for the same state were grouped together. Now, we can run the same spatial range query. Note that we don’t need to manually add a predicate for the partitioning field or manually figure out which state the query window overlaps with. Apache Sedona will prune GeoParquet files according to their bbox metadata.

The query took less than 8 seconds. Let’s visualize the bounding boxes of each GeoParquet files to see the distribution of the partitioned dataset.

Partition by S2 Cells

If your dataset does not have a field such as STATEFP20 to directly partition by, you can partition the dataset by the S2 cell ID of the geometry. This requires selecting a proper S2 resolution level suited for your dataset. Here we repartition the dataset by the level-4 S2 cell ID of the geometry values. Let’s run the same query on this dataset partitioned by S2. Note that we don’t need to add a predicate for the s2 partitioning column.

The bounding boxes of each GeoParquet files looks like this:

Sort by the GeoHash of geometry

An alternative approach is to sort the dataset by the GeoHash of the geometry values. The way GeoHash is encoded makes geometry objects that are close to each other has similar GeoHash values. We can utilize this property to redistribute our datasets for faster queries. Running the same query again, we can see it finishes in roughly 10 seconds.

The bounding boxes of each GeoParquet files looks like this. Note that although the bounding boxes of files significantly overlaps with each other, we can still prune most of the files when running spatial queries.

The Take-Away

Apache Sedona supports reading and writing GeoParquet files - A variant of Parquet for exchanging spatial data. Apache Sedona can perform spatial filter pushdown when running spatial range queries on GeoParquet files. You can group your GeoParquet dataset by spatial proximity to achieve faster spatial query performance. You can try Apache Sedona with GeoParquet, by downloading Apache Sedona here


If you want to access the notebook for running the experiments in this blogpost, please sign up here

664 views0 comments
bottom of page