Mo Sarwat
Release Apache Sedona 1.4 – A Giant Leap in Geospatial Data Processing
To the fantastic Apache Sedona community,
We're exhilarated to share the release of Apache Sedona 1.4 (https://sedona.apache.org/1.4.1/), encompassing both 1.4.0 and 1.4.1 versions, signifying a transformative stage in geospatial analytics. These successive versions introduce an array of striking new features, major enhancements, and indispensable bug fixes.
Highlights
Here are the standout features spanning across these two versions:
Use SedonaContext as a unified entry point (SEDONA-287): This allows you to use the same method to initialize your Sedona environment in different computation engines without tedious configuration.
Geodesic/Geography Functions Support (SEDONA-281, SEDONA-286): This functionality adds a whole new dimension to your geospatial data processing, enabling more accurate computations over geographical data.
Redesigned end-to-end raster ETL pipeline (SEDONA-251, SEDONA-269, SEDONA-292): Apache Sedona 1.4 includes new raster readers / writers, raster type, and many essential raster functions, paving the way for streamlined handling of vast volumes of geospatial data.
Google S2 based spatial join (SEDONA-235): Sedona 1.4 introduces the support of Google S2 cells which enables a speedy approximate point-in-polygon join.
Support for Spark 3.4 (SEDONA-276): Our support for Spark 3.4 allows you to merge the powerful capabilities of the latest Spark version with the enhanced geospatial features of Sedona.
Serialization and Deserialization Enhancements (SEDONA-207, SEDONA-227, SEDONA-231): Sedona 1.4 offered faster geometry serialization and deserialization, resulting in a performance boost by a factor of 3 to 7.
Native filter pushdown support in GeoParquet (SEDONA-156): this can push down spatial predicate on GeoParquet to reduce memory consumption by 10X.
Now, let's dive into Features 1-4 and learn how to leverage them in your applications. We'll discuss the remaining features in separate blog posts.
Use SedonaContext as a unified entry point (v1.4.1)
Initially designed to be a geospatial computation system on top of Apache Spark, Sedona has evolved over the years. It has extended its support to other engines and is now on a mission to become a cross-platform, scalable geospatial computation engine.
To use Sedona with different underlying engines, you must import several classes and register necessary Sedona data types, serializers, functions, and optimization rules. This process can be confusing for new users over time. Users previously had to manually import many classes and add serializer configuration. However, with the introduction of SedonaContext and other helper functions in SEDONA-287, starting applications has become much easier. Here's how to do it:

Geodesic/Geography Functions Support (v1.4.1)
As we all know, the Earth is a spheroid, making it difficult to accurately calculate geographic distances between two locations. Given the complicated surface of the Earth, finding the exact distance is a challenging task. In the GIS industry, it is considered best practice to project the geospatial objects (usually in WGS84 / EPSG:4326) to an appropriate meter-based Coordinate Reference System (CRS) before applying geometric calculations such as distance calculation. Users can choose a CRS that is close to the locations of their data. For example, in Apache Sedona, one can calculate the distance between two locations as follows:
Sedona ST_Transfrom in SQL

However, it is common for a single dataset to include geospatial objects across the globe, and there is no CRS suitable for everywhere. Therefore, it can be difficult to calculate distances accurately. In this case, the best way to measure geographic distance is either through great-circle distance (haversine distance) or spheroid distance (Vincenty's formulae or C. Karney’s algorithm).
SEDONA-281 introduces support for both great-circle distance and spheroid distance, namely ST_DistanceSphere and ST_DistanceSpheroid. Let's look at an example of calculating the distance between Hong Kong (22.308919, 113.914603) and Toronto (43.677223, -79.630556).
Sedona sphere and spheroid distance in SQL
All distance units in this example are in meters. The actual distance between Hong Kong and Toronto is about 12,544 to 12,568 kilometers. ST_DistanceSphere and ST_DistanceSpheroid both produce very accurate results. However, the ST_Transformbased approach cannot give a good answer if no proper CRS (coordinate reference system) is available.

Rule of thumb
Using ST_DistanceSphere and ST_DistanceSpheroid does not mean that the ST_Transform-based approach is no longer necessary. To select the most suitable approach, use the following rule of thumb: when calculating the distance between two geospatial objects that are far apart, such as cities in different countries or continents, use ST_DistanceSphere or ST_DistanceSpheroid. When calculating the distance between two objects in the same region such as a metropolitan area, use the ST_Transform-based approach.
Now that we are able to calculate the spherical distance between two geospatial objects, we can apply this to real-world applications that involve the distance calculation among many geospatial objects and not just two. For example, we can find the parks that are within 50KM distance from each city among the locations of cities and parks in the world (available from OpenStreetMap). This operation is called DistanceJoin in Sedona and is one of the most expensive and resource-consuming operations. Thanks to our highly efficient spatial join algorithm, Sedona can easily handle this operation on huge datasets. If you used our ST_Transform-based approach before, you may have enjoyed this algorithm for a while. SEDONA-286 extends our algorithm to support distance join over ST_DistanceSphere and ST_DistanceSpheroid. A distance join query can be written as follows:
Sedona distance join in SQL
50000 indicates 50KM

Redesigned end-to-end raster ETL pipeline (v1.4.1)
Sedona has been implementing geospatial vector data functions for several years, and the support has become mostly mature. In recent releases, the Sedona community has invested more in geospatial raster data support, including raster data reader, writer, and raster manipulation.To enable comprehensive raster data support, Sedona has made the following efforts:
Raster data type
SEDONA-251 introduces a native raster data type. Similar to the geometry type, the raster type indicates that the column in a schema contains raster data. Each record stored in this type contains two parts: metadata and raster band information. With this in place, we can create a table that has both a geometry column and a raster column.
Raster reader
To begin processing raster data, the first step is to load the raster data. Additionally, we must be able to load raster data in parallel given the vast amount of available raster data. SEDONA-251 introduces a scalable raster reader and native raster data constructors. To construct a raster column, use the following steps. We have added additional constructors, such as RS_FromGeoTiff and RS_FromAscGrid, which allow you to convert various raster binary formats to a unified raster format. After calling the geotiff constructor, we can print the schema of this table (code below). As we can see, the type of this column now is raster.

Raster writer
After processing raster data, it is necessary to store the data in an external storage. With SEDONA-269, this can be achieved in Sedona Spark as follows:
First, convert the raster data to a specific binary format, such as GeoTiff. Note that the raster can be imported from one image format, but exported to another image format. After processing the data, we can save it to an external location. Coding example is given below:

Raster functions
SEDONA-251, SEDONA-269, and SEDONA-292 together introduce several functions to transform and analyze raster data. Let's examine a few examples.
First, let's assume that we have a GeoTiff image as follows:

Extract bounding box, SRID and number of bands
We can extract the bounding box, srid, and number of bands of this raster. The result table will have a schema that contains both raster and geometry type (see below). This allows for more complex geometric operations on the table. For instance, we can save the table as a GeoParquet format and perform filter pushdown. When executing a Sedona spatial range query on this GeoParquet table, Sedona will retrieve only the data that intersects the spatial range query window.

Extract and modify an individual band
We can extract and modify any band in a raster. After modifying a band value, a user can save it to the original raster. Code example is given below:

The result of the modified raster will look like that:

Google S2-based spatial join (v1.4.0)
If you're experiencing suboptimal performance with Sedona's optimized join, it may be due to complicated and overlapping geometries. Fortunately, Sedona offers a built-in alternative: the Google S2-based approximate equi-join. This equi-join utilizes Spark's internal equi-join algorithm and may perform better than the optimized join, especially if you choose to skip the refinement step. However, be aware that skipping the refinement step will result in a loss of query accuracy.
Sedona's approximate equi-join is a useful tool to have in your arsenal when working with complex geometries. And with the recent updates to Sedona, such as the introduction of the native raster data type and the redesigned end-to-end raster ETL pipeline, Sedona is better equipped than ever to meet the evolving needs of the geospatial data processing community.

Note that steps 3 and 4 are optional. For step 3: due to the nature of S2 Cellid, the equi-join results might have a few false-positives depending on the S2 level you choose. A smaller level indicates bigger cells, less exploded rows, but more false positives. To ensure the correctness, you can use one of the Spatial Predicates to filter out them. Use this query instead of the query in Step 2. As you see, compared to the query in Step 2, we added one more filter, which is ST_Contains, to remove false positives. You can also use ST_Intersects and so on.
**Tip: You can skip step 3 if you don't need 100% accuracy and want faster query speed.
For step 4: Due to the explode function used when we generate S2 Cell Ids, the resulting DataFrame may have several duplicate <lcs_geom, rcs_geom> matches. You can remove them by performing a GroupBy query. The first() function is to take the first value from a number of duplicate values. If you don't have a unique id for each geometry, you can also group by geometry itself.
Try it now
With these improvements and additions, Apache Sedona 1.4 s better equipped to meet the evolving needs of the geospatial data processing community.
We are immensely grateful to our fantastic community for their ongoing support and contributions to the development and growth of Apache Sedona. We are excited to continue our journey together, pushing the boundaries of geospatial data processing.
Stay tuned for future updates and enjoy your data processing with Apache Sedona 1.4!