May 2, 2024

Havasu: A Table Format for Spatial Attributes in a Data Lake Architecture

Havasu is an open table format that extends Apache Iceberg to support spatial data. Havasu introduces a range of features, including native support for manipulating and storing geometry and raster objects directly in data lake tables, and enables seamless querying and processing of spatial tables using Spatial SQL.

In this talk we will cover how Havasu supports spatial data types and storage (including GeoParquet), spatial statistics, and spatial filter push down and indexing.

Topics Covered

Iceberg and Table Formats

Sign up to watch all Subsurface 2024 sessions

Transcript

Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

William Lyon:

Hi, everyone. Thanks for joining today. So, like Mike said, the title of this talk is Iceberg Havasu, a Spatial Data Lakehouse Format. So, in this talk, we’re going to be talking about working with geospatial data in Apache Iceberg. So, my name is Will. I work for a company called Wherebots on the developer relations team. You can find some of my contact info there on the screen. Feel free to reach out. Always happy to chat with folks. You can grab the slides at bit.ly/subsurface-havasu or scan that QR code. I’ll have the link again at the end if you miss it here. 

So, Wherebots was founded by the original creators of the open source Apache Sedona project. If you’re not familiar with Apache Sedona, it’s a project that extends the functionality of distributed compute frameworks like Apache Spark or Flink or Snowflake to add geospatial functionality. So, adding geospatial types and optimizations for really large scale geospatial data processing. And at Wherebots, we’re building the spatial analytics and AI cloud on top of Apache Sedona. So, managing the cluster for you and adding more cloud native features to give you more cloud native develop experience for working with large scale spatial data. And one of those, like, cloud native developer experiences that we really wanted in Wherebots cloud was this DBMS-like, this relational database-like experience for working with large scale geospatial analytics. So, having familiar SQL tables, familiar SQL syntax, but really underlying that, having this massive data lake of spatial data in Parquet files. 

So, this really was the driver for the need for a spatial data lake house table format, which gets us to Havasu. Havasu is an extension of Apache Iceberg. So, this is an open table format that adds support for working with large scale spatial data to Iceberg. 

So, for the rest of the talk here, I guess I’d like to, you know, first do a little bit of motivation here to talk about, you know, why are we interested in geospatial for Iceberg? Then we’ll talk about some of the features in Havasu, and then at the end, we’ll have a demo to see this in action. 

What is Geospatial Analytics

So, first of all, what is geospatial analytics? Well, rather than give you, like, an exact definition, I think it’s helpful to see some of the industries, some of the use cases, some of the data sources that are involved, and especially think about, like, the scale of the data involved here. So, for example, in the real estate industry, a common use case is something like site selection, identifying investment opportunities, managing infrastructure, planning, and the data sources might be parcel data, building footprint data, maybe derived from satellite imagery, mobility data. In the risk analysis world, this could be things like disaster modeling, pricing risk, pricing insurance products, assessing geopolitical risks. 

So, the data sources might be things like flood models, weather patterns, satellite imagery that we apply, maybe, object detection machine learning to. In ecology, things like migration patterns, species interactions, the impacts of infrastructure. In the telecom industry, these are going to be things like cell tower analysis, where should I place my next cell tower? Do I have good quality of service for my current customers, depending on analyzing call detail records, mobility data? Elevation models are also important here if I need line of sight for things like microwave signals, these sorts of things. Transportation and logistics, things like traffic changes into account, efficient routing, using road network data, telemetry, satellite imagery. So, these are some examples, but really, when we’re talking about geospatial analytics, really talking about making sense of working with data that has some spatial component. This blog post, the periodic table of spatial analysis, I thought was pretty helpful to give a broad overview of some of the functionality involved in spatial analytics. So, that’s something to check out if you want to see some of the techniques involved. 

Challenges with Geospatial Analytics

There’s some common challenges that come up when we’re working with large-scale geospatial data. There’s this saying that spatial is special, which implies that due to the specialized nature of the data, the specialized knowledge and experience, that working with spatial data requires specialized tooling. And some of the challenges that come up are things like the data representation format. So, in the geospatial world, there’s vector data and raster data. We’ll talk about what those are in a minute. These are different ways of representing spatial data. We have different coordinate systems in the spatial world. So, these are different ways of mapping points on the surface of the Earth using different units of measure. We have projected coordinate systems, which map a curved surface of the Earth to a 2D representation. 

There’s some distortion involved. You need to be able to translate data back and forth from coordinate systems if we’re working with multiple data types. So, point line polygon, these are the basic vector geometry types. But these can actually be quite complicated. If we think of a very complex, very large polygon, that can actually be a huge data structure that might just represent one attribute and one row in our data. The indexing structures that we use, because spatial data is by nature multidimensional, the indexing structures we use to work with spatial data can be a bit different from other data systems that we’re familiar with. And in a distributed system, spatial partitioning plays a really important role, making sure we’re efficiently distributing data throughout our cluster. 

So, Apache Sedona was created to address many of these challenges. We said Apache Sedona is an open source project that adds spatial types, optimizations for spatial queries, spatial processing functionality to distributed compute frameworks like Apache Spark. 

Spatial SQL

Now, Spatial SQL, this is one of the most common ways with Apache Sedona and where bots and other geospatial tools that we interact with spatial data. This is based on a standard from OGC, the Open Geospatial Council. This is the OGC SQL access standard that defines these ST_ functions. So, things like being able to construct a geospatial type, predicates that we might use in a join clause, aggregations, also working with raster data as well. Let’s take a look at some examples of the kinds of spatial queries that Apache Sedona introduces optimizations for. And if you think about how these execute on very, very large scale data, think about the kinds of indexing, the kind of optimizations that we want to have available. So, this first one is a spatial range query. So, in this example, we’re searching for point geometries that fall within a polygon. I just kind of drew a polygon roughly approximating the Gulf of Mexico. And in our SQL statement, select from where. So, we’re selecting the location of these structures from our offshore infrastructure table, where ST within. So, where the geometry of these offshore infrastructure points are within this polygon. 

Another spatial query is the spatial join query, where we have two tables, in this case, a table of airports. So, point geometries and country boundaries. So, polygons. And we want to join these tables where the geometry of the airport is within the geometry of the country. Then we might do an aggregation, a group by maybe a count of the number of airports within each country. 

Spatial KNN, K-Nearest Neighbors query based on distance. So, here we’re searching for points of interest in a specific category that are closest to some point near San Francisco. So, we want to be able to efficiently access data in this case that is within close spatial proximity. Again, think of the distributed indexing and partitioning functionality that is needed for executing this query on a very, very large dataset. 

Raster data, which we’ve mentioned a few times. If you’re not familiar with raster data, raster data, these are things like satellite imagery, or this could be something like population estimates. But basically, we’re working with gridded data that each cell has a value, one or more values. If we’re talking about satellite imagery, we could have maybe a three or four band raster that has a visible spectrum like our red, green, blue values for the bands. And if we’re working with large raster data, we’ll typically tile those. So, each tile of the raster of the image in this case, represents one row in the table. 

Then to query that, oftentimes we want to work with individual cell values within the raster. So, here we’re taking two rasters, the average Earth surface temperature over two years, subtracting the difference using the map algebra spatial SQL function to give us the distance. So, where was the Earth warmer? Where was it colder from one year to the other? So, this brings us back to Havasu and this architecture that we want to offer as developer experience for users, where we have multiple applications that are querying a single Havasu iceberg table using spatial SQL at the same time, supporting concurrent operations. Underneath that, we have this data lake partitioned parquet or partitioned geoparquet files. 

Havasu Spacial Table Format

So, there are two interesting things to note here. One is the geoparquet aspect, which we’ll talk about in a second, and then the Havasu table format. So, let’s dig into Havasu and we’ll talk a bit about the features there and what that looks like. So, Havasu is an extension of Apache Iceberg. I know a lot of you are familiar with Iceberg. If you’re not, Iceberg is this open table format with the goal of introducing reliable, simple SQL tables on top of huge analytic datasets. So, we can think about the table as the abstraction, not think about how are we going to partition our parquet files, these sorts of things. It allows us to bring compute to the data by having multiple query engines accessing our tables. 

So, Havasu is this open source extension of Iceberg that adds support for spatial geometry and raster data types, and also takes into account the spatial column metadata, which is really important for ensuring efficient spatial queries. And then we can encode that in parquet files, optionally as geoparquet to ensure compatibility with the ecosystem there. So, what are the benefits? Well, the main benefit is that we have this spatial DBMS-like experience. We have tables as the abstraction that we can send spatial SQL to query these tables. That’s the developer experience that we’re shooting for. But then we have performance benefits because we’re enforcing spatial integrity at the right time of these tables. We don’t have to parse and transform spatial data at the application level. And because we’re taking into account spatial metadata, the spatial filtering and queries can be optimized to add things like spatial filter pushdown. And then, of course, we have all the benefits of Apache Iceberg as well. What’s the Havasu layer? 

GeoParquet

Let’s talk a bit about this geoparquet data lake layer at the bottom there. So, geoparquet, if you’re not familiar, is a specification for working with geospatial data in Apache Parquet. So, really, the goal here is to introduce all the benefits of Parquet, which is namely efficient storage and efficient retrieval, but now also for spatial data. And this is a community effort. There’s been a few years in the works here. I think the 1.0 release came out in October of last year. It’s just about to the 1.1 release now. Parquet, if you’re not familiar, like I said, is this column-oriented data store that gives us very efficient data storage and data retrieval. I’m going to skip over these slides to make sure we have time for the demo here. But this goes into a bit of detail of how Parquet works. One important aspect is this idea of metadata for chunks or row groups. And so, Parquet keeps track of these statistics for the minimum and maximum values within a chunk or within a row group that readers at query time can use to prune or exclude chunks to not have to read, which allows for very efficient queries. 

So, geoparquet specifies how geometries should be serialized as binary, and then the benefits of Parquet can apply to those for efficient storage. So, I mentioned the 1.1 version of geoparquet is coming very soon. This adds support for geo-arrow encoding and also taking into account row-level bounding box column, which enables this pruning at the row group level. 

Spatial Attributes: Vector

So, let’s talk about Havasu a bit more. We said that Havasu adds support for vector spatial attributes. Vector spatial attributes, these are points, lines, polygons, and their associated attributes. So, for the geometry values, we have the shape of the geometry and the associated SRID, the spatial reference, the coordinate system, essentially. And we can encode these in Parquet using a variety of formats, optionally geoparquet, which we just reviewed. Now, raster data, so we said raster data is gridded data where we have cells, each cell maps to some area on the surface of the earth and has one or more values in bands. And there are two ways to persist raster data in Iceberg Havasu. One is NDB to store the actual band values alongside the geo-reference data in the actual data files or OutDB, which stores the geo-metadata in Havasu, but just holds a reference to something like cloud object storage. Typically, these are cloud-optimized geotips. And this allows us to take advantage of cloud object storage and can be much more efficient for working with large-scale raster data. 

Spatial Filter Pushdown

So, by taking into account the spatial metadata, if we go back to that example of the spatial range query where we’re filtering for points within a polygon, we can exclude certain data files based on the query window. This is spatial filter pushdown. In order to enable this, we need to create an index. So, Havasu supports the create spatial index functionality that uses Hilbert space-filling curve. This is really shorthand for Iceberg’s rewrite data files to essentially sort our rows according to spatial proximity. In Whereabouts Cloud, we also use the Iceberg catalog feature to expose this table database abstraction to the user. So, we implement an Iceberg catalog as a screenshot from Whereabouts Cloud, just showing some databases and tables that I have in my account. And we also use this for the Whereabouts Open Data Catalog, which are curated spatial datasets that we make available to each Whereabouts Cloud user. 

There’s been some exciting developments recently on the aspect of contributing some of these features back to Apache Iceberg. So, there’s, in collaboration with the community and some folks at Apple and other organizations, there’s now a proposal on GitHub to add geospatial support to Iceberg. So, if you’re interested in that, you can check out the GitHub issue linked there. 

Creating an Havasu Table

Okay. So, let’s see this in action. I’m going to switch over to this notebook that I have running in Whereabouts Cloud. And so, I’ve loaded some data from Bird Buddy. So, I have a data frame here that has observations of bird species and when and where they were found. And what I’m going to do now is create a Havasu table with this data. So, first of all, we can, you know, look at the catalog. So, catalog is kind of the top level namespace. By default, we use the Whereabouts namespace. So, here I’m creating a Havasu table, Whereabouts Bird Buddy observations. So, Whereabouts is the catalog, Bird Buddy is the database or namespace, observations is the table. We can look at all of the databases that I have in this catalog and we can look at the tables within the Bird Buddy namespace, which is just the observations. And we can describe the table and we can insert values. So, here I’m going to insert one observation. I found a purple finch and we can query that using SQL. We can visualize it using Kepler. So, this is the Kepler integration for Sedona. You can see our one lonely observation. Let’s write the data frame that we had above to our Bird Buddy observations table. I think this is maybe something like 25 million observations. And then we can create a spatial index. So, we created an index on the location property. And again, this is kind of syntactic sugar for Iceberg’s rewrite data files, which is now going to rewrite the parquet files, the partitioned parquet files, sorting them by spatial proximity. 

By the way, we can see these in Whereabouts UI. So, this is the Whereabouts Cloud UI in my Whereabouts catalog. If I take a look at the underlying files for the observation table, first of all, we can see the data files. These are partitioned parquet files. You can see the metadata files stored there as well. And so, now we can apply some spatial queries to this table. So, we can search for observations that intersect some buffer around a point in New York. And we can visualize these. So, here are all of the observations in this table that we found within a certain radius of this point in Manhattan. We can also work with raster data. So, in this example, we are going to load a raster image. So, this is a dataset from NASA of nighttime visible light, which is a really interesting dataset that can be used to derive some economic indicators as well. So, if we’re observing nighttime lights, that’s an indicator that there’s some amounts of industrial activity. So, we’re going to use this RSFromPath function to load an OutDB raster from an S3 bucket. This is a GeoTIFF. And then typically, if we have rasters that cover the entire earth, we will tile those. So, we’re tiling these now into 256 by 256 tiles. And now we can write that to a Havasu table. And this is quite small. We have only 11,000 rows in this table. Gives you a sense of what that looks like. And we zoom out here in the visualization. We can see how these are tiled across the surface of the earth. 

We can do things like find the tile that intersects a certain point and extract the value at that point. So, at this point, the value is 63, which means that there’s some measure of nighttime light there. Another operation we can do is zonal statistics. So, here we’re going to load a counties shapefile. So, these are boundaries of all US counties. And then we’re going to use this RSZonalStat. So, zonal statistics is basically taking some vector geometry, some boundary, identifying all the raster cells within that boundary and then either doing a count of the bound values or in this case we’re taking the average to get a sense of the average light exposure within each county, which we can then visualize. And you can see this maps largely to population density. There’s maybe some interesting outliers of perhaps industrial activity, things like perhaps nighttime oil operations or things like that. But it gives a sense of what are the counties with the most light exposure at night. 

Cool. Well, that was a quick demo. The code is linked there if you’d like to check it out. If you’d like to get started with Havasu today, you can sign up for a free Wherobots Cloud account, which gives you access to all the functionality we looked at today. There’s also this slide, which has some good resources for getting started as well. And I’ll just end here on this slide so you can grab the link to the slides, scan that QR code, and then you can go through and find any of the links that you’re looking for.

header-bg