Table of Contents
How Software AG Makes IoT Analytics Available on the Data Lake with Dremio
With billions of connected devices, the Internet of Things (IoT) has driven a massive increase in data volume and diversity. According to a Cisco white paper from 2011, we reached the tipping point of having more connected “things” than people somewhere between the years 2008 and 2009. It’s now projected we’ll see 30.9 billion connected devices by 2025.
IoT data is growing exponentially. Generating value from all that data, however, remains challenging. That’s why Software AG, an enterprise software company based in Darmstadt, Germany, built a self-service IoT data platform with the aim of making IoT projects easier. Software AG’s Cumulocity IoT data platform supports a variety of use cases, from real-time streaming analytics to batch processing for historical large-scale analytics on a data lake. For the latter, the company’s platform enables customers to integrate and analyze IoT data on the lake with a data hub powered by Dremio’s open lakehouse platform.
Here’s a look under the hood of Software AG’s Cumulocity IoT DataHub, along with considerations and best practices for offloading, storing, and managing IoT data on a data lake. (You can watch the story from Software AG’s presentation at Subsurface LIVE 2022 as well as read more in the Cumulocity IoT DataHub fact sheet and white paper. )
Ingesting Data to a MongoDB Operational Store
In IoT data infrastructure, data arrives continuously. For the Software AG team, that means dealing with large volumes of data constantly flowing into the platform from various devices using different protocols. Regardless of the device or protocol, the team converts the data to Software AG’s canonical format (JSON) and stores it in an operational store – in this case, MongoDB.
The MongoDB store isn’t the final destination, however. It’s much more cost-efficient and performant to offload the data to cloud data storage and make it available for analytics there.
Offloading to the Cumulocity IoT DataHub with Dremio
The Cumulocity IoT DataHub offloads historical data in the operational store to a data lake (e.g., Amazon S3 or Azure Data Lake Storage Gen2), converting the JSON files to Apache Parquet. The customer can then use traditional BI tools (such as PowerBI, Qlik, or Tableau) directly against the historical data in their own data lake. The DataHub also enables customers to train ML models on the historical data; they can then operationalize those models in streaming analytics. Dremio’s open lakehouse platform powers the DataHub to make these analytics use cases possible. Here’s how it works in detail.
Software AG’s IoT domain model has various asset types, including alarms, events, devices, and measurements (mostly time-series data coming from sensors). A variety of customer use cases means that data is differently shaped. For instance, one customer might have time-series data with only a handful of values sent as one measurement, whereas another customer might have a series with several hundred values sent as one measurement. All this differently shaped data must be processed and transformed into Cumulocity IoT’s relational schema so that it can then be stored, (say as Parquet files on S3). That means taking the different JSON attributes speaking in JSON path notation as column names.
Software AG’s ETL procedure uses Dremio’s CREATE TABLE AS functionality to create Parquet files of the raw data unprocessed in the staging area of the customer’s data lake. Next, the newly arrived batch must be analyzed to discover the shape of the time-series array – for instance, what data types and series names it contains. The final step applies local sorting and partitioning.
This step generates as few Parquet files as possible and appends it to the folder structure of the existing data lake table and refreshes the metadata. From that point on new data is available for customers to query either with Dremio or other query tools that directly read from the data lake.
Optimizing and Managing Data in the IoT Data Lake
Once data has been offloaded, the team follows a number of data management and optimization best practices to make sure that performance doesn’t degrade.
Partitioning: The Software AG team tries to align data as physically close as possible to the most common query patterns. For IoT data, most queries have a time component (for instance, finding the average temperature in a table when the time is between A and B). For these query patterns, partitioning by time is clearly the best choice. By default, Cumulocity IoT uses year, month, and day as the partitioning schema. As a result, when time-range queries come in, the engine (whether Dremio Sonar, Spark, or something else) is able to do efficient partition pruning, which in turn heavily reduces I/O, speeds up the query, and reduces costs.
Compaction: Another important practice involves housekeeping operations, particularly around small files. Cumulocity IoT customers may have very different use case sizes. Since the data team does hourly offloading of data from the operational store, customers with smaller data inflows end up with a large number of small Parquet files. In the long term, this leads to performance degradation. To fix that problem, the team implemented a nightly compaction procedure that runs on all tenants. The procedure picks up the 24 incremental folders from the previous day and compacts them, rewriting them into as few Parquet files as possible.
Denormalization: A third performance optimization for the IoT data lake involves denormalizing data as early as possible. For example, data is generated with a reference to the device that produces the data. For reporting purposes, you may not be interested in the device’s ID, but you want the device name or location. It makes sense to denormalize and join the data during the offloading procedure and store it in the data lake in denormalized fashion instead of doing the join at runtime (which can be more complex, given that there may be billions of records in the use case).
Looking Ahead: Apache Iceberg Advantages for IoT
For the future, the Software AG team is looking to adopt Apache Iceberg. The open source table format has a number of features that make it particularly useful for the Cumulocity IoT platform, including:
- Schema evolution: In IoT, firmware can change, devices can change, and sometimes mixed-type problems arise on the data lake where series names remain stable but the data type changes. Iceberg provides the ability to find the type of column through a metadata operation, and thus keep tables strongly typed. Iceberg’s support for in-place table evolution makes it a good choice for Cumulocity IoT.
- Dynamic partition evolution: This property would let the team adjust the granularity of partitions based on the amount of inflowing data.
- Compaction: Iceberg supports out-of-the-box data compaction, which would help the team manage small files.
- Serializable isolation: Using guarantees like serializable isolation when data is written to the data lake in Iceberg would help the team avoid temporary data inconsistencies during ETL processes or during compaction.
To learn more, visit Software AG’s Cumulocity IoT platform. See how Software AG’s Cumulocity IoT platform, the leading self-service IoT platform, can help you ensure reliable, scalable operations. Get started with a free trial. And get started with Dremio’s open lakehouse platform with a forever-free account.