Get Started Free
No time limit - totally free - just the way you like it.Sign Up Now
With billions of connected devices, the Internet of Things (IoT) has driven a massive increase in data volume and diversity. According to a Cisco white paper from 2011, we reached the tipping point of having more connected “things” than people somewhere between the years 2008 and 2009. It’s now projected we’ll see 30.9 billion connected devices by 2025.
IoT data is growing exponentially. Generating value from all that data, however, remains challenging. That’s why Software AG, an enterprise software company based in Darmstadt, Germany, built a self-service IoT data platform with the aim of making IoT projects easier. Software AG’s Cumulocity IoT data platform supports a variety of use cases, from real-time streaming analytics to batch processing for historical large-scale analytics on a data lake. For the latter, the company’s platform enables customers to integrate and analyze IoT data on the lake with a data hub powered by Dremio’s open lakehouse platform.
Here’s a look under the hood of Software AG’s Cumulocity IoT DataHub, along with considerations and best practices for offloading, storing, and managing IoT data on a data lake. (You can watch the story from Software AG’s presentation at Subsurface LIVE 2022 as well as read more in the Cumulocity IoT DataHub fact sheet and white paper. )
In IoT data infrastructure, data arrives continuously. For the Software AG team, that means dealing with large volumes of data constantly flowing into the platform from various devices using different protocols. Regardless of the device or protocol, the team converts the data to Software AG’s canonical format (JSON) and stores it in an operational store – in this case, MongoDB.
The MongoDB store isn’t the final destination, however. It’s much more cost-efficient and performant to offload the data to cloud data storage and make it available for analytics there.
The Cumulocity IoT DataHub offloads historical data in the operational store to a data lake (e.g., Amazon S3 or Azure Data Lake Storage Gen2), converting the JSON files to Apache Parquet. The customer can then use traditional BI tools (such as PowerBI, Qlik, or Tableau) directly against the historical data in their own data lake. The DataHub also enables customers to train ML models on the historical data; they can then operationalize those models in streaming analytics. Dremio’s open lakehouse platform powers the DataHub to make these analytics use cases possible. Here’s how it works in detail.
Software AG’s IoT domain model has various asset types, including alarms, events, devices, and measurements (mostly time-series data coming from sensors). A variety of customer use cases means that data is differently shaped. For instance, one customer might have time-series data with only a handful of values sent as one measurement, whereas another customer might have a series with several hundred values sent as one measurement. All this differently shaped data must be processed and transformed into Cumulocity IoT’s relational schema so that it can then be stored, (say as Parquet files on S3). That means taking the different JSON attributes speaking in JSON path notation as column names.
Software AG’s ETL procedure uses Dremio’s CREATE TABLE AS functionality to create Parquet files of the raw data unprocessed in the staging area of the customer’s data lake. Next, the newly arrived batch must be analyzed to discover the shape of the time-series array – for instance, what data types and series names it contains. The final step applies local sorting and partitioning.
This step generates as few Parquet files as possible and appends it to the folder structure of the existing data lake table and refreshes the metadata. From that point on new data is available for customers to query either with Dremio or other query tools that directly read from the data lake.
Once data has been offloaded, the team follows a number of data management and optimization best practices to make sure that performance doesn’t degrade.
Partitioning: The Software AG team tries to align data as physically close as possible to the most common query patterns. For IoT data, most queries have a time component (for instance, finding the average temperature in a table when the time is between A and B). For these query patterns, partitioning by time is clearly the best choice. By default, Cumulocity IoT uses year, month, and day as the partitioning schema. As a result, when time-range queries come in, the engine (whether Dremio Sonar, Spark, or something else) is able to do efficient partition pruning, which in turn heavily reduces I/O, speeds up the query, and reduces costs.
Compaction: Another important practice involves housekeeping operations, particularly around small files. Cumulocity IoT customers may have very different use case sizes. Since the data team does hourly offloading of data from the operational store, customers with smaller data inflows end up with a large number of small Parquet files. In the long term, this leads to performance degradation. To fix that problem, the team implemented a nightly compaction procedure that runs on all tenants. The procedure picks up the 24 incremental folders from the previous day and compacts them, rewriting them into as few Parquet files as possible.
Denormalization: A third performance optimization for the IoT data lake involves denormalizing data as early as possible. For example, data is generated with a reference to the device that produces the data. For reporting purposes, you may not be interested in the device’s ID, but you want the device name or location. It makes sense to denormalize and join the data during the offloading procedure and store it in the data lake in denormalized fashion instead of doing the join at runtime (which can be more complex, given that there may be billions of records in the use case).
For the future, the Software AG team is looking to adopt Apache Iceberg. The open source table format has a number of features that make it particularly useful for the Cumulocity IoT platform, including:
To learn more, visit Software AG’s Cumulocity IoT platform. See how Software AG’s Cumulocity IoT platform, the leading self-service IoT platform, can help you ensure reliable, scalable operations. Get started with a free trial. And get started with Dremio’s open lakehouse platform with a forever-free account.