Refresh your AWS S3 Buckets With Minimal Data Lake Downtime
For every data-driven enterprise, the data lake is the first place where data lands. In this article I’ll focus on a data lake built on AWS where data containing new records is loaded into a S3 bucket a couple of times per day. Unfortunately, and completely expected, once data consumers start using this data, will realize that a certain percentage of records are incorrect, or missing and therefore requiring data to be updated or reloaded depending on the case.
Unfortunately, it is not possible to append to an existing file on S3 since the objects are immutable. The only possible workaround for this issue would be to clear the bucket, upload the new data and get the pipeline going again, this as you guessed it, means that the data lake will be down for a while and this downtime can trigger delays and operational costs that can be avoided.
Dremio can utilize reflections and Virtual Datasets to address this problem of regular updates and complete refreshes without experiencing an outage.
To handle this situation in Dremio, let’s break the problem into two parts; (1) refresh dremio with new data and (2) execute a complete data refresh without experiencing an outage.
CREATE THE INITIAL DREMIO S3 DATASOURCE
The first step is to register the S3 bucket as a Dremio datasource. The Registration process makes available the files in the S3 bucket for end users to query using Dremio.
For this S3 compatible datasource, enter the AWS Access and Access Secret Keys for your system as shown below.
Next, we need to specify how frequently Dremio should check the S3 bucket for new files provided from the ingest tool. To do this, specify a Fetch frequency of slightly more than the frequency of posting into the bucket by the ingest tool. For example, if the ingest tool posts updated files roughly every hour, then the best Metadata refresh cycle should be no less than every 2 hours as shown below.
Lastly, we will need to specify the reflection refresh rules. To save processing time, ideally we will want to select the “Incremental update based on new files” option. Then specify the refresh policy which for our example should be every 2 or more hours as shown below:
Once the S3 Datasource properties are set, Dremio will display of all the S3 buckets defined to the datasource (NOTE: for more information on the Dremio S3 Data source handing check the Dremio Documentation)
CREATE THE INITIAL S3 PHYSICAL DATASET
Now that the S3 bucket is registered to Dremio and the refresh rules established, it is now time to create a dataset from the S3 bucket. Hold the mouse over the right side of the screen for the S3 bucket folder to convert into a dataset as shown below (NOTE: this operation is not copying any data into the Dremio cluster):
Specify the dataset settings as shown below:
Then click Save and all of the CSV files posted into the S3 bucket are now visible as if a data pipeline had been manually created to load the data in a data warehouse.
The last step to do is to create a reflection. For this example, we will just create a simple RAW reflection (Check out our documentation for more information about creating reflections).
REFRESH DREMIO WITH NEW DATA
Our example here calls for a couple of new CSV files added to the S3 bucket each day. And further recall that the query outage is minimal to not at all while adding new data or refreshing.
Fortunately, Dremio in the background identifies the new file and makes all the appropriate changes to ensure user queries includes the latest version of the data based on the Metadata and Reflection refresh properties we set earlier in this process.
COMPLETE DATA REFRESH WITHOUT EXPERIENCING AN OUTAGE
Dremio leveraging virtual datasets (VDS) (not to be confused with data virtualization), provides the tools to define a semantic layer which will solve this problem.
In this case, our semantic layer definition consists of 3 layers as shown below:
As we see in this semantic layer chart, the current production Yellow Taxi queries are all going against the original Yellow S3 bucket. The trick to avoid the outage when completely refreshing the S3 Yellow Taxi data is to load the data and conduct the data curation activities off line and then quickly repoint the production Yellow Taxi production semantic layer to the V2 Yellow taxi S3 bucket producing a definition as shown below:
The steps we will take to affect this change with a minimal outage is as follows:
Step 1: Create a V2 bucket of the dataset in S3
In this step we will go through the steps described earlier in this blog to create a new bucket containing the ‘freshest’ version of the data.
Step 2: Apply the data format to the new S3 V2 bucket thus creating a new Dremio Physical dataset within the original datasource. In the example below, we see below that there are 2 S3 buckets containing the Yellow Taxi data; yellow and yellow_v2. At this point all ‘prod’ Yellow Taxi queries are still pointing to the ‘yellow’ copy providing the ability to perform data curation on yellow_v2 dataset offline without interrupting a user’s query.
Step 3: Create a new VDS Based on the “yellow_v2” Physical Dataset and apply the Column casting operations as shown below:
Save the above query as a virtual dataset and give it the name “YellowTaxiBaseV2”. Go and build the reflections on the new Virtual dataset so the end users won’t notice any degradation in performance when the switch over happens.
One thing to emphasize here is that the YellowTaxiBaseV2 is accessing the data directly in the S3 Yellow bucket and is not making a copy of the data which is one of several features which Dremio provides to keep the data as fresh as possible and avoid latency issues caused by data pipelines.
Once the YellowTaxiBaseV2 VDS is ready to deploy in production we are ready for the final step, the switchover.
Step 4: The new dataset is ready for the switch. The parent of TaxiYellowProd as we see below using the Dremio Lineage Graph is Yellow_Taxi_V1, which references the first S3 bucket prior to the refresh.
Now to repoint the TaxiYellowProd to the new S3 bucket with the full data refresh, we need to change the Original SQL from the V1 Virtual Datasource to the V2 and then save the updates, such that any subsequent queries will access “Taxi-yellow-v2” instead of the original “Yellow_Taxi_V1”. Once the OriginalSQL is updated to the V2 dataset the new Lineage for our TaxiYellowProd table will be as shown below.
So did the repointing work? To answer this question JMeter was setup for a 10 minute test. A little after 5 minutes, the TaxiYellowProd was re-pointed to the V2 Dataset (see blue arrow).
At the transition point, there were NO errors, hence we can say that the repointing V1 to V2 does NOT result in an outage.
In summary, following the approach outlined here will enable full refresh of a dataset without incurring an outage. Comparing this approach to other common techniques to support a full refresh, Dremio also is able to:
- Maintain a consistent query response
- Avoid duplicated storage anywhere except for the two S3 buckets. This storage impact is minimal, because as soon as the V2 S3 bucket is validated, there is nothing stopping you from deleting the original V1 to reduce the total storage. Virtual datasets as provided by Dremio means NO temporary storage costs nor transfer time lags caused by having to maintain temporary storage sets.