8 minute read · February 3, 2020
Accelerate Relational Databases with the Data Lake Engine
I am certain you know the problem. One or more of your databases popular for reporting and or analytics is overloaded, and new projects continue to pour in. Unfortunately, either it is impossible to add additional capacity in the system due to infrastructure constraints or there is insufficient budget to support purchase of additional instances/servers. The challenge is how to continue supporting existing projects and begin work on new projects?
Dremio, the data lake engine, provides the tools to extract more value from your existing overloaded and slow relational database. Whether your database is Oracle, PostgreSQL, Teradata or Microsoft SQL Server, Dremio provides the tools to accelerate your existing queries and reduce the load on your existing servers without requiring complex and expensive data pipelines nor database migration efforts.
In this blog we will go through an example offloading Oracle TPC-DS sample queries from an existing Oracle overloaded database to both improve query response times and number of concurrent queries.
THE CURRENT SITUATION
For this blog we are going to run through a sequence of TPC-DS queries in random order. The dataset is set with tables up to 30M rows on Oracle Database 12c. The objective is to run at least 5 concurrent users executing random TPC-DS queries on this dataset and free up space on the Oracle server. Unfortunately, in testing the Oracle database without the assistance of Dremio, the highest concurrency level achievable was only 2 concurrent users in the available infrastructure before Oracle starts to report errors and consume all 4 cores available. In this test, of Oracle we even saw the foreground processes representing over 80% of the total. The thought that more projects could get added to the existing resources is completely unthinkable.
REGISTER ORACLE AS A DREMIO DATASOURCE
Our first step is to register the Oracle datasource. As you will see here registering a new datasource to Dremio is a very quick process. But before we begin, you will first want to gather the following pieces of information
Configuration Item | Description |
Host | The hostname or IP address where your RDBMS server exists |
Port | The port used for communicating with the RDBMS |
Service name | The SID to register into Dremio |
Username & Password | The access credentials granted privilege to the tables to be queries. It is ideal to have this id as a service account |
One thing to keep in mind is that by changing the Host, Service Name and/or the authentication credentials you have the option to actually have multiple database instances available to Dremio.
Once you have this information, then click on the ‘+’ next to the “Sources” label on the Dremio home page. Then select “Oracle” as a New Source Type to setup.
Now take the information you gathered about your Oracle instance (or whichever RDBMS you are offloading) and enter that information into the General options section of the edit Source screen.
The last step is to set the “record fetch size”. The larger the “Record fetch size” the more efficient the record reads from Oracle to Dremio. But be careful as the setting increases, the performance benefits can reduce. Work with your Oracle system administrator with help to assign this value to the highest possible value. One thing to keep in mind when determining the value, as you offload queries from your RDBMS more resources in the source Host will get freed up. Bottom line don’t get too stingy with this value, the larger the value the faster Dremio will read source data and the more successful your RDBMS offload project is likely to become.
In the example below found that setting the value to 5000 was optimal on the 4 core 16 GB Oracle server used for this test.
Once all the Source properties are entered, click “Save” at the bottom right side of the screen and you now have immediate access to begin querying the Oracle tables.
This ease of access introduces another problem; and that is security. Without Dremio, RDBMS source system security could be physically controlled or at least managed through logins. Fortunately, Dremio respects the same security mechanisms already in place. In addition, Dremio introduces a thin “Sharing” security layer so if you are JOINing an Oracle table with a S3 file, the enterprise can be assured that consistent access controls are applied across the different data sources.
The Dremio “Sharing” capability can further restrict data visibility, by specifying which user or LDAP/AD membership groups can see which tables in Dremio. For example, in the view below, we can see all the tables which the Dremio administrator has access:
However, by enabling “sharing” for our “Data Analyst” user “da” to have view-only permissions on the Oracle “call_center_analytics” related tables. In the screen below we see that 2 users “Mark” and “da” have access to the Call_Center_Analytics space.
Now when our “da” user logs in they can only view data for the tables in the “Call_Center_Analytics” workspace.
The enterprises governance team further can assure that the “da” user is only accessing the CallCenter table by leveraging the “Jobs” page. From the “Jobs” page we have the ability to search for all queries performed by a user, ensuring that the enterprise access guidelines are being followed even if those rules are not consistently implemented in all underlying data sources.
As we see below, our “da” user can access the CallCenter table.
But, our “da” user CANNOT access any other of the other tables and based on either end user id or LDAP group membership will get a permissions denied error if they attempt access to other tables.
QUERY ACCELERATION
Dremio without any effort leverages Apache Arrow and other techniques to begin accelerating your queries and thus taking steps to manually accelerate can be unnecessary. But since, our objective here is to remove load from the Oracle Database server, a RAW reflection is recommended.
A RAW reflection is effectively like an Index and a Materialized view combined and on steroids. Within the RAW Reflection contains some or all of the data necessary to support queries without burdening the source system and as was mentioned at the start of the blog, a set of tools to simplify without a data pipeline, the data refresh process.
Creating a RAW Reflection is quite simple. As an “Admin” user (our “da” user does not have this permission), select the “settings” option as shown below.
Then, click “Reflections” followed by “Raw Reflections” from the menu to begin defining our reflection. We want all of the dataset columns accelerated, so all columns are selected for “Display” as shown below.
Click “Save” and Dremio will begin creating the reflection which depending on the dataset can take seconds to minutes.
CONCLUSION
By following all the steps in this first blog in the RDBMS offload series, we have integrated Dremio with your RDBMS, established a security layer to control and monitor access to datasets, and transferred to the RDBMS to our cloud storage with automatic refresh steps. All this has been done without having to create any special data pipelines or additional administrative overhead.
In the next blog of this series, we will explore the performance benefits resulting from following these steps. But as a teaser to the next blog, before Dremio, to run a sampling of TPC-DS queries with the 1 GB dataset scale and 2 concurrent users on a small Oracle server with 4 cores and 16 GB of RAM, the tests consumed all 4 cores as we see in the diagram below. However, using Dremio as a Data Lake engine offloading server load from our overloaded Oracle server using the steps described above, the server load dropped to negligible levels.
So what does all of this mean? Simply put Dremio can extend the life of your existing RDBMS systems and begin easing the migration to your cloud of choice.