12 minute read · September 2, 2024
Evaluating Dremio: Deploying a Single-Node Instance on a VM
· Senior Tech Evangelist, Dremio
If you are reading this, you are probably looking at Dremio as a potential solution to many different problems:
- Creating a central access point across your databases, data lakes, and data warehouses to avoid "data silos."
- To unify data you have in the cloud and on-prem for a hybrid data lake or data lakehouse
- To see faster data access on your existing data lake
- To reduce your storage and infrastructure costs (Reduce Total Cost of Ownership)
- To make access to your data on your data lake and other sources more accessible for business users
- To create a central point of governance of your datasets across many sources
- To read and write your Apache Iceberg Lakehouse using Dremio's Enterprise Catalog or other Iceberg Catalogs
- To build a semantic layer of all your business metrics in one place so all organizational departments can have consistent data
- To improve BI Dashboard Performance
- To automate the management of your Apache Iceberg Lakehouse
- To run SQL on NoSQL sources like MongoDB and ElasticSearch
- To join Delta Lake tables with Data in other systems
- To build data analytics apps on top of all your data at all locations
If any of the above propositions would add value to your organization, then it would be advisable to evaluate Dremio as a solution, and this guide can hopefully help you learn how to assess Dremio.
Individual Assessment on your Laptop with Docker
At this level, it's just about getting hands-on with Dremio and running a few queries to understand the Dremio workflow and features better. You should see pretty amazing performance when you connect to many of your datasets, but keep in mind that running Dremio on your laptop will be limited by your laptops specifications and internet connection.
With docker installed, you can have Dremio running in moments by running the following command in your terminal/command line:
docker run -p 9047:9047 -p 31010:31010 -p 45678:45678 -p 32010:32010 -e DREMIO_JAVA_SERVER_EXTRA_OPTS=-Dpaths.dist=file:///opt/dremio/data/dist --name try-dremio dremio/dremio-oss
In a few moments after running this command, you'll find a local version of Dremio running on http://localhost:9047. With this local Dremio you can connect your existing data sources (databases, data lakes and data warehouses) and query data in each location. When your done with this demo environment you can shut it off with:
# turn off environment docker stop try-dremio # turn it back on docker start try-dremio
If you want a more guided experience, we have many tutorials that'll simulate many workflows. It is recommended that you allocate at least 6gb of ram to docker for the more complex exercises.
- Intro to Dremio, Iceberg and Nessie
- From MongoDB to Iceberg/Dremio/Nessie to Apache Superset Dashboard
- From Postgres to Iceberg/Dremio/Nessie to Apache Superset Dashboard
- From SQLServer to Iceberg/Dremio/Nessie to Apache Superset Dashboard
- From Apache Druid to Iceberg/Dremio/Nessie to Apache Superset Dashboard
- From MySQL to Iceberg/Dremio/Nessie to Apache Superset Dashboard
- From Kafka Connect to Iceberg/Dremio/Nessie to Apache Superset Dashboard
- From JSON/CSV to Iceberg/Dremio/Nessie to Apache Superset Dashboard
- From Postgres/MongoDB to Iceberg/Dremio/Nessie to Apache Superset Dashboard using dbt & git-for-data
These exercises will give you a pretty good feel for what is possible with Dremio working with different sources, although one of Dremio's best features is not just for you to access multiple sources but to collaborate with others on all your data in one place, for that we'll need to deploy Dremio online.
Testing Out Collaboration with a Single-Node Deployment
You can deploy a single-node version of Dremio using a virtual machine from your favorite cloud compute provider (I tested out the below using an AWS t2.medium instance). This deployment is not meant for production and is limited by the power of the compute you use. This deployment can allow you to test the collaboration features by creating user accounts for some colleagues and letting them access the data you've connected to this deployment.
Provision a compute instance from your favorite provider (this guide is assuming an ubuntu based VM) and SSH into the shell of that instance.
Save the following scripts to .sh files on your instance using nano or vim:
setup1.sh
# Update the package list and install UFW sudo apt-get update -y sudo apt-get install -y ufw # Set default firewall policies to deny incoming connections sudo ufw default deny incoming sudo ufw default allow outgoing # Allow SSH and Docker traffic through the firewall sudo ufw allow OpenSSH sudo ufw allow 9047/tcp # Dremio Web UI sudo ufw allow 31010/tcp # Dremio ODBC/JDBC client connections sudo ufw allow 32010/tcp # Dremio Arrow/Flight client connections sudo ufw allow 45678/tcp # Dremio Internal Process Communication sudo ufw enable # Install Docker sudo apt-get install -y apt-transport-https ca-certificates curl software-properties-common curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null sudo apt-get update -y sudo apt-get install -y docker-ce docker-ce-cli containerd.io # Add the ubuntu user to the docker group to run Docker without sudo sudo usermod -aG docker $USER
setup2.sh
# Pull the Dremio Docker image docker pull dremio/dremio-oss # Run Dremio in Docker docker run -d --name dremio \ -e DREMIO_JAVA_SERVER_EXTRA_OPTS=-Dpaths.dist=file:///opt/dremio/data/dist \ -p 9047:9047 \ -p 31010:31010 \ -p 45678:45678 \ -p 32010:32010 \ dremio/dremio-oss # Notify that setup is complete echo "Setup complete. Dremio is running as $USER."
nginx.sh (optional)
# Ensure the DOMAIN environment variable is set if [ -z "$DOMAIN" ]; then echo "Error: DOMAIN environment variable is not set." exit 1 fi # Ensure the EMAIL environment variable is set if [ -z "$EMAIL" ]; then echo "Error: EMAIL environment variable is not set." exit 1 fi # Install Nginx sudo apt-get update -y sudo apt-get install -y nginx # Allow Nginx Full profile through UFW firewall sudo ufw allow 'Nginx Full' # Install Certbot and the Nginx plugin sudo apt-get install -y certbot python3-certbot-nginx # Create an initial Nginx server block configuration sudo tee /etc/nginx/sites-available/dremio <<EOF server { listen 80; server_name $DOMAIN; location / { proxy_pass http://localhost:9047; proxy_set_header Host \$host; proxy_set_header X-Real-IP \$remote_addr; proxy_set_header X-Forwarded-For \$proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto \$scheme; } } EOF # Enable the Nginx site configuration sudo ln -s /etc/nginx/sites-available/dremio /etc/nginx/sites-enabled/ # Test the Nginx configuration sudo nginx -t # Reload Nginx to apply the configuration sudo systemctl reload nginx # Obtain an SSL certificate using Certbot sudo certbot --nginx --non-interactive --agree-tos --email $EMAIL -d $DOMAIN # Set up auto-renewal for the certificate sudo certbot renew --dry-run echo "Nginx reverse proxy with SSL setup complete. You can access Dremio via https://$DOMAIN"
The follow the following steps:
- Run setup1.sh with the command "
source setup1.sh
" - terminate the ssh connection
- reconnect with a new ssh session
- Run setup2.sh with the command "
source setup2.sh
" - In a few minutes, Dremio should now be available at http://IPADDRESS:9047
At this point, you can already start working with Dremio using the compute instance's IP Address, but if you want to use a domain name with an SSL certificate, you can run the nginx.sh script after defining the DOMAIN and EMAIL environment variables assuming your domain's DNS settings are pointing to your instance. Other things to keep in mind:
- That your cloud provider is allowing traffic to the instance to the ports that it is using
- That you have the credentials to SSH into the instance
- That if using Domain, that the domains DNS settings have propagated
Doing a POC with Dremio Cloud-Managed or Self-Managed
At this point, you've been able to experience the value of Dremio along with your colleagues and it's time to take the next step of doing a production POC with your organizations data to see it directly solving your organizations challenges. You can deploy a cloud-managed version of Dremio on AWS or Azure in moments or using Kubernetes deploy a self-managed version that can run in any environment in the cloud or on-prem. To get this process started you can follow either of the two links:
- Start setting up Dremio now by following our Get Started Page
- Setup an Architectural Workshop with Dremio for a more Guided Process
Conclusion
I hope this guide has helped you on your journey in exploring whether Dremio is the right solution to eliminating data silos, reducing costs and overall improving your organizations data outcomes.