Dremio Authors: Insights and Perspectives

Dremio Team

Dremio Authors: Insights and Perspectives's Articles and Resources

Guides

What Is Data Lineage?

Data Lineage Definition: Data lineage refers to the data’s “line of descent.” In other words, it’s a record of how data got to a specific location and the intermediate steps and transformations that took place as it traveled through business systems. For organizations that depend on data, understanding where data comes from, evaluating its quality, […]

Read more ->

Guides

Data Virtualization vs. Data Lakes

To make data available to data consumers like analysts for analytics and reporting, businesses need to aggregate data sources. Data virtualization and data lakes are popular approaches to breaking down data silos and providing centralized data access. Your approach can significantly impact scalability, cost, and performance, so it’s important to understand the differences.

Read more ->

Guides

What Is a Data Pipeline?

A data pipeline moves data between systems. Data pipelines involve a series of data processing steps to move data from source to target. These steps may involve copying data, moving it from an on-premises system to the cloud, standardizing it, joining it with other data sources, and more. 

Read more ->

Guides

Intro: Data Lake vs Warehouse by Dremio

If your organization depends on data, you need a place to store it. Not only that — you need the right kind of data storage and management solution for the data you use and produce. Most organizations find that a data warehouse or data lake meets their needs. Many even use both. Data lakes and […]

Read more ->

Guides

What Is ETL & Types of ETL Tools

If you’ve ever discussed data warehousing, you’ve probably heard the term “ETL.” It refers to processes that allow businesses to access data, modify it, and store it. Organizations use ETL for a variety of reasons, including the efficient management of data and the ability to run business intelligence (BI) against their data. There are several […]

Read more ->

Blog Post

Dremio’s $135M Series D

This week we announced a $135M series D at a billion-dollar valuation making Dremio one of the top funded companies in our space. Chief Product Officer, Tomer Shiran highlights our vision in this blog.

Read more ->

Blog Post

Collecting App Metrics in your cloud data lake with Kafka

In this article, we will demonstrate how Kafka can be used to collect metrics on data lake storage like Amazon S3 from a web application.

Read more ->

AWS

Introducing Elastic Engines

Introducing Elastic Engines – Dremio In this article we walk you through the steps to provision and manage Elastic Engines, we also show you the steps to manage workloads using queues and rules. Step 2. You will see a default engine already deployed. Now click on Add New Step 3. The Set Up Engine popup […]

Read more ->

AWS

Introducing Parallel Projects

Parallel projects are multi-tenant instances of Dremio where you get a service-like cluster experience with end-to-end lifecycle automation across deployment, configuration with best practices, and upgrades, all running in your own AWS account. Every time that you launch a new project, it comes with all the best practices already set up for you. In this […]

Read more ->

Python

Data Science on the Data Lake using Dremio, NLTK and Spacy

Enterprises often have a need to work with data stored in different places; because of the variety of data being produced and stored, it is almost impossible to use SQL to query all these data sources. These two things represent a great challenge for the data science and BI community. Prior to working on the […]

Read more ->

AWS

Using R to perform data science operations on AWS

Amazon Web Services (AWS) is a cloud services platform with extensive functionality. AWS provides different opportunities and solutions for databases, storage, data management and analytics, computing, security, AI, etc. Among the offered databases and storages are Amazon Redshift and Amazon S3. Amazon Redshift belongs to the group of the leading data warehouses. It is designed […]

Read more ->

Adls

Multi-Source Time Series Data Prediction with Python

Modern businesses generate, store, and use huge amounts of data. Often, the data is stored in different data sources. Moreover, many data users are comfortable to interact with data using SQL while many data sources don’t support SQL. For example, you may have data inside a data lake or NoSQL database like MongoDB, or even […]

Read more ->

Adls

Forecasting air quality with Dremio, Python and Kafka

Forecasting air quality is a worthwhile investment on many different levels, not only to individuals but also communities in general, having an idea of what the quality of air will be at a certain point in time allows people to plan ahead, and as a result decreases the effects on health and costs associated with […]

Read more ->

Tableau

Lightning Fast Analytics with Tableau Online and Dremio

Tableau Bridge is a way to connect your Tableau Instance to your data. Connecting to online data sources using Tableau Online is easy, you can connect to both live and extracted data depending on your environment, but what if your data sources are constantly changing? You wouldn’t want to have to re-publish your workbooks every […]

Read more ->

Kubernetes

Easily Deploy Dremio on MicroK8s

One of the many advantages of Dremio, is its deployment flexibility. You can deploy Dremio on any of your favorite cloud flavors, and also on Prem using different methods such as Yarn, Docker and Kubernetes. In this article I will walk through the steps of evaluating Dremio by deploying it through Kubernetes using MicroK8s on […]

Read more ->

AWS

Analyzing Multiple Stream Data Sources using Dremio and Python

New technologies, communication systems, and information processing algorithms demand data rates, availability, and performance targets. Accordingly, the data processing procedures implemented with data (messages) calls for technologies capable of handling this high demand. One of these technologies is RabbitMQ – which is used to develop service-oriented architecture services (SOA) and distributed resource-intensive operations. However, it […]

Read more ->

Amazon

Cluster Analysis The Cloud Data Lake with Dremio and Python

Today’s modern world is filled with a myriad of different devices, gadgets, and systems equipped with GPS modules. The main function of these modules is to locate the positions of the moving objects and record them to a file called a GPS track. The services for accounting and processing such files, which are generally called […]

Read more ->

Amazon

Machine Learning Models on S3 and Redshift with Python

An important requirement for large and small business is the proper resource management. Classical solutions for such tasks can be presented as different optimization and control methods. But for the last few years, there appeared some approaches that use mathematical tools, statistics, and probability theory. They allow solving the optimization problems by detecting dependencies in […]

Read more ->

Python

How to Analyze Student Performance with Dremio and Python

Data analysis and data visualization are essential components of data science. Actually, before the machine learning era, all data science was about the interpretation and visualization of data with different tools and making conclusions about the nature of data. Nowadays, these tasks are still present. They just became one of many miscellaneous data science jobs. […]

Read more ->

AWS

Anomaly detection on cloud data with Dremio and Python

In datasets, very often some records do not match with the rest of the data by error or by nature. These kinds of records are useless and even harmful to ML models. In other problems, the sole purpose is to detect anomalies. For example, in health-monitoring systems in hospitals or credit fraud detection. Either way, […]

Read more ->

AWS

Querying Cloud Data Lakes Using Dremio and Python Seaborn

In the last few years, more and more companies have realized the value of data. Therefore, the popularity of data analytics has been growing rapidly. In general, data analysis can be performed in several ways, which are classified into subtypes depending on the analysis task: descriptive, exploratory, inferential, predictive, causal, and mechanistic. Each of these […]

Read more ->

Python

Data Lake Machine Learning Models with Python and Dremio

Amazon Simple Storage Service (S3) is an object storage service that offers high availability and reliability, easy scaling, security, and performance. Many companies all around the world use Amazon S3 to store and protect their data. PostgreSQL is an open-source object-relational database system. In addition to many useful features, PostgreSQL is highly extensible, and this […]

Read more ->

Python

Gensim Topic Modeling with Python, Dremio and S3

Topic modeling is one of the most widespread tasks in natural language processing (NLP). This is one of the vivid examples of unsupervised learning. The main goal of this task is the following: a machine learning model should be trained on the corpus of texts with no predefined labels. In other words, we don’t have […]

Read more ->

ARP

How to Create an ARP Connector

How to Create an ARP Connector The storage plugin configuration file tells Dremio what the name of the plugin should be, what connection options should be displayed in the source UI such as host address, user credentials, etc., what the name of the ARP file is, which JDBC driver to use and how to make […]

Read more ->

Python

Visualizing Amazon SQS and S3 using Python and Dremio

Nowadays, relevant analysis of different data is an important stage of business and technical research and development. Often the data is received in the form of serial info messages (queues). This is typical for data loggers and recorders, IoT developments, live-tracking systems, communication and navigation systems, etc. After that, the following information is sent to […]

Read more ->

Python

Using Dremio and Python Dash to Visualize Data from Amazon S3

Data in its natural form is not that valuable if you cannot visualize it. There are lots of visualization libraries available in the community, which may make it difficult to select one. In this tutorial, we hope to make your selection process a little bit easier by showing you how to work with Dash. Dash […]

Read more ->
get started

Get Started Free

No time limit - totally free - just the way you like it.

Sign Up Now
demo on demand

See Dremio in Action

Not ready to get started today? See the platform in action.

Watch Demo
talk expert

Talk to an Expert

Not sure where to start? Get your questions answered fast.

Contact Us

Ready to Get Started?

Bring your users closer to the data with organization-wide self-service analytics and lakehouse flexibility, scalability, and performance at a fraction of the cost. Run Dremio anywhere with self-managed software or Dremio Cloud.