Dremio Jekyll

Analyzing web server logs with Dremio, Apache Spark, and Kotlin

Introduction

Apache Spark is a powerful, fast open source framework for big data processing. We are going to use it with Kotlin - a modern programming language which has a rapid growth and adoption. In this tutorial, we will show you the integration of Apache Spark, Kotlin, and Dremio for effective data preparation and analysis.

Assumptions

We assume that you have already installed Dremio; if you haven’t done so, visit Dremio Docs to read about installation and deployments. We are going to work with a standalone model on a single node which perfectly covers the exploration and learning purposes. We also recommend you to get acquainted with Getting Oriented to Dremio and other tutorials.

Besides, the following technologies are used in this tutorial:

Data curation with Dremio

After you have launched Dremio, navigated to the Dremio UI at http://localhost:9047 and logged in, you can connect to data sources. In this post, we will analyze the NASA Kennedy Space Center web server logs which contain information about HTTP requests to the server in Florida. The privilege of Dremio is that we can transform data the way we want with no particular knowledge of SQL or deep understanding of programming. Moreover, high-loaded server logs contain a lot of data and without Dremio we would spend much more time waiting for data transformation. Also, with the help of UI, you can see the result in just a few seconds after editing.

Let’s get started. First, go to http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html and take a look at the information written there. We are going to work with the data from two zip archives located in the Distribution paragraph. So, download and unzip them to any folder on your computer. Further, we will use this folder as a container for data sources. As a matter of fact, unzipping isn’t actually necessary; as long as you know about the data you can point to the zipped files and apply formatting.

Click Add Source on the main page of Dremio UI to upload a folder containing log files. Web server logs are usually stored as plain text files on a computer, where the web server is running. Therefore, we select NAS (Network Attached Storage) data source to share the folder located on your computer.

Adding NAS source in Dremio

Note that as well as NAS you could choose HDFS or Amazon S3 data source here**.

In the modal window type a name for the data source and a path to the folder containing logs. Pay attention to the path; on Windows, it should be absolute with a “file:///” prefix, e.g. “file:///C:/some/folder”.

Configuring the NAS source in Dremio

Now the folder is connected, but data files in it remain unstructured. To transform raw data into a Dremio dataset, hover over the folder and click on the icon at the very right to go to the Dataset Settings menu. This way you can make a single table out of the separate files in the folder.

Applying formatting to the data

The format you need to choose is a plain text with space-delimited fields. Select “Text (delimited)” format and “ “ field delimiter.

Configuring the delimiter for our data in Dremio

Now our goal is to transform the data to the form, which is suitable for analysis with Apache Spark. First, let’s drop the “B”, “C” and “E” columns, which represent “rfc931”, “authuser” and “time zone” fields respectively as they contain the same “-” or “-4000]” data that is not useful at this time. To drop a column, click a small arrow near the column name and select Drop in the drop-down menu.

Dropping a column from the data source in Dremio

Remember: if you’ve done something wrong, you can always revert to the previous state using small circles (breadcrumbs) on the right side of the page.*

Now rename the columns by clicking on the column name.

A host
D request_time
F request
G status
H bytes

For the “request_time” column we want to remove square brackets at the start of every field. To do this, highlight everything except the bracket and select “extract” from the menu. Then toggle the Drop Source Field checkbox and click Apply.

Applying formatting for the date data type

Now, let’s change the datatypes. You can do it by clicking the Abc button near the column name. For the “request_time” choose Data&Time type and then use the following custom pattern: DD/MON/YYYY:HH24:MI:SS.

Applying formatting for the date data type

Then convert “status” and “bytes” to Integer. When changing the “status” column, select Delete records for non-matching values. When dealing with the “bytes” column, choose to Replace values with 0 for non-matching values, because there are a lot of fields containing “-”, which means zero-sized response.

Configuring how null values should be treated

After we have successfully executed all the transformations, the query in the SQL Editor automatically changed and adjusted to fulfill our operations, so the new query looks like this:

1
2
3
4
5
6
7
8
9
10
11
SELECT host, TO_TIMESTAMP(request_time, 'DD/MON/YYYY:HH24:MI:SS', 0) AS request_time, request, status, bytes
FROM (
  SELECT A AS host,
  CASE WHEN length(substr(D, 2, 20)) > 0 THEN substr(D, 2, 20) ELSE NULL END AS request_time,
  F AS request,
  CONVERT_TO_INTEGER(G, 1, 0, 0) AS status,
  CONVERT_TO_INTEGER(H, 1, 0, 0) AS bytes
  FROM apache-logs.nasa
  WHERE is_convertible_data(G, 1, 'INTEGER')
) nested_0
 WHERE is_TIMESTAMP(request_time, 'DD/MON/YYYY:HH24:MI:SS')

Finally, select Save As to create a virtual dataset. Virtual datasets apply any changes you want to make to the data dynamically, without changing the original data or making a copy. It’s better to choose space different from homespace as, in this case, datasets aren’t shared with other users.

Saving the virutal dataset

That’s it! Now we are ready to connect to Dremio from a standalone application written in Kotlin (on JVM) and perform a simple data analysis.

Data analysis using Apache Spark

Now, having a transformed data source, we are ready to perform data analysis on the following topics:

  • Content size statistics.
  • Status code analysis.
  • Hosts analysis.
  • Top error paths.
  • Unique hosts.
  • Unique daily hosts.
  • Average number of daily requests per host.

Setting up the environment

First, create a Gradle project with Kotlin support:

Creating a Gradle project including Kotlin

In the build.gradle add the following line to the dependencies section:

1
compile group: 'org.apache.spark', name: 'spark-sql_2.10', version: '2.1.0'

Then press Import Changes to download Spark SQL module. You can download the full “build.gradle” file here: https://gist.github.com/vkhitev/081a25fb0bba071ad164b959e125cd70

After that, include the JDBC module for Dremio. Open File -> Project structure -> Modules -> Dependencies -> Add (JAR) and provide the path for a JDBC driver (e.g. C:\Program Files\Dremio\jars\jdbc-driver\dremio-jdbc-driver-1.2.1-201710030121530889-8e49316.jar). After adding the file, toggle the checkbox that represents it to include this module to your project.

Configuring your Kotlin project

Now, Windows users should download the Winutils files and place them in C:/winutil/, so the full path is C:/winutil/bin/winutils.exe. Winutils are used by Spark and are necessary on Windows to interact with the file system.

In the src/main/resources folder of your project create a “log4j.properties” file with the content listed below to prevent unnecessary “info” messages from Spark:

1
2
3
4
5
6
7
# Root logger option
log4j.rootLogger=ERROR, stdout

# Direct log messages to stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.Target=System.out
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout

We are ready to start writing the application code and analyzing the data.

Our first step is to configure logger, set winutils path and load JDBC driver class.

1
2
3
4
5
fun main(args: Array<String>) {
  PropertyConfigurator.configure("src/resources/log4j.properties")
  System.setProperty("hadoop.home.dir", "c:/winutil/")
  Class.forName("com.dremio.jdbc.Driver")
  ...

Next, we create a SparkSession instance. “local[*]” means using as many threads as the number of processors available to the Java virtual machine.

1
2
3
4
val spark = SparkSession.builder()
  .appName("Dremio-Spark example")
  .master("local[*]")
  .getOrCreate()

Then we connect to Dremio using JDBC driver. You should provide name of the database, username, and password.

1
2
3
4
5
6
7
val dfRaw = spark.read()
  .format("jdbc")
  .option("url", "jdbc:dremio:direct=127.0.0.1:31010")
  .option("dbtable", """"@admin"."logs"""")
  .option("user", "admin")
  .option("password", "admin123qwe")
  .load()

As a next step we should do some transformations to make the data more suitable for further analysis:

  • Split the “request” column into three values and select the first as “verb” and the second as “resource”. The third value is almost always the “HTTP/1.0” so we will drop it.
  • Cast the “status” column to the short data type.
1
2
3
4
5
6
7
8
fun castDf(df: Dataset<Row>) =
  df.withColumn("_tmp", split(col("request"), " ")).select(
      col("host"),
      col("_tmp").getItem(0).alias("verb"),
      col("_tmp").getItem(1).alias("resource"),
      col("status").cast("short"),
      col("bytes")
  ).drop("_tmp")!!

Note: All of the above transformations can and in some cases should be conducted in Dremio (through the virtual dataset’s ability to morph) as to allow to propagate the transformations to all other processes that might need them. E.g. several instances of Spark will access data and we can eliminate the need of repeating the same actions and duplicate code in several places. However, in some cases it may be better to do transformations on the application level.

Let’s take a look at the schema of the dataset.

1
2
val df = castDf(dfRaw)
df.printSchema()

Showing the output of the schema

Analysis examples

After all previous preparations, we are ready to perform data analysis. Examples below are the brief queries taken from the original application listed here. Results of the program execution are listed here.

Let’s look at basic statistics of the size of the content returned from the server. In the example below, you can see the number of entries in the table, mean and standard deviation, min and max values (max is ~3.4mb). Standard deviation is big, so the content sizes strongly diverge from the mean.

1
df.describe("bytes").show()

Showing the differences in size of the files

We continue with examine the occurrence of each status code in the table. You can see that most of requests to the server were successful, and only 1662 requests got a 403/404 error.

1
2
3
4
df.groupBy("status")
  .count()
  .sort("status")
  .show()

Showing the distribution of status types

The next step is to get hosts that accessed the server most often. The obvious winner is piweba3y.prodigy.com.

1
2
3
4
5
df.groupBy("host")
  .count()
  .filter(col("count").gt(700))
  .sort(col("count").desc())
  .show()

Showing which hosts accessed the files most often

Let’s see what requests generated the most 404 errors and find out how many unique hosts are presented in the entire log. People tried to access readme.txt and release.txt, but there were no such files.

1
2
3
4
5
df.filter(col("status").equalTo(404))
  .groupBy("resource")
  .count()
  .sort(desc("count"))
  .show(10)

Showing which resources return the most errors

Note that the SQL pushed down to Dremio in these dataframe expressions looks like this:

1
2
SELECT "request","status" FROM "NASA logs"."logs"
WHERE ("status" IS NOT NULL)

In other words, the aggregation is being performed client side. In our example the data size is small and the performance is not impacted. For very large datasets, moving all the data over the network to the client could be impractical. In addition, the data would need to fit in memory in order for the aggregation to be performed. In most deployments the preferred approach would be to perform the resource-intensive operations where the most resources are present. Typically this is the data serving tier. Developers should consider these trade offs when developing their data processing.

And now, let’s get the number of unique daily hosts who tried to access the server. As we can see, the most popular day was June the 1st, and the most unpopular was June the 3rd.

1
2
3
df.select("host")
  .distinct()
  .count()

Showing the number of distinct hosts

1
2
3
4
5
6
7
8
9
10
val days = df.select(
  col("host"),
  month(col("time")).alias("month"),
  dayofmonth(col("time")).alias("day")
).distinct().cache()

days.groupBy("month", "day")
  .count()
  .sort(desc("count"))
  .show()

Showing the number of request by day of the month

Finally, we examine the average number of daily requests per host. Here we use the DataFrame from the previous code block. We count a total number of requests per day and divide it by the number of unique hosts per day. As we can see, all averages have fairly similar values.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
val dailyHosts = days.groupBy("month", "day").count()

val totalReqPerDay = df.select(
  month(col("time")).alias("month"),
  dayofmonth(col("time")).alias("day")
).groupBy("month", "day").count()

val grouping = JavaConversions.asScalaBuffer(listOf("month", "day")).toSeq()
totalReqPerDay.join(dailyHosts, grouping)
  .select(
    col("month"),
    col("day"),
    totalReqPerDay.col("count").divide(dailyHosts.col("count"))
      .alias("avg")
  ).sort(col("avg").desc()).show()

image alt text

Conclusion

In this article, we performed data analysis on the NASA Kennedy Space Center web server logs. In fact, Spark is designed for much more extensive data, but in order to make this tutorial clear and easy-to-follow, we have chosen a smaller dataset. We used Dremio as a convenient intermediate link between raw data and standalone application. Dremio played an invaluable role in managing data sources: parsing, casting, uniting, joining data while seeing the result of your work almost immediately. To achieve an even better understanding of the data structure and relationships we used Apache Spark as one of the most popular data processing solutions.