Serverless Data Analytics Using AWS Glue and Athena

In this short tutorial, I’m explaining how to perform SQL queries on big datasets without using any servers at all. We are going to play with Reddit comments dataset and find out which subreddit is the most popular and how many grateful comments are there on Reddit.

AWS Athena is a serverless service provided by AWS that allows you executing SQL queries on your data right where it sits, without a need of moving to some cluster/DB engine. It’s built on a top of the Presto(a distributed SQL engine built by Facebook).

AWS Glue is a serverless ETL tool provided by AWS. It’s built on a top of Hadoop and related tools. You can consider both services as a serverless version of AWS EMR.

Warning

This tutorial includes a relatively significant amount of data transfers and the use of paid AWS services. It’s your responsibility to calculate costs needed to run steps of this tutorial and control the usage of services. Some of the services/resources used in the tutorial are not eligible for the Free Tier usage. You have to use a different(smaller) dataset and smaller instances to reduce costs needed for executing steps from this tutorial.

Preparation

Let’s assume you need to quickly find out something about your data. You want to run 1-2 queries, get some insight and never return to your dataset again. The solution is rather simple in this case.

First, we need to create a database in the AWS Glue. To do so, open the AWS Glue in the AWS Management Console and click the “Add database” button:

Dataset Overview

The size of the dataset is around 300GB compressed.

The dataset itself is grouped by years:

Graph Example

Inside each folder you have 12 files in the .bz2 archive format for each month:

Graph Example

Each month is represented by a compressed file, that contains a newline-separated list of JSON objects, one object per comment:

Graph Example

Each comment is a JSON object with various fields:

Graph Example

Lazy Approach

Let’s assume you need to quickly find out something about your data. You just want to run 1-2 queries, get some insight and never return back to your dataset again. The solution is rather simple in this case.

First, we need to create a database in the AWS Glue. To do so, open the AWS Glue in the AWS Management Console and click the “Add database” button:

Graph Example

We’ll call the database “comments_db”:

Graph Example

Now we need to specify the data, stored in this database. We could do this manually, by adding tables to the databases. Instead, to save some time, we are going to discover our data scheme using an AWS Glue Crawler.

What’s an AWS Glue Crawler?

AWS Glue Crawler is a data processing tool, that automatically discovers the data, its format(JSON, CSV, Parquet, AVRO, compressed JSON/CSV, etc) and the scheme (column names, number of columns, column types).

To discover the schema of our dataset automatically, let’s create a new crawler using the “Add crawler” button in the “Crawlers” section of the AWS Glue:

Graph Example

Specify the “comments_crawler” as name and select the “Create a single schema for each S3 path” option under the “Grouping behavior for S3 data (optional)” section of the dialog:

Graph Example

After setting the name and grouping behavior, click “Next” and provide a name of the bucket and a path to the dataset:

Graph Example

Click “Next,” don’t add other data stores and proceed to the next step again. An IAM role of the crawler should be specified, and the simplest option is to create a new one. Add some suffix to the role name and click “Next”:

Graph Example

On the schedule page, select the “Run on demand” option, since we want to crawl the data only once and click “Next”. Specify the “comments_db” on the Crawler’s output page configuration. Also, be sure to select the “Update all new and existing partitions with metadata from the table” option:

Graph Example

It’s needed because over time the structure of the comment object changed (fields were added and removed), and we need to have a common schema that’s able to hold both old and new comment format. Without this option, we’ll end up with the “HIVE_PARTITION_SCHEMA_MISMATCH” error.

Click “Next” again, review all the information entered during previous steps and if everything looks good, create the crawler. It should appear in the list of crawlers:

Graph Example

Now it’s time to run the crawler. To do it, select the crawler in the list and click the “Run crawler” button:

Graph Example

Schema discovery takes some time. However, unfortunately, in the case of this particular dataset, the discovery finishes with an error. For some reason, the “edited” field gets the “double” type instead of the “boolean,” and it results in a “HIVE_BAD_DATA” error when performing a query:

Graph Example

To resolve this, the newly created table schema should be updated, and the “edited” column type should be changed to the “boolean”:

Graph Example
Graph Example

Don’t forget to click “Save” after editing. After the schema is updated, the crawler should be run one more time. It will propagate table schema changes to all partitions and will fix the issue. Now it’s time to switch to AWS Athena!

Running the first query in AWS Athena

To ensure the scheme is correct and we are able to run queries, let’s use this simple test:

SELECT * FROM "comments_db"."comments" limit 10;

Such query execution takes around 10 minutes, and the answer is(dataset includes all comments until March 2017): 103772273

Isn’t it impressive? Now let’s figure out which subreddit contains the biggest number of comments:

SELECT subreddit, COUNT(*) AS "Number of Comments" FROM comments GROUP BY subreddit ORDER BY "Number of Comments" DESC

It will take some time, so let’s discover other parts of the Athena Query Editor. On the “History” tab of the editor it’s possible both to check previous queries and also see the execution time, amount of scanned data and the current status of active queries:

Graph Example

Active queries can be canceled by using the corresponding link in the “Actions” section of the History table. And finally the result:

Graph Example

Summary

AWS Glue and AWS Athena are compelling services that can help you with analyzing significant amounts of data in its original format(if services support it). Using these tools directly from the AWS Management Console doesn’t require the knowledge of underlying technologies, but is beneficial.

In the next tutorial, I’m going to show you how to transform data in one of the columnar formats supported by AWS Glue and Athena to increase the performance and reduce the amount of data that should be scanned to execute the query.

In a case if you are interested in implementing data analytics for your company, please feel free to drop me an email to volodymyr@agilevision.io and I’ll gladly answer all your questions!

Comments

©2016-2018 — AgileVision sp. z o.o. All rights reserved.