Jumping into Azure Data Lake

Introduction

You may have heard of Data Lakes and are wondering if it is the next big thing. This post will provide an overview of Azure Data Lake, how to use it and some of its pros and cons.

History of Data Lakes and Azure Data Lake

The Data Lake term was coined by Pentaho in October 2010 to address limitations of a data mart, which typically has a subset of attributes and is aggregated. Per Jamie Dixon’b Blog, the lake terminology was chosen since “the contents of the data lake stream in from a source to fill a lake, and various users of the lake can come to examine, dive in, or take samples.”

Microsoft’s Azure Data Lake became generally available in November 2016, as described in this post: https://azure.microsoft.com/en-us/blog/the-intelligent-data-lake/. There are really three services that make up the data lake as shown in the diagram below.

The underlying technology for the data lake is based on Hadoop, but as an end user that is transparent. You will create objects in the Data Lake Store and use Data Lake Analytics to process the data.

Getting Started

To get started with Azure Data Lake, you will provision a new “Data Lake Analytics” account from the Azure portal. During that process you can also provision the Data Lake Store, or reference an existing Data Lake Store. Aside from the name and location (Azure region), the only other choice you have is the data encryption. By default, encryption will be enabled and use keys managed by the Data Lake Store. If you would prefer no encryption or to use keys from your own vault, you will need to create the Data Lake Store first.

As of April 2017, data lake store pricing for monthly plans ranges from 1 TB for $35 to 1 PB for $24,880. Pay as you go pricing if $0.04 per GB per month. The Data Lake Analytics service is priced based on Analytics Units (AU) hours, with monthly commitments of 100 AU for $100 to 100,000 AU for $52,000 or $2 per AU hour without a monthly commitment. See https://azure.microsoft.com/en-us/pricing/details/data-lake-store/ and https://azure.microsoft.com/en-us/pricing/details/data-lake-store/ for details.

Working with Azure Data Lake

If you are used to working with relational databases or data warehouses, you will find Azure Data Lake has a couple of key differences:

- The schema is defined when reading the data

- The processing is batch oriented

A data lake can store any type of data, regardless of it structure. Only when accessing the data will you specify its structure, such as the columns it contains, delimiter and any header rows to ignore.

Instead of just executing a query and getting results back immediately as you typically would with a relational database, with Azure Data Lake, you submit an analytical job. The jpb will go through 4 states: Preparing, Queued, Running, and Finalizing. Even simple tasks against small data sets will take at least 30 seconds in my experience. Also, the syntax is not immediately checked, so even identifying a typo will take 20 or 30 seconds for the job to be queued.

U-SQL

Azure Data Lake uses the U-SQL language. If you imagine SQL and C# having a baby, it would be U-SQL. It has most of the structural elements of SQL but uses the power of C# for more advanced features. For example, you use C# functions for string manipulation and “==” to check for equality.

One thing to be aware of is that it has also inherited the sensitive side of C#. Unlike SQL, U-SQL is case-sensitive. Also, the “Order By” clause has moved.

For an example illustrating these points, see below.

USE DATABASE retail;

@qty = SELECT
orders.customername,
SUM(orders.quantity) AS qty
FROM sales.stock
INNER JOIN sales.orders ON orders.stockitemid == stock.id
WHERE stock.item.ToUpper() LIKE "%CHOCOLATE%"
GROUP BY orders.customername;

OUTPUT @qty
TO "/retail/chocolatecust.csv"
ORDER BY qty DESC
USING Outputters.Csv();

Optimizing

There are just a few choices you can make to optimize performance with Azure Data Lake:

- Table indexing, distributing and partitioning

- U-SQL structure

- # of AUs

A data lake table can have just one clustered index, but the index can consist of more than one column. It basically determines how the data will be physically stored in the clustered table. (See https://msdn.microsoft.com/en-us/library/mt706196.aspx.) There are not any b-tree or column store indexes.

Every table also needs to have a distribution scheme. The most common schemes are hash and round robin. Tables can also optionally be partitioned which is a coarser level grain than the distribution. For example, partition by orderdate and distribute based on the hash of the orderid.

U-SQL is a powerful language, and will provide lots of options in how you process your data. If you have experience with SQL tuning, many of the same guidance will apply. For example, avoid using SELECT *, and eliminate partitions if possible. See https://www.slideshare.net/MichaelRys/usql-query-execution-and-performance-tuning for more information.

The final option that impacts performance is the Azure Data Lake Analytics Unit or AU for short. When you run a job, you assign it a number of AUs. Each AU is currently equivalent to 2 CPU cores and 6 GB RAM. There are two things to be aware of regarding AU’s. You pay for the AU for the duration of the job, regardless of whether it is actually used. Second, adding more AU’s will not necessarily improve the job’s speed. See https://blogs.msdn.microsoft.com/azuredatalake/2016/10/12/understanding-adl-analytics-unit/ for more details.

Visual Studio can provide graph of a job’s execution so you can see how many AU’s were used and adjust if needed for future runs. An example is shown below for a job run with an allocation of 4 AUs. The graph shows that all 4 AUs were actually used.

Conclusion

The Data Lake is a newer concept, and some may think of it as the answer to their data challenges. However, I’d suggest it is better to think of it as one piece of the puzzle. It is a great place to land and ingest high volumes of data. It provides the “Big Data” power of Hadoop without the complexity of spinning of clusters and doing lots of programming.

However, for high value data, a data warehouse still has benefits. The additional indexing capabilities and real-time responsiveness that a data warehouse provides enable it to more responsively serve large numbers of users. The data warehouse also provides compatibility with the widest variety of tools for ad hoc reporting.

References

Hands-on with Azure Data Lake: How to get productive fast
http://www.zdnet.com/pictures/hands-on-with-microsofts-azure-data-lake/

Data & Analytics Partners: Navigating the Azure data services jungle

https://blogs.technet.microsoft.com/msuspartner/2017/04/05/data-analytics-partners-navigating-data/

Load data from Azure Data Lake Store into SQL Data Warehouse

https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-load-from-azure-data-lake-store

Microsoft: DAT223.1x Processing Big Data with Azure Data Lake Analytics
https://courses.edx.org/courses/course-v1:Microsoft+DAT223.1x+2T2017/info

Search This Blog

Microsoft Power BI and Data Platform News and Tips