Jumping into Azure Data Lake
Introduction
You may have heard of Data Lakes and are wondering if it is
the next big thing. This post will
provide an overview of Azure Data Lake, how to use it and some of its pros and
cons.
History of Data Lakes and Azure Data Lake
The Data Lake term was coined by Pentaho in October 2010 to
address limitations of a data mart, which typically has a subset of attributes
and is aggregated. Per Jamie
Dixon’b Blog, the lake terminology was chosen since “the contents of the
data lake stream in from a source to fill a lake, and various users of the lake can come to examine,
dive in, or take samples.”
Microsoft’s Azure
Data Lake became generally available in November 2016, as described in this
post: https://azure.microsoft.com/en-us/blog/the-intelligent-data-lake/.
There are really three services that make up the data lake as shown in
the diagram below.
Getting Started
To get started with Azure Data Lake, you will provision a
new “Data Lake Analytics” account from the Azure portal. During that process you can also provision
the Data Lake Store, or reference an existing Data Lake Store. Aside from the name and location (Azure
region), the only other choice you have is the data encryption. By default, encryption will be enabled and
use keys managed by the Data Lake Store.
If you would prefer no encryption or to use keys from your own vault,
you will need to create the Data Lake Store first.
As of April 2017, data lake store pricing for monthly plans
ranges from 1 TB for $35 to 1 PB for $24,880.
Pay as you go pricing if $0.04 per GB per month. The Data Lake Analytics service is priced
based on Analytics Units (AU) hours, with monthly commitments of 100 AU for
$100 to 100,000 AU for $52,000 or $2 per AU hour without a monthly
commitment. See https://azure.microsoft.com/en-us/pricing/details/data-lake-store/
and https://azure.microsoft.com/en-us/pricing/details/data-lake-store/
for details.
Working with Azure Data Lake
If you are used to working with relational databases or data
warehouses, you will find Azure Data Lake has a couple of key differences:
-
The schema is defined when reading the data
-
The processing is batch oriented
A data lake can store any type of data, regardless of it
structure. Only when accessing the data
will you specify its structure, such as the columns it contains, delimiter and
any header rows to ignore.
Instead of just executing a query and getting results back immediately
as you typically would with a relational database, with Azure Data Lake, you
submit an analytical job. The jpb will
go through 4 states: Preparing, Queued, Running, and Finalizing. Even simple tasks against small data sets
will take at least 30 seconds in my experience.
Also, the syntax is not immediately checked, so even identifying a typo
will take 20 or 30 seconds for the job to be queued.
U-SQL
Azure Data Lake uses the U-SQL language. If you imagine SQL and C# having a baby, it would
be U-SQL. It has most of the structural
elements of SQL but uses the power of C# for more advanced features. For example, you use C# functions for string
manipulation and “==” to check for equality.
One thing to be aware of is that it has also inherited the
sensitive side of C#. Unlike SQL, U-SQL
is case-sensitive. Also, the “Order By”
clause has moved.
For an example illustrating these points, see below.
USE DATABASE retail;
@qty = SELECT
orders.customername,
SUM(orders.quantity) AS qty
FROM sales.stock
INNER JOIN sales.orders ON orders.stockitemid == stock.id
WHERE stock.item.ToUpper() LIKE "%CHOCOLATE%"
GROUP BY orders.customername;
OUTPUT @qty
TO "/retail/chocolatecust.csv"
ORDER BY qty DESC
USING Outputters.Csv();
@qty = SELECT
orders.customername,
SUM(orders.quantity) AS qty
FROM sales.stock
INNER JOIN sales.orders ON orders.stockitemid == stock.id
WHERE stock.item.ToUpper() LIKE "%CHOCOLATE%"
GROUP BY orders.customername;
OUTPUT @qty
TO "/retail/chocolatecust.csv"
ORDER BY qty DESC
USING Outputters.Csv();
Optimizing
There are just a few choices you can make to optimize performance
with Azure Data Lake:
-
Table indexing, distributing and partitioning
-
U-SQL structure
-
# of AUs
A data lake table can have just one clustered index, but the
index can consist of more than one column. It basically determines how the data
will be physically stored in the clustered table. (See https://msdn.microsoft.com/en-us/library/mt706196.aspx.)
There are not any b-tree or column store indexes.
Every table also needs to have a distribution scheme. The most common schemes are hash and round
robin. Tables can also optionally be
partitioned which is a coarser level grain than the distribution. For example, partition by orderdate and
distribute based on the hash of the orderid.
U-SQL is a powerful language, and will provide lots of
options in how you process your data. If
you have experience with SQL tuning, many of the same guidance will apply. For example, avoid using SELECT *, and eliminate
partitions if possible. See https://www.slideshare.net/MichaelRys/usql-query-execution-and-performance-tuning for more information.
The final option that impacts performance is the Azure Data
Lake Analytics Unit or AU for short.
When you run a job, you assign it a number of AUs. Each AU is currently equivalent to 2 CPU
cores and 6 GB RAM. There are two things
to be aware of regarding AU’s. You pay
for the AU for the duration of the job, regardless of whether it is actually
used. Second, adding more AU’s will not
necessarily improve the job’s speed. See
https://blogs.msdn.microsoft.com/azuredatalake/2016/10/12/understanding-adl-analytics-unit/ for more details.
Visual Studio can provide graph of a job’s execution so you can see how many AU’s were used and adjust if needed for future runs. An example is shown below for a job run with an allocation of 4 AUs. The graph shows that all 4 AUs were actually used.
Visual Studio can provide graph of a job’s execution so you can see how many AU’s were used and adjust if needed for future runs. An example is shown below for a job run with an allocation of 4 AUs. The graph shows that all 4 AUs were actually used.
Conclusion
The Data Lake is a newer concept, and some may think of it
as the answer to their data challenges.
However, I’d suggest it is better to think of it as one piece of the
puzzle. It is a great place to land and
ingest high volumes of data. It provides
the “Big Data” power of Hadoop without the complexity of spinning of clusters
and doing lots of programming.
However, for high value data, a data warehouse still has
benefits. The additional indexing
capabilities and real-time responsiveness that a data warehouse provides enable
it to more responsively serve large numbers of users. The data warehouse also provides compatibility
with the widest variety of tools for ad hoc reporting.
References
Hands-on with Azure Data Lake: How to get productive fast
http://www.zdnet.com/pictures/hands-on-with-microsofts-azure-data-lake/
http://www.zdnet.com/pictures/hands-on-with-microsofts-azure-data-lake/
Data & Analytics Partners:
Navigating the Azure data services jungle
https://blogs.technet.microsoft.com/msuspartner/2017/04/05/data-analytics-partners-navigating-data/Load data from Azure Data Lake Store into SQL Data Warehouse
https://courses.edx.org/courses/course-v1:Microsoft+DAT223.1x+2T2017/info
Comments
Post a Comment