Apache Hadoop has gained considerable attention from the enterprise IT community as a data analytics alternative to traditional BI systems and data warehousing. And while this is not the only alternative currently available, it has become highly visible.
However, with heightened visibility comes heightened scrutiny. Hadoop’s shortcomings have also become more visible to enterprise IT administrators who have expressed concern over data integrity, system resiliency, ease of use, and maintainability. Now, a growing number of enterprise IT‐centric vendors are responding to the opportunity to offer a Hadoop‐based data analytics solution that conforms to the demands of a production data center environment. Here we review one such solution that has resulted from a partnership between NetApp and Cloudera, the commercial face of Apache Hadoop.
3. NetApp’s Open Solution for Hadoop
Introduction
Apache Hadoop has gained considerable attention from the enterprise IT community as a data analytics
alternative to traditional BI systems and data warehousing. And while this is not the only alternative
currently available, it has become highly visible.
However, with heightened visibility comes heightened scrutiny. Hadoop’s shortcomings have also
become more visible to enterprise IT administrators who have expressed concern over data integrity,
system resiliency, ease of use, and maintainability. Now, a growing number of enterprise IT‐centric
vendors are responding to the opportunity to offer a Hadoop‐based data analytics solution that
conforms to the demands of a production data center environment. Here we review one such solution
that has resulted from a partnership between NetApp and Cloudera, the commercial face of Apache
Hadoop.
Target Market
The NetApp Open Solution for Hadoop consists of at least two NetApp storage arrays—the E2660 which
provides hardware RAID storage for Hadoop Data Nodes and the FAS2000 which offers system resilience
and metadata protection capabilities to the Hadoop Name Node. The SANtricity Storage manager is also
required. As part of the solution, Cloudera Enterprise including the Cloudera Enterprise Manage Suite
can be included. Hadoop servers, clients, SAS HBAs, and network switches are not included. Therefore,
this is not an appliance offering in the same way that EMC Greenplum and IBM Netezza are offered as
pre‐integrated and installable solutions that include all componentry.
The NetApp Open Solution for Hadoop augments traditional BI systems by allowing the BI users to
embrace a much greater range of data types and data set sizes as well as perform reiterative queries in
real time. And, unlike many of the early Hadoop implementations, care has been taken by the
Cloudera/NetApp partners to help users flatten the Hadoop learning curve, and accelerate time to
production.
But perhaps more importantly, it is aimed at enterprise data center administrators looking for a Hadoop
platform that can be managed in ways that are more consistent with production data center policies and
practices. The objective is to provide an operational model that is tuned, tested, more stable and easier
to maintain over time.
We have been asked recently by storage administrators who are also NetApp users to suggest ways they
can help to make emerging Hadoop environments more stable and consistent with enterprise data
management policies regarding application availability, data protection, archive, compliance, security,
and audit. The NetApp Open Solution for Hadoop addresses these requirements and offers a blue print
for integrating NetApp storage arrays with Hadoop clusters that preserves Hadoop’s “shared nothing”
architecture.
The Shared Nothing Imperative
Apache Hadoop users typically build their own parallelized computing clusters from commodity servers,
each with server‐internal storage, typically in the form of a small JBOD disk array. These are commonly
Page 2 of 6 Copyright 2012, Evaluator Group, Inc.
4. NetApp’s Open Solution for Hadoop
referred to as “shared nothing” architectures because all processing is done in parallel by servers in the
cluster that are self‐contained processing units. They communicate with one another over a common
network but otherwise do not share any other computing resources in the cluster including memory and
storage. SAN and NAS storage, while scalable and resilient, is typically seen as lacking the kind of I/O
performance these clusters need to rise above the capabilities of the standard data warehouse.
Therefore, Hadoop storage is DAS.
The practitioners of New Data Analytics processes are generally hostile to shared storage. They prefer
direct‐attached storage (DAS) in its various forms from solid state disk (SSD) to high capacity SATA disk
buried inside parallel processing nodes. The perception of shared storage architectures—SAN and NAS—
is that they are relatively slow, complex, and above all, expensive. These qualities are not consistent
with New Data Analytics systems that thrive on system performance, commodity infrastructure, and low
cost.
Real or near‐real time information delivery is one of the defining characteristics of New Data Analytics.
Latency is therefore avoided whenever and wherever possible. Data in memory is good. Data on
spinning disk at the other end of a FC SAN connection is not.
NetApp for Hadoop
The first thing to note about the NetApp Open Solution for Hadoop is that it preserves the shared
nothing architectural model. It provides DAS storage in the form of a NetApp E2660 array to each Data
Node within the Hadoop cluster. The E2660’s house a total of 60 disks per enclosure. Configured as
four volumes of DAS, each Data Node has its own non‐shared set of disks and each Data Node “sees”
only its share of disk (see graphic below). Each Data Node is allocated fourteen disks within the E2660
array as well as “array intelligence” – dual array controllers w/ hardware assisted computation of RAID
parity.
Page 3 of 6 Copyright 2012, Evaluator Group, Inc.
5. NetApp’s Open Solution for Hadoop
Figure 1. NetApp Open Solution for Hadoop configuration (courtesy NetApp)
The E2660 operates as four completely separate and independent storage modules that are co‐located
in the same 4U chassis. A single enclosure contains a total of sixty (60) two‐ or three‐TB, 7.2 K RPM
Near‐line SAS drives. Each module consists of 14 disks configured by the user as either RAID 5 (13 data
+1 parity) or RAID 6 (12 data + 2 parity). The remaining four drives are available as global hot spares.
The NetApp FAS2000, including its Data ONTAP operating system, provides NFS‐based storage for the
Hadoop Name Node server. The FAS system offers production data center quality storage for Hadoop
system metadata—a critical component to the overall functioning and resiliency of the Hadoop cluster.
The integration level between the FAS system and the name node server is described by NetApp as
modest in the first release of this solution. Later releases will use more ONTAP functionality and be
more tightly into the Hadoop code base.
Problems the Solution Addresses:
At the Name Node Level
The Hadoop Name Node is a well‐known single point of failure that can shut down the cluster when not
functioning. The FAS2040 is used as storage for the Name Node, mitigating loss of cluster metadata due
to Name Node failure. It functions as a single, unified repository of cluster metadata that supports faster
recovery from disk failure. It also serves as a repository for other cluster software including scripts and
as such can be used to simplify cluster deployment, updates, and ongoing maintenance.
Page 4 of 6 Copyright 2012, Evaluator Group, Inc.
6. NetApp’s Open Solution for Hadoop
At the Data Node Level
Standard Hadoop clusters typically use Data Node‐based software to provide data protection and
system resilience. Hadoop uses a distributed, host software‐based multiple data mirroring scheme that
functions across all Data Nodes in a cluster. Upon data ingest, users typically specify that two additional
copies of the original data be written to two other Data Nodes in the cluster 1 resulting in having three
copies of data contained within the cluster. This provides both a degree of resilience in case of a failure
and balanced access (load balancing) to data across the data nodes in the cluster.
However, using a replication count of three, every TB of data ingested yields three TBs stored. In
addition, the copy process consumes cluster processing resources and internal communications
bandwidth that detracts from making those same resources available to analytic processes.
NetApp moves data protection processes, and the creation of data replicas needed for adverse event
recovery purposes, off of the Hadoop cluster and on to storage arrays that are designed to accomplish
these tasks far more efficiently. Triple mirroring within the cluster consumes server and network
bandwidth. Instead, NetApp allows admins to mirror data to a direct attached NetApp E2660 array via
6GB/s SAS connections. Doing so replaces the triple mirror implemented in software that runs at the
Data Node level with hardware RAID at that runs at the array level.
The net result is that the Hadoop Data Nodes can be protected from the risk of disk failures that result in
job failures. Support for non‐disruptive, simultaneous rebuild of logical volumes means that disk failures
can be handled without disrupting the cluster and without requiring administrator intervention. And
the use of enterprise‐grade disk by NetApp in the E2660 array will result in fewer disk failures over time.
Use of the E2660 can also increase overall cluster performance—even when JBOD disk used within the
Data Nodes is replaced by the E2660—by reducing the HDFS replica count and allowing the storage
array to process that workload. In addition, the use of hardware RAID combined with caching at the
E2660 array level will add an additional margin of performance.
Conclusion
As mentioned earlier, the NetApp Open Solution for Hadoop differs from the data analytics appliance
vendors in that it does not include Hadoop server and client hardware. This means that customers for
this solution are free to source their own at the best price they can negotiate. Additionally, Cloudera’s
Distribution including Apache Hadoop (CDH) is available as a free download from Cloudera. Zaloni is one
of NetApp’s partners that offers the solution while adding custom services and support with the NetApp
Open Solution for Hadoop. Hadoop is now emerging in enterprise production data centers as a new BI
tool that in some cases augments already established data warehousing systems and in other cases,
delivers functionality that is beyond the reach of the traditional data warehouse. We believe that these
early Hadoop deployments will grow in size and importance over time. Therefore it is important to start
with an implementation that offers production data center quality resilience and data integrity as can be
1
Replication count is user controllable. Maintaining three copies of data has become standard practice. However,
to improve performance for large bulk data loads, users can and often do reduce the replication count to one or
two, and increase the count later.
Page 5 of 6 Copyright 2012, Evaluator Group, Inc.
7. NetApp’s Open Solution for Hadoop
scaled upward in time by adding internal storage capacity rather than adding mode Data Nodes each
time more storage capacity is needed. We also note that the ability to integrate an archival storage
component for security and compliance reasons will also become more critical as time goes on. The
NetApp Open Solution for Hadoop addresses these requirements by delivering enterprise data center
quality storage platforms, integrated with Hadoop, that are well known and understood by enterprise IT
administrators.
About Evaluator Group
Evaluator Group Inc. is dedicated to helping IT professionals and vendors create and implement strategies that make the most of
the value of their storage and digital information. Evaluator Group services deliver in‐depth, unbiased analysis on storage
architectures, infrastructures and management for IT professionals. Since 1997 Evaluator Group has provided services for
thousands of end users and vendor professionals through product and market evaluations, competitive analysis and education.
www.evaluatorgroup.com Follow us on Twitter @evaluator_group
Copyright 2012 Evaluator Group, Inc. All rights reserved.
No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying
and recording, or stored in a database or retrieval system for any purpose without the express written consent of Evaluator Group Inc. The
information contained in this document is subject to change without notice. Evaluator Group assumes no responsibility for errors or omissions.
Evaluator Group makes no expressed or implied warranties in this document relating to the use or operation of the products described herein.
In no event shall Evaluator Group be liable for any indirect, special, inconsequential or incidental damages arising out of or associated with any
aspect of this publication, even if advised of the possibility of such damages. The Evaluator Series is a trademark of Evaluator Group, Inc. All
other trademarks are the property of their respective companies.
Page 6 of 6 Copyright 2012, Evaluator Group, Inc.