The document discusses the rise of Big Data as a Service (BDaaS) and how recent technological advancements have enabled its emergence. It provides a brief history of Hadoop and how improvements in networking, storage, virtualization and containers have addressed earlier limitations. It defines BDaaS and describes the public cloud and on-premises deployment models. Finally, it highlights how BlueData's software platform can deliver an integrated BDaaS solution both on-premises and across multiple public clouds including AWS.
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
The Time Has Come for Big-Data-as-a-Service
1. #HadoopSummit
The Time Has Come for
Big-Data-as-a-Service
Kris Applegate – Cloud and Big Data Solution Architect, Dell
Tom Phelan – Co-Founder and Chief Architect, BlueData
2. #HadoopSummit
Agenda
• A Brief History of Hadoop
• Data Storage and Networking Evolution
• The Virtualization Revolution
• Rise of Big-Data-as-a-Service
• Big-Data-as-a-Service (BDaaS) Defined
• BDaaS – Public Cloud or On-Premises?
• Q & A
4. #HadoopSummit
In the Beginning (circa 2003) …
• Networks were slow (1 Gigabit per
second maximum)
• Siloed storage was expensive
(proprietary and often required
special hardware)
• Local HDDs were cheap and fast
enough for big data needs
Source:
http://static.googleusercontent.com/media/researc
8. #HadoopSummit
Result: Is Disk-Locality Irrelevant?
Source: https://amplab.cs.berkeley.edu/wp-
content/uploads/2011/06/disk-
irrelevant_hotos2011.pdf
Less relevant may be more accurate
•Faster data center networks
•Distributed/non-distributed caching
platforms
• Example: Alluxio (Tachyon)
•Compute and storage separation
9. #HadoopSummit
• Virtualization / “cloud” technology is
not absolutely required
• But realistically … the flexibility and
elasticity of BDaaS cannot be
economically provided without these
underlying technologies
BDaaS and Cloud
11. #HadoopSummit
Virtualization enabled several key benefits including:
•Automation, flexibility, elasticity
• Cost reduction and consolidation
• Higher utilization, less hardware overprovisioning
•Multi-tenancy
• Security
• VxLAN
• Fault isolation
The Virtualization Revolution
12. #HadoopSummit
But …. the overhead involved in the virtualization
of storage and networking within a hypervisor
make it difficult to meet the performance needs of
Big Data workloads (SLAs, QoS)
The Virtualization Revolution
13. #HadoopSummit
• Linux Containers
• OS virtualization reduces CPU,
memory, network, and storage
virtualization overhead
• Docker file format makes containers
easy to use and share
The Virtualization Revolution
15. #HadoopSummit
Big Data New Realities
Big Data Traditional
Assumptions
Bare-metal
Disk-locality
HDFS on local disks
Big Data
New Realities
Containers
Compute and storage
separation
In-place access on remote
data stores
New Benefits
and Value
Big-Data-as-a-Service
Agility and cost savings
Faster time-to-insights
16. #HadoopSummit
Journey to BDaaS
2003
Google
paper
2012 Hadoop 1.0.2
Snappy Compression
2012 10 Gbit
networking in
data center
2008 Initial
release of Linux
containers
2002 Initial
release of
VMware ESX
2015 BlueData
EPIC 2.0 with
Docker
2016
BDaaS available
on-prem or cloud
2004
Big Data
era begins
2002 2016
2014
VxLANs
available
2013 Dell Hadoop
Performance
Analysis
2011 Dell first to launch
optimized Apache
Hadoop solution
2007 Hadoop
release 0.14.1
2009 Dell DCS
delivers first Big
Data server
2013 Initial
release
of Docker
2015 40 Gb
networking in
data center
2014 BlueData
wins Strata +
Hadoop World
Showcase
2009 Amazon
Launches EMR
17. #HadoopSummit
BDaaS – The Time Has Come
All the pieces are now available:
•Fast network hardware and good data compression
Compute and storage separation
Low overhead virtualization (containers)
Ability to run network and storage-intensive workloads
•No sacrifice in performance
•Demand from end users for agility, flexibility, & speed
18. #HadoopSummit
Big-Data-as-a-Service Defined
“A mechanism for the delivery of statistical analysis tools and information
that helps organizations understand and use insights gained from large
information sets in order to gain a competitive advantage.”
On-Demand, Self-Service, Elastic
Big Data Infrastructure, Applications, Analytics
Source: www.semantikoz.com/blog/big-data-as-a-service-definition-classification
19. #HadoopSummit
• Core BDaaS
• Performance BDaaS
• Feature BDaaS
• Integrated BDaaS
Four Types of BDaaS
Source: www.semantikoz.com/blog/big-data-as-a-service-definition-classification
20. #HadoopSummit
Core BDaaS
• Minimal platform, such as Hadoop with YARN
Performance BDaaS
• “Downwards” vertical integration
• Includes optimized infrastructure
• Tight integration with Core BDaaS
Source: www.semantikoz.com/blog/big-data-as-a-service-definition-classification
Four Types of BDaaS
21. #HadoopSummit
Four Types of BDaaS
Feature BDaaS
• “Upwards” vertical integration
• Include features beyond Hadoop
• Support for multiple Core BDaaS providers
Integrated BDaaS
• Full vertical integration and optimization
• Includes both Performance BDaaS & Feature BDaaS
Source: www.semantikoz.com/blog/big-data-as-a-service-definition-classification
23. #HadoopSummit
Public Cloud
• Low Capex, high Opex
• “Infinite” expandability
• Less secure?
• Less control: software,
SLAs, configs, etc
On-Premises (Private Cloud)
•High Capex, low Opex
•Eventually reach resource limit
•More secure?
•More control: software, SLAs,
configs, etc.
BDaaS – Public Cloud or On-Prem
24. #HadoopSummit
Challenge: Public cloud services can be proprietary
Goal: Deliver API-compatible on-prem + public cloud
• BDaaS layer (e.g. BlueData)
• PaaS layer (e.g. Cloudforms, Cloud Foundry)
• API-compatible private cloud (e.g. Microsoft Azure
Pack/Stack, OpenStack, VMware)
BDaaS – Workload Portability
25. #HadoopSummit
• Workloads with a shorter life than 16 months*
(e.g. Dev/Test)
• When data is in the cloud too
• Public-facing services
Example Public Cloud Use Cases
BDaaS – Public Cloud
* www.dell.com/learn/us/en/555/business~solutions~whitepapers~en/documents~microsoft-private-cloud-tco-0914.pdf
26. #HadoopSummit
Example On-Prem Use Cases
• High performance clusters
• Data security
• Data compliance
• Persistent clusters with > 16 month lifespan*
• High capacity clusters
• When SLAs are needed
* The BlueData EPIC software platform addresses this potential limitation
BDaaS – On-Premises / Private Cloud
27. #HadoopSummit
• BDaaS software platform, using Docker containers
• Self-service, on-demand Hadoop / Spark clusters
• Bring your own application / distribution / version
• Compute and storage separation
Scale resources independently
Clusters with < 16 month lifespan well supported (e.g. transient)
No HDFS data ingestion penalty
• Secure multi-tenancy, Quality of Service (QoS)
BlueData EPIC – Integrated BDaaS
28. #HadoopSummit
Big Data On-Premises
Traditional Big Data On-Prem
IT
ManufacturingSalesR&DServices
< 30%
Utilization
Duplication of data
Management
complexity
Weeks to build
each cluster
Complex,
painful
upgrades
BlueData EPIC Software Platform
ManufacturingSalesR&DServices
BI/Analytics
Tools
> 90%
Utilization
BDaaS On-Prem with BlueData
No Duplication
of Data
Simplified
Management
Multi-Tenant
Simple,
instant
upgrades
Self-service,
on-demand
clusters
with BlueData
29. #HadoopSummit
NEW – BDaaS On-Prem and Cloud
• BlueData announced AWS and multi-cloud strategy
Extending the user experience and value of BlueData to public cloud
Single pane of glass for on-prem and off-prem Big Data workloads
Initial AWS support; then MS Azure, Google Cloud Platform, others
• Support for data on-prem and compute in the cloud
Leverage cloud compute elasticity while keeping data on-premises
Eliminate challenge of data movement from on-prem to cloud
30. #HadoopSummit
BlueData and Dell Partnership
• Joint solution for Big-Data-as-a-Service
• BlueData = Certified Dell Technology Partner
• Installed, tested, validated on Dell hardware
• Featured in Dell’s Global Customer Solution Centers
Tom –
3x data replication was enough. No need for backup/recovery/geographical replication/snapshots etc.
Kris
Kris
Kris
Tom
Data locality is still important – caching. RAM & SSD. tiering.
Performance of Random access on data working sets that exceed cache capacity will devolve to network speed.
Modern data center has at least 10 Gbit/s, more likely 40 Gbit/s network.
Tom –
Bare metal implementations of BDaaS are available.
Costly and cumbersome.
Tom
First hardware virtualization - Hypervisors esx/hyperv/kvm/xen/etc
Then Operating System virtualization – jails/containers/etc.
Kris
Kris
Tom
Same Docker files can be run on-prem & in cloud.
Remember this. It will be important later.
Tom
Tom
Big Data users developed needs for agility, multiple clusters, remote data, independent compute & storage scalability
Technology progressed.
Did the need drive the development of the technology or was it conincidental?
Which came 1st, the chicken or the egg?
Ultimately, it does not matter.
Tom & Kris
Tom
The needs/wants of the Big Data user can be met by available technology - BDaaS.
Tom
There are many definitions of BDaaS.
Some say it is the combo of s/w & data- that can be hard to grasp.
We say it is functionality stack:
Tom
There are four types. Integrated BDaaS is the nirvana. The other three at stepping stones to get there.
For the most part, the later types encompass the functionality of the earlier types.
Each step gives more “help” to the organization wrt the use of their data.
Kris
Tom
BI tools – datameer, platfora,
CDH/HDP/Pivotal/MapR/BigInsights
Tom
This is often the $64,000 question.
I have data here. I have data there. Why do I need to “get” it somewhere it order to be able to use it?
Tom
Before we can answer the question of how to process my data,
Lets look at where the data is. Typically it will be in one of two places and each of those places has unique characteristics.
Kris
Kris
Kris
Tom
One example of BDaaS.
The EPIC platform from BlueData provides BDaaS.
Tom
Bring order to chaos.
Stop “cluster sprawl” – we originally heard this term years ago from an early (alpha level) customer. The term has since entered common usage.
Tom
This is breaking news.
EPIC support for this was announced LAST WEEK
A roadmap for the delivery of the nirvana of “Integrated BDaaS” across both on-prem and public cloud
With compute/storage separation – keep data on-prem, offload compute to the cloud – avoid cost, complexity, & potential risk of moving data to the public cloud