Data is being generated at a feverish pace and forward thinking companies are integrating big data and analytics as part of their core strategy from day one. However, it is often hard to sift through the hype around big data and many companies start with only a small subset of data. Can smaller companies benefit from big data efforts? We will discuss several use cases and examples of how startups are using data to optimize their operations, connect with their users, and expand their market.
1. How Can Startups Leverage Big Data?
Trudging Through Myth To Discover Real Value
2. • Mostly Unstructured Data
• Client Data
• Customer Data
• Social Data
• Driving towards insight
2
What is Big Data?
www.rackspace.com
3. RACKSPACE® HOSTING | WWW.RACKSPACE.COM
“Big Data is any
dataset not suited to
be processed by
traditional legacy
technology.”
4. The Three V’s
4
V3C
Mining social data for sentiment
Analyzing web clickstreams
Analyzing log data for security breaches
Telemetry from sensors and machines
eCommerce predictive analytics
VOLUME VELOCITY
VARIETY COMPLEXITY
5. The Three V’s
5
V3C
Mining social data for sentiment
Analyzing web clickstreams
Analyzing log data for security breaches
Telemetry from sensors and machines
eCommerce predictive analytics
VOLUME VELOCITY
VARIETY COMPLEXITY
7. • Big Data is now much more than hype – real
customers with real use cases are adopting daily
•Recent survey found that business leaders expected
the deployment of Hadoop to result in a 3-year benefit
ranging from $5M to $50M+
• Close to 100% of business leaders have already
deployed or plan to deploy ApacheTM Hadoop®
"Enterprises are showing increasing interest in the value provided by the large-scale data processing that Hadoop and Spark
can provide, but can be wary of the upfront cost and complexity of setting up a cluster to prove that value. Managed services
such as [OnMetalTM Cloud Big Data Platform] enable enterprises to focus their energies on generating business insights rather
than configuring and managing infrastructure.”
Matt Aslett
451 Research Director, Data Platforms and Analytics
7
Big Data is Here to Stay
www.rackspace.com
8. • To learn more about your customers
• To optimize your business processes
• To become a more targeted marketer
• Interact with users and customers in real time
• Add additional revenue and services
8
Why leverage Big Data?
www.rackspace.com
9. www.rackspace.com 9
What Is the Cost of Lacking a Big Data Strategy?
• Today every company can be a data company
• Successful companies will be data companies
• Under Armour isn’t just a fitness company – they’re a data company
10. • Open Source
• Able to process petabytes of data quickly
• Developed at Google, implemented at scale at Yahoo
• Handles unstructured data very well
• One of the fastest growing eco-systems
10
Hadoop Has Emerged As A Leader In Distributed Data Sets
11. Fundamentals of Hadoop v1
Zookeeper
Configuration, sync
and naming registry
Oozie
Workflow and job
scheduling
Knox
Auth and access
Falcon
Data pipeline
framework
Installation, monitoring, administration
11
Data
Services
Pig
Data flow
scripting
language
HBase
Distributed,
scalable, non
relational
database HCatalog
Metadata and table management system
Core
Services HDFS
Distributed File System
Hive
DW analysis layer
through HiveQL
(SQL-like) queries
MapReduce
Data processing framework
Ambari
Operational
Services
Flume
Log data
aggregation and
movement
Sqoop
Bulk data transfer
from and to
relational DB
12. • Biggest impediments include:
– Insufficient skills in-house to design and deploy
– Designing and deploying takes too long
– High cost of physical infrastructure
12
Hadoop is Hard
www.rackspace.com
3 10
only in
businesses that plan
to implement Hadoop
have done so
13. Hadoop is Changing
• Original focus on batch processing
• Streaming and interactive use cases emerging
• Shift from jobs that take hours to seconds
• Impala, Spark, and Presto are emerging tools
14. 14
But what are these companies
doing with Big Data?
www.rackspace.com
Gaining Insights!!!
15. What are Companies Doing with Hadoop?
www.rackspace.com 15
Vertical Use Case Data Type
Financial Services
New Account Risk Screens Text, Server Logs
Fraud Prevention Server Logs
Trading Risk Server Logs
Maximize Deposit Spread Text, Server Logs
Insurance Underwriting Geographic, Sensor, Text
Accelerate Loan Processing Text
Telecom
Call Detail Records (CDRs) Machine, Geographic
Infrastructure Investment Machine, Server logs
Next Product to Buy (NPTB) Clickstream
Real-time Bandwidth Allocation Server Logs, Text,
Sentiment
New Product Development Machine, Geographic
Retail
360 View of the Customer Clickstream, Text
Analyze Brand Sentiment Sentiment
Localized, Personalized Promotions Geographic
Website Optimization Clickstream
Optimal Store Layout Sensor
Manufacturing
Supply Chain and Logistics Sensor
Assembly Line Quality Assurance Sensor
Proactive Maintenance Machine
Crowdsourced Quality Assurance Sentiment
16. Application Underpinning
People are building net-new applications with Hadoop as their database
• Mobile
– Enterprises consider support for mobility and productivity enhancement to mobile workers as their top-priority new application
category, according to a recent survey by CIMI Corp. That means most companies that have adopted, or are adopting,
Hadoop will likely have to integrate the framework with mobile applications.
• Data Aggregation
– The two big use cases we're seeing for Impala are aggregating data in Hadoop to present analytic dashboards and improving
data-discovery applications by providing faster performance than Hive," Alex Gutow, Cloudera's product marketing
manager.
• Dashboarding
– Users are increasingly choosing Hadoop as the underlying technology to power interactive dashboarding capability.
• Internet of Things
– As tech wearables and generated devices start to become common-day solutions the backend of your application needs to
be built to address these concerns and can handle the velocity and volume of data being produced by the appliance.
www.rackspace.com 16
17. Clickstream Analysis
Understand how your users are behaving on your website and optimize your experience
Your home page looks great. But how do you move customers on to bigger things—like submitting a form
or completing a purchase? Get more granular with customer segmentation. Hadoop makes it easier to
analyze, visualize and ultimately change how visitors behave on your website.
A clickstream is a series of page requests. Every page requested generates a signal. These signals can be
graphically represented for clickstream reporting. The main point of clickstream tracking is to give
webmasters insight into what visitors on their site are doing.
• Clickpath
– The study of human clicks on a website
• Tracking Cookies
– Tool used to understand and track online activity
• Data Mining
– Collecting data from websites and online properties
www.rackspace.com 17
18. Sentiment Analysis
Find out what your users are saying about you. Are they happy? Does your product make them a promoter?
Your customers are talking. With Hadoop, you can mine Twitter, Facebook and other social media
conversations for sentiment data about you and your competition, and use it to make targeted, real-time
decisions that increase market share.
Sentiment analysis aims to determine the attitude of a speaker or a writer with respect to some topic or the
overall contextual polarity of a document.
• Social Media Feeds
– Many companies are now capturing entire Twitter and Facebook feeds to analyze.
• Data Mining
– Users are searching the web for comments, blogs, and whitepapers that can point to overall sentiment
• E-Communities
– Forums, user groups, Heroku
www.rackspace.com 18
19. Machine Learning
Interactive devices are now streamlining things like maintenance and troubleshooting
Your machines know things. From out in the field to the assembly line floor—machines stream low-cost,
always-on data. Hadoop makes it easier for you to store and refine that data and identify meaningful
patterns, providing you with the insight to make proactive business decisions.
Machine Learning is a scientific discipline that deals with the construction and study of algorithms that can
learn from data. Such algorithms operate by building a model based on inputs and using that to make
predictions or decisions, rather than following only explicitly programmed instructions.
• Pattern Recognition
– Users are building clusters to detect patterns and identify anomalies in data that these devices are generating
• Decision Tree
– Allows the system to take action and make choices based on the data
• Predictive Modeling
– Aims to automate the most common mistakes and errors as part of a preventative model
www.rackspace.com 19
20. Fraud Detection
Users are detecting fraudulent online behavior and rejecting those users before they commit an offense
Fraud is a billion-dollar business and it is increasing every year. The PwC global economic crime survey of
2009 suggests that close to 30% of companies worldwide have reported being victims of fraud in the past
year.
Fraud involves one or more persons who intentionally act secretly to deprive another of something of value,
for their own benefit. Fraud is as old as humanity itself and can take an unlimited variety of different forms.
However, in recent years, the development of new technologies has also provided further ways in which
criminals may commit fraud.
• Rules-Based Detection
– Even though internet hackers have become better at tricking online systems, they still exhibit very calculated behavior.
• Machine Learning
– The aggregation of data points can help you collect more info about the potential sale and detect if it might be fraud.
• Users Tagging and Tracing
– Once users are flagged as fraudulent, their repeated attempts can be prevented.
www.rackspace.com 20
21. Server Log Data
Aggregate server logs to find trends and anomalies in your security records
Security breaches happen. And when they do, your server logs may be your best line of defense. Hadoop
takes server-log analysis to the next level by speeding and improving security forensics and providing a low
cost platform to show compliance.
Generally small files that track user information inside a confined environment; often used to meet
compliance or troubleshoot an incident.
• Scrub Data for Forensics
– If a security incident occurs, it is important to remediate fast
• Identify Anomalies
– Anti-patterns are often the first sign
• Discover Trends
– Some types of errors might become common; learn to identify them
• Actively Automate to Solve Issues with Log Files
– Many of these errors can be proactively eliminated through the use of automation.
www.rackspace.com 21
22. 360 View of Customer – Dashboards and Analytics
Create in-depth personas for your customers based on how they are actually behaving.
Whenever a customer interacts with an organization, it is vital that the richness of information available on
that customer informs and guides the processes that will help to maximize their experience, while
simultaneously making the interaction as effective and efficient as possible. This includes everything from
avoiding repetition or rekeying of information, to viewing customer history, establishing context and initiating
desired actions.
A total 360 view often contains 3 views:
• The Past
– Understanding how your users act in the past lets you understand who they are and serve them relevant content and
products
• The Present
– Where are users coming from? What is their experience on your site right now? Do they need help?
• The Future
– Did they buy? Can we serve them more information to help their choice? Can we market to them better?
www.rackspace.com 22
23. What’s Next? Interactive Processing!
Interact with customers in real-time offering suggestions and inhibiting behavior
What if instead of reacting to behavior we can engage virtually with the user to inhibit behavior?
This is called interactive processing and it takes input from humans and reacts based on patterns and
algorithms.
The quicker we can server up this interaction, to the user the better equipped we are to inhibit their behavior!
www.rackspace.com 23
Input
data
Proces
s
Output
data
source: Teach-ICT.com
24. • Introducing support of Apache SparkTM
• Apache Spark enables enterprises to combine the breadth of structured and unstructured data with the
speed of in-memory processing to build streaming, machine learning, and graph-optimized applications
that allow businesses to take action at the speed of insight.
24
Apache Spark
www.rackspace.com
25. • Deeper Integration with SQL Workloads
• Streaming Applications
• Machine Learning
• Iterative Processing
• Real-time Graphical Dashboards
25
New Use Cases
www.rackspace.com
26. YES
26
Does the delivery method matter?
www.rackspace.com
27. Choose The Best Deployment Model
27
Public Cloud Managed Cloud
Your Private Cloud
(on Premise)
Private Cloud
29. Advantages of storing data in the cloud:
29
Portability between
providers
Utility Pricing Minimal
planning needed
Scale to meet the exact
demands
Integration with data
platforms
30. • Dedicated Hosting
– No Capex Investment
– Choose new hardware and software versioning easily
– Rely on extended support personnel
– Increased security options
– Concurrent and predictable performance
• On-Premise
– Control Data Access
– Integrate with core mainframe and systems
– Build your own IP
– Control every aspect of design and operation
www.rackspace.com 30
Advantages of Dedicated Hosting/On-Premise
31. www.rackspace.com 31
The Trade Off...
Custom Built
Consistent
Available
Performant
Purpose Built
Elastic
Flexible
On-Demand
32. www.rackspace.com 32
OnMetal Lets You Scale Like the Internet Giants
BARE METAL
SERVERS
Instantly Available API-driven Highly Specialized No Hypervisor
“Rackspace Cloud, because of its single-tenant OnMetal line, is the only place on Earth where you can enjoy
Facebook/Google-style infrastructure rented by the hour.”
-Ev Kontsevoy
Director, Product
Rackspace
33. Benefits of Outsourced Hosting
Deliver resources fast
Offload management responsibilities
Scale as you grow
Optimize around specified hardware
34. www.rackspace.com 34
The Level of Management You Need
Only you can decide what model is best for you!
• DIY
• Platform
• Managed Service
• Turnkey Service
35. Data as a Service:
more time building,
less time managing databases
• For some businesses, database or
infrastructure management IS core to the
business
• For most software-based businesses, database
or infrastructure management represents time
and resources not spent building the
application
• You must answer for yourself: are you in the
business of managing infrastructure, or in the
business of [your market here]?
More time
spent building
the app
More tasks performed FOR the
developer (means that more time can be
spent building the application)
Sharding
Scaling
Performance
Availability
Analytics
Optimization
Proactive tasks
Complex admin
Patch
Upgrade
Backup/Restore
Monitoring
Replication
HW selection
Installation
Patch
Upgrade
Backup/Restore
Monitoring
Replication
HW selection
Installation
Patch
Upgrade
Backup/Restore
Monitoring
Replication
HW selection
Installation
1
Do-it-yourself
database
2
Provisioned
database
3
Automated
database
4
Data as a
Service
HW selection
Installation
Patch
Upgrade
Backup/Restore
Sharding
Scaling
Performance
Availability
Analytics
Optimization
Proactive tasks
Complex admin
App-specific
data mgmt
Patch
Upgrade
Backup/Restore
Monitoring
Replication
Sharding
Scaling
Performance
Availability
Analytics
Optimization
Proactive tasks
Complex admin
App-specific
data mgmt
Sharding
Scaling
Performance
Availability
Analytics
Optimization
Proactive tasks
Complex admin
More tasks performed BY the developer
(means that more time can be spent
building the application)
App-specific
data mgmt
App-specific
data mgmt
39. 39
Rackspace Offerings for the Data Tier
www.rackspace.com
Managed
Database
Services for
Production Apps
Managed
Offerings of Most
Popular
Big Data, SQL, &
NoSQL Databases
Infrastructure
for Data
•Automatic DBA: Sharding,
Backup, & HA
•Entire Stack Optimized on Bare
Metal
•Supported 24x7x365 by experts
• More than MongoDB…
Cloud IaaS
Get started fast
DBA Services
Dedicated
Hosting
Predictable costs &
performance
OnMetal
Cloud Elasticity &
Dedicated
Performance
•Architecture & Design
•Tuning & Monitoring
•24 x 7 x 365 Support
•Cost Effective
40. 1. Sign up for a free trial
2. Want to know more?
– Read my blog and check out the articles
www.baremetalbigdata.com
40
What’s Next?
www.rackspace.com
If your company doesn’t have a robust big data strategy it’s a real concern. As you can see from the last slide, it’s likely that regardless of industry your competitors are building their own big data initiatives.
Examples:
Nike
Nest
MapMyFitness
Today we are all data companies. Examples include Nike, Nest, even old guard companies like John Deere.
Share the Under Armour story. The data they harvest has the potential to impact every part of their business---from how they manage their supply chain to how the interact with their customers. With consumer movements like “the instrumented self”, it is the differentiator.