The AWS cloud computing platform has disrupted big data. Managing big data applications used to be for only well-funded research organizations and large corporations, but not any longer. Hear from Ben Butler, Big Data Solutions Marketing Manager for AWS, to learn how our customers are using big data services in the AWS cloud to innovate faster than ever before. Not only is AWS technology available to everyone, but it is self-service, on-demand, and featuring innovative technology and flexible pricing models at low cost with no commitments. Learn from customer success stories, as Ben shares real-world case studies describing the specific big data challenges being solved on AWS. We will conclude with a discussion around the tutorials, public datasets, test drives, and our grants program - all of the resources needed to get you started quickly.
4. Big Data: Unconstrained data growth
95% of the 1.2 zettabytes
of data in the digital
universe is unstructured
70% of of this is user-
generated content
Unstructured data growth
explosive, with estimates
of compound annual
growth (CAGR) at 62%
Source: IDCGB TB
PB
ZB
EB
5. The amount of information generated during the first day of
a baby’s life today is equivalent to 70 times the information
contained in the Library of Congress
8. Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
Available for analysis
Generated data
Data volume - Gap
1990 2000 2010 2020
9. Elastic and highly scalable
No upfront capital expense
Only pay for what you use
+
+
Available on-demand
+
=
Remove constraints
12. Big data and AWS Cloud computing
Big data Cloud computing
Variety, volume, and velocity
requiring new tools
Variety of compute, storage,
and networking options
13. Big data and AWS Cloud computing
Big data Cloud computing
Potentially massive datasets Massive, virtually unlimited
capacity
14. Big data and AWS Cloud computing
Big data Cloud computing
Iterative, experimental style of
data manipulation and analysis
Iterative, experimental style of
infrastructure deployment/usage
15. Big data and AWS Cloud computing
Big data Cloud computing
Frequently not steady-state
workload; peaks and valleys
At its most efficient with highly
variable workloads
16. Big data and AWS Cloud computing
Big data Cloud computing
Absolute performance not as
critical as “time to results”;
shared resources are a
bottleneck
Parallel compute projects allow
each workgroup to have more
autonomy, get faster results
23. Try Amazon Redshift with BI & ETL for Free!
aws.amazon.com/redshift/free-trial
2 months | 750 hours/month | dw2.large SSD instance
160GB of compressed storage per node
Try BI & ETL for free from nine partners at
aws.amazon.com/redshift/partners
24. Hadoop/HDFS clusters
Hive, Pig, Impala, Hbase
Easy to use; fully managed
On-demand and spot pricing
Tight integration with S3,
DynamoDB, and Kinesis
Amazon
Elastic
MapReduce
25. Amazon EMR now ships with ODBC and JDBC drivers for
Hive, Impala, and HBase
Easier to use popular BI tools like:
Microsoft Excel, Tableau, MicroStrategy, and QlikView
ODBC and JDBC drivers now for Amazon EMR
33. Free steak campaign
Disaster recovery
Web site & media sharing
Facebook app
Ground campaign
SAP & SharePoint
Marketing web site
Business line of sight
Consumer social app
IT operations
Mars exploration ops
Interactive TV apps
Media streaming
Consumer social app
Facebook page
Securities Trading Data Archiving
Financial markets analytics
Web and mobile apps
Big data analytics
Digital media
Ticket pricing optimization
Streaming webcasts
Mobile analytics
Consumer social app
Core IT and media
36. Dropcam is the biggest inbound video service
on the Web
More data uploaded per
minute than YouTube
Petabytes of data
processed every month
Billions of motion events
detected
37.
38. 4 months to production
300% speed gain
$500k - $1M in CAPEX saved
39.
40.
41.
42.
43.
44.
45. 500MM tweets/day = ~ 20.8MM tweets/hr
2k/tweet is ~12MB/sec, need 6 shards, ~1TB/day
$0.015/hour per shard, $0.028/million PUTS
Kinesis cost is $0.765/hour
Redshift cost is $0.850/hour (for a 2TB dw1.xlarge)
Total: $1.615/hour
Cost &
Scale
47. “THANKS TO AMAZON WEB SERVICES, WE CAN DELIGHT OUR PLAYERS WORLDWIDE.”
Sami Yliharju | Services Lead
48.
49. The Climate Corporation - Weather Insurance for Farms
Challenge:
Volatile weather is deadly to crops like grapes
Solution:
Built a predictive model based on freely available
data:
• 60 years of crop data,
• 14 TBs of soil data, and
• 1M government Doppler radar points
• 50 EMR clusters process new data as it comes
into S3 each day, continuously updating the
model.
51. Foursquare…
33 million users
1.3 million businesses
…generates a lot of Data
3.5 billion check-ins
15M+ venues,
Terabytes of log data
52. Uses EMR for
Evaluation of new features
Machine learning
Exploratory analysis
Daily customer usage reporting
Long-term trend analysis
53. Benefits of Amazon EMR
Ease-of-Use
“We have decreased the processing time for urgent data-analysis”
Flexibility
To deal with changing requirements & dynamically expand reporting clusters
Costs
“We have reduced our analytics costs by over 50%”
54. Who is checking in?
0
0.1
0.2
0.3
0.4
0.5
0.6
Female Male
Gender
0 20 40 60 80
Age
62. What is DataXu?
• Digital Marketing Platform, Ad Tech Platform
• Real-time Multivariate Decision System
• 5th Fastest Growing Private Company in U.S (Inc 500)
• Optimize Digital Marketing Campaigns
– ...put the right ad campaign in front of the right customer
– …find customer who left their site without converting
– …find more customers who are likely to convert
– …offer insight into who, why, when, where are respondents
• 950,000 times per second
63. Big Data, Little Decisions
Decision
impact
(also proportional
to risk)
Decision rate
1
2000’s – “How often can we run a permission-based email mktg. campaign?” Rules-based alerts
2010’s – Millions of decisions and actions taken, all in less than a blink of an eye
volume ~ value
The Evolution of Real-Time Decision Systems
1
2
2
3
3
1990’s – “Should we advertise on the Superbowl? Should we run direct mail this qtr.?” Batch mode
64. Real Time Bidding
Site
Auctions
Ads, e.g
Google
User
Opens
Browser
Goes to
Sports Site
DataXu
Bids
(others bid too)
DataXu
Wins Bid
Ad Shown,
Page loads
65. Quick Statistics
• 950K bid requests per second
• Billions of impressions per month, Petabyte of
data
• 100 ms round trip response time
• 100+TB of warehouse data
• 3000+ Servers powering the platform
66. Why AWS
• Automation, API
• Costs, Pay As You Go
• Auto Scaling (elasticity – up and down)
• All Data in One Place (S3 foundational store)
• Improved Testability
• Security, Privacy
• Disaster Recovery and Business Continuity
67. DataXu Stack
Campaign
Management
Business Intelligence
Data Mart
Interactive
Queries
Batch
Queries
Real Time Bidding System
Activity Logs
1st Party3rd Party
Distributed Log
Ingestion
S3/HDFS Warehouse
CDN
User
Profiles
Campaign
Metadata
ETL Attribution Machine Learning
Spend
Decision
System
Audience
Calculation
Uniques/S
egment
Big Velocity
950K TPS
Big Volume
Petabyte of Data
Big Variety
Data Providers
68. High Level Deployment
ON PREMISE
SSL
Meta
Amazon S3
RTB
System
Elastic Load
Balancing
Availability Zone
Route
53
EC2
Auto scaling Group
Volumes
AMI
Availability Zone
Log
Ingestion
System
Machine
Learning
System
Auto scaling
Group
EMR
CloudWatch
69. Traditional Hadoop vs EMR
• Traditional Hadoop
– Anticipate and provision for
peaks
– Cant de-couple storage
and compute
– 75% cluster is idle
– Data Duplication/Multiple
Clusters
• EMR to the rescue
• Monthly savings of 72%
using EMR
70. S3 Provides Linearly Scalable Bandwidth
• Big volume workloads
involve several
datasets together and
terabytes of data
• Aggregate bandwidth
matters
• S3 scales pretty
linearly
S3 Streaming Performance
(m1.xlarge @ $0.34/hr)
100 VMs; 9.6GB/s; $34/hr
350 VMs; 28.7GB/s; $119/hr
34 secs per terabyte