SlideShare ist ein Scribd-Unternehmen logo
1 von 39
Downloaden Sie, um offline zu lesen
©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved
Deep Dive: Amazon EMR
Matt Yanchyshyn - Sr. Manager, Solutions Architecture
Why Amazon EMR?
easy to use
launch a cluster in minutes
low cost
pay an hourly rate
elastic
easily add or remove capacity
reliable
spend less time monitoring
secure
managed firewalls
flexible
you control the cluster
Easy to deploy
AWS Management Console Command line
or use the EMR API with your favorite SDK
Easy to monitor and debug
Monitor Debug
integrated with Amazon CloudWatch
monitor cluster, node, and IO
Try different configurations to find your optimal architecture
CPU
c3 family
cc1.4xlarge
cc2.8xlarge
Memory
m2 family
r3 family
Disk/IO
d2 family
i2 family
General
m1 family
m3 family
Choose your instance types
Batch Machine Spark and Large
process learning interactive HDFS
Easy to add and remove compute
capacity on your cluster
Match compute
demands with
cluster sizing
Resizable clusters
Spot for
task nodes
Up to 90%
off EC2
on-demand
pricing
On-demand for
core nodes
Standard
Amazon EC2
pricing for
on-demand
capacity
Easy to use Spot Instances
Meet SLA at predictable cost Exceed SLA at lower cost
Read data directly into Hive,
Pig, streaming and cascading
from Amazon Kinesis streams
No intermediate data
persistence required
Simple way to introduce real-time sources into
batch-oriented systems
Multi-application support and automatic
checkpointing
Amazon EMR Integration with Amazon Kinesis
The Hadoop ecosystem can run in Amazon EMR
Bootstrap Actions
Use bootstrap actions to install applications or to configure
Hadoop
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop
--keyword-config-file – merge values in new config to existing
--keyword-key-value – override values provided
Configuration file
name
Configuration file
keyword
File name shortcut
Key-value pair
shortcut
core-site.xml core C c
hdfs-site.xml hdfs H h
mapred-site.xml mapred M m
yarn-site.xml yarn Y y
Hue
Amazon S3 and HDFS
Hue
Query Editor
Hue
Job Browser
Leverage Amazon S3 with EMRFS
Amazon S3 as your persistent data store
• Separate compute and storage
• Resize and shut down Amazon
EMR clusters with no data loss
• Point multiple Amazon EMR
clusters at same data in Amazon
S3
EMR
EMR
Amazon
S3
EMRFS makes it easier to use Amazon S3
• Read-after-write consistency
• Very fast list operations
• Error-handling options
• Support for Amazon S3 encryption
• Transparent to applications: s3://
EMRFS client-side encryption
Amazon S3
AmazonS3encryptionclients
EMRFSenabledfor
AmazonS3client-sideencryption
Key vendor (AWS KMS or your custom key vendor)
(client-side encrypted objects)
HDFS is still there if you need it
• Iterative workloads
– If you’re processing the same dataset more than
once
• Disk I/O-intensive workloads
• Persist data on Amazon S3 and use S3DistCp to
copy to/from HDFS for processing
Amazon EMR – Design Patterns
EMR example #1: Batch processing
GB of logs pushed to
S3 hourly
Daily EMR cluster
using Hive to process
data
Input and output
stored in S3
250 Amazon EMR jobs per day, processing 30 TB of data
http://aws.amazon.com/solutions/case-studies/yelp/
EMR example #2: Long-running cluster
Data pushed to S3 Daily EMR cluster
ETL data into
database
24/7 EMR cluster running
HBase holds last two years of
data
Front-end service uses
HBase cluster to power
dashboard with high
concurrency
TBs of logs sent
daily
Logs stored in
Amazon S3
Hive metastore
on Amazon EMR
EMR example #3: Interactive query
Interactive query using Presto on multi-petabyte warehouse
http://nflx.it/1dO7Pnt
EMR example #4: Streaming-data processing
TBs of logs sent
daily
Logs stored in
Amazon Kinesis
Amazon Kinesis
Client Library
AWS Lambda
Amazon EMR
Amazon EC2
Optimizations for Storage
File formats
• Row-oriented
– text files
– sequence files
• writable object
– Avro data files
• described by schema
• Columnar format
– Object Record Columnar (ORC)
– Parquet
logical table
row-oriented
column-oriented
Choosing the right file format
• Processing and query tools
– Hive, Impala, and Presto
• Evolution of schema
– Avro for schema and Presto for storage
• File format “splittability”
– Avoid JSON/XML files. Use them as records.
• Compression - block or file
File sizes
• Avoid small files
– Anything smaller than 100 MB
• Each mapper is a single JVM
– CPU time is required to spawn JVMs/mappers
• Fewer files, matching closely to block size
– fewer calls to S3
– fewer network/HDFS requests
Dealing with small files
• Reduce HDFS block size, e.g. 1 MB (default is 128 MB)
– --bootstrap-action s3://elasticmapreduce/bootstrap-
actions/configure-hadoop --args “-m,dfs.block.size=1048576”
• Better: use S3DistCP to combine smaller files together
– S3DistCP takes a pattern and target path to combine smaller
input files to larger ones
– Supply a target size and compression codec
Compression
• Always compress data files on Amazon S3
– reduces network traffic between Amazon S3 and
Amazon EMR
– speeds up your job
• Compress mappers and reducer output
Amazon EMR compresses inter-node traffic with LZO with
Hadoop 1 and Snappy with Hadoop 2
Choosing the right compression
• Time-sensitive, faster compressions are a better choice
• Large amount of data, use space-efficient compressions
• Combined workload, use gzip
Algorithm Splittable? Compression ratio
Compress and
decompress speed
Gzip (DEFLATE) No High Medium
bzip2 Yes Very high Slow
LZO Yes Low Fast
Snappy No Low Very fast
The Nielsen Company and Amazon EMR
Copyright©2012TheNielsenCompany.Confidentialandproprietary.
SOCIAL TV IS A CONSUMER PHENOMENON
Viewers interacting around TV programming through social to connect
with friends, fans, stars, and advertisers in real time.
of smartphone/tablet owners
use devices as second screens
while watching TV
1 Billion84percent
Tweets about U.S.
TV in 2014, sent by
25million people
Copyright©2012TheNielsenCompany.Confidentialandproprietary.
WHAT IS NIELSEN SOCIAL?
The leading provider of social TV measurement, analytics, and audience engagement
solutions.
• 90+ network, agency and
advertiser clients in the U.S.
• Exclusive provider of
Nielsen Twitter TV Ratings
• International presence in
Italy, Australia, and Mexico
Copyright©2012TheNielsenCompany.Confidentialandproprietary.
HOW DOES NIELSEN SOCIAL WORK?
Nielsen Social captures Twitter activity about every TV program across 250+ U.S.
networks, 1900+ brands, movies, and sports.
Twitter provides impressions and demographics for every tweet about TV,
which we aggregate and de-duplicate to produce Nielsen Twitter TV Ratings.
Copyright©2012TheNielsenCompany.Confidentialandproprietary.
NIELSEN SOCIAL – AMAZON EMR JOB FLOW
1) Country Segmentation Cluster
• Input: Global Firehose (500M TPD, 200 GB gz
files on S3)
• Output: US/IT/AU/MX Tweets (150M TPD)
• Data Volatility: Low (+/- 20%)
2) TV Tweet Matching Cluster
• Input: US/IT/AU/MX Tweets (150M TPD)
• Output: TV Tweets (2 to 25M)
• Data Volatility: High (+/- 1200%)
3) TV Analytics Cluster
• Input: TV Tweets (2 to 25M TPD)
• Output: Analytics and Reports
• Data Volatility: High (+/- 1200%)
Amazon EMR Config:
• Multiple Transient Clusters
• On Demand Instances
+ more for spikes
• Type: m4.2xlarge / m3.xl
• Job Freq: daily
Amazon EMR Config:
• Dedicated (24/7)
• Reserved Instances
• Type: m4.2xlarge
• Job Freq: every 10 min
Amazon EMR Config:
• Transient
• Reserved Instances
+ On Demand for spikes
• Type: m2.2xlarge
• Job Freq: hourly/overlap
TWITTER FIREHOSE
TV SCHEDULES
DEMOS & IMPRESSIONS
Copyright©2012TheNielsenCompany.Confidentialandproprietary.
TWITTER FIREHOSE
NIELSEN SOCIAL – AMAZON EMR JOB FLOW
1) Country Segmentation Cluster
• Input: Global Firehose (500M TPD, 200 GB gz
files on S3)
• Output: US/IT/AU/MX Tweets (150M TPD)
• Data Volatility: Low (+/- 20%)
2) TV Tweet Matching Cluster
• Input: US/IT/AU/MX Tweets (150M TPD)
• Output: TV Tweets (2 to 25M)
• Data Volatility: High (+/- 1200%)
3) TV Analytics Cluster
• Input: TV Tweets (2 to 25M TPD)
• Output: Analytics and Reports
• Data Volatility: High (+/- 1200%)
TV SCHEDULES
DEMOS & IMPRESSIONS
Amazon EMR Config:
• Dedicated (24/7)
• Reserved Instances
• Type: m4.2xlarge
• Job Freq: every 10 min
Amazon EMR Config:
• Transient
• Reserved Instances
+ On Demand for spikes
• Type: m2.2xlarge
• Job Freq: hourly/overlap
NIELSEN
TWITTER
TV RATINGS
Takeaway
Cost-saving tips for Amazon EMR
• Use Amazon S3 as your persistent data store
• Only pay for compute when you need it
• Use Amazon EC2 Spot instances to save >80%
• Use Amazon EC2 Reserved instances for steady workloads
• Use CloudWatch alerts to notify you if a cluster is
underutilized, then shut it down (for example, 0 mappers
running for >N hours)
• Contact your AWS sales for pricing options if you are
spending >$10K/mo on Amazon EMR
CHICAGO

Weitere ähnliche Inhalte

Was ist angesagt?

Amazon Aurora Let's Talk About Performance
Amazon Aurora Let's Talk About PerformanceAmazon Aurora Let's Talk About Performance
Amazon Aurora Let's Talk About PerformanceDanilo Poccia
 
AWS Summit London 2014 | Maximising EC2 and EBC Performance (400)
AWS Summit London 2014 | Maximising EC2 and EBC Performance (400)AWS Summit London 2014 | Maximising EC2 and EBC Performance (400)
AWS Summit London 2014 | Maximising EC2 and EBC Performance (400)Amazon Web Services
 
What’s New in Amazon Aurora for MySQL and PostgreSQL
What’s New in Amazon Aurora for MySQL and PostgreSQLWhat’s New in Amazon Aurora for MySQL and PostgreSQL
What’s New in Amazon Aurora for MySQL and PostgreSQLAmazon Web Services
 
Amazon RDS & Amazon Aurora: Relational Databases on AWS - SRV206 - Atlanta AW...
Amazon RDS & Amazon Aurora: Relational Databases on AWS - SRV206 - Atlanta AW...Amazon RDS & Amazon Aurora: Relational Databases on AWS - SRV206 - Atlanta AW...
Amazon RDS & Amazon Aurora: Relational Databases on AWS - SRV206 - Atlanta AW...Amazon Web Services
 
Deep Dive on Amazon Aurora - Covering New Feature Announcements
Deep Dive on Amazon Aurora - Covering New Feature AnnouncementsDeep Dive on Amazon Aurora - Covering New Feature Announcements
Deep Dive on Amazon Aurora - Covering New Feature AnnouncementsAmazon Web Services
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
Announcing Amazon Aurora with PostgreSQL Compatibility - January 2017 AWS Onl...
Announcing Amazon Aurora with PostgreSQL Compatibility - January 2017 AWS Onl...Announcing Amazon Aurora with PostgreSQL Compatibility - January 2017 AWS Onl...
Announcing Amazon Aurora with PostgreSQL Compatibility - January 2017 AWS Onl...Amazon Web Services
 
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017Amazon Web Services
 
AWS re:Invent 2016: Deep Dive on Amazon Aurora (DAT303)
AWS re:Invent 2016: Deep Dive on Amazon Aurora (DAT303)AWS re:Invent 2016: Deep Dive on Amazon Aurora (DAT303)
AWS re:Invent 2016: Deep Dive on Amazon Aurora (DAT303)Amazon Web Services
 
Deep Dive on the Amazon Aurora MySQL-compatible Edition - DAT301 - re:Invent ...
Deep Dive on the Amazon Aurora MySQL-compatible Edition - DAT301 - re:Invent ...Deep Dive on the Amazon Aurora MySQL-compatible Edition - DAT301 - re:Invent ...
Deep Dive on the Amazon Aurora MySQL-compatible Edition - DAT301 - re:Invent ...Amazon Web Services
 
AWS June 2016 Webinar Series - Amazon Aurora Deep Dive - Optimizing Database ...
AWS June 2016 Webinar Series - Amazon Aurora Deep Dive - Optimizing Database ...AWS June 2016 Webinar Series - Amazon Aurora Deep Dive - Optimizing Database ...
AWS June 2016 Webinar Series - Amazon Aurora Deep Dive - Optimizing Database ...Amazon Web Services
 
AWS Summit London 2014 | Deployment Done Right (300)
AWS Summit London 2014 | Deployment Done Right (300)AWS Summit London 2014 | Deployment Done Right (300)
AWS Summit London 2014 | Deployment Done Right (300)Amazon Web Services
 
(DAT405) Amazon Aurora Deep Dive
(DAT405) Amazon Aurora Deep Dive(DAT405) Amazon Aurora Deep Dive
(DAT405) Amazon Aurora Deep DiveAmazon Web Services
 
Amazon RDS for Microsoft SQL: Performance, Security, Best Practices (DAT303) ...
Amazon RDS for Microsoft SQL: Performance, Security, Best Practices (DAT303) ...Amazon RDS for Microsoft SQL: Performance, Security, Best Practices (DAT303) ...
Amazon RDS for Microsoft SQL: Performance, Security, Best Practices (DAT303) ...Amazon Web Services
 
Deep Dive on Amazon Elastic Block Store
Deep Dive on Amazon Elastic Block StoreDeep Dive on Amazon Elastic Block Store
Deep Dive on Amazon Elastic Block StoreAmazon Web Services
 
AWS re:Invent 2016: Deep Dive on Amazon Elastic Block Store (STG301)
AWS re:Invent 2016: Deep Dive on Amazon Elastic Block Store (STG301)AWS re:Invent 2016: Deep Dive on Amazon Elastic Block Store (STG301)
AWS re:Invent 2016: Deep Dive on Amazon Elastic Block Store (STG301)Amazon Web Services
 
Getting Started with Amazon EC2 and Compute Services
Getting Started with Amazon EC2 and Compute ServicesGetting Started with Amazon EC2 and Compute Services
Getting Started with Amazon EC2 and Compute ServicesAmazon Web Services
 
Overview and Best Practices for Amazon Elastic Block Store - September 2016 W...
Overview and Best Practices for Amazon Elastic Block Store - September 2016 W...Overview and Best Practices for Amazon Elastic Block Store - September 2016 W...
Overview and Best Practices for Amazon Elastic Block Store - September 2016 W...Amazon Web Services
 

Was ist angesagt? (20)

Amazon Aurora Let's Talk About Performance
Amazon Aurora Let's Talk About PerformanceAmazon Aurora Let's Talk About Performance
Amazon Aurora Let's Talk About Performance
 
AWS Summit London 2014 | Maximising EC2 and EBC Performance (400)
AWS Summit London 2014 | Maximising EC2 and EBC Performance (400)AWS Summit London 2014 | Maximising EC2 and EBC Performance (400)
AWS Summit London 2014 | Maximising EC2 and EBC Performance (400)
 
What’s New in Amazon Aurora for MySQL and PostgreSQL
What’s New in Amazon Aurora for MySQL and PostgreSQLWhat’s New in Amazon Aurora for MySQL and PostgreSQL
What’s New in Amazon Aurora for MySQL and PostgreSQL
 
Amazon RDS & Amazon Aurora: Relational Databases on AWS - SRV206 - Atlanta AW...
Amazon RDS & Amazon Aurora: Relational Databases on AWS - SRV206 - Atlanta AW...Amazon RDS & Amazon Aurora: Relational Databases on AWS - SRV206 - Atlanta AW...
Amazon RDS & Amazon Aurora: Relational Databases on AWS - SRV206 - Atlanta AW...
 
Deep Dive on Amazon Aurora - Covering New Feature Announcements
Deep Dive on Amazon Aurora - Covering New Feature AnnouncementsDeep Dive on Amazon Aurora - Covering New Feature Announcements
Deep Dive on Amazon Aurora - Covering New Feature Announcements
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Announcing Amazon Aurora with PostgreSQL Compatibility - January 2017 AWS Onl...
Announcing Amazon Aurora with PostgreSQL Compatibility - January 2017 AWS Onl...Announcing Amazon Aurora with PostgreSQL Compatibility - January 2017 AWS Onl...
Announcing Amazon Aurora with PostgreSQL Compatibility - January 2017 AWS Onl...
 
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
 
AWS re:Invent 2016: Deep Dive on Amazon Aurora (DAT303)
AWS re:Invent 2016: Deep Dive on Amazon Aurora (DAT303)AWS re:Invent 2016: Deep Dive on Amazon Aurora (DAT303)
AWS re:Invent 2016: Deep Dive on Amazon Aurora (DAT303)
 
Deep Dive on the Amazon Aurora MySQL-compatible Edition - DAT301 - re:Invent ...
Deep Dive on the Amazon Aurora MySQL-compatible Edition - DAT301 - re:Invent ...Deep Dive on the Amazon Aurora MySQL-compatible Edition - DAT301 - re:Invent ...
Deep Dive on the Amazon Aurora MySQL-compatible Edition - DAT301 - re:Invent ...
 
AWS June 2016 Webinar Series - Amazon Aurora Deep Dive - Optimizing Database ...
AWS June 2016 Webinar Series - Amazon Aurora Deep Dive - Optimizing Database ...AWS June 2016 Webinar Series - Amazon Aurora Deep Dive - Optimizing Database ...
AWS June 2016 Webinar Series - Amazon Aurora Deep Dive - Optimizing Database ...
 
AWS Summit London 2014 | Deployment Done Right (300)
AWS Summit London 2014 | Deployment Done Right (300)AWS Summit London 2014 | Deployment Done Right (300)
AWS Summit London 2014 | Deployment Done Right (300)
 
(DAT405) Amazon Aurora Deep Dive
(DAT405) Amazon Aurora Deep Dive(DAT405) Amazon Aurora Deep Dive
(DAT405) Amazon Aurora Deep Dive
 
Amazon RDS for Microsoft SQL: Performance, Security, Best Practices (DAT303) ...
Amazon RDS for Microsoft SQL: Performance, Security, Best Practices (DAT303) ...Amazon RDS for Microsoft SQL: Performance, Security, Best Practices (DAT303) ...
Amazon RDS for Microsoft SQL: Performance, Security, Best Practices (DAT303) ...
 
Aurora Deep Dive | AWS Floor28
Aurora Deep Dive | AWS Floor28Aurora Deep Dive | AWS Floor28
Aurora Deep Dive | AWS Floor28
 
Startup Showcase - Mojang
Startup Showcase - MojangStartup Showcase - Mojang
Startup Showcase - Mojang
 
Deep Dive on Amazon Elastic Block Store
Deep Dive on Amazon Elastic Block StoreDeep Dive on Amazon Elastic Block Store
Deep Dive on Amazon Elastic Block Store
 
AWS re:Invent 2016: Deep Dive on Amazon Elastic Block Store (STG301)
AWS re:Invent 2016: Deep Dive on Amazon Elastic Block Store (STG301)AWS re:Invent 2016: Deep Dive on Amazon Elastic Block Store (STG301)
AWS re:Invent 2016: Deep Dive on Amazon Elastic Block Store (STG301)
 
Getting Started with Amazon EC2 and Compute Services
Getting Started with Amazon EC2 and Compute ServicesGetting Started with Amazon EC2 and Compute Services
Getting Started with Amazon EC2 and Compute Services
 
Overview and Best Practices for Amazon Elastic Block Store - September 2016 W...
Overview and Best Practices for Amazon Elastic Block Store - September 2016 W...Overview and Best Practices for Amazon Elastic Block Store - September 2016 W...
Overview and Best Practices for Amazon Elastic Block Store - September 2016 W...
 

Andere mochten auch

Self Service Agile Infrastructure for Product Teams - Pop-up Loft Tel Aviv
Self Service Agile Infrastructure for Product Teams - Pop-up Loft Tel AvivSelf Service Agile Infrastructure for Product Teams - Pop-up Loft Tel Aviv
Self Service Agile Infrastructure for Product Teams - Pop-up Loft Tel AvivAmazon Web Services
 
OAuth 2.0 refresher Talk
OAuth 2.0 refresher TalkOAuth 2.0 refresher Talk
OAuth 2.0 refresher Talkmarcwan
 
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Holden Karau
 
Architecting for Greater Security on AWS
Architecting for Greater Security on AWSArchitecting for Greater Security on AWS
Architecting for Greater Security on AWSAmazon Web Services
 
Py.test
Py.testPy.test
Py.testsoasme
 
Masterless Puppet Using AWS S3 Buckets and IAM Roles
Masterless Puppet Using AWS S3 Buckets and IAM RolesMasterless Puppet Using AWS S3 Buckets and IAM Roles
Masterless Puppet Using AWS S3 Buckets and IAM RolesMalcolm Duncanson, CISSP
 
(CMP407) Lambda as Cron: Scheduling Invocations in AWS Lambda
(CMP407) Lambda as Cron: Scheduling Invocations in AWS Lambda(CMP407) Lambda as Cron: Scheduling Invocations in AWS Lambda
(CMP407) Lambda as Cron: Scheduling Invocations in AWS LambdaAmazon Web Services
 
Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Amazon Web Services
 
Building the enterprise data architecture
Building the enterprise data architectureBuilding the enterprise data architecture
Building the enterprise data architectureCosta Pissaris
 
Enterprise Master Data Architecture
Enterprise Master Data ArchitectureEnterprise Master Data Architecture
Enterprise Master Data ArchitectureBoris Otto
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RDatabricks
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitSpark Summit
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Databricks
 
(SEC302) IAM Best Practices To Live By
(SEC302) IAM Best Practices To Live By(SEC302) IAM Best Practices To Live By
(SEC302) IAM Best Practices To Live ByAmazon Web Services
 
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...Databricks
 
AWS re:Invent 2016: Enterprise Fundamentals: Design Your Account and VPC Arch...
AWS re:Invent 2016: Enterprise Fundamentals: Design Your Account and VPC Arch...AWS re:Invent 2016: Enterprise Fundamentals: Design Your Account and VPC Arch...
AWS re:Invent 2016: Enterprise Fundamentals: Design Your Account and VPC Arch...Amazon Web Services
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSAmazon Web Services
 

Andere mochten auch (20)

Self Service Agile Infrastructure for Product Teams - Pop-up Loft Tel Aviv
Self Service Agile Infrastructure for Product Teams - Pop-up Loft Tel AvivSelf Service Agile Infrastructure for Product Teams - Pop-up Loft Tel Aviv
Self Service Agile Infrastructure for Product Teams - Pop-up Loft Tel Aviv
 
OAuth 2.0 refresher Talk
OAuth 2.0 refresher TalkOAuth 2.0 refresher Talk
OAuth 2.0 refresher Talk
 
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
 
Architecting for Greater Security on AWS
Architecting for Greater Security on AWSArchitecting for Greater Security on AWS
Architecting for Greater Security on AWS
 
Py.test
Py.testPy.test
Py.test
 
Survival Analysis of Web Users
Survival Analysis of Web UsersSurvival Analysis of Web Users
Survival Analysis of Web Users
 
Masterless Puppet Using AWS S3 Buckets and IAM Roles
Masterless Puppet Using AWS S3 Buckets and IAM RolesMasterless Puppet Using AWS S3 Buckets and IAM Roles
Masterless Puppet Using AWS S3 Buckets and IAM Roles
 
(CMP407) Lambda as Cron: Scheduling Invocations in AWS Lambda
(CMP407) Lambda as Cron: Scheduling Invocations in AWS Lambda(CMP407) Lambda as Cron: Scheduling Invocations in AWS Lambda
(CMP407) Lambda as Cron: Scheduling Invocations in AWS Lambda
 
Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)
 
Building the enterprise data architecture
Building the enterprise data architectureBuilding the enterprise data architecture
Building the enterprise data architecture
 
Enterprise Master Data Architecture
Enterprise Master Data ArchitectureEnterprise Master Data Architecture
Enterprise Master Data Architecture
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And Profit
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
 
(SEC302) IAM Best Practices To Live By
(SEC302) IAM Best Practices To Live By(SEC302) IAM Best Practices To Live By
(SEC302) IAM Best Practices To Live By
 
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
 
AWS Lambda and Serverless Cloud
AWS Lambda and Serverless CloudAWS Lambda and Serverless Cloud
AWS Lambda and Serverless Cloud
 
AWS CloudFormation Masterclass
AWS CloudFormation MasterclassAWS CloudFormation Masterclass
AWS CloudFormation Masterclass
 
AWS re:Invent 2016: Enterprise Fundamentals: Design Your Account and VPC Arch...
AWS re:Invent 2016: Enterprise Fundamentals: Design Your Account and VPC Arch...AWS re:Invent 2016: Enterprise Fundamentals: Design Your Account and VPC Arch...
AWS re:Invent 2016: Enterprise Fundamentals: Design Your Account and VPC Arch...
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 

Ähnlich wie Deep Dive: Amazon Elastic MapReduce

Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceAmazon Web Services
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesVladimir Simek
 
Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRIsrael AWS User Group
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon Web Services
 
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석Amazon Web Services Korea
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big DataAmazon Web Services
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best PracticesAmazon Web Services
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceAmazon Web Services
 
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Web Services
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
Introduction to Amazon Relational Database Service
Introduction to Amazon Relational Database ServiceIntroduction to Amazon Relational Database Service
Introduction to Amazon Relational Database ServiceAmazon Web Services
 
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...Amazon Web Services
 
Amazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the conceptsAmazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the concepts Julien SIMON
 
ENT305 Migrating Your Databases to AWS: Deep Dive on Amazon Relational Databa...
ENT305 Migrating Your Databases to AWS: Deep Dive on Amazon Relational Databa...ENT305 Migrating Your Databases to AWS: Deep Dive on Amazon Relational Databa...
ENT305 Migrating Your Databases to AWS: Deep Dive on Amazon Relational Databa...Amazon Web Services
 
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Amazon Web Services
 
(ARC311) Decoding The Genetic Blueprint Of Life On A Cloud Ecosystem
(ARC311) Decoding The Genetic Blueprint Of Life On A Cloud Ecosystem(ARC311) Decoding The Genetic Blueprint Of Life On A Cloud Ecosystem
(ARC311) Decoding The Genetic Blueprint Of Life On A Cloud EcosystemAmazon Web Services
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAmazon Web Services
 
Deep Dive RDS & Aurora - Pop-up Loft TLV 2017
Deep Dive RDS & Aurora - Pop-up Loft TLV 2017Deep Dive RDS & Aurora - Pop-up Loft TLV 2017
Deep Dive RDS & Aurora - Pop-up Loft TLV 2017Amazon Web Services
 
Deep Dive: Amazon Relational Database Service (March 2017)
Deep Dive: Amazon Relational Database Service (March 2017)Deep Dive: Amazon Relational Database Service (March 2017)
Deep Dive: Amazon Relational Database Service (March 2017)Julien SIMON
 

Ähnlich wie Deep Dive: Amazon Elastic MapReduce (20)

Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutes
 
Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMR
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Introduction to Amazon Relational Database Service
Introduction to Amazon Relational Database ServiceIntroduction to Amazon Relational Database Service
Introduction to Amazon Relational Database Service
 
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
 
Amazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the conceptsAmazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the concepts
 
ENT305 Migrating Your Databases to AWS: Deep Dive on Amazon Relational Databa...
ENT305 Migrating Your Databases to AWS: Deep Dive on Amazon Relational Databa...ENT305 Migrating Your Databases to AWS: Deep Dive on Amazon Relational Databa...
ENT305 Migrating Your Databases to AWS: Deep Dive on Amazon Relational Databa...
 
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
 
(ARC311) Decoding The Genetic Blueprint Of Life On A Cloud Ecosystem
(ARC311) Decoding The Genetic Blueprint Of Life On A Cloud Ecosystem(ARC311) Decoding The Genetic Blueprint Of Life On A Cloud Ecosystem
(ARC311) Decoding The Genetic Blueprint Of Life On A Cloud Ecosystem
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMR
 
Deep Dive RDS & Aurora - Pop-up Loft TLV 2017
Deep Dive RDS & Aurora - Pop-up Loft TLV 2017Deep Dive RDS & Aurora - Pop-up Loft TLV 2017
Deep Dive RDS & Aurora - Pop-up Loft TLV 2017
 
Deep Dive: Amazon Relational Database Service (March 2017)
Deep Dive: Amazon Relational Database Service (March 2017)Deep Dive: Amazon Relational Database Service (March 2017)
Deep Dive: Amazon Relational Database Service (March 2017)
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
 

Mehr von Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Mehr von Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Kürzlich hochgeladen

Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 

Kürzlich hochgeladen (20)

Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 

Deep Dive: Amazon Elastic MapReduce

  • 1. ©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved Deep Dive: Amazon EMR Matt Yanchyshyn - Sr. Manager, Solutions Architecture
  • 2. Why Amazon EMR? easy to use launch a cluster in minutes low cost pay an hourly rate elastic easily add or remove capacity reliable spend less time monitoring secure managed firewalls flexible you control the cluster
  • 3. Easy to deploy AWS Management Console Command line or use the EMR API with your favorite SDK
  • 4. Easy to monitor and debug Monitor Debug integrated with Amazon CloudWatch monitor cluster, node, and IO
  • 5. Try different configurations to find your optimal architecture CPU c3 family cc1.4xlarge cc2.8xlarge Memory m2 family r3 family Disk/IO d2 family i2 family General m1 family m3 family Choose your instance types Batch Machine Spark and Large process learning interactive HDFS
  • 6. Easy to add and remove compute capacity on your cluster Match compute demands with cluster sizing Resizable clusters
  • 7. Spot for task nodes Up to 90% off EC2 on-demand pricing On-demand for core nodes Standard Amazon EC2 pricing for on-demand capacity Easy to use Spot Instances Meet SLA at predictable cost Exceed SLA at lower cost
  • 8. Read data directly into Hive, Pig, streaming and cascading from Amazon Kinesis streams No intermediate data persistence required Simple way to introduce real-time sources into batch-oriented systems Multi-application support and automatic checkpointing Amazon EMR Integration with Amazon Kinesis
  • 9. The Hadoop ecosystem can run in Amazon EMR
  • 10. Bootstrap Actions Use bootstrap actions to install applications or to configure Hadoop --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --keyword-config-file – merge values in new config to existing --keyword-key-value – override values provided Configuration file name Configuration file keyword File name shortcut Key-value pair shortcut core-site.xml core C c hdfs-site.xml hdfs H h mapred-site.xml mapred M m yarn-site.xml yarn Y y
  • 14. Leverage Amazon S3 with EMRFS
  • 15. Amazon S3 as your persistent data store • Separate compute and storage • Resize and shut down Amazon EMR clusters with no data loss • Point multiple Amazon EMR clusters at same data in Amazon S3 EMR EMR Amazon S3
  • 16. EMRFS makes it easier to use Amazon S3 • Read-after-write consistency • Very fast list operations • Error-handling options • Support for Amazon S3 encryption • Transparent to applications: s3://
  • 17. EMRFS client-side encryption Amazon S3 AmazonS3encryptionclients EMRFSenabledfor AmazonS3client-sideencryption Key vendor (AWS KMS or your custom key vendor) (client-side encrypted objects)
  • 18. HDFS is still there if you need it • Iterative workloads – If you’re processing the same dataset more than once • Disk I/O-intensive workloads • Persist data on Amazon S3 and use S3DistCp to copy to/from HDFS for processing
  • 19. Amazon EMR – Design Patterns
  • 20. EMR example #1: Batch processing GB of logs pushed to S3 hourly Daily EMR cluster using Hive to process data Input and output stored in S3 250 Amazon EMR jobs per day, processing 30 TB of data http://aws.amazon.com/solutions/case-studies/yelp/
  • 21. EMR example #2: Long-running cluster Data pushed to S3 Daily EMR cluster ETL data into database 24/7 EMR cluster running HBase holds last two years of data Front-end service uses HBase cluster to power dashboard with high concurrency
  • 22. TBs of logs sent daily Logs stored in Amazon S3 Hive metastore on Amazon EMR EMR example #3: Interactive query Interactive query using Presto on multi-petabyte warehouse http://nflx.it/1dO7Pnt
  • 23. EMR example #4: Streaming-data processing TBs of logs sent daily Logs stored in Amazon Kinesis Amazon Kinesis Client Library AWS Lambda Amazon EMR Amazon EC2
  • 25. File formats • Row-oriented – text files – sequence files • writable object – Avro data files • described by schema • Columnar format – Object Record Columnar (ORC) – Parquet logical table row-oriented column-oriented
  • 26. Choosing the right file format • Processing and query tools – Hive, Impala, and Presto • Evolution of schema – Avro for schema and Presto for storage • File format “splittability” – Avoid JSON/XML files. Use them as records. • Compression - block or file
  • 27. File sizes • Avoid small files – Anything smaller than 100 MB • Each mapper is a single JVM – CPU time is required to spawn JVMs/mappers • Fewer files, matching closely to block size – fewer calls to S3 – fewer network/HDFS requests
  • 28. Dealing with small files • Reduce HDFS block size, e.g. 1 MB (default is 128 MB) – --bootstrap-action s3://elasticmapreduce/bootstrap- actions/configure-hadoop --args “-m,dfs.block.size=1048576” • Better: use S3DistCP to combine smaller files together – S3DistCP takes a pattern and target path to combine smaller input files to larger ones – Supply a target size and compression codec
  • 29. Compression • Always compress data files on Amazon S3 – reduces network traffic between Amazon S3 and Amazon EMR – speeds up your job • Compress mappers and reducer output Amazon EMR compresses inter-node traffic with LZO with Hadoop 1 and Snappy with Hadoop 2
  • 30. Choosing the right compression • Time-sensitive, faster compressions are a better choice • Large amount of data, use space-efficient compressions • Combined workload, use gzip Algorithm Splittable? Compression ratio Compress and decompress speed Gzip (DEFLATE) No High Medium bzip2 Yes Very high Slow LZO Yes Low Fast Snappy No Low Very fast
  • 31. The Nielsen Company and Amazon EMR
  • 32. Copyright©2012TheNielsenCompany.Confidentialandproprietary. SOCIAL TV IS A CONSUMER PHENOMENON Viewers interacting around TV programming through social to connect with friends, fans, stars, and advertisers in real time. of smartphone/tablet owners use devices as second screens while watching TV 1 Billion84percent Tweets about U.S. TV in 2014, sent by 25million people
  • 33. Copyright©2012TheNielsenCompany.Confidentialandproprietary. WHAT IS NIELSEN SOCIAL? The leading provider of social TV measurement, analytics, and audience engagement solutions. • 90+ network, agency and advertiser clients in the U.S. • Exclusive provider of Nielsen Twitter TV Ratings • International presence in Italy, Australia, and Mexico
  • 34. Copyright©2012TheNielsenCompany.Confidentialandproprietary. HOW DOES NIELSEN SOCIAL WORK? Nielsen Social captures Twitter activity about every TV program across 250+ U.S. networks, 1900+ brands, movies, and sports. Twitter provides impressions and demographics for every tweet about TV, which we aggregate and de-duplicate to produce Nielsen Twitter TV Ratings.
  • 35. Copyright©2012TheNielsenCompany.Confidentialandproprietary. NIELSEN SOCIAL – AMAZON EMR JOB FLOW 1) Country Segmentation Cluster • Input: Global Firehose (500M TPD, 200 GB gz files on S3) • Output: US/IT/AU/MX Tweets (150M TPD) • Data Volatility: Low (+/- 20%) 2) TV Tweet Matching Cluster • Input: US/IT/AU/MX Tweets (150M TPD) • Output: TV Tweets (2 to 25M) • Data Volatility: High (+/- 1200%) 3) TV Analytics Cluster • Input: TV Tweets (2 to 25M TPD) • Output: Analytics and Reports • Data Volatility: High (+/- 1200%) Amazon EMR Config: • Multiple Transient Clusters • On Demand Instances + more for spikes • Type: m4.2xlarge / m3.xl • Job Freq: daily Amazon EMR Config: • Dedicated (24/7) • Reserved Instances • Type: m4.2xlarge • Job Freq: every 10 min Amazon EMR Config: • Transient • Reserved Instances + On Demand for spikes • Type: m2.2xlarge • Job Freq: hourly/overlap TWITTER FIREHOSE TV SCHEDULES DEMOS & IMPRESSIONS
  • 36. Copyright©2012TheNielsenCompany.Confidentialandproprietary. TWITTER FIREHOSE NIELSEN SOCIAL – AMAZON EMR JOB FLOW 1) Country Segmentation Cluster • Input: Global Firehose (500M TPD, 200 GB gz files on S3) • Output: US/IT/AU/MX Tweets (150M TPD) • Data Volatility: Low (+/- 20%) 2) TV Tweet Matching Cluster • Input: US/IT/AU/MX Tweets (150M TPD) • Output: TV Tweets (2 to 25M) • Data Volatility: High (+/- 1200%) 3) TV Analytics Cluster • Input: TV Tweets (2 to 25M TPD) • Output: Analytics and Reports • Data Volatility: High (+/- 1200%) TV SCHEDULES DEMOS & IMPRESSIONS Amazon EMR Config: • Dedicated (24/7) • Reserved Instances • Type: m4.2xlarge • Job Freq: every 10 min Amazon EMR Config: • Transient • Reserved Instances + On Demand for spikes • Type: m2.2xlarge • Job Freq: hourly/overlap NIELSEN TWITTER TV RATINGS
  • 38. Cost-saving tips for Amazon EMR • Use Amazon S3 as your persistent data store • Only pay for compute when you need it • Use Amazon EC2 Spot instances to save >80% • Use Amazon EC2 Reserved instances for steady workloads • Use CloudWatch alerts to notify you if a cluster is underutilized, then shut it down (for example, 0 mappers running for >N hours) • Contact your AWS sales for pricing options if you are spending >$10K/mo on Amazon EMR