SlideShare ist ein Scribd-Unternehmen logo
1 von 49
Downloaden Sie, um offline zu lesen
Taming Big Data with Hadoop
Kul Subedi
Introduction
● What is Big Data?
Properties of Big Data
Cont...
● Large and growing data files
● Commonly measured in terabytes or
petabytes
● Unstructured data
● May not fit in a “relational database model”
● Derived from Users, Applications, Systems,
and Sensors
Problem: Data Throughput Mis-match
● Standard spinning hard drive 60-120 MB/sec
● Standard solid state hard drive 250-500
MB/sec
● Hard drive capacity growing
● Online data growth
Moving data on and off of disk is the bottleneck.
Cont...
● One Terabyte (TB) of data will take 10,000
seconds (approximately 167 minutes)
● One TB of data will take 2000 seconds
(approximately 33 minutes) to read data at
500 MB/sec (solid state)
One TB is a “small” file size
The need for parallel data access is essential
for Big data.
Problem(1): Scaling
Agenda
● Hadoop Definition
● Hadoop Ecosystem
● History
● Hadoop Design Principle
● HDFS and MapReduce (Demo)
● Conclusion
Definition
● A framework of open-source tools,
libraries, and methodologies for the
distributed processing of large data sets
● Scale up from single servers to thousands of
machines, each offering local computation
and storage
Cont...
● The project includes
❏ Hadoop Common
❏ HDFS
❏ YARN
❏ MapReduce
Cont...
● Other Hadoop-related projects
❏ Pig
❏ Hive
❏ Tez
❏ Spark
❏ HBase
❏ Ambari etc
Hadoop Usage Modes
● Administrators
❏ Installation
❏ Monitor/Manage System
❏ Tune System
● End Users
❏ Design MapReduce Applications
❏ Import/Export Data
❏ Work with various Hadoop Tools
Hadoop History
● Developed by Doug Cutting and Michael J.
Cafarella
● Based On Google MapReduce technology
● Designed to handle large amounts of data
and be robust
● Donated to Apache Software Foundation in
2006 by Yahoo
Cont...
Application areas:
❏ Social media
❏ Retail
❏ Financial services
❏ Web Search
❏ Everywhere there is large amounts of
unstructured data
Cont...
Prominent Users:
❏ Yahoo!
❏ Facebook
❏ Amazon
❏ Ebay
❏ American Airlines
❏ The New York Time, and many others
Design Principles
● Moving computation is cheaper than moving
data
● Hardware will fail, manage it
● Hide execution details from the user
● Use streaming data access
● Use a simple file system coherency model
Cont...
What Hadoop is not: A replacement for SQL,
always fast and efficient, good for quick ad-hoc
querying
HDFS
HDFS-Architecture
1.Where do I read or write
data?
2.Use these data nodes.
NameNode
● Only one per cluster “master node”
● Stores meta information of “filesystem” such
as Filename, permission, directories, blocks
● Keep in RAM for fast access
● Persisted to disk
● The namenode is the brain of the outfit
DataNode
● Many per cluster, “slave node”
● Stores individual file “blocks” but knows
nothing about them, accept the block name
● Reports regularly to NameNode
“Hey I am alive, and I have these blocks”
HDFS Presents
● Transparency
● Replication
HDFS Properties
● Files are immutable
No updates, no appends
● Disk access is optimized for sequential
reads
Store data in large “blocks” 128 MB default
● Avoid Corruption
“blocks” are verified with checksum when
stored and read
Cont...
● High throughput
Avoid contention, have system share as
little information and resources as possible
● Fault Tolerant
Loss of a disk, or machine, or rack of
machines should not lead to data loss
Client Reading From HDFS
Client Writing To HDFS
Demo
● File System Health check
❏ hdfs fsck /
● List the file system content
❏ $hdfs dfs -ls /
● Create a directory
❏ $hdfs dfs -mkdir /data1
● Upload file to HDFS
❏ $hdfs dfs -put input.txt /in
Cont...
● Input directory in HDFS: /in
● Output directory in HDFS: /output
NameNode High Availability
● NFS Filer
● Quorum Journal Manager (QJM)
MapReduce
● A programming model for processing large
Data in Distributed fashion over cluster of
commodity machines
● Introduced by Google
● Uses two key steps: mapping and reducing
Cont...
● Almost all data can be mapped into <key,
value> pairs somehow
● Your keys and values may be of any type:
string, integers, dummy types, and <K,V>
pairs themselves and so on
● Scale-free programming
❏ If a program works for a 1KB file, it can
work for any file size
see spot run
run spot run
see the cat
run spot run
see spot run
see the cat
see,1
spot,1
run,1
see,1
see,1
spot,1
spot,1
run,1
run,1
run,1
the,1
cat,1
Input
Split
Map
Shuffle
see,1
spot,1
run,1
see,1
the,1
cat,1
see,2
spot,2
run,3
the,1
cat,1
Reduce
see,2
spot,2
run,3
the,1
cat,1
Output
Data Flow
MapReduce Word Count Data Flow
Example: Hello World Program
● wget www.gutenberg.org/files/2600/2600.txt
● python mapper.py < input.txt | sort | python
reducer.py
● cat *.txt | python mapper.py | sort | python
reducer.py
Demo
● cat input1.txt | python mapper.py
● cat input1.txt | python mapper.py | sort
● cat input1.txt | python mapper.py | sort |
python reducer.py
● cat input1.txt | python mapper.py | sort |
python reducer.py > output.txt
How to run job in Cluster?
● Using Streaming Interface
❏ hadoop jar /opt/hadoop/hadoop-2.6.0
/share/hadoop/tools/lib/hadoop-streaming-
2.6.0.jar -file ./mapper.py -mapper .
/mapper.py -file ./reducer.py -input
/in/input.txt -output /output/run1
Web Interfaces
● NameNode: http://10.0.0.160:50070
● ResourceManager: http://10.0.0.160:8088
● /opt/hadoop/hadoop-2.6.0
/share/hadoop/hdfs/webapps/hdfs contains
view
Prerequisites
● Java
● ssh
● rsync
Installation
● wget apache.osuosl.
org/hadoop/common/hadoop-2.6.0/hadoop-
2.6.0-src.tar.gz
● wget apache.osuosl.
org/hadoop/common/hadoop-2.6.0/hadoop-
2.6.0-src.tar.gz.mds
Cont...
● wget apache.osuosl.
org/hadoop/common/hadoop-2.6.0/hadoop-
2.6.0.tar.gz
● wget apache.osuosl.
org/hadoop/common/hadoop-2.6.0/hadoop-
2.6.0.tar.gz.mds
Integrity Check
● md5sum hadoop-2.6.0.tar.gz
● cat hadoop-2.6.0.tar.gz.mds | grep -i md5
Startup Code
● https://github.com/kpsubedi/BigData
OpenSource
● Apache Hadoop http://hadoop.apache.
org/
Commercial Big Data Players
● Hortonworks http://hortonworks.com/
● Cloudera http://www.cloudera.
com/content/cloudera/en/home.html
● MAPR https://www.mapr.com/
● Others
Conclusion
● Thank you
References
● Apache Hadoop http://hadoop.apache.org/
● Hortonworks http://hortonworks.com/
● Cloudera http://www.cloudera.com/content/cloudera/en/downloads.
html
● MAPR https://www.mapr.com/
● The Google File System http://static.googleusercontent.
com/media/research.google.com/en/us/archive/gfs-sosp2003.pdf
References (1)
● MapReduce: Simplified Data Processing on
Large Clusters http://static.googleusercontent.
com/media/research.google.com/en/us/archive/mapreduce-osdi04.pdf

Weitere ähnliche Inhalte

Was ist angesagt?

Building your own NSQL store
Building your own NSQL storeBuilding your own NSQL store
Building your own NSQL storeEdward Capriolo
 
Bigdata and Hadoop
 Bigdata and Hadoop Bigdata and Hadoop
Bigdata and HadoopGirish L
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentalsits_skm
 
An Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed DatabaseAn Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed DatabaseBenjamin Bengfort
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringBADR
 
Modern software design in Big data era
Modern software design in Big data eraModern software design in Big data era
Modern software design in Big data eraBill GU
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache HadoopAjit Koti
 
Map reduce and hadoop at mylife
Map reduce and hadoop at mylifeMap reduce and hadoop at mylife
Map reduce and hadoop at myliferesponseteam
 
Try Cloud Spanner
Try Cloud SpannerTry Cloud Spanner
Try Cloud SpannerSimon Su
 
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...Cloudera, Inc.
 
Gluster.community.day.2013
Gluster.community.day.2013Gluster.community.day.2013
Gluster.community.day.2013Udo Seidel
 

Was ist angesagt? (17)

Big data nyu
Big data nyuBig data nyu
Big data nyu
 
JOSA TechTalks - Big Data on Hadoop
JOSA TechTalks - Big Data on HadoopJOSA TechTalks - Big Data on Hadoop
JOSA TechTalks - Big Data on Hadoop
 
Building your own NSQL store
Building your own NSQL storeBuilding your own NSQL store
Building your own NSQL store
 
Bigdata and Hadoop
 Bigdata and Hadoop Bigdata and Hadoop
Bigdata and Hadoop
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentals
 
Hadoop - How It Works
Hadoop - How It WorksHadoop - How It Works
Hadoop - How It Works
 
An Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed DatabaseAn Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed Database
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data Engineering
 
Modern software design in Big data era
Modern software design in Big data eraModern software design in Big data era
Modern software design in Big data era
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
 
Hadoop
HadoopHadoop
Hadoop
 
Map reduce and hadoop at mylife
Map reduce and hadoop at mylifeMap reduce and hadoop at mylife
Map reduce and hadoop at mylife
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Try Cloud Spanner
Try Cloud SpannerTry Cloud Spanner
Try Cloud Spanner
 
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
 
Gluster.community.day.2013
Gluster.community.day.2013Gluster.community.day.2013
Gluster.community.day.2013
 

Andere mochten auch

Taming Big Data with MapReduce and Hadoop
Taming Big Data with MapReduce and HadoopTaming Big Data with MapReduce and Hadoop
Taming Big Data with MapReduce and HadoopArun Gopinathan
 
Complete Android Developer
Complete Android DeveloperComplete Android Developer
Complete Android DeveloperArun Gopinathan
 
3 top tools for taming big data
3 top tools for taming big data3 top tools for taming big data
3 top tools for taming big dataHPC Asia
 
Taming the Data Deluge: How Advertisers and Publishers Can Win at the Big Dat...
Taming the Data Deluge: How Advertisers and Publishers Can Win at the Big Dat...Taming the Data Deluge: How Advertisers and Publishers Can Win at the Big Dat...
Taming the Data Deluge: How Advertisers and Publishers Can Win at the Big Dat...Signal Chicago 2012
 
Taming Operations in the Hadoop Ecosystem
Taming Operations in the Hadoop EcosystemTaming Operations in the Hadoop Ecosystem
Taming Operations in the Hadoop EcosystemCloudera, Inc.
 
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...DataKitchen
 
Big data introduction - Big Data from a Consulting perspective - Sogeti
Big data introduction - Big Data from a Consulting perspective - SogetiBig data introduction - Big Data from a Consulting perspective - Sogeti
Big data introduction - Big Data from a Consulting perspective - SogetiEdzo Botjes
 
Taming Big Data in the Reverse Logistics Supply Chain
Taming Big Data in the Reverse Logistics Supply ChainTaming Big Data in the Reverse Logistics Supply Chain
Taming Big Data in the Reverse Logistics Supply ChainJonathan Pine
 
Taming the Big Data Beast - Together
Taming the Big Data Beast - TogetherTaming the Big Data Beast - Together
Taming the Big Data Beast - TogetherKennisalliantie
 
Taming Big Data!
Taming Big Data!Taming Big Data!
Taming Big Data!Ian Foster
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataKaran Desai
 
Taming Social Data: How Social Data Framing liberates analysis and accelerate...
Taming Social Data: How Social Data Framing liberates analysis and accelerate...Taming Social Data: How Social Data Framing liberates analysis and accelerate...
Taming Social Data: How Social Data Framing liberates analysis and accelerate...DataSift
 
Spark: Taming Big Data
Spark: Taming Big DataSpark: Taming Big Data
Spark: Taming Big DataLeonardo Gamas
 
Tools and techniques adopted for big data analytics
Tools and techniques adopted for big data analyticsTools and techniques adopted for big data analytics
Tools and techniques adopted for big data analyticsJOSEPH FRANCIS
 
Taming the Big Data Beast
Taming the Big Data BeastTaming the Big Data Beast
Taming the Big Data BeastBaynote
 

Andere mochten auch (20)

Taming Big Data with MapReduce and Hadoop
Taming Big Data with MapReduce and HadoopTaming Big Data with MapReduce and Hadoop
Taming Big Data with MapReduce and Hadoop
 
Complete Android Developer
Complete Android DeveloperComplete Android Developer
Complete Android Developer
 
Hadoop, Taming Elephants
Hadoop, Taming ElephantsHadoop, Taming Elephants
Hadoop, Taming Elephants
 
3 top tools for taming big data
3 top tools for taming big data3 top tools for taming big data
3 top tools for taming big data
 
Turning Information chaos into reliable data
Turning Information chaos into reliable dataTurning Information chaos into reliable data
Turning Information chaos into reliable data
 
Taming the Data Deluge: How Advertisers and Publishers Can Win at the Big Dat...
Taming the Data Deluge: How Advertisers and Publishers Can Win at the Big Dat...Taming the Data Deluge: How Advertisers and Publishers Can Win at the Big Dat...
Taming the Data Deluge: How Advertisers and Publishers Can Win at the Big Dat...
 
Taming Operations in the Hadoop Ecosystem
Taming Operations in the Hadoop EcosystemTaming Operations in the Hadoop Ecosystem
Taming Operations in the Hadoop Ecosystem
 
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
 
Big data introduction - Big Data from a Consulting perspective - Sogeti
Big data introduction - Big Data from a Consulting perspective - SogetiBig data introduction - Big Data from a Consulting perspective - Sogeti
Big data introduction - Big Data from a Consulting perspective - Sogeti
 
Taming Big Data in the Reverse Logistics Supply Chain
Taming Big Data in the Reverse Logistics Supply ChainTaming Big Data in the Reverse Logistics Supply Chain
Taming Big Data in the Reverse Logistics Supply Chain
 
Taming the Big Data Beast - Together
Taming the Big Data Beast - TogetherTaming the Big Data Beast - Together
Taming the Big Data Beast - Together
 
Taming Big Data With Modern Software Architecture
Taming Big Data  With Modern Software ArchitectureTaming Big Data  With Modern Software Architecture
Taming Big Data With Modern Software Architecture
 
Taming Big Data!
Taming Big Data!Taming Big Data!
Taming Big Data!
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big data Introduction by Mohan
Big data Introduction by MohanBig data Introduction by Mohan
Big data Introduction by Mohan
 
Taming Social Data: How Social Data Framing liberates analysis and accelerate...
Taming Social Data: How Social Data Framing liberates analysis and accelerate...Taming Social Data: How Social Data Framing liberates analysis and accelerate...
Taming Social Data: How Social Data Framing liberates analysis and accelerate...
 
Spark: Taming Big Data
Spark: Taming Big DataSpark: Taming Big Data
Spark: Taming Big Data
 
Hbase hive pig
Hbase hive pigHbase hive pig
Hbase hive pig
 
Tools and techniques adopted for big data analytics
Tools and techniques adopted for big data analyticsTools and techniques adopted for big data analytics
Tools and techniques adopted for big data analytics
 
Taming the Big Data Beast
Taming the Big Data BeastTaming the Big Data Beast
Taming the Big Data Beast
 

Ähnlich wie Hadoop-2.6.0 Slides

Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvewKunal Khanna
 
Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopRojaT4
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big DataJoe Alex
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarKognitio
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysDemi Ben-Ari
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAsLuis Marques
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceDerek Chen
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...Big Data Montreal
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.MaharajothiP
 

Ähnlich wie Hadoop-2.6.0 Slides (20)

Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - Hadoop
 
Training
TrainingTraining
Training
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAs
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 

Kürzlich hochgeladen

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfLivetecs LLC
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noidabntitsolutionsrishis
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 

Kürzlich hochgeladen (20)

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdf
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 

Hadoop-2.6.0 Slides

  • 1. Taming Big Data with Hadoop Kul Subedi
  • 4. Cont... ● Large and growing data files ● Commonly measured in terabytes or petabytes ● Unstructured data ● May not fit in a “relational database model” ● Derived from Users, Applications, Systems, and Sensors
  • 5. Problem: Data Throughput Mis-match ● Standard spinning hard drive 60-120 MB/sec ● Standard solid state hard drive 250-500 MB/sec ● Hard drive capacity growing ● Online data growth Moving data on and off of disk is the bottleneck.
  • 6. Cont... ● One Terabyte (TB) of data will take 10,000 seconds (approximately 167 minutes) ● One TB of data will take 2000 seconds (approximately 33 minutes) to read data at 500 MB/sec (solid state) One TB is a “small” file size The need for parallel data access is essential for Big data.
  • 8. Agenda ● Hadoop Definition ● Hadoop Ecosystem ● History ● Hadoop Design Principle ● HDFS and MapReduce (Demo) ● Conclusion
  • 9. Definition ● A framework of open-source tools, libraries, and methodologies for the distributed processing of large data sets ● Scale up from single servers to thousands of machines, each offering local computation and storage
  • 10.
  • 11. Cont... ● The project includes ❏ Hadoop Common ❏ HDFS ❏ YARN ❏ MapReduce
  • 12. Cont... ● Other Hadoop-related projects ❏ Pig ❏ Hive ❏ Tez ❏ Spark ❏ HBase ❏ Ambari etc
  • 13. Hadoop Usage Modes ● Administrators ❏ Installation ❏ Monitor/Manage System ❏ Tune System ● End Users ❏ Design MapReduce Applications ❏ Import/Export Data ❏ Work with various Hadoop Tools
  • 14. Hadoop History ● Developed by Doug Cutting and Michael J. Cafarella ● Based On Google MapReduce technology ● Designed to handle large amounts of data and be robust ● Donated to Apache Software Foundation in 2006 by Yahoo
  • 15. Cont... Application areas: ❏ Social media ❏ Retail ❏ Financial services ❏ Web Search ❏ Everywhere there is large amounts of unstructured data
  • 16. Cont... Prominent Users: ❏ Yahoo! ❏ Facebook ❏ Amazon ❏ Ebay ❏ American Airlines ❏ The New York Time, and many others
  • 17. Design Principles ● Moving computation is cheaper than moving data ● Hardware will fail, manage it ● Hide execution details from the user ● Use streaming data access ● Use a simple file system coherency model
  • 18. Cont... What Hadoop is not: A replacement for SQL, always fast and efficient, good for quick ad-hoc querying
  • 19. HDFS
  • 20. HDFS-Architecture 1.Where do I read or write data? 2.Use these data nodes.
  • 21. NameNode ● Only one per cluster “master node” ● Stores meta information of “filesystem” such as Filename, permission, directories, blocks ● Keep in RAM for fast access ● Persisted to disk ● The namenode is the brain of the outfit
  • 22. DataNode ● Many per cluster, “slave node” ● Stores individual file “blocks” but knows nothing about them, accept the block name ● Reports regularly to NameNode “Hey I am alive, and I have these blocks”
  • 24.
  • 25. HDFS Properties ● Files are immutable No updates, no appends ● Disk access is optimized for sequential reads Store data in large “blocks” 128 MB default ● Avoid Corruption “blocks” are verified with checksum when stored and read
  • 26. Cont... ● High throughput Avoid contention, have system share as little information and resources as possible ● Fault Tolerant Loss of a disk, or machine, or rack of machines should not lead to data loss
  • 29. Demo ● File System Health check ❏ hdfs fsck / ● List the file system content ❏ $hdfs dfs -ls / ● Create a directory ❏ $hdfs dfs -mkdir /data1 ● Upload file to HDFS ❏ $hdfs dfs -put input.txt /in
  • 30. Cont... ● Input directory in HDFS: /in ● Output directory in HDFS: /output
  • 31. NameNode High Availability ● NFS Filer ● Quorum Journal Manager (QJM)
  • 32.
  • 33. MapReduce ● A programming model for processing large Data in Distributed fashion over cluster of commodity machines ● Introduced by Google ● Uses two key steps: mapping and reducing
  • 34. Cont... ● Almost all data can be mapped into <key, value> pairs somehow ● Your keys and values may be of any type: string, integers, dummy types, and <K,V> pairs themselves and so on ● Scale-free programming ❏ If a program works for a 1KB file, it can work for any file size
  • 35. see spot run run spot run see the cat run spot run see spot run see the cat see,1 spot,1 run,1 see,1 see,1 spot,1 spot,1 run,1 run,1 run,1 the,1 cat,1 Input Split Map Shuffle see,1 spot,1 run,1 see,1 the,1 cat,1 see,2 spot,2 run,3 the,1 cat,1 Reduce see,2 spot,2 run,3 the,1 cat,1 Output Data Flow MapReduce Word Count Data Flow
  • 36. Example: Hello World Program ● wget www.gutenberg.org/files/2600/2600.txt ● python mapper.py < input.txt | sort | python reducer.py ● cat *.txt | python mapper.py | sort | python reducer.py
  • 37. Demo ● cat input1.txt | python mapper.py ● cat input1.txt | python mapper.py | sort ● cat input1.txt | python mapper.py | sort | python reducer.py ● cat input1.txt | python mapper.py | sort | python reducer.py > output.txt
  • 38. How to run job in Cluster? ● Using Streaming Interface ❏ hadoop jar /opt/hadoop/hadoop-2.6.0 /share/hadoop/tools/lib/hadoop-streaming- 2.6.0.jar -file ./mapper.py -mapper . /mapper.py -file ./reducer.py -input /in/input.txt -output /output/run1
  • 39. Web Interfaces ● NameNode: http://10.0.0.160:50070 ● ResourceManager: http://10.0.0.160:8088 ● /opt/hadoop/hadoop-2.6.0 /share/hadoop/hdfs/webapps/hdfs contains view
  • 41. Installation ● wget apache.osuosl. org/hadoop/common/hadoop-2.6.0/hadoop- 2.6.0-src.tar.gz ● wget apache.osuosl. org/hadoop/common/hadoop-2.6.0/hadoop- 2.6.0-src.tar.gz.mds
  • 42. Cont... ● wget apache.osuosl. org/hadoop/common/hadoop-2.6.0/hadoop- 2.6.0.tar.gz ● wget apache.osuosl. org/hadoop/common/hadoop-2.6.0/hadoop- 2.6.0.tar.gz.mds
  • 43. Integrity Check ● md5sum hadoop-2.6.0.tar.gz ● cat hadoop-2.6.0.tar.gz.mds | grep -i md5
  • 45. OpenSource ● Apache Hadoop http://hadoop.apache. org/
  • 46. Commercial Big Data Players ● Hortonworks http://hortonworks.com/ ● Cloudera http://www.cloudera. com/content/cloudera/en/home.html ● MAPR https://www.mapr.com/ ● Others
  • 48. References ● Apache Hadoop http://hadoop.apache.org/ ● Hortonworks http://hortonworks.com/ ● Cloudera http://www.cloudera.com/content/cloudera/en/downloads. html ● MAPR https://www.mapr.com/ ● The Google File System http://static.googleusercontent. com/media/research.google.com/en/us/archive/gfs-sosp2003.pdf
  • 49. References (1) ● MapReduce: Simplified Data Processing on Large Clusters http://static.googleusercontent. com/media/research.google.com/en/us/archive/mapreduce-osdi04.pdf