Hadoop-2.6.0 Slides

Taming Big Data with Hadoop
Kul Subedi

Introduction
● What is Big Data?

Cont...
● Large and growing data files
● Commonly measured in terabytes or
petabytes
● Unstructured data
● May not fit in a “relational database model”
● Derived from Users, Applications, Systems,
and Sensors

Problem: Data Throughput Mis-match
● Standard spinning hard drive 60-120 MB/sec
● Standard solid state hard drive 250-500
MB/sec
● Hard drive capacity growing
● Online data growth
Moving data on and off of disk is the bottleneck.

Cont...
● One Terabyte (TB) of data will take 10,000
seconds (approximately 167 minutes)
● One TB of data will take 2000 seconds
(approximately 33 minutes) to read data at
500 MB/sec (solid state)
One TB is a “small” file size
The need for parallel data access is essential
for Big data.

Agenda
● Hadoop Definition
● Hadoop Ecosystem
● History
● Hadoop Design Principle
● HDFS and MapReduce (Demo)
● Conclusion

Definition
● A framework of open-source tools,
libraries, and methodologies for the
distributed processing of large data sets
● Scale up from single servers to thousands of
machines, each offering local computation
and storage

Cont...
● The project includes
❏ Hadoop Common
❏ HDFS
❏ YARN
❏ MapReduce

Cont...
● Other Hadoop-related projects
❏ Pig
❏ Hive
❏ Tez
❏ Spark
❏ HBase
❏ Ambari etc

Hadoop Usage Modes
● Administrators
❏ Installation
❏ Monitor/Manage System
❏ Tune System
● End Users
❏ Design MapReduce Applications
❏ Import/Export Data
❏ Work with various Hadoop Tools

Hadoop History
● Developed by Doug Cutting and Michael J.
Cafarella
● Based On Google MapReduce technology
● Designed to handle large amounts of data
and be robust
● Donated to Apache Software Foundation in
2006 by Yahoo

Cont...
Application areas:
❏ Social media
❏ Retail
❏ Financial services
❏ Web Search
❏ Everywhere there is large amounts of
unstructured data

Cont...
Prominent Users:
❏ Yahoo!
❏ Facebook
❏ Amazon
❏ Ebay
❏ American Airlines
❏ The New York Time, and many others

Design Principles
● Moving computation is cheaper than moving
data
● Hardware will fail, manage it
● Hide execution details from the user
● Use streaming data access
● Use a simple file system coherency model

Cont...
What Hadoop is not: A replacement for SQL,
always fast and efficient, good for quick ad-hoc
querying

HDFS-Architecture
1.Where do I read or write
data?
2.Use these data nodes.

NameNode
● Only one per cluster “master node”
● Stores meta information of “filesystem” such
as Filename, permission, directories, blocks
● Keep in RAM for fast access
● Persisted to disk
● The namenode is the brain of the outfit

DataNode
● Many per cluster, “slave node”
● Stores individual file “blocks” but knows
nothing about them, accept the block name
● Reports regularly to NameNode
“Hey I am alive, and I have these blocks”

HDFS Presents
● Transparency
● Replication

HDFS Properties
● Files are immutable
No updates, no appends
● Disk access is optimized for sequential
reads
Store data in large “blocks” 128 MB default
● Avoid Corruption
“blocks” are verified with checksum when
stored and read

Cont...
● High throughput
Avoid contention, have system share as
little information and resources as possible
● Fault Tolerant
Loss of a disk, or machine, or rack of
machines should not lead to data loss

Demo
● File System Health check
❏ hdfs fsck /
● List the file system content
❏ $hdfs dfs -ls /
● Create a directory
❏ $hdfs dfs -mkdir /data1
● Upload file to HDFS
❏ $hdfs dfs -put input.txt /in

Cont...
● Input directory in HDFS: /in
● Output directory in HDFS: /output

NameNode High Availability
● NFS Filer
● Quorum Journal Manager (QJM)

MapReduce
● A programming model for processing large
Data in Distributed fashion over cluster of
commodity machines
● Introduced by Google
● Uses two key steps: mapping and reducing

Cont...
● Almost all data can be mapped into <key,
value> pairs somehow
● Your keys and values may be of any type:
string, integers, dummy types, and <K,V>
pairs themselves and so on
● Scale-free programming
❏ If a program works for a 1KB file, it can
work for any file size

see spot run
run spot run
see the cat
run spot run
see spot run
see the cat
see,1
spot,1
run,1
see,1
see,1
spot,1
spot,1
run,1
run,1
run,1
the,1
cat,1
Input
Split
Map
Shuffle
see,1
spot,1
run,1
see,1
the,1
cat,1
see,2
spot,2
run,3
the,1
cat,1
Reduce
see,2
spot,2
run,3
the,1
cat,1
Output
Data Flow
MapReduce Word Count Data Flow

How to run job in Cluster?
● Using Streaming Interface
❏ hadoop jar /opt/hadoop/hadoop-2.6.0
/share/hadoop/tools/lib/hadoop-streaming-
2.6.0.jar -file ./mapper.py -mapper .
/mapper.py -file ./reducer.py -input
/in/input.txt -output /output/run1

Web Interfaces
● NameNode: http://10.0.0.160:50070
● ResourceManager: http://10.0.0.160:8088
● /opt/hadoop/hadoop-2.6.0
/share/hadoop/hdfs/webapps/hdfs contains
view

Prerequisites
● Java
● ssh
● rsync

Installation
● wget apache.osuosl.
org/hadoop/common/hadoop-2.6.0/hadoop-
2.6.0-src.tar.gz
2.6.0-src.tar.gz.mds

Cont...
2.6.0.tar.gz
2.6.0.tar.gz.mds

Integrity Check
● md5sum hadoop-2.6.0.tar.gz
● cat hadoop-2.6.0.tar.gz.mds | grep -i md5

Startup Code
● https://github.com/kpsubedi/BigData

OpenSource
● Apache Hadoop http://hadoop.apache.
org/

Commercial Big Data Players
● Hortonworks http://hortonworks.com/
● Cloudera http://www.cloudera.
com/content/cloudera/en/home.html
● MAPR https://www.mapr.com/
● Others

References
● Apache Hadoop http://hadoop.apache.org/
● Hortonworks http://hortonworks.com/
● Cloudera http://www.cloudera.com/content/cloudera/en/downloads.
html
● MAPR https://www.mapr.com/
● The Google File System http://static.googleusercontent.
com/media/research.google.com/en/us/archive/gfs-sosp2003.pdf

References (1)
● MapReduce: Simplified Data Processing on
Large Clusters http://static.googleusercontent.
com/media/research.google.com/en/us/archive/mapreduce-osdi04.pdf

Hadoop-2.6.0 Slides

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (17)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Hadoop-2.6.0 Slides

Ähnlich wie Hadoop-2.6.0 Slides (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Hadoop-2.6.0 Slides