SlideShare ist ein Scribd-Unternehmen logo
1 von 229
Downloaden Sie, um offline zu lesen
The Hadoop Ecosystem
Zohar Elkayam & Ronen Fidel
Brillix
Agenda
• Big Data – The Challenge
• Introduction to Hadoop
– Deep dive into HDFS
– MapReduce and YARN
• Improving Hadoop: tools and extensions
• NoSQL and RDBMS
2
About Brillix
• Brillix is a leading company that specialized in Data
Management
• We provide professional services and consulting for
Databases, Security and Big Data solutions
3
Who am I?
• Zohar Elkayam, CTO at Brillix
• DBA, team leader, instructor and a senior consultant for over 17 years
• Oracle ACE Associate
• Involved with Big Data projects since 2011
• Blogger – www.realdbamagic.com
4
Big Data
"Big Data"??
Different definitions
“Big data exceeds the reach of commonlyused hardware environments
and software tools to capture, manage, and process it with in a tolerable
elapsed time for its user population.”- Teradata Magazinearticle,2011
“Big data refers to data sets whose size is beyond the abilityof typical
database software tools to capture, store, manage and analyze.”
- The McKinseyGlobal Institute, 2012
“Big data is a collectionof data sets so large and complex that it
becomes difficultto process using on-handdatabasemanagement
tools.” - Wikipedia, 2014
6
A Success Story
8
More success stories
9
MORE stories..
• Crime Prevention in Los Angeles
• Diagnosis and treatment of genetic diseases
• Investments in the financial sector
• Generation of personalized advertising
• Astronomical discoveries
10
Examples of Big Data Use Cases Today
MEDIA/
ENTERTAINMENT
Viewers / advertising
effectiveness
COMMUNICATIONS
Location-based
advertising
EDUCATION &
RESEARCH
Experiment
sensor analysis
CONSUMER
PACKAGED
GOODS
Sentiment analysis of
what’s hot, problems
HEALTH CARE
Patient sensors,
monitoring, EHRs
Quality of care
LIFE SCIENCES
Clinical trials
Genomics
HIGH TECHNOLOGY /
INDUSTRIAL MFG.
Mfg quality
Warranty analysis
OIL & GAS
Drilling
exploration
sensor analysis
FINANCIAL
SERVICES
Risk & portfolio analysis
New products
AUTOMOTIVE
Auto sensors
reporting
location,
problems
RETAIL
Consumer
sentiment
Optimized
marketing
LAW
ENFORCEMENT
& DEFENSE
Threat analysis -
social media
monitoring, photo
analysis
TRAVEL &
TRANSPORTATION
Sensor analysis for
optimal traffic flows
Customer sentiment
UTILITIES
Smart
Meter
analysis for
network
capacity,
ON-LINE
SERVICES /
SOCIAL MEDIA
People & career
matching
Web-site
optimization
11
Most Requested Uses of Big Data
• Log Analytics & Storage
• Smart Grid / Smarter Utilities
• RFID Tracking & Analytics
• Fraud / Risk Management & Modeling
• 360° View of the Customer
• Warehouse Extension
• Email / Call Center Transcript Analysis
• Call Detail Record Analysis
12
The Challenge
Big Data Big Problems
• Unstructured
• Unprocessed
• Un-aggregated
• Un-filtered
• Repetitive
• Low quality
• And generally messy
Oh, and there is a lot of it
14
The Big Data Challenge
15
Big Data: Challenge to Value
Business
Value
 High Variety
 High Volume
 High Velocity
Today
 Deep Analytics
 High Agility
 Massive Scalability
 Real TimeTomorrow
Challenges
16
Volume
• Big data come in one size: Big.
• Size is measured in Terabyte(1012), Petabyte(1015),
Exabyte(1018), Zettabyte (1021)
• The storing and handling of the data becomes an issue
• Producing value out of the data in a reasonable time is an
issue
17
Some Numbers
• How much data in the world?
– 800 Terabytes, 2000
– 160 Exabytes, 2006 (1EB = 1018B)
– 4.5 Zettabytes, 2012 (1ZB = 1021B)
– 44 Zettabytes by 2020
• How much is a zettabyte?
– 1,000,000,000,000,000,000,000 bytes
– A stack of 1TB hard disks that is 25,400 km high
18
Data grows fast!
19
Growth Rate
How much data
generated in a day?
– 7 TB, Twitter
– 10 TB, Facebook
20
Variety
• Big Data extends beyond structured data:
including semi-structured and unstructured
information: logs, text, audio and videos
• Wide variety of rapidly evolving data types
requires highly flexible stores and handling
21
Structured & Un-Structured
Un-Structured Structured
Objects Tables
Flexible Columns and Rows
Structure Unknown Predefined Structure
Textual and Binary Mostly Textual
22
Big Data is ANY data:
Unstructured, Semi-Structure and Structured
• Some has fixed structure
• Some is “bring own structure”
• We want to find value in all of it
23
Data Types by Industry
24
Velocity
• The speed in which the data is being generated and
collected
• Streaming data and large volume data movement
• High velocity of data capture – requires rapid ingestion
• Might cause the backlog problem
25
Global Internet Device Forecast
26
Internet of Things
27
Veracity
• Quality of the data can vary greatly
• Data sources might be messy or corrupted
28
So, What Defines Big Data?
• When we think that we can produce value from that data
and want to handle it
• When the data is too big or moves too fast to handle in a
sensible amount of time
• When the data doesn’t fit conventional database structure
• When the solution becomes part of the problem
29
Handling Big Data
Big Data in Practice
• Big data is big: technological infrastructure solutions
needed
• Big data is messy: data sources must be cleaned
before use
• Big data is complicated: need developers and system
admins to manage intake of data
32
Big Data in Practice (cont.)
• Data must be broken out of silos in order to be mined,
analyzed and transformed into value
• The organization must learn how to communicate and
interpret the results of analysis
33
Infrastructure Challenges
• Infrastructure that is built for:
– Large-scale
– Distributed
– Data-intensive jobs that spread the problem across clusters of
server nodes
34
Infrastructure Challenges (cont.)
• Storage:
– Efficient and cost-effective enough to capture and
store terabytes, if not petabytes, of data
– With intelligent capabilities to reduce your data
footprint such as:
• Data compression
• Automatic data tiering
• Data deduplication
35
Infrastructure Challenges (cont.)
• Network infrastructure that can quickly import large
data sets and then replicate it to various nodes for
processing
• Security capabilities that protect highly-distributed
infrastructure and data
36
Introduction To Hadoop
Apache Hadoop
• Open source project run by Apache (2006)
• Hadoop brings the ability to cheaply process large
amounts of data, regardless of its structure
• It Is has been the driving force behind the growth of the
big data Industry
• Get the public release from:
http://hadoop.apache.org/core/
38
Hadoop Creation History
39
Key points
• An open-source framework that uses a simple programming model to
enable distributed processing of large data sets on clusters of computers.
• The complete technology stack includes
– common utilities
– a distributed file system
– analytics and data storage platforms
– an application layer that manages distributed processing, parallel
computation, workflow, and configuration management
• Cost-effective for handling large unstructured data sets than conventional
approaches, and it offers massive scalability and speed
40
Why use Hadoop?
Cost Flexibility
Near linear
performance up
to 1000s of nodes
Leverages
commodity HW &
open source SW
Versatility with
data, analytics &
operation
Scalability
41
No, really, why use Hadoop?
• Need to process Multi Petabyte Datasets
• Expensive to build reliability in each application
• Nodes fail every day
– Failure is expected, rather than exceptional
– The number of nodes in a cluster is not constant
• Need common infrastructure
– Efficient, reliable, Open Source Apache License
• The above goals are same as Condor, but
– Workloads are IO bound and not CPU bound
42
Hadoop Benefits
• Reliable solution based on unreliable hardware
• Designed for large files
• Load data first, structure later
• Designed to maximize throughput of large scans
• Designed to leverage parallelism
• Designed to scale
• Flexible development platform
• Solution Ecosystem
43
Hadoop Limitations
• Hadoop is scalable but it’s not fast
• Some assembly required
• Batteries not included
• Instrumentation not included either
• DIY mindset
44
Hadoop Components
Hadoop Main Components
• HDFS: Hadoop Distributed File System –
distributed file system that runs in a clustered
environment.
• MapReduce – programming paradigm for
running processes over a clustered
environments.
47
HDFS is...
• A distributed file system
• Redundant storage
• Designed to reliably store data using commodity hardware
• Designed to expect hardware failures
• Intended for large files
• Designed for batch inserts
• The Hadoop Distributed File System
48
HDFS Node Types
HDFS has three types of Nodes
• Namenode (MasterNode)
– Distribute files in the cluster
– Responsible for the replication between
the datanodes and for file blocks location
• Datanodes
– Responsible for actual file store
– Serving data from files(data) to client
• BackupNode (version 0.23 and up)
• It’s a backup of the NameNode
49
Typical implementation
• Nodes are commodity PCs
• 30-40 nodes per rack
• Uplink from racks is 3-4 gigabit
• Rack-internal is 1 gigabit
50
MapReduce is...
• A programming model for expressing distributed
computations at a massive scale
• An execution framework for organizing and performing
such computations
• An open-source implementation called Hadoop
51
MapReduce paradigm
• Implement two functions:
• MAP - Takes a large problem and divides into sub problems
and performs the same function on all subsystems
Map(k1, v1) -> list(k2, v2)
• REDUCE - Combine the output from all sub-problems
Reduce(k2, list(v2)) -> list(v3)
• Framework handles everything else (almost)
• Value with same key must go to the same reducer
52
Typical large-data problem
• Iterate over a large number of records
• Extract something of interest from each
• Shuffle and sort intermediate results
• Aggregate intermediate results
• Generate final output
Map
Reduce
53
Divide and Conquer
54
MapReduce - word count example
function map(String name, String document):
for each word w in document:
emit(w, 1)
function reduce(String word, Iterator
partialCounts):
totalCount = 0
for each count in partialCounts:
totalCount += count
emit(word, totalCount)
55
MapReduce Word Count Process
56
MapReduce Advantages
Example: $HADOOP_HOME/bin/hadoop jar @HADOOP_HOME/hadoop-
streaming.jar 
- input myInputDirs 
- output myOutputDir 
- mapper /bin/cat 
- reducer /bin/wc
• Runs programs (jobs) across many computers
• Protects against single server failure by re-run failed steps
• MR jobs can be written in Java, C, Phyton, Ruby and
others
• Users only write Map and Reduce functions
57
MapReduce is good for...
• Embarrassingly parallel algorithms
• Summing, grouping, filtering, joining
• Off-line batch jobs on massive data sets
• Analyzing an entire large dataset
58
MapReduce is OK for...
• Iterative jobs (i.e., graph algorithms)
• Each iteration must read/write data to disk
• IO and latency cost of an iteration is high
59
MapReduce is NOT good for...
• Jobs that need shared state/coordination
• Tasks are shared-nothing
• Shared-state requires scalable state store
• Low-latency jobs
• Jobs on small datasets
• Finding individual records
60
Deep Dive into HDFS
HDFS
• Appears as a single disk
• Runs on top of a native filesystem
– Ext3,Ext4,XFS
• Based on Google's Filesystem GFS
• Fault Tolerant
– Can handle disk crashes, machine crashes, etc...
• Based on Google's Filesystem (GFS or GoogleFS)
– gfs-sosp2003.pdf
• http://static.googleusercontent.com/external_content/untrusted_dlcp/research.go
ogle.com/en/us/archive/gfs-sosp2003.pdf
– http://en.wikipedia.org/wiki/Google_File_System
62
HDFS is Good for...
• Storing large files
– Terabytes, Petabytes, etc...
– Millions rather than billions of files
– 100MB or more per file
• Streaming data
– Write once and read-many times patterns
– Optimized for streaming reads rather than random reads
– Append operation added to Hadoop 0.21
• “Cheap” Commodity Hardware
– No need for super-computers, use less reliable commodity hardware
63
HDFS is not so good for...
• Low-latency reads
– High-throughput rather than low latency for small chunks of
data
– HBase addresses this issue
• Large amount of small files
– Better for millions of large files instead of billions of small files
• For example each file can be 100MB or more
• Multiple Writers
– Single writer per file
– Writes only at the end of file, no-support for arbitrary offset
64
HDFS: Hadoop Distributed File System
• A given file is broken down into blocks
(default=64MB), then blocks are
replicated across cluster (default=3)
• Optimized for:
– Throughput
– Put/Get/Delete
– Appends
• Block Replication for:
– Durability
– Availability
– Throughput
• Block Replicas are distributed across
servers and racks
65
HDFS Architecture
• Name Node : Maps a file to a
file-id and list of Map Nodes
• Data Node : Maps a block-id to
a physical location on disk
• Secondary Name Node:
Periodic merge of Transaction
log
66
HDFS Daemons
• Filesystem cluster is manager by three types of processes
– Namenode
• manages the File System's namespace/meta-data/file blocks
• Runs on 1 machine to several machines
– Datanode
• Stores and retrieves data blocks
• Reports to Namenode
• Runs on many machines
– Secondary Namenode
• Performs house keeping work so Namenode doesn’t have to
• Requires similar hardware as Namenode machine
• Not used for high-availability – not a backup for Namenode
67
Files and Blocks
• Files are split into blocks (single unit of storage)
– Managed by Namenode, stored by Datanode
– Transparent to user
• Replicated across machines at load time
– Same block is stored on multiple machines
– Good for fault-tolerance and access
– Default replication is 3
68
HDFS Blocks
• Blocks are traditionally either 64MB or 128MB
– Default is 128MB
• The motivation is to minimize the cost of seeks as compared to
transfer rate
– 'Time to transfer' > 'Time to seek'
• For example, lets say
– seek time = 10ms
– Transfer rate = 100 MB/s
• To achieve seek time of 1% transfer rate
– Block size will need to be = 100MB
69
Block Replication
• Namenode determines replica placement
• Replica placements are rack aware
– Balance between reliability and performance
• Attempts to reduce bandwidth
• Attempts to improve reliability by putting replicas on multiple racks
– Default replication is 3
• 1st replica on the local rack
• 2nd replica on the local rack but different machine
• 3rd replica on the different rack
– This policy may change/improve in the future
70
Data Correctness
• Use Checksums to validate data
– Use CRC32
• File Creation
– Client computes checksum per 512 byte
– Data Node stores the checksum
• File access
– Client retrieves the data and checksum from Data Node
– If Validation fails, Client tries other replicas
71
Data Pipelining
• Client retrieves a list of Data Nodes on which to place
replicas of a block
• Client writes block to the first Data Node
• The first Data Node forwards the data to the next Data
Node in the Pipeline
• When all replicas are written, the Client moves on to
write the next block in file
72
Client, Namenode, and Datanodes
• Namenode does NOT directly write or read data
– One of the reasons for HDFS’s Scalability
• Client interacts with Namenode to update
Namenode’s HDFS namespace and retrieve block
locations for writing and reading
• Client interacts directly with Datanode to
read/write data
73
Name Node Metadata
• Meta-data in Memory
– The entire metadata is in main memory
– No demand paging of meta-data
• Types of Metadata
– List of files
– List of Blocks for each file
– List of Data Nodes for each block
– File attributes, e.g. creation time, replication factor
• A Transaction Log
– Records file creations, file deletions. etc.
74
Namenode Memory Concerns
• For fast access Namenode keeps all block metadata in-
memory
– The bigger the cluster - the more RAM required
• Best for millions of large files (100mb or more) rather than billions
• Will work well for clusters of 100s machines
• Hadoop 2+
– Namenode Federations
• Each namenode will host part of the blocks
• Horizontally scale the Namenode
– Support for 1000+ machine clusters
75
Using HDFS
Reading Data from HDFS
1. Create FileSystem
2. Open InputStream to a Path
3. Copy bytes using IOUtils
4. Close Stream
77
1: Create FileSystem
• FileSystem fs = FileSystem.get(new
Configuration());
– If you run with yarn command,
DistributedFileSystem (HDFS) will be created
• Utilizes fs.default.name property from configuration
• Recall that Hadoop framework loads core-site.xml which
sets property to hdfs (hdfs://localhost:8020)
78
2: Open Input Stream to a Path
...
InputStream input = null;
try {
input = fs.open(fileToRead);
...
• fs.open returns org.apache.hadoop.fs.FSDataInputStream
– Another FileSystem implementation will return their own custom
implementation of InputStream
• Opens stream with a default buffer of 4k
• If you want to provide your own buffer size use
– fs.open(Path f, int bufferSize)
79
3: Copy bytes using IOUtils
IOUtils.copyBytes(inputStream, outputStream,
buffer);
• Copy bytes from InputStream to OutputStream
• Hadoop’s IOUtils makes the task simple
– buffer parameter specifies number of bytes to
buffer at a time
80
4: Close Stream
...
} finally {
IOUtils.closeStream(input);
...
• Utilize IOUtils to avoid boiler plate code that catches
IOException
81
ReadFile.java Example
public class ReadFile {
public static void main(String[] args) throws IOException {
Path fileToRead = new Path("/user/sample/sonnets.txt");
FileSystem fs = FileSystem.get(new Configuration()); // 1: Open FileSystem
InputStream input = null;
try {
input = fs.open(fileToRead); // 2: Open InputStream
IOUtils.copyBytes(input, System.out, 4096); // 3: Copy from Input to Output
} finally {
IOUtils.closeStream(input); // 4: Close stream
}
}
}
$ yarn jar my-hadoop-examples.jar hdfs.ReadFile
82
Reading Data - Seek
• FileSystem.open returns FSDataInputStream
– Extension of java.io.DataInputStream
– Supports random access and reading via interfaces:
• PositionedReadable : read chunks of the stream
• Seekable : seek to a particular position in the stream
83
Seeking to a Position
• FSDataInputStream implements Seekable
interface
– void seek(long pos) throws IOException
• Seek to a particular position in the file
• Next read will begin at that position
• If you attempt to seek past the file boundary IOException is emitted
• Somewhat expensive operation – strive for streaming and not seeking
– long getPos() throws IOException
• Returns the current position/offset from the beginning of the
stream/file
84
SeekReadFile.java Example
public class SeekReadFile {
public static void main(String[] args) throws IOException {
Path fileToRead = new Path("/user/sample/readMe.txt");
FileSystem fs = FileSystem.get(new Configuration());
FSDataInputStream input = null;
try {
input = fs.open(fileToRead);
System.out.print("start postion=" + input.getPos() + ": ");
IOUtils.copyBytes(input, System.out, 4096, false);
input.seek(11);
System.out.print("start postion=" + input.getPos() + ": ");
IOUtils.copyBytes(input, System.out, 4096, false);
input.seek(0);
System.out.print("start postion=" + input.getPos() + ": ");
IOUtils.copyBytes(input, System.out, 4096, false);
} finally {
IOUtils.closeStream(input);
}
}
}
85
Run SeekReadFile Example
$ yarn jar my-hadoop-examples.jar hdfs.SeekReadFile
start position=0: Hello from readme.txt
start position=11: readme.txt
start position=0: Hello from readme.txt
86
Write Data
1. Create FileSystem instance
2. Open OutputStream
– FSDataOutputStream in this case
– Open a stream directly to a Path from FileSystem
– Creates all needed directories on the provided path
3. Copy data using IOUtils
87
WriteToFile.java Example
public class WriteToFile {
public static void main(String[] args) throws IOException {
String textToWrite = "Hello HDFS! Elephants are awesome!n";
InputStream in = new BufferedInputStream(
new ByteArrayInputStream(textToWrite.getBytes()));
Path toHdfs = new Path("/user/sample/writeMe.txt");
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf); // 1: Create FileSystem instance
FSDataOutputStream out = fs.create(toHdfs); // 2: Open OutputStream
IOUtils.copyBytes(in, out, conf); // 3: Copy Data
}
}
88
Run WriteToFile
$ yarn jar my-hadoop-examples.jar hdfs.WriteToFile
$ hdfs dfs -cat /user/sample/writeMe.txt
Hello HDFS! Elephants are awesome!
90
MapReduce and YARN
Hadoop MapReduce
• Model for processing large amounts of data in
parallel
– On commodity hardware
– Lots of nodes
• Derived from functional programming
– Map and reduce functions
• Can be implemented in multiple languages
– Java, C++, Ruby, Python (etc...)
92
The MapReduce Model
• Imposes key-value input/output
• Defines map and reduce functions
map: (K1,V1) → list (K2,V2)
reduce: (K2,list(V2)) → list (K3,V3)
1. Map function is applied to every input key-value pair
2. Map function generates intermediate key-value pairs
3. Intermediate key-values are sorted and grouped by key
4. Reduce is applied to sorted and grouped intermediate key-values
5. Reduce emits result key-values
93
MapReduce Programming Model
94
MapReduce in Hadoop (1)
95
MapReduce in Hadoop (2)
96
MapReduce Framework
• Takes care of distributed processing and
coordination
• Scheduling
– Jobs are broken down into smaller chunks called tasks.
– These tasks are scheduled
• Task Localization with Data
– Framework strives to place tasks on the nodes that host
the segment of data to be processed by that specific task
– Code is moved to where the data is
97
MapReduce Framework
• Error Handling
– Failures are an expected behavior so tasks are
automatically re-tried on other machines
• Data Synchronization
– Shuffle and Sort barrier re-arranges and moves data
between machines
– Input and output are coordinated by the framework
98
Map Reduce 2.0 on YARN
• Yet Another Resource Negotiator (YARN)
• Various applications can run on YARN
– MapReduce is just one choice (the main choice at this point)
– http://wiki.apache.org/hadoop/PoweredByYarn
• YARN was designed to address issues with
MapReduce1
– Scalability issues (max ~4,000 machines)
– Inflexible Resource Management
• MapReduce1 had slot based model
99
MapReduce1 vs. YARN
• MapReduce1 runs on top of JobTracker and TaskTracker daemons
– JobTracker schedules tasks, matches task with TaskTrackers
– JobTracker manages MapReduce Jobs, monitors progress
– JobTracker recovers from errors, restarts failed and slow tasks
• MapReduce1 has inflexible slot-based memory management model
– Each TaskTracker is configured at start-up to have N slots
– A task is executed in a single slot
– Slots are configured with maximum memory on cluster start-up
– The model is likely to cause over and under utilization issues
100
MapReduce1 vs. YARN (cont.)
• YARN addresses shortcomings of MapReduce1
– JobTracker is split into 2 daemons
• ResourceManager - administers resources on the cluster
• ApplicationMaster - manages applications such as MapReduce
– Fine-Grained memory management model
• ApplicationMaster requests resources by asking for
“containers” with a certain memory limit (ex 2G)
• YARN administers these containers and enforces memory usage
• Each Application/Job has control of how much memory to
request
101
Daemons
• YARN Daemons
– Node Manger
• Manages resources of a single node
• There is one instance per node in the cluster
– Resource Manager
• Manages Resources for a Cluster
• Instructs Node Manager to allocate resources
• Application negotiates for resources with Resource Manager
• There is only one instance of Resource Manager
• MapReduce Specific Daemon
– MapReduce History Server
• Archives Jobs’ metrics and meta-data
102
Old vs. New Java API
• There are two flavors of MapReduce API which became known as Old and
New
• Old API classes reside under
– org.apache.hadoop.mapred
• New API classes can be found under
– org.apache.hadoop.mapreduce
– org.apache.hadoop.mapreduce.lib
• We will use new API exclusively
• New API was re-designed for easier evolution
• Early Hadoop versions deprecated old API but deprecation was removed
• Do not mix new and old API
103
Developing First
MapReduce Job
MapReduce
• Divided in two phases
– Map phase
– Reduce phase
• Both phases use key-value pairs as input and output
• The implementer provides map and reduce functions
• MapReduce framework orchestrates splitting, and
distributing of Map and Reduce phases
– Most of the pieces can be easily overridden
105
MapReduce
• Job – execution of map and reduce
functions to accomplish a task
– Equal to Java’s main
• Task – single Mapper or Reducer
– Performs work on a fragment of data
106
Map Reduce Flow of Data
107
First Map Reduce Job
• StartsWithCount Job
– Input is a body of text from HDFS
• In this case hamlet.txt
– Split text into tokens
– For each first letter sum up all occurrences
– Output to HDFS
108
Word Count Job
109
Starts With Count Job
1. Configure the Job
– Specify Input, Output, Mapper, Reducer and Combiner
2. Implement Mapper
– Input is text – a line from hamlet.txt
– Tokenize the text and emit first character with a count of
1 - <token, 1>
3. Implement Reducer
– Sum up counts for each letter
– Write out the result to HDFS
4. Run the job
110
1: Configure Job
• Job class
– Encapsulates information about a job
– Controls execution of the job
Job job = Job.getInstance(getConf(), "StartsWithCount");
• A job is packaged within a jar file
– Hadoop Framework distributes the jar on your behalf
– Needs to know which jar file to distribute
– The easiest way to specify the jar that your job resides in is by calling
job.setJarByClass
job.setJarByClass(getClass());
– Hadoop will locate the jar file that contains the provided class
111
1: Configure Job - Specify Input
TextInputFormat.addInputPath(job, new Path(args[0]));
job.setInputFormatClass(TextInputFormat.class);
• Can be a file, directory or a file pattern
– Directory is converted to a list of files as an input
• Input is specified by implementation of InputFormat - in this
case TextInputFormat
– Responsible for creating splits and a record reader
– Controls input types of key-value pairs, in this case LongWritable
and Text
– File is broken into lines, mapper will receive 1 line at a time
112
Side Note – Hadoop IO Classes
• Hadoop uses it’s own serialization mechanism for writing data
in and out of network, database or files
– Optimized for network serialization
– A set of basic types is provided
– Easy to implement your own
• org.apache.hadoop.io package
– LongWritable for Long
– IntWritable for Integer
– Text for String
– Etc...
113
1: Configure Job – Specify Output
TextOutputFormat.setOutputPath(job, new Path(args[1]));
job.setOutputFormatClass(TextOutputFormat.class);
• OutputFormat defines specification for outputting data from
Map/Reduce job
• Count job utilizes an implemenation of
OutputFormat - TextOutputFormat
– Define output path where reducer should place its output
• If path already exists then the job will fail
– Each reducer task writes to its own file
• By default a job is configured to run with a single reducer
– Writes key-value pair as plain text
114
1: Configure Job – Specify Output
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
• Specify the output key and value types for
both mapper and reducer functions
– Many times the same type
– If types differ then use
• setMapOutputKeyClass()
• setMapOutputValueClass()
115
1: Configure Job
• Specify Mapper, Reducer and Combiner
– At a minimum will need to implement these classes
– Mappers and Reducer usually have same output
key
job.setMapperClass(StartsWithCountMapper.class);
job.setReducerClass(StartsWithCountReducer.class);
job.setCombinerClass(StartsWithCountReducer.class);
116
1: Configure Job
• job.waitForCompletion(true)
– Submits and waits for completion
– The boolean parameter flag specifies whether
output should be written to console
– If the job completes successfully ‘true’ is
returned, otherwise ‘false’ is returned
117
Our Count Job is configured to
• Chop up text files into lines
• Send records to mappers as key-value pairs
– Line number and the actual value
• Mapper class is StartsWithCountMapper
– Receives key-value of <IntWritable,Text>
– Outputs key-value of <Text, IntWritable>
• Reducer class is StartsWithCountReducer
– Receives key-value of <Text, IntWritable>
– Outputs key-values of <Text, IntWritable> as text
• Combiner class is StartsWithCountReducer
118
1: Configure Count Job
public class StartsWithCountJob extends Configured implements Tool{
@Override
public int run(String[] args) throws Exception {
Job job = Job.getInstance(getConf(), "StartsWithCount");
job.setJarByClass(getClass());
// configure output and input source
TextInputFormat.addInputPath(job, new Path(args[0]));
job.setInputFormatClass(TextInputFormat.class);
// configure mapper and reducer
job.setMapperClass(StartsWithCountMapper.class);
job.setCombinerClass(StartsWithCountReducer.class);
job.setReducerClass(StartsWithCountReducer.class);
119
StartsWithCountJob.java (cont.)
// configure output
TextOutputFormat.setOutputPath(job, new Path(args[1]));
job.setOutputFormatClass(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(
new StartsWithCountJob(), args);
System.exit(exitCode);
}
}
120
2: Implement Mapper class
• Class has 4 Java Generics parameters
– (1) input key (2) input value (3) output key (4) output value
– Input and output utilizes hadoop’s IO framework
• org.apache.hadoop.io
• Your job is to implement map() method
– Input key and value
– Output key and value
– Logic is up to you
• map() method injects Context object, use to:
– Write output
– Create your own counters
121
2: Implement Mapper
public class StartsWithCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable countOne = new IntWritable(1);
private final Text reusableText = new Text();
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
StringTokenizer tokenizer = new StringTokenizer(value.toString());
while (tokenizer.hasMoreTokens()) {
reusableText.set(tokenizer.nextToken().substring(0, 1));
context.write(reusableText, countOne);
}
}
}
122
3: Implement Reducer
• Analogous to Mapper – generic class with four types
– (1) input key (2) input value (3) output key (4) output value
– The output types of map functions must match the input types of reduce
function
• In this case Text and IntWritable
– Map/Reduce framework groups key-value pairs produced by mapper by
key
• For each key there is a set of one or more values
• Input into a reducer is sorted by key
• Known as Shuffle and Sort
– Reduce function accepts key->setOfValues and outputs key-value pairs
• Also utilizes Context object (similar to Mapper)
123
3: Implement Reducer
public class StartsWithCountReducer extends
Reducer<Text, IntWritable, Text, IntWritable> {
@Override
protected void reduce(Text token,
Iterable<IntWritable> counts,
Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable count : counts) {
sum+= count.get();
}
context.write(token, new IntWritable(sum));
}
}
124
3: Reducer as a Combiner
• Combine data per Mapper task to reduce amount of
data transferred to reduce phase
• Reducer can very often serve as a combiner
– Only works if reducer’s output key-value pair types are the
same as mapper’s output types
• Combiners are not guaranteed to run
– Optimization only
– Not for critical logic
• More about combiners later
125
4: Run Count Job
$ yarn jar my-hadoop-examples.jar 
mr.wordcount.StartsWithCountJob 
/user/sample/readme.txt 
/user/sample/wordcount
126
Output of Count Job
• Output is written to the configured output
directory
– /user/sample/wordCount/
• One output file per Reducer
– part-r-xxxxx format
• Output is driven by TextOutputFormat class
127
$yarn command
• yarn script with a class argument command launches a JVM
and executes the provided Job
$ yarn jar HadoopSamples.jar 
mr.wordcount.StartsWithCountJob 
/user/sample/hamlet.txt 
/user/sample/wordcount/
• You could use straight java but yarn script is more convenient
– Adds hadoop’s libraries to CLASSPATH
– Adds hadoop’s configurations to Configuration object
• Ex: core-site.xml, mapred-site.xml, *.xml
– You can also utilize $HADOOP_CLASSPATH environment variable
128
Input and Output
MapReduce Theory
• Map and Reduce functions produce input and output
– Input and output can range from Text to Complex data
structures
– Specified via Job’s configuration
– Relatively easy to implement your own
• Generally we can treat the flow as
map: (K1,V1) → list (K2,V2)
reduce: (K2,list(V2)) → list (K3,V3)
– Reduce input types are the same as map output types
130
Map Reduce Flow of Data
map: (K1,V1) → list (K2,V2)
reduce: (K2,list(V2)) → list (K3,V3)
131
Key and Value Types
• Utilizes Hadoop’s serialization mechanism for writing
data in and out of network, database or files
– Optimized for network serialization
– A set of basic types is provided
– Easy to implement your own
• Extends Writable interface
– Framework’s serialization mechanisms
– Defines how to read and write fields
– org.apache.hadoop.io package
132
Key and Value Types
• Keys must implement WritableComparable
interface
– Extends Writable and java.lang.Comparable<T>
– Required because keys are sorted prior reduce phase
• Hadoop is shipped with many default
implementations of WritableComparable<T>
– Wrappers for primitives (String, Integer, etc...)
– Or you can implement your own
133
WritableComparable<T>
Implementations
Hadoop’s Class Explanation
BooleanWritable Boolean implementation
BytesWritable Bytes implementation
DoubleWritable Double implementation
FloatWritable Float implementation
IntWritable Int implementation
LongWritable Long implementation
NullWritable Writable with no data
134
Implement Custom
WritableComparable<T>
• Implement 3 methods
– write(DataOutput)
• Serialize your attributes
– readFields(DataInput)
• De-Serialize your attributes
– compareTo(T)
• Identify how to order your objects
• If your custom object is used as the key it will be sorted
prior to reduce phase
135
BlogWritable – Implemenation
of WritableComparable<T>
public class BlogWritable implements
WritableComparable<BlogWritable> {
private String author;
private String content;
public BlogWritable(){}
public BlogWritable(String author, String content) {
this.author = author;
this.content = content;
}
public String getAuthor() {
return author;
public String getContent() {
return content;
...
...
136
BlogWritable – Implemenation
of WritableComparable<T>
...
@Override
public void readFields(DataInput input) throws IOException {
author = input.readUTF();
content = input.readUTF();
}
@Override
public void write(DataOutput output) throws IOException {
output.writeUTF(author);
output.writeUTF(content);
}
@Override
public int compareTo(BlogWritable other) {
return author.compareTo(other.author);
}
}
137
Mapper
• Extend Mapper class
– Mapper<KeyIn, ValueIn, KeyOut, ValueOut>
• Simple life-cycle
1. The framework first calls setup(Context)
2. for each key/value pair in the split:
• map(Key, Value, Context)
3. Finally cleanup(Context) is called
138
InputSplit
• Splits are a set of logically arranged records
– A set of lines in a file
– A set of rows in a database table
• Each instance of mapper will process a single split
– Map instance processes one record at a time
• map(k,v) is called for each record
• Splits are implemented by extending InputSplit
class
139
InputSplit
• Framework provides many options for
InputSplit implementations
– Hadoop’s FileSplit
– HBase’s TableSplit
• Don’t usually need to deal with splits directly
– InputFormat’s responsibility
140
Combiner
• Runs on output of map function
• Produces outpu
map: (K1,V1) → list (K2,V2)
combine: (K2,list(V2)) → list (K2,V2)
reduce: (K2,list(V2)) → list (K3,V3)
• Optimization to reduce bandwidth
– NO guarantees on being called
– Maybe only applied to a sub-set of map outputs
• Often is the same class as Reducer
• Each combine processes output from a single split
141
Combiner Data Flow
142
Sample StartsWithCountJob
Run without Combiner
143
Sample StartsWithCountJob
Run with Combiner
144
Specify Combiner Function
• To implement Combiner extend Reducer
class
• Set combiner on Job class
–
job.setCombinerClass(StartsWithCountReducer.
class);
145
Reducer
• Extend Reducer class
– Reducer<KeyIn, ValueIn, KeyOut, ValueOut>
– KeyIn and ValueIn types must match output types of mapper
• Receives input from mappers’ output
– Sorted on key
– Grouped on key of key-values produced by mappers
– Input is directed by Partitioner implementation
• Simple life-cycle – similar to Mapper
– The framework first calls setup(Context)
– for each key → list(value) calls
• reduce(Key, Values, Context)
– Finally cleanup(Context) is called
146
Reducer
• Can configure more than 1 reducer
– job.setNumReduceTasks(10);
– mapreduce.job.reduces property
• job.getConfiguration().setInt("mapreduce.job.reduces", 10)
• Partitioner implementation directs key-value pairs to the
proper reducer task
– A partition is processed by a reduce task
• # of partitions = # or reduce tasks
– Default strategy is to hash key to determine partition
implemented by HashPartitioner<K, V>
147
Partitioner Data Flow
148
HashPartitioner
public class HashPartitioner<K, V> extends Partitioner<K, V> {
public int getPartition(K key, V value, int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
}
• Calculate Index of Partition:
– Convert key’s hash into non-negative number
• Logical AND with maximum integer value
– Modulo by number of reduce tasks
• In case of more than 1 reducer
– Records distributed evenly across available reduce tasks
• Assuming a good hashCode() function
– Records with same key will make it into the same reduce task
– Code is independent from the # of partitions/reducers specified
149
Custom Partitioner
public class CustomPartitioner
extends Partitioner<Text, BlogWritable>{
@Override
public int getPartition(Text key, BlogWritable blog,
int numReduceTasks) {
int positiveHash =
blog.getAuthor().hashCode()& Integer.MAX_VALUE;
//Use author’s hash only, AND with
//max integer to get a positive value
return positiveHash % numReduceTasks;
}
}
• All blogs with the same author will end up in the same reduce task
150
Component Overview
151
Improving Hadoop
Improving Hadoop
• Core Hadoop is complicated so some tools were
added to make things easier
• Hadoop Distributions collect these tools and
release them as a whole package
153
Noticeable Distributions
• Cloudera
• MapR
• HortonWorks
• Amazon EMR
154
HADOOP Technology Eco System
155
Improving Programmability
• Pig: Programming language that simplifies
Hadoop actions: loading, transforming and
sorting data
• Hive: enables Hadoop to operate as data
warehouse using SQL-like syntax.
156
Pig
• “is a platform for analyzing large data sets that consists of a high-level
language for expressing data analysis programs, coupled with
infrastructure for evaluating these programs. “
• Top Level Apache Project
– http://pig.apache.org
• Pig is an abstraction on top of Hadoop
– Provides high level programming language designed for data processing
– Converted into MapReduce and executed on Hadoop Clusters
• Pig is widely accepted and used
– Yahoo!, Twitter, Netflix, etc...
157
Pig and MapReduce
• MapReduce requires programmers
– Must think in terms of map and reduce functions
– More than likely will require Java programmers
• Pig provides high-level language that can be used by
– Analysts
– Data Scientists
– Statisticians
– Etc...
• Originally implemented at Yahoo! to allow analysts to access
data
158
Pig’s Features
• Join Datasets
• Sort Datasets
• Filter
• Data Types
• Group By
• User Defined Functions
159
Pig’s Use Cases
• Extract Transform Load (ETL)
– Ex: Processing large amounts of log data
• clean bad entries, join with other data-sets
• Research of “raw” information
– Ex. User Audit Logs
– Schema maybe unknown or inconsistent
– Data Scientists and Analysts may like Pig’s data
transformation paradigm
160
Pig Components
• Pig Latin
– Command based language
– Designed specifically for data transformation and flow expression
• Execution Environment
– The environment in which Pig Latin commands are executed
– Currently there is support for Local and Hadoop modes
• Pig compiler converts Pig Latin to MapReduce
– Compiler strives to optimize execution
– You automatically get optimization improvements with Pig updates
161
Pig Code Example
162
Hive
• Data Warehousing Solution built on top of Hadoop
• Provides SQL-like query language named HiveQL
– Minimal learning curve for people with SQL expertise
– Data analysts are target audience
• Early Hive development work started at Facebook in
2007
• Today Hive is an Apache project under Hadoop
– http://hive.apache.org
163
Hive Provides
• Ability to bring structure to various data formats
• Simple interface for ad hoc querying, analyzing
and summarizing large amounts of data
• Access to files on various data stores such as
HDFS and HBase
164
When not to use Hive
• Hive does NOT provide low latency or real time queries
• Even querying small amounts of data may take minutes
• Designed for scalability and ease-of-use rather than low
latency responses
165
Hive
• Translates HiveQL statements into a set of MapReduce Jobs
which are then executed on a Hadoop Cluster
166
Hive Metastore
• To support features like schema(s) and data partitioning
Hive keeps its metadata in a Relational Database
– Packaged with Derby, a lightweight embedded SQL DB
• Default Derby based is good for evaluation an testing
• Schema is not shared between users as each user has their own
instance of embedded Derby
• Stored in metastore_db directory which resides in the directory that
hive was started from
– Can easily switch another SQL installation such as MySQL
167
Hive Architecture
168
1: Create a Table
• Let’s create a table to store data from $PLAY_AREA/data/user-posts.txt
169
1: Create a Table
170
2: Load Data Into a Table
171
3: Query Data
172
3: Query Data
173
Databases and DB Connectivity
• HBase: column oriented database that runs on
HDFS.
• Sqoop: a tool designed to import data from
relational databases (HDFS or Hive).
174
HBase
• Distributed column-oriented database built on
top of HDFS, providing Big Table-like capabilities
for Hadoop
175
When do we use HBase?
• Huge volumes of randomly accessed data.
• HBase is at its best when it’s accessed in a distributed
fashion by many clients.
• Consider HBase when you’re loading data by key,
searching data by key (or range), serving data by key,
querying data by key or when storing data by row that
doesn’t conform well to a schema.
176
When not to use Hbase
• HBase doesn’t use SQL, don’t have an optimizer,
doesn’t support in transactions or joins.
• If you need those things, you probably can’t use
Hbase
177
HBase Example
Example:
create ‘blogposts’, ‘post’, ‘image’ ---create table
put ‘blogposts’, ‘id1′, ‘post:title’, ‘Hello World’ ---insert value
put ‘blogposts’, ‘id1′, ‘post:body’, ‘This is a blog post’ ---insert value
put ‘blogposts’, ‘id1′, ‘image:header’, ‘image1.jpg’ ---insert value
get ‘blogposts’, ‘id1′ ---select records
178
Sqoop
• Sqoop is a command line tool for moving data from RDBMS to Hadoop
• Uses MapReduce program or Hive to load the data
• Can also export data from HBase to RDBMS
• Comes with connectors to MySQL, PostgreSQL, Oracle, SQL Server and
DB2.
Example:
$bin/sqoop import --connect 'jdbc:sqlserver://10.80.181.127;username=dbuser;password=dbpasswd;database=tpch' 
--table lineitem --hive-import
$bin/sqoop export --connect 'jdbc:sqlserver://10.80.181.127;username=dbuser;password=dbpasswd;database=tpch' --table lineitem --
export-dir /data/lineitemData
179
Improving Hadoop – More useful tools
• For improving coordination: Zookeeper
• For Improving log collection: Flume
• For improving scheduling/orchestration: Oozie
• For Monitoring: Chukwa
• For Improving UI: Hue
180
ZooKeeper
• ZooKeeper is a centralized service for maintaining configuration
information, naming, providing distributed synchronization, and
providing group services
• It allows distributed processes to coordinate with each other
through a shared hierarchal namespace which is organized
similarly to a standard file system
• ZooKeeper stamps each update with a number that reflects the
order of all ZooKeeper transactions
181
Flume
• Flume is a distributed system for collecting log data
from many sources, aggregating it, and writing it to
HDFS
• Flume maintains a central list of ongoing data flows,
stored redundantly in Zookeeper.
182
Oozie
• Oozie is a workflow scheduler system to manage
Hadoop jobs
• Oozie workflow is a collection of actions arranged in a
control dependency DAG specifying a sequence of
actions execution
• The Oozie Coordinator system allows the user to define
workflow execution bases on intervals or on-demand
183
Spark
Fast and general MapReduce-like engine for large-scale data processing
• Fast
– In memory data storage for very fast interactive queries Up to 100 times faster
then Hadoop
• General
– Unified platform that can combine: SQL, Machine Learning , Streaming , Graph &
Complex analytics
• Ease of use
– Can be developed in Java, Scala or Python
• Integrated with Hadoop
– Can read from HDFS, HBase, Cassandra, and any Hadoop data source.
184
Spark is the Most Active Open Source
Project in Big Data
185
The Spark Community
186
Key Concepts
Resilient Distributed Datasets
• Collections of objects spread
across a cluster, stored in RAM or
on Disk
• Built through parallel
transformations
• Automatically rebuilt on failure
Operations
• Transformations
(e.g. map, filter, groupBy)
• Actions
(e.g. count, collect, save)
Write programs in terms of transformations on
distributed datasets
187
Unified Platform
• Continued innovation bringing new functionality, e.g.:
• Java 8 (Closures, LambaExpressions)
• Spark SQL (SQL on Spark, not just Hive)
• BlinkDB(Approximate Queries)
• SparkR(R wrapper for Spark)
188
Language Support
189
Data Sources
• Local Files
– file:///opt/httpd/logs/access_log
• S3
• Hadoop Distributed Filesystem
– Regular files, sequence files, any other Hadoop
InputFormat
• Hbase
• Can also read from any other Hadoop data source.
190
Resilient Distributed Datasets (RDD)
• Spark revolves around RDDs
• Fault-tolerant collection of elements that
can be operated on in parallel
– Parallelized Collection: Scala collection which is
run in parallel
– Hadoop Dataset: records of files supported by
Hadoop
191
Hadoop Tools
192
Hadoop cluster
Cluster of machine running Hadoop at Yahoo! (credit: Yahoo!)
193
Big Data and NoSQL
The Challenge
• We want scalable, durable, high volume, high
velocity, distributed data storage that can handle
non-structured data and that will fit our specific
need
• RDBMS is too generic and doesn’t cut it any more –
it can do the job but it is not cost effective to our
usages
195
The Solution: NoSQL
• Let’s take some parts of the standard RDBMS out
to and design the solution to our specific uses
• NoSQL databases have been around for ages
under different names/solutions
196
Example Comparison: RDBMS vs. Hadoop
Typical Traditional RDBMS Hadoop
Data Size Gigabytes Petabytes
Access Interactive and Batch Batch – NOT Interactive
Updates Read / Write many times Write once, Read many times
Structure Static Schema Dynamic Schema
Scaling Nonlinear Linear
Query Response
Time
Can be near immediate Has latency (due to batch processing)
197
Best Used For:
 Structured or Not (Flexibility)
 Scalability of Storage/Compute
 Complex Data Processing
 Cheaper compared to RDBMS
Relational Database
Best Used For:
 Interactive OLAP Analytics
(<1sec)
 Multistep Transactions
 100% SQL Compliance
Best when used together
Hadoop And Relational Database
198
The NOSQL Movement
• NOSQL is not a technology – it’s a concept.
• We need high performance, scale out abilities or
an agile structure.
• We are now willing to sacrifice our sacred cows:
consistency, transactions.
• Over 200 different brands and solutions
(http://nosql-database.org/).
199
NoSQL, NOSQL or NewSQL
• NoSQL is not No to SQL
• NoSQL is not Never SQL
• NOSQL = Not Only SQL
200
Why NoSQL?
• Some applications need very few database features,
but need high scale.
• Desire to avoid data/schema pre-design altogether
for simple applications.
• Need for a low-latency, low-overhead API to access
data.
• Simplicity -- do not need fancy indexing – just fast
lookup by primary key.
201
Why NoSQL? (cont.)
• Developer friendly, DBAs not needed (?).
• Schema-less.
• Agile: non-structured (or semi-structured).
• In Memory.
• No (or loose) Transactions.
• No joins.
202
203
Is NoSQL a RDMS Replacement?
NO
Well... Sometimes it does…
204
RDBMS vs. NoSQL
Rationale for choosing a persistent store:
Relational Architecture NoSQL Architecture
High value, high density, complex
Data
Low value, low density, simple data
Complex data relationships Very simple relationships
Schema-centric Schema-free, unstructured or
semistructured Data
Designed to scale up & out Distributed storage and processing
Lots of general purpose
features/functionality
Stripped down, special purpose
data store
High overhead ($ per operation) Low overhead ($ per operation)
205
Scalability and Consistency
Scalability
• NoSQL is sometimes very easy to scale out
• Most have dynamic data partitioning and easy data
distribution
• But distributed system always come with a price:
The CAP Theorem and impact on ACID transactions
207
ACID Transactions
Most DBMS are built with ACID transactions in mind:
• Atomicity: All or nothing, performs write operations as a single
transaction
• Consistency: Any transaction will take the DB from one
consistent state to another with no broken constraints,
ensures replicas are identical on different nodes
• Isolation: Other operations cannot access data that has been
modified during a transaction that has not been completed yet
• Durability: Ability to recover the committed transaction
updates against any kind of system failure (transaction log)
208
ACID Transactions (cont.)
• ACID is usually implemented by a locking
mechanism/manager
• Distributed systems central locking can be a
bottleneck in that system
• Most NoSQL does not use/limit the ACID
transactions and replaces it with something
else…
209
• The CAP theorem states that in a
distributed/partitioned application, you
can only pick two of the following
three characteristics:
– Consistency.
– Availability.
– Partition Tolerance.
CAP Theorem
210
CAP in Practice
211
NoSQL BASE
• NoSQL usually provide BASE characteristics instead of ACID.
BASE stands for:
– Basically Available
– Soft State
– Eventual Consistency
• It means that when an update is made in one place, the other
partitions will see it over time - there might be an inconsistency
window
• read and write operations complete more quickly, lowering
latency
212
Eventual Consistency
213
Types of NoSQL
NoSQL Taxonomy
Type Examples
Key-Value Store
Document Store
Column Store
Graph Store
215
SQL comfort zone
size
Complex
Typical
RDBMS
Key
Value
Column
Store
Graph
DATABASE
Document
Database
Performance
NoSQL Map
216
Key Value Store
• Distributed hash tables.
• Very fast to get a single value.
• Examples:
– Amazon DynamoDB
– Berkeley DB
– Redis
– Riak
– Cassandra
217
Document Store
• Similar to Key/Value, but value is a document
• JSON or something similar, flexible schema
• Agile technology
• Examples:
– MongoDB
– CouchDB
– CouchBase
218
Column Store
• One key, multiple attributes
• Hybrid row/column
• Examples:
– Google BigTable
– Hbase
– Amazon’s SimpleDB
– Cassandra
219
How Records are Organized?
• This is a logical table in RDBMS systems
• Its physical organization is just like the logical
one: column by column, row by row
Row 1
Row 2
Row 3
Row 4
Col 1 Col 2 Col 3 Col 4
220
Query Data
• When we query data, records
are read at the order they are
organized in the physical
structure
• Even when we query a single
column, we still need to read
the entire table and extract the
column
Row 1
Row 2
Row 3
Row 4
Col 1 Col 2 Col 3 Col 4
Select Col2
From MyTable
Select *
From MyTable
221
How Does Column Store Keep Data
Organization in row store Organization in column store
Select Col2
From MyTable
222
Graph Store
• Inspired by Graph Theory
• Data model: Nodes, relationships, properties
on both
• Relational Database have very hard time to
represent a graph in the Database
• Example:
– Neo4j
– InfiniteGraph
– RDF
223
• An abstract representation of a set of objects
where some pairs are connected by links.
• Object (Vertex, Node) – can have attributes like
name and value
• Link (Edge, Arc, Relationship) – can have attributes
like type and name or date
What is Graph
NODE
Edge
224
Graph Types
Undirected Graph
Directed Graph
Pseudo Graph
Multi Graph
NODE
Edge
NODE
NODE
Edge
NODE
NODE
NODE NODE
225
More Graph Types
Weighted Graph
Labeled Graph
Property Graph
NODE
10
NODE
NODE
Like
NODE
NODE NODE
friend, date 2015
Name:yosi,
Age:40
Name:ami,
Age:30
226
Relationships
ID:1
TYPE:F
NAME:alice
ID:2
TYPE:M
NAME:bob
ID:1
TYPE:G
NAME:NoS
QL
ID:1
TYPE:F
NAME:dafn
a
TYPE: member
Since:2012
227
228
Q&A
Conclusion
• The Challenge of Big Data
• Hadoop Basics: HDFS, MapReduce and YARN
• Improving Hadoop and Tools
• NoSQL and RDBMS
230
The Hadoop Ecosystem Explained

Weitere ähnliche Inhalte

Was ist angesagt?

Oracle Database In-Memory Option for ILOUG
Oracle Database In-Memory Option for ILOUGOracle Database In-Memory Option for ILOUG
Oracle Database In-Memory Option for ILOUGZohar Elkayam
 
Introduction to Oracle Data Guard Broker
Introduction to Oracle Data Guard BrokerIntroduction to Oracle Data Guard Broker
Introduction to Oracle Data Guard BrokerZohar Elkayam
 
MySQL 5.7 New Features for Developers
MySQL 5.7 New Features for DevelopersMySQL 5.7 New Features for Developers
MySQL 5.7 New Features for DevelopersZohar Elkayam
 
Docker Concepts for Oracle/MySQL DBAs and DevOps
Docker Concepts for Oracle/MySQL DBAs and DevOpsDocker Concepts for Oracle/MySQL DBAs and DevOps
Docker Concepts for Oracle/MySQL DBAs and DevOpsZohar Elkayam
 
Scaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInScaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInDataWorks Summit
 
SQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureSQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureVenu Anuganti
 
A3 transforming data_management_in_the_cloud
A3 transforming data_management_in_the_cloudA3 transforming data_management_in_the_cloud
A3 transforming data_management_in_the_cloudDr. Wilfred Lin (Ph.D.)
 
Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2Connor McDonald
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopCloudera, Inc.
 
Research on vector spatial data storage scheme based
Research on vector spatial data storage scheme basedResearch on vector spatial data storage scheme based
Research on vector spatial data storage scheme basedAnant Kumar
 
Temporal Tables, Transparent Archiving in DB2 for z/OS and IDAA
Temporal Tables, Transparent Archiving in DB2 for z/OS and IDAATemporal Tables, Transparent Archiving in DB2 for z/OS and IDAA
Temporal Tables, Transparent Archiving in DB2 for z/OS and IDAACuneyt Goksu
 
Connector/J Beyond JDBC: the X DevAPI for Java and MySQL as a Document Store
Connector/J Beyond JDBC: the X DevAPI for Java and MySQL as a Document StoreConnector/J Beyond JDBC: the X DevAPI for Java and MySQL as a Document Store
Connector/J Beyond JDBC: the X DevAPI for Java and MySQL as a Document StoreFilipe Silva
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiridatastack
 
Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...
Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...
Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...Cloudera, Inc.
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprisesmarkgrover
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoophadooparchbook
 
ODI11g, Hadoop and "Big Data" Sources
ODI11g, Hadoop and "Big Data" SourcesODI11g, Hadoop and "Big Data" Sources
ODI11g, Hadoop and "Big Data" SourcesMark Rittman
 
GoldenGate and ODI - A Perfect Match for Real-Time Data Warehousing
GoldenGate and ODI - A Perfect Match for Real-Time Data WarehousingGoldenGate and ODI - A Perfect Match for Real-Time Data Warehousing
GoldenGate and ODI - A Perfect Match for Real-Time Data WarehousingMichael Rainey
 
Oracle Big data at work
Oracle Big data at workOracle Big data at work
Oracle Big data at worksolarisyougood
 

Was ist angesagt? (20)

Oracle Database In-Memory Option for ILOUG
Oracle Database In-Memory Option for ILOUGOracle Database In-Memory Option for ILOUG
Oracle Database In-Memory Option for ILOUG
 
Introduction to Oracle Data Guard Broker
Introduction to Oracle Data Guard BrokerIntroduction to Oracle Data Guard Broker
Introduction to Oracle Data Guard Broker
 
MySQL 5.7 New Features for Developers
MySQL 5.7 New Features for DevelopersMySQL 5.7 New Features for Developers
MySQL 5.7 New Features for Developers
 
Docker Concepts for Oracle/MySQL DBAs and DevOps
Docker Concepts for Oracle/MySQL DBAs and DevOpsDocker Concepts for Oracle/MySQL DBAs and DevOps
Docker Concepts for Oracle/MySQL DBAs and DevOps
 
Scaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInScaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedIn
 
SQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureSQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data Architecture
 
A3 transforming data_management_in_the_cloud
A3 transforming data_management_in_the_cloudA3 transforming data_management_in_the_cloud
A3 transforming data_management_in_the_cloud
 
Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Research on vector spatial data storage scheme based
Research on vector spatial data storage scheme basedResearch on vector spatial data storage scheme based
Research on vector spatial data storage scheme based
 
Temporal Tables, Transparent Archiving in DB2 for z/OS and IDAA
Temporal Tables, Transparent Archiving in DB2 for z/OS and IDAATemporal Tables, Transparent Archiving in DB2 for z/OS and IDAA
Temporal Tables, Transparent Archiving in DB2 for z/OS and IDAA
 
Connector/J Beyond JDBC: the X DevAPI for Java and MySQL as a Document Store
Connector/J Beyond JDBC: the X DevAPI for Java and MySQL as a Document StoreConnector/J Beyond JDBC: the X DevAPI for Java and MySQL as a Document Store
Connector/J Beyond JDBC: the X DevAPI for Java and MySQL as a Document Store
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...
Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...
Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprises
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
ODI11g, Hadoop and "Big Data" Sources
ODI11g, Hadoop and "Big Data" SourcesODI11g, Hadoop and "Big Data" Sources
ODI11g, Hadoop and "Big Data" Sources
 
Database Technologies
Database TechnologiesDatabase Technologies
Database Technologies
 
GoldenGate and ODI - A Perfect Match for Real-Time Data Warehousing
GoldenGate and ODI - A Perfect Match for Real-Time Data WarehousingGoldenGate and ODI - A Perfect Match for Real-Time Data Warehousing
GoldenGate and ODI - A Perfect Match for Real-Time Data Warehousing
 
Oracle Big data at work
Oracle Big data at workOracle Big data at work
Oracle Big data at work
 

Andere mochten auch

Oracle Database Advanced Querying (2016)
Oracle Database Advanced Querying (2016)Oracle Database Advanced Querying (2016)
Oracle Database Advanced Querying (2016)Zohar Elkayam
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystemDataiku
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemRajkumar Singh
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem pptsunera pathan
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
Hadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceHadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceNeev Technologies
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop EcosystemLior Sidi
 
Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010
Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010
Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010Cloudera, Inc.
 
Advanced PL/SQL Optimizing for Better Performance 2016
Advanced PL/SQL Optimizing for Better Performance 2016Advanced PL/SQL Optimizing for Better Performance 2016
Advanced PL/SQL Optimizing for Better Performance 2016Zohar Elkayam
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystemtfmailru
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Zohar Elkayam
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemCloudera, Inc.
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersCleverence Kombe
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Uwe Printz
 

Andere mochten auch (20)

Oracle Database Advanced Querying (2016)
Oracle Database Advanced Querying (2016)Oracle Database Advanced Querying (2016)
Oracle Database Advanced Querying (2016)
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystem
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Hadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceHadoop Ecosystem at a Glance
Hadoop Ecosystem at a Glance
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010
Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010
Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010
 
Advanced PL/SQL Optimizing for Better Performance 2016
Advanced PL/SQL Optimizing for Better Performance 2016Advanced PL/SQL Optimizing for Better Performance 2016
Advanced PL/SQL Optimizing for Better Performance 2016
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clusters
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
 

Ähnlich wie The Hadoop Ecosystem Explained

Modul_1_Introduction_to_Big_Data.pptx
Modul_1_Introduction_to_Big_Data.pptxModul_1_Introduction_to_Big_Data.pptx
Modul_1_Introduction_to_Big_Data.pptxNouhaElhaji1
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web developmentTung Nguyen
 
Foxvalley bigdata
Foxvalley bigdataFoxvalley bigdata
Foxvalley bigdataTom Rogers
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-HadoopNagarjuna D.N
 
VTU 6th Sem Elective CSE - Module 4 cloud computing
VTU 6th Sem Elective CSE - Module 4  cloud computingVTU 6th Sem Elective CSE - Module 4  cloud computing
VTU 6th Sem Elective CSE - Module 4 cloud computingSachin Gowda
 
module4-cloudcomputing-180131071200.pdf
module4-cloudcomputing-180131071200.pdfmodule4-cloudcomputing-180131071200.pdf
module4-cloudcomputing-180131071200.pdfSumanthReddy540432
 
Big data analysis using hadoop cluster
Big data analysis using hadoop clusterBig data analysis using hadoop cluster
Big data analysis using hadoop clusterFurqan Haider
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvewKunal Khanna
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSatish Mohan
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...Institute of Contemporary Sciences
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which DataWorks Summit
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoopMohit Tare
 
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 GenoaHadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoalarsgeorge
 
IoT and Big Data - Iot Asia 2014
IoT and Big Data - Iot Asia 2014IoT and Big Data - Iot Asia 2014
IoT and Big Data - Iot Asia 2014John Berns
 
Big Data Introduction
Big Data IntroductionBig Data Introduction
Big Data Introductionyalla4u
 

Ähnlich wie The Hadoop Ecosystem Explained (20)

Modul_1_Introduction_to_Big_Data.pptx
Modul_1_Introduction_to_Big_Data.pptxModul_1_Introduction_to_Big_Data.pptx
Modul_1_Introduction_to_Big_Data.pptx
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
 
Big Data
Big DataBig Data
Big Data
 
Foxvalley bigdata
Foxvalley bigdataFoxvalley bigdata
Foxvalley bigdata
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-Hadoop
 
VTU 6th Sem Elective CSE - Module 4 cloud computing
VTU 6th Sem Elective CSE - Module 4  cloud computingVTU 6th Sem Elective CSE - Module 4  cloud computing
VTU 6th Sem Elective CSE - Module 4 cloud computing
 
module4-cloudcomputing-180131071200.pdf
module4-cloudcomputing-180131071200.pdfmodule4-cloudcomputing-180131071200.pdf
module4-cloudcomputing-180131071200.pdf
 
Big data analysis using hadoop cluster
Big data analysis using hadoop clusterBig data analysis using hadoop cluster
Big data analysis using hadoop cluster
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Hadoop Eco system
Hadoop Eco systemHadoop Eco system
Hadoop Eco system
 
50 Shades of SQL
50 Shades of SQL50 Shades of SQL
50 Shades of SQL
 
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 GenoaHadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
 
Big Data Boom
Big Data BoomBig Data Boom
Big Data Boom
 
IoT and Big Data - Iot Asia 2014
IoT and Big Data - Iot Asia 2014IoT and Big Data - Iot Asia 2014
IoT and Big Data - Iot Asia 2014
 
Big Data Introduction
Big Data IntroductionBig Data Introduction
Big Data Introduction
 

Mehr von Zohar Elkayam

Oracle Database Performance Tuning Advanced Features and Best Practices for DBAs
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAsOracle Database Performance Tuning Advanced Features and Best Practices for DBAs
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAsZohar Elkayam
 
PL/SQL New and Advanced Features for Extreme Performance
PL/SQL New and Advanced Features for Extreme PerformancePL/SQL New and Advanced Features for Extreme Performance
PL/SQL New and Advanced Features for Extreme PerformanceZohar Elkayam
 
The art of querying – newest and advanced SQL techniques
The art of querying – newest and advanced SQL techniquesThe art of querying – newest and advanced SQL techniques
The art of querying – newest and advanced SQL techniquesZohar Elkayam
 
Oracle Advanced SQL and Analytic Functions
Oracle Advanced SQL and Analytic FunctionsOracle Advanced SQL and Analytic Functions
Oracle Advanced SQL and Analytic FunctionsZohar Elkayam
 
Oracle 12c New Features For Better Performance
Oracle 12c New Features For Better PerformanceOracle 12c New Features For Better Performance
Oracle 12c New Features For Better PerformanceZohar Elkayam
 
OOW2016: Exploring Advanced SQL Techniques Using Analytic Functions
OOW2016: Exploring Advanced SQL Techniques Using Analytic FunctionsOOW2016: Exploring Advanced SQL Techniques Using Analytic Functions
OOW2016: Exploring Advanced SQL Techniques Using Analytic FunctionsZohar Elkayam
 
Is SQLcl the Next Generation of SQL*Plus?
Is SQLcl the Next Generation of SQL*Plus?Is SQLcl the Next Generation of SQL*Plus?
Is SQLcl the Next Generation of SQL*Plus?Zohar Elkayam
 
Exploring Advanced SQL Techniques Using Analytic Functions
Exploring Advanced SQL Techniques Using Analytic FunctionsExploring Advanced SQL Techniques Using Analytic Functions
Exploring Advanced SQL Techniques Using Analytic FunctionsZohar Elkayam
 
Exploring Advanced SQL Techniques Using Analytic Functions
Exploring Advanced SQL Techniques Using Analytic FunctionsExploring Advanced SQL Techniques Using Analytic Functions
Exploring Advanced SQL Techniques Using Analytic FunctionsZohar Elkayam
 
Advanced PLSQL Optimizing for Better Performance
Advanced PLSQL Optimizing for Better PerformanceAdvanced PLSQL Optimizing for Better Performance
Advanced PLSQL Optimizing for Better PerformanceZohar Elkayam
 
Oracle Database Advanced Querying
Oracle Database Advanced QueryingOracle Database Advanced Querying
Oracle Database Advanced QueryingZohar Elkayam
 
SQLcl the next generation of SQLPlus?
SQLcl the next generation of SQLPlus?SQLcl the next generation of SQLPlus?
SQLcl the next generation of SQLPlus?Zohar Elkayam
 
Oracle Data Guard A to Z
Oracle Data Guard A to ZOracle Data Guard A to Z
Oracle Data Guard A to ZZohar Elkayam
 
Oracle Data Guard Broker Webinar
Oracle Data Guard Broker WebinarOracle Data Guard Broker Webinar
Oracle Data Guard Broker WebinarZohar Elkayam
 

Mehr von Zohar Elkayam (14)

Oracle Database Performance Tuning Advanced Features and Best Practices for DBAs
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAsOracle Database Performance Tuning Advanced Features and Best Practices for DBAs
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAs
 
PL/SQL New and Advanced Features for Extreme Performance
PL/SQL New and Advanced Features for Extreme PerformancePL/SQL New and Advanced Features for Extreme Performance
PL/SQL New and Advanced Features for Extreme Performance
 
The art of querying – newest and advanced SQL techniques
The art of querying – newest and advanced SQL techniquesThe art of querying – newest and advanced SQL techniques
The art of querying – newest and advanced SQL techniques
 
Oracle Advanced SQL and Analytic Functions
Oracle Advanced SQL and Analytic FunctionsOracle Advanced SQL and Analytic Functions
Oracle Advanced SQL and Analytic Functions
 
Oracle 12c New Features For Better Performance
Oracle 12c New Features For Better PerformanceOracle 12c New Features For Better Performance
Oracle 12c New Features For Better Performance
 
OOW2016: Exploring Advanced SQL Techniques Using Analytic Functions
OOW2016: Exploring Advanced SQL Techniques Using Analytic FunctionsOOW2016: Exploring Advanced SQL Techniques Using Analytic Functions
OOW2016: Exploring Advanced SQL Techniques Using Analytic Functions
 
Is SQLcl the Next Generation of SQL*Plus?
Is SQLcl the Next Generation of SQL*Plus?Is SQLcl the Next Generation of SQL*Plus?
Is SQLcl the Next Generation of SQL*Plus?
 
Exploring Advanced SQL Techniques Using Analytic Functions
Exploring Advanced SQL Techniques Using Analytic FunctionsExploring Advanced SQL Techniques Using Analytic Functions
Exploring Advanced SQL Techniques Using Analytic Functions
 
Exploring Advanced SQL Techniques Using Analytic Functions
Exploring Advanced SQL Techniques Using Analytic FunctionsExploring Advanced SQL Techniques Using Analytic Functions
Exploring Advanced SQL Techniques Using Analytic Functions
 
Advanced PLSQL Optimizing for Better Performance
Advanced PLSQL Optimizing for Better PerformanceAdvanced PLSQL Optimizing for Better Performance
Advanced PLSQL Optimizing for Better Performance
 
Oracle Database Advanced Querying
Oracle Database Advanced QueryingOracle Database Advanced Querying
Oracle Database Advanced Querying
 
SQLcl the next generation of SQLPlus?
SQLcl the next generation of SQLPlus?SQLcl the next generation of SQLPlus?
SQLcl the next generation of SQLPlus?
 
Oracle Data Guard A to Z
Oracle Data Guard A to ZOracle Data Guard A to Z
Oracle Data Guard A to Z
 
Oracle Data Guard Broker Webinar
Oracle Data Guard Broker WebinarOracle Data Guard Broker Webinar
Oracle Data Guard Broker Webinar
 

Kürzlich hochgeladen

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 

Kürzlich hochgeladen (20)

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 

The Hadoop Ecosystem Explained

  • 1. The Hadoop Ecosystem Zohar Elkayam & Ronen Fidel Brillix
  • 2. Agenda • Big Data – The Challenge • Introduction to Hadoop – Deep dive into HDFS – MapReduce and YARN • Improving Hadoop: tools and extensions • NoSQL and RDBMS 2
  • 3. About Brillix • Brillix is a leading company that specialized in Data Management • We provide professional services and consulting for Databases, Security and Big Data solutions 3
  • 4. Who am I? • Zohar Elkayam, CTO at Brillix • DBA, team leader, instructor and a senior consultant for over 17 years • Oracle ACE Associate • Involved with Big Data projects since 2011 • Blogger – www.realdbamagic.com 4
  • 6. "Big Data"?? Different definitions “Big data exceeds the reach of commonlyused hardware environments and software tools to capture, manage, and process it with in a tolerable elapsed time for its user population.”- Teradata Magazinearticle,2011 “Big data refers to data sets whose size is beyond the abilityof typical database software tools to capture, store, manage and analyze.” - The McKinseyGlobal Institute, 2012 “Big data is a collectionof data sets so large and complex that it becomes difficultto process using on-handdatabasemanagement tools.” - Wikipedia, 2014 6
  • 7.
  • 10. MORE stories.. • Crime Prevention in Los Angeles • Diagnosis and treatment of genetic diseases • Investments in the financial sector • Generation of personalized advertising • Astronomical discoveries 10
  • 11. Examples of Big Data Use Cases Today MEDIA/ ENTERTAINMENT Viewers / advertising effectiveness COMMUNICATIONS Location-based advertising EDUCATION & RESEARCH Experiment sensor analysis CONSUMER PACKAGED GOODS Sentiment analysis of what’s hot, problems HEALTH CARE Patient sensors, monitoring, EHRs Quality of care LIFE SCIENCES Clinical trials Genomics HIGH TECHNOLOGY / INDUSTRIAL MFG. Mfg quality Warranty analysis OIL & GAS Drilling exploration sensor analysis FINANCIAL SERVICES Risk & portfolio analysis New products AUTOMOTIVE Auto sensors reporting location, problems RETAIL Consumer sentiment Optimized marketing LAW ENFORCEMENT & DEFENSE Threat analysis - social media monitoring, photo analysis TRAVEL & TRANSPORTATION Sensor analysis for optimal traffic flows Customer sentiment UTILITIES Smart Meter analysis for network capacity, ON-LINE SERVICES / SOCIAL MEDIA People & career matching Web-site optimization 11
  • 12. Most Requested Uses of Big Data • Log Analytics & Storage • Smart Grid / Smarter Utilities • RFID Tracking & Analytics • Fraud / Risk Management & Modeling • 360° View of the Customer • Warehouse Extension • Email / Call Center Transcript Analysis • Call Detail Record Analysis 12
  • 14. Big Data Big Problems • Unstructured • Unprocessed • Un-aggregated • Un-filtered • Repetitive • Low quality • And generally messy Oh, and there is a lot of it 14
  • 15. The Big Data Challenge 15
  • 16. Big Data: Challenge to Value Business Value  High Variety  High Volume  High Velocity Today  Deep Analytics  High Agility  Massive Scalability  Real TimeTomorrow Challenges 16
  • 17. Volume • Big data come in one size: Big. • Size is measured in Terabyte(1012), Petabyte(1015), Exabyte(1018), Zettabyte (1021) • The storing and handling of the data becomes an issue • Producing value out of the data in a reasonable time is an issue 17
  • 18. Some Numbers • How much data in the world? – 800 Terabytes, 2000 – 160 Exabytes, 2006 (1EB = 1018B) – 4.5 Zettabytes, 2012 (1ZB = 1021B) – 44 Zettabytes by 2020 • How much is a zettabyte? – 1,000,000,000,000,000,000,000 bytes – A stack of 1TB hard disks that is 25,400 km high 18
  • 20. Growth Rate How much data generated in a day? – 7 TB, Twitter – 10 TB, Facebook 20
  • 21. Variety • Big Data extends beyond structured data: including semi-structured and unstructured information: logs, text, audio and videos • Wide variety of rapidly evolving data types requires highly flexible stores and handling 21
  • 22. Structured & Un-Structured Un-Structured Structured Objects Tables Flexible Columns and Rows Structure Unknown Predefined Structure Textual and Binary Mostly Textual 22
  • 23. Big Data is ANY data: Unstructured, Semi-Structure and Structured • Some has fixed structure • Some is “bring own structure” • We want to find value in all of it 23
  • 24. Data Types by Industry 24
  • 25. Velocity • The speed in which the data is being generated and collected • Streaming data and large volume data movement • High velocity of data capture – requires rapid ingestion • Might cause the backlog problem 25
  • 26. Global Internet Device Forecast 26
  • 28. Veracity • Quality of the data can vary greatly • Data sources might be messy or corrupted 28
  • 29. So, What Defines Big Data? • When we think that we can produce value from that data and want to handle it • When the data is too big or moves too fast to handle in a sensible amount of time • When the data doesn’t fit conventional database structure • When the solution becomes part of the problem 29
  • 31.
  • 32. Big Data in Practice • Big data is big: technological infrastructure solutions needed • Big data is messy: data sources must be cleaned before use • Big data is complicated: need developers and system admins to manage intake of data 32
  • 33. Big Data in Practice (cont.) • Data must be broken out of silos in order to be mined, analyzed and transformed into value • The organization must learn how to communicate and interpret the results of analysis 33
  • 34. Infrastructure Challenges • Infrastructure that is built for: – Large-scale – Distributed – Data-intensive jobs that spread the problem across clusters of server nodes 34
  • 35. Infrastructure Challenges (cont.) • Storage: – Efficient and cost-effective enough to capture and store terabytes, if not petabytes, of data – With intelligent capabilities to reduce your data footprint such as: • Data compression • Automatic data tiering • Data deduplication 35
  • 36. Infrastructure Challenges (cont.) • Network infrastructure that can quickly import large data sets and then replicate it to various nodes for processing • Security capabilities that protect highly-distributed infrastructure and data 36
  • 38. Apache Hadoop • Open source project run by Apache (2006) • Hadoop brings the ability to cheaply process large amounts of data, regardless of its structure • It Is has been the driving force behind the growth of the big data Industry • Get the public release from: http://hadoop.apache.org/core/ 38
  • 40. Key points • An open-source framework that uses a simple programming model to enable distributed processing of large data sets on clusters of computers. • The complete technology stack includes – common utilities – a distributed file system – analytics and data storage platforms – an application layer that manages distributed processing, parallel computation, workflow, and configuration management • Cost-effective for handling large unstructured data sets than conventional approaches, and it offers massive scalability and speed 40
  • 41. Why use Hadoop? Cost Flexibility Near linear performance up to 1000s of nodes Leverages commodity HW & open source SW Versatility with data, analytics & operation Scalability 41
  • 42. No, really, why use Hadoop? • Need to process Multi Petabyte Datasets • Expensive to build reliability in each application • Nodes fail every day – Failure is expected, rather than exceptional – The number of nodes in a cluster is not constant • Need common infrastructure – Efficient, reliable, Open Source Apache License • The above goals are same as Condor, but – Workloads are IO bound and not CPU bound 42
  • 43. Hadoop Benefits • Reliable solution based on unreliable hardware • Designed for large files • Load data first, structure later • Designed to maximize throughput of large scans • Designed to leverage parallelism • Designed to scale • Flexible development platform • Solution Ecosystem 43
  • 44. Hadoop Limitations • Hadoop is scalable but it’s not fast • Some assembly required • Batteries not included • Instrumentation not included either • DIY mindset 44
  • 46. Hadoop Main Components • HDFS: Hadoop Distributed File System – distributed file system that runs in a clustered environment. • MapReduce – programming paradigm for running processes over a clustered environments. 47
  • 47. HDFS is... • A distributed file system • Redundant storage • Designed to reliably store data using commodity hardware • Designed to expect hardware failures • Intended for large files • Designed for batch inserts • The Hadoop Distributed File System 48
  • 48. HDFS Node Types HDFS has three types of Nodes • Namenode (MasterNode) – Distribute files in the cluster – Responsible for the replication between the datanodes and for file blocks location • Datanodes – Responsible for actual file store – Serving data from files(data) to client • BackupNode (version 0.23 and up) • It’s a backup of the NameNode 49
  • 49. Typical implementation • Nodes are commodity PCs • 30-40 nodes per rack • Uplink from racks is 3-4 gigabit • Rack-internal is 1 gigabit 50
  • 50. MapReduce is... • A programming model for expressing distributed computations at a massive scale • An execution framework for organizing and performing such computations • An open-source implementation called Hadoop 51
  • 51. MapReduce paradigm • Implement two functions: • MAP - Takes a large problem and divides into sub problems and performs the same function on all subsystems Map(k1, v1) -> list(k2, v2) • REDUCE - Combine the output from all sub-problems Reduce(k2, list(v2)) -> list(v3) • Framework handles everything else (almost) • Value with same key must go to the same reducer 52
  • 52. Typical large-data problem • Iterate over a large number of records • Extract something of interest from each • Shuffle and sort intermediate results • Aggregate intermediate results • Generate final output Map Reduce 53
  • 54. MapReduce - word count example function map(String name, String document): for each word w in document: emit(w, 1) function reduce(String word, Iterator partialCounts): totalCount = 0 for each count in partialCounts: totalCount += count emit(word, totalCount) 55
  • 55. MapReduce Word Count Process 56
  • 56. MapReduce Advantages Example: $HADOOP_HOME/bin/hadoop jar @HADOOP_HOME/hadoop- streaming.jar - input myInputDirs - output myOutputDir - mapper /bin/cat - reducer /bin/wc • Runs programs (jobs) across many computers • Protects against single server failure by re-run failed steps • MR jobs can be written in Java, C, Phyton, Ruby and others • Users only write Map and Reduce functions 57
  • 57. MapReduce is good for... • Embarrassingly parallel algorithms • Summing, grouping, filtering, joining • Off-line batch jobs on massive data sets • Analyzing an entire large dataset 58
  • 58. MapReduce is OK for... • Iterative jobs (i.e., graph algorithms) • Each iteration must read/write data to disk • IO and latency cost of an iteration is high 59
  • 59. MapReduce is NOT good for... • Jobs that need shared state/coordination • Tasks are shared-nothing • Shared-state requires scalable state store • Low-latency jobs • Jobs on small datasets • Finding individual records 60
  • 61. HDFS • Appears as a single disk • Runs on top of a native filesystem – Ext3,Ext4,XFS • Based on Google's Filesystem GFS • Fault Tolerant – Can handle disk crashes, machine crashes, etc... • Based on Google's Filesystem (GFS or GoogleFS) – gfs-sosp2003.pdf • http://static.googleusercontent.com/external_content/untrusted_dlcp/research.go ogle.com/en/us/archive/gfs-sosp2003.pdf – http://en.wikipedia.org/wiki/Google_File_System 62
  • 62. HDFS is Good for... • Storing large files – Terabytes, Petabytes, etc... – Millions rather than billions of files – 100MB or more per file • Streaming data – Write once and read-many times patterns – Optimized for streaming reads rather than random reads – Append operation added to Hadoop 0.21 • “Cheap” Commodity Hardware – No need for super-computers, use less reliable commodity hardware 63
  • 63. HDFS is not so good for... • Low-latency reads – High-throughput rather than low latency for small chunks of data – HBase addresses this issue • Large amount of small files – Better for millions of large files instead of billions of small files • For example each file can be 100MB or more • Multiple Writers – Single writer per file – Writes only at the end of file, no-support for arbitrary offset 64
  • 64. HDFS: Hadoop Distributed File System • A given file is broken down into blocks (default=64MB), then blocks are replicated across cluster (default=3) • Optimized for: – Throughput – Put/Get/Delete – Appends • Block Replication for: – Durability – Availability – Throughput • Block Replicas are distributed across servers and racks 65
  • 65. HDFS Architecture • Name Node : Maps a file to a file-id and list of Map Nodes • Data Node : Maps a block-id to a physical location on disk • Secondary Name Node: Periodic merge of Transaction log 66
  • 66. HDFS Daemons • Filesystem cluster is manager by three types of processes – Namenode • manages the File System's namespace/meta-data/file blocks • Runs on 1 machine to several machines – Datanode • Stores and retrieves data blocks • Reports to Namenode • Runs on many machines – Secondary Namenode • Performs house keeping work so Namenode doesn’t have to • Requires similar hardware as Namenode machine • Not used for high-availability – not a backup for Namenode 67
  • 67. Files and Blocks • Files are split into blocks (single unit of storage) – Managed by Namenode, stored by Datanode – Transparent to user • Replicated across machines at load time – Same block is stored on multiple machines – Good for fault-tolerance and access – Default replication is 3 68
  • 68. HDFS Blocks • Blocks are traditionally either 64MB or 128MB – Default is 128MB • The motivation is to minimize the cost of seeks as compared to transfer rate – 'Time to transfer' > 'Time to seek' • For example, lets say – seek time = 10ms – Transfer rate = 100 MB/s • To achieve seek time of 1% transfer rate – Block size will need to be = 100MB 69
  • 69. Block Replication • Namenode determines replica placement • Replica placements are rack aware – Balance between reliability and performance • Attempts to reduce bandwidth • Attempts to improve reliability by putting replicas on multiple racks – Default replication is 3 • 1st replica on the local rack • 2nd replica on the local rack but different machine • 3rd replica on the different rack – This policy may change/improve in the future 70
  • 70. Data Correctness • Use Checksums to validate data – Use CRC32 • File Creation – Client computes checksum per 512 byte – Data Node stores the checksum • File access – Client retrieves the data and checksum from Data Node – If Validation fails, Client tries other replicas 71
  • 71. Data Pipelining • Client retrieves a list of Data Nodes on which to place replicas of a block • Client writes block to the first Data Node • The first Data Node forwards the data to the next Data Node in the Pipeline • When all replicas are written, the Client moves on to write the next block in file 72
  • 72. Client, Namenode, and Datanodes • Namenode does NOT directly write or read data – One of the reasons for HDFS’s Scalability • Client interacts with Namenode to update Namenode’s HDFS namespace and retrieve block locations for writing and reading • Client interacts directly with Datanode to read/write data 73
  • 73. Name Node Metadata • Meta-data in Memory – The entire metadata is in main memory – No demand paging of meta-data • Types of Metadata – List of files – List of Blocks for each file – List of Data Nodes for each block – File attributes, e.g. creation time, replication factor • A Transaction Log – Records file creations, file deletions. etc. 74
  • 74. Namenode Memory Concerns • For fast access Namenode keeps all block metadata in- memory – The bigger the cluster - the more RAM required • Best for millions of large files (100mb or more) rather than billions • Will work well for clusters of 100s machines • Hadoop 2+ – Namenode Federations • Each namenode will host part of the blocks • Horizontally scale the Namenode – Support for 1000+ machine clusters 75
  • 76. Reading Data from HDFS 1. Create FileSystem 2. Open InputStream to a Path 3. Copy bytes using IOUtils 4. Close Stream 77
  • 77. 1: Create FileSystem • FileSystem fs = FileSystem.get(new Configuration()); – If you run with yarn command, DistributedFileSystem (HDFS) will be created • Utilizes fs.default.name property from configuration • Recall that Hadoop framework loads core-site.xml which sets property to hdfs (hdfs://localhost:8020) 78
  • 78. 2: Open Input Stream to a Path ... InputStream input = null; try { input = fs.open(fileToRead); ... • fs.open returns org.apache.hadoop.fs.FSDataInputStream – Another FileSystem implementation will return their own custom implementation of InputStream • Opens stream with a default buffer of 4k • If you want to provide your own buffer size use – fs.open(Path f, int bufferSize) 79
  • 79. 3: Copy bytes using IOUtils IOUtils.copyBytes(inputStream, outputStream, buffer); • Copy bytes from InputStream to OutputStream • Hadoop’s IOUtils makes the task simple – buffer parameter specifies number of bytes to buffer at a time 80
  • 80. 4: Close Stream ... } finally { IOUtils.closeStream(input); ... • Utilize IOUtils to avoid boiler plate code that catches IOException 81
  • 81. ReadFile.java Example public class ReadFile { public static void main(String[] args) throws IOException { Path fileToRead = new Path("/user/sample/sonnets.txt"); FileSystem fs = FileSystem.get(new Configuration()); // 1: Open FileSystem InputStream input = null; try { input = fs.open(fileToRead); // 2: Open InputStream IOUtils.copyBytes(input, System.out, 4096); // 3: Copy from Input to Output } finally { IOUtils.closeStream(input); // 4: Close stream } } } $ yarn jar my-hadoop-examples.jar hdfs.ReadFile 82
  • 82. Reading Data - Seek • FileSystem.open returns FSDataInputStream – Extension of java.io.DataInputStream – Supports random access and reading via interfaces: • PositionedReadable : read chunks of the stream • Seekable : seek to a particular position in the stream 83
  • 83. Seeking to a Position • FSDataInputStream implements Seekable interface – void seek(long pos) throws IOException • Seek to a particular position in the file • Next read will begin at that position • If you attempt to seek past the file boundary IOException is emitted • Somewhat expensive operation – strive for streaming and not seeking – long getPos() throws IOException • Returns the current position/offset from the beginning of the stream/file 84
  • 84. SeekReadFile.java Example public class SeekReadFile { public static void main(String[] args) throws IOException { Path fileToRead = new Path("/user/sample/readMe.txt"); FileSystem fs = FileSystem.get(new Configuration()); FSDataInputStream input = null; try { input = fs.open(fileToRead); System.out.print("start postion=" + input.getPos() + ": "); IOUtils.copyBytes(input, System.out, 4096, false); input.seek(11); System.out.print("start postion=" + input.getPos() + ": "); IOUtils.copyBytes(input, System.out, 4096, false); input.seek(0); System.out.print("start postion=" + input.getPos() + ": "); IOUtils.copyBytes(input, System.out, 4096, false); } finally { IOUtils.closeStream(input); } } } 85
  • 85. Run SeekReadFile Example $ yarn jar my-hadoop-examples.jar hdfs.SeekReadFile start position=0: Hello from readme.txt start position=11: readme.txt start position=0: Hello from readme.txt 86
  • 86. Write Data 1. Create FileSystem instance 2. Open OutputStream – FSDataOutputStream in this case – Open a stream directly to a Path from FileSystem – Creates all needed directories on the provided path 3. Copy data using IOUtils 87
  • 87. WriteToFile.java Example public class WriteToFile { public static void main(String[] args) throws IOException { String textToWrite = "Hello HDFS! Elephants are awesome!n"; InputStream in = new BufferedInputStream( new ByteArrayInputStream(textToWrite.getBytes())); Path toHdfs = new Path("/user/sample/writeMe.txt"); Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); // 1: Create FileSystem instance FSDataOutputStream out = fs.create(toHdfs); // 2: Open OutputStream IOUtils.copyBytes(in, out, conf); // 3: Copy Data } } 88
  • 88. Run WriteToFile $ yarn jar my-hadoop-examples.jar hdfs.WriteToFile $ hdfs dfs -cat /user/sample/writeMe.txt Hello HDFS! Elephants are awesome! 90
  • 90. Hadoop MapReduce • Model for processing large amounts of data in parallel – On commodity hardware – Lots of nodes • Derived from functional programming – Map and reduce functions • Can be implemented in multiple languages – Java, C++, Ruby, Python (etc...) 92
  • 91. The MapReduce Model • Imposes key-value input/output • Defines map and reduce functions map: (K1,V1) → list (K2,V2) reduce: (K2,list(V2)) → list (K3,V3) 1. Map function is applied to every input key-value pair 2. Map function generates intermediate key-value pairs 3. Intermediate key-values are sorted and grouped by key 4. Reduce is applied to sorted and grouped intermediate key-values 5. Reduce emits result key-values 93
  • 95. MapReduce Framework • Takes care of distributed processing and coordination • Scheduling – Jobs are broken down into smaller chunks called tasks. – These tasks are scheduled • Task Localization with Data – Framework strives to place tasks on the nodes that host the segment of data to be processed by that specific task – Code is moved to where the data is 97
  • 96. MapReduce Framework • Error Handling – Failures are an expected behavior so tasks are automatically re-tried on other machines • Data Synchronization – Shuffle and Sort barrier re-arranges and moves data between machines – Input and output are coordinated by the framework 98
  • 97. Map Reduce 2.0 on YARN • Yet Another Resource Negotiator (YARN) • Various applications can run on YARN – MapReduce is just one choice (the main choice at this point) – http://wiki.apache.org/hadoop/PoweredByYarn • YARN was designed to address issues with MapReduce1 – Scalability issues (max ~4,000 machines) – Inflexible Resource Management • MapReduce1 had slot based model 99
  • 98. MapReduce1 vs. YARN • MapReduce1 runs on top of JobTracker and TaskTracker daemons – JobTracker schedules tasks, matches task with TaskTrackers – JobTracker manages MapReduce Jobs, monitors progress – JobTracker recovers from errors, restarts failed and slow tasks • MapReduce1 has inflexible slot-based memory management model – Each TaskTracker is configured at start-up to have N slots – A task is executed in a single slot – Slots are configured with maximum memory on cluster start-up – The model is likely to cause over and under utilization issues 100
  • 99. MapReduce1 vs. YARN (cont.) • YARN addresses shortcomings of MapReduce1 – JobTracker is split into 2 daemons • ResourceManager - administers resources on the cluster • ApplicationMaster - manages applications such as MapReduce – Fine-Grained memory management model • ApplicationMaster requests resources by asking for “containers” with a certain memory limit (ex 2G) • YARN administers these containers and enforces memory usage • Each Application/Job has control of how much memory to request 101
  • 100. Daemons • YARN Daemons – Node Manger • Manages resources of a single node • There is one instance per node in the cluster – Resource Manager • Manages Resources for a Cluster • Instructs Node Manager to allocate resources • Application negotiates for resources with Resource Manager • There is only one instance of Resource Manager • MapReduce Specific Daemon – MapReduce History Server • Archives Jobs’ metrics and meta-data 102
  • 101. Old vs. New Java API • There are two flavors of MapReduce API which became known as Old and New • Old API classes reside under – org.apache.hadoop.mapred • New API classes can be found under – org.apache.hadoop.mapreduce – org.apache.hadoop.mapreduce.lib • We will use new API exclusively • New API was re-designed for easier evolution • Early Hadoop versions deprecated old API but deprecation was removed • Do not mix new and old API 103
  • 103. MapReduce • Divided in two phases – Map phase – Reduce phase • Both phases use key-value pairs as input and output • The implementer provides map and reduce functions • MapReduce framework orchestrates splitting, and distributing of Map and Reduce phases – Most of the pieces can be easily overridden 105
  • 104. MapReduce • Job – execution of map and reduce functions to accomplish a task – Equal to Java’s main • Task – single Mapper or Reducer – Performs work on a fragment of data 106
  • 105. Map Reduce Flow of Data 107
  • 106. First Map Reduce Job • StartsWithCount Job – Input is a body of text from HDFS • In this case hamlet.txt – Split text into tokens – For each first letter sum up all occurrences – Output to HDFS 108
  • 108. Starts With Count Job 1. Configure the Job – Specify Input, Output, Mapper, Reducer and Combiner 2. Implement Mapper – Input is text – a line from hamlet.txt – Tokenize the text and emit first character with a count of 1 - <token, 1> 3. Implement Reducer – Sum up counts for each letter – Write out the result to HDFS 4. Run the job 110
  • 109. 1: Configure Job • Job class – Encapsulates information about a job – Controls execution of the job Job job = Job.getInstance(getConf(), "StartsWithCount"); • A job is packaged within a jar file – Hadoop Framework distributes the jar on your behalf – Needs to know which jar file to distribute – The easiest way to specify the jar that your job resides in is by calling job.setJarByClass job.setJarByClass(getClass()); – Hadoop will locate the jar file that contains the provided class 111
  • 110. 1: Configure Job - Specify Input TextInputFormat.addInputPath(job, new Path(args[0])); job.setInputFormatClass(TextInputFormat.class); • Can be a file, directory or a file pattern – Directory is converted to a list of files as an input • Input is specified by implementation of InputFormat - in this case TextInputFormat – Responsible for creating splits and a record reader – Controls input types of key-value pairs, in this case LongWritable and Text – File is broken into lines, mapper will receive 1 line at a time 112
  • 111. Side Note – Hadoop IO Classes • Hadoop uses it’s own serialization mechanism for writing data in and out of network, database or files – Optimized for network serialization – A set of basic types is provided – Easy to implement your own • org.apache.hadoop.io package – LongWritable for Long – IntWritable for Integer – Text for String – Etc... 113
  • 112. 1: Configure Job – Specify Output TextOutputFormat.setOutputPath(job, new Path(args[1])); job.setOutputFormatClass(TextOutputFormat.class); • OutputFormat defines specification for outputting data from Map/Reduce job • Count job utilizes an implemenation of OutputFormat - TextOutputFormat – Define output path where reducer should place its output • If path already exists then the job will fail – Each reducer task writes to its own file • By default a job is configured to run with a single reducer – Writes key-value pair as plain text 114
  • 113. 1: Configure Job – Specify Output job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); • Specify the output key and value types for both mapper and reducer functions – Many times the same type – If types differ then use • setMapOutputKeyClass() • setMapOutputValueClass() 115
  • 114. 1: Configure Job • Specify Mapper, Reducer and Combiner – At a minimum will need to implement these classes – Mappers and Reducer usually have same output key job.setMapperClass(StartsWithCountMapper.class); job.setReducerClass(StartsWithCountReducer.class); job.setCombinerClass(StartsWithCountReducer.class); 116
  • 115. 1: Configure Job • job.waitForCompletion(true) – Submits and waits for completion – The boolean parameter flag specifies whether output should be written to console – If the job completes successfully ‘true’ is returned, otherwise ‘false’ is returned 117
  • 116. Our Count Job is configured to • Chop up text files into lines • Send records to mappers as key-value pairs – Line number and the actual value • Mapper class is StartsWithCountMapper – Receives key-value of <IntWritable,Text> – Outputs key-value of <Text, IntWritable> • Reducer class is StartsWithCountReducer – Receives key-value of <Text, IntWritable> – Outputs key-values of <Text, IntWritable> as text • Combiner class is StartsWithCountReducer 118
  • 117. 1: Configure Count Job public class StartsWithCountJob extends Configured implements Tool{ @Override public int run(String[] args) throws Exception { Job job = Job.getInstance(getConf(), "StartsWithCount"); job.setJarByClass(getClass()); // configure output and input source TextInputFormat.addInputPath(job, new Path(args[0])); job.setInputFormatClass(TextInputFormat.class); // configure mapper and reducer job.setMapperClass(StartsWithCountMapper.class); job.setCombinerClass(StartsWithCountReducer.class); job.setReducerClass(StartsWithCountReducer.class); 119
  • 118. StartsWithCountJob.java (cont.) // configure output TextOutputFormat.setOutputPath(job, new Path(args[1])); job.setOutputFormatClass(TextOutputFormat.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); return job.waitForCompletion(true) ? 0 : 1; } public static void main(String[] args) throws Exception { int exitCode = ToolRunner.run( new StartsWithCountJob(), args); System.exit(exitCode); } } 120
  • 119. 2: Implement Mapper class • Class has 4 Java Generics parameters – (1) input key (2) input value (3) output key (4) output value – Input and output utilizes hadoop’s IO framework • org.apache.hadoop.io • Your job is to implement map() method – Input key and value – Output key and value – Logic is up to you • map() method injects Context object, use to: – Write output – Create your own counters 121
  • 120. 2: Implement Mapper public class StartsWithCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable countOne = new IntWritable(1); private final Text reusableText = new Text(); @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer tokenizer = new StringTokenizer(value.toString()); while (tokenizer.hasMoreTokens()) { reusableText.set(tokenizer.nextToken().substring(0, 1)); context.write(reusableText, countOne); } } } 122
  • 121. 3: Implement Reducer • Analogous to Mapper – generic class with four types – (1) input key (2) input value (3) output key (4) output value – The output types of map functions must match the input types of reduce function • In this case Text and IntWritable – Map/Reduce framework groups key-value pairs produced by mapper by key • For each key there is a set of one or more values • Input into a reducer is sorted by key • Known as Shuffle and Sort – Reduce function accepts key->setOfValues and outputs key-value pairs • Also utilizes Context object (similar to Mapper) 123
  • 122. 3: Implement Reducer public class StartsWithCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> { @Override protected void reduce(Text token, Iterable<IntWritable> counts, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable count : counts) { sum+= count.get(); } context.write(token, new IntWritable(sum)); } } 124
  • 123. 3: Reducer as a Combiner • Combine data per Mapper task to reduce amount of data transferred to reduce phase • Reducer can very often serve as a combiner – Only works if reducer’s output key-value pair types are the same as mapper’s output types • Combiners are not guaranteed to run – Optimization only – Not for critical logic • More about combiners later 125
  • 124. 4: Run Count Job $ yarn jar my-hadoop-examples.jar mr.wordcount.StartsWithCountJob /user/sample/readme.txt /user/sample/wordcount 126
  • 125. Output of Count Job • Output is written to the configured output directory – /user/sample/wordCount/ • One output file per Reducer – part-r-xxxxx format • Output is driven by TextOutputFormat class 127
  • 126. $yarn command • yarn script with a class argument command launches a JVM and executes the provided Job $ yarn jar HadoopSamples.jar mr.wordcount.StartsWithCountJob /user/sample/hamlet.txt /user/sample/wordcount/ • You could use straight java but yarn script is more convenient – Adds hadoop’s libraries to CLASSPATH – Adds hadoop’s configurations to Configuration object • Ex: core-site.xml, mapred-site.xml, *.xml – You can also utilize $HADOOP_CLASSPATH environment variable 128
  • 128. MapReduce Theory • Map and Reduce functions produce input and output – Input and output can range from Text to Complex data structures – Specified via Job’s configuration – Relatively easy to implement your own • Generally we can treat the flow as map: (K1,V1) → list (K2,V2) reduce: (K2,list(V2)) → list (K3,V3) – Reduce input types are the same as map output types 130
  • 129. Map Reduce Flow of Data map: (K1,V1) → list (K2,V2) reduce: (K2,list(V2)) → list (K3,V3) 131
  • 130. Key and Value Types • Utilizes Hadoop’s serialization mechanism for writing data in and out of network, database or files – Optimized for network serialization – A set of basic types is provided – Easy to implement your own • Extends Writable interface – Framework’s serialization mechanisms – Defines how to read and write fields – org.apache.hadoop.io package 132
  • 131. Key and Value Types • Keys must implement WritableComparable interface – Extends Writable and java.lang.Comparable<T> – Required because keys are sorted prior reduce phase • Hadoop is shipped with many default implementations of WritableComparable<T> – Wrappers for primitives (String, Integer, etc...) – Or you can implement your own 133
  • 132. WritableComparable<T> Implementations Hadoop’s Class Explanation BooleanWritable Boolean implementation BytesWritable Bytes implementation DoubleWritable Double implementation FloatWritable Float implementation IntWritable Int implementation LongWritable Long implementation NullWritable Writable with no data 134
  • 133. Implement Custom WritableComparable<T> • Implement 3 methods – write(DataOutput) • Serialize your attributes – readFields(DataInput) • De-Serialize your attributes – compareTo(T) • Identify how to order your objects • If your custom object is used as the key it will be sorted prior to reduce phase 135
  • 134. BlogWritable – Implemenation of WritableComparable<T> public class BlogWritable implements WritableComparable<BlogWritable> { private String author; private String content; public BlogWritable(){} public BlogWritable(String author, String content) { this.author = author; this.content = content; } public String getAuthor() { return author; public String getContent() { return content; ... ... 136
  • 135. BlogWritable – Implemenation of WritableComparable<T> ... @Override public void readFields(DataInput input) throws IOException { author = input.readUTF(); content = input.readUTF(); } @Override public void write(DataOutput output) throws IOException { output.writeUTF(author); output.writeUTF(content); } @Override public int compareTo(BlogWritable other) { return author.compareTo(other.author); } } 137
  • 136. Mapper • Extend Mapper class – Mapper<KeyIn, ValueIn, KeyOut, ValueOut> • Simple life-cycle 1. The framework first calls setup(Context) 2. for each key/value pair in the split: • map(Key, Value, Context) 3. Finally cleanup(Context) is called 138
  • 137. InputSplit • Splits are a set of logically arranged records – A set of lines in a file – A set of rows in a database table • Each instance of mapper will process a single split – Map instance processes one record at a time • map(k,v) is called for each record • Splits are implemented by extending InputSplit class 139
  • 138. InputSplit • Framework provides many options for InputSplit implementations – Hadoop’s FileSplit – HBase’s TableSplit • Don’t usually need to deal with splits directly – InputFormat’s responsibility 140
  • 139. Combiner • Runs on output of map function • Produces outpu map: (K1,V1) → list (K2,V2) combine: (K2,list(V2)) → list (K2,V2) reduce: (K2,list(V2)) → list (K3,V3) • Optimization to reduce bandwidth – NO guarantees on being called – Maybe only applied to a sub-set of map outputs • Often is the same class as Reducer • Each combine processes output from a single split 141
  • 143. Specify Combiner Function • To implement Combiner extend Reducer class • Set combiner on Job class – job.setCombinerClass(StartsWithCountReducer. class); 145
  • 144. Reducer • Extend Reducer class – Reducer<KeyIn, ValueIn, KeyOut, ValueOut> – KeyIn and ValueIn types must match output types of mapper • Receives input from mappers’ output – Sorted on key – Grouped on key of key-values produced by mappers – Input is directed by Partitioner implementation • Simple life-cycle – similar to Mapper – The framework first calls setup(Context) – for each key → list(value) calls • reduce(Key, Values, Context) – Finally cleanup(Context) is called 146
  • 145. Reducer • Can configure more than 1 reducer – job.setNumReduceTasks(10); – mapreduce.job.reduces property • job.getConfiguration().setInt("mapreduce.job.reduces", 10) • Partitioner implementation directs key-value pairs to the proper reducer task – A partition is processed by a reduce task • # of partitions = # or reduce tasks – Default strategy is to hash key to determine partition implemented by HashPartitioner<K, V> 147
  • 147. HashPartitioner public class HashPartitioner<K, V> extends Partitioner<K, V> { public int getPartition(K key, V value, int numReduceTasks) { return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks; } } • Calculate Index of Partition: – Convert key’s hash into non-negative number • Logical AND with maximum integer value – Modulo by number of reduce tasks • In case of more than 1 reducer – Records distributed evenly across available reduce tasks • Assuming a good hashCode() function – Records with same key will make it into the same reduce task – Code is independent from the # of partitions/reducers specified 149
  • 148. Custom Partitioner public class CustomPartitioner extends Partitioner<Text, BlogWritable>{ @Override public int getPartition(Text key, BlogWritable blog, int numReduceTasks) { int positiveHash = blog.getAuthor().hashCode()& Integer.MAX_VALUE; //Use author’s hash only, AND with //max integer to get a positive value return positiveHash % numReduceTasks; } } • All blogs with the same author will end up in the same reduce task 150
  • 151. Improving Hadoop • Core Hadoop is complicated so some tools were added to make things easier • Hadoop Distributions collect these tools and release them as a whole package 153
  • 152. Noticeable Distributions • Cloudera • MapR • HortonWorks • Amazon EMR 154
  • 153. HADOOP Technology Eco System 155
  • 154. Improving Programmability • Pig: Programming language that simplifies Hadoop actions: loading, transforming and sorting data • Hive: enables Hadoop to operate as data warehouse using SQL-like syntax. 156
  • 155. Pig • “is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. “ • Top Level Apache Project – http://pig.apache.org • Pig is an abstraction on top of Hadoop – Provides high level programming language designed for data processing – Converted into MapReduce and executed on Hadoop Clusters • Pig is widely accepted and used – Yahoo!, Twitter, Netflix, etc... 157
  • 156. Pig and MapReduce • MapReduce requires programmers – Must think in terms of map and reduce functions – More than likely will require Java programmers • Pig provides high-level language that can be used by – Analysts – Data Scientists – Statisticians – Etc... • Originally implemented at Yahoo! to allow analysts to access data 158
  • 157. Pig’s Features • Join Datasets • Sort Datasets • Filter • Data Types • Group By • User Defined Functions 159
  • 158. Pig’s Use Cases • Extract Transform Load (ETL) – Ex: Processing large amounts of log data • clean bad entries, join with other data-sets • Research of “raw” information – Ex. User Audit Logs – Schema maybe unknown or inconsistent – Data Scientists and Analysts may like Pig’s data transformation paradigm 160
  • 159. Pig Components • Pig Latin – Command based language – Designed specifically for data transformation and flow expression • Execution Environment – The environment in which Pig Latin commands are executed – Currently there is support for Local and Hadoop modes • Pig compiler converts Pig Latin to MapReduce – Compiler strives to optimize execution – You automatically get optimization improvements with Pig updates 161
  • 161. Hive • Data Warehousing Solution built on top of Hadoop • Provides SQL-like query language named HiveQL – Minimal learning curve for people with SQL expertise – Data analysts are target audience • Early Hive development work started at Facebook in 2007 • Today Hive is an Apache project under Hadoop – http://hive.apache.org 163
  • 162. Hive Provides • Ability to bring structure to various data formats • Simple interface for ad hoc querying, analyzing and summarizing large amounts of data • Access to files on various data stores such as HDFS and HBase 164
  • 163. When not to use Hive • Hive does NOT provide low latency or real time queries • Even querying small amounts of data may take minutes • Designed for scalability and ease-of-use rather than low latency responses 165
  • 164. Hive • Translates HiveQL statements into a set of MapReduce Jobs which are then executed on a Hadoop Cluster 166
  • 165. Hive Metastore • To support features like schema(s) and data partitioning Hive keeps its metadata in a Relational Database – Packaged with Derby, a lightweight embedded SQL DB • Default Derby based is good for evaluation an testing • Schema is not shared between users as each user has their own instance of embedded Derby • Stored in metastore_db directory which resides in the directory that hive was started from – Can easily switch another SQL installation such as MySQL 167
  • 167. 1: Create a Table • Let’s create a table to store data from $PLAY_AREA/data/user-posts.txt 169
  • 168. 1: Create a Table 170
  • 169. 2: Load Data Into a Table 171
  • 172. Databases and DB Connectivity • HBase: column oriented database that runs on HDFS. • Sqoop: a tool designed to import data from relational databases (HDFS or Hive). 174
  • 173. HBase • Distributed column-oriented database built on top of HDFS, providing Big Table-like capabilities for Hadoop 175
  • 174. When do we use HBase? • Huge volumes of randomly accessed data. • HBase is at its best when it’s accessed in a distributed fashion by many clients. • Consider HBase when you’re loading data by key, searching data by key (or range), serving data by key, querying data by key or when storing data by row that doesn’t conform well to a schema. 176
  • 175. When not to use Hbase • HBase doesn’t use SQL, don’t have an optimizer, doesn’t support in transactions or joins. • If you need those things, you probably can’t use Hbase 177
  • 176. HBase Example Example: create ‘blogposts’, ‘post’, ‘image’ ---create table put ‘blogposts’, ‘id1′, ‘post:title’, ‘Hello World’ ---insert value put ‘blogposts’, ‘id1′, ‘post:body’, ‘This is a blog post’ ---insert value put ‘blogposts’, ‘id1′, ‘image:header’, ‘image1.jpg’ ---insert value get ‘blogposts’, ‘id1′ ---select records 178
  • 177. Sqoop • Sqoop is a command line tool for moving data from RDBMS to Hadoop • Uses MapReduce program or Hive to load the data • Can also export data from HBase to RDBMS • Comes with connectors to MySQL, PostgreSQL, Oracle, SQL Server and DB2. Example: $bin/sqoop import --connect 'jdbc:sqlserver://10.80.181.127;username=dbuser;password=dbpasswd;database=tpch' --table lineitem --hive-import $bin/sqoop export --connect 'jdbc:sqlserver://10.80.181.127;username=dbuser;password=dbpasswd;database=tpch' --table lineitem -- export-dir /data/lineitemData 179
  • 178. Improving Hadoop – More useful tools • For improving coordination: Zookeeper • For Improving log collection: Flume • For improving scheduling/orchestration: Oozie • For Monitoring: Chukwa • For Improving UI: Hue 180
  • 179. ZooKeeper • ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services • It allows distributed processes to coordinate with each other through a shared hierarchal namespace which is organized similarly to a standard file system • ZooKeeper stamps each update with a number that reflects the order of all ZooKeeper transactions 181
  • 180. Flume • Flume is a distributed system for collecting log data from many sources, aggregating it, and writing it to HDFS • Flume maintains a central list of ongoing data flows, stored redundantly in Zookeeper. 182
  • 181. Oozie • Oozie is a workflow scheduler system to manage Hadoop jobs • Oozie workflow is a collection of actions arranged in a control dependency DAG specifying a sequence of actions execution • The Oozie Coordinator system allows the user to define workflow execution bases on intervals or on-demand 183
  • 182. Spark Fast and general MapReduce-like engine for large-scale data processing • Fast – In memory data storage for very fast interactive queries Up to 100 times faster then Hadoop • General – Unified platform that can combine: SQL, Machine Learning , Streaming , Graph & Complex analytics • Ease of use – Can be developed in Java, Scala or Python • Integrated with Hadoop – Can read from HDFS, HBase, Cassandra, and any Hadoop data source. 184
  • 183. Spark is the Most Active Open Source Project in Big Data 185
  • 185. Key Concepts Resilient Distributed Datasets • Collections of objects spread across a cluster, stored in RAM or on Disk • Built through parallel transformations • Automatically rebuilt on failure Operations • Transformations (e.g. map, filter, groupBy) • Actions (e.g. count, collect, save) Write programs in terms of transformations on distributed datasets 187
  • 186. Unified Platform • Continued innovation bringing new functionality, e.g.: • Java 8 (Closures, LambaExpressions) • Spark SQL (SQL on Spark, not just Hive) • BlinkDB(Approximate Queries) • SparkR(R wrapper for Spark) 188
  • 188. Data Sources • Local Files – file:///opt/httpd/logs/access_log • S3 • Hadoop Distributed Filesystem – Regular files, sequence files, any other Hadoop InputFormat • Hbase • Can also read from any other Hadoop data source. 190
  • 189. Resilient Distributed Datasets (RDD) • Spark revolves around RDDs • Fault-tolerant collection of elements that can be operated on in parallel – Parallelized Collection: Scala collection which is run in parallel – Hadoop Dataset: records of files supported by Hadoop 191
  • 191. Hadoop cluster Cluster of machine running Hadoop at Yahoo! (credit: Yahoo!) 193
  • 192. Big Data and NoSQL
  • 193. The Challenge • We want scalable, durable, high volume, high velocity, distributed data storage that can handle non-structured data and that will fit our specific need • RDBMS is too generic and doesn’t cut it any more – it can do the job but it is not cost effective to our usages 195
  • 194. The Solution: NoSQL • Let’s take some parts of the standard RDBMS out to and design the solution to our specific uses • NoSQL databases have been around for ages under different names/solutions 196
  • 195. Example Comparison: RDBMS vs. Hadoop Typical Traditional RDBMS Hadoop Data Size Gigabytes Petabytes Access Interactive and Batch Batch – NOT Interactive Updates Read / Write many times Write once, Read many times Structure Static Schema Dynamic Schema Scaling Nonlinear Linear Query Response Time Can be near immediate Has latency (due to batch processing) 197
  • 196. Best Used For:  Structured or Not (Flexibility)  Scalability of Storage/Compute  Complex Data Processing  Cheaper compared to RDBMS Relational Database Best Used For:  Interactive OLAP Analytics (<1sec)  Multistep Transactions  100% SQL Compliance Best when used together Hadoop And Relational Database 198
  • 197. The NOSQL Movement • NOSQL is not a technology – it’s a concept. • We need high performance, scale out abilities or an agile structure. • We are now willing to sacrifice our sacred cows: consistency, transactions. • Over 200 different brands and solutions (http://nosql-database.org/). 199
  • 198. NoSQL, NOSQL or NewSQL • NoSQL is not No to SQL • NoSQL is not Never SQL • NOSQL = Not Only SQL 200
  • 199. Why NoSQL? • Some applications need very few database features, but need high scale. • Desire to avoid data/schema pre-design altogether for simple applications. • Need for a low-latency, low-overhead API to access data. • Simplicity -- do not need fancy indexing – just fast lookup by primary key. 201
  • 200. Why NoSQL? (cont.) • Developer friendly, DBAs not needed (?). • Schema-less. • Agile: non-structured (or semi-structured). • In Memory. • No (or loose) Transactions. • No joins. 202
  • 201. 203
  • 202. Is NoSQL a RDMS Replacement? NO Well... Sometimes it does… 204
  • 203. RDBMS vs. NoSQL Rationale for choosing a persistent store: Relational Architecture NoSQL Architecture High value, high density, complex Data Low value, low density, simple data Complex data relationships Very simple relationships Schema-centric Schema-free, unstructured or semistructured Data Designed to scale up & out Distributed storage and processing Lots of general purpose features/functionality Stripped down, special purpose data store High overhead ($ per operation) Low overhead ($ per operation) 205
  • 205. Scalability • NoSQL is sometimes very easy to scale out • Most have dynamic data partitioning and easy data distribution • But distributed system always come with a price: The CAP Theorem and impact on ACID transactions 207
  • 206. ACID Transactions Most DBMS are built with ACID transactions in mind: • Atomicity: All or nothing, performs write operations as a single transaction • Consistency: Any transaction will take the DB from one consistent state to another with no broken constraints, ensures replicas are identical on different nodes • Isolation: Other operations cannot access data that has been modified during a transaction that has not been completed yet • Durability: Ability to recover the committed transaction updates against any kind of system failure (transaction log) 208
  • 207. ACID Transactions (cont.) • ACID is usually implemented by a locking mechanism/manager • Distributed systems central locking can be a bottleneck in that system • Most NoSQL does not use/limit the ACID transactions and replaces it with something else… 209
  • 208. • The CAP theorem states that in a distributed/partitioned application, you can only pick two of the following three characteristics: – Consistency. – Availability. – Partition Tolerance. CAP Theorem 210
  • 210. NoSQL BASE • NoSQL usually provide BASE characteristics instead of ACID. BASE stands for: – Basically Available – Soft State – Eventual Consistency • It means that when an update is made in one place, the other partitions will see it over time - there might be an inconsistency window • read and write operations complete more quickly, lowering latency 212
  • 213. NoSQL Taxonomy Type Examples Key-Value Store Document Store Column Store Graph Store 215
  • 215. Key Value Store • Distributed hash tables. • Very fast to get a single value. • Examples: – Amazon DynamoDB – Berkeley DB – Redis – Riak – Cassandra 217
  • 216. Document Store • Similar to Key/Value, but value is a document • JSON or something similar, flexible schema • Agile technology • Examples: – MongoDB – CouchDB – CouchBase 218
  • 217. Column Store • One key, multiple attributes • Hybrid row/column • Examples: – Google BigTable – Hbase – Amazon’s SimpleDB – Cassandra 219
  • 218. How Records are Organized? • This is a logical table in RDBMS systems • Its physical organization is just like the logical one: column by column, row by row Row 1 Row 2 Row 3 Row 4 Col 1 Col 2 Col 3 Col 4 220
  • 219. Query Data • When we query data, records are read at the order they are organized in the physical structure • Even when we query a single column, we still need to read the entire table and extract the column Row 1 Row 2 Row 3 Row 4 Col 1 Col 2 Col 3 Col 4 Select Col2 From MyTable Select * From MyTable 221
  • 220. How Does Column Store Keep Data Organization in row store Organization in column store Select Col2 From MyTable 222
  • 221. Graph Store • Inspired by Graph Theory • Data model: Nodes, relationships, properties on both • Relational Database have very hard time to represent a graph in the Database • Example: – Neo4j – InfiniteGraph – RDF 223
  • 222. • An abstract representation of a set of objects where some pairs are connected by links. • Object (Vertex, Node) – can have attributes like name and value • Link (Edge, Arc, Relationship) – can have attributes like type and name or date What is Graph NODE Edge 224
  • 223. Graph Types Undirected Graph Directed Graph Pseudo Graph Multi Graph NODE Edge NODE NODE Edge NODE NODE NODE NODE 225
  • 224. More Graph Types Weighted Graph Labeled Graph Property Graph NODE 10 NODE NODE Like NODE NODE NODE friend, date 2015 Name:yosi, Age:40 Name:ami, Age:30 226
  • 226. 228
  • 227. Q&A
  • 228. Conclusion • The Challenge of Big Data • Hadoop Basics: HDFS, MapReduce and YARN • Improving Hadoop and Tools • NoSQL and RDBMS 230