"Hadoop Ecosystem and Low Latency Streaming Architecture" was presented by Vijay Mandava and Lan Jiang to Detroit Java User Group on 3/23/2015. It covers the basic introduction of Hadoop Ecosystem and then focus on the low latency streaming architecture, including frameworks such as Flume, Kafka and Storm.
Hadoop Ecosystem and Low Latency Streaming Architecture
1. Hadoop Ecosystem and Low
Latency Streaming Architecture
InSemble Inc.
http://www.insemble.com
2. Agenda
What is Big Data and why it is relevant ?1
Flume, Kafka and Storm4
Reference Architecture for Low Latency Streaming3
Hadoop Ecosystem2
Demo5
3. Big Data Definitions
• Wikipedia defines it as “Data Sets with sizes beyond the ability of
commonly used software tools to capture, curate, manage and process
data within a tolerable elapsed time”
• Gartner defines it as Data with the following characteristics
– High Velocity
– High Variety
– High Volume
• Another Definition is “Big Data is a large volume, unstructured data
which cannot be handled by traditional database management systems
”
4. Why a game changer
• Schema on Read
– Interpreting data at processing time
– Key, Values are not intrinsic properties of data but chosen by
person analyzing the data
• Move code to data
– With traditional, we bring data to code and I/O becomes a
bottleneck
– With distributed systems, we have to deal with our own
checkpointing/recovery
• More data beats better algorithms
5. Enterprise Relevance
• Missed Opportunities
– Channels
– Data that is analyzed
• Constraint was high cost
– Storage
– Processing
• Future-proof your business
– Schema on Read
– Access pattern not as relevant
– Not just future-proofing your architecture
7. Hadoop 2 with YARN
Source: Hadoop In Practice by Alex Holmes
8. Big Data Journey
➢ Real time Insight from all channels
➢ IT is key differentiator for your business
➢ Perfect alignment of Business and IT
➢ Ad Hoc Data Exploration
➢ Batch, Interactive, Real time use cases
➢ Predictive Analytics, Machine Learning
➢ Consolidated Analytics
➢ ETL
➢ Time Constraints
➢ Security standards defined
➢ Governance Standards Defined
➢ Integrated with the Enterprise
➢ Evaluate Business Benefits
➢ Understand Ecosystem
➢ Identify Platform
Aware of Benefits
Execute
Expand
Managed
Optimized
- Scout for Opportunities
- Pilot project
- Multiple Use cases
- Governance Model
- Core competency
Journey Over Time
BusinessValue
Effects
GREAT
GOOD
10. Flume Architecture
• Distributed system for
collecting and aggregating
from multiple data stores to
a centralized data store
• Agent is a JVM that hosts
the Flume components
• Channel will store
message until picked by a
sink
• Different types of Flume
sources
• Source and Sink are
decoupled
13. Kafka Introduction
• Messaging System which is distributed, partitioned and replicated
• Kafka brokers run as a cluster
• Producers and Consumers can be written in any language
14. Topic
• Ordered, immutable sequence numbers
• Retains messages until a period of time
• “Offset” of where they are is controlled by the consumer
• Each partition is replicated and has “leader” and 0 or more “follower”.
R/W only done on leader
15. Producers and Consumers
• Producer controls which partition messages goes to
• Supports both Queuing and Pub/Sub
– Abstraction called Consumer group
• Ordering within Partition
– Ordering for subscriber has to be done with only one subscriber to that
partition
16. Storm Introduction
• Distributed real time computational system
–Process unbounded streams of data
–Can use multiple programming languages
–Scalable, fault-tolerant and guarantees that data will be processed
• Use Cases
–Real time analytics, online machine learning
–Continuous Computation
–Distributed RPC
–ETL
• Concepts
–Topology
–Spouts
–Bolts
17. Concepts
• Storm Cluster
– Master node(Nimbus)
• Distributing code
• Assigns tasks to machines
• Monitors for failures
– Worker nodes(Supervisor)
• Starts/stops worker processes
• Each worker process executes subset of a topology
– Zookeeper
• Coordinates between Nimbus and Supervisors
• Nimbus and Supervisors completely stateless
• State maintained by Zookeeper or local disks
18. Details
• Stream
– Unbounded sequence of tuples
• Spout(write logic)
– Source of stream. Emits tuples
• Bolt(write logic)
– Processes streams and emits tuples
• Topology
– DAG of spouts and bolts
– Submit a topology to a Storm cluster
– Each node runs in parallel and parallelism is controlled
19. Stream groupings
• Tells a topology how to send tuples between two components
• Since tasks are executed in parallel, how do we control which tasks the
tuples are being sent to
22. Demo - Twitter TopN Trending Topic
• Use Flume Twitter Source to ingest data and
publish event to Kafka topic
• Use Kafka as messaging backbone
• Use Storm as an Real-Time event processing
system to calculate TopN trending topic
• Use Redis to store the TopN Result
• Use Node.js/JQuery for visualization