SlideShare ist ein Scribd-Unternehmen logo
1 von 21
STORM
    COMPARISON – INTRODUCTION - CONCEPTS




PRESENTATION BY KASPER MADSEN
MARCH - 2012
HADOOP                              VS             STORM
     Batch processing                            Real-time processing
     Jobs runs to completion                   Topologies run forever
     JobTracker is SPOF*                      No single point of failure
     Stateful nodes                                    Stateless nodes


     Scalable                                                 Scalable
     Guarantees no data loss                  Guarantees no data loss
     Open source                                          Open source




* Hadoop 0.21 added some checkpointing
 SPOF: Single Point Of Failure
COMPONENTS
     Nimbus daemon is comparable to Hadoop JobTracker. It is the master
     Supervisor daemon spawns workers, it is comparable to Hadoop TaskTracker
     Worker is spawned by supervisor, one per port defined in storm.yaml configuration
     Task is run as a thread in workers
     Zookeeper* is a distributed system, used to store metadata. Nimbus and
     Supervisor daemons are fail-fast and stateless. All state is kept in Zookeeper.




         Notice all communication between Nimbus and
           Supervisors are done through Zookeeper

      On a cluster with 2k+1 zookeeper nodes, the system
          can recover when maximally k nodes fails.




* Zookeeper is an Apache top-level project
STREAMS
Stream is an unbounded sequence of tuples.
Topology is a graph where each node is a spout or bolt, and the edges indicate
which bolts are subscribing to which streams.
•   A spout is a source of a stream
•   A bolt is consuming a stream (possibly emits a new one)
                                                              Subscribes: A
•   An edge represents a grouping                             Emits: C


                                                                                 Subscribes: C & D

                                                              Subscribes: A
                                 Source of stream A           Emits: D




                                 Source of stream B
                                                              Subscribes:A & B
GROUPINGS
Each spout or bolt are running X instances in parallel (called tasks).
Groupings are used to decide which task in the subscribing bolt, the tuple is sent to
Shuffle grouping     is a random grouping
Fields grouping      is grouped by value, such that equal value results in equal task
All grouping         replicates to all tasks
Global grouping      makes all tuples go to one task
None grouping        makes bolt run in same thread as bolt/spout it subscribes to
Direct grouping      producer (task that emits) controls which consumer will receive
                                          4 tasks   3 tasks


                                2 tasks


                                          2 tasks
TestWordSpout          ExclamationBolt     ExclamationBolt

    EXAMPLE
     TopologyBuilder builder = new TopologyBuilder();                   Create stream called ”words”

                                                                        Run 10 tasks
     builder.setSpout("words", new TestWordSpout(), 10);
                                                                        Create stream called ”exclaim1”
     builder.setBolt("exclaim1", new ExclamationBolt(), 3)              Run 3 tasks

                                                                        Subscribe to stream ”words”,
                 .shuffleGrouping("words");                             using shufflegrouping
                                                                        Create stream called ”exclaim2”
     builder.setBolt("exclaim2", new ExclamationBolt(), 2)
                                                                        Run 2 tasks
                 .shuffleGrouping("exclaim1");                          Subscribe to stream ”exclaim1”,
                                                                        using shufflegrouping



        A bolt can subscribe to an unlimited number of
                streams, by chaining groupings.



The sourcecode for this example is part of the storm-starter project on github
TestWordSpout        ExclamationBolt     ExclamationBolt

EXAMPLE – 1
TestWordSpout
public void nextTuple() {
     Utils.sleep(100);
     final String[] words = new String[] {"nathan", "mike", "jackson", "golda", "bertels"};
     final Random rand = new Random();
     final String word = words[rand.nextInt(words.length)];
     _collector.emit(new Values(word));
}



The TestWordSpout emits a random string from the
       array words, each 100 milliseconds
TestWordSpout          ExclamationBolt        ExclamationBolt

EXAMPLE – 2
ExclamationBolt                                    Prepare is called when bolt is created

OutputCollector _collector;
public void prepare(Map conf, TopologyContext context, OutputCollector collector) {
      _collector = collector;
}                                             Execute is called for each tuple
public void execute(Tuple tuple) {
     _collector.emit(tuple, new Values(tuple.getString(0) + "!!!"));
     _collector.ack(tuple);
 }                                            declareOutputFields is called when bolt is created
public void declareOutputFields(OutputFieldsDeclarer declarer) {
     declarer.declare(new Fields("word"));
}


declareOutputFields is used to declare streams and their schemas. It
 is possible to declare several streams and specify the stream to use
           when outputting tuples in the emit function call.
FAULT TOLERANCE
Zookeeper stores metadata in a very robust way
Nimbus and Supervisor are stateless and only need metadata from ZK to work/restart
When a node dies
   • The tasks will time out and be reassigned to other workers by Nimbus.
When a worker dies
     •The supervisor will restart the worker.
     •Nimbus will reassign worker to another supervisor, if no heartbeats are sent.
     •If not possible (no free ports), then tasks will be run on other workers in
      topology. If more capacity is added to the cluster later, STORM will
      automatically initialize a new worker and spread out the tasks.
When nimbus or supervisor dies
     •   Workers will continue to run
     •   Workers cannot be reassigned without Nimbus
     •   Nimbus and Supervisor should be run using a process monitoring tool, to
         restarts them automatically if they fail.
AT-LEAST-ONCE PROCESSING
STORM guarantees at-least-once processing of tuples.
Message id, gets assigned to a tuple when emitting from spout or bolt. Is 64 bits long
Tree of tuples is the tuples generated (directly and indirectly) from a spout tuple.
Ack is called on spout, when tree of tuples for spout tuple is fully processed.
Fail is called on spout, if one of the tuples in the tree of tuples fails or the tree of
tuples is not fully processed within a specified timeout (default is 30 seconds).
It is possible to specify the message id, when emitting a tuple. This might be useful for
replaying tuples from a queue.




                   Ack/fail method called when tree of
                  tuples have been fully processed or
                             failed / timed-out
AT-LEAST-ONCE PROCESSING – 2
Anchoring is used to copy the spout tuple message id(s) to the new tuples
generated. In this way, every tuple knows the message id(s) of all spout tuples.
Multi-anchoring is when multiple tuples are anchored. If the tuple tree fails, then
multiple spout tuples will be replayed. Useful for doing streaming joins and more.
Ack called from a bolt, indicates the tuple has been processed as intented
Fail called from a bolt, replays the spout tuple(s)
Every tuple must be acked/failed or the task will run out of memory at some point.




_collector.emit(tuple, new Values(word));    Uses anchoring

_collector.emit(new Values(word));           Does NOT use anchoring
AT-LEAST-ONCE PROCESSING – 3
Acker tasks tracks the tree of tuples for every spout tuple
     •   The acker task responsible for a given spout tuple is determined by modulo
         on message id. Since all tuples have all spout tuple message ids, it is easy
         to call the correct acker tasks.
     •   Acker task stores a map, the format is {spoutMsgId, {spoutTaskId, ”ack val”}}
     •   ”ack val” is the representation of state of entire tree of tuples. It is the xor of
         all tuple message ids created and acked in the tree of tuples.
     •   When ”ack val” is 0, then tuple tree is fully processed.
     •   Since message ids are random 64 bits numbers, chances of ”ack val”
         becoming 0 by accident is extremely small.




               Important to set number of acker tasks in topology when
                  processing large amounts of tuples (defaults to 1)
AT-LEAST-ONCE PROCESSING – 4
    Example                                                             Bolt
                                                         Emit ”h”        Task: 3
                                                         spoutIds: 10
                                                         msgId: 2
                Spout          Emit ”hey”   Bolt
                     Task: 1   msgId:10      Task: 2
                                                        Emit ”ey”
                                                        spoutIds: 10
                                                        msgId: 3        Bolt
                                                                         Task: 4



Shows what happens in acker task, for one spout tuple. Format is: {spoutMsgId, {spoutTaskId, ”ack val”}}
1. After emit ”hey”: {10, {1, 0000 XOR 1010 = 1010}
2. After emit ”h”: {10, {1, 1010 XOR 0010 = 1000}
3. After emit ”ey”: {10, {1, 1000 XOR 0011 = 1011}                      USES 64 BIT IDS
4. After ack ”hey”: {10, {1, 1011 XOR 1010 = 0001}                         IN REALITY
5. After ack ”h”:    {10, {1, 0001 XOR 0010 = 0011}
6. After ack ”ey”: {10, {1, 0011 XOR 0011 = 0000}
7. Since ”ack val” is 0, spout tuple with id 10, must be fully processed. Call ack on spout (task 1)
AT-LEAST-ONCE PROCESSING – 5
A tuple isn't acked because the task died:
The spout tuple(s) at the root of the tree of tuples will time out and be replayed.
Acker task dies:
All the spout tuples the acker was tracking will time out and be replayed.
Spout task dies:
In this case the source that the spout talks to is responsible for replaying the
messages. For example, queues like Kestrel and RabbitMQ will place all pending
messages back on the queue when a client disconnects.
AT-LEAST-ONCE PROCESSING – 6
At-least-once processing might process a tuple more than once.
Example

     All grouping       Bolt       1. A spout tuple is emitted to task 2 and 3
                         Task: 2   2. Worker responsible for task 3 fails
                                   3. Supervisor restarts worker
  Spout
      Task: 1
                                   4. Spout tuple is replayed and emitted to task 2 and 3
                                   5. Task 2 will now have executed the same bolt twice
                       Bolt
                         Task: 3




Consider why the all grouping is not important in this example
EXACTLY-ONCE-PROCESSING
Transactional topologies (TT) is an abstraction built on STORM primitives.
TT guarantees exactly-once-processing of tuples.
Acking is optimized in TT, no need to do anchoring or acking manually.
Bolts execute as new instances per attempt of processing a batch


Example

     All grouping       Bolt        1. A spout tuple is emitted to task 2 and 3
                         Task: 2    2. Worker responsible for task 3 fails
                                    3. Supervisor restarts worker
  Spout
      Task: 1
                                    4. Spout tuple is replayed and emitted to task 2 and 3
                                    5. Task 2 and 3 initiate new bolts because of new attempt
                        Bolt        5. Now there is no problem
                         Task: 3
EXACTLY-ONCE-PROCESSING – 2
For efficiency batch processing of tuples is introduced in TT
Batch has two states: processing or committing
Many batches can be in the processing state concurrently
Only one batch can be in the committing state, and a strong ordering is imposed. That
means batch 1 will always be committed before batch 2 and so on.
Types of bolts for TT: BasicBolt, BatchBolt, BatchBolt marked as committer
BasicBolt is processing one tuple at a time.
BatchBolt is processing batches. Call finishBatch when all tuples of batch is executed
BatchBolt marked as committer is calling finishBatch only when batch is in
committing state.
EXACTLY-ONCE-PROCESSING – 3
      Transactional spout has capability               Committer               Committer
      to replay exact batches of tuples    batchbolt   batchbolt   batchbolt
                                                                               batchbolt




BATCH IS IN PROCESSING STATE
Bolt A:   execute method is called for all tuples received from spout
          finishBatch is called when first batch is received
Bolt B:   execute method is called for all tuples received from bolt A
          finishBatch is NOT called because batch is in processing state
Bolt C:   execute method is called for all tuples received from bolt A (and B)
          finishBatch is NOT called, because bolt B has not called finishBatch
Bolt D:   execute method is called for all tuples received from bolt C
          finishBatch is NOT called because batch is in processing state
BATCH CHANGES TO COMMITTING STATE
Bolt B:   finishBatch is called
Bolt C:   finishBatch is called, because we know we got all tuples from Bolt B now
Bolt D:   finishBatch is called, because we know we got all tuples from Bolt C now
EXACTLY-ONCE-PROCESSING – 4
    Transactional spout
  All groupings on                   When batch should enter processing state:
  batch stream                       •  Coordinator emits a tuple with TransactionAttempt and the metadata for that
                                        transaction to the "batch" stream.
                                     •  All emitter tasks receives the tuple and begins to emit their portion of tuples for
                                        the given batch.


                                     When processing phase of batch is done (determined by acker task):
                                     •  Ack gets called on coordinator


                                     When ack gets called on coordinator and all prior transactions have committed:
                  Regular bolt,      •  Coordinator emits a tuple with TransactionAttempt to the commit stream.
                  Parallelism of P   •  All Bolts which are marked as committers subscribe to the commit stream of the
                                        coordinator using an all grouping.
                                     •  Bolts marked as committers now know the batch is in the committing phase
Regular spout, parallelism of 1
Defined streams: batch & commit
                                     When batch is fully processed again (determined by acker task):
                                     •  Ack gets called on coordinator
                                     •  Coordinator knows batch is now committed
STORM LIBRARIES
STORM uses a lot of libraries. The most prominent are
Clojure    a new lisp programming language. Crash-course follows
Jetty      an embedded webserver. Used to host the UI of Nimbus.
Kryo       a fast serializer, used when sending tuples
Thrift     a framework to build services. Nimbus is a thrift daemon
ZeroMQ     a very fast transportation layer
Zookeeper a distributed system for storing metadata
LEARN MORE
Wiki (https://github.com/nathanmarz/storm/wiki)
Storm-starter (https://github.com/nathanmarz/storm-starter)
Mailing list (http://groups.google.com/group/storm-user)
#storm-user room on freenode




                                            from: http://www.cupofjoe.tv/2010/11/learn-lesson.html

Weitere ähnliche Inhalte

Was ist angesagt?

Storm presentation
Storm presentationStorm presentation
Storm presentationShyam Raj
 
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Sonal Raj
 
Introduction to Twitter Storm
Introduction to Twitter StormIntroduction to Twitter Storm
Introduction to Twitter StormUwe Printz
 
Real time and reliable processing with Apache Storm
Real time and reliable processing with Apache StormReal time and reliable processing with Apache Storm
Real time and reliable processing with Apache StormAndrea Iacono
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormEugene Dvorkin
 
Introduction to Storm
Introduction to StormIntroduction to Storm
Introduction to StormEugene Dvorkin
 
Introduction to Apache Storm
Introduction to Apache StormIntroduction to Apache Storm
Introduction to Apache StormTiziano De Matteis
 
Storm and Cassandra
Storm and Cassandra Storm and Cassandra
Storm and Cassandra T Jake Luciani
 
Streams processing with Storm
Streams processing with StormStreams processing with Storm
Streams processing with StormMariusz Gil
 
Storm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-DataStorm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-DataDataWorks Summit
 
Apache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integrationApache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integrationUday Vakalapudi
 
Cassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceCassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceP. Taylor Goetz
 
Apache Storm Internals
Apache Storm InternalsApache Storm Internals
Apache Storm InternalsHumoyun Ahmedov
 
Slide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormSlide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormMd. Shamsur Rahim
 
Developing Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormDeveloping Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormLester Martin
 
BWB Meetup: Storm - distributed realtime computation system
BWB Meetup: Storm - distributed realtime computation systemBWB Meetup: Storm - distributed realtime computation system
BWB Meetup: Storm - distributed realtime computation systemAndrii Gakhov
 

Was ist angesagt? (20)

Storm presentation
Storm presentationStorm presentation
Storm presentation
 
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
 
Introduction to Twitter Storm
Introduction to Twitter StormIntroduction to Twitter Storm
Introduction to Twitter Storm
 
Real time and reliable processing with Apache Storm
Real time and reliable processing with Apache StormReal time and reliable processing with Apache Storm
Real time and reliable processing with Apache Storm
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache Storm
 
Introduction to Storm
Introduction to StormIntroduction to Storm
Introduction to Storm
 
Storm
StormStorm
Storm
 
Introduction to Apache Storm
Introduction to Apache StormIntroduction to Apache Storm
Introduction to Apache Storm
 
Storm and Cassandra
Storm and Cassandra Storm and Cassandra
Storm and Cassandra
 
Storm
StormStorm
Storm
 
Storm Anatomy
Storm AnatomyStorm Anatomy
Storm Anatomy
 
Tutorial Kafka-Storm
Tutorial Kafka-StormTutorial Kafka-Storm
Tutorial Kafka-Storm
 
Streams processing with Storm
Streams processing with StormStreams processing with Storm
Streams processing with Storm
 
Storm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-DataStorm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-Data
 
Apache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integrationApache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integration
 
Cassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceCassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market Sceince
 
Apache Storm Internals
Apache Storm InternalsApache Storm Internals
Apache Storm Internals
 
Slide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormSlide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache Storm
 
Developing Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormDeveloping Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache Storm
 
BWB Meetup: Storm - distributed realtime computation system
BWB Meetup: Storm - distributed realtime computation systemBWB Meetup: Storm - distributed realtime computation system
BWB Meetup: Storm - distributed realtime computation system
 

Andere mochten auch

Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profilepramodbiligiri
 
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...Mia Yuan Cao
 
Next Generation of Hadoop MapReduce
Next Generation of Hadoop MapReduceNext Generation of Hadoop MapReduce
Next Generation of Hadoop MapReducehuguk
 
Sensor(zigbee)
Sensor(zigbee)Sensor(zigbee)
Sensor(zigbee)rajrayala
 
Advance ethernet
Advance ethernetAdvance ethernet
Advance ethernetOnline
 
Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Stormviirya
 
A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013Nathan Bijnens
 
Ethernet technology
Ethernet technologyEthernet technology
Ethernet technologyJosekutty James
 
Ethernet - Networking presentation
Ethernet - Networking presentationEthernet - Networking presentation
Ethernet - Networking presentationViet Nguyen
 
Zigbee technology ppt edited
Zigbee technology ppt editedZigbee technology ppt edited
Zigbee technology ppt editedrakeshkumarchary
 
Zigbee Presentation
Zigbee PresentationZigbee Presentation
Zigbee PresentationMaathu Michael
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopDataWorks Summit
 
Internet of Things and its applications
Internet of Things and its applicationsInternet of Things and its applications
Internet of Things and its applicationsPasquale Puzio
 

Andere mochten auch (15)

Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
 
Next Generation of Hadoop MapReduce
Next Generation of Hadoop MapReduceNext Generation of Hadoop MapReduce
Next Generation of Hadoop MapReduce
 
Sensor(zigbee)
Sensor(zigbee)Sensor(zigbee)
Sensor(zigbee)
 
Advance ethernet
Advance ethernetAdvance ethernet
Advance ethernet
 
Ethernet
EthernetEthernet
Ethernet
 
Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
 
A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013
 
Ethernet technology
Ethernet technologyEthernet technology
Ethernet technology
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
Ethernet - Networking presentation
Ethernet - Networking presentationEthernet - Networking presentation
Ethernet - Networking presentation
 
Zigbee technology ppt edited
Zigbee technology ppt editedZigbee technology ppt edited
Zigbee technology ppt edited
 
Zigbee Presentation
Zigbee PresentationZigbee Presentation
Zigbee Presentation
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 
Internet of Things and its applications
Internet of Things and its applicationsInternet of Things and its applications
Internet of Things and its applications
 

Ähnlich wie Storm vs Hadoop comparison and introduction to concepts

Apache Storm Tutorial
Apache Storm TutorialApache Storm Tutorial
Apache Storm TutorialFarzad Nozarian
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormDavorin Vukelic
 
storm-170531123446.dotx.pptx
storm-170531123446.dotx.pptxstorm-170531123446.dotx.pptx
storm-170531123446.dotx.pptxIbrahimBenhadhria
 
understand Storm in pictures
understand Storm in picturesunderstand Storm in pictures
understand Storm in pictureszqhxuyuan
 
Unraveling mysteries of the Universe at CERN, with OpenStack and Hadoop
Unraveling mysteries of the Universe at CERN, with OpenStack and HadoopUnraveling mysteries of the Universe at CERN, with OpenStack and Hadoop
Unraveling mysteries of the Universe at CERN, with OpenStack and HadoopPiotr Turek
 
The Back Propagation Learning Algorithm
The Back Propagation Learning AlgorithmThe Back Propagation Learning Algorithm
The Back Propagation Learning AlgorithmESCOM
 
Storm introduction
Storm introductionStorm introduction
Storm introductionAngelo Genovese
 
Storm begins
Storm beginsStorm begins
Storm beginsSungMin OH
 
Stackless Python In Eve
Stackless Python In EveStackless Python In Eve
Stackless Python In Evel xf
 
02 - Basics of Qt
02 - Basics of Qt02 - Basics of Qt
02 - Basics of QtAndreas Jakl
 
storm-170531123446.pptx
storm-170531123446.pptxstorm-170531123446.pptx
storm-170531123446.pptxIbrahimBenhadhria
 
Cleveland HUG - Storm
Cleveland HUG - StormCleveland HUG - Storm
Cleveland HUG - Stormjustinjleet
 
Java 7 LavaJUG
Java 7 LavaJUGJava 7 LavaJUG
Java 7 LavaJUGjulien.ponge
 
Stackless Python In Eve
Stackless Python In EveStackless Python In Eve
Stackless Python In Eveguest91855c
 
Java 7 at SoftShake 2011
Java 7 at SoftShake 2011Java 7 at SoftShake 2011
Java 7 at SoftShake 2011julien.ponge
 

Ähnlich wie Storm vs Hadoop comparison and introduction to concepts (20)

Storm 0.8.2
Storm 0.8.2Storm 0.8.2
Storm 0.8.2
 
Apache Storm Tutorial
Apache Storm TutorialApache Storm Tutorial
Apache Storm Tutorial
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache Storm
 
Storm
StormStorm
Storm
 
storm-170531123446.dotx.pptx
storm-170531123446.dotx.pptxstorm-170531123446.dotx.pptx
storm-170531123446.dotx.pptx
 
Storm
StormStorm
Storm
 
Storm
StormStorm
Storm
 
Iteration
IterationIteration
Iteration
 
understand Storm in pictures
understand Storm in picturesunderstand Storm in pictures
understand Storm in pictures
 
Unraveling mysteries of the Universe at CERN, with OpenStack and Hadoop
Unraveling mysteries of the Universe at CERN, with OpenStack and HadoopUnraveling mysteries of the Universe at CERN, with OpenStack and Hadoop
Unraveling mysteries of the Universe at CERN, with OpenStack and Hadoop
 
The Back Propagation Learning Algorithm
The Back Propagation Learning AlgorithmThe Back Propagation Learning Algorithm
The Back Propagation Learning Algorithm
 
Storm introduction
Storm introductionStorm introduction
Storm introduction
 
Storm begins
Storm beginsStorm begins
Storm begins
 
Stackless Python In Eve
Stackless Python In EveStackless Python In Eve
Stackless Python In Eve
 
02 - Basics of Qt
02 - Basics of Qt02 - Basics of Qt
02 - Basics of Qt
 
storm-170531123446.pptx
storm-170531123446.pptxstorm-170531123446.pptx
storm-170531123446.pptx
 
Cleveland HUG - Storm
Cleveland HUG - StormCleveland HUG - Storm
Cleveland HUG - Storm
 
Java 7 LavaJUG
Java 7 LavaJUGJava 7 LavaJUG
Java 7 LavaJUG
 
Stackless Python In Eve
Stackless Python In EveStackless Python In Eve
Stackless Python In Eve
 
Java 7 at SoftShake 2011
Java 7 at SoftShake 2011Java 7 at SoftShake 2011
Java 7 at SoftShake 2011
 

KĂźrzlich hochgeladen

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 

KĂźrzlich hochgeladen (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 

Storm vs Hadoop comparison and introduction to concepts

  • 1. STORM COMPARISON – INTRODUCTION - CONCEPTS PRESENTATION BY KASPER MADSEN MARCH - 2012
  • 2. HADOOP VS STORM Batch processing Real-time processing Jobs runs to completion Topologies run forever JobTracker is SPOF* No single point of failure Stateful nodes Stateless nodes Scalable Scalable Guarantees no data loss Guarantees no data loss Open source Open source * Hadoop 0.21 added some checkpointing SPOF: Single Point Of Failure
  • 3. COMPONENTS Nimbus daemon is comparable to Hadoop JobTracker. It is the master Supervisor daemon spawns workers, it is comparable to Hadoop TaskTracker Worker is spawned by supervisor, one per port defined in storm.yaml configuration Task is run as a thread in workers Zookeeper* is a distributed system, used to store metadata. Nimbus and Supervisor daemons are fail-fast and stateless. All state is kept in Zookeeper. Notice all communication between Nimbus and Supervisors are done through Zookeeper On a cluster with 2k+1 zookeeper nodes, the system can recover when maximally k nodes fails. * Zookeeper is an Apache top-level project
  • 4. STREAMS Stream is an unbounded sequence of tuples. Topology is a graph where each node is a spout or bolt, and the edges indicate which bolts are subscribing to which streams. • A spout is a source of a stream • A bolt is consuming a stream (possibly emits a new one) Subscribes: A • An edge represents a grouping Emits: C Subscribes: C & D Subscribes: A Source of stream A Emits: D Source of stream B Subscribes:A & B
  • 5. GROUPINGS Each spout or bolt are running X instances in parallel (called tasks). Groupings are used to decide which task in the subscribing bolt, the tuple is sent to Shuffle grouping is a random grouping Fields grouping is grouped by value, such that equal value results in equal task All grouping replicates to all tasks Global grouping makes all tuples go to one task None grouping makes bolt run in same thread as bolt/spout it subscribes to Direct grouping producer (task that emits) controls which consumer will receive 4 tasks 3 tasks 2 tasks 2 tasks
  • 6. TestWordSpout ExclamationBolt ExclamationBolt EXAMPLE TopologyBuilder builder = new TopologyBuilder(); Create stream called ”words” Run 10 tasks builder.setSpout("words", new TestWordSpout(), 10); Create stream called ”exclaim1” builder.setBolt("exclaim1", new ExclamationBolt(), 3) Run 3 tasks Subscribe to stream ”words”, .shuffleGrouping("words"); using shufflegrouping Create stream called ”exclaim2” builder.setBolt("exclaim2", new ExclamationBolt(), 2) Run 2 tasks .shuffleGrouping("exclaim1"); Subscribe to stream ”exclaim1”, using shufflegrouping A bolt can subscribe to an unlimited number of streams, by chaining groupings. The sourcecode for this example is part of the storm-starter project on github
  • 7. TestWordSpout ExclamationBolt ExclamationBolt EXAMPLE – 1 TestWordSpout public void nextTuple() { Utils.sleep(100); final String[] words = new String[] {"nathan", "mike", "jackson", "golda", "bertels"}; final Random rand = new Random(); final String word = words[rand.nextInt(words.length)]; _collector.emit(new Values(word)); } The TestWordSpout emits a random string from the array words, each 100 milliseconds
  • 8. TestWordSpout ExclamationBolt ExclamationBolt EXAMPLE – 2 ExclamationBolt Prepare is called when bolt is created OutputCollector _collector; public void prepare(Map conf, TopologyContext context, OutputCollector collector) { _collector = collector; } Execute is called for each tuple public void execute(Tuple tuple) { _collector.emit(tuple, new Values(tuple.getString(0) + "!!!")); _collector.ack(tuple); } declareOutputFields is called when bolt is created public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } declareOutputFields is used to declare streams and their schemas. It is possible to declare several streams and specify the stream to use when outputting tuples in the emit function call.
  • 9. FAULT TOLERANCE Zookeeper stores metadata in a very robust way Nimbus and Supervisor are stateless and only need metadata from ZK to work/restart When a node dies • The tasks will time out and be reassigned to other workers by Nimbus. When a worker dies •The supervisor will restart the worker. •Nimbus will reassign worker to another supervisor, if no heartbeats are sent. •If not possible (no free ports), then tasks will be run on other workers in topology. If more capacity is added to the cluster later, STORM will automatically initialize a new worker and spread out the tasks. When nimbus or supervisor dies • Workers will continue to run • Workers cannot be reassigned without Nimbus • Nimbus and Supervisor should be run using a process monitoring tool, to restarts them automatically if they fail.
  • 10. AT-LEAST-ONCE PROCESSING STORM guarantees at-least-once processing of tuples. Message id, gets assigned to a tuple when emitting from spout or bolt. Is 64 bits long Tree of tuples is the tuples generated (directly and indirectly) from a spout tuple. Ack is called on spout, when tree of tuples for spout tuple is fully processed. Fail is called on spout, if one of the tuples in the tree of tuples fails or the tree of tuples is not fully processed within a specified timeout (default is 30 seconds). It is possible to specify the message id, when emitting a tuple. This might be useful for replaying tuples from a queue. Ack/fail method called when tree of tuples have been fully processed or failed / timed-out
  • 11. AT-LEAST-ONCE PROCESSING – 2 Anchoring is used to copy the spout tuple message id(s) to the new tuples generated. In this way, every tuple knows the message id(s) of all spout tuples. Multi-anchoring is when multiple tuples are anchored. If the tuple tree fails, then multiple spout tuples will be replayed. Useful for doing streaming joins and more. Ack called from a bolt, indicates the tuple has been processed as intented Fail called from a bolt, replays the spout tuple(s) Every tuple must be acked/failed or the task will run out of memory at some point. _collector.emit(tuple, new Values(word));  Uses anchoring _collector.emit(new Values(word));  Does NOT use anchoring
  • 12. AT-LEAST-ONCE PROCESSING – 3 Acker tasks tracks the tree of tuples for every spout tuple • The acker task responsible for a given spout tuple is determined by modulo on message id. Since all tuples have all spout tuple message ids, it is easy to call the correct acker tasks. • Acker task stores a map, the format is {spoutMsgId, {spoutTaskId, ”ack val”}} • ”ack val” is the representation of state of entire tree of tuples. It is the xor of all tuple message ids created and acked in the tree of tuples. • When ”ack val” is 0, then tuple tree is fully processed. • Since message ids are random 64 bits numbers, chances of ”ack val” becoming 0 by accident is extremely small. Important to set number of acker tasks in topology when processing large amounts of tuples (defaults to 1)
  • 13. AT-LEAST-ONCE PROCESSING – 4 Example Bolt Emit ”h” Task: 3 spoutIds: 10 msgId: 2 Spout Emit ”hey” Bolt Task: 1 msgId:10 Task: 2 Emit ”ey” spoutIds: 10 msgId: 3 Bolt Task: 4 Shows what happens in acker task, for one spout tuple. Format is: {spoutMsgId, {spoutTaskId, ”ack val”}} 1. After emit ”hey”: {10, {1, 0000 XOR 1010 = 1010} 2. After emit ”h”: {10, {1, 1010 XOR 0010 = 1000} 3. After emit ”ey”: {10, {1, 1000 XOR 0011 = 1011} USES 64 BIT IDS 4. After ack ”hey”: {10, {1, 1011 XOR 1010 = 0001} IN REALITY 5. After ack ”h”: {10, {1, 0001 XOR 0010 = 0011} 6. After ack ”ey”: {10, {1, 0011 XOR 0011 = 0000} 7. Since ”ack val” is 0, spout tuple with id 10, must be fully processed. Call ack on spout (task 1)
  • 14. AT-LEAST-ONCE PROCESSING – 5 A tuple isn't acked because the task died: The spout tuple(s) at the root of the tree of tuples will time out and be replayed. Acker task dies: All the spout tuples the acker was tracking will time out and be replayed. Spout task dies: In this case the source that the spout talks to is responsible for replaying the messages. For example, queues like Kestrel and RabbitMQ will place all pending messages back on the queue when a client disconnects.
  • 15. AT-LEAST-ONCE PROCESSING – 6 At-least-once processing might process a tuple more than once. Example All grouping Bolt 1. A spout tuple is emitted to task 2 and 3 Task: 2 2. Worker responsible for task 3 fails 3. Supervisor restarts worker Spout Task: 1 4. Spout tuple is replayed and emitted to task 2 and 3 5. Task 2 will now have executed the same bolt twice Bolt Task: 3 Consider why the all grouping is not important in this example
  • 16. EXACTLY-ONCE-PROCESSING Transactional topologies (TT) is an abstraction built on STORM primitives. TT guarantees exactly-once-processing of tuples. Acking is optimized in TT, no need to do anchoring or acking manually. Bolts execute as new instances per attempt of processing a batch Example All grouping Bolt 1. A spout tuple is emitted to task 2 and 3 Task: 2 2. Worker responsible for task 3 fails 3. Supervisor restarts worker Spout Task: 1 4. Spout tuple is replayed and emitted to task 2 and 3 5. Task 2 and 3 initiate new bolts because of new attempt Bolt 5. Now there is no problem Task: 3
  • 17. EXACTLY-ONCE-PROCESSING – 2 For efficiency batch processing of tuples is introduced in TT Batch has two states: processing or committing Many batches can be in the processing state concurrently Only one batch can be in the committing state, and a strong ordering is imposed. That means batch 1 will always be committed before batch 2 and so on. Types of bolts for TT: BasicBolt, BatchBolt, BatchBolt marked as committer BasicBolt is processing one tuple at a time. BatchBolt is processing batches. Call finishBatch when all tuples of batch is executed BatchBolt marked as committer is calling finishBatch only when batch is in committing state.
  • 18. EXACTLY-ONCE-PROCESSING – 3 Transactional spout has capability Committer Committer to replay exact batches of tuples batchbolt batchbolt batchbolt batchbolt BATCH IS IN PROCESSING STATE Bolt A: execute method is called for all tuples received from spout finishBatch is called when first batch is received Bolt B: execute method is called for all tuples received from bolt A finishBatch is NOT called because batch is in processing state Bolt C: execute method is called for all tuples received from bolt A (and B) finishBatch is NOT called, because bolt B has not called finishBatch Bolt D: execute method is called for all tuples received from bolt C finishBatch is NOT called because batch is in processing state BATCH CHANGES TO COMMITTING STATE Bolt B: finishBatch is called Bolt C: finishBatch is called, because we know we got all tuples from Bolt B now Bolt D: finishBatch is called, because we know we got all tuples from Bolt C now
  • 19. EXACTLY-ONCE-PROCESSING – 4 Transactional spout All groupings on When batch should enter processing state: batch stream • Coordinator emits a tuple with TransactionAttempt and the metadata for that transaction to the "batch" stream. • All emitter tasks receives the tuple and begins to emit their portion of tuples for the given batch. When processing phase of batch is done (determined by acker task): • Ack gets called on coordinator When ack gets called on coordinator and all prior transactions have committed: Regular bolt, • Coordinator emits a tuple with TransactionAttempt to the commit stream. Parallelism of P • All Bolts which are marked as committers subscribe to the commit stream of the coordinator using an all grouping. • Bolts marked as committers now know the batch is in the committing phase Regular spout, parallelism of 1 Defined streams: batch & commit When batch is fully processed again (determined by acker task): • Ack gets called on coordinator • Coordinator knows batch is now committed
  • 20. STORM LIBRARIES STORM uses a lot of libraries. The most prominent are Clojure a new lisp programming language. Crash-course follows Jetty an embedded webserver. Used to host the UI of Nimbus. Kryo a fast serializer, used when sending tuples Thrift a framework to build services. Nimbus is a thrift daemon ZeroMQ a very fast transportation layer Zookeeper a distributed system for storing metadata
  • 21. LEARN MORE Wiki (https://github.com/nathanmarz/storm/wiki) Storm-starter (https://github.com/nathanmarz/storm-starter) Mailing list (http://groups.google.com/group/storm-user) #storm-user room on freenode from: http://www.cupofjoe.tv/2010/11/learn-lesson.html