SlideShare ist ein Scribd-Unternehmen logo
1 von 66
Downloaden Sie, um offline zu lesen
Spark	
  
Making	
  Big	
  Data	
  Analytics	
  Interactive	
  and	
  
Real-­‐Time	
  

Matei	
  Zaharia,	
  in	
  collaboration	
  with	
  
Mosharaf	
  Chowdhury,	
  Tathagata	
  Das,	
  Timothy	
  Hunter,	
  
Ankur	
  Dave,	
  Haoyuan	
  Li,	
  Justin	
  Ma,	
  Murphy	
  McCauley,	
  
Michael	
  Franklin,	
  Scott	
  Shenker,	
  Ion	
  Stoica	
  
	
  
spark-­‐project.org	
                                                 UC	
  BERKELEY	
  
Overview	
  
Spark	
  is	
  a	
  parallel	
  framework	
  that	
  provides:	
  
  » Efficient	
  primitives	
  for	
  in-­‐memory	
  data	
  sharing	
  
  » Simple	
  APIs	
  in	
  Scala,	
  Java,	
  SQL	
  
  » High	
  generality	
  (applicable	
  to	
  many	
  emerging	
  problems)	
  


This	
  talk	
  will	
  cover:	
  
  » What	
  it	
  does	
  
  » How	
  people	
  are	
  using	
  it	
  (including	
  some	
  surprises)	
  
  » Current	
  research	
  
Motivation	
  
MapReduce	
  simplified	
  data	
  analysis	
  on	
  large,	
  
unreliable	
  clusters	
  
But	
  as	
  soon	
  as	
  it	
  got	
  popular,	
  users	
  wanted	
  more:	
  
  » More	
  complex,	
  multi-­‐pass	
  applications	
  (e.g.	
  
    machine	
  learning,	
  graph	
  algorithms)	
  
  » More	
  interactive	
  ad-­‐hoc	
  queries	
  
  » More	
  real-­‐time	
  stream	
  processing	
  

    One	
  reaction:	
  specialized	
  models	
  for	
  some	
  of	
  
             these	
  apps	
  (e.g.	
  Pregel,	
  Storm)	
  
Motivation	
  
Complex	
  apps,	
  streaming,	
  and	
  interactive	
  queries	
  
all	
  need	
  one	
  thing	
  that	
  MapReduce	
  lacks:	
  
         Efficient	
  primitives	
  for	
  data	
  sharing	
  
	
  
Examples	
  
              HDFS	
                      HDFS	
                                HDFS	
                      HDFS	
  
              read	
                      write	
                               read	
                      write	
  
                         iter.	
  1	
                                                      iter.	
  2	
                         .	
  	
  .	
  	
  .	
  

  Input	
  

               HDFS	
                             query	
  1	
                                                result	
  1	
  
               read	
  
                                                  query	
  2	
                                                result	
  2	
  


                                                  query	
  3	
                                                result	
  3	
  
  Input	
  
                                                      .	
  	
  .	
  	
  .	
  
               Slow	
  due	
  to	
  replication	
  and	
  disk	
  I/O,	
  
                 but	
  necessary	
  for	
  fault	
  tolerance	
  
Goal:	
  Sharing	
  at	
  Memory	
  Speed	
  

                        iter.	
  1	
           iter.	
  2	
                 .	
  	
  .	
  	
  .	
  

    Input	
  

                                               query	
  1	
  
                 one-­‐time	
  
                processing	
  
                                               query	
  2	
  

                                               query	
  3	
  
    Input	
  
                                                 .	
  	
  .	
  	
  .	
  

10-­‐100×	
  faster	
  than	
  network/disk,	
  but	
  how	
  to	
  get	
  FT?	
  
Challenge	
  

How	
  to	
  design	
  a	
  distributed	
  memory	
  abstraction	
  
   that	
  is	
  both	
  fault-­‐tolerant	
  and	
  efficient?	
  
Existing	
  Systems	
  
Existing	
  in-­‐memory	
  storage	
  systems	
  have	
  
interfaces	
  based	
  on	
  fine-­‐grained	
  updates	
  
  » Reads	
  and	
  writes	
  to	
  cells	
  in	
  a	
  table	
  
  » E.g.	
  databases,	
  key-­‐value	
  stores,	
  distributed	
  memory	
  

Requires	
  replicating	
  data	
  or	
  logs	
  across	
  nodes	
  
for	
  fault	
  tolerance	
  è	
  expensive!	
  
  » 10-­‐100x	
  slower	
  than	
  memory	
  write…	
  
Solution:	
  Resilient	
  Distributed	
  
Datasets	
  (RDDs)	
  
Provide	
  an	
  interface	
  based	
  on	
  coarse-­‐grained	
  
operations	
  (map,	
  group-­‐by,	
  join,	
  …)	
  
Efficient	
  fault	
  recovery	
  using	
  lineage	
  
 » Log	
  one	
  operation	
  to	
  apply	
  to	
  many	
  elements	
  
 » Recompute	
  lost	
  partitions	
  on	
  failure	
  
 » No	
  cost	
  if	
  nothing	
  fails	
  
RDD	
  Recovery	
  
                      iter.	
  1	
     iter.	
  2	
                  .	
  	
  .	
  	
  .	
  

  Input	
  


                                       query	
  1	
  
                one-­‐time	
  
               processing	
  
                                       query	
  2	
  

                                       query	
  3	
  
   Input	
  
                                           .	
  	
  .	
  	
  .	
  
Generality	
  of	
  RDDs	
  
RDDs	
  can	
  express	
  surprisingly	
  many	
  parallel	
  
algorithms	
  
 » These	
  naturally	
  apply	
  same	
  operation	
  to	
  many	
  items	
  

Capture	
  many	
  current	
  programming	
  models	
  
 » Data	
  flow	
  models:	
  MapReduce,	
  Dryad,	
  SQL,	
  …	
  
 » Specialized	
  models	
  for	
  iterative	
  apps:	
  Pregel,	
  iterative	
  
   MapReduce,	
  PowerGraph,	
  …	
  

Allow	
  these	
  models	
  to	
  be	
  composed	
  
Outline	
  
Programming	
  interface	
  
Examples	
  
User	
  applications	
  
Implementation	
  
Demo	
  
Current	
  research:	
  Spark	
  Streaming	
  
Spark	
  Programming	
  Interface	
  
Language-­‐integrated	
  API	
  in	
  Scala*	
  
Provides:	
  
  » Resilient	
  distributed	
  datasets	
  (RDDs)	
  
      •  Partitioned	
  collections	
  with	
  controllable	
  caching	
  
  » Operations	
  on	
  RDDs	
  
      •  Transformations	
  (define	
  RDDs),	
  actions	
  (compute	
  results)	
  
  » Restricted	
  shared	
  variables	
  (broadcast,	
  accumulators)	
  


*Also	
  Java,	
  SQL	
  and	
  soon	
  Python	
  
Example:	
  Log	
  Mining	
  
 Load	
  error	
  messages	
  from	
  a	
  log	
  into	
  memory,	
  then	
  
 interactively	
  search	
  for	
  various	
  patterns	
  
                                                                         Base	
  RDD	
                                         Msgs.	
  1	
  
lines = spark.textFile(“hdfs://...”)                                         Transformed	
  RDD	
  
                                                                                                                    Worker	
  
                                                                                               results	
  
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘t’)(2))                                                                  tasks	
   Block	
  1	
  
                                                                                     Driver	
  
cachedMsgs = messages.persist()

                                                                                    Action	
  
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count                                                                                   Msgs.	
  2	
  
                                                                                                                  Worker	
  
. . .
                                                                                        Msgs.	
  3	
  
                                                                             Worker	
                              Block	
  2	
  
 Result:	
  sull-­‐text	
  s1	
  TB	
  data	
  in	
  5-­‐7	
  sec	
  
               fcaled	
  to	
   earch	
  of	
  Wikipedia	
  
 in	
  <1	
  sec	
  (vs	
  ec	
  for	
  on-­‐disk	
  data)	
  ata)	
  
        (vs	
  170	
  s 20	
  sec	
  for	
  on-­‐disk	
  d                    Block	
  3	
  
Fault	
  Recovery	
  
RDDs	
  track	
  the	
  graph	
  of	
  transformations	
  that	
  
built	
  them	
  (their	
  lineage)	
  to	
  rebuild	
  lost	
  data	
  
E.g.:	
   messages                 = textFile(...).filter(_.contains(“error”))
                                                  .map(_.split(‘t’)(2))
	
  
	
  
       HadoopRDD	
  
                  	
  
                                            FilteredRDD	
  
                                                         	
  
                                                                             MappedRDD	
  
                                                                                         	
  

       path	
  =	
  hdfs://…	
            func	
  =	
  _.contains(...)	
     func	
  =	
  _.split(…)	
  
Fault	
  Recovery	
  Results	
  
                                 140	
  
                                           119	
                       Failure	
  happens	
  
 Iteratrion	
  time	
  (s)	
  




                                 120	
  
                                 100	
                                                   81	
  
                                  80	
  
                                                     57	
     56	
     58	
     58	
              57	
     59	
     57	
     59	
  
                                  60	
  
                                  40	
  
                                  20	
  
                                   0	
  
                                             1	
      2	
      3	
      4	
       5	
   6	
        7	
      8	
      9	
     10	
  
                                                                                Iteration	
  
Example:	
  Logistic	
  Regression	
  
Goal:	
  find	
  best	
  line	
  separating	
  two	
  sets	
  of	
  points	
  

                                              random	
  initial	
  line	
  
                              +
                            + ++ +
                        – –       +
                              + +
                       – – –– + +
                          – –
                        – –
                                  target	
  
Example:	
  Logistic	
  Regression	
  
val data = spark.textFile(...).map(readPoint).persist()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) {
  val gradient = data.map(p =>
    (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
  ).reduce(_ + _)
  w -= gradient
}

println("Final w: " + w)
Logistic	
  Regression	
  Performance	
  
                               60	
  
                                                                                                110	
  s	
  /	
  iteration	
  
Running	
  Time	
  (min)	
  




                               50	
  

                               40	
  
                                                                                                  Hadoop	
  
                               30	
  
                                                                                                  Spark	
  
                               20	
  

                               10	
                                                          first	
  iteration	
  80	
  s	
  
                                 0	
                                                       further	
  iterations	
  6	
  s	
  
                                         1	
         10	
           20	
          30	
  
                                                 Number	
  of	
  Iterations	
  
Example:	
  Collaborative	
  Filtering	
  
Goal:	
  predict	
  users’	
  movie	
  ratings	
  based	
  on	
  past	
  
ratings	
  of	
  other	
  movies	
  

                        1 	
  ?   	
  ? 	
  4 	
  5 	
  ? 	
  3	
  
                        ? 	
  ?   	
  3 	
  5 	
  ? 	
  ? 	
  3	
  
             R	
  =	
                                                 Users	
  
                        5 	
  ?   	
  5 	
  ? 	
  ? 	
  ? 	
  1	
  
                        4 	
  ?   	
  ? 	
  ? 	
  ? 	
  2 	
  ?	
  
                                     Movies	
  
Model	
  and	
  Algorithm	
  
Model	
  R	
  as	
  product	
  of	
  user	
  and	
  movie	
  feature	
  
matrices	
  A	
  and	
  B	
  of	
  size	
  U×K	
  and	
  M×K	
  

                                                            BT	
  
                      R	
            =	
     A	
  

Alternating	
  Least	
  Squares	
  (ALS)	
  
  » Start	
  with	
  random	
  A	
  &	
  B	
  
  » Optimize	
  user	
  vectors	
  (A)	
  based	
  on	
  movies	
  
  » Optimize	
  movie	
  vectors	
  (B)	
  based	
  on	
  users	
  
  » Repeat	
  until	
  converged	
  
Serial	
  ALS	
  
var R = readRatingsMatrix(...)

var A = // array of U random vectors
var B = // array of M random vectors

for (i <- 1 to ITERATIONS) {
  A = (0 until U).map(i => updateUser(i, B, R))
  B = (0 until M).map(i => updateMovie(i, A, R))
}


          Range	
  objects	
  
Naïve	
  Spark	
  ALS	
  
var R = readRatingsMatrix(...)

var A = // array of U random vectors
var B = // array of M random vectors

for (i <- 1 to ITERATIONS) {
  A = spark.parallelize(0 until U, numSlices)    Problem:	
  
           .map(i => updateUser(i, B, R))        R	
  re-­‐sent	
  
           .collect()
  B = spark.parallelize(0 until M, numSlices)   to	
  all	
  nodes	
  
           .map(i => updateMovie(i, A, R))         in	
  each	
  
           .collect()                            iteration	
  
}
Efficient	
  Spark	
  ALS	
  
var R = spark.broadcast(readRatingsMatrix(...))    Solution:	
  
                                                   mark	
  R	
  as	
  
var A = // array of U random vectors               broadcast	
  
var B = // array of M random vectors
                                                    variable	
  
for (i <- 1 to ITERATIONS) {
  A = spark.parallelize(0 until U, numSlices)
           .map(i => updateUser(i, B, R.value))
           .collect()
  B = spark.parallelize(0 until M, numSlices)
           .map(i => updateMovie(i, A, R.value))
           .collect()
}

      Result:	
  3×	
  performance	
  improvement	
  
Scaling	
  Up	
  Broadcast	
  
                                           Initial	
  version	
  (HDFS)
                                                                      	
                                                               Cornet	
  P2P	
  broadcast	
  
                               250	
                                                                                       250	
  
                                                                 Communication	
                                                                      Communication	
  
                               200	
                             Computation	
                                             200	
                      Computation	
  
Iteration	
  time	
  (s)	
  




                                                                                            Iteration	
  time	
  (s)	
  
                               150	
                                                                                       150	
  

                               100	
                                                                                       100	
  

                                 50	
                                                                                        50	
  

                                   0	
                                                                                         0	
  
                                               10	
        30	
         60	
   90	
                                                      10	
   30	
        60	
   90	
  
                                                        Number	
  of	
  machines	
                                                         Number	
  of	
  machines	
  

                                                                                        [Chowdhury	
  et	
  al,	
  SIGCOMM	
  2011]	
  
Other	
  RDD	
  Operations	
  
                                         map	
                      flatMap	
  
                                         filter	
                     union	
  
 Transformations	
                     sample	
                       join	
  
(define	
  a	
  new	
  RDD)	
         groupByKey	
                   cogroup	
  
                                    reduceByKey	
                    cross	
  
                                      sortByKey	
                      …	
  

                                                      collect	
  
      Actions	
                                       reduce	
  
 (return	
  a	
  result	
  to	
                        count	
  
  driver	
  program)	
                                  save	
  
                                                         …	
  
Spark	
  in	
  Java	
  
lines.filter(_.contains(“error”)).count()




JavaRDD<String> lines = sc.textFile(...);

lines.filter(new Function<String, Boolean>() {
  Boolean call(String s) {
    return s.contains(“error”);
  }
}).count();
Spark	
  in	
  Python	
  (Coming	
  Soon!)	
  
lines = sc.textFile(sys.argv[1])

counts = lines.flatMap(lambda x: x.split(' ')) 
              .map(lambda x: (x, 1)) 
              .reduceByKey(lambda x, y: x + y)
Outline	
  
Programming	
  interface	
  
Examples	
  
User	
  applications	
  
Implementation	
  
Demo	
  
Current	
  research:	
  Spark	
  Streaming	
  
Spark	
  Users	
  




    400+	
  user	
  meetup,	
  20+	
  contributors	
  
User	
  Applications	
  
Crowdsourced	
  traffic	
  estimation	
  (Mobile	
  Millennium)	
  
Video	
  analytics	
  &	
  anomaly	
  detection	
  (Conviva)	
  
Ad-­‐hoc	
  queries	
  from	
  web	
  app	
  (Quantifind)	
  
Twitter	
  spam	
  classification	
  (Monarch)	
  
DNA	
  sequence	
  analysis	
  (SNAP)	
  
…	
  
Mobile	
  Millennium	
  Project	
  
Estimate	
  city	
  traffic	
  from	
  GPS-­‐equipped	
  vehicles	
  
(e.g.	
  SF	
  taxis)	
  
Sample	
  Data	
  




     Credit:	
  Tim	
  Hunter,	
  with	
  support	
  of	
  the	
  Mobile	
  Millennium	
  team;	
  P.I.	
  Alex	
  Bayen;	
  traffic.berkeley.edu	
  
Challenge	
  
Data	
  is	
  noisy	
  and	
  sparse	
  (1	
  sample/minute)	
  
Must	
  infer	
  path	
  taken	
  by	
  each	
  vehicle	
  in	
  
addition	
  to	
  travel	
  time	
  distribution	
  on	
  each	
  link	
  
Challenge	
  
Data	
  is	
  noisy	
  and	
  sparse	
  (1	
  sample/minute)	
  
Must	
  infer	
  path	
  taken	
  by	
  each	
  vehicle	
  in	
  
addition	
  to	
  travel	
  time	
  distribution	
  on	
  each	
  link	
  
Solution	
  
EM	
  algorithm	
  to	
  estimate	
  paths	
  and	
  travel	
  time	
  
distributions	
  simultaneously	
  

                                      observations	
  
                                                         flatMap	
  
                                      weighted	
  path	
  samples	
  
                                                         groupByKey	
  
                                      link	
  parameters	
  
                                                         broadcast	
  
Results	
                            [Hunter	
  et	
  al,	
  SOCC	
  2011]	
  




 3×	
  speedup	
  from	
  caching,	
  4.5x	
  from	
  broadcast	
  
Conviva	
  GeoReport	
  
   Hive	
                                                   20	
  

 Spark	
              0.5	
  
                                                              Time	
  (hours)	
  
              0	
               5	
     10	
     15	
     20	
  


SQL	
  aggregations	
  on	
  many	
  keys	
  w/	
  same	
  filter	
  
40×	
  gain	
  over	
  Hive	
  from	
  avoiding	
  repeated	
  I/O,	
  
deserialization	
  and	
  filtering	
  
Other	
  Programming	
  Models	
  
Pregel	
  on	
  Spark	
  (Bagel)	
  
  » 200	
  lines	
  of	
  code	
  

Iterative	
  MapReduce	
  
  » 200	
  lines	
  of	
  code	
  

Hive	
  on	
  Spark	
  (Shark)	
  
  » 5000	
  lines	
  of	
  code	
  
  » Compatible	
  with	
  Apache	
  Hive	
  
  » Machine	
  learning	
  ops.	
  in	
  Scala	
  
Outline	
  
Programming	
  interface	
  
Examples	
  
User	
  applications	
  
Implementation	
  
Demo	
  
Current	
  research:	
  Spark	
  Streaming	
  
Implementation	
  
Runs	
  on	
  Apache	
  Mesos	
  cluster	
  
                                                  Spark	
   Hadoop	
       MPI	
  
manager	
  to	
  coexist	
  w/	
  Hadoop	
                                           …	
  
                                                               Mesos	
  
Supports	
  any	
  Hadoop	
  storage	
  
system	
  (HDFS,	
  HBase,	
  …)	
                 Node	
      Node	
        Node	
  

Easy	
  local	
  mode	
  and	
  EC2	
  launch	
  scripts	
  
No	
  changes	
  to	
  Scala	
  
Task	
  Scheduler	
  
Runs	
  general	
  DAGs	
                 A:	
                              B:	
  

Pipelines	
  functions	
                                                                                 G:	
  
within	
  a	
  stage	
         Stage	
  1	
                     groupBy	
  

Cache-­‐aware	
  data	
       C:	
                     D:	
                  F:	
  

reuse	
  &	
  locality	
                           map	
  
                                                     E:	
                                          join	
  
Partitioning-­‐aware	
  
to	
  avoid	
  shuffles	
          Stage	
  2	
                         union	
                            Stage	
  3	
  


                                                                =	
  cached	
  data	
  partition	
  
Language	
  Integration	
  
Scala	
  closures	
  are	
  Serializable	
  Java	
  objects	
  
       » Serialize	
  on	
  master,	
  load	
  &	
  run	
  on	
  workers	
  

Not	
  quite	
  enough	
  
       » Nested	
  closures	
  may	
  reference	
  entire	
  outer	
  scope,	
  
         pulling	
  in	
  non-­‐Serializable	
  variables	
  not	
  used	
  inside	
  
       » Solution:	
  bytecode	
  analysis	
  +	
  reflection	
  

	
  
Interactive	
  Spark	
  
Modified	
  Scala	
  interpreter	
  to	
  allow	
  Spark	
  to	
  be	
  
used	
  interactively	
  from	
  the	
  command	
  line	
  
  » Track	
  variables	
  that	
  each	
  line	
  depends	
  on	
  
  » Ship	
  generated	
  classes	
  to	
  workers	
  

Enables	
  in-­‐memory	
  exploration	
  of	
  big	
  data	
  
Outline	
  
Programming	
  interface	
  
Examples	
  
User	
  applications	
  
Implementation	
  
Demo	
  
Current	
  research:	
  Spark	
  Streaming	
  
Outline	
  
Programming	
  interface	
  
Examples	
  
User	
  applications	
  
Implementation	
  
Demo	
  
Current	
  research:	
  Spark	
  Streaming	
  
Motivation	
  
Many	
  “big	
  data”	
  apps	
  need	
  to	
  work	
  in	
  real	
  time	
  
  » Site	
  statistics,	
  spam	
  filtering,	
  intrusion	
  detection,	
  …	
  

To	
  scale	
  to	
  100s	
  of	
  nodes,	
  need:	
  
  » Fault-­‐tolerance:	
  for	
  both	
  crashes	
  and	
  stragglers	
  
  » Efficiency:	
  don’t	
  consume	
  many	
  resources	
  beyond	
  
    base	
  processing	
  

Challenging	
  in	
  existing	
  streaming	
  systems	
  
Traditional	
  Streaming	
  Systems	
  
Continuous	
  processing	
  model	
  
 » Each	
  node	
  has	
  long-­‐lived	
  state	
  
 » For	
  each	
  record,	
  update	
  state	
  &	
  send	
  new	
  records	
  
                                 mutable	
  state	
  


      input	
  records	
                            push	
  
                                node	
  1	
  
                                                               node	
  3	
  


      input	
  records	
  
                                node	
  2	
  
Traditional	
  Streaming	
  Systems	
  
      Fault	
  tolerance	
  via	
  replication	
  or	
  upstream	
  backup:	
  

input	
      node	
  1	
                            input	
  
                                                                node	
  1	
  
                                node	
  3	
  
                                                                                node	
  3	
  
input	
      node	
  2	
  
                                                    input	
  
                              synchronization	
                 node	
  2	
  



             node	
  1’	
                                                       standby	
  
                                node	
  3’	
  

             node	
  2’	
  
Traditional	
  Streaming	
  Systems	
  
      Fault	
  tolerance	
  via	
  replication	
  or	
  upstream	
  backup:	
  

input	
         node	
  1	
                              input	
  
                                                                         node	
  1	
  
                                      node	
  3	
  
                                                                                              node	
  3	
  
input	
         node	
  2	
  
                                                         input	
  
                                   synchronization	
                     node	
  2	
  



                node	
  1’	
                                                                 standby	
  
                                     node	
  3’	
  
            Fast	
  recovery,	
  but	
  2x	
                         Only	
  need	
  1	
  standby,	
  
               hardware	
  cost	
  
              node	
  2’	
                                            but	
  slow	
  to	
  recover	
  
Traditional	
  Streaming	
  Systems	
  
      Fault	
  tolerance	
  via	
  replication	
  or	
  upstream	
  backup:	
  

input	
      node	
  1	
                                    input	
  
                                                                        node	
  1	
  
                                         node	
  3	
  
                                                                                                 node	
  3	
  
input	
      node	
  2	
  
                                                            input	
  
                                      synchronization	
                 node	
  2	
  



             node	
  1’	
                                                                       standby	
  
                                         node	
  3’	
  

             node	
  2’	
  
                               Neither	
  approach	
  can	
  handle	
  stragglers	
  
                          [Borealis,	
  Flux]	
                         [Hwang	
  et	
  al,	
  2005]	
  
Observation	
  
Batch	
  processing	
  models,	
  such	
  as	
  MapReduce,	
  
do	
  provide	
  fault	
  tolerance	
  efficiently	
  
  » Divide	
  job	
  into	
  deterministic	
  tasks	
  
  » Rerun	
  failed/slow	
  tasks	
  in	
  parallel	
  on	
  other	
  nodes	
  

Idea:	
  run	
  streaming	
  computations	
  as	
  a	
  series	
  of	
  
small,	
  deterministic	
  batch	
  jobs	
  
  » Same	
  recovery	
  schemes	
  at	
  much	
  smaller	
  timescale	
  
  » To	
  make	
  latency	
  low,	
  store	
  state	
  in	
  RDDs	
  
Discretized	
  Stream	
  Processing	
  
                                                    batch	
  operation	
  
      t	
  =	
  1:	
  
                         input	
         pull	
  

immutable	
  dataset	
                                                               immutable	
  dataset	
  
  (stored	
  reliably)	
                                                              (output	
  or	
  state);	
  
                                                                                        stored	
  as	
  RDD	
  
      t	
  =	
  2:	
  
                         input	
  
                                     …	
  




                                                                             …	
  
Fault	
  Recovery	
  
Checkpoint	
  state	
  RDDs	
  periodically	
  
If	
  a	
  node	
  fails/straggles,	
  rebuild	
  lost	
  RDD	
  partitions	
  
in	
  parallel	
  on	
  other	
  nodes	
  

                                    map	
  



           input	
  dataset	
                            output	
  dataset	
  

          Faster	
  recovery	
  than	
  upstream	
  backup,	
  
              without	
  the	
  cost	
  of	
  replication	
  
How	
  Fast	
  Can	
  It	
  Go?	
  
Can	
  process	
  over	
  60M	
  records/s	
  (6	
  GB/s)	
  on	
  
100	
  nodes	
  at	
  sub-­‐second	
  latency	
  
	
                                   80	
                                                                             40	
  
       Records/s	
  (millions)	
  




                                                                                        Records/s	
  (millions)	
  
                                                       Grep	
                                                                           Top	
  K	
  Words	
  
                                     60	
                                                                             30	
  

                                     40	
                                                                             20	
  

                                     20	
                               1	
  sec	
                                    10	
                               1	
  sec	
  
                                                                        2	
  sec	
                                                                       2	
  sec	
  
                                       0	
                                                                              0	
  
                                               0	
              50	
         100	
                                              0	
              50	
         100	
  
                                                       Nodes	
  in	
  Cluster	
                                                         Nodes	
  in	
  Cluster	
  
                                                   Max	
  throughput	
  under	
  a	
  given	
  latency	
  (1	
  or	
  2s)	
  
Comparison	
  with	
  Storm	
  
                                             Grep	
                                                               Top	
  K	
  Words	
  
                          80	
                                                                        30	
  
                                                                Spark	
                                                                    Spark	
  




                                                                            MB/s	
  per	
  node	
  
MB/s	
  per	
  node	
  




                          60	
                                  Storm	
                                                                    Storm	
  
                                                                                                      20	
  
                          40	
  
                                                                                                      10	
  
                          20	
  
                            0	
                                                                         0	
  
                                      100	
        1000	
                                                         100	
      1000	
  
                                    Record	
  Size	
  (bytes)	
                                                 Record	
  Size	
  (bytes)	
  

   Storm	
  limited	
  to	
  100K	
  records/s/node	
                                                                         Lack	
  Spark’s	
  
   Also	
  tried	
  S4:	
  10K	
  records/s/node	
                                                                            FT	
  guarantees	
  

   Commercial	
  systems:	
  O(500K)	
  total	
  
How	
  Fast	
  Can	
  It	
  Recover?	
  
Recovers	
  from	
  faults/stragglers	
  within	
  1	
  second	
  

                                                  Failure	
  happens	
  
                               1.5	
  
  Interval	
  Processing	
  




                                  1	
  
        Time	
  (s)	
  




                               0.5	
  

                                  0	
                                                                 Time	
  (s)	
  
                                          0	
      5	
         10	
        15	
     20	
     25	
  


           Sliding	
  WordCount	
  on	
  20	
  nodes	
  with	
  10s	
  checkpoint	
  interval	
  
Programming	
  Interface	
  
Extension	
  to	
  Spark:	
  Spark	
  Streaming	
  
       » All	
  Spark	
  operators	
  plus	
  new	
  “stateful”	
  ones	
  
	
  
                                                        “Discretized	
  stream”	
  
// Running count of pageviews by URL                         (D-­‐stream)	
   ones
                                                                  views                                    counts
                                                          t	
  =	
  1:	
  
views = readStream("http:...", "1s")

ones = views.map(ev => (ev.url, 1))                                          map	
            reduce	
  

counts = ones.runningReduce(_ + _)                        t	
  =	
  2:	
  




                                                                                                             ...

                                                                               =	
  RDD	
         =	
  partition	
  
Incremental	
  Operators	
  
words.reduceByWindow(“5s”, max)                            words.reduceByWindow(“5s”, _+_, _-_)


               words	
   interval	
          sliding	
                      words	
   interval	
   sliding	
  
                           max	
              max	
                                    counts	
   counts	
  
   t-­‐1	
                                                      t-­‐1	
  

     t	
                                                          t	
  

   t+1	
                                                        t+1	
  

   t+2	
                                                        t+2	
                                   –	
  

   t+3	
                                       	
  	
           t+3	
  
                                               max	
                                                            +
   t+4	
                          	
  	
                        t+4	
  
                                                                                                +	
  

        Associative	
  function	
                               Associative	
  &	
  invertible	
  
Applications	
  
                                                                                                                          Mobile	
  Millennium    	
  
  Conviva	
  video	
  dashboard	
  
                                                                                                                          traffic	
  estimation	
  
                                  4.0	
                                                                                  2000	
  




                                                                                   GPS	
  observations	
  /	
  sec	
  
Active	
  video	
  sessions	
  




                                  3.5	
  
                                  3.0	
                                                                                  1500	
  
                                  2.5	
  
      (millions)	
  




                                  2.0	
                                                                                  1000	
  
                                  1.5	
  
                                  1.0	
                                                                                   500	
  
                                  0.5	
  
                                  0.0	
                                                                                       0	
  
                                            0	
      16	
   32	
   48	
   64	
                                                        0	
      20	
   40	
   60	
   80	
  
                                                    Nodes	
  in	
  Cluster	
                                                                  Nodes	
  in	
  Cluster	
  

                      (>50	
  session-­‐level	
  metrics)	
                                                                   (online	
  EM	
  algorithm)	
  
Unifying	
  Streaming	
  and	
  Batch	
  
D-­‐streams	
  and	
  RDDs	
  can	
  seamlessly	
  be	
  combined	
  
       » Same	
  execution	
  and	
  fault	
  recovery	
  models	
  

Enables	
  powerful	
  features:	
  
       » Combining	
  streams	
  with	
  historical	
  data:	
  
          !                                                                 used	
  in	
  MM	
  
              pageViews.join(historicCounts).map(...)                       application	
  
	
  

       » Interactive	
  ad-­‐hoc	
  queries	
  on	
  stream	
  state:	
  
          !
              pageViews.slice(“21:00”,“21:05”).topK(10)                       used	
  in	
  
                                                                            Conviva	
  app	
  
Benefits	
  of	
  a	
  Unified	
  Stack	
  
Write	
  each	
  algorithm	
  only	
  once	
  
Reuse	
  data	
  across	
  streaming	
  &	
  batch	
  jobs	
  
Query	
  stream	
  state	
  instead	
  of	
  waiting	
  for	
  import	
  
	
  
Some	
  users	
  were	
  doing	
  this	
  manually!	
  
  » Conviva	
  anomaly	
  detection,	
  Quantifind	
  dashboard	
  
Conclusion	
  
“Big	
  data”	
  is	
  moving	
  beyond	
  one-­‐pass	
  batch	
  jobs,	
  
to	
  low-­‐latency	
  apps	
  that	
  need	
  data	
  sharing	
  
RDDs	
  offer	
  fault-­‐tolerant	
  sharing	
  at	
  memory	
  speed	
  
Spark	
  uses	
  them	
  to	
  combine	
  streaming,	
  batch	
  &	
  
interactive	
  analytics	
  in	
  one	
  system	
  


               www.spark-­‐project.org	
  
Related	
  Work	
  
DryadLINQ,	
  FlumeJava	
  
  » Similar	
  “distributed	
  collection”	
  API,	
  but	
  cannot	
  reuse	
  
    datasets	
  efficiently	
  across	
  queries	
  
GraphLab,	
  Piccolo,	
  BigTable,	
  RAMCloud	
  
  » Fine-­‐grained	
  writes	
  requiring	
  replication	
  or	
  checkpoints	
  
Iterative	
  MapReduce	
  (e.g.	
  Twister,	
  HaLoop)	
  
  » Implicit	
  data	
  sharing	
  for	
  a	
  fixed	
  computation	
  pattern	
  
Relational	
  databases	
  
  » Lineage/provenance,	
  logical	
  logging,	
  materialized	
  views	
  
Caching	
  systems	
  (e.g.	
  Nectar)	
  
  » Store	
  data	
  in	
  files,	
  no	
  explicit	
  control	
  over	
  what	
  is	
  cached	
  
Behavior	
  with	
  Not	
  Enough	
  RAM	
  
                               100	
  
                                              68.8	
  
Iteration	
  time	
  (s)	
  




                                                           58.1	
  
                                80	
  




                                                                              40.7	
  
                                60	
  




                                                                                                29.7	
  
                                40	
  




                                                                                                             11.5	
  
                                20	
  
                                   0	
  
                                            Cache	
       25%	
             50%	
              75%	
        Fully	
  
                                           disabled	
                                                      cached	
  
                                                          %	
  of	
  working	
  set	
  in	
  memory	
  
RDDs	
  for	
  Debugging	
  
Debugging	
  general	
  distributed	
  apps	
  is	
  very	
  hard	
  
However,	
  Spark	
  and	
  other	
  recent	
  frameworks	
  
run	
  deterministic	
  tasks	
  for	
  fault	
  tolerance	
  
Leverage	
  this	
  determinism	
  for	
  debugging:	
  
 » Log	
  lineage	
  for	
  all	
  RDDs	
  created	
  (small)	
  
 » Let	
  user	
  replay	
  any	
  task	
  in	
  jdb,	
  rebuild	
  any	
  RDD	
  to	
  
   query	
  it	
  interactively,	
  or	
  check	
  assertions	
  

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to Tokyo Products
Introduction to Tokyo ProductsIntroduction to Tokyo Products
Introduction to Tokyo ProductsMikio Hirabayashi
 
NoSQL @ CodeMash 2010
NoSQL @ CodeMash 2010NoSQL @ CodeMash 2010
NoSQL @ CodeMash 2010Ben Scofield
 
Xsl Tand X Path Quick Reference
Xsl Tand X Path Quick ReferenceXsl Tand X Path Quick Reference
Xsl Tand X Path Quick ReferenceLiquidHub
 
Parallel Computing for Econometricians with Amazon Web Services
Parallel Computing for Econometricians with Amazon Web ServicesParallel Computing for Econometricians with Amazon Web Services
Parallel Computing for Econometricians with Amazon Web Servicesstephenjbarr
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelTakahiro Inoue
 
KVSの性能、RDBMSのインデックス、更にMapReduceを併せ持つAll-in-One NoSQL: MongoDB
KVSの性能、RDBMSのインデックス、更にMapReduceを併せ持つAll-in-One NoSQL: MongoDB KVSの性能、RDBMSのインデックス、更にMapReduceを併せ持つAll-in-One NoSQL: MongoDB
KVSの性能、RDBMSのインデックス、更にMapReduceを併せ持つAll-in-One NoSQL: MongoDB Rakuten Group, Inc.
 
MongoDB 在盛大大数据量下的应用
MongoDB 在盛大大数据量下的应用MongoDB 在盛大大数据量下的应用
MongoDB 在盛大大数据量下的应用iammutex
 
Hive Object Model
Hive Object ModelHive Object Model
Hive Object ModelZheng Shao
 
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012Chris Richardson
 
Large volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive PlatformLarge volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive PlatformMartin Zapletal
 
Survey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and AnalyticsSurvey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and AnalyticsYannick Pouliot
 
Configuring Mahout Clustering Jobs - Frank Scholten
Configuring Mahout Clustering Jobs - Frank ScholtenConfiguring Mahout Clustering Jobs - Frank Scholten
Configuring Mahout Clustering Jobs - Frank Scholtenlucenerevolution
 
DSL - Domain Specific Languages, Chapter 4, Internal DSL
DSL - Domain Specific Languages,  Chapter 4, Internal DSLDSL - Domain Specific Languages,  Chapter 4, Internal DSL
DSL - Domain Specific Languages, Chapter 4, Internal DSLHiro Yoshioka
 
Acunu & OCaml: Experience Report, CUFP
Acunu & OCaml: Experience Report, CUFPAcunu & OCaml: Experience Report, CUFP
Acunu & OCaml: Experience Report, CUFPAcunu
 
stream processing engine
stream processing enginestream processing engine
stream processing enginetiana528
 
Create custom looping procedures in sql server 2005 tech republic
Create custom looping procedures in sql server 2005   tech republicCreate custom looping procedures in sql server 2005   tech republic
Create custom looping procedures in sql server 2005 tech republicKaing Menglieng
 
Intro to threp
Intro to threpIntro to threp
Intro to threpHong Wu
 
Js info vis_toolkit
Js info vis_toolkitJs info vis_toolkit
Js info vis_toolkitnikhilyagnic
 
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...Eelco Visser
 

Was ist angesagt? (20)

Introduction to Tokyo Products
Introduction to Tokyo ProductsIntroduction to Tokyo Products
Introduction to Tokyo Products
 
NoSQL @ CodeMash 2010
NoSQL @ CodeMash 2010NoSQL @ CodeMash 2010
NoSQL @ CodeMash 2010
 
Xsl Tand X Path Quick Reference
Xsl Tand X Path Quick ReferenceXsl Tand X Path Quick Reference
Xsl Tand X Path Quick Reference
 
Parallel Computing for Econometricians with Amazon Web Services
Parallel Computing for Econometricians with Amazon Web ServicesParallel Computing for Econometricians with Amazon Web Services
Parallel Computing for Econometricians with Amazon Web Services
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
 
KVSの性能、RDBMSのインデックス、更にMapReduceを併せ持つAll-in-One NoSQL: MongoDB
KVSの性能、RDBMSのインデックス、更にMapReduceを併せ持つAll-in-One NoSQL: MongoDB KVSの性能、RDBMSのインデックス、更にMapReduceを併せ持つAll-in-One NoSQL: MongoDB
KVSの性能、RDBMSのインデックス、更にMapReduceを併せ持つAll-in-One NoSQL: MongoDB
 
MongoDB 在盛大大数据量下的应用
MongoDB 在盛大大数据量下的应用MongoDB 在盛大大数据量下的应用
MongoDB 在盛大大数据量下的应用
 
Hive Object Model
Hive Object ModelHive Object Model
Hive Object Model
 
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012
 
Large volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive PlatformLarge volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive Platform
 
Survey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and AnalyticsSurvey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and Analytics
 
Configuring Mahout Clustering Jobs - Frank Scholten
Configuring Mahout Clustering Jobs - Frank ScholtenConfiguring Mahout Clustering Jobs - Frank Scholten
Configuring Mahout Clustering Jobs - Frank Scholten
 
DSL - Domain Specific Languages, Chapter 4, Internal DSL
DSL - Domain Specific Languages,  Chapter 4, Internal DSLDSL - Domain Specific Languages,  Chapter 4, Internal DSL
DSL - Domain Specific Languages, Chapter 4, Internal DSL
 
Acunu & OCaml: Experience Report, CUFP
Acunu & OCaml: Experience Report, CUFPAcunu & OCaml: Experience Report, CUFP
Acunu & OCaml: Experience Report, CUFP
 
stream processing engine
stream processing enginestream processing engine
stream processing engine
 
Failing gracefully
Failing gracefullyFailing gracefully
Failing gracefully
 
Create custom looping procedures in sql server 2005 tech republic
Create custom looping procedures in sql server 2005   tech republicCreate custom looping procedures in sql server 2005   tech republic
Create custom looping procedures in sql server 2005 tech republic
 
Intro to threp
Intro to threpIntro to threp
Intro to threp
 
Js info vis_toolkit
Js info vis_toolkitJs info vis_toolkit
Js info vis_toolkit
 
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
 

Andere mochten auch

Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBaseHortonworks
 
Twitter Storm: Ereignisverarbeitung in Echtzeit
Twitter Storm: Ereignisverarbeitung in EchtzeitTwitter Storm: Ereignisverarbeitung in Echtzeit
Twitter Storm: Ereignisverarbeitung in EchtzeitGuido Schmutz
 
HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917Chicago Hadoop Users Group
 
Monitoring MySQL with OpenTSDB
Monitoring MySQL with OpenTSDBMonitoring MySQL with OpenTSDB
Monitoring MySQL with OpenTSDBGeoffrey Anderson
 
Using Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLUsing Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLCloudera, Inc.
 
How we solved Real-time User Segmentation using HBase
How we solved Real-time User Segmentation using HBaseHow we solved Real-time User Segmentation using HBase
How we solved Real-time User Segmentation using HBaseDataWorks Summit
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDataWorks Summit
 
A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013Nathan Bijnens
 
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponHBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponCloudera, Inc.
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsHortonworks
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopDataWorks Summit
 
Diagnostics & Debugging webinar
Diagnostics & Debugging webinarDiagnostics & Debugging webinar
Diagnostics & Debugging webinarMongoDB
 
Quantified Self Ideology: Personal Data becomes Big Data
Quantified Self Ideology:  Personal Data becomes Big DataQuantified Self Ideology:  Personal Data becomes Big Data
Quantified Self Ideology: Personal Data becomes Big DataMelanie Swan
 
The Future of Telecom (Petro Chernyshov Business Stream)
The Future of Telecom (Petro Chernyshov Business Stream)The Future of Telecom (Petro Chernyshov Business Stream)
The Future of Telecom (Petro Chernyshov Business Stream)IT Arena
 
Governing the Data to Dollars Value Chain™ - Sept 2012 NYC Data Governance Co...
Governing the Data to Dollars Value Chain™ - Sept 2012 NYC Data Governance Co...Governing the Data to Dollars Value Chain™ - Sept 2012 NYC Data Governance Co...
Governing the Data to Dollars Value Chain™ - Sept 2012 NYC Data Governance Co...Fitzgerald Analytics, Inc.
 
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)Hadoop / Spark Conference Japan
 
sparksql-hive-bench-by-nec-hwx-at-hcj16
sparksql-hive-bench-by-nec-hwx-at-hcj16sparksql-hive-bench-by-nec-hwx-at-hcj16
sparksql-hive-bench-by-nec-hwx-at-hcj16Yifeng Jiang
 
Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016
Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016
Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016Nagato Kasaki
 

Andere mochten auch (20)

Intro to Pig UDF
Intro to Pig UDFIntro to Pig UDF
Intro to Pig UDF
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
 
Twitter Storm: Ereignisverarbeitung in Echtzeit
Twitter Storm: Ereignisverarbeitung in EchtzeitTwitter Storm: Ereignisverarbeitung in Echtzeit
Twitter Storm: Ereignisverarbeitung in Echtzeit
 
HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Monitoring MySQL with OpenTSDB
Monitoring MySQL with OpenTSDBMonitoring MySQL with OpenTSDB
Monitoring MySQL with OpenTSDB
 
Using Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLUsing Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETL
 
How we solved Real-time User Segmentation using HBase
How we solved Real-time User Segmentation using HBaseHow we solved Real-time User Segmentation using HBase
How we solved Real-time User Segmentation using HBase
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 
A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013
 
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponHBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data Applications
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 
Diagnostics & Debugging webinar
Diagnostics & Debugging webinarDiagnostics & Debugging webinar
Diagnostics & Debugging webinar
 
Quantified Self Ideology: Personal Data becomes Big Data
Quantified Self Ideology:  Personal Data becomes Big DataQuantified Self Ideology:  Personal Data becomes Big Data
Quantified Self Ideology: Personal Data becomes Big Data
 
The Future of Telecom (Petro Chernyshov Business Stream)
The Future of Telecom (Petro Chernyshov Business Stream)The Future of Telecom (Petro Chernyshov Business Stream)
The Future of Telecom (Petro Chernyshov Business Stream)
 
Governing the Data to Dollars Value Chain™ - Sept 2012 NYC Data Governance Co...
Governing the Data to Dollars Value Chain™ - Sept 2012 NYC Data Governance Co...Governing the Data to Dollars Value Chain™ - Sept 2012 NYC Data Governance Co...
Governing the Data to Dollars Value Chain™ - Sept 2012 NYC Data Governance Co...
 
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
 
sparksql-hive-bench-by-nec-hwx-at-hcj16
sparksql-hive-bench-by-nec-hwx-at-hcj16sparksql-hive-bench-by-nec-hwx-at-hcj16
sparksql-hive-bench-by-nec-hwx-at-hcj16
 
Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016
Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016
Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016
 

Ähnlich wie Making Big Data Analytics Interactive and Real-­Time

Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataJetlore
 
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305mjfrankli
 
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012Amazon Web Services
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xincaidezhi655
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part IIArjen de Vries
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionSubhas Kumar Ghosh
 
Introduction to Apache Accumulo
Introduction to Apache AccumuloIntroduction to Apache Accumulo
Introduction to Apache AccumuloJared Winick
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
Introduction to MapReduce using Disco
Introduction to MapReduce using DiscoIntroduction to MapReduce using Disco
Introduction to MapReduce using DiscoJim Roepcke
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduceNewvewm
 
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsDTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsCheng Lian
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 

Ähnlich wie Making Big Data Analytics Interactive and Real-­Time (20)

Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
 
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
 
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xin
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
 
Introduction to Apache Accumulo
Introduction to Apache AccumuloIntroduction to Apache Accumulo
Introduction to Apache Accumulo
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Introduction to MapReduce using Disco
Introduction to MapReduce using DiscoIntroduction to MapReduce using Disco
Introduction to MapReduce using Disco
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsDTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime Internals
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Apache spark core
Apache spark coreApache spark core
Apache spark core
 
hadoop.ppt
hadoop.ppthadoop.ppt
hadoop.ppt
 
Hadoop
HadoopHadoop
Hadoop
 
Spark 101
Spark 101Spark 101
Spark 101
 
Hadoop 3
Hadoop 3Hadoop 3
Hadoop 3
 

Kürzlich hochgeladen

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Kürzlich hochgeladen (20)

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

Making Big Data Analytics Interactive and Real-­Time

  • 1. Spark   Making  Big  Data  Analytics  Interactive  and   Real-­‐Time   Matei  Zaharia,  in  collaboration  with   Mosharaf  Chowdhury,  Tathagata  Das,  Timothy  Hunter,   Ankur  Dave,  Haoyuan  Li,  Justin  Ma,  Murphy  McCauley,   Michael  Franklin,  Scott  Shenker,  Ion  Stoica     spark-­‐project.org   UC  BERKELEY  
  • 2. Overview   Spark  is  a  parallel  framework  that  provides:   » Efficient  primitives  for  in-­‐memory  data  sharing   » Simple  APIs  in  Scala,  Java,  SQL   » High  generality  (applicable  to  many  emerging  problems)   This  talk  will  cover:   » What  it  does   » How  people  are  using  it  (including  some  surprises)   » Current  research  
  • 3. Motivation   MapReduce  simplified  data  analysis  on  large,   unreliable  clusters   But  as  soon  as  it  got  popular,  users  wanted  more:   » More  complex,  multi-­‐pass  applications  (e.g.   machine  learning,  graph  algorithms)   » More  interactive  ad-­‐hoc  queries   » More  real-­‐time  stream  processing   One  reaction:  specialized  models  for  some  of   these  apps  (e.g.  Pregel,  Storm)  
  • 4. Motivation   Complex  apps,  streaming,  and  interactive  queries   all  need  one  thing  that  MapReduce  lacks:   Efficient  primitives  for  data  sharing    
  • 5. Examples   HDFS   HDFS   HDFS   HDFS   read   write   read   write   iter.  1   iter.  2   .    .    .   Input   HDFS   query  1   result  1   read   query  2   result  2   query  3   result  3   Input   .    .    .   Slow  due  to  replication  and  disk  I/O,   but  necessary  for  fault  tolerance  
  • 6. Goal:  Sharing  at  Memory  Speed   iter.  1   iter.  2   .    .    .   Input   query  1   one-­‐time   processing   query  2   query  3   Input   .    .    .   10-­‐100×  faster  than  network/disk,  but  how  to  get  FT?  
  • 7. Challenge   How  to  design  a  distributed  memory  abstraction   that  is  both  fault-­‐tolerant  and  efficient?  
  • 8. Existing  Systems   Existing  in-­‐memory  storage  systems  have   interfaces  based  on  fine-­‐grained  updates   » Reads  and  writes  to  cells  in  a  table   » E.g.  databases,  key-­‐value  stores,  distributed  memory   Requires  replicating  data  or  logs  across  nodes   for  fault  tolerance  è  expensive!   » 10-­‐100x  slower  than  memory  write…  
  • 9. Solution:  Resilient  Distributed   Datasets  (RDDs)   Provide  an  interface  based  on  coarse-­‐grained   operations  (map,  group-­‐by,  join,  …)   Efficient  fault  recovery  using  lineage   » Log  one  operation  to  apply  to  many  elements   » Recompute  lost  partitions  on  failure   » No  cost  if  nothing  fails  
  • 10. RDD  Recovery   iter.  1   iter.  2   .    .    .   Input   query  1   one-­‐time   processing   query  2   query  3   Input   .    .    .  
  • 11. Generality  of  RDDs   RDDs  can  express  surprisingly  many  parallel   algorithms   » These  naturally  apply  same  operation  to  many  items   Capture  many  current  programming  models   » Data  flow  models:  MapReduce,  Dryad,  SQL,  …   » Specialized  models  for  iterative  apps:  Pregel,  iterative   MapReduce,  PowerGraph,  …   Allow  these  models  to  be  composed  
  • 12. Outline   Programming  interface   Examples   User  applications   Implementation   Demo   Current  research:  Spark  Streaming  
  • 13. Spark  Programming  Interface   Language-­‐integrated  API  in  Scala*   Provides:   » Resilient  distributed  datasets  (RDDs)   •  Partitioned  collections  with  controllable  caching   » Operations  on  RDDs   •  Transformations  (define  RDDs),  actions  (compute  results)   » Restricted  shared  variables  (broadcast,  accumulators)   *Also  Java,  SQL  and  soon  Python  
  • 14. Example:  Log  Mining   Load  error  messages  from  a  log  into  memory,  then   interactively  search  for  various  patterns   Base  RDD   Msgs.  1   lines = spark.textFile(“hdfs://...”) Transformed  RDD   Worker   results   errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘t’)(2)) tasks   Block  1   Driver   cachedMsgs = messages.persist() Action   cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count Msgs.  2   Worker   . . . Msgs.  3   Worker   Block  2   Result:  sull-­‐text  s1  TB  data  in  5-­‐7  sec   fcaled  to   earch  of  Wikipedia   in  <1  sec  (vs  ec  for  on-­‐disk  data)  ata)   (vs  170  s 20  sec  for  on-­‐disk  d Block  3  
  • 15. Fault  Recovery   RDDs  track  the  graph  of  transformations  that   built  them  (their  lineage)  to  rebuild  lost  data   E.g.:   messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘t’)(2))     HadoopRDD     FilteredRDD     MappedRDD     path  =  hdfs://…   func  =  _.contains(...)   func  =  _.split(…)  
  • 16. Fault  Recovery  Results   140   119   Failure  happens   Iteratrion  time  (s)   120   100   81   80   57   56   58   58   57   59   57   59   60   40   20   0   1   2   3   4   5   6   7   8   9   10   Iteration  
  • 17. Example:  Logistic  Regression   Goal:  find  best  line  separating  two  sets  of  points   random  initial  line   + + ++ + – – + + + – – –– + + – – – – target  
  • 18. Example:  Logistic  Regression   val data = spark.textFile(...).map(readPoint).persist() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w)
  • 19. Logistic  Regression  Performance   60   110  s  /  iteration   Running  Time  (min)   50   40   Hadoop   30   Spark   20   10   first  iteration  80  s   0   further  iterations  6  s   1   10   20   30   Number  of  Iterations  
  • 20. Example:  Collaborative  Filtering   Goal:  predict  users’  movie  ratings  based  on  past   ratings  of  other  movies   1  ?  ?  4  5  ?  3   ?  ?  3  5  ?  ?  3   R  =   Users   5  ?  5  ?  ?  ?  1   4  ?  ?  ?  ?  2  ?   Movies  
  • 21. Model  and  Algorithm   Model  R  as  product  of  user  and  movie  feature   matrices  A  and  B  of  size  U×K  and  M×K   BT   R   =   A   Alternating  Least  Squares  (ALS)   » Start  with  random  A  &  B   » Optimize  user  vectors  (A)  based  on  movies   » Optimize  movie  vectors  (B)  based  on  users   » Repeat  until  converged  
  • 22. Serial  ALS   var R = readRatingsMatrix(...) var A = // array of U random vectors var B = // array of M random vectors for (i <- 1 to ITERATIONS) { A = (0 until U).map(i => updateUser(i, B, R)) B = (0 until M).map(i => updateMovie(i, A, R)) } Range  objects  
  • 23. Naïve  Spark  ALS   var R = readRatingsMatrix(...) var A = // array of U random vectors var B = // array of M random vectors for (i <- 1 to ITERATIONS) { A = spark.parallelize(0 until U, numSlices) Problem:   .map(i => updateUser(i, B, R)) R  re-­‐sent   .collect() B = spark.parallelize(0 until M, numSlices) to  all  nodes   .map(i => updateMovie(i, A, R)) in  each   .collect() iteration   }
  • 24. Efficient  Spark  ALS   var R = spark.broadcast(readRatingsMatrix(...)) Solution:   mark  R  as   var A = // array of U random vectors broadcast   var B = // array of M random vectors variable   for (i <- 1 to ITERATIONS) { A = spark.parallelize(0 until U, numSlices) .map(i => updateUser(i, B, R.value)) .collect() B = spark.parallelize(0 until M, numSlices) .map(i => updateMovie(i, A, R.value)) .collect() } Result:  3×  performance  improvement  
  • 25. Scaling  Up  Broadcast   Initial  version  (HDFS)   Cornet  P2P  broadcast   250   250   Communication   Communication   200   Computation   200   Computation   Iteration  time  (s)   Iteration  time  (s)   150   150   100   100   50   50   0   0   10   30   60   90   10   30   60   90   Number  of  machines   Number  of  machines   [Chowdhury  et  al,  SIGCOMM  2011]  
  • 26. Other  RDD  Operations   map   flatMap   filter   union   Transformations   sample   join   (define  a  new  RDD)   groupByKey   cogroup   reduceByKey   cross   sortByKey   …   collect   Actions   reduce   (return  a  result  to   count   driver  program)   save   …  
  • 27. Spark  in  Java   lines.filter(_.contains(“error”)).count() JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();
  • 28. Spark  in  Python  (Coming  Soon!)   lines = sc.textFile(sys.argv[1]) counts = lines.flatMap(lambda x: x.split(' ')) .map(lambda x: (x, 1)) .reduceByKey(lambda x, y: x + y)
  • 29. Outline   Programming  interface   Examples   User  applications   Implementation   Demo   Current  research:  Spark  Streaming  
  • 30. Spark  Users   400+  user  meetup,  20+  contributors  
  • 31. User  Applications   Crowdsourced  traffic  estimation  (Mobile  Millennium)   Video  analytics  &  anomaly  detection  (Conviva)   Ad-­‐hoc  queries  from  web  app  (Quantifind)   Twitter  spam  classification  (Monarch)   DNA  sequence  analysis  (SNAP)   …  
  • 32. Mobile  Millennium  Project   Estimate  city  traffic  from  GPS-­‐equipped  vehicles   (e.g.  SF  taxis)  
  • 33. Sample  Data   Credit:  Tim  Hunter,  with  support  of  the  Mobile  Millennium  team;  P.I.  Alex  Bayen;  traffic.berkeley.edu  
  • 34. Challenge   Data  is  noisy  and  sparse  (1  sample/minute)   Must  infer  path  taken  by  each  vehicle  in   addition  to  travel  time  distribution  on  each  link  
  • 35. Challenge   Data  is  noisy  and  sparse  (1  sample/minute)   Must  infer  path  taken  by  each  vehicle  in   addition  to  travel  time  distribution  on  each  link  
  • 36. Solution   EM  algorithm  to  estimate  paths  and  travel  time   distributions  simultaneously   observations   flatMap   weighted  path  samples   groupByKey   link  parameters   broadcast  
  • 37. Results   [Hunter  et  al,  SOCC  2011]   3×  speedup  from  caching,  4.5x  from  broadcast  
  • 38. Conviva  GeoReport   Hive   20   Spark   0.5   Time  (hours)   0   5   10   15   20   SQL  aggregations  on  many  keys  w/  same  filter   40×  gain  over  Hive  from  avoiding  repeated  I/O,   deserialization  and  filtering  
  • 39. Other  Programming  Models   Pregel  on  Spark  (Bagel)   » 200  lines  of  code   Iterative  MapReduce   » 200  lines  of  code   Hive  on  Spark  (Shark)   » 5000  lines  of  code   » Compatible  with  Apache  Hive   » Machine  learning  ops.  in  Scala  
  • 40. Outline   Programming  interface   Examples   User  applications   Implementation   Demo   Current  research:  Spark  Streaming  
  • 41. Implementation   Runs  on  Apache  Mesos  cluster   Spark   Hadoop   MPI   manager  to  coexist  w/  Hadoop   …   Mesos   Supports  any  Hadoop  storage   system  (HDFS,  HBase,  …)   Node   Node   Node   Easy  local  mode  and  EC2  launch  scripts   No  changes  to  Scala  
  • 42. Task  Scheduler   Runs  general  DAGs   A:   B:   Pipelines  functions   G:   within  a  stage   Stage  1   groupBy   Cache-­‐aware  data   C:   D:   F:   reuse  &  locality   map   E:   join   Partitioning-­‐aware   to  avoid  shuffles   Stage  2   union   Stage  3   =  cached  data  partition  
  • 43. Language  Integration   Scala  closures  are  Serializable  Java  objects   » Serialize  on  master,  load  &  run  on  workers   Not  quite  enough   » Nested  closures  may  reference  entire  outer  scope,   pulling  in  non-­‐Serializable  variables  not  used  inside   » Solution:  bytecode  analysis  +  reflection    
  • 44. Interactive  Spark   Modified  Scala  interpreter  to  allow  Spark  to  be   used  interactively  from  the  command  line   » Track  variables  that  each  line  depends  on   » Ship  generated  classes  to  workers   Enables  in-­‐memory  exploration  of  big  data  
  • 45. Outline   Programming  interface   Examples   User  applications   Implementation   Demo   Current  research:  Spark  Streaming  
  • 46. Outline   Programming  interface   Examples   User  applications   Implementation   Demo   Current  research:  Spark  Streaming  
  • 47. Motivation   Many  “big  data”  apps  need  to  work  in  real  time   » Site  statistics,  spam  filtering,  intrusion  detection,  …   To  scale  to  100s  of  nodes,  need:   » Fault-­‐tolerance:  for  both  crashes  and  stragglers   » Efficiency:  don’t  consume  many  resources  beyond   base  processing   Challenging  in  existing  streaming  systems  
  • 48. Traditional  Streaming  Systems   Continuous  processing  model   » Each  node  has  long-­‐lived  state   » For  each  record,  update  state  &  send  new  records   mutable  state   input  records   push   node  1   node  3   input  records   node  2  
  • 49. Traditional  Streaming  Systems   Fault  tolerance  via  replication  or  upstream  backup:   input   node  1   input   node  1   node  3   node  3   input   node  2   input   synchronization   node  2   node  1’   standby   node  3’   node  2’  
  • 50. Traditional  Streaming  Systems   Fault  tolerance  via  replication  or  upstream  backup:   input   node  1   input   node  1   node  3   node  3   input   node  2   input   synchronization   node  2   node  1’   standby   node  3’   Fast  recovery,  but  2x   Only  need  1  standby,   hardware  cost   node  2’   but  slow  to  recover  
  • 51. Traditional  Streaming  Systems   Fault  tolerance  via  replication  or  upstream  backup:   input   node  1   input   node  1   node  3   node  3   input   node  2   input   synchronization   node  2   node  1’   standby   node  3’   node  2’   Neither  approach  can  handle  stragglers   [Borealis,  Flux]   [Hwang  et  al,  2005]  
  • 52. Observation   Batch  processing  models,  such  as  MapReduce,   do  provide  fault  tolerance  efficiently   » Divide  job  into  deterministic  tasks   » Rerun  failed/slow  tasks  in  parallel  on  other  nodes   Idea:  run  streaming  computations  as  a  series  of   small,  deterministic  batch  jobs   » Same  recovery  schemes  at  much  smaller  timescale   » To  make  latency  low,  store  state  in  RDDs  
  • 53. Discretized  Stream  Processing   batch  operation   t  =  1:   input   pull   immutable  dataset   immutable  dataset   (stored  reliably)   (output  or  state);   stored  as  RDD   t  =  2:   input   …   …  
  • 54. Fault  Recovery   Checkpoint  state  RDDs  periodically   If  a  node  fails/straggles,  rebuild  lost  RDD  partitions   in  parallel  on  other  nodes   map   input  dataset   output  dataset   Faster  recovery  than  upstream  backup,   without  the  cost  of  replication  
  • 55. How  Fast  Can  It  Go?   Can  process  over  60M  records/s  (6  GB/s)  on   100  nodes  at  sub-­‐second  latency     80   40   Records/s  (millions)   Records/s  (millions)   Grep   Top  K  Words   60   30   40   20   20   1  sec   10   1  sec   2  sec   2  sec   0   0   0   50   100   0   50   100   Nodes  in  Cluster   Nodes  in  Cluster   Max  throughput  under  a  given  latency  (1  or  2s)  
  • 56. Comparison  with  Storm   Grep   Top  K  Words   80   30   Spark   Spark   MB/s  per  node   MB/s  per  node   60   Storm   Storm   20   40   10   20   0   0   100   1000   100   1000   Record  Size  (bytes)   Record  Size  (bytes)   Storm  limited  to  100K  records/s/node   Lack  Spark’s   Also  tried  S4:  10K  records/s/node   FT  guarantees   Commercial  systems:  O(500K)  total  
  • 57. How  Fast  Can  It  Recover?   Recovers  from  faults/stragglers  within  1  second   Failure  happens   1.5   Interval  Processing   1   Time  (s)   0.5   0   Time  (s)   0   5   10   15   20   25   Sliding  WordCount  on  20  nodes  with  10s  checkpoint  interval  
  • 58. Programming  Interface   Extension  to  Spark:  Spark  Streaming   » All  Spark  operators  plus  new  “stateful”  ones     “Discretized  stream”   // Running count of pageviews by URL (D-­‐stream)   ones views counts t  =  1:   views = readStream("http:...", "1s") ones = views.map(ev => (ev.url, 1)) map   reduce   counts = ones.runningReduce(_ + _) t  =  2:   ... =  RDD   =  partition  
  • 59. Incremental  Operators   words.reduceByWindow(“5s”, max) words.reduceByWindow(“5s”, _+_, _-_) words   interval   sliding   words   interval   sliding   max   max   counts   counts   t-­‐1   t-­‐1   t   t   t+1   t+1   t+2   t+2   –   t+3       t+3   max   + t+4       t+4   +   Associative  function   Associative  &  invertible  
  • 60. Applications   Mobile  Millennium   Conviva  video  dashboard   traffic  estimation   4.0   2000   GPS  observations  /  sec   Active  video  sessions   3.5   3.0   1500   2.5   (millions)   2.0   1000   1.5   1.0   500   0.5   0.0   0   0   16   32   48   64   0   20   40   60   80   Nodes  in  Cluster   Nodes  in  Cluster   (>50  session-­‐level  metrics)   (online  EM  algorithm)  
  • 61. Unifying  Streaming  and  Batch   D-­‐streams  and  RDDs  can  seamlessly  be  combined   » Same  execution  and  fault  recovery  models   Enables  powerful  features:   » Combining  streams  with  historical  data:   ! used  in  MM   pageViews.join(historicCounts).map(...) application     » Interactive  ad-­‐hoc  queries  on  stream  state:   ! pageViews.slice(“21:00”,“21:05”).topK(10) used  in   Conviva  app  
  • 62. Benefits  of  a  Unified  Stack   Write  each  algorithm  only  once   Reuse  data  across  streaming  &  batch  jobs   Query  stream  state  instead  of  waiting  for  import     Some  users  were  doing  this  manually!   » Conviva  anomaly  detection,  Quantifind  dashboard  
  • 63. Conclusion   “Big  data”  is  moving  beyond  one-­‐pass  batch  jobs,   to  low-­‐latency  apps  that  need  data  sharing   RDDs  offer  fault-­‐tolerant  sharing  at  memory  speed   Spark  uses  them  to  combine  streaming,  batch  &   interactive  analytics  in  one  system   www.spark-­‐project.org  
  • 64. Related  Work   DryadLINQ,  FlumeJava   » Similar  “distributed  collection”  API,  but  cannot  reuse   datasets  efficiently  across  queries   GraphLab,  Piccolo,  BigTable,  RAMCloud   » Fine-­‐grained  writes  requiring  replication  or  checkpoints   Iterative  MapReduce  (e.g.  Twister,  HaLoop)   » Implicit  data  sharing  for  a  fixed  computation  pattern   Relational  databases   » Lineage/provenance,  logical  logging,  materialized  views   Caching  systems  (e.g.  Nectar)   » Store  data  in  files,  no  explicit  control  over  what  is  cached  
  • 65. Behavior  with  Not  Enough  RAM   100   68.8   Iteration  time  (s)   58.1   80   40.7   60   29.7   40   11.5   20   0   Cache   25%   50%   75%   Fully   disabled   cached   %  of  working  set  in  memory  
  • 66. RDDs  for  Debugging   Debugging  general  distributed  apps  is  very  hard   However,  Spark  and  other  recent  frameworks   run  deterministic  tasks  for  fault  tolerance   Leverage  this  determinism  for  debugging:   » Log  lineage  for  all  RDDs  created  (small)   » Let  user  replay  any  task  in  jdb,  rebuild  any  RDD  to   query  it  interactively,  or  check  assertions