SlideShare ist ein Scribd-Unternehmen logo
1 von 49
Interac(ve	
  Big	
  data	
  analysis	
  
Viet-­‐Trung	
  Tran	
  
1	
  
MapReduce	
  wordcount	
  
2	
  
MR	
  –	
  batch	
  processing	
  
•  Long	
  running	
  job	
  
– latency	
  between	
  running	
  the	
  job	
  and	
  geBng	
  the	
  
answer	
  
•  Lot	
  of	
  computa(ons	
  
•  Specific	
  language	
  
3	
  
Example	
  Problem	
  
•  Jane	
  works	
  as	
  an	
  
analyst	
  at	
  an	
  e-­‐
commerce	
  company	
  
•  How	
  does	
  she	
  figure	
  
out	
  good	
  targe(ng	
  
segments	
  for	
  the	
  next	
  
marke(ng	
  campaign?	
  
•  She	
  has	
  some	
  ideas	
  
and	
  lots	
  of	
  data	
  
User	
  	
  
profiles	
  
Transac.on	
  
informa.on	
  
Access	
  
logs	
  
4	
  
Solving	
  the	
  problems?	
  
All	
  compiled	
  to	
  Map	
  Reduce	
  jobs	
  
5	
  
Dremel:	
  interac(ve	
  analysis	
  of	
  
web-­‐scale	
  datasets	
  
Melnik	
  et.	
  al,	
  Google	
  inc	
  
[VLDB	
  2010]	
  
6	
  
What	
  is	
  Dremel?	
  
•  Near	
  real	
  (me	
  interac(ve	
  analysis	
  (instead	
  batch	
  
processing).	
  SQL-­‐like	
  query	
  language	
  
–  Trillion	
  record,	
  mul(-­‐terabyte	
  datasets	
  
•  Nested	
  data	
  with	
  a	
  column	
  storage	
  representa(on	
  
•  Serving	
  tree:	
  mul(-­‐level	
  execu(on	
  trees	
  for	
  query	
  
processing	
  
•  Interoperates	
  "in	
  place"	
  with	
  GFS,	
  Big	
  Table	
  
•  The	
  engine	
  behind	
  Google	
  BigQuery	
  
•  Builds	
  on	
  the	
  ideas	
  from	
  web	
  search	
  and	
  parallel	
  
DBMS.	
  
7	
  
•  Brand of power tools that primarily rely on
their speed as opposed to torque
•  Data analysis tool that uses speed instead
of raw power
Why call it Dremel
8	
  
Widely used inside Google
•  Analysis of crawled web
documents
•  Tracking install data for
applications on Android
Market
•  Crash reporting for Google
products
•  OCR results from Google
Books
•  Spam analysis
•  Debugging of map tiles on
Google Maps
•  Tablet migrations in
managed Bigtable instances
•  Results of tests run on
Google's distributed build
system
•  Disk I/O statistics for
hundreds of thousands of
disks
•  Resource monitoring for
jobs run in Google's data
centers
•  Symbols and dependencies
in Google's codebase
9	
  
Records vs. columns
A	
  
B	
  
C	
   D	
  
E	
  
*	
  
*	
  
*	
  
.	
  .	
  .	
  
.	
  .	
  .	
  
r1	
  
r2	
   r1	
  
r2	
  
r1	
  
r2	
  
r1	
  
r2	
  
Challenge: preserve structure,
reconstruct from a subset of fields
Read less,
cheaper
decompression
10	
  
Columnar	
  format	
  
•  Values	
  in	
  a	
  column	
  stored	
  next	
  to	
  one	
  another	
  
– Beher	
  compression	
  
– Range-­‐map:	
  save	
  min-­‐max	
  
•  Only	
  access	
  columns	
  par(cipa(ng	
  in	
  query	
  
•  Aggrega(ons	
  can	
  be	
  done	
  without	
  decoding	
  
11	
  
Nested data model
message Document {
required int64 DocId; [1,1]
optional group Links {
repeated int64 Backward; [0,*]
repeated int64 Forward;
}
repeated group Name {
repeated group Language {
required string Code;
optional string Country; [0,1]
}
optional string Url;
}
}
DocId: 10
Links
Forward: 20
Forward: 40
Forward: 60
Name
Language
Code: 'en-us'
Country: 'us'
Language
Code: 'en'
Url: 'http://A'
Name
Url: 'http://B'
Name
Language
Code: 'en-gb'
Country: 'gb'
r1	
  
DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80
Name
Url: 'http://C'
r2	
  
multiplicity:
12	
  
Column-striped representation
value r d
10 0 0
20 0 0
DocId
value r d
http://A 0 2
http://B 1 2
NULL 1 1
http://C 0 2
Name.Url
value r d
en-us 0 2
en 2 2
NULL 1 1
en-gb 1 2
NULL 0 1
Name.Language.Code Name.Language.Country
Links.BackwardLinks.Forward
value r d
us 0 3
NULL 2 2
NULL 1 1
gb 1 3
NULL 0 1
value r d
20 0 2
40 1 2
60 1 2
80 0 2
value r d
NULL 0 1
10 0 2
30 1 2
13	
  
Repetition and
definition levels
DocId: 10
Links
Forward: 20
Forward: 40
Forward: 60
Name
Language
Code: 'en-us'
Country: 'us'
Language
Code: 'en'
Url: 'http://A'
Name
Url: 'http://B'
Name
Language
Code: 'en-gb'
Country: 'gb'
r1	
  
DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80
Name
Url: 'http://C'
r2	
  
value r d
en-us 0 2
en 2 2
NULL 1 1
en-gb 1 2
NULL 0 1
Name.Language.Code
r: At what repeated field in the field's path
the value has repeated	
  
d: How many fields in paths that could be
undefined (opt. or rep.) are actually present	
  
record (r=0) has repeated	
  
r=2	
  r=1	
  
Language (r=2) has repeated	
  
(non-repeating)	
  
14	
  
Record assembly FSM	
  
message Document {
required int64 DocId; [1,1]
optional group Links {
repeated int64 Backward; [0,*]
repeated int64 Forward;
}
repeated group Name {
repeated group Language {
required string Code;
optional string Country; [0,1]
}
optional string Url;
}
}
Name.Language.CountryName.Language.Code
Links.Backward Links.Forward
Name.Url
DocId
1	
  
0	
  
1	
  
0	
  
0,1,2	
  
2	
  
0,1	
  1	
  
0	
  
0	
  
Transitions
labeled with
repetition levels
15	
  
Record assembly FSM: example
Name.Language.CountryName.Language.Code
Links.Backward Links.Forward
Name.Url
DocId
1	
  
0	
  
1	
  
0	
  
0,1,2	
  
2	
  
0,1	
  1	
  
0	
  
0	
  
Transitions
labeled with
repetition levels
DocId: 10
Links
Forward: 20
Forward: 40
Forward: 60
Name
Language
Code: 'en-us'
Country: 'us'
Language
Code: 'en'
Url: 'http://A'
Name
Url: 'http://B'
Name
Language
Code: 'en-gb'
Country: 'gb'
16	
  
Reading two fields
DocId
Name.Language.Country1,2	
  
0	
  
0	
  
DocId: 10
Name
Language
Country: 'us'
Language
Name
Name
Language
Country: 'gb'
DocId: 20
Name
s1	
  
s2	
  
Structure of parent fields is preserved.
Useful for queries like /Name[3]/Language[1]/Country
17	
  
Query processing
•  Optimized for select-project-aggregate
– Very common class of interactive queries
– Single scan
– Within-record and cross-record aggregation
•  Approximations: count(distinct), top-k
•  Joins, temp tables, UDFs/TVFs, etc.
18	
  
SQL dialect for nested data
Id: 10
Name
Cnt: 2
Language
Str: 'http://A,en-us'
Str: 'http://A,en'
Name
Cnt: 0
t1	
  
SELECT DocId AS Id,
COUNT(Name.Language.Code) WITHIN Name AS Cnt,
Name.Url + ',' + Name.Language.Code AS Str
FROM t
WHERE REGEXP(Name.Url, '^http') AND DocId < 20;
message QueryResult {
required int64 Id;
repeated group Name {
optional uint64 Cnt;
repeated group Language {
optional string Str;
}
}
}
Output table	
   Output schema	
  
No record assembly during query processing	
  
19	
  
Serving tree
storage layer (e.g., GFS)
. . .	
  
. . .	
  
. . .	
  leaf servers
(with local
storage)	
  
intermediate
servers	
  
root server	
  
client	
  
!"
!#$"
!#%"
!#&"
!#'"
!#("
!#)"
!" %" '" )" *" $!" $%" $'" $)"
histogram of
response times	
  
20	
  
Mul(-­‐level	
  serving	
  tree	
  
•  Parallelizes scheduling and aggregation
– Reduced fan-in
– Divide/conquer
– Better network utilization
•  Fault tolerance
21	
  
Example: count()
SELECT A, COUNT(B) FROM T
GROUP BY A
T = {/gfs/1, /gfs/2, …, /gfs/100000}
SELECT A, SUM(c)
FROM (R11 UNION ALL R110)
GROUP BY A
SELECT A, COUNT(B) AS c
FROM T11 GROUP BY A
T11 = {/gfs/1, …, /gfs/10000}
SELECT A, COUNT(B) AS c
FROM T12 GROUP BY A
T12 = {/gfs/10001, …, /gfs/20000}
SELECT A, COUNT(B) AS c
FROM T31 GROUP BY A
T31 = {/gfs/1}
. . .	
  
0	
  
1	
  
3	
  
R11	
   R12	
  
Data access ops	
  
. . .	
  
. . .	
  
22	
  
Experiments
Table
name
Number of
records
Size (unrepl.,
compressed)
Number
of fields
Data
center
Repl.
factor
T1 85 billion 87 TB 270 A 3×
T2 24 billion 13 TB 530 A 3×
T3 4 billion 70 TB 1200 A 3×
T4 1+ trillion 105 TB 50 B 3×
T5 1+ trillion 20 TB 30 B 2×
•  1 PB of real data
(uncompressed, non-replicated)
•  100K-800K tablets per table
•  Experiments run during business hours
23	
  
!"
#"
$"
%"
&"
'!"
'#"
'$"
'%"
'&"
#!"
'" #" (" $" )" %" *" &" +" '!"
Read from disk
columns	
  
records	
  
objects	
  
fromrecords	
  fromcolumns	
  
(a) read +
decompress	
  
(b) assemble
records	
  
(c) parse as
C++ objects	
  
(d) read +
decompress	
  
(e) parse as
C++ objects	
  
time (sec)	
  
number of fields	
  
Table partition: 375 MB (compressed), 300K rows, 125 columns	
  
2-4x overhead of
using records
10x speedup
using columnar
storage
24	
  
MR and Dremel execution
Sawzall program ran on MR:
num_recs: table sum of int;
num_words: table sum of int;
emit num_recs <- 1;
emit num_words <-
count_words(input.txtField);!"
!#"
!##"
!###"
!####"
$%&'()*'+," $%&)*-./0," 1'(/(-"
execution time (sec) on 3000 nodes 	
  
SELECT SUM(count_words(txtField)) / COUNT(*)
FROM T1
Q1:	
  
87 TB	
   0.5 TB	
   0.5 TB	
  
MR overheads: launch jobs, schedule 0.5M tasks,
assemble records
Avg # of terms in txtField in 85 billion record table T1	
  
25	
  
Impact of serving tree depth
!"
#!"
$!"
%!"
&!"
'!"
(!"
)$" )%"
$"*+,+*-"
%"*+,+*-"
&"*+,+*-"
execution time (sec)	
  
SELECT country, SUM(item.amount) FROM T2

GROUP BY country
SELECT domain, SUM(item.amount) FROM T2

WHERE domain CONTAINS ’.net’

GROUP BY domain
Q2:
Q3:
40 billion nested items
(returns 100s of records) (returns 1M records)
26	
  
!"
#!"
$!!"
$#!"
%!!"
%#!"
$!!!" %!!!" &!!!" '!!!"
Scalability
execution time (sec)	
  
number of
leaf servers	
  
SELECT TOP(aid, 20), COUNT(*) FROM T4
Q5 on a trillion-row table T4:
27	
  
Interactive speed
!"
#"
$!"
$#"
%!"
%#"
&!"
$" $!" $!!" $!!!"
execution time
(sec)	
  
percentage of queries
Most queries complete under 10 sec
Monthly query workload
of one 3000-node Dremel
instance
28	
  
Observations
•  Possible to analyze large disk-resident datasets
interactively on commodity hardware
–  1T records, 1000s of nodes
•  MR can benefit from columnar storage just like a parallel
DBMS
–  But record assembly is expensive
–  Interactive SQL and MR can be complementary
•  Parallel DBMSes may benefit from serving tree
architecture just like search engines
29	
  
Vs.	
  MapReduce	
  
•  Scheduling	
  Model	
  
–  Coarse	
  resource	
  model	
  reduces	
  hardware	
  u(liza(on	
  
–  Acquisi(on	
  of	
  resources	
  typically	
  takes	
  100’s	
  of	
  millis	
  to	
  seconds	
  
•  Barriers	
  
–  Map	
  comple(on	
  required	
  before	
  shuffle/reduce	
  
commencement	
  
–  All	
  maps	
  must	
  complete	
  before	
  reduce	
  can	
  start	
  
–  In	
  chained	
  jobs,	
  one	
  job	
  must	
  finish	
  en(rely	
  before	
  the	
  next	
  one	
  
can	
  start	
  
•  Persistence	
  and	
  Recoverability	
  
–  Data	
  is	
  persisted	
  to	
  disk	
  between	
  each	
  barrier	
  
–  Serializa(on	
  and	
  deserializa(on	
  are	
  required	
  between	
  execu(on	
  
phase	
  
30	
  
Apache	
  Drill	
  
31	
  
32	
  
33	
  
34	
  
Full	
  SQL	
  –	
  ANSI	
  SQL	
  2003	
  
•  SQL	
  like	
  is	
  not	
  enough	
  
•  Fine	
  integra(on	
  with	
  exis(ng	
  BI	
  tools	
  
– Tableau,	
  SAP	
  
– Standard	
  ODBC/JDBC	
  driver	
  
35	
  
Working	
  data	
  
•  Flat	
  files	
  in	
  DFS	
  
– Complex	
  data	
  (thrif,	
  Avro,	
  protobuf)	
  
– Columnar	
  data	
  (Parquet,	
  ORC)	
  
– JSON	
  
– CSV,	
  TSV	
  
•  NoSQL	
  stores	
  
– Document	
  stores	
  
– Spare	
  data	
  
– Rela(onal-­‐like	
  
36	
  
37	
  
Flexible	
  schema	
  	
  
38	
  
Sample	
  query	
  
39	
  
40	
  
Nested	
  data	
  
•  Nested	
  data	
  as	
  first	
  class	
  en(ty	
  
– Similar	
  to	
  BigQuery	
  
– No	
  upfront	
  flahening	
  required	
  
– JSON,	
  BSON,	
  AVRO,	
  Protocol	
  buffers	
  
41	
  
Cross	
  data	
  source	
  queries	
  
•  Combilne	
  data	
  from	
  Files,	
  HBASE,	
  Hive	
  in	
  one	
  
single	
  query	
  
•  No	
  central	
  metadata	
  defini(ons	
  necessary	
  
42	
  
High	
  level	
  architecture	
  
•  Cluster	
  of	
  drillbits,	
  one	
  per	
  node,	
  designed	
  to	
  maximize	
  data	
  locality	
  
•  Form	
  a	
  distributed	
  query	
  processing	
  engine	
  
•  Zookeeper	
  for	
  cluster	
  membership	
  only	
  
•  Hazelcast	
  distributed	
  cache	
  for	
  query	
  plans,	
  metadata,	
  locality	
  informa(on	
  
•  Columnar	
  record	
  organiza(on	
  
•  No	
  dependency	
  on	
  other	
  execu(on	
  engines	
  (Mapreduce,	
  Tez,	
  Spark)	
  
43	
  
Basic	
  query	
  flow	
  
44	
  
Drillbit	
  modules	
  
•  SQL	
  parser	
  
•  Op(mizer	
  
•  execu(on	
  
•  Query	
  execu(on	
  
– source	
  query:	
  what	
  
– logical	
  plan:	
  what	
  
– physical	
  plan:	
  how	
  
– execu(on	
  plan:	
  where	
  
45	
  
46	
  
Op(mis(c	
  execu(on	
  
•  Short	
  running	
  query	
  
– No	
  checkpoints	
  
– Rerun	
  en(re	
  query	
  in	
  face	
  of	
  failure	
  
•  No	
  barriers	
  
•  No	
  persistence	
  
47	
  
Run(me	
  compila(on	
  
48	
  
Roadmap	
  
49	
  

Weitere ähnliche Inhalte

Was ist angesagt?

Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Sparksamthemonad
 
(Julien le dem) parquet
(Julien le dem)   parquet(Julien le dem)   parquet
(Julien le dem) parquetNAVER D2
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tipsSubhas Kumar Ghosh
 
Extending lifespan with Hadoop and R
Extending lifespan with Hadoop and RExtending lifespan with Hadoop and R
Extending lifespan with Hadoop and RRadek Maciaszek
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelTakahiro Inoue
 
Hadoop本 輪読会 1章〜2章
Hadoop本 輪読会 1章〜2章Hadoop本 輪読会 1章〜2章
Hadoop本 輪読会 1章〜2章moai kids
 
Python for R developers and data scientists
Python for R developers and data scientistsPython for R developers and data scientists
Python for R developers and data scientistsLambda Tree
 
R the unsung hero of Big Data
R the unsung hero of Big DataR the unsung hero of Big Data
R the unsung hero of Big DataDhafer Malouche
 
ちょっとHadoopについて語ってみるか(仮題)
ちょっとHadoopについて語ってみるか(仮題)ちょっとHadoopについて語ってみるか(仮題)
ちょっとHadoopについて語ってみるか(仮題)moai kids
 
Ruby on Big Data (Cassandra + Hadoop)
Ruby on Big Data (Cassandra + Hadoop)Ruby on Big Data (Cassandra + Hadoop)
Ruby on Big Data (Cassandra + Hadoop)Brian O'Neill
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopMohamed Elsaka
 
Perly Parallel Processing of Fixed Width Data Records
Perly Parallel Processing of Fixed Width Data RecordsPerly Parallel Processing of Fixed Width Data Records
Perly Parallel Processing of Fixed Width Data RecordsWorkhorse Computing
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceDr Ganesh Iyer
 
R + 15 minutes = Hadoop cluster
R + 15 minutes = Hadoop clusterR + 15 minutes = Hadoop cluster
R + 15 minutes = Hadoop clusterJeffrey Breen
 

Was ist angesagt? (20)

14 query processing-sorting
14 query processing-sorting14 query processing-sorting
14 query processing-sorting
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Spark
 
(Julien le dem) parquet
(Julien le dem)   parquet(Julien le dem)   parquet
(Julien le dem) parquet
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
Extending lifespan with Hadoop and R
Extending lifespan with Hadoop and RExtending lifespan with Hadoop and R
Extending lifespan with Hadoop and R
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
 
Hadoop本 輪読会 1章〜2章
Hadoop本 輪読会 1章〜2章Hadoop本 輪読会 1章〜2章
Hadoop本 輪読会 1章〜2章
 
Scalding
ScaldingScalding
Scalding
 
Python for R developers and data scientists
Python for R developers and data scientistsPython for R developers and data scientists
Python for R developers and data scientists
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
 
R the unsung hero of Big Data
R the unsung hero of Big DataR the unsung hero of Big Data
R the unsung hero of Big Data
 
ちょっとHadoopについて語ってみるか(仮題)
ちょっとHadoopについて語ってみるか(仮題)ちょっとHadoopについて語ってみるか(仮題)
ちょっとHadoopについて語ってみるか(仮題)
 
Ruby on Big Data (Cassandra + Hadoop)
Ruby on Big Data (Cassandra + Hadoop)Ruby on Big Data (Cassandra + Hadoop)
Ruby on Big Data (Cassandra + Hadoop)
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Report
ReportReport
Report
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
Perly Parallel Processing of Fixed Width Data Records
Perly Parallel Processing of Fixed Width Data RecordsPerly Parallel Processing of Fixed Width Data Records
Perly Parallel Processing of Fixed Width Data Records
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
R + 15 minutes = Hadoop cluster
R + 15 minutes = Hadoop clusterR + 15 minutes = Hadoop cluster
R + 15 minutes = Hadoop cluster
 
C07.heaps
C07.heapsC07.heaps
C07.heaps
 

Andere mochten auch

Social media strategies for libraries poster
Social media strategies for libraries posterSocial media strategies for libraries poster
Social media strategies for libraries posterNataly Blas
 
Practica 2 quimica organica -espol
Practica 2  quimica organica -espolPractica 2  quimica organica -espol
Practica 2 quimica organica -espolLissy Rodriguez
 
Tachyon memory centric, fault tolerance storage for cluster framworks
Tachyon  memory centric, fault tolerance storage for cluster framworksTachyon  memory centric, fault tolerance storage for cluster framworks
Tachyon memory centric, fault tolerance storage for cluster framworksViet-Trung TRAN
 
Challenging our Notions of Learning: Understanding How Web 2.0 Technology Wor...
Challenging our Notions of Learning: Understanding How Web 2.0 Technology Wor...Challenging our Notions of Learning: Understanding How Web 2.0 Technology Wor...
Challenging our Notions of Learning: Understanding How Web 2.0 Technology Wor...Paul Brown
 
The State of Facilities at Eastern Region Institutions JUNE16
The State of Facilities at Eastern Region Institutions JUNE16The State of Facilities at Eastern Region Institutions JUNE16
The State of Facilities at Eastern Region Institutions JUNE16Sightlines
 
Ultimate Platform Hotness Smackdown (Twitter, Facebook, iPhone, Native Web / ...
Ultimate Platform Hotness Smackdown (Twitter, Facebook, iPhone, Native Web / ...Ultimate Platform Hotness Smackdown (Twitter, Facebook, iPhone, Native Web / ...
Ultimate Platform Hotness Smackdown (Twitter, Facebook, iPhone, Native Web / ...Dave McClure
 
Balanceo de una ecuación química
Balanceo de una ecuación químicaBalanceo de una ecuación química
Balanceo de una ecuación químicadopamina mexico
 
xoxooo tkmmm
xoxooo tkmmmxoxooo tkmmm
xoxooo tkmmmceny2
 
Guia De Estudio Digestivo
Guia De Estudio DigestivoGuia De Estudio Digestivo
Guia De Estudio DigestivoLuciana Yohai
 
Jobs consultant
Jobs consultantJobs consultant
Jobs consultantTenforce
 
How to increase traffic to your WordPress website.
How to increase traffic to your WordPress website. How to increase traffic to your WordPress website.
How to increase traffic to your WordPress website. Liquis Design
 
William Gross Sues Pimco for Hundreds of Millions
William Gross Sues Pimco for Hundreds of MillionsWilliam Gross Sues Pimco for Hundreds of Millions
William Gross Sues Pimco for Hundreds of MillionsTric Park
 
Charitable Giving and Happiness
Charitable Giving and HappinessCharitable Giving and Happiness
Charitable Giving and HappinessFaircom New York
 

Andere mochten auch (20)

Social media strategies for libraries poster
Social media strategies for libraries posterSocial media strategies for libraries poster
Social media strategies for libraries poster
 
Practica 2 quimica organica -espol
Practica 2  quimica organica -espolPractica 2  quimica organica -espol
Practica 2 quimica organica -espol
 
Tachyon memory centric, fault tolerance storage for cluster framworks
Tachyon  memory centric, fault tolerance storage for cluster framworksTachyon  memory centric, fault tolerance storage for cluster framworks
Tachyon memory centric, fault tolerance storage for cluster framworks
 
The Rules - SGS
The Rules - SGSThe Rules - SGS
The Rules - SGS
 
Challenging our Notions of Learning: Understanding How Web 2.0 Technology Wor...
Challenging our Notions of Learning: Understanding How Web 2.0 Technology Wor...Challenging our Notions of Learning: Understanding How Web 2.0 Technology Wor...
Challenging our Notions of Learning: Understanding How Web 2.0 Technology Wor...
 
The State of Facilities at Eastern Region Institutions JUNE16
The State of Facilities at Eastern Region Institutions JUNE16The State of Facilities at Eastern Region Institutions JUNE16
The State of Facilities at Eastern Region Institutions JUNE16
 
Ultimate Platform Hotness Smackdown (Twitter, Facebook, iPhone, Native Web / ...
Ultimate Platform Hotness Smackdown (Twitter, Facebook, iPhone, Native Web / ...Ultimate Platform Hotness Smackdown (Twitter, Facebook, iPhone, Native Web / ...
Ultimate Platform Hotness Smackdown (Twitter, Facebook, iPhone, Native Web / ...
 
Balanceo de una ecuación química
Balanceo de una ecuación químicaBalanceo de una ecuación química
Balanceo de una ecuación química
 
teaching methods
teaching methods teaching methods
teaching methods
 
Moving to the Right Side of Safety
Moving to the Right Side of SafetyMoving to the Right Side of Safety
Moving to the Right Side of Safety
 
God Is Forgiving
God Is ForgivingGod Is Forgiving
God Is Forgiving
 
xoxooo tkmmm
xoxooo tkmmmxoxooo tkmmm
xoxooo tkmmm
 
Guia De Estudio Digestivo
Guia De Estudio DigestivoGuia De Estudio Digestivo
Guia De Estudio Digestivo
 
Jobs consultant
Jobs consultantJobs consultant
Jobs consultant
 
Jvm mbeans jmxtran
Jvm mbeans jmxtranJvm mbeans jmxtran
Jvm mbeans jmxtran
 
How to increase traffic to your WordPress website.
How to increase traffic to your WordPress website. How to increase traffic to your WordPress website.
How to increase traffic to your WordPress website.
 
William Gross Sues Pimco for Hundreds of Millions
William Gross Sues Pimco for Hundreds of MillionsWilliam Gross Sues Pimco for Hundreds of Millions
William Gross Sues Pimco for Hundreds of Millions
 
Latin Dansları
Latin DanslarıLatin Dansları
Latin Dansları
 
Charitable Giving and Happiness
Charitable Giving and HappinessCharitable Giving and Happiness
Charitable Giving and Happiness
 
Torque
TorqueTorque
Torque
 

Ähnlich wie Interactive big data analytics

Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets robertlz
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
 
Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster InnardsMartin Dvorak
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profilepramodbiligiri
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
 
(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep DiveAmazon Web Services
 
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Data Con LA
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingCollin Bennett
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersCleverence Kombe
 
Apache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelApache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelAndrey Lomakin
 
Dremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsDremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsCarl Lu
 
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...InfluxData
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectMao Geng
 
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezYahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezDataWorks Summit
 

Ähnlich wie Interactive big data analytics (20)

Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
Handout3o
Handout3oHandout3o
Handout3o
 
MongoDB 3.0
MongoDB 3.0 MongoDB 3.0
MongoDB 3.0
 
Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster Innards
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive
 
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive Computing
 
An Introduction To Map-Reduce
An Introduction To Map-ReduceAn Introduction To Map-Reduce
An Introduction To Map-Reduce
 
User biglm
User biglmUser biglm
User biglm
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clusters
 
Apache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelApache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data model
 
Data Collection and Storage
Data Collection and StorageData Collection and Storage
Data Collection and Storage
 
Dremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsDremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasets
 
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezYahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
 

Mehr von Viet-Trung TRAN

Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017
Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017
Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017Viet-Trung TRAN
 
Dynamo: Amazon’s Highly Available Key-value Store
Dynamo: Amazon’s Highly Available Key-value StoreDynamo: Amazon’s Highly Available Key-value Store
Dynamo: Amazon’s Highly Available Key-value StoreViet-Trung TRAN
 
Pregel: Hệ thống xử lý đồ thị lớn
Pregel: Hệ thống xử lý đồ thị lớnPregel: Hệ thống xử lý đồ thị lớn
Pregel: Hệ thống xử lý đồ thị lớnViet-Trung TRAN
 
Mapreduce simplified-data-processing
Mapreduce simplified-data-processingMapreduce simplified-data-processing
Mapreduce simplified-data-processingViet-Trung TRAN
 
Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của Facebook
Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của FacebookTìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của Facebook
Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của FacebookViet-Trung TRAN
 
giasan.vn real-estate analytics: a Vietnam case study
giasan.vn real-estate analytics: a Vietnam case studygiasan.vn real-estate analytics: a Vietnam case study
giasan.vn real-estate analytics: a Vietnam case studyViet-Trung TRAN
 
A Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkA Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkViet-Trung TRAN
 
A Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkA Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkViet-Trung TRAN
 
Large-Scale Geographically Weighted Regression on Spark
Large-Scale Geographically Weighted Regression on SparkLarge-Scale Geographically Weighted Regression on Spark
Large-Scale Geographically Weighted Regression on SparkViet-Trung TRAN
 
Recent progress on distributing deep learning
Recent progress on distributing deep learningRecent progress on distributing deep learning
Recent progress on distributing deep learningViet-Trung TRAN
 
success factors for project proposals
success factors for project proposalssuccess factors for project proposals
success factors for project proposalsViet-Trung TRAN
 
OCR processing with deep learning: Apply to Vietnamese documents
OCR processing with deep learning: Apply to Vietnamese documents OCR processing with deep learning: Apply to Vietnamese documents
OCR processing with deep learning: Apply to Vietnamese documents Viet-Trung TRAN
 
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...Viet-Trung TRAN
 
Introduction to BigData @TCTK2015
Introduction to BigData @TCTK2015Introduction to BigData @TCTK2015
Introduction to BigData @TCTK2015Viet-Trung TRAN
 
From neural networks to deep learning
From neural networks to deep learningFrom neural networks to deep learning
From neural networks to deep learningViet-Trung TRAN
 
From decision trees to random forests
From decision trees to random forestsFrom decision trees to random forests
From decision trees to random forestsViet-Trung TRAN
 
Recommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filteringRecommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filteringViet-Trung TRAN
 

Mehr von Viet-Trung TRAN (20)

Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017
Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017
Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017
 
Dynamo: Amazon’s Highly Available Key-value Store
Dynamo: Amazon’s Highly Available Key-value StoreDynamo: Amazon’s Highly Available Key-value Store
Dynamo: Amazon’s Highly Available Key-value Store
 
Pregel: Hệ thống xử lý đồ thị lớn
Pregel: Hệ thống xử lý đồ thị lớnPregel: Hệ thống xử lý đồ thị lớn
Pregel: Hệ thống xử lý đồ thị lớn
 
Mapreduce simplified-data-processing
Mapreduce simplified-data-processingMapreduce simplified-data-processing
Mapreduce simplified-data-processing
 
Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của Facebook
Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của FacebookTìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của Facebook
Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của Facebook
 
giasan.vn real-estate analytics: a Vietnam case study
giasan.vn real-estate analytics: a Vietnam case studygiasan.vn real-estate analytics: a Vietnam case study
giasan.vn real-estate analytics: a Vietnam case study
 
Giasan.vn @rstars
Giasan.vn @rstarsGiasan.vn @rstars
Giasan.vn @rstars
 
A Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkA Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural Network
 
A Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkA Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural Network
 
Large-Scale Geographically Weighted Regression on Spark
Large-Scale Geographically Weighted Regression on SparkLarge-Scale Geographically Weighted Regression on Spark
Large-Scale Geographically Weighted Regression on Spark
 
Recent progress on distributing deep learning
Recent progress on distributing deep learningRecent progress on distributing deep learning
Recent progress on distributing deep learning
 
success factors for project proposals
success factors for project proposalssuccess factors for project proposals
success factors for project proposals
 
GPSinsights poster
GPSinsights posterGPSinsights poster
GPSinsights poster
 
OCR processing with deep learning: Apply to Vietnamese documents
OCR processing with deep learning: Apply to Vietnamese documents OCR processing with deep learning: Apply to Vietnamese documents
OCR processing with deep learning: Apply to Vietnamese documents
 
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
 
Deep learning for nlp
Deep learning for nlpDeep learning for nlp
Deep learning for nlp
 
Introduction to BigData @TCTK2015
Introduction to BigData @TCTK2015Introduction to BigData @TCTK2015
Introduction to BigData @TCTK2015
 
From neural networks to deep learning
From neural networks to deep learningFrom neural networks to deep learning
From neural networks to deep learning
 
From decision trees to random forests
From decision trees to random forestsFrom decision trees to random forests
From decision trees to random forests
 
Recommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filteringRecommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filtering
 

Kürzlich hochgeladen

Phase noise transfer functions.pptx
Phase noise transfer      functions.pptxPhase noise transfer      functions.pptx
Phase noise transfer functions.pptxSaiGouthamSunkara
 
Landsman converter for power factor improvement
Landsman converter for power factor improvementLandsman converter for power factor improvement
Landsman converter for power factor improvementVijayMuni2
 
Summer training report on BUILDING CONSTRUCTION for DIPLOMA Students.pdf
Summer training report on BUILDING CONSTRUCTION for DIPLOMA Students.pdfSummer training report on BUILDING CONSTRUCTION for DIPLOMA Students.pdf
Summer training report on BUILDING CONSTRUCTION for DIPLOMA Students.pdfNaveenVerma126
 
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...Sean Meyn
 
Modelling Guide for Timber Structures - FPInnovations
Modelling Guide for Timber Structures - FPInnovationsModelling Guide for Timber Structures - FPInnovations
Modelling Guide for Timber Structures - FPInnovationsYusuf Yıldız
 
Renewable Energy & Entrepreneurship Workshop_21Feb2024.pdf
Renewable Energy & Entrepreneurship Workshop_21Feb2024.pdfRenewable Energy & Entrepreneurship Workshop_21Feb2024.pdf
Renewable Energy & Entrepreneurship Workshop_21Feb2024.pdfodunowoeminence2019
 
Nodal seismic construction requirements.pptx
Nodal seismic construction requirements.pptxNodal seismic construction requirements.pptx
Nodal seismic construction requirements.pptxwendy cai
 
A Seminar on Electric Vehicle Software Simulation
A Seminar on Electric Vehicle Software SimulationA Seminar on Electric Vehicle Software Simulation
A Seminar on Electric Vehicle Software SimulationMohsinKhanA
 
دليل تجارب الاسفلت المختبرية - Asphalt Experiments Guide Laboratory
دليل تجارب الاسفلت المختبرية - Asphalt Experiments Guide Laboratoryدليل تجارب الاسفلت المختبرية - Asphalt Experiments Guide Laboratory
دليل تجارب الاسفلت المختبرية - Asphalt Experiments Guide LaboratoryBahzad5
 
nvidia AI-gtc 2024 partial slide deck.pptx
nvidia AI-gtc 2024 partial slide deck.pptxnvidia AI-gtc 2024 partial slide deck.pptx
nvidia AI-gtc 2024 partial slide deck.pptxjasonsedano2
 
Basic Principle of Electrochemical Sensor
Basic Principle of  Electrochemical SensorBasic Principle of  Electrochemical Sensor
Basic Principle of Electrochemical SensorTanvir Moin
 
Lecture 1: Basics of trigonometry (surveying)
Lecture 1: Basics of trigonometry (surveying)Lecture 1: Basics of trigonometry (surveying)
Lecture 1: Basics of trigonometry (surveying)Bahzad5
 
Mohs Scale of Hardness, Hardness Scale.pptx
Mohs Scale of Hardness, Hardness Scale.pptxMohs Scale of Hardness, Hardness Scale.pptx
Mohs Scale of Hardness, Hardness Scale.pptxKISHAN KUMAR
 
IT3401-WEB ESSENTIALS PRESENTATIONS.pptx
IT3401-WEB ESSENTIALS PRESENTATIONS.pptxIT3401-WEB ESSENTIALS PRESENTATIONS.pptx
IT3401-WEB ESSENTIALS PRESENTATIONS.pptxSAJITHABANUS
 
ASME BPVC 2023 Section I para leer y entender
ASME BPVC 2023 Section I para leer y entenderASME BPVC 2023 Section I para leer y entender
ASME BPVC 2023 Section I para leer y entenderjuancarlos286641
 
Guardians and Glitches: Navigating the Duality of Gen AI in AppSec
Guardians and Glitches: Navigating the Duality of Gen AI in AppSecGuardians and Glitches: Navigating the Duality of Gen AI in AppSec
Guardians and Glitches: Navigating the Duality of Gen AI in AppSecTrupti Shiralkar, CISSP
 
GENERAL CONDITIONS FOR CONTRACTS OF CIVIL ENGINEERING WORKS
GENERAL CONDITIONS  FOR  CONTRACTS OF CIVIL ENGINEERING WORKS GENERAL CONDITIONS  FOR  CONTRACTS OF CIVIL ENGINEERING WORKS
GENERAL CONDITIONS FOR CONTRACTS OF CIVIL ENGINEERING WORKS Bahzad5
 
me3493 manufacturing technology unit 1 Part A
me3493 manufacturing technology unit 1 Part Ame3493 manufacturing technology unit 1 Part A
me3493 manufacturing technology unit 1 Part Akarthi keyan
 

Kürzlich hochgeladen (20)

Phase noise transfer functions.pptx
Phase noise transfer      functions.pptxPhase noise transfer      functions.pptx
Phase noise transfer functions.pptx
 
Landsman converter for power factor improvement
Landsman converter for power factor improvementLandsman converter for power factor improvement
Landsman converter for power factor improvement
 
Summer training report on BUILDING CONSTRUCTION for DIPLOMA Students.pdf
Summer training report on BUILDING CONSTRUCTION for DIPLOMA Students.pdfSummer training report on BUILDING CONSTRUCTION for DIPLOMA Students.pdf
Summer training report on BUILDING CONSTRUCTION for DIPLOMA Students.pdf
 
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
 
Modelling Guide for Timber Structures - FPInnovations
Modelling Guide for Timber Structures - FPInnovationsModelling Guide for Timber Structures - FPInnovations
Modelling Guide for Timber Structures - FPInnovations
 
Renewable Energy & Entrepreneurship Workshop_21Feb2024.pdf
Renewable Energy & Entrepreneurship Workshop_21Feb2024.pdfRenewable Energy & Entrepreneurship Workshop_21Feb2024.pdf
Renewable Energy & Entrepreneurship Workshop_21Feb2024.pdf
 
Nodal seismic construction requirements.pptx
Nodal seismic construction requirements.pptxNodal seismic construction requirements.pptx
Nodal seismic construction requirements.pptx
 
A Seminar on Electric Vehicle Software Simulation
A Seminar on Electric Vehicle Software SimulationA Seminar on Electric Vehicle Software Simulation
A Seminar on Electric Vehicle Software Simulation
 
دليل تجارب الاسفلت المختبرية - Asphalt Experiments Guide Laboratory
دليل تجارب الاسفلت المختبرية - Asphalt Experiments Guide Laboratoryدليل تجارب الاسفلت المختبرية - Asphalt Experiments Guide Laboratory
دليل تجارب الاسفلت المختبرية - Asphalt Experiments Guide Laboratory
 
nvidia AI-gtc 2024 partial slide deck.pptx
nvidia AI-gtc 2024 partial slide deck.pptxnvidia AI-gtc 2024 partial slide deck.pptx
nvidia AI-gtc 2024 partial slide deck.pptx
 
Basic Principle of Electrochemical Sensor
Basic Principle of  Electrochemical SensorBasic Principle of  Electrochemical Sensor
Basic Principle of Electrochemical Sensor
 
Lecture 1: Basics of trigonometry (surveying)
Lecture 1: Basics of trigonometry (surveying)Lecture 1: Basics of trigonometry (surveying)
Lecture 1: Basics of trigonometry (surveying)
 
Mohs Scale of Hardness, Hardness Scale.pptx
Mohs Scale of Hardness, Hardness Scale.pptxMohs Scale of Hardness, Hardness Scale.pptx
Mohs Scale of Hardness, Hardness Scale.pptx
 
Lecture 2 .pdf
Lecture 2                           .pdfLecture 2                           .pdf
Lecture 2 .pdf
 
IT3401-WEB ESSENTIALS PRESENTATIONS.pptx
IT3401-WEB ESSENTIALS PRESENTATIONS.pptxIT3401-WEB ESSENTIALS PRESENTATIONS.pptx
IT3401-WEB ESSENTIALS PRESENTATIONS.pptx
 
計劃趕得上變化
計劃趕得上變化計劃趕得上變化
計劃趕得上變化
 
ASME BPVC 2023 Section I para leer y entender
ASME BPVC 2023 Section I para leer y entenderASME BPVC 2023 Section I para leer y entender
ASME BPVC 2023 Section I para leer y entender
 
Guardians and Glitches: Navigating the Duality of Gen AI in AppSec
Guardians and Glitches: Navigating the Duality of Gen AI in AppSecGuardians and Glitches: Navigating the Duality of Gen AI in AppSec
Guardians and Glitches: Navigating the Duality of Gen AI in AppSec
 
GENERAL CONDITIONS FOR CONTRACTS OF CIVIL ENGINEERING WORKS
GENERAL CONDITIONS  FOR  CONTRACTS OF CIVIL ENGINEERING WORKS GENERAL CONDITIONS  FOR  CONTRACTS OF CIVIL ENGINEERING WORKS
GENERAL CONDITIONS FOR CONTRACTS OF CIVIL ENGINEERING WORKS
 
me3493 manufacturing technology unit 1 Part A
me3493 manufacturing technology unit 1 Part Ame3493 manufacturing technology unit 1 Part A
me3493 manufacturing technology unit 1 Part A
 

Interactive big data analytics

  • 1. Interac(ve  Big  data  analysis   Viet-­‐Trung  Tran   1  
  • 3. MR  –  batch  processing   •  Long  running  job   – latency  between  running  the  job  and  geBng  the   answer   •  Lot  of  computa(ons   •  Specific  language   3  
  • 4. Example  Problem   •  Jane  works  as  an   analyst  at  an  e-­‐ commerce  company   •  How  does  she  figure   out  good  targe(ng   segments  for  the  next   marke(ng  campaign?   •  She  has  some  ideas   and  lots  of  data   User     profiles   Transac.on   informa.on   Access   logs   4  
  • 5. Solving  the  problems?   All  compiled  to  Map  Reduce  jobs   5  
  • 6. Dremel:  interac(ve  analysis  of   web-­‐scale  datasets   Melnik  et.  al,  Google  inc   [VLDB  2010]   6  
  • 7. What  is  Dremel?   •  Near  real  (me  interac(ve  analysis  (instead  batch   processing).  SQL-­‐like  query  language   –  Trillion  record,  mul(-­‐terabyte  datasets   •  Nested  data  with  a  column  storage  representa(on   •  Serving  tree:  mul(-­‐level  execu(on  trees  for  query   processing   •  Interoperates  "in  place"  with  GFS,  Big  Table   •  The  engine  behind  Google  BigQuery   •  Builds  on  the  ideas  from  web  search  and  parallel   DBMS.   7  
  • 8. •  Brand of power tools that primarily rely on their speed as opposed to torque •  Data analysis tool that uses speed instead of raw power Why call it Dremel 8  
  • 9. Widely used inside Google •  Analysis of crawled web documents •  Tracking install data for applications on Android Market •  Crash reporting for Google products •  OCR results from Google Books •  Spam analysis •  Debugging of map tiles on Google Maps •  Tablet migrations in managed Bigtable instances •  Results of tests run on Google's distributed build system •  Disk I/O statistics for hundreds of thousands of disks •  Resource monitoring for jobs run in Google's data centers •  Symbols and dependencies in Google's codebase 9  
  • 10. Records vs. columns A   B   C   D   E   *   *   *   .  .  .   .  .  .   r1   r2   r1   r2   r1   r2   r1   r2   Challenge: preserve structure, reconstruct from a subset of fields Read less, cheaper decompression 10  
  • 11. Columnar  format   •  Values  in  a  column  stored  next  to  one  another   – Beher  compression   – Range-­‐map:  save  min-­‐max   •  Only  access  columns  par(cipa(ng  in  query   •  Aggrega(ons  can  be  done  without  decoding   11  
  • 12. Nested data model message Document { required int64 DocId; [1,1] optional group Links { repeated int64 Backward; [0,*] repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; [0,1] } optional string Url; } } DocId: 10 Links Forward: 20 Forward: 40 Forward: 60 Name Language Code: 'en-us' Country: 'us' Language Code: 'en' Url: 'http://A' Name Url: 'http://B' Name Language Code: 'en-gb' Country: 'gb' r1   DocId: 20 Links Backward: 10 Backward: 30 Forward: 80 Name Url: 'http://C' r2   multiplicity: 12  
  • 13. Column-striped representation value r d 10 0 0 20 0 0 DocId value r d http://A 0 2 http://B 1 2 NULL 1 1 http://C 0 2 Name.Url value r d en-us 0 2 en 2 2 NULL 1 1 en-gb 1 2 NULL 0 1 Name.Language.Code Name.Language.Country Links.BackwardLinks.Forward value r d us 0 3 NULL 2 2 NULL 1 1 gb 1 3 NULL 0 1 value r d 20 0 2 40 1 2 60 1 2 80 0 2 value r d NULL 0 1 10 0 2 30 1 2 13  
  • 14. Repetition and definition levels DocId: 10 Links Forward: 20 Forward: 40 Forward: 60 Name Language Code: 'en-us' Country: 'us' Language Code: 'en' Url: 'http://A' Name Url: 'http://B' Name Language Code: 'en-gb' Country: 'gb' r1   DocId: 20 Links Backward: 10 Backward: 30 Forward: 80 Name Url: 'http://C' r2   value r d en-us 0 2 en 2 2 NULL 1 1 en-gb 1 2 NULL 0 1 Name.Language.Code r: At what repeated field in the field's path the value has repeated   d: How many fields in paths that could be undefined (opt. or rep.) are actually present   record (r=0) has repeated   r=2  r=1   Language (r=2) has repeated   (non-repeating)   14  
  • 15. Record assembly FSM   message Document { required int64 DocId; [1,1] optional group Links { repeated int64 Backward; [0,*] repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; [0,1] } optional string Url; } } Name.Language.CountryName.Language.Code Links.Backward Links.Forward Name.Url DocId 1   0   1   0   0,1,2   2   0,1  1   0   0   Transitions labeled with repetition levels 15  
  • 16. Record assembly FSM: example Name.Language.CountryName.Language.Code Links.Backward Links.Forward Name.Url DocId 1   0   1   0   0,1,2   2   0,1  1   0   0   Transitions labeled with repetition levels DocId: 10 Links Forward: 20 Forward: 40 Forward: 60 Name Language Code: 'en-us' Country: 'us' Language Code: 'en' Url: 'http://A' Name Url: 'http://B' Name Language Code: 'en-gb' Country: 'gb' 16  
  • 17. Reading two fields DocId Name.Language.Country1,2   0   0   DocId: 10 Name Language Country: 'us' Language Name Name Language Country: 'gb' DocId: 20 Name s1   s2   Structure of parent fields is preserved. Useful for queries like /Name[3]/Language[1]/Country 17  
  • 18. Query processing •  Optimized for select-project-aggregate – Very common class of interactive queries – Single scan – Within-record and cross-record aggregation •  Approximations: count(distinct), top-k •  Joins, temp tables, UDFs/TVFs, etc. 18  
  • 19. SQL dialect for nested data Id: 10 Name Cnt: 2 Language Str: 'http://A,en-us' Str: 'http://A,en' Name Cnt: 0 t1   SELECT DocId AS Id, COUNT(Name.Language.Code) WITHIN Name AS Cnt, Name.Url + ',' + Name.Language.Code AS Str FROM t WHERE REGEXP(Name.Url, '^http') AND DocId < 20; message QueryResult { required int64 Id; repeated group Name { optional uint64 Cnt; repeated group Language { optional string Str; } } } Output table   Output schema   No record assembly during query processing   19  
  • 20. Serving tree storage layer (e.g., GFS) . . .   . . .   . . .  leaf servers (with local storage)   intermediate servers   root server   client   !" !#$" !#%" !#&" !#'" !#(" !#)" !" %" '" )" *" $!" $%" $'" $)" histogram of response times   20  
  • 21. Mul(-­‐level  serving  tree   •  Parallelizes scheduling and aggregation – Reduced fan-in – Divide/conquer – Better network utilization •  Fault tolerance 21  
  • 22. Example: count() SELECT A, COUNT(B) FROM T GROUP BY A T = {/gfs/1, /gfs/2, …, /gfs/100000} SELECT A, SUM(c) FROM (R11 UNION ALL R110) GROUP BY A SELECT A, COUNT(B) AS c FROM T11 GROUP BY A T11 = {/gfs/1, …, /gfs/10000} SELECT A, COUNT(B) AS c FROM T12 GROUP BY A T12 = {/gfs/10001, …, /gfs/20000} SELECT A, COUNT(B) AS c FROM T31 GROUP BY A T31 = {/gfs/1} . . .   0   1   3   R11   R12   Data access ops   . . .   . . .   22  
  • 23. Experiments Table name Number of records Size (unrepl., compressed) Number of fields Data center Repl. factor T1 85 billion 87 TB 270 A 3× T2 24 billion 13 TB 530 A 3× T3 4 billion 70 TB 1200 A 3× T4 1+ trillion 105 TB 50 B 3× T5 1+ trillion 20 TB 30 B 2× •  1 PB of real data (uncompressed, non-replicated) •  100K-800K tablets per table •  Experiments run during business hours 23  
  • 24. !" #" $" %" &" '!" '#" '$" '%" '&" #!" '" #" (" $" )" %" *" &" +" '!" Read from disk columns   records   objects   fromrecords  fromcolumns   (a) read + decompress   (b) assemble records   (c) parse as C++ objects   (d) read + decompress   (e) parse as C++ objects   time (sec)   number of fields   Table partition: 375 MB (compressed), 300K rows, 125 columns   2-4x overhead of using records 10x speedup using columnar storage 24  
  • 25. MR and Dremel execution Sawzall program ran on MR: num_recs: table sum of int; num_words: table sum of int; emit num_recs <- 1; emit num_words <- count_words(input.txtField);!" !#" !##" !###" !####" $%&'()*'+," $%&)*-./0," 1'(/(-" execution time (sec) on 3000 nodes   SELECT SUM(count_words(txtField)) / COUNT(*) FROM T1 Q1:   87 TB   0.5 TB   0.5 TB   MR overheads: launch jobs, schedule 0.5M tasks, assemble records Avg # of terms in txtField in 85 billion record table T1   25  
  • 26. Impact of serving tree depth !" #!" $!" %!" &!" '!" (!" )$" )%" $"*+,+*-" %"*+,+*-" &"*+,+*-" execution time (sec)   SELECT country, SUM(item.amount) FROM T2
 GROUP BY country SELECT domain, SUM(item.amount) FROM T2
 WHERE domain CONTAINS ’.net’
 GROUP BY domain Q2: Q3: 40 billion nested items (returns 100s of records) (returns 1M records) 26  
  • 27. !" #!" $!!" $#!" %!!" %#!" $!!!" %!!!" &!!!" '!!!" Scalability execution time (sec)   number of leaf servers   SELECT TOP(aid, 20), COUNT(*) FROM T4 Q5 on a trillion-row table T4: 27  
  • 28. Interactive speed !" #" $!" $#" %!" %#" &!" $" $!" $!!" $!!!" execution time (sec)   percentage of queries Most queries complete under 10 sec Monthly query workload of one 3000-node Dremel instance 28  
  • 29. Observations •  Possible to analyze large disk-resident datasets interactively on commodity hardware –  1T records, 1000s of nodes •  MR can benefit from columnar storage just like a parallel DBMS –  But record assembly is expensive –  Interactive SQL and MR can be complementary •  Parallel DBMSes may benefit from serving tree architecture just like search engines 29  
  • 30. Vs.  MapReduce   •  Scheduling  Model   –  Coarse  resource  model  reduces  hardware  u(liza(on   –  Acquisi(on  of  resources  typically  takes  100’s  of  millis  to  seconds   •  Barriers   –  Map  comple(on  required  before  shuffle/reduce   commencement   –  All  maps  must  complete  before  reduce  can  start   –  In  chained  jobs,  one  job  must  finish  en(rely  before  the  next  one   can  start   •  Persistence  and  Recoverability   –  Data  is  persisted  to  disk  between  each  barrier   –  Serializa(on  and  deserializa(on  are  required  between  execu(on   phase   30  
  • 32. 32  
  • 33. 33  
  • 34. 34  
  • 35. Full  SQL  –  ANSI  SQL  2003   •  SQL  like  is  not  enough   •  Fine  integra(on  with  exis(ng  BI  tools   – Tableau,  SAP   – Standard  ODBC/JDBC  driver   35  
  • 36. Working  data   •  Flat  files  in  DFS   – Complex  data  (thrif,  Avro,  protobuf)   – Columnar  data  (Parquet,  ORC)   – JSON   – CSV,  TSV   •  NoSQL  stores   – Document  stores   – Spare  data   – Rela(onal-­‐like   36  
  • 37. 37  
  • 40. 40  
  • 41. Nested  data   •  Nested  data  as  first  class  en(ty   – Similar  to  BigQuery   – No  upfront  flahening  required   – JSON,  BSON,  AVRO,  Protocol  buffers   41  
  • 42. Cross  data  source  queries   •  Combilne  data  from  Files,  HBASE,  Hive  in  one   single  query   •  No  central  metadata  defini(ons  necessary   42  
  • 43. High  level  architecture   •  Cluster  of  drillbits,  one  per  node,  designed  to  maximize  data  locality   •  Form  a  distributed  query  processing  engine   •  Zookeeper  for  cluster  membership  only   •  Hazelcast  distributed  cache  for  query  plans,  metadata,  locality  informa(on   •  Columnar  record  organiza(on   •  No  dependency  on  other  execu(on  engines  (Mapreduce,  Tez,  Spark)   43  
  • 45. Drillbit  modules   •  SQL  parser   •  Op(mizer   •  execu(on   •  Query  execu(on   – source  query:  what   – logical  plan:  what   – physical  plan:  how   – execu(on  plan:  where   45  
  • 46. 46  
  • 47. Op(mis(c  execu(on   •  Short  running  query   – No  checkpoints   – Rerun  en(re  query  in  face  of  failure   •  No  barriers   •  No  persistence   47