SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Downloaden Sie, um offline zu lesen
HadoopDB
An An Architectural Hybrid of MapReduce
  & DBMS Technologies for Analytical
               Workloads


                                                1
                           Tilani Gunawardena
Road Map
•   Motivation
•   Introduction
•   Desired Properties
•   Background & Shortfalls
•   HadoopDB
•   Benchmarks
•   Fault Tolerance
•   Conclusion
•   Related Work
•   References

                              2
Introduction
• Analyzing massive structured data on 1000s of shared-nothing
  nodes

• Shared nothing architecture:
  • A collection of independent,possibly virtual matchines eact with
    local disk and local main memory connected together on a high-
    speed network

• Approachs:
  • Parallel databases
  • Map/Reduce systems
                                                                       3
Desired Properties
• Performance
  • A primary characteristic that commercial database systems use to
    distinguish themselves

• A Fault tolerance

• Heterogeneus environments
  • Increasing number of nodes
  • Difficult homogeneous

• Flexible query interface
  • Usually JDBC or ODBC
  • UDF mechanism
  • Desirable SQL and no SQL interfaces
                                                                       4
Background-PDBMS
• Standard relational tables and SQL
   • Indexing, compression,caching, I/O sharing
• Tables partitioned over nodes
   • Transparent to the user
• Meet performance
  • Needed highly skilled DBA
• Flexible query interfaces
   • UDFs varies accros implementations
• Fault tolerance
   • Not score so well
   • Assumption: failures are rare
   • Assumption: dozens of nodes in clusters

                                                  5
Background-MapReduce
• Satisfies fault tolerance
• Works on heterogeneus environment
• Drawback: performance
  • No enhacing performance techniques
• Interfaces
  • Write M/R jobs in multiple languages
  • SQL not supported directly ( excluding eg: Hive )




                                                        6
• MapReduce (Hadoop) MapReduce is a programming model
  which specifies:
  • A map function that processes a key/value pair to generate a set
    of intermediate key/value pairs,
  • A reduce function that merges all intermediate values associated
    with the same intermediate key.
• Hadoop
  • Is a MapReduce implementation for processing large data sets
    over 1000s of nodes.
  • Maps and Reduces run independently of each other over blocks
    of data distributed across a cluster

                                                                       7
Background-MapReduce




                       8
Differences between Parallel
Databases and MapReduce?




                              9
10
HadoopDB


           11
HadoopDB
• Hadoop as communication layer above multiple nodes running
  single-node DBMS instances
• Full open-source solution :
  • PostgreSQL as DB layer
  • Hadoop as communication layer
  • Hive as translation layer




                                                               12
HadoopDB
RDBMS                      Hadoop
• Careful layout of data   • Job scheduling
• Indexing                 • Task coordination
• Sorting                  • Parallellization
• Query optimization
• compression




                                                 13
Ideas
• Main goal: achieve the properties described before
• Connect multiple single-datanode systems
   • Hadoop as the task coordination & network communication layer
   • Queries parallelized across the nodes using MapReduce framework
• Fault tolerant and work in heterogeneus nodes
• Parallel databases performance
   • Query processing in database engine




                                                                       14
Architecture Background
• Data Storage layer (HDFS)
  • Block structured file system managed by central NameNode
  • Files broken in blocks and ditributed
• Data processing layer (Map/Reduce framework)
  • Master/slave architecture
  • Job and Task trackers




                                                               15
HadoopDB Components
•   Database Connector
•   Catalog
•   Data Loader
•   Planner (SMS)




                         16
Database Connector
• Interface between DBMS and TaskTacker
• Responsabilities
   • Connect to the database
   • Execute the SQL query
   • Return the results as key-value pairs
• Achieved goal
   • Datasources are similar to datablocks in HDFS




                                                     17
Catalog
 •    Maintain information about database
     • Database location, driver class
     • Darasets in cluster, replica or partitioning
 • Catalog stored as xml file in HDFS
 • Plan to deploy as separated service




                                                      18
Data Loader
• Responsabilities:
  • Globally partition the data on given key
  • Break single node data into chunks
  • Bulk-loading chunks in single-node databases

• Two main components:
  • Global hasher
      • Map/Reduce job read from HDS and repartition
  • Local Hasher
      • Copies from HDFS to local file system




                                                       19
SMS Planner
• Extends Hive
• Steps
 •   Parser transforms query to (AST)abstract syntax tree
 •   Get table schema information from catalog
 •   Logical plan generator creates query plan
 •   Optimizer breaks up plan to Map or Reduce phases
 •   Executable plan generated for one or more MapReduce jobs
 •   SMS tries to push maximum work to database layer




                                                                20
21
Benchmarking
• Environment
  • Amazon EC2 “large” instances
  • Each instance
     • 7,5 GB memory,2 virtual cores,850 GB storage,64 bits Linux Fedora 8
• Systems
  • Hadoop
     • 256MB data blocks,1024 MB heap size, 200Mb sort buffer
  • HadoopDB
     • Similar to Hadoop conf,PostgreSQL 8.2.5,No compress data
  • Vertica
     • Used a cloud edition
     • All data is compressed
  • DBMS-X
     • Comercial parallel row
     • Run on EC2 (not cloud edition available)                              22
Benchmarking
• Used data
  • Http log files, html pages, ranking
  • Sizes (per node):
     • 155 millions user visits (~ 20Gigabytes)
     • 18 millions ranking (~1Gigabyte)
     • Stored as plain text in HDFS




                                                  23
Evaluating HadoopDB
• Compare HadoopDB to
  • 1 Hadoop
  • 2 Parallel databases (Vertica, DBMS-X)
• Features:
  • 1 Performance:
     • We expected HadoopDB to approach the performance of parallel
       databases
  • 2 Scalability:
     • We expected HadoopDB to scale as well as Hadoop We ran the Pavlo
       et al. SIGMOD’09 benchmark on Amazon EC2 clusters of 10, 50, 100
       nodes.

                                                                          24
Benchmark tasks
•   Data loading
•   Grep task
•   Selection task
•   Aggregation task
•   Join task
•   UDF Aggregation task
•   Fault tolerance and Heterogeneous environment




                                                    25
Data Load




            26
Queries Result




                 27
• load -data loads are slower than Hadoop, but faster than
       parallel databases
• runtime -
  • Structured data-HadoopDB is faster than Hadoop but slower than
    parallel databases(HadoopDB’s performance is close to parallel
    databases)
  • Unstructured data- HadoopDB’s performance matches Hadoop




                                                                     28
Scalability:Setup
•   Simple aggregation task - full table scan
•   Data replicated across 10 nodes
•   Fault-tolerance: Kill a node halfway
•   Fluctuation-tolerance: Slow down a node for the entire
    experiment




                                                             29
Scalability:Results
• HadoopDB and Hadoop take advantage of runtime acheduling
  by splitting data into chunks
• Parallel databases restart entire query on node failure or wait
  for the slowest node




                                                                    30
To Summarize
• HadoopDB - a hybrid of DBMS and MapReduce
• HadoopDB is close in performance to parallel databases
• HadoopDB is able to operate in truly heterogeneous
  environment and has the fault tolerance of Hadoop
  environment
• Is free and open-source



 http://hadoopdb.sourceforge.net


                                                           31
Related Work
• Pig Project at yahoo
• SCOPE project at Microsoft
• Hive project




                               32
Future Work
• Integration with other open source databases
• Full automation of the loading and replication process
• Dynamically adjusting fault-tolerance levels based on failure
  rate




                                                                  33
Thank You!

             34

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapakapa rohit
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreducehansen3032
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreModern Data Stack France
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101EMC
 
Cloud File System with GFS and HDFS
Cloud File System with GFS and HDFS  Cloud File System with GFS and HDFS
Cloud File System with GFS and HDFS Dr Neelesh Jain
 
HBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table designHBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table designphanleson
 

Was ist angesagt? (19)

Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapa
 
Upgrading hadoop
Upgrading hadoopUpgrading hadoop
Upgrading hadoop
 
Hadoop fault-tolerance
Hadoop fault-toleranceHadoop fault-tolerance
Hadoop fault-tolerance
 
HW09 Hadoop Vaidya
HW09 Hadoop VaidyaHW09 Hadoop Vaidya
HW09 Hadoop Vaidya
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
002 Introduction to hadoop v3
002   Introduction to hadoop v3002   Introduction to hadoop v3
002 Introduction to hadoop v3
 
Hadoop Ecosystem Overview
Hadoop Ecosystem OverviewHadoop Ecosystem Overview
Hadoop Ecosystem Overview
 
4. hbase overview
4. hbase overview4. hbase overview
4. hbase overview
 
NoSQL: Cassadra vs. HBase
NoSQL: Cassadra vs. HBaseNoSQL: Cassadra vs. HBase
NoSQL: Cassadra vs. HBase
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScore
 
1. Apache HIVE
1. Apache HIVE1. Apache HIVE
1. Apache HIVE
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
Cloud File System with GFS and HDFS
Cloud File System with GFS and HDFS  Cloud File System with GFS and HDFS
Cloud File System with GFS and HDFS
 
Hadoop scheduler
Hadoop schedulerHadoop scheduler
Hadoop scheduler
 
HBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table designHBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table design
 
Tutorial Haddop 2.3
Tutorial Haddop 2.3Tutorial Haddop 2.3
Tutorial Haddop 2.3
 

Ähnlich wie Hadoop DB

Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopCloudera, Inc.
 
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...Qian Lin
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברגTaldor Group
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDYVenneladonthireddy1
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentationArvind Kumar
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop OverviewBrian Enochson
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud ComputingFarzad Nozarian
 
Hadoop Training in Hyderabad
Hadoop Training in HyderabadHadoop Training in Hyderabad
Hadoop Training in HyderabadRajitha D
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Andrew Brust
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoopyaevents
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big DataAndrew Brust
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete informationbhargavi804095
 

Ähnlich wie Hadoop DB (20)

Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
SQL Server 2012 and Big Data
SQL Server 2012 and Big DataSQL Server 2012 and Big Data
SQL Server 2012 and Big Data
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
Hadoop Training in Hyderabad
Hadoop Training in HyderabadHadoop Training in Hyderabad
Hadoop Training in Hyderabad
 
Hadoop Training in Hyderabad
Hadoop Training in HyderabadHadoop Training in Hyderabad
Hadoop Training in Hyderabad
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
Anju
AnjuAnju
Anju
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 

Mehr von Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL

Mehr von Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL (20)

BlockChain.pptx
BlockChain.pptxBlockChain.pptx
BlockChain.pptx
 
Introduction to data mining and machine learning
Introduction to data mining and machine learningIntroduction to data mining and machine learning
Introduction to data mining and machine learning
 
Introduction to cloud computing
Introduction to cloud computingIntroduction to cloud computing
Introduction to cloud computing
 
Data analytics
Data analyticsData analytics
Data analytics
 
Hadoop Eco system
Hadoop Eco systemHadoop Eco system
Hadoop Eco system
 
Parallel Computing on the GPU
Parallel Computing on the GPUParallel Computing on the GPU
Parallel Computing on the GPU
 
evaluation and credibility-Part 2
evaluation and credibility-Part 2evaluation and credibility-Part 2
evaluation and credibility-Part 2
 
evaluation and credibility-Part 1
evaluation and credibility-Part 1evaluation and credibility-Part 1
evaluation and credibility-Part 1
 
Machine Learning and Data Mining
Machine Learning and Data MiningMachine Learning and Data Mining
Machine Learning and Data Mining
 
K Nearest Neighbors
K Nearest NeighborsK Nearest Neighbors
K Nearest Neighbors
 
Decision tree
Decision treeDecision tree
Decision tree
 
kmean clustering
kmean clusteringkmean clustering
kmean clustering
 
Covering algorithm
Covering algorithmCovering algorithm
Covering algorithm
 
Hierachical clustering
Hierachical clusteringHierachical clustering
Hierachical clustering
 
Assosiate rule mining
Assosiate rule miningAssosiate rule mining
Assosiate rule mining
 
Big data in telecom
Big data in telecomBig data in telecom
Big data in telecom
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
MapReduce
MapReduceMapReduce
MapReduce
 
Cheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduceCheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduce
 
Pig Experience
Pig ExperiencePig Experience
Pig Experience
 

Kürzlich hochgeladen

6 ways Samsung’s Interactive Display powered by Android changes the classroom
6 ways Samsung’s Interactive Display powered by Android changes the classroom6 ways Samsung’s Interactive Display powered by Android changes the classroom
6 ways Samsung’s Interactive Display powered by Android changes the classroomSamsung Business USA
 
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDecoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDhatriParmar
 
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxCLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxAnupam32727
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWQuiz Club NITW
 
How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17Celine George
 
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...DhatriParmar
 
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQ-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQuiz Club NITW
 
Tree View Decoration Attribute in the Odoo 17
Tree View Decoration Attribute in the Odoo 17Tree View Decoration Attribute in the Odoo 17
Tree View Decoration Attribute in the Odoo 17Celine George
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQuiz Club NITW
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 - I-LEARN SMART WORLD - CẢ NĂM - CÓ FILE NGHE (BẢN...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 - I-LEARN SMART WORLD - CẢ NĂM - CÓ FILE NGHE (BẢN...BÀI TẬP BỔ TRỢ TIẾNG ANH 8 - I-LEARN SMART WORLD - CẢ NĂM - CÓ FILE NGHE (BẢN...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 - I-LEARN SMART WORLD - CẢ NĂM - CÓ FILE NGHE (BẢN...Nguyen Thanh Tu Collection
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationdeepaannamalai16
 
4.9.24 Social Capital and Social Exclusion.pptx
4.9.24 Social Capital and Social Exclusion.pptx4.9.24 Social Capital and Social Exclusion.pptx
4.9.24 Social Capital and Social Exclusion.pptxmary850239
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Association for Project Management
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...Nguyen Thanh Tu Collection
 
An Overview of the Calendar App in Odoo 17 ERP
An Overview of the Calendar App in Odoo 17 ERPAn Overview of the Calendar App in Odoo 17 ERP
An Overview of the Calendar App in Odoo 17 ERPCeline George
 

Kürzlich hochgeladen (20)

Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of EngineeringFaculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
 
6 ways Samsung’s Interactive Display powered by Android changes the classroom
6 ways Samsung’s Interactive Display powered by Android changes the classroom6 ways Samsung’s Interactive Display powered by Android changes the classroom
6 ways Samsung’s Interactive Display powered by Android changes the classroom
 
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDecoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
 
Mattingly "AI & Prompt Design" - Introduction to Machine Learning"
Mattingly "AI & Prompt Design" - Introduction to Machine Learning"Mattingly "AI & Prompt Design" - Introduction to Machine Learning"
Mattingly "AI & Prompt Design" - Introduction to Machine Learning"
 
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxCLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITW
 
How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17
 
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
 
prashanth updated resume 2024 for Teaching Profession
prashanth updated resume 2024 for Teaching Professionprashanth updated resume 2024 for Teaching Profession
prashanth updated resume 2024 for Teaching Profession
 
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQ-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
 
Tree View Decoration Attribute in the Odoo 17
Tree View Decoration Attribute in the Odoo 17Tree View Decoration Attribute in the Odoo 17
Tree View Decoration Attribute in the Odoo 17
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
 
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptxINCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 - I-LEARN SMART WORLD - CẢ NĂM - CÓ FILE NGHE (BẢN...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 - I-LEARN SMART WORLD - CẢ NĂM - CÓ FILE NGHE (BẢN...BÀI TẬP BỔ TRỢ TIẾNG ANH 8 - I-LEARN SMART WORLD - CẢ NĂM - CÓ FILE NGHE (BẢN...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 - I-LEARN SMART WORLD - CẢ NĂM - CÓ FILE NGHE (BẢN...
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentation
 
4.9.24 Social Capital and Social Exclusion.pptx
4.9.24 Social Capital and Social Exclusion.pptx4.9.24 Social Capital and Social Exclusion.pptx
4.9.24 Social Capital and Social Exclusion.pptx
 
Introduction to Research ,Need for research, Need for design of Experiments, ...
Introduction to Research ,Need for research, Need for design of Experiments, ...Introduction to Research ,Need for research, Need for design of Experiments, ...
Introduction to Research ,Need for research, Need for design of Experiments, ...
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
 
An Overview of the Calendar App in Odoo 17 ERP
An Overview of the Calendar App in Odoo 17 ERPAn Overview of the Calendar App in Odoo 17 ERP
An Overview of the Calendar App in Odoo 17 ERP
 

Hadoop DB

  • 1. HadoopDB An An Architectural Hybrid of MapReduce & DBMS Technologies for Analytical Workloads 1 Tilani Gunawardena
  • 2. Road Map • Motivation • Introduction • Desired Properties • Background & Shortfalls • HadoopDB • Benchmarks • Fault Tolerance • Conclusion • Related Work • References 2
  • 3. Introduction • Analyzing massive structured data on 1000s of shared-nothing nodes • Shared nothing architecture: • A collection of independent,possibly virtual matchines eact with local disk and local main memory connected together on a high- speed network • Approachs: • Parallel databases • Map/Reduce systems 3
  • 4. Desired Properties • Performance • A primary characteristic that commercial database systems use to distinguish themselves • A Fault tolerance • Heterogeneus environments • Increasing number of nodes • Difficult homogeneous • Flexible query interface • Usually JDBC or ODBC • UDF mechanism • Desirable SQL and no SQL interfaces 4
  • 5. Background-PDBMS • Standard relational tables and SQL • Indexing, compression,caching, I/O sharing • Tables partitioned over nodes • Transparent to the user • Meet performance • Needed highly skilled DBA • Flexible query interfaces • UDFs varies accros implementations • Fault tolerance • Not score so well • Assumption: failures are rare • Assumption: dozens of nodes in clusters 5
  • 6. Background-MapReduce • Satisfies fault tolerance • Works on heterogeneus environment • Drawback: performance • No enhacing performance techniques • Interfaces • Write M/R jobs in multiple languages • SQL not supported directly ( excluding eg: Hive ) 6
  • 7. • MapReduce (Hadoop) MapReduce is a programming model which specifies: • A map function that processes a key/value pair to generate a set of intermediate key/value pairs, • A reduce function that merges all intermediate values associated with the same intermediate key. • Hadoop • Is a MapReduce implementation for processing large data sets over 1000s of nodes. • Maps and Reduces run independently of each other over blocks of data distributed across a cluster 7
  • 10. 10
  • 11. HadoopDB 11
  • 12. HadoopDB • Hadoop as communication layer above multiple nodes running single-node DBMS instances • Full open-source solution : • PostgreSQL as DB layer • Hadoop as communication layer • Hive as translation layer 12
  • 13. HadoopDB RDBMS Hadoop • Careful layout of data • Job scheduling • Indexing • Task coordination • Sorting • Parallellization • Query optimization • compression 13
  • 14. Ideas • Main goal: achieve the properties described before • Connect multiple single-datanode systems • Hadoop as the task coordination & network communication layer • Queries parallelized across the nodes using MapReduce framework • Fault tolerant and work in heterogeneus nodes • Parallel databases performance • Query processing in database engine 14
  • 15. Architecture Background • Data Storage layer (HDFS) • Block structured file system managed by central NameNode • Files broken in blocks and ditributed • Data processing layer (Map/Reduce framework) • Master/slave architecture • Job and Task trackers 15
  • 16. HadoopDB Components • Database Connector • Catalog • Data Loader • Planner (SMS) 16
  • 17. Database Connector • Interface between DBMS and TaskTacker • Responsabilities • Connect to the database • Execute the SQL query • Return the results as key-value pairs • Achieved goal • Datasources are similar to datablocks in HDFS 17
  • 18. Catalog • Maintain information about database • Database location, driver class • Darasets in cluster, replica or partitioning • Catalog stored as xml file in HDFS • Plan to deploy as separated service 18
  • 19. Data Loader • Responsabilities: • Globally partition the data on given key • Break single node data into chunks • Bulk-loading chunks in single-node databases • Two main components: • Global hasher • Map/Reduce job read from HDS and repartition • Local Hasher • Copies from HDFS to local file system 19
  • 20. SMS Planner • Extends Hive • Steps • Parser transforms query to (AST)abstract syntax tree • Get table schema information from catalog • Logical plan generator creates query plan • Optimizer breaks up plan to Map or Reduce phases • Executable plan generated for one or more MapReduce jobs • SMS tries to push maximum work to database layer 20
  • 21. 21
  • 22. Benchmarking • Environment • Amazon EC2 “large” instances • Each instance • 7,5 GB memory,2 virtual cores,850 GB storage,64 bits Linux Fedora 8 • Systems • Hadoop • 256MB data blocks,1024 MB heap size, 200Mb sort buffer • HadoopDB • Similar to Hadoop conf,PostgreSQL 8.2.5,No compress data • Vertica • Used a cloud edition • All data is compressed • DBMS-X • Comercial parallel row • Run on EC2 (not cloud edition available) 22
  • 23. Benchmarking • Used data • Http log files, html pages, ranking • Sizes (per node): • 155 millions user visits (~ 20Gigabytes) • 18 millions ranking (~1Gigabyte) • Stored as plain text in HDFS 23
  • 24. Evaluating HadoopDB • Compare HadoopDB to • 1 Hadoop • 2 Parallel databases (Vertica, DBMS-X) • Features: • 1 Performance: • We expected HadoopDB to approach the performance of parallel databases • 2 Scalability: • We expected HadoopDB to scale as well as Hadoop We ran the Pavlo et al. SIGMOD’09 benchmark on Amazon EC2 clusters of 10, 50, 100 nodes. 24
  • 25. Benchmark tasks • Data loading • Grep task • Selection task • Aggregation task • Join task • UDF Aggregation task • Fault tolerance and Heterogeneous environment 25
  • 26. Data Load 26
  • 28. • load -data loads are slower than Hadoop, but faster than parallel databases • runtime - • Structured data-HadoopDB is faster than Hadoop but slower than parallel databases(HadoopDB’s performance is close to parallel databases) • Unstructured data- HadoopDB’s performance matches Hadoop 28
  • 29. Scalability:Setup • Simple aggregation task - full table scan • Data replicated across 10 nodes • Fault-tolerance: Kill a node halfway • Fluctuation-tolerance: Slow down a node for the entire experiment 29
  • 30. Scalability:Results • HadoopDB and Hadoop take advantage of runtime acheduling by splitting data into chunks • Parallel databases restart entire query on node failure or wait for the slowest node 30
  • 31. To Summarize • HadoopDB - a hybrid of DBMS and MapReduce • HadoopDB is close in performance to parallel databases • HadoopDB is able to operate in truly heterogeneous environment and has the fault tolerance of Hadoop environment • Is free and open-source http://hadoopdb.sourceforge.net 31
  • 32. Related Work • Pig Project at yahoo • SCOPE project at Microsoft • Hive project 32
  • 33. Future Work • Integration with other open source databases • Full automation of the loading and replication process • Dynamically adjusting fault-tolerance levels based on failure rate 33

Hinweis der Redaktion

  1. Hadoop=open source MapReduceParalle databases===shared nothing RDBMSWhat parallel Databases got right : data partitioning ,indexing,parallel sorts,joins,aggregationInteresting ideas in MRVery flexible can handle almost any data type:records,arrays,imagesRuntime job scheduler & load balanceFault tolerance and straggler handling
  2. It is impossible to get homogeneous performance across 100/1000 s of nodes even if the node run on identical h/w or on identical virtual machine
  3. Scaling not performance
  4. Parallel DBMS- best at ad-hoc analytical queries , substantially faster once data is loaded,but loading the data takes considerably longer who wants to program paralle joins ???MapReduce- very suited for extract,transform ,load tasks ease of use for complex analytic tasks’
  5. Basic design idea Multiple, independent, single node databases coordinated by HadoopFault tolerance-scheduling and job tracking implementation from Hadoop,
  6. NameNode maintains metadata about the size and location of blocks and their replicasMapReduce Framework ---master-slave architecture. master is a single JobTracker and slaves -TaskTrackersEach job is broken down into Map tasks and Reduce tasksThe JobTracker assigns tasks to TaskTrackers based on locality and load balancinglocality by matching a TaskTracker to Map tasks that process data local to itload-balances by ensuring all available TaskTrackers are assigned tasks
  7. AST buildingSemantic analyzer connects to catalogDAG of relational operatorsOptimizer reestructurationConvert plan to M/R jobsDAG in M/R serialized in xml plan
  8. SMS planner extends Hive
  9. After global partition
  10. Small query and large query
  11. HadoopDB distinguishes itself from many of the current parallel databases by dynamically monitoring and adjusting for slow nodes and node failures to optimize performance in heterogenous clusters