3. Introduction
• Analyzing massive structured data on 1000s of shared-nothing
nodes
• Shared nothing architecture:
• A collection of independent,possibly virtual matchines eact with
local disk and local main memory connected together on a high-
speed network
• Approachs:
• Parallel databases
• Map/Reduce systems
3
4. Desired Properties
• Performance
• A primary characteristic that commercial database systems use to
distinguish themselves
• A Fault tolerance
• Heterogeneus environments
• Increasing number of nodes
• Difficult homogeneous
• Flexible query interface
• Usually JDBC or ODBC
• UDF mechanism
• Desirable SQL and no SQL interfaces
4
5. Background-PDBMS
• Standard relational tables and SQL
• Indexing, compression,caching, I/O sharing
• Tables partitioned over nodes
• Transparent to the user
• Meet performance
• Needed highly skilled DBA
• Flexible query interfaces
• UDFs varies accros implementations
• Fault tolerance
• Not score so well
• Assumption: failures are rare
• Assumption: dozens of nodes in clusters
5
6. Background-MapReduce
• Satisfies fault tolerance
• Works on heterogeneus environment
• Drawback: performance
• No enhacing performance techniques
• Interfaces
• Write M/R jobs in multiple languages
• SQL not supported directly ( excluding eg: Hive )
6
7. • MapReduce (Hadoop) MapReduce is a programming model
which specifies:
• A map function that processes a key/value pair to generate a set
of intermediate key/value pairs,
• A reduce function that merges all intermediate values associated
with the same intermediate key.
• Hadoop
• Is a MapReduce implementation for processing large data sets
over 1000s of nodes.
• Maps and Reduces run independently of each other over blocks
of data distributed across a cluster
7
12. HadoopDB
• Hadoop as communication layer above multiple nodes running
single-node DBMS instances
• Full open-source solution :
• PostgreSQL as DB layer
• Hadoop as communication layer
• Hive as translation layer
12
14. Ideas
• Main goal: achieve the properties described before
• Connect multiple single-datanode systems
• Hadoop as the task coordination & network communication layer
• Queries parallelized across the nodes using MapReduce framework
• Fault tolerant and work in heterogeneus nodes
• Parallel databases performance
• Query processing in database engine
14
15. Architecture Background
• Data Storage layer (HDFS)
• Block structured file system managed by central NameNode
• Files broken in blocks and ditributed
• Data processing layer (Map/Reduce framework)
• Master/slave architecture
• Job and Task trackers
15
17. Database Connector
• Interface between DBMS and TaskTacker
• Responsabilities
• Connect to the database
• Execute the SQL query
• Return the results as key-value pairs
• Achieved goal
• Datasources are similar to datablocks in HDFS
17
18. Catalog
• Maintain information about database
• Database location, driver class
• Darasets in cluster, replica or partitioning
• Catalog stored as xml file in HDFS
• Plan to deploy as separated service
18
19. Data Loader
• Responsabilities:
• Globally partition the data on given key
• Break single node data into chunks
• Bulk-loading chunks in single-node databases
• Two main components:
• Global hasher
• Map/Reduce job read from HDS and repartition
• Local Hasher
• Copies from HDFS to local file system
19
20. SMS Planner
• Extends Hive
• Steps
• Parser transforms query to (AST)abstract syntax tree
• Get table schema information from catalog
• Logical plan generator creates query plan
• Optimizer breaks up plan to Map or Reduce phases
• Executable plan generated for one or more MapReduce jobs
• SMS tries to push maximum work to database layer
20
22. Benchmarking
• Environment
• Amazon EC2 “large” instances
• Each instance
• 7,5 GB memory,2 virtual cores,850 GB storage,64 bits Linux Fedora 8
• Systems
• Hadoop
• 256MB data blocks,1024 MB heap size, 200Mb sort buffer
• HadoopDB
• Similar to Hadoop conf,PostgreSQL 8.2.5,No compress data
• Vertica
• Used a cloud edition
• All data is compressed
• DBMS-X
• Comercial parallel row
• Run on EC2 (not cloud edition available) 22
23. Benchmarking
• Used data
• Http log files, html pages, ranking
• Sizes (per node):
• 155 millions user visits (~ 20Gigabytes)
• 18 millions ranking (~1Gigabyte)
• Stored as plain text in HDFS
23
24. Evaluating HadoopDB
• Compare HadoopDB to
• 1 Hadoop
• 2 Parallel databases (Vertica, DBMS-X)
• Features:
• 1 Performance:
• We expected HadoopDB to approach the performance of parallel
databases
• 2 Scalability:
• We expected HadoopDB to scale as well as Hadoop We ran the Pavlo
et al. SIGMOD’09 benchmark on Amazon EC2 clusters of 10, 50, 100
nodes.
24
28. • load -data loads are slower than Hadoop, but faster than
parallel databases
• runtime -
• Structured data-HadoopDB is faster than Hadoop but slower than
parallel databases(HadoopDB’s performance is close to parallel
databases)
• Unstructured data- HadoopDB’s performance matches Hadoop
28
29. Scalability:Setup
• Simple aggregation task - full table scan
• Data replicated across 10 nodes
• Fault-tolerance: Kill a node halfway
• Fluctuation-tolerance: Slow down a node for the entire
experiment
29
30. Scalability:Results
• HadoopDB and Hadoop take advantage of runtime acheduling
by splitting data into chunks
• Parallel databases restart entire query on node failure or wait
for the slowest node
30
31. To Summarize
• HadoopDB - a hybrid of DBMS and MapReduce
• HadoopDB is close in performance to parallel databases
• HadoopDB is able to operate in truly heterogeneous
environment and has the fault tolerance of Hadoop
environment
• Is free and open-source
http://hadoopdb.sourceforge.net
31
32. Related Work
• Pig Project at yahoo
• SCOPE project at Microsoft
• Hive project
32
33. Future Work
• Integration with other open source databases
• Full automation of the loading and replication process
• Dynamically adjusting fault-tolerance levels based on failure
rate
33
Hadoop=open source MapReduceParalle databases===shared nothing RDBMSWhat parallel Databases got right : data partitioning ,indexing,parallel sorts,joins,aggregationInteresting ideas in MRVery flexible can handle almost any data type:records,arrays,imagesRuntime job scheduler & load balanceFault tolerance and straggler handling
It is impossible to get homogeneous performance across 100/1000 s of nodes even if the node run on identical h/w or on identical virtual machine
Scaling not performance
Parallel DBMS- best at ad-hoc analytical queries , substantially faster once data is loaded,but loading the data takes considerably longer who wants to program paralle joins ???MapReduce- very suited for extract,transform ,load tasks ease of use for complex analytic tasks’
Basic design idea Multiple, independent, single node databases coordinated by HadoopFault tolerance-scheduling and job tracking implementation from Hadoop,
NameNode maintains metadata about the size and location of blocks and their replicasMapReduce Framework ---master-slave architecture. master is a single JobTracker and slaves -TaskTrackersEach job is broken down into Map tasks and Reduce tasksThe JobTracker assigns tasks to TaskTrackers based on locality and load balancinglocality by matching a TaskTracker to Map tasks that process data local to itload-balances by ensuring all available TaskTrackers are assigned tasks
AST buildingSemantic analyzer connects to catalogDAG of relational operatorsOptimizer reestructurationConvert plan to M/R jobsDAG in M/R serialized in xml plan
SMS planner extends Hive
After global partition
Small query and large query
HadoopDB distinguishes itself from many of the current parallel databases by dynamically monitoring and adjusting for slow nodes and node failures to optimize performance in heterogenous clusters