Sf NoSQL MeetUp: Apache Hadoop and HBase

Apache Hadoop and HBase Todd Lipcon todd@cloudera.com @tlipcon @cloudera August 9th, 2011

Outline Why should you care? (Intro) What is Apache Hadoop? How does it work? What is Apache HBase? Use Cases

Software Engineer at Committer and PMC member on Apache HBase, HDFS, MapReduce, and Thrift Previously: systems programming, operations, large scale data analysis I love data and data systems Introductions

“Every two days we create as much information as we did from the dawn of civilization up until 2003.” Eric Schmidt (Chairman of Google)

“I keep saying that the sexy job in the next 10 years will be statisticians. And I’m not kidding.” Hal Varian (Google’s chief economist)

Are you throwing away data? Data comes in many shapes and sizes: relational tuples, log files, semistructured textual data (e.g., e-mail), … . Are you throwing it away because it doesn’t ‘fit’?

So, what’s Hadoop? The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry

Apache Hadoop is anopen-source systemto reliably store and processGOBS of dataacross many commodity computers.

Two Core Components Store Process HDFS MapReduce Self-healing high-bandwidthclustered storage. Fault-tolerant distributed processing.

Falsehood #1: Machines can be reliable… Image: MadMan the Mighty CC BY-NC-SA

Hadoop separates distributed system fault-tolerance code from application logic. Unicorns Systems Programmers Statisticians

Falsehood #2: Machines deserve identities... Image:Laughing Squid CC BY-NC-SA

Hadoop lets you interact with a cluster, not a bunch of machines. Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

Falsehood #3: Your analysis fits on one machine… Image: Matthew J. Stinson CC-BY-NC

Hadoop scales linearlywith data sizeor analysis complexity. Data-parallel or compute-parallel. For example: Extensive machine learning on <100GB of image data Simple SQL queries on >100TB of clickstream data Hadoop works for both applications!

Hadoop sounds likemagic.How is it possible?

A Typical Look... 5-4000 commodity servers(8-core, 24GB RAM, 4-12 TB, gig-E) 2-level network architecture 20-40 nodes per rack

Cluster nodes Master nodes (1 each) NameNode (metadata server and database) JobTracker (scheduler) Slave nodes (1-4000 each) DataNodes (block storage) TaskTrackers (task execution)

HDFS Data Storage DN 1 NameNode /logs/weblog.txt 64MB 64MB 30MB DN 2 blk_29232 158MB blk_19231 DN 3 blk_329432 DN 4

HDFS has split the file into 64MB blocks and stored it on the DataNodes. Now, we want to process that data.

The MapReduce Programming Model

You specify map() and reduce() functions.The framework does the rest.

map() map: K₁,V₁->list K₂,V₂ Key: byte offset 193284 Value: “127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /userimage/123 HTTP/1.0" 200 2326” Key: userimage Value: 2326 bytes The map function runs on the same node as the data was stored!

Input Format Wait! HDFS is not a Key-Value store! InputFormatinterprets bytes as a Key and Value 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /userimage/123 HTTP/1.0" 200 2326 Key: log offset 193284 Value: “127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /userimage/123 HTTP/1.0" 200 2326”

The Shuffle Each map output is assigned to a “reducer” based on its key map output is grouped andsorted by key

reduce() K₂, iter(V₂)->list(K₃,V₃) Key: userimage Value: 2326 bytes (from map task 0001) Value: 1000 bytes (from map task 0008) Value: 3020 bytes (from map task 0120) Reducer function Key: userimage Value: 6346 bytes TextOutputFormat userimage 6346

Putting it together... Note: not limited to just one reducer. Result set may be many TB!

Hadoop is not NoSQL(NoNoSQL? Sorry…) Hive project adds SQL support to Hadoop HiveQL (SQL dialect) compiles to a query plan Query plan executes as MapReduce jobs

Hive Example CREATE TABLE movie_rating_data ( userid INT, movieid INT, rating INT, unixtime STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '‘ STORED AS TEXTFILE; LOAD DATA INPATH ‘/datasets/movielens’ INTO TABLE movie_rating_data; CREATE TABLE average_ratings AS SELECT movieid, AVG(rating) FROM movie_rating_data GROUP BY movieid;

The Hadoop Ecosystem (Column DB)

Hadoop in the Wild(yes, it’s used in production) Yahoo! Hadoop Clusters: > 82PB, >40k machines(as of Jun ‘11) Facebook: 15TB new data per day;1200+ machines, 30PB in one cluster Twitter: >1TB per day, ~120 nodes Lots of 5-40 node clusters at companies withoutpetabytes of data (web, retail, finance, telecom, research, government)

What about real time access? MapReduce is a batch system The fastest MR job takes 15+ seconds HDFS just stores bytes, and is append-only Not about to serve data for your next web site.

Apache HBase HBase is an open source, distributed, sorted map modeled after Google’s BigTable

Open Source Apache 2.0 License Committers and contributors from diverse organizations Cloudera, Facebook, StumbleUpon, Trend Micro, etc.

Distributed Store and access data on 1-1000 commodity servers Automatic failover based on Apache ZooKeeper Linear scaling of capacity and IOPS by adding servers

Sorted Map Datastore Tables consist of rows, each of which has a primary key (row key) Each row may have any number of columns -- like a Map<byte[], byte[]> Rows are stored in sorted order

Sorted Map Datastore(logical view as “records”) Implicit PRIMARY KEY in RDBMS terms Data is all byte[] in HBase Different types of data separated into different “column families” Different rows may have different sets of columns(table is sparse) Useful for *-To-Many mappings A single cell might have different values at different timestamps

Sorted Map Datastore(physical view as “cells”) info Column Family roles Column Family Sorted on disk byRow key, Col key, descending timestamp Milliseconds since unix epoch

Column Families Different sets of columns may have different properties and access patterns Configurable by column family: Block Compression (none, gzip, LZO, Snappy) Version retention policies Cache priority CFs stored separately on disk: access one without wasting IO on the other.

HBase API get(row) put(row, Map<column, value>) scan(key range, filter) increment(row, columns) … (checkAndPut, delete, etc…) MapReduce/Hive

Accessing HBase Java API (thick client) REST/HTTP Apache Thrift (any language) Hive/Pig for analytics

High Level Architecture Your PHP Application MapReduce Hive/Pig Thrift/REST Gateway Your Java Application Java Client ZooKeeper HBase HDFS

HBase vs just HDFS If you have neither random write nor random read, stick to HDFS!

HBase vs other “NoSQL” Favors Strict Consistency over Availability (but availability is good in practice!) Great Hadoop integration (very efficient bulk loads, MapReduce analysis) Ordered range partitions (not hash) Automatically shards/scales (just turn on more servers, really proven at petabyte scale) Sparse column storage (not key-value)

HBase in Numbers Largest cluster: ~1000 nodes, ~1PB Most clusters: 5-20 nodes, 100GB-4TB Writes: 1-3ms, 1k-10k writes/sec per node Reads: 0-3ms cached, 10-30ms disk 10-40k reads / second / node from cache Cell size: 0-3MB preferred

HBase in Production Facebook (Messages, Analytics, operational datastore, more on the way) [see SIGMOD paper] StumbleUpon / http://su.pr Mozilla (receives crash reports) Yahoo (stores a copy of the web) Twitter (stores users and tweets for analytics) … many others

Ok, fine, what next? Get Hadoop! CDH - Cloudera’s Distribution including Apache Hadoophttp://cloudera.com/ http://hadoop.apache.org/ Try it out! (Locally, VM, or EC2) Watch free training videos onhttp://cloudera.com/

Thanks! todd@cloudera.com @tlipcon (feedback? yes!) (hiring? yes!)

Sf NoSQL MeetUp: Apache Hadoop and HBase

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Sf NoSQL MeetUp: Apache Hadoop and HBase

Ähnlich wie Sf NoSQL MeetUp: Apache Hadoop and HBase (20)

Mehr von Cloudera, Inc.

Mehr von Cloudera, Inc. (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Sf NoSQL MeetUp: Apache Hadoop and HBase

Hinweis der Redaktion