Introduction to Hadoop.
What are Hadoop, MapReeduce, and Hadoop Distributed File System.
Who uses Hadoop?
How to run Hadoop?
What are Pig, Hive, Mahout?
- There is a flood of data and content being produced (User generated content, social networks, sharing, logging and tracking) - Google, Yahoo and others need to index the entire internet and return search results in milliseconds - NYSE generates 1 TB data/day - Facebook uses Hadoop to manage 400 terabytes of stored data and ingest 20 terabytes of new data per day. Hosts approx. 10 billion photos, 1 petabyte (2009)
- Challenge to both store and analyze this data - reliably (computers break down, storage crashes) - affordably (fast, reliable systems expensive) - and quickly (lots of data takes time)
- split up the data - run jobs in parallel - recombine to get the answer - schedule across arbitrarily-sized cluster - handle fault-tolerance - since even the best systems breakdown, use cheap commodity computers
- open-source Apache project - grew out of Apache Nutch project: open-source search engine - Two Google papers: - MapReduce (2003): programming model for parallel processing - distributed filesystem for fault-tolerant data processing
A Map/Reduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. UNIX: cat input | grep | sort | uniq -c | cat > output Input | Map | Shuffle & Sort | Reduce | Output
The Map/Reduce framework operates exclusively on <key, value> pairs, that is, the framework views the input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of the job, conceivably of different types. (input) <k1, v1> ->map -><k2, v2> ->combine -><k2, v2> ->reduce -><k3, v3> (output)
- Files split into large blocks - designed for streaming reads and appending writes, not random access - 3 replicas for each piece of data by default - data can be encoded/archived
- Hadoop brings the computation as physically close to the data for best bandwidth, instead of copying data - tries to use same node, then same rack, then same data center - auto-replication if data lost - auto-kill and restart of tasks on another node if taking too long or flaky
- simplest example - most Hadoop jobs are a series of jobs that prepare the data first by filtering, cleaning, formatting
Yahoo! - More than 100,000 CPUs in >25,000 computers running Hadoop - Our biggest cluster: 4000 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM) - Used to support research for Ad Systems and Web Search Facebook - all their stats, daily and hourly reports on user growth, page views, avg. time spent on page, ad campaign performance, suggest friends and applications, ad hoc jobs on historical data for product and executive teams to compare performance of new features Netflix - movie recommendation. Rub jobs every hour to parse and analyze logs. eHarmony - writes MapReduce in Ruby to match 20 million people and improve algorithms NYTimes - used it to process 4 TB of scanned archives and convert them to PDF in 24 hours on 100 machines on EC2 Last.fm - hundreds of daily jobs, analyze logs, evaluate A/B testing, generating charts
Difference between standalone and pseudo?
- how to set up your own cluster? - Cloudera's distribution runs on your own cluster - They have scripts to launch and manage EC2 clusters