1. Hadoop Summit 2012
Infrastructure Around Hadoop
Backups, failover, configuration and monitoring
Terran Melconian, Edmund MacKenty
tripadvisor.com/careers 1
2. What TripAdvisor Does
• World's largest travel site and community
• Trip planning user reviews
• >50 million unique monthly visitors, 30 countries*
• >60 million reviews and opinions*
• Run like a startup: 30+ teams all doing their own thing
• Heavy use of open-source projects
• Speed Wins!
* source: comScore Media Metrix for TripAdvisor Sites, Worldwide, January 2012
2
3. What the Warehouse Team Does
• Retain and aggregate historic site activity data
• Make data available throughout the company
• Hits, reviews, forums, contacts, locations, businesses, etc.
• ~50 nodes in 4 clusters: Cloudera CDH3u3 (Hadoop 0.20.2)
• Used by ~12 analytics teams, heavy use of Hive
• Some jobs must run every day (eg. ETL, aggregations)
• Systems are very open, we trust our users (usually)
• 3 people, fairly new to Hadoop/Hive
3
4. Why Hadoop at TripAdvisor
• Hadoop is how we scale analysis past the limits of one machine
– Some daily jobs taking nearly 24 hours, and we're still growing quickly
• Our old RDBMS data warehouse could barely keep up with data
ingestion, even running on expensive hardware with a SAN
– We obtained 20x improvement in wall clock time
• Reprocess unaggregated historical data as definitions change
– Before, impossible except for a small sample
– Now, reprocess years of data at the finest level in a few days
• Efficient platform for many kinds of statistics
– Representative example: five-hour RDBMS job went to 25 minutes
4
5. HA NameNode: DRBD, Corosync and Pacemaker
• Namenode and JobTracker run on “master” node
• Datanode and TaskTracker run on “slave” nodes
• Automatic fail-over of all master-node services to a passive node
• Provision two identical systems
• Set up virtual Master IP address to be failed over
• Secondary namenode on passive node, if available
• Monitor and automatically restart failed services
5
6. DRBD/Corosync Configuration
• DRBD: replicates namenode image, Hive metadata, Oozie job data
– Create two identical storage devices (we used RAID 1)
– Connect the master nodes with a cross-over ethernet cable
– Configure DRBD to use the cross-over and storage devices
– Use drbdadm to create the replicated device
– Create a filesystem on /dev/drbd0 with mkfs
– Cat /proc/drbd to see state of the device
– Once created, use /etc/init.d/drbd to manage it
• Corosync: messaging between active-passive masters
– Configure Corosync to also use the cross-over ethernet cable
– Corosync will start Pacemaker for you
– Use /etc/init.d/corosync to manage it, and Pacemaker
6
7. Pacemaker Configuration
• Define each resource you want to manage:
– DRBD device, master IP address, ethernet connectivity checks,
Hadoop namenode and jobtracker, Hive thrift server, MySQL for Hive
metadata, Oozie for workflow coordination
• Set monitoring intervals for each resource
• Define resource co-location dependencies
• Define resource ordering dependencies
• Restarts failed services, eg. Hive-Thrift
• Use crm tool to manage nodes and resources
• Test with a manual fail-over:
– migrate namenode resource to passive master
– Use crm status to watch all resources move over
7
8. Monitoring: Ganglia and Nagios, Job Tracking
• Visibility into cluster operations
• Monitor hardware states and resource usage
• Notify on specific boundary or failure conditions
• Track MapReduce jobs and Hive tables
• Identify immediate problems
• Show trends over time to predict future needs
8
9. Ganglia
• Standard monitoring of CPU, Memory, Disk usage, etc.
• PERL script parses Hadoop metrics, sends using gmetric(1)
• ~50 Hadoop metrics, ~30 system metrics
• Graphs for entire cluster and individual nodes
• Example: Two jobs with different resource profiles
9
10. Nagios
• Our primary notification system
• About 80 checks, ~25 are our own. Examples:
– check_hdp_connectivity: can master talk to all its slaves?
– check_hdp_data_nodes: are all configured slave datanodes running?
– check_hdp_max_mr_settings: does jobtracker have resources we expect?
– check_hadoop_master_logfiles: are logs being written to?
– check_hive_server: is it up?
• Some warnings:
– Do not let Nagios run hadoop fsck (check_hdp_hdfs)
– LDAP failure causes email cascade
– High loads can cause timeouts, which cause notifications
10
11. Job Tracking
• PERL script invoked frequently by cron
• Parses jobtracker log entries since last run
• Records data on each job in PostreSQL DB:
– Job ID, user, submitting IP and time, status
– Cluster ID, queue, Hive query
– start/stop times for job and first mapper and reducer
– Mapper and reducer counts, max memory, slots, splits
• CGI script to do queries:
– Running jobs, failed jobs, MapReduce capacity usage
– Job resource usage by status, queue, user
• Helps post-mortem of problems
• Used to predict trends, future resource needs
11
12. Other cron scripts we run
• Check_load:
– Dumps Java stack trace when load is too high
– Emails list of top processes so we can see what was wrong
• Master nodes:
– Compresses Hadoop/Hive logs more than 30 days old
– Removes logs more than 120 days old (we keep 10+ GBs)
– Check_hdfs: Runs hadoop fsck to see if HDFS is “healthy”
– Backup current namenode fsimage
• Slave Nodes:
– Check_disks: Removes read-only disks from datanode configuration
– Check_load: Kills some tasks and notifies us when load is too high
• Refresh production data to development cluster
12
13. Configuration Management
• Seems like extra work at first, but essential as you grow.
• Not Hadoop-specific: manage OS packages, Nagios and Ganglia
scripts, cron jobs, svn, SSH keys, NFS mounts, jars
– Consistent UID/GIDs critical with DRBD
– We replace some jars from the RPMs with local fixes
– Templatized configuration files very convenient. ERB is good.
– SSH keys made consistent across nodes, masters share host key
• Use SVN as file delivery mechanism: checkout on each box
• We chose Puppet as a tool
– Gets the job done
– Lacks flexibility in inheritance to specialize defaults per-machine
– Some aspects of operation are hard to debug
13
14. Backup: HDFS and Hive DDL
• Objectives:
– Provide safety against total HDFS failure due to software bugs or
machine room environmental incident
– Protect against user error in dropping or overwriting tables
– Restore data to another cluster
• Assumptions
– Repeating one day of processing is acceptable when restoring
• Components
– Incremental HDFS backup
– Hive DDL backup
• Runs on separate backup server with storage (NexSan)
– Pull process driven by processes on backup server
14
15. Backup HDFS
• Open-source Java app
• Requires customization to your environment
• Traverses HDFS directory tree
• Copies out files modified after a given date
• Doesn't copy very new directories
– Needed a way to avoid copying files being written at time of backup
– HDFS has no snapshots
• Ignores specified directories
• Generates restore shell scripts to set owners, perms
• Verification tool checks file sizes and checksums
15
16. Backup Hive DDL
• Open-source Java app uses Thrift server
• Iterates over all tables and views
• Constructs DDL statements from Hive metadata
• Ignores specific tables
• Generates Hive command script
– Recreates all tables, adds all partitions back one at a time
• Used to move metadata to MySQL
• Restore full cluster:
– copying files back with copyFromLocal
– Run perm/owner scripts
– Reapply Hive DDL
16
17. Other Things To Potentially Back Up
• Backup the Namenode Metadata
– We do this once every 4 hours
– This is in addition to mirroring on four physical drives
• Our job tracking database
• No general backups of root or local FS on machines
– Recreate machines with Puppet or other configuration management
tool instead
• Oozie job database
– We do NOT back this up
– Tightly coupled with HDFS state and restore would be problematic
– The recovery procedure is to rebuild and reinstall coordinators
17
18. Oozie: Why
• Drawback: several times slower to write than cronjobs, while also
less expressive
• Advantage: Ability to cleanly depend on input data
– With cron, you would have to poll for stamps
• Advantage: Clean and consistent metadata
– See what ran, what failed, what is still waiting and why
– Easily retry things which failed – good luck doing that with cron
– Output datasets are deleted on rerun so ordering is preserved
18
19. Oozie: How
• Establish consistent local practices for completion stamps, job
naming, owners, and source code locations
• Enforce that all jobs must be idempotent
• Create scripts/makefiles/build.xml to rebuild and reinstall jobs
after changes in their dependencies
• Bypass the Oozie GUI
– The CLI is a more capable tool
– Go straight to the Oozie backing DB and issue SQL queries
• Rerun coordinator actions, not workflows
• Don't ever use Derby – we experienced massive corruption
19
20. Experiences and Expectations
• Hadoop is not mature from a reliability and stability point of view
– It will probably get there in a few more years
• Cluster outages are common events, not outliers
– Must bounce key services to pick up basic configuration changes such
as adding a new queue
– As you scale up, you will encounter new classes of problems
– Example: kernel deadlocks during heavy disk IO
• You must design for failure and have a robust mechanism to
cleanly and easily resume execution once the cluster is back up.
• Important jobs must be isolated from developers
– Each cluster should contain ONE tier of jobs, grouped by SLA, release
process, and time-to-recovery requirements
20
21. Attributes of Robust Jobs
• Idempotent and resumable regardless of when/how terminated
• Has an external framework for recording success/failure, timing,
and amount of data processed
• Knows what input data it needs and waits for it to be ready
• Has mechanism for reprocessing if the input data is restated
• Checked into source control
• Testable in an expendable cluster before release
21
22. Benchmarks
• How to evaluate hardware/network changes or map/reduce slot
tuning?
– Key insight: For the same job, the same task always does the same
work
– Rerun job and compare execution of the same task across machines
Machine Tasks Comps Relative Perf (larger is better)
~~~~~~~~~~~~ ~~~~~ ~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
type1_1 82 37 0.99 ====================
type1_2 91 76 0.98 ====================
type1_3 92 35 1.01 ====================
type1_4 88 85 1.06 =====================
type2_1 71 26 1.30 ==========================
type3_1 92 80 0.68 ==============
type4_1 78 42 1.19 ========================
type4_2 78 45 1.29 ==========================
type4_3 75 75 1.19 ========================
remote 546 534 0.97 ===================
local 378 69 1.05 =====================
22
23. Features you Should Use
• Fair Scheduler
• refreshNodes, refreshQueues
• Hadoop metrics
• Namenode audit logging (disabled by default in 0.20)
• Exclude files to decommission slave nodes
23
24. Staffing
• We're living proof that you can hire some engineers with good
fundamentals but no specialized experience and throw them in
the deep end (it's the TA way)
• Skills to hire for:
– Operations and Linux experience
– General service troubleshooting
– Scripting
– Java
– SQL (even if not using Hive)
• Managing clusters which are growing 2x - 4x per year takes 1-2
people working full time just to run in place
24
25. Open Questions
• Resuming of jobs on jobtracker restart
• Reloading of configurations without a restart
• Robust response to cluster OOM conditions
• Disabling job submission while allowing existing jobs to finish
• Please tell us if you have the answers!
25
34. Example Oozie Query
SELECT
a.todaystatus as today,
a.yesterdaystatus as yday,
j.status as parent,
j.app_name,
a.last_modified_time,
a.nominal_time,
a.id
FROM (
SELECT
t.status as todaystatus,
y.status as yesterdaystatus,
COALESCE(t.id, y.id) AS id,
y.job_id,
COALESCE(t.nominal_time, y.nominal_time) AS nominal_time,
COALESCE(t.last_modified_time, y.last_modified_time) AS last_modified_time
FROM (SELECT *
FROM COORD_ACTIONS
WHERE TIMESTAMPDIFF(DAY, last_modified_time, now()) = 0) t
RIGHT OUTER JOIN (SELECT *
FROM COORD_ACTIONS
WHERE TIMESTAMPDIFF(DAY, last_modified_time, now()) = 1) y
ON (t.job_id=y.job_id)
WHERE COALESCE(t.status, '') NOT IN ('SUCCEEDED', 'WAITING')
-- If they're WAITING today, then make sure yesterday ran OK.
OR (t.status = 'WAITING' and y.status <> 'SUCCEEDED')
UNION DISTINCT
-- Dummy record to force the table to exist even when empty, since MySql
-- otherwise emits nothing if data is not returned.
SELECT 'EMPTY', 'RECORD', '', '', '', 'THIS IS A DUMMY RECORD'
)a
LEFT OUTER JOIN COORD_JOBS j
ON a.job_id=j.id
WHERE j.status = 'RUNNING' OR j.status IS NULL
;
34