Infrastructure Around Hadoop

Hadoop Summit 2012

Infrastructure Around Hadoop

Backups, failover, configuration and monitoring

Terran Melconian, Edmund MacKenty

tripadvisor.com/careers 1

What TripAdvisor Does

•  World's largest travel site and community
•  Trip planning user reviews
•  >50 million unique monthly visitors, 30 countries*
•  >60 million reviews and opinions*
•  Run like a startup: 30+ teams all doing their own thing
•  Heavy use of open-source projects
•  Speed Wins!

* source: comScore Media Metrix for TripAdvisor Sites, Worldwide, January 2012

2

What the Warehouse Team Does

•  Retain and aggregate historic site activity data
•  Make data available throughout the company
•  Hits, reviews, forums, contacts, locations, businesses, etc.
•  ~50 nodes in 4 clusters: Cloudera CDH3u3 (Hadoop 0.20.2)
•  Used by ~12 analytics teams, heavy use of Hive
•  Some jobs must run every day (eg. ETL, aggregations)
•  Systems are very open, we trust our users (usually)
•  3 people, fairly new to Hadoop/Hive

3

Why Hadoop at TripAdvisor

•  Hadoop is how we scale analysis past the limits of one machine
–  Some daily jobs taking nearly 24 hours, and we're still growing quickly

•  Our old RDBMS data warehouse could barely keep up with data
ingestion, even running on expensive hardware with a SAN
–  We obtained 20x improvement in wall clock time

•  Reprocess unaggregated historical data as definitions change
–  Before, impossible except for a small sample
–  Now, reprocess years of data at the finest level in a few days

•  Efficient platform for many kinds of statistics
–  Representative example: five-hour RDBMS job went to 25 minutes

4

HA NameNode: DRBD, Corosync and Pacemaker

•  Namenode and JobTracker run on “master” node
•  Datanode and TaskTracker run on “slave” nodes
•  Automatic fail-over of all master-node services to a passive node
•  Provision two identical systems
•  Set up virtual Master IP address to be failed over
•  Secondary namenode on passive node, if available
•  Monitor and automatically restart failed services

5

DRBD/Corosync Configuration

•  DRBD: replicates namenode image, Hive metadata, Oozie job data
–  Create two identical storage devices (we used RAID 1)
–  Connect the master nodes with a cross-over ethernet cable
–  Configure DRBD to use the cross-over and storage devices
–  Use drbdadm to create the replicated device
–  Create a filesystem on /dev/drbd0 with mkfs
–  Cat /proc/drbd to see state of the device
–  Once created, use /etc/init.d/drbd to manage it

•  Corosync: messaging between active-passive masters
–  Configure Corosync to also use the cross-over ethernet cable
–  Corosync will start Pacemaker for you
–  Use /etc/init.d/corosync to manage it, and Pacemaker

6

Pacemaker Configuration

•  Define each resource you want to manage:
–  DRBD device, master IP address, ethernet connectivity checks,
Hadoop namenode and jobtracker, Hive thrift server, MySQL for Hive
metadata, Oozie for workflow coordination

•  Set monitoring intervals for each resource
•  Define resource co-location dependencies
•  Define resource ordering dependencies
•  Restarts failed services, eg. Hive-Thrift
•  Use crm tool to manage nodes and resources
•  Test with a manual fail-over:
–  migrate namenode resource to passive master
–  Use crm status to watch all resources move over

7

Monitoring: Ganglia and Nagios, Job Tracking

•  Visibility into cluster operations
•  Monitor hardware states and resource usage
•  Notify on specific boundary or failure conditions
•  Track MapReduce jobs and Hive tables
•  Identify immediate problems
•  Show trends over time to predict future needs

8

Ganglia

•  Standard monitoring of CPU, Memory, Disk usage, etc.
•  PERL script parses Hadoop metrics, sends using gmetric(1)
•  ~50 Hadoop metrics, ~30 system metrics
•  Graphs for entire cluster and individual nodes
•  Example: Two jobs with different resource profiles

9

Nagios

•  Our primary notification system
•  About 80 checks, ~25 are our own. Examples:
–  check_hdp_connectivity: can master talk to all its slaves?
–  check_hdp_data_nodes: are all configured slave datanodes running?
–  check_hdp_max_mr_settings: does jobtracker have resources we expect?
–  check_hadoop_master_logfiles: are logs being written to?
–  check_hive_server: is it up?

•  Some warnings:
–  Do not let Nagios run hadoop fsck (check_hdp_hdfs)
–  LDAP failure causes email cascade
–  High loads can cause timeouts, which cause notifications

10

Job Tracking

•  PERL script invoked frequently by cron
•  Parses jobtracker log entries since last run
•  Records data on each job in PostreSQL DB:
–  Job ID, user, submitting IP and time, status
–  Cluster ID, queue, Hive query
–  start/stop times for job and first mapper and reducer
–  Mapper and reducer counts, max memory, slots, splits

•  CGI script to do queries:
–  Running jobs, failed jobs, MapReduce capacity usage
–  Job resource usage by status, queue, user

•  Helps post-mortem of problems
•  Used to predict trends, future resource needs

11

Other cron scripts we run

•  Check_load:
–  Dumps Java stack trace when load is too high
–  Emails list of top processes so we can see what was wrong

•  Master nodes:
–  Compresses Hadoop/Hive logs more than 30 days old
–  Removes logs more than 120 days old (we keep 10+ GBs)
–  Check_hdfs: Runs hadoop fsck to see if HDFS is “healthy”
–  Backup current namenode fsimage

•  Slave Nodes:
–  Check_disks: Removes read-only disks from datanode configuration
–  Check_load: Kills some tasks and notifies us when load is too high

•  Refresh production data to development cluster

12

Configuration Management

•  Seems like extra work at first, but essential as you grow.
•  Not Hadoop-specific: manage OS packages, Nagios and Ganglia
scripts, cron jobs, svn, SSH keys, NFS mounts, jars
–  Consistent UID/GIDs critical with DRBD
–  We replace some jars from the RPMs with local fixes
–  Templatized configuration files very convenient. ERB is good.
–  SSH keys made consistent across nodes, masters share host key

•  Use SVN as file delivery mechanism: checkout on each box
•  We chose Puppet as a tool
–  Gets the job done
–  Lacks flexibility in inheritance to specialize defaults per-machine
–  Some aspects of operation are hard to debug

13

Backup: HDFS and Hive DDL

•  Objectives:
–  Provide safety against total HDFS failure due to software bugs or
machine room environmental incident
–  Protect against user error in dropping or overwriting tables
–  Restore data to another cluster

•  Assumptions
–  Repeating one day of processing is acceptable when restoring

•  Components
–  Incremental HDFS backup
–  Hive DDL backup

•  Runs on separate backup server with storage (NexSan)
–  Pull process driven by processes on backup server

14

Backup HDFS

•  Open-source Java app
•  Requires customization to your environment
•  Traverses HDFS directory tree
•  Copies out files modified after a given date
•  Doesn't copy very new directories
–  Needed a way to avoid copying files being written at time of backup
–  HDFS has no snapshots

•  Ignores specified directories
•  Generates restore shell scripts to set owners, perms
•  Verification tool checks file sizes and checksums

15

Backup Hive DDL

•  Open-source Java app uses Thrift server
•  Iterates over all tables and views
•  Constructs DDL statements from Hive metadata
•  Ignores specific tables
•  Generates Hive command script
–  Recreates all tables, adds all partitions back one at a time

•  Used to move metadata to MySQL
•  Restore full cluster:
–  copying files back with copyFromLocal
–  Run perm/owner scripts
–  Reapply Hive DDL

16

Other Things To Potentially Back Up

•  Backup the Namenode Metadata
–  We do this once every 4 hours
–  This is in addition to mirroring on four physical drives

•  Our job tracking database
•  No general backups of root or local FS on machines
–  Recreate machines with Puppet or other configuration management
tool instead

•  Oozie job database
–  We do NOT back this up
–  Tightly coupled with HDFS state and restore would be problematic
–  The recovery procedure is to rebuild and reinstall coordinators

17

Oozie: Why

•  Drawback: several times slower to write than cronjobs, while also
less expressive
•  Advantage: Ability to cleanly depend on input data
–  With cron, you would have to poll for stamps

•  Advantage: Clean and consistent metadata
–  See what ran, what failed, what is still waiting and why
–  Easily retry things which failed – good luck doing that with cron
–  Output datasets are deleted on rerun so ordering is preserved

18

Oozie: How

•  Establish consistent local practices for completion stamps, job
naming, owners, and source code locations
•  Enforce that all jobs must be idempotent
•  Create scripts/makefiles/build.xml to rebuild and reinstall jobs
after changes in their dependencies
•  Bypass the Oozie GUI
–  The CLI is a more capable tool
–  Go straight to the Oozie backing DB and issue SQL queries

•  Rerun coordinator actions, not workflows
•  Don't ever use Derby – we experienced massive corruption

19

Experiences and Expectations

•  Hadoop is not mature from a reliability and stability point of view
–  It will probably get there in a few more years

•  Cluster outages are common events, not outliers
–  Must bounce key services to pick up basic configuration changes such
as adding a new queue
–  As you scale up, you will encounter new classes of problems
–  Example: kernel deadlocks during heavy disk IO

•  You must design for failure and have a robust mechanism to
cleanly and easily resume execution once the cluster is back up.
•  Important jobs must be isolated from developers
–  Each cluster should contain ONE tier of jobs, grouped by SLA, release
process, and time-to-recovery requirements

20

Attributes of Robust Jobs

•  Idempotent and resumable regardless of when/how terminated
•  Has an external framework for recording success/failure, timing,
and amount of data processed
•  Knows what input data it needs and waits for it to be ready
•  Has mechanism for reprocessing if the input data is restated
•  Checked into source control
•  Testable in an expendable cluster before release

21

Benchmarks

•  How to evaluate hardware/network changes or map/reduce slot
tuning?
–  Key insight: For the same job, the same task always does the same
work
–  Rerun job and compare execution of the same task across machines
Machine Tasks Comps Relative Perf (larger is better)
~~~~~~~~~~~~ ~~~~~ ~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
type1_1 82 37 0.99 ====================
type1_2 91 76 0.98 ====================
type1_3 92 35 1.01 ====================
type1_4 88 85 1.06 =====================
type2_1 71 26 1.30 ==========================
type3_1 92 80 0.68 ==============
type4_1 78 42 1.19 ========================
type4_2 78 45 1.29 ==========================
type4_3 75 75 1.19 ========================

remote 546 534 0.97 ===================
local 378 69 1.05 =====================
22

Features you Should Use

•  Fair Scheduler
•  refreshNodes, refreshQueues
•  Hadoop metrics
•  Namenode audit logging (disabled by default in 0.20)
•  Exclude files to decommission slave nodes

23

Staffing

•  We're living proof that you can hire some engineers with good
fundamentals but no specialized experience and throw them in
the deep end (it's the TA way)
•  Skills to hire for:
–  Operations and Linux experience
–  General service troubleshooting
–  Scripting
–  Java
–  SQL (even if not using Hive)

•  Managing clusters which are growing 2x - 4x per year takes 1-2
people working full time just to run in place

24

Open Questions

•  Resuming of jobs on jobtracker restart
•  Reloading of configurations without a restart
•  Robust response to cluster OOM conditions
•  Disabling job submission while allowing existing jobs to finish

•  Please tell us if you have the answers!

25

Appendix

This is for you to read later
after downloading the
presentation
27

Downloads

https://github.com/TAwarehouse/

28

DRBD Configuration
global {
usage-count no;
minor-count 1;
}
common {
protocol C; on master01.tripadvisor.com {
syncer { rate 90M; } device /dev/drbd0;
} disk /dev/sda3;
resource internal { address 10.0.0.1:7789;
startup { flexible-meta-disk internal;
wfc-timeout 600; }
degr-wfc-timeout 60; on master02.tripadvisor.com {
} device /dev/drbd0;
disk { disk /dev/sda3;
on-io-error detach; address 10.0.0.2:7789;
} flexible-meta-disk internal;
net { }
# timeout 60; }
# connect-int 10;
# ping-int 10;
# max-buffers 2048;
# max-epoch-size 2048;
}

29

Corosync Configuration
compatibility: whitetank
totem {
version: 2
secauth: off
threads: 0
interface {
ringnumber: 0 amf {
bindnetaddr: 10.0.0.0 mode: disabled
mcastaddr: 239.0.0.11 }
mcastport: 5415 aisexec {
} user: root
} group: root
logging { }
fileline: off service {
to_stderr: no name: pacemaker
to_logfile: yes ver: 0
to_syslog: yes }
logfile: /var/log/corosync.log
debug: off
timestamp: on
logger_subsys {
subsys: AMF
debug: off
}
}

30

Pacemaker Configuration

node master01.tripadvisor.com attributes standby="off"
node master02.tripadvisor.com attributes standby="off"
property $id="cib-bootstrap-options" stonith-enabled="false" no-quorum-policy="ignore"
expected-quorum-votes="2" dc-version="1.0.12-unknown" cluster-infrastructure="openais"
last-lrm-refresh="1337718104"
rsc_defaults $id="rsc-options" resource-stickiness="100"
primitive DataStore ocf:linbit:drbd params drbd_resource="internal"
op start interval="0" timeout="240s" op stop interval="0" timeout="100s"
primitive fs_DataStore ocf:heartbeat:Filesystem
params device="/dev/drbd0" directory="/data/internal" fstype="ext3"
op monitor interval="60s" timeout="40s" op start interval="0" timeout="60s"
op stop interval="0" timeout="60s"
ms Cluster DataStore
meta master-max="1" master-node="max=1" clone-max="2" clone-node-max="1" notify="true"
colocation fs-with-drbd inf: fs_DataStore Cluster:Master
order drdb-fs inf: Cluster:promote fs_DataStore:start
primitive MasterIP ocf:heartbeat:IPaddr2
params ip="192.168.236.10" nic="bond0" op monitor interval="30s"
colocation ip-with-drbd inf: MasterIP Cluster:Master
order fs-ip inf: fs_DataStore MasterIP
primitive NameNode lsb:hadoop-0.20-namenode op monitor interval="30s" meta target-role="Started"
colocation namenode-with-fs inf: NameNode fs_DataStore
order ip-namenode inf: MasterIP NameNode
primitive JobTracker lsb:hadoop-0.20-jobtracker op monitor interval="30s" meta target-role="Started"
colocation jobtracker-with-fs inf: JobTracker fs_DataStore
order namenode-jobtracker inf: NameNode JobTracker

31

Pacemaker Configuration (cont.)
primitive SecondaryNameNode lsb:hadoop-0.20-secondarynamenode
op monitor interval="30s" meta target-role="Started"
colocation secondarynamenode-not-with-ip -inf: SecondaryNameNode MasterIP
order jobtracker-secnamenode inf: JobTracker SecondaryNameNode
primitive Mysql ocf:heartbeat:mysql
params datadir="/data/internal/mysql" socket="/data/internal/mysql/mysql.sock"
binary="/usr/bin/mysqld_safe" op monitor interval="30s" timeout="30s" op start
interval="0" timeout="120s" op stop interval="0" timeout="120s"
meta target-role="Started"
colocation mysql-with-fs inf: Mysql fs_DataStore
order ip-mysql inf: MasterIP Mysql
primitive HiveThrift lsb:hive-thrift
colocation hivethrift-with-ip inf: HiveThrift MasterIP
order jobtracker-hivethrift inf: JobTracker HiveThrift
order mysql-hivethrift inf: Mysql HiveThrift
primitive Oozie lsb:oozie
colocation oozie-with-fs inf: Oozie MasterIP
order jobtracker-oozie inf: JobTracker Oozie
primitive PingNodes ocf:pacemaker:ping
params host_list="192.168.236.1 192.168.236.2 192.168.236.5" multiplier="100"
op start interval="0" timeout="60s" op monitor interval="30s" timeout="60s"
clone PingClone PingNodes meta interleave="true"
location ping-with-ip MasterIP
rule $id="ping-with-ip-rule" pingd: defined pingd
location prefer-master01.tripadvisor.com MasterIP
rule $id="prefer-master01.tripadvisor.com-rule" 50: #uname eq master01.tripadvisor.com
order ip-ping inf: MasterIP PingClone

32

Nagios Checks

check_apt check_breeze check_by_ssh check_checkup_metric
check_clamd check_cluster check_cronjobs check_crontabs
check_dhcp check_dig check_disk check_disk_smb
check_disk_writable check_dns check_dummy check_fbrs
check_file_age check_files_age check_filesystems check_flexlm
check_ftp check_gc check_hadoop_master_logfiles
check_hdp_connectivity check_hdp_data_nodes check_hdp_hdfs
20

check_hdp_max_mr_settings check_hive 10
check_hive_nsc
check_hive_server check_http check_icmp 0
check_ide_smart
R
check_ifoperstatus check_ifstatus check_imap check_ircd
check_jabber check_load check_local_mail check_log
check_log_updated check_mailq check_memcached check_minerva
check_mrtg check_mrtgtraf check_mysql_repl check_nagios
check_nntp check_nntps check_nrpe check_nt
check_ntp check_ntp_peer check_ntp_time check_nwstat
check_oracle check_overcr check_ping check_pop
check_proc_filehandles check_procs check_real check_rpc
check_sensors check_simap check_smtp check_spop
check_ssh check_ssmtp check_swap check_swapping
check_sys_filehandles check_ta_services check_tcp check_time
check_udp check_ups check_users check_wave
check_writeable_tmp

33

Example Oozie Query
SELECT
a.todaystatus as today,
a.yesterdaystatus as yday,
j.status as parent,
j.app_name,
a.last_modified_time,
a.nominal_time,
a.id
FROM (
SELECT
t.status as todaystatus,
y.status as yesterdaystatus,
COALESCE(t.id, y.id) AS id,
y.job_id,
COALESCE(t.nominal_time, y.nominal_time) AS nominal_time,
COALESCE(t.last_modified_time, y.last_modified_time) AS last_modified_time
FROM (SELECT *
FROM COORD_ACTIONS
WHERE TIMESTAMPDIFF(DAY, last_modified_time, now()) = 0) t
RIGHT OUTER JOIN (SELECT *
FROM COORD_ACTIONS
WHERE TIMESTAMPDIFF(DAY, last_modified_time, now()) = 1) y
ON (t.job_id=y.job_id)
WHERE COALESCE(t.status, '') NOT IN ('SUCCEEDED', 'WAITING')
-- If they're WAITING today, then make sure yesterday ran OK.
OR (t.status = 'WAITING' and y.status <> 'SUCCEEDED')
UNION DISTINCT
-- Dummy record to force the table to exist even when empty, since MySql
-- otherwise emits nothing if data is not returned.
SELECT 'EMPTY', 'RECORD', '', '', '', 'THIS IS A DUMMY RECORD'
)a
LEFT OUTER JOIN COORD_JOBS j
ON a.job_id=j.id
WHERE j.status = 'RUNNING' OR j.status IS NULL
;

34

Sessions will resume at 4:30pm

Page 35

Infrastructure Around Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Infrastructure Around Hadoop

Similar to Infrastructure Around Hadoop (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Infrastructure Around Hadoop