SlideShare a Scribd company logo
1 of 35
Hadoop Summit 2012




 Infrastructure Around Hadoop

Backups, failover, configuration and monitoring

       Terran Melconian, Edmund MacKenty


               tripadvisor.com/careers            1
What TripAdvisor Does


•  World's largest travel site and community
•  Trip planning user reviews
•  >50 million unique monthly visitors, 30 countries*
•  >60 million reviews and opinions*
•  Run like a startup: 30+ teams all doing their own thing
•  Heavy use of open-source projects
•  Speed Wins!




            * source: comScore Media Metrix for TripAdvisor Sites, Worldwide, January 2012


                                                                                             2
What the Warehouse Team Does


•  Retain and aggregate historic site activity data
•  Make data available throughout the company
•  Hits, reviews, forums, contacts, locations, businesses, etc.
•  ~50 nodes in 4 clusters: Cloudera CDH3u3 (Hadoop 0.20.2)
•  Used by ~12 analytics teams, heavy use of Hive
•  Some jobs must run every day (eg. ETL, aggregations)
•  Systems are very open, we trust our users (usually)
•  3 people, fairly new to Hadoop/Hive




                                                                  3
Why Hadoop at TripAdvisor


•  Hadoop is how we scale analysis past the limits of one machine
  –  Some daily jobs taking nearly 24 hours, and we're still growing quickly

•  Our old RDBMS data warehouse could barely keep up with data
   ingestion, even running on expensive hardware with a SAN
  –  We obtained 20x improvement in wall clock time

•  Reprocess unaggregated historical data as definitions change
  –  Before, impossible except for a small sample
  –  Now, reprocess years of data at the finest level in a few days

•  Efficient platform for many kinds of statistics
  –  Representative example: five-hour RDBMS job went to 25 minutes




                                                                               4
HA NameNode: DRBD, Corosync and Pacemaker


•  Namenode and JobTracker run on “master” node
•  Datanode and TaskTracker run on “slave” nodes
•  Automatic fail-over of all master-node services to a passive node
•  Provision two identical systems
•  Set up virtual Master IP address to be failed over
•  Secondary namenode on passive node, if available
•  Monitor and automatically restart failed services




                                                                       5
DRBD/Corosync Configuration


•  DRBD: replicates namenode image, Hive metadata, Oozie job data
  –  Create two identical storage devices (we used RAID 1)
  –  Connect the master nodes with a cross-over ethernet cable
  –  Configure DRBD to use the cross-over and storage devices
  –  Use drbdadm to create the replicated device
  –  Create a filesystem on /dev/drbd0 with mkfs
  –  Cat /proc/drbd to see state of the device
  –  Once created, use /etc/init.d/drbd to manage it

•  Corosync: messaging between active-passive masters
  –  Configure Corosync to also use the cross-over ethernet cable
  –  Corosync will start Pacemaker for you
  –  Use /etc/init.d/corosync to manage it, and Pacemaker




                                                                    6
Pacemaker Configuration


•  Define each resource you want to manage:
  –  DRBD device, master IP address, ethernet connectivity checks,
    Hadoop namenode and jobtracker, Hive thrift server, MySQL for Hive
    metadata, Oozie for workflow coordination

•  Set monitoring intervals for each resource
•  Define resource co-location dependencies
•  Define resource ordering dependencies
•  Restarts failed services, eg. Hive-Thrift
•  Use crm tool to manage nodes and resources
•  Test with a manual fail-over:
  –  migrate namenode resource to passive master
  –  Use crm status to watch all resources move over

                                                                         7
Monitoring: Ganglia and Nagios, Job Tracking


•  Visibility into cluster operations
•  Monitor hardware states and resource usage
•  Notify on specific boundary or failure conditions
•  Track MapReduce jobs and Hive tables
•  Identify immediate problems
•  Show trends over time to predict future needs




                                                       8
Ganglia


•  Standard monitoring of CPU, Memory, Disk usage, etc.
•  PERL script parses Hadoop metrics, sends using gmetric(1)
•  ~50 Hadoop metrics, ~30 system metrics
•  Graphs for entire cluster and individual nodes
•  Example: Two jobs with different resource profiles




                                                               9
Nagios


•  Our primary notification system
•  About 80 checks, ~25 are our own. Examples:
  –  check_hdp_connectivity: can master talk to all its slaves?
  –  check_hdp_data_nodes: are all configured slave datanodes running?
  –  check_hdp_max_mr_settings: does jobtracker have resources we expect?
  –  check_hadoop_master_logfiles: are logs being written to?
  –  check_hive_server: is it up?

•  Some warnings:
  –  Do not let Nagios run hadoop fsck (check_hdp_hdfs)
  –  LDAP failure causes email cascade
  –  High loads can cause timeouts, which cause notifications




                                                                         10
Job Tracking


•  PERL script invoked frequently by cron
•  Parses jobtracker log entries since last run
•  Records data on each job in PostreSQL DB:
  –  Job ID, user, submitting IP and time, status
  –  Cluster ID, queue, Hive query
  –  start/stop times for job and first mapper and reducer
  –  Mapper and reducer counts, max memory, slots, splits

•  CGI script to do queries:
  –  Running jobs, failed jobs, MapReduce capacity usage
  –  Job resource usage by status, queue, user

•  Helps post-mortem of problems
•  Used to predict trends, future resource needs

                                                             11
Other cron scripts we run


•  Check_load:
  –  Dumps Java stack trace when load is too high
  –  Emails list of top processes so we can see what was wrong

•  Master nodes:
  –  Compresses Hadoop/Hive logs more than 30 days old
  –  Removes logs more than 120 days old (we keep 10+ GBs)
  –  Check_hdfs: Runs hadoop fsck to see if HDFS is “healthy”
  –  Backup current namenode fsimage

•  Slave Nodes:
  –  Check_disks: Removes read-only disks from datanode configuration
  –  Check_load: Kills some tasks and notifies us when load is too high

•  Refresh production data to development cluster


                                                                          12
Configuration Management


•  Seems like extra work at first, but essential as you grow.
•  Not Hadoop-specific: manage OS packages, Nagios and Ganglia
   scripts, cron jobs, svn, SSH keys, NFS mounts, jars
  –  Consistent UID/GIDs critical with DRBD
  –  We replace some jars from the RPMs with local fixes
  –  Templatized configuration files very convenient. ERB is good.
  –  SSH keys made consistent across nodes, masters share host key

•  Use SVN as file delivery mechanism: checkout on each box
•  We chose Puppet as a tool
  –  Gets the job done
  –  Lacks flexibility in inheritance to specialize defaults per-machine
  –  Some aspects of operation are hard to debug



                                                                           13
Backup: HDFS and Hive DDL


•  Objectives:
  –  Provide safety against total HDFS failure due to software bugs or
     machine room environmental incident
  –  Protect against user error in dropping or overwriting tables
  –  Restore data to another cluster

•  Assumptions
  –  Repeating one day of processing is acceptable when restoring

•  Components
  –  Incremental HDFS backup
  –  Hive DDL backup

•  Runs on separate backup server with storage (NexSan)
  –  Pull process driven by processes on backup server



                                                                         14
Backup HDFS


•  Open-source Java app
•  Requires customization to your environment
•  Traverses HDFS directory tree
•  Copies out files modified after a given date
•  Doesn't copy very new directories
  –  Needed a way to avoid copying files being written at time of backup
  –  HDFS has no snapshots

•  Ignores specified directories
•  Generates restore shell scripts to set owners, perms
•  Verification tool checks file sizes and checksums


                                                                           15
Backup Hive DDL


•  Open-source Java app uses Thrift server
•  Iterates over all tables and views
•  Constructs DDL statements from Hive metadata
•  Ignores specific tables
•  Generates Hive command script
  –  Recreates all tables, adds all partitions back one at a time

•  Used to move metadata to MySQL
•  Restore full cluster:
  –  copying files back with copyFromLocal
  –  Run perm/owner scripts
  –  Reapply Hive DDL


                                                                    16
Other Things To Potentially Back Up


•  Backup the Namenode Metadata
  –  We do this once every 4 hours
  –  This is in addition to mirroring on four physical drives

•  Our job tracking database
•  No general backups of root or local FS on machines
  –  Recreate machines with Puppet or other configuration management
    tool instead

•  Oozie job database
  –  We do NOT back this up
  –  Tightly coupled with HDFS state and restore would be problematic
  –  The recovery procedure is to rebuild and reinstall coordinators




                                                                        17
Oozie: Why


•  Drawback: several times slower to write than cronjobs, while also
   less expressive
•  Advantage: Ability to cleanly depend on input data
  –  With cron, you would have to poll for stamps

•  Advantage: Clean and consistent metadata
  –  See what ran, what failed, what is still waiting and why
  –  Easily retry things which failed – good luck doing that with cron
  –  Output datasets are deleted on rerun so ordering is preserved




                                                                         18
Oozie: How


•  Establish consistent local practices for completion stamps, job
   naming, owners, and source code locations
•  Enforce that all jobs must be idempotent
•  Create scripts/makefiles/build.xml to rebuild and reinstall jobs
   after changes in their dependencies
•  Bypass the Oozie GUI
  –  The CLI is a more capable tool
  –  Go straight to the Oozie backing DB and issue SQL queries

•  Rerun coordinator actions, not workflows
•  Don't ever use Derby – we experienced massive corruption




                                                                      19
Experiences and Expectations


•  Hadoop is not mature from a reliability and stability point of view
  –  It will probably get there in a few more years

•  Cluster outages are common events, not outliers
  –  Must bounce key services to pick up basic configuration changes such
     as adding a new queue
  –  As you scale up, you will encounter new classes of problems
  –  Example: kernel deadlocks during heavy disk IO

•  You must design for failure and have a robust mechanism to
   cleanly and easily resume execution once the cluster is back up.
•  Important jobs must be isolated from developers
  –  Each cluster should contain ONE tier of jobs, grouped by SLA, release
    process, and time-to-recovery requirements



                                                                             20
Attributes of Robust Jobs


•  Idempotent and resumable regardless of when/how terminated
•  Has an external framework for recording success/failure, timing,
   and amount of data processed
•  Knows what input data it needs and waits for it to be ready
•  Has mechanism for reprocessing if the input data is restated
•  Checked into source control
•  Testable in an expendable cluster before release




                                                                      21
Benchmarks


•  How to evaluate hardware/network changes or map/reduce slot
   tuning?
  –  Key insight: For the same job, the same task always does the same
     work
  –  Rerun job and compare execution of the same task across machines
Machine  Tasks Comps Relative Perf (larger is better)
~~~~~~~~~~~~ ~~~~~ ~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
type1_1   82 37 0.99 ====================
type1_2   91 76 0.98 ====================
type1_3   92 35 1.01 ====================
type1_4   88 85 1.06 =====================
type2_1   71 26 1.30 ==========================
type3_1   92 80 0.68 ==============
type4_1   78 42 1.19 ========================
type4_2   78 45 1.29 ==========================
type4_3   75 75 1.19 ========================

remote    546 534 0.97 ===================
local    378 69 1.05 =====================
                                                                         22
Features you Should Use


•  Fair Scheduler
•  refreshNodes, refreshQueues
•  Hadoop metrics
•  Namenode audit logging (disabled by default in 0.20)
•  Exclude files to decommission slave nodes




                                                          23
Staffing


•  We're living proof that you can hire some engineers with good
   fundamentals but no specialized experience and throw them in
   the deep end (it's the TA way)
•  Skills to hire for:
   –  Operations and Linux experience
   –  General service troubleshooting
   –  Scripting
   –  Java
   –  SQL (even if not using Hive)

•  Managing clusters which are growing 2x - 4x per year takes 1-2
   people working full time just to run in place




                                                                    24
Open Questions


•  Resuming of jobs on jobtracker restart
•  Reloading of configurations without a restart
•  Robust response to cluster OOM conditions
•  Disabling job submission while allowing existing jobs to finish


•  Please tell us if you have the answers!




                                                                     25
Questions?




             26
Appendix


This is for you to read later
  after downloading the
         presentation
                                27
Downloads




https://github.com/TAwarehouse/




                                  28
DRBD Configuration
global {
  usage-count no;
  minor-count 1;
}
common {
  protocol C;                             on master01.tripadvisor.com {
  syncer { rate 90M; }                        device     /dev/drbd0;
}                                             disk      /dev/sda3;
resource internal {                           address 10.0.0.1:7789;
  startup {                                   flexible-meta-disk internal;
    wfc-timeout 600;                        }
    degr-wfc-timeout 60;                    on master02.tripadvisor.com {
  }                                           device     /dev/drbd0;
  disk {                                      disk      /dev/sda3;
    on-io-error detach;                       address 10.0.0.2:7789;
  }                                           flexible-meta-disk internal;
  net {                                     }
    # timeout        60;                  }
    # connect-int     10;
    # ping-int      10;
    # max-buffers 2048;
    # max-epoch-size 2048;
  }




                                                                             29
Corosync Configuration
compatibility: whitetank
totem {
    version: 2
    secauth: off
    threads: 0
    interface {
           ringnumber: 0                  amf {
           bindnetaddr: 10.0.0.0               mode: disabled
           mcastaddr: 239.0.0.11          }
           mcastport: 5415                aisexec {
    }                                          user: root
}                                              group: root
logging {                                 }
    fileline: off                         service {
    to_stderr: no                              name: pacemaker
    to_logfile: yes                            ver: 0
    to_syslog: yes                        }
    logfile: /var/log/corosync.log
    debug: off
    timestamp: on
    logger_subsys {
           subsys: AMF
           debug: off
    }
}

                                                                 30
Pacemaker Configuration

node master01.tripadvisor.com attributes standby="off"
node master02.tripadvisor.com attributes standby="off"
property $id="cib-bootstrap-options" stonith-enabled="false" no-quorum-policy="ignore" 
              expected-quorum-votes="2" dc-version="1.0.12-unknown" cluster-infrastructure="openais" 
              last-lrm-refresh="1337718104"
rsc_defaults $id="rsc-options" resource-stickiness="100"
primitive DataStore ocf:linbit:drbd params drbd_resource="internal" 
              op start interval="0" timeout="240s" op stop interval="0" timeout="100s"
primitive fs_DataStore ocf:heartbeat:Filesystem 
              params device="/dev/drbd0" directory="/data/internal" fstype="ext3" 
              op monitor interval="60s" timeout="40s" op start interval="0" timeout="60s" 
              op stop interval="0" timeout="60s"
ms Cluster DataStore 
              meta master-max="1" master-node="max=1" clone-max="2" clone-node-max="1" notify="true"
colocation fs-with-drbd inf: fs_DataStore Cluster:Master
order drdb-fs inf: Cluster:promote fs_DataStore:start
primitive MasterIP ocf:heartbeat:IPaddr2 
              params ip="192.168.236.10" nic="bond0" op monitor interval="30s"
colocation ip-with-drbd inf: MasterIP Cluster:Master
order fs-ip inf: fs_DataStore MasterIP
primitive NameNode lsb:hadoop-0.20-namenode op monitor interval="30s" meta target-role="Started"
colocation namenode-with-fs inf: NameNode fs_DataStore
order ip-namenode inf: MasterIP NameNode
primitive JobTracker lsb:hadoop-0.20-jobtracker op monitor interval="30s" meta target-role="Started"
colocation jobtracker-with-fs inf: JobTracker fs_DataStore
order namenode-jobtracker inf: NameNode JobTracker




                                                                                                         31
Pacemaker Configuration (cont.)
primitive SecondaryNameNode lsb:hadoop-0.20-secondarynamenode 
             op monitor interval="30s" meta target-role="Started"
colocation secondarynamenode-not-with-ip -inf: SecondaryNameNode MasterIP
order jobtracker-secnamenode inf: JobTracker SecondaryNameNode
primitive Mysql ocf:heartbeat:mysql 
             params datadir="/data/internal/mysql" socket="/data/internal/mysql/mysql.sock" 
             binary="/usr/bin/mysqld_safe" op monitor interval="30s" timeout="30s" op start 
             interval="0" timeout="120s" op stop interval="0" timeout="120s" 
             meta target-role="Started"
colocation mysql-with-fs inf: Mysql fs_DataStore
order ip-mysql inf: MasterIP Mysql
primitive HiveThrift lsb:hive-thrift 
             op monitor interval="30s" meta target-role="Started"
colocation hivethrift-with-ip inf: HiveThrift MasterIP
order jobtracker-hivethrift inf: JobTracker HiveThrift
order mysql-hivethrift inf: Mysql HiveThrift
primitive Oozie lsb:oozie 
             op monitor interval="30s" meta target-role="Started"
colocation oozie-with-fs inf: Oozie MasterIP
order jobtracker-oozie inf: JobTracker Oozie
primitive PingNodes ocf:pacemaker:ping 
             params host_list="192.168.236.1 192.168.236.2 192.168.236.5" multiplier="100" 
             op start interval="0" timeout="60s" op monitor interval="30s" timeout="60s"
clone PingClone PingNodes meta interleave="true"
location ping-with-ip MasterIP 
rule $id="ping-with-ip-rule" pingd: defined pingd
location prefer-master01.tripadvisor.com MasterIP 
             rule $id="prefer-master01.tripadvisor.com-rule" 50: #uname eq master01.tripadvisor.com
order ip-ping inf: MasterIP PingClone


                                                                                                      32
Nagios Checks

check_apt              check_breeze      check_by_ssh            check_checkup_metric
check_clamd            check_cluster     check_cronjobs          check_crontabs
check_dhcp             check_dig         check_disk              check_disk_smb
check_disk_writable    check_dns         check_dummy             check_fbrs
check_file_age         check_files_age   check_filesystems       check_flexlm
check_ftp              check_gc          check_hadoop_master_logfiles
check_hdp_connectivity check_hdp_data_nodes                      check_hdp_hdfs
                                                        20	
  
check_hdp_max_mr_settings                check_hive     10	
     check_hive_nsc
check_hive_server      check_http        check_icmp      0	
     check_ide_smart
                                                               R
check_ifoperstatus     check_ifstatus    check_imap              check_ircd
check_jabber           check_load        check_local_mail        check_log
check_log_updated      check_mailq       check_memcached                    check_minerva
check_mrtg             check_mrtgtraf    check_mysql_repl        check_nagios
check_nntp             check_nntps       check_nrpe              check_nt
check_ntp              check_ntp_peer    check_ntp_time          check_nwstat
check_oracle           check_overcr      check_ping              check_pop
check_proc_filehandles check_procs       check_real              check_rpc
check_sensors          check_simap       check_smtp              check_spop
check_ssh              check_ssmtp       check_swap              check_swapping
check_sys_filehandles check_ta_services  check_tcp               check_time
check_udp              check_ups         check_users             check_wave
check_writeable_tmp




                                                                                            33
Example Oozie Query
SELECT
  a.todaystatus as today,
  a.yesterdaystatus as yday,
  j.status as parent,
  j.app_name,
  a.last_modified_time,
  a.nominal_time,
  a.id
FROM (
  SELECT
  t.status as todaystatus,
  y.status as yesterdaystatus,
  COALESCE(t.id, y.id) AS id,
  y.job_id,
  COALESCE(t.nominal_time, y.nominal_time) AS nominal_time,
  COALESCE(t.last_modified_time, y.last_modified_time) AS last_modified_time
  FROM (SELECT *
      FROM COORD_ACTIONS
      WHERE TIMESTAMPDIFF(DAY, last_modified_time, now()) = 0) t
  RIGHT OUTER JOIN (SELECT *
      FROM COORD_ACTIONS
      WHERE TIMESTAMPDIFF(DAY, last_modified_time, now()) = 1) y
  ON (t.job_id=y.job_id)
  WHERE COALESCE(t.status, '') NOT IN ('SUCCEEDED', 'WAITING')
      -- If they're WAITING today, then make sure yesterday ran OK.
        OR (t.status = 'WAITING' and y.status <> 'SUCCEEDED')
  UNION DISTINCT
  -- Dummy record to force the table to exist even when empty, since MySql
  -- otherwise emits nothing if data is not returned.
  SELECT 'EMPTY', 'RECORD', '', '', '', 'THIS IS A DUMMY RECORD'
)a
LEFT OUTER JOIN COORD_JOBS j
ON a.job_id=j.id
WHERE j.status = 'RUNNING' OR j.status IS NULL
;



                                                                               34
Sessions will resume at 4:30pm




                             Page 35

More Related Content

What's hot

Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...Dataconomy Media
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/KuduChris George
 
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...Yahoo Developer Network
 
GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGuang Xu
 
Splice Machine Overview
Splice Machine OverviewSplice Machine Overview
Splice Machine OverviewKunal Gupta
 
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platformcloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data PlatformRakuten Group, Inc.
 
Lessons Learned from Building an Enterprise Big Data Platform from the Ground...
Lessons Learned from Building an Enterprise Big Data Platform from the Ground...Lessons Learned from Building an Enterprise Big Data Platform from the Ground...
Lessons Learned from Building an Enterprise Big Data Platform from the Ground...DataWorks Summit
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache KuduJeff Holoman
 
The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)
The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)
The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)Ontico
 
Hello OpenStack, Meet Hadoop
Hello OpenStack, Meet HadoopHello OpenStack, Meet Hadoop
Hello OpenStack, Meet HadoopDataWorks Summit
 
Managing PostgreSQL with Ansible
 Managing PostgreSQL with Ansible Managing PostgreSQL with Ansible
Managing PostgreSQL with AnsibleEDB
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)Todd Lipcon
 

What's hot (20)

Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/Kudu
 
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
 
Video Analysis in Hadoop
Video Analysis in HadoopVideo Analysis in Hadoop
Video Analysis in Hadoop
 
Hadoop and HBase @eBay
Hadoop and HBase @eBayHadoop and HBase @eBay
Hadoop and HBase @eBay
 
Hadoop and OpenStack
Hadoop and OpenStackHadoop and OpenStack
Hadoop and OpenStack
 
GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheet
 
Splice Machine Overview
Splice Machine OverviewSplice Machine Overview
Splice Machine Overview
 
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platformcloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
 
Lessons Learned from Building an Enterprise Big Data Platform from the Ground...
Lessons Learned from Building an Enterprise Big Data Platform from the Ground...Lessons Learned from Building an Enterprise Big Data Platform from the Ground...
Lessons Learned from Building an Enterprise Big Data Platform from the Ground...
 
Kudu demo
Kudu demoKudu demo
Kudu demo
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Introducing Kudu
Introducing KuduIntroducing Kudu
Introducing Kudu
 
The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)
The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)
The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)
 
Kudu demo
Kudu demoKudu demo
Kudu demo
 
Hello OpenStack, Meet Hadoop
Hello OpenStack, Meet HadoopHello OpenStack, Meet Hadoop
Hello OpenStack, Meet Hadoop
 
Managing PostgreSQL with Ansible
 Managing PostgreSQL with Ansible Managing PostgreSQL with Ansible
Managing PostgreSQL with Ansible
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
 
What database
What databaseWhat database
What database
 

Similar to Infrastructure Around Hadoop

Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015 clairvoyantllc
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop OverviewBrian Enochson
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete informationbhargavi804095
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceDerek Chen
 
Deployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersDeployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersAmal G Jose
 
Managing growth in Production Hadoop Deployments
Managing growth in Production Hadoop DeploymentsManaging growth in Production Hadoop Deployments
Managing growth in Production Hadoop DeploymentsDataWorks Summit
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxDanishMahmood23
 
Top 10 lessons learned from deploying hadoop in a private cloud
Top 10 lessons learned from deploying hadoop in a private cloudTop 10 lessons learned from deploying hadoop in a private cloud
Top 10 lessons learned from deploying hadoop in a private cloudRogue Wave Software
 
Facing enterprise specific challenges – utility programming in hadoop
Facing enterprise specific challenges – utility programming in hadoopFacing enterprise specific challenges – utility programming in hadoop
Facing enterprise specific challenges – utility programming in hadoopfann wu
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDYVenneladonthireddy1
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentationAmrut Patil
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...Big Data Montreal
 
Hadoop Distributed file system.pdf
Hadoop Distributed file system.pdfHadoop Distributed file system.pdf
Hadoop Distributed file system.pdfvishal choudhary
 

Similar to Infrastructure Around Hadoop (20)

Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Deployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersDeployment and Management of Hadoop Clusters
Deployment and Management of Hadoop Clusters
 
Managing growth in Production Hadoop Deployments
Managing growth in Production Hadoop DeploymentsManaging growth in Production Hadoop Deployments
Managing growth in Production Hadoop Deployments
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
Top 10 lessons learned from deploying hadoop in a private cloud
Top 10 lessons learned from deploying hadoop in a private cloudTop 10 lessons learned from deploying hadoop in a private cloud
Top 10 lessons learned from deploying hadoop in a private cloud
 
Facing enterprise specific challenges – utility programming in hadoop
Facing enterprise specific challenges – utility programming in hadoopFacing enterprise specific challenges – utility programming in hadoop
Facing enterprise specific challenges – utility programming in hadoop
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentation
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
 
Hadoop Distributed file system.pdf
Hadoop Distributed file system.pdfHadoop Distributed file system.pdf
Hadoop Distributed file system.pdf
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 

Recently uploaded (20)

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 

Infrastructure Around Hadoop

  • 1. Hadoop Summit 2012 Infrastructure Around Hadoop Backups, failover, configuration and monitoring Terran Melconian, Edmund MacKenty tripadvisor.com/careers 1
  • 2. What TripAdvisor Does •  World's largest travel site and community •  Trip planning user reviews •  >50 million unique monthly visitors, 30 countries* •  >60 million reviews and opinions* •  Run like a startup: 30+ teams all doing their own thing •  Heavy use of open-source projects •  Speed Wins! * source: comScore Media Metrix for TripAdvisor Sites, Worldwide, January 2012 2
  • 3. What the Warehouse Team Does •  Retain and aggregate historic site activity data •  Make data available throughout the company •  Hits, reviews, forums, contacts, locations, businesses, etc. •  ~50 nodes in 4 clusters: Cloudera CDH3u3 (Hadoop 0.20.2) •  Used by ~12 analytics teams, heavy use of Hive •  Some jobs must run every day (eg. ETL, aggregations) •  Systems are very open, we trust our users (usually) •  3 people, fairly new to Hadoop/Hive 3
  • 4. Why Hadoop at TripAdvisor •  Hadoop is how we scale analysis past the limits of one machine –  Some daily jobs taking nearly 24 hours, and we're still growing quickly •  Our old RDBMS data warehouse could barely keep up with data ingestion, even running on expensive hardware with a SAN –  We obtained 20x improvement in wall clock time •  Reprocess unaggregated historical data as definitions change –  Before, impossible except for a small sample –  Now, reprocess years of data at the finest level in a few days •  Efficient platform for many kinds of statistics –  Representative example: five-hour RDBMS job went to 25 minutes 4
  • 5. HA NameNode: DRBD, Corosync and Pacemaker •  Namenode and JobTracker run on “master” node •  Datanode and TaskTracker run on “slave” nodes •  Automatic fail-over of all master-node services to a passive node •  Provision two identical systems •  Set up virtual Master IP address to be failed over •  Secondary namenode on passive node, if available •  Monitor and automatically restart failed services 5
  • 6. DRBD/Corosync Configuration •  DRBD: replicates namenode image, Hive metadata, Oozie job data –  Create two identical storage devices (we used RAID 1) –  Connect the master nodes with a cross-over ethernet cable –  Configure DRBD to use the cross-over and storage devices –  Use drbdadm to create the replicated device –  Create a filesystem on /dev/drbd0 with mkfs –  Cat /proc/drbd to see state of the device –  Once created, use /etc/init.d/drbd to manage it •  Corosync: messaging between active-passive masters –  Configure Corosync to also use the cross-over ethernet cable –  Corosync will start Pacemaker for you –  Use /etc/init.d/corosync to manage it, and Pacemaker 6
  • 7. Pacemaker Configuration •  Define each resource you want to manage: –  DRBD device, master IP address, ethernet connectivity checks, Hadoop namenode and jobtracker, Hive thrift server, MySQL for Hive metadata, Oozie for workflow coordination •  Set monitoring intervals for each resource •  Define resource co-location dependencies •  Define resource ordering dependencies •  Restarts failed services, eg. Hive-Thrift •  Use crm tool to manage nodes and resources •  Test with a manual fail-over: –  migrate namenode resource to passive master –  Use crm status to watch all resources move over 7
  • 8. Monitoring: Ganglia and Nagios, Job Tracking •  Visibility into cluster operations •  Monitor hardware states and resource usage •  Notify on specific boundary or failure conditions •  Track MapReduce jobs and Hive tables •  Identify immediate problems •  Show trends over time to predict future needs 8
  • 9. Ganglia •  Standard monitoring of CPU, Memory, Disk usage, etc. •  PERL script parses Hadoop metrics, sends using gmetric(1) •  ~50 Hadoop metrics, ~30 system metrics •  Graphs for entire cluster and individual nodes •  Example: Two jobs with different resource profiles 9
  • 10. Nagios •  Our primary notification system •  About 80 checks, ~25 are our own. Examples: –  check_hdp_connectivity: can master talk to all its slaves? –  check_hdp_data_nodes: are all configured slave datanodes running? –  check_hdp_max_mr_settings: does jobtracker have resources we expect? –  check_hadoop_master_logfiles: are logs being written to? –  check_hive_server: is it up? •  Some warnings: –  Do not let Nagios run hadoop fsck (check_hdp_hdfs) –  LDAP failure causes email cascade –  High loads can cause timeouts, which cause notifications 10
  • 11. Job Tracking •  PERL script invoked frequently by cron •  Parses jobtracker log entries since last run •  Records data on each job in PostreSQL DB: –  Job ID, user, submitting IP and time, status –  Cluster ID, queue, Hive query –  start/stop times for job and first mapper and reducer –  Mapper and reducer counts, max memory, slots, splits •  CGI script to do queries: –  Running jobs, failed jobs, MapReduce capacity usage –  Job resource usage by status, queue, user •  Helps post-mortem of problems •  Used to predict trends, future resource needs 11
  • 12. Other cron scripts we run •  Check_load: –  Dumps Java stack trace when load is too high –  Emails list of top processes so we can see what was wrong •  Master nodes: –  Compresses Hadoop/Hive logs more than 30 days old –  Removes logs more than 120 days old (we keep 10+ GBs) –  Check_hdfs: Runs hadoop fsck to see if HDFS is “healthy” –  Backup current namenode fsimage •  Slave Nodes: –  Check_disks: Removes read-only disks from datanode configuration –  Check_load: Kills some tasks and notifies us when load is too high •  Refresh production data to development cluster 12
  • 13. Configuration Management •  Seems like extra work at first, but essential as you grow. •  Not Hadoop-specific: manage OS packages, Nagios and Ganglia scripts, cron jobs, svn, SSH keys, NFS mounts, jars –  Consistent UID/GIDs critical with DRBD –  We replace some jars from the RPMs with local fixes –  Templatized configuration files very convenient. ERB is good. –  SSH keys made consistent across nodes, masters share host key •  Use SVN as file delivery mechanism: checkout on each box •  We chose Puppet as a tool –  Gets the job done –  Lacks flexibility in inheritance to specialize defaults per-machine –  Some aspects of operation are hard to debug 13
  • 14. Backup: HDFS and Hive DDL •  Objectives: –  Provide safety against total HDFS failure due to software bugs or machine room environmental incident –  Protect against user error in dropping or overwriting tables –  Restore data to another cluster •  Assumptions –  Repeating one day of processing is acceptable when restoring •  Components –  Incremental HDFS backup –  Hive DDL backup •  Runs on separate backup server with storage (NexSan) –  Pull process driven by processes on backup server 14
  • 15. Backup HDFS •  Open-source Java app •  Requires customization to your environment •  Traverses HDFS directory tree •  Copies out files modified after a given date •  Doesn't copy very new directories –  Needed a way to avoid copying files being written at time of backup –  HDFS has no snapshots •  Ignores specified directories •  Generates restore shell scripts to set owners, perms •  Verification tool checks file sizes and checksums 15
  • 16. Backup Hive DDL •  Open-source Java app uses Thrift server •  Iterates over all tables and views •  Constructs DDL statements from Hive metadata •  Ignores specific tables •  Generates Hive command script –  Recreates all tables, adds all partitions back one at a time •  Used to move metadata to MySQL •  Restore full cluster: –  copying files back with copyFromLocal –  Run perm/owner scripts –  Reapply Hive DDL 16
  • 17. Other Things To Potentially Back Up •  Backup the Namenode Metadata –  We do this once every 4 hours –  This is in addition to mirroring on four physical drives •  Our job tracking database •  No general backups of root or local FS on machines –  Recreate machines with Puppet or other configuration management tool instead •  Oozie job database –  We do NOT back this up –  Tightly coupled with HDFS state and restore would be problematic –  The recovery procedure is to rebuild and reinstall coordinators 17
  • 18. Oozie: Why •  Drawback: several times slower to write than cronjobs, while also less expressive •  Advantage: Ability to cleanly depend on input data –  With cron, you would have to poll for stamps •  Advantage: Clean and consistent metadata –  See what ran, what failed, what is still waiting and why –  Easily retry things which failed – good luck doing that with cron –  Output datasets are deleted on rerun so ordering is preserved 18
  • 19. Oozie: How •  Establish consistent local practices for completion stamps, job naming, owners, and source code locations •  Enforce that all jobs must be idempotent •  Create scripts/makefiles/build.xml to rebuild and reinstall jobs after changes in their dependencies •  Bypass the Oozie GUI –  The CLI is a more capable tool –  Go straight to the Oozie backing DB and issue SQL queries •  Rerun coordinator actions, not workflows •  Don't ever use Derby – we experienced massive corruption 19
  • 20. Experiences and Expectations •  Hadoop is not mature from a reliability and stability point of view –  It will probably get there in a few more years •  Cluster outages are common events, not outliers –  Must bounce key services to pick up basic configuration changes such as adding a new queue –  As you scale up, you will encounter new classes of problems –  Example: kernel deadlocks during heavy disk IO •  You must design for failure and have a robust mechanism to cleanly and easily resume execution once the cluster is back up. •  Important jobs must be isolated from developers –  Each cluster should contain ONE tier of jobs, grouped by SLA, release process, and time-to-recovery requirements 20
  • 21. Attributes of Robust Jobs •  Idempotent and resumable regardless of when/how terminated •  Has an external framework for recording success/failure, timing, and amount of data processed •  Knows what input data it needs and waits for it to be ready •  Has mechanism for reprocessing if the input data is restated •  Checked into source control •  Testable in an expendable cluster before release 21
  • 22. Benchmarks •  How to evaluate hardware/network changes or map/reduce slot tuning? –  Key insight: For the same job, the same task always does the same work –  Rerun job and compare execution of the same task across machines Machine Tasks Comps Relative Perf (larger is better) ~~~~~~~~~~~~ ~~~~~ ~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ type1_1 82 37 0.99 ==================== type1_2 91 76 0.98 ==================== type1_3 92 35 1.01 ==================== type1_4 88 85 1.06 ===================== type2_1 71 26 1.30 ========================== type3_1 92 80 0.68 ============== type4_1 78 42 1.19 ======================== type4_2 78 45 1.29 ========================== type4_3 75 75 1.19 ======================== remote 546 534 0.97 =================== local 378 69 1.05 ===================== 22
  • 23. Features you Should Use •  Fair Scheduler •  refreshNodes, refreshQueues •  Hadoop metrics •  Namenode audit logging (disabled by default in 0.20) •  Exclude files to decommission slave nodes 23
  • 24. Staffing •  We're living proof that you can hire some engineers with good fundamentals but no specialized experience and throw them in the deep end (it's the TA way) •  Skills to hire for: –  Operations and Linux experience –  General service troubleshooting –  Scripting –  Java –  SQL (even if not using Hive) •  Managing clusters which are growing 2x - 4x per year takes 1-2 people working full time just to run in place 24
  • 25. Open Questions •  Resuming of jobs on jobtracker restart •  Reloading of configurations without a restart •  Robust response to cluster OOM conditions •  Disabling job submission while allowing existing jobs to finish •  Please tell us if you have the answers! 25
  • 27. Appendix This is for you to read later after downloading the presentation 27
  • 29. DRBD Configuration global { usage-count no; minor-count 1; } common { protocol C; on master01.tripadvisor.com { syncer { rate 90M; } device /dev/drbd0; } disk /dev/sda3; resource internal { address 10.0.0.1:7789; startup { flexible-meta-disk internal; wfc-timeout 600; } degr-wfc-timeout 60; on master02.tripadvisor.com { } device /dev/drbd0; disk { disk /dev/sda3; on-io-error detach; address 10.0.0.2:7789; } flexible-meta-disk internal; net { } # timeout 60; } # connect-int 10; # ping-int 10; # max-buffers 2048; # max-epoch-size 2048; } 29
  • 30. Corosync Configuration compatibility: whitetank totem { version: 2 secauth: off threads: 0 interface { ringnumber: 0 amf { bindnetaddr: 10.0.0.0 mode: disabled mcastaddr: 239.0.0.11 } mcastport: 5415 aisexec { } user: root } group: root logging { } fileline: off service { to_stderr: no name: pacemaker to_logfile: yes ver: 0 to_syslog: yes } logfile: /var/log/corosync.log debug: off timestamp: on logger_subsys { subsys: AMF debug: off } } 30
  • 31. Pacemaker Configuration node master01.tripadvisor.com attributes standby="off" node master02.tripadvisor.com attributes standby="off" property $id="cib-bootstrap-options" stonith-enabled="false" no-quorum-policy="ignore" expected-quorum-votes="2" dc-version="1.0.12-unknown" cluster-infrastructure="openais" last-lrm-refresh="1337718104" rsc_defaults $id="rsc-options" resource-stickiness="100" primitive DataStore ocf:linbit:drbd params drbd_resource="internal" op start interval="0" timeout="240s" op stop interval="0" timeout="100s" primitive fs_DataStore ocf:heartbeat:Filesystem params device="/dev/drbd0" directory="/data/internal" fstype="ext3" op monitor interval="60s" timeout="40s" op start interval="0" timeout="60s" op stop interval="0" timeout="60s" ms Cluster DataStore meta master-max="1" master-node="max=1" clone-max="2" clone-node-max="1" notify="true" colocation fs-with-drbd inf: fs_DataStore Cluster:Master order drdb-fs inf: Cluster:promote fs_DataStore:start primitive MasterIP ocf:heartbeat:IPaddr2 params ip="192.168.236.10" nic="bond0" op monitor interval="30s" colocation ip-with-drbd inf: MasterIP Cluster:Master order fs-ip inf: fs_DataStore MasterIP primitive NameNode lsb:hadoop-0.20-namenode op monitor interval="30s" meta target-role="Started" colocation namenode-with-fs inf: NameNode fs_DataStore order ip-namenode inf: MasterIP NameNode primitive JobTracker lsb:hadoop-0.20-jobtracker op monitor interval="30s" meta target-role="Started" colocation jobtracker-with-fs inf: JobTracker fs_DataStore order namenode-jobtracker inf: NameNode JobTracker 31
  • 32. Pacemaker Configuration (cont.) primitive SecondaryNameNode lsb:hadoop-0.20-secondarynamenode op monitor interval="30s" meta target-role="Started" colocation secondarynamenode-not-with-ip -inf: SecondaryNameNode MasterIP order jobtracker-secnamenode inf: JobTracker SecondaryNameNode primitive Mysql ocf:heartbeat:mysql params datadir="/data/internal/mysql" socket="/data/internal/mysql/mysql.sock" binary="/usr/bin/mysqld_safe" op monitor interval="30s" timeout="30s" op start interval="0" timeout="120s" op stop interval="0" timeout="120s" meta target-role="Started" colocation mysql-with-fs inf: Mysql fs_DataStore order ip-mysql inf: MasterIP Mysql primitive HiveThrift lsb:hive-thrift op monitor interval="30s" meta target-role="Started" colocation hivethrift-with-ip inf: HiveThrift MasterIP order jobtracker-hivethrift inf: JobTracker HiveThrift order mysql-hivethrift inf: Mysql HiveThrift primitive Oozie lsb:oozie op monitor interval="30s" meta target-role="Started" colocation oozie-with-fs inf: Oozie MasterIP order jobtracker-oozie inf: JobTracker Oozie primitive PingNodes ocf:pacemaker:ping params host_list="192.168.236.1 192.168.236.2 192.168.236.5" multiplier="100" op start interval="0" timeout="60s" op monitor interval="30s" timeout="60s" clone PingClone PingNodes meta interleave="true" location ping-with-ip MasterIP rule $id="ping-with-ip-rule" pingd: defined pingd location prefer-master01.tripadvisor.com MasterIP rule $id="prefer-master01.tripadvisor.com-rule" 50: #uname eq master01.tripadvisor.com order ip-ping inf: MasterIP PingClone 32
  • 33. Nagios Checks check_apt check_breeze check_by_ssh check_checkup_metric check_clamd check_cluster check_cronjobs check_crontabs check_dhcp check_dig check_disk check_disk_smb check_disk_writable check_dns check_dummy check_fbrs check_file_age check_files_age check_filesystems check_flexlm check_ftp check_gc check_hadoop_master_logfiles check_hdp_connectivity check_hdp_data_nodes check_hdp_hdfs 20   check_hdp_max_mr_settings check_hive 10   check_hive_nsc check_hive_server check_http check_icmp 0   check_ide_smart R check_ifoperstatus check_ifstatus check_imap check_ircd check_jabber check_load check_local_mail check_log check_log_updated check_mailq check_memcached check_minerva check_mrtg check_mrtgtraf check_mysql_repl check_nagios check_nntp check_nntps check_nrpe check_nt check_ntp check_ntp_peer check_ntp_time check_nwstat check_oracle check_overcr check_ping check_pop check_proc_filehandles check_procs check_real check_rpc check_sensors check_simap check_smtp check_spop check_ssh check_ssmtp check_swap check_swapping check_sys_filehandles check_ta_services check_tcp check_time check_udp check_ups check_users check_wave check_writeable_tmp 33
  • 34. Example Oozie Query SELECT a.todaystatus as today, a.yesterdaystatus as yday, j.status as parent, j.app_name, a.last_modified_time, a.nominal_time, a.id FROM ( SELECT t.status as todaystatus, y.status as yesterdaystatus, COALESCE(t.id, y.id) AS id, y.job_id, COALESCE(t.nominal_time, y.nominal_time) AS nominal_time, COALESCE(t.last_modified_time, y.last_modified_time) AS last_modified_time FROM (SELECT * FROM COORD_ACTIONS WHERE TIMESTAMPDIFF(DAY, last_modified_time, now()) = 0) t RIGHT OUTER JOIN (SELECT * FROM COORD_ACTIONS WHERE TIMESTAMPDIFF(DAY, last_modified_time, now()) = 1) y ON (t.job_id=y.job_id) WHERE COALESCE(t.status, '') NOT IN ('SUCCEEDED', 'WAITING') -- If they're WAITING today, then make sure yesterday ran OK. OR (t.status = 'WAITING' and y.status <> 'SUCCEEDED') UNION DISTINCT -- Dummy record to force the table to exist even when empty, since MySql -- otherwise emits nothing if data is not returned. SELECT 'EMPTY', 'RECORD', '', '', '', 'THIS IS A DUMMY RECORD' )a LEFT OUTER JOIN COORD_JOBS j ON a.job_id=j.id WHERE j.status = 'RUNNING' OR j.status IS NULL ; 34
  • 35. Sessions will resume at 4:30pm Page 35