SlideShare a Scribd company logo
1 of 55
DEBUGGING HIVE WITH
HADOOP IN THE CLOUD
Soam Acharya, David Chaiken, Denis Sheahan, Charles Wimmer
Altiscale, Inc.
#LABDUG @ 20150115T19:30-0800
WHO ARE WE?
•  Altiscale: Infrastructure Nerds!
•  Hadoop As A Service
•  Rack and build our own Hadoop clusters
•  Provide a suite of Hadoop tools
o  Hive, Pig, Oozie
o  Others as needed: R, Python, Spark, Mahout, Impala, etc.
•  Monthly billing plan: compute (YARN), storage (HDFS)
•  https://www.altiscale.com
•  @Altiscale #HadoopSherpa
TALK ROADMAP
•  Our Platform and Perspective
•  Hadoop 2 Primer
•  Hadoop Debugging Tools
•  Accessing Logs in Hadoop 2
•  Hive + Hadoop Architecture
•  Hive Logs
•  Hive Issues + Case Studies
o  Hive + Interactive (DRAM Centric) Processing Engines
•  Conclusion: Making Hive Easier to Use
OUR DYNAMIC PLATFORM
•  Hadoop 2.0.5 => Hadoop 2.2.0 => Hadoop 2.4.1 => …
•  Hive 0.10 => Hive 0.12 => Stinger (Hive 0.13 + Tez) => …
•  Hive, Pig and Oozie most commonly used tools
•  Working with customers on:
Spark, H2O, Trifacta, Impala, Flume, Camus/Kafka, …
ALTISCALE PERSPECTIVE
•  What we do as a service provider…
o  Performance + Reliability: Jobs finish faster, fewer failures
o  Instant Access: Always-on access to HDFS and YARN
o  Hadoop Helpdesk: Tools + experts ensure customer success
o  Secure: Networking, SOC 2 Audit, Kerberos
o  Results: Faster Time-to-Value (TTV), Lower TCO
•  Operational approach in this presentation…
o  How to use Hadoop 2 cluster tools and logs
to debug and to tune Hive
o  This talk will not focus on query optimization
 	
  	
  Hadoop	
  2	
  Cluster	
  
Name	
  Node	
  
	
  
Hadoop	
  Slave	
  
Hadoop	
  Slave	
  
Hadoop	
  Slave	
  
Resource	
  Manager	
  
	
  
Secondary	
  NameNode	
  
	
  
Hadoop	
  Slave	
  
Node	
  Managers	
  
+	
  	
  
Data	
  Nodes	
  
QUICK PRIMER – HADOOP 2
QUICK PRIMER – HADOOP 2 YARN
•  Resource Manager (per cluster)
o  Manages job scheduling and execution
o  Global resource allocation
•  Application Master (per job)
o  Manages task scheduling and execution
o  Local resource allocation
•  Node Manager (per-machine agent)
o  Manages the lifecycle of task containers
o  Reports to RM on health and resource usage
HADOOP 1 VS HADOOP 2
•  No more JobTrackers, TaskTrackers
•  YARN ~ Operating System for Clusters
o  MapReduce is implemented as a YARN application
o  Bring on the applications! (Spark is just the start…)
•  Should be Transparent to Hive users
HADOOP 2 DEBUGGING TOOLS
•  Monitoring
o  System state of cluster:
§  CPU, Memory, Network, Disk
§  Nagios, Ganglia, Sensu!
§  Collectd, statd, Graphite
o  Hadoop level
§  HDFS usage
§  Resource usage:
•  Container memory allocated vs used
•  # of jobs running at the same time
•  Long running tasks
HADOOP 2 DEBUGGING TOOLS
•  Hadoop logs
o  Daemon logs: Resource Manager, NameNode, DataNode
o  Application logs: Application Master, MapReduce tasks
o  Job history file: resources allocated during job lifetime
o  Application configuration files: store all Hadoop application
parameters
•  Source code instrumentation
ACCESSING LOGS IN HADOOP 2
•  To view the logs for a job, click on the link under the ID
column in Resource Manager UI.
ACCESSING LOGS IN HADOOP 2
•  To view application top level logs, click on logs.
•  To view individual logs for the mappers and reducers,
click on History.
ACCESSING LOGS IN HADOOP 2
•  Log output for the entire application.
ACCESSING LOGS IN HADOOP 2
•  Click on the Map link for mapper logs and the Reduce
link for reducer logs.
ACCESSING LOGS IN HADOOP 2
•  Clicking on a single link under Name provides an
overview for that particular map job.
ACCESSING LOGS IN HADOOP 2
•  Finally, clicking on the logs link will take you to the log
output for that map job.
ACCESSING LOGS IN HADOOP 2
•  Fun, fun, donuts, and more fun…
HIVE + HADOOP 2 ARCHITECTURE
•  Hive 0.10+
	
  	
  	
  Hadoop	
  2	
  Cluster	
  
Hive	
  CLI	
   Hive	
  
Metastore	
  
Hiveserver2	
  JDBC/ODBC	
  
Tableau,	
  
KeFle,	
  …	
  
HIVE LOGS
•  Query Log location
•  From /etc/hive/hive-site.xml:


<property>"
<name>hive.querylog.location</name>"
<value>/home/hive/log/${user.name}</value>"
</property>"
"
SessionStart SESSION_ID="soam_201402032341"
TIME="1391470900594""
"
HIVE CLIENT LOGS
•  /etc/hive/hive-log4j.properties:
o  hive.log.dir=/var/log/hive/${user.name}
2014-05-29 19:51:09,830 INFO parse.ParseDriver (ParseDriver.java:parse(179)) - Parsing
command: select count(*) from dogfood_job_data"
2014-05-29 19:51:09,852 INFO parse.ParseDriver (ParseDriver.java:parse(197)) - Parse
Completed"
2014-05-29 19:51:09,852 INFO ql.Driver (PerfLogger.java:PerfLogEnd(124)) - </PERFLOG
method=parse start=1401393069830 end=1401393069852 duration=22>"
2014-05-29 19:51:09,853 INFO ql.Driver (PerfLogger.java:PerfLogBegin(97)) - <PERFLOG
method=semanticAnalyze>"
2014-05-29 19:51:09,890 INFO parse.SemanticAnalyzer
(SemanticAnalyzer.java:analyzeInternal(8305)) - Starting Semantic Analysis"
2014-05-29 19:51:09,892 INFO parse.SemanticAnalyzer
(SemanticAnalyzer.java:analyzeInternal(8340)) - Completed phase 1 of Semantic Analysis"
2014-05-29 19:51:09,892 INFO parse.SemanticAnalyzer
(SemanticAnalyzer.java:getMetaData(1060)) - Get metadata for source tables"
2014-05-29 19:51:09,906 INFO parse.SemanticAnalyzer
(SemanticAnalyzer.java:getMetaData(1167)) - Get metadata for subqueries"
2014-05-29 19:51:09,909 INFO parse.SemanticAnalyzer
(SemanticAnalyzer.java:getMetaData(1187)) - Get metadata for destination tables"
"
HIVE METASTORE LOGS
•  /etc/hive-metastore/hive-log4j.properties:
o  hive.log.dir=/service/log/hive-metastore/${user.name}
2014-05-29 19:50:50,179 INFO metastore.HiveMetaStore
(HiveMetaStore.java:logInfo(454)) - 200: source:/10.252.18.94
get_table : db=default tbl=dogfood_job_data"
2014-05-29 19:50:50,180 INFO HiveMetaStore.audit
(HiveMetaStore.java:logAuditEvent(239)) - ugi=chaiken ip=/10.252.18.94
cmd=source:/10.252.18.94 get_table : db=default tbl=dogfood_job_data "
2014-05-29 19:50:50,236 INFO metastore.HiveMetaStore
(HiveMetaStore.java:logInfo(454)) - 200: source:/10.252.18.94
get_table : db=default tbl=dogfood_job_data"
2014-05-29 19:50:50,236 INFO HiveMetaStore.audit
(HiveMetaStore.java:logAuditEvent(239)) - ugi=chaiken ip=/10.252.18.94
cmd=source:/10.252.18.94 get_table : db=default tbl=dogfood_job_data "
2014-05-29 19:50:50,261 INFO metastore.HiveMetaStore
(HiveMetaStore.java:logInfo(454)) - 200: source:/10.252.18.94
get_table : db=default tbl=dogfood_job_data"
HIVE ISSUES + CASE STUDIES
•  Hive Issues
o  Hive client out of memory
o  Hive map/reduce task out of memory
o  Hive metastore out of memory
o  Hive launches too many tasks
•  Case Studies:
o  Hive “stuck” job
o  Hive “missing directories”
o  Analyze Hive Query Execution
o  Hive + Interactive (DRAM Centric) Processing Engines
HIVE CLIENT OUT OF MEMORY
•  Memory intensive client side hive query (map-side join)
Number of reduce tasks not specified. Estimated from input data size: 999"
In order to change the average load for a reducer (in bytes):"
set hive.exec.reducers.bytes.per.reducer=<number>"
In order to limit the maximum number of reducers:"
set hive.exec.reducers.max=<number>"
In order to set a constant number of reducers:"
set mapred.reduce.tasks=<number>"
java.lang.OutOfMemoryError: Java heap space!
at java.nio.CharBuffer.wrap(CharBuffer.java:350)"
at java.nio.CharBuffer.wrap(CharBuffer.java:373)"
at java.lang.StringCoding$StringDecoder.decode(StringCoding.java:
138)"
HIVE CLIENT OUT OF MEMORY
•  Use HADOOP_HEAPSIZE prior to launching Hive client
•  HADOOP_HEAPSIZE=<new heapsize> hive <fileName>"
•  Watch out for HADOOP_CLIENT_OPTS issue in hive-env.sh!
•  Important to know the amount of memory available on
machine running client… Do not exceed or use
disproportionate amount.
$ free -m"
total used free shared buffers cached"
Mem: 1695 1388 306 0 60 424"
-/+ buffers/cache: 903 791"
Swap: 895 101 794"
	
  
	
  
HIVE TASK OUT OF MEMORY
•  Query spawns MapReduce jobs that run out of memory
•  How to find this issue?
o  Hive diagnostic message
o  Hadoop MapReduce logs
HIVE TASK OUT OF MEMORY
•  Fix is to increase task RAM allocation…
set mapreduce.map.memory.mb=<new RAM allocation>; "
set mapreduce.reduce.memory.mb=<new RAM allocation>;"
•  Also watch out for…
set mapreduce.map.java.opts=-Xmx<heap size>m; "
set mapreduce.reduce.java.opts=-Xmx<heap size>m; "
•  Not a magic bullet – requires manual tuning
•  Increase in individual container memory size:
o  Decrease in overall containers that can be run
o  Decrease in overall parallelism
HIVE METASTORE OUT OF MEMORY
•  Out of memory issues not necessarily dumped to logs
•  Metastore can become unresponsive
•  Can’t submit queries
•  Restart with a higher heap size:
export HADOOP_HEAPSIZE in hcat_server.sh
•  After notifying hive users about downtime:
service hcat restart"
HIVE LAUNCHES TOO MANY TASKS
•  Typically a function of the input data set
•  Lots of little files
HIVE LAUNCHES TOO MANY TASKS
•  Set mapred.max.split.size to appropriate fraction of data size
•  Also verify that
hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat"
CASE STUDY: HIVE STUCK JOB
From an Altiscale customer:
“This job [jobid] has been running now for
41 hours. Is it still progressing or has
something hung up the map/reduce so it’s
just spinning? Do you have any insight?”
HIVE STUCK JOB
1.  Received jobId,
application_1382973574141_4536, from client
2.  Logged into client cluster.
3.  Pulled up Resource Manager
4.  Entered part of jobId (4536) in the search box.
5.  Clicked on the link that says:
application_1382973574141_4536"
6.  On resulting Application Overview page, clicked on link
next to “Tracking URL” that said Application Master
HIVE STUCK JOB
7.  On resulting MapReduce Application page, we clicked on the
Job Id (job_1382973574141_4536).
8.  The resulting MapReduce Job page displayed detailed status
of the mappers, including 4 failed mappers
9.  We then clicked on the 4 link on the Maps row in the Failed
column.
10. Title of the next page was “FAILED Map attempts in
job_1382973574141_4536.”
11.  Each failed mapper generated an error message.
12. Buried in the 16th line:
Caused by: java.io.FileNotFoundException: File
does not exist: hdfs://opaque_hostname:8020/
HiveTableDir/FileName.log.date.seq !
HIVE STUCK JOB
•  Job was stuck for a day or so, retrying a mapper that
would never finish successfully.
•  During the job, our customers’ colleague realized input
file was corrupted and deleted it.
•  Colleague did not anticipate the affect of removing
corrupted data on a running job
•  Hadoop didn’t make it easy to find out:
o  RM => search => application link => AM overview page => MR
Application Page => MR Job Page => Failed jobs page =>
parse long logs
o  Task retry without hope of success
HIVE “MISSING DIRECTORIES”
From an Altiscale customer:
“One problem we are seeing after the
[Hive Metastore] restart is that we lost
quite a few directories in [HDFS]. Is there
a way to recover these?”
HIVE “MISSING DIRECTORIES”
•  Obtained list of “missing” directories from customer:
o  /hive/biz/prod/*
•  Confirmed they were missing from HDFS
•  Searched through NameNode audit log to get block IDs that
belonged to missing directories.
13/07/24 21:10:08 INFO hdfs.StateChange: BLOCK*
NameSystem.allocateBlock: /hive/biz/prod/
incremental/carryoverstore/postdepuis/
lmt_unmapped_pggroup_schema._COPYING_.
BP-798113632-10.251.255.251-1370812162472
blk_3560522076897293424_2448396{blockUCState=UNDER_C
ONSTRUCTION, primaryNodeIndex=-1,
replicas=[ReplicaUnderConstruction[10.251.255.177:50
010|RBW],
ReplicaUnderConstruction[10.251.255.174:50010|RBW],
ReplicaUnderConstruction[10.251.255.169:50010|RBW]]}"
HIVE “MISSING DIRECTORIES”
•  Used blockID to locate exact time of file deletion from
Namenode logs:
13/07/31 08:10:33 INFO hdfs.StateChange:
BLOCK* addToInvalidates:
blk_3560522076897293424_2448396 to
10.251.255.177:50010 10.251.255.169:50010
10.251.255.174:50010 "
•  Used time of deletion to inspect hive logs
HIVE “MISSING DIRECTORIES”
QueryStart QUERY_STRING="create database biz_weekly location '/hive/biz/
prod'" QUERY_ID=“usrprod_20130731043232_0a40fd32-8c8a-479c-
ba7d-3bd8a2698f4b" TIME="1375245164667"
:
QueryEnd QUERY_STRING="create database biz_weekly location '/hive/biz/
prod'" QUERY_ID=”usrprod_20130731043232_0a40fd32-8c8a-479c-
ba7d-3bd8a2698f4b" QUERY_RET_CODE="0" QUERY_NUM_TASKS="0"
TIME="1375245166203"
:
QueryStart QUERY_STRING="drop database biz_weekly"
QUERY_ID=”usrprod_20130731073333_e9acf35c-4f07-4f12-bd9d-bae137ae0733"
TIME="1375256014799"
:
QueryEnd QUERY_STRING="drop database biz_weekly"
QUERY_ID=”usrprod_20130731073333_e9acf35c-4f07-4f12-bd9d-bae137ae0733"
QUERY_NUM_TASKS="0" TIME="1375256014838"
HIVE “MISSING DIRECTORIES”
•  In effect, user “usrprod” issued:
At 2013-07-31 04:32:44: create database biz_weekly
location '/hive/biz/prod'
At 2013-07-31 07:33:24: drop database biz_weekly
•  This is functionally equivalent to:
hdfs dfs -rm -r /hive/biz/prod"
HIVE “MISSING DIRECTORIES”
•  Customer manually placed their own data in /hive –
the warehouse directory managed and controlled by hive
•  Customer used CREATE and DROP db commands in
their code
o  Hive deletes database and table locations in /hive with
impunity
•  Why didn’t deleted data end up in .Trash?
o  Trash collection not turned on in configuration settings
o  It is now, but need a –skipTrash option (HIVE-6469)
HIVE “MISSING DIRECTORIES”
•  Hadoop forensics: piece together disparate sources…
o  Hadoop daemon logs (NameNode)
o  Hive query and metastore logs
o  Hadoop config files
•  Need better tools to correlate the different layers of the
system: hive client, hive metastore, MapReduce job,
YARN, HDFS, operating sytem metrics, …
By the way… Operating any distributed system would be
totally insane without NTP and a standard time zone (UTC).
CASE STUDY – ANALYZE QUERY
•  Customer provided Hive query + data sets
(100GBs to ~5 TBs)
•  Needed help optimizing the query
•  Didn’t rewrite query immediately
•  Wanted to characterize query performance and isolate
bottlenecks first
ANALYZE AND TUNE EXECUTION
•  Ran original query on the datasets in our environment:
o  Two M/R Stages: Stage-1, Stage-2
•  Long running reducers run out of memory
o  set mapreduce.reduce.memory.mb=5120"
o  Reduces slots and extends reduce time
•  Query fails to launch Stage-2 with out of memory
o  set HADOOP_HEAPSIZE=1024 on client machine
•  Query has 250,000 Mappers in Stage-2 which causes
failure
o  set mapred.max.split.size=5368709120

to reduce Mappers
ANALYSIS: HOW TO VISUALIZE?
•  Next challenge - how to visualize job execution?
•  Existing hadoop/hive logs not sufficient for this task
•  Wrote internal tools
o  parse job history files
o  plot mapper and reducer execution
ANALYSIS: MAP STAGE-1
Single	
  reduce	
  task	
  
ANALYSIS: REDUCE STAGE-1
ANALYSIS: MAP STAGE-2
ANALYSIS: REDUCE STAGE-2
ANALYZE EXECUTION: FINDINGS
•  Lone, long running reducer in first stage of query
•  Analyzed input data:
o  Query split input data by userId
o  Bucketizing input data by userId
o  One very large bucket: “invalid” userId
o  Discussed “invalid” userid with customer
•  An error value is a common pattern!
o  Need to differentiate between “Don’t know and don’t care”
or “don’t know and do care.”
INTERACTIVE (DRAM CENTRIC)
PROCESSING SYSTEMS
•  Loading data into DRAM makes processing fast!
•  Examples: Spark, Impala, 0xdata, …, [SAP HANA], …
•  Streaming systems (Storm, DataTorrent) may be similar
•  Need to increase YARN container memory size
•  Caution: larger YARN container settings for interactive
jobs may not be right for batch systems like Hive
•  Container size: needs to combine vcores and memory:
yarn.scheduler.maximum-allocation-vcores

yarn.nodemanager.resource.cpu-vcores ..."
Hive + Interactive: Watch Out for Container Size
HIVE + INTERACTIVE:
WATCH OUT FOR FRAGMENTATION
•  Attempting to schedule interactive systems and batch
systems like Hive may result in fragmentation
•  Interactive systems may require all-or-nothing scheduling
•  Batch jobs with little tasks may starve interactive jobs
HIVE + INTERACTIVE:
WATCH OUT FOR FRAGMENTATION
Solutions for fragmentation…
•  Reserve interactive nodes before starting batch jobs
•  Reduce interactive container size (if the algorithm permits)
•  Node labels (YARN-2492) and gang scheduling (YARN-624)
CONCLUSIONS
•  Hive + Hadoop debugging can get very complex
o  Sifting through many logs and screens
o  Automatic transmission versus manual transmission
•  Static partitioning induced by Java Virtual Machine has
benefits but also induces challenges.
•  Where there are difficulties, there’s opportunity:
o  Better tooling, instrumentation, integration of logs/metrics
•  YARN still evolving into an operating system
•  Hadoop as a Service: aggregate and share expertise
•  Need to learn from the traditional database community!
QUESTIONS? COMMENTS?

More Related Content

What's hot

Integrating Apache Kafka and Elastic Using the Connect Framework
Integrating Apache Kafka and Elastic Using the Connect FrameworkIntegrating Apache Kafka and Elastic Using the Connect Framework
Integrating Apache Kafka and Elastic Using the Connect Frameworkconfluent
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteJulian Hyde
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Cloudera, Inc.
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Accelerating Big Data Analytics with Apache Kylin
Accelerating Big Data Analytics with Apache KylinAccelerating Big Data Analytics with Apache Kylin
Accelerating Big Data Analytics with Apache KylinTyler Wishnoff
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is FailingDataWorks Summit
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInDataWorks Summit
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide trainingSpark Summit
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentationArvind Kumar
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure DataTaro L. Saito
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Agile Testing Alliance
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 

What's hot (20)

Integrating Apache Kafka and Elastic Using the Connect Framework
Integrating Apache Kafka and Elastic Using the Connect FrameworkIntegrating Apache Kafka and Elastic Using the Connect Framework
Integrating Apache Kafka and Elastic Using the Connect Framework
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
 
Sqoop
SqoopSqoop
Sqoop
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
6.hive
6.hive6.hive
6.hive
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Accelerating Big Data Analytics with Apache Kylin
Accelerating Big Data Analytics with Apache KylinAccelerating Big Data Analytics with Apache Kylin
Accelerating Big Data Analytics with Apache Kylin
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is Failing
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 
Hive
HiveHive
Hive
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure Data
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Internal Hive
Internal HiveInternal Hive
Internal Hive
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 

Similar to Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale

OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - AltiscaleOC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - AltiscaleBig Data Joe™ Rossi
 
Debugging Hive with Hadoop-in-the-Cloud
Debugging Hive with Hadoop-in-the-CloudDebugging Hive with Hadoop-in-the-Cloud
Debugging Hive with Hadoop-in-the-CloudSoam Acharya
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDataWorks Summit
 
Managing growth in Production Hadoop Deployments
Managing growth in Production Hadoop DeploymentsManaging growth in Production Hadoop Deployments
Managing growth in Production Hadoop DeploymentsDataWorks Summit
 
Review of Calculation Paradigm and its Components
Review of Calculation Paradigm and its ComponentsReview of Calculation Paradigm and its Components
Review of Calculation Paradigm and its ComponentsNamuk Park
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNATomas Cervenka
 
HBase and Hadoop at Urban Airship
HBase and Hadoop at Urban AirshipHBase and Hadoop at Urban Airship
HBase and Hadoop at Urban Airshipdave_revell
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop OverviewBrian Enochson
 
Hadoop Robot from eBay at China Hadoop Summit 2015
Hadoop Robot from eBay at China Hadoop Summit 2015Hadoop Robot from eBay at China Hadoop Summit 2015
Hadoop Robot from eBay at China Hadoop Summit 2015polo li
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...rhatr
 
Hadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and FutureHadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and FutureRyan Hennig
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
 
Hadoop Map-Reduce from the subject: Big Data Analytics
Hadoop Map-Reduce from the subject: Big Data AnalyticsHadoop Map-Reduce from the subject: Big Data Analytics
Hadoop Map-Reduce from the subject: Big Data AnalyticsRUHULAMINHAZARIKA
 

Similar to Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale (20)

OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - AltiscaleOC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
 
Debugging Hive with Hadoop-in-the-Cloud
Debugging Hive with Hadoop-in-the-CloudDebugging Hive with Hadoop-in-the-Cloud
Debugging Hive with Hadoop-in-the-Cloud
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-Cloud
 
Managing growth in Production Hadoop Deployments
Managing growth in Production Hadoop DeploymentsManaging growth in Production Hadoop Deployments
Managing growth in Production Hadoop Deployments
 
Review of Calculation Paradigm and its Components
Review of Calculation Paradigm and its ComponentsReview of Calculation Paradigm and its Components
Review of Calculation Paradigm and its Components
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
 
HBase and Hadoop at Urban Airship
HBase and Hadoop at Urban AirshipHBase and Hadoop at Urban Airship
HBase and Hadoop at Urban Airship
 
מיכאל
מיכאלמיכאל
מיכאל
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
Hadoop Robot from eBay at China Hadoop Summit 2015
Hadoop Robot from eBay at China Hadoop Summit 2015Hadoop Robot from eBay at China Hadoop Summit 2015
Hadoop Robot from eBay at China Hadoop Summit 2015
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
 
Hadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and FutureHadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and Future
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
MHUG - YARN
MHUG - YARNMHUG - YARN
MHUG - YARN
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
Hadoop Map-Reduce from the subject: Big Data Analytics
Hadoop Map-Reduce from the subject: Big Data AnalyticsHadoop Map-Reduce from the subject: Big Data Analytics
Hadoop Map-Reduce from the subject: Big Data Analytics
 

More from Data Con LA

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA
 

More from Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 

Recently uploaded

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 

Recently uploaded (20)

DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 

Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale

  • 1. DEBUGGING HIVE WITH HADOOP IN THE CLOUD Soam Acharya, David Chaiken, Denis Sheahan, Charles Wimmer Altiscale, Inc. #LABDUG @ 20150115T19:30-0800
  • 2. WHO ARE WE? •  Altiscale: Infrastructure Nerds! •  Hadoop As A Service •  Rack and build our own Hadoop clusters •  Provide a suite of Hadoop tools o  Hive, Pig, Oozie o  Others as needed: R, Python, Spark, Mahout, Impala, etc. •  Monthly billing plan: compute (YARN), storage (HDFS) •  https://www.altiscale.com •  @Altiscale #HadoopSherpa
  • 3. TALK ROADMAP •  Our Platform and Perspective •  Hadoop 2 Primer •  Hadoop Debugging Tools •  Accessing Logs in Hadoop 2 •  Hive + Hadoop Architecture •  Hive Logs •  Hive Issues + Case Studies o  Hive + Interactive (DRAM Centric) Processing Engines •  Conclusion: Making Hive Easier to Use
  • 4. OUR DYNAMIC PLATFORM •  Hadoop 2.0.5 => Hadoop 2.2.0 => Hadoop 2.4.1 => … •  Hive 0.10 => Hive 0.12 => Stinger (Hive 0.13 + Tez) => … •  Hive, Pig and Oozie most commonly used tools •  Working with customers on: Spark, H2O, Trifacta, Impala, Flume, Camus/Kafka, …
  • 5. ALTISCALE PERSPECTIVE •  What we do as a service provider… o  Performance + Reliability: Jobs finish faster, fewer failures o  Instant Access: Always-on access to HDFS and YARN o  Hadoop Helpdesk: Tools + experts ensure customer success o  Secure: Networking, SOC 2 Audit, Kerberos o  Results: Faster Time-to-Value (TTV), Lower TCO •  Operational approach in this presentation… o  How to use Hadoop 2 cluster tools and logs to debug and to tune Hive o  This talk will not focus on query optimization
  • 6.      Hadoop  2  Cluster   Name  Node     Hadoop  Slave   Hadoop  Slave   Hadoop  Slave   Resource  Manager     Secondary  NameNode     Hadoop  Slave   Node  Managers   +     Data  Nodes   QUICK PRIMER – HADOOP 2
  • 7. QUICK PRIMER – HADOOP 2 YARN •  Resource Manager (per cluster) o  Manages job scheduling and execution o  Global resource allocation •  Application Master (per job) o  Manages task scheduling and execution o  Local resource allocation •  Node Manager (per-machine agent) o  Manages the lifecycle of task containers o  Reports to RM on health and resource usage
  • 8. HADOOP 1 VS HADOOP 2 •  No more JobTrackers, TaskTrackers •  YARN ~ Operating System for Clusters o  MapReduce is implemented as a YARN application o  Bring on the applications! (Spark is just the start…) •  Should be Transparent to Hive users
  • 9. HADOOP 2 DEBUGGING TOOLS •  Monitoring o  System state of cluster: §  CPU, Memory, Network, Disk §  Nagios, Ganglia, Sensu! §  Collectd, statd, Graphite o  Hadoop level §  HDFS usage §  Resource usage: •  Container memory allocated vs used •  # of jobs running at the same time •  Long running tasks
  • 10. HADOOP 2 DEBUGGING TOOLS •  Hadoop logs o  Daemon logs: Resource Manager, NameNode, DataNode o  Application logs: Application Master, MapReduce tasks o  Job history file: resources allocated during job lifetime o  Application configuration files: store all Hadoop application parameters •  Source code instrumentation
  • 11.
  • 12. ACCESSING LOGS IN HADOOP 2 •  To view the logs for a job, click on the link under the ID column in Resource Manager UI.
  • 13. ACCESSING LOGS IN HADOOP 2 •  To view application top level logs, click on logs. •  To view individual logs for the mappers and reducers, click on History.
  • 14. ACCESSING LOGS IN HADOOP 2 •  Log output for the entire application.
  • 15. ACCESSING LOGS IN HADOOP 2 •  Click on the Map link for mapper logs and the Reduce link for reducer logs.
  • 16. ACCESSING LOGS IN HADOOP 2 •  Clicking on a single link under Name provides an overview for that particular map job.
  • 17. ACCESSING LOGS IN HADOOP 2 •  Finally, clicking on the logs link will take you to the log output for that map job.
  • 18. ACCESSING LOGS IN HADOOP 2 •  Fun, fun, donuts, and more fun…
  • 19. HIVE + HADOOP 2 ARCHITECTURE •  Hive 0.10+      Hadoop  2  Cluster   Hive  CLI   Hive   Metastore   Hiveserver2  JDBC/ODBC   Tableau,   KeFle,  …  
  • 20. HIVE LOGS •  Query Log location •  From /etc/hive/hive-site.xml: 
 <property>" <name>hive.querylog.location</name>" <value>/home/hive/log/${user.name}</value>" </property>" " SessionStart SESSION_ID="soam_201402032341" TIME="1391470900594"" "
  • 21. HIVE CLIENT LOGS •  /etc/hive/hive-log4j.properties: o  hive.log.dir=/var/log/hive/${user.name} 2014-05-29 19:51:09,830 INFO parse.ParseDriver (ParseDriver.java:parse(179)) - Parsing command: select count(*) from dogfood_job_data" 2014-05-29 19:51:09,852 INFO parse.ParseDriver (ParseDriver.java:parse(197)) - Parse Completed" 2014-05-29 19:51:09,852 INFO ql.Driver (PerfLogger.java:PerfLogEnd(124)) - </PERFLOG method=parse start=1401393069830 end=1401393069852 duration=22>" 2014-05-29 19:51:09,853 INFO ql.Driver (PerfLogger.java:PerfLogBegin(97)) - <PERFLOG method=semanticAnalyze>" 2014-05-29 19:51:09,890 INFO parse.SemanticAnalyzer (SemanticAnalyzer.java:analyzeInternal(8305)) - Starting Semantic Analysis" 2014-05-29 19:51:09,892 INFO parse.SemanticAnalyzer (SemanticAnalyzer.java:analyzeInternal(8340)) - Completed phase 1 of Semantic Analysis" 2014-05-29 19:51:09,892 INFO parse.SemanticAnalyzer (SemanticAnalyzer.java:getMetaData(1060)) - Get metadata for source tables" 2014-05-29 19:51:09,906 INFO parse.SemanticAnalyzer (SemanticAnalyzer.java:getMetaData(1167)) - Get metadata for subqueries" 2014-05-29 19:51:09,909 INFO parse.SemanticAnalyzer (SemanticAnalyzer.java:getMetaData(1187)) - Get metadata for destination tables" "
  • 22. HIVE METASTORE LOGS •  /etc/hive-metastore/hive-log4j.properties: o  hive.log.dir=/service/log/hive-metastore/${user.name} 2014-05-29 19:50:50,179 INFO metastore.HiveMetaStore (HiveMetaStore.java:logInfo(454)) - 200: source:/10.252.18.94 get_table : db=default tbl=dogfood_job_data" 2014-05-29 19:50:50,180 INFO HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(239)) - ugi=chaiken ip=/10.252.18.94 cmd=source:/10.252.18.94 get_table : db=default tbl=dogfood_job_data " 2014-05-29 19:50:50,236 INFO metastore.HiveMetaStore (HiveMetaStore.java:logInfo(454)) - 200: source:/10.252.18.94 get_table : db=default tbl=dogfood_job_data" 2014-05-29 19:50:50,236 INFO HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(239)) - ugi=chaiken ip=/10.252.18.94 cmd=source:/10.252.18.94 get_table : db=default tbl=dogfood_job_data " 2014-05-29 19:50:50,261 INFO metastore.HiveMetaStore (HiveMetaStore.java:logInfo(454)) - 200: source:/10.252.18.94 get_table : db=default tbl=dogfood_job_data"
  • 23. HIVE ISSUES + CASE STUDIES •  Hive Issues o  Hive client out of memory o  Hive map/reduce task out of memory o  Hive metastore out of memory o  Hive launches too many tasks •  Case Studies: o  Hive “stuck” job o  Hive “missing directories” o  Analyze Hive Query Execution o  Hive + Interactive (DRAM Centric) Processing Engines
  • 24. HIVE CLIENT OUT OF MEMORY •  Memory intensive client side hive query (map-side join) Number of reduce tasks not specified. Estimated from input data size: 999" In order to change the average load for a reducer (in bytes):" set hive.exec.reducers.bytes.per.reducer=<number>" In order to limit the maximum number of reducers:" set hive.exec.reducers.max=<number>" In order to set a constant number of reducers:" set mapred.reduce.tasks=<number>" java.lang.OutOfMemoryError: Java heap space! at java.nio.CharBuffer.wrap(CharBuffer.java:350)" at java.nio.CharBuffer.wrap(CharBuffer.java:373)" at java.lang.StringCoding$StringDecoder.decode(StringCoding.java: 138)"
  • 25. HIVE CLIENT OUT OF MEMORY •  Use HADOOP_HEAPSIZE prior to launching Hive client •  HADOOP_HEAPSIZE=<new heapsize> hive <fileName>" •  Watch out for HADOOP_CLIENT_OPTS issue in hive-env.sh! •  Important to know the amount of memory available on machine running client… Do not exceed or use disproportionate amount. $ free -m" total used free shared buffers cached" Mem: 1695 1388 306 0 60 424" -/+ buffers/cache: 903 791" Swap: 895 101 794"    
  • 26. HIVE TASK OUT OF MEMORY •  Query spawns MapReduce jobs that run out of memory •  How to find this issue? o  Hive diagnostic message o  Hadoop MapReduce logs
  • 27. HIVE TASK OUT OF MEMORY •  Fix is to increase task RAM allocation… set mapreduce.map.memory.mb=<new RAM allocation>; " set mapreduce.reduce.memory.mb=<new RAM allocation>;" •  Also watch out for… set mapreduce.map.java.opts=-Xmx<heap size>m; " set mapreduce.reduce.java.opts=-Xmx<heap size>m; " •  Not a magic bullet – requires manual tuning •  Increase in individual container memory size: o  Decrease in overall containers that can be run o  Decrease in overall parallelism
  • 28. HIVE METASTORE OUT OF MEMORY •  Out of memory issues not necessarily dumped to logs •  Metastore can become unresponsive •  Can’t submit queries •  Restart with a higher heap size: export HADOOP_HEAPSIZE in hcat_server.sh •  After notifying hive users about downtime: service hcat restart"
  • 29. HIVE LAUNCHES TOO MANY TASKS •  Typically a function of the input data set •  Lots of little files
  • 30. HIVE LAUNCHES TOO MANY TASKS •  Set mapred.max.split.size to appropriate fraction of data size •  Also verify that hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat"
  • 31. CASE STUDY: HIVE STUCK JOB From an Altiscale customer: “This job [jobid] has been running now for 41 hours. Is it still progressing or has something hung up the map/reduce so it’s just spinning? Do you have any insight?”
  • 32. HIVE STUCK JOB 1.  Received jobId, application_1382973574141_4536, from client 2.  Logged into client cluster. 3.  Pulled up Resource Manager 4.  Entered part of jobId (4536) in the search box. 5.  Clicked on the link that says: application_1382973574141_4536" 6.  On resulting Application Overview page, clicked on link next to “Tracking URL” that said Application Master
  • 33. HIVE STUCK JOB 7.  On resulting MapReduce Application page, we clicked on the Job Id (job_1382973574141_4536). 8.  The resulting MapReduce Job page displayed detailed status of the mappers, including 4 failed mappers 9.  We then clicked on the 4 link on the Maps row in the Failed column. 10. Title of the next page was “FAILED Map attempts in job_1382973574141_4536.” 11.  Each failed mapper generated an error message. 12. Buried in the 16th line: Caused by: java.io.FileNotFoundException: File does not exist: hdfs://opaque_hostname:8020/ HiveTableDir/FileName.log.date.seq !
  • 34. HIVE STUCK JOB •  Job was stuck for a day or so, retrying a mapper that would never finish successfully. •  During the job, our customers’ colleague realized input file was corrupted and deleted it. •  Colleague did not anticipate the affect of removing corrupted data on a running job •  Hadoop didn’t make it easy to find out: o  RM => search => application link => AM overview page => MR Application Page => MR Job Page => Failed jobs page => parse long logs o  Task retry without hope of success
  • 35. HIVE “MISSING DIRECTORIES” From an Altiscale customer: “One problem we are seeing after the [Hive Metastore] restart is that we lost quite a few directories in [HDFS]. Is there a way to recover these?”
  • 36. HIVE “MISSING DIRECTORIES” •  Obtained list of “missing” directories from customer: o  /hive/biz/prod/* •  Confirmed they were missing from HDFS •  Searched through NameNode audit log to get block IDs that belonged to missing directories. 13/07/24 21:10:08 INFO hdfs.StateChange: BLOCK* NameSystem.allocateBlock: /hive/biz/prod/ incremental/carryoverstore/postdepuis/ lmt_unmapped_pggroup_schema._COPYING_. BP-798113632-10.251.255.251-1370812162472 blk_3560522076897293424_2448396{blockUCState=UNDER_C ONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.251.255.177:50 010|RBW], ReplicaUnderConstruction[10.251.255.174:50010|RBW], ReplicaUnderConstruction[10.251.255.169:50010|RBW]]}"
  • 37. HIVE “MISSING DIRECTORIES” •  Used blockID to locate exact time of file deletion from Namenode logs: 13/07/31 08:10:33 INFO hdfs.StateChange: BLOCK* addToInvalidates: blk_3560522076897293424_2448396 to 10.251.255.177:50010 10.251.255.169:50010 10.251.255.174:50010 " •  Used time of deletion to inspect hive logs
  • 38. HIVE “MISSING DIRECTORIES” QueryStart QUERY_STRING="create database biz_weekly location '/hive/biz/ prod'" QUERY_ID=“usrprod_20130731043232_0a40fd32-8c8a-479c- ba7d-3bd8a2698f4b" TIME="1375245164667" : QueryEnd QUERY_STRING="create database biz_weekly location '/hive/biz/ prod'" QUERY_ID=”usrprod_20130731043232_0a40fd32-8c8a-479c- ba7d-3bd8a2698f4b" QUERY_RET_CODE="0" QUERY_NUM_TASKS="0" TIME="1375245166203" : QueryStart QUERY_STRING="drop database biz_weekly" QUERY_ID=”usrprod_20130731073333_e9acf35c-4f07-4f12-bd9d-bae137ae0733" TIME="1375256014799" : QueryEnd QUERY_STRING="drop database biz_weekly" QUERY_ID=”usrprod_20130731073333_e9acf35c-4f07-4f12-bd9d-bae137ae0733" QUERY_NUM_TASKS="0" TIME="1375256014838"
  • 39. HIVE “MISSING DIRECTORIES” •  In effect, user “usrprod” issued: At 2013-07-31 04:32:44: create database biz_weekly location '/hive/biz/prod' At 2013-07-31 07:33:24: drop database biz_weekly •  This is functionally equivalent to: hdfs dfs -rm -r /hive/biz/prod"
  • 40. HIVE “MISSING DIRECTORIES” •  Customer manually placed their own data in /hive – the warehouse directory managed and controlled by hive •  Customer used CREATE and DROP db commands in their code o  Hive deletes database and table locations in /hive with impunity •  Why didn’t deleted data end up in .Trash? o  Trash collection not turned on in configuration settings o  It is now, but need a –skipTrash option (HIVE-6469)
  • 41. HIVE “MISSING DIRECTORIES” •  Hadoop forensics: piece together disparate sources… o  Hadoop daemon logs (NameNode) o  Hive query and metastore logs o  Hadoop config files •  Need better tools to correlate the different layers of the system: hive client, hive metastore, MapReduce job, YARN, HDFS, operating sytem metrics, … By the way… Operating any distributed system would be totally insane without NTP and a standard time zone (UTC).
  • 42. CASE STUDY – ANALYZE QUERY •  Customer provided Hive query + data sets (100GBs to ~5 TBs) •  Needed help optimizing the query •  Didn’t rewrite query immediately •  Wanted to characterize query performance and isolate bottlenecks first
  • 43. ANALYZE AND TUNE EXECUTION •  Ran original query on the datasets in our environment: o  Two M/R Stages: Stage-1, Stage-2 •  Long running reducers run out of memory o  set mapreduce.reduce.memory.mb=5120" o  Reduces slots and extends reduce time •  Query fails to launch Stage-2 with out of memory o  set HADOOP_HEAPSIZE=1024 on client machine •  Query has 250,000 Mappers in Stage-2 which causes failure o  set mapred.max.split.size=5368709120
 to reduce Mappers
  • 44. ANALYSIS: HOW TO VISUALIZE? •  Next challenge - how to visualize job execution? •  Existing hadoop/hive logs not sufficient for this task •  Wrote internal tools o  parse job history files o  plot mapper and reducer execution
  • 46. Single  reduce  task   ANALYSIS: REDUCE STAGE-1
  • 49. ANALYZE EXECUTION: FINDINGS •  Lone, long running reducer in first stage of query •  Analyzed input data: o  Query split input data by userId o  Bucketizing input data by userId o  One very large bucket: “invalid” userId o  Discussed “invalid” userid with customer •  An error value is a common pattern! o  Need to differentiate between “Don’t know and don’t care” or “don’t know and do care.”
  • 50. INTERACTIVE (DRAM CENTRIC) PROCESSING SYSTEMS •  Loading data into DRAM makes processing fast! •  Examples: Spark, Impala, 0xdata, …, [SAP HANA], … •  Streaming systems (Storm, DataTorrent) may be similar •  Need to increase YARN container memory size
  • 51. •  Caution: larger YARN container settings for interactive jobs may not be right for batch systems like Hive •  Container size: needs to combine vcores and memory: yarn.scheduler.maximum-allocation-vcores
 yarn.nodemanager.resource.cpu-vcores ..." Hive + Interactive: Watch Out for Container Size
  • 52. HIVE + INTERACTIVE: WATCH OUT FOR FRAGMENTATION •  Attempting to schedule interactive systems and batch systems like Hive may result in fragmentation •  Interactive systems may require all-or-nothing scheduling •  Batch jobs with little tasks may starve interactive jobs
  • 53. HIVE + INTERACTIVE: WATCH OUT FOR FRAGMENTATION Solutions for fragmentation… •  Reserve interactive nodes before starting batch jobs •  Reduce interactive container size (if the algorithm permits) •  Node labels (YARN-2492) and gang scheduling (YARN-624)
  • 54. CONCLUSIONS •  Hive + Hadoop debugging can get very complex o  Sifting through many logs and screens o  Automatic transmission versus manual transmission •  Static partitioning induced by Java Virtual Machine has benefits but also induces challenges. •  Where there are difficulties, there’s opportunity: o  Better tooling, instrumentation, integration of logs/metrics •  YARN still evolving into an operating system •  Hadoop as a Service: aggregate and share expertise •  Need to learn from the traditional database community!