SlideShare ist ein Scribd-Unternehmen logo
1 von 46
1
Scaling ETL
with Hadoop
Gwen Shapira, Solutions Architect
@gwenshap
gshapira@cloudera.com
2
ETL is…
• Extracting data from outside sources
• Transforming it to fit operational needs
• Loading it into the end target
• (Wikipedia: http://en.wikipedia.org/wiki/Extract,_transform,_load)
3
Hadoop Is…
• HDFS – Replicated, distributed data storage
• Map-Reduce – Batch oriented data processing at
scale. Parallel programming platform.
4
The Ecosystem
• High level languages and abstractions
• File, relational and streaming data integration
• Process Orchestration and Scheduling
• Libraries for data wrangling
• Low latency query language
5
6
Why ETL with Hadoop?
Got unstructured data?
• Traditional ETL:
• Text
• CSV
• XLS
• XML
• Hadoop:
• HTML
• XML, RSS
• JSON
• Apache Logs
• Avro, ProtoBuffs, ORC, Parquet
• Compression
• Office, OpenDocument, iWorks
• PDF, Epup, RTF
• Midi, MP3
• JPEG, Tiff
• Java Classes
• Mbox, RFC822
• Autocad
• TrueType Parser
• HFD / NetCDF
8
Replace ETL Clusters
• Informatica is:
• Easy to program
• Complete ETL solution
• Hadoop is:
• Cheaper
• MUCH more flexible
• Faster?
• More scalable?
• You can have both
9
Data Warehouse Offloading
• DWH resources are
EXPENSIVE
• Reduce storage costs
• Release CPU capacity
• Scale
• Better tools
10
What I often see
ETL
Cluster
ELT in
DWH
ETL in
Hadoop
11
ETL Cluster
ETL Cluster with
some Hadoop
12
We’ll Discuss:
Technologies Speed & Scale Tips & Tricks
Extract
Transform
Load
Workflow
13
14
Extract
Let me count the ways
• From Databases: Sqoop
• Log Data: Flume + CDK
• Or just write data to HDFS
15
Sqoop – The Balancing Act
16
Scale Sqoop Slowly
• Balance between:
• Maximizing network utilization
• Minimizing database impact
• Start with smallish table (1-10G)
• 1 mapper, 2 mappers, 4 mappers
• Where’s the bottleneck?
• vmtat, iostat, mpstat, netstat, iptraf
17
When Loading Files:
Same principles apply:
• Parallel Copy
• Add Parallelism
• Find Bottlenecks
• Resolve them
• Avoid Self-DDOS
18
Scaling Sqoop
• Split column - match index or partitions
• Compression
• Drivers + Connectors + Direct mode
• Incremental import
19
Ingest Tips
• Use file system tricks to ensure consistency
• Directory structure:
/intent
/category
/application (optional)
/dataset
/partitions
/files
• Examples:
/data/fraud/txs/2011-01-01/20110101-00.avro
/group/research/model-17/training-tx/part-00000.txt
/user/gshapira/scratch/surge/
20
Ingest Tips
• External tables in Hive
• Keep raw data
• Trigger workflows on
file arrival
21
22
Transform
Endless Possibilites
• Map Reduce
(in any language)
• Hive (i.e. SQL)
• Pig
• R
• Shell scripts
• Plain old Java
• Morphlines
23
Prototype
24
Partitioning
• Hive
• Directory Structure
• Pre-filter
• Adds metadata
25
Tune Data Structures
• Joins are expensive
• Disk space is not
• De-normalize
• Store same data in
multiple formats
26
Map-Reduce
• Assembly language of data processing
• Simple things are hard, hard things are possible
• Use for:
• Optimization: Do in one MR job what Hive does in 3
• Optimization: Partition the data just right
• GeoSpatial
• Mahout – Map/Reduce machine learning
27
Parallelism –Unit of Work
• Amdahl’s Law
• Small Units
• That stay small
• One user?
• One day?
• Ten square meter?
28
Remember the Basics
• X reduce output is 3X disk IO and 2X network IO
• Less jobs = Less reduces = Less IO = Faster and Scalier
• Know your network and disk throughput
• Have rough idea of ops-per-second
29
Instrumentation
• Optimize the right things
• Right jobs
• Right hardware
• 90% of the time –
its not the hardware
30
Tips
Avoid the
“old data warehouse guy”
syndrome
31
Tips
• Slowly changing dimensions:
• Load changes
• Merge
• And swap
• Store intermediate results:
• Performance
• Debugging
• Store Source/Derived relation
32
Fault and Rebuild
• Tier 0 – raw data
• Tier 1 – cleaned data
• Tier 2 – transformations, lookups and denormalization
• Tier 3 - Aggregations
33
Never Metadata I didn’t like
• Metadata is small data
• That changes frequently
• Solutions:
• Oozie
• Directory structure / Partitions
• Outside of HDFS
• HBase
• Cloudera Navigator
34
Few words about Real Time ETL
• What does it even mean?
• Fast reporting?
• No delay from OLTP to DWH?
• Micro-batches make more sense:
• Aggregation
• Economy of scale
• Late data happens
• Near-line solutions
35
36
Load
Technologies
• Sqoop
• Fuse-DFS
• Oracle Connectors
• NoSQLs
37
Scaling
• Sqoop works better for some DBs
• Fuse-FS and DB tools give more control
• Load in parallel
• Directly to FS
• Partitions
• Do all formatting on Hadoop
• Do you REALLY need to load that?
38
How not to Load
• Most Hadoop customers don’t load data in bulk
• History can stay in Hadoop
• Load only aggregated data
• Or computation results – recommendations, reports.
• Most queries can run in Hadoop
• BI tools often run in Hadoop
39
40
Workflow Management
Tools
• Oozie
• Azkaban
• Pentaho Kettle
• TalenD
• Informatica
41
} Native Hadoop
} Kinda Open Source
Scaling Challenges
• Keeping track of:
• Code Components
• Metadata
• Integrations and Adapters
• Reports, results, artifacts
• Scheduling and Orchestration
• Cohesive System View
• Life Cycle
• Instrumentation, Measurement and Monitoring
42
My Toolbox
• Hue + Oozie:
• Scheduling + Orchestration
• Cohesive system view
• Process repository
• Some metadata
• Some instrumentation
• Cloudera Manager for monitoring
• … and way too many home grown scripts
43
Hue + Oozie
44
45
— Josh Wills
A lot is still missing
• Metadata solution
• Instrumentation and monitoring solution
• Lifecycle management
• Lineage tracking
Contributions are welcome
46
47

Weitere ähnliche Inhalte

Was ist angesagt?

How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...DataWorks Summit/Hadoop Summit
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keownCisco Canada
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaDataWorks Summit
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0Adam Muise
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBill Liu
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...DataWorks Summit/Hadoop Summit
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingGwen (Chen) Shapira
 
Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance DataWorks Summit/Hadoop Summit
 
Kafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupKafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupGwen (Chen) Shapira
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv larsgeorge
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014hadooparchbook
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopPrasanna Rajaperumal
 
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/HudiVinoth Chandar
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfCharles Givre
 

Was ist angesagt? (20)

How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
 
Apache HAWQ Architecture
Apache HAWQ ArchitectureApache HAWQ Architecture
Apache HAWQ Architecture
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 
Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance
 
Kafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupKafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn Meetup
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
 
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
 

Andere mochten auch

A Walk Through the Kimball ETL Subsystems with Oracle Data Integration
A Walk Through the Kimball ETL Subsystems with Oracle Data IntegrationA Walk Through the Kimball ETL Subsystems with Oracle Data Integration
A Walk Through the Kimball ETL Subsystems with Oracle Data IntegrationMichael Rainey
 
Designing High Performance ETL for Data Warehouse
Designing High Performance ETL for Data WarehouseDesigning High Performance ETL for Data Warehouse
Designing High Performance ETL for Data WarehouseMarcel Franke
 
ETL Is Dead, Long-live Streams
ETL Is Dead, Long-live StreamsETL Is Dead, Long-live Streams
ETL Is Dead, Long-live StreamsC4Media
 
Designing and implementing_an_etl_framework
Designing and implementing_an_etl_frameworkDesigning and implementing_an_etl_framework
Designing and implementing_an_etl_frameworkBharat Vadlamudi
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 DataWorks Summit
 
Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...
Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...
Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...Mark Rittman
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseDataWorks Summit
 
ETL Using Informatica Power Center
ETL Using Informatica Power CenterETL Using Informatica Power Center
ETL Using Informatica Power CenterEdureka!
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsCloudera, Inc.
 
How to document a database
How to document a databaseHow to document a database
How to document a databasePiotr Kononow
 
End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...
End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...
End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...Mark Rittman
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data WarehouseCaserta
 

Andere mochten auch (12)

A Walk Through the Kimball ETL Subsystems with Oracle Data Integration
A Walk Through the Kimball ETL Subsystems with Oracle Data IntegrationA Walk Through the Kimball ETL Subsystems with Oracle Data Integration
A Walk Through the Kimball ETL Subsystems with Oracle Data Integration
 
Designing High Performance ETL for Data Warehouse
Designing High Performance ETL for Data WarehouseDesigning High Performance ETL for Data Warehouse
Designing High Performance ETL for Data Warehouse
 
ETL Is Dead, Long-live Streams
ETL Is Dead, Long-live StreamsETL Is Dead, Long-live Streams
ETL Is Dead, Long-live Streams
 
Designing and implementing_an_etl_framework
Designing and implementing_an_etl_frameworkDesigning and implementing_an_etl_framework
Designing and implementing_an_etl_framework
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...
Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...
Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
 
ETL Using Informatica Power Center
ETL Using Informatica Power CenterETL Using Informatica Power Center
ETL Using Informatica Power Center
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
 
How to document a database
How to document a databaseHow to document a database
How to document a database
 
End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...
End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...
End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data Warehouse
 

Ähnlich wie Scaling etl with hadoop shapira 3

Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopCloudera, Inc.
 
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop EcosystemThings Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop EcosystemZohar Elkayam
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
Search in the Apache Hadoop Ecosystem: Thoughts from the Field
Search in the Apache Hadoop Ecosystem: Thoughts from the FieldSearch in the Apache Hadoop Ecosystem: Thoughts from the Field
Search in the Apache Hadoop Ecosystem: Thoughts from the FieldAlex Moundalexis
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...Institute of Contemporary Sciences
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...Rittman Analytics
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Zohar Elkayam
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadooplarsgeorge
 
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem 20170527
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem 20170527Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem 20170527
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem 20170527Zohar Elkayam
 
Things Every Oracle DBA Needs To Know About The Hadoop Ecosystem
Things Every Oracle DBA Needs To Know About The Hadoop EcosystemThings Every Oracle DBA Needs To Know About The Hadoop Ecosystem
Things Every Oracle DBA Needs To Know About The Hadoop EcosystemZohar Elkayam
 
SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightTillmann Eitelberg
 
Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tigerElizabeth Smith
 
Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tigerElizabeth Smith
 

Ähnlich wie Scaling etl with hadoop shapira 3 (20)

Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop EcosystemThings Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Search in the Apache Hadoop Ecosystem: Thoughts from the Field
Search in the Apache Hadoop Ecosystem: Thoughts from the FieldSearch in the Apache Hadoop Ecosystem: Thoughts from the Field
Search in the Apache Hadoop Ecosystem: Thoughts from the Field
 
Apache drill
Apache drillApache drill
Apache drill
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem 20170527
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem 20170527Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem 20170527
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem 20170527
 
Things Every Oracle DBA Needs To Know About The Hadoop Ecosystem
Things Every Oracle DBA Needs To Know About The Hadoop EcosystemThings Every Oracle DBA Needs To Know About The Hadoop Ecosystem
Things Every Oracle DBA Needs To Know About The Hadoop Ecosystem
 
SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsight
 
Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tiger
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Big data applications
Big data applicationsBig data applications
Big data applications
 
Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tiger
 
Oracle OpenWo2014 review part 03 three_paa_s_database
Oracle OpenWo2014 review part 03 three_paa_s_databaseOracle OpenWo2014 review part 03 three_paa_s_database
Oracle OpenWo2014 review part 03 three_paa_s_database
 

Mehr von Gwen (Chen) Shapira

Velocity 2019 - Kafka Operations Deep Dive
Velocity 2019  - Kafka Operations Deep DiveVelocity 2019  - Kafka Operations Deep Dive
Velocity 2019 - Kafka Operations Deep DiveGwen (Chen) Shapira
 
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote Gwen (Chen) Shapira
 
Gluecon - Kafka and the service mesh
Gluecon - Kafka and the service meshGluecon - Kafka and the service mesh
Gluecon - Kafka and the service meshGwen (Chen) Shapira
 
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17Gwen (Chen) Shapira
 
Papers we love realtime at facebook
Papers we love   realtime at facebookPapers we love   realtime at facebook
Papers we love realtime at facebookGwen (Chen) Shapira
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Gwen (Chen) Shapira
 
Streaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data MeetupStreaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data MeetupGwen (Chen) Shapira
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Gwen (Chen) Shapira
 
Fraud Detection for Israel BigThings Meetup
Fraud Detection  for Israel BigThings MeetupFraud Detection  for Israel BigThings Meetup
Fraud Detection for Israel BigThings MeetupGwen (Chen) Shapira
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereGwen (Chen) Shapira
 
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clustersNyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clustersGwen (Chen) Shapira
 
Kafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupKafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupGwen (Chen) Shapira
 
Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupGwen (Chen) Shapira
 

Mehr von Gwen (Chen) Shapira (20)

Velocity 2019 - Kafka Operations Deep Dive
Velocity 2019  - Kafka Operations Deep DiveVelocity 2019  - Kafka Operations Deep Dive
Velocity 2019 - Kafka Operations Deep Dive
 
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
 
Gluecon - Kafka and the service mesh
Gluecon - Kafka and the service meshGluecon - Kafka and the service mesh
Gluecon - Kafka and the service mesh
 
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
 
Papers we love realtime at facebook
Papers we love   realtime at facebookPapers we love   realtime at facebook
Papers we love realtime at facebook
 
Kafka reliability velocity 17
Kafka reliability   velocity 17Kafka reliability   velocity 17
Kafka reliability velocity 17
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017
 
Streaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data MeetupStreaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data Meetup
 
Kafka at scale facebook israel
Kafka at scale   facebook israelKafka at scale   facebook israel
Kafka at scale facebook israel
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016
 
Fraud Detection for Israel BigThings Meetup
Fraud Detection  for Israel BigThings MeetupFraud Detection  for Israel BigThings Meetup
Fraud Detection for Israel BigThings Meetup
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be there
 
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clustersNyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
 
Fraud Detection Architecture
Fraud Detection ArchitectureFraud Detection Architecture
Fraud Detection Architecture
 
Have your cake and eat it too
Have your cake and eat it tooHave your cake and eat it too
Have your cake and eat it too
 
Kafka for DBAs
Kafka for DBAsKafka for DBAs
Kafka for DBAs
 
Kafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupKafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka Meetup
 
Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data Meetup
 
Ssd collab13
Ssd   collab13Ssd   collab13
Ssd collab13
 
Integrated dwh 3
Integrated dwh 3Integrated dwh 3
Integrated dwh 3
 

Kürzlich hochgeladen

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 

Kürzlich hochgeladen (20)

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 

Scaling etl with hadoop shapira 3

  • 1. 1 Scaling ETL with Hadoop Gwen Shapira, Solutions Architect @gwenshap gshapira@cloudera.com
  • 2. 2
  • 3. ETL is… • Extracting data from outside sources • Transforming it to fit operational needs • Loading it into the end target • (Wikipedia: http://en.wikipedia.org/wiki/Extract,_transform,_load) 3
  • 4. Hadoop Is… • HDFS – Replicated, distributed data storage • Map-Reduce – Batch oriented data processing at scale. Parallel programming platform. 4
  • 5. The Ecosystem • High level languages and abstractions • File, relational and streaming data integration • Process Orchestration and Scheduling • Libraries for data wrangling • Low latency query language 5
  • 6. 6 Why ETL with Hadoop?
  • 7. Got unstructured data? • Traditional ETL: • Text • CSV • XLS • XML • Hadoop: • HTML • XML, RSS • JSON • Apache Logs • Avro, ProtoBuffs, ORC, Parquet • Compression • Office, OpenDocument, iWorks • PDF, Epup, RTF • Midi, MP3 • JPEG, Tiff • Java Classes • Mbox, RFC822 • Autocad • TrueType Parser • HFD / NetCDF 8
  • 8. Replace ETL Clusters • Informatica is: • Easy to program • Complete ETL solution • Hadoop is: • Cheaper • MUCH more flexible • Faster? • More scalable? • You can have both 9
  • 9. Data Warehouse Offloading • DWH resources are EXPENSIVE • Reduce storage costs • Release CPU capacity • Scale • Better tools 10
  • 10. What I often see ETL Cluster ELT in DWH ETL in Hadoop 11 ETL Cluster ETL Cluster with some Hadoop
  • 11. 12
  • 12. We’ll Discuss: Technologies Speed & Scale Tips & Tricks Extract Transform Load Workflow 13
  • 14. Let me count the ways • From Databases: Sqoop • Log Data: Flume + CDK • Or just write data to HDFS 15
  • 15. Sqoop – The Balancing Act 16
  • 16. Scale Sqoop Slowly • Balance between: • Maximizing network utilization • Minimizing database impact • Start with smallish table (1-10G) • 1 mapper, 2 mappers, 4 mappers • Where’s the bottleneck? • vmtat, iostat, mpstat, netstat, iptraf 17
  • 17. When Loading Files: Same principles apply: • Parallel Copy • Add Parallelism • Find Bottlenecks • Resolve them • Avoid Self-DDOS 18
  • 18. Scaling Sqoop • Split column - match index or partitions • Compression • Drivers + Connectors + Direct mode • Incremental import 19
  • 19. Ingest Tips • Use file system tricks to ensure consistency • Directory structure: /intent /category /application (optional) /dataset /partitions /files • Examples: /data/fraud/txs/2011-01-01/20110101-00.avro /group/research/model-17/training-tx/part-00000.txt /user/gshapira/scratch/surge/ 20
  • 20. Ingest Tips • External tables in Hive • Keep raw data • Trigger workflows on file arrival 21
  • 22. Endless Possibilites • Map Reduce (in any language) • Hive (i.e. SQL) • Pig • R • Shell scripts • Plain old Java • Morphlines 23
  • 24. Partitioning • Hive • Directory Structure • Pre-filter • Adds metadata 25
  • 25. Tune Data Structures • Joins are expensive • Disk space is not • De-normalize • Store same data in multiple formats 26
  • 26. Map-Reduce • Assembly language of data processing • Simple things are hard, hard things are possible • Use for: • Optimization: Do in one MR job what Hive does in 3 • Optimization: Partition the data just right • GeoSpatial • Mahout – Map/Reduce machine learning 27
  • 27. Parallelism –Unit of Work • Amdahl’s Law • Small Units • That stay small • One user? • One day? • Ten square meter? 28
  • 28. Remember the Basics • X reduce output is 3X disk IO and 2X network IO • Less jobs = Less reduces = Less IO = Faster and Scalier • Know your network and disk throughput • Have rough idea of ops-per-second 29
  • 29. Instrumentation • Optimize the right things • Right jobs • Right hardware • 90% of the time – its not the hardware 30
  • 30. Tips Avoid the “old data warehouse guy” syndrome 31
  • 31. Tips • Slowly changing dimensions: • Load changes • Merge • And swap • Store intermediate results: • Performance • Debugging • Store Source/Derived relation 32
  • 32. Fault and Rebuild • Tier 0 – raw data • Tier 1 – cleaned data • Tier 2 – transformations, lookups and denormalization • Tier 3 - Aggregations 33
  • 33. Never Metadata I didn’t like • Metadata is small data • That changes frequently • Solutions: • Oozie • Directory structure / Partitions • Outside of HDFS • HBase • Cloudera Navigator 34
  • 34. Few words about Real Time ETL • What does it even mean? • Fast reporting? • No delay from OLTP to DWH? • Micro-batches make more sense: • Aggregation • Economy of scale • Late data happens • Near-line solutions 35
  • 36. Technologies • Sqoop • Fuse-DFS • Oracle Connectors • NoSQLs 37
  • 37. Scaling • Sqoop works better for some DBs • Fuse-FS and DB tools give more control • Load in parallel • Directly to FS • Partitions • Do all formatting on Hadoop • Do you REALLY need to load that? 38
  • 38. How not to Load • Most Hadoop customers don’t load data in bulk • History can stay in Hadoop • Load only aggregated data • Or computation results – recommendations, reports. • Most queries can run in Hadoop • BI tools often run in Hadoop 39
  • 40. Tools • Oozie • Azkaban • Pentaho Kettle • TalenD • Informatica 41 } Native Hadoop } Kinda Open Source
  • 41. Scaling Challenges • Keeping track of: • Code Components • Metadata • Integrations and Adapters • Reports, results, artifacts • Scheduling and Orchestration • Cohesive System View • Life Cycle • Instrumentation, Measurement and Monitoring 42
  • 42. My Toolbox • Hue + Oozie: • Scheduling + Orchestration • Cohesive system view • Process repository • Some metadata • Some instrumentation • Cloudera Manager for monitoring • … and way too many home grown scripts 43
  • 45. A lot is still missing • Metadata solution • Instrumentation and monitoring solution • Lifecycle management • Lineage tracking Contributions are welcome 46
  • 46. 47

Hinweis der Redaktion

  1. Start with a portion of your data for fast iterationsPrototype – with Impala / streamingStart high level – tune as you go
  2. Pregnancy takes 9 month, no matter how many women are assigned to itBut – Elephants are pregnant for two years!
  3. 50% of Hadoop systems end up managed by the data warehouse team. And they want to do things as they always did – Kimball methodologies, time dimensions, etc.Not always a good idea, not always scalable, not always possible.