Scaling ETL with Hadoop - Avoiding Failure

1
Scaling ETL
with Hadoop
Gwen Shapira
@gwenshap
gshapira@cloudera.com

Coming soon to a bookstore near you…
• Hadoop Application
Architectures
How to build end-to-end solutions
using Apache Hadoop and related
tools
@hadooparchbook
www.hadooparchitecturebook.com

ETL is…
• Extracting data from outside sources
• Transforming it to fit operational needs
• Loading it into the end target
• (Wikipedia: http://en.wikipedia.org/wiki/Extract,_transform,_load)
3

Hadoop Is…
• HDFS – Massive, redundant data storage
• MapReduce – Batch oriented data processing at scale
• Many many ways to process data in parallel at scale
4

The Ecosystem
• High level languages and abstractions
• File, relational and streaming data integration
• Process Orchestration and Scheduling
• Libraries for data wrangling
• Low latency query language
5

Data Has Changed in the Last 30 YearsDATAGROWTH
END-USER
APPLICATIONS
THE INTERNET
MOBILE DEVICES
SOPHISTICATED
MACHINES
STRUCTURED DATA – 10%
1980 2013
UNSTRUCTURED DATA – 90%

Volume, Variety, Velocity Cause Problems
8
OLTP
Enterprise
Applications
Data
Warehouse
QueryExtract
Transform
Load
Business
Intelligence
Transform
1
1
1
Slow data transformations. Missed SLAs.
2
2
Slow queries. Frustrated business and IT.
3 Must archive. Archived data can’t provide value.

Got unstructured data?
• Traditional ETL:
• Text
• CSV
• XLS
• XML
• Hadoop:
• HTML
• XML, RSS
• JSON
• Apache Logs
• Avro, ProtoBuffs, ORC, Parquet
• Compression
• Office, OpenDocument, iWorks
• PDF, Epup, RTF
• Midi, MP3
• JPEG, Tiff
• Java Classes
• Mbox, RFC822
• Autocad
• TrueType Parser
• HFD / NetCDF
9

What is Apache Hadoop?
10
Has the Flexibility to Store and
Mine Any Type of Data
 Ask questions across structured and
unstructured data that were previously
impossible to ask or solve
 Not bound by a single schema
Excels at
Processing Complex Data
 Scale-out architecture divides workloads
across multiple nodes
 Flexible file system eliminates ETL
bottlenecks
Scales
Economically
 Can be deployed on commodity
hardware
 Open source platform guards against
vendor lock
Hadoop Distributed
File System (HDFS)
Self-Healing, High
Bandwidth Clustered
Storage
MapReduce
Distributed Computing
Framework
Apache Hadoop is an open source
platform for data storage and processing
that is…
 Distributed
 Fault tolerant
 Scalable
CORE HADOOP SYSTEM COMPONENTS

What I often see
ETL
Cluster
ELT in
DWH
ETL in
Hadoop
11

Best Practices
Arup Nanda taught me to ask:
1. Why is it better than the rest?
2. What happens if it is not followed?
3. When are they not applicable?
13

Let me count the ways
1. From Databases: Sqoop
2. Log Data: Flume
3. Copy data to HDFS
16

Data Loading Mistake #1
17
Hadoop is scalable.
Lets run as many Sqoop mappers
as possible, to get the data from
our DB faster!
— Famous last words

Lesson:
• Start with 2 mappers, add slowly
• Watch DB load and network utilization
• Use FairScheduler to limit number of mappers
19

20
Database specific connectors are
complicated and scary.
Lets just use the default JDBC
connector.

Lesson:
1. There are connectors to:
Oracle, Netezza and Teradata
2. Download them
3. Read documentation
4. Ask questions if not clear
5. Follow installation instructions
6. Use Sqoop with connectors
22

23
Just copying files?
This sounds too simple. We
probably need some cool
whizzbang tool.

Lessons:
• Copying files is a legitimate solution
• In general, simple is good
25

Endless Possibilities
• Map Reduce
• Crunch / Cascading
• Spark
• Hive (i.e. SQL)
• Pig
• R
• Shell scripts
• Plain old Java
27

Data Processing Mistake #1
29
This system must be ready in 12 month.
We have to convert 100 data sources and 5000
transformations to Hadoop.
Lets spend 2 days planning a schedule and budget
for the entire year and then just go and implement
it.
Prototype? Who needs that?

Lessons
• Take learning curve into account
• You don’t know what you don’t know
• Hadoop will be difficult and frustrating
for at least 3 month.
31

32
Hadoop is all about MapReduce.
So I’ll use MapReduce for all my
data processing needs.

Lessons:
MapReduce is the assembly language of Hadoop:
Simple things are hard.
Hard things are possible.
34

35
I got 5000 tiny XMLs, and Hadoop
is great at processing
unstructured data.
So I’ll just leave the data like that
and parse the XML in every job.

Lessons
1. Consolidate small files
2. Don’t argue about #1
3. Convert files to easy-to-query formats
4. De-normalize
37

38
Partitions are for relational databases

Lessons
1. Without partitions every query is a full table scan
2. Yes, Hadoop scans fast.
3. But faster read is the one you don’t perform
4. Cheap storage allows you to store same dataset,
partitioned multiple ways.
5. Use partitions for fast data loading
40

Technologies
• Sqoop
• Fuse-DFS
• Oracle Connectors
• Just copy files
• Query Hadoop
42

43
All of the data must end up in a
relational DWH.

Lessons:
• Use Relational:
• To maintain tool compatibility
• DWH enrichment
• Stay in Hadoop for:
• Text search
• Graph analysis
• Reduce time in pipeline
• Big data & small network
• Congested database
45

46
We used Sqoop to get data out of
Oracle. Lets use Sqoop to get it back in.

Lesson
Use Oracle direct connectors if you can afford them.
They are:
1. Faster than any alternative
2. Use Hadoop to make Oracle more efficient
3. Make *you* more efficient
48

Tools
• Oozie
• Pentaho, Talend, ActiveBatch, AutoSys, Informatica,
UC4, Cron
50

Workflow Mistake #1
51
Workflow management is easy. I’ll just
write few scripts.

Lesson:
Workflow management tool should enable:
• Keeping track of metadata, components and
integrations
• Scheduling and Orchestration
• Restarts and retries
• Cohesive System View
• Instrumentation, Measurement and Monitoring
• Reporting
53

Workflow Mistake #2
54
Schema? This is Hadoop. Why would
we need a schema?

Lesson
/user/…
/user/gshapira/testdata/orders
/data/<database>/<table>/<partition>
/data/<biz unit>/<app>/<dataset>/partition
/data/pharmacy/fraud/orders/date=20131101
/etl/<biz unit>/<app>/<dataset>/<stage>
/etl/pharmacy/fraud/orders/validated
56

Workflow Mistake #3
57
Oozie was written for Hadoop, so the
right solution will always use Oozie

Lessons:
• Oozie has advantages
• Use the tool that works for you
59

Should DBAs learn Hadoop?
• Hadoop projects are more visible
• 48% of Hadoop clusters are owned by DWH team
• Big Data == Business pays attention to data
• New skills – from coding to cluster administration
• Interesting projects
• No, you don’t need to learn Java
64

Beginner Projects
• Take a class
• Download a VM
• Install 5 node Hadoop cluster in AWS
• Load data:
• Complete works of Shakespeare
• Movielens database
• Find the 10 most common words in Shakespeare
• Find the 10 most recommended movies
• Run TPC-H
• Cloudera Data Science Challenge
• Actual use-case:
XML ingestion, ETL process, DWH history
65

Scaling ETL with Hadoop - Avoiding Failure

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Scaling ETL with Hadoop - Avoiding Failure

Ähnlich wie Scaling ETL with Hadoop - Avoiding Failure (20)

Mehr von Gwen (Chen) Shapira

Mehr von Gwen (Chen) Shapira (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Scaling ETL with Hadoop - Avoiding Failure

Hinweis der Redaktion