2. Coming soon to a bookstore near you…
• Hadoop Application
Architectures
How to build end-to-end solutions
using Apache Hadoop and related
tools
@hadooparchbook
www.hadooparchitecturebook.com
3. ETL is…
• Extracting data from outside sources
• Transforming it to fit operational needs
• Loading it into the end target
• (Wikipedia: http://en.wikipedia.org/wiki/Extract,_transform,_load)
3
4. Hadoop Is…
• HDFS – Massive, redundant data storage
• MapReduce – Batch oriented data processing at scale
• Many many ways to process data in parallel at scale
4
5. The Ecosystem
• High level languages and abstractions
• File, relational and streaming data integration
• Process Orchestration and Scheduling
• Libraries for data wrangling
• Low latency query language
5
7. Data Has Changed in the Last 30 YearsDATAGROWTH
END-USER
APPLICATIONS
THE INTERNET
MOBILE DEVICES
SOPHISTICATED
MACHINES
STRUCTURED DATA – 10%
1980 2013
UNSTRUCTURED DATA – 90%
8. Volume, Variety, Velocity Cause Problems
8
OLTP
Enterprise
Applications
Data
Warehouse
QueryExtract
Transform
Load
Business
Intelligence
Transform
1
1
1
Slow data transformations. Missed SLAs.
2
2
Slow queries. Frustrated business and IT.
3 Must archive. Archived data can’t provide value.
10. What is Apache Hadoop?
10
Has the Flexibility to Store and
Mine Any Type of Data
Ask questions across structured and
unstructured data that were previously
impossible to ask or solve
Not bound by a single schema
Excels at
Processing Complex Data
Scale-out architecture divides workloads
across multiple nodes
Flexible file system eliminates ETL
bottlenecks
Scales
Economically
Can be deployed on commodity
hardware
Open source platform guards against
vendor lock
Hadoop Distributed
File System (HDFS)
Self-Healing, High
Bandwidth Clustered
Storage
MapReduce
Distributed Computing
Framework
Apache Hadoop is an open source
platform for data storage and processing
that is…
Distributed
Fault tolerant
Scalable
CORE HADOOP SYSTEM COMPONENTS
11. What I often see
ETL
Cluster
ELT in
DWH
ETL in
Hadoop
11
13. Best Practices
Arup Nanda taught me to ask:
1. Why is it better than the rest?
2. What happens if it is not followed?
3. When are they not applicable?
13
22. Lesson:
1. There are connectors to:
Oracle, Netezza and Teradata
2. Download them
3. Read documentation
4. Ask questions if not clear
5. Follow installation instructions
6. Use Sqoop with connectors
22
23. Data Loading Mistake #3
23
Just copying files?
This sounds too simple. We
probably need some cool
whizzbang tool.
— Famous last words
29. Data Processing Mistake #1
29
This system must be ready in 12 month.
We have to convert 100 data sources and 5000
transformations to Hadoop.
Lets spend 2 days planning a schedule and budget
for the entire year and then just go and implement
it.
Prototype? Who needs that?
— Famous last words
31. Lessons
• Take learning curve into account
• You don’t know what you don’t know
• Hadoop will be difficult and frustrating
for at least 3 month.
31
32. Data Processing Mistake #2
32
Hadoop is all about MapReduce.
So I’ll use MapReduce for all my
data processing needs.
— Famous last words
34. Lessons:
MapReduce is the assembly language of Hadoop:
Simple things are hard.
Hard things are possible.
34
35. Data Processing Mistake #3
35
I got 5000 tiny XMLs, and Hadoop
is great at processing
unstructured data.
So I’ll just leave the data like that
and parse the XML in every job.
— Famous last words
40. Lessons
1. Without partitions every query is a full table scan
2. Yes, Hadoop scans fast.
3. But faster read is the one you don’t perform
4. Cheap storage allows you to store same dataset,
partitioned multiple ways.
5. Use partitions for fast data loading
40
45. Lessons:
• Use Relational:
• To maintain tool compatibility
• DWH enrichment
• Stay in Hadoop for:
• Text search
• Graph analysis
• Reduce time in pipeline
• Big data & small network
• Congested database
45
46. Data Loading Mistake #2
46
We used Sqoop to get data out of
Oracle. Lets use Sqoop to get it back in.
— Famous last words
48. Lesson
Use Oracle direct connectors if you can afford them.
They are:
1. Faster than any alternative
2. Use Hadoop to make Oracle more efficient
3. Make *you* more efficient
48
53. Lesson:
Workflow management tool should enable:
• Keeping track of metadata, components and
integrations
• Scheduling and Orchestration
• Restarts and retries
• Cohesive System View
• Instrumentation, Measurement and Monitoring
• Reporting
53
64. Should DBAs learn Hadoop?
• Hadoop projects are more visible
• 48% of Hadoop clusters are owned by DWH team
• Big Data == Business pays attention to data
• New skills – from coding to cluster administration
• Interesting projects
• No, you don’t need to learn Java
64
65. Beginner Projects
• Take a class
• Download a VM
• Install 5 node Hadoop cluster in AWS
• Load data:
• Complete works of Shakespeare
• Movielens database
• Find the 10 most common words in Shakespeare
• Find the 10 most recommended movies
• Run TPC-H
• Cloudera Data Science Challenge
• Actual use-case:
XML ingestion, ETL process, DWH history
65
The data landscape looks totally different now than it did 30 years ago when the fundamental concepts and technologies of data management were developed. Since 1980, the birth of personal computing, the internet, mobile devices and sophisticated electronic machines have led to an explosion in data volume, variety and velocity. Simply said, the data you’re managing today looks nothing like the data you were managing in 1980. Actually, structured data only represents somewhere between 10-20% of the total data volume in any given enterprise.
Open source system implemented (mostly) In Java. Provides reliable and scalable storage and processing of extremely large volumes of data (TB to PB) on commodity hardware.
Two primary components, HDFS and MapReduce. Based on software originally developed at Google.
An important aspect of Hadoop in the context of this talk is that it allows for inexpensive storage (<10% of other solutions) of any type of data, and places no constraints on how that data is processed.
Allows companies to begin storing data that was previously thrown away.
Hadoop has value as a stand-alone analytics system, but many if not most organizations will be looking to integrate into existing data infrastructure.
ETL clusters are easy to use and manage, but are often inefficient. ELT is efficient, but spends expensive DWH cycles on low-value transformations, wastes storage on temp data and can cause missed SLAs. The next step is moving the ETL to Hadoop.
There are many Hadoop technologies for ETL. Its easy to google and see what each one does and how to use it. The trick is to use the right technology at the right time, and avoid some common mistakes.
So, I’m not going to waste time telling you how to use Hadoop. You’ll easily find out. I’ll tell you what to use it for, and how to avoid pitfalls.
Database specific connectors make data ingest about 1000 times faster.
Important note: Sqoop’s Connector to Oracle is NOT Oracle connector for Hadoop. It is called OraOop and is both free and open source
If you get a bunch of files to a directory every day or hour, just copying them to Hadoop is fine. If you can do something in 1 line in shell, do it. Don’t overcomplicate.
You are just starting to learn Hadoop, so make life easier on yourself.
Give yourself a fighting chance
These days there is hardly ever a reason to use plain map reduce. So many other good tools.
Like walking Huanan trail: Not very stable and not very fast
Mottainai – Japanese term for wasting resources or doing something in an inefficient manner
Use Hadoop cores to pre-sort, pre-partition the data and turn it into Oracle data format. Then load it as “append” to Oracle. Uses no redo, very little CPU, lots of parallelism.