Application Architectures with Hadoop - Big Data TechCon SF 2014

Application
Architectures with
Hadoop
Big Data TechCon San Francisco 2014
slideshare.com/hadooparchbook
Mark Grover | @mark_grover
Jonathan Seidman | @jseidman

2
About the book
• @hadooparchbook
• hadooparchitecturebook.com
• github.com/hadooparchitecturebook
• slideshare.com/hadooparchbook
©2014 Cloudera, Inc. All Rights Reserved.

3
About Us
• Mark
– Software Engineer
– Committer on Apache Bigtop, committer and PPMC member on Apache
Sentry (incubating).
– Contributor to Hadoop, Hive, Spark, Sqoop, Flume.
– @mark_grover
• Jonathan
– Senior Solutions Architect, Partner Enablement.
– Co-founder of Chicago Hadoop User Group and Chicago Big Data.
– jseidman@cloudera.com
– @jseidman

4
Case Study
Clickstream Analysis

5
Analytics

6
Analytics

7
Web Logs – Combined Log Format
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0"
200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/
5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/36.0.1944.0 Safari/537.36”
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?
productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com"
"Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/
GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile
Safari/533.1”

8
Clickstream Analytics
244.157.45.12 - - [17/Oct/
2014:21:08:30 ] "GET /seatposts
HTTP/1.0" 200 4463 "http://
bestcyclingreviews.com/
top_online_shops" "Mozilla/5.0
(Macintosh; Intel Mac OS X 10_9_2)
AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/36.0.1944.0 Safari/
537.36”

9
Challenges of Hadoop Implementation

10
Challenges of Hadoop Implementation

11
Hadoop Architectural Considerations
• Storage managers?
– HDFS? HBase?
• Data storage and modeling:
– File formats? Compression? Schema design?
• Data movement
– How do we actually get the data into Hadoop? How do we get it out?
• Metadata
– How do we manage data about the data?
• Data access and processing
– How will the data be accessed once in Hadoop? How can we transform it? How do
we query it?
• Orchestration
– How do we manage the workflow for all of this?

12
Architectural
Considerations
Data Storage and Modeling

13
Data Modeling Considerations
• We need to consider the following in our architecture:
– Storage layer – HDFS? HBase? Etc.
– File system schemas – how will we lay out the data?
– File formats – what storage formats to use for our data, both raw and
processed data?
– Data compression formats?

14
Architectural
Considerations
Data Modeling – Storage Layer

15
Data Storage Layer Choices
• Two likely choices for raw data:

16
Data Storage Layer Choices
• Stores data directly as files
• Fast scans
• Poor random reads/writes
• Stores data as Hfiles on
HDFS
• Slow scans
• Fast random reads/writes

17
Data Storage – Storage Manager Considerations
• Incoming raw data:
– Processing requirements call for batch transformations across multiple
records – for example sessionization.
• Processed data:
– Access to processed data will be via things like analytical queries – again
requiring access to multiple records.
• We choose HDFS
– Processing needs in this case served better by fast scans.

18
Architectural
Considerations
Data Modeling – Raw Data Storage

Storage Formats – Raw Data and Processed Data
19
Processed
Data
Raw
Data

20
Data Storage – Format Considerations
Click to enter confidentiality information
Logs
(plain
text)
Logs
(plain
Lotgesx t)
(plain
text)
Logs
(plain
text)
Logs
(plain
text)
Logs
(plain
text)
Logs
(plain
text)
Logs
(plain
text)

21
Data Storage – Compression
snappy
Well, maybe.
But not splittable.
X
Splittable. Getting
better…
Splittable, but no... Hmmm….

22
Data Storage – More About Snappy
– Designed at Google to provide high compression speeds with reasonable
compression.
– Not the highest compression, but provides very good performance for
processing on Hadoop.
– Snappy is not splittable though, which brings us to…

23
Hadoop File Types
• Formats designed specifically to store and process data on Hadoop:
– File based – Sequence File
– Serialization formats – Thrift, Protocol Buffers, Avro
– Columnar formats – RCFile, ORC, Parquet

24
SequenceFile
• Stores records as
binary key/value pairs.
• SequenceFile “blocks”
can be compressed.
• This enables
splittability with non-splittable
compression.

25
Avro
• Kinda SequenceFile on
Steroids.
• Self-documenting –
stores schema in
header.
• Provides very efficient
storage.
• Supports splittable
compression.

26
Our Format Choices…
• Avro with Snappy
– Snappy provides optimized compression.
– Avro provides compact storage, self-documenting files, and supports schema
evolution.
– Avro also provides better failure handling than other choices.
• SequenceFiles would also be a good choice, and are directly
supported by ingestion tools in the ecosystem.
– But only supports Java.

27
Architectural
Considerations
Data Modeling – Processed Data Storage

Storage Formats – Raw Data and Processed Data
28
Processed
Data
Raw
Data

29
Access to Processed Data
Column A Column B Column C Column D
Value Value Value Value
Analytical
Queries

30
Columnar Formats
• Eliminates I/O for columns that are not part of a query.
• Works well for queries that access a subset of columns.
• Often provide better compression.
• These add up to dramatically improved performance for many
queries.
1 2014-10-1
3
abc
2 2014-10-1
4
def
3 2014-10-1
5
ghi
1 2 3
2014-10-1
2014-10-1
3
4
2014-10-1
5
abc def ghi

31
Columnar Choices – RCFile
• Provides efficient processing for MapReduce applications, but
generally only used with Hive.
• Only supports Java.
• No Avro support.
• Limited compression support.
• Sub-optimal performance compared to newer columnar formats.

32
Columnar Choices – ORC
• A better RCFile.
• Supports Hive, but currently limited support for other interfaces.
• Only supports Java.

33
Columnar Choices – Parquet
• Supports multiple programming interfaces – MapReduce, Hive,
Impala, Pig.
• Multiple language support.
• Broad vendor support.
• These features make Parquet a good choice for our processed data.

34
Architectural
Considerations
Data Modeling – HDFS Schema Design

35
Recommended HDFS Schema Design
• How to lay out data on HDFS?

36
Recommended HDFS Schema Design
/user/<username> - User specific data, jars, conf files
/etl – Data in various stages of ETL workflow
/tmp – temp data from tools or shared between users
/data – shared data for the entire organization
/app – Everything but data: UDF jars, HQL files, Oozie workflows

37
Architectural
Considerations
Data Modeling – Advanced HDFS Schema
Design

38
What is Partitioning?
dataset
col=val1/file.txt
col=val2/file.txt
.
.
.
col=valn/file.txt
Un-partitioned HDFS directory
dataset
file1.txt
file2.txt
.
.
.
filen.txt
structure
Partitioned HDFS directory
structure

39
What is Partitioning?
clicks
dt=2014-01-01/clicks.txt
.
.
.
Un-partitioned HDFS directory
structure
clicks
clicks-2014-01-01.txt
.
.
.
Partitioned HDFS directory
structure

40
Partitioning
• Split the dataset into smaller consumable chunks
• Rudimentary form of “indexing”
• <data set name>/<partition_column_name=partition_column_value>/
{files}

41
Partitioning considerations
• What column to partition by?
– HDFS is append only.
– Don’t have too many partitions (<10,000)
– Don’t have too many small files in the partitions (more than block size
generally)
• We decided to partition by timestamp

42
Partitioning For Our Case Study
• Raw dataset:
– /etl/BI/casualcyclist/clicks/rawlogs/year=2014/month=10/day=10!
• Processed dataset:
– /data/bikeshop/clickstream/year=2014/month=10/day=10!

43
A Word About Partitioning Alternatives
year=2014/month=10/day=10 dt=2014-10-10

44
Architectural
Considerations
Data Ingestion

45
File Transfers
• “hadoop fs –put <file>”
• Reliable, but not
resilient to failure.
• Other options are
mountable HDFS, for
example NFSv3.

46
Streaming Ingestion
• Flume
– Reliable, distributed, and available system for efficient collection, aggregation
and movement of streaming data, e.g. logs.
• Kafka
– Reliable and distributed publish-subscribe messaging system.

47
Flume vs. Kafka
• Purpose built for
Hadoop data ingest.
• Pre-built sinks for
HDFS, HBase, etc.
• Supports
transformation of data
in-flight.
• General pub-sub
messaging framework.
• Hadoop not supported,
requires 3rd-party
component (Camus).
• Just a message
transport (a very fast
one).

48
Flume vs. Kafka
• Bottom line:
– Flume very well integrated with Hadoop ecosystem, well suited to ingestion of
sources such as log files.
– Kafka is a highly reliable and scalable enterprise messaging system, and
great for scaling out to multiple consumers.

49
Short Intro to Flume
Sources Interceptors Selectors Channels Sinks
Flume Agent
Twitter, logs, JMS,
webserver
Mask, re-format,
validate…
DR, critical Memory, file HDFS, HBase,
Solr

50
A Quick Introduction to Flume
• Reliable – events are stored in channel until delivered to next
stage.
• Recoverable – events can be persisted to disk and recovered in
the event of failure.
Flume Agent
Source Channel Sink Destination

51
A Quick Introduction to Flume
• Declarative
– No coding required.
– Configuration specifies
how components are
wired together.

52
A Brief Discussion of Flume Patterns – Fan-in
• Flume agent runs on
each of our servers.
• These agents send
data to multiple agents
to provide reliability.
• Flume provides support
for load balancing.

53
Sqoop Overview
• Apache project designed to ease import and export of data between
Hadoop and external data stores such as relational databases.
• Great for doing bulk imports and exports of data between HDFS,
Hive and HBase and an external data store. Not suited for ingesting
event based data.

54
Ingestion Decisions
• Historical Data
– Smaller files: file transfer
– Larger files: Flume with spooling directory source.
• Incoming Data
– Flume with the spooling directory source.
• Relational Data Sources – ODS, CRM, etc.
– Sqoop

55
Architectural
Considerations
Data Processing – Engines

56
Data flow
Raw data
Partitioned
clickstream data
Other data
(Financial,
CRM, etc.)
Aggregated
dataset #2
Aggregated
dataset #1

57
Processing Engines
• MapReduce
• Abstractions – Pig, Hive, Cascading, Crunch
• Spark
• Impala
Confidentiality Information Goes Here

58
MapReduce
• Oldie but goody
• Restrictive Framework / Innovated Work Around
• Extreme Batch

59
MapReduce Basic High Level
Mapper
HDFS
(Replicated)
Native File System
Block of
Data
Temp Spill
Data
Partitioned
Sorted Data
Reducer
Reducer
Local Copy
Output File

60
Abstractions
• SQL
– Hive
• Script/Code
– Pig: Pig Latin
– Crunch: Java/Scala
– Cascading: Java/Scala

61
Spark
• The New Kid that isn’t that New Anymore
• Easily 10x less code
• Extremely Easy and Powerful API
• Very good for machine learning
• Scala, Java, and Python
• RDDs
• DAG Engine

62
Spark - DAG

63
Impala
• Real-time open source MPP style engine for Hadoop
• Doesn’t build on MapReduce
• Written in C++, uses LLVM for run-time code generation
• Can create tables over HDFS or HBase data
• Accesses Hive metastore for metadata
• Access available via JDBC/ODBC

64
What processing needs to happen?
• Sessionization
• Filtering
• Deduplication
• BI / Discovery

65
Sessionization
Website visit
Visitor 1
Session 1
Visitor 1
Session 2
Visitor 2
Session 1
> 30 minutes

66
Why sessionize?
Helps answers questions like:
• What is my website’s bounce rate?
– i.e. how many % of visitors don’t go past the landing page?
• Which marketing channels (e.g. organic search, display ad, etc.) are
leading to most sessions?
– Which ones of those lead to most conversions (e.g. people buying things,
signing up, etc.)
• Do attribution analysis – which channels are responsible for most
conversions?

67
How to Sessionize?
1. Given a list of clicks, determine which clicks
came from the same user
2. Given a particular user's clicks, determine if a
given click is a part of a new session or a
continuation of the previous session

68
#1 – Which clicks are from same user?
• We can use:
– IP address (244.157.45.12)
– Cookies (A9A3BECE0563982D)
– IP address (244.157.45.12)and user agent string ((KHTML, like
Gecko) Chrome/36.0.1944.0 Safari/537.36")

69
• We can use:
– IP address (244.157.45.12)
– Cookies (A9A3BECE0563982D)
– IP address (244.157.45.12)and user agent string ((KHTML, like
Gecko) Chrome/36.0.1944.0 Safari/537.36")

70
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0"
200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC
Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”

71
#2 – Which clicks part of the same session?
> 30 mins apart = different
sessions

Sessionization engine recommendation
• We have sessionization code in MR, Spark on github. The
complexity of the code varies, depends on the expertise in the
organization.
• We choose MR, since it’s fairly simple and maintainable code.
©2014 Cloudera, Inc. All rights reserved. 72

73
Filtering – filter out incomplete records
200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U…

74
Filtering – filter out records from bots/spiders
Google spider IP address

Filtering recommendation
• Bot/Spider filtering can be done easily in any of the engines
• Incomplete records are harder to filter in schema systems like
Hive, Impala, Pig, etc.
• Pretty close choice between MR, Hive and Spark
• Can be done in Flume interceptors as well
• We can simply embed this in our sessionization job

76
Deduplication – remove duplicate records

Deduplication recommendation
• Can be done in all engines.
• We already have a Hive table with all the columns, a simple
DISTINCT query will perform deduplication
• We use Pig

BI/Discovery engine recommendation
• Main requirements for this are:
– Low latency
– SQL interface (e.g. JDBC/ODBC)
– Users don’t know how to code
• We chose Impala
– It’s a SQL engine
– Much faster than other engines
– Provides standard JDBC/ODBC interfaces

79
Architectural
Considerations
Orchestration

80
Orchestrating Clickstream
• Data arrives through Flume
• Triggers a processing event:
– Sessionize
– Enrich – Location, marketing channel…
– Store as Parquet
• Each day we process events from the previous day

• Workflow is fairly simple
• Need to trigger workflow based on data
• Be able to recover from errors
• Perhaps notify on the status
• And collect metrics for reporting
Choosing Right

Choosing…
Easier in Oozie

Choosing the right Orchestration Tool
Better in Azkaban

Important Decision Consideration!
The best orchestration tool
is the one you are an expert on

86
Putting It All
Together
Final Architecture

87
Final Architecture – High Level Overview
Data
Sources Ingestion Data Storage/
Processing
Data
Reporting/
Analysis

88
Data
Processing
Data
Reporting/
Analysis

89
Final Architecture – Ingestion
Web App Avro Agent
Web App Avro Agent
Web App Avro Agent
Web App Avro Agent
Web App Avro Agent
Web App Avro Agent
Web App Avro Agent
Web App Avro Agent
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Fan-in
Pattern
Multi Agents for
Failover and rolling
restarts
HDFS

90
Data
Processing
Data
Reporting/
Analysis

91
Final Architecture – Storage and Processing
/etl/weblogs/20140331/
/etl/weblogs/20140401/
…
Data Processing
/data/marketing/clickstream/
bouncerate/
/data/marketing/clickstream/
attribution/
…

92
Data
Processing
Data
Reporting/
Analysis

93
Final Architecture – Data Access
Hive/
Impala
BI/
Analytics
Tools
JDBC/ODBC
DWH
Sqoop
Local
Disk
R, etc.
DB import tool

95
Contact info
• Mark Grover
– @mark_grover
– www.linkedin.com/in/grovermark
• Jonathan Seidman
– jseidman@cloudera.com
– @jseidman
– www.linkedin.com/in/jaseidman
– www.slideshare.net/jseidman
• Slides at slideshare.net/hadooparchbook

Application Architectures with Hadoop - Big Data TechCon SF 2014

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (16)

Ähnlich wie Application Architectures with Hadoop - Big Data TechCon SF 2014

Ähnlich wie Application Architectures with Hadoop - Big Data TechCon SF 2014 (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Application Architectures with Hadoop - Big Data TechCon SF 2014