SlideShare ist ein Scribd-Unternehmen logo
1 von 95
Downloaden Sie, um offline zu lesen
Application 
Architectures with 
Hadoop 
Big Data TechCon San Francisco 2014 
slideshare.com/hadooparchbook 
Mark Grover | @mark_grover 
Jonathan Seidman | @jseidman
2 
About the book 
• @hadooparchbook 
• hadooparchitecturebook.com 
• github.com/hadooparchitecturebook 
• slideshare.com/hadooparchbook 
©2014 Cloudera, Inc. All Rights Reserved.
3 
About Us 
• Mark 
– Software Engineer 
– Committer on Apache Bigtop, committer and PPMC member on Apache 
Sentry (incubating). 
– Contributor to Hadoop, Hive, Spark, Sqoop, Flume. 
– @mark_grover 
• Jonathan 
– Senior Solutions Architect, Partner Enablement. 
– Co-founder of Chicago Hadoop User Group and Chicago Big Data. 
– jseidman@cloudera.com 
– @jseidman 
©2014 Cloudera, Inc. All Rights Reserved.
4 
Case Study 
Clickstream Analysis
5 
Analytics 
©2014 Cloudera, Inc. All Rights Reserved.
6 
Analytics 
©2014 Cloudera, Inc. All Rights Reserved.
7 
Web Logs – Combined Log Format 
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 
200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/ 
5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, 
like Gecko) Chrome/36.0.1944.0 Safari/537.36” 
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp? 
productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" 
"Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/ 
GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile 
Safari/533.1” 
©2014 Cloudera, Inc. All Rights Reserved.
8 
Clickstream Analytics 
©2014 Cloudera, Inc. All Rights Reserved. 
244.157.45.12 - - [17/Oct/ 
2014:21:08:30 ] "GET /seatposts 
HTTP/1.0" 200 4463 "http:// 
bestcyclingreviews.com/ 
top_online_shops" "Mozilla/5.0 
(Macintosh; Intel Mac OS X 10_9_2) 
AppleWebKit/537.36 (KHTML, like 
Gecko) Chrome/36.0.1944.0 Safari/ 
537.36”
9 
Challenges of Hadoop Implementation 
©2014 Cloudera, Inc. All Rights Reserved.
10 
Challenges of Hadoop Implementation 
©2014 Cloudera, Inc. All Rights Reserved.
11 
Hadoop Architectural Considerations 
• Storage managers? 
– HDFS? HBase? 
• Data storage and modeling: 
– File formats? Compression? Schema design? 
• Data movement 
– How do we actually get the data into Hadoop? How do we get it out? 
• Metadata 
– How do we manage data about the data? 
• Data access and processing 
– How will the data be accessed once in Hadoop? How can we transform it? How do 
we query it? 
• Orchestration 
– How do we manage the workflow for all of this? 
©2014 Cloudera, Inc. All Rights Reserved.
12 
Architectural 
Considerations 
Data Storage and Modeling
13 
Data Modeling Considerations 
• We need to consider the following in our architecture: 
– Storage layer – HDFS? HBase? Etc. 
– File system schemas – how will we lay out the data? 
– File formats – what storage formats to use for our data, both raw and 
processed data? 
– Data compression formats? 
©2014 Cloudera, Inc. All Rights Reserved.
14 
Architectural 
Considerations 
Data Modeling – Storage Layer
15 
Data Storage Layer Choices 
• Two likely choices for raw data: 
©2014 Cloudera, Inc. All Rights Reserved.
16 
Data Storage Layer Choices 
• Stores data directly as files 
• Fast scans 
• Poor random reads/writes 
• Stores data as Hfiles on 
HDFS 
• Slow scans 
• Fast random reads/writes 
©2014 Cloudera, Inc. All Rights Reserved.
17 
Data Storage – Storage Manager Considerations 
• Incoming raw data: 
– Processing requirements call for batch transformations across multiple 
records – for example sessionization. 
• Processed data: 
– Access to processed data will be via things like analytical queries – again 
requiring access to multiple records. 
• We choose HDFS 
– Processing needs in this case served better by fast scans. 
©2014 Cloudera, Inc. All Rights Reserved.
18 
Architectural 
Considerations 
Data Modeling – Raw Data Storage
Storage Formats – Raw Data and Processed Data 
19 
©2014 Cloudera, Inc. All Rights Reserved. 
Processed 
Data 
Raw 
Data
20 
Data Storage – Format Considerations 
Click to enter confidentiality information 
Logs 
(plain 
text) 
Logs 
(plain 
Lotgesx t) 
(plain 
text) 
Logs 
(plain 
text) 
Logs 
(plain 
text) 
Logs 
(plain 
text) 
Logs 
(plain 
text) 
Logs 
(plain 
text)
21 
Data Storage – Compression 
Click to enter confidentiality information 
snappy 
Well, maybe. 
But not splittable. 
X 
Splittable. Getting 
better… 
Splittable, but no... Hmmm….
22 
Data Storage – More About Snappy 
– Designed at Google to provide high compression speeds with reasonable 
compression. 
– Not the highest compression, but provides very good performance for 
processing on Hadoop. 
– Snappy is not splittable though, which brings us to… 
©2014 Cloudera, Inc. All Rights Reserved.
23 
Hadoop File Types 
• Formats designed specifically to store and process data on Hadoop: 
– File based – Sequence File 
– Serialization formats – Thrift, Protocol Buffers, Avro 
– Columnar formats – RCFile, ORC, Parquet 
Click to enter confidentiality information
24 
SequenceFile 
• Stores records as 
binary key/value pairs. 
• SequenceFile “blocks” 
can be compressed. 
• This enables 
splittability with non-splittable 
compression. 
©2014 Cloudera, Inc. All Rights Reserved.
25 
Avro 
• Kinda SequenceFile on 
Steroids. 
• Self-documenting – 
stores schema in 
header. 
• Provides very efficient 
storage. 
• Supports splittable 
compression. 
©2014 Cloudera, Inc. All Rights Reserved.
26 
Our Format Choices… 
• Avro with Snappy 
– Snappy provides optimized compression. 
– Avro provides compact storage, self-documenting files, and supports schema 
evolution. 
– Avro also provides better failure handling than other choices. 
• SequenceFiles would also be a good choice, and are directly 
supported by ingestion tools in the ecosystem. 
– But only supports Java. 
©2014 Cloudera, Inc. All Rights Reserved.
27 
Architectural 
Considerations 
Data Modeling – Processed Data Storage
Storage Formats – Raw Data and Processed Data 
28 
©2014 Cloudera, Inc. All Rights Reserved. 
Processed 
Data 
Raw 
Data
29 
Access to Processed Data 
Column A Column B Column C Column D 
Value Value Value Value 
Value Value Value Value 
Value Value Value Value 
Value Value Value Value 
©2014 Cloudera, Inc. All Rights Reserved. 
Analytical 
Queries
30 
Columnar Formats 
• Eliminates I/O for columns that are not part of a query. 
• Works well for queries that access a subset of columns. 
• Often provide better compression. 
• These add up to dramatically improved performance for many 
queries. 
©2014 Cloudera, Inc. All Rights Reserved. 
1 2014-10-1 
3 
abc 
2 2014-10-1 
4 
def 
3 2014-10-1 
5 
ghi 
1 2 3 
2014-10-1 
2014-10-1 
3 
4 
2014-10-1 
5 
abc def ghi
31 
Columnar Choices – RCFile 
• Provides efficient processing for MapReduce applications, but 
generally only used with Hive. 
• Only supports Java. 
• No Avro support. 
• Limited compression support. 
• Sub-optimal performance compared to newer columnar formats. 
©2014 Cloudera, Inc. All Rights Reserved.
32 
Columnar Choices – ORC 
• A better RCFile. 
• Supports Hive, but currently limited support for other interfaces. 
• Only supports Java. 
©2014 Cloudera, Inc. All Rights Reserved.
33 
Columnar Choices – Parquet 
• Supports multiple programming interfaces – MapReduce, Hive, 
Impala, Pig. 
• Multiple language support. 
• Broad vendor support. 
• These features make Parquet a good choice for our processed data. 
©2014 Cloudera, Inc. All Rights Reserved.
34 
Architectural 
Considerations 
Data Modeling – HDFS Schema Design
35 
Recommended HDFS Schema Design 
• How to lay out data on HDFS? 
©2014 Cloudera, Inc. All Rights Reserved.
36 
Recommended HDFS Schema Design 
/user/<username> - User specific data, jars, conf files 
/etl – Data in various stages of ETL workflow 
/tmp – temp data from tools or shared between users 
/data – shared data for the entire organization 
/app – Everything but data: UDF jars, HQL files, Oozie workflows 
©2014 Cloudera, Inc. All Rights Reserved.
37 
Architectural 
Considerations 
Data Modeling – Advanced HDFS Schema 
Design
38 
What is Partitioning? 
dataset 
col=val1/file.txt 
col=val2/file.txt 
. 
. 
. 
col=valn/file.txt 
Un-partitioned HDFS directory 
dataset 
file1.txt 
file2.txt 
. 
. 
. 
filen.txt 
structure 
Partitioned HDFS directory 
structure 
©2014 Cloudera, Inc. All Rights Reserved.
39 
What is Partitioning? 
clicks 
dt=2014-01-01/clicks.txt 
dt=2014-01-02/clicks.txt 
. 
. 
. 
dt=2014-03-31/clicks.txt 
Un-partitioned HDFS directory 
structure 
clicks 
clicks-2014-01-01.txt 
clicks-2014-01-02.txt 
. 
. 
. 
clicks-2014-03-31.txt 
Partitioned HDFS directory 
structure 
©2014 Cloudera, Inc. All Rights Reserved.
40 
Partitioning 
• Split the dataset into smaller consumable chunks 
• Rudimentary form of “indexing” 
• <data set name>/<partition_column_name=partition_column_value>/ 
{files} 
©2014 Cloudera, Inc. All Rights Reserved.
41 
Partitioning considerations 
• What column to partition by? 
– HDFS is append only. 
– Don’t have too many partitions (<10,000) 
– Don’t have too many small files in the partitions (more than block size 
generally) 
• We decided to partition by timestamp 
©2014 Cloudera, Inc. All Rights Reserved.
42 
Partitioning For Our Case Study 
• Raw dataset: 
– /etl/BI/casualcyclist/clicks/rawlogs/year=2014/month=10/day=10! 
• Processed dataset: 
– /data/bikeshop/clickstream/year=2014/month=10/day=10! 
©2014 Cloudera, Inc. All Rights Reserved.
43 
A Word About Partitioning Alternatives 
year=2014/month=10/day=10 dt=2014-10-10 
©2014 Cloudera, Inc. All Rights Reserved.
44 
Architectural 
Considerations 
Data Ingestion
45 
File Transfers 
• “hadoop fs –put <file>” 
• Reliable, but not 
resilient to failure. 
• Other options are 
mountable HDFS, for 
example NFSv3. 
©2014 Cloudera, Inc. All Rights Reserved.
46 
Streaming Ingestion 
• Flume 
– Reliable, distributed, and available system for efficient collection, aggregation 
and movement of streaming data, e.g. logs. 
• Kafka 
– Reliable and distributed publish-subscribe messaging system. 
©2014 Cloudera, Inc. All Rights Reserved.
47 
Flume vs. Kafka 
• Purpose built for 
Hadoop data ingest. 
• Pre-built sinks for 
HDFS, HBase, etc. 
• Supports 
transformation of data 
in-flight. 
• General pub-sub 
messaging framework. 
• Hadoop not supported, 
requires 3rd-party 
component (Camus). 
• Just a message 
transport (a very fast 
one). 
©2014 Cloudera, Inc. All Rights Reserved.
48 
Flume vs. Kafka 
• Bottom line: 
– Flume very well integrated with Hadoop ecosystem, well suited to ingestion of 
sources such as log files. 
– Kafka is a highly reliable and scalable enterprise messaging system, and 
great for scaling out to multiple consumers. 
©2014 Cloudera, Inc. All Rights Reserved.
49 
Short Intro to Flume 
Sources Interceptors Selectors Channels Sinks 
Flume Agent 
Twitter, logs, JMS, 
webserver 
Mask, re-format, 
validate… 
DR, critical Memory, file HDFS, HBase, 
Solr
50 
A Quick Introduction to Flume 
• Reliable – events are stored in channel until delivered to next 
stage. 
• Recoverable – events can be persisted to disk and recovered in 
the event of failure. 
Flume Agent 
Source Channel Sink Destination 
©2014 Cloudera, Inc. All Rights Reserved.
51 
A Quick Introduction to Flume 
• Declarative 
– No coding required. 
– Configuration specifies 
how components are 
wired together. 
©2014 Cloudera, Inc. All Rights Reserved.
52 
A Brief Discussion of Flume Patterns – Fan-in 
• Flume agent runs on 
each of our servers. 
• These agents send 
data to multiple agents 
to provide reliability. 
• Flume provides support 
for load balancing. 
©2014 Cloudera, Inc. All Rights Reserved.
53 
Sqoop Overview 
• Apache project designed to ease import and export of data between 
Hadoop and external data stores such as relational databases. 
• Great for doing bulk imports and exports of data between HDFS, 
Hive and HBase and an external data store. Not suited for ingesting 
event based data. 
©2014 Cloudera, Inc. All Rights Reserved.
54 
Ingestion Decisions 
• Historical Data 
– Smaller files: file transfer 
– Larger files: Flume with spooling directory source. 
• Incoming Data 
– Flume with the spooling directory source. 
• Relational Data Sources – ODS, CRM, etc. 
– Sqoop 
©2014 Cloudera, Inc. All Rights Reserved.
55 
Architectural 
Considerations 
Data Processing – Engines
56 
Data flow 
Raw data 
Partitioned 
clickstream data 
Other data 
(Financial, 
CRM, etc.) 
Aggregated 
dataset #2 
Aggregated 
dataset #1 
©2014 Cloudera, Inc. All Rights Reserved.
57 
Processing Engines 
• MapReduce 
• Abstractions – Pig, Hive, Cascading, Crunch 
• Spark 
• Impala 
Confidentiality Information Goes Here
58 
MapReduce 
• Oldie but goody 
• Restrictive Framework / Innovated Work Around 
• Extreme Batch 
Confidentiality Information Goes Here
59 
MapReduce Basic High Level 
Confidentiality Information Goes Here 
Mapper 
HDFS 
(Replicated) 
Native File System 
Block of 
Data 
Temp Spill 
Data 
Partitioned 
Sorted Data 
Reducer 
Reducer 
Local Copy 
Output File
60 
Abstractions 
• SQL 
– Hive 
• Script/Code 
– Pig: Pig Latin 
– Crunch: Java/Scala 
– Cascading: Java/Scala 
Confidentiality Information Goes Here
61 
Spark 
• The New Kid that isn’t that New Anymore 
• Easily 10x less code 
• Extremely Easy and Powerful API 
• Very good for machine learning 
• Scala, Java, and Python 
• RDDs 
• DAG Engine 
Confidentiality Information Goes Here
62 
Spark - DAG 
Confidentiality Information Goes Here
63 
Impala 
• Real-time open source MPP style engine for Hadoop 
• Doesn’t build on MapReduce 
• Written in C++, uses LLVM for run-time code generation 
• Can create tables over HDFS or HBase data 
• Accesses Hive metastore for metadata 
• Access available via JDBC/ODBC 
©2014 Cloudera, Inc. All Rights Reserved.
64 
What processing needs to happen? 
Confidentiality Information Goes Here 
• Sessionization 
• Filtering 
• Deduplication 
• BI / Discovery
65 
Sessionization 
Confidentiality Information Goes Here 
Website visit 
Visitor 1 
Session 1 
Visitor 1 
Session 2 
Visitor 2 
Session 1 
> 30 minutes
66 
Why sessionize? 
Helps answers questions like: 
• What is my website’s bounce rate? 
– i.e. how many % of visitors don’t go past the landing page? 
• Which marketing channels (e.g. organic search, display ad, etc.) are 
leading to most sessions? 
– Which ones of those lead to most conversions (e.g. people buying things, 
Confidentiality Information Goes Here 
signing up, etc.) 
• Do attribution analysis – which channels are responsible for most 
conversions?
67 
How to Sessionize? 
Confidentiality Information Goes Here 
1. Given a list of clicks, determine which clicks 
came from the same user 
2. Given a particular user's clicks, determine if a 
given click is a part of a new session or a 
continuation of the previous session
68 
#1 – Which clicks are from same user? 
• We can use: 
– IP address (244.157.45.12) 
– Cookies (A9A3BECE0563982D) 
– IP address (244.157.45.12)and user agent string ((KHTML, like 
Gecko) Chrome/36.0.1944.0 Safari/537.36") 
©2014 Cloudera, Inc. All Rights Reserved.
69 
#1 – Which clicks are from same user? 
• We can use: 
– IP address (244.157.45.12) 
– Cookies (A9A3BECE0563982D) 
– IP address (244.157.45.12)and user agent string ((KHTML, like 
Gecko) Chrome/36.0.1944.0 Safari/537.36") 
©2014 Cloudera, Inc. All Rights Reserved.
70 
#1 – Which clicks are from same user? 
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// 
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 
200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC 
Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1” 
©2014 Cloudera, Inc. All Rights Reserved.
71 
#2 – Which clicks part of the same session? 
> 30 mins apart = different 
sessions 
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// 
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 
200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC 
Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1” 
©2014 Cloudera, Inc. All Rights Reserved.
Sessionization engine recommendation 
• We have sessionization code in MR, Spark on github. The 
complexity of the code varies, depends on the expertise in the 
organization. 
• We choose MR, since it’s fairly simple and maintainable code. 
©2014 Cloudera, Inc. All rights reserved. 72
73 
Filtering – filter out incomplete records 
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// 
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 
200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U… 
©2014 Cloudera, Inc. All Rights Reserved.
74 
Filtering – filter out records from bots/spiders 
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// 
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 
209.85.238.11 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 
200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC 
Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1” 
©2014 Cloudera, Inc. All Rights Reserved. 
Google spider IP address
Filtering recommendation 
• Bot/Spider filtering can be done easily in any of the engines 
• Incomplete records are harder to filter in schema systems like 
Hive, Impala, Pig, etc. 
• Pretty close choice between MR, Hive and Spark 
• Can be done in Flume interceptors as well 
• We can simply embed this in our sessionization job 
©2014 Cloudera, Inc. All rights reserved. 75
76 
Deduplication – remove duplicate records 
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// 
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// 
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 
©2014 Cloudera, Inc. All Rights Reserved.
Deduplication recommendation 
• Can be done in all engines. 
• We already have a Hive table with all the columns, a simple 
DISTINCT query will perform deduplication 
• We use Pig 
©2014 Cloudera, Inc. All rights reserved. 77
BI/Discovery engine recommendation 
• Main requirements for this are: 
– Low latency 
– SQL interface (e.g. JDBC/ODBC) 
– Users don’t know how to code 
• We chose Impala 
– It’s a SQL engine 
– Much faster than other engines 
– Provides standard JDBC/ODBC interfaces 
©2014 Cloudera, Inc. All rights reserved. 78
79 
Architectural 
Considerations 
Orchestration
80 
Orchestrating Clickstream 
• Data arrives through Flume 
• Triggers a processing event: 
– Sessionize 
– Enrich – Location, marketing channel… 
– Store as Parquet 
• Each day we process events from the previous day
• Workflow is fairly simple 
• Need to trigger workflow based on data 
• Be able to recover from errors 
• Perhaps notify on the status 
• And collect metrics for reporting 
©2014 Cloudera, Inc. All rights reserved. 81 
Choosing Right
82 
Oozie or Azkaban? 
©2014 Cloudera, Inc. All rights reserved.
• Workflow is fairly simple 
• Need to trigger workflow based on data 
• Be able to recover from errors 
• Perhaps notify on the status 
• And collect metrics for reporting 
©2014 Cloudera, Inc. All rights reserved. 83 
Choosing… 
Easier in Oozie
Choosing the right Orchestration Tool 
• Workflow is fairly simple 
• Need to trigger workflow based on data 
• Be able to recover from errors 
• Perhaps notify on the status 
• And collect metrics for reporting 
Better in Azkaban 
©2014 Cloudera, Inc. All rights reserved. 84
Important Decision Consideration! 
The best orchestration tool 
is the one you are an expert on 
©2014 Cloudera, Inc. All rights reserved. 85
86 
Putting It All 
Together 
Final Architecture
87 
Final Architecture – High Level Overview 
Data 
Sources Ingestion Data Storage/ 
Processing 
Data 
Reporting/ 
Analysis 
©2014 Cloudera, Inc. All Rights Reserved.
88 
Final Architecture – High Level Overview 
Data 
Sources Ingestion Data Storage/ 
Processing 
Data 
Reporting/ 
Analysis 
©2014 Cloudera, Inc. All Rights Reserved.
89 
Final Architecture – Ingestion 
Web App Avro Agent 
Web App Avro Agent 
Web App Avro Agent 
Web App Avro Agent 
Web App Avro Agent 
Web App Avro Agent 
Web App Avro Agent 
Web App Avro Agent 
Flume Agent 
Flume Agent 
Flume Agent 
Flume Agent 
Fan-in 
Pattern 
Multi Agents for 
Failover and rolling 
restarts 
HDFS 
©2014 Cloudera, Inc. All Rights Reserved.
90 
Final Architecture – High Level Overview 
Data 
Sources Ingestion Data Storage/ 
Processing 
Data 
Reporting/ 
Analysis 
©2014 Cloudera, Inc. All Rights Reserved.
91 
Final Architecture – Storage and Processing 
/etl/weblogs/20140331/ 
/etl/weblogs/20140401/ 
… 
Data Processing 
/data/marketing/clickstream/ 
bouncerate/ 
/data/marketing/clickstream/ 
attribution/ 
… 
©2014 Cloudera, Inc. All Rights Reserved.
92 
Final Architecture – High Level Overview 
Data 
Sources Ingestion Data Storage/ 
Processing 
Data 
Reporting/ 
Analysis 
©2014 Cloudera, Inc. All Rights Reserved.
93 
Final Architecture – Data Access 
Hive/ 
Impala 
BI/ 
Analytics 
Tools 
JDBC/ODBC 
DWH 
Sqoop 
Local 
Disk 
R, etc. 
DB import tool 
©2014 Cloudera, Inc. All Rights Reserved.
Thank you
95 
Contact info 
• Mark Grover 
– @mark_grover 
– www.linkedin.com/in/grovermark 
• Jonathan Seidman 
– jseidman@cloudera.com 
– @jseidman 
– www.linkedin.com/in/jaseidman 
– www.slideshare.net/jseidman 
• Slides at slideshare.net/hadooparchbook 
©2014 Cloudera, Inc. All Rights Reserved.

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015hadooparchbook
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applicationshadooparchbook
 
Strata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationsStrata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationshadooparchbook
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patternshadooparchbook
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming apphadooparchbook
 
Application architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MRApplication architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MRmarkgrover
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationshadooparchbook
 
Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorialmarkgrover
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applicationshadooparchbook
 
Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopArchitecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopDataWorks Summit
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoopmarkgrover
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv larsgeorge
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialhadooparchbook
 
Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platformhadooparchbook
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialhadooparchbook
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Jonathan Seidman
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoophadooparchbook
 

Was ist angesagt? (20)

Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applications
 
Strata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationsStrata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applications
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patterns
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
 
Application architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MRApplication architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MR
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
 
Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorial
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applications
 
Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopArchitecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with Hadoop
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoop
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
 
Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platform
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 

Andere mochten auch

Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoophadooparchbook
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentationhadooparchbook
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Spark Summit
 
Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)
Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)
Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)Spark Summit
 
Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture Hortonworks
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
 
Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...
Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...
Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...DataStax Academy
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
Flintrock: A Faster, Better spark-ec2 by Nicholas Chammas
Flintrock: A Faster, Better spark-ec2 by Nicholas ChammasFlintrock: A Faster, Better spark-ec2 by Nicholas Chammas
Flintrock: A Faster, Better spark-ec2 by Nicholas ChammasSpark Summit
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detectionhadooparchbook
 
IoT and Big Data - Iot Asia 2014
IoT and Big Data - Iot Asia 2014IoT and Big Data - Iot Asia 2014
IoT and Big Data - Iot Asia 2014John Berns
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platformhadooparchbook
 
Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...
Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...
Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...Spark Summit
 
Reactive Streams, linking Reactive Application to Spark Streaming by Luc Bour...
Reactive Streams, linking Reactive Application to Spark Streaming by Luc Bour...Reactive Streams, linking Reactive Application to Spark Streaming by Luc Bour...
Reactive Streams, linking Reactive Application to Spark Streaming by Luc Bour...Spark Summit
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 

Andere mochten auch (16)

Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
 
Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)
Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)
Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)
 
Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
 
Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...
Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...
Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Flintrock: A Faster, Better spark-ec2 by Nicholas Chammas
Flintrock: A Faster, Better spark-ec2 by Nicholas ChammasFlintrock: A Faster, Better spark-ec2 by Nicholas Chammas
Flintrock: A Faster, Better spark-ec2 by Nicholas Chammas
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detection
 
IoT and Big Data - Iot Asia 2014
IoT and Big Data - Iot Asia 2014IoT and Big Data - Iot Asia 2014
IoT and Big Data - Iot Asia 2014
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platform
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
 
Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...
Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...
Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...
 
Reactive Streams, linking Reactive Application to Spark Streaming by Luc Bour...
Reactive Streams, linking Reactive Application to Spark Streaming by Luc Bour...Reactive Streams, linking Reactive Application to Spark Streaming by Luc Bour...
Reactive Streams, linking Reactive Application to Spark Streaming by Luc Bour...
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 

Ähnlich wie Application Architectures with Hadoop - Big Data TechCon SF 2014

Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Cloudera, Inc.
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kiteJoey Echeverria
 
Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Chris Nauroth
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impalahuguk
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Apache hadoop: POSH Meetup Palo Alto, CA April 2014Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Apache hadoop: POSH Meetup Palo Alto, CA April 2014Kevin Crocker
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecasesudhakara st
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online TrainingLearntek1
 
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon ValleyIntro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valleymarkgrover
 
Key trends in Big Data and new reference architecture from Hewlett Packard En...
Key trends in Big Data and new reference architecture from Hewlett Packard En...Key trends in Big Data and new reference architecture from Hewlett Packard En...
Key trends in Big Data and new reference architecture from Hewlett Packard En...Ontico
 
Big Data Conference April 2015
Big Data Conference April 2015Big Data Conference April 2015
Big Data Conference April 2015Aaron Benz
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprisesmarkgrover
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopAmir Shaikh
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and FutureDataWorks Summit
 
Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)Wei-Chiu Chuang
 

Ähnlich wie Application Architectures with Hadoop - Big Data TechCon SF 2014 (20)

Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
 
Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Apache hadoop: POSH Meetup Palo Alto, CA April 2014Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Apache hadoop: POSH Meetup Palo Alto, CA April 2014
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online Training
 
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon ValleyIntro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
 
Key trends in Big Data and new reference architecture from Hewlett Packard En...
Key trends in Big Data and new reference architecture from Hewlett Packard En...Key trends in Big Data and new reference architecture from Hewlett Packard En...
Key trends in Big Data and new reference architecture from Hewlett Packard En...
 
Big Data Conference April 2015
Big Data Conference April 2015Big Data Conference April 2015
Big Data Conference April 2015
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprises
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and Future
 
Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)
 

Kürzlich hochgeladen

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 

Kürzlich hochgeladen (20)

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 

Application Architectures with Hadoop - Big Data TechCon SF 2014

  • 1. Application Architectures with Hadoop Big Data TechCon San Francisco 2014 slideshare.com/hadooparchbook Mark Grover | @mark_grover Jonathan Seidman | @jseidman
  • 2. 2 About the book • @hadooparchbook • hadooparchitecturebook.com • github.com/hadooparchitecturebook • slideshare.com/hadooparchbook ©2014 Cloudera, Inc. All Rights Reserved.
  • 3. 3 About Us • Mark – Software Engineer – Committer on Apache Bigtop, committer and PPMC member on Apache Sentry (incubating). – Contributor to Hadoop, Hive, Spark, Sqoop, Flume. – @mark_grover • Jonathan – Senior Solutions Architect, Partner Enablement. – Co-founder of Chicago Hadoop User Group and Chicago Big Data. – jseidman@cloudera.com – @jseidman ©2014 Cloudera, Inc. All Rights Reserved.
  • 4. 4 Case Study Clickstream Analysis
  • 5. 5 Analytics ©2014 Cloudera, Inc. All Rights Reserved.
  • 6. 6 Analytics ©2014 Cloudera, Inc. All Rights Reserved.
  • 7. 7 Web Logs – Combined Log Format 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/ 5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp? productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/ GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1” ©2014 Cloudera, Inc. All Rights Reserved.
  • 8. 8 Clickstream Analytics ©2014 Cloudera, Inc. All Rights Reserved. 244.157.45.12 - - [17/Oct/ 2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/ top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/ 537.36”
  • 9. 9 Challenges of Hadoop Implementation ©2014 Cloudera, Inc. All Rights Reserved.
  • 10. 10 Challenges of Hadoop Implementation ©2014 Cloudera, Inc. All Rights Reserved.
  • 11. 11 Hadoop Architectural Considerations • Storage managers? – HDFS? HBase? • Data storage and modeling: – File formats? Compression? Schema design? • Data movement – How do we actually get the data into Hadoop? How do we get it out? • Metadata – How do we manage data about the data? • Data access and processing – How will the data be accessed once in Hadoop? How can we transform it? How do we query it? • Orchestration – How do we manage the workflow for all of this? ©2014 Cloudera, Inc. All Rights Reserved.
  • 12. 12 Architectural Considerations Data Storage and Modeling
  • 13. 13 Data Modeling Considerations • We need to consider the following in our architecture: – Storage layer – HDFS? HBase? Etc. – File system schemas – how will we lay out the data? – File formats – what storage formats to use for our data, both raw and processed data? – Data compression formats? ©2014 Cloudera, Inc. All Rights Reserved.
  • 14. 14 Architectural Considerations Data Modeling – Storage Layer
  • 15. 15 Data Storage Layer Choices • Two likely choices for raw data: ©2014 Cloudera, Inc. All Rights Reserved.
  • 16. 16 Data Storage Layer Choices • Stores data directly as files • Fast scans • Poor random reads/writes • Stores data as Hfiles on HDFS • Slow scans • Fast random reads/writes ©2014 Cloudera, Inc. All Rights Reserved.
  • 17. 17 Data Storage – Storage Manager Considerations • Incoming raw data: – Processing requirements call for batch transformations across multiple records – for example sessionization. • Processed data: – Access to processed data will be via things like analytical queries – again requiring access to multiple records. • We choose HDFS – Processing needs in this case served better by fast scans. ©2014 Cloudera, Inc. All Rights Reserved.
  • 18. 18 Architectural Considerations Data Modeling – Raw Data Storage
  • 19. Storage Formats – Raw Data and Processed Data 19 ©2014 Cloudera, Inc. All Rights Reserved. Processed Data Raw Data
  • 20. 20 Data Storage – Format Considerations Click to enter confidentiality information Logs (plain text) Logs (plain Lotgesx t) (plain text) Logs (plain text) Logs (plain text) Logs (plain text) Logs (plain text) Logs (plain text)
  • 21. 21 Data Storage – Compression Click to enter confidentiality information snappy Well, maybe. But not splittable. X Splittable. Getting better… Splittable, but no... Hmmm….
  • 22. 22 Data Storage – More About Snappy – Designed at Google to provide high compression speeds with reasonable compression. – Not the highest compression, but provides very good performance for processing on Hadoop. – Snappy is not splittable though, which brings us to… ©2014 Cloudera, Inc. All Rights Reserved.
  • 23. 23 Hadoop File Types • Formats designed specifically to store and process data on Hadoop: – File based – Sequence File – Serialization formats – Thrift, Protocol Buffers, Avro – Columnar formats – RCFile, ORC, Parquet Click to enter confidentiality information
  • 24. 24 SequenceFile • Stores records as binary key/value pairs. • SequenceFile “blocks” can be compressed. • This enables splittability with non-splittable compression. ©2014 Cloudera, Inc. All Rights Reserved.
  • 25. 25 Avro • Kinda SequenceFile on Steroids. • Self-documenting – stores schema in header. • Provides very efficient storage. • Supports splittable compression. ©2014 Cloudera, Inc. All Rights Reserved.
  • 26. 26 Our Format Choices… • Avro with Snappy – Snappy provides optimized compression. – Avro provides compact storage, self-documenting files, and supports schema evolution. – Avro also provides better failure handling than other choices. • SequenceFiles would also be a good choice, and are directly supported by ingestion tools in the ecosystem. – But only supports Java. ©2014 Cloudera, Inc. All Rights Reserved.
  • 27. 27 Architectural Considerations Data Modeling – Processed Data Storage
  • 28. Storage Formats – Raw Data and Processed Data 28 ©2014 Cloudera, Inc. All Rights Reserved. Processed Data Raw Data
  • 29. 29 Access to Processed Data Column A Column B Column C Column D Value Value Value Value Value Value Value Value Value Value Value Value Value Value Value Value ©2014 Cloudera, Inc. All Rights Reserved. Analytical Queries
  • 30. 30 Columnar Formats • Eliminates I/O for columns that are not part of a query. • Works well for queries that access a subset of columns. • Often provide better compression. • These add up to dramatically improved performance for many queries. ©2014 Cloudera, Inc. All Rights Reserved. 1 2014-10-1 3 abc 2 2014-10-1 4 def 3 2014-10-1 5 ghi 1 2 3 2014-10-1 2014-10-1 3 4 2014-10-1 5 abc def ghi
  • 31. 31 Columnar Choices – RCFile • Provides efficient processing for MapReduce applications, but generally only used with Hive. • Only supports Java. • No Avro support. • Limited compression support. • Sub-optimal performance compared to newer columnar formats. ©2014 Cloudera, Inc. All Rights Reserved.
  • 32. 32 Columnar Choices – ORC • A better RCFile. • Supports Hive, but currently limited support for other interfaces. • Only supports Java. ©2014 Cloudera, Inc. All Rights Reserved.
  • 33. 33 Columnar Choices – Parquet • Supports multiple programming interfaces – MapReduce, Hive, Impala, Pig. • Multiple language support. • Broad vendor support. • These features make Parquet a good choice for our processed data. ©2014 Cloudera, Inc. All Rights Reserved.
  • 34. 34 Architectural Considerations Data Modeling – HDFS Schema Design
  • 35. 35 Recommended HDFS Schema Design • How to lay out data on HDFS? ©2014 Cloudera, Inc. All Rights Reserved.
  • 36. 36 Recommended HDFS Schema Design /user/<username> - User specific data, jars, conf files /etl – Data in various stages of ETL workflow /tmp – temp data from tools or shared between users /data – shared data for the entire organization /app – Everything but data: UDF jars, HQL files, Oozie workflows ©2014 Cloudera, Inc. All Rights Reserved.
  • 37. 37 Architectural Considerations Data Modeling – Advanced HDFS Schema Design
  • 38. 38 What is Partitioning? dataset col=val1/file.txt col=val2/file.txt . . . col=valn/file.txt Un-partitioned HDFS directory dataset file1.txt file2.txt . . . filen.txt structure Partitioned HDFS directory structure ©2014 Cloudera, Inc. All Rights Reserved.
  • 39. 39 What is Partitioning? clicks dt=2014-01-01/clicks.txt dt=2014-01-02/clicks.txt . . . dt=2014-03-31/clicks.txt Un-partitioned HDFS directory structure clicks clicks-2014-01-01.txt clicks-2014-01-02.txt . . . clicks-2014-03-31.txt Partitioned HDFS directory structure ©2014 Cloudera, Inc. All Rights Reserved.
  • 40. 40 Partitioning • Split the dataset into smaller consumable chunks • Rudimentary form of “indexing” • <data set name>/<partition_column_name=partition_column_value>/ {files} ©2014 Cloudera, Inc. All Rights Reserved.
  • 41. 41 Partitioning considerations • What column to partition by? – HDFS is append only. – Don’t have too many partitions (<10,000) – Don’t have too many small files in the partitions (more than block size generally) • We decided to partition by timestamp ©2014 Cloudera, Inc. All Rights Reserved.
  • 42. 42 Partitioning For Our Case Study • Raw dataset: – /etl/BI/casualcyclist/clicks/rawlogs/year=2014/month=10/day=10! • Processed dataset: – /data/bikeshop/clickstream/year=2014/month=10/day=10! ©2014 Cloudera, Inc. All Rights Reserved.
  • 43. 43 A Word About Partitioning Alternatives year=2014/month=10/day=10 dt=2014-10-10 ©2014 Cloudera, Inc. All Rights Reserved.
  • 45. 45 File Transfers • “hadoop fs –put <file>” • Reliable, but not resilient to failure. • Other options are mountable HDFS, for example NFSv3. ©2014 Cloudera, Inc. All Rights Reserved.
  • 46. 46 Streaming Ingestion • Flume – Reliable, distributed, and available system for efficient collection, aggregation and movement of streaming data, e.g. logs. • Kafka – Reliable and distributed publish-subscribe messaging system. ©2014 Cloudera, Inc. All Rights Reserved.
  • 47. 47 Flume vs. Kafka • Purpose built for Hadoop data ingest. • Pre-built sinks for HDFS, HBase, etc. • Supports transformation of data in-flight. • General pub-sub messaging framework. • Hadoop not supported, requires 3rd-party component (Camus). • Just a message transport (a very fast one). ©2014 Cloudera, Inc. All Rights Reserved.
  • 48. 48 Flume vs. Kafka • Bottom line: – Flume very well integrated with Hadoop ecosystem, well suited to ingestion of sources such as log files. – Kafka is a highly reliable and scalable enterprise messaging system, and great for scaling out to multiple consumers. ©2014 Cloudera, Inc. All Rights Reserved.
  • 49. 49 Short Intro to Flume Sources Interceptors Selectors Channels Sinks Flume Agent Twitter, logs, JMS, webserver Mask, re-format, validate… DR, critical Memory, file HDFS, HBase, Solr
  • 50. 50 A Quick Introduction to Flume • Reliable – events are stored in channel until delivered to next stage. • Recoverable – events can be persisted to disk and recovered in the event of failure. Flume Agent Source Channel Sink Destination ©2014 Cloudera, Inc. All Rights Reserved.
  • 51. 51 A Quick Introduction to Flume • Declarative – No coding required. – Configuration specifies how components are wired together. ©2014 Cloudera, Inc. All Rights Reserved.
  • 52. 52 A Brief Discussion of Flume Patterns – Fan-in • Flume agent runs on each of our servers. • These agents send data to multiple agents to provide reliability. • Flume provides support for load balancing. ©2014 Cloudera, Inc. All Rights Reserved.
  • 53. 53 Sqoop Overview • Apache project designed to ease import and export of data between Hadoop and external data stores such as relational databases. • Great for doing bulk imports and exports of data between HDFS, Hive and HBase and an external data store. Not suited for ingesting event based data. ©2014 Cloudera, Inc. All Rights Reserved.
  • 54. 54 Ingestion Decisions • Historical Data – Smaller files: file transfer – Larger files: Flume with spooling directory source. • Incoming Data – Flume with the spooling directory source. • Relational Data Sources – ODS, CRM, etc. – Sqoop ©2014 Cloudera, Inc. All Rights Reserved.
  • 55. 55 Architectural Considerations Data Processing – Engines
  • 56. 56 Data flow Raw data Partitioned clickstream data Other data (Financial, CRM, etc.) Aggregated dataset #2 Aggregated dataset #1 ©2014 Cloudera, Inc. All Rights Reserved.
  • 57. 57 Processing Engines • MapReduce • Abstractions – Pig, Hive, Cascading, Crunch • Spark • Impala Confidentiality Information Goes Here
  • 58. 58 MapReduce • Oldie but goody • Restrictive Framework / Innovated Work Around • Extreme Batch Confidentiality Information Goes Here
  • 59. 59 MapReduce Basic High Level Confidentiality Information Goes Here Mapper HDFS (Replicated) Native File System Block of Data Temp Spill Data Partitioned Sorted Data Reducer Reducer Local Copy Output File
  • 60. 60 Abstractions • SQL – Hive • Script/Code – Pig: Pig Latin – Crunch: Java/Scala – Cascading: Java/Scala Confidentiality Information Goes Here
  • 61. 61 Spark • The New Kid that isn’t that New Anymore • Easily 10x less code • Extremely Easy and Powerful API • Very good for machine learning • Scala, Java, and Python • RDDs • DAG Engine Confidentiality Information Goes Here
  • 62. 62 Spark - DAG Confidentiality Information Goes Here
  • 63. 63 Impala • Real-time open source MPP style engine for Hadoop • Doesn’t build on MapReduce • Written in C++, uses LLVM for run-time code generation • Can create tables over HDFS or HBase data • Accesses Hive metastore for metadata • Access available via JDBC/ODBC ©2014 Cloudera, Inc. All Rights Reserved.
  • 64. 64 What processing needs to happen? Confidentiality Information Goes Here • Sessionization • Filtering • Deduplication • BI / Discovery
  • 65. 65 Sessionization Confidentiality Information Goes Here Website visit Visitor 1 Session 1 Visitor 1 Session 2 Visitor 2 Session 1 > 30 minutes
  • 66. 66 Why sessionize? Helps answers questions like: • What is my website’s bounce rate? – i.e. how many % of visitors don’t go past the landing page? • Which marketing channels (e.g. organic search, display ad, etc.) are leading to most sessions? – Which ones of those lead to most conversions (e.g. people buying things, Confidentiality Information Goes Here signing up, etc.) • Do attribution analysis – which channels are responsible for most conversions?
  • 67. 67 How to Sessionize? Confidentiality Information Goes Here 1. Given a list of clicks, determine which clicks came from the same user 2. Given a particular user's clicks, determine if a given click is a part of a new session or a continuation of the previous session
  • 68. 68 #1 – Which clicks are from same user? • We can use: – IP address (244.157.45.12) – Cookies (A9A3BECE0563982D) – IP address (244.157.45.12)and user agent string ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36") ©2014 Cloudera, Inc. All Rights Reserved.
  • 69. 69 #1 – Which clicks are from same user? • We can use: – IP address (244.157.45.12) – Cookies (A9A3BECE0563982D) – IP address (244.157.45.12)and user agent string ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36") ©2014 Cloudera, Inc. All Rights Reserved.
  • 70. 70 #1 – Which clicks are from same user? 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1” ©2014 Cloudera, Inc. All Rights Reserved.
  • 71. 71 #2 – Which clicks part of the same session? > 30 mins apart = different sessions 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1” ©2014 Cloudera, Inc. All Rights Reserved.
  • 72. Sessionization engine recommendation • We have sessionization code in MR, Spark on github. The complexity of the code varies, depends on the expertise in the organization. • We choose MR, since it’s fairly simple and maintainable code. ©2014 Cloudera, Inc. All rights reserved. 72
  • 73. 73 Filtering – filter out incomplete records 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U… ©2014 Cloudera, Inc. All Rights Reserved.
  • 74. 74 Filtering – filter out records from bots/spiders 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 209.85.238.11 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1” ©2014 Cloudera, Inc. All Rights Reserved. Google spider IP address
  • 75. Filtering recommendation • Bot/Spider filtering can be done easily in any of the engines • Incomplete records are harder to filter in schema systems like Hive, Impala, Pig, etc. • Pretty close choice between MR, Hive and Spark • Can be done in Flume interceptors as well • We can simply embed this in our sessionization job ©2014 Cloudera, Inc. All rights reserved. 75
  • 76. 76 Deduplication – remove duplicate records 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” ©2014 Cloudera, Inc. All Rights Reserved.
  • 77. Deduplication recommendation • Can be done in all engines. • We already have a Hive table with all the columns, a simple DISTINCT query will perform deduplication • We use Pig ©2014 Cloudera, Inc. All rights reserved. 77
  • 78. BI/Discovery engine recommendation • Main requirements for this are: – Low latency – SQL interface (e.g. JDBC/ODBC) – Users don’t know how to code • We chose Impala – It’s a SQL engine – Much faster than other engines – Provides standard JDBC/ODBC interfaces ©2014 Cloudera, Inc. All rights reserved. 78
  • 80. 80 Orchestrating Clickstream • Data arrives through Flume • Triggers a processing event: – Sessionize – Enrich – Location, marketing channel… – Store as Parquet • Each day we process events from the previous day
  • 81. • Workflow is fairly simple • Need to trigger workflow based on data • Be able to recover from errors • Perhaps notify on the status • And collect metrics for reporting ©2014 Cloudera, Inc. All rights reserved. 81 Choosing Right
  • 82. 82 Oozie or Azkaban? ©2014 Cloudera, Inc. All rights reserved.
  • 83. • Workflow is fairly simple • Need to trigger workflow based on data • Be able to recover from errors • Perhaps notify on the status • And collect metrics for reporting ©2014 Cloudera, Inc. All rights reserved. 83 Choosing… Easier in Oozie
  • 84. Choosing the right Orchestration Tool • Workflow is fairly simple • Need to trigger workflow based on data • Be able to recover from errors • Perhaps notify on the status • And collect metrics for reporting Better in Azkaban ©2014 Cloudera, Inc. All rights reserved. 84
  • 85. Important Decision Consideration! The best orchestration tool is the one you are an expert on ©2014 Cloudera, Inc. All rights reserved. 85
  • 86. 86 Putting It All Together Final Architecture
  • 87. 87 Final Architecture – High Level Overview Data Sources Ingestion Data Storage/ Processing Data Reporting/ Analysis ©2014 Cloudera, Inc. All Rights Reserved.
  • 88. 88 Final Architecture – High Level Overview Data Sources Ingestion Data Storage/ Processing Data Reporting/ Analysis ©2014 Cloudera, Inc. All Rights Reserved.
  • 89. 89 Final Architecture – Ingestion Web App Avro Agent Web App Avro Agent Web App Avro Agent Web App Avro Agent Web App Avro Agent Web App Avro Agent Web App Avro Agent Web App Avro Agent Flume Agent Flume Agent Flume Agent Flume Agent Fan-in Pattern Multi Agents for Failover and rolling restarts HDFS ©2014 Cloudera, Inc. All Rights Reserved.
  • 90. 90 Final Architecture – High Level Overview Data Sources Ingestion Data Storage/ Processing Data Reporting/ Analysis ©2014 Cloudera, Inc. All Rights Reserved.
  • 91. 91 Final Architecture – Storage and Processing /etl/weblogs/20140331/ /etl/weblogs/20140401/ … Data Processing /data/marketing/clickstream/ bouncerate/ /data/marketing/clickstream/ attribution/ … ©2014 Cloudera, Inc. All Rights Reserved.
  • 92. 92 Final Architecture – High Level Overview Data Sources Ingestion Data Storage/ Processing Data Reporting/ Analysis ©2014 Cloudera, Inc. All Rights Reserved.
  • 93. 93 Final Architecture – Data Access Hive/ Impala BI/ Analytics Tools JDBC/ODBC DWH Sqoop Local Disk R, etc. DB import tool ©2014 Cloudera, Inc. All Rights Reserved.
  • 95. 95 Contact info • Mark Grover – @mark_grover – www.linkedin.com/in/grovermark • Jonathan Seidman – jseidman@cloudera.com – @jseidman – www.linkedin.com/in/jaseidman – www.slideshare.net/jseidman • Slides at slideshare.net/hadooparchbook ©2014 Cloudera, Inc. All Rights Reserved.