This document provides an overview of big data and various big data tools including Pig, Hive, and Cascading. It discusses the history and motivation for each tool, how they work by mapping operations to MapReduce jobs, and compares key aspects of their data models, typing, and procedural vs declarative styles. The document is intended as a training presentation on these popular big data frameworks.
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
1. BIG DATA
How do elephant
make babies
Florian Douetteau
CEO, Dataiku
2. Agenda
•
Big Data & Hadoop Overview
•
Practical Big Data Coding: Pig / Hive / Cascading
•
PagesJaunes Big Data Use Case
•
Machine Learning For Big Data
5. “Big” Data in 1999
struct Element {
Key key;
void* stat_data ;
}
….
C
Optimized Data structures
Perfect Hashing
HP-UNIX Servers – 4GB Ram
100 GB data
Web Crawler – Socket reuse
HTTP 0.9
Dataiku
1 Month
5
1/8/14
6. Big Data in 2013
Hadoop
Java / Pig / Hive / Scala / Closure / …
A Dozen NoSQL data store
MPP Databases
Real-Time
1 Hour
6
Dataiku 1/8/14
7. Data Analytics: The Stakes
1 TB
1B $
1 TB
?$
1 TB
100M $
Web Search
1999
Logistics
2004
Dataiku
10 TB
10M $
100 TB
?$
Banking
CRM
2008
50TB
1B$
1000TB
500M $
E-Commerce
2013
Social Gaming
2011
Web
Search
2010
Online
Advertising
2012
1/8/14
7
8. Meet Hal Alowne
Hal Alowne
BI Manager
Dim‟s Private Showroom
European E-commerce Web site
• 100M$ Revenue
• 1 Million customer
• 1 Data Analyst (Hal Himself)
Dataiku - Data Tuesday
‟
Dim Sum
CEO & Founder
Dim‟s Private Showroom
Hey Hal ! We need
a big data platform
like the big guys.
Let‟s just do as they do!
Big Data
Copy Cat Project
”
Big Guys
• 10B$+ Revenue
• 100M+ customers
• 100+ Data Scientist
1/8/14
8
24. MERIT = TIME + ROI
TIME : 6 MONTHS
ROI : APPS
2014
2013
Find the right
people
(6 months?)
Choose the
technology
(6 months?)
Make it work
(6 months?)
2013
Build the lab
(6 months)
• Train People
• Reuse working patterns
Build a lab in 6 months
(rather than 18 months)
Dataiku
Targeted
Newsletter
Recommender
Systems
Adapted Product
/ Promotions
Deploy apps
24
that actually deliver value
1/9/14
27. CHOOSE TECHNOLOGY
NoSQL-Slavia
Hadoop
Elastic Search
Ceph
SOLR
Riak
Machine Learning
Mystery Land
Scalability Central
Cassandra
MongoDB
Membase
Scikit-Learn
GraphLAB
prediction.io jubatus
Mahout
WEKA
Sphere
Kafka Flume
Real-time island
Spark Storm
SQL Colunnar Republic
MLBase
RapidMiner
Vertica
Netezza
QlickView
Kibana
SpotFire D3
Cascading
Tableau
Dataiku - Pig, Hive and Cascading
SPSS
Panda
Pig
Vizualization County
R
SAS
InfiniDB Drill
GreenPlum
Impala
LibSVM
Talend
Data Clean Wasteland
Statistician Old
House
28. Large E-Retailer
Business Intelligence Stack as
Scalability and maintenance
issues
Backoffice implements
business rules that are
challenged
Existing infrastructure cannot
cope with per-user
information
Main Pain Point:
23 hours 52 minutes to
compute Business Intelligence
aggregates for one day.
29
Dataiku 1/9/14
29. Large E-Retailer : The
Datalab
•
•
•
Relieve their current DWH and
accelerate production of some
aggregates/KPIs
Be the backbone for new
personalized user experience on
their website: more
recommendations, more profiling,
etc.,
Train existing people around
machine learning and
segmentation experience
1h12
to perform the
aggregate, available every morning
New
home page personalization
deployed in a few weeks
Hadoop
Cluster (24 cores)
Google Compute Engine
Python + R + Vertica
12 TB dataset
6 weeks projects
30
Dataiku - Data Tuesday 1/9/14
30. Example (Social Gaming)
Social Gaming Communities
Correlation
◦ between community size and
engagement / virality
Some mid-size
communities
Meaningul patterns
◦ 2 players / Family / Group
What is the minimum
number of friends to have in
the application to get
additional engagement ?
A very large community
Lots of small clusters
mostly 2 players)
31
Dataiku
1/9/14
31. How do I (pre)process data?
Implicit User Data
(Views, Searches…)
Online User
Information
Transformation
Predictor
500TB
Transformation
Matrix
Explicit User Data
Predictor
Runtime
(Click, Buy, …)
Per User Stats
Rank Predictor
50TB
Per Content Stats
User Information
(Location, Graph…)
User Similarity
1TB
Content Data
(Title, Categories, Price, …)
200GB
Content Similarity
A/B Test Data
Dataiku - Pig, Hive and Cascading
33. The Questions
Pour Data In
How often ?
What kind of
interaction?
How much ?
Compute Something
Smart About It
How complex ?
Do you need all
data at once ?
How incremental
?
Make Available
Interaction ?
Random Access ?
40. Agenda
Dataiku - Pig, Hive and Cascading
Hadoop and Context (-0:03)
Pig, Hive, Cascading, … (-0:09)
How they work (-0:15)
Comparing the tools (-0:35)
Make them work together (-0:40)
Wrap‟up and question (-Beer)
41. Pig History
Yahoo Research in 2006
Inspired from Sawzall, a Google Paper from
2003
2007 as an Apache Project
Initial motivation
◦ Search Log Analytics: how long is the average user
session ? how many links does a user click ? on before
leaving a website ? how do click patterns vary in the
course of a day/week/month ? …
words = LOAD '/training/hadoopwordcount/output„ USING PigStorage(„t‟)
AS (word:chararray, count:int);
sorted_words = ORDER words BY count DESC;
first_words = LIMIT sorted_words 10;
DUMP first_words;
Dataiku - Pig, Hive and Cascading
42. Hive History
Developed by Facebook in January 2007
Open source in August 2008
Initial Motivation
◦ Provide a SQL like abstraction to perform statistics on
status updates
create external table wordcounts (
word string,
count int
) row format delimited fields terminated by 't'
location '/training/hadoop-wordcount/output';
select * from wordcounts order by count desc limit
10;
select SUM(count) from wordcounts where word like
„th%‟;
Dataiku - Pig, Hive and Cascading
43. Cascading History
Authored by Chris Wensel 2008
Associated Projects
◦ Cascalog : Cascading in Closure
◦ Scalding : Cascading in Scala (Twitter in 2012)
◦ Lingual ( to be released soon): SQL layer on top
of cascading
Dataiku - Pig, Hive and Cascading
44. Agenda
Dataiku - Pig, Hive and Cascading
Hadoop and Context (-0:03)
Pig, Hive, Cascading, … (-0:09)
How they work (-0:15)
Comparing the tools (-0:35)
Make them work together (-0:40)
Wrap‟up and question (-Beer)
45. Pig Hive
Mapping to Mapreduce jobs
events
= LOAD „/events‟ USING PigStorage(„t‟) AS
(type:chararray, user:chararray, price:int, timestamp:int);
events_filtered = FILTER events BY type;
by_user
= GROUP events_filtered BY user;
price_by_user
= FOREACH by_user GENERATE type, SUM(price) AS total_price,
MAX(timestamp) as max_ts;
high_pbu
= FILTER price_by_user BY total_price 1000;
Job 1 : Mapper
LOAD
FILTER
Job 1 : Reducer1
Shuffle and
sort by user
GROUP
FOREACH
FILTER
* VAT
excluded
Dataiku - Innovation Services
1/8/14
46
46. Pig Hive
Mapping to Mapreduce jobs
= LOAD „/events‟ USING PigStorage(„t‟) AS
(type:chararray, user:chararray, price:int, timestamp:int);
events
events_filtered = FILTER events BY type;
by_user
= GROUP events_filtered BY user;
price_by_user
= FOREACH by_user GENERATE type, SUM(price) AS total_price,
MAX(timestamp) as max_ts;
high_pbu
= FILTER price_by_user BY total_price 1000;
recent_high
= ORDER high_pbu BY max_ts DESC;
STORE recent_high INTO „/output‟;
Job 1: Mapper
LOAD
FILTER
Job 1 :Reducer
Shuffle and
sort by user
Job 2: Mapper
LOAD
(from tmp)
GROUP
FOREACH
FILTER
Job 2: Reducer
Shuffle and
sort by max_ts
STORE
47
Dataiku - Innovation Services
1/8/14
47. Pig
How does it work
Data Execution Plan compiled into 10
map reduce jobs executed in parallel
(or not)
Dataiku - Pig, Hive and Cascading
48. Hive Joins
How to join with MapReduce ?
Uid
tbl_idx
uid
1
2
1
1
2
Dupont
Type2
Type1
2
Type2
type
Tbl_idx
Name
Type
Uid
1
Type
Durand
Type1
Durand
Type2
2
Name
2
Type1
2
2
Type1
Reducer 1
2
2
Dupont
1
2
Durand
Uid
2
Type
Dupont
Shuffle by uid
Sort by (uid, tbl_idx)
uid
Name
1
1
Dupont
1
tbl_idx
Type
Uid
1
1
Name
name
1
1
Tbl_idx
Type1
Type1
Mappers output
Reducer 2
49
Dataiku - Innovation Services
1/8/14
49. Agenda
Dataiku - Pig, Hive and Cascading
Hadoop and Context (-0:03)
Pig, Hive, Cascading, … (-0:09)
How they work (-0:15)
Comparing the tools (-0:35)
Make them work together (-0:40)
Wrap‟up and question (-Beer)
50. Comparing without Comparable
Philosophy
◦ Procedural Vs Declarative
◦ Data Model and Schema
Productivity
◦ Headachability
◦ Checkpointing
◦ Testing and environment
Integration
◦ Partitioning
◦ Formats Integration
◦ External Code Integration
Performance and optimization
Dataiku - Pig, Hive and Cascading
51. Procedural Vs Declarative
Transformation as a
sequence of operations
Users
= load 'users' as (name, age, ipaddr);
Clicks
= load 'clicks' as (user, url, value);
ValuableClicks
= filter Clicks by value 0;
UserClicks
= join Users by name, ValuableClicks by
user;
Geoinfo
= load 'geoinfo' as (ipaddr, dma);
UserGeo
= join UserClicks by ipaddr, Geoinfo by
ipaddr;
ByDMA
= group UserGeo by dma;
ValuableClicksPerDMA = foreach ByDMA generate group,
COUNT(UserGeo);
store ValuableClicksPerDMA into 'ValuableClicksPerDMA';
Transformation as a set of
formulas
insert into ValuableClicksPerDMA select
dma, count(*)
from geoinfo join (
select name, ipaddr from
users join clicks on (users.name =
clicks.user)
where value 0;
) using ipaddr
group by dma;
Dataiku - Pig, Hive and Cascading
52. Data type and Model
Rationale
All three Extend basic data model with extended
data types
◦ array-like [ event1, event2, event3]
◦ map-like { type1:value1, type2:value2, …}
Different approach
◦ Resilient Schema
◦ Static Typing
◦ No Static Typing
Dataiku - Pig, Hive and Cascading
53. Hive
Data Type and Schema
CREATE TABLE visit (
user_name
user_id
user_details
);
STRING,
INT,
STRUCTage:INT, zipcode:INT
Simple type
Details
TINYINT, SMALLINT, INT, BIGINT
1, 2, 4 and 8 bytes
FLOAT, DOUBLE
4 and 8 bytes
BOOLEAN
STRING
Arbitrary-length, replaces VARCHAR
TIMESTAMP
Complex type
Details
ARRAY
Array of typed items (0-indexed)
MAP
Associative map
STRUCT
Complex class-like objects
54
Dataiku Training – Hadoop for Data Science
1/8/14
54. Data types and Schema
Pig
rel = LOAD '/folder/path/'
USING PigStorage(„t‟)
AS (col:type, col:type, col:type);
Simple type
Details
int, long, float,
double
32 and 64 bits, signed
chararray
A string
bytearray
An array of … bytes
boolean
A boolean
Complex type
Details
tuple
a tuple is an ordered fieldname:value map
bag
a bag is a set of tuples
55
Dataiku Training – Hadoop for Data Science
1/8/14
55. Data Type and Schema
Cascading
Support for Any Java Types, provided they can be
serialized in Hadoop
No support for Typing
Simple type
Details
Int, Long, Float,
Double
32 and 64 bits, signed
String
A string
byte[]
An array of … bytes
Boolean
A boolean
Complex type
Object
Dataiku - Pig, Hive and Cascading
Details
Object must be « Hadoop serializable »
56. Style Summary
Style
Typing
Data Model
Metadata
store
Pig
Procedural
Static +
Dynamic
scalar +
tuple+ bag
(fully
recursive)
No
(HCatalog)
Hive
Declarative
Static +
Dynamic,
enforced at
execution
time
scalar+ list +
map
Integrated
Cascading
Procedural
Weak
scalar+ java
objects
No
Dataiku - Pig, Hive and Cascading
57. Comparing without Comparable
Philosophy
◦ Procedural Vs Declarative
◦ Data Model and Schema
Productivity
◦ Headachability
◦ Checkpointing
◦ Testing, error management and environment
Integration
◦ Partitioning
◦ Formats Integration
◦ External Code Integration
Performance and optimization
Dataiku - Pig, Hive and Cascading
59. Headaches
Pig
Out Of Memory Error (Reducer)
Exception in Building /
Extended Functions
(handling of null)
Null vs “”
Nested Foreach and scoping
Date Management (pig 0.10)
Field implicit ordering
Dataiku - Pig, Hive and Cascading
61. Headaches
Hive
Out of Memory Errors in
Reducers
Few Debugging Options
Null / “”
No builtin “first”
Dataiku - Pig, Hive and Cascading
62. Headaches
Cascading
Weak Typing Errors (comparing
Int and String … )
Illegal Operation Sequence
(Group after group …)
Field Implicit Ordering
Dataiku - Pig, Hive and Cascading
63. Testing
Motivation
How to perform unit tests ?
How to have different versions of the same script
(parameter) ?
Dataiku - Pig, Hive and Cascading
66. Checkpointing
Motivation
Lots of iteration while developing on Hadoop
Sometime jobs fail
Sometimes need to restart from the start …
Parse Logs
Per Page Stats
Page User Correlation
FIX and
relaunch
Dataiku - Pig, Hive and Cascading
Filtering
Output
67. Pig
Manual Checkpointing
STORE Command to manually
store files
Parse Logs
Per Page Stats
Page User Correlation
// COMMENT Beginning
of script and relaunch
Dataiku - Pig, Hive and Cascading
Filtering
Output
69. Cascading
Topological Scheduler
Check each file intermediate timestamp
Execute only if more recent
Parse Logs
Per Page Stats
Page User Correlation
Filtering
Dataiku - Pig, Hive and Cascading
Output
71. Comparing without Comparable
Philosophy
◦ Procedural Vs Declarative
◦ Data Model and Schema
Productivity
◦ Headachability
◦ Checkpointing
◦ Testing and environment
Integration
◦ Formats Integration
◦ Partitioning
◦ External Code Integration
Performance and optimization
Dataiku - Pig, Hive and Cascading
72. Formats Integration
Motivation
Ability to integrate different file formats
Ability to integrate with external data sources or sink (
MongoDB, ElasticSearch, Database. …)
◦ Text Delimited
◦ Sequence File (Binary Hadoop format)
◦ Avro, Thrift ..
Format impact on size and performance
Format
Size on Disk (GB)
HIVE Processing time (24 cores)
Text File, uncompressed
18.7
1m32s
1 Text File, Gzipped
3.89
6m23s
JSON compressed
7.89
2m42s
multiple text file gzipped
4.02
43s
Sequence File, Block, Gzip
5.32
1m18s
Text File, LZO Indexed
7.03
1m22s
Dataiku - Pig, Hive and Cascading
(no parallelization)
74. Partitions
Motivation
No support for “UPDATE” patterns, any increment is
performed by adding or deleting a partition
Common partition schemas on Hadoop
◦
◦
◦
◦
◦
By Date /apache_logs/dt=2013-01-23
By Data center /apache_logs/dc=redbus01/…
By Country
…
Or any combination of the above
Dataiku - Pig, Hive and Cascading
75. Hive Partitioning
Partitioned tables
CREATE TABLE event (
user_id INT,
type STRING,
message STRING)
PARTITIONED BY (day STRING, server_id STRING);
Disk structure
/hive/event/day=2013-01-27/server_id=s1/file0
/hive/event/day=2013-01-27/server_id=s1/file1
/hive/event/day=2013-01-27/server_id=s2/file0
/hive/event/day=2013-01-27/server_id=s2/file1
…
/hive/event/day=2013-01-28/server_id=s2/file0
/hive/event/day=2013-01-28/server_id=s2/file1
INSERT OVERWRITE TABLE event PARTITION(ds='2013-01-27',
server_id=„s1‟)
SELECT * FROM event_tmp;
Dataiku Training – Hadoop for Data Science
1/8/14
76
76. Cascading Partition
No Direct support for partition
Support for “Glob” Tap, to build read from files using patterns
➔
You can code your own custom or virtual partition schemes
Dataiku - Pig, Hive and Cascading
80. Spring Batch
Cascading Integration
Allow to call a cascading flow from a Spring Batch
No full Integration with Spring MessageSource or
MessageHandler yet (only for local flows)
Dataiku - Pig, Hive and Cascading
81. Integration
Summary
Partition/Increme External Code
ntal Updates
Pig
No Direct Support
Hive
Cascading
Dataiku - Pig, Hive and Cascading
Fully integrated,
SQL Like
With Coding
Simple
Format
Integration
Doable and rich
community
Very simple, but
Doable and existing
complex dev setup
community
Complex UDFS
but regular, and
Java Expression
embeddable
Doable and
growing
commuinty
82. Comparing without Comparable
Philosophy
◦ Procedural Vs Declarative
◦ Data Model and Schema
Productivity
◦ Headachability
◦ Checkpointing
◦ Testing and environment
Integration
◦ Formats Integration
◦ Partitioning
◦ External Code Integration
Performance and optimization
Dataiku - Pig, Hive and Cascading
83. Optimization
Several Common Map Reduce Optimization Patterns
◦
◦
◦
◦
◦
Combiners
MapJoin
Job Fusion
Job Parallelism
Reducer Parallelism
Different support per framework
◦ Fully Automatic
◦ Pragma / Directives / Options
◦ Coding style / Code to write
Dataiku - Pig, Hive and Cascading
84. Combiner
Perform Partial Aggregate at Mapper Stage
SELECT date, COUNT(*) FROM product GROUP BY date
2012-02-14 4354
Map
…
2012-02-14 4354
2012-02-15 21we2
…
Reduc
e
2012-02-14 20
2012-02-15 21we2
2012-02-15 35
2012-02-14 qa334
…
2012-02-15 23aq2
2012-02-14 qa334
…
2012-02-15 23aq2
Dataiku - Pig, Hive and Cascading
2012-02-16 1
85. Combiner
Perform Partial Aggregate at Mapper Stage
SELECT date, COUNT(*) FROM product GROUP BY date
Map
2012-02-14 4354
2012-02-14 8
…
2012-02-15 12
Reduc
e
2012-02-14 20
2012-02-15 21we2
2012-02-15 35
2012-02-14 qa334
…
2012-02-15 23aq2
2012-02-14 12
2012-02-15 23
2012-02-16 1
Reduced network bandwith. Better
parallelism
Dataiku - Pig, Hive and Cascading
2012-02-16 1
86. Join Optimization
Map Join
Hive
set hive.auto.convert.join =
true;
Pig
Cascadin
g
( no aggregation support after HashJoin)
Dataiku - Pig, Hive and Cascading
87. Number of Reducers
Critical for performance
Estimated per the size of input file
◦ Hive
divide size per hive.exec.reducers.bytes.per.reducer (default 1GB)
◦ Pig
divide size pig.exec.reducers.bytes.per.reducer (default 1GB)
Dataiku - Pig, Hive and Cascading
89. Date • Titre de la présentation
CAS D’USAGE DU BIG DATA ET
MACHINE LEARNING
Qualité du search
•
ERWAN PIGNEUL
•
TEAM LEADER – RESPONSABLE DE PROJET
90
90. CONTEXTE PAGESJAUNES
CŒUR DE MÉTIER : LA RECHERCHE LOCALE DE PROFESSIONNELS
PAGESJAUNES UTILISE UN MOTEUR D'INTERPRÉTATION SPÉCIFIQUE
NÉCESSITANT UNE INDEXATION MANUELLE
CELA PERMET DE BIEN TRAITER LES REQUÊTES LES PLUS JOUÉES
MAIS CELA NE GÈRE PAS LA LONGUE TRAINE
91. COMMENT AMÉLIORER LA PERTINENCE DE NOS RÉPONSES
VIA L’ANALYSE DU COMPORTEMENT UTILISATEUR ?
20 M
1,4M
10
occurrences
requêtes
Analyse
corrections
200M
recherches
0,5M requêtes
priorisées
automatisation
93. ENSEIGNEMENTS TECHNIQUES
HADOOP / PIG / HIVE :
Efficace
Remet en question certaines logiques test/prod (apparition de pbs sur gros volumes)
Attention, ca reste jeune (compatibilité, …)
DATAIKU STUDIO :
Accélérateur de dev big data
Ordonnanceur des traitements en intégrant tous nos jobs et gère les dépendances
Easy Machine learning
ELASTICSEARCH :
Volume indexé et rapidité de search
94. EFFICACITÉ DE L’APPROCHE
Evolution de la fragilité de la requête ‘Parc enfant’
Fragile
Requête
‘Parc
enfant’
Moyenne
générale
Not fragile
99. clustering applications
•
Fraud: Detect Outliers
•
CRM : Mine for customer segments
•
Image Processing : Similar Images
•
Search : Similar documents
•
Search : Allocate Topics
100. K-Means
Guess an initial placement for centroids
Assign each point to closest Center
Reposition Center
MAP
REDUCE
101.
102.
103.
104.
105.
106.
107.
108.
109.
110. clustering challenges
•
Curse of Dimensionality
•
Choice of distance / number of parameters
•
Performance
•
Choice # of clusters
111. Mahout Clustering
Challenges
•
No Integrated Feature Engineering Stack:
Get ready to write data processing in Java
•
Hadoop SequenceFile required as an input
•
Iterations as Map/Reduce read and write to disks:
Relatively slow compared to in-memory
processing
115. Convert a CSV File to
Mahout Vector
•
Real Code would have
•
Converting Categorical
variables to dimensions
•
Variable Rescaling
•
Dropping IDs (name,
forname …)
116. Mahout Algorithms
Parameters
Implicit Assumption
Ouput
K-Means
K (number of clusters)
Convergence
Circles
Point - ClusterId
Fuzzy K-Means
K (number of clusters)
Convergence
Circles
Point - ClusterId * , Probability
Expectation
Maximization
K (Number of clusterS)
Convergence
Gaussian distribution
Point - ClusterId*, Probability
Mean-Shift
Clustering
Distance boundaries,
Convergence
Gradient like distribution
Point - Cluster ID
Top Down
Clustering
Two Clustering Algorithns
Hierarchy
Point - Large ClusterId, Small
ClusterId
Dirichlet
Process
Model Distribution
Points are a mixture of
distribution
Point - ClusterId, Probability
Spectral
Clustering
-
-
Point - ClusterId
MinHash
Clustering
Number of hash / keys
Hash Type
High Dimension
Point - Hash*