SlideShare ist ein Scribd-Unternehmen logo
1 von 64
Downloaden Sie, um offline zu lesen
Simple representations for learning:
factorizations and similarities
Ga¨el Varoquaux
Settings: Very high dimensionality
- signals (images, spectra)
- many entities (customers, product)
- non-standardized categories (typos, variants)
Exploit links & redundancy across features
G Varoquaux 2
1 Factorizing huge matrices
2 Encoding with similarities
G Varoquaux 3
1 Factorizing huge matrices
with A. Mensch, J. Mairal, B. Thirion
[Mensch... 2016, 2017]
samples
features
samples
features
Y +E · S= N
Challenge: scalability
1 Intuitions
2 Experiments
3 Algorithms
4 Proof
G Varoquaux 4
1 Real world data: recommender systems
3
9 7
7
9 5 7
8
4
1 6
9
7
7
1
4 4
9
5
5 8
Product ratings
Millions of entries
Hundreds of thousands of
products and users
Large sparse matrix
users
product
users
products
Y +E · S= N
G Varoquaux 5
1 Real world data: brain imaging
Brain activity at rest
1000 subjects with ∼ 100–10 000
samples
Images of dimensionality
> 100 000
Dense matrix, large both ways
time
voxels
time
voxels
time
voxels
Y +E · S=
25
N
G Varoquaux 6
1 Scalable solvers for matrix factorizations
Large matrices
= terabytes of data
argmin
E,S
Y−E ST 2
Fro + λΩ(S)
G Varoquaux 7
1 Scalable solvers for matrix factorizations
- Data
access
- Dictionary
update
- Code com-
putation
Alternating
minimization
Seen at t Seen at t+1 Unseen at t
Data
matrix
Large matrices
= terabytes of data
argmin
E,S
Y−E ST 2
Fro + λΩ(S)
G Varoquaux 7
1 Scalable solvers for matrix factorizations
Large matrices
= terabytes of data
argmin
E,S
Y−E ST 2
Fro + λΩ(S)
Rewrite as an expectation: [Mairal... 2010]
argmin
E i
mins
Yi − E sT 2
Fro + λΩ(s)
argmin
E
E f (E)
⇒ Optimize on approximations (sub-samples)
G Varoquaux 7
1 Scalable solvers for matrix factorizations
- Data
access
- Dictionary
update
Stream
columns
- Code com-
putation
Online matrix
factorization
Alternating
minimization
Seen at t Seen at t+1 Unseen at t
Data
matrix
G Varoquaux 7
1 Scalable solvers for matrix factorizations
- Data
access
- Dictionary
update
Stream
columns
- Code com-
putation
Online matrix
factorization
Alternating
minimization
Seen at t Seen at t+1 Unseen at t
Data
matrix
159 h run time
2 terabytes
of data
12 h run time
100 gigabytes
of data
Online matrix factorization [Mairal... 2010]
G Varoquaux 7
1 Scalable solvers for matrix factorizations – SOMF
- Data
access
- Dictionary
update
Stream
columns
- Code com-
putation Subsample
rows
Online matrix
factorization
New subsampling
algorithm
Alternating
minimization
Seen at t Seen at t+1 Unseen at t
Data
matrix
Subsampled Online Matrix Factorization
= SOMF
G Varoquaux 7
1 Scalable solvers for matrix factorizations – SOMF
- Data
access
- Dictionary
update
Stream
columns
- Code com-
putation Subsample
rows
Online matrix
factorization
New subsampling
algorithm
Alternating
minimization
Seen at t Seen at t+1 Unseen at t
Data
matrix
Subsampled Online Matrix Factorization
= SOMF
159 h run time
2 terabytes
of data
12 h run time
100 gigabytes
of data
13 h run time
1 terabyte
of data
Online matrix factorization [Mairal... 2010] [Mensch... 2017]
×10 speed up
G Varoquaux 7
1 Experimental results: resting-state fMRI
159 h run time
2 terabytes
of data
12 h run time
100 gigabytes
of data
13 h run time
1 terabyte
of data
100s 1000s 1h 5h 24h
1.02
1.04
1.06
Testobjectivevalue
×105
Time
HCP (3.5TB)
x 1e5 SGD (best step-size)
Online matrix factorization
Proposed SOMF (r = 12)
SOMF = Subsambled Online Matrix Factorization
G Varoquaux 8
1 Experimental results: large images
5s 1min 6min
2.80
2.85
2.90
2.95
Testobjectivevalue ×104
Time
ADHD
Sparse dictionary
2 GB
1min 1h 5h
0.105
0.106
0.107
0.108
0.109
Aviris
NMF
103 GB
1min 1h 5h
0.35
0.36
0.37
0.38
0.39
0.40
Testobjectivevalue
Time
Aviris
Dictionary learning
103 GB
OMF: SOMF: r = 4
r = 6
r = 8
r = 12
r = 24r = 1
Best step-size SGD
100s 1h 5h 24h
0.98
1.00
1.02
1.04
×105
HCP
Sparse dictionary
2 TB
SOMF = Subsampled Online Matrix Factorization
G Varoquaux 9
1 Experimental results: recommender system
SOMF = Subsampled Online Matrix Factorization
G Varoquaux 10
1 Algorithm: Online matrix factorization prior art
Stream samples xt: [Mairal... 2010]
1. Compute code
αt = argmin
α∈Rk
xt − Dt−1α 2
2 + λΩ(αt)
2. Update the surrogate function
gt(D) =
1
t
t
i=1
xi − Dαi
2
2 = trace


1
2
D DAt − D Bt


At = (1 −
1
t
)At−1 +
1
t
αtαt Bt = (1 −
1
t
)Bt−1 +
1
t
xtαt
3. Minimize surrogate
Dt = argmin
D∈C
gt(D) gt = DAt − Bt
G Varoquaux 11
1 Algorithm: Online matrix factorization prior art
Stream samples xt: [Mairal... 2010]
1. Compute code
αt = argmin
α∈Rk
xt − Dt−1α 2
2 + λΩ(αt)
2. Update the surrogate function
gt(D) =
1
t
t
i=1
xi − Dαi
2
2 = trace


1
2
D DAt − D Bt


At = (1 −
1
t
)At−1 +
1
t
αtαt Bt = (1 −
1
t
)Bt−1 +
1
t
xtαt
3. Minimize surrogate
Dt = argmin
D∈C
gt(D) gt = DAt − Bt
gt(D)
surrogate
=
x
l(x, D) αi is used, and not α
⇒ Stochastic Majorization-Minimization
No nasty hyper-parameters
G Varoquaux 11
1 Algorithm: Online matrix factorization prior art
Stream samples xt: [Mairal... 2010]
1. Compute code complexity depends on p
αt = argmin
α∈Rk
xt − Dt−1α 2
2 + λΩ(αt)
2. Update the surrogate function O(p)
gt(D) =
1
t
t
i=1
xi − Dαi
2
2 = trace


1
2
D DAt − D Bt


At = (1 −
1
t
)At−1 +
1
t
αtαt Bt = (1 −
1
t
)Bt−1 +
1
t
xtαt
3. Minimize surrogate O(p)
Dt = argmin
D∈C
gt(D) gt = DAt − Bt
G Varoquaux 11
1 Sub-sample features
Data stream: (xt)t → masked
(Mtxt)t
Dimension: p → s
Use only Mtxt in computation
→ complexity in O(s) Mtxt
Stream
Ignore
p
n
1
Modify all steps to work on s features
Code
computation
Surrogate
update
Surrogate
minimization
G Varoquaux 12
1 Sub-sample features
Original online MF
1. Code computation
αt = argmin
α∈Rk
xt − Dt−1α 2
2
+ λΩ(αt)
2. Surrogate aggregation
At =
1
t
t
i=1
αi αi
Bt = Bt−1 +
1
t
(xtαt − Bt−1)
3. Surrogate minimization
Dj
← p⊥
Cr
j
(Dj
−
1
(At)j,j
(DAj
t−Bj
t))
Our algorithm
1. Approximate code computation: masked
β
(i)
t ← (1 − γ)G
(i)
t−1 + γDt−1Mtx(i)
G
(i)
t ← (1 − γ)G
(i)
t−1 + γDt−1MtDt−1
αt ← argmin
α∈Rk
1
2
α Gtα − α βt + λ Ω(α).
2. Surrogate aggregation, averaging
At =
1
wt
αtαt + (1 −
1
wt
)At−1
Pt
¯Bt ← (1 − wt)Pt
¯Bt−1 + wtPtxtαt
3. Surrogate minimization
PtDt ← argmin
Dr ∈Cr
1
2
tr(Dr
Dr ¯At) − tr(Dr
Pt
¯B
Pt
⊥ ¯Bt ← (1 − wt)Pt
⊥ ¯Bt−1 + wtPt
⊥
xtαt .
G Varoquaux 13
1 Sub-sample features – variance reduction
Original online MF
1. Code computation
αt = argmin
α∈Rk
xt − Dt−1α 2
2
+ λΩ(αt)
2. Surrogate aggregation
At =
1
t
t
i=1
αi αi
Bt = Bt−1 +
1
t
(xtαt − Bt−1)
3. Surrogate minimization
Dj
← p⊥
Cr
j
(Dj
−
1
(At)j,j
(DAj
t−Bj
t))
Our algorithm
1. Approximate code computation: masked
β
(i)
t ← (1 − γ)G
(i)
t−1 + γDt−1Mtx(i)
G
(i)
t ← (1 − γ)G
(i)
t−1 + γDt−1MtDt−1
αt ← argmin
α∈Rk
1
2
α Gtα − α βt + λ Ω(α).
2. Surrogate aggregation, averaging
At =
1
wt
αtαt + (1 −
1
wt
)At−1
Pt
¯Bt ← (1 − wt)Pt
¯Bt−1 + wtPtxtαt
3. Surrogate minimization
PtDt ← argmin
Dr ∈Cr
1
2
tr(Dr
Dr ¯At) − tr(Dr
Pt
¯B
Pt
⊥ ¯Bt ← (1 − wt)Pt
⊥ ¯Bt−1 + wtPt
⊥
xtαt .
10−1 100 101
97000
97500
98000
98500
99000
99500
100000
Testobjectivefunction
Zoom
10−2
10−3
(relative to lowest value)
Subsampling ratio
None
r = 12
r = 24
100 101 Time
10−2
10−3
Code computation
No subsampling (19)
Averaged estimators (c)
Masked loss (a)
G Varoquaux 13
1 Why does it work?
Objective:
D = argmin
D∈C x
l(x, D) where l(x, D) = minα
f (x, D, α)
Algorithm (online matrix factorization)
gt(D)
majorant
=
x
l(x, D) αi is used, and not α
⇒ Stochastic Majorization-Minimization [Mairal 2013]
G Varoquaux 14
1 Why does it work?
Objective:
D = argmin
D∈C x
l(x, D) where l(x, D) = minα
f (x, D, α)
Algorithm (online matrix factorization)
gt(D)
majorant
=
x
l(x, D) αi is used, and not α
⇒ Stochastic Majorization-Minimization [Mairal 2013]
Surrogate computation SMM Full minimization
G Varoquaux 14
1 Stochastic Approximate Majorization-Minimization
Objective:
D = argmin
D∈C x
l(x, D) where l(x, D) = minα
f (x, D, α)
Algorithm (online matrix factorization)
gt(D)
majorant
=
x
l(x, D) αi is used, and not α
⇒ Stochastic Majorization-Minimization [Mairal 2013]
Surrogate computation
Surrogate approximation Partial minimization
SMM
SAMM
Full minimization
G Varoquaux 14
samples
features
samples
features
Y +E · S= N
Massive matrix factorization via subsampling
Subsampling features ⇒ doubly stochastic
10x speed ups on a fast algorithm
Analysis via stochastic approximate
majorization-minization
Conclusive on various high-dimensional problems
G Varoquaux 15
samples
features
samples
features
Y +E · S= N
2 Encoding with similarities
with P. Cerda and B. Kegl [Cerda... 2018]
When categories create a huge dimensionality
G Varoquaux 16
samples
features
samples
features
Y +E · S= N
2 Encoding with similarities
with P. Cerda and B. Kegl [Cerda... 2018]
Machine learning Let X ∈ Rn×p
The real world
Gender Date Hired Employee Position Title
M 09/12/1988 Master Police Officer
F 11/19/1989 Social Worker IV
M 07/16/2007 Police Officer III
F 02/05/2007 Police Aide
M 01/13/2014 Electrician I
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M 11/22/2010 Library Assistant IG Varoquaux 16
samples
features
samples
features
Y +E · S= N
2 Encoding with similarities
with P. Cerda and B. Kegl [Cerda... 2018]
Machine learning Let X ∈ Rn×p
The real world
Gender Date Hired Employee Position Title
M 09/12/1988 Master Police Officer
F 11/19/1989 Social Worker IV
M 07/16/2007 Police Officer III
F 02/05/2007 Police Aide
M 01/13/2014 Electrician I
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M 11/22/2010 Library Assistant I
A data cleaning problem?
A feature engineering problem?
A problem of representations in high dimension
G Varoquaux 16
2 The problem of “dirty categories”
Non-curated categorical entries
Employee Position Title
Master Police Officer
Social Worker IV
Police Officer III
Police Aide
Electrician I
Bus Operator
Bus Operator
Social Worker III
Library Assistant I
Library Assistant I
Overlapping categories
“Master Police Officer”,
“Police Officer III”,
“Police Officer II”...
High cardinality
400 unique entries
in 10 000 rows
Rare categories
Only 1 “Architect III”
New categories in test set
G Varoquaux 17
2 Dirty categories in the wild
Employee Salaries: salary information for employees
of Montgomery County, Maryland.
Employee Position Title
Master Police Officer
Social Worker IV
...
G Varoquaux 18
2 Dirty categories in the wild
Employee Salaries: salary information for employees
of Montgomery County, Maryland.
Open Payments: payments by health care
companies to medical doctors or hospitals.
Company name Frequency
Pfizer Inc. 79,073
Pfizer Pharmaceuticals LLC 486
Pfizer International LLC 425
Pfizer Limited 13
Pfizer Corporation Hong Kong Limited 4
Pfizer Pharmaceuticals Korea Limited 3
...
G Varoquaux 18
2 Dirty categories in the wild
Employee Salaries: salary information for employees
of Montgomery County, Maryland.
Open Payments: payments by health care
companies to medical doctors or hospitals.
Medical charges: patient discharges: utilization,
payment, and hospital-specific charges across 3 000
US hospitals.
...
Nothing on UCI machine-learning data repository
G Varoquaux 18
2 Dirty categories in the wild
Employee Salaries: salary information for employees
of Montgomery County, Maryland.
Open Payments: payments by health care
companies to medical doctors or hospitals.
Medical charges: patient discharges: utilization,
payment, and hospital-specific charges across 3 000
US hospitals.
...
Nothing on UCI machine-learning data repository
Cardinality slowly increases with number of rows
100 1k 10k 100k 1M
Number of rows
100
1 000
10 000
Numberofcategories
beer reviews
road safety
traffic violations
midwest survey
open payments
employee salaries
medical charges
100
√
n
5 log2(n)
Create a high-dimensional learning problem
G Varoquaux 18
2 Dirty categories in the wild
Employee Salaries: salary information for employees
of Montgomery County, Maryland.
Open Payments: payments by health care
companies to medical doctors or hospitals.
Medical charges: patient discharges: utilization,
payment, and hospital-specific charges across 3 000
US hospitals.
...
Nothing on UCI machine-learning data repository
Our goal: a statistical view of supervised
learning on dirty categories
The statistical question
should inform curation
Pfizer Corporation Hong Kong
=?
Pfizer Pharmaceuticals
Korea
G Varoquaux 18
2 Related work: Database cleaning
Recognizing / merging entities
Record linkage:
matching across different (clean) tables
Deduplication/fuzzy matching:
matching in one dirty table
Techniques [Fellegi and Sunter 1969]
Supervised learning (known matches)
Clustering
Expectation Maximization to learn a metric
Outputs a “clean” database
G Varoquaux 19
2 Related work: natural language processing
Stemming / normalization
Set of (handcrafted) rules
Need to be adapted to new language / new domains
G Varoquaux 20
2 Related work: natural language processing
Stemming / normalization
Set of (handcrafted) rules
Need to be adapted to new language / new domains
Semantics
Relate different discreet objects
Formal semantics (entity resolution in knowlege bases)
Distributional semantics:
“a word is characterized by the company it keeps”
G Varoquaux 20
2 Related work: natural language processing
Stemming / normalization
Set of (handcrafted) rules
Need to be adapted to new language / new domains
Semantics
Relate different discreet objects
Formal semantics (entity resolution in knowlege bases)
Distributional semantics:
“a word is characterized by the company it keeps”
Character-level NLP
For entity resolution [Klein... 2003]
For semantics [Bojanowski... 2017]
“London” & “Londres” may carry different information
G Varoquaux 20
2 Similarity encoding: a simple solution
Adding similarities to one-hot encoding
1. One-hot encoding maps categories to vector spaces
2. String similarities capture information
G Varoquaux 21
2 Similarity encoding: a simple solution
One-hot encoding
London Londres Paris
Londres 0 1 0
London 1 0 0
Paris 0 0 1
BX ∈ Rn×p
p grows fast
new categories?
link categories?
G Varoquaux 22
2 Similarity encoding: a simple solution
One-hot encoding
London Londres Paris
Londres 0 1 0
London 1 0 0
Paris 0 0 1
BX ∈ Rn×p
p grows fast
new categories?
link categories?Similarity encoding
London Londres Paris
Londres 0.3 1.0 0.0
London 1.0 0.3 0.0
Paris 0.0 0.0 1.0
string distance(Londres, London)
G Varoquaux 22
2 Some string similarities
Levenshtein
Number of edit operations on one string to match
the other
Jaro-Winkler
djaro(s1, s2) = m
3|s1| + m
3|s2| + m−t
3m
m: number of matching characters
t: number of character transpositions
n-gram similarity
n-gram: group of n consecutive characters
similarity =
#n-gram in comon
#n-gram in total
G Varoquaux 23
2 Empirical study
Datasets with dirty categories
Dataset # of
rows
# of cat-
egories
Less frequent
category
Prediction
type
medical charges 160k 100 613 regression
employee salaries 9.2k 385 1 regression
open payments 100k 973 1 binary clf
midwest survey 2.8k 1009 1 multiclass clf
traffic violations 100k 3043 1 multiclass clf
road safety 10k 4617 1 binary clf
beer reviews 10k 4634 1 multiclass clf
7 datasets! All open
Experimental paradigm
Cross-validation & measure prediction
Stupid Simple
G Varoquaux 24
2 Experiments: gradient boosted trees
0.8 0.9
medical
charges
3­gram    
Levenshtein
ratio      
Jaro­winkler
Target encoding
One­hot encoding
Hash encoding
Similarity
encoding
0.6 0.8
employee
salaries
0.6 0.8
open
payments
0.5
midw
surv
G Varoquaux 25
2 Experiments: gradient boosted trees
0.8 0.9
medical
charges
3­gram    
Levenshtein
ratio      
Jaro­winkler
Target encoding
One­hot encoding
Hash encoding
Similarity
encoding
0.6 0.8
employee
salaries
0.6 0.8
open
payments
0.5
midw
surv
G Varoquaux 25
2 Experiments: gradient boosted trees
0.8 0.9
medical
charges
3­gram    
Levenshtein
ratio      
Jaro­winkler
Target encoding
One­hot encoding
Hash encoding
Similarity
encoding
0.6 0.8
employee
salaries
0.6 0.8
open
payments
0.5
midw
surv
G Varoquaux 25
2 Experiments: gradient boosted trees
0.8 0.9
medical
charges
3­gram    
Levenshtein
ratio      
Jaro­winkler
Target encoding
One­hot encoding
Hash encoding
Similarity
encoding
0.6 0.8
employee
salaries
0.6 0.8
open
payments
0.5 0.7
midwest
survey
0.6 0.8
traffic
violations
0.4 0.5
road
safety
0.25 0.75
beer
reviews
1.6
2.4
2.9
3.7
4.6
5.9
Average ranking across datasets
G Varoquaux 25
2 Experiments: gradient boosted trees
0.8 0.9
medical
charges
3­gram    
Levenshtein
ratio      
Jaro­winkler
Target encoding
One­hot encoding
Hash encoding
Similarity
encoding
0.6 0.8
employee
salaries
0.6 0.8
open
payments
0.5 0.7
midwest
survey
0.6 0.8
traffic
violations
0.4 0.5
road
safety
0.25 0.75
beer
reviews
1.6
2.4
2.9
3.7
4.6
5.9
Average ranking across datasets
G Varoquaux 25
2 Experiments: ridge
0.7 0.9
medical
charges
3­gram    
Levenshtein
ratio      
Jaro­winkler
Target encoding
One­hot encoding
Hash encoding
Similarity
encoding
0.6 0.8
employee
salaries
0.25 0.50
open
payments
0.5 0.7
midwest
survey
0.6 0.8
traffic
violations
0.45 0.50
road
safety
0.25 0.75
beer
reviews
1.0
2.9
3.1
4.4
3.6
6.0
Average ranking across datasets
Similarity encoding, with 3-gram similarity
G Varoquaux 26
2 Experiments: different learner
0.85 0.90
medical
charges
Random Forest
Gradient Boosting
Ridge CV
Logistic CV
0.7 0.9
employee
salaries
0.50 0.75
open
payments
0.5 0.7
midwest
survey
0.7500.775
traffic
violations
0.45 0.55
road
safety
0.50 0.75
beer
reviews
one­hot encoding 3­gram similarity encoding
2.7
2.4
2.3
2.0
G Varoquaux 27
2 This is just a string similarity?
What similarity is defined by our encoding? (kernel)
si, sj sim =
k
l=1
sim(si, s(l)
) sim(sj, s(l)
)
Sum over the
categories
Reference
categories
The categories in the train set shape the similarity
G Varoquaux 28
2 This is just a string similarity?
What similarity is defined by our encoding? (kernel)
si, sj sim =
k
l=1
sim(si, s(l)
) sim(sj, s(l)
)
Sum over the
categories
Reference
categories
The categories in the train set shape the similarity
0.83 0.88
medical
charges
3-gram
Levenshtein-
ratio
Jaro-winkler
Bag of 3-grams
Target encoding
MDV
One-hot encoding
Hash encoding
Similarity
encoding
0.75 0.85
employee
salaries
0.3 0.5
open
payments
0.6 0.7
midwest
survey
0.72 0.78
traffic
violations
0.44 0.52
road
safety
0.3 0.8
beer
reviews
1.1
3.1
3.4
4.1
5.3
6.4
4.7
7.3
Similarity encoding >>> a feature map capturing string similarities
G Varoquaux 28
2 Reducing the dimensionality
BX ∈ Rn×p
but p is large
Statistical problems
Computational problems
G Varoquaux 29
2 Reducing the dimensionality
d=100
d=300
Full
d=30
d=100
d=300
d=30
d=100
d=300
Full
One­hot
encoding
ity
 K­means 
Deduplication
with K­means
Random
projections
G Varoquaux 29
2 Reducing the dimensionality
d=100
d=300
Full
d=30
d=100
d=300
d=30
d=100
d=300
Full
One­hot
encoding
ity
 K­means 
Deduplication
with K­means
Random
projections
G Varoquaux 29
2 Reducing the dimensionality
0.7 0.8 0.9
d=30
d=100
d=300
d=30
d=100
d=300
d=30
d=100
d=300
Full
d=30
d=100
d=300
d=30
d=100
d=300
Full
One­hot
encoding
3­gram similarity
encoding
Random
projections
Most
frequent
categories
 K­means 
Deduplication
with K­means
Random
projections
0.7 0.8 0.6 0.7 0.7500.G Varoquaux 29
2 Reducing the dimensionality
0.7 0.8 0.9
employee
salaries
d=30
d=100
d=300
d=30
d=100
d=300
d=30
d=100
d=300
Full
d=30
d=100
d=300
d=30
d=100
d=300
Full
(k=355)Cardinality of
categorical variable
One­hot
encoding
3­gram similarity
encoding
Random
projections
Most
frequent
categories
 K­means 
Deduplication
with K­means
Random
projections
0.7 0.8
open
payments
(k=910)
0.6 0.7
midwest
survey
(k=644)
0.7500.775
traffic
violations
(k=2588)
0.45 0.50 0.55
road
safety
(k=3988)
0.25 0.50 0.75
beer
reviews
(k=4015)
7.2
5.0
3.7
10.5
7.2
4.6
10.5
6.3
3.3
2.0
16.3
14.5
14.3
12.3
10.8
9.9
14.5
Average ranking across datasets
G Varoquaux 29
2 Reducing the dimensionality
0.7 0.8 0.9
employee
salaries
d=30
d=100
d=300
d=30
d=100
d=300
d=30
d=100
d=300
Full
d=30
d=100
d=300
d=30
d=100
d=300
Full
(k=355)Cardinality of
categorical variable
One­hot
encoding
3­gram similarity
encoding
Random
projections
Most
frequent
categories
 K­means 
Deduplication
with K­means
Random
projections
0.7 0.8
open
payments
(k=910)
0.6 0.7
midwest
survey
(k=644)
0.7500.775
traffic
violations
(k=2588)
0.45 0.50 0.55
road
safety
(k=3988)
0.25 0.50 0.75
beer
reviews
(k=4015)
7.2
5.0
3.7
10.5
7.2
4.6
10.5
6.3
3.3
2.0
16.3
14.5
14.3
12.3
10.8
9.9
14.5
Average ranking across datasets
Factorizing one-hot: Multiple Correspondance Analysis
Hashing n-grams (for speed and collisions)
G Varoquaux 29
@GaelVaroquaux
Representations in high dimension
factorizations and similarities
signals, entities, categories
@GaelVaroquaux
Representations in high dimension
factorizations and similarities
signals, entities, categories
Factorizations
Costly in large-p, large-n
Sub-sampling p gives huge speed ups
Stochastic Approximate Majorization-Minimization
https://github.com/arthurmensch/modl
@GaelVaroquaux
Representations in high dimension
factorizations and similarities
signals, entities, categories
Factorizations
https://github.com/arthurmensch/modl
Similarity encoding for categories
No separate duplication / cleaning step
Creates a categorie-aware metric space
https://dirty-cat.github.io
DirtyData project (hiring)
References I
P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching
word vectors with subword information. Transactions of the
Association of Computational Linguistics, 5(1):135–146, 2017.
P. Cerda, G. Varoquaux, and B. K´egl. Similarity encoding for
learning with dirty categorical variables. Machine Learning,
pages 1–18, 2018.
I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal
of the American Statistical Association, 64:1183, 1969.
D. Klein, J. Smarr, H. Nguyen, and C. D. Manning. Named entity
recognition with character-level models. In Proceedings of the
seventh conference on Natural language learning at HLT-NAACL
2003-Volume 4, pages 180–183. Association for Computational
Linguistics, 2003.
J. Mairal. Stochastic majorization-minimization algorithms for
large-scale optimization. In Advances in Neural Information
Processing Systems, 2013.
References II
J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for
matrix factorization and sparse coding. Journal of Machine
Learning Research, 11:19, 2010.
A. Mensch, J. Mairal, B. Thirion, and G. Varoquaux. Dictionary
learning for massive matrix factorization. In ICML, 2016.
A. Mensch, J. Mairal, B. Thirion, and G. Varoquaux. Stochastic
subsampling for factorizing huge matrices. IEEE Transactions on
Signal Processing, 66(1):113–128, 2017.

Weitere ähnliche Inhalte

Was ist angesagt?

Faster Practical Block Compression for Rank/Select Dictionaries
Faster Practical Block Compression for Rank/Select DictionariesFaster Practical Block Compression for Rank/Select Dictionaries
Faster Practical Block Compression for Rank/Select DictionariesRakuten Group, Inc.
 
第19回ステアラボ人工知能セミナー発表資料
第19回ステアラボ人工知能セミナー発表資料第19回ステアラボ人工知能セミナー発表資料
第19回ステアラボ人工知能セミナー発表資料Takayuki Osogami
 
Deep Convolutional GANs - meaning of latent space
Deep Convolutional GANs - meaning of latent spaceDeep Convolutional GANs - meaning of latent space
Deep Convolutional GANs - meaning of latent spaceHansol Kang
 
Dictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix FactorizationDictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix FactorizationArthur Mensch
 
OPTEX MATHEMATICAL MODELING AND MANAGEMENT SYSTEM
OPTEX MATHEMATICAL MODELING AND MANAGEMENT SYSTEMOPTEX MATHEMATICAL MODELING AND MANAGEMENT SYSTEM
OPTEX MATHEMATICAL MODELING AND MANAGEMENT SYSTEMJesus Velasquez
 
쉽게 설명하는 GAN (What is this? Gum? It's GAN.)
쉽게 설명하는 GAN (What is this? Gum? It's GAN.)쉽게 설명하는 GAN (What is this? Gum? It's GAN.)
쉽게 설명하는 GAN (What is this? Gum? It's GAN.)Hansol Kang
 
Boosted Tree-based Multinomial Logit Model for Aggregated Market Data
Boosted Tree-based Multinomial Logit Model for Aggregated Market DataBoosted Tree-based Multinomial Logit Model for Aggregated Market Data
Boosted Tree-based Multinomial Logit Model for Aggregated Market DataJay (Jianqiang) Wang
 
Graph convolutional networks in apache spark
Graph convolutional networks in apache sparkGraph convolutional networks in apache spark
Graph convolutional networks in apache sparkEmiliano Martinez Sanchez
 
VAE-type Deep Generative Models
VAE-type Deep Generative ModelsVAE-type Deep Generative Models
VAE-type Deep Generative ModelsKenta Oono
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningRyo Iwaki
 
Gradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsGradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsYoonho Lee
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks남주 김
 
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEEuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEHONGJOO LEE
 
Pytorch for tf_developers
Pytorch for tf_developersPytorch for tf_developers
Pytorch for tf_developersAbdul Muneer
 
LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)
LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)
LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)Hansol Kang
 
Multinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkMultinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkDB Tsai
 
Parallel Optimization in Machine Learning
Parallel Optimization in Machine LearningParallel Optimization in Machine Learning
Parallel Optimization in Machine LearningFabian Pedregosa
 

Was ist angesagt? (20)

Faster Practical Block Compression for Rank/Select Dictionaries
Faster Practical Block Compression for Rank/Select DictionariesFaster Practical Block Compression for Rank/Select Dictionaries
Faster Practical Block Compression for Rank/Select Dictionaries
 
第19回ステアラボ人工知能セミナー発表資料
第19回ステアラボ人工知能セミナー発表資料第19回ステアラボ人工知能セミナー発表資料
第19回ステアラボ人工知能セミナー発表資料
 
Deep Convolutional GANs - meaning of latent space
Deep Convolutional GANs - meaning of latent spaceDeep Convolutional GANs - meaning of latent space
Deep Convolutional GANs - meaning of latent space
 
Dictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix FactorizationDictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix Factorization
 
OPTEX MATHEMATICAL MODELING AND MANAGEMENT SYSTEM
OPTEX MATHEMATICAL MODELING AND MANAGEMENT SYSTEMOPTEX MATHEMATICAL MODELING AND MANAGEMENT SYSTEM
OPTEX MATHEMATICAL MODELING AND MANAGEMENT SYSTEM
 
쉽게 설명하는 GAN (What is this? Gum? It's GAN.)
쉽게 설명하는 GAN (What is this? Gum? It's GAN.)쉽게 설명하는 GAN (What is this? Gum? It's GAN.)
쉽게 설명하는 GAN (What is this? Gum? It's GAN.)
 
Boosted Tree-based Multinomial Logit Model for Aggregated Market Data
Boosted Tree-based Multinomial Logit Model for Aggregated Market DataBoosted Tree-based Multinomial Logit Model for Aggregated Market Data
Boosted Tree-based Multinomial Logit Model for Aggregated Market Data
 
Graph convolutional networks in apache spark
Graph convolutional networks in apache sparkGraph convolutional networks in apache spark
Graph convolutional networks in apache spark
 
VAE-type Deep Generative Models
VAE-type Deep Generative ModelsVAE-type Deep Generative Models
VAE-type Deep Generative Models
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learning
 
Gradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsGradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation Graphs
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
 
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEEuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
 
Europy17_dibernardo
Europy17_dibernardoEuropy17_dibernardo
Europy17_dibernardo
 
Pytorch for tf_developers
Pytorch for tf_developersPytorch for tf_developers
Pytorch for tf_developers
 
LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)
LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)
LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)
 
Multinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkMultinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache Spark
 
Parallel Optimization in Machine Learning
Parallel Optimization in Machine LearningParallel Optimization in Machine Learning
Parallel Optimization in Machine Learning
 
그림 그리는 AI
그림 그리는 AI그림 그리는 AI
그림 그리는 AI
 
Lecture9 xing
Lecture9 xingLecture9 xing
Lecture9 xing
 

Ähnlich wie Simple representations for learning: factorizations and similarities

Adaptive Three Operator Splitting
Adaptive Three Operator SplittingAdaptive Three Operator Splitting
Adaptive Three Operator SplittingFabian Pedregosa
 
A walk through the intersection between machine learning and mechanistic mode...
A walk through the intersection between machine learning and mechanistic mode...A walk through the intersection between machine learning and mechanistic mode...
A walk through the intersection between machine learning and mechanistic mode...JuanPabloCarbajal3
 
Distributed Coordinate Descent for Logistic Regression with Regularization
Distributed Coordinate Descent for Logistic Regression with RegularizationDistributed Coordinate Descent for Logistic Regression with Regularization
Distributed Coordinate Descent for Logistic Regression with RegularizationИлья Трофимов
 
Massive Matrix Factorization : Applications to collaborative filtering
Massive Matrix Factorization : Applications to collaborative filteringMassive Matrix Factorization : Applications to collaborative filtering
Massive Matrix Factorization : Applications to collaborative filteringArthur Mensch
 
Dictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix FactorizationDictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix Factorizationrecsysfr
 
Introduction to PyTorch
Introduction to PyTorchIntroduction to PyTorch
Introduction to PyTorchJun Young Park
 
Dual-time Modeling and Forecasting in Consumer Banking (2016)
Dual-time Modeling and Forecasting in Consumer Banking (2016)Dual-time Modeling and Forecasting in Consumer Banking (2016)
Dual-time Modeling and Forecasting in Consumer Banking (2016)Aijun Zhang
 
Introduction to computing Processing and performance.pdf
Introduction to computing Processing and performance.pdfIntroduction to computing Processing and performance.pdf
Introduction to computing Processing and performance.pdfTulasiramKandula1
 
Distributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUsDistributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUsPantelis Sopasakis
 
Matrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender SystemsMatrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender SystemsDmitriy Selivanov
 
Introduction to Algorithms
Introduction to AlgorithmsIntroduction to Algorithms
Introduction to AlgorithmsVenkatesh Iyer
 
Introduction to Artificial Neural Networks
Introduction to Artificial Neural NetworksIntroduction to Artificial Neural Networks
Introduction to Artificial Neural NetworksStratio
 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Charles Martin
 
Feedback Particle Filter and its Applications to Neuroscience
Feedback Particle Filter and its Applications to NeuroscienceFeedback Particle Filter and its Applications to Neuroscience
Feedback Particle Filter and its Applications to Neurosciencemehtapgresearch
 
Multiplicative Interaction Models in R
Multiplicative Interaction Models in RMultiplicative Interaction Models in R
Multiplicative Interaction Models in Rhtstatistics
 

Ähnlich wie Simple representations for learning: factorizations and similarities (20)

Adaptive Three Operator Splitting
Adaptive Three Operator SplittingAdaptive Three Operator Splitting
Adaptive Three Operator Splitting
 
A walk through the intersection between machine learning and mechanistic mode...
A walk through the intersection between machine learning and mechanistic mode...A walk through the intersection between machine learning and mechanistic mode...
A walk through the intersection between machine learning and mechanistic mode...
 
Distributed Coordinate Descent for Logistic Regression with Regularization
Distributed Coordinate Descent for Logistic Regression with RegularizationDistributed Coordinate Descent for Logistic Regression with Regularization
Distributed Coordinate Descent for Logistic Regression with Regularization
 
Massive Matrix Factorization : Applications to collaborative filtering
Massive Matrix Factorization : Applications to collaborative filteringMassive Matrix Factorization : Applications to collaborative filtering
Massive Matrix Factorization : Applications to collaborative filtering
 
Dictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix FactorizationDictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix Factorization
 
MUMS Opening Workshop - An Overview of Reduced-Order Models and Emulators (ED...
MUMS Opening Workshop - An Overview of Reduced-Order Models and Emulators (ED...MUMS Opening Workshop - An Overview of Reduced-Order Models and Emulators (ED...
MUMS Opening Workshop - An Overview of Reduced-Order Models and Emulators (ED...
 
Introduction to PyTorch
Introduction to PyTorchIntroduction to PyTorch
Introduction to PyTorch
 
Dual-time Modeling and Forecasting in Consumer Banking (2016)
Dual-time Modeling and Forecasting in Consumer Banking (2016)Dual-time Modeling and Forecasting in Consumer Banking (2016)
Dual-time Modeling and Forecasting in Consumer Banking (2016)
 
Introduction to computing Processing and performance.pdf
Introduction to computing Processing and performance.pdfIntroduction to computing Processing and performance.pdf
Introduction to computing Processing and performance.pdf
 
Distributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUsDistributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUs
 
AINL 2016: Strijov
AINL 2016: StrijovAINL 2016: Strijov
AINL 2016: Strijov
 
Matrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender SystemsMatrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender Systems
 
Introduction to Algorithms
Introduction to AlgorithmsIntroduction to Algorithms
Introduction to Algorithms
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
 
Introduction to Artificial Neural Networks
Introduction to Artificial Neural NetworksIntroduction to Artificial Neural Networks
Introduction to Artificial Neural Networks
 
Analysis of Algorithum
Analysis of AlgorithumAnalysis of Algorithum
Analysis of Algorithum
 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3
 
Feedback Particle Filter and its Applications to Neuroscience
Feedback Particle Filter and its Applications to NeuroscienceFeedback Particle Filter and its Applications to Neuroscience
Feedback Particle Filter and its Applications to Neuroscience
 
MUMS: Transition & SPUQ Workshop - Practical Bayesian Optimization for Urban ...
MUMS: Transition & SPUQ Workshop - Practical Bayesian Optimization for Urban ...MUMS: Transition & SPUQ Workshop - Practical Bayesian Optimization for Urban ...
MUMS: Transition & SPUQ Workshop - Practical Bayesian Optimization for Urban ...
 
Multiplicative Interaction Models in R
Multiplicative Interaction Models in RMultiplicative Interaction Models in R
Multiplicative Interaction Models in R
 

Mehr von Gael Varoquaux

Evaluating machine learning models and their diagnostic value
Evaluating machine learning models and their diagnostic valueEvaluating machine learning models and their diagnostic value
Evaluating machine learning models and their diagnostic valueGael Varoquaux
 
Measuring mental health with machine learning and brain imaging
Measuring mental health with machine learning and brain imagingMeasuring mental health with machine learning and brain imaging
Measuring mental health with machine learning and brain imagingGael Varoquaux
 
Machine learning with missing values
Machine learning with missing valuesMachine learning with missing values
Machine learning with missing valuesGael Varoquaux
 
Dirty data science machine learning on non-curated data
Dirty data science machine learning on non-curated dataDirty data science machine learning on non-curated data
Dirty data science machine learning on non-curated dataGael Varoquaux
 
Representation learning in limited-data settings
Representation learning in limited-data settingsRepresentation learning in limited-data settings
Representation learning in limited-data settingsGael Varoquaux
 
Better neuroimaging data processing: driven by evidence, open communities, an...
Better neuroimaging data processing: driven by evidence, open communities, an...Better neuroimaging data processing: driven by evidence, open communities, an...
Better neuroimaging data processing: driven by evidence, open communities, an...Gael Varoquaux
 
Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?Gael Varoquaux
 
Atlases of cognition with large-scale human brain mapping
Atlases of cognition with large-scale human brain mappingAtlases of cognition with large-scale human brain mapping
Atlases of cognition with large-scale human brain mappingGael Varoquaux
 
Similarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variablesSimilarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variablesGael Varoquaux
 
Machine learning for functional connectomes
Machine learning for functional connectomesMachine learning for functional connectomes
Machine learning for functional connectomesGael Varoquaux
 
Towards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imagingTowards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imagingGael Varoquaux
 
A tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imagingA tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imagingGael Varoquaux
 
Scikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn and nilearn: Democratisation of machine learning for brain imagingScikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn and nilearn: Democratisation of machine learning for brain imagingGael Varoquaux
 
Computational practices for reproducible science
Computational practices for reproducible scienceComputational practices for reproducible science
Computational practices for reproducible scienceGael Varoquaux
 
Coding for science and innovation
Coding for science and innovationCoding for science and innovation
Coding for science and innovationGael Varoquaux
 
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsEstimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsGael Varoquaux
 
On the code of data science
On the code of data scienceOn the code of data science
On the code of data scienceGael Varoquaux
 
Scientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of dataScientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of dataGael Varoquaux
 
Machine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questionsMachine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questionsGael Varoquaux
 
Social-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsitySocial-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsityGael Varoquaux
 

Mehr von Gael Varoquaux (20)

Evaluating machine learning models and their diagnostic value
Evaluating machine learning models and their diagnostic valueEvaluating machine learning models and their diagnostic value
Evaluating machine learning models and their diagnostic value
 
Measuring mental health with machine learning and brain imaging
Measuring mental health with machine learning and brain imagingMeasuring mental health with machine learning and brain imaging
Measuring mental health with machine learning and brain imaging
 
Machine learning with missing values
Machine learning with missing valuesMachine learning with missing values
Machine learning with missing values
 
Dirty data science machine learning on non-curated data
Dirty data science machine learning on non-curated dataDirty data science machine learning on non-curated data
Dirty data science machine learning on non-curated data
 
Representation learning in limited-data settings
Representation learning in limited-data settingsRepresentation learning in limited-data settings
Representation learning in limited-data settings
 
Better neuroimaging data processing: driven by evidence, open communities, an...
Better neuroimaging data processing: driven by evidence, open communities, an...Better neuroimaging data processing: driven by evidence, open communities, an...
Better neuroimaging data processing: driven by evidence, open communities, an...
 
Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?
 
Atlases of cognition with large-scale human brain mapping
Atlases of cognition with large-scale human brain mappingAtlases of cognition with large-scale human brain mapping
Atlases of cognition with large-scale human brain mapping
 
Similarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variablesSimilarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variables
 
Machine learning for functional connectomes
Machine learning for functional connectomesMachine learning for functional connectomes
Machine learning for functional connectomes
 
Towards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imagingTowards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imaging
 
A tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imagingA tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imaging
 
Scikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn and nilearn: Democratisation of machine learning for brain imagingScikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn and nilearn: Democratisation of machine learning for brain imaging
 
Computational practices for reproducible science
Computational practices for reproducible scienceComputational practices for reproducible science
Computational practices for reproducible science
 
Coding for science and innovation
Coding for science and innovationCoding for science and innovation
Coding for science and innovation
 
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsEstimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
 
On the code of data science
On the code of data scienceOn the code of data science
On the code of data science
 
Scientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of dataScientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of data
 
Machine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questionsMachine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questions
 
Social-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsitySocial-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsity
 

Kürzlich hochgeladen

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 

Kürzlich hochgeladen (20)

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 

Simple representations for learning: factorizations and similarities

  • 1. Simple representations for learning: factorizations and similarities Ga¨el Varoquaux
  • 2. Settings: Very high dimensionality - signals (images, spectra) - many entities (customers, product) - non-standardized categories (typos, variants) Exploit links & redundancy across features G Varoquaux 2
  • 3. 1 Factorizing huge matrices 2 Encoding with similarities G Varoquaux 3
  • 4. 1 Factorizing huge matrices with A. Mensch, J. Mairal, B. Thirion [Mensch... 2016, 2017] samples features samples features Y +E · S= N Challenge: scalability 1 Intuitions 2 Experiments 3 Algorithms 4 Proof G Varoquaux 4
  • 5. 1 Real world data: recommender systems 3 9 7 7 9 5 7 8 4 1 6 9 7 7 1 4 4 9 5 5 8 Product ratings Millions of entries Hundreds of thousands of products and users Large sparse matrix users product users products Y +E · S= N G Varoquaux 5
  • 6. 1 Real world data: brain imaging Brain activity at rest 1000 subjects with ∼ 100–10 000 samples Images of dimensionality > 100 000 Dense matrix, large both ways time voxels time voxels time voxels Y +E · S= 25 N G Varoquaux 6
  • 7. 1 Scalable solvers for matrix factorizations Large matrices = terabytes of data argmin E,S Y−E ST 2 Fro + λΩ(S) G Varoquaux 7
  • 8. 1 Scalable solvers for matrix factorizations - Data access - Dictionary update - Code com- putation Alternating minimization Seen at t Seen at t+1 Unseen at t Data matrix Large matrices = terabytes of data argmin E,S Y−E ST 2 Fro + λΩ(S) G Varoquaux 7
  • 9. 1 Scalable solvers for matrix factorizations Large matrices = terabytes of data argmin E,S Y−E ST 2 Fro + λΩ(S) Rewrite as an expectation: [Mairal... 2010] argmin E i mins Yi − E sT 2 Fro + λΩ(s) argmin E E f (E) ⇒ Optimize on approximations (sub-samples) G Varoquaux 7
  • 10. 1 Scalable solvers for matrix factorizations - Data access - Dictionary update Stream columns - Code com- putation Online matrix factorization Alternating minimization Seen at t Seen at t+1 Unseen at t Data matrix G Varoquaux 7
  • 11. 1 Scalable solvers for matrix factorizations - Data access - Dictionary update Stream columns - Code com- putation Online matrix factorization Alternating minimization Seen at t Seen at t+1 Unseen at t Data matrix 159 h run time 2 terabytes of data 12 h run time 100 gigabytes of data Online matrix factorization [Mairal... 2010] G Varoquaux 7
  • 12. 1 Scalable solvers for matrix factorizations – SOMF - Data access - Dictionary update Stream columns - Code com- putation Subsample rows Online matrix factorization New subsampling algorithm Alternating minimization Seen at t Seen at t+1 Unseen at t Data matrix Subsampled Online Matrix Factorization = SOMF G Varoquaux 7
  • 13. 1 Scalable solvers for matrix factorizations – SOMF - Data access - Dictionary update Stream columns - Code com- putation Subsample rows Online matrix factorization New subsampling algorithm Alternating minimization Seen at t Seen at t+1 Unseen at t Data matrix Subsampled Online Matrix Factorization = SOMF 159 h run time 2 terabytes of data 12 h run time 100 gigabytes of data 13 h run time 1 terabyte of data Online matrix factorization [Mairal... 2010] [Mensch... 2017] ×10 speed up G Varoquaux 7
  • 14. 1 Experimental results: resting-state fMRI 159 h run time 2 terabytes of data 12 h run time 100 gigabytes of data 13 h run time 1 terabyte of data 100s 1000s 1h 5h 24h 1.02 1.04 1.06 Testobjectivevalue ×105 Time HCP (3.5TB) x 1e5 SGD (best step-size) Online matrix factorization Proposed SOMF (r = 12) SOMF = Subsambled Online Matrix Factorization G Varoquaux 8
  • 15. 1 Experimental results: large images 5s 1min 6min 2.80 2.85 2.90 2.95 Testobjectivevalue ×104 Time ADHD Sparse dictionary 2 GB 1min 1h 5h 0.105 0.106 0.107 0.108 0.109 Aviris NMF 103 GB 1min 1h 5h 0.35 0.36 0.37 0.38 0.39 0.40 Testobjectivevalue Time Aviris Dictionary learning 103 GB OMF: SOMF: r = 4 r = 6 r = 8 r = 12 r = 24r = 1 Best step-size SGD 100s 1h 5h 24h 0.98 1.00 1.02 1.04 ×105 HCP Sparse dictionary 2 TB SOMF = Subsampled Online Matrix Factorization G Varoquaux 9
  • 16. 1 Experimental results: recommender system SOMF = Subsampled Online Matrix Factorization G Varoquaux 10
  • 17. 1 Algorithm: Online matrix factorization prior art Stream samples xt: [Mairal... 2010] 1. Compute code αt = argmin α∈Rk xt − Dt−1α 2 2 + λΩ(αt) 2. Update the surrogate function gt(D) = 1 t t i=1 xi − Dαi 2 2 = trace   1 2 D DAt − D Bt   At = (1 − 1 t )At−1 + 1 t αtαt Bt = (1 − 1 t )Bt−1 + 1 t xtαt 3. Minimize surrogate Dt = argmin D∈C gt(D) gt = DAt − Bt G Varoquaux 11
  • 18. 1 Algorithm: Online matrix factorization prior art Stream samples xt: [Mairal... 2010] 1. Compute code αt = argmin α∈Rk xt − Dt−1α 2 2 + λΩ(αt) 2. Update the surrogate function gt(D) = 1 t t i=1 xi − Dαi 2 2 = trace   1 2 D DAt − D Bt   At = (1 − 1 t )At−1 + 1 t αtαt Bt = (1 − 1 t )Bt−1 + 1 t xtαt 3. Minimize surrogate Dt = argmin D∈C gt(D) gt = DAt − Bt gt(D) surrogate = x l(x, D) αi is used, and not α ⇒ Stochastic Majorization-Minimization No nasty hyper-parameters G Varoquaux 11
  • 19. 1 Algorithm: Online matrix factorization prior art Stream samples xt: [Mairal... 2010] 1. Compute code complexity depends on p αt = argmin α∈Rk xt − Dt−1α 2 2 + λΩ(αt) 2. Update the surrogate function O(p) gt(D) = 1 t t i=1 xi − Dαi 2 2 = trace   1 2 D DAt − D Bt   At = (1 − 1 t )At−1 + 1 t αtαt Bt = (1 − 1 t )Bt−1 + 1 t xtαt 3. Minimize surrogate O(p) Dt = argmin D∈C gt(D) gt = DAt − Bt G Varoquaux 11
  • 20. 1 Sub-sample features Data stream: (xt)t → masked (Mtxt)t Dimension: p → s Use only Mtxt in computation → complexity in O(s) Mtxt Stream Ignore p n 1 Modify all steps to work on s features Code computation Surrogate update Surrogate minimization G Varoquaux 12
  • 21. 1 Sub-sample features Original online MF 1. Code computation αt = argmin α∈Rk xt − Dt−1α 2 2 + λΩ(αt) 2. Surrogate aggregation At = 1 t t i=1 αi αi Bt = Bt−1 + 1 t (xtαt − Bt−1) 3. Surrogate minimization Dj ← p⊥ Cr j (Dj − 1 (At)j,j (DAj t−Bj t)) Our algorithm 1. Approximate code computation: masked β (i) t ← (1 − γ)G (i) t−1 + γDt−1Mtx(i) G (i) t ← (1 − γ)G (i) t−1 + γDt−1MtDt−1 αt ← argmin α∈Rk 1 2 α Gtα − α βt + λ Ω(α). 2. Surrogate aggregation, averaging At = 1 wt αtαt + (1 − 1 wt )At−1 Pt ¯Bt ← (1 − wt)Pt ¯Bt−1 + wtPtxtαt 3. Surrogate minimization PtDt ← argmin Dr ∈Cr 1 2 tr(Dr Dr ¯At) − tr(Dr Pt ¯B Pt ⊥ ¯Bt ← (1 − wt)Pt ⊥ ¯Bt−1 + wtPt ⊥ xtαt . G Varoquaux 13
  • 22. 1 Sub-sample features – variance reduction Original online MF 1. Code computation αt = argmin α∈Rk xt − Dt−1α 2 2 + λΩ(αt) 2. Surrogate aggregation At = 1 t t i=1 αi αi Bt = Bt−1 + 1 t (xtαt − Bt−1) 3. Surrogate minimization Dj ← p⊥ Cr j (Dj − 1 (At)j,j (DAj t−Bj t)) Our algorithm 1. Approximate code computation: masked β (i) t ← (1 − γ)G (i) t−1 + γDt−1Mtx(i) G (i) t ← (1 − γ)G (i) t−1 + γDt−1MtDt−1 αt ← argmin α∈Rk 1 2 α Gtα − α βt + λ Ω(α). 2. Surrogate aggregation, averaging At = 1 wt αtαt + (1 − 1 wt )At−1 Pt ¯Bt ← (1 − wt)Pt ¯Bt−1 + wtPtxtαt 3. Surrogate minimization PtDt ← argmin Dr ∈Cr 1 2 tr(Dr Dr ¯At) − tr(Dr Pt ¯B Pt ⊥ ¯Bt ← (1 − wt)Pt ⊥ ¯Bt−1 + wtPt ⊥ xtαt . 10−1 100 101 97000 97500 98000 98500 99000 99500 100000 Testobjectivefunction Zoom 10−2 10−3 (relative to lowest value) Subsampling ratio None r = 12 r = 24 100 101 Time 10−2 10−3 Code computation No subsampling (19) Averaged estimators (c) Masked loss (a) G Varoquaux 13
  • 23. 1 Why does it work? Objective: D = argmin D∈C x l(x, D) where l(x, D) = minα f (x, D, α) Algorithm (online matrix factorization) gt(D) majorant = x l(x, D) αi is used, and not α ⇒ Stochastic Majorization-Minimization [Mairal 2013] G Varoquaux 14
  • 24. 1 Why does it work? Objective: D = argmin D∈C x l(x, D) where l(x, D) = minα f (x, D, α) Algorithm (online matrix factorization) gt(D) majorant = x l(x, D) αi is used, and not α ⇒ Stochastic Majorization-Minimization [Mairal 2013] Surrogate computation SMM Full minimization G Varoquaux 14
  • 25. 1 Stochastic Approximate Majorization-Minimization Objective: D = argmin D∈C x l(x, D) where l(x, D) = minα f (x, D, α) Algorithm (online matrix factorization) gt(D) majorant = x l(x, D) αi is used, and not α ⇒ Stochastic Majorization-Minimization [Mairal 2013] Surrogate computation Surrogate approximation Partial minimization SMM SAMM Full minimization G Varoquaux 14
  • 26. samples features samples features Y +E · S= N Massive matrix factorization via subsampling Subsampling features ⇒ doubly stochastic 10x speed ups on a fast algorithm Analysis via stochastic approximate majorization-minization Conclusive on various high-dimensional problems G Varoquaux 15
  • 27. samples features samples features Y +E · S= N 2 Encoding with similarities with P. Cerda and B. Kegl [Cerda... 2018] When categories create a huge dimensionality G Varoquaux 16
  • 28. samples features samples features Y +E · S= N 2 Encoding with similarities with P. Cerda and B. Kegl [Cerda... 2018] Machine learning Let X ∈ Rn×p The real world Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F 11/19/1989 Social Worker IV M 07/16/2007 Police Officer III F 02/05/2007 Police Aide M 01/13/2014 Electrician I F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M 11/22/2010 Library Assistant IG Varoquaux 16
  • 29. samples features samples features Y +E · S= N 2 Encoding with similarities with P. Cerda and B. Kegl [Cerda... 2018] Machine learning Let X ∈ Rn×p The real world Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F 11/19/1989 Social Worker IV M 07/16/2007 Police Officer III F 02/05/2007 Police Aide M 01/13/2014 Electrician I F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M 11/22/2010 Library Assistant I A data cleaning problem? A feature engineering problem? A problem of representations in high dimension G Varoquaux 16
  • 30. 2 The problem of “dirty categories” Non-curated categorical entries Employee Position Title Master Police Officer Social Worker IV Police Officer III Police Aide Electrician I Bus Operator Bus Operator Social Worker III Library Assistant I Library Assistant I Overlapping categories “Master Police Officer”, “Police Officer III”, “Police Officer II”... High cardinality 400 unique entries in 10 000 rows Rare categories Only 1 “Architect III” New categories in test set G Varoquaux 17
  • 31. 2 Dirty categories in the wild Employee Salaries: salary information for employees of Montgomery County, Maryland. Employee Position Title Master Police Officer Social Worker IV ... G Varoquaux 18
  • 32. 2 Dirty categories in the wild Employee Salaries: salary information for employees of Montgomery County, Maryland. Open Payments: payments by health care companies to medical doctors or hospitals. Company name Frequency Pfizer Inc. 79,073 Pfizer Pharmaceuticals LLC 486 Pfizer International LLC 425 Pfizer Limited 13 Pfizer Corporation Hong Kong Limited 4 Pfizer Pharmaceuticals Korea Limited 3 ... G Varoquaux 18
  • 33. 2 Dirty categories in the wild Employee Salaries: salary information for employees of Montgomery County, Maryland. Open Payments: payments by health care companies to medical doctors or hospitals. Medical charges: patient discharges: utilization, payment, and hospital-specific charges across 3 000 US hospitals. ... Nothing on UCI machine-learning data repository G Varoquaux 18
  • 34. 2 Dirty categories in the wild Employee Salaries: salary information for employees of Montgomery County, Maryland. Open Payments: payments by health care companies to medical doctors or hospitals. Medical charges: patient discharges: utilization, payment, and hospital-specific charges across 3 000 US hospitals. ... Nothing on UCI machine-learning data repository Cardinality slowly increases with number of rows 100 1k 10k 100k 1M Number of rows 100 1 000 10 000 Numberofcategories beer reviews road safety traffic violations midwest survey open payments employee salaries medical charges 100 √ n 5 log2(n) Create a high-dimensional learning problem G Varoquaux 18
  • 35. 2 Dirty categories in the wild Employee Salaries: salary information for employees of Montgomery County, Maryland. Open Payments: payments by health care companies to medical doctors or hospitals. Medical charges: patient discharges: utilization, payment, and hospital-specific charges across 3 000 US hospitals. ... Nothing on UCI machine-learning data repository Our goal: a statistical view of supervised learning on dirty categories The statistical question should inform curation Pfizer Corporation Hong Kong =? Pfizer Pharmaceuticals Korea G Varoquaux 18
  • 36. 2 Related work: Database cleaning Recognizing / merging entities Record linkage: matching across different (clean) tables Deduplication/fuzzy matching: matching in one dirty table Techniques [Fellegi and Sunter 1969] Supervised learning (known matches) Clustering Expectation Maximization to learn a metric Outputs a “clean” database G Varoquaux 19
  • 37. 2 Related work: natural language processing Stemming / normalization Set of (handcrafted) rules Need to be adapted to new language / new domains G Varoquaux 20
  • 38. 2 Related work: natural language processing Stemming / normalization Set of (handcrafted) rules Need to be adapted to new language / new domains Semantics Relate different discreet objects Formal semantics (entity resolution in knowlege bases) Distributional semantics: “a word is characterized by the company it keeps” G Varoquaux 20
  • 39. 2 Related work: natural language processing Stemming / normalization Set of (handcrafted) rules Need to be adapted to new language / new domains Semantics Relate different discreet objects Formal semantics (entity resolution in knowlege bases) Distributional semantics: “a word is characterized by the company it keeps” Character-level NLP For entity resolution [Klein... 2003] For semantics [Bojanowski... 2017] “London” & “Londres” may carry different information G Varoquaux 20
  • 40. 2 Similarity encoding: a simple solution Adding similarities to one-hot encoding 1. One-hot encoding maps categories to vector spaces 2. String similarities capture information G Varoquaux 21
  • 41. 2 Similarity encoding: a simple solution One-hot encoding London Londres Paris Londres 0 1 0 London 1 0 0 Paris 0 0 1 BX ∈ Rn×p p grows fast new categories? link categories? G Varoquaux 22
  • 42. 2 Similarity encoding: a simple solution One-hot encoding London Londres Paris Londres 0 1 0 London 1 0 0 Paris 0 0 1 BX ∈ Rn×p p grows fast new categories? link categories?Similarity encoding London Londres Paris Londres 0.3 1.0 0.0 London 1.0 0.3 0.0 Paris 0.0 0.0 1.0 string distance(Londres, London) G Varoquaux 22
  • 43. 2 Some string similarities Levenshtein Number of edit operations on one string to match the other Jaro-Winkler djaro(s1, s2) = m 3|s1| + m 3|s2| + m−t 3m m: number of matching characters t: number of character transpositions n-gram similarity n-gram: group of n consecutive characters similarity = #n-gram in comon #n-gram in total G Varoquaux 23
  • 44. 2 Empirical study Datasets with dirty categories Dataset # of rows # of cat- egories Less frequent category Prediction type medical charges 160k 100 613 regression employee salaries 9.2k 385 1 regression open payments 100k 973 1 binary clf midwest survey 2.8k 1009 1 multiclass clf traffic violations 100k 3043 1 multiclass clf road safety 10k 4617 1 binary clf beer reviews 10k 4634 1 multiclass clf 7 datasets! All open Experimental paradigm Cross-validation & measure prediction Stupid Simple G Varoquaux 24
  • 45. 2 Experiments: gradient boosted trees 0.8 0.9 medical charges 3­gram     Levenshtein ratio       Jaro­winkler Target encoding One­hot encoding Hash encoding Similarity encoding 0.6 0.8 employee salaries 0.6 0.8 open payments 0.5 midw surv G Varoquaux 25
  • 46. 2 Experiments: gradient boosted trees 0.8 0.9 medical charges 3­gram     Levenshtein ratio       Jaro­winkler Target encoding One­hot encoding Hash encoding Similarity encoding 0.6 0.8 employee salaries 0.6 0.8 open payments 0.5 midw surv G Varoquaux 25
  • 47. 2 Experiments: gradient boosted trees 0.8 0.9 medical charges 3­gram     Levenshtein ratio       Jaro­winkler Target encoding One­hot encoding Hash encoding Similarity encoding 0.6 0.8 employee salaries 0.6 0.8 open payments 0.5 midw surv G Varoquaux 25
  • 48. 2 Experiments: gradient boosted trees 0.8 0.9 medical charges 3­gram     Levenshtein ratio       Jaro­winkler Target encoding One­hot encoding Hash encoding Similarity encoding 0.6 0.8 employee salaries 0.6 0.8 open payments 0.5 0.7 midwest survey 0.6 0.8 traffic violations 0.4 0.5 road safety 0.25 0.75 beer reviews 1.6 2.4 2.9 3.7 4.6 5.9 Average ranking across datasets G Varoquaux 25
  • 49. 2 Experiments: gradient boosted trees 0.8 0.9 medical charges 3­gram     Levenshtein ratio       Jaro­winkler Target encoding One­hot encoding Hash encoding Similarity encoding 0.6 0.8 employee salaries 0.6 0.8 open payments 0.5 0.7 midwest survey 0.6 0.8 traffic violations 0.4 0.5 road safety 0.25 0.75 beer reviews 1.6 2.4 2.9 3.7 4.6 5.9 Average ranking across datasets G Varoquaux 25
  • 50. 2 Experiments: ridge 0.7 0.9 medical charges 3­gram     Levenshtein ratio       Jaro­winkler Target encoding One­hot encoding Hash encoding Similarity encoding 0.6 0.8 employee salaries 0.25 0.50 open payments 0.5 0.7 midwest survey 0.6 0.8 traffic violations 0.45 0.50 road safety 0.25 0.75 beer reviews 1.0 2.9 3.1 4.4 3.6 6.0 Average ranking across datasets Similarity encoding, with 3-gram similarity G Varoquaux 26
  • 51. 2 Experiments: different learner 0.85 0.90 medical charges Random Forest Gradient Boosting Ridge CV Logistic CV 0.7 0.9 employee salaries 0.50 0.75 open payments 0.5 0.7 midwest survey 0.7500.775 traffic violations 0.45 0.55 road safety 0.50 0.75 beer reviews one­hot encoding 3­gram similarity encoding 2.7 2.4 2.3 2.0 G Varoquaux 27
  • 52. 2 This is just a string similarity? What similarity is defined by our encoding? (kernel) si, sj sim = k l=1 sim(si, s(l) ) sim(sj, s(l) ) Sum over the categories Reference categories The categories in the train set shape the similarity G Varoquaux 28
  • 53. 2 This is just a string similarity? What similarity is defined by our encoding? (kernel) si, sj sim = k l=1 sim(si, s(l) ) sim(sj, s(l) ) Sum over the categories Reference categories The categories in the train set shape the similarity 0.83 0.88 medical charges 3-gram Levenshtein- ratio Jaro-winkler Bag of 3-grams Target encoding MDV One-hot encoding Hash encoding Similarity encoding 0.75 0.85 employee salaries 0.3 0.5 open payments 0.6 0.7 midwest survey 0.72 0.78 traffic violations 0.44 0.52 road safety 0.3 0.8 beer reviews 1.1 3.1 3.4 4.1 5.3 6.4 4.7 7.3 Similarity encoding >>> a feature map capturing string similarities G Varoquaux 28
  • 54. 2 Reducing the dimensionality BX ∈ Rn×p but p is large Statistical problems Computational problems G Varoquaux 29
  • 55. 2 Reducing the dimensionality d=100 d=300 Full d=30 d=100 d=300 d=30 d=100 d=300 Full One­hot encoding ity  K­means  Deduplication with K­means Random projections G Varoquaux 29
  • 56. 2 Reducing the dimensionality d=100 d=300 Full d=30 d=100 d=300 d=30 d=100 d=300 Full One­hot encoding ity  K­means  Deduplication with K­means Random projections G Varoquaux 29
  • 57. 2 Reducing the dimensionality 0.7 0.8 0.9 d=30 d=100 d=300 d=30 d=100 d=300 d=30 d=100 d=300 Full d=30 d=100 d=300 d=30 d=100 d=300 Full One­hot encoding 3­gram similarity encoding Random projections Most frequent categories  K­means  Deduplication with K­means Random projections 0.7 0.8 0.6 0.7 0.7500.G Varoquaux 29
  • 58. 2 Reducing the dimensionality 0.7 0.8 0.9 employee salaries d=30 d=100 d=300 d=30 d=100 d=300 d=30 d=100 d=300 Full d=30 d=100 d=300 d=30 d=100 d=300 Full (k=355)Cardinality of categorical variable One­hot encoding 3­gram similarity encoding Random projections Most frequent categories  K­means  Deduplication with K­means Random projections 0.7 0.8 open payments (k=910) 0.6 0.7 midwest survey (k=644) 0.7500.775 traffic violations (k=2588) 0.45 0.50 0.55 road safety (k=3988) 0.25 0.50 0.75 beer reviews (k=4015) 7.2 5.0 3.7 10.5 7.2 4.6 10.5 6.3 3.3 2.0 16.3 14.5 14.3 12.3 10.8 9.9 14.5 Average ranking across datasets G Varoquaux 29
  • 59. 2 Reducing the dimensionality 0.7 0.8 0.9 employee salaries d=30 d=100 d=300 d=30 d=100 d=300 d=30 d=100 d=300 Full d=30 d=100 d=300 d=30 d=100 d=300 Full (k=355)Cardinality of categorical variable One­hot encoding 3­gram similarity encoding Random projections Most frequent categories  K­means  Deduplication with K­means Random projections 0.7 0.8 open payments (k=910) 0.6 0.7 midwest survey (k=644) 0.7500.775 traffic violations (k=2588) 0.45 0.50 0.55 road safety (k=3988) 0.25 0.50 0.75 beer reviews (k=4015) 7.2 5.0 3.7 10.5 7.2 4.6 10.5 6.3 3.3 2.0 16.3 14.5 14.3 12.3 10.8 9.9 14.5 Average ranking across datasets Factorizing one-hot: Multiple Correspondance Analysis Hashing n-grams (for speed and collisions) G Varoquaux 29
  • 60. @GaelVaroquaux Representations in high dimension factorizations and similarities signals, entities, categories
  • 61. @GaelVaroquaux Representations in high dimension factorizations and similarities signals, entities, categories Factorizations Costly in large-p, large-n Sub-sampling p gives huge speed ups Stochastic Approximate Majorization-Minimization https://github.com/arthurmensch/modl
  • 62. @GaelVaroquaux Representations in high dimension factorizations and similarities signals, entities, categories Factorizations https://github.com/arthurmensch/modl Similarity encoding for categories No separate duplication / cleaning step Creates a categorie-aware metric space https://dirty-cat.github.io DirtyData project (hiring)
  • 63. References I P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching word vectors with subword information. Transactions of the Association of Computational Linguistics, 5(1):135–146, 2017. P. Cerda, G. Varoquaux, and B. K´egl. Similarity encoding for learning with dirty categorical variables. Machine Learning, pages 1–18, 2018. I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64:1183, 1969. D. Klein, J. Smarr, H. Nguyen, and C. D. Manning. Named entity recognition with character-level models. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4, pages 180–183. Association for Computational Linguistics, 2003. J. Mairal. Stochastic majorization-minimization algorithms for large-scale optimization. In Advances in Neural Information Processing Systems, 2013.
  • 64. References II J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research, 11:19, 2010. A. Mensch, J. Mairal, B. Thirion, and G. Varoquaux. Dictionary learning for massive matrix factorization. In ICML, 2016. A. Mensch, J. Mairal, B. Thirion, and G. Varoquaux. Stochastic subsampling for factorizing huge matrices. IEEE Transactions on Signal Processing, 66(1):113–128, 2017.