Real-life data seldom comes in the ideal form for statistical learning.
This talk focuses on high-dimensional problems for signals and
discrete entities: when dealing with many, correlated, signals or
entities, it is useful to extract representations that capture these
correlations.
Matrix factorization models provide simple but powerful representations. They are used for recommender systems across discrete entities such as users and products, or to learn good dictionaries to represent images. However they entail large computing costs on very high-dimensional data, databases with many products or high-resolution images. I will present an
algorithm to factorize huge matrices based on stochastic subsampling that gives up to 10-fold speed-ups [1].
With discrete entities, the explosion of dimensionality may be due to variations in how a smaller number of categories are represented. Such a problem of "dirty categories" is typical of uncurated data sources. I will discuss how encoding this data based on similarities recovers a useful category structure with no preprocessing. I will show how it interpolates between one-hot encoding and techniques used in character-level natural language processing.
[1] Stochastic subsampling for factorizing huge matrices, A Mensch, J Mairal, B Thirion, G Varoquaux, IEEE Transactions on Signal Processing 66 (1), 113-128
[2] Similarity encoding for learning with dirty categorical variables. P Cerda, G Varoquaux, B Kégl Machine Learning (2018): 1-18
2. Settings: Very high dimensionality
- signals (images, spectra)
- many entities (customers, product)
- non-standardized categories (typos, variants)
Exploit links & redundancy across features
G Varoquaux 2
4. 1 Factorizing huge matrices
with A. Mensch, J. Mairal, B. Thirion
[Mensch... 2016, 2017]
samples
features
samples
features
Y +E · S= N
Challenge: scalability
1 Intuitions
2 Experiments
3 Algorithms
4 Proof
G Varoquaux 4
5. 1 Real world data: recommender systems
3
9 7
7
9 5 7
8
4
1 6
9
7
7
1
4 4
9
5
5 8
Product ratings
Millions of entries
Hundreds of thousands of
products and users
Large sparse matrix
users
product
users
products
Y +E · S= N
G Varoquaux 5
6. 1 Real world data: brain imaging
Brain activity at rest
1000 subjects with ∼ 100–10 000
samples
Images of dimensionality
> 100 000
Dense matrix, large both ways
time
voxels
time
voxels
time
voxels
Y +E · S=
25
N
G Varoquaux 6
7. 1 Scalable solvers for matrix factorizations
Large matrices
= terabytes of data
argmin
E,S
Y−E ST 2
Fro + λΩ(S)
G Varoquaux 7
8. 1 Scalable solvers for matrix factorizations
- Data
access
- Dictionary
update
- Code com-
putation
Alternating
minimization
Seen at t Seen at t+1 Unseen at t
Data
matrix
Large matrices
= terabytes of data
argmin
E,S
Y−E ST 2
Fro + λΩ(S)
G Varoquaux 7
9. 1 Scalable solvers for matrix factorizations
Large matrices
= terabytes of data
argmin
E,S
Y−E ST 2
Fro + λΩ(S)
Rewrite as an expectation: [Mairal... 2010]
argmin
E i
mins
Yi − E sT 2
Fro + λΩ(s)
argmin
E
E f (E)
⇒ Optimize on approximations (sub-samples)
G Varoquaux 7
10. 1 Scalable solvers for matrix factorizations
- Data
access
- Dictionary
update
Stream
columns
- Code com-
putation
Online matrix
factorization
Alternating
minimization
Seen at t Seen at t+1 Unseen at t
Data
matrix
G Varoquaux 7
11. 1 Scalable solvers for matrix factorizations
- Data
access
- Dictionary
update
Stream
columns
- Code com-
putation
Online matrix
factorization
Alternating
minimization
Seen at t Seen at t+1 Unseen at t
Data
matrix
159 h run time
2 terabytes
of data
12 h run time
100 gigabytes
of data
Online matrix factorization [Mairal... 2010]
G Varoquaux 7
12. 1 Scalable solvers for matrix factorizations – SOMF
- Data
access
- Dictionary
update
Stream
columns
- Code com-
putation Subsample
rows
Online matrix
factorization
New subsampling
algorithm
Alternating
minimization
Seen at t Seen at t+1 Unseen at t
Data
matrix
Subsampled Online Matrix Factorization
= SOMF
G Varoquaux 7
13. 1 Scalable solvers for matrix factorizations – SOMF
- Data
access
- Dictionary
update
Stream
columns
- Code com-
putation Subsample
rows
Online matrix
factorization
New subsampling
algorithm
Alternating
minimization
Seen at t Seen at t+1 Unseen at t
Data
matrix
Subsampled Online Matrix Factorization
= SOMF
159 h run time
2 terabytes
of data
12 h run time
100 gigabytes
of data
13 h run time
1 terabyte
of data
Online matrix factorization [Mairal... 2010] [Mensch... 2017]
×10 speed up
G Varoquaux 7
14. 1 Experimental results: resting-state fMRI
159 h run time
2 terabytes
of data
12 h run time
100 gigabytes
of data
13 h run time
1 terabyte
of data
100s 1000s 1h 5h 24h
1.02
1.04
1.06
Testobjectivevalue
×105
Time
HCP (3.5TB)
x 1e5 SGD (best step-size)
Online matrix factorization
Proposed SOMF (r = 12)
SOMF = Subsambled Online Matrix Factorization
G Varoquaux 8
15. 1 Experimental results: large images
5s 1min 6min
2.80
2.85
2.90
2.95
Testobjectivevalue ×104
Time
ADHD
Sparse dictionary
2 GB
1min 1h 5h
0.105
0.106
0.107
0.108
0.109
Aviris
NMF
103 GB
1min 1h 5h
0.35
0.36
0.37
0.38
0.39
0.40
Testobjectivevalue
Time
Aviris
Dictionary learning
103 GB
OMF: SOMF: r = 4
r = 6
r = 8
r = 12
r = 24r = 1
Best step-size SGD
100s 1h 5h 24h
0.98
1.00
1.02
1.04
×105
HCP
Sparse dictionary
2 TB
SOMF = Subsampled Online Matrix Factorization
G Varoquaux 9
16. 1 Experimental results: recommender system
SOMF = Subsampled Online Matrix Factorization
G Varoquaux 10
17. 1 Algorithm: Online matrix factorization prior art
Stream samples xt: [Mairal... 2010]
1. Compute code
αt = argmin
α∈Rk
xt − Dt−1α 2
2 + λΩ(αt)
2. Update the surrogate function
gt(D) =
1
t
t
i=1
xi − Dαi
2
2 = trace
1
2
D DAt − D Bt
At = (1 −
1
t
)At−1 +
1
t
αtαt Bt = (1 −
1
t
)Bt−1 +
1
t
xtαt
3. Minimize surrogate
Dt = argmin
D∈C
gt(D) gt = DAt − Bt
G Varoquaux 11
18. 1 Algorithm: Online matrix factorization prior art
Stream samples xt: [Mairal... 2010]
1. Compute code
αt = argmin
α∈Rk
xt − Dt−1α 2
2 + λΩ(αt)
2. Update the surrogate function
gt(D) =
1
t
t
i=1
xi − Dαi
2
2 = trace
1
2
D DAt − D Bt
At = (1 −
1
t
)At−1 +
1
t
αtαt Bt = (1 −
1
t
)Bt−1 +
1
t
xtαt
3. Minimize surrogate
Dt = argmin
D∈C
gt(D) gt = DAt − Bt
gt(D)
surrogate
=
x
l(x, D) αi is used, and not α
⇒ Stochastic Majorization-Minimization
No nasty hyper-parameters
G Varoquaux 11
19. 1 Algorithm: Online matrix factorization prior art
Stream samples xt: [Mairal... 2010]
1. Compute code complexity depends on p
αt = argmin
α∈Rk
xt − Dt−1α 2
2 + λΩ(αt)
2. Update the surrogate function O(p)
gt(D) =
1
t
t
i=1
xi − Dαi
2
2 = trace
1
2
D DAt − D Bt
At = (1 −
1
t
)At−1 +
1
t
αtαt Bt = (1 −
1
t
)Bt−1 +
1
t
xtαt
3. Minimize surrogate O(p)
Dt = argmin
D∈C
gt(D) gt = DAt − Bt
G Varoquaux 11
20. 1 Sub-sample features
Data stream: (xt)t → masked
(Mtxt)t
Dimension: p → s
Use only Mtxt in computation
→ complexity in O(s) Mtxt
Stream
Ignore
p
n
1
Modify all steps to work on s features
Code
computation
Surrogate
update
Surrogate
minimization
G Varoquaux 12
22. 1 Sub-sample features – variance reduction
Original online MF
1. Code computation
αt = argmin
α∈Rk
xt − Dt−1α 2
2
+ λΩ(αt)
2. Surrogate aggregation
At =
1
t
t
i=1
αi αi
Bt = Bt−1 +
1
t
(xtαt − Bt−1)
3. Surrogate minimization
Dj
← p⊥
Cr
j
(Dj
−
1
(At)j,j
(DAj
t−Bj
t))
Our algorithm
1. Approximate code computation: masked
β
(i)
t ← (1 − γ)G
(i)
t−1 + γDt−1Mtx(i)
G
(i)
t ← (1 − γ)G
(i)
t−1 + γDt−1MtDt−1
αt ← argmin
α∈Rk
1
2
α Gtα − α βt + λ Ω(α).
2. Surrogate aggregation, averaging
At =
1
wt
αtαt + (1 −
1
wt
)At−1
Pt
¯Bt ← (1 − wt)Pt
¯Bt−1 + wtPtxtαt
3. Surrogate minimization
PtDt ← argmin
Dr ∈Cr
1
2
tr(Dr
Dr ¯At) − tr(Dr
Pt
¯B
Pt
⊥ ¯Bt ← (1 − wt)Pt
⊥ ¯Bt−1 + wtPt
⊥
xtαt .
10−1 100 101
97000
97500
98000
98500
99000
99500
100000
Testobjectivefunction
Zoom
10−2
10−3
(relative to lowest value)
Subsampling ratio
None
r = 12
r = 24
100 101 Time
10−2
10−3
Code computation
No subsampling (19)
Averaged estimators (c)
Masked loss (a)
G Varoquaux 13
23. 1 Why does it work?
Objective:
D = argmin
D∈C x
l(x, D) where l(x, D) = minα
f (x, D, α)
Algorithm (online matrix factorization)
gt(D)
majorant
=
x
l(x, D) αi is used, and not α
⇒ Stochastic Majorization-Minimization [Mairal 2013]
G Varoquaux 14
24. 1 Why does it work?
Objective:
D = argmin
D∈C x
l(x, D) where l(x, D) = minα
f (x, D, α)
Algorithm (online matrix factorization)
gt(D)
majorant
=
x
l(x, D) αi is used, and not α
⇒ Stochastic Majorization-Minimization [Mairal 2013]
Surrogate computation SMM Full minimization
G Varoquaux 14
25. 1 Stochastic Approximate Majorization-Minimization
Objective:
D = argmin
D∈C x
l(x, D) where l(x, D) = minα
f (x, D, α)
Algorithm (online matrix factorization)
gt(D)
majorant
=
x
l(x, D) αi is used, and not α
⇒ Stochastic Majorization-Minimization [Mairal 2013]
Surrogate computation
Surrogate approximation Partial minimization
SMM
SAMM
Full minimization
G Varoquaux 14
26. samples
features
samples
features
Y +E · S= N
Massive matrix factorization via subsampling
Subsampling features ⇒ doubly stochastic
10x speed ups on a fast algorithm
Analysis via stochastic approximate
majorization-minization
Conclusive on various high-dimensional problems
G Varoquaux 15
27. samples
features
samples
features
Y +E · S= N
2 Encoding with similarities
with P. Cerda and B. Kegl [Cerda... 2018]
When categories create a huge dimensionality
G Varoquaux 16
28. samples
features
samples
features
Y +E · S= N
2 Encoding with similarities
with P. Cerda and B. Kegl [Cerda... 2018]
Machine learning Let X ∈ Rn×p
The real world
Gender Date Hired Employee Position Title
M 09/12/1988 Master Police Officer
F 11/19/1989 Social Worker IV
M 07/16/2007 Police Officer III
F 02/05/2007 Police Aide
M 01/13/2014 Electrician I
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M 11/22/2010 Library Assistant IG Varoquaux 16
29. samples
features
samples
features
Y +E · S= N
2 Encoding with similarities
with P. Cerda and B. Kegl [Cerda... 2018]
Machine learning Let X ∈ Rn×p
The real world
Gender Date Hired Employee Position Title
M 09/12/1988 Master Police Officer
F 11/19/1989 Social Worker IV
M 07/16/2007 Police Officer III
F 02/05/2007 Police Aide
M 01/13/2014 Electrician I
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M 11/22/2010 Library Assistant I
A data cleaning problem?
A feature engineering problem?
A problem of representations in high dimension
G Varoquaux 16
30. 2 The problem of “dirty categories”
Non-curated categorical entries
Employee Position Title
Master Police Officer
Social Worker IV
Police Officer III
Police Aide
Electrician I
Bus Operator
Bus Operator
Social Worker III
Library Assistant I
Library Assistant I
Overlapping categories
“Master Police Officer”,
“Police Officer III”,
“Police Officer II”...
High cardinality
400 unique entries
in 10 000 rows
Rare categories
Only 1 “Architect III”
New categories in test set
G Varoquaux 17
31. 2 Dirty categories in the wild
Employee Salaries: salary information for employees
of Montgomery County, Maryland.
Employee Position Title
Master Police Officer
Social Worker IV
...
G Varoquaux 18
32. 2 Dirty categories in the wild
Employee Salaries: salary information for employees
of Montgomery County, Maryland.
Open Payments: payments by health care
companies to medical doctors or hospitals.
Company name Frequency
Pfizer Inc. 79,073
Pfizer Pharmaceuticals LLC 486
Pfizer International LLC 425
Pfizer Limited 13
Pfizer Corporation Hong Kong Limited 4
Pfizer Pharmaceuticals Korea Limited 3
...
G Varoquaux 18
33. 2 Dirty categories in the wild
Employee Salaries: salary information for employees
of Montgomery County, Maryland.
Open Payments: payments by health care
companies to medical doctors or hospitals.
Medical charges: patient discharges: utilization,
payment, and hospital-specific charges across 3 000
US hospitals.
...
Nothing on UCI machine-learning data repository
G Varoquaux 18
34. 2 Dirty categories in the wild
Employee Salaries: salary information for employees
of Montgomery County, Maryland.
Open Payments: payments by health care
companies to medical doctors or hospitals.
Medical charges: patient discharges: utilization,
payment, and hospital-specific charges across 3 000
US hospitals.
...
Nothing on UCI machine-learning data repository
Cardinality slowly increases with number of rows
100 1k 10k 100k 1M
Number of rows
100
1 000
10 000
Numberofcategories
beer reviews
road safety
traffic violations
midwest survey
open payments
employee salaries
medical charges
100
√
n
5 log2(n)
Create a high-dimensional learning problem
G Varoquaux 18
35. 2 Dirty categories in the wild
Employee Salaries: salary information for employees
of Montgomery County, Maryland.
Open Payments: payments by health care
companies to medical doctors or hospitals.
Medical charges: patient discharges: utilization,
payment, and hospital-specific charges across 3 000
US hospitals.
...
Nothing on UCI machine-learning data repository
Our goal: a statistical view of supervised
learning on dirty categories
The statistical question
should inform curation
Pfizer Corporation Hong Kong
=?
Pfizer Pharmaceuticals
Korea
G Varoquaux 18
36. 2 Related work: Database cleaning
Recognizing / merging entities
Record linkage:
matching across different (clean) tables
Deduplication/fuzzy matching:
matching in one dirty table
Techniques [Fellegi and Sunter 1969]
Supervised learning (known matches)
Clustering
Expectation Maximization to learn a metric
Outputs a “clean” database
G Varoquaux 19
37. 2 Related work: natural language processing
Stemming / normalization
Set of (handcrafted) rules
Need to be adapted to new language / new domains
G Varoquaux 20
38. 2 Related work: natural language processing
Stemming / normalization
Set of (handcrafted) rules
Need to be adapted to new language / new domains
Semantics
Relate different discreet objects
Formal semantics (entity resolution in knowlege bases)
Distributional semantics:
“a word is characterized by the company it keeps”
G Varoquaux 20
39. 2 Related work: natural language processing
Stemming / normalization
Set of (handcrafted) rules
Need to be adapted to new language / new domains
Semantics
Relate different discreet objects
Formal semantics (entity resolution in knowlege bases)
Distributional semantics:
“a word is characterized by the company it keeps”
Character-level NLP
For entity resolution [Klein... 2003]
For semantics [Bojanowski... 2017]
“London” & “Londres” may carry different information
G Varoquaux 20
40. 2 Similarity encoding: a simple solution
Adding similarities to one-hot encoding
1. One-hot encoding maps categories to vector spaces
2. String similarities capture information
G Varoquaux 21
41. 2 Similarity encoding: a simple solution
One-hot encoding
London Londres Paris
Londres 0 1 0
London 1 0 0
Paris 0 0 1
BX ∈ Rn×p
p grows fast
new categories?
link categories?
G Varoquaux 22
42. 2 Similarity encoding: a simple solution
One-hot encoding
London Londres Paris
Londres 0 1 0
London 1 0 0
Paris 0 0 1
BX ∈ Rn×p
p grows fast
new categories?
link categories?Similarity encoding
London Londres Paris
Londres 0.3 1.0 0.0
London 1.0 0.3 0.0
Paris 0.0 0.0 1.0
string distance(Londres, London)
G Varoquaux 22
43. 2 Some string similarities
Levenshtein
Number of edit operations on one string to match
the other
Jaro-Winkler
djaro(s1, s2) = m
3|s1| + m
3|s2| + m−t
3m
m: number of matching characters
t: number of character transpositions
n-gram similarity
n-gram: group of n consecutive characters
similarity =
#n-gram in comon
#n-gram in total
G Varoquaux 23
44. 2 Empirical study
Datasets with dirty categories
Dataset # of
rows
# of cat-
egories
Less frequent
category
Prediction
type
medical charges 160k 100 613 regression
employee salaries 9.2k 385 1 regression
open payments 100k 973 1 binary clf
midwest survey 2.8k 1009 1 multiclass clf
traffic violations 100k 3043 1 multiclass clf
road safety 10k 4617 1 binary clf
beer reviews 10k 4634 1 multiclass clf
7 datasets! All open
Experimental paradigm
Cross-validation & measure prediction
Stupid Simple
G Varoquaux 24
45. 2 Experiments: gradient boosted trees
0.8 0.9
medical
charges
3gram
Levenshtein
ratio
Jarowinkler
Target encoding
Onehot encoding
Hash encoding
Similarity
encoding
0.6 0.8
employee
salaries
0.6 0.8
open
payments
0.5
midw
surv
G Varoquaux 25
46. 2 Experiments: gradient boosted trees
0.8 0.9
medical
charges
3gram
Levenshtein
ratio
Jarowinkler
Target encoding
Onehot encoding
Hash encoding
Similarity
encoding
0.6 0.8
employee
salaries
0.6 0.8
open
payments
0.5
midw
surv
G Varoquaux 25
47. 2 Experiments: gradient boosted trees
0.8 0.9
medical
charges
3gram
Levenshtein
ratio
Jarowinkler
Target encoding
Onehot encoding
Hash encoding
Similarity
encoding
0.6 0.8
employee
salaries
0.6 0.8
open
payments
0.5
midw
surv
G Varoquaux 25
48. 2 Experiments: gradient boosted trees
0.8 0.9
medical
charges
3gram
Levenshtein
ratio
Jarowinkler
Target encoding
Onehot encoding
Hash encoding
Similarity
encoding
0.6 0.8
employee
salaries
0.6 0.8
open
payments
0.5 0.7
midwest
survey
0.6 0.8
traffic
violations
0.4 0.5
road
safety
0.25 0.75
beer
reviews
1.6
2.4
2.9
3.7
4.6
5.9
Average ranking across datasets
G Varoquaux 25
49. 2 Experiments: gradient boosted trees
0.8 0.9
medical
charges
3gram
Levenshtein
ratio
Jarowinkler
Target encoding
Onehot encoding
Hash encoding
Similarity
encoding
0.6 0.8
employee
salaries
0.6 0.8
open
payments
0.5 0.7
midwest
survey
0.6 0.8
traffic
violations
0.4 0.5
road
safety
0.25 0.75
beer
reviews
1.6
2.4
2.9
3.7
4.6
5.9
Average ranking across datasets
G Varoquaux 25
50. 2 Experiments: ridge
0.7 0.9
medical
charges
3gram
Levenshtein
ratio
Jarowinkler
Target encoding
Onehot encoding
Hash encoding
Similarity
encoding
0.6 0.8
employee
salaries
0.25 0.50
open
payments
0.5 0.7
midwest
survey
0.6 0.8
traffic
violations
0.45 0.50
road
safety
0.25 0.75
beer
reviews
1.0
2.9
3.1
4.4
3.6
6.0
Average ranking across datasets
Similarity encoding, with 3-gram similarity
G Varoquaux 26
52. 2 This is just a string similarity?
What similarity is defined by our encoding? (kernel)
si, sj sim =
k
l=1
sim(si, s(l)
) sim(sj, s(l)
)
Sum over the
categories
Reference
categories
The categories in the train set shape the similarity
G Varoquaux 28
53. 2 This is just a string similarity?
What similarity is defined by our encoding? (kernel)
si, sj sim =
k
l=1
sim(si, s(l)
) sim(sj, s(l)
)
Sum over the
categories
Reference
categories
The categories in the train set shape the similarity
0.83 0.88
medical
charges
3-gram
Levenshtein-
ratio
Jaro-winkler
Bag of 3-grams
Target encoding
MDV
One-hot encoding
Hash encoding
Similarity
encoding
0.75 0.85
employee
salaries
0.3 0.5
open
payments
0.6 0.7
midwest
survey
0.72 0.78
traffic
violations
0.44 0.52
road
safety
0.3 0.8
beer
reviews
1.1
3.1
3.4
4.1
5.3
6.4
4.7
7.3
Similarity encoding >>> a feature map capturing string similarities
G Varoquaux 28
54. 2 Reducing the dimensionality
BX ∈ Rn×p
but p is large
Statistical problems
Computational problems
G Varoquaux 29
55. 2 Reducing the dimensionality
d=100
d=300
Full
d=30
d=100
d=300
d=30
d=100
d=300
Full
Onehot
encoding
ity
Kmeans
Deduplication
with Kmeans
Random
projections
G Varoquaux 29
56. 2 Reducing the dimensionality
d=100
d=300
Full
d=30
d=100
d=300
d=30
d=100
d=300
Full
Onehot
encoding
ity
Kmeans
Deduplication
with Kmeans
Random
projections
G Varoquaux 29
57. 2 Reducing the dimensionality
0.7 0.8 0.9
d=30
d=100
d=300
d=30
d=100
d=300
d=30
d=100
d=300
Full
d=30
d=100
d=300
d=30
d=100
d=300
Full
Onehot
encoding
3gram similarity
encoding
Random
projections
Most
frequent
categories
Kmeans
Deduplication
with Kmeans
Random
projections
0.7 0.8 0.6 0.7 0.7500.G Varoquaux 29
58. 2 Reducing the dimensionality
0.7 0.8 0.9
employee
salaries
d=30
d=100
d=300
d=30
d=100
d=300
d=30
d=100
d=300
Full
d=30
d=100
d=300
d=30
d=100
d=300
Full
(k=355)Cardinality of
categorical variable
Onehot
encoding
3gram similarity
encoding
Random
projections
Most
frequent
categories
Kmeans
Deduplication
with Kmeans
Random
projections
0.7 0.8
open
payments
(k=910)
0.6 0.7
midwest
survey
(k=644)
0.7500.775
traffic
violations
(k=2588)
0.45 0.50 0.55
road
safety
(k=3988)
0.25 0.50 0.75
beer
reviews
(k=4015)
7.2
5.0
3.7
10.5
7.2
4.6
10.5
6.3
3.3
2.0
16.3
14.5
14.3
12.3
10.8
9.9
14.5
Average ranking across datasets
G Varoquaux 29
59. 2 Reducing the dimensionality
0.7 0.8 0.9
employee
salaries
d=30
d=100
d=300
d=30
d=100
d=300
d=30
d=100
d=300
Full
d=30
d=100
d=300
d=30
d=100
d=300
Full
(k=355)Cardinality of
categorical variable
Onehot
encoding
3gram similarity
encoding
Random
projections
Most
frequent
categories
Kmeans
Deduplication
with Kmeans
Random
projections
0.7 0.8
open
payments
(k=910)
0.6 0.7
midwest
survey
(k=644)
0.7500.775
traffic
violations
(k=2588)
0.45 0.50 0.55
road
safety
(k=3988)
0.25 0.50 0.75
beer
reviews
(k=4015)
7.2
5.0
3.7
10.5
7.2
4.6
10.5
6.3
3.3
2.0
16.3
14.5
14.3
12.3
10.8
9.9
14.5
Average ranking across datasets
Factorizing one-hot: Multiple Correspondance Analysis
Hashing n-grams (for speed and collisions)
G Varoquaux 29
61. @GaelVaroquaux
Representations in high dimension
factorizations and similarities
signals, entities, categories
Factorizations
Costly in large-p, large-n
Sub-sampling p gives huge speed ups
Stochastic Approximate Majorization-Minimization
https://github.com/arthurmensch/modl
62. @GaelVaroquaux
Representations in high dimension
factorizations and similarities
signals, entities, categories
Factorizations
https://github.com/arthurmensch/modl
Similarity encoding for categories
No separate duplication / cleaning step
Creates a categorie-aware metric space
https://dirty-cat.github.io
DirtyData project (hiring)
63. References I
P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching
word vectors with subword information. Transactions of the
Association of Computational Linguistics, 5(1):135–146, 2017.
P. Cerda, G. Varoquaux, and B. K´egl. Similarity encoding for
learning with dirty categorical variables. Machine Learning,
pages 1–18, 2018.
I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal
of the American Statistical Association, 64:1183, 1969.
D. Klein, J. Smarr, H. Nguyen, and C. D. Manning. Named entity
recognition with character-level models. In Proceedings of the
seventh conference on Natural language learning at HLT-NAACL
2003-Volume 4, pages 180–183. Association for Computational
Linguistics, 2003.
J. Mairal. Stochastic majorization-minimization algorithms for
large-scale optimization. In Advances in Neural Information
Processing Systems, 2013.
64. References II
J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for
matrix factorization and sparse coding. Journal of Machine
Learning Research, 11:19, 2010.
A. Mensch, J. Mairal, B. Thirion, and G. Varoquaux. Dictionary
learning for massive matrix factorization. In ICML, 2016.
A. Mensch, J. Mairal, B. Thirion, and G. Varoquaux. Stochastic
subsampling for factorizing huge matrices. IEEE Transactions on
Signal Processing, 66(1):113–128, 2017.