In this session we will show how to build a text classifier using the Apache Lucene/Solr with libSVM libraries. We classify our corpus of job offers into a number of predefined categories. Each indexed document (a job offer) then belongs to zero, one or more categories. Known machine learning techniques for text classification include naïve bayes model, logistic regression, neural network, support vector machine (SVM), etc. We use Lucene/Solr to construct the features vector. Then we use the libsvm library known as the reference implementation of the SVM model to classify the document. We construct as many one-vs-all svm classifiers as there are classes in our setting, then using the Hadoop MapReduce Framework we reconcile the result of our classifiers. The end result is a scalable multi-class classifier. Finally we outline how the classifier is used to enrich basic solr keyword search.
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
1.
2. Text classification with Lucene/Solr and
LibSVM
By Majirus FANSI, Phd
@majirus
Agile Software Developer
3. Motivation: Guiding user search
●
Search engines are basically keyword-oriented
–
What about the meaning?
●
Synonym search needs listing the synonyms
●
More-Like-This component is about more like THIS
●
Category search for better user experience
– Deals with the cases where user keywords are not in
the collection
– User searches for « emarketing », you returns
documents on « webmarketing »
5. Text classification or Categorization
●
●
●
Aims
– Classify documents into a fixed number of predefined
categories
●
Each document can be in multiple, exactly one, or no
category at all.
Applications
– Classifying emails (Spam / Not Spam)
– Guiding user search
Challenges
– Building text classifiers by hand is difficult and time
consuming
– It is advantageous to learn classifiers from examples
6. Machine Learning
●
Definition (by Tom Mitchell - 1998)
“A computer program is said to learn from experience E
with respect to some task T and some performance
measure P, if its performance on T, as measured by P,
improves with experience E”
●
Experience E: watching the label of a document
●
Task T: classify a document
●
Performance P: probability that a document is correctly
classified.
7. Machine Learning Algorithms
●
●
Usupervised learning
– Let the program learn by itself
●
Market segmentation, social network analysis...
Supervised learning
– Teach the computer program how to do something
– We give the algorithm the “right answers” for some
examples
9. Supervised learning: Working
m training examples
Training Set (X, Y)
(X(i),Y(i)) : ith training example
X's : input variable or features
Y's : output/target variable
It's the job of the learning
algorithm to produce the model h
Training algorithm
Feature vector
(x)
h(x)
Hypothesis h
Predicted value
(y)
10. Classifier/Decision Boundary
●
●
●
●
Carves up the feature space into volumes
Feature vectors in volumes assigned to the same class
Decision regions separated by surfaces
Decision boundary linear if a straight line in the
dimensional space
– A line in 2D, a plane in 3D, a hyperplane in 4+D
12. Properties of text
●
●
●
●
High dimensional input space
– More than 10 000 features
Very few irrelevant features
Document vectors are sparse
– few entries which are non zero
Most text categorization problems are linearly separable
No need to map the input features to a higher dimension space
13. Classification algorithm /choosing the method
●
●
Thorsten Joachims compares SVM to Naive Bayes,
Rocchio, K-nearest neighbor and C4.5 decision tree
SVM consistently achieve good performance on
categorization task
– It outperforms the other methods
– Eliminates the need for feature selection
– More robust than the other
Thorsten Joachims, 1998. Text Categorization with SVM :
Learning with many relevant features
14. SVM ? Yes But...
« The research community should direct efforts towards
increasing the size of annotated training collections, while
deemphasizing the focus on comparing different learning
techniques trained only on small training corpora »
Banko & Brill in « scaling very very large corpora for
natural language disambiguation »
15. What is SVM - Support Vector Machine?
●
●
« Support Vector Networks » Cortes & Vapnik, 1995
SVM implements the following idea
– Maps the input vectors into some high dimensional
feature space Z
●
Through some non linear mapping choosing a
priori
–
In this feature space a linear decision surface is
constructed
–
Special properties of the decision surface ensures
high generalization ability of the learning machine
16. SVM - Classification of an unknown pattern
classification
w1
sv1
w2
sv2
X
wN
svk
Support vectors zi in feature
space
Input vector in feature space
Non-linear
transformation
x
Input vector, x
17. SVM - decision boundary
●
●
Optimal hyperplane
– Training data can be separated without errors
– It is the linear decision function with maximal
margin between the vectors of the two classes
Soft margin hyperplane
– Training data cannot be separated without errors
20. SVM - optimal hyperplane
●
●
Given the training set X of (x1, y1), (x2, y2), … (xm, ym) ; yi Є{-1, 1}
X is linearly separable if there exists a vector w and a scalar
b s.t.
w.x i +b≥1 if y i =1(1)
w.x i +b≤−1 if y i =−1(2)
●
Vectors xi for which yi (w.xi+b) = 1 is termed support vectors
– Used to construct the hyperplane
– if the training vectors are separated without errors by
an optimal hyperplane
E [Pr (error)]≤
●
(1), (2)⇒ y i ( w.xi +b)≥1(3)
E[number of support vectors]
(4)
number of training vectors
w . z +b =0(5)
The optimal hyperplane
– Unique one which separates the training data with a
maximal margin
0
0
21. SVM - optimal hyperplane – decision function
●
Let us consider the optimal hyperplane
w 0 . z +b 0=0(5)
●
The weight w0 can be written as some linear combination
of SVs
w 0=
∑
αi z i (6)
support vectors
●
The linear decision function I(z) is of the form
I ( z )=sign( ∑ α z . z+ b )(7)
i i
0
support vectors
●
zi.z is the dot product between svs zi and vector z
23. Soft margin Classification
●
Want to separate the training set with a minimal number
of errors
Φ (ξ)=∑ ξ ; ξ ≥0 ; for small σ> 0(5)
m
i=1
s.t.
●
●
σ
i
i
y i (w.xi + b)≥1−ξ i ; i=1,... , m(6)
The functional (5) describes the number of training errors
Removing the subset of training errors from training set
●
Remaining part separated without errors
●
By constructing an optimal hyperplane
24. SVM - soft margin Idea
●
Soft margin svm can be expressed as
m
1 2
w +C ∑ ξi (7)
w , b ,ξ 2
i =1
min
s.t. y i (w.xi + b)≥1−ξ i
●
ξ i≥0 (8)
For sufficiently large C, the vector w0 and constant b0,
that minimizes (7) under (8) determine the hyperplane
that
– minimizes the sum of deviations, ξ, of training errors
– Maximizes the margin for correctly classified vectors
25. SVM - soft margin figure
x2
ξ=0
se
pa
ra
ξ=0
to
r
0<ξ<1
soft
margin
ξ>1
x1
27. Constructing and using the text classifier
●
●
●
●
Which library ?
– Efficient optimization packages are available
●
SVMlight, LibSVM
From text to features vectors
– Lucene/solr helps here
Multi-class classification vs One-vs-the-rest
Using the categories for semantic search
●
Dedicated solr index with the most predictive
terms
28. SVM library
●
●
●
SVMlight
– By Thorsten Joachim
LibSVM
– By Chan & Lin from Taiwan university
– Under heavy development and testing
– Library for java, C, python,...,Package for R language
LibLinear
– By Chan, Lin & al.
– Brother of LibSVM
– Recommended by LibSVM authors for large-scale
linear classification
29. LibLinear
●
●
A Library for Large Linear Classification
– Binary and Multi-class
– implements Logistic Regression and linear SVM
Format of training and testing data file is :
– <label> <index1>:<value1><index2>:<value2>...
– Each line contains an instance and is ended by a 'n'
– <label> is an integer indicating the class label
– The pair <index>:<value> gives a feature value
●
<index> is an integer starting from 1
●
<value> is a real number
– Indices must be in ascending order
30. LibLinear input and dictionary
●
●
Example input file for training
1 101:1 123:5 234:2
-1 54:2 64:1 453:3
– Do not have to represent the zeros.
Need a dictionary of terms in lexicographical order
1 .net
2 aa
...
6000 jav
...
7565 solr
31. Building the dictionary
●
●
Divide the overall training data into a number of
portions
– Using knowledge of your domain
●
Software development portion
●
marketing portion...
– Avoid a very large dictionary
●
A java dev position and a marketing position
share few common terms
Use Expert boolean queries to load a dedicated solr
core per domain
– description:python AND title:python
32. Building the dictionary with Solr
●
What do we need in the dictionary
– Terms properly analyzed
●
LowerCaseFilterFactory, StopFilterFactory,
●
ASCIIFoldingFilterFactory,
SnowballPorterFilterFactory
–
●
Terms that occurs in a number of documents (df >min)
●
Rare terms may cause the model to overfit
Terms are retrieved from solr
–
Using solr TermVectorsComponent
33. Solr TermVectorComponent
●
SearchComponent designed to return information about
terms in documents
– tv.df returns the document frequency per term in
the document
–
tv.tf returns document term frequency info per term
in the document
●
Used as feature value
–
tv.fl provides the list of fields to get term vectors for
●
Only the catch-all field we use for classification
34. Solr Core configuration
●
Set termvectors attribute on fields you will use
–
<field name="title_and_description" type="texte_analyse"
indexed="true" stored="true" termVectors="true"
termPositions="true" termOffsets="true"/>
Normalize your text and use stemming during the
analysis
Enable TermVectorComponent in solrconfig
–
●
–
<searchComponent name="tvComponent"
class="org.apache.solr.handler.component.TermVectorComponent"/>
–
Configure a RequestHandler to use this component
●
<lst name="defaults"> <bool name="tv">true</bool> </lst>
●
<arr name="last-components"> <str>tvComponent</str> </arr>
36. Feature extraction
●
Domain expert query is used to extract docs for each
category
–
–
–
TVC returns the terms info of the terms in each
document
Each term is replaced by its index from the dictionary
●
This is the attribute
Its tf info is used as value
●
Some use presence/absence (or 1/0)
●
Others tf-idf
term_index_from_dico:term_freq is an input feature
37. Training and Test sets partition
●
●
●
We shuffle documents set so that high score docs do not
go to the same bucket
We split the result list so that
– 60 % to the training set (TS)
●
Here are positive examples (the +1s)
– 20 % to the validation set (VS)
●
Positive in this model, negative in others
– 20 % is used for other classes training set (OTS)
●
These are negative examples to others
Balanced training set (≈50 % of +1s and ≈50 % of -1s)
– The negatives come form other's 20 % OTS
38. Model file
●
Model file is saved after training
– One model per category
– It outlines the following
●
solver_type L2R_L2LOSS_SVC
●
nr_class 2
●
label 1 -1
●
nr_feature 8920
●
bias 1.000000000000000
●
w
-0.1626437446641374
w.xi + b ≥ 1 if yi = 1
●
●
●
0
7.152404908494515e-05
39. Most predictives terms
●
●
Model file contains the weight vector w
Use w to compute the most predictves terms of the model
– Give an indication as to whether the model is good or
not
●
You are the domain expert
–
Useful to extend basic keyword search to semantic
search
40. Toward semantic search - Indexing
●
●
Create a category core in solr
– Each document represents a category
●
One field for the category ID
●
One multi-valued field holds its top predictives
terms
At indexing time
– Each document is sent to the classification service
– The service returns the categories of the document
– Categories are saved in a multi-valued field along with
other domain-pertinents document fields
41. Toward semantic search - searching
●
At search time
– User query is run on the category core
●
What about libShortText
–
The returned categories are used to extend the
initial query
●
A boost < 1 is assigned to the category
42. References
●
●
●
●
●
Cortes and Vapnik, 1995. Support-Vector Networks
Chang and Lin, 2012. LibSVM : A Library for Support Vector
Machines
Fan, Lin, et al. LibLinear, 2012 : A Library for Large Linear
Classification
Thorsten Joachims, 1998. Text Categorization with SVM : Learning
with many relevant features
Rifkin and Klautau, 2004. In Defense of One-Vs-All classification
43. A big thank you
●
Lucene/Solr Revolution EU 2013 organizers
●
To Valtech Management
●
To Michels, Maj-Daniels, and Marie-Audrey Fansi
●
To all of you for your presence and attention