Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

Text classification with Lucene/Solr and
LibSVM

By Majirus FANSI, Phd
@majirus

Agile Software Developer

Motivation: Guiding user search
●

Search engines are basically keyword-oriented
–

What about the meaning?
●
Synonym search needs listing the synonyms
●
More-Like-This component is about more like THIS
●
Category search for better user experience
– Deals with the cases where user keywords are not in
the collection
– User searches for « emarketing », you returns
documents on « webmarketing »

Outline

●

Text Categorization

●

Introducing Machine Learning

●

Why SVM?

●

How Solr can help ?
Putting it all Together is our aim

Text classification or Categorization
●

●

●

Aims
– Classify documents into a fixed number of predefined
categories
●
Each document can be in multiple, exactly one, or no
category at all.
Applications
– Classifying emails (Spam / Not Spam)
– Guiding user search
Challenges
– Building text classifiers by hand is difficult and time
consuming
– It is advantageous to learn classifiers from examples

Machine Learning
●

Definition (by Tom Mitchell - 1998)
“A computer program is said to learn from experience E
with respect to some task T and some performance
measure P, if its performance on T, as measured by P,
improves with experience E”
●
Experience E: watching the label of a document
●
Task T: classify a document
●
Performance P: probability that a document is correctly
classified.

Machine Learning Algorithms

●

●

Usupervised learning
– Let the program learn by itself
●
Market segmentation, social network analysis...
Supervised learning
– Teach the computer program how to do something
– We give the algorithm the “right answers” for some
examples

Supervised learning problems
–

Regression
●

●

–

Predict continuous valued output
Ex: price of the houses in Corbeil Essonnes

Classification
●

Predict a discrete valued output (+1, -1)

Supervised learning: Working

m training examples

Training Set (X, Y)

(X(i),Y(i)) : ith training example
X's : input variable or features
Y's : output/target variable

It's the job of the learning
algorithm to produce the model h

Training algorithm

Feature vector
(x)

h(x)
Hypothesis h

Predicted value
(y)

Classifier/Decision Boundary
●
●
●
●

Carves up the feature space into volumes
Feature vectors in volumes assigned to the same class
Decision regions separated by surfaces
Decision boundary linear if a straight line in the
dimensional space
– A line in 2D, a plane in 3D, a hyperplane in 4+D

Which Algorithm for text classifier

Properties of text
●

●
●

●

High dimensional input space
– More than 10 000 features
Very few irrelevant features
Document vectors are sparse
– few entries which are non zero
Most text categorization problems are linearly separable

No need to map the input features to a higher dimension space

Classification algorithm /choosing the method
●

●

Thorsten Joachims compares SVM to Naive Bayes,
Rocchio, K-nearest neighbor and C4.5 decision tree
SVM consistently achieve good performance on
categorization task
– It outperforms the other methods
– Eliminates the need for feature selection
– More robust than the other
Thorsten Joachims, 1998. Text Categorization with SVM :
Learning with many relevant features

SVM ? Yes But...
« The research community should direct efforts towards
increasing the size of annotated training collections, while
deemphasizing the focus on comparing different learning
techniques trained only on small training corpora »
Banko & Brill in « scaling very very large corpora for
natural language disambiguation »

What is SVM - Support Vector Machine?

●
●

« Support Vector Networks » Cortes & Vapnik, 1995
SVM implements the following idea
– Maps the input vectors into some high dimensional
feature space Z
●
Through some non linear mapping choosing a
priori
–

In this feature space a linear decision surface is
constructed

–

Special properties of the decision surface ensures
high generalization ability of the learning machine

SVM - Classification of an unknown pattern
classification
w1

sv1

w2

sv2

X

wN

svk

Support vectors zi in feature
space

Input vector in feature space
Non-linear
transformation

x

Input vector, x

SVM - decision boundary

●

●

Optimal hyperplane
– Training data can be separated without errors
– It is the linear decision function with maximal
margin between the vectors of the two classes
Soft margin hyperplane
– Training data cannot be separated without errors

Optimal hyperplane - figure
x2

Op
tim
al

hy
pe
rp

la
ne

Optimal
margin
x1

SVM - optimal hyperplane
●
●

Given the training set X of (x1, y1), (x2, y2), … (xm, ym) ; yi Є{-1, 1}
X is linearly separable if there exists a vector w and a scalar
b s.t.
w.x i +b≥1 if y i =1(1)

w.x i +b≤−1 if y i =−1(2)
●

Vectors xi for which yi (w.xi+b) = 1 is termed support vectors
– Used to construct the hyperplane
– if the training vectors are separated without errors by
an optimal hyperplane
E [Pr (error)]≤

●

(1), (2)⇒ y i ( w.xi +b)≥1(3)

E[number of support vectors]
(4)
number of training vectors

w . z +b =0(5)
The optimal hyperplane
– Unique one which separates the training data with a
maximal margin
0

0

SVM - optimal hyperplane – decision function
●

Let us consider the optimal hyperplane
w 0 . z +b 0=0(5)

●

The weight w0 can be written as some linear combination
of SVs
w 0=

∑

αi z i (6)

support vectors

●

The linear decision function I(z) is of the form
I ( z )=sign( ∑ α z . z+ b )(7)
i i

0

support vectors

●

zi.z is the dot product between svs zi and vector z

Soft margin Classification

●

Want to separate the training set with a minimal number
of errors
Φ (ξ)=∑ ξ ; ξ ≥0 ; for small σ> 0(5)
m

i=1

s.t.

●

●

σ
i

i

y i (w.xi + b)≥1−ξ i ; i=1,... , m(6)

The functional (5) describes the number of training errors
Removing the subset of training errors from training set
●
Remaining part separated without errors
●
By constructing an optimal hyperplane

SVM - soft margin Idea

●

Soft margin svm can be expressed as
m

1 2
w +C ∑ ξi (7)
w , b ,ξ 2
i =1
min

s.t. y i (w.xi + b)≥1−ξ i
●

ξ i≥0 (8)

For sufficiently large C, the vector w0 and constant b0,
that minimizes (7) under (8) determine the hyperplane
that
– minimizes the sum of deviations, ξ, of training errors
– Maximizes the margin for correctly classified vectors

SVM - soft margin figure
x2

ξ=0

se
pa
ra
ξ=0
to
r
0<ξ<1

soft
margin

ξ>1

x1

Constructing text classifier with SVM

Constructing and using the text classifier

●

●

●

●

Which library ?
– Efficient optimization packages are available
●
SVMlight, LibSVM
From text to features vectors
– Lucene/solr helps here
Multi-class classification vs One-vs-the-rest
Using the categories for semantic search
●
Dedicated solr index with the most predictive
terms

SVM library
●

●

●

SVMlight
– By Thorsten Joachim
LibSVM
– By Chan & Lin from Taiwan university
– Under heavy development and testing
– Library for java, C, python,...,Package for R language
LibLinear
– By Chan, Lin & al.
– Brother of LibSVM
– Recommended by LibSVM authors for large-scale
linear classification

LibLinear
●

●

A Library for Large Linear Classification
– Binary and Multi-class
– implements Logistic Regression and linear SVM
Format of training and testing data file is :
– <label> <index1>:<value1><index2>:<value2>...
– Each line contains an instance and is ended by a 'n'
– <label> is an integer indicating the class label
– The pair <index>:<value> gives a feature value
●
<index> is an integer starting from 1
●
<value> is a real number
– Indices must be in ascending order

LibLinear input and dictionary
●

●

Example input file for training
1 101:1 123:5 234:2
-1 54:2 64:1 453:3
– Do not have to represent the zeros.
Need a dictionary of terms in lexicographical order
1 .net
2 aa
...
6000 jav
...
7565 solr

Building the dictionary
●

●

Divide the overall training data into a number of
portions
– Using knowledge of your domain
●
Software development portion
●
marketing portion...
– Avoid a very large dictionary
●
A java dev position and a marketing position
share few common terms
Use Expert boolean queries to load a dedicated solr
core per domain
– description:python AND title:python

Building the dictionary with Solr
●

What do we need in the dictionary
– Terms properly analyzed
●
LowerCaseFilterFactory, StopFilterFactory,
●
ASCIIFoldingFilterFactory,
SnowballPorterFilterFactory
–

●

Terms that occurs in a number of documents (df >min)
●
Rare terms may cause the model to overfit

Terms are retrieved from solr
–

Using solr TermVectorsComponent

Solr TermVectorComponent
●

SearchComponent designed to return information about
terms in documents
– tv.df returns the document frequency per term in
the document
–

tv.tf returns document term frequency info per term
in the document
●
Used as feature value

–

tv.fl provides the list of fields to get term vectors for
●
Only the catch-all field we use for classification

Solr Core configuration
●

Set termvectors attribute on fields you will use
–

<field name="title_and_description" type="texte_analyse"
indexed="true" stored="true" termVectors="true"
termPositions="true" termOffsets="true"/>

Normalize your text and use stemming during the
analysis
Enable TermVectorComponent in solrconfig
–

●

–

<searchComponent name="tvComponent"
class="org.apache.solr.handler.component.TermVectorComponent"/>

–

Configure a RequestHandler to use this component
●
<lst name="defaults"> <bool name="tv">true</bool> </lst>
●
<arr name="last-components"> <str>tvComponent</str> </arr>

Constructing Training and Test sets per
model

Feature extraction
●

Domain expert query is used to extract docs for each
category
–
–
–

TVC returns the terms info of the terms in each
document
Each term is replaced by its index from the dictionary
●
This is the attribute
Its tf info is used as value
●
Some use presence/absence (or 1/0)
●
Others tf-idf

term_index_from_dico:term_freq is an input feature

Training and Test sets partition
●

●

●

We shuffle documents set so that high score docs do not
go to the same bucket
We split the result list so that
– 60 % to the training set (TS)
●
Here are positive examples (the +1s)
– 20 % to the validation set (VS)
●
Positive in this model, negative in others
– 20 % is used for other classes training set (OTS)
●
These are negative examples to others
Balanced training set (≈50 % of +1s and ≈50 % of -1s)
– The negatives come form other's 20 % OTS

Model file
●

Model file is saved after training
– One model per category
– It outlines the following
●
solver_type L2R_L2LOSS_SVC
●
nr_class 2
●
label 1 -1
●
nr_feature 8920
●
bias 1.000000000000000
●
w
-0.1626437446641374
w.xi + b ≥ 1 if yi = 1
●
●
●

0
7.152404908494515e-05

Most predictives terms

●

●

Model file contains the weight vector w
Use w to compute the most predictves terms of the model
– Give an indication as to whether the model is good or
not
●
You are the domain expert
–

Useful to extend basic keyword search to semantic
search

Toward semantic search - Indexing
●

●

Create a category core in solr
– Each document represents a category
●
One field for the category ID
●
One multi-valued field holds its top predictives
terms
At indexing time
– Each document is sent to the classification service
– The service returns the categories of the document
– Categories are saved in a multi-valued field along with
other domain-pertinents document fields

Toward semantic search - searching

●

At search time
– User query is run on the category core
●
What about libShortText
–

The returned categories are used to extend the
initial query
●
A boost < 1 is assigned to the category

References
●

●

●

●

●

Cortes and Vapnik, 1995. Support-Vector Networks
Chang and Lin, 2012. LibSVM : A Library for Support Vector
Machines
Fan, Lin, et al. LibLinear, 2012 : A Library for Large Linear
Classification
Thorsten Joachims, 1998. Text Categorization with SVM : Learning
with many relevant features
Rifkin and Klautau, 2004. In Defense of One-Vs-All classification

A big thank you
●

Lucene/Solr Revolution EU 2013 organizers

●

To Valtech Management

●

To Michels, Maj-Daniels, and Marie-Audrey Fansi

●

To all of you for your presence and attention

To my wife, Marie-Audrey, for all the
attention she pay to our family

Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

Similar to Text Classification with Lucene/Solr, Apache Hadoop and LibSVM (20)

More from lucenerevolution

More from lucenerevolution (20)

Recently uploaded

Recently uploaded (20)

Text Classification with Lucene/Solr, Apache Hadoop and LibSVM