SlideShare a Scribd company logo
1 of 45
Download to read offline
Text classification with Lucene/Solr and
LibSVM

By Majirus FANSI, Phd
@majirus

Agile Software Developer
Motivation: Guiding user search
●

Search engines are basically keyword-oriented
–

What about the meaning?
●
Synonym search needs listing the synonyms
●
More-Like-This component is about more like THIS
●
Category search for better user experience
– Deals with the cases where user keywords are not in
the collection
– User searches for « emarketing », you returns
documents on « webmarketing »
Outline

●

Text Categorization

●

Introducing Machine Learning

●

Why SVM?

●

How Solr can help ?
Putting it all Together is our aim
Text classification or Categorization
●

●

●

Aims
– Classify documents into a fixed number of predefined
categories
●
Each document can be in multiple, exactly one, or no
category at all.
Applications
– Classifying emails (Spam / Not Spam)
– Guiding user search
Challenges
– Building text classifiers by hand is difficult and time
consuming
– It is advantageous to learn classifiers from examples
Machine Learning
●

Definition (by Tom Mitchell - 1998)
“A computer program is said to learn from experience E
with respect to some task T and some performance
measure P, if its performance on T, as measured by P,
improves with experience E”
●
Experience E: watching the label of a document
●
Task T: classify a document
●
Performance P: probability that a document is correctly
classified.
Machine Learning Algorithms

●

●

Usupervised learning
– Let the program learn by itself
●
Market segmentation, social network analysis...
Supervised learning
– Teach the computer program how to do something
– We give the algorithm the “right answers” for some
examples
Supervised learning problems
–

Regression
●

●

–

Predict continuous valued output
Ex: price of the houses in Corbeil Essonnes

Classification
●

Predict a discrete valued output (+1, -1)
Supervised learning: Working

m training examples

Training Set (X, Y)

(X(i),Y(i)) : ith training example
X's : input variable or features
Y's : output/target variable

It's the job of the learning
algorithm to produce the model h

Training algorithm

Feature vector
(x)

h(x)
Hypothesis h

Predicted value
(y)
Classifier/Decision Boundary
●
●
●
●

Carves up the feature space into volumes
Feature vectors in volumes assigned to the same class
Decision regions separated by surfaces
Decision boundary linear if a straight line in the
dimensional space
– A line in 2D, a plane in 3D, a hyperplane in 4+D
Which Algorithm for text classifier
Properties of text
●

●
●

●

High dimensional input space
– More than 10 000 features
Very few irrelevant features
Document vectors are sparse
– few entries which are non zero
Most text categorization problems are linearly separable

No need to map the input features to a higher dimension space
Classification algorithm /choosing the method
●

●

Thorsten Joachims compares SVM to Naive Bayes,
Rocchio, K-nearest neighbor and C4.5 decision tree
SVM consistently achieve good performance on
categorization task
– It outperforms the other methods
– Eliminates the need for feature selection
– More robust than the other
Thorsten Joachims, 1998. Text Categorization with SVM :
Learning with many relevant features
SVM ? Yes But...
« The research community should direct efforts towards
increasing the size of annotated training collections, while
deemphasizing the focus on comparing different learning
techniques trained only on small training corpora »
Banko & Brill in « scaling very very large corpora for
natural language disambiguation »
What is SVM - Support Vector Machine?

●
●

« Support Vector Networks » Cortes & Vapnik, 1995
SVM implements the following idea
– Maps the input vectors into some high dimensional
feature space Z
●
Through some non linear mapping choosing a
priori
–

In this feature space a linear decision surface is
constructed

–

Special properties of the decision surface ensures
high generalization ability of the learning machine
SVM - Classification of an unknown pattern
classification
w1

sv1

w2

sv2

X

wN

svk

Support vectors zi in feature
space

Input vector in feature space
Non-linear
transformation

x

Input vector, x
SVM - decision boundary

●

●

Optimal hyperplane
– Training data can be separated without errors
– It is the linear decision function with maximal
margin between the vectors of the two classes
Soft margin hyperplane
– Training data cannot be separated without errors
Optimal hyperplane
Optimal hyperplane - figure
x2

Op
tim
al

hy
pe
rp

la
ne

Optimal
margin
x1
SVM - optimal hyperplane
●
●

Given the training set X of (x1, y1), (x2, y2), … (xm, ym) ; yi Є{-1, 1}
X is linearly separable if there exists a vector w and a scalar
b s.t.
w.x i +b≥1 if y i =1(1)

w.x i +b≤−1 if y i =−1(2)
●

Vectors xi for which yi (w.xi+b) = 1 is termed support vectors
– Used to construct the hyperplane
– if the training vectors are separated without errors by
an optimal hyperplane
E [Pr (error)]≤

●

(1), (2)⇒ y i ( w.xi +b)≥1(3)

E[number of support vectors]
(4)
number of training vectors

w . z +b =0(5)
The optimal hyperplane
– Unique one which separates the training data with a
maximal margin
0

0
SVM - optimal hyperplane – decision function
●

Let us consider the optimal hyperplane
w 0 . z +b 0=0(5)

●

The weight w0 can be written as some linear combination
of SVs
w 0=

∑

αi z i (6)

support vectors

●

The linear decision function I(z) is of the form
I ( z )=sign( ∑ α z . z+ b )(7)
i i

0

support vectors

●

zi.z is the dot product between svs zi and vector z
Soft margin hyperplane
Soft margin Classification

●

Want to separate the training set with a minimal number
of errors
Φ (ξ)=∑ ξ ; ξ ≥0 ; for small σ> 0(5)
m

i=1

s.t.

●

●

σ
i

i

y i (w.xi + b)≥1−ξ i ; i=1,... , m(6)

The functional (5) describes the number of training errors
Removing the subset of training errors from training set
●
Remaining part separated without errors
●
By constructing an optimal hyperplane
SVM - soft margin Idea

●

Soft margin svm can be expressed as
m

1 2
w +C ∑ ξi (7)
w , b ,ξ 2
i =1
min

s.t. y i (w.xi + b)≥1−ξ i
●

ξ i≥0 (8)

For sufficiently large C, the vector w0 and constant b0,
that minimizes (7) under (8) determine the hyperplane
that
– minimizes the sum of deviations, ξ, of training errors
– Maximizes the margin for correctly classified vectors
SVM - soft margin figure
x2

ξ=0

se
pa
ra
ξ=0
to
r
0<ξ<1

soft
margin

ξ>1

x1
Constructing text classifier with SVM
Constructing and using the text classifier

●

●

●

●

Which library ?
– Efficient optimization packages are available
●
SVMlight, LibSVM
From text to features vectors
– Lucene/solr helps here
Multi-class classification vs One-vs-the-rest
Using the categories for semantic search
●
Dedicated solr index with the most predictive
terms
SVM library
●

●

●

SVMlight
– By Thorsten Joachim
LibSVM
– By Chan & Lin from Taiwan university
– Under heavy development and testing
– Library for java, C, python,...,Package for R language
LibLinear
– By Chan, Lin & al.
– Brother of LibSVM
– Recommended by LibSVM authors for large-scale
linear classification
LibLinear
●

●

A Library for Large Linear Classification
– Binary and Multi-class
– implements Logistic Regression and linear SVM
Format of training and testing data file is :
– <label> <index1>:<value1><index2>:<value2>...
– Each line contains an instance and is ended by a 'n'
– <label> is an integer indicating the class label
– The pair <index>:<value> gives a feature value
●
<index> is an integer starting from 1
●
<value> is a real number
– Indices must be in ascending order
LibLinear input and dictionary
●

●

Example input file for training
1 101:1 123:5 234:2
-1 54:2 64:1 453:3
– Do not have to represent the zeros.
Need a dictionary of terms in lexicographical order
1 .net
2 aa
...
6000 jav
...
7565 solr
Building the dictionary
●

●

Divide the overall training data into a number of
portions
– Using knowledge of your domain
●
Software development portion
●
marketing portion...
– Avoid a very large dictionary
●
A java dev position and a marketing position
share few common terms
Use Expert boolean queries to load a dedicated solr
core per domain
– description:python AND title:python
Building the dictionary with Solr
●

What do we need in the dictionary
– Terms properly analyzed
●
LowerCaseFilterFactory, StopFilterFactory,
●
ASCIIFoldingFilterFactory,
SnowballPorterFilterFactory
–

●

Terms that occurs in a number of documents (df >min)
●
Rare terms may cause the model to overfit

Terms are retrieved from solr
–

Using solr TermVectorsComponent
Solr TermVectorComponent
●

SearchComponent designed to return information about
terms in documents
– tv.df returns the document frequency per term in
the document
–

tv.tf returns document term frequency info per term
in the document
●
Used as feature value

–

tv.fl provides the list of fields to get term vectors for
●
Only the catch-all field we use for classification
Solr Core configuration
●

Set termvectors attribute on fields you will use
–

<field name="title_and_description" type="texte_analyse"
indexed="true" stored="true" termVectors="true"
termPositions="true" termOffsets="true"/>

Normalize your text and use stemming during the
analysis
Enable TermVectorComponent in solrconfig
–

●

–

<searchComponent name="tvComponent"
class="org.apache.solr.handler.component.TermVectorComponent"/>

–

Configure a RequestHandler to use this component
●
<lst name="defaults"> <bool name="tv">true</bool> </lst>
●
<arr name="last-components"> <str>tvComponent</str> </arr>
Constructing Training and Test sets per
model
Feature extraction
●

Domain expert query is used to extract docs for each
category
–
–
–

TVC returns the terms info of the terms in each
document
Each term is replaced by its index from the dictionary
●
This is the attribute
Its tf info is used as value
●
Some use presence/absence (or 1/0)
●
Others tf-idf

term_index_from_dico:term_freq is an input feature
Training and Test sets partition
●

●

●

We shuffle documents set so that high score docs do not
go to the same bucket
We split the result list so that
– 60 % to the training set (TS)
●
Here are positive examples (the +1s)
– 20 % to the validation set (VS)
●
Positive in this model, negative in others
– 20 % is used for other classes training set (OTS)
●
These are negative examples to others
Balanced training set (≈50 % of +1s and ≈50 % of -1s)
– The negatives come form other's 20 % OTS
Model file
●

Model file is saved after training
– One model per category
– It outlines the following
●
solver_type L2R_L2LOSS_SVC
●
nr_class 2
●
label 1 -1
●
nr_feature 8920
●
bias 1.000000000000000
●
w
-0.1626437446641374
w.xi + b ≥ 1 if yi = 1
●
●
●

0
7.152404908494515e-05
Most predictives terms

●

●

Model file contains the weight vector w
Use w to compute the most predictves terms of the model
– Give an indication as to whether the model is good or
not
●
You are the domain expert
–

Useful to extend basic keyword search to semantic
search
Toward semantic search - Indexing
●

●

Create a category core in solr
– Each document represents a category
●
One field for the category ID
●
One multi-valued field holds its top predictives
terms
At indexing time
– Each document is sent to the classification service
– The service returns the categories of the document
– Categories are saved in a multi-valued field along with
other domain-pertinents document fields
Toward semantic search - searching

●

At search time
– User query is run on the category core
●
What about libShortText
–

The returned categories are used to extend the
initial query
●
A boost < 1 is assigned to the category
References
●

●

●

●

●

Cortes and Vapnik, 1995. Support-Vector Networks
Chang and Lin, 2012. LibSVM : A Library for Support Vector
Machines
Fan, Lin, et al. LibLinear, 2012 : A Library for Large Linear
Classification
Thorsten Joachims, 1998. Text Categorization with SVM : Learning
with many relevant features
Rifkin and Klautau, 2004. In Defense of One-Vs-All classification
A big thank you
●

Lucene/Solr Revolution EU 2013 organizers

●

To Valtech Management

●

To Michels, Maj-Daniels, and Marie-Audrey Fansi

●

To all of you for your presence and attention
Questions ?
To my wife, Marie-Audrey, for all the
attention she pay to our family

More Related Content

What's hot

9. Input Output in java
9. Input Output in java9. Input Output in java
9. Input Output in javaNilesh Dalvi
 
Talwalkar mlconf (1)
Talwalkar mlconf (1)Talwalkar mlconf (1)
Talwalkar mlconf (1)MLconf
 
Ml programming with python
Ml programming with pythonMl programming with python
Ml programming with pythonKumud Arora
 
Lucene And Solr Document Classification
Lucene And Solr Document ClassificationLucene And Solr Document Classification
Lucene And Solr Document ClassificationAlessandro Benedetti
 
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)Jee Vang, Ph.D.
 
Introduction To Applied Machine Learning
Introduction To Applied Machine LearningIntroduction To Applied Machine Learning
Introduction To Applied Machine Learningananth
 
probabilistic ranking
probabilistic rankingprobabilistic ranking
probabilistic rankingFELIX75
 
Jake Mannix, MLconf 2013
Jake Mannix, MLconf 2013Jake Mannix, MLconf 2013
Jake Mannix, MLconf 2013MLconf
 
Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014StampedeCon
 
Feature Extraction for Large-Scale Text Collections
Feature Extraction for Large-Scale Text CollectionsFeature Extraction for Large-Scale Text Collections
Feature Extraction for Large-Scale Text CollectionsSease
 
Algorithms & Complexity Calculation
Algorithms & Complexity CalculationAlgorithms & Complexity Calculation
Algorithms & Complexity CalculationAkhil Kaushik
 
Machine learning introduction
Machine learning introductionMachine learning introduction
Machine learning introductionAnas Jamil
 
Intro to Data Structure & Algorithms
Intro to Data Structure & AlgorithmsIntro to Data Structure & Algorithms
Intro to Data Structure & AlgorithmsAkhil Kaushik
 
Summer Training Project On Python Programming
Summer Training Project On Python ProgrammingSummer Training Project On Python Programming
Summer Training Project On Python ProgrammingKAUSHAL KUMAR JHA
 
A Field Guide to DSL Design in Scala
A Field Guide to DSL Design in ScalaA Field Guide to DSL Design in Scala
A Field Guide to DSL Design in ScalaTomer Gabel
 
Machine Learning for Everyone
Machine Learning for EveryoneMachine Learning for Everyone
Machine Learning for EveryoneAly Abdelkareem
 
Explainability for Learning to Rank
Explainability for Learning to RankExplainability for Learning to Rank
Explainability for Learning to RankSease
 

What's hot (19)

9. Input Output in java
9. Input Output in java9. Input Output in java
9. Input Output in java
 
Talwalkar mlconf (1)
Talwalkar mlconf (1)Talwalkar mlconf (1)
Talwalkar mlconf (1)
 
Ml programming with python
Ml programming with pythonMl programming with python
Ml programming with python
 
Lucene And Solr Document Classification
Lucene And Solr Document ClassificationLucene And Solr Document Classification
Lucene And Solr Document Classification
 
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
 
Introduction To Applied Machine Learning
Introduction To Applied Machine LearningIntroduction To Applied Machine Learning
Introduction To Applied Machine Learning
 
probabilistic ranking
probabilistic rankingprobabilistic ranking
probabilistic ranking
 
Jake Mannix, MLconf 2013
Jake Mannix, MLconf 2013Jake Mannix, MLconf 2013
Jake Mannix, MLconf 2013
 
Datatype
DatatypeDatatype
Datatype
 
Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014
 
Feature Extraction for Large-Scale Text Collections
Feature Extraction for Large-Scale Text CollectionsFeature Extraction for Large-Scale Text Collections
Feature Extraction for Large-Scale Text Collections
 
Algorithms & Complexity Calculation
Algorithms & Complexity CalculationAlgorithms & Complexity Calculation
Algorithms & Complexity Calculation
 
Machine learning introduction
Machine learning introductionMachine learning introduction
Machine learning introduction
 
Intro to Data Structure & Algorithms
Intro to Data Structure & AlgorithmsIntro to Data Structure & Algorithms
Intro to Data Structure & Algorithms
 
2. Basics of Java
2. Basics of Java2. Basics of Java
2. Basics of Java
 
Summer Training Project On Python Programming
Summer Training Project On Python ProgrammingSummer Training Project On Python Programming
Summer Training Project On Python Programming
 
A Field Guide to DSL Design in Scala
A Field Guide to DSL Design in ScalaA Field Guide to DSL Design in Scala
A Field Guide to DSL Design in Scala
 
Machine Learning for Everyone
Machine Learning for EveryoneMachine Learning for Everyone
Machine Learning for Everyone
 
Explainability for Learning to Rank
Explainability for Learning to RankExplainability for Learning to Rank
Explainability for Learning to Rank
 

Viewers also liked

Random forest using apache mahout
Random forest using apache mahoutRandom forest using apache mahout
Random forest using apache mahoutGaurav Kasliwal
 
Vaklipi Text Analytics Tools
Vaklipi Text Analytics ToolsVaklipi Text Analytics Tools
Vaklipi Text Analytics Toolsaiaioo
 
Introducing OpenText Auto-Classification
Introducing OpenText Auto-ClassificationIntroducing OpenText Auto-Classification
Introducing OpenText Auto-ClassificationStephen Ludlow
 
Text categorization
Text categorizationText categorization
Text categorizationKU Leuven
 
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...PyData
 
Consuming RealTime Signals in Solr
Consuming RealTime Signals in Solr Consuming RealTime Signals in Solr
Consuming RealTime Signals in Solr Umesh Prasad
 
Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)Ian Huston
 
Sdforum 11-04-2010
Sdforum 11-04-2010Sdforum 11-04-2010
Sdforum 11-04-2010Ted Dunning
 
A BAYESIAN CLASSIFICATION APPROACH USING CLASS-SPECIFIC FEATURES FOR TEXT CAT...
A BAYESIAN CLASSIFICATION APPROACH USING CLASS-SPECIFIC FEATURES FOR TEXT CAT...A BAYESIAN CLASSIFICATION APPROACH USING CLASS-SPECIFIC FEATURES FOR TEXT CAT...
A BAYESIAN CLASSIFICATION APPROACH USING CLASS-SPECIFIC FEATURES FOR TEXT CAT...Nexgen Technology
 
Document classification models based on Bayesian networks
Document classification models based on Bayesian networksDocument classification models based on Bayesian networks
Document classification models based on Bayesian networksAlfonso E. Romero
 
Automated Bug classification using Bayesian probabilistic approach
Automated Bug classification using Bayesian probabilistic approachAutomated Bug classification using Bayesian probabilistic approach
Automated Bug classification using Bayesian probabilistic approachMasud Rahman
 
pmuthoju_presentation.ppt
pmuthoju_presentation.pptpmuthoju_presentation.ppt
pmuthoju_presentation.pptbutest
 
Bayesian Inference using b8
Bayesian Inference using b8Bayesian Inference using b8
Bayesian Inference using b8Dave Ross
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Confidence Weightedで ランク学習を実装してみた
Confidence Weightedで ランク学習を実装してみたConfidence Weightedで ランク学習を実装してみた
Confidence Weightedで ランク学習を実装してみたtkng
 
Combining R With Java For Data Analysis (Devoxx UK 2015 Session)
Combining R With Java For Data Analysis (Devoxx UK 2015 Session)Combining R With Java For Data Analysis (Devoxx UK 2015 Session)
Combining R With Java For Data Analysis (Devoxx UK 2015 Session)Ryan Cuprak
 
Text Classification/Categorization
Text Classification/CategorizationText Classification/Categorization
Text Classification/CategorizationOswal Abhishek
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systemsTrey Grainger
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent EngineLeveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent EngineTrey Grainger
 

Viewers also liked (20)

Random forest using apache mahout
Random forest using apache mahoutRandom forest using apache mahout
Random forest using apache mahout
 
Vaklipi Text Analytics Tools
Vaklipi Text Analytics ToolsVaklipi Text Analytics Tools
Vaklipi Text Analytics Tools
 
Introducing OpenText Auto-Classification
Introducing OpenText Auto-ClassificationIntroducing OpenText Auto-Classification
Introducing OpenText Auto-Classification
 
Text categorization
Text categorizationText categorization
Text categorization
 
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
 
Consuming RealTime Signals in Solr
Consuming RealTime Signals in Solr Consuming RealTime Signals in Solr
Consuming RealTime Signals in Solr
 
Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)
 
Sdforum 11-04-2010
Sdforum 11-04-2010Sdforum 11-04-2010
Sdforum 11-04-2010
 
A BAYESIAN CLASSIFICATION APPROACH USING CLASS-SPECIFIC FEATURES FOR TEXT CAT...
A BAYESIAN CLASSIFICATION APPROACH USING CLASS-SPECIFIC FEATURES FOR TEXT CAT...A BAYESIAN CLASSIFICATION APPROACH USING CLASS-SPECIFIC FEATURES FOR TEXT CAT...
A BAYESIAN CLASSIFICATION APPROACH USING CLASS-SPECIFIC FEATURES FOR TEXT CAT...
 
Document classification models based on Bayesian networks
Document classification models based on Bayesian networksDocument classification models based on Bayesian networks
Document classification models based on Bayesian networks
 
Automated Bug classification using Bayesian probabilistic approach
Automated Bug classification using Bayesian probabilistic approachAutomated Bug classification using Bayesian probabilistic approach
Automated Bug classification using Bayesian probabilistic approach
 
pmuthoju_presentation.ppt
pmuthoju_presentation.pptpmuthoju_presentation.ppt
pmuthoju_presentation.ppt
 
Bayesian Inference using b8
Bayesian Inference using b8Bayesian Inference using b8
Bayesian Inference using b8
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Confidence Weightedで ランク学習を実装してみた
Confidence Weightedで ランク学習を実装してみたConfidence Weightedで ランク学習を実装してみた
Confidence Weightedで ランク学習を実装してみた
 
Combining R With Java For Data Analysis (Devoxx UK 2015 Session)
Combining R With Java For Data Analysis (Devoxx UK 2015 Session)Combining R With Java For Data Analysis (Devoxx UK 2015 Session)
Combining R With Java For Data Analysis (Devoxx UK 2015 Session)
 
Text Classification/Categorization
Text Classification/CategorizationText Classification/Categorization
Text Classification/Categorization
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systems
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent EngineLeveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
 
Text categorization
Text categorizationText categorization
Text categorization
 

Similar to Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

Scala for Machine Learning
Scala for Machine LearningScala for Machine Learning
Scala for Machine LearningPatrick Nicolas
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machinesnextlib
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Gabriel Moreira
 
sentiment analysis using support vector machine
sentiment analysis using support vector machinesentiment analysis using support vector machine
sentiment analysis using support vector machineShital Andhale
 
Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017Manish Pandey
 
Intro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft VenturesIntro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft Venturesmicrosoftventures
 
Building largescalepredictionsystemv1
Building largescalepredictionsystemv1Building largescalepredictionsystemv1
Building largescalepredictionsystemv1arthi v
 
Methodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniquesMethodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniquesijsc
 
Learning with classification and clustering, neural networks
Learning with classification and clustering, neural networksLearning with classification and clustering, neural networks
Learning with classification and clustering, neural networksShaun D'Souza
 
Machine Learning using Support Vector Machine
Machine Learning using Support Vector MachineMachine Learning using Support Vector Machine
Machine Learning using Support Vector MachineMohsin Ul Haq
 
Support Vector Machines USING MACHINE LEARNING HOW IT WORKS
Support Vector Machines USING MACHINE LEARNING HOW IT WORKSSupport Vector Machines USING MACHINE LEARNING HOW IT WORKS
Support Vector Machines USING MACHINE LEARNING HOW IT WORKSrajalakshmi5921
 
Linear Discrimination Centering on Support Vector Machines
Linear Discrimination Centering on Support Vector MachinesLinear Discrimination Centering on Support Vector Machines
Linear Discrimination Centering on Support Vector Machinesbutest
 
A Sneak Peek into Artificial Intelligence Based HFT Trading Strategies
A Sneak Peek into Artificial Intelligence Based HFT Trading StrategiesA Sneak Peek into Artificial Intelligence Based HFT Trading Strategies
A Sneak Peek into Artificial Intelligence Based HFT Trading StrategiesQuantInsti
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...Chetan Khatri
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 

Similar to Text Classification with Lucene/Solr, Apache Hadoop and LibSVM (20)

Scala for Machine Learning
Scala for Machine LearningScala for Machine Learning
Scala for Machine Learning
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
 
sentiment analysis using support vector machine
sentiment analysis using support vector machinesentiment analysis using support vector machine
sentiment analysis using support vector machine
 
Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017
 
Intro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft VenturesIntro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft Ventures
 
Building largescalepredictionsystemv1
Building largescalepredictionsystemv1Building largescalepredictionsystemv1
Building largescalepredictionsystemv1
 
Optimization
OptimizationOptimization
Optimization
 
AI and Deep Learning
AI and Deep Learning AI and Deep Learning
AI and Deep Learning
 
S2 NIGHT SKILL.pptx
S2 NIGHT SKILL.pptxS2 NIGHT SKILL.pptx
S2 NIGHT SKILL.pptx
 
S2 NIGHT SKILL.pptx
S2 NIGHT SKILL.pptxS2 NIGHT SKILL.pptx
S2 NIGHT SKILL.pptx
 
Methodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniquesMethodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniques
 
presentation.ppt
presentation.pptpresentation.ppt
presentation.ppt
 
Learning with classification and clustering, neural networks
Learning with classification and clustering, neural networksLearning with classification and clustering, neural networks
Learning with classification and clustering, neural networks
 
Machine Learning using Support Vector Machine
Machine Learning using Support Vector MachineMachine Learning using Support Vector Machine
Machine Learning using Support Vector Machine
 
Support Vector Machines USING MACHINE LEARNING HOW IT WORKS
Support Vector Machines USING MACHINE LEARNING HOW IT WORKSSupport Vector Machines USING MACHINE LEARNING HOW IT WORKS
Support Vector Machines USING MACHINE LEARNING HOW IT WORKS
 
Linear Discrimination Centering on Support Vector Machines
Linear Discrimination Centering on Support Vector MachinesLinear Discrimination Centering on Support Vector Machines
Linear Discrimination Centering on Support Vector Machines
 
A Sneak Peek into Artificial Intelligence Based HFT Trading Strategies
A Sneak Peek into Artificial Intelligence Based HFT Trading StrategiesA Sneak Peek into Artificial Intelligence Based HFT Trading Strategies
A Sneak Peek into Artificial Intelligence Based HFT Trading Strategies
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 

More from lucenerevolution

State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadooplucenerevolution
 
A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...lucenerevolution
 
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting PlatformHow Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting Platformlucenerevolution
 

More from lucenerevolution (20)

State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 
A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...
 
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting PlatformHow Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
 

Recently uploaded

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 

Recently uploaded (20)

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 

Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

  • 1.
  • 2. Text classification with Lucene/Solr and LibSVM By Majirus FANSI, Phd @majirus Agile Software Developer
  • 3. Motivation: Guiding user search ● Search engines are basically keyword-oriented – What about the meaning? ● Synonym search needs listing the synonyms ● More-Like-This component is about more like THIS ● Category search for better user experience – Deals with the cases where user keywords are not in the collection – User searches for « emarketing », you returns documents on « webmarketing »
  • 4. Outline ● Text Categorization ● Introducing Machine Learning ● Why SVM? ● How Solr can help ? Putting it all Together is our aim
  • 5. Text classification or Categorization ● ● ● Aims – Classify documents into a fixed number of predefined categories ● Each document can be in multiple, exactly one, or no category at all. Applications – Classifying emails (Spam / Not Spam) – Guiding user search Challenges – Building text classifiers by hand is difficult and time consuming – It is advantageous to learn classifiers from examples
  • 6. Machine Learning ● Definition (by Tom Mitchell - 1998) “A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E” ● Experience E: watching the label of a document ● Task T: classify a document ● Performance P: probability that a document is correctly classified.
  • 7. Machine Learning Algorithms ● ● Usupervised learning – Let the program learn by itself ● Market segmentation, social network analysis... Supervised learning – Teach the computer program how to do something – We give the algorithm the “right answers” for some examples
  • 8. Supervised learning problems – Regression ● ● – Predict continuous valued output Ex: price of the houses in Corbeil Essonnes Classification ● Predict a discrete valued output (+1, -1)
  • 9. Supervised learning: Working m training examples Training Set (X, Y) (X(i),Y(i)) : ith training example X's : input variable or features Y's : output/target variable It's the job of the learning algorithm to produce the model h Training algorithm Feature vector (x) h(x) Hypothesis h Predicted value (y)
  • 10. Classifier/Decision Boundary ● ● ● ● Carves up the feature space into volumes Feature vectors in volumes assigned to the same class Decision regions separated by surfaces Decision boundary linear if a straight line in the dimensional space – A line in 2D, a plane in 3D, a hyperplane in 4+D
  • 11. Which Algorithm for text classifier
  • 12. Properties of text ● ● ● ● High dimensional input space – More than 10 000 features Very few irrelevant features Document vectors are sparse – few entries which are non zero Most text categorization problems are linearly separable No need to map the input features to a higher dimension space
  • 13. Classification algorithm /choosing the method ● ● Thorsten Joachims compares SVM to Naive Bayes, Rocchio, K-nearest neighbor and C4.5 decision tree SVM consistently achieve good performance on categorization task – It outperforms the other methods – Eliminates the need for feature selection – More robust than the other Thorsten Joachims, 1998. Text Categorization with SVM : Learning with many relevant features
  • 14. SVM ? Yes But... « The research community should direct efforts towards increasing the size of annotated training collections, while deemphasizing the focus on comparing different learning techniques trained only on small training corpora » Banko & Brill in « scaling very very large corpora for natural language disambiguation »
  • 15. What is SVM - Support Vector Machine? ● ● « Support Vector Networks » Cortes & Vapnik, 1995 SVM implements the following idea – Maps the input vectors into some high dimensional feature space Z ● Through some non linear mapping choosing a priori – In this feature space a linear decision surface is constructed – Special properties of the decision surface ensures high generalization ability of the learning machine
  • 16. SVM - Classification of an unknown pattern classification w1 sv1 w2 sv2 X wN svk Support vectors zi in feature space Input vector in feature space Non-linear transformation x Input vector, x
  • 17. SVM - decision boundary ● ● Optimal hyperplane – Training data can be separated without errors – It is the linear decision function with maximal margin between the vectors of the two classes Soft margin hyperplane – Training data cannot be separated without errors
  • 19. Optimal hyperplane - figure x2 Op tim al hy pe rp la ne Optimal margin x1
  • 20. SVM - optimal hyperplane ● ● Given the training set X of (x1, y1), (x2, y2), … (xm, ym) ; yi Є{-1, 1} X is linearly separable if there exists a vector w and a scalar b s.t. w.x i +b≥1 if y i =1(1) w.x i +b≤−1 if y i =−1(2) ● Vectors xi for which yi (w.xi+b) = 1 is termed support vectors – Used to construct the hyperplane – if the training vectors are separated without errors by an optimal hyperplane E [Pr (error)]≤ ● (1), (2)⇒ y i ( w.xi +b)≥1(3) E[number of support vectors] (4) number of training vectors w . z +b =0(5) The optimal hyperplane – Unique one which separates the training data with a maximal margin 0 0
  • 21. SVM - optimal hyperplane – decision function ● Let us consider the optimal hyperplane w 0 . z +b 0=0(5) ● The weight w0 can be written as some linear combination of SVs w 0= ∑ αi z i (6) support vectors ● The linear decision function I(z) is of the form I ( z )=sign( ∑ α z . z+ b )(7) i i 0 support vectors ● zi.z is the dot product between svs zi and vector z
  • 23. Soft margin Classification ● Want to separate the training set with a minimal number of errors Φ (ξ)=∑ ξ ; ξ ≥0 ; for small σ> 0(5) m i=1 s.t. ● ● σ i i y i (w.xi + b)≥1−ξ i ; i=1,... , m(6) The functional (5) describes the number of training errors Removing the subset of training errors from training set ● Remaining part separated without errors ● By constructing an optimal hyperplane
  • 24. SVM - soft margin Idea ● Soft margin svm can be expressed as m 1 2 w +C ∑ ξi (7) w , b ,ξ 2 i =1 min s.t. y i (w.xi + b)≥1−ξ i ● ξ i≥0 (8) For sufficiently large C, the vector w0 and constant b0, that minimizes (7) under (8) determine the hyperplane that – minimizes the sum of deviations, ξ, of training errors – Maximizes the margin for correctly classified vectors
  • 25. SVM - soft margin figure x2 ξ=0 se pa ra ξ=0 to r 0<ξ<1 soft margin ξ>1 x1
  • 27. Constructing and using the text classifier ● ● ● ● Which library ? – Efficient optimization packages are available ● SVMlight, LibSVM From text to features vectors – Lucene/solr helps here Multi-class classification vs One-vs-the-rest Using the categories for semantic search ● Dedicated solr index with the most predictive terms
  • 28. SVM library ● ● ● SVMlight – By Thorsten Joachim LibSVM – By Chan & Lin from Taiwan university – Under heavy development and testing – Library for java, C, python,...,Package for R language LibLinear – By Chan, Lin & al. – Brother of LibSVM – Recommended by LibSVM authors for large-scale linear classification
  • 29. LibLinear ● ● A Library for Large Linear Classification – Binary and Multi-class – implements Logistic Regression and linear SVM Format of training and testing data file is : – <label> <index1>:<value1><index2>:<value2>... – Each line contains an instance and is ended by a 'n' – <label> is an integer indicating the class label – The pair <index>:<value> gives a feature value ● <index> is an integer starting from 1 ● <value> is a real number – Indices must be in ascending order
  • 30. LibLinear input and dictionary ● ● Example input file for training 1 101:1 123:5 234:2 -1 54:2 64:1 453:3 – Do not have to represent the zeros. Need a dictionary of terms in lexicographical order 1 .net 2 aa ... 6000 jav ... 7565 solr
  • 31. Building the dictionary ● ● Divide the overall training data into a number of portions – Using knowledge of your domain ● Software development portion ● marketing portion... – Avoid a very large dictionary ● A java dev position and a marketing position share few common terms Use Expert boolean queries to load a dedicated solr core per domain – description:python AND title:python
  • 32. Building the dictionary with Solr ● What do we need in the dictionary – Terms properly analyzed ● LowerCaseFilterFactory, StopFilterFactory, ● ASCIIFoldingFilterFactory, SnowballPorterFilterFactory – ● Terms that occurs in a number of documents (df >min) ● Rare terms may cause the model to overfit Terms are retrieved from solr – Using solr TermVectorsComponent
  • 33. Solr TermVectorComponent ● SearchComponent designed to return information about terms in documents – tv.df returns the document frequency per term in the document – tv.tf returns document term frequency info per term in the document ● Used as feature value – tv.fl provides the list of fields to get term vectors for ● Only the catch-all field we use for classification
  • 34. Solr Core configuration ● Set termvectors attribute on fields you will use – <field name="title_and_description" type="texte_analyse" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true"/> Normalize your text and use stemming during the analysis Enable TermVectorComponent in solrconfig – ● – <searchComponent name="tvComponent" class="org.apache.solr.handler.component.TermVectorComponent"/> – Configure a RequestHandler to use this component ● <lst name="defaults"> <bool name="tv">true</bool> </lst> ● <arr name="last-components"> <str>tvComponent</str> </arr>
  • 35. Constructing Training and Test sets per model
  • 36. Feature extraction ● Domain expert query is used to extract docs for each category – – – TVC returns the terms info of the terms in each document Each term is replaced by its index from the dictionary ● This is the attribute Its tf info is used as value ● Some use presence/absence (or 1/0) ● Others tf-idf term_index_from_dico:term_freq is an input feature
  • 37. Training and Test sets partition ● ● ● We shuffle documents set so that high score docs do not go to the same bucket We split the result list so that – 60 % to the training set (TS) ● Here are positive examples (the +1s) – 20 % to the validation set (VS) ● Positive in this model, negative in others – 20 % is used for other classes training set (OTS) ● These are negative examples to others Balanced training set (≈50 % of +1s and ≈50 % of -1s) – The negatives come form other's 20 % OTS
  • 38. Model file ● Model file is saved after training – One model per category – It outlines the following ● solver_type L2R_L2LOSS_SVC ● nr_class 2 ● label 1 -1 ● nr_feature 8920 ● bias 1.000000000000000 ● w -0.1626437446641374 w.xi + b ≥ 1 if yi = 1 ● ● ● 0 7.152404908494515e-05
  • 39. Most predictives terms ● ● Model file contains the weight vector w Use w to compute the most predictves terms of the model – Give an indication as to whether the model is good or not ● You are the domain expert – Useful to extend basic keyword search to semantic search
  • 40. Toward semantic search - Indexing ● ● Create a category core in solr – Each document represents a category ● One field for the category ID ● One multi-valued field holds its top predictives terms At indexing time – Each document is sent to the classification service – The service returns the categories of the document – Categories are saved in a multi-valued field along with other domain-pertinents document fields
  • 41. Toward semantic search - searching ● At search time – User query is run on the category core ● What about libShortText – The returned categories are used to extend the initial query ● A boost < 1 is assigned to the category
  • 42. References ● ● ● ● ● Cortes and Vapnik, 1995. Support-Vector Networks Chang and Lin, 2012. LibSVM : A Library for Support Vector Machines Fan, Lin, et al. LibLinear, 2012 : A Library for Large Linear Classification Thorsten Joachims, 1998. Text Categorization with SVM : Learning with many relevant features Rifkin and Klautau, 2004. In Defense of One-Vs-All classification
  • 43. A big thank you ● Lucene/Solr Revolution EU 2013 organizers ● To Valtech Management ● To Michels, Maj-Daniels, and Marie-Audrey Fansi ● To all of you for your presence and attention
  • 45. To my wife, Marie-Audrey, for all the attention she pay to our family