We use metadata of various kind to improve and enrich text document clustering using an extension of Latent Dirichlet Allocation (LDA). The methods are fully implemented, evaluated and software is available on github.
These are the slides of an invited talk I gave September 8 at the Alexandria Workshop of TPDL-2016: http://alexandria-project.eu/events/3rd-workshop/
Total Legal: A “Joint” Journey into the Chemistry of Cannabinoids
Text Mining using LDA with Context
1. Institute for Web Science and Technologies · University of Koblenz-Landau, Germany
Text Mining Using LDA with Context
Christoph Kling, Steffen Staab
Web and Internet Science Group · ECS · University of Southampton, UK &
2. Text Mining Using LDA with Context 2/68Steffen Staab
Text Mining Documents
Documents are
PDFs, emails, tweets,
Flickr photo tags, CVs, ...
Documents consist of
bag of words
metadata
- author(s)
- timestamp
- geolocation
- publisher
- booktitle
- device
...
Chinese
food
Vegan
food
Break
-
fast
dimsum
duck
eggs
...
vegan
tofu
...
eggs
ham
...
Objective:
Cluster, categorize,
& explain
3. Text Mining Using LDA with Context 3/68Steffen Staab
Latent Dirichlet Allocation (LDA)
4. Text Mining Using LDA with Context 4/68Steffen Staab
Latent Dirichlet Allocation (LDA)
Document-topic distributions
Topic-word distributions
K topics
M documents
Each doc m M has length Nm
5. Text Mining Using LDA with Context 5/68Steffen Staab
Use Metadata to Help Topic Prediction
Improve topic detection
→ Morning times may help to improve the breakfast topic
Describe dependencies: metadata ↔ topics
→ breakfast topic happens
during morning hours Chinese
food
Vegan
food
Break
-
fast
dimsum
duck
eggs
...
vegan
tofu
...
eggs
ham
...
6. Text Mining Using LDA with Context 6/68Steffen Staab
Use Metadata to Help Topic Prediction
Improve topic detection
→ Morning times may help to improve the breakfast topic
Describe dependencies: metadata ↔ topics
→ breakfast topic happens
during morning hours
Usage
Autocompletion
→ From words to words
Prediction of search queries
→ From metadata to words
→ From words to metadata
Chinese
food
Vegan
food
Break
-
fast
dimsum
duck
eggs
...
vegan
tofu
...
eggs
ham
...
7. Text Mining Using LDA with Context 7/68Steffen Staab
Nominal
Ordinal
Cyclic
Spherical
Networked
Structures of Metadata Spaces Nejdl
Staa
b
Kling
8. Text Mining Using LDA with Context 8/68Steffen Staab
Challenges for Using Metadata for Text Mining
Generalizing the Text Mining Model
Creating a special text mining model for every dataset with its
kind of metadata spaces is impractical
→ we need flexible models!
9. Text Mining Using LDA with Context 9/68Steffen Staab
Challenges for Using Metadata for Text Mining
Generalizing the Text Mining Model
Efficiency of the Text Mining Model
Rich metadata
→ complex models
→ complex inference, slow convergence of samplers
→ analysis of big datasets impossible
10. Text Mining Using LDA with Context 10/68Steffen Staab
Challenges for Using Metadata for Text Mining
Generalizing the Text Mining Model
Efficiency of the Text Mining Model
Explaining the Result
Importance of Metadata
→ learn how to weight metadata
→ exclude irrelevant metadata (improves efficiency!)
Complex dependencies & complex probability functions
→ Learned parameters incomprehensible
→ Reduced usefulness for data analysis / visualisation
→ No sanity checks on parameters
11. Text Mining Using LDA with Context 11/68Steffen Staab
Topic Models for Arbitrary Metadata
12. Text Mining Using LDA with Context 12/68Steffen Staab
Topic Models for Arbitrary Metadata
Predict document-topic distributions using metadata
→ Gaussian Process Regression Topic Model
(Agovic & Banerjee, 2012)
→ Dirichlet-Multinomial Regression Topic Model
(Mimno & McCallum, 2012)
→ Structural Topic Model (logistic normal regression)
(Roberts et al., 2013)
13. Text Mining Using LDA with Context 13/68Steffen Staab
Topic Models for Arbitrary Metadata
Predict document-topic distributions using metadata
→ Gaussian Process Regression Topic Model
→ Dirichlet-Multinomial Regression Topic Model
→ Structural Topic Model (logistic normal regression)
Regression input: Metadata
Regression output: Topic distribution
14. Text Mining Using LDA with Context 14/68Steffen Staab
Topic Models for Arbitrary Metadata
Dirichlet-multinomial regression
Metadata
Document-topic distributions
15. Text Mining Using LDA with Context 15/68Steffen Staab
Topic Models for Arbitrary Metadata
Gaussian process regression
Metadata
Document-topic distributions
16. Text Mining Using LDA with Context 16/68Steffen Staab
Topic Models for Arbitrary Metadata
Logistic normal regression
Metadata
Document-topic distributions
17. Text Mining Using LDA with Context 17/68Steffen Staab
Topic Models for Arbitrary Metadata
Alternating inference:
Estimate topics
Estimate regression model
Use prediction for re-estimating topics
Re-estimate regression model with new topics
...
18. Text Mining Using LDA with Context 18/68Steffen Staab
Topic Models for Arbitrary Metadata
Alternating inference:
Estimate topics
Estimate regression model
Use prediction for re-estimating topics
Re-estimate regression model with new topics
...
19. Text Mining Using LDA with Context 19/68Steffen Staab
Topic Models for Arbitrary Metadata
Applicable to a wide range of metadata!
Estimation of regression parameters relatively expensive
Learned parameters have no natural interpretation
Alternating process of paramter estimation is expensive
20. Text Mining Using LDA with Context 20/68Steffen Staab
Topic Models for Arbitrary Metadata
Dirichlet-multinomial and logistic-normal regression do not
support complex input data
(i.e. geographical data, temporal cycles, …)
Gaussian process regression topic models are very
powerful with the right kernel function
...but require expert knowledge for kernel selection and
efficient inference!
21. Text Mining Using LDA with Context 21/68Steffen Staab
Hierarchical
Multi-Dirichlet Process
Topic Models
The Idea
22. Text Mining Using LDA with Context 22/68Steffen Staab
Topic Prediction
TopicProbability
Metadata (e.g. time)
Documents, e.g. emails
23. Text Mining Using LDA with Context 23/68Steffen Staab
Dirichlet-Multinomial Regression
TopicProbability
Metadata (e.g. time)
24. Text Mining Using LDA with Context 24/68Steffen Staab
Gaussian Process Regression
TopicProbability
Metadata (e.g. time)
TopicProbability
25. Text Mining Using LDA with Context 25/68Steffen Staab
Cluster-Based Prediction
TopicProbability
Metadata (e.g. time)
26. Text Mining Using LDA with Context 26/68Steffen Staab
Cluster-Based Prediction
TopicProbability
Metadata (e.g. time)
27. Text Mining Using LDA with Context 27/68Steffen Staab
Cluster-Based Prediction
TopicProbability
Metadata (e.g. time)
TopicProbabilityTopicProbabilityTopicProbability
28. Text Mining Using LDA with Context 28/68Steffen Staab
Cluster-Based Prediction
TopicProbability
Metadata (e.g. time)
TopicProbabilityTopicProbabilityTopicProbability
29. Text Mining Using LDA with Context 29/68Steffen Staab
Idea
Two-step model:
1)Cluster similar documents
2)Learn topics for clusters and documents simultaneously
▪ Learn topic distributions of document clusters
▪ Use cluster-topic distributions for topic prediction
30. Text Mining Using LDA with Context 30/68Steffen Staab
Performance, Complex Metadata
Cluster documents for each metadata
31. Text Mining Using LDA with Context 31/68Steffen Staab
Performance, Complex Metadata
Cluster documents for each metadata
32. Text Mining Using LDA with Context 32/68Steffen Staab
Performance, Complex Metadata
Cluster documents for each metadata
+ nominal, ordinal, cyclic, spherical data
+ any data which can be clustered!
33. Text Mining Using LDA with Context 33/68Steffen Staab
Performance, Complex Metadata
Metadata clusters are associated with topics
German Beer
Party
34. Text Mining Using LDA with Context 34/68Steffen Staab
Mixture of Metadata Predictions
Metadata clusters are associated with topics
German Beer
Party
The topic prediction for a single document is a mixture of
the prediction of its metadata clusters
35. Text Mining Using LDA with Context 35/68Steffen Staab
Smoothing of HMDP
36. Text Mining Using LDA with Context 36/68Steffen Staab
Cluster-Based Prediction vs Outliers and noisy data
TopicProbability
Metadata (e.g. time)
37. Text Mining Using LDA with Context 37/68Steffen Staab
Adjacency Smoothing
Naive approach: Smoothed value of a cluster is the mean
of the cluster and its adjacent clusters
Repeat n times
38. Text Mining Using LDA with Context 38/68Steffen Staab
Smoothing topics associated with metadata clusters
Documents receive topics from their own and neighboring
metadata clusters
39. Text Mining Using LDA with Context 39/68Steffen Staab
Performance, Complex Metadata
Smooth topics associated with metadata clusters
40. Text Mining Using LDA with Context 40/68Steffen Staab
Nominal
Ordinal
Cyclic
Spherical
Networked
41. Text Mining Using LDA with Context 41/68Steffen Staab
Smoothing
Smoothing-strength is learned during inference
Similar clusters → stronger smoothing
Dissimilar clusters → softer smoothing
Smoothing-strength alternatively can be predefined by user
42. Text Mining Using LDA with Context 42/68Steffen Staab
Metadata Weighting in HMDP's
43. Text Mining Using LDA with Context 43/68Steffen Staab
Feature Weighting
One variable governs the influence of metadata cluster on
documents
If η < threshold, ignore variable.
η
44. Text Mining Using LDA with Context 44/68Steffen Staab
Metadata Weighting
Importance of metadata is learned during inference,
answering the question:
How many percent of the topics are explained by a given
metadata? (e.g. time, geographical coordinates, ...)
→ Interpretable parameter!
Metadata with a low weight can be removed during
inference
45. Text Mining Using LDA with Context 45/68Steffen Staab
Example Application
46. Text Mining Using LDA with Context 46/68Steffen Staab
Dataset
Linux Kernel Mailinglist
3,400,000 emails with timestamps and mailinglist ID
47. Text Mining Using LDA with Context 47/68Steffen Staab
Dataset
Linux Kernel Mailinglist
3,400,000 emails with timestamps and mailinglist ID
Timeline
Yearly cycle
Weekly cycle
Daily cycle
Mailing list
50. Text Mining Using LDA with Context 50/68Steffen Staab
Topics
Professional topics:
Hobbyist topics:
51. Text Mining Using LDA with Context 51/68Steffen Staab
Topics
Metadata weighting:
52. Text Mining Using LDA with Context 52/68Steffen Staab
Topics
Metadata weighting:
can be removed during inference
53. Text Mining Using LDA with Context 53/68Steffen Staab
Efficient Inference in HMDP
54. Text Mining Using LDA with Context 54/68Steffen Staab
Hierarchical Multi-Dirichlet Process Topic Model (HMDP)
Cluster-topic distributions
Document-topic distributions
Metadata
55. Text Mining Using LDA with Context 55/68Steffen Staab
Hierarchical Multi-Dirichlet Process Topic Model (HMDP)
Inference:
Nearly completely collapsed
inference!
56. Text Mining Using LDA with Context 56/68Steffen Staab
Hierarchical Multi-Dirichlet Process Topic Model (HMDP)
We only need to learn
Global topic distribution
Topic assignments to words
57. Text Mining Using LDA with Context 57/68Steffen Staab
Hierarchical Multi-Dirichlet Process Topic Model (HMDP)
We only need to learn
Global topic distribution
Topic assignments to words
Dirichlet parameters
58. Text Mining Using LDA with Context 58/68Steffen Staab
Hierarchical Multi-Dirichlet Process Topic Model (HMDP)
Approximations:
Variational
Practical
Stochastic
→ low memory consumption
→ online inference
59. Text Mining Using LDA with Context 59/68Steffen Staab
Parameters of HMDP
Cluster-topic distributions:
How many documents of a cluster contain topic x?
60. Text Mining Using LDA with Context 60/68Steffen Staab
Parameters of HMDP
Cluster-topic distributions:
How many documents of a cluster contain topic x?
Metadata-weights
How many of the topics of documents are explained
by metadata x?
61. Text Mining Using LDA with Context 61/68Steffen Staab
Parameters of HMDP
Cluster-topic distributions:
How many documents of a cluster contain topic x?
Metadata-weights
How many of the topics of documents are explained
by metadata x?
Dirichlet process scaling parameters
How many pseudo-counts do we add to the topic
distributions?
62. Text Mining Using LDA with Context 62/68Steffen Staab
Properties of HMDP
Interpretable parameters
Simultaneous inference of topics and metadata-topic
dependencies
Efficient online inference
63. Text Mining Using LDA with Context 63/68Steffen Staab
Comparison of
Topic Models for Arbitrary Metadata
64. Text Mining Using LDA with Context 64/68Steffen Staab
Comparison
Gaussian Process Topic Model
The “perfect” model:
Can cope with arbitrary metadata
Models dependencies between metadata
Parameter learning is very expensive
Kernel selection and inference require expert knowledge
Parameters of Gaussian processes hard to interpret
65. Text Mining Using LDA with Context 65/68Steffen Staab
Comparison
Multinomial Regression Topic Model
The “straight-forward” model:
Can cope with many metadata
Parameter learning is cheaper than for Gaussian
processes but still expensive (due to alternating inference
and repeated distance calculations)
Can not cope with complex metadata
(e.g. geographical, cyclic, ...)
Does not model dependencies between metadata
Regression weights of Dirichlet-multinomial regression
hard to interpret
66. Text Mining Using LDA with Context 66/68Steffen Staab
Comparison
Hierarchical Multi-Dirichlet Process Topic Model
The “fast” model:
Can cope with arbitrary metadata
Fast inference (simultaneously for topics and topic
predictions)
All parameters have natural interpretations as probabilities
or pseudo-counts
Requires a (simple) pre-clustering of documents
Does not model dependencies between metadata
67. Text Mining Using LDA with Context 67/68Steffen Staab
THANK YOU FOR YOUR
ATTENTION!